nips nips2010 nips2010-193 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. [sent-2, score-0.232]
2 Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. [sent-3, score-0.375]
3 We provide a complete characterization of online learnability in the supervised setting. [sent-5, score-0.601]
4 1 Introduction In the online learning framework, the learner is faced with a sequence of data appearing at discrete time intervals. [sent-6, score-0.382]
5 In contrast to the classical “batch” learning scenario where the learner is being evaluated after the sequence is completely revealed, in the online framework the learner is evaluated at every round. [sent-7, score-0.591]
6 with an unknown distribution, while in the online framework we relax or eliminate any stochastic assumptions on the data source. [sent-11, score-0.255]
7 As such, the online learning problem can be phrased as a repeated two-player game between the learner (player) and the adversary (Nature). [sent-12, score-0.646]
8 , T , the Learner chooses ft ∈ F, the Adversary picks xt ∈ X , and the Learner suffers loss ft (xt ). [sent-17, score-0.745]
9 At the end of T rounds we define regret as the difference between the cumulative loss of the player as compared to the cumulative loss of the best fixed comparator. [sent-18, score-0.253]
10 For the given pair (F, X ), the problem is said to be online learnable if there exists an algorithm for the learner such that regret grows sublinearly. [sent-19, score-0.69]
11 There has been a lot of interest in a particular setting of the online learning model, called online convex optimization. [sent-21, score-0.559]
12 In this setting, we write xt (ft ) as the loss incurred by the learner, and the assumption is made that the function xt is convex in its argument. [sent-22, score-0.534]
13 The online learning model also subsumes the prediction setting. [sent-27, score-0.301]
14 In the latter, the learner’s choice of a Yvalued function gt leads to the loss of (gt (zt ), yt ) according to a fixed loss function : Y × Y → R. [sent-28, score-0.191]
15 The choice of the learner is equivalently written as ft (x) = (gt (z), y), and xt = (zt , yt ) is the choice of the adversary. [sent-29, score-0.685]
16 It is well-known that learnability in the binary case (that is, Y = {−1, +1}) is completely characterized by finiteness of the VapnikChervonenkis combinatorial dimension of the function class [32, 31]. [sent-36, score-0.589]
17 The last two dimensions 1 were shown to be characterizing learnability [3] and uniform convergence of means to expectations for function classes. [sent-38, score-0.293]
18 In contrast to the classical learning setting, there has been surprisingly little work on characterizing learnability for the online learning framework. [sent-39, score-0.63]
19 Littlestone [19] has shown that, in the setting of prediction of binary outcomes, a certain combinatorial property of the binary-valued function class characterizes learnability in the realizable case. [sent-40, score-0.626]
20 In parallel to [7], minimax analysis of online convex optimization yielded new insights into the value of the game, its minimax dual representation, as well as algorithm-independent upper and lower bounds [1, 27]. [sent-42, score-0.499]
21 In this paper, we build upon these results and the findings of [7] to develop a theory of online learning. [sent-43, score-0.285]
22 We show that in the online learning model, a notion which we call Sequential Rademacher complexity allows us to easily prove learnability for a vast array of problems. [sent-44, score-0.605]
23 The role of this complexity is similar to the role of the Rademacher complexity in statistical learning theory. [sent-45, score-0.176]
24 We show that finiteness of this scale-sensitive version, which we call the fat-shattering dimension, is necessary and sufficient for learnability in the prediction setting. [sent-47, score-0.308]
25 data: if the problem is learnable in the supervised setting, then it is learnable by this algorithm. [sent-51, score-0.466]
26 Along the way we develop analogues of Massart’s finite class lemma, the Dudley integral upper bound on the Sequential Rademacher complexity, appropriately defined packing and covering numbers, and even an analogue of the Sauer-Shelah combinatorial lemma. [sent-52, score-0.543]
27 While the spirit of the online theory is that it provides a “temporal” generalization of the “batch” learning problem, not all the results from statistical learning theory transfer to our setting. [sent-60, score-0.315]
28 For instance, two distinct notions of a packing set exist for trees, and these notions can be seen to coincide in “batch” learning. [sent-61, score-0.237]
29 The fact that many notions of statistical learning theory can be extended to the online learning model is indeed remarkable. [sent-62, score-0.348]
30 2 Preliminaries By phrasing the online learning model as a repeated game and considering its minimax value, we naturally arrive at an important object in combinatorial game theory: trees. [sent-63, score-0.852]
31 Unless specified, all trees considered in this paper are rooted binary trees with equal-depth paths from the root to the leaves. [sent-64, score-0.231]
32 While it is useful to have the tree picture in mind when reading the paper, it is also necessary to precisely define trees as mathematical objects. [sent-65, score-0.195]
33 Given some set Z, a Z-valued tree of depth T is a sequence (z1 , . [sent-67, score-0.241]
34 3 Value of the Game Fix the sets F and X and consider the online learning model stated in the introduction. [sent-99, score-0.255]
35 We define the value of the game as T VT (F, X ) = inf sup Ef1 ∼q1 · · · inf q1 ∈Q x1 ∈X T ft (xt ) − inf sup EfT ∼qT qT ∈Q xT ∈X t=1 f ∈F f (xt ) (1) t=1 where ft has distribution qt . [sent-105, score-1.2]
36 We consider here the adaptive adversary who gets to choose each xt based on the history of moves f1:t−1 and x1:t−1 . [sent-106, score-0.299]
37 sup ExT ∼pT p1 pT T inf Ext ∼pt [ft (xt )] − inf t=1 ft ∈F f ∈F f (xt ) . [sent-115, score-0.536]
38 (2) t=1 The question of learnability in the online learning model is now reduced to the study of VT (F, X ), taking Eq. [sent-116, score-0.517]
39 A class F is said to be online learnable with respect to the given X if VT (F, X ) lim sup =0. [sent-120, score-0.622]
40 T T →∞ The rest of the paper is aimed at understanding the value of the game VT (F, X ) for various function classes F. [sent-121, score-0.191]
41 A natural generalization of Rademacher complexity [18, 6, 21], the sequential analogue possesses many of the nice properties of its classical cousin. [sent-124, score-0.307]
42 The properties are proved in Section 7 and then used to show learnability for many of the examples in Section 8. [sent-125, score-0.262]
43 The Sequential Rademacher Complexity of a function class F ⊆ RX is defined as T RT (F) = sup E x sup t f (xt ( )) f ∈F t=1 where the outer supremum is taken over all X -valued trees of depth T and = ( 1 , . [sent-129, score-0.481]
44 The minimax value of a randomized game is bounded as VT (F) ≤ 2RT (F) . [sent-137, score-0.307]
45 In the sequel, these combinatorial parameters are shown to control the growth of covering numbers on trees. [sent-147, score-0.282]
46 In the setting of prediction, the combinatorial parameters are shown to exactly characterize learnability (see Section 6). [sent-148, score-0.395]
47 An X -valued tree x of depth d is shattered by a function class F ⊆ {±1}X if for all ∈ {±1}d , there exists f ∈ F such that f (xt ( )) = t for all t ∈ [d]. [sent-150, score-0.346]
48 The Littlestone dimension Ldim(F, X ) is the largest d such that F shatters an X -valued tree of depth d. [sent-151, score-0.287]
49 An X -valued tree x of depth d is α-shattered by a function class F ⊆ RX , if there exists an R-valued tree s of depth d such that ∀ ∈ {±1}d , ∃f ∈ F s. [sent-153, score-0.587]
50 The fat-shattering dimension fatα (F, X ) at scale α is the largest d such that F α-shatters an X -valued tree of depth d. [sent-156, score-0.287]
51 Let us mention that if trees x are defined by constant mappings xt ( ) = xt , the combinatorial parameters coincide with the Vapnik-Chervonenkis dimension and with the scale-sensitive dimension Pγ . [sent-159, score-0.877]
52 In particular, a “size” of a function class is known to be related to complexity of learning from i. [sent-162, score-0.193]
53 , and the classical way to measure “size” is through a cover or a packing set. [sent-166, score-0.188]
54 A set V of R-valued trees of depth T is an α-cover (with respect to F ⊆ RX on a tree x of depth T if ∀f ∈ F, ∀ ∈ {±1}T ∃v ∈ V s. [sent-169, score-0.475]
55 1 T p -norm) of T |vt ( ) − f (xt ( ))|p ≤ αp t=1 The covering number Np (α, F, x) of a function class F on a given tree x is the size of the smallest cover. [sent-171, score-0.321]
56 Further define Np (α, F, T ) = supx Np (α, F, x), the maximal p covering number of F over depth T trees. [sent-172, score-0.299]
57 In particular, a set V of R-valued trees of depth T is a 0-cover of F ⊆ RX on a tree x of depth T if for all f ∈ F and ∈ {±1}T , there exists v ∈ V s. [sent-173, score-0.475]
58 setting there is a notion of packing number that upper and lower bounds covering number, in the sequential counterpart such an analog fails. [sent-181, score-0.324]
59 We now show that the covering numbers are bounded in terms of the fat-shattering dimension. [sent-199, score-0.183]
60 Next, we present a bound similar to Massart’s finite class lemma [20, Lemma 5. [sent-209, score-0.211]
61 However, as we show next, Lemma 5 goes well beyond just finite classes and can be used to get an analog of Dudley entropy bound [10] for the online setting through a chaining argument. [sent-217, score-0.353]
62 The Integrated complexity of a function class F ⊆ [−1, 1]X is defined as DT (F) = inf α Z 4T α + 12 1 ff p T log N2 (δ, F, T ) dδ . [sent-219, score-0.304]
63 For any function class F ⊆ [−1, 1]X , 6 RT (F) ≤ DT (F) Supervised Learning In this section we study the supervised learning problem where player picks a function ft ∈ RX at any time t and the adversary provides input target pair (xt , yt ) and the player suffers loss |ft (xt ) − yt |. [sent-223, score-0.856]
64 As we are interested in prediction, we allow ft to be outside of F. [sent-225, score-0.243]
65 To formally define the value of the online supervised learning game, fix a set of labels Y ⊆ [−1, 1]. [sent-228, score-0.339]
66 Now, the supervised S game is obtained using the pair (FS , X × Y) and we accordingly define VT (F) = VT (FS , X × Y) . [sent-230, score-0.275]
67 For the supervised learning game played with a function class F ⊆ [−1, 1]X , for any T ≥ 1 n p o 1 1 S √ sup α T min {fatα , T } ≤ VT (F) 2 4 2 α ( ≤ RT (F) ≤ DT (F) ≤ inf α √ Z 4T α + 12 T 1 s „ fatβ log α 2eT β ) « dβ (3) Theorem 8. [sent-234, score-0.562]
68 For any function class F ⊆ [−1, 1]X , F is online learnable in the supervised setting if and only if fatα (F) is finite for any α > 0. [sent-235, score-0.635]
69 Moreover, if the function class is online learnable, S then the value of the supervised game VT (F), the Sequential Rademacher complexity R(F), and the Integrated complexity D(F) are within a multiplicative factor of O(log3/2 T ) of each other. [sent-236, score-0.811]
70 For the binary classification game played with function class F we have that Binary K1 T min {Ldim(F), T } ≤ VT (F) ≤ K2 T Ldim(F) log T for some universal constants K1 , K2 . [sent-238, score-0.339]
71 those simply output a prediction yt ∈ Y rather than a function ft ∈ F. [sent-242, score-0.378]
72 Since a ˆ proper learning strategy can always be used as an improper learning strategy, we trivially have that if class is online learnable in the supervised setting then it is improperly online learnable. [sent-243, score-0.988]
73 Because the above mentioned property of lower bound of Proposition 7, we also have the non-trivial reverse implication: if a class is improperly online learnable in the supervised setting, it is online learnable. [sent-244, score-0.982]
74 1 Generic Algorithm We shall now present a generic improper learning algorithm for the supervised setting that achieves a low regret bound whenever the function class is online learnable. [sent-246, score-0.694]
75 Using these experts along with exponentially weighted experts algorithm we shall provide the generic algorithm for online supervised learning. [sent-257, score-0.592]
76 , YL ) V1 ← F for t = 1 to T do Rt (x) = {r ∈ Bα : fatα (Vt (r, x)) = maxr ∈Bα fatα (Vt (r , x))} P For each x ∈ X , let ft (x) = |Rt1 r∈Rt (x) r (x)| if t ∈ {i1 , . [sent-264, score-0.243]
77 Play ft , receive xt , and update Vt+1 = Vt (ft (xt ), xt ) else Play ft = ft , receive xt , and set Vt+1 = Vt end if end for For each L ≤ fatα (F) and every possible choice of 1 ≤ i1 < . [sent-270, score-1.407]
78 Each expert outputs a function ft ∈ F at every round T . [sent-278, score-0.293]
79 In particular, the contraction inequality due to Ledoux and Talagrand, allows one to pass from a composition of a Lipschitz function with a class to the function class itself. [sent-287, score-0.21]
80 The next lemma bounds the Sequential Rademacher complexity for the product of classes. [sent-296, score-0.179]
81 √ The above result actually allows us to recover the O( T ) regret bounds of online mirror descent (including Zinkevich’s online gradient descent) obtained in the online convex optimization literature. [sent-329, score-0.962]
82 It is easy to bound the value of the convex game by that of the linear game [2], i. [sent-332, score-0.477]
83 The online convex optimization setting includes supervised√ learning using convex losses and linear predictors and so our theorem also proves existence of O( T ) regret algorithms in that setting. [sent-336, score-0.517]
84 As far as we know, this is the first general margin based mistake bound in the online setting for a general function class. [sent-340, score-0.301]
85 For any function class F ⊂ RX bounded by B, there exists a randomized player strategy π such that for any sequence (x1 , y1 ), . [sent-342, score-0.209]
86 , (xT , yT ) ∈ (X × {±1})T , T X ( Eft ∼πt (x1:t−1 ) [1 {ft (xt )yt < 0}] ≤ inf γ>0 t=1 inf f ∈F T X 1 {f (xt )yt < γ} + t=1 √ 4 RT (F) + T log log γ „ B γ «) Example : Neural Networks and Decision Trees We now consider a k-layer 1-norm neural network. [sent-345, score-0.222]
87 The theory we have developed provides us with enough tools to control the sequential Rademacher complexity of classes like the above that are built using simpler components. [sent-347, score-0.222]
88 i=1 We can also prove online learnability of decision trees under appropriate restrictions on their depth and number of leaves. [sent-355, score-0.751]
89 The structural results enjoyed by the sequential Rademacher complexity (esp. [sent-357, score-0.192]
90 Example: Transductive Learning and Prediction of Individual Sequences Let F ⊂ RX and let N∞ (α, F) be the classical pointwise (over X ) covering number at scale α. [sent-359, score-0.197]
91 To ensure online i=1 learnability, it is sufficient to consider an assumption on the dependence of N∞ (α, F) on α. [sent-363, score-0.255]
92 An obvious example of such a class is a VC-type class with N∞ (α, F) ≤ (c/α)d for some c which can depend on n. [sent-364, score-0.21]
93 Assuming that F ⊂ [0, 1]X , the value of the game is upper bounded by 2DT (F) ≤ √ 4 dT log c. [sent-365, score-0.225]
94 In particular, for binary prediction, using the Sauer-Shelah lemma ensures that the value of the game is at most 4 dT log(eT ), matching the result of [15] up to a constant 2. [sent-366, score-0.294]
95 Formally, we define static experts as mappings f : {1, . [sent-369, score-0.178]
96 A natural open question posed by the authors is whether there is an online variant of Isotron. [sent-381, score-0.255]
97 Before even attempting a quest for such an algorithm, we can ask a more basic question: is the (Idealized) SIM problem even learnable in the online framework? [sent-382, score-0.446]
98 Using the machinery we developed, it is not hard to show that the class H is online learnable in the supervised setting. [sent-384, score-0.635]
99 A stochastic view of optimal regret through minimax duality. [sent-391, score-0.199]
100 Optimal strategies and minimax lower bounds for online convex games. [sent-399, score-0.417]
wordName wordTfidf (topN-words)
[('rademacher', 0.284), ('fat', 0.277), ('learnability', 0.262), ('online', 0.255), ('vt', 0.246), ('ft', 0.243), ('xt', 0.226), ('game', 0.191), ('learnable', 0.191), ('depth', 0.14), ('combinatorial', 0.133), ('learner', 0.127), ('rx', 0.126), ('regret', 0.117), ('covering', 0.115), ('ldim', 0.115), ('inf', 0.111), ('experts', 0.109), ('class', 0.105), ('sequential', 0.104), ('tree', 0.101), ('littlestone', 0.099), ('trees', 0.094), ('yt', 0.089), ('complexity', 0.088), ('supervised', 0.084), ('minimax', 0.082), ('classical', 0.082), ('rt', 0.081), ('packing', 0.074), ('adversary', 0.073), ('sup', 0.071), ('player', 0.07), ('mappings', 0.069), ('isotron', 0.069), ('notions', 0.063), ('dudley', 0.06), ('lemma', 0.06), ('batch', 0.059), ('fj', 0.057), ('shai', 0.053), ('chaining', 0.052), ('improper', 0.052), ('niteness', 0.052), ('fk', 0.05), ('expert', 0.05), ('fw', 0.049), ('convex', 0.049), ('qt', 0.048), ('theorem', 0.047), ('fs', 0.047), ('prediction', 0.046), ('bound', 0.046), ('ekt', 0.046), ('improperly', 0.046), ('shelah', 0.046), ('dimension', 0.046), ('dt', 0.045), ('alon', 0.044), ('supx', 0.044), ('binary', 0.043), ('discretization', 0.042), ('eft', 0.04), ('vid', 0.04), ('armed', 0.04), ('symmetrization', 0.04), ('bartlett', 0.039), ('transductive', 0.039), ('zt', 0.038), ('analogues', 0.037), ('ext', 0.037), ('realizable', 0.037), ('massart', 0.037), ('complexities', 0.037), ('coincide', 0.037), ('games', 0.036), ('gt', 0.036), ('np', 0.035), ('shall', 0.035), ('sim', 0.035), ('il', 0.034), ('bounded', 0.034), ('proposition', 0.034), ('numbers', 0.034), ('proceeds', 0.034), ('loss', 0.033), ('colt', 0.033), ('analogue', 0.033), ('abernethy', 0.033), ('rakhlin', 0.033), ('ya', 0.033), ('yl', 0.033), ('sridharan', 0.033), ('corollary', 0.032), ('bounding', 0.032), ('cover', 0.032), ('characterizing', 0.031), ('bounds', 0.031), ('theory', 0.03), ('proving', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
2 0.26882362 243 nips-2010-Smoothness, Low Noise and Fast Rates
Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1
3 0.266491 182 nips-2010-New Adaptive Algorithms for Online Classification
Author: Francesco Orabona, Koby Crammer
Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1
4 0.25006801 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári
Abstract: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the first rigorously proved regret bound for the problem. 1
5 0.23805435 222 nips-2010-Random Walk Approach to Regret Minimization
Author: Hariharan Narayanan, Alexander Rakhlin
Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1
6 0.15789972 226 nips-2010-Repeated Games against Budgeted Adversaries
7 0.15110281 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
8 0.13707604 192 nips-2010-Online Classification with Specificity Constraints
9 0.125545 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
10 0.12336045 39 nips-2010-Bayesian Action-Graph Games
11 0.11920076 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
12 0.11008801 15 nips-2010-A Theory of Multiclass Boosting
13 0.10907853 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
14 0.10545329 265 nips-2010-The LASSO risk: asymptotic results and real world examples
15 0.1015175 219 nips-2010-Random Conic Pursuit for Semidefinite Programming
16 0.09273874 194 nips-2010-Online Learning for Latent Dirichlet Allocation
17 0.09052138 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
18 0.090490945 202 nips-2010-Parallelized Stochastic Gradient Descent
19 0.087014563 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices
20 0.086660311 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information
topicId topicWeight
[(0, 0.252), (1, -0.062), (2, 0.216), (3, 0.016), (4, 0.023), (5, 0.146), (6, -0.166), (7, -0.053), (8, -0.098), (9, 0.307), (10, 0.116), (11, -0.025), (12, -0.08), (13, 0.027), (14, -0.036), (15, 0.119), (16, 0.022), (17, -0.13), (18, -0.038), (19, -0.072), (20, 0.081), (21, 0.102), (22, 0.063), (23, -0.047), (24, -0.205), (25, 0.01), (26, -0.021), (27, 0.028), (28, -0.122), (29, 0.048), (30, 0.08), (31, -0.006), (32, 0.017), (33, 0.02), (34, -0.0), (35, -0.04), (36, 0.015), (37, 0.021), (38, 0.009), (39, 0.051), (40, 0.054), (41, 0.027), (42, 0.068), (43, -0.072), (44, -0.049), (45, -0.099), (46, 0.045), (47, 0.091), (48, 0.006), (49, 0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.96414 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
2 0.74890876 222 nips-2010-Random Walk Approach to Regret Minimization
Author: Hariharan Narayanan, Alexander Rakhlin
Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1
3 0.74374598 182 nips-2010-New Adaptive Algorithms for Online Classification
Author: Francesco Orabona, Koby Crammer
Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1
4 0.66540104 226 nips-2010-Repeated Games against Budgeted Adversaries
Author: Jacob D. Abernethy, Manfred K. Warmuth
Abstract: We study repeated zero-sum games against an adversary on a budget. Given that an adversary has some constraint on the sequence of actions that he plays, we consider what ought to be the player’s best mixed strategy with knowledge of this budget. We show that, for a general class of normal-form games, the minimax strategy is indeed efficiently computable and relies on a “random playout” technique. We give three diverse applications of this new algorithmic template: a cost-sensitive “Hedge” setting, a particular problem in Metrical Task Systems, and the design of combinatorial prediction markets. 1
5 0.59005272 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári
Abstract: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the first rigorously proved regret bound for the problem. 1
6 0.57773513 243 nips-2010-Smoothness, Low Noise and Fast Rates
7 0.55625272 274 nips-2010-Trading off Mistakes and Don't-Know Predictions
8 0.53612125 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
9 0.52722198 183 nips-2010-Non-Stochastic Bandit Slate Problems
10 0.48835206 220 nips-2010-Random Projection Trees Revisited
11 0.48506197 192 nips-2010-Online Classification with Specificity Constraints
12 0.47359112 15 nips-2010-A Theory of Multiclass Boosting
13 0.46517038 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
14 0.45897311 202 nips-2010-Parallelized Stochastic Gradient Descent
15 0.43334889 191 nips-2010-On the Theory of Learnining with Privileged Information
16 0.42965785 219 nips-2010-Random Conic Pursuit for Semidefinite Programming
17 0.42186582 158 nips-2010-Learning via Gaussian Herding
18 0.41209257 61 nips-2010-Direct Loss Minimization for Structured Prediction
19 0.41052243 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
20 0.40379629 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
topicId topicWeight
[(13, 0.061), (17, 0.01), (27, 0.055), (30, 0.124), (39, 0.213), (45, 0.157), (50, 0.037), (52, 0.036), (53, 0.012), (60, 0.051), (77, 0.066), (78, 0.017), (90, 0.087)]
simIndex simValue paperId paperTitle
same-paper 1 0.81741971 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: We develop a theory of online learning by defining several complexity measures. Among them are analogues of Rademacher complexity, covering numbers and fatshattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. We apply these results to various learning problems. We provide a complete characterization of online learnability in the supervised setting. 1
2 0.78197533 182 nips-2010-New Adaptive Algorithms for Online Classification
Author: Francesco Orabona, Koby Crammer
Abstract: We propose a general framework to online learning for classification problems with time-varying potential functions in the adversarial setting. This framework allows to design and prove relative mistake bounds for any generic loss function. The mistake bounds can be specialized for the hinge loss, allowing to recover and improve the bounds of known online classification algorithms. By optimizing the general bound we derive a new online classification algorithm, called NAROW, that hybridly uses adaptive- and fixed- second order information. We analyze the properties of the algorithm and illustrate its performance using synthetic dataset. 1
3 0.7702477 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
Author: Martha White, Adam White
Abstract: The reinforcement learning community has explored many approaches to obtaining value estimates and models to guide decision making; these approaches, however, do not usually provide a measure of confidence in the estimate. Accurate estimates of an agent’s confidence are useful for many applications, such as biasing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing confidence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agentenvironment interaction rarely satisfies traditional assumptions. Samples of valueestimates are dependent, likely non-normally distributed and often limited, particularly in early learning when confidence estimates are pivotal. In this work, we investigate how to compute robust confidences for value estimates in continuous Markov decision processes. We illustrate how to use bootstrapping to compute confidence intervals online under a changing policy (previously not possible) and prove validity under a few reasonable assumptions. We demonstrate the applicability of our confidence estimation algorithms with experiments on exploration, parameter estimation and tracking. 1
4 0.76512402 105 nips-2010-Getting lost in space: Large sample analysis of the resistance distance
Author: Ulrike V. Luxburg, Agnes Radl, Matthias Hein
Abstract: The commute distance between two vertices in a graph is the expected time it takes a random walk to travel from the first to the second vertex and back. We study the behavior of the commute distance as the size of the underlying graph increases. We prove that the commute distance converges to an expression that does not take into account the structure of the graph at all and that is completely meaningless as a distance function on the graph. Consequently, the use of the raw commute distance for machine learning purposes is strongly discouraged for large graphs and in high dimensions. As an alternative we introduce the amplified commute distance that corrects for the undesired large sample effects. 1
5 0.76111543 172 nips-2010-Multi-Stage Dantzig Selector
Author: Ji Liu, Peter Wonka, Jieping Ye
Abstract: We consider the following sparse signal recovery (or feature selection) problem: given a design matrix X ∈ Rn×m (m n) and a noisy observation vector y ∈ Rn satisfying y = Xβ ∗ + where is the noise vector following a Gaussian distribution N (0, σ 2 I), how to recover the signal (or parameter vector) β ∗ when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal β ∗ . We show that if X obeys a certain condition, then with a large probability the difference between the solution ˆ β estimated by the proposed method and the true solution β ∗ measured in terms of the lp norm (p ≥ 1) is bounded as ˆ β − β∗ p ≤ C(s − N )1/p log m + ∆ σ, where C is a constant, s is the number of nonzero entries in β ∗ , ∆ is independent of m and is much smaller than the first term, and N is the number of entries of √ β ∗ larger than a certain value in the order of O(σ log m). The proposed method improves the estimation bound of the standard Dantzig selector approximately √ √ from Cs1/p log mσ to C(s − N )1/p log mσ where the value N depends on the number of large entries in β ∗ . When N = s, the proposed algorithm achieves the oracle solution with a high probability. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector. 1
6 0.711923 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
7 0.71077287 222 nips-2010-Random Walk Approach to Regret Minimization
8 0.70972937 220 nips-2010-Random Projection Trees Revisited
9 0.7029373 117 nips-2010-Identifying graph-structured activation patterns in networks
10 0.69255096 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
11 0.69054902 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
12 0.6893903 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
13 0.68819451 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
14 0.6849131 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
15 0.68126631 280 nips-2010-Unsupervised Kernel Dimension Reduction
16 0.68076682 265 nips-2010-The LASSO risk: asymptotic results and real world examples
17 0.68039179 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
18 0.68003523 243 nips-2010-Smoothness, Low Noise and Fast Rates
19 0.67985559 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
20 0.6796692 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing