nips nips2007 nips2007-21 knowledge-graph by maker-knowledge-mining

21 nips-2007-Adaptive Online Gradient Descent

Source: pdf

Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett

Abstract: We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates √ between T and log T . Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We study the rates of growth of the regret in online convex optimization. [sent-8, score-0.607]

2 First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. [sent-9, score-0.106]

3 We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates √ between T and log T . [sent-10, score-0.318]

4 1 Introduction The problem of online convex optimization can be formulated as a repeated game between a player and an adversary. [sent-13, score-0.282]

5 At round t, the player chooses an action xt from some convex subset K of Rn , and then the adversary chooses a convex loss function ft . [sent-14, score-1.216]

6 The player aims to ensure that the total T T loss, t=1 ft (xt ), is not much larger than the smallest total loss t=1 ft (x) of any ﬁxed action x. [sent-15, score-0.995]

7 The difference between the total loss and its optimal value for a ﬁxed action is known as the regret, which we denote T RT = T ft (xt ) − min x∈K t=1 ft (x). [sent-16, score-0.965]

8 t=1 Many problems of online prediction of individual sequences can be viewed as special cases of online convex optimization, including prediction with expert advice, sequential probability assignment, and sequential investment [1]. [sent-17, score-0.357]

9 A central question in all these cases is how the regret grows with the number of rounds of the game. [sent-18, score-0.367]

10 √ Zinkevich [2] considered the following gradient descent algorithm, with step size ηt = Θ(1/ t). [sent-19, score-0.138]

11 (Here, ΠK (v) denotes the Euclidean projection of v on to the convex set K. [sent-20, score-0.153]

12 2: for t = 1 to T do 3: Predict xt , observe ft . [sent-23, score-0.864]

13 5: end for √ Zinkevich showed that the regret of this algorithm grows as T , where T is the number of rounds of the game. [sent-25, score-0.406]

14 This rate cannot be improved in general for arbitrary convex loss functions. [sent-26, score-0.175]

15 However, this is not the case if the loss functions are uniformly convex, for instance, if all ft have second derivative at least H > 0. [sent-27, score-0.559]

16 Recently, Hazan et al [3] showed that in this case it is possible for the regret to grow only logarithmically with T , using the same algorithm but with the smaller step size ηt = 1/(Ht). [sent-28, score-0.395]

17 The algorithm that achieves logarithmic regret must know in advance a lower bound on the convexity of the loss functions, since this bound is used to determine the step size. [sent-30, score-0.656]

18 It is natural to ask if this is essential: is there an algorithm that can adapt to the convexity of the loss functions and achieve the √ same regret rates in both cases—O(log T ) for uniformly convex functions and O( T ) for arbitrary convex functions? [sent-31, score-0.859]

19 In this paper, we present an adaptive algorithm of this kind. [sent-32, score-0.069]

20 The key technique is regularization: We consider the online gradient descent (OGD) algorithm, but we add a uniformly convex function, the quadratic λt x 2 , to each loss function ft (x). [sent-33, score-0.909]

21 This corresponds to shrinking the algorithm’s actions xt towards the origin. [sent-34, score-0.41]

22 It leads to a regret bound of the form T RT ≤ c λt + p(λ1 , . [sent-35, score-0.405]

23 t=1 The ﬁrst term on the right hand side can be viewed as a bias term; it increases with λt because the presence of the regularization might lead the algorithm away from the optimum. [sent-39, score-0.083]

24 The second term is a penalty for the ﬂatness of the loss functions that becomes smaller as the regularization increases. [sent-40, score-0.143]

25 We show that choosing the regularization coefﬁcient λt so as to balance these two terms in the bound on the regret up to round t is nearly optimal in a strong sense. [sent-41, score-0.48]

26 Let K be a convex subset of Rn and suppose that supx∈K x ≤ D. [sent-47, score-0.166]

27 Let Ht be the largest value such that for any x∗ ∈ K, ft (x∗ ) ≥ ft (xt ) + t (x∗ − xt ) + Ht ∗ x − xt 2 . [sent-50, score-1.69]

28 2 (1) In particular, if 2 ft − Ht · I 0, then the above inequality is satisﬁed. [sent-51, score-0.485]

29 2: for t = 1 to T do 3: Predict xt , observe ft . [sent-57, score-0.864]

30 6: Update xt+1 = ΠK (xt − ηt+1 ( ft (xt ) + λt xt )). [sent-60, score-0.845]

31 The regret of Algorithm 2 is bounded by T RT ≤ 3 inf ∗ λ1 ,. [sent-63, score-0.388]

32 While Algorithm 2 is stated with the squared Euclidean norm as a regularizer, we show that it is straightforward to generalize our technique to other regularization functions that are uniformly convex with respect to other norms. [sent-67, score-0.358]

33 This leads to adaptive versions of the mirror descent algorithm analyzed recently in [4, 5]. [sent-68, score-0.163]

34 2 Preliminary results The following theorem gives a regret bound for the OGD algorithm with a particular choice of step size. [sent-69, score-0.507]

35 The virtue of the theorem is that the step size can be set without knowledge of the uniform lower bound on Ht , which is required in the original algorithm of [3]. [sent-70, score-0.142]

36 Then the regret of OGD is bounded as RT ≤ 1 2 T t=1 G2 t . [sent-76, score-0.347]

37 t mint 1 s=1 Hs t Note that nothing prevents Ht from being negative or zero, implying that the same algorithm gives logarithmic regret even when some of the functions are linear or concave, as long as the partial t averages 1 s=1 Hs are positive and not too small. [sent-78, score-0.45]

38 The above result already provides an important t extension to the log-regret algorithm of [3]: no prior knowledge on the uniform convexity of the functions is needed, and the bound is in terms of the observed sequence {Ht }. [sent-79, score-0.202]

39 If H1 > 0 and Ht = 0 for all t >√ then s=1 Hs = H1 , resulting 1, in a linear regret bound. [sent-81, score-0.347]

40 However, we know from [2] that a O( T ) bound can be obtained. [sent-82, score-0.058]

41 In the √ next section we provide an algorithm which interpolates between O(log T ) and O( T ) bound on the regret depending on the curvature of the observed functions. [sent-83, score-0.498]

42 3 Adaptive Regularization Suppose the environment plays a sequence of ft ’s with curvature Ht ≥ 0. [sent-84, score-0.522]

43 Instead of performing ˜ gradient descent on these functions, we step in the direction of the gradient of ft (x) = ft (x) + 1 2 2 λt x , where the regularization parameter λt ≥ 0 is chosen appropriately at each step as a function of the curvature of the previous functions. [sent-85, score-1.203]

44 We remind the reader that K is assumed to be centered around the origin, for otherwise we would instead use x − x0 2 to shrink the actions xt towards the origin x0 . [sent-86, score-0.429]

45 If the Online Gradient Descent algorithm is performed on the functions ft (x) = 1 ft (x) + 2 λt x 2 with 1 ηt+1 = H 1:t + λ1:t for any sequence of non-negative λ1 , . [sent-91, score-0.992]

46 1 applied to functions ft , T t=1 1 ft (xt ) + λt xt 2 T 2 ≤ min x 1 ft (x) + λt x 2 t=1 2 + 1 2 T t=1 (Gt + λt D)2 . [sent-97, score-1.787]

47 H 1:t + λ1:t ˜ Indeed, it is easy to verify that condition (1) for ft implies the corresponding statement with Ht = ˜ . [sent-98, score-0.466]

48 Furthermore, by linearity, the bound on the gradient of ft is Gt +λt xt ≤ Gt +λt D. [sent-99, score-0.962]

49 ˜ Ht +λt for ft T Deﬁne x∗ = arg minx t=1 ft (x). [sent-100, score-0.946]

50 Then, dropping the xt 2 terms and bounding x∗ 2 ≤ D2 , T T ft (xt ) ≤ t=1 1 1 ft (x∗ ) + D2 λ1:T + 2 2 t=1 T (Gt + λt D)2 , H 1:t + λ1:t t=1 which proves the the theorem. [sent-101, score-1.296]

51 √ It turns out that for appropriate choices of {λt }, the above theorem recovers the O( T ) bound on the regret for linear functions [2] and the O(log T ) bound for strongly convex functions [3]. [sent-104, score-0.804]

52 Moreover, under speciﬁc assumptions on the sequence {Ht }, we can deﬁne a sequence {λt } which produces √ intermediate rates between log T and T . [sent-105, score-0.117]

53 These results are exhibited in corollaries at the end of this section. [sent-106, score-0.063]

54 While the inﬂuence of the regularization parameters λt on the ﬁrst sum is trivial, the inﬂuence on the second sum is more involved as all terms for t ≥ t depend on λt . [sent-112, score-0.059]

55 Let {λ∗ } be the optimal sequence of non-negative regularization t coefﬁcients. [sent-127, score-0.085]

56 The lemma above is the key to the proof of the near-optimal bounds for Algorithm 2 1 . [sent-136, score-0.05]

57 Without loss of generality, G1 = 0, for otherwise x1 is minimizing f1 (x) and regret is negative on that round. [sent-150, score-0.391]

58 Hence, the algorithm has a bound on the performance which is 6 times the bound obtained by the best ofﬂine adaptive choice of regularization coefﬁcients. [sent-151, score-0.262]

59 We also remark that if the diameter D is unknown, the regularization coefﬁcients λt can still be chosen by balancing as in Eq. [sent-153, score-0.079]

60 This choice of λt , however, increases the bound on the regret suffered by Algorithm 2 by a factor of O(D2 ). [sent-155, score-0.449]

61 1 not only recovers the rate of increase of regret of [3] and [2], but also provides intermediate rates. [sent-157, score-0.405]

62 1 guarantees that Algorithm 2 is competitive with the best choice of the parameters, we conclude that Algorithm 2 achieves the same rates. [sent-160, score-0.053]

63 Then for any sequence of convex functions t {ft }, the bound on regret of Algorithm 2 is O( T ). [sent-164, score-0.602]

64 1 effectively describes an algorithm for an online problem with competitive ratio of 2. [sent-170, score-0.149]

65 In the full version of this paper we give a lower bound strictly larger than one on the competitive ratio achievable by any online algorithm for this problem. [sent-171, score-0.207]

66 5 √ Hence, the regret of Algorithm 2 can never increase faster than T . [sent-172, score-0.347]

67 Then the bound on regret of t Algorithm 2 is O(log T ). [sent-177, score-0.405]

68 The above proof also recovers the result of Theorem 2. [sent-181, score-0.053]

69 The following Corollary shows a spectrum of rates under assumptions on the curvature of functions. [sent-183, score-0.069]

70 α 4 Generalization to different norms The original online gradient descent (OGD) algorithm as analyzed by Zinkevich [2] used the Euclidean distance of the current point from the optimum as a potential function. [sent-200, score-0.32]

71 The logarithmic regret bounds of [3] for strongly convex functions were also stated for the Euclidean norm, and such was the presentation above. [sent-201, score-0.628]

72 As such, our results above for adaptive regularization carry on to the general setting, as we state below . [sent-203, score-0.104]

73 A function g over a convex set K is called H-strongly convex with respect to a convex function h if H Bh (x, y). [sent-207, score-0.409]

74 This notion of strong convexity generalizes the Euclidean notion: the function g(x) = x 2 is 2 strongly convex with respect to h(x) = x 2 (in this case Bh (x, y) = x − y 2 ). [sent-210, score-0.235]

75 Henceforth we also refer to the dual norm of a given norm, deﬁned by y ∗ = sup x ≤1 {y x}. [sent-214, score-0.05]

76 For the case of p norms, we have y ∗ = y q where q satisﬁes 1 1 1 1 2 2 o (this holds for norms p + q = 1, and by H¨ lder’s inequality y x ≤ y ∗ x ≤ 2 y ∗ + 2 x other than p as well). [sent-215, score-0.087]

77 6 For simplicity, the reader may think of the functions g, h as convex and differentiable2 . [sent-216, score-0.187]

78 The following algorithm is a generalization of the OGD algorithm to general strongly convex functions (see the derivation in [6]). [sent-217, score-0.253]

79 Algorithm 3 General-Norm Online Gradient Descent 1: Input: convex function h 2: Initialize x1 arbitrarily. [sent-219, score-0.131]

80 3: for t = 1 to T do 4: Predict xt , observe ft . [sent-220, score-0.864]

81 5: Compute ηt+1 and let yt+1 be such that h(yt+1 ) = h(xt ) − 2ηt+1 ft (xt ). [sent-221, score-0.472]

82 As a ﬁrst step, let us generalize the bound of [3], as well as Theorem 2. [sent-224, score-0.079]

83 Suppose that, for each t, ft is a Ht -strongly convex function with respect to h, and let h be such that Bh (x, y) ≥ x − y 2 for some norm · . [sent-227, score-0.669]

84 By assumption on the functions ft , for any x∗ ∈ K, Ht Bh (x∗ , xt ). [sent-232, score-0.885]

85 ft (xt ) − ft (x∗ ) ≤ ft (xt ) (xt − x∗ ) − 2 By a well-known property of Bregman divergences (see [6]), it holds that for any vectors x, y, z, (x − y) ( h(z) − Combining both observations, h(y)) = Bh (x, y) − Bh (x, z) + Bh (y, z). [sent-233, score-1.392]

86 t the Bregman divergence of yt+1 and x∗ ∈ K is in the convex set. [sent-236, score-0.16]

87 Summing over all iterations and recalling that ηt+1 = H1 , 1:t T Bh (x∗ , xt ) 2RT ≤ t=2 T = t=1 1 ηt+1 1 ηt+1 − 1 − Ht ηt + Bh (x∗ , x1 ) Bh (xt , yt+1 ). [sent-237, score-0.394]

88 1 − H1 η2 T + t=1 1 Bh (xt , yt+1 ) ηt+1 (4) 2 Since the set of points of nondifferentiability of convex functions has measure zero, convexity is the only property that we require. [sent-238, score-0.225]

89 Indeed, for nondifferentiable functions, the algorithm would choose a point xt , which ˜ is xt with the addition of a small random perturbation. [sent-239, score-0.828]

90 With probability one, the functions would be smooth at the perturbed point, and the perturbation could be made arbitrarily small so that the regret rate would not be affected. [sent-240, score-0.387]

91 By deﬁnition of Bregman divergence, and the dual norm inequality stated before, Bh (xt , yt+1 ) + Bh (yt+1 , xt ) = ( h(xt ) − h(yt+1 )) (xt − yt+1 ) = 2ηt+1 ft (xt ) (xt − yt+1 ) 2 ≤ ηt+1 2 t ∗ + xt − yt+1 2 . [sent-242, score-1.345]

92 Thus, by our assumption Bh (x, y) ≥ x − y 2 , we have 2 2 Bh (xt , yt+1 ) ≤ ηt+1 t 2 + xt − yt+1 2 − Bh (yt+1 , xt ) ≤ ηt+1 ∗ 2 t ∗. [sent-243, score-0.788]

93 The following algorithm is an analogue of Algorithm 2 and Theorem 4. [sent-248, score-0.05]

94 Let g(x) be 1-strongly convex with respect to the convex function h. [sent-252, score-0.278]

95 2: for t = 1 to T do 3: Predict xt , observe ft 4: Compute λt = 1 2 2 (H 1:t + λ1:t−1 ) + 8G2 /(A2 + 2B 2 ) − (H 1:t + λ1:t−1 ) . [sent-253, score-0.864]

96 6: Let yt+1 be such that h(yt+1 ) = h(xt ) − 2ηt+1 ( ft (xt ) + λt g(xt ))). [sent-255, score-0.451]

97 Suppose that each ft is a Ht -strongly convex function with respect to h, and let g be a 1-strongly convex with respect h. [sent-259, score-0.766]

98 The regret of Algorithm 4 is bounded by T RT ≤ inf ∗ λ1 ,. [sent-262, score-0.388]

99 ,λ∗ T (A2 + 2B 2 )λ∗ + 1:T t=1 (Gt + λ∗ B)2 t H 1:t + λ∗ 1:t If the norm in the above theorem is the Euclidean norm and g(x) = supx∈K x = A = B and recover the results of Theorem 1. [sent-265, score-0.16]

100 Proving relative loss bounds for on-line learning algorithms using Bregman divergences. [sent-292, score-0.059]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ft', 0.451), ('bh', 0.447), ('xt', 0.394), ('regret', 0.347), ('ht', 0.294), ('yt', 0.162), ('gt', 0.162), ('convex', 0.131), ('ogd', 0.109), ('online', 0.105), ('rt', 0.103), ('bregman', 0.097), ('hazan', 0.095), ('ct', 0.086), ('descent', 0.079), ('theorem', 0.06), ('gradient', 0.059), ('regularization', 0.059), ('bound', 0.058), ('hs', 0.056), ('euclidean', 0.054), ('convexity', 0.054), ('supx', 0.054), ('zinkevich', 0.051), ('norm', 0.05), ('curvature', 0.045), ('adaptive', 0.045), ('loss', 0.044), ('minx', 0.044), ('inf', 0.041), ('functions', 0.04), ('logarithmic', 0.039), ('norms', 0.038), ('recovers', 0.036), ('suppose', 0.035), ('strongly', 0.034), ('inequality', 0.034), ('gentile', 0.032), ('player', 0.03), ('division', 0.03), ('corollary', 0.03), ('divergence', 0.029), ('elad', 0.029), ('rakhlin', 0.029), ('shai', 0.029), ('yoram', 0.029), ('berkeley', 0.028), ('sequence', 0.026), ('analogue', 0.026), ('corollaries', 0.026), ('suffered', 0.026), ('divergences', 0.024), ('interpolates', 0.024), ('al', 0.024), ('algorithm', 0.024), ('rates', 0.024), ('initialize', 0.024), ('uniformly', 0.024), ('exhibited', 0.022), ('intermediate', 0.022), ('stated', 0.022), ('projection', 0.022), ('coef', 0.021), ('let', 0.021), ('competitive', 0.02), ('diameter', 0.02), ('rounds', 0.02), ('observe', 0.019), ('action', 0.019), ('log', 0.019), ('origin', 0.019), ('lemma', 0.018), ('choice', 0.018), ('bartlett', 0.018), ('oracle', 0.017), ('advance', 0.017), ('proof', 0.017), ('uc', 0.016), ('reader', 0.016), ('repeated', 0.016), ('respect', 0.016), ('predict', 0.016), ('atness', 0.016), ('bor', 0.016), ('investment', 0.016), ('kale', 0.016), ('nicol', 0.016), ('nondifferentiable', 0.016), ('satyen', 0.016), ('shrinking', 0.016), ('technique', 0.016), ('round', 0.016), ('bounds', 0.015), ('end', 0.015), ('achieves', 0.015), ('analyzed', 0.015), ('holds', 0.015), ('ca', 0.015), ('verify', 0.015), ('amit', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 21 nips-2007-Adaptive Online Gradient Descent

Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett

2 0.28450239 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria

Author: Elad Hazan, Satyen Kale

Abstract: We study the relation between notions of game-theoretic equilibria which are based on stability under a set of deviations, and empirical equilibria which are reached by rational players. Rational players are modeled by players using no regret algorithms, which guarantee that their payoff in the long run is close to the maximum they could hope to achieve by consistently deviating from the algorithm’s suggested action. We show that for a given set of deviations over the strategy set of a player, it is possible to efﬁciently approximate ﬁxed points of a given deviation if and only if there exist efﬁcient no regret algorithms resistant to the deviations. Further, we show that if all players use a no regret algorithm, then the empirical distribution of their plays converges to an equilibrium. 1

3 0.25476146 199 nips-2007-The Price of Bandit Information for Online Optimization

Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes

Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efﬁciently in special cases of particular interest, such as path planning and Markov Decision Problems. 1

4 0.25257161 86 nips-2007-Exponential Family Predictive Representations of State

Author: David Wingate, Satinder S. Baveja

Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufﬁcient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which deﬁnes as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1

5 0.22307026 146 nips-2007-On higher-order perceptron algorithms

Author: Claudio Gentile, Fabio Vitale, Cristian Brotto

Abstract: A new algorithm for on-line learning linear-threshold functions is proposed which efﬁciently combines second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms. An initial theoretical analysis is provided suggesting that our algorithm might be viewed as a standard Perceptron algorithm operating on a transformed sequence of examples with improved margin properties. We also report on experiments carried out on datasets from diverse domains, with the goal of comparing to known Perceptron algorithms (ﬁrst-order, second-order, additive, multiplicative). Our learning procedure seems to generalize quite well, and converges faster than the corresponding multiplicative baseline algorithms. 1 Introduction and preliminaries The problem of on-line learning linear-threshold functions from labeled data is one which have spurred a substantial amount of research in Machine Learning. The relevance of this task from both the theoretical and the practical point of view is widely recognized: On the one hand, linear functions combine ﬂexiblity with analytical and computational tractability, on the other hand, online algorithms provide efﬁcient methods for processing massive amounts of data. Moreover, the widespread use of kernel methods in Machine Learning (e.g., [24]) have greatly improved the scope of this learning technology, thereby increasing even further the general attention towards the speciﬁc task of incremental learning (generalized) linear functions. Many models/algorithms have been proposed in the literature (stochastic, adversarial, noisy, etc.) : Any list of references would not do justice of the existing work on this subject. In this paper, we are interested in the problem of online learning linear-threshold functions from adversarially generated examples. We introduce a new family of algorithms, collectively called the Higher-order Perceptron algorithm (where ”higher” means here ”higher than one”, i.e., ”higher than ﬁrst-order” descent algorithms such as gradientdescent or standard Perceptron-like algorithms”). Contrary to other higher-order algorithms, such as the ridge regression-like algorithms considered in, e.g., [4, 7], Higher-order Perceptron has the ability to put together in a principled and ﬂexible manner second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms (e.g., [18, 19, 6, 13, 15, 20]). Our algorithm exploits a simpliﬁed form of the inverse data matrix, lending itself to be easily combined with the dual norms machinery introduced by [13] (see also [12, 23]). As we will see, this has also computational advantages, allowing us to formulate an efﬁcient (subquadratic) implementation. Our contribution is twofold. First, we provide an initial theoretical analysis suggesting that our algorithm might be seen as a standard Perceptron algorithm [21] operating on a transformed sequence of examples with improved margin properties. The same analysis also suggests a simple (but principled) way of switching on the ﬂy between higher-order and ﬁrst-order updates. This is ∗ The authors gratefully acknowledge partial support by the PASCAL Network of Excellence under EC grant n. 506778. This publication only reﬂects the authors’ views. especially convenient when we deal with kernel functions, a major concern being the sparsity of the computed solution. The second contribution of this paper is an experimental investigation of our algorithm on artiﬁcial and real-world datasets from various domains: We compared Higher-order Perceptron to baseline Perceptron algorithms, like the Second-order Perceptron algorithm deﬁned in [7] and the standard (p-norm) Perceptron algorithm, as in [13, 12]. We found in our experiments that Higher-order Perceptron generalizes quite well. Among our experimental ﬁndings are the following: 1) Higher-order Perceptron is always outperforming the corresponding multiplicative (p-norm) baseline (thus the stored data matrix is always beneﬁcial in terms of convergence speed); 2) When dealing with Euclidean norms (p = 2), the comparison to Second-order Perceptron is less clear and depends on the speciﬁc task at hand. Learning protocol and notation. Our algorithm works in the well-known mistake bound model of on-line learning, as introduced in [18, 2], and further investigated by many authors (e.g., [19, 6, 13, 15, 7, 20, 23] and references therein). Prediction proceeds in a sequence of trials. In each trial t = 1, 2, . . . the prediction algorithm is given an instance vector in Rn (for simplicity, all vectors are normalized, i.e., ||xt || = 1, where || · || is the Euclidean norm unless otherwise speciﬁed), and then guesses the binary label yt ∈ {−1, 1} associated with xt . We denote the algorithm’s prediction by yt ∈ {−1, 1}. Then the true label yt is disclosed. In the case when yt = yt we say that the algorithm has made a prediction mistake. We call an example a pair (xt , yt ), and a sequence of examples S any sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). In this paper, we are competing against the class of linear-threshold predictors, parametrized by normal vectors u ∈ {v ∈ Rn : ||v|| = 1}. In this case, a common way of measuring the (relative) prediction performance of an algorithm A is to compare the total number of mistakes of A on S to some measure of the linear separability of S. One such measure (e.g., [24]) is the cumulative hinge-loss (or soft-margin) Dγ (u; S) of S w.r.t. a T linear classiﬁer u at a given margin value γ > 0: Dγ (u; S) = t=1 max{0, γ − yt u xt } (observe that Dγ (u; S) vanishes if and only if u separates S with margin at least γ. A mistake-driven algorithm A is one which updates its internal state only upon mistakes. One can therefore associate with the run of A on S a subsequence M = M(S, A) ⊆ {1, . . . , T } of mistaken trials. Now, the standard analysis of these algorithms allows us to restrict the behavior of the comparison class to mistaken trials only and, as a consequence, to reﬁne Dγ (u; S) so as to include only trials in M: Dγ (u; S) = t∈M max{0, γ − yt u xt }. This gives bounds on A’s performance relative to the best u over a sequence of examples produced (or, actually, selected) by A during its on-line functioning. Our analysis in Section 3 goes one step further: the number of mistakes of A on S is contrasted to the cumulative hinge loss of the best u on a transformed ˜ sequence S = ((˜ i1 , yi1 ), (˜ i2 , yi2 ), . . . , (˜ im , yim )), where each instance xik gets transformed x x x ˜ into xik through a mapping depending only on the past behavior of the algorithm (i.e., only on ˜ examples up to trial t = ik−1 ). As we will see in Section 3, this new sequence S tends to be ”more separable” than the original sequence, in the sense that if S is linearly separable with some margin, ˜ then the transformed sequence S is likely to be separable with a larger margin. 2 The Higher-order Perceptron algorithm The algorithm (described in Figure 1) takes as input a sequence of nonnegative parameters ρ1 , ρ2 , ..., and maintains a product matrix Bk (initialized to the identity matrix I) and a sum vector v k (initialized to 0). Both of them are indexed by k, a counter storing the current number of mistakes (plus one). Upon receiving the t-th normalized instance vector xt ∈ Rn , the algorithm computes its binary prediction value yt as the sign of the inner product between vector Bk−1 v k−1 and vector Bk−1 xt . If yt = yt then matrix Bk−1 is updates multiplicatively as Bk = Bk−1 (I − ρk xt xt ) while vector v k−1 is updated additively through the standard Perceptron rule v k = v k−1 + yt xt . The new matrix Bk and the new vector v k will be used in the next trial. If yt = yt no update is performed (hence the algorithm is mistake driven). Observe that ρk = 0 for any k makes this algorithm degenerate into the standard Perceptron algorithm [21]. Moreover, one can easily see that, in order to let this algorithm exploit the information collected in the matrix B (and let the algorithm’s ∞ behavior be substantially different from Perceptron’s) we need to ensure k=1 ρk = ∞. In the sequel, our standard choice will be ρk = c/k, with c ∈ (0, 1). See Sections 3 and 4. Implementing Higher-Order Perceptron can be done in many ways. Below, we quickly describe three of them, each one having its own merits. 1) Primal version. We store and update an n×n matrix Ak = Bk Bk and an n-dimensional column Parameters: ρ1 , ρ2 , ... ∈ [0, 1). Initialization: B0 = I; v 0 = 0; k = 1. Repeat for t = 1, 2, . . . , T : 1. Get instance xt ∈ Rn , ||xt || = 1; 2. Predict yt = SGN(wk−1 xt ) ∈ {−1, +1}, where wk−1 = Bk−1 Bk−1 v k−1 ; 3. Get label yt ∈ {−1, +1}; v k = v k−1 + yt xt 4. if yt = yt then: Bk k = Bk−1 (I − ρk xt xt ) ← k + 1. Figure 1: The Higher-order Perceptron algorithm (for p = 2). vector v k . Matrix Ak is updated as Ak = Ak−1 − ρAk−1 xx − ρxx Ak−1 + ρ2 (x Ak−1 x)xx , taking O(n2 ) operations, while v k is updated as in Figure 1. Computing the algorithm’s margin v Ax can then be carried out in time quadratic in the dimension n of the input space. 2) Dual version. This implementation allows us the use of kernel functions (e.g., [24]). Let us denote by Xk the n × k matrix whose columns are the n-dimensional instance vectors x1 , ..., xk where a mistake occurred so far, and y k be the k-dimensional column vector of the corresponding (k) labels. We store and update the k × k matrix Dk = [di,j ]k i,j=1 , the k × k diagonal matrix Hk = DIAG {hk }, (k) (k) hk = (h1 , ..., hk ) = Xk Xk y k , and the k-dimensional column vector g k = y k + Dk Hk 1k , being 1k a vector of k ones. If we interpret the primal matrix Ak above as Ak = (k) k I + i,j=1 di,j xi xj , it is not hard to show that the margin value wk−1 x is equal to g k−1 Xk−1 x, and can be computed through O(k) extra inner products. Now, on the k-th mistake, vector g can be updated with O(k 2 ) extra inner products by updating D and H in the following way. We let D0 and H0 be empty matrices. Then, given Dk−1 and Hk−1 = DIAG{hk−1 }, we have1 Dk = Dk−1 −ρk bk (k) , where bk = Dk−1 Xk−1 xk , and dk,k = ρ2 xk Xk−1 bk − 2ρk + ρ2 . On (k) k k −ρk bk dk,k the other hand, Hk = DIAG {hk−1 (k) (k) + yk Xk−1 xk , hk }, with hk = y k−1 Xk−1 xk + yk . Observe that on trials when ρk = 0 matrix Dk−1 is padded with a zero row and a zero column. (k) k This amounts to say that matrix Ak = I + i,j=1 di,j xi xj , is not updated, i.e., Ak = Ak−1 . A closer look at the above update mechanism allows us to conclude that the overall extra inner products needed to compute g k is actually quadratic only in the number of past mistaken trials having ρk > 0. This turns out to be especially important when using a sparse version of our algorithm which, on a mistaken trial, decides whether to update both B and v or just v (see Section 4). 3) Implicit primal version and the dual norms algorithm. This is based on the simple observation that for any vector z we can compute Bk z by unwrapping Bk as in Bk z = Bk−1 (I − ρxx )z = Bk−1 z , where vector z = (z − ρx x z) can be calculated in time O(n). Thus computing the margin v Bk−1 Bk−1 x actually takes O(nk). Maintaining this implicit representation for the product matrix B can be convenient when an efﬁcient dual version is likely to be unavailable, as is the case for the multiplicative (or, more generally, dual norms) extension of our algorithm. We recall that a multiplicative algorithm is useful when learning sparse target hyperplanes (e.g., [18, 15, 3, 12, 11, 20]). We obtain a dual norms algorithm by introducing a norm parameter p ≥ 2, and the associated gradient mapping2 g : θ ∈ Rn → θ ||θ||2 / 2 ∈ Rn . Then, in Figure 1, we p normalize instance vectors xt w.r.t. the p-norm, we deﬁne wk−1 = Bk−1 g(Bk−1 v k−1 ), and generalize the matrix update as Bk = Bk−1 (I − ρk xt g(xt ) ). As we will see, the resulting algorithm combines the multiplicative behavior of the p-norm algorithms with the ”second-order” information contained in the matrix Bk . One can easily see that the above-mentioned argument for computing the margin g(Bk−1 v k−1 ) Bk−1 x in time O(nk) still holds. 1 Observe that, by construction, Dk is a symmetric matrix. This mapping has also been used in [12, 11]. Recall that setting p = O(log n) yields an algorithm similar to Winnow [18]. Also, notice that p = 2 yields g = identity. 2 3 Analysis We express the performance of the Higher-order Perceptron algorithm in terms of the hinge-loss behavior of the best linear classiﬁer over the transformed sequence ˜ S = (B0 xt(1) , yt(1) ), (B1 xt(2) , yt(2) ), (B2 xt(3) , yt(3) ), . . . , (1) being t(k) the trial where the k-th mistake occurs, and Bk the k-th matrix produced by the algorithm. Observe that each feature vector xt(k) gets transformed by a matrix Bk depending on past examples ˜ only. This is relevant to the argument that S tends to have a larger margin than the original sequence (see the discussion at the end of this section). This neat ”on-line structure” does not seem to be shared by other competing higher-order algorithms, such as the ”ridge regression-like” algorithms considered, e.g., in [25, 4, 7, 23]. For the sake of simplicity, we state the theorem below only in the case p = 2. A more general statement holds when p ≥ 2. Theorem 1 Let the Higher-order Perceptron algorithm in Figure 1 be run on a sequence of examples S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). Let the sequence of parameters ρk satisfy 0 ≤ ρk ≤ 1−c , where xt is the k-th mistaken instance vector, and c ∈ (0, 1]. Then the total number m 1+|v k−1 xt | of mistakes satisﬁes3 ˜ ˜ Dγ (u; Sc )) α2 α Dγ (u; Sc )) α2 m≤α + 2+ α + 2, (2) γ 2γ γ γ 4γ holding for any γ > 0 and any unit norm vector u ∈ Rn , where α = α(c) = (2 − c)/c. Proof. The analysis deliberately mimics the standard Perceptron convergence analysis [21]. We ﬁx an arbitrary sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ) and let M ⊆ {1, 2, . . . , T } be the set of trials where the algorithm in Figure 1 made a mistake. Let t = t(k) be the trial where the k-th mistake occurred. We study the evolution of ||Bk v k ||2 over mistaken trials. Notice that the matrix Bk Bk is positive semideﬁnite for any k. We can write ||Bk v k ||2 = ||Bk−1 (I − ρk xt xt ) (v k−1 + yt xt ) ||2 (from the update rule v k = v k−1 + yt xt and Bk = Bk−1 (I − ρk xt xt ) ) = ||Bk−1 v k−1 + yt (1 − ρk yt v k−1 xt − ρk )Bk−1 xt ||2 2 = ||Bk−1 v k−1 || + 2 yt rk v k−1 Bk−1 Bk−1 xt + (using ||xt || = 1) 2 rk ||Bk−1 xt ||2 , where we set for brevity rk = 1 − ρk yt v k−1 xt − ρk . We proceed by upper and lower bounding the above chain of equalities. To this end, we need to ensure rk ≥ 0. Observe that yt v k−1 xt ≥ 0 implies rk ≥ 0 if and only if ρk ≤ 1/(1 + yt v k−1 xt ). On the other hand, if yt v k−1 xt < 0 then, in order for rk to be nonnegative, it sufﬁces to pick ρk ≤ 1. In both cases ρk ≤ (1 − c)/(1 + |v k−1 xt |) implies 2 rk ≥ c > 0, and also rk ≤ (1+ρk |v k−1 xt |−ρk )2 ≤ (2−c)2 . Now, using yt v k−1 Bk−1 Bk−1 xt ≤ 0 (combined with rk ≥ 0), we conclude that ||Bk v k ||2 − ||Bk−1 v k−1 ||2 ≤ (2 − c)2 ||Bk−1 xt ||2 = (2 − c)2 xt Ak−1 xt , where we set Ak = Bk Bk . A simple4 (and crude) upper bound on the last term follows by observing that ||xt || = 1 implies xt Ak−1 xt ≤ ||Ak−1 ||, the spectral norm (largest eigenvalue) of Ak−1 . Since a factor matrix of the form (I − ρ xx ) with ρ ≤ 1 and ||x|| = 1 has k−1 spectral norm one, we have xt Ak−1 xt ≤ ||Ak−1 || ≤ i=1 ||I − ρi xt(i) xt(i) ||2 ≤ 1. Therefore, summing over k = 1, . . . , m = |M| (or, equivalently, over t ∈ M) and using v 0 = 0 yields the upper bound ||Bm v m ||2 ≤ (2 − c)2 m. (3) To ﬁnd a lower bound of the left-hand side of (3), we ﬁrst pick any unit norm vector u ∈ Rn , and apply the standard Cauchy-Schwartz inequality: ||Bm v m || ≥ u Bm v m . Then, we observe that for a generic trial t = t(k) the update rule of our algorithm allows us to write u Bk v k − u Bk−1 v k−1 = rk yt u Bk−1 xt ≥ rk (γ − max{0, γ − yt u Bk−1 xt }), where the last inequality follows from rk ≥ 0 and holds for any margin value γ > 0. We sum 3 ˜ The subscript c in Sc emphasizes the dependence of the transformed sequence on the choice of c. Note that in the special case c = 1 we have ρk = 0 for any k and α = 1, thereby recovering the standard Perceptron bound for nonseparable sequences (see, e.g., [12]). 4 A slightly more reﬁned bound can be derived which depends on the trace of matrices I − Ak . Details will be given in the full version of this paper. the above over k = 1, . . . , m and exploit c ≤ rk ≤ 2 − c after rearranging terms. This gets ˜ ||Bm v m || ≥ u Bm v m ≥ c γ m − (2 − c)Dγ (u; Sc ). Combining with (3) and solving for m gives the claimed bound. From the above result one can see that our algorithm might be viewed as a standard Perceptron ˜ algorithm operating on the transformed sequence Sc in (1). We now give a qualitative argument, ˜ which is suggestive of the improved margin properties of Sc . Assume for simplicity that all examples (xt , yt ) in the original sequence are correctly classiﬁed by hyperplane u with the same margin γ = yt u xt > 0, where t = t(k). According to Theorem 1, the parameters ρ1 , ρ2 , . . . should be small positive numbers. Assume, again for simplicity, that all ρk are set to the same small enough k value ρ > 0. Then, up to ﬁrst order, matrix Bk = i=1 (I − ρ xt(i) xt(i) ) can be approximated as Bk I −ρ k i=1 xt(i) xt(i) . Then, to the extent that the above approximation holds, we can write:5 yt u Bk−1 xt = yt u I −ρ = yt u xt − ρ yt k−1 i=1 xt(i) xt(i) xt = yt u k−1 i=1 I −ρ k−1 i=1 yt(i) xt(i) yt(i) xt(i) xt yt(i) u xt(i) yt(i) xt(i) xt = γ − ρ γ yt v k−1 xt . Now, yt v k−1 xt is the margin of the (ﬁrst-order) Perceptron vector v k−1 over a mistaken trial for the Higher-order Perceptron vector wk−1 . Since the two vectors v k−1 and wk−1 are correlated (recall that v k−1 wk−1 = v k−1 Bk−1 Bk−1 v k−1 = ||Bk−1 v k−1 ||2 ≥ 0) the mistaken condition yt wk−1 xt ≤ 0 is more likely to imply yt v k−1 xt ≤ 0 than the opposite. This tends to yield a margin larger than the original margin γ. As we mentioned in Section 2, this is also advantageous from a computational standpoint, since in those cases the matrix update Bk−1 → Bk might be skipped (this is equivalent to setting ρk = 0), still Theorem 1 would hold. Though the above might be the starting point of a more thorough theoretical understanding of the margin properties of our algorithm, in this paper we prefer to stop early and leave any further investigation to collecting experimental evidence. 4 Experiments We tested the empirical performance of our algorithm by conducting a number of experiments on a collection of datasets, both artiﬁcial and real-world from diverse domains (Optical Character Recognition, text categorization, DNA microarrays). The main goal of these experiments was to compare Higher-order Perceptron (with both p = 2 and p > 2) to known Perceptron-like algorithms, such as ﬁrst-order [21] and second-order Perceptron [7], in terms of training accuracy (i.e., convergence speed) and test set accuracy. The results are contained in Tables 1, 2, 3, and in Figure 2. Task 1: DNA microarrays and artiﬁcial data. The goal here was to test the convergence properties of our algorithms on sparse target learning tasks. We ﬁrst tested on a couple of well-known DNA microarray datasets. For each dataset, we ﬁrst generated a number of random training/test splits (our random splits also included random permutations of the training set). The reported results are averaged over these random splits. The two DNA datasets are: i. The ER+/ER− dataset from [14]. Here the task is to analyze expression proﬁles of breast cancer and classify breast tumors according to ER (Estrogen Receptor) status. This dataset (which we call the “Breast” dataset) contains 58 expression proﬁles concerning 3389 genes. We randomly split 1000 times into a training set of size 47 and a test set of size 11. ii. The “Lymphoma” dataset [1]. Here the goal is to separate cancerous and normal tissues in a large B-Cell lymphoma problem. The dataset contains 96 expression proﬁles concerning 4026 genes. We randomly split the dataset into a training set of size 60 and a test set of size 36. Again, the random split was performed 1000 times. On both datasets, the tested algorithms have been run by cycling 5 times over the current training set. No kernel functions have been used. We also artiﬁcially generated two (moderately) sparse learning problems with margin γ ≥ 0.005 at labeling noise levels η = 0.0 (linearly separable) and η = 0.1, respectively. The datasets have been generated at random by ﬁrst generating two (normalized) target vectors u ∈ {−1, 0, +1}500 , where the ﬁrst 50 components are selected independently at random in {−1, +1} and the remaining 450 5 Again, a similar argument holds in the more general setting p ≥ 2. The reader should notice how important the dependence of Bk on the past is to this argument. components are 0. Then we set η = 0.0 for the ﬁrst target and η = 0.1 for the second one and, corresponding to each of the two settings, we randomly generated 1000 training examples and 1000 test examples. The instance vectors are chosen at random from [−1, +1]500 and then normalized. If u · xt ≥ γ then a +1 label is associated with xt . If u · xt ≤ −γ then a −1 label is associated with xt . The labels so obtained are ﬂipped with probability η. If |u · xt | < γ then xt is rejected and a new vector xt is drawn. We call the two datasets ”Artiﬁcial 0.0 ” and ”Artiﬁcial 0.1 ”. We tested our algorithms by training over an increasing number of epochs and checking the evolution of the corresponding test set accuracy. Again, no kernel functions have been used. Task 2: Text categorization. The text categorization datasets are derived from the ﬁrst 20,000 newswire stories in the Reuters Corpus Volume 1 (RCV1, [22]). A standard TF - IDF bag-of-words encoding was used to transform each news story into a normalized vector of real attributes. We built four binary classiﬁcation problems by “binarizing” consecutive news stories against the four target categories 70, 101, 4, and 59. These are the 2nd, 3rd, 4th, and 5th most frequent6 categories, respectively, within the ﬁrst 20,000 news stories of RCV1. We call these datasets RCV1x , where x = 70, 101, 4, 59. Each dataset was split into a training set of size 10,000 and a test set of the same size. All algorithms have been trained for a single epoch. We initially tried polynomial kernels, then realized that kernel functions did not signiﬁcantly alter our conclusions on this task. Thus the reported results refer to algorithms with no kernel functions. Task 3: Optical character recognition (OCR). We used two well-known OCR benchmarks: the USPS dataset and the MNIST dataset [16] and followed standard experimental setups, such as the one in [9], including the one-versus-rest scheme for reducing a multiclass problem to a set of binary tasks. We used for each algorithm the standard Gaussian and polynomial kernels, with parameters chosen via 5-fold cross validation on the training set across standard ranges. Again, all algorithms have been trained for a single epoch over the training set. The results in Table 3 only refer to the best parameter settings for each kernel. Algorithms. We implemented the standard Perceptron algorithm (with and without kernels), the Second-order Perceptron algorithm, as described in [7] (with and without kernels), and our Higherorder Perceptron algorithm. The implementation of the latter algorithm (for both p = 2 and p > 2) was ”implicit primal” when tested on the sparse learning tasks, and in dual variables for the other two tasks. When using Second-order Perceptron, we set its parameter a (see [7] for details) by testing on a generous range of values. For brevity, only the settings achieving the best results are reported. On the sparse learning tasks we tried Higher-order Perceptron with norm p = 2, 4, 7, 10, while on the other two tasks we set p = 2. In any case, for each value of p, we set7 ρk = c/k, with c = 0, 0.2, 0.4, 0.6, 0.8. Since c = 0 corresponds to a standard p-norm Perceptron algorithm [13, 12] we tried to emphasize the comparison c = 0 vs. c > 0. Finally, when using kernels on the OCR tasks, we also compared to a sparse dual version of Higher-order Perceptron. On a mistaken round t = t(k), this algorithm sets ρk = c/k if yt v k−1 xt ≥ 0, and ρk = 0 otherwise (thus, when yt v k−1 xt < 0 the matrix Bk−1 is not updated). For the sake of brevity, the standard Perceptron algorithm is called FO (”First Order”), the Second-order algorithm is denoted by SO (”Second Order”), while the Higher-order algorithm with norm parameter p and ρk = c/k is abbreviated as HOp (c). Thus, for instance, FO = HO2 (0). Results and conclusions. Our Higher-order Perceptron algorithm seems to deliver interesting results. In all our experiments HOp (c) with c > 0 outperforms HOp (0). On the other hand, the comparison HOp (c) vs. SO depends on the speciﬁc task. On the DNA datasets, HOp (c) with c > 0 is clearly superior in Breast. On Lymphoma, HOp (c) gets worse as p increases. This is a good indication that, in general, a multiplicative algorithm is not suitable for this dataset. In any case, HO2 turns out to be only slightly worse than SO. On the artiﬁcial datasets HOp (c) with c > 0 is always better than the corresponding p-norm Perceptron algorithm. On the text categorization tasks, HO2 tends to perform better than SO. On USPS, HO2 is superior to the other competitors, while on MNIST it performs similarly when combined with Gaussian kernels (though it turns out to be relatively sparser), while it is slightly inferior to SO when using polynomial kernels. The sparse version of HO2 cuts the matrix updates roughly by half, still maintaining a good performance. In all cases HO2 (either sparse or not) signiﬁcantly outperforms FO. In conclusion, the Higher-order Perceptron algorithm is an interesting tool for on-line binary clas6 7 We did not use the most frequent category because of its signiﬁcant overlap with the other ones. Notice that this setting fulﬁlls the condition on ρk stated in Theorem 1. Table 1: Training and test error on the two datasets ”Breast” and ”Lymphoma”. Training error is the average total number of updates over 5 training epochs, while test error is the average fraction of misclassiﬁed patterns in the test set, The results refer to the same training/test splits. For each algorithm, only the best setting is shown (best training and best test setting coincided in these experiments). Thus, for instance, HO2 differs from FO because of the c parameter. We emphasized the comparison HO7 (0) vs. HO7 (c) with best c among the tested values. According to Wilcoxon signed rank test, an error difference of 0.5% or larger might be considered signiﬁcant. In bold are the smallest ﬁgures achieved on each row of the table. FO TRAIN TEST TRAIN TEST LYMPHOMA HO 2 HO 4 HO 7 (0) HO 7 HO 10 SO 45.2 23.4% 22.1 11.8% 21.7 16.4% 19.6 10.0% 24.5 13.3% 18.9 10.0% 47.4 15.7% 23.0 11.5% 24.5 12.0% 20.0 11.5% 32.4 13.5 23.1 11.9% 29.6 15.0% 19.3 9.6% FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.0 SO # of training updates 800 * HO 4(0.4) 600 HO 7(0.0) * 400 300 * * * * * SO 2400 HO 2(0.4) 700 500 HO 7 (0.4) * 2000 * 1200 400 2 3 5 10 15 20 * 1 * * 2 3 * Test error rates * * * (a = 0.2) HO 2(0.4) HO 4(0.4) * * * * HO 7(0.0) HO 7 (0.4) 14% Test error rates (minus 10%) FO = HO 2(0.0) SO 18% * HO 7(0.0) HO 7(0.4) 5 10 15 20 # of training epochs Test error rates vs training epochs on Artificial 0.0 22% * * # of training epochs 26% (a = 0.2) HO 2(0.4) HO 4(0.4) 1600 800 * 1 FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.1 (a = 0.2) # of training updates B REAST FO = HO 2(0.0) Test error rates vs training epochs on Artificial 0.1 SO 26% 22% * * * * * * * * (a = 0.2) HO 2(0.4) HO 4(0.4) 18% HO 7(0.0) 14% HO 7 (0.4) 10% 10% 6% 6% 1 2 3 5 10 # of training epochs 15 20 1 2 3 5 10 15 20 # of training epochs Figure 2: Experiments on the two artiﬁcial datasets (Artiﬁcial0.0 , on the left, and Artiﬁcial0.1 , on the right). The plots give training and test behavior as a function of the number of training epochs. Notice that the test set in Artiﬁcial0.1 is affected by labelling noise of rate 10%. Hence, a visual comparison between the two plots at the bottom can only be made once we shift down the y-axis of the noisy plot by 10%. On the other hand, the two training plots (top) are not readily comparable. The reader might have difﬁculty telling apart the two kinds of algorithms HOp (0.0) and HOp (c) with c > 0. In practice, the latter turned out to be always slightly superior in performance to the former. siﬁcation, having the ability to combine multiplicative (or nonadditive) and second-order behavior into a single inference procedure. Like other algorithms, HOp can be extended (details omitted due to space limitations) in several ways through known worst-case learning technologies, such as large margin (e.g., [17, 11]), label-efﬁcient/active learning (e.g., [5, 8]), and bounded memory (e.g., [10]). References [1] A. Alizadeh, et al. (2000). Distinct types of diffuse large b-cell lymphoma identiﬁed by gene expression proﬁling. Nature, 403, 503–511. [2] D. Angluin (1988). Queries and concept learning. Machine Learning, 2(4), 319–342. [3] P. Auer & M.K. Warmuth (1998). Tracking the best disjunction. Machine Learning, 32(2), 127–150. [4] K.S. Azoury & M.K. Warmuth (2001). Relative loss bounds for on-line density estimation with the exponential familiy of distributions. Machine Learning, 43(3), 211–246. [5] A. Bordes, S. Ertekin, J. Weston, & L. Bottou (2005). Fast kernel classiﬁers with on-line and active learning. JMLR, 6, 1579–1619. [6] N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, & M.K. Warmuth (1997). How to use expert advice. J. ACM, 44(3), 427–485. Table 2: Experimental results on the four binary classiﬁcation tasks derived from RCV1. ”Train” denotes the number of training corrections, while ”Test” gives the fraction of misclassiﬁed patterns in the test set. Only the results corresponding to the best test set accuracy are shown. In bold are the smallest ﬁgures achieved for each of the 8 combinations of dataset (RCV1x , x = 70, 101, 4, 59) and phase (training or test). FO TRAIN TEST 993 673 803 767 7.20% 6.39% 6.14% 6.45% RCV170 RCV1101 RCV14 RCV159 HO 2 TRAIN TEST 941 665 783 762 6.83% 5.81% 5.94% 6.04% SO TRAIN TEST 880 677 819 760 6.95% 5.48% 6.05% 6.84% Table 3: Experimental results on the OCR tasks. ”Train” denotes the total number of training corrections, summed over the 10 categories, while ”Test” denotes the fraction of misclassiﬁed patterns in the test set. Only the results corresponding to the best test set accuracy are shown. For the sparse version of HO2 we also reported (in parentheses) the number of matrix updates during training. In bold are the smallest ﬁgures achieved for each of the 8 combinations of dataset (USPS or MNIST), kernel type (Gaussian or Polynomial), and phase (training or test). FO TRAIN U SPS M NIST G AUSS P OLY G AUSS P OLY TEST 1385 1609 5834 8148 6.53% 7.37% 2.10% 3.04% HO 2 TRAIN TEST 945 1090 5351 6404 4.76% 5.71% 1.79% 2.27% Sparse HO2 SO TRAIN TEST TRAIN TEST 965 (440) 1081 (551) 5363 (2596) 6476 (3311) 5.13% 5.52% 1.81% 2.28% 1003 1054 5684 6440 5.05% 5.53% 1.82% 2.03% [7] N. Cesa-Bianchi, A. Conconi & C. Gentile (2005). A second-order perceptron algorithm. SIAM Journal of Computing, 34(3), 640–668. [8] N. Cesa-Bianchi, C. Gentile, & L. Zaniboni (2006). Worst-case analysis of selective sampling for linearthreshold algorithms. JMLR, 7, 1205–1230. [9] C. Cortes & V. Vapnik (1995). Support-vector networks. Machine Learning, 20(3), 273–297. [10] O. Dekel, S. Shalev-Shwartz, & Y. Singer (2006). The Forgetron: a kernel-based Perceptron on a ﬁxed budget. NIPS 18, MIT Press, pp. 259–266. [11] C. Gentile (2001). A new approximate maximal margin classiﬁcation algorithm. JMLR, 2, 213–242. [12] C. Gentile (2003). The Robustness of the p-norm Algorithms. Machine Learning, 53(3), pp. 265–299. [13] A.J. Grove, N. Littlestone & D. Schuurmans (2001). General convergence results for linear discriminant updates. Machine Learning Journal, 43(3), 173–210. [14] S. Gruvberger, et al. (2001). Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res., 61, 5979–5984. [15] J. Kivinen, M.K. Warmuth, & P. Auer (1997). The perceptron algorithm vs. winnow: linear vs. logarithmic mistake bounds when few input variables are relevant. Artiﬁcial Intelligence, 97, 325–343. [16] Y. Le Cun, et al. (1995). Comparison of learning algorithms for handwritten digit recognition. ICANN 1995, pp. 53–60. [17] Y. Li & P. Long (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1-3), 361–387. [18] N. Littlestone (1988). Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning, 2(4), 285–318. [19] N. Littlestone & M.K. Warmuth (1994). The weighted majority algorithm. Information and Computation, 108(2), 212–261. [20] P. Long & X. Wu (2004). Mistake bounds for maximum entropy discrimination. NIPS 2004. [21] A.B.J. Novikov (1962). On convergence proofs on perceptrons. Proc. of the Symposium on the Mathematical Theory of Automata, vol. XII, pp. 615–622. [22] Reuters: 2000. http://about.reuters.com/researchandstandards/corpus/. [23] S. Shalev-Shwartz & Y. Singer (2006). Online Learning Meets Optimization in the Dual. COLT 2006, pp. 423–437. [24] B. Schoelkopf & A. Smola (2002). Learning with kernels. MIT Press. [25] Vovk, V. (2001). Competitive on-line statistics. International Statistical Review, 69, 213-248.

6 0.20257983 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information

7 0.18567853 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

8 0.17553049 165 nips-2007-Regret Minimization in Games with Incomplete Information

9 0.16503642 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

10 0.15085325 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

11 0.14434764 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

12 0.13292795 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

13 0.13084976 213 nips-2007-Variational Inference for Diffusion Processes

14 0.10693339 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

15 0.096772745 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets

16 0.096471436 62 nips-2007-Convex Learning with Invariances

17 0.084267214 101 nips-2007-How SVMs can estimate quantiles and the median

18 0.083421536 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

19 0.079121687 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

20 0.075218812 118 nips-2007-Learning with Transformation Invariant Kernels

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.206), (1, -0.288), (2, 0.079), (3, 0.135), (4, 0.376), (5, -0.087), (6, -0.05), (7, -0.005), (8, -0.075), (9, -0.08), (10, 0.235), (11, 0.166), (12, -0.08), (13, 0.155), (14, -0.095), (15, 0.005), (16, -0.025), (17, -0.079), (18, -0.012), (19, 0.057), (20, -0.041), (21, -0.058), (22, 0.0), (23, 0.017), (24, -0.063), (25, 0.007), (26, 0.02), (27, -0.039), (28, 0.016), (29, 0.083), (30, 0.065), (31, 0.03), (32, -0.097), (33, -0.027), (34, -0.153), (35, 0.001), (36, -0.108), (37, 0.078), (38, -0.026), (39, -0.055), (40, -0.11), (41, 0.044), (42, 0.091), (43, -0.009), (44, -0.016), (45, 0.111), (46, 0.027), (47, -0.008), (48, -0.016), (49, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9763363 21 nips-2007-Adaptive Online Gradient Descent

Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett

2 0.66518986 199 nips-2007-The Price of Bandit Information for Online Optimization

Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes

3 0.63680685 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information

Author: John Langford, Tong Zhang

Abstract: We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales as O(T 2/3 S 1/3 ) or better (sometimes, much better). Here S is the complexity term in a sample complexity bound for standard supervised learning. 1

4 0.60551983 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria

Author: Elad Hazan, Satyen Kale

5 0.58985579 146 nips-2007-On higher-order perceptron algorithms

Author: Claudio Gentile, Fabio Vitale, Cristian Brotto

6 0.5234499 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

7 0.44417384 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

8 0.43604565 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

9 0.42636451 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

10 0.40871572 86 nips-2007-Exponential Family Predictive Representations of State

11 0.37420139 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets

12 0.36828271 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

13 0.34872836 62 nips-2007-Convex Learning with Invariances

14 0.33884761 165 nips-2007-Regret Minimization in Games with Incomplete Information

15 0.33707088 213 nips-2007-Variational Inference for Diffusion Processes

16 0.32269642 101 nips-2007-How SVMs can estimate quantiles and the median

17 0.31047246 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

18 0.30431703 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

19 0.28639039 184 nips-2007-Stability Bounds for Non-i.i.d. Processes

20 0.28517953 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.038), (13, 0.041), (16, 0.084), (18, 0.014), (19, 0.017), (21, 0.032), (31, 0.011), (34, 0.015), (35, 0.052), (43, 0.274), (46, 0.011), (47, 0.061), (49, 0.011), (83, 0.159), (90, 0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73924798 21 nips-2007-Adaptive Online Gradient Descent

Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett

2 0.58525115 36 nips-2007-Better than least squares: comparison of objective functions for estimating linear-nonlinear models

Author: Tatyana Sharpee

Abstract: This paper compares a family of methods for characterizing neural feature selectivity with natural stimuli in the framework of the linear-nonlinear model. In this model, the neural ﬁring rate is a nonlinear function of a small number of relevant stimulus components. The relevant stimulus dimensions can be found by maximizing one of the family of objective functions, R´ nyi divergences of different e orders [1, 2]. We show that maximizing one of them, R´ nyi divergence of ore der 2, is equivalent to least-square ﬁtting of the linear-nonlinear model to neural data. Next, we derive reconstruction errors in relevant dimensions found by maximizing R´ nyi divergences of arbitrary order in the asymptotic limit of large spike e numbers. We ﬁnd that the smallest errors are obtained with R´ nyi divergence of e order 1, also known as Kullback-Leibler divergence. This corresponds to ﬁnding relevant dimensions by maximizing mutual information [2]. We numerically test how these optimization schemes perform in the regime of low signal-to-noise ratio (small number of spikes and increasing neural noise) for model visual neurons. We ﬁnd that optimization schemes based on either least square ﬁtting or information maximization perform well even when number of spikes is small. Information maximization provides slightly, but signiﬁcantly, better reconstructions than least square ﬁtting. This makes the problem of ﬁnding relevant dimensions, together with the problem of lossy compression [3], one of examples where informationtheoretic measures are no more data limited than those derived from least squares. 1

3 0.58509964 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

Author: Maneesh Sahani, Byron M. Yu, John P. Cunningham, Krishna V. Shenoy

Abstract: Neural spike trains present challenges to analytical efforts due to their noisy, spiking nature. Many studies of neuroscientiﬁc and neural prosthetic importance rely on a smoothed, denoised estimate of the spike train’s underlying ﬁring rate. Current techniques to ﬁnd time-varying ﬁring rates require ad hoc choices of parameters, offer no conﬁdence intervals on their estimates, and can obscure potentially important single trial variability. We present a new method, based on a Gaussian Process prior, for inferring probabilistically optimal estimates of ﬁring rate functions underlying single or multiple neural spike trains. We test the performance of the method on simulated data and experimentally gathered neural spike trains, and we demonstrate improvements over conventional estimators. 1

4 0.58174253 140 nips-2007-Neural characterization in partially observed populations of spiking neurons

Author: Jonathan W. Pillow, Peter E. Latham

Abstract: Point process encoding models provide powerful statistical methods for understanding the responses of neurons to sensory stimuli. Although these models have been successfully applied to neurons in the early sensory pathway, they have fared less well capturing the response properties of neurons in deeper brain areas, owing in part to the fact that they do not take into account multiple stages of processing. Here we introduce a new twist on the point-process modeling approach: we include unobserved as well as observed spiking neurons in a joint encoding model. The resulting model exhibits richer dynamics and more highly nonlinear response properties, making it more powerful and more ﬂexible for ﬁtting neural data. More importantly, it allows us to estimate connectivity patterns among neurons (both observed and unobserved), and may provide insight into how networks process sensory input. We formulate the estimation procedure using variational EM and the wake-sleep algorithm, and illustrate the model’s performance using a simulated example network consisting of two coupled neurons.

5 0.57707411 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

Author: Lars Buesing, Wolfgang Maass

Abstract: We show that under suitable assumptions (primarily linearization) a simple and perspicuous online learning rule for Information Bottleneck optimization with spiking neurons can be derived. This rule performs on common benchmark tasks as well as a rather complex rule that has previously been proposed [1]. Furthermore, the transparency of this new learning rule makes a theoretical analysis of its convergence properties feasible. A variation of this learning rule (with sign changes) provides a theoretically founded method for performing Principal Component Analysis (PCA) with spiking neurons. By applying this rule to an ensemble of neurons, different principal components of the input can be extracted. In addition, it is possible to preferentially extract those principal components from incoming signals X that are related or are not related to some additional target signal YT . In a biological interpretation, this target signal YT (also called relevance variable) could represent proprioceptive feedback, input from other sensory modalities, or top-down signals. 1

6 0.56929487 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

7 0.56814903 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

8 0.56777728 63 nips-2007-Convex Relaxations of Latent Variable Training

9 0.56752509 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

10 0.56662285 195 nips-2007-The Generalized FITC Approximation

11 0.56603152 185 nips-2007-Stable Dual Dynamic Programming

12 0.56593841 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

13 0.56587332 40 nips-2007-Bundle Methods for Machine Learning

14 0.56503409 49 nips-2007-Colored Maximum Variance Unfolding

15 0.56402534 102 nips-2007-Incremental Natural Actor-Critic Algorithms

16 0.56397879 116 nips-2007-Learning the structure of manifolds using random projections

17 0.56321311 70 nips-2007-Discriminative K-means for Clustering

18 0.56239671 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

19 0.56202084 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

20 0.56132305 200 nips-2007-The Tradeoffs of Large Scale Learning