nips nips2007 nips2007-21 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett
Abstract: We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates √ between T and log T . Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We study the rates of growth of the regret in online convex optimization. [sent-8, score-0.607]
2 First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. [sent-9, score-0.106]
3 We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates √ between T and log T . [sent-10, score-0.318]
4 1 Introduction The problem of online convex optimization can be formulated as a repeated game between a player and an adversary. [sent-13, score-0.282]
5 At round t, the player chooses an action xt from some convex subset K of Rn , and then the adversary chooses a convex loss function ft . [sent-14, score-1.216]
6 The player aims to ensure that the total T T loss, t=1 ft (xt ), is not much larger than the smallest total loss t=1 ft (x) of any fixed action x. [sent-15, score-0.995]
7 The difference between the total loss and its optimal value for a fixed action is known as the regret, which we denote T RT = T ft (xt ) − min x∈K t=1 ft (x). [sent-16, score-0.965]
8 t=1 Many problems of online prediction of individual sequences can be viewed as special cases of online convex optimization, including prediction with expert advice, sequential probability assignment, and sequential investment [1]. [sent-17, score-0.357]
9 A central question in all these cases is how the regret grows with the number of rounds of the game. [sent-18, score-0.367]
10 √ Zinkevich [2] considered the following gradient descent algorithm, with step size ηt = Θ(1/ t). [sent-19, score-0.138]
11 (Here, ΠK (v) denotes the Euclidean projection of v on to the convex set K. [sent-20, score-0.153]
12 2: for t = 1 to T do 3: Predict xt , observe ft . [sent-23, score-0.864]
13 5: end for √ Zinkevich showed that the regret of this algorithm grows as T , where T is the number of rounds of the game. [sent-25, score-0.406]
14 This rate cannot be improved in general for arbitrary convex loss functions. [sent-26, score-0.175]
15 However, this is not the case if the loss functions are uniformly convex, for instance, if all ft have second derivative at least H > 0. [sent-27, score-0.559]
16 Recently, Hazan et al [3] showed that in this case it is possible for the regret to grow only logarithmically with T , using the same algorithm but with the smaller step size ηt = 1/(Ht). [sent-28, score-0.395]
17 The algorithm that achieves logarithmic regret must know in advance a lower bound on the convexity of the loss functions, since this bound is used to determine the step size. [sent-30, score-0.656]
18 It is natural to ask if this is essential: is there an algorithm that can adapt to the convexity of the loss functions and achieve the √ same regret rates in both cases—O(log T ) for uniformly convex functions and O( T ) for arbitrary convex functions? [sent-31, score-0.859]
19 In this paper, we present an adaptive algorithm of this kind. [sent-32, score-0.069]
20 The key technique is regularization: We consider the online gradient descent (OGD) algorithm, but we add a uniformly convex function, the quadratic λt x 2 , to each loss function ft (x). [sent-33, score-0.909]
21 This corresponds to shrinking the algorithm’s actions xt towards the origin. [sent-34, score-0.41]
22 It leads to a regret bound of the form T RT ≤ c λt + p(λ1 , . [sent-35, score-0.405]
23 t=1 The first term on the right hand side can be viewed as a bias term; it increases with λt because the presence of the regularization might lead the algorithm away from the optimum. [sent-39, score-0.083]
24 The second term is a penalty for the flatness of the loss functions that becomes smaller as the regularization increases. [sent-40, score-0.143]
25 We show that choosing the regularization coefficient λt so as to balance these two terms in the bound on the regret up to round t is nearly optimal in a strong sense. [sent-41, score-0.48]
26 Let K be a convex subset of Rn and suppose that supx∈K x ≤ D. [sent-47, score-0.166]
27 Let Ht be the largest value such that for any x∗ ∈ K, ft (x∗ ) ≥ ft (xt ) + t (x∗ − xt ) + Ht ∗ x − xt 2 . [sent-50, score-1.69]
28 2 (1) In particular, if 2 ft − Ht · I 0, then the above inequality is satisfied. [sent-51, score-0.485]
29 2: for t = 1 to T do 3: Predict xt , observe ft . [sent-57, score-0.864]
30 6: Update xt+1 = ΠK (xt − ηt+1 ( ft (xt ) + λt xt )). [sent-60, score-0.845]
31 The regret of Algorithm 2 is bounded by T RT ≤ 3 inf ∗ λ1 ,. [sent-63, score-0.388]
32 While Algorithm 2 is stated with the squared Euclidean norm as a regularizer, we show that it is straightforward to generalize our technique to other regularization functions that are uniformly convex with respect to other norms. [sent-67, score-0.358]
33 This leads to adaptive versions of the mirror descent algorithm analyzed recently in [4, 5]. [sent-68, score-0.163]
34 2 Preliminary results The following theorem gives a regret bound for the OGD algorithm with a particular choice of step size. [sent-69, score-0.507]
35 The virtue of the theorem is that the step size can be set without knowledge of the uniform lower bound on Ht , which is required in the original algorithm of [3]. [sent-70, score-0.142]
36 Then the regret of OGD is bounded as RT ≤ 1 2 T t=1 G2 t . [sent-76, score-0.347]
37 t mint 1 s=1 Hs t Note that nothing prevents Ht from being negative or zero, implying that the same algorithm gives logarithmic regret even when some of the functions are linear or concave, as long as the partial t averages 1 s=1 Hs are positive and not too small. [sent-78, score-0.45]
38 The above result already provides an important t extension to the log-regret algorithm of [3]: no prior knowledge on the uniform convexity of the functions is needed, and the bound is in terms of the observed sequence {Ht }. [sent-79, score-0.202]
39 If H1 > 0 and Ht = 0 for all t >√ then s=1 Hs = H1 , resulting 1, in a linear regret bound. [sent-81, score-0.347]
40 However, we know from [2] that a O( T ) bound can be obtained. [sent-82, score-0.058]
41 In the √ next section we provide an algorithm which interpolates between O(log T ) and O( T ) bound on the regret depending on the curvature of the observed functions. [sent-83, score-0.498]
42 3 Adaptive Regularization Suppose the environment plays a sequence of ft ’s with curvature Ht ≥ 0. [sent-84, score-0.522]
43 Instead of performing ˜ gradient descent on these functions, we step in the direction of the gradient of ft (x) = ft (x) + 1 2 2 λt x , where the regularization parameter λt ≥ 0 is chosen appropriately at each step as a function of the curvature of the previous functions. [sent-85, score-1.203]
44 We remind the reader that K is assumed to be centered around the origin, for otherwise we would instead use x − x0 2 to shrink the actions xt towards the origin x0 . [sent-86, score-0.429]
45 If the Online Gradient Descent algorithm is performed on the functions ft (x) = 1 ft (x) + 2 λt x 2 with 1 ηt+1 = H 1:t + λ1:t for any sequence of non-negative λ1 , . [sent-91, score-0.992]
46 1 applied to functions ft , T t=1 1 ft (xt ) + λt xt 2 T 2 ≤ min x 1 ft (x) + λt x 2 t=1 2 + 1 2 T t=1 (Gt + λt D)2 . [sent-97, score-1.787]
47 H 1:t + λ1:t ˜ Indeed, it is easy to verify that condition (1) for ft implies the corresponding statement with Ht = ˜ . [sent-98, score-0.466]
48 Furthermore, by linearity, the bound on the gradient of ft is Gt +λt xt ≤ Gt +λt D. [sent-99, score-0.962]
49 ˜ Ht +λt for ft T Define x∗ = arg minx t=1 ft (x). [sent-100, score-0.946]
50 Then, dropping the xt 2 terms and bounding x∗ 2 ≤ D2 , T T ft (xt ) ≤ t=1 1 1 ft (x∗ ) + D2 λ1:T + 2 2 t=1 T (Gt + λt D)2 , H 1:t + λ1:t t=1 which proves the the theorem. [sent-101, score-1.296]
51 √ It turns out that for appropriate choices of {λt }, the above theorem recovers the O( T ) bound on the regret for linear functions [2] and the O(log T ) bound for strongly convex functions [3]. [sent-104, score-0.804]
52 Moreover, under specific assumptions on the sequence {Ht }, we can define a sequence {λt } which produces √ intermediate rates between log T and T . [sent-105, score-0.117]
53 These results are exhibited in corollaries at the end of this section. [sent-106, score-0.063]
54 While the influence of the regularization parameters λt on the first sum is trivial, the influence on the second sum is more involved as all terms for t ≥ t depend on λt . [sent-112, score-0.059]
55 Let {λ∗ } be the optimal sequence of non-negative regularization t coefficients. [sent-127, score-0.085]
56 The lemma above is the key to the proof of the near-optimal bounds for Algorithm 2 1 . [sent-136, score-0.05]
57 Without loss of generality, G1 = 0, for otherwise x1 is minimizing f1 (x) and regret is negative on that round. [sent-150, score-0.391]
58 Hence, the algorithm has a bound on the performance which is 6 times the bound obtained by the best offline adaptive choice of regularization coefficients. [sent-151, score-0.262]
59 We also remark that if the diameter D is unknown, the regularization coefficients λt can still be chosen by balancing as in Eq. [sent-153, score-0.079]
60 This choice of λt , however, increases the bound on the regret suffered by Algorithm 2 by a factor of O(D2 ). [sent-155, score-0.449]
61 1 not only recovers the rate of increase of regret of [3] and [2], but also provides intermediate rates. [sent-157, score-0.405]
62 1 guarantees that Algorithm 2 is competitive with the best choice of the parameters, we conclude that Algorithm 2 achieves the same rates. [sent-160, score-0.053]
63 Then for any sequence of convex functions t {ft }, the bound on regret of Algorithm 2 is O( T ). [sent-164, score-0.602]
64 1 effectively describes an algorithm for an online problem with competitive ratio of 2. [sent-170, score-0.149]
65 In the full version of this paper we give a lower bound strictly larger than one on the competitive ratio achievable by any online algorithm for this problem. [sent-171, score-0.207]
66 5 √ Hence, the regret of Algorithm 2 can never increase faster than T . [sent-172, score-0.347]
67 Then the bound on regret of t Algorithm 2 is O(log T ). [sent-177, score-0.405]
68 The above proof also recovers the result of Theorem 2. [sent-181, score-0.053]
69 The following Corollary shows a spectrum of rates under assumptions on the curvature of functions. [sent-183, score-0.069]
70 α 4 Generalization to different norms The original online gradient descent (OGD) algorithm as analyzed by Zinkevich [2] used the Euclidean distance of the current point from the optimum as a potential function. [sent-200, score-0.32]
71 The logarithmic regret bounds of [3] for strongly convex functions were also stated for the Euclidean norm, and such was the presentation above. [sent-201, score-0.628]
72 As such, our results above for adaptive regularization carry on to the general setting, as we state below . [sent-203, score-0.104]
73 A function g over a convex set K is called H-strongly convex with respect to a convex function h if H Bh (x, y). [sent-207, score-0.409]
74 This notion of strong convexity generalizes the Euclidean notion: the function g(x) = x 2 is 2 strongly convex with respect to h(x) = x 2 (in this case Bh (x, y) = x − y 2 ). [sent-210, score-0.235]
75 Henceforth we also refer to the dual norm of a given norm, defined by y ∗ = sup x ≤1 {y x}. [sent-214, score-0.05]
76 For the case of p norms, we have y ∗ = y q where q satisfies 1 1 1 1 2 2 o (this holds for norms p + q = 1, and by H¨ lder’s inequality y x ≤ y ∗ x ≤ 2 y ∗ + 2 x other than p as well). [sent-215, score-0.087]
77 6 For simplicity, the reader may think of the functions g, h as convex and differentiable2 . [sent-216, score-0.187]
78 The following algorithm is a generalization of the OGD algorithm to general strongly convex functions (see the derivation in [6]). [sent-217, score-0.253]
79 Algorithm 3 General-Norm Online Gradient Descent 1: Input: convex function h 2: Initialize x1 arbitrarily. [sent-219, score-0.131]
80 3: for t = 1 to T do 4: Predict xt , observe ft . [sent-220, score-0.864]
81 5: Compute ηt+1 and let yt+1 be such that h(yt+1 ) = h(xt ) − 2ηt+1 ft (xt ). [sent-221, score-0.472]
82 As a first step, let us generalize the bound of [3], as well as Theorem 2. [sent-224, score-0.079]
83 Suppose that, for each t, ft is a Ht -strongly convex function with respect to h, and let h be such that Bh (x, y) ≥ x − y 2 for some norm · . [sent-227, score-0.669]
84 By assumption on the functions ft , for any x∗ ∈ K, Ht Bh (x∗ , xt ). [sent-232, score-0.885]
85 ft (xt ) − ft (x∗ ) ≤ ft (xt ) (xt − x∗ ) − 2 By a well-known property of Bregman divergences (see [6]), it holds that for any vectors x, y, z, (x − y) ( h(z) − Combining both observations, h(y)) = Bh (x, y) − Bh (x, z) + Bh (y, z). [sent-233, score-1.392]
86 t the Bregman divergence of yt+1 and x∗ ∈ K is in the convex set. [sent-236, score-0.16]
87 Summing over all iterations and recalling that ηt+1 = H1 , 1:t T Bh (x∗ , xt ) 2RT ≤ t=2 T = t=1 1 ηt+1 1 ηt+1 − 1 − Ht ηt + Bh (x∗ , x1 ) Bh (xt , yt+1 ). [sent-237, score-0.394]
88 1 − H1 η2 T + t=1 1 Bh (xt , yt+1 ) ηt+1 (4) 2 Since the set of points of nondifferentiability of convex functions has measure zero, convexity is the only property that we require. [sent-238, score-0.225]
89 Indeed, for nondifferentiable functions, the algorithm would choose a point xt , which ˜ is xt with the addition of a small random perturbation. [sent-239, score-0.828]
90 With probability one, the functions would be smooth at the perturbed point, and the perturbation could be made arbitrarily small so that the regret rate would not be affected. [sent-240, score-0.387]
91 By definition of Bregman divergence, and the dual norm inequality stated before, Bh (xt , yt+1 ) + Bh (yt+1 , xt ) = ( h(xt ) − h(yt+1 )) (xt − yt+1 ) = 2ηt+1 ft (xt ) (xt − yt+1 ) 2 ≤ ηt+1 2 t ∗ + xt − yt+1 2 . [sent-242, score-1.345]
92 Thus, by our assumption Bh (x, y) ≥ x − y 2 , we have 2 2 Bh (xt , yt+1 ) ≤ ηt+1 t 2 + xt − yt+1 2 − Bh (yt+1 , xt ) ≤ ηt+1 ∗ 2 t ∗. [sent-243, score-0.788]
93 The following algorithm is an analogue of Algorithm 2 and Theorem 4. [sent-248, score-0.05]
94 Let g(x) be 1-strongly convex with respect to the convex function h. [sent-252, score-0.278]
95 2: for t = 1 to T do 3: Predict xt , observe ft 4: Compute λt = 1 2 2 (H 1:t + λ1:t−1 ) + 8G2 /(A2 + 2B 2 ) − (H 1:t + λ1:t−1 ) . [sent-253, score-0.864]
96 6: Let yt+1 be such that h(yt+1 ) = h(xt ) − 2ηt+1 ( ft (xt ) + λt g(xt ))). [sent-255, score-0.451]
97 Suppose that each ft is a Ht -strongly convex function with respect to h, and let g be a 1-strongly convex with respect h. [sent-259, score-0.766]
98 The regret of Algorithm 4 is bounded by T RT ≤ inf ∗ λ1 ,. [sent-262, score-0.388]
99 ,λ∗ T (A2 + 2B 2 )λ∗ + 1:T t=1 (Gt + λ∗ B)2 t H 1:t + λ∗ 1:t If the norm in the above theorem is the Euclidean norm and g(x) = supx∈K x = A = B and recover the results of Theorem 1. [sent-265, score-0.16]
100 Proving relative loss bounds for on-line learning algorithms using Bregman divergences. [sent-292, score-0.059]
wordName wordTfidf (topN-words)
[('ft', 0.451), ('bh', 0.447), ('xt', 0.394), ('regret', 0.347), ('ht', 0.294), ('yt', 0.162), ('gt', 0.162), ('convex', 0.131), ('ogd', 0.109), ('online', 0.105), ('rt', 0.103), ('bregman', 0.097), ('hazan', 0.095), ('ct', 0.086), ('descent', 0.079), ('theorem', 0.06), ('gradient', 0.059), ('regularization', 0.059), ('bound', 0.058), ('hs', 0.056), ('euclidean', 0.054), ('convexity', 0.054), ('supx', 0.054), ('zinkevich', 0.051), ('norm', 0.05), ('curvature', 0.045), ('adaptive', 0.045), ('loss', 0.044), ('minx', 0.044), ('inf', 0.041), ('functions', 0.04), ('logarithmic', 0.039), ('norms', 0.038), ('recovers', 0.036), ('suppose', 0.035), ('strongly', 0.034), ('inequality', 0.034), ('gentile', 0.032), ('player', 0.03), ('division', 0.03), ('corollary', 0.03), ('divergence', 0.029), ('elad', 0.029), ('rakhlin', 0.029), ('shai', 0.029), ('yoram', 0.029), ('berkeley', 0.028), ('sequence', 0.026), ('analogue', 0.026), ('corollaries', 0.026), ('suffered', 0.026), ('divergences', 0.024), ('interpolates', 0.024), ('al', 0.024), ('algorithm', 0.024), ('rates', 0.024), ('initialize', 0.024), ('uniformly', 0.024), ('exhibited', 0.022), ('intermediate', 0.022), ('stated', 0.022), ('projection', 0.022), ('coef', 0.021), ('let', 0.021), ('competitive', 0.02), ('diameter', 0.02), ('rounds', 0.02), ('observe', 0.019), ('action', 0.019), ('log', 0.019), ('origin', 0.019), ('lemma', 0.018), ('choice', 0.018), ('bartlett', 0.018), ('oracle', 0.017), ('advance', 0.017), ('proof', 0.017), ('uc', 0.016), ('reader', 0.016), ('repeated', 0.016), ('respect', 0.016), ('predict', 0.016), ('atness', 0.016), ('bor', 0.016), ('investment', 0.016), ('kale', 0.016), ('nicol', 0.016), ('nondifferentiable', 0.016), ('satyen', 0.016), ('shrinking', 0.016), ('technique', 0.016), ('round', 0.016), ('bounds', 0.015), ('end', 0.015), ('achieves', 0.015), ('analyzed', 0.015), ('holds', 0.015), ('ca', 0.015), ('verify', 0.015), ('amit', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 21 nips-2007-Adaptive Online Gradient Descent
Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett
Abstract: We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates √ between T and log T . Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms. 1
2 0.28450239 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria
Author: Elad Hazan, Satyen Kale
Abstract: We study the relation between notions of game-theoretic equilibria which are based on stability under a set of deviations, and empirical equilibria which are reached by rational players. Rational players are modeled by players using no regret algorithms, which guarantee that their payoff in the long run is close to the maximum they could hope to achieve by consistently deviating from the algorithm’s suggested action. We show that for a given set of deviations over the strategy set of a player, it is possible to efficiently approximate fixed points of a given deviation if and only if there exist efficient no regret algorithms resistant to the deviations. Further, we show that if all players use a no regret algorithm, then the empirical distribution of their plays converges to an equilibrium. 1
3 0.25476146 199 nips-2007-The Price of Bandit Information for Online Optimization
Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes
Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1
4 0.25257161 86 nips-2007-Exponential Family Predictive Representations of State
Author: David Wingate, Satinder S. Baveja
Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1
5 0.22307026 146 nips-2007-On higher-order perceptron algorithms
Author: Claudio Gentile, Fabio Vitale, Cristian Brotto
Abstract: A new algorithm for on-line learning linear-threshold functions is proposed which efficiently combines second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms. An initial theoretical analysis is provided suggesting that our algorithm might be viewed as a standard Perceptron algorithm operating on a transformed sequence of examples with improved margin properties. We also report on experiments carried out on datasets from diverse domains, with the goal of comparing to known Perceptron algorithms (first-order, second-order, additive, multiplicative). Our learning procedure seems to generalize quite well, and converges faster than the corresponding multiplicative baseline algorithms. 1 Introduction and preliminaries The problem of on-line learning linear-threshold functions from labeled data is one which have spurred a substantial amount of research in Machine Learning. The relevance of this task from both the theoretical and the practical point of view is widely recognized: On the one hand, linear functions combine flexiblity with analytical and computational tractability, on the other hand, online algorithms provide efficient methods for processing massive amounts of data. Moreover, the widespread use of kernel methods in Machine Learning (e.g., [24]) have greatly improved the scope of this learning technology, thereby increasing even further the general attention towards the specific task of incremental learning (generalized) linear functions. Many models/algorithms have been proposed in the literature (stochastic, adversarial, noisy, etc.) : Any list of references would not do justice of the existing work on this subject. In this paper, we are interested in the problem of online learning linear-threshold functions from adversarially generated examples. We introduce a new family of algorithms, collectively called the Higher-order Perceptron algorithm (where ”higher” means here ”higher than one”, i.e., ”higher than first-order” descent algorithms such as gradientdescent or standard Perceptron-like algorithms”). Contrary to other higher-order algorithms, such as the ridge regression-like algorithms considered in, e.g., [4, 7], Higher-order Perceptron has the ability to put together in a principled and flexible manner second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms (e.g., [18, 19, 6, 13, 15, 20]). Our algorithm exploits a simplified form of the inverse data matrix, lending itself to be easily combined with the dual norms machinery introduced by [13] (see also [12, 23]). As we will see, this has also computational advantages, allowing us to formulate an efficient (subquadratic) implementation. Our contribution is twofold. First, we provide an initial theoretical analysis suggesting that our algorithm might be seen as a standard Perceptron algorithm [21] operating on a transformed sequence of examples with improved margin properties. The same analysis also suggests a simple (but principled) way of switching on the fly between higher-order and first-order updates. This is ∗ The authors gratefully acknowledge partial support by the PASCAL Network of Excellence under EC grant n. 506778. This publication only reflects the authors’ views. especially convenient when we deal with kernel functions, a major concern being the sparsity of the computed solution. The second contribution of this paper is an experimental investigation of our algorithm on artificial and real-world datasets from various domains: We compared Higher-order Perceptron to baseline Perceptron algorithms, like the Second-order Perceptron algorithm defined in [7] and the standard (p-norm) Perceptron algorithm, as in [13, 12]. We found in our experiments that Higher-order Perceptron generalizes quite well. Among our experimental findings are the following: 1) Higher-order Perceptron is always outperforming the corresponding multiplicative (p-norm) baseline (thus the stored data matrix is always beneficial in terms of convergence speed); 2) When dealing with Euclidean norms (p = 2), the comparison to Second-order Perceptron is less clear and depends on the specific task at hand. Learning protocol and notation. Our algorithm works in the well-known mistake bound model of on-line learning, as introduced in [18, 2], and further investigated by many authors (e.g., [19, 6, 13, 15, 7, 20, 23] and references therein). Prediction proceeds in a sequence of trials. In each trial t = 1, 2, . . . the prediction algorithm is given an instance vector in Rn (for simplicity, all vectors are normalized, i.e., ||xt || = 1, where || · || is the Euclidean norm unless otherwise specified), and then guesses the binary label yt ∈ {−1, 1} associated with xt . We denote the algorithm’s prediction by yt ∈ {−1, 1}. Then the true label yt is disclosed. In the case when yt = yt we say that the algorithm has made a prediction mistake. We call an example a pair (xt , yt ), and a sequence of examples S any sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). In this paper, we are competing against the class of linear-threshold predictors, parametrized by normal vectors u ∈ {v ∈ Rn : ||v|| = 1}. In this case, a common way of measuring the (relative) prediction performance of an algorithm A is to compare the total number of mistakes of A on S to some measure of the linear separability of S. One such measure (e.g., [24]) is the cumulative hinge-loss (or soft-margin) Dγ (u; S) of S w.r.t. a T linear classifier u at a given margin value γ > 0: Dγ (u; S) = t=1 max{0, γ − yt u xt } (observe that Dγ (u; S) vanishes if and only if u separates S with margin at least γ. A mistake-driven algorithm A is one which updates its internal state only upon mistakes. One can therefore associate with the run of A on S a subsequence M = M(S, A) ⊆ {1, . . . , T } of mistaken trials. Now, the standard analysis of these algorithms allows us to restrict the behavior of the comparison class to mistaken trials only and, as a consequence, to refine Dγ (u; S) so as to include only trials in M: Dγ (u; S) = t∈M max{0, γ − yt u xt }. This gives bounds on A’s performance relative to the best u over a sequence of examples produced (or, actually, selected) by A during its on-line functioning. Our analysis in Section 3 goes one step further: the number of mistakes of A on S is contrasted to the cumulative hinge loss of the best u on a transformed ˜ sequence S = ((˜ i1 , yi1 ), (˜ i2 , yi2 ), . . . , (˜ im , yim )), where each instance xik gets transformed x x x ˜ into xik through a mapping depending only on the past behavior of the algorithm (i.e., only on ˜ examples up to trial t = ik−1 ). As we will see in Section 3, this new sequence S tends to be ”more separable” than the original sequence, in the sense that if S is linearly separable with some margin, ˜ then the transformed sequence S is likely to be separable with a larger margin. 2 The Higher-order Perceptron algorithm The algorithm (described in Figure 1) takes as input a sequence of nonnegative parameters ρ1 , ρ2 , ..., and maintains a product matrix Bk (initialized to the identity matrix I) and a sum vector v k (initialized to 0). Both of them are indexed by k, a counter storing the current number of mistakes (plus one). Upon receiving the t-th normalized instance vector xt ∈ Rn , the algorithm computes its binary prediction value yt as the sign of the inner product between vector Bk−1 v k−1 and vector Bk−1 xt . If yt = yt then matrix Bk−1 is updates multiplicatively as Bk = Bk−1 (I − ρk xt xt ) while vector v k−1 is updated additively through the standard Perceptron rule v k = v k−1 + yt xt . The new matrix Bk and the new vector v k will be used in the next trial. If yt = yt no update is performed (hence the algorithm is mistake driven). Observe that ρk = 0 for any k makes this algorithm degenerate into the standard Perceptron algorithm [21]. Moreover, one can easily see that, in order to let this algorithm exploit the information collected in the matrix B (and let the algorithm’s ∞ behavior be substantially different from Perceptron’s) we need to ensure k=1 ρk = ∞. In the sequel, our standard choice will be ρk = c/k, with c ∈ (0, 1). See Sections 3 and 4. Implementing Higher-Order Perceptron can be done in many ways. Below, we quickly describe three of them, each one having its own merits. 1) Primal version. We store and update an n×n matrix Ak = Bk Bk and an n-dimensional column Parameters: ρ1 , ρ2 , ... ∈ [0, 1). Initialization: B0 = I; v 0 = 0; k = 1. Repeat for t = 1, 2, . . . , T : 1. Get instance xt ∈ Rn , ||xt || = 1; 2. Predict yt = SGN(wk−1 xt ) ∈ {−1, +1}, where wk−1 = Bk−1 Bk−1 v k−1 ; 3. Get label yt ∈ {−1, +1}; v k = v k−1 + yt xt 4. if yt = yt then: Bk k = Bk−1 (I − ρk xt xt ) ← k + 1. Figure 1: The Higher-order Perceptron algorithm (for p = 2). vector v k . Matrix Ak is updated as Ak = Ak−1 − ρAk−1 xx − ρxx Ak−1 + ρ2 (x Ak−1 x)xx , taking O(n2 ) operations, while v k is updated as in Figure 1. Computing the algorithm’s margin v Ax can then be carried out in time quadratic in the dimension n of the input space. 2) Dual version. This implementation allows us the use of kernel functions (e.g., [24]). Let us denote by Xk the n × k matrix whose columns are the n-dimensional instance vectors x1 , ..., xk where a mistake occurred so far, and y k be the k-dimensional column vector of the corresponding (k) labels. We store and update the k × k matrix Dk = [di,j ]k i,j=1 , the k × k diagonal matrix Hk = DIAG {hk }, (k) (k) hk = (h1 , ..., hk ) = Xk Xk y k , and the k-dimensional column vector g k = y k + Dk Hk 1k , being 1k a vector of k ones. If we interpret the primal matrix Ak above as Ak = (k) k I + i,j=1 di,j xi xj , it is not hard to show that the margin value wk−1 x is equal to g k−1 Xk−1 x, and can be computed through O(k) extra inner products. Now, on the k-th mistake, vector g can be updated with O(k 2 ) extra inner products by updating D and H in the following way. We let D0 and H0 be empty matrices. Then, given Dk−1 and Hk−1 = DIAG{hk−1 }, we have1 Dk = Dk−1 −ρk bk (k) , where bk = Dk−1 Xk−1 xk , and dk,k = ρ2 xk Xk−1 bk − 2ρk + ρ2 . On (k) k k −ρk bk dk,k the other hand, Hk = DIAG {hk−1 (k) (k) + yk Xk−1 xk , hk }, with hk = y k−1 Xk−1 xk + yk . Observe that on trials when ρk = 0 matrix Dk−1 is padded with a zero row and a zero column. (k) k This amounts to say that matrix Ak = I + i,j=1 di,j xi xj , is not updated, i.e., Ak = Ak−1 . A closer look at the above update mechanism allows us to conclude that the overall extra inner products needed to compute g k is actually quadratic only in the number of past mistaken trials having ρk > 0. This turns out to be especially important when using a sparse version of our algorithm which, on a mistaken trial, decides whether to update both B and v or just v (see Section 4). 3) Implicit primal version and the dual norms algorithm. This is based on the simple observation that for any vector z we can compute Bk z by unwrapping Bk as in Bk z = Bk−1 (I − ρxx )z = Bk−1 z , where vector z = (z − ρx x z) can be calculated in time O(n). Thus computing the margin v Bk−1 Bk−1 x actually takes O(nk). Maintaining this implicit representation for the product matrix B can be convenient when an efficient dual version is likely to be unavailable, as is the case for the multiplicative (or, more generally, dual norms) extension of our algorithm. We recall that a multiplicative algorithm is useful when learning sparse target hyperplanes (e.g., [18, 15, 3, 12, 11, 20]). We obtain a dual norms algorithm by introducing a norm parameter p ≥ 2, and the associated gradient mapping2 g : θ ∈ Rn → θ ||θ||2 / 2 ∈ Rn . Then, in Figure 1, we p normalize instance vectors xt w.r.t. the p-norm, we define wk−1 = Bk−1 g(Bk−1 v k−1 ), and generalize the matrix update as Bk = Bk−1 (I − ρk xt g(xt ) ). As we will see, the resulting algorithm combines the multiplicative behavior of the p-norm algorithms with the ”second-order” information contained in the matrix Bk . One can easily see that the above-mentioned argument for computing the margin g(Bk−1 v k−1 ) Bk−1 x in time O(nk) still holds. 1 Observe that, by construction, Dk is a symmetric matrix. This mapping has also been used in [12, 11]. Recall that setting p = O(log n) yields an algorithm similar to Winnow [18]. Also, notice that p = 2 yields g = identity. 2 3 Analysis We express the performance of the Higher-order Perceptron algorithm in terms of the hinge-loss behavior of the best linear classifier over the transformed sequence ˜ S = (B0 xt(1) , yt(1) ), (B1 xt(2) , yt(2) ), (B2 xt(3) , yt(3) ), . . . , (1) being t(k) the trial where the k-th mistake occurs, and Bk the k-th matrix produced by the algorithm. Observe that each feature vector xt(k) gets transformed by a matrix Bk depending on past examples ˜ only. This is relevant to the argument that S tends to have a larger margin than the original sequence (see the discussion at the end of this section). This neat ”on-line structure” does not seem to be shared by other competing higher-order algorithms, such as the ”ridge regression-like” algorithms considered, e.g., in [25, 4, 7, 23]. For the sake of simplicity, we state the theorem below only in the case p = 2. A more general statement holds when p ≥ 2. Theorem 1 Let the Higher-order Perceptron algorithm in Figure 1 be run on a sequence of examples S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). Let the sequence of parameters ρk satisfy 0 ≤ ρk ≤ 1−c , where xt is the k-th mistaken instance vector, and c ∈ (0, 1]. Then the total number m 1+|v k−1 xt | of mistakes satisfies3 ˜ ˜ Dγ (u; Sc )) α2 α Dγ (u; Sc )) α2 m≤α + 2+ α + 2, (2) γ 2γ γ γ 4γ holding for any γ > 0 and any unit norm vector u ∈ Rn , where α = α(c) = (2 − c)/c. Proof. The analysis deliberately mimics the standard Perceptron convergence analysis [21]. We fix an arbitrary sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ) and let M ⊆ {1, 2, . . . , T } be the set of trials where the algorithm in Figure 1 made a mistake. Let t = t(k) be the trial where the k-th mistake occurred. We study the evolution of ||Bk v k ||2 over mistaken trials. Notice that the matrix Bk Bk is positive semidefinite for any k. We can write ||Bk v k ||2 = ||Bk−1 (I − ρk xt xt ) (v k−1 + yt xt ) ||2 (from the update rule v k = v k−1 + yt xt and Bk = Bk−1 (I − ρk xt xt ) ) = ||Bk−1 v k−1 + yt (1 − ρk yt v k−1 xt − ρk )Bk−1 xt ||2 2 = ||Bk−1 v k−1 || + 2 yt rk v k−1 Bk−1 Bk−1 xt + (using ||xt || = 1) 2 rk ||Bk−1 xt ||2 , where we set for brevity rk = 1 − ρk yt v k−1 xt − ρk . We proceed by upper and lower bounding the above chain of equalities. To this end, we need to ensure rk ≥ 0. Observe that yt v k−1 xt ≥ 0 implies rk ≥ 0 if and only if ρk ≤ 1/(1 + yt v k−1 xt ). On the other hand, if yt v k−1 xt < 0 then, in order for rk to be nonnegative, it suffices to pick ρk ≤ 1. In both cases ρk ≤ (1 − c)/(1 + |v k−1 xt |) implies 2 rk ≥ c > 0, and also rk ≤ (1+ρk |v k−1 xt |−ρk )2 ≤ (2−c)2 . Now, using yt v k−1 Bk−1 Bk−1 xt ≤ 0 (combined with rk ≥ 0), we conclude that ||Bk v k ||2 − ||Bk−1 v k−1 ||2 ≤ (2 − c)2 ||Bk−1 xt ||2 = (2 − c)2 xt Ak−1 xt , where we set Ak = Bk Bk . A simple4 (and crude) upper bound on the last term follows by observing that ||xt || = 1 implies xt Ak−1 xt ≤ ||Ak−1 ||, the spectral norm (largest eigenvalue) of Ak−1 . Since a factor matrix of the form (I − ρ xx ) with ρ ≤ 1 and ||x|| = 1 has k−1 spectral norm one, we have xt Ak−1 xt ≤ ||Ak−1 || ≤ i=1 ||I − ρi xt(i) xt(i) ||2 ≤ 1. Therefore, summing over k = 1, . . . , m = |M| (or, equivalently, over t ∈ M) and using v 0 = 0 yields the upper bound ||Bm v m ||2 ≤ (2 − c)2 m. (3) To find a lower bound of the left-hand side of (3), we first pick any unit norm vector u ∈ Rn , and apply the standard Cauchy-Schwartz inequality: ||Bm v m || ≥ u Bm v m . Then, we observe that for a generic trial t = t(k) the update rule of our algorithm allows us to write u Bk v k − u Bk−1 v k−1 = rk yt u Bk−1 xt ≥ rk (γ − max{0, γ − yt u Bk−1 xt }), where the last inequality follows from rk ≥ 0 and holds for any margin value γ > 0. We sum 3 ˜ The subscript c in Sc emphasizes the dependence of the transformed sequence on the choice of c. Note that in the special case c = 1 we have ρk = 0 for any k and α = 1, thereby recovering the standard Perceptron bound for nonseparable sequences (see, e.g., [12]). 4 A slightly more refined bound can be derived which depends on the trace of matrices I − Ak . Details will be given in the full version of this paper. the above over k = 1, . . . , m and exploit c ≤ rk ≤ 2 − c after rearranging terms. This gets ˜ ||Bm v m || ≥ u Bm v m ≥ c γ m − (2 − c)Dγ (u; Sc ). Combining with (3) and solving for m gives the claimed bound. From the above result one can see that our algorithm might be viewed as a standard Perceptron ˜ algorithm operating on the transformed sequence Sc in (1). We now give a qualitative argument, ˜ which is suggestive of the improved margin properties of Sc . Assume for simplicity that all examples (xt , yt ) in the original sequence are correctly classified by hyperplane u with the same margin γ = yt u xt > 0, where t = t(k). According to Theorem 1, the parameters ρ1 , ρ2 , . . . should be small positive numbers. Assume, again for simplicity, that all ρk are set to the same small enough k value ρ > 0. Then, up to first order, matrix Bk = i=1 (I − ρ xt(i) xt(i) ) can be approximated as Bk I −ρ k i=1 xt(i) xt(i) . Then, to the extent that the above approximation holds, we can write:5 yt u Bk−1 xt = yt u I −ρ = yt u xt − ρ yt k−1 i=1 xt(i) xt(i) xt = yt u k−1 i=1 I −ρ k−1 i=1 yt(i) xt(i) yt(i) xt(i) xt yt(i) u xt(i) yt(i) xt(i) xt = γ − ρ γ yt v k−1 xt . Now, yt v k−1 xt is the margin of the (first-order) Perceptron vector v k−1 over a mistaken trial for the Higher-order Perceptron vector wk−1 . Since the two vectors v k−1 and wk−1 are correlated (recall that v k−1 wk−1 = v k−1 Bk−1 Bk−1 v k−1 = ||Bk−1 v k−1 ||2 ≥ 0) the mistaken condition yt wk−1 xt ≤ 0 is more likely to imply yt v k−1 xt ≤ 0 than the opposite. This tends to yield a margin larger than the original margin γ. As we mentioned in Section 2, this is also advantageous from a computational standpoint, since in those cases the matrix update Bk−1 → Bk might be skipped (this is equivalent to setting ρk = 0), still Theorem 1 would hold. Though the above might be the starting point of a more thorough theoretical understanding of the margin properties of our algorithm, in this paper we prefer to stop early and leave any further investigation to collecting experimental evidence. 4 Experiments We tested the empirical performance of our algorithm by conducting a number of experiments on a collection of datasets, both artificial and real-world from diverse domains (Optical Character Recognition, text categorization, DNA microarrays). The main goal of these experiments was to compare Higher-order Perceptron (with both p = 2 and p > 2) to known Perceptron-like algorithms, such as first-order [21] and second-order Perceptron [7], in terms of training accuracy (i.e., convergence speed) and test set accuracy. The results are contained in Tables 1, 2, 3, and in Figure 2. Task 1: DNA microarrays and artificial data. The goal here was to test the convergence properties of our algorithms on sparse target learning tasks. We first tested on a couple of well-known DNA microarray datasets. For each dataset, we first generated a number of random training/test splits (our random splits also included random permutations of the training set). The reported results are averaged over these random splits. The two DNA datasets are: i. The ER+/ER− dataset from [14]. Here the task is to analyze expression profiles of breast cancer and classify breast tumors according to ER (Estrogen Receptor) status. This dataset (which we call the “Breast” dataset) contains 58 expression profiles concerning 3389 genes. We randomly split 1000 times into a training set of size 47 and a test set of size 11. ii. The “Lymphoma” dataset [1]. Here the goal is to separate cancerous and normal tissues in a large B-Cell lymphoma problem. The dataset contains 96 expression profiles concerning 4026 genes. We randomly split the dataset into a training set of size 60 and a test set of size 36. Again, the random split was performed 1000 times. On both datasets, the tested algorithms have been run by cycling 5 times over the current training set. No kernel functions have been used. We also artificially generated two (moderately) sparse learning problems with margin γ ≥ 0.005 at labeling noise levels η = 0.0 (linearly separable) and η = 0.1, respectively. The datasets have been generated at random by first generating two (normalized) target vectors u ∈ {−1, 0, +1}500 , where the first 50 components are selected independently at random in {−1, +1} and the remaining 450 5 Again, a similar argument holds in the more general setting p ≥ 2. The reader should notice how important the dependence of Bk on the past is to this argument. components are 0. Then we set η = 0.0 for the first target and η = 0.1 for the second one and, corresponding to each of the two settings, we randomly generated 1000 training examples and 1000 test examples. The instance vectors are chosen at random from [−1, +1]500 and then normalized. If u · xt ≥ γ then a +1 label is associated with xt . If u · xt ≤ −γ then a −1 label is associated with xt . The labels so obtained are flipped with probability η. If |u · xt | < γ then xt is rejected and a new vector xt is drawn. We call the two datasets ”Artificial 0.0 ” and ”Artificial 0.1 ”. We tested our algorithms by training over an increasing number of epochs and checking the evolution of the corresponding test set accuracy. Again, no kernel functions have been used. Task 2: Text categorization. The text categorization datasets are derived from the first 20,000 newswire stories in the Reuters Corpus Volume 1 (RCV1, [22]). A standard TF - IDF bag-of-words encoding was used to transform each news story into a normalized vector of real attributes. We built four binary classification problems by “binarizing” consecutive news stories against the four target categories 70, 101, 4, and 59. These are the 2nd, 3rd, 4th, and 5th most frequent6 categories, respectively, within the first 20,000 news stories of RCV1. We call these datasets RCV1x , where x = 70, 101, 4, 59. Each dataset was split into a training set of size 10,000 and a test set of the same size. All algorithms have been trained for a single epoch. We initially tried polynomial kernels, then realized that kernel functions did not significantly alter our conclusions on this task. Thus the reported results refer to algorithms with no kernel functions. Task 3: Optical character recognition (OCR). We used two well-known OCR benchmarks: the USPS dataset and the MNIST dataset [16] and followed standard experimental setups, such as the one in [9], including the one-versus-rest scheme for reducing a multiclass problem to a set of binary tasks. We used for each algorithm the standard Gaussian and polynomial kernels, with parameters chosen via 5-fold cross validation on the training set across standard ranges. Again, all algorithms have been trained for a single epoch over the training set. The results in Table 3 only refer to the best parameter settings for each kernel. Algorithms. We implemented the standard Perceptron algorithm (with and without kernels), the Second-order Perceptron algorithm, as described in [7] (with and without kernels), and our Higherorder Perceptron algorithm. The implementation of the latter algorithm (for both p = 2 and p > 2) was ”implicit primal” when tested on the sparse learning tasks, and in dual variables for the other two tasks. When using Second-order Perceptron, we set its parameter a (see [7] for details) by testing on a generous range of values. For brevity, only the settings achieving the best results are reported. On the sparse learning tasks we tried Higher-order Perceptron with norm p = 2, 4, 7, 10, while on the other two tasks we set p = 2. In any case, for each value of p, we set7 ρk = c/k, with c = 0, 0.2, 0.4, 0.6, 0.8. Since c = 0 corresponds to a standard p-norm Perceptron algorithm [13, 12] we tried to emphasize the comparison c = 0 vs. c > 0. Finally, when using kernels on the OCR tasks, we also compared to a sparse dual version of Higher-order Perceptron. On a mistaken round t = t(k), this algorithm sets ρk = c/k if yt v k−1 xt ≥ 0, and ρk = 0 otherwise (thus, when yt v k−1 xt < 0 the matrix Bk−1 is not updated). For the sake of brevity, the standard Perceptron algorithm is called FO (”First Order”), the Second-order algorithm is denoted by SO (”Second Order”), while the Higher-order algorithm with norm parameter p and ρk = c/k is abbreviated as HOp (c). Thus, for instance, FO = HO2 (0). Results and conclusions. Our Higher-order Perceptron algorithm seems to deliver interesting results. In all our experiments HOp (c) with c > 0 outperforms HOp (0). On the other hand, the comparison HOp (c) vs. SO depends on the specific task. On the DNA datasets, HOp (c) with c > 0 is clearly superior in Breast. On Lymphoma, HOp (c) gets worse as p increases. This is a good indication that, in general, a multiplicative algorithm is not suitable for this dataset. In any case, HO2 turns out to be only slightly worse than SO. On the artificial datasets HOp (c) with c > 0 is always better than the corresponding p-norm Perceptron algorithm. On the text categorization tasks, HO2 tends to perform better than SO. On USPS, HO2 is superior to the other competitors, while on MNIST it performs similarly when combined with Gaussian kernels (though it turns out to be relatively sparser), while it is slightly inferior to SO when using polynomial kernels. The sparse version of HO2 cuts the matrix updates roughly by half, still maintaining a good performance. In all cases HO2 (either sparse or not) significantly outperforms FO. In conclusion, the Higher-order Perceptron algorithm is an interesting tool for on-line binary clas6 7 We did not use the most frequent category because of its significant overlap with the other ones. Notice that this setting fulfills the condition on ρk stated in Theorem 1. Table 1: Training and test error on the two datasets ”Breast” and ”Lymphoma”. Training error is the average total number of updates over 5 training epochs, while test error is the average fraction of misclassified patterns in the test set, The results refer to the same training/test splits. For each algorithm, only the best setting is shown (best training and best test setting coincided in these experiments). Thus, for instance, HO2 differs from FO because of the c parameter. We emphasized the comparison HO7 (0) vs. HO7 (c) with best c among the tested values. According to Wilcoxon signed rank test, an error difference of 0.5% or larger might be considered significant. In bold are the smallest figures achieved on each row of the table. FO TRAIN TEST TRAIN TEST LYMPHOMA HO 2 HO 4 HO 7 (0) HO 7 HO 10 SO 45.2 23.4% 22.1 11.8% 21.7 16.4% 19.6 10.0% 24.5 13.3% 18.9 10.0% 47.4 15.7% 23.0 11.5% 24.5 12.0% 20.0 11.5% 32.4 13.5 23.1 11.9% 29.6 15.0% 19.3 9.6% FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.0 SO # of training updates 800 * HO 4(0.4) 600 HO 7(0.0) * 400 300 * * * * * SO 2400 HO 2(0.4) 700 500 HO 7 (0.4) * 2000 * 1200 400 2 3 5 10 15 20 * 1 * * 2 3 * Test error rates * * * (a = 0.2) HO 2(0.4) HO 4(0.4) * * * * HO 7(0.0) HO 7 (0.4) 14% Test error rates (minus 10%) FO = HO 2(0.0) SO 18% * HO 7(0.0) HO 7(0.4) 5 10 15 20 # of training epochs Test error rates vs training epochs on Artificial 0.0 22% * * # of training epochs 26% (a = 0.2) HO 2(0.4) HO 4(0.4) 1600 800 * 1 FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.1 (a = 0.2) # of training updates B REAST FO = HO 2(0.0) Test error rates vs training epochs on Artificial 0.1 SO 26% 22% * * * * * * * * (a = 0.2) HO 2(0.4) HO 4(0.4) 18% HO 7(0.0) 14% HO 7 (0.4) 10% 10% 6% 6% 1 2 3 5 10 # of training epochs 15 20 1 2 3 5 10 15 20 # of training epochs Figure 2: Experiments on the two artificial datasets (Artificial0.0 , on the left, and Artificial0.1 , on the right). The plots give training and test behavior as a function of the number of training epochs. Notice that the test set in Artificial0.1 is affected by labelling noise of rate 10%. Hence, a visual comparison between the two plots at the bottom can only be made once we shift down the y-axis of the noisy plot by 10%. On the other hand, the two training plots (top) are not readily comparable. The reader might have difficulty telling apart the two kinds of algorithms HOp (0.0) and HOp (c) with c > 0. In practice, the latter turned out to be always slightly superior in performance to the former. sification, having the ability to combine multiplicative (or nonadditive) and second-order behavior into a single inference procedure. Like other algorithms, HOp can be extended (details omitted due to space limitations) in several ways through known worst-case learning technologies, such as large margin (e.g., [17, 11]), label-efficient/active learning (e.g., [5, 8]), and bounded memory (e.g., [10]). References [1] A. Alizadeh, et al. (2000). Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511. [2] D. Angluin (1988). Queries and concept learning. Machine Learning, 2(4), 319–342. [3] P. Auer & M.K. Warmuth (1998). Tracking the best disjunction. Machine Learning, 32(2), 127–150. [4] K.S. Azoury & M.K. Warmuth (2001). Relative loss bounds for on-line density estimation with the exponential familiy of distributions. Machine Learning, 43(3), 211–246. [5] A. Bordes, S. Ertekin, J. Weston, & L. Bottou (2005). Fast kernel classifiers with on-line and active learning. JMLR, 6, 1579–1619. [6] N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, & M.K. Warmuth (1997). How to use expert advice. J. ACM, 44(3), 427–485. Table 2: Experimental results on the four binary classification tasks derived from RCV1. ”Train” denotes the number of training corrections, while ”Test” gives the fraction of misclassified patterns in the test set. Only the results corresponding to the best test set accuracy are shown. In bold are the smallest figures achieved for each of the 8 combinations of dataset (RCV1x , x = 70, 101, 4, 59) and phase (training or test). FO TRAIN TEST 993 673 803 767 7.20% 6.39% 6.14% 6.45% RCV170 RCV1101 RCV14 RCV159 HO 2 TRAIN TEST 941 665 783 762 6.83% 5.81% 5.94% 6.04% SO TRAIN TEST 880 677 819 760 6.95% 5.48% 6.05% 6.84% Table 3: Experimental results on the OCR tasks. ”Train” denotes the total number of training corrections, summed over the 10 categories, while ”Test” denotes the fraction of misclassified patterns in the test set. Only the results corresponding to the best test set accuracy are shown. For the sparse version of HO2 we also reported (in parentheses) the number of matrix updates during training. In bold are the smallest figures achieved for each of the 8 combinations of dataset (USPS or MNIST), kernel type (Gaussian or Polynomial), and phase (training or test). FO TRAIN U SPS M NIST G AUSS P OLY G AUSS P OLY TEST 1385 1609 5834 8148 6.53% 7.37% 2.10% 3.04% HO 2 TRAIN TEST 945 1090 5351 6404 4.76% 5.71% 1.79% 2.27% Sparse HO2 SO TRAIN TEST TRAIN TEST 965 (440) 1081 (551) 5363 (2596) 6476 (3311) 5.13% 5.52% 1.81% 2.28% 1003 1054 5684 6440 5.05% 5.53% 1.82% 2.03% [7] N. Cesa-Bianchi, A. Conconi & C. Gentile (2005). A second-order perceptron algorithm. SIAM Journal of Computing, 34(3), 640–668. [8] N. Cesa-Bianchi, C. Gentile, & L. Zaniboni (2006). Worst-case analysis of selective sampling for linearthreshold algorithms. JMLR, 7, 1205–1230. [9] C. Cortes & V. Vapnik (1995). Support-vector networks. Machine Learning, 20(3), 273–297. [10] O. Dekel, S. Shalev-Shwartz, & Y. Singer (2006). The Forgetron: a kernel-based Perceptron on a fixed budget. NIPS 18, MIT Press, pp. 259–266. [11] C. Gentile (2001). A new approximate maximal margin classification algorithm. JMLR, 2, 213–242. [12] C. Gentile (2003). The Robustness of the p-norm Algorithms. Machine Learning, 53(3), pp. 265–299. [13] A.J. Grove, N. Littlestone & D. Schuurmans (2001). General convergence results for linear discriminant updates. Machine Learning Journal, 43(3), 173–210. [14] S. Gruvberger, et al. (2001). Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res., 61, 5979–5984. [15] J. Kivinen, M.K. Warmuth, & P. Auer (1997). The perceptron algorithm vs. winnow: linear vs. logarithmic mistake bounds when few input variables are relevant. Artificial Intelligence, 97, 325–343. [16] Y. Le Cun, et al. (1995). Comparison of learning algorithms for handwritten digit recognition. ICANN 1995, pp. 53–60. [17] Y. Li & P. Long (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1-3), 361–387. [18] N. Littlestone (1988). Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning, 2(4), 285–318. [19] N. Littlestone & M.K. Warmuth (1994). The weighted majority algorithm. Information and Computation, 108(2), 212–261. [20] P. Long & X. Wu (2004). Mistake bounds for maximum entropy discrimination. NIPS 2004. [21] A.B.J. Novikov (1962). On convergence proofs on perceptrons. Proc. of the Symposium on the Mathematical Theory of Automata, vol. XII, pp. 615–622. [22] Reuters: 2000. http://about.reuters.com/researchandstandards/corpus/. [23] S. Shalev-Shwartz & Y. Singer (2006). Online Learning Meets Optimization in the Dual. COLT 2006, pp. 423–437. [24] B. Schoelkopf & A. Smola (2002). Learning with kernels. MIT Press. [25] Vovk, V. (2001). Competitive on-line statistics. International Statistical Review, 69, 213-248.
6 0.20257983 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
7 0.18567853 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
8 0.17553049 165 nips-2007-Regret Minimization in Games with Incomplete Information
9 0.16503642 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
10 0.15085325 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
11 0.14434764 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
12 0.13292795 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding
13 0.13084976 213 nips-2007-Variational Inference for Diffusion Processes
14 0.10693339 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
15 0.096772745 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
16 0.096471436 62 nips-2007-Convex Learning with Invariances
17 0.084267214 101 nips-2007-How SVMs can estimate quantiles and the median
18 0.083421536 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration
19 0.079121687 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
20 0.075218812 118 nips-2007-Learning with Transformation Invariant Kernels
topicId topicWeight
[(0, -0.206), (1, -0.288), (2, 0.079), (3, 0.135), (4, 0.376), (5, -0.087), (6, -0.05), (7, -0.005), (8, -0.075), (9, -0.08), (10, 0.235), (11, 0.166), (12, -0.08), (13, 0.155), (14, -0.095), (15, 0.005), (16, -0.025), (17, -0.079), (18, -0.012), (19, 0.057), (20, -0.041), (21, -0.058), (22, 0.0), (23, 0.017), (24, -0.063), (25, 0.007), (26, 0.02), (27, -0.039), (28, 0.016), (29, 0.083), (30, 0.065), (31, 0.03), (32, -0.097), (33, -0.027), (34, -0.153), (35, 0.001), (36, -0.108), (37, 0.078), (38, -0.026), (39, -0.055), (40, -0.11), (41, 0.044), (42, 0.091), (43, -0.009), (44, -0.016), (45, 0.111), (46, 0.027), (47, -0.008), (48, -0.016), (49, 0.048)]
simIndex simValue paperId paperTitle
same-paper 1 0.9763363 21 nips-2007-Adaptive Online Gradient Descent
Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett
Abstract: We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates √ between T and log T . Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms. 1
2 0.66518986 199 nips-2007-The Price of Bandit Information for Online Optimization
Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes
Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1
3 0.63680685 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
Author: John Langford, Tong Zhang
Abstract: We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales as O(T 2/3 S 1/3 ) or better (sometimes, much better). Here S is the complexity term in a sample complexity bound for standard supervised learning. 1
4 0.60551983 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria
Author: Elad Hazan, Satyen Kale
Abstract: We study the relation between notions of game-theoretic equilibria which are based on stability under a set of deviations, and empirical equilibria which are reached by rational players. Rational players are modeled by players using no regret algorithms, which guarantee that their payoff in the long run is close to the maximum they could hope to achieve by consistently deviating from the algorithm’s suggested action. We show that for a given set of deviations over the strategy set of a player, it is possible to efficiently approximate fixed points of a given deviation if and only if there exist efficient no regret algorithms resistant to the deviations. Further, we show that if all players use a no regret algorithm, then the empirical distribution of their plays converges to an equilibrium. 1
5 0.58985579 146 nips-2007-On higher-order perceptron algorithms
Author: Claudio Gentile, Fabio Vitale, Cristian Brotto
Abstract: A new algorithm for on-line learning linear-threshold functions is proposed which efficiently combines second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms. An initial theoretical analysis is provided suggesting that our algorithm might be viewed as a standard Perceptron algorithm operating on a transformed sequence of examples with improved margin properties. We also report on experiments carried out on datasets from diverse domains, with the goal of comparing to known Perceptron algorithms (first-order, second-order, additive, multiplicative). Our learning procedure seems to generalize quite well, and converges faster than the corresponding multiplicative baseline algorithms. 1 Introduction and preliminaries The problem of on-line learning linear-threshold functions from labeled data is one which have spurred a substantial amount of research in Machine Learning. The relevance of this task from both the theoretical and the practical point of view is widely recognized: On the one hand, linear functions combine flexiblity with analytical and computational tractability, on the other hand, online algorithms provide efficient methods for processing massive amounts of data. Moreover, the widespread use of kernel methods in Machine Learning (e.g., [24]) have greatly improved the scope of this learning technology, thereby increasing even further the general attention towards the specific task of incremental learning (generalized) linear functions. Many models/algorithms have been proposed in the literature (stochastic, adversarial, noisy, etc.) : Any list of references would not do justice of the existing work on this subject. In this paper, we are interested in the problem of online learning linear-threshold functions from adversarially generated examples. We introduce a new family of algorithms, collectively called the Higher-order Perceptron algorithm (where ”higher” means here ”higher than one”, i.e., ”higher than first-order” descent algorithms such as gradientdescent or standard Perceptron-like algorithms”). Contrary to other higher-order algorithms, such as the ridge regression-like algorithms considered in, e.g., [4, 7], Higher-order Perceptron has the ability to put together in a principled and flexible manner second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms (e.g., [18, 19, 6, 13, 15, 20]). Our algorithm exploits a simplified form of the inverse data matrix, lending itself to be easily combined with the dual norms machinery introduced by [13] (see also [12, 23]). As we will see, this has also computational advantages, allowing us to formulate an efficient (subquadratic) implementation. Our contribution is twofold. First, we provide an initial theoretical analysis suggesting that our algorithm might be seen as a standard Perceptron algorithm [21] operating on a transformed sequence of examples with improved margin properties. The same analysis also suggests a simple (but principled) way of switching on the fly between higher-order and first-order updates. This is ∗ The authors gratefully acknowledge partial support by the PASCAL Network of Excellence under EC grant n. 506778. This publication only reflects the authors’ views. especially convenient when we deal with kernel functions, a major concern being the sparsity of the computed solution. The second contribution of this paper is an experimental investigation of our algorithm on artificial and real-world datasets from various domains: We compared Higher-order Perceptron to baseline Perceptron algorithms, like the Second-order Perceptron algorithm defined in [7] and the standard (p-norm) Perceptron algorithm, as in [13, 12]. We found in our experiments that Higher-order Perceptron generalizes quite well. Among our experimental findings are the following: 1) Higher-order Perceptron is always outperforming the corresponding multiplicative (p-norm) baseline (thus the stored data matrix is always beneficial in terms of convergence speed); 2) When dealing with Euclidean norms (p = 2), the comparison to Second-order Perceptron is less clear and depends on the specific task at hand. Learning protocol and notation. Our algorithm works in the well-known mistake bound model of on-line learning, as introduced in [18, 2], and further investigated by many authors (e.g., [19, 6, 13, 15, 7, 20, 23] and references therein). Prediction proceeds in a sequence of trials. In each trial t = 1, 2, . . . the prediction algorithm is given an instance vector in Rn (for simplicity, all vectors are normalized, i.e., ||xt || = 1, where || · || is the Euclidean norm unless otherwise specified), and then guesses the binary label yt ∈ {−1, 1} associated with xt . We denote the algorithm’s prediction by yt ∈ {−1, 1}. Then the true label yt is disclosed. In the case when yt = yt we say that the algorithm has made a prediction mistake. We call an example a pair (xt , yt ), and a sequence of examples S any sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). In this paper, we are competing against the class of linear-threshold predictors, parametrized by normal vectors u ∈ {v ∈ Rn : ||v|| = 1}. In this case, a common way of measuring the (relative) prediction performance of an algorithm A is to compare the total number of mistakes of A on S to some measure of the linear separability of S. One such measure (e.g., [24]) is the cumulative hinge-loss (or soft-margin) Dγ (u; S) of S w.r.t. a T linear classifier u at a given margin value γ > 0: Dγ (u; S) = t=1 max{0, γ − yt u xt } (observe that Dγ (u; S) vanishes if and only if u separates S with margin at least γ. A mistake-driven algorithm A is one which updates its internal state only upon mistakes. One can therefore associate with the run of A on S a subsequence M = M(S, A) ⊆ {1, . . . , T } of mistaken trials. Now, the standard analysis of these algorithms allows us to restrict the behavior of the comparison class to mistaken trials only and, as a consequence, to refine Dγ (u; S) so as to include only trials in M: Dγ (u; S) = t∈M max{0, γ − yt u xt }. This gives bounds on A’s performance relative to the best u over a sequence of examples produced (or, actually, selected) by A during its on-line functioning. Our analysis in Section 3 goes one step further: the number of mistakes of A on S is contrasted to the cumulative hinge loss of the best u on a transformed ˜ sequence S = ((˜ i1 , yi1 ), (˜ i2 , yi2 ), . . . , (˜ im , yim )), where each instance xik gets transformed x x x ˜ into xik through a mapping depending only on the past behavior of the algorithm (i.e., only on ˜ examples up to trial t = ik−1 ). As we will see in Section 3, this new sequence S tends to be ”more separable” than the original sequence, in the sense that if S is linearly separable with some margin, ˜ then the transformed sequence S is likely to be separable with a larger margin. 2 The Higher-order Perceptron algorithm The algorithm (described in Figure 1) takes as input a sequence of nonnegative parameters ρ1 , ρ2 , ..., and maintains a product matrix Bk (initialized to the identity matrix I) and a sum vector v k (initialized to 0). Both of them are indexed by k, a counter storing the current number of mistakes (plus one). Upon receiving the t-th normalized instance vector xt ∈ Rn , the algorithm computes its binary prediction value yt as the sign of the inner product between vector Bk−1 v k−1 and vector Bk−1 xt . If yt = yt then matrix Bk−1 is updates multiplicatively as Bk = Bk−1 (I − ρk xt xt ) while vector v k−1 is updated additively through the standard Perceptron rule v k = v k−1 + yt xt . The new matrix Bk and the new vector v k will be used in the next trial. If yt = yt no update is performed (hence the algorithm is mistake driven). Observe that ρk = 0 for any k makes this algorithm degenerate into the standard Perceptron algorithm [21]. Moreover, one can easily see that, in order to let this algorithm exploit the information collected in the matrix B (and let the algorithm’s ∞ behavior be substantially different from Perceptron’s) we need to ensure k=1 ρk = ∞. In the sequel, our standard choice will be ρk = c/k, with c ∈ (0, 1). See Sections 3 and 4. Implementing Higher-Order Perceptron can be done in many ways. Below, we quickly describe three of them, each one having its own merits. 1) Primal version. We store and update an n×n matrix Ak = Bk Bk and an n-dimensional column Parameters: ρ1 , ρ2 , ... ∈ [0, 1). Initialization: B0 = I; v 0 = 0; k = 1. Repeat for t = 1, 2, . . . , T : 1. Get instance xt ∈ Rn , ||xt || = 1; 2. Predict yt = SGN(wk−1 xt ) ∈ {−1, +1}, where wk−1 = Bk−1 Bk−1 v k−1 ; 3. Get label yt ∈ {−1, +1}; v k = v k−1 + yt xt 4. if yt = yt then: Bk k = Bk−1 (I − ρk xt xt ) ← k + 1. Figure 1: The Higher-order Perceptron algorithm (for p = 2). vector v k . Matrix Ak is updated as Ak = Ak−1 − ρAk−1 xx − ρxx Ak−1 + ρ2 (x Ak−1 x)xx , taking O(n2 ) operations, while v k is updated as in Figure 1. Computing the algorithm’s margin v Ax can then be carried out in time quadratic in the dimension n of the input space. 2) Dual version. This implementation allows us the use of kernel functions (e.g., [24]). Let us denote by Xk the n × k matrix whose columns are the n-dimensional instance vectors x1 , ..., xk where a mistake occurred so far, and y k be the k-dimensional column vector of the corresponding (k) labels. We store and update the k × k matrix Dk = [di,j ]k i,j=1 , the k × k diagonal matrix Hk = DIAG {hk }, (k) (k) hk = (h1 , ..., hk ) = Xk Xk y k , and the k-dimensional column vector g k = y k + Dk Hk 1k , being 1k a vector of k ones. If we interpret the primal matrix Ak above as Ak = (k) k I + i,j=1 di,j xi xj , it is not hard to show that the margin value wk−1 x is equal to g k−1 Xk−1 x, and can be computed through O(k) extra inner products. Now, on the k-th mistake, vector g can be updated with O(k 2 ) extra inner products by updating D and H in the following way. We let D0 and H0 be empty matrices. Then, given Dk−1 and Hk−1 = DIAG{hk−1 }, we have1 Dk = Dk−1 −ρk bk (k) , where bk = Dk−1 Xk−1 xk , and dk,k = ρ2 xk Xk−1 bk − 2ρk + ρ2 . On (k) k k −ρk bk dk,k the other hand, Hk = DIAG {hk−1 (k) (k) + yk Xk−1 xk , hk }, with hk = y k−1 Xk−1 xk + yk . Observe that on trials when ρk = 0 matrix Dk−1 is padded with a zero row and a zero column. (k) k This amounts to say that matrix Ak = I + i,j=1 di,j xi xj , is not updated, i.e., Ak = Ak−1 . A closer look at the above update mechanism allows us to conclude that the overall extra inner products needed to compute g k is actually quadratic only in the number of past mistaken trials having ρk > 0. This turns out to be especially important when using a sparse version of our algorithm which, on a mistaken trial, decides whether to update both B and v or just v (see Section 4). 3) Implicit primal version and the dual norms algorithm. This is based on the simple observation that for any vector z we can compute Bk z by unwrapping Bk as in Bk z = Bk−1 (I − ρxx )z = Bk−1 z , where vector z = (z − ρx x z) can be calculated in time O(n). Thus computing the margin v Bk−1 Bk−1 x actually takes O(nk). Maintaining this implicit representation for the product matrix B can be convenient when an efficient dual version is likely to be unavailable, as is the case for the multiplicative (or, more generally, dual norms) extension of our algorithm. We recall that a multiplicative algorithm is useful when learning sparse target hyperplanes (e.g., [18, 15, 3, 12, 11, 20]). We obtain a dual norms algorithm by introducing a norm parameter p ≥ 2, and the associated gradient mapping2 g : θ ∈ Rn → θ ||θ||2 / 2 ∈ Rn . Then, in Figure 1, we p normalize instance vectors xt w.r.t. the p-norm, we define wk−1 = Bk−1 g(Bk−1 v k−1 ), and generalize the matrix update as Bk = Bk−1 (I − ρk xt g(xt ) ). As we will see, the resulting algorithm combines the multiplicative behavior of the p-norm algorithms with the ”second-order” information contained in the matrix Bk . One can easily see that the above-mentioned argument for computing the margin g(Bk−1 v k−1 ) Bk−1 x in time O(nk) still holds. 1 Observe that, by construction, Dk is a symmetric matrix. This mapping has also been used in [12, 11]. Recall that setting p = O(log n) yields an algorithm similar to Winnow [18]. Also, notice that p = 2 yields g = identity. 2 3 Analysis We express the performance of the Higher-order Perceptron algorithm in terms of the hinge-loss behavior of the best linear classifier over the transformed sequence ˜ S = (B0 xt(1) , yt(1) ), (B1 xt(2) , yt(2) ), (B2 xt(3) , yt(3) ), . . . , (1) being t(k) the trial where the k-th mistake occurs, and Bk the k-th matrix produced by the algorithm. Observe that each feature vector xt(k) gets transformed by a matrix Bk depending on past examples ˜ only. This is relevant to the argument that S tends to have a larger margin than the original sequence (see the discussion at the end of this section). This neat ”on-line structure” does not seem to be shared by other competing higher-order algorithms, such as the ”ridge regression-like” algorithms considered, e.g., in [25, 4, 7, 23]. For the sake of simplicity, we state the theorem below only in the case p = 2. A more general statement holds when p ≥ 2. Theorem 1 Let the Higher-order Perceptron algorithm in Figure 1 be run on a sequence of examples S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). Let the sequence of parameters ρk satisfy 0 ≤ ρk ≤ 1−c , where xt is the k-th mistaken instance vector, and c ∈ (0, 1]. Then the total number m 1+|v k−1 xt | of mistakes satisfies3 ˜ ˜ Dγ (u; Sc )) α2 α Dγ (u; Sc )) α2 m≤α + 2+ α + 2, (2) γ 2γ γ γ 4γ holding for any γ > 0 and any unit norm vector u ∈ Rn , where α = α(c) = (2 − c)/c. Proof. The analysis deliberately mimics the standard Perceptron convergence analysis [21]. We fix an arbitrary sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ) and let M ⊆ {1, 2, . . . , T } be the set of trials where the algorithm in Figure 1 made a mistake. Let t = t(k) be the trial where the k-th mistake occurred. We study the evolution of ||Bk v k ||2 over mistaken trials. Notice that the matrix Bk Bk is positive semidefinite for any k. We can write ||Bk v k ||2 = ||Bk−1 (I − ρk xt xt ) (v k−1 + yt xt ) ||2 (from the update rule v k = v k−1 + yt xt and Bk = Bk−1 (I − ρk xt xt ) ) = ||Bk−1 v k−1 + yt (1 − ρk yt v k−1 xt − ρk )Bk−1 xt ||2 2 = ||Bk−1 v k−1 || + 2 yt rk v k−1 Bk−1 Bk−1 xt + (using ||xt || = 1) 2 rk ||Bk−1 xt ||2 , where we set for brevity rk = 1 − ρk yt v k−1 xt − ρk . We proceed by upper and lower bounding the above chain of equalities. To this end, we need to ensure rk ≥ 0. Observe that yt v k−1 xt ≥ 0 implies rk ≥ 0 if and only if ρk ≤ 1/(1 + yt v k−1 xt ). On the other hand, if yt v k−1 xt < 0 then, in order for rk to be nonnegative, it suffices to pick ρk ≤ 1. In both cases ρk ≤ (1 − c)/(1 + |v k−1 xt |) implies 2 rk ≥ c > 0, and also rk ≤ (1+ρk |v k−1 xt |−ρk )2 ≤ (2−c)2 . Now, using yt v k−1 Bk−1 Bk−1 xt ≤ 0 (combined with rk ≥ 0), we conclude that ||Bk v k ||2 − ||Bk−1 v k−1 ||2 ≤ (2 − c)2 ||Bk−1 xt ||2 = (2 − c)2 xt Ak−1 xt , where we set Ak = Bk Bk . A simple4 (and crude) upper bound on the last term follows by observing that ||xt || = 1 implies xt Ak−1 xt ≤ ||Ak−1 ||, the spectral norm (largest eigenvalue) of Ak−1 . Since a factor matrix of the form (I − ρ xx ) with ρ ≤ 1 and ||x|| = 1 has k−1 spectral norm one, we have xt Ak−1 xt ≤ ||Ak−1 || ≤ i=1 ||I − ρi xt(i) xt(i) ||2 ≤ 1. Therefore, summing over k = 1, . . . , m = |M| (or, equivalently, over t ∈ M) and using v 0 = 0 yields the upper bound ||Bm v m ||2 ≤ (2 − c)2 m. (3) To find a lower bound of the left-hand side of (3), we first pick any unit norm vector u ∈ Rn , and apply the standard Cauchy-Schwartz inequality: ||Bm v m || ≥ u Bm v m . Then, we observe that for a generic trial t = t(k) the update rule of our algorithm allows us to write u Bk v k − u Bk−1 v k−1 = rk yt u Bk−1 xt ≥ rk (γ − max{0, γ − yt u Bk−1 xt }), where the last inequality follows from rk ≥ 0 and holds for any margin value γ > 0. We sum 3 ˜ The subscript c in Sc emphasizes the dependence of the transformed sequence on the choice of c. Note that in the special case c = 1 we have ρk = 0 for any k and α = 1, thereby recovering the standard Perceptron bound for nonseparable sequences (see, e.g., [12]). 4 A slightly more refined bound can be derived which depends on the trace of matrices I − Ak . Details will be given in the full version of this paper. the above over k = 1, . . . , m and exploit c ≤ rk ≤ 2 − c after rearranging terms. This gets ˜ ||Bm v m || ≥ u Bm v m ≥ c γ m − (2 − c)Dγ (u; Sc ). Combining with (3) and solving for m gives the claimed bound. From the above result one can see that our algorithm might be viewed as a standard Perceptron ˜ algorithm operating on the transformed sequence Sc in (1). We now give a qualitative argument, ˜ which is suggestive of the improved margin properties of Sc . Assume for simplicity that all examples (xt , yt ) in the original sequence are correctly classified by hyperplane u with the same margin γ = yt u xt > 0, where t = t(k). According to Theorem 1, the parameters ρ1 , ρ2 , . . . should be small positive numbers. Assume, again for simplicity, that all ρk are set to the same small enough k value ρ > 0. Then, up to first order, matrix Bk = i=1 (I − ρ xt(i) xt(i) ) can be approximated as Bk I −ρ k i=1 xt(i) xt(i) . Then, to the extent that the above approximation holds, we can write:5 yt u Bk−1 xt = yt u I −ρ = yt u xt − ρ yt k−1 i=1 xt(i) xt(i) xt = yt u k−1 i=1 I −ρ k−1 i=1 yt(i) xt(i) yt(i) xt(i) xt yt(i) u xt(i) yt(i) xt(i) xt = γ − ρ γ yt v k−1 xt . Now, yt v k−1 xt is the margin of the (first-order) Perceptron vector v k−1 over a mistaken trial for the Higher-order Perceptron vector wk−1 . Since the two vectors v k−1 and wk−1 are correlated (recall that v k−1 wk−1 = v k−1 Bk−1 Bk−1 v k−1 = ||Bk−1 v k−1 ||2 ≥ 0) the mistaken condition yt wk−1 xt ≤ 0 is more likely to imply yt v k−1 xt ≤ 0 than the opposite. This tends to yield a margin larger than the original margin γ. As we mentioned in Section 2, this is also advantageous from a computational standpoint, since in those cases the matrix update Bk−1 → Bk might be skipped (this is equivalent to setting ρk = 0), still Theorem 1 would hold. Though the above might be the starting point of a more thorough theoretical understanding of the margin properties of our algorithm, in this paper we prefer to stop early and leave any further investigation to collecting experimental evidence. 4 Experiments We tested the empirical performance of our algorithm by conducting a number of experiments on a collection of datasets, both artificial and real-world from diverse domains (Optical Character Recognition, text categorization, DNA microarrays). The main goal of these experiments was to compare Higher-order Perceptron (with both p = 2 and p > 2) to known Perceptron-like algorithms, such as first-order [21] and second-order Perceptron [7], in terms of training accuracy (i.e., convergence speed) and test set accuracy. The results are contained in Tables 1, 2, 3, and in Figure 2. Task 1: DNA microarrays and artificial data. The goal here was to test the convergence properties of our algorithms on sparse target learning tasks. We first tested on a couple of well-known DNA microarray datasets. For each dataset, we first generated a number of random training/test splits (our random splits also included random permutations of the training set). The reported results are averaged over these random splits. The two DNA datasets are: i. The ER+/ER− dataset from [14]. Here the task is to analyze expression profiles of breast cancer and classify breast tumors according to ER (Estrogen Receptor) status. This dataset (which we call the “Breast” dataset) contains 58 expression profiles concerning 3389 genes. We randomly split 1000 times into a training set of size 47 and a test set of size 11. ii. The “Lymphoma” dataset [1]. Here the goal is to separate cancerous and normal tissues in a large B-Cell lymphoma problem. The dataset contains 96 expression profiles concerning 4026 genes. We randomly split the dataset into a training set of size 60 and a test set of size 36. Again, the random split was performed 1000 times. On both datasets, the tested algorithms have been run by cycling 5 times over the current training set. No kernel functions have been used. We also artificially generated two (moderately) sparse learning problems with margin γ ≥ 0.005 at labeling noise levels η = 0.0 (linearly separable) and η = 0.1, respectively. The datasets have been generated at random by first generating two (normalized) target vectors u ∈ {−1, 0, +1}500 , where the first 50 components are selected independently at random in {−1, +1} and the remaining 450 5 Again, a similar argument holds in the more general setting p ≥ 2. The reader should notice how important the dependence of Bk on the past is to this argument. components are 0. Then we set η = 0.0 for the first target and η = 0.1 for the second one and, corresponding to each of the two settings, we randomly generated 1000 training examples and 1000 test examples. The instance vectors are chosen at random from [−1, +1]500 and then normalized. If u · xt ≥ γ then a +1 label is associated with xt . If u · xt ≤ −γ then a −1 label is associated with xt . The labels so obtained are flipped with probability η. If |u · xt | < γ then xt is rejected and a new vector xt is drawn. We call the two datasets ”Artificial 0.0 ” and ”Artificial 0.1 ”. We tested our algorithms by training over an increasing number of epochs and checking the evolution of the corresponding test set accuracy. Again, no kernel functions have been used. Task 2: Text categorization. The text categorization datasets are derived from the first 20,000 newswire stories in the Reuters Corpus Volume 1 (RCV1, [22]). A standard TF - IDF bag-of-words encoding was used to transform each news story into a normalized vector of real attributes. We built four binary classification problems by “binarizing” consecutive news stories against the four target categories 70, 101, 4, and 59. These are the 2nd, 3rd, 4th, and 5th most frequent6 categories, respectively, within the first 20,000 news stories of RCV1. We call these datasets RCV1x , where x = 70, 101, 4, 59. Each dataset was split into a training set of size 10,000 and a test set of the same size. All algorithms have been trained for a single epoch. We initially tried polynomial kernels, then realized that kernel functions did not significantly alter our conclusions on this task. Thus the reported results refer to algorithms with no kernel functions. Task 3: Optical character recognition (OCR). We used two well-known OCR benchmarks: the USPS dataset and the MNIST dataset [16] and followed standard experimental setups, such as the one in [9], including the one-versus-rest scheme for reducing a multiclass problem to a set of binary tasks. We used for each algorithm the standard Gaussian and polynomial kernels, with parameters chosen via 5-fold cross validation on the training set across standard ranges. Again, all algorithms have been trained for a single epoch over the training set. The results in Table 3 only refer to the best parameter settings for each kernel. Algorithms. We implemented the standard Perceptron algorithm (with and without kernels), the Second-order Perceptron algorithm, as described in [7] (with and without kernels), and our Higherorder Perceptron algorithm. The implementation of the latter algorithm (for both p = 2 and p > 2) was ”implicit primal” when tested on the sparse learning tasks, and in dual variables for the other two tasks. When using Second-order Perceptron, we set its parameter a (see [7] for details) by testing on a generous range of values. For brevity, only the settings achieving the best results are reported. On the sparse learning tasks we tried Higher-order Perceptron with norm p = 2, 4, 7, 10, while on the other two tasks we set p = 2. In any case, for each value of p, we set7 ρk = c/k, with c = 0, 0.2, 0.4, 0.6, 0.8. Since c = 0 corresponds to a standard p-norm Perceptron algorithm [13, 12] we tried to emphasize the comparison c = 0 vs. c > 0. Finally, when using kernels on the OCR tasks, we also compared to a sparse dual version of Higher-order Perceptron. On a mistaken round t = t(k), this algorithm sets ρk = c/k if yt v k−1 xt ≥ 0, and ρk = 0 otherwise (thus, when yt v k−1 xt < 0 the matrix Bk−1 is not updated). For the sake of brevity, the standard Perceptron algorithm is called FO (”First Order”), the Second-order algorithm is denoted by SO (”Second Order”), while the Higher-order algorithm with norm parameter p and ρk = c/k is abbreviated as HOp (c). Thus, for instance, FO = HO2 (0). Results and conclusions. Our Higher-order Perceptron algorithm seems to deliver interesting results. In all our experiments HOp (c) with c > 0 outperforms HOp (0). On the other hand, the comparison HOp (c) vs. SO depends on the specific task. On the DNA datasets, HOp (c) with c > 0 is clearly superior in Breast. On Lymphoma, HOp (c) gets worse as p increases. This is a good indication that, in general, a multiplicative algorithm is not suitable for this dataset. In any case, HO2 turns out to be only slightly worse than SO. On the artificial datasets HOp (c) with c > 0 is always better than the corresponding p-norm Perceptron algorithm. On the text categorization tasks, HO2 tends to perform better than SO. On USPS, HO2 is superior to the other competitors, while on MNIST it performs similarly when combined with Gaussian kernels (though it turns out to be relatively sparser), while it is slightly inferior to SO when using polynomial kernels. The sparse version of HO2 cuts the matrix updates roughly by half, still maintaining a good performance. In all cases HO2 (either sparse or not) significantly outperforms FO. In conclusion, the Higher-order Perceptron algorithm is an interesting tool for on-line binary clas6 7 We did not use the most frequent category because of its significant overlap with the other ones. Notice that this setting fulfills the condition on ρk stated in Theorem 1. Table 1: Training and test error on the two datasets ”Breast” and ”Lymphoma”. Training error is the average total number of updates over 5 training epochs, while test error is the average fraction of misclassified patterns in the test set, The results refer to the same training/test splits. For each algorithm, only the best setting is shown (best training and best test setting coincided in these experiments). Thus, for instance, HO2 differs from FO because of the c parameter. We emphasized the comparison HO7 (0) vs. HO7 (c) with best c among the tested values. According to Wilcoxon signed rank test, an error difference of 0.5% or larger might be considered significant. In bold are the smallest figures achieved on each row of the table. FO TRAIN TEST TRAIN TEST LYMPHOMA HO 2 HO 4 HO 7 (0) HO 7 HO 10 SO 45.2 23.4% 22.1 11.8% 21.7 16.4% 19.6 10.0% 24.5 13.3% 18.9 10.0% 47.4 15.7% 23.0 11.5% 24.5 12.0% 20.0 11.5% 32.4 13.5 23.1 11.9% 29.6 15.0% 19.3 9.6% FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.0 SO # of training updates 800 * HO 4(0.4) 600 HO 7(0.0) * 400 300 * * * * * SO 2400 HO 2(0.4) 700 500 HO 7 (0.4) * 2000 * 1200 400 2 3 5 10 15 20 * 1 * * 2 3 * Test error rates * * * (a = 0.2) HO 2(0.4) HO 4(0.4) * * * * HO 7(0.0) HO 7 (0.4) 14% Test error rates (minus 10%) FO = HO 2(0.0) SO 18% * HO 7(0.0) HO 7(0.4) 5 10 15 20 # of training epochs Test error rates vs training epochs on Artificial 0.0 22% * * # of training epochs 26% (a = 0.2) HO 2(0.4) HO 4(0.4) 1600 800 * 1 FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.1 (a = 0.2) # of training updates B REAST FO = HO 2(0.0) Test error rates vs training epochs on Artificial 0.1 SO 26% 22% * * * * * * * * (a = 0.2) HO 2(0.4) HO 4(0.4) 18% HO 7(0.0) 14% HO 7 (0.4) 10% 10% 6% 6% 1 2 3 5 10 # of training epochs 15 20 1 2 3 5 10 15 20 # of training epochs Figure 2: Experiments on the two artificial datasets (Artificial0.0 , on the left, and Artificial0.1 , on the right). The plots give training and test behavior as a function of the number of training epochs. Notice that the test set in Artificial0.1 is affected by labelling noise of rate 10%. Hence, a visual comparison between the two plots at the bottom can only be made once we shift down the y-axis of the noisy plot by 10%. On the other hand, the two training plots (top) are not readily comparable. The reader might have difficulty telling apart the two kinds of algorithms HOp (0.0) and HOp (c) with c > 0. In practice, the latter turned out to be always slightly superior in performance to the former. sification, having the ability to combine multiplicative (or nonadditive) and second-order behavior into a single inference procedure. Like other algorithms, HOp can be extended (details omitted due to space limitations) in several ways through known worst-case learning technologies, such as large margin (e.g., [17, 11]), label-efficient/active learning (e.g., [5, 8]), and bounded memory (e.g., [10]). References [1] A. Alizadeh, et al. (2000). Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511. [2] D. Angluin (1988). Queries and concept learning. Machine Learning, 2(4), 319–342. [3] P. Auer & M.K. Warmuth (1998). Tracking the best disjunction. Machine Learning, 32(2), 127–150. [4] K.S. Azoury & M.K. Warmuth (2001). Relative loss bounds for on-line density estimation with the exponential familiy of distributions. Machine Learning, 43(3), 211–246. [5] A. Bordes, S. Ertekin, J. Weston, & L. Bottou (2005). Fast kernel classifiers with on-line and active learning. JMLR, 6, 1579–1619. [6] N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, & M.K. Warmuth (1997). How to use expert advice. J. ACM, 44(3), 427–485. Table 2: Experimental results on the four binary classification tasks derived from RCV1. ”Train” denotes the number of training corrections, while ”Test” gives the fraction of misclassified patterns in the test set. Only the results corresponding to the best test set accuracy are shown. In bold are the smallest figures achieved for each of the 8 combinations of dataset (RCV1x , x = 70, 101, 4, 59) and phase (training or test). FO TRAIN TEST 993 673 803 767 7.20% 6.39% 6.14% 6.45% RCV170 RCV1101 RCV14 RCV159 HO 2 TRAIN TEST 941 665 783 762 6.83% 5.81% 5.94% 6.04% SO TRAIN TEST 880 677 819 760 6.95% 5.48% 6.05% 6.84% Table 3: Experimental results on the OCR tasks. ”Train” denotes the total number of training corrections, summed over the 10 categories, while ”Test” denotes the fraction of misclassified patterns in the test set. Only the results corresponding to the best test set accuracy are shown. For the sparse version of HO2 we also reported (in parentheses) the number of matrix updates during training. In bold are the smallest figures achieved for each of the 8 combinations of dataset (USPS or MNIST), kernel type (Gaussian or Polynomial), and phase (training or test). FO TRAIN U SPS M NIST G AUSS P OLY G AUSS P OLY TEST 1385 1609 5834 8148 6.53% 7.37% 2.10% 3.04% HO 2 TRAIN TEST 945 1090 5351 6404 4.76% 5.71% 1.79% 2.27% Sparse HO2 SO TRAIN TEST TRAIN TEST 965 (440) 1081 (551) 5363 (2596) 6476 (3311) 5.13% 5.52% 1.81% 2.28% 1003 1054 5684 6440 5.05% 5.53% 1.82% 2.03% [7] N. Cesa-Bianchi, A. Conconi & C. Gentile (2005). A second-order perceptron algorithm. SIAM Journal of Computing, 34(3), 640–668. [8] N. Cesa-Bianchi, C. Gentile, & L. Zaniboni (2006). Worst-case analysis of selective sampling for linearthreshold algorithms. JMLR, 7, 1205–1230. [9] C. Cortes & V. Vapnik (1995). Support-vector networks. Machine Learning, 20(3), 273–297. [10] O. Dekel, S. Shalev-Shwartz, & Y. Singer (2006). The Forgetron: a kernel-based Perceptron on a fixed budget. NIPS 18, MIT Press, pp. 259–266. [11] C. Gentile (2001). A new approximate maximal margin classification algorithm. JMLR, 2, 213–242. [12] C. Gentile (2003). The Robustness of the p-norm Algorithms. Machine Learning, 53(3), pp. 265–299. [13] A.J. Grove, N. Littlestone & D. Schuurmans (2001). General convergence results for linear discriminant updates. Machine Learning Journal, 43(3), 173–210. [14] S. Gruvberger, et al. (2001). Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res., 61, 5979–5984. [15] J. Kivinen, M.K. Warmuth, & P. Auer (1997). The perceptron algorithm vs. winnow: linear vs. logarithmic mistake bounds when few input variables are relevant. Artificial Intelligence, 97, 325–343. [16] Y. Le Cun, et al. (1995). Comparison of learning algorithms for handwritten digit recognition. ICANN 1995, pp. 53–60. [17] Y. Li & P. Long (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1-3), 361–387. [18] N. Littlestone (1988). Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning, 2(4), 285–318. [19] N. Littlestone & M.K. Warmuth (1994). The weighted majority algorithm. Information and Computation, 108(2), 212–261. [20] P. Long & X. Wu (2004). Mistake bounds for maximum entropy discrimination. NIPS 2004. [21] A.B.J. Novikov (1962). On convergence proofs on perceptrons. Proc. of the Symposium on the Mathematical Theory of Automata, vol. XII, pp. 615–622. [22] Reuters: 2000. http://about.reuters.com/researchandstandards/corpus/. [23] S. Shalev-Shwartz & Y. Singer (2006). Online Learning Meets Optimization in the Dual. COLT 2006, pp. 423–437. [24] B. Schoelkopf & A. Smola (2002). Learning with kernels. MIT Press. [25] Vovk, V. (2001). Competitive on-line statistics. International Statistical Review, 69, 213-248.
6 0.5234499 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
7 0.44417384 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
8 0.43604565 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
9 0.42636451 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
10 0.40871572 86 nips-2007-Exponential Family Predictive Representations of State
11 0.37420139 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
12 0.36828271 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding
13 0.34872836 62 nips-2007-Convex Learning with Invariances
14 0.33884761 165 nips-2007-Regret Minimization in Games with Incomplete Information
15 0.33707088 213 nips-2007-Variational Inference for Diffusion Processes
16 0.32269642 101 nips-2007-How SVMs can estimate quantiles and the median
17 0.31047246 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration
18 0.30431703 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin
19 0.28639039 184 nips-2007-Stability Bounds for Non-i.i.d. Processes
20 0.28517953 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process
topicId topicWeight
[(5, 0.038), (13, 0.041), (16, 0.084), (18, 0.014), (19, 0.017), (21, 0.032), (31, 0.011), (34, 0.015), (35, 0.052), (43, 0.274), (46, 0.011), (47, 0.061), (49, 0.011), (83, 0.159), (90, 0.058)]
simIndex simValue paperId paperTitle
same-paper 1 0.73924798 21 nips-2007-Adaptive Online Gradient Descent
Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett
Abstract: We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates √ between T and log T . Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms. 1
2 0.58525115 36 nips-2007-Better than least squares: comparison of objective functions for estimating linear-nonlinear models
Author: Tatyana Sharpee
Abstract: This paper compares a family of methods for characterizing neural feature selectivity with natural stimuli in the framework of the linear-nonlinear model. In this model, the neural firing rate is a nonlinear function of a small number of relevant stimulus components. The relevant stimulus dimensions can be found by maximizing one of the family of objective functions, R´ nyi divergences of different e orders [1, 2]. We show that maximizing one of them, R´ nyi divergence of ore der 2, is equivalent to least-square fitting of the linear-nonlinear model to neural data. Next, we derive reconstruction errors in relevant dimensions found by maximizing R´ nyi divergences of arbitrary order in the asymptotic limit of large spike e numbers. We find that the smallest errors are obtained with R´ nyi divergence of e order 1, also known as Kullback-Leibler divergence. This corresponds to finding relevant dimensions by maximizing mutual information [2]. We numerically test how these optimization schemes perform in the regime of low signal-to-noise ratio (small number of spikes and increasing neural noise) for model visual neurons. We find that optimization schemes based on either least square fitting or information maximization perform well even when number of spikes is small. Information maximization provides slightly, but significantly, better reconstructions than least square fitting. This makes the problem of finding relevant dimensions, together with the problem of lossy compression [3], one of examples where informationtheoretic measures are no more data limited than those derived from least squares. 1
3 0.58509964 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes
Author: Maneesh Sahani, Byron M. Yu, John P. Cunningham, Krishna V. Shenoy
Abstract: Neural spike trains present challenges to analytical efforts due to their noisy, spiking nature. Many studies of neuroscientific and neural prosthetic importance rely on a smoothed, denoised estimate of the spike train’s underlying firing rate. Current techniques to find time-varying firing rates require ad hoc choices of parameters, offer no confidence intervals on their estimates, and can obscure potentially important single trial variability. We present a new method, based on a Gaussian Process prior, for inferring probabilistically optimal estimates of firing rate functions underlying single or multiple neural spike trains. We test the performance of the method on simulated data and experimentally gathered neural spike trains, and we demonstrate improvements over conventional estimators. 1
4 0.58174253 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
Author: Jonathan W. Pillow, Peter E. Latham
Abstract: Point process encoding models provide powerful statistical methods for understanding the responses of neurons to sensory stimuli. Although these models have been successfully applied to neurons in the early sensory pathway, they have fared less well capturing the response properties of neurons in deeper brain areas, owing in part to the fact that they do not take into account multiple stages of processing. Here we introduce a new twist on the point-process modeling approach: we include unobserved as well as observed spiking neurons in a joint encoding model. The resulting model exhibits richer dynamics and more highly nonlinear response properties, making it more powerful and more flexible for fitting neural data. More importantly, it allows us to estimate connectivity patterns among neurons (both observed and unobserved), and may provide insight into how networks process sensory input. We formulate the estimation procedure using variational EM and the wake-sleep algorithm, and illustrate the model’s performance using a simulated example network consisting of two coupled neurons.
Author: Lars Buesing, Wolfgang Maass
Abstract: We show that under suitable assumptions (primarily linearization) a simple and perspicuous online learning rule for Information Bottleneck optimization with spiking neurons can be derived. This rule performs on common benchmark tasks as well as a rather complex rule that has previously been proposed [1]. Furthermore, the transparency of this new learning rule makes a theoretical analysis of its convergence properties feasible. A variation of this learning rule (with sign changes) provides a theoretically founded method for performing Principal Component Analysis (PCA) with spiking neurons. By applying this rule to an ensemble of neurons, different principal components of the input can be extracted. In addition, it is possible to preferentially extract those principal components from incoming signals X that are related or are not related to some additional target signal YT . In a biological interpretation, this target signal YT (also called relevance variable) could represent proprioceptive feedback, input from other sensory modalities, or top-down signals. 1
6 0.56929487 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
7 0.56814903 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
8 0.56777728 63 nips-2007-Convex Relaxations of Latent Variable Training
9 0.56752509 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
10 0.56662285 195 nips-2007-The Generalized FITC Approximation
11 0.56603152 185 nips-2007-Stable Dual Dynamic Programming
12 0.56593841 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
13 0.56587332 40 nips-2007-Bundle Methods for Machine Learning
14 0.56503409 49 nips-2007-Colored Maximum Variance Unfolding
15 0.56402534 102 nips-2007-Incremental Natural Actor-Critic Algorithms
16 0.56397879 116 nips-2007-Learning the structure of manifolds using random projections
17 0.56321311 70 nips-2007-Discriminative K-means for Clustering
18 0.56239671 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
19 0.56202084 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
20 0.56132305 200 nips-2007-The Tradeoffs of Large Scale Learning