nips nips2007 nips2007-206 knowledge-graph by maker-knowledge-mining

206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

Source: pdf

Author: Nicolas L. Roux, Pierre-antoine Manzagol, Yoshua Bengio

Abstract: Guided by the goal of obtaining an optimization algorithm that is both fast and yields good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. Although that direction can be very expensive to compute we develop an efﬁcient, general, online approximation to the natural gradient descent which is suited to large scale problems. We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Topmoumoute online natural gradient algorithm Pierre-Antoine Manzagol University of Montreal manzagop@iro. [sent-1, score-0.437]

2 ca Abstract Guided by the goal of obtaining an optimization algorithm that is both fast and yields good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. [sent-8, score-0.497]

3 The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. [sent-9, score-0.443]

4 Although that direction can be very expensive to compute we develop an efﬁcient, general, online approximation to the natural gradient descent which is suited to large scale problems. [sent-10, score-0.792]

5 We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets. [sent-11, score-0.598]

6 Consider the optimization of a cost on a training set with access to a validation set. [sent-15, score-0.2]

7 As the end objective is a good solution with respect to generalization, one often uses early stopping: optimizing the training error while monitoring the validation error to ﬁght overﬁtting. [sent-16, score-0.264]

8 In terms of gradients, the gradient of the cost on the training set is never collinear with the true gradient, and the dot product between the two actually eventually becomes negative. [sent-19, score-0.421]

9 From this standpoint, we discover new justiﬁcations behind the natural gradient [1]. [sent-23, score-0.379]

10 Depending on certain assumptions, it corresponds either to the direction minimizing the probability of increasing generalization error, or to the direction in which the generalization error is expected to decrease the fastest. [sent-24, score-0.487]

11 Unfortunately, natural gradient algorithms suffer from poor scaling properties, both with respect to computation time and memory, when the number of parameters becomes large. [sent-25, score-0.412]

12 To address this issue, we propose a generally applicable online approximation of natural gradient that scales linearly with the number of parameters (and requires computation time comparable to stochastic gradient descent). [sent-26, score-0.966]

13 1 1 Natural gradient Let L be a cost deﬁned as L(θ) = L(x, θ)p(x)dx where L is a loss function over some parameters θ and over the random variable x with distribution p(x). [sent-28, score-0.369]

14 In the case of non-convex optimization, gradient descent is a successful e technique. [sent-31, score-0.413]

15 The approach consists in progressively updating θ using the gradient g = dL . [sent-32, score-0.349]

16 dθ [1] showed that the parameter space is a Riemannian space of metric C (the covariance of the gradients), and introduced the natural gradient as the direction of steepest descent in this space. [sent-33, score-0.783]

17 The natural gradient direction is therefore given by C −1 g. [sent-34, score-0.518]

18 [6] showed that, in the case of a mean squared cost function, the Hessian is equal to the sum of the covariance matrix of the gradients and of an additional term that vanishes to 0 as the training error goes down. [sent-37, score-0.518]

19 Indeed, when the data are generated from the model, the Hessian and the covariance matrix are equal. [sent-38, score-0.197]

20 There are two important differences: the covariance matrix C is positive-deﬁnite, which makes the technique more stable, but contains no explicit second order information. [sent-39, score-0.232]

21 The covariance matrix accounts for slight variations in the set of training samples. [sent-41, score-0.249]

22 2 A new justiﬁcation for natural gradient Until now, we supposed we had access to the true distribution p. [sent-44, score-0.379]

23 a gradient g) that, although close to the true cost (resp. [sent-47, score-0.369]

24 Instead of only monitoring the validation error, we consider using as descent direction an estimate of the direction that maximizes the probability of reducing the generalization error. [sent-55, score-0.559]

25 We know that if v T g is negative then the generalization error drops (for a reasonably small step) when stepping in the direction of v. [sent-58, score-0.283]

26 With a rough approximation, one can consider the g i s as draws from the true gradient distribution and assume all the gradients are independent and identically distributed. [sent-62, score-0.395]

27 The central limit theorem then gives By deﬁnition, the gradient on the training set is g = g∼N where C is the true covariance matrix of ∂L(x,θ) ∂θ g, C n wrt p(x). [sent-63, score-0.534]

28 2 (1) We will now show that, both in the Bayesian setting (with a Gaussian prior) and in the frequentist setting (with some restrictions over the type of gradient considered), the natural gradient is optimal in some sense. [sent-64, score-0.772]

29 We would thus like to deﬁne a posterior over g given the samples gi in order to have a posterior distribution over v T g for any given direction v. [sent-67, score-0.217]

30 In the case of σ = ∞, Cσ = I and this is the batch gradient descent. [sent-74, score-0.317]

31 This direction is the natural gradient and does not depend on the value of σ. [sent-81, score-0.518]

32 Since g ∼ N g, C , we have n e vT g = gT M g ∼ N g T M g, g T M T CM g n (6) The matrix M ∗ which minimizes the probability of v T g to be positive satisﬁes M ∗ = argminM 3 gT M g g T M T CM g (7) The numerator of the derivative of this quantity is gg T M T CM gg T − 2CM gg T M gg T . [sent-87, score-0.586]

33 Thus, for this derivative to be 0 for all g, one must have M ∝ C −1 and we obtain the same result as in the Bayesian case: the natural gradient represents the direction minimizing the probability of increasing the generalization error. [sent-89, score-0.638]

34 3 Online natural gradient The previous sections provided a number of justiﬁcations for using the natural gradient. [sent-90, score-0.473]

35 Indeed, considering p as the number of parameters and n as the number of examples, a direct batch implementation of the natural gradient is O(p2 ) in space and O(np2 + p3 ) in time, associated respectively with the gradients’ covariance storage, computation and inversion. [sent-92, score-0.581]

36 This section reviews existing low complexity implementations of the natural gradient, before proposing TONGA, a new low complexity, online and generally applicable implementation suited to large scale problems. [sent-93, score-0.316]

37 In the previous sections we assumed the true covariance matrix C to be known. [sent-94, score-0.197]

38 1 Low complexity natural gradient implementations [9] proposes a method speciﬁc to the case of multilayer perceptrons. [sent-97, score-0.483]

39 By operating on blocks of the covariance matrix, this approach attains a lower computational complexity 1. [sent-98, score-0.221]

40 However, the technique is quite involved, speciﬁc to multilayer perceptrons and requires two assumptions: Gaussian distributed inputs and a number of hidden units much inferior to that of input units. [sent-99, score-0.181]

41 [2] offers a more general approach based on the Sherman-Morrison formula used in Kalman ﬁlters: the technique maintains an empirical estimate of the inversed covariance matrix that can be updated in O(p 2 ). [sent-100, score-0.252]

42 The ﬁrst uses conjugate gradient descent to solve Cv = g. [sent-104, score-0.413]

43 This technique is used in the minibatch setting, where Cv can be computed cheaply through two matrix vector products. [sent-107, score-0.196]

44 However, estimating the gradient covariance only from a small number of examples in one minibatch yields unstable estimation. [sent-108, score-0.544]

45 2 TONGA Existing techniques fail to provide an implementation of the natural gradient adequate for the large scale setting. [sent-110, score-0.399]

46 TONGA was designed to address these issues, which it does this by maintaining a low rank approximation of the covariance and by casting both problems of ﬁnding the low rank approximation and of computing the natural gradient in a lower dimensional space, thereby attaining a much lower complexity. [sent-112, score-0.872]

47 What we exploit here is that although a covariance matrix needs many gradients to be estimated, we can take advantage of an observed property that it generally varies smoothly as training proceeds and moves in parameter space. [sent-113, score-0.359]

48 1 Computing the natural gradient direction between two eigendecompositions Even though our motivation for the use of natural gradient implied the covariance matrix of the empirical gradients, we will use the second moment (i. [sent-116, score-1.094]

49 Indeed, in the batch setting, we have (assuming C is the centered covariance matrix and g the mean) v = C −1 g, thus Cv = g. [sent-120, score-0.229]

50 But then, (C + gg T )v = g + gg T v = g(1 + g T v) and v (C + gg T )−1 g = =v ¯ (8) 1 + gT v 1 Though the technique allows for a compact representation of the covariance matrix, the working memory requirement remains the same. [sent-121, score-0.544]

51 4 Even though the direction is the same, the scale changes and the norm of the direction is bounded 1 by g cos(g,v) . [sent-122, score-0.298]

52 When presented with a new gradient, we integrate its information using the following update formula2: T ˆ Ct = γ Ct−1 + gt gt (9) ˆ where C0 = 0 and Ct−1 is the low rank approximation at time step t − 1. [sent-124, score-0.738]

53 Ct is now likely of ˆ ˆ greater rank, and the problem resides in computing its low rank approximation Ct . [sent-125, score-0.178]

54 Writing Ct−1 = T Xt−1 Xt−1 , √ T Ct = Xt Xt with Xt = [ γXt−1 gt ] With such covariance matrices, computing the (regularized) natural direction v t is equal to T vt = (Ct + λI)−1 gt = (Xt Xt + λI)−1 gt (10) T vt = (Xt Xt + λI)−1 Xt yt with yt = [0, . [sent-126, score-1.366]

55 (11) Using the Woodbury identity with positive deﬁnite matrices [7], we have T vt = Xt (Xt Xt + λI)−1 yt (12) If Xt is of size p × r (with r < p, thus yielding a covariance matrix of rank r), the cost of this T computation is O(pr 2 + r3 ). [sent-130, score-0.522]

56 However, since the Gram matrix Gt = Xt Xt can be rewritten as √ T √ T T γXt−1 gt γG γXt−1 gt γXt−1 Xt−1 √ T √ Tt−1 = , (13) Gt = T T γgt Xt−1 gt gt γgt Xt−1 gt gt the cost of computing Gt using Gt−1 reduces to O(pr + r 3 ). [sent-131, score-1.824]

57 This can be made at low cost using its relation to that of G t : Gt = V DV T Ct = (Xt V D− 2 )D(Xt V D− 2 )T (14) The cost of such an eigendecomposition is O(kr 2 + pkr) (for the computation of the eigendecomposition of the Gram matrix and the computation of the eigenvectors, respectively). [sent-136, score-0.455]

58 Since the cost of computing the natural direction is O(pr + r 3 ), it is computationally more efﬁcient to let the rank of Xt grow for several steps (using formula 12 in between) and then compute the eigendecomposition using 1 Ct+b T = Xt+b Xt+b with Xt+b = γUt , γ 1 b−1 2 gt+1 , . [sent-137, score-0.478]

59 3 Computational complexity The computational complexity of TONGA depends on the complexity of updating the low rank approximation and on the complexity of computing the natural gradient. [sent-143, score-0.438]

60 The cost of updating the approximation is in O(k(k + b)2 + p(k + b)k) (as above, using r = k + b). [sent-144, score-0.186]

61 The cost of computing the natural gradient vt is in O(p(k + b) + (k + b)3 ) (again, as above, using r = k + b). [sent-145, score-0.519]

62 Assuming k+b (p) and k ≤ b, TONGA’s total computational cost per each natural gradient computation is then O(pb). [sent-146, score-0.496]

63 Furthermore, by operating on minibatch gradients of size b , we end up with a cost per example of O( bp ). [sent-147, score-0.295]

64 Choosing b = b , yields O(p) per example, the same as stochastic gradient descent. [sent-148, score-0.449]

65 The result is an approximate natural gradient with low complexity, general applicability and ﬂexibility over the tradoff between computations and the quality of the estimate. [sent-151, score-0.412]

66 2 The second term is not weighted by 1−γ so that the inﬂuence of gt in Ct is the same for all t, even t = 0. [sent-152, score-0.28]

67 5 4 Block-diagonal online natural gradient for neural networks One might wonder if there are better approximations of the covariance matrix C than computing its ﬁrst k eigenvectors. [sent-157, score-0.664]

68 One possibility is a block-diagonal approximation from which to retain only the ﬁrst k eigenvectors of every block (the value of k can be different for each block). [sent-158, score-0.276]

69 Indeed, [4] showed that the Hessian of a neural network with one hidden layer trained with the cross-entropy cost converges to a block diagonal matrix during optimization. [sent-159, score-0.428]

70 These blocks are composed of the weights linking all the hidden units to one output unit and all the input units to one hidden unit. [sent-160, score-0.247]

71 Figure 1 shows the correlation between the standard stochastic gradients of the parameters of a 16 − 50 − 26 neural network. [sent-162, score-0.253]

72 The ﬁrst blocks represent the weights going from the input units to each hidden unit (thus 50 blocks of size 17, bias included) and the following represent the weights going from the hidden units to each output unit (26 blocks of size 51). [sent-163, score-0.349]

73 Thus, instead of selecting only k eigenvectors to represent the full covariance matrix, we can select k eigenvectors for every block, yielding the same total cost. [sent-165, score-0.305]

74 In the matrices shown in ﬁgure 1, which are of size 2176, a value of k = 5 yields an approximation of rank 380. [sent-167, score-0.193]

75 The block diagonal approximation is therefore very cost effective. [sent-176, score-0.403]

76 8 Ratio of the squared Frobenius norms Ratio of the squared Frobenius norms 1 Full matrix approximation Block diagonal approximation 0. [sent-178, score-0.382]

77 1 2000 (a) Full view Full matrix approximation Block diagonal approximation 0. [sent-194, score-0.298]

78 5 Experiments We performed a small number of experiments with TONGA approximating the full covariance matrix, keeping the overhead of the natural gradient small (ie, limiting the rank of the approximation). [sent-197, score-0.66]

79 Regrettably, TONGA performed only as well as stochastic gradient descent, while being rather sensitive to the hyperparameter values. [sent-198, score-0.428]

80 The following experiments, on the other hand, use TONGA with the block diagonal approximation and yield impressive results. [sent-199, score-0.319]

81 We believe this is a reﬂection of the phenomenon illustrated in ﬁgure 2: the block diagonal approximation makes for a very cost effective approximation of the covariance matrix. [sent-200, score-0.608]

82 All the experiments have been made optimizing hyperparameters on a validation set (not shown here) and selecting the best set of hyperparameters for testing, trying to keep small the overhead due to natural gradient calculations. [sent-201, score-0.589]

83 When λ goes to inﬁnity, TONGA becomes the standard stochastic gradient algorithm. [sent-205, score-0.451]

84 05 0 500 1000 1500 2000 2500 3000 CPU time (in seconds) 3500 4000 4500 (d) Test NLL Figure 3: Comparison between stochastic gradient and TONGA on the MNIST dataset (50000 training examples), in terms of training and test classiﬁcation error and Negative Log-Likelihood (NLL). [sent-241, score-0.584]

85 Figure 3 shows that in terms of training CPU time (which includes the overhead due to TONGA), TONGA allows much faster convergence in training NLL, as well as in testing classiﬁcation error and testing NLL than ordinary stochastic and minibatch gradient descent on this task. [sent-243, score-0.921]

86 One can also note that minibatch stochastic gradient is able to proﬁt from matrix-matrix multiplications, but this advantage is mainly seen in training classiﬁcation error. [sent-244, score-0.581]

87 Figure 4 shows that in terms of training CPU time, TONGA allows much faster convergence than ordinary stochastic gradient descent on this task, as well as lower classiﬁcation error. [sent-251, score-0.673]

88 5 4 x 10 Stochastic gradient Block diagonal TONGA 0. [sent-283, score-0.387]

89 18 Classification error on the training set Stochastic gradient Block diagonal TONGA 0. [sent-287, score-0.491]

90 5 4 x 10 (d) Test class error Figure 4: Comparison between stochastic gradient descent and TONGA w. [sent-299, score-0.608]

91 NLL and classiﬁcation error, on training and validation sets for the rectangles problem (900,000 training examples). [sent-302, score-0.209]

92 6 Discussion [3] reviews the different gradient descent techniques in the online setting and discusses their respective properties. [sent-303, score-0.515]

93 , with a search direction of is v = M g with g the gradient and M a positive semideﬁnite matrix) is optimal (in terms of convergence speed) when M converges to H −1 . [sent-306, score-0.443]

94 Given the aforementioned relationship between the covariance and the Hessian matrices, the natural gradient is close to optimal in the sense deﬁned above, provided the model has enough capacity. [sent-308, score-0.516]

95 On mixture models where the block-diagonal approximation is appropriate, it allows us to maintain an approximation of much higher rank than a standard low-rank approximation of the full covariance matrix. [sent-309, score-0.442]

96 First, by looking for the descent direction with either the greatest probability of not increasing generalization error or the direction with the largest expected increase in generalization error, we obtain new justiﬁcations for the natural gradient descent direction. [sent-311, score-1.122]

97 Second, we present an online low-rank approximation of natural gradient descent with computational complexity and CPU time similar to stochastic gradientr descent. [sent-312, score-0.809]

98 In a number of experimental comparisons we ﬁnd this optimization technique to beat stochastic gradient in terms of speed and generalization (or in generalization for a given amount of training time). [sent-313, score-0.649]

99 Adaptive method of realizing natural gradient learning for multilayer perceptrons. [sent-323, score-0.427]

100 Complexity issues in natural gradient descent method for training multi-layer perceptrons. [sent-379, score-0.559]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tonga', 0.527), ('batchsize', 0.324), ('gradient', 0.285), ('gt', 0.28), ('xt', 0.197), ('ct', 0.168), ('block', 0.149), ('cpu', 0.146), ('stochastic', 0.143), ('nll', 0.141), ('direction', 0.139), ('covariance', 0.137), ('descent', 0.128), ('gg', 0.124), ('gradients', 0.11), ('cv', 0.108), ('diagonal', 0.102), ('minibatch', 0.101), ('natural', 0.094), ('cost', 0.084), ('rank', 0.077), ('approximation', 0.068), ('frobenius', 0.068), ('generalization', 0.067), ('units', 0.065), ('frequentist', 0.064), ('eigendecomposition', 0.064), ('validation', 0.064), ('seconds', 0.064), ('matrix', 0.06), ('eigenvectors', 0.059), ('online', 0.058), ('gi', 0.058), ('hessian', 0.056), ('vt', 0.056), ('error', 0.052), ('training', 0.052), ('classification', 0.052), ('blocks', 0.051), ('multilayer', 0.048), ('montreal', 0.043), ('overhead', 0.043), ('norms', 0.042), ('cm', 0.041), ('ght', 0.041), ('topmoumoute', 0.041), ('zoom', 0.041), ('rectangles', 0.041), ('span', 0.039), ('ratio', 0.037), ('hyperparameters', 0.036), ('detrimental', 0.035), ('orr', 0.035), ('technique', 0.035), ('updating', 0.034), ('complexity', 0.033), ('hidden', 0.033), ('low', 0.033), ('computation', 0.033), ('mnist', 0.032), ('batch', 0.032), ('keep', 0.031), ('derivative', 0.03), ('amari', 0.03), ('riemannian', 0.03), ('progressively', 0.03), ('wonder', 0.03), ('stopping', 0.029), ('tting', 0.029), ('justi', 0.027), ('matrices', 0.027), ('yielding', 0.026), ('cations', 0.025), ('sake', 0.025), ('yang', 0.025), ('kept', 0.025), ('negative', 0.025), ('full', 0.024), ('faster', 0.024), ('goes', 0.023), ('increasing', 0.023), ('implementations', 0.023), ('directions', 0.023), ('ordinary', 0.022), ('reviews', 0.022), ('pr', 0.022), ('early', 0.022), ('monitoring', 0.022), ('bring', 0.022), ('gram', 0.022), ('yt', 0.022), ('setting', 0.022), ('deep', 0.021), ('yields', 0.021), ('default', 0.021), ('scale', 0.02), ('samples', 0.02), ('formula', 0.02), ('train', 0.02), ('convergence', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

Author: Nicolas L. Roux, Pierre-antoine Manzagol, Yoshua Bengio

2 0.18567853 21 nips-2007-Adaptive Online Gradient Descent

Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett

Abstract: We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates √ between T and log T . Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms. 1

3 0.11673644 213 nips-2007-Variational Inference for Diffusion Processes

Author: Cédric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, John S. Shawe-taylor

Abstract: Diffusion processes are a family of continuous-time continuous-state stochastic processes that are in general only partially observed. The joint estimation of the forcing parameters and the system noise (volatility) in these dynamical systems is a crucial, but non-trivial task, especially when the system is nonlinear and multimodal. We propose a variational treatment of diffusion processes, which allows us to compute type II maximum likelihood estimates of the parameters by simple gradient techniques and which is computationally less demanding than most MCMC approaches. We also show how a cheap estimate of the posterior over the parameters can be constructed based on the variational free energy. 1

4 0.10862534 146 nips-2007-On higher-order perceptron algorithms

Author: Claudio Gentile, Fabio Vitale, Cristian Brotto

Abstract: A new algorithm for on-line learning linear-threshold functions is proposed which efﬁciently combines second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms. An initial theoretical analysis is provided suggesting that our algorithm might be viewed as a standard Perceptron algorithm operating on a transformed sequence of examples with improved margin properties. We also report on experiments carried out on datasets from diverse domains, with the goal of comparing to known Perceptron algorithms (ﬁrst-order, second-order, additive, multiplicative). Our learning procedure seems to generalize quite well, and converges faster than the corresponding multiplicative baseline algorithms. 1 Introduction and preliminaries The problem of on-line learning linear-threshold functions from labeled data is one which have spurred a substantial amount of research in Machine Learning. The relevance of this task from both the theoretical and the practical point of view is widely recognized: On the one hand, linear functions combine ﬂexiblity with analytical and computational tractability, on the other hand, online algorithms provide efﬁcient methods for processing massive amounts of data. Moreover, the widespread use of kernel methods in Machine Learning (e.g., [24]) have greatly improved the scope of this learning technology, thereby increasing even further the general attention towards the speciﬁc task of incremental learning (generalized) linear functions. Many models/algorithms have been proposed in the literature (stochastic, adversarial, noisy, etc.) : Any list of references would not do justice of the existing work on this subject. In this paper, we are interested in the problem of online learning linear-threshold functions from adversarially generated examples. We introduce a new family of algorithms, collectively called the Higher-order Perceptron algorithm (where ”higher” means here ”higher than one”, i.e., ”higher than ﬁrst-order” descent algorithms such as gradientdescent or standard Perceptron-like algorithms”). Contrary to other higher-order algorithms, such as the ridge regression-like algorithms considered in, e.g., [4, 7], Higher-order Perceptron has the ability to put together in a principled and ﬂexible manner second-order statistics about the data with the ”logarithmic behavior” of multiplicative/dual-norm algorithms (e.g., [18, 19, 6, 13, 15, 20]). Our algorithm exploits a simpliﬁed form of the inverse data matrix, lending itself to be easily combined with the dual norms machinery introduced by [13] (see also [12, 23]). As we will see, this has also computational advantages, allowing us to formulate an efﬁcient (subquadratic) implementation. Our contribution is twofold. First, we provide an initial theoretical analysis suggesting that our algorithm might be seen as a standard Perceptron algorithm [21] operating on a transformed sequence of examples with improved margin properties. The same analysis also suggests a simple (but principled) way of switching on the ﬂy between higher-order and ﬁrst-order updates. This is ∗ The authors gratefully acknowledge partial support by the PASCAL Network of Excellence under EC grant n. 506778. This publication only reﬂects the authors’ views. especially convenient when we deal with kernel functions, a major concern being the sparsity of the computed solution. The second contribution of this paper is an experimental investigation of our algorithm on artiﬁcial and real-world datasets from various domains: We compared Higher-order Perceptron to baseline Perceptron algorithms, like the Second-order Perceptron algorithm deﬁned in [7] and the standard (p-norm) Perceptron algorithm, as in [13, 12]. We found in our experiments that Higher-order Perceptron generalizes quite well. Among our experimental ﬁndings are the following: 1) Higher-order Perceptron is always outperforming the corresponding multiplicative (p-norm) baseline (thus the stored data matrix is always beneﬁcial in terms of convergence speed); 2) When dealing with Euclidean norms (p = 2), the comparison to Second-order Perceptron is less clear and depends on the speciﬁc task at hand. Learning protocol and notation. Our algorithm works in the well-known mistake bound model of on-line learning, as introduced in [18, 2], and further investigated by many authors (e.g., [19, 6, 13, 15, 7, 20, 23] and references therein). Prediction proceeds in a sequence of trials. In each trial t = 1, 2, . . . the prediction algorithm is given an instance vector in Rn (for simplicity, all vectors are normalized, i.e., ||xt || = 1, where || · || is the Euclidean norm unless otherwise speciﬁed), and then guesses the binary label yt ∈ {−1, 1} associated with xt . We denote the algorithm’s prediction by yt ∈ {−1, 1}. Then the true label yt is disclosed. In the case when yt = yt we say that the algorithm has made a prediction mistake. We call an example a pair (xt , yt ), and a sequence of examples S any sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). In this paper, we are competing against the class of linear-threshold predictors, parametrized by normal vectors u ∈ {v ∈ Rn : ||v|| = 1}. In this case, a common way of measuring the (relative) prediction performance of an algorithm A is to compare the total number of mistakes of A on S to some measure of the linear separability of S. One such measure (e.g., [24]) is the cumulative hinge-loss (or soft-margin) Dγ (u; S) of S w.r.t. a T linear classiﬁer u at a given margin value γ > 0: Dγ (u; S) = t=1 max{0, γ − yt u xt } (observe that Dγ (u; S) vanishes if and only if u separates S with margin at least γ. A mistake-driven algorithm A is one which updates its internal state only upon mistakes. One can therefore associate with the run of A on S a subsequence M = M(S, A) ⊆ {1, . . . , T } of mistaken trials. Now, the standard analysis of these algorithms allows us to restrict the behavior of the comparison class to mistaken trials only and, as a consequence, to reﬁne Dγ (u; S) so as to include only trials in M: Dγ (u; S) = t∈M max{0, γ − yt u xt }. This gives bounds on A’s performance relative to the best u over a sequence of examples produced (or, actually, selected) by A during its on-line functioning. Our analysis in Section 3 goes one step further: the number of mistakes of A on S is contrasted to the cumulative hinge loss of the best u on a transformed ˜ sequence S = ((˜ i1 , yi1 ), (˜ i2 , yi2 ), . . . , (˜ im , yim )), where each instance xik gets transformed x x x ˜ into xik through a mapping depending only on the past behavior of the algorithm (i.e., only on ˜ examples up to trial t = ik−1 ). As we will see in Section 3, this new sequence S tends to be ”more separable” than the original sequence, in the sense that if S is linearly separable with some margin, ˜ then the transformed sequence S is likely to be separable with a larger margin. 2 The Higher-order Perceptron algorithm The algorithm (described in Figure 1) takes as input a sequence of nonnegative parameters ρ1 , ρ2 , ..., and maintains a product matrix Bk (initialized to the identity matrix I) and a sum vector v k (initialized to 0). Both of them are indexed by k, a counter storing the current number of mistakes (plus one). Upon receiving the t-th normalized instance vector xt ∈ Rn , the algorithm computes its binary prediction value yt as the sign of the inner product between vector Bk−1 v k−1 and vector Bk−1 xt . If yt = yt then matrix Bk−1 is updates multiplicatively as Bk = Bk−1 (I − ρk xt xt ) while vector v k−1 is updated additively through the standard Perceptron rule v k = v k−1 + yt xt . The new matrix Bk and the new vector v k will be used in the next trial. If yt = yt no update is performed (hence the algorithm is mistake driven). Observe that ρk = 0 for any k makes this algorithm degenerate into the standard Perceptron algorithm [21]. Moreover, one can easily see that, in order to let this algorithm exploit the information collected in the matrix B (and let the algorithm’s ∞ behavior be substantially different from Perceptron’s) we need to ensure k=1 ρk = ∞. In the sequel, our standard choice will be ρk = c/k, with c ∈ (0, 1). See Sections 3 and 4. Implementing Higher-Order Perceptron can be done in many ways. Below, we quickly describe three of them, each one having its own merits. 1) Primal version. We store and update an n×n matrix Ak = Bk Bk and an n-dimensional column Parameters: ρ1 , ρ2 , ... ∈ [0, 1). Initialization: B0 = I; v 0 = 0; k = 1. Repeat for t = 1, 2, . . . , T : 1. Get instance xt ∈ Rn , ||xt || = 1; 2. Predict yt = SGN(wk−1 xt ) ∈ {−1, +1}, where wk−1 = Bk−1 Bk−1 v k−1 ; 3. Get label yt ∈ {−1, +1}; v k = v k−1 + yt xt 4. if yt = yt then: Bk k = Bk−1 (I − ρk xt xt ) ← k + 1. Figure 1: The Higher-order Perceptron algorithm (for p = 2). vector v k . Matrix Ak is updated as Ak = Ak−1 − ρAk−1 xx − ρxx Ak−1 + ρ2 (x Ak−1 x)xx , taking O(n2 ) operations, while v k is updated as in Figure 1. Computing the algorithm’s margin v Ax can then be carried out in time quadratic in the dimension n of the input space. 2) Dual version. This implementation allows us the use of kernel functions (e.g., [24]). Let us denote by Xk the n × k matrix whose columns are the n-dimensional instance vectors x1 , ..., xk where a mistake occurred so far, and y k be the k-dimensional column vector of the corresponding (k) labels. We store and update the k × k matrix Dk = [di,j ]k i,j=1 , the k × k diagonal matrix Hk = DIAG {hk }, (k) (k) hk = (h1 , ..., hk ) = Xk Xk y k , and the k-dimensional column vector g k = y k + Dk Hk 1k , being 1k a vector of k ones. If we interpret the primal matrix Ak above as Ak = (k) k I + i,j=1 di,j xi xj , it is not hard to show that the margin value wk−1 x is equal to g k−1 Xk−1 x, and can be computed through O(k) extra inner products. Now, on the k-th mistake, vector g can be updated with O(k 2 ) extra inner products by updating D and H in the following way. We let D0 and H0 be empty matrices. Then, given Dk−1 and Hk−1 = DIAG{hk−1 }, we have1 Dk = Dk−1 −ρk bk (k) , where bk = Dk−1 Xk−1 xk , and dk,k = ρ2 xk Xk−1 bk − 2ρk + ρ2 . On (k) k k −ρk bk dk,k the other hand, Hk = DIAG {hk−1 (k) (k) + yk Xk−1 xk , hk }, with hk = y k−1 Xk−1 xk + yk . Observe that on trials when ρk = 0 matrix Dk−1 is padded with a zero row and a zero column. (k) k This amounts to say that matrix Ak = I + i,j=1 di,j xi xj , is not updated, i.e., Ak = Ak−1 . A closer look at the above update mechanism allows us to conclude that the overall extra inner products needed to compute g k is actually quadratic only in the number of past mistaken trials having ρk > 0. This turns out to be especially important when using a sparse version of our algorithm which, on a mistaken trial, decides whether to update both B and v or just v (see Section 4). 3) Implicit primal version and the dual norms algorithm. This is based on the simple observation that for any vector z we can compute Bk z by unwrapping Bk as in Bk z = Bk−1 (I − ρxx )z = Bk−1 z , where vector z = (z − ρx x z) can be calculated in time O(n). Thus computing the margin v Bk−1 Bk−1 x actually takes O(nk). Maintaining this implicit representation for the product matrix B can be convenient when an efﬁcient dual version is likely to be unavailable, as is the case for the multiplicative (or, more generally, dual norms) extension of our algorithm. We recall that a multiplicative algorithm is useful when learning sparse target hyperplanes (e.g., [18, 15, 3, 12, 11, 20]). We obtain a dual norms algorithm by introducing a norm parameter p ≥ 2, and the associated gradient mapping2 g : θ ∈ Rn → θ ||θ||2 / 2 ∈ Rn . Then, in Figure 1, we p normalize instance vectors xt w.r.t. the p-norm, we deﬁne wk−1 = Bk−1 g(Bk−1 v k−1 ), and generalize the matrix update as Bk = Bk−1 (I − ρk xt g(xt ) ). As we will see, the resulting algorithm combines the multiplicative behavior of the p-norm algorithms with the ”second-order” information contained in the matrix Bk . One can easily see that the above-mentioned argument for computing the margin g(Bk−1 v k−1 ) Bk−1 x in time O(nk) still holds. 1 Observe that, by construction, Dk is a symmetric matrix. This mapping has also been used in [12, 11]. Recall that setting p = O(log n) yields an algorithm similar to Winnow [18]. Also, notice that p = 2 yields g = identity. 2 3 Analysis We express the performance of the Higher-order Perceptron algorithm in terms of the hinge-loss behavior of the best linear classiﬁer over the transformed sequence ˜ S = (B0 xt(1) , yt(1) ), (B1 xt(2) , yt(2) ), (B2 xt(3) , yt(3) ), . . . , (1) being t(k) the trial where the k-th mistake occurs, and Bk the k-th matrix produced by the algorithm. Observe that each feature vector xt(k) gets transformed by a matrix Bk depending on past examples ˜ only. This is relevant to the argument that S tends to have a larger margin than the original sequence (see the discussion at the end of this section). This neat ”on-line structure” does not seem to be shared by other competing higher-order algorithms, such as the ”ridge regression-like” algorithms considered, e.g., in [25, 4, 7, 23]. For the sake of simplicity, we state the theorem below only in the case p = 2. A more general statement holds when p ≥ 2. Theorem 1 Let the Higher-order Perceptron algorithm in Figure 1 be run on a sequence of examples S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ). Let the sequence of parameters ρk satisfy 0 ≤ ρk ≤ 1−c , where xt is the k-th mistaken instance vector, and c ∈ (0, 1]. Then the total number m 1+|v k−1 xt | of mistakes satisﬁes3 ˜ ˜ Dγ (u; Sc )) α2 α Dγ (u; Sc )) α2 m≤α + 2+ α + 2, (2) γ 2γ γ γ 4γ holding for any γ > 0 and any unit norm vector u ∈ Rn , where α = α(c) = (2 − c)/c. Proof. The analysis deliberately mimics the standard Perceptron convergence analysis [21]. We ﬁx an arbitrary sequence S = (x1 , y1 ), (x2 , y2 ), . . . , (xT , yT ) and let M ⊆ {1, 2, . . . , T } be the set of trials where the algorithm in Figure 1 made a mistake. Let t = t(k) be the trial where the k-th mistake occurred. We study the evolution of ||Bk v k ||2 over mistaken trials. Notice that the matrix Bk Bk is positive semideﬁnite for any k. We can write ||Bk v k ||2 = ||Bk−1 (I − ρk xt xt ) (v k−1 + yt xt ) ||2 (from the update rule v k = v k−1 + yt xt and Bk = Bk−1 (I − ρk xt xt ) ) = ||Bk−1 v k−1 + yt (1 − ρk yt v k−1 xt − ρk )Bk−1 xt ||2 2 = ||Bk−1 v k−1 || + 2 yt rk v k−1 Bk−1 Bk−1 xt + (using ||xt || = 1) 2 rk ||Bk−1 xt ||2 , where we set for brevity rk = 1 − ρk yt v k−1 xt − ρk . We proceed by upper and lower bounding the above chain of equalities. To this end, we need to ensure rk ≥ 0. Observe that yt v k−1 xt ≥ 0 implies rk ≥ 0 if and only if ρk ≤ 1/(1 + yt v k−1 xt ). On the other hand, if yt v k−1 xt < 0 then, in order for rk to be nonnegative, it sufﬁces to pick ρk ≤ 1. In both cases ρk ≤ (1 − c)/(1 + |v k−1 xt |) implies 2 rk ≥ c > 0, and also rk ≤ (1+ρk |v k−1 xt |−ρk )2 ≤ (2−c)2 . Now, using yt v k−1 Bk−1 Bk−1 xt ≤ 0 (combined with rk ≥ 0), we conclude that ||Bk v k ||2 − ||Bk−1 v k−1 ||2 ≤ (2 − c)2 ||Bk−1 xt ||2 = (2 − c)2 xt Ak−1 xt , where we set Ak = Bk Bk . A simple4 (and crude) upper bound on the last term follows by observing that ||xt || = 1 implies xt Ak−1 xt ≤ ||Ak−1 ||, the spectral norm (largest eigenvalue) of Ak−1 . Since a factor matrix of the form (I − ρ xx ) with ρ ≤ 1 and ||x|| = 1 has k−1 spectral norm one, we have xt Ak−1 xt ≤ ||Ak−1 || ≤ i=1 ||I − ρi xt(i) xt(i) ||2 ≤ 1. Therefore, summing over k = 1, . . . , m = |M| (or, equivalently, over t ∈ M) and using v 0 = 0 yields the upper bound ||Bm v m ||2 ≤ (2 − c)2 m. (3) To ﬁnd a lower bound of the left-hand side of (3), we ﬁrst pick any unit norm vector u ∈ Rn , and apply the standard Cauchy-Schwartz inequality: ||Bm v m || ≥ u Bm v m . Then, we observe that for a generic trial t = t(k) the update rule of our algorithm allows us to write u Bk v k − u Bk−1 v k−1 = rk yt u Bk−1 xt ≥ rk (γ − max{0, γ − yt u Bk−1 xt }), where the last inequality follows from rk ≥ 0 and holds for any margin value γ > 0. We sum 3 ˜ The subscript c in Sc emphasizes the dependence of the transformed sequence on the choice of c. Note that in the special case c = 1 we have ρk = 0 for any k and α = 1, thereby recovering the standard Perceptron bound for nonseparable sequences (see, e.g., [12]). 4 A slightly more reﬁned bound can be derived which depends on the trace of matrices I − Ak . Details will be given in the full version of this paper. the above over k = 1, . . . , m and exploit c ≤ rk ≤ 2 − c after rearranging terms. This gets ˜ ||Bm v m || ≥ u Bm v m ≥ c γ m − (2 − c)Dγ (u; Sc ). Combining with (3) and solving for m gives the claimed bound. From the above result one can see that our algorithm might be viewed as a standard Perceptron ˜ algorithm operating on the transformed sequence Sc in (1). We now give a qualitative argument, ˜ which is suggestive of the improved margin properties of Sc . Assume for simplicity that all examples (xt , yt ) in the original sequence are correctly classiﬁed by hyperplane u with the same margin γ = yt u xt > 0, where t = t(k). According to Theorem 1, the parameters ρ1 , ρ2 , . . . should be small positive numbers. Assume, again for simplicity, that all ρk are set to the same small enough k value ρ > 0. Then, up to ﬁrst order, matrix Bk = i=1 (I − ρ xt(i) xt(i) ) can be approximated as Bk I −ρ k i=1 xt(i) xt(i) . Then, to the extent that the above approximation holds, we can write:5 yt u Bk−1 xt = yt u I −ρ = yt u xt − ρ yt k−1 i=1 xt(i) xt(i) xt = yt u k−1 i=1 I −ρ k−1 i=1 yt(i) xt(i) yt(i) xt(i) xt yt(i) u xt(i) yt(i) xt(i) xt = γ − ρ γ yt v k−1 xt . Now, yt v k−1 xt is the margin of the (ﬁrst-order) Perceptron vector v k−1 over a mistaken trial for the Higher-order Perceptron vector wk−1 . Since the two vectors v k−1 and wk−1 are correlated (recall that v k−1 wk−1 = v k−1 Bk−1 Bk−1 v k−1 = ||Bk−1 v k−1 ||2 ≥ 0) the mistaken condition yt wk−1 xt ≤ 0 is more likely to imply yt v k−1 xt ≤ 0 than the opposite. This tends to yield a margin larger than the original margin γ. As we mentioned in Section 2, this is also advantageous from a computational standpoint, since in those cases the matrix update Bk−1 → Bk might be skipped (this is equivalent to setting ρk = 0), still Theorem 1 would hold. Though the above might be the starting point of a more thorough theoretical understanding of the margin properties of our algorithm, in this paper we prefer to stop early and leave any further investigation to collecting experimental evidence. 4 Experiments We tested the empirical performance of our algorithm by conducting a number of experiments on a collection of datasets, both artiﬁcial and real-world from diverse domains (Optical Character Recognition, text categorization, DNA microarrays). The main goal of these experiments was to compare Higher-order Perceptron (with both p = 2 and p > 2) to known Perceptron-like algorithms, such as ﬁrst-order [21] and second-order Perceptron [7], in terms of training accuracy (i.e., convergence speed) and test set accuracy. The results are contained in Tables 1, 2, 3, and in Figure 2. Task 1: DNA microarrays and artiﬁcial data. The goal here was to test the convergence properties of our algorithms on sparse target learning tasks. We ﬁrst tested on a couple of well-known DNA microarray datasets. For each dataset, we ﬁrst generated a number of random training/test splits (our random splits also included random permutations of the training set). The reported results are averaged over these random splits. The two DNA datasets are: i. The ER+/ER− dataset from [14]. Here the task is to analyze expression proﬁles of breast cancer and classify breast tumors according to ER (Estrogen Receptor) status. This dataset (which we call the “Breast” dataset) contains 58 expression proﬁles concerning 3389 genes. We randomly split 1000 times into a training set of size 47 and a test set of size 11. ii. The “Lymphoma” dataset [1]. Here the goal is to separate cancerous and normal tissues in a large B-Cell lymphoma problem. The dataset contains 96 expression proﬁles concerning 4026 genes. We randomly split the dataset into a training set of size 60 and a test set of size 36. Again, the random split was performed 1000 times. On both datasets, the tested algorithms have been run by cycling 5 times over the current training set. No kernel functions have been used. We also artiﬁcially generated two (moderately) sparse learning problems with margin γ ≥ 0.005 at labeling noise levels η = 0.0 (linearly separable) and η = 0.1, respectively. The datasets have been generated at random by ﬁrst generating two (normalized) target vectors u ∈ {−1, 0, +1}500 , where the ﬁrst 50 components are selected independently at random in {−1, +1} and the remaining 450 5 Again, a similar argument holds in the more general setting p ≥ 2. The reader should notice how important the dependence of Bk on the past is to this argument. components are 0. Then we set η = 0.0 for the ﬁrst target and η = 0.1 for the second one and, corresponding to each of the two settings, we randomly generated 1000 training examples and 1000 test examples. The instance vectors are chosen at random from [−1, +1]500 and then normalized. If u · xt ≥ γ then a +1 label is associated with xt . If u · xt ≤ −γ then a −1 label is associated with xt . The labels so obtained are ﬂipped with probability η. If |u · xt | < γ then xt is rejected and a new vector xt is drawn. We call the two datasets ”Artiﬁcial 0.0 ” and ”Artiﬁcial 0.1 ”. We tested our algorithms by training over an increasing number of epochs and checking the evolution of the corresponding test set accuracy. Again, no kernel functions have been used. Task 2: Text categorization. The text categorization datasets are derived from the ﬁrst 20,000 newswire stories in the Reuters Corpus Volume 1 (RCV1, [22]). A standard TF - IDF bag-of-words encoding was used to transform each news story into a normalized vector of real attributes. We built four binary classiﬁcation problems by “binarizing” consecutive news stories against the four target categories 70, 101, 4, and 59. These are the 2nd, 3rd, 4th, and 5th most frequent6 categories, respectively, within the ﬁrst 20,000 news stories of RCV1. We call these datasets RCV1x , where x = 70, 101, 4, 59. Each dataset was split into a training set of size 10,000 and a test set of the same size. All algorithms have been trained for a single epoch. We initially tried polynomial kernels, then realized that kernel functions did not signiﬁcantly alter our conclusions on this task. Thus the reported results refer to algorithms with no kernel functions. Task 3: Optical character recognition (OCR). We used two well-known OCR benchmarks: the USPS dataset and the MNIST dataset [16] and followed standard experimental setups, such as the one in [9], including the one-versus-rest scheme for reducing a multiclass problem to a set of binary tasks. We used for each algorithm the standard Gaussian and polynomial kernels, with parameters chosen via 5-fold cross validation on the training set across standard ranges. Again, all algorithms have been trained for a single epoch over the training set. The results in Table 3 only refer to the best parameter settings for each kernel. Algorithms. We implemented the standard Perceptron algorithm (with and without kernels), the Second-order Perceptron algorithm, as described in [7] (with and without kernels), and our Higherorder Perceptron algorithm. The implementation of the latter algorithm (for both p = 2 and p > 2) was ”implicit primal” when tested on the sparse learning tasks, and in dual variables for the other two tasks. When using Second-order Perceptron, we set its parameter a (see [7] for details) by testing on a generous range of values. For brevity, only the settings achieving the best results are reported. On the sparse learning tasks we tried Higher-order Perceptron with norm p = 2, 4, 7, 10, while on the other two tasks we set p = 2. In any case, for each value of p, we set7 ρk = c/k, with c = 0, 0.2, 0.4, 0.6, 0.8. Since c = 0 corresponds to a standard p-norm Perceptron algorithm [13, 12] we tried to emphasize the comparison c = 0 vs. c > 0. Finally, when using kernels on the OCR tasks, we also compared to a sparse dual version of Higher-order Perceptron. On a mistaken round t = t(k), this algorithm sets ρk = c/k if yt v k−1 xt ≥ 0, and ρk = 0 otherwise (thus, when yt v k−1 xt < 0 the matrix Bk−1 is not updated). For the sake of brevity, the standard Perceptron algorithm is called FO (”First Order”), the Second-order algorithm is denoted by SO (”Second Order”), while the Higher-order algorithm with norm parameter p and ρk = c/k is abbreviated as HOp (c). Thus, for instance, FO = HO2 (0). Results and conclusions. Our Higher-order Perceptron algorithm seems to deliver interesting results. In all our experiments HOp (c) with c > 0 outperforms HOp (0). On the other hand, the comparison HOp (c) vs. SO depends on the speciﬁc task. On the DNA datasets, HOp (c) with c > 0 is clearly superior in Breast. On Lymphoma, HOp (c) gets worse as p increases. This is a good indication that, in general, a multiplicative algorithm is not suitable for this dataset. In any case, HO2 turns out to be only slightly worse than SO. On the artiﬁcial datasets HOp (c) with c > 0 is always better than the corresponding p-norm Perceptron algorithm. On the text categorization tasks, HO2 tends to perform better than SO. On USPS, HO2 is superior to the other competitors, while on MNIST it performs similarly when combined with Gaussian kernels (though it turns out to be relatively sparser), while it is slightly inferior to SO when using polynomial kernels. The sparse version of HO2 cuts the matrix updates roughly by half, still maintaining a good performance. In all cases HO2 (either sparse or not) signiﬁcantly outperforms FO. In conclusion, the Higher-order Perceptron algorithm is an interesting tool for on-line binary clas6 7 We did not use the most frequent category because of its signiﬁcant overlap with the other ones. Notice that this setting fulﬁlls the condition on ρk stated in Theorem 1. Table 1: Training and test error on the two datasets ”Breast” and ”Lymphoma”. Training error is the average total number of updates over 5 training epochs, while test error is the average fraction of misclassiﬁed patterns in the test set, The results refer to the same training/test splits. For each algorithm, only the best setting is shown (best training and best test setting coincided in these experiments). Thus, for instance, HO2 differs from FO because of the c parameter. We emphasized the comparison HO7 (0) vs. HO7 (c) with best c among the tested values. According to Wilcoxon signed rank test, an error difference of 0.5% or larger might be considered signiﬁcant. In bold are the smallest ﬁgures achieved on each row of the table. FO TRAIN TEST TRAIN TEST LYMPHOMA HO 2 HO 4 HO 7 (0) HO 7 HO 10 SO 45.2 23.4% 22.1 11.8% 21.7 16.4% 19.6 10.0% 24.5 13.3% 18.9 10.0% 47.4 15.7% 23.0 11.5% 24.5 12.0% 20.0 11.5% 32.4 13.5 23.1 11.9% 29.6 15.0% 19.3 9.6% FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.0 SO # of training updates 800 * HO 4(0.4) 600 HO 7(0.0) * 400 300 * * * * * SO 2400 HO 2(0.4) 700 500 HO 7 (0.4) * 2000 * 1200 400 2 3 5 10 15 20 * 1 * * 2 3 * Test error rates * * * (a = 0.2) HO 2(0.4) HO 4(0.4) * * * * HO 7(0.0) HO 7 (0.4) 14% Test error rates (minus 10%) FO = HO 2(0.0) SO 18% * HO 7(0.0) HO 7(0.4) 5 10 15 20 # of training epochs Test error rates vs training epochs on Artificial 0.0 22% * * # of training epochs 26% (a = 0.2) HO 2(0.4) HO 4(0.4) 1600 800 * 1 FO = HO 2(0.0) Training updates vs training epochs on Artificial 0.1 (a = 0.2) # of training updates B REAST FO = HO 2(0.0) Test error rates vs training epochs on Artificial 0.1 SO 26% 22% * * * * * * * * (a = 0.2) HO 2(0.4) HO 4(0.4) 18% HO 7(0.0) 14% HO 7 (0.4) 10% 10% 6% 6% 1 2 3 5 10 # of training epochs 15 20 1 2 3 5 10 15 20 # of training epochs Figure 2: Experiments on the two artiﬁcial datasets (Artiﬁcial0.0 , on the left, and Artiﬁcial0.1 , on the right). The plots give training and test behavior as a function of the number of training epochs. Notice that the test set in Artiﬁcial0.1 is affected by labelling noise of rate 10%. Hence, a visual comparison between the two plots at the bottom can only be made once we shift down the y-axis of the noisy plot by 10%. On the other hand, the two training plots (top) are not readily comparable. The reader might have difﬁculty telling apart the two kinds of algorithms HOp (0.0) and HOp (c) with c > 0. In practice, the latter turned out to be always slightly superior in performance to the former. siﬁcation, having the ability to combine multiplicative (or nonadditive) and second-order behavior into a single inference procedure. Like other algorithms, HOp can be extended (details omitted due to space limitations) in several ways through known worst-case learning technologies, such as large margin (e.g., [17, 11]), label-efﬁcient/active learning (e.g., [5, 8]), and bounded memory (e.g., [10]). References [1] A. Alizadeh, et al. (2000). Distinct types of diffuse large b-cell lymphoma identiﬁed by gene expression proﬁling. Nature, 403, 503–511. [2] D. Angluin (1988). Queries and concept learning. Machine Learning, 2(4), 319–342. [3] P. Auer & M.K. Warmuth (1998). Tracking the best disjunction. Machine Learning, 32(2), 127–150. [4] K.S. Azoury & M.K. Warmuth (2001). Relative loss bounds for on-line density estimation with the exponential familiy of distributions. Machine Learning, 43(3), 211–246. [5] A. Bordes, S. Ertekin, J. Weston, & L. Bottou (2005). Fast kernel classiﬁers with on-line and active learning. JMLR, 6, 1579–1619. [6] N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, & M.K. Warmuth (1997). How to use expert advice. J. ACM, 44(3), 427–485. Table 2: Experimental results on the four binary classiﬁcation tasks derived from RCV1. ”Train” denotes the number of training corrections, while ”Test” gives the fraction of misclassiﬁed patterns in the test set. Only the results corresponding to the best test set accuracy are shown. In bold are the smallest ﬁgures achieved for each of the 8 combinations of dataset (RCV1x , x = 70, 101, 4, 59) and phase (training or test). FO TRAIN TEST 993 673 803 767 7.20% 6.39% 6.14% 6.45% RCV170 RCV1101 RCV14 RCV159 HO 2 TRAIN TEST 941 665 783 762 6.83% 5.81% 5.94% 6.04% SO TRAIN TEST 880 677 819 760 6.95% 5.48% 6.05% 6.84% Table 3: Experimental results on the OCR tasks. ”Train” denotes the total number of training corrections, summed over the 10 categories, while ”Test” denotes the fraction of misclassiﬁed patterns in the test set. Only the results corresponding to the best test set accuracy are shown. For the sparse version of HO2 we also reported (in parentheses) the number of matrix updates during training. In bold are the smallest ﬁgures achieved for each of the 8 combinations of dataset (USPS or MNIST), kernel type (Gaussian or Polynomial), and phase (training or test). FO TRAIN U SPS M NIST G AUSS P OLY G AUSS P OLY TEST 1385 1609 5834 8148 6.53% 7.37% 2.10% 3.04% HO 2 TRAIN TEST 945 1090 5351 6404 4.76% 5.71% 1.79% 2.27% Sparse HO2 SO TRAIN TEST TRAIN TEST 965 (440) 1081 (551) 5363 (2596) 6476 (3311) 5.13% 5.52% 1.81% 2.28% 1003 1054 5684 6440 5.05% 5.53% 1.82% 2.03% [7] N. Cesa-Bianchi, A. Conconi & C. Gentile (2005). A second-order perceptron algorithm. SIAM Journal of Computing, 34(3), 640–668. [8] N. Cesa-Bianchi, C. Gentile, & L. Zaniboni (2006). Worst-case analysis of selective sampling for linearthreshold algorithms. JMLR, 7, 1205–1230. [9] C. Cortes & V. Vapnik (1995). Support-vector networks. Machine Learning, 20(3), 273–297. [10] O. Dekel, S. Shalev-Shwartz, & Y. Singer (2006). The Forgetron: a kernel-based Perceptron on a ﬁxed budget. NIPS 18, MIT Press, pp. 259–266. [11] C. Gentile (2001). A new approximate maximal margin classiﬁcation algorithm. JMLR, 2, 213–242. [12] C. Gentile (2003). The Robustness of the p-norm Algorithms. Machine Learning, 53(3), pp. 265–299. [13] A.J. Grove, N. Littlestone & D. Schuurmans (2001). General convergence results for linear discriminant updates. Machine Learning Journal, 43(3), 173–210. [14] S. Gruvberger, et al. (2001). Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res., 61, 5979–5984. [15] J. Kivinen, M.K. Warmuth, & P. Auer (1997). The perceptron algorithm vs. winnow: linear vs. logarithmic mistake bounds when few input variables are relevant. Artiﬁcial Intelligence, 97, 325–343. [16] Y. Le Cun, et al. (1995). Comparison of learning algorithms for handwritten digit recognition. ICANN 1995, pp. 53–60. [17] Y. Li & P. Long (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1-3), 361–387. [18] N. Littlestone (1988). Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Machine Learning, 2(4), 285–318. [19] N. Littlestone & M.K. Warmuth (1994). The weighted majority algorithm. Information and Computation, 108(2), 212–261. [20] P. Long & X. Wu (2004). Mistake bounds for maximum entropy discrimination. NIPS 2004. [21] A.B.J. Novikov (1962). On convergence proofs on perceptrons. Proc. of the Symposium on the Mathematical Theory of Automata, vol. XII, pp. 615–622. [22] Reuters: 2000. http://about.reuters.com/researchandstandards/corpus/. [23] S. Shalev-Shwartz & Y. Singer (2006). Online Learning Meets Optimization in the Dual. COLT 2006, pp. 423–437. [24] B. Schoelkopf & A. Smola (2002). Learning with kernels. MIT Press. [25] Vovk, V. (2001). Competitive on-line statistics. International Statistical Review, 69, 213-248.

5 0.10052869 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

Author: Omer Bobrowski, Ron Meir, Shy Shoham, Yonina Eldar

Abstract: It is becoming increasingly evident that organisms acting in uncertain dynamical environments often employ exact or approximate Bayesian statistical calculations in order to continuously estimate the environmental state, integrate information from multiple sensory modalities, form predictions and choose actions. What is less clear is how these putative computations are implemented by cortical neural networks. An additional level of complexity is introduced because these networks observe the world through spike trains received from primary sensory afferents, rather than directly. A recent line of research has described mechanisms by which such computations can be implemented using a network of neurons whose activity directly represents a probability distribution across the possible “world states”. Much of this work, however, uses various approximations, which severely restrict the domain of applicability of these implementations. Here we make use of rigorous mathematical results from the theory of continuous time point process ﬁltering, and show how optimal real-time state estimation and prediction may be implemented in a general setting using linear neural networks. We demonstrate the applicability of the approach with several examples, and relate the required network properties to the statistical nature of the environment, thereby quantifying the compatibility of a given network with its environment. 1

6 0.099477738 102 nips-2007-Incremental Natural Actor-Critic Algorithms

7 0.095244482 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

8 0.094413236 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

9 0.087325618 199 nips-2007-The Price of Bandit Information for Online Optimization

10 0.080808818 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

11 0.078786612 135 nips-2007-Multi-task Gaussian Process Prediction

12 0.077551752 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

13 0.076450989 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

14 0.062386632 200 nips-2007-The Tradeoffs of Large Scale Learning

15 0.058002811 145 nips-2007-On Sparsity and Overcompleteness in Image Models

16 0.055271331 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields

17 0.05448978 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

18 0.054191347 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

19 0.054145314 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

20 0.05373596 156 nips-2007-Predictive Matrix-Variate t Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.197), (1, -0.058), (2, 0.035), (3, 0.05), (4, 0.043), (5, -0.008), (6, -0.078), (7, -0.043), (8, -0.076), (9, -0.086), (10, 0.171), (11, 0.076), (12, 0.011), (13, 0.111), (14, -0.107), (15, 0.077), (16, -0.115), (17, -0.057), (18, -0.001), (19, 0.043), (20, -0.013), (21, -0.0), (22, -0.016), (23, 0.015), (24, -0.024), (25, -0.025), (26, 0.009), (27, 0.013), (28, 0.051), (29, -0.018), (30, -0.066), (31, -0.092), (32, -0.1), (33, -0.141), (34, -0.094), (35, -0.06), (36, 0.063), (37, -0.021), (38, 0.122), (39, -0.144), (40, -0.034), (41, 0.061), (42, 0.223), (43, 0.036), (44, 0.071), (45, 0.065), (46, -0.071), (47, 0.115), (48, -0.136), (49, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96562034 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

Author: Nicolas L. Roux, Pierre-antoine Manzagol, Yoshua Bengio

2 0.61429012 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng

Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random ﬁelds (CRFs). Using an implicit differentiation trick, we derive an efﬁcient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we ﬁnd that multiple hyperparameter learning can provide a signiﬁcant boost in accuracy compared to using only a single regularization hyperparameter. 1

3 0.59718019 21 nips-2007-Adaptive Online Gradient Descent

Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett

4 0.51269352 146 nips-2007-On higher-order perceptron algorithms

Author: Claudio Gentile, Fabio Vitale, Cristian Brotto

5 0.50641686 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

Author: Emre Kiciman, David Maltz, John C. Platt

Abstract: Web servers on the Internet need to maintain high reliability, but the cause of intermittent failures of web transactions is non-obvious. We use approximate Bayesian inference to diagnose problems with web services. This diagnosis problem is far larger than any previously attempted: it requires inference of 104 possible faults from 105 observations. Further, such inference must be performed in less than a second. Inference can be done at this speed by combining a mean-ﬁeld variational approximation and the use of stochastic gradient descent to optimize a variational cost function. We use this fast inference to diagnose a time series of anomalous HTTP requests taken from a real web service. The inference is fast enough to analyze network logs with billions of entries in a matter of hours. 1

6 0.49497128 213 nips-2007-Variational Inference for Diffusion Processes

7 0.49416655 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

8 0.48770761 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation

9 0.41256383 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

10 0.4068315 40 nips-2007-Bundle Methods for Machine Learning

11 0.39218503 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

12 0.37870038 185 nips-2007-Stable Dual Dynamic Programming

13 0.37641075 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

14 0.37247494 96 nips-2007-Heterogeneous Component Analysis

15 0.36636788 200 nips-2007-The Tradeoffs of Large Scale Learning

16 0.36295253 156 nips-2007-Predictive Matrix-Variate t Models

17 0.35090557 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

18 0.34865129 203 nips-2007-The rat as particle filter

19 0.34357879 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

20 0.34121245 27 nips-2007-Anytime Induction of Cost-sensitive Trees

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.035), (13, 0.034), (16, 0.448), (21, 0.044), (34, 0.03), (35, 0.021), (47, 0.1), (49, 0.022), (83, 0.1), (85, 0.016), (90, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93053454 60 nips-2007-Contraction Properties of VLSI Cooperative Competitive Neural Networks of Spiking Neurons

Author: Emre Neftci, Elisabetta Chicca, Giacomo Indiveri, Jean-jeacques Slotine, Rodney J. Douglas

Abstract: A non–linear dynamic system is called contracting if initial conditions are forgotten exponentially fast, so that all trajectories converge to a single trajectory. We use contraction theory to derive an upper bound for the strength of recurrent connections that guarantees contraction for complex neural networks. Speciﬁcally, we apply this theory to a special class of recurrent networks, often called Cooperative Competitive Networks (CCNs), which are an abstract representation of the cooperative-competitive connectivity observed in cortex. This speciﬁc type of network is believed to play a major role in shaping cortical responses and selecting the relevant signal among distractors and noise. In this paper, we analyze contraction of combined CCNs of linear threshold units and verify the results of our analysis in a hybrid analog/digital VLSI CCN comprising spiking neurons and dynamic synapses. 1

2 0.91799325 33 nips-2007-Bayesian Inference for Spiking Neuron Models with a Sparsity Prior

Author: Sebastian Gerwinn, Matthias Bethge, Jakob H. Macke, Matthias Seeger

Abstract: Generalized linear models are the most commonly used tools to describe the stimulus selectivity of sensory neurons. Here we present a Bayesian treatment of such models. Using the expectation propagation algorithm, we are able to approximate the full posterior distribution over all weights. In addition, we use a Laplacian prior to favor sparse solutions. Therefore, stimulus features that do not critically inﬂuence neural activity will be assigned zero weights and thus be effectively excluded by the model. This feature selection mechanism facilitates both the interpretation of the neuron model as well as its predictive abilities. The posterior distribution can be used to obtain conﬁdence intervals which makes it possible to assess the statistical signiﬁcance of the solution. In neural data analysis, the available amount of experimental measurements is often limited whereas the parameter space is large. In such a situation, both regularization by a sparsity prior and uncertainty estimates for the model parameters are essential. We apply our method to multi-electrode recordings of retinal ganglion cells and use our uncertainty estimate to test the statistical signiﬁcance of functional couplings between neurons. Furthermore we used the sparsity of the Laplace prior to select those ﬁlters from a spike-triggered covariance analysis that are most informative about the neural response. 1

same-paper 3 0.88745356 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

Author: Nicolas L. Roux, Pierre-antoine Manzagol, Yoshua Bengio

4 0.88079423 170 nips-2007-Robust Regression with Twinned Gaussian Processes

Author: Andrew Naish-guzman, Sean Holden

Abstract: We propose a Gaussian process (GP) framework for robust inference in which a GP prior on the mixing weights of a two-component noise model augments the standard process over latent function values. This approach is a generalization of the mixture likelihood used in traditional robust GP regression, and a specialization of the GP mixture models suggested by Tresp [1] and Rasmussen and Ghahramani [2]. The value of this restriction is in its tractable expectation propagation updates, which allow for faster inference and model selection, and better convergence than the standard mixture. An additional beneﬁt over the latter method lies in our ability to incorporate knowledge of the noise domain to inﬂuence predictions, and to recover with the predictive distribution information about the outlier distribution via the gating process. The model has asymptotic complexity equal to that of conventional robust methods, but yields more conﬁdent predictions on benchmark problems than classical heavy-tailed models and exhibits improved stability for data with clustered corruptions, for which they fail altogether. We show further how our approach can be used without adjustment for more smoothly heteroscedastic data, and suggest how it could be extended to more general noise models. We also address similarities with the work of Goldberg et al. [3].

5 0.72206968 140 nips-2007-Neural characterization in partially observed populations of spiking neurons

Author: Jonathan W. Pillow, Peter E. Latham

Abstract: Point process encoding models provide powerful statistical methods for understanding the responses of neurons to sensory stimuli. Although these models have been successfully applied to neurons in the early sensory pathway, they have fared less well capturing the response properties of neurons in deeper brain areas, owing in part to the fact that they do not take into account multiple stages of processing. Here we introduce a new twist on the point-process modeling approach: we include unobserved as well as observed spiking neurons in a joint encoding model. The resulting model exhibits richer dynamics and more highly nonlinear response properties, making it more powerful and more ﬂexible for ﬁtting neural data. More importantly, it allows us to estimate connectivity patterns among neurons (both observed and unobserved), and may provide insight into how networks process sensory input. We formulate the estimation procedure using variational EM and the wake-sleep algorithm, and illustrate the model’s performance using a simulated example network consisting of two coupled neurons.

6 0.7063303 36 nips-2007-Better than least squares: comparison of objective functions for estimating linear-nonlinear models

7 0.68308055 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

8 0.63619089 195 nips-2007-The Generalized FITC Approximation

9 0.61009997 164 nips-2007-Receptive Fields without Spike-Triggering

10 0.60773587 205 nips-2007-Theoretical Analysis of Learning with Reward-Modulated Spike-Timing-Dependent Plasticity

11 0.57422024 117 nips-2007-Learning to classify complex patterns using a VLSI network of spiking neurons

12 0.54101431 35 nips-2007-Bayesian binning beats approximate alternatives: estimating peri-stimulus time histograms

13 0.53938109 25 nips-2007-An in-silico Neural Model of Dynamic Routing through Neuronal Coherence

14 0.51976526 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis

15 0.51775217 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

16 0.51488334 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

17 0.48910707 213 nips-2007-Variational Inference for Diffusion Processes

18 0.48607078 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

19 0.48180068 21 nips-2007-Adaptive Online Gradient Descent

20 0.47916564 174 nips-2007-Selecting Observations against Adversarial Objectives