nips nips2003 nips2003-102 knowledge-graph by maker-knowledge-mining

102 nips-2003-Large Scale Online Learning


Source: pdf

Author: Léon Bottou, Yann L. Cun

Abstract: We consider situations where training data is abundant and computing resources are comparatively scarce. We argue that suitably designed online learning algorithms asymptotically outperform any batch learning algorithm. Both theoretical and experimental evidences are presented. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We consider situations where training data is abundant and computing resources are comparatively scarce. [sent-4, score-0.217]

2 We argue that suitably designed online learning algorithms asymptotically outperform any batch learning algorithm. [sent-5, score-1.058]

3 Network traffic itself provides new and abundant sources of data in the form of server log files. [sent-9, score-0.154]

4 This remark suggests that learning algorithms must process increasing amounts of data using comparatively smaller computing resources. [sent-12, score-0.233]

5 This work assumes that datasets have grown to practically infinite sizes and discusses which learning algorithms asymptotically provide the best generalization performance using limited computing resources. [sent-13, score-0.311]

6 • Online algorithms operate by repetitively drawing a fresh random example and adjusting the parameters on the basis of this single example only. [sent-14, score-0.23]

7 • Batch algorithms avoid this issue by completely optimizing the cost function defined on a set of training examples. [sent-17, score-0.199]

8 On the other hand, such algorithms cannot process as many examples because they must iterate several times over the training set to achieve the optimum. [sent-18, score-0.227]

9 As datasets grow to practically infinite sizes, we argue that online algorithms outperform learning algorithms that operate by repetitively sweeping over a training set. [sent-19, score-0.766]

10 2 Gradient Based Learning Many learning algorithms optimize an empirical cost function Cn (θ) that can be expressed as the average of a large number of terms L(z, θ). [sent-20, score-0.186]

11 Each term measures the cost associated with running a model with parameter vector θ on independent examples z i (typically input/output pairs zi = (xi , yi ). [sent-21, score-0.303]

12 • Online gradient: Parameter updates are performed on the basis of a single sample zt picked randomly at each iteration: 1 ∂L Φt (zt , θ(t − 1)) (3) t ∂θ where Φt is again an appropriately chosen positive definite symmetric matrix. [sent-23, score-0.182]

13 Very often the examples zt are chosen by cycling over a randomly permuted training set. [sent-24, score-0.403]

14 This paper however considers situations where the supply of training samples is practically unlimited. [sent-26, score-0.181]

15 Each iteration of the online algorithm utilizes a fresh sample, unlikely to have been presented to the system before. [sent-27, score-0.665]

16 θ(t) = θ(t − 1) − ∗ Simple batch algorithms converge linearly1 to the optimum θn of the empirical cost. [sent-28, score-0.557]

17 Careful choices of Φk make the convergence super-linear or even quadratic2 in favorable cases (Dennis and Schnabel, 1983). [sent-29, score-0.227]

18 Whereas online algorithms may converge to the general area of the optimum at least as fast as batch algorithms (Le Cun et al. [sent-30, score-1.005]

19 , 1998), the optimization proceeds rather slowly during the final convergence phase (Bottou and Murata, 2002). [sent-31, score-0.238]

20 The noisy gradient estimate causes the parameter vector to fluctuate around the optimum in a bowl whose size decreases like 1/t at best. [sent-32, score-0.206]

21 This is the fundamental difference between optimization speed and learning speed. [sent-35, score-0.174]

22 1 2 ∗ Linear convergence speed: log 1/|θ(k) − θn |2 grows linearly with k. [sent-36, score-0.297]

23 ∗ Quadratic convergence speed: log log 1/|θ(k) − θn |2 grows linearly with k. [sent-37, score-0.376]

24 3 Learning Speed Running an efficient batch algorithm on a training set of size n quickly yields the empirical ∗ ∗ optimum θn . [sent-38, score-0.636]

25 The sequence of empirical optima θn usually converges to the solution θ ∗ when the training set size n increases. [sent-39, score-0.118]

26 In contrast, online algorithms randomly draw one example zt at each iteration. [sent-40, score-0.623]

27 When these examples are drawn from a set of n examples, the online algorithm minimizes the empirical error Cn . [sent-41, score-0.604]

28 When these examples are drawn from the asymptotic distribution p(z), it minimizes the expected cost C∞ . [sent-42, score-0.241]

29 Because the supply of training samples is practically unlimited, each iteration of the online algorithm utilizes a fresh example. [sent-43, score-0.846]

30 The parameter vectors θ(t) thus directly converge to the optimum θ ∗ of the expected cost C∞ . [sent-45, score-0.216]

31 ∗ The convergence speed of the batch θn and online θ(t) sequences were first compared by Murata and Amari (1999). [sent-46, score-1.124]

32 ∗ ∗ θn = θn−1 − 1 ∂L ∗ Ψn (zn , θn−1 ) + O n ∂θ with 1 n Ψn = n i=1 ∂2 ∗ L(zi , θn−1 ) ∂θ∂θ 1 n2 (5) −1 −→ H−1 t→∞ ∗ θn Relation (5) describes the sequence as a recursive stochastic process that is essentially similar to the online learning algorithm (3). [sent-50, score-0.571]

33 Each iteration of this “algorithm” consists in picking a fresh example zn and updating the parameters according to (5). [sent-51, score-0.224]

34 We can however apply the mathematics of online learning algorithms to this stochastic process. [sent-53, score-0.534]

35 The similarity between (5) and (3) suggests that both the batch and online sequences converge at the same speed for adequate choices of the scaling matrix Φt . [sent-54, score-1.159]

36 Under customary regularity conditions, the following asymptotic speed results holds when the scaling matrix Φt converges to the inverse H−1 of the Hessian matrix. [sent-55, score-0.323]

37 E |θ(t) − θ ∗ |2 + o 1 t ∗ = E |θt − θ∗ |2 + o 1 t = tr H−1 G H−1 t (6) This convergence speed expression has been discovered many times. [sent-56, score-0.368]

38 Murata and Amari (1999) address generic stochastic gradient algorithms with a constant scaling matrix. [sent-58, score-0.251]

39 Our result (Bottou and Le Cun, 2003) holds when the scaling matrix Φt depends on the previously seen examples, and also holds when the stochastic update is perturbed by unspecified second order terms, as in equation (5). [sent-59, score-0.24]

40 Result (6) applies to both the online θ(t) and batch θ(t) sequences. [sent-61, score-0.762]

41 This constant is neither affected by the second order terms of (5) nor by the convergence speed of the scaling matrix Φt toward H−1 . [sent-63, score-0.463]

42 Equation (6) then indicates that the convergence speed saturates the Cramer-Rao bound. [sent-65, score-0.327]

43 This fact was known in the case of the natural gradient algorithm (Amari, 1998). [sent-66, score-0.134]

44 It remains true for a large class of online learning algorithms. [sent-67, score-0.424]

45 Result (6) suggests that the scaling matrix Φt should be a full rank approximation of the Hessian H. [sent-68, score-0.131]

46 The computational cost of each iteration can be drastically reduced by maintaining only a coarse approximations of the Hessian (e. [sent-70, score-0.159]

47 A proper setup ensures that the convergence speed remains O (1/t) despite a less favorable constant factor. [sent-74, score-0.366]

48 The similar nature of the convergence of the batch and online sequences can be summarized as follows. [sent-75, score-0.985]

49 Consider two optimally designed batch and online learning algorithms. [sent-76, score-0.797]

50 The best generalization error is asymptotically achieved by the learning algorithm that uses the most examples within the allowed time. [sent-77, score-0.338]

51 4 Computational Cost The discussion so far has established that a properly designed online learning algorithm performs as well as any batch learning algorithm for a same number of examples. [sent-78, score-0.93]

52 We now establish that, given the same computing resources, an online learning algorithm can asymptotically process more examples than a batch learning algorithm. [sent-79, score-1.105]

53 Each iteration of a batch learning algorithm running on N training examples requires a time K1 N + K2 . [sent-80, score-0.732]

54 Result (6) provides the following asymptotic equivalence: 1 ∗ (θN − θ∗ )2 ∼ N ∗ The batch algorithm must perform enough iterations to approximate θN with at least the same accuracy (∼ 1/N ). [sent-82, score-0.539]

55 An efficient algorithm with quadratic convergence achieves this after a number of iterations asymptotically proportional to log log N . [sent-83, score-0.566]

56 Running an online learning algorithm requires a constant time K3 per processed example. [sent-84, score-0.503]

57 Let us call T the number of examples processed by the online learning algorithm using the same computing resources as the batch algorithm. [sent-85, score-1.016]

58 We then have: K3 T ∼ (K1 N + K2 ) log log N =⇒ T ∼ N log log N The parameter θ(T ) of the online algorithm also converges according to (6). [sent-86, score-0.809]

59 Comparing the accuracies of both algorithms shows that the online algorithm asymptotically provides a better solution by a factor O (log log N ). [sent-87, score-0.733]

60 1 1 ∗ (θ(T ) − θ ∗ )2 ∼ ∼ (θN − θ∗ )2 N log log N N This log log N factor corresponds to the number of iterations required by the batch algorithm. [sent-88, score-0.728]

61 Experience shows however that online algorithms are considerably easier to implement. [sent-91, score-0.441]

62 Each iteration of the batch algorithm involves a large summation over all the available examples. [sent-92, score-0.486]

63 On the other hand, each iteration of the online algorithm only involves one random example which can then be discarded. [sent-94, score-0.502]

64 The examples are input/output pairs (x, y) with x ∈ R20 and y = ±1. [sent-96, score-0.092]

65 66x) is the standard sigmoid discussed in LeCun et al. [sent-101, score-0.095]

66 The sigmoid generates various curvature conditions in the parameter space, including negative curvature and plateaus. [sent-103, score-0.21]

67 This simple model represents well the final convergence phase of the learning process. [sent-104, score-0.273]

68 Two separate sets for training and testing were drawn with 1 000 000 examples each. [sent-115, score-0.253]

69 Each learning algorithm is trained using various number of examples taken sequentially from the beginning of the permuted sets. [sent-117, score-0.222]

70 Batch-Newton algorithm The reference batch algorithm uses the Newton-Raphson algorithm with Gauss-Newton approximation (Le Cun et al. [sent-119, score-0.562]

71 Each iteration visits all the training and computes both gradient g and the Gauss-Newton approximation H of the Hessian matrix. [sent-121, score-0.232]

72 Online-Kalman algorithm The online algorithm performs a single sequential sweep over the training examples. [sent-125, score-0.628]

73 The parameter vector is updated after processing each example (xt , yt ) as follows: 1 ∂L θt = θt−1 − Φt (xt , yt , θt−1 ) τ ∂θ The scalar τ = max (20, t − 40) makes sure that the first few examples do not cause impractically large parameter updates. [sent-126, score-0.234]

74 The scaling matrix Φt is equal to the inverse of a leaky average of the per-example Gauss-Newton approximation of the Hessian. [sent-127, score-0.094]

75 Φt = 1− 2 τ Φ−1 + t−1 2 τ −1 2 (f (θt−1 xt )) xt xT t The implementation avoids the matrix inversions by directly computing Φt from Φt−1 using the matrix inversion lemma. [sent-128, score-0.27]

76 The resulting algorithm slightly differs from the Adaptive Natural Gradient algorithm (Amari, Park, and Fukumizu, 1998). [sent-139, score-0.098]

77 The 1/t (or 1/τ ) schedule is asymptotically optimal. [sent-141, score-0.132]

78 Results The optimal parameter vector θ ∗ was first computed on the testing set using the batchnewton approach. [sent-142, score-0.094]

79 Figure 1 plots the average squared distance between the optimal parameter vector θ ∗ and the parameter vector θ achieved on training sets of various sizes. [sent-144, score-0.193]

80 Both the batch points and the online points join the theoretical prediction when the training set size increases. [sent-146, score-0.881]

81 The online algorithm gradually becomes more efficient when the training set size increases. [sent-148, score-0.521]

82 This happens because the batch algorithm needs to perform additional iterations in order to maintain the same level of accuracy. [sent-149, score-0.461]

83 Figure 3 displays a logarithmic plot of the difference between the MSE and the best achievable MSE, that is to say the MSE achieved by parameter vector θ ∗ . [sent-151, score-0.102]

84 Both algorithms yield virtually identical errors for the same training set size. [sent-153, score-0.135]

85 This suggests that the small differences shown in figure 1 occur along the low curvature directions of the cost function. [sent-154, score-0.152]

86 The online algorithm always provides higher accuracy in significantly less time. [sent-156, score-0.47]

87 As expected from the theoretical argument, the online algorithm asymptotically outperforms the super-linear Newton-Raphson algorithm3 . [sent-157, score-0.606]

88 More importantly, the online algorithm achieves this result by performing a single sweep over the training data. [sent-158, score-0.579]

89 342 100000 Figure 3: Average test MSE as a function of the number of examples (left). [sent-168, score-0.092]

90 Conclusion Many popular algorithms do not scale well to large number of examples because they were designed with small data sets in mind. [sent-175, score-0.144]

91 Our baseline super-linear batch algorithm learns in N log log N time. [sent-177, score-0.58]

92 We demonstrate that adequate online algorithms asymptotically achieve the same generalization performance in N time after a single sweep on the training set. [sent-178, score-0.805]

93 The convergence of learning algorithms is usually described in terms of a search phase followed by a final convergence phase (Bottou and Murata, 2002). [sent-179, score-0.563]

94 , 1998) suggests that online algorithms outperform batch algorithms during the search phase. [sent-181, score-0.945]

95 The present work provides both theoretical and experimental evidence that an adequate online algorithm outperforms any batch algorithm during the final convergence phase as well. [sent-182, score-1.227]

96 Appendix4 : Sketch of the convergence speed proof Lemma — Let (ut ) be a sequence of positive reals verifying the following recurrence: ut = 1− α +o t 1 t ut−1 + β +o t2 1 t2 (7) β The lemma states that t ut −→ α−1 when α > 1 and β > 0. [sent-183, score-0.556]

97 However, it is easy to illustrate this convergence with simple numerical simulations. [sent-185, score-0.188]

98 Convergence speed — Consider the following recursive stochastic process: θ(t) = θ(t − 1) − 1 ∂L Φt (zt , θ(t − 1)) + O t ∂θ 1 n2 (8) Our discussion addresses the final convergence phase of this process. [sent-186, score-0.475]

99 There are many possible flavors of convergence, including uniform convergence, almost sure convergence, convergence in probability, etc. [sent-202, score-0.22]

100 The result still holds because the contribution of zt vanishes quickly when t grows large. [sent-210, score-0.286]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('online', 0.389), ('batch', 0.373), ('cun', 0.235), ('bottou', 0.224), ('mse', 0.213), ('murata', 0.205), ('convergence', 0.188), ('zt', 0.182), ('amari', 0.143), ('speed', 0.139), ('asymptotically', 0.132), ('fresh', 0.127), ('hessian', 0.107), ('xt', 0.097), ('le', 0.097), ('ut', 0.097), ('examples', 0.092), ('chambers', 0.088), ('filled', 0.088), ('hollow', 0.088), ('gradient', 0.085), ('training', 0.083), ('log', 0.079), ('cn', 0.078), ('fukumizu', 0.076), ('circles', 0.076), ('nec', 0.07), ('au', 0.07), ('optimum', 0.066), ('remark', 0.066), ('cost', 0.064), ('iteration', 0.064), ('practically', 0.062), ('adequate', 0.061), ('schnabel', 0.059), ('tsypkin', 0.059), ('stochastic', 0.058), ('sweep', 0.058), ('scaling', 0.056), ('zi', 0.056), ('labs', 0.056), ('parameter', 0.055), ('sigmoid', 0.053), ('algorithms', 0.052), ('curvature', 0.051), ('repetitively', 0.051), ('phase', 0.05), ('algorithm', 0.049), ('hastie', 0.048), ('resources', 0.048), ('sketch', 0.047), ('achievable', 0.047), ('unspeci', 0.046), ('permuted', 0.046), ('yann', 0.046), ('asymptotic', 0.046), ('holds', 0.044), ('abundant', 0.043), ('comparatively', 0.043), ('dennis', 0.043), ('milliseconds', 0.043), ('outperform', 0.042), ('toward', 0.042), ('et', 0.042), ('lecun', 0.041), ('tr', 0.041), ('recursive', 0.04), ('drawn', 0.039), ('gray', 0.039), ('favorable', 0.039), ('hundred', 0.039), ('iterations', 0.039), ('testing', 0.039), ('matrix', 0.038), ('nal', 0.038), ('cpu', 0.037), ('america', 0.037), ('suggests', 0.037), ('running', 0.036), ('theoretical', 0.036), ('supply', 0.036), ('utilizes', 0.036), ('empirical', 0.035), ('learning', 0.035), ('proof', 0.035), ('sequences', 0.035), ('zn', 0.033), ('princeton', 0.033), ('bengio', 0.033), ('vapnik', 0.032), ('park', 0.032), ('sure', 0.032), ('provides', 0.032), ('relation', 0.031), ('maintaining', 0.031), ('converge', 0.031), ('generalization', 0.03), ('grows', 0.03), ('processed', 0.03), ('quickly', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 102 nips-2003-Large Scale Online Learning

Author: Léon Bottou, Yann L. Cun

Abstract: We consider situations where training data is abundant and computing resources are comparatively scarce. We argue that suitably designed online learning algorithms asymptotically outperform any batch learning algorithm. Both theoretical and experimental evidences are presented. 1

2 0.17084116 148 nips-2003-Online Passive-Aggressive Algorithms

Author: Shai Shalev-shwartz, Koby Crammer, Ofer Dekel, Yoram Singer

Abstract: We present a unified view for online classification, regression, and uniclass problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also discussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss. 1

3 0.16796437 145 nips-2003-Online Classification on a Budget

Author: Koby Crammer, Jaz Kandola, Yoram Singer

Abstract: Online algorithms for classification often require vast amounts of memory and computation time when employed in conjunction with kernel functions. In this paper we describe and analyze a simple approach for an on-the-fly reduction of the number of past examples used for prediction. Experiments performed with real datasets show that using the proposed algorithmic approach with a single epoch is competitive with the support vector machine (SVM) although the latter, being a batch algorithm, accesses each training example multiple times. 1 Introduction and Motivation Kernel-based methods are widely being used for data modeling and prediction because of their conceptual simplicity and outstanding performance on many real-world tasks. The support vector machine (SVM) is a well known algorithm for finding kernel-based linear classifiers with maximal margin [7]. The kernel trick can be used to provide an effective method to deal with very high dimensional feature spaces as well as to model complex input phenomena via embedding into inner product spaces. However, despite generalization error being upper bounded by a function of the margin of a linear classifier, it is notoriously difficult to implement such classifiers efficiently. Empirically this often translates into very long training times. A number of alternative algorithms exist for finding a maximal margin hyperplane many of which have been inspired by Rosenblatt’s Perceptron algorithm [6] which is an on-line learning algorithm for linear classifiers. The work on SVMs has inspired a number of modifications and enhancements to the original Perceptron algorithm. These incorporate the notion of margin to the learning and prediction processes whilst exhibiting good empirical performance in practice. Examples of such algorithms include the Relaxed Online Maximum Margin Algorithm (ROMMA) [4], the Approximate Maximal Margin Classification Algorithm (ALMA) [2], and the Margin Infused Relaxed Algorithm (MIRA) [1] which can be used in conjunction with kernel functions. A notable limitation of kernel based methods is their computational complexity since the amount of computer memory that they require to store the so called support patterns grows linearly with the number prediction errors. A number of attempts have been made to speed up the training and testing of SVM’s by enforcing a sparsity condition. In this paper we devise an online algorithm that is not only sparse but also generalizes well. To achieve this goal our algorithm employs an insertion and deletion process. Informally, it can be thought of as revising the weight vector after each example on which a prediction mistake has been made. Once such an event occurs the algorithm adds the new erroneous example (the insertion phase), and then immediately searches for past examples that appear to be redundant given the recent addition (the deletion phase). As we describe later, making this adjustment to the algorithm allows us to modify the standard online proof techniques so as to provide a bound on the total number of examples the algorithm keeps. This paper is organized as follows. In Sec. 2 we formalize the problem setting and provide a brief outline of our method for obtaining a sparse set of support patterns in an online setting. In Sec. 3 we present both theoretical and algorithmic details of our approach and provide a bound on the number of support patterns that constitute the cache. Sec. 4 provides experimental details, evaluated on three real world datasets, to illustrate the performance and merits of our sparse online algorithm. We end the paper with conclusions and ideas for future work. 2 Problem Setting and Algorithms This work focuses on online additive algorithms for classification tasks. In such problems we are typically given a stream of instance-label pairs (x1 , y1 ), . . . , (xt , yt ), . . .. we assume that each instance is a vector xt ∈ Rn and each label belongs to a finite set Y. In this and the next section we assume that Y = {−1, +1} but relax this assumption in Sec. 4 where we describe experiments with datasets consisting of more than two labels. When dealing with the task of predicting new labels, thresholded linear classifiers of the form h(x) = sign(w · x) are commonly employed. The vector w is typically represented as a weighted linear combination of the examples, namely w = t αt yt xt where αt ≥ 0. The instances for which αt > 0 are referred to as support patterns. Under this assumption, the output of the classifier solely depends on inner-products of the form x · xt the use of kernel functions can easily be employed simply by replacing the standard scalar product with a function K(·, ·) which satisfies Mercer conditions [7]. The resulting classification rule takes the form h(x) = sign(w · x) = sign( t αt yt K(x, xt )). The majority of additive online algorithms for classification, for example the well known Perceptron [6], share a common algorithmic structure. These online algorithms typically work in rounds. On the tth round, an online algorithm receives an instance xt , computes the inner-products st = i 0. The various online algorithms differ in the way the values of the parameters βt , αt and ct are set. A notable example of an online algorithm is the Perceptron algorithm [6] for which we set βt = 0, αt = 1 and ct = 1. More recent algorithms such as the Relaxed Online Maximum Margin Algorithm (ROMMA) [4] the Approximate Maximal Margin Classification Algorithm (ALMA) [2] and the Margin Infused Relaxed Algorithm (MIRA) [1] can also be described in this framework although the constants βt , αt and ct are not as simple as the ones employed by the Perceptron algorithm. An important computational consideration needs to be made when employing kernel functions for machine learning tasks. This is because the amount of memory required to store the so called support patterns grows linearly with the number prediction errors. In Input: Tolerance β. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. Insert Ct ← Ct−1 ∪ {t}. 2. Set αt = 1. 3. Compute wt ← wt−1 + yt αt xt . 4. DistillCache(Ct , wt , (α1 , . . . , αt )). Output : H(x) = sign(wT · x). Figure 1: The aggressive Perceptron algorithm with a variable-size cache. this paper we shift the focus to the problem of devising online algorithms which are budget-conscious as they attempt to keep the number of support patterns small. The approach is attractive for at least two reasons. Firstly, both the training time and classification time can be reduced significantly if we store only a fraction of the potential support patterns. Secondly, a classier with a small number of support patterns is intuitively ”simpler”, and hence are likely to exhibit good generalization properties rather than complex classifiers with large numbers of support patterns. (See for instance [7] for formal results connecting the number of support patterns to the generalization error.) In Sec. 3 we present a formal analysis and Input: C, w, (α1 , . . . , αt ). the algorithmic details of our approach. Loop: Let us now provide a general overview • Choose i ∈ C such that of how to restrict the number of support β ≤ yi (w − αi yi xi ). patterns in an online setting. Denote by Ct the indices of patterns which consti• if no such i exists then return. tute the classification vector wt . That is, • Remove the example i : i ∈ Ct if and only if αi > 0 on round 1. αi = 0. t when xt is received. The online classification algorithms discussed above keep 2. w ← w − αi yi xi . enlarging Ct – once an example is added 3. C ← C/{i} to Ct it will never be deleted. However, Return : C, w, (α1 , . . . , αt ). as the online algorithm receives more examples, the performance of the classifier Figure 2: DistillCache improves, and some of the past examples may have become redundant and hence can be removed. Put another way, old examples may have been inserted into the cache simply due the lack of support patterns in early rounds. As more examples are observed, the old examples maybe replaced with new examples whose location is closer to the decision boundary induced by the online classifier. We thus add a new stage to the online algorithm in which we discard a few old examples from the cache Ct . We suggest a modification of the online algorithm structure as follows. Whenever yt i 0. Then the number of support patterns constituting the cache is at most S ≤ (R2 + 2β)/γ 2 . Proof: The proof of the theorem is based on the mistake bound of the Perceptron algorithm [5]. To prove the theorem we bound wT 2 from above and below and compare the 2 t bounds. Denote by αi the weight of the ith example at the end of round t (after stage 4 of the algorithm). Similarly, we denote by αi to be the weight of the ith example on round ˜t t after stage 3, before calling the DistillCache (Fig. 2) procedure. We analogously ˜ denote by wt and wt the corresponding instantaneous classifiers. First, we derive a lower bound on wT 2 by bounding the term wT · u from below in a recursive manner. T αt yt (xt · u) wT · u = t∈CT T αt = γ S . ≥ γ (1) t∈CT We now turn to upper bound wT 2 . Recall that each example may be added to the cache and removed from the cache a single time. Let us write wT 2 as a telescopic sum, wT 2 = ( wT 2 ˜ − wT 2 ˜ ) + ( wT 2 − wT −1 2 ˜ ) + . . . + ( w1 2 − w0 2 ) . (2) We now consider three different scenarios that may occur for each new example. The first case is when we did not insert the tth example into the cache at all. In this case, ˜ ( wt 2 − wt−1 2 ) = 0. The second scenario is when an example is inserted into the cache and is never discarded in future rounds, thus, ˜ wt 2 = wt−1 + yt xt 2 = wt−1 2 + 2yt (wt−1 · xt ) + xt 2 . Since we inserted (xt , yt ), the condition yt (wt−1 · xt ) ≤ β must hold. Combining this ˜ with the assumption that the examples are enclosed in a ball of radius R we get, ( wt 2 − wt−1 2 ) ≤ 2β + R2 . The last scenario occurs when an example is inserted into the cache on some round t, and is then later on removed from the cache on round t + p for p > 0. As in the previous case we can bound the value of summands in Equ. (2), ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 ˜ − wt+p 2 ) Input: Tolerance β, Cache Limit n. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. If |Ct | = n remove one example: (a) Find i = arg maxj∈Ct {yj (wt−1 − αj yj xj )}. (b) Update wt−1 ← wt−1 − αi yi xi . (c) Remove Ct−1 ← Ct−1 /{i} 2. Insert Ct ← Ct−1 ∪ {t}. 3. Set αt = 1. 4. Compute wt ← wt−1 + yt αt xt . Output : H(x) = sign(wT · x). Figure 3: The aggressive Perceptron algorithm with as fixed-size cache. ˜ = 2yt (wt−1 · xt ) + xt 2 − 2yt (wt+p · xt ) + xt ˜ = 2 [yt (wt−1 · xt ) − yt ((wt+p − yt xt ) · xt )] ˜ ≤ 2 [β − yt ((wt+p − yt xt ) · xt )] . 2 ˜ Based on the form of the cache update we know that yt ((wt+p − yt xt ) · xt ) ≥ β, and thus, ˜ ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 − wt+p 2 ) ≤ 0 . Summarizing all three cases we see that only the examples which persist in the cache contribute a factor of R2 + 2β each to the bound of the telescopic sum of Equ. (2) and the rest of the examples do contribute anything to the bound. Hence, we can bound the norm of wT as follows, wT 2 ≤ S R2 + 2β . (3) We finish up the proof by applying the Cauchy-Swartz inequality and the assumption u = 1. Combining Equ. (1) and Equ. (3) we get, γ 2 S 2 ≤ (wT · u)2 ≤ wT 2 u 2 ≤ S(2β + R2 ) , which gives the desired bound. 4 Experiments In this section we describe the experimental methods that were used to compare the performance of standard online algorithms with the new algorithm described above. We also describe shortly another variant that sets a hard limit on the number of support patterns. The experiments were designed with the aim of trying to answer the following questions. First, what is effect of the number of support patterns on the generalization error (measured in terms of classification accuracy on unseen data), and second, would the algorithm described in Fig. 2 be able to find an optimal cache size that is able to achieve the best generalization performance. To examine each question separately we used a modified version of the algorithm described by Fig. 2 in which we restricted ourselves to have a fixed bounded cache. This modified algorithm (which we refer to as the fixed budget Perceptron) Name mnist letter usps No. of Training Examples 60000 16000 7291 No. of Test Examples 10000 4000 2007 No. of Classes 10 26 10 No. of Attributes 784 16 256 Table 1: Description of the datasets used in experiments. simulates the original Perceptron algorithm with one notable difference. When the number of support patterns exceeds a pre-determined limit, it chooses a support pattern from the cache and discards it. With this modification the number of support patterns can never exceed the pre-determined limit. This modified algorithm is described in Fig. 3. The algorithm deletes the example which seemingly attains the highest margin after the removal of the example itself (line 1(a) in Fig. 3). Despite the simplicity of the original Perceptron algorithm [6] its good generalization performance on many datasets is remarkable. During the last few year a number of other additive online algorithms have been developed [4, 2, 1] that have shown better performance on a number of tasks. In this paper, we have preferred to embed these ideas into another online algorithm and start with a higher baseline performance. We have chosen to use the Margin Infused Relaxed Algorithm (MIRA) as our baseline algorithm since it has exhibited good generalization performance in previous experiments [1] and has the additional advantage that it is designed to solve multiclass classification problem directly without any recourse to performing reductions. The algorithms were evaluated on three natural datasets: mnist1 , usps2 and letter3 . The characteristics of these datasets has been summarized in Table 1. A comprehensive overview of the performance of various algorithms on these datasets can be found in a recent paper [2]. Since all of the algorithms that we have evaluated are online, it is not implausible for the specific ordering of examples to affect the generalization performance. We thus report results averaged over 11 random permutations for usps and letter and over 5 random permutations for mnist. No free parameter optimization was carried out and instead we simply used the values reported in [1]. More specifically, the margin parameter was set to β = 0.01 for all algorithms and for all datasets. A homogeneous polynomial kernel of degree 9 was used when training on the mnist and usps data sets, and a RBF kernel for letter data set. (The variance of the RBF kernel was identical to the one used in [1].) We evaluated the performance of four algorithms in total. The first algorithm was the standard MIRA online algorithm, which does not incorporate any budget constraints. The second algorithm is the version of MIRA described in Fig. 3 which uses a fixed limited budget. Here we enumerated the cache size limit in each experiment we performed. The different sizes that we tested are dataset dependent but for each dataset we evaluated at least 10 different sizes. We would like to note that such an enumeration cannot be done in an online fashion and the goal of employing the the algorithm with a fixed-size cache is to underscore the merit of the truly adaptive algorithm. The third algorithm is the version of MIRA described in Fig. 2 that adapts the cache size during the running of the algorithms. We also report additional results for a multiclass version of the SVM [1]. Whilst this algorithm is not online and during the training process it considers all the examples at once, this algorithm serves as our gold-standard algorithm against which we want to compare 1 Available from http://www.research.att.com/˜yann Available from ftp.kyb.tuebingen.mpg.de 3 Available from http://www.ics.uci.edu/˜mlearn/MLRepository.html 2 usps mnist Fixed Adaptive SVM MIRA 1.8 4.8 4.7 letter Fixed Adaptive SVM MIRA 5.5 1.7 4.6 5 1.5 1.4 Test Error 4.5 Test Error Test Error 1.6 Fixed Adaptive SVM MIRA 6 4.4 4.3 4.5 4 3.5 4.2 4.1 3 4 2.5 1.3 1.2 3.9 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 2 1000 1500 x 10 mnist 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 usps Fixed Adaptive MIRA 1550 7000 8000 9000 letter Fixed Adaptive MIRA 270 4000 5000 6000 # Support Patterns Fixed Adaptive MIRA 1500 265 1500 1400 260 Training Online Errors Training Online Errors Training Online Errors 1450 1450 255 250 245 1400 1350 1300 1350 240 1250 235 1300 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 1000 1500 x 10 mnist 4 x 10 2000 2500 # Support Patterns 3000 3500 1000 usps 6500 Fixed Adaptive MIRA 5.5 2000 3000 4000 5000 6000 # Support Patterns 7000 Fixed Adaptive MIRA 1.5 6000 9000 letter 4 x 10 1.6 Fixed Adaptive MIRA 8000 4 3.5 3 1.4 5500 Training Margin Errors Training Margin Errors Training Margin Errors 5 4.5 5000 4500 1.3 1.2 1.1 4000 1 2.5 3500 0.9 2 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 4 x 10 500 1000 1500 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 4000 5000 6000 # Support Patterns 7000 8000 9000 Figure 4: Results on a three data sets - mnist (left), usps (center) and letter (right). Each point in a plot designates the test error (y-axis) vs. the number of support patterns used (x-axis). Four algorithms are compared - SVM, MIRA, MIRA with a fixed cache size and MIRA with a variable cache size. performance. Note that for the multiclass SVM we report the results using the best set of parameters, which does not coincide with the set of parameters used for the online training. The results are summarized in Fig 4. This figure is composed of three different plots organized in columns. Each of these plots corresponds to a different dataset - mnist (left), usps (center) and letter (right). In each of the three plots the x-axis designates number of support patterns the algorithm uses. The results for the fixed-size cache are connected with a line to emphasize the performance dependency on the size of the cache. The top row of the three columns shows the generalization error. Thus the y-axis designates the test error of an algorithm on unseen data at the end of the training. Looking at the error of the algorithm with a fixed-size cache reveals that there is a broad range of cache size where the algorithm exhibits good performance. In fact for MNIST and USPS there are sizes for which the test error of the algorithm is better than SVM’s test error. Naturally, we cannot fix the correct size in hindsight so the question is whether the algorithm with variable cache size is a viable automatic size-selection method. Analyzing each of the datasets in turn reveals that this is indeed the case – the algorithm obtains a very similar number of support patterns and test error when compared to the SVM method. The results are somewhat less impressive for the letter dataset which contains less examples per class. One possible explanation is that the algorithm had fewer chances to modify and distill the cache. Nonetheless, overall the results are remarkable given that all the online algorithms make a single pass through the data and the variable-size method finds a very good cache size while making it also comparable to the SVM in terms of performance. The MIRA algorithm, which does not incorporate any form of example insertion or deletion in its algorithmic structure, obtains the poorest level of performance not only in terms of generalization error but also in terms of number of support patterns. The plot of online training error against the number of support patterns, in row 2 of Fig 4, can be considered to be a good on-the-fly validation of generalization performance. As the plots indicate, for the fixed and adaptive versions of the algorithm, on all the datasets, a low online training error translates into good generalization performance. Comparing the test error plots with the online error plots we see a nice similarity between the qualitative behavior of the two errors. Hence, one can use the online error, which is easy to evaluate, to choose a good cache size for the fixed-size algorithm. The third row gives the online training margin errors that translates directly to the number of insertions into the cache. Here we see that the good test error and compactness of the algorithm with a variable cache size come with a price. Namely, the algorithm makes significantly more insertions into the cache than the fixed size version of the algorithm. However, as the upper two sets of plots indicate, the surplus in insertions is later taken care of by excess deletions and the end result is very good overall performance. In summary, the online algorithm with a variable cache and SVM obtains similar levels of generalization and also number of support patterns. While the SVM is still somewhat better in both aspects for the letter dataset, the online algorithm is much simpler to implement and performs a single sweep through the training data. 5 Summary We have described and analyzed a new sparse online algorithm that attempts to deal with the computational problems implicit in classification algorithms such as the SVM. The proposed method was empirically tested and its performance in both the size of the resulting classifier and its error rate are comparable to SVM. There are a few possible extensions and enhancements. We are currently looking at alternative criteria for the deletions of examples from the cache. For instance, the weight of examples might relay information on their importance for accurate classification. Incorporating prior knowledge to the insertion and deletion scheme might also prove important. We hope that such enhancements would make the proposed approach a viable alternative to SVM and other batch algorithms. Acknowledgements: The authors would like to thank John Shawe-Taylor for many helpful comments and discussions. This research was partially funded by the EU project KerMIT No. IST-2000-25341. References [1] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Jornal of Machine Learning Research, 3:951–991, 2003. [2] C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242, 2001. [3] M´ zard M. Krauth W. Learning algorithms with optimal stability in neural networks. Journal of e Physics A., 20:745, 1987. [4] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1–3):361–387, 2002. [5] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume XII, pages 615–622, 1962. [6] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).). [7] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.

4 0.14865099 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

Author: Justin Werfel, Xiaohui Xie, H. S. Seung

Abstract: Gradient-following learning methods can encounter problems of implementation in many applications, and stochastic variants are frequently used to overcome these difficulties. We derive quantitative learning curves for three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. The maximum learning rate for the stochastic methods scales inversely with the first power of the dimensionality of the noise injected into the system; with sufficiently small learning rate, all three methods give identical learning curves. These results suggest guidelines for when these stochastic methods will be limited in their utility, and considerations for architectures in which they will be effective. 1

5 0.12710249 51 nips-2003-Design of Experiments via Information Theory

Author: Liam Paninski

Abstract: We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efficiency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the “psychometric function” underlying the output responses. 1

6 0.11177588 176 nips-2003-Sequential Bayesian Kernel Regression

7 0.087727889 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian

8 0.086135104 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

9 0.083712578 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

10 0.078681402 171 nips-2003-Semi-Definite Programming by Perceptron Learning

11 0.07700289 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

12 0.075852036 78 nips-2003-Gaussian Processes in Reinforcement Learning

13 0.074161045 146 nips-2003-Online Learning of Non-stationary Sequences

14 0.071874052 41 nips-2003-Boosting versus Covering

15 0.068837434 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

16 0.068615548 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

17 0.068384647 55 nips-2003-Distributed Optimization in Adaptive Networks

18 0.059725866 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion

19 0.05878 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

20 0.057475798 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.22), (1, 0.029), (2, -0.05), (3, -0.125), (4, 0.139), (5, 0.062), (6, 0.082), (7, 0.197), (8, 0.06), (9, 0.067), (10, -0.071), (11, -0.007), (12, 0.012), (13, 0.063), (14, -0.049), (15, -0.123), (16, -0.076), (17, -0.073), (18, 0.083), (19, 0.108), (20, 0.071), (21, -0.046), (22, -0.077), (23, 0.096), (24, -0.066), (25, 0.111), (26, 0.009), (27, -0.008), (28, -0.019), (29, -0.015), (30, -0.062), (31, 0.025), (32, -0.073), (33, 0.085), (34, -0.011), (35, 0.073), (36, -0.143), (37, 0.004), (38, 0.04), (39, 0.008), (40, 0.079), (41, 0.004), (42, -0.201), (43, -0.014), (44, -0.053), (45, -0.01), (46, -0.071), (47, 0.004), (48, 0.033), (49, -0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96287918 102 nips-2003-Large Scale Online Learning

Author: Léon Bottou, Yann L. Cun

Abstract: We consider situations where training data is abundant and computing resources are comparatively scarce. We argue that suitably designed online learning algorithms asymptotically outperform any batch learning algorithm. Both theoretical and experimental evidences are presented. 1

2 0.68896413 145 nips-2003-Online Classification on a Budget

Author: Koby Crammer, Jaz Kandola, Yoram Singer

Abstract: Online algorithms for classification often require vast amounts of memory and computation time when employed in conjunction with kernel functions. In this paper we describe and analyze a simple approach for an on-the-fly reduction of the number of past examples used for prediction. Experiments performed with real datasets show that using the proposed algorithmic approach with a single epoch is competitive with the support vector machine (SVM) although the latter, being a batch algorithm, accesses each training example multiple times. 1 Introduction and Motivation Kernel-based methods are widely being used for data modeling and prediction because of their conceptual simplicity and outstanding performance on many real-world tasks. The support vector machine (SVM) is a well known algorithm for finding kernel-based linear classifiers with maximal margin [7]. The kernel trick can be used to provide an effective method to deal with very high dimensional feature spaces as well as to model complex input phenomena via embedding into inner product spaces. However, despite generalization error being upper bounded by a function of the margin of a linear classifier, it is notoriously difficult to implement such classifiers efficiently. Empirically this often translates into very long training times. A number of alternative algorithms exist for finding a maximal margin hyperplane many of which have been inspired by Rosenblatt’s Perceptron algorithm [6] which is an on-line learning algorithm for linear classifiers. The work on SVMs has inspired a number of modifications and enhancements to the original Perceptron algorithm. These incorporate the notion of margin to the learning and prediction processes whilst exhibiting good empirical performance in practice. Examples of such algorithms include the Relaxed Online Maximum Margin Algorithm (ROMMA) [4], the Approximate Maximal Margin Classification Algorithm (ALMA) [2], and the Margin Infused Relaxed Algorithm (MIRA) [1] which can be used in conjunction with kernel functions. A notable limitation of kernel based methods is their computational complexity since the amount of computer memory that they require to store the so called support patterns grows linearly with the number prediction errors. A number of attempts have been made to speed up the training and testing of SVM’s by enforcing a sparsity condition. In this paper we devise an online algorithm that is not only sparse but also generalizes well. To achieve this goal our algorithm employs an insertion and deletion process. Informally, it can be thought of as revising the weight vector after each example on which a prediction mistake has been made. Once such an event occurs the algorithm adds the new erroneous example (the insertion phase), and then immediately searches for past examples that appear to be redundant given the recent addition (the deletion phase). As we describe later, making this adjustment to the algorithm allows us to modify the standard online proof techniques so as to provide a bound on the total number of examples the algorithm keeps. This paper is organized as follows. In Sec. 2 we formalize the problem setting and provide a brief outline of our method for obtaining a sparse set of support patterns in an online setting. In Sec. 3 we present both theoretical and algorithmic details of our approach and provide a bound on the number of support patterns that constitute the cache. Sec. 4 provides experimental details, evaluated on three real world datasets, to illustrate the performance and merits of our sparse online algorithm. We end the paper with conclusions and ideas for future work. 2 Problem Setting and Algorithms This work focuses on online additive algorithms for classification tasks. In such problems we are typically given a stream of instance-label pairs (x1 , y1 ), . . . , (xt , yt ), . . .. we assume that each instance is a vector xt ∈ Rn and each label belongs to a finite set Y. In this and the next section we assume that Y = {−1, +1} but relax this assumption in Sec. 4 where we describe experiments with datasets consisting of more than two labels. When dealing with the task of predicting new labels, thresholded linear classifiers of the form h(x) = sign(w · x) are commonly employed. The vector w is typically represented as a weighted linear combination of the examples, namely w = t αt yt xt where αt ≥ 0. The instances for which αt > 0 are referred to as support patterns. Under this assumption, the output of the classifier solely depends on inner-products of the form x · xt the use of kernel functions can easily be employed simply by replacing the standard scalar product with a function K(·, ·) which satisfies Mercer conditions [7]. The resulting classification rule takes the form h(x) = sign(w · x) = sign( t αt yt K(x, xt )). The majority of additive online algorithms for classification, for example the well known Perceptron [6], share a common algorithmic structure. These online algorithms typically work in rounds. On the tth round, an online algorithm receives an instance xt , computes the inner-products st = i 0. The various online algorithms differ in the way the values of the parameters βt , αt and ct are set. A notable example of an online algorithm is the Perceptron algorithm [6] for which we set βt = 0, αt = 1 and ct = 1. More recent algorithms such as the Relaxed Online Maximum Margin Algorithm (ROMMA) [4] the Approximate Maximal Margin Classification Algorithm (ALMA) [2] and the Margin Infused Relaxed Algorithm (MIRA) [1] can also be described in this framework although the constants βt , αt and ct are not as simple as the ones employed by the Perceptron algorithm. An important computational consideration needs to be made when employing kernel functions for machine learning tasks. This is because the amount of memory required to store the so called support patterns grows linearly with the number prediction errors. In Input: Tolerance β. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. Insert Ct ← Ct−1 ∪ {t}. 2. Set αt = 1. 3. Compute wt ← wt−1 + yt αt xt . 4. DistillCache(Ct , wt , (α1 , . . . , αt )). Output : H(x) = sign(wT · x). Figure 1: The aggressive Perceptron algorithm with a variable-size cache. this paper we shift the focus to the problem of devising online algorithms which are budget-conscious as they attempt to keep the number of support patterns small. The approach is attractive for at least two reasons. Firstly, both the training time and classification time can be reduced significantly if we store only a fraction of the potential support patterns. Secondly, a classier with a small number of support patterns is intuitively ”simpler”, and hence are likely to exhibit good generalization properties rather than complex classifiers with large numbers of support patterns. (See for instance [7] for formal results connecting the number of support patterns to the generalization error.) In Sec. 3 we present a formal analysis and Input: C, w, (α1 , . . . , αt ). the algorithmic details of our approach. Loop: Let us now provide a general overview • Choose i ∈ C such that of how to restrict the number of support β ≤ yi (w − αi yi xi ). patterns in an online setting. Denote by Ct the indices of patterns which consti• if no such i exists then return. tute the classification vector wt . That is, • Remove the example i : i ∈ Ct if and only if αi > 0 on round 1. αi = 0. t when xt is received. The online classification algorithms discussed above keep 2. w ← w − αi yi xi . enlarging Ct – once an example is added 3. C ← C/{i} to Ct it will never be deleted. However, Return : C, w, (α1 , . . . , αt ). as the online algorithm receives more examples, the performance of the classifier Figure 2: DistillCache improves, and some of the past examples may have become redundant and hence can be removed. Put another way, old examples may have been inserted into the cache simply due the lack of support patterns in early rounds. As more examples are observed, the old examples maybe replaced with new examples whose location is closer to the decision boundary induced by the online classifier. We thus add a new stage to the online algorithm in which we discard a few old examples from the cache Ct . We suggest a modification of the online algorithm structure as follows. Whenever yt i 0. Then the number of support patterns constituting the cache is at most S ≤ (R2 + 2β)/γ 2 . Proof: The proof of the theorem is based on the mistake bound of the Perceptron algorithm [5]. To prove the theorem we bound wT 2 from above and below and compare the 2 t bounds. Denote by αi the weight of the ith example at the end of round t (after stage 4 of the algorithm). Similarly, we denote by αi to be the weight of the ith example on round ˜t t after stage 3, before calling the DistillCache (Fig. 2) procedure. We analogously ˜ denote by wt and wt the corresponding instantaneous classifiers. First, we derive a lower bound on wT 2 by bounding the term wT · u from below in a recursive manner. T αt yt (xt · u) wT · u = t∈CT T αt = γ S . ≥ γ (1) t∈CT We now turn to upper bound wT 2 . Recall that each example may be added to the cache and removed from the cache a single time. Let us write wT 2 as a telescopic sum, wT 2 = ( wT 2 ˜ − wT 2 ˜ ) + ( wT 2 − wT −1 2 ˜ ) + . . . + ( w1 2 − w0 2 ) . (2) We now consider three different scenarios that may occur for each new example. The first case is when we did not insert the tth example into the cache at all. In this case, ˜ ( wt 2 − wt−1 2 ) = 0. The second scenario is when an example is inserted into the cache and is never discarded in future rounds, thus, ˜ wt 2 = wt−1 + yt xt 2 = wt−1 2 + 2yt (wt−1 · xt ) + xt 2 . Since we inserted (xt , yt ), the condition yt (wt−1 · xt ) ≤ β must hold. Combining this ˜ with the assumption that the examples are enclosed in a ball of radius R we get, ( wt 2 − wt−1 2 ) ≤ 2β + R2 . The last scenario occurs when an example is inserted into the cache on some round t, and is then later on removed from the cache on round t + p for p > 0. As in the previous case we can bound the value of summands in Equ. (2), ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 ˜ − wt+p 2 ) Input: Tolerance β, Cache Limit n. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. If |Ct | = n remove one example: (a) Find i = arg maxj∈Ct {yj (wt−1 − αj yj xj )}. (b) Update wt−1 ← wt−1 − αi yi xi . (c) Remove Ct−1 ← Ct−1 /{i} 2. Insert Ct ← Ct−1 ∪ {t}. 3. Set αt = 1. 4. Compute wt ← wt−1 + yt αt xt . Output : H(x) = sign(wT · x). Figure 3: The aggressive Perceptron algorithm with as fixed-size cache. ˜ = 2yt (wt−1 · xt ) + xt 2 − 2yt (wt+p · xt ) + xt ˜ = 2 [yt (wt−1 · xt ) − yt ((wt+p − yt xt ) · xt )] ˜ ≤ 2 [β − yt ((wt+p − yt xt ) · xt )] . 2 ˜ Based on the form of the cache update we know that yt ((wt+p − yt xt ) · xt ) ≥ β, and thus, ˜ ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 − wt+p 2 ) ≤ 0 . Summarizing all three cases we see that only the examples which persist in the cache contribute a factor of R2 + 2β each to the bound of the telescopic sum of Equ. (2) and the rest of the examples do contribute anything to the bound. Hence, we can bound the norm of wT as follows, wT 2 ≤ S R2 + 2β . (3) We finish up the proof by applying the Cauchy-Swartz inequality and the assumption u = 1. Combining Equ. (1) and Equ. (3) we get, γ 2 S 2 ≤ (wT · u)2 ≤ wT 2 u 2 ≤ S(2β + R2 ) , which gives the desired bound. 4 Experiments In this section we describe the experimental methods that were used to compare the performance of standard online algorithms with the new algorithm described above. We also describe shortly another variant that sets a hard limit on the number of support patterns. The experiments were designed with the aim of trying to answer the following questions. First, what is effect of the number of support patterns on the generalization error (measured in terms of classification accuracy on unseen data), and second, would the algorithm described in Fig. 2 be able to find an optimal cache size that is able to achieve the best generalization performance. To examine each question separately we used a modified version of the algorithm described by Fig. 2 in which we restricted ourselves to have a fixed bounded cache. This modified algorithm (which we refer to as the fixed budget Perceptron) Name mnist letter usps No. of Training Examples 60000 16000 7291 No. of Test Examples 10000 4000 2007 No. of Classes 10 26 10 No. of Attributes 784 16 256 Table 1: Description of the datasets used in experiments. simulates the original Perceptron algorithm with one notable difference. When the number of support patterns exceeds a pre-determined limit, it chooses a support pattern from the cache and discards it. With this modification the number of support patterns can never exceed the pre-determined limit. This modified algorithm is described in Fig. 3. The algorithm deletes the example which seemingly attains the highest margin after the removal of the example itself (line 1(a) in Fig. 3). Despite the simplicity of the original Perceptron algorithm [6] its good generalization performance on many datasets is remarkable. During the last few year a number of other additive online algorithms have been developed [4, 2, 1] that have shown better performance on a number of tasks. In this paper, we have preferred to embed these ideas into another online algorithm and start with a higher baseline performance. We have chosen to use the Margin Infused Relaxed Algorithm (MIRA) as our baseline algorithm since it has exhibited good generalization performance in previous experiments [1] and has the additional advantage that it is designed to solve multiclass classification problem directly without any recourse to performing reductions. The algorithms were evaluated on three natural datasets: mnist1 , usps2 and letter3 . The characteristics of these datasets has been summarized in Table 1. A comprehensive overview of the performance of various algorithms on these datasets can be found in a recent paper [2]. Since all of the algorithms that we have evaluated are online, it is not implausible for the specific ordering of examples to affect the generalization performance. We thus report results averaged over 11 random permutations for usps and letter and over 5 random permutations for mnist. No free parameter optimization was carried out and instead we simply used the values reported in [1]. More specifically, the margin parameter was set to β = 0.01 for all algorithms and for all datasets. A homogeneous polynomial kernel of degree 9 was used when training on the mnist and usps data sets, and a RBF kernel for letter data set. (The variance of the RBF kernel was identical to the one used in [1].) We evaluated the performance of four algorithms in total. The first algorithm was the standard MIRA online algorithm, which does not incorporate any budget constraints. The second algorithm is the version of MIRA described in Fig. 3 which uses a fixed limited budget. Here we enumerated the cache size limit in each experiment we performed. The different sizes that we tested are dataset dependent but for each dataset we evaluated at least 10 different sizes. We would like to note that such an enumeration cannot be done in an online fashion and the goal of employing the the algorithm with a fixed-size cache is to underscore the merit of the truly adaptive algorithm. The third algorithm is the version of MIRA described in Fig. 2 that adapts the cache size during the running of the algorithms. We also report additional results for a multiclass version of the SVM [1]. Whilst this algorithm is not online and during the training process it considers all the examples at once, this algorithm serves as our gold-standard algorithm against which we want to compare 1 Available from http://www.research.att.com/˜yann Available from ftp.kyb.tuebingen.mpg.de 3 Available from http://www.ics.uci.edu/˜mlearn/MLRepository.html 2 usps mnist Fixed Adaptive SVM MIRA 1.8 4.8 4.7 letter Fixed Adaptive SVM MIRA 5.5 1.7 4.6 5 1.5 1.4 Test Error 4.5 Test Error Test Error 1.6 Fixed Adaptive SVM MIRA 6 4.4 4.3 4.5 4 3.5 4.2 4.1 3 4 2.5 1.3 1.2 3.9 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 2 1000 1500 x 10 mnist 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 usps Fixed Adaptive MIRA 1550 7000 8000 9000 letter Fixed Adaptive MIRA 270 4000 5000 6000 # Support Patterns Fixed Adaptive MIRA 1500 265 1500 1400 260 Training Online Errors Training Online Errors Training Online Errors 1450 1450 255 250 245 1400 1350 1300 1350 240 1250 235 1300 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 1000 1500 x 10 mnist 4 x 10 2000 2500 # Support Patterns 3000 3500 1000 usps 6500 Fixed Adaptive MIRA 5.5 2000 3000 4000 5000 6000 # Support Patterns 7000 Fixed Adaptive MIRA 1.5 6000 9000 letter 4 x 10 1.6 Fixed Adaptive MIRA 8000 4 3.5 3 1.4 5500 Training Margin Errors Training Margin Errors Training Margin Errors 5 4.5 5000 4500 1.3 1.2 1.1 4000 1 2.5 3500 0.9 2 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 4 x 10 500 1000 1500 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 4000 5000 6000 # Support Patterns 7000 8000 9000 Figure 4: Results on a three data sets - mnist (left), usps (center) and letter (right). Each point in a plot designates the test error (y-axis) vs. the number of support patterns used (x-axis). Four algorithms are compared - SVM, MIRA, MIRA with a fixed cache size and MIRA with a variable cache size. performance. Note that for the multiclass SVM we report the results using the best set of parameters, which does not coincide with the set of parameters used for the online training. The results are summarized in Fig 4. This figure is composed of three different plots organized in columns. Each of these plots corresponds to a different dataset - mnist (left), usps (center) and letter (right). In each of the three plots the x-axis designates number of support patterns the algorithm uses. The results for the fixed-size cache are connected with a line to emphasize the performance dependency on the size of the cache. The top row of the three columns shows the generalization error. Thus the y-axis designates the test error of an algorithm on unseen data at the end of the training. Looking at the error of the algorithm with a fixed-size cache reveals that there is a broad range of cache size where the algorithm exhibits good performance. In fact for MNIST and USPS there are sizes for which the test error of the algorithm is better than SVM’s test error. Naturally, we cannot fix the correct size in hindsight so the question is whether the algorithm with variable cache size is a viable automatic size-selection method. Analyzing each of the datasets in turn reveals that this is indeed the case – the algorithm obtains a very similar number of support patterns and test error when compared to the SVM method. The results are somewhat less impressive for the letter dataset which contains less examples per class. One possible explanation is that the algorithm had fewer chances to modify and distill the cache. Nonetheless, overall the results are remarkable given that all the online algorithms make a single pass through the data and the variable-size method finds a very good cache size while making it also comparable to the SVM in terms of performance. The MIRA algorithm, which does not incorporate any form of example insertion or deletion in its algorithmic structure, obtains the poorest level of performance not only in terms of generalization error but also in terms of number of support patterns. The plot of online training error against the number of support patterns, in row 2 of Fig 4, can be considered to be a good on-the-fly validation of generalization performance. As the plots indicate, for the fixed and adaptive versions of the algorithm, on all the datasets, a low online training error translates into good generalization performance. Comparing the test error plots with the online error plots we see a nice similarity between the qualitative behavior of the two errors. Hence, one can use the online error, which is easy to evaluate, to choose a good cache size for the fixed-size algorithm. The third row gives the online training margin errors that translates directly to the number of insertions into the cache. Here we see that the good test error and compactness of the algorithm with a variable cache size come with a price. Namely, the algorithm makes significantly more insertions into the cache than the fixed size version of the algorithm. However, as the upper two sets of plots indicate, the surplus in insertions is later taken care of by excess deletions and the end result is very good overall performance. In summary, the online algorithm with a variable cache and SVM obtains similar levels of generalization and also number of support patterns. While the SVM is still somewhat better in both aspects for the letter dataset, the online algorithm is much simpler to implement and performs a single sweep through the training data. 5 Summary We have described and analyzed a new sparse online algorithm that attempts to deal with the computational problems implicit in classification algorithms such as the SVM. The proposed method was empirically tested and its performance in both the size of the resulting classifier and its error rate are comparable to SVM. There are a few possible extensions and enhancements. We are currently looking at alternative criteria for the deletions of examples from the cache. For instance, the weight of examples might relay information on their importance for accurate classification. Incorporating prior knowledge to the insertion and deletion scheme might also prove important. We hope that such enhancements would make the proposed approach a viable alternative to SVM and other batch algorithms. Acknowledgements: The authors would like to thank John Shawe-Taylor for many helpful comments and discussions. This research was partially funded by the EU project KerMIT No. IST-2000-25341. References [1] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Jornal of Machine Learning Research, 3:951–991, 2003. [2] C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242, 2001. [3] M´ zard M. Krauth W. Learning algorithms with optimal stability in neural networks. Journal of e Physics A., 20:745, 1987. [4] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1–3):361–387, 2002. [5] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume XII, pages 615–622, 1962. [6] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).). [7] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.

3 0.64739347 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

Author: Justin Werfel, Xiaohui Xie, H. S. Seung

Abstract: Gradient-following learning methods can encounter problems of implementation in many applications, and stochastic variants are frequently used to overcome these difficulties. We derive quantitative learning curves for three online training methods used with a linear perceptron: direct gradient descent, node perturbation, and weight perturbation. The maximum learning rate for the stochastic methods scales inversely with the first power of the dimensionality of the noise injected into the system; with sufficiently small learning rate, all three methods give identical learning curves. These results suggest guidelines for when these stochastic methods will be limited in their utility, and considerations for architectures in which they will be effective. 1

4 0.63283658 148 nips-2003-Online Passive-Aggressive Algorithms

Author: Shai Shalev-shwartz, Koby Crammer, Ofer Dekel, Yoram Singer

Abstract: We present a unified view for online classification, regression, and uniclass problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also discussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss. 1

5 0.54691261 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian

Author: Eiji Mizutani, James Demmel

Abstract: The online incremental gradient (or backpropagation) algorithm is widely considered to be the fastest method for solving large-scale neural-network (NN) learning problems. In contrast, we show that an appropriately implemented iterative batch-mode (or block-mode) learning method can be much faster. For example, it is three times faster in the UCI letter classification problem (26 outputs, 16,000 data items, 6,066 parameters with a two-hidden-layer multilayer perceptron) and 353 times faster in a nonlinear regression problem arising in color recipe prediction (10 outputs, 1,000 data items, 2,210 parameters with a neuro-fuzzy modular network). The three principal innovative ingredients in our algorithm are the following: First, we use scaled trust-region regularization with inner-outer iteration to solve the associated “overdetermined” nonlinear least squares problem, where the inner iteration performs a truncated (or inexact) Newton method. Second, we employ Pearlmutter’s implicit sparse Hessian matrix-vector multiply algorithm to construct the Krylov subspaces used to solve for the truncated Newton update. Third, we exploit sparsity (for preconditioning) in the matrices resulting from the NNs having many outputs. 1

6 0.4848254 44 nips-2003-Can We Learn to Beat the Best Stock

7 0.47931775 51 nips-2003-Design of Experiments via Information Theory

8 0.45330393 176 nips-2003-Sequential Bayesian Kernel Regression

9 0.43701708 187 nips-2003-Training a Quantum Neural Network

10 0.41453499 196 nips-2003-Wormholes Improve Contrastive Divergence

11 0.36786076 171 nips-2003-Semi-Definite Programming by Perceptron Learning

12 0.36706185 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

13 0.36538249 55 nips-2003-Distributed Optimization in Adaptive Networks

14 0.36088794 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation

15 0.35584638 25 nips-2003-An MCMC-Based Method of Comparing Connectionist Models in Cognitive Science

16 0.34192082 146 nips-2003-Online Learning of Non-stationary Sequences

17 0.33525121 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

18 0.33409414 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

19 0.3305963 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

20 0.32602057 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.055), (11, 0.026), (29, 0.02), (30, 0.019), (35, 0.054), (48, 0.319), (53, 0.092), (69, 0.026), (71, 0.071), (76, 0.041), (85, 0.086), (91, 0.11), (99, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79940975 102 nips-2003-Large Scale Online Learning

Author: Léon Bottou, Yann L. Cun

Abstract: We consider situations where training data is abundant and computing resources are comparatively scarce. We argue that suitably designed online learning algorithms asymptotically outperform any batch learning algorithm. Both theoretical and experimental evidences are presented. 1

2 0.72734731 120 nips-2003-Locality Preserving Projections

Author: Xiaofei He, Partha Niyogi

Abstract: Many problems in information processing involve some form of dimensionality reduction. In this paper, we introduce Locality Preserving Projections (LPP). These are linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the data set. LPP should be seen as an alternative to Principal Component Analysis (PCA) – a classical linear technique that projects the data along the directions of maximal variance. When the high dimensional data lies on a low dimensional manifold embedded in the ambient space, the Locality Preserving Projections are obtained by finding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the manifold. As a result, LPP shares many of the data representation properties of nonlinear techniques such as Laplacian Eigenmaps or Locally Linear Embedding. Yet LPP is linear and more crucially is defined everywhere in ambient space rather than just on the training data points. This is borne out by illustrative examples on some high dimensional data sets.

3 0.68356895 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

Author: Susanne Still, William Bialek, Léon Bottou

Abstract: We argue that K–means and deterministic annealing algorithms for geometric clustering can be derived from the more general Information Bottleneck approach. If we cluster the identities of data points to preserve information about their location, the set of optimal solutions is massively degenerate. But if we treat the equations that define the optimal solution as an iterative algorithm, then a set of “smooth” initial conditions selects solutions with the desired geometrical properties. In addition to conceptual unification, we argue that this approach can be more efficient and robust than classic algorithms. 1

4 0.52673841 78 nips-2003-Gaussian Processes in Reinforcement Learning

Author: Malte Kuss, Carl E. Rasmussen

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

5 0.52495819 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1

6 0.52356374 113 nips-2003-Learning with Local and Global Consistency

7 0.51877475 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

8 0.51837444 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

9 0.51727539 143 nips-2003-On the Dynamics of Boosting

10 0.51727068 126 nips-2003-Measure Based Regularization

11 0.51693898 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

12 0.51607049 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

13 0.5149051 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

14 0.51465952 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

15 0.51353598 3 nips-2003-AUC Optimization vs. Error Rate Minimization

16 0.51291144 107 nips-2003-Learning Spectral Clustering

17 0.51285946 158 nips-2003-Policy Search by Dynamic Programming

18 0.51250237 172 nips-2003-Semi-Supervised Learning with Trees

19 0.511518 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

20 0.51098949 30 nips-2003-Approximability of Probability Distributions