nips nips2003 nips2003-148 knowledge-graph by maker-knowledge-mining

148 nips-2003-Online Passive-Aggressive Algorithms


Source: pdf

Author: Shai Shalev-shwartz, Koby Crammer, Ofer Dekel, Yoram Singer

Abstract: We present a unified view for online classification, regression, and uniclass problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also discussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract We present a unified view for online classification, regression, and uniclass problems. [sent-4, score-0.305]

2 We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. [sent-6, score-0.269]

3 A conversion of our main online algorithm to the setting of batch learning is also discussed. [sent-7, score-0.15]

4 The end result is new algorithms and accompanying loss bounds for the hinge-loss. [sent-8, score-0.203]

5 Specifically, we discuss online classification, online regression, and online uniclass prediction. [sent-10, score-0.52]

6 For concreteness we assume that these instances are vectors in Rn and denote the instance received on round t by xt . [sent-12, score-0.27]

7 Our goal in the uniclass problem is to find a center-point in Rn with a small Euclidean distance to all of the instances. [sent-15, score-0.198]

8 After receiving xt we extend a prediction yt using f . [sent-18, score-0.399]

9 For regression the prediction is simply yt = f (xt ) while for classiˆ ˆ fication yt = sign(f (xt )). [sent-19, score-0.438]

10 After extending the prediction yt , we receive the true outcome ˆ ˆ yt . [sent-20, score-0.367]

11 We then suffer an instantaneous loss based on the discrepancy between yt and f (xt ). [sent-21, score-0.419]

12 The goal of the online learning algorithm is to minimize the cumulative loss. [sent-22, score-0.13]

13 For regression the -insensitive loss is, 0 |y − w · x| ≤ (w; (x, y)) = , (1) |y − w · x| − otherwise while for classification the -insensitive loss is defined to be, 0 y(w · x) ≥ (w; (x, y)) = . [sent-24, score-0.366]

14 (2) − y(w · x) otherwise As in other online algorithms the weight vector w is updated after receiving the feedback yt . [sent-25, score-0.35]

15 Therefore, we denote by wt the vector used for prediction on round t. [sent-26, score-0.801]

16 The setting for uniclass is slightly different as we only observe a sequence of instances. [sent-29, score-0.188]

17 The goal of the uniclass algorithm is to find a center-point w such that all instances x t fall within a radius of from w. [sent-30, score-0.251]

18 The vector wt therefore plays the role of the instantaneous center and is adapted after observing each instance xt . [sent-32, score-0.998]

19 If an example xt falls within a Euclidean distance from wt then we suffer no loss. [sent-33, score-0.97]

20 Otherwise, the loss is the distance between xt and a ball of radius centered at wt . [sent-34, score-1.131]

21 Formally, the uniclass loss is, (wt ; xt ) = 0 xt − w t − xt − wt ≤ otherwise . [sent-35, score-1.676]

22 (3) In the next sections we give additive and multiplicative online algorithms for the above learning problems and prove respective online loss bounds. [sent-36, score-0.448]

23 Li and Long [14] were among the first to suggest the idea of converting a batch optimization problem into an online task. [sent-40, score-0.152]

24 In particular, Gentile and Warmuth [6] generalized and adapted techniques from [11] to the hinge loss which is closely related to the losses defined in Eqs. [sent-42, score-0.23]

25 [10] discussed a general framework for gradient-based online learning where some of their bounds bare similarities to the bounds presented in this paper. [sent-45, score-0.167]

26 Our work also generalizes and greatly improves online loss bounds for classification given in [3]. [sent-46, score-0.277]

27 The main difference between these algorithms and the online algorithms presented in this paper lies in the analysis: while we derive worst case, finite horizon loss bounds, the optimization community is mostly concerned with asymptotic convergence properties. [sent-50, score-0.321]

28 Let z t = (xt , yt ) denote the instance-target pair received on round t where in the case of uniclass we set yt = 1 as a placeholder. [sent-54, score-0.565]

29 For a given example zt , let δ(w; zt ) denote the discrepancy of w on zt : for classification we set the discrepancy to be −yt (wt · xt ) (the negative of the margin), for regression it is |yt − wt · xt |, and for uniclass xt − wt . [sent-55, score-3.832]

30 Fixing zt , we also view δ(w; zt ) as a convex function of w. [sent-56, score-0.916]

31 To conclude, the loss in all three problems can be derived by applying the same hinge loss to different (problem dependent) discrepancies. [sent-61, score-0.355]

32 3 An Additive Algorithm for the Realizable Case Equipped with the simple unified notion of loss we describe in this section a single online algorithm that is applicable to all three problems. [sent-62, score-0.271]

33 Namely, we assume that (w ; zt ) = 0 for all t which implies that, yt (w · xt ) ≥ | | (Class. [sent-64, score-0.829]

34 The general method we use for deriving our on-line update rule is to define the new weight vector wt+1 as the solution to the following projection problem 1 wt+1 = argmin w − wt 2 s. [sent-70, score-0.823]

35 (w; zt ) = 0 , (5) 2 w namely, wt+1 is set to be the projection of wt onto the set of all weight vectors that attain a loss of zero. [sent-72, score-1.367]

36 For the case of classification, C is a half space, C = {w : −yt w · xt ≤ }. [sent-74, score-0.195]

37 For regression C is an -hyper-slab, C = {w : |w · xt − yt | ≤ } and for uniclass it is a ball of radius centered at xt , C = {w : w − xt ≤ }. [sent-75, score-1.056]

38 This optimization problem attempts to keep wt+1 as close to wt as possible, while forcing wt+1 to achieve a zero loss on the most recent example. [sent-78, score-0.919]

39 The resulting algorithm is passive whenever the loss is zero, that is, wt+1 = wt whenever (wt ; zt ) = 0. [sent-79, score-1.385]

40 In contrast, on rounds for which (wt ; zt ) > 0 we aggressively force wt+1 to satisfy the constraint (wt+1 ; zt ) = 0. [sent-80, score-0.9]

41 solution to the optimization problem • Get a new instance: zt ∈ Rn in Eq. [sent-85, score-0.459]

42 (5) yields the following update • Suffer loss: (wt ; zt ) rule, • If (wt ; zt ) > 0 : wt+1 = wt + τt vt , (6) 1. [sent-86, score-1.844]

43 Set vt (see Table 1) where vt is minus the gradi2. [sent-87, score-0.374]

44 (Note that although the discrepancy might not be differentiable everywhere, its gradient exists whenever the loss is Figure 1: The additive PA algorithm. [sent-90, score-0.286]

45 (5), first note that the equality constraint (w; zt ) = 0 is equivalent to the inequality constraint δ(w; zt ) ≤ . [sent-94, score-0.907]

46 The Lagrangian of the optimization problem is 1 L(w, τ ) = w − wt 2 + τ (δ(w; zt ) − ) , (7) 2 wt+1    wt q  wt+1    q wt   wt+1    q wt   Figure 2: An illustration of the update: wt+1 is found by projecting the current vector wt onto the set of vectors attaining a zero loss on zt . [sent-95, score-4.813]

47 To find a saddle point of L we first differentiate L with respect to w and use the fact that vt is minus the gradient of the discrepancy to get, w (L) = w − wt + τ wδ = 0 ⇒ w = w t + τ vt . [sent-98, score-1.214]

48 Hence, whenever τ is positive (as in the case of non-zero loss), the inequality constraint, δ(w; zt ) ≤ , becomes an equality. [sent-100, score-0.483]

49 Simple algebraic manipulations yield that the value τ for which δ(w; zt ) = for all three problems is equal to, τt = (w; zt )/ vt 2 . [sent-101, score-1.092]

50 However, in the case of uniclass initializing w1 to be the zero vector might incur large losses if, for instance, all the instances are located far away from the origin. [sent-107, score-0.271]

51 A more sensible choice for uniclass is to initialize w1 to be one of the examples. [sent-108, score-0.201]

52 To conclude this section we note that for all three cases the weight vector wt is a linear combination of the instances. [sent-110, score-0.804]

53 4 Analysis The following theorem provides a unified loss bound for all three settings. [sent-112, score-0.225]

54 Assume that there exist w and such that (w ; zt ) = 0 for all t. [sent-121, score-0.455]

55 Then if the additive PA algorithm is run with ≥ , the following bound holds for any T ≥1 T T ( (wt ; zt )) t=1 2 + 2( − ) (wt ; zt ) t=1 ≤ B w − w1 2 , (8) where for classification and regression B is a bound on the squared norm of the instances (∀t : B ≥ xt 2 ) and B = 1 for uniclass. [sent-122, score-1.384]

56 In the following we prove the lower bound (wt ; zt ) ( (wt ; zt ) + 2( − B )) . [sent-127, score-0.954]

57 (10) First note that we do not modify wt if (wt ; zt ) = 0. [sent-128, score-1.192]

58 Therefore, this inequality trivially holds when (wt ; zt ) = 0 and thus we can restrict ourselves to rounds on which the discrepancy is larger than , which implies that (wt ; zt ) = δ(wt ; zt ) − . [sent-129, score-1.485]

59 Let t be such a round then by rewriting wt+1 as wt + τt vt we get, ∆t = 2 wt − w = wt − w = −τt2 vt 2 2 2 − wt+1 − w − τt2 vt 2 = wt − w 2 − w t + τ t vt − w + 2τt (vt · (wt − w )) + wt − w 2 2 + 2τt vt · (w − wt ) . [sent-130, score-5.424]

60 (11) Using the fact that −vt is the gradient of the convex function δ(w; zt ) at wt we have, δ(w ; zt ) − δ(wt ; zt ) ≥ (−vt ) · (w − wt ) . [sent-131, score-2.862]

61 (12) and rearranging we get, vt · (w − wt ) ≥ δ(wt ; zt ) − + − δ(w ; zt ) . [sent-133, score-1.835]

62 Recall that δ(wt ; zt ) − = (wt ; zt ) and that (13) ≥ δ(w ; zt ). [sent-134, score-1.329]

63 Therefore, (δ(wt ; zt ) − ) + ( − δ(w ; zt )) ≥ (wt ; zt ) + ( − ) . [sent-135, score-1.329]

64 (13-14) we get ∆t ≥ −τt2 vt 2 = τt −τt vt Plugging τt = (wt ; zt )/ vt ∆t ≥ 2 + 2τt ( (wt ; zt ) + ( − 2 )) + 2 (wt ; zt ) + 2( − ) . [sent-138, score-1.902]

65 (15) we get (wt ; zt ) ( (wt ; zt ) + 2( − vt 2 )) . [sent-140, score-1.097]

66 For uniclass vt 2 is always equal to 1 by construction and for classification and regression we have vt 2 = xt 2 ≤ B which gives, (wt ; zt ) ( (wt ; zt ) + 2( − )) . [sent-141, score-1.713]

67 (9) we get ∆t ≥ T T 2 ( (wt ; zt )) + t=1 t=1 2( − ) (wt ; zt ) ≤ B w − w1 2 . [sent-143, score-0.916]

68 Due to the realizability assumption, there exist w and such that for all t, (w ; zt ) = 0 which implies that yt (w · xt ) ≥ − . [sent-148, score-0.841]

69 Dividing w by its norm we can rewrite the latter as ˆ ˆ yt (w · xt ) ≥ ˆ where w = w / w and ˆ = | |/ w . [sent-149, score-0.395]

70 Now, setting = −1 we get that (w; z) = [1 − y(w · x)]+ – the hinge loss for classification. [sent-151, score-0.233]

71 1 to obtain two loss bounds for the hinge loss in a classification setting. [sent-153, score-0.375]

72 (8) vanishes as = = −1 and thus, T t=1 ([1 − yt (wt · xt )]+ ) 2 ≤ B w 2 = B . [sent-155, score-0.371]

73 We can immediately use this bound to derive a mistake bound for the PA algorithm. [sent-158, score-0.129]

74 Note that the algorithm makes a prediction mistake iff yt (wt · xt ) ≤ 0. [sent-159, score-0.416]

75 In this case, [1 − yt (wt · xt )]+ ≥ 1 and therefore the number of prediction mistakes is bounded by B/(ˆ )2 . [sent-160, score-0.397]

76 This bound is common to online algorithms for classification such as ROMMA [14]. [sent-161, score-0.165]

77 (8) we get, T ) 2(−1 − t=1 [1 − yt (wt · xt )]+ ≤ B w ˆ By setting w = 2w /ˆ , which implies that to get a bound on the cumulative hinge loss, . [sent-165, score-0.538]

78 = −2, we can further simplify the above T t=1 2 [1 − yt (wt · xt )]+ ≤ 2 B . [sent-166, score-0.371]

79 (ˆ )2 To conclude this section, we would like to point out that the PA online algorithm can also be used as a building block for a batch algorithm. [sent-167, score-0.162]

80 1, T is at most B w −w1 2 /β and √ by construction the loss of wT on any z ∈ S is at most β. [sent-180, score-0.152]

81 Since wT achieves a small empirical loss and its norm is small, it can be shown using classical techniques (cf. [sent-182, score-0.165]

82 [15]) that the loss of wT on unseen data is small as well. [sent-183, score-0.141]

83 The case of uniclass is more involved and will be discussed in detail elsewhere. [sent-199, score-0.188]

84 The first is the insensitivity parameter which defines the loss function as in the realizable case. [sent-201, score-0.238]

85 However, in this case we do not assume that there exists w that achieves zero loss over the sequence. [sent-202, score-0.154]

86 We instead measure the loss of the online algorithm relative to the loss of any vector w . [sent-203, score-0.413]

87 That is, if the loss on example zt is zero then wt+1 = wt . [sent-207, score-1.346]

88 In case the loss is positive the update rule is wt+1 = wt + τt vt where vt is the same as in the realizable case. [sent-208, score-1.336]

89 However, the scaling factor τt is modified and is set to, (wt ; zt ) τt = . [sent-209, score-0.443]

90 vt 2 + γ The following theorem provides a loss bound for the online algorithm relative to the loss of any fixed weight vector w . [sent-210, score-0.687]

91 Then if the PA algorithm for the unrealizable case is run with , and with γ > 0, the following bound holds for any T ≥ 1 and a constant B satisfying B ≥ xt 2 , T ( (wt ; zt )) 2 t=1 ≤ (γ + B) w − w1 2 + 1+ B γ T ( (w ; zt )) 2 . [sent-219, score-1.233]

92 The multiplicative PA algorithm maintains a weight vector wt ∈ Pn where n Pn = {x : x ∈ Rn , + j=1 xj = 1}. [sent-224, score-0.827]

93 The multiplicative update of wt is, wt+1,j = (1/Zt ) wt,j eτt vt,j , where vt is the same as the one used in the additive algorithm (Table 1), τt now becomes 4 (wt ; zt )/ vt 2 for regression and classification and (wt ; zt )/(8 vt 2 ) for uniclass ∞ ∞ n τt vt,j and Zt = is a normalization factor. [sent-225, score-2.546]

94 For the multiplicative PA we can j=1 wt,j e prove the following loss bound. [sent-226, score-0.192]

95 Assume that there exist w and such that (w ; zt ) = 0 for all t. [sent-235, score-0.455]

96 An interesting question is whether the unified view of classification, regression, and uniclass can be exported and used with other algorithms for classification such as ROMMA [14] and ALMA [5]. [sent-239, score-0.211]

97 Another, rather general direction for possible extension surfaces when replacing the Euclidean distance between wt+1 and wt with other distances and divergences such as the Bregman divergence. [sent-240, score-0.769]

98 In this case it might be possible to derive general loss bounds, see for example [12]. [sent-242, score-0.156]

99 We are currently exploring generalizations of our framework to other decision tasks such as distance-learning [16] and online convex programming [17]. [sent-243, score-0.123]

100 Learning additive models online with fast evaluating kernels. [sent-291, score-0.14]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wt', 0.749), ('zt', 0.443), ('xt', 0.195), ('uniclass', 0.188), ('vt', 0.181), ('yt', 0.176), ('loss', 0.141), ('online', 0.105), ('discrepancy', 0.074), ('regression', 0.071), ('hinge', 0.062), ('unrealizable', 0.059), ('pa', 0.056), ('realizable', 0.056), ('classi', 0.054), ('bound', 0.049), ('kivinen', 0.043), ('rn', 0.041), ('insensitivity', 0.041), ('cation', 0.038), ('additive', 0.035), ('multiplicative', 0.032), ('bounds', 0.031), ('batch', 0.031), ('instances', 0.031), ('herbster', 0.031), ('get', 0.03), ('uni', 0.03), ('update', 0.028), ('losses', 0.027), ('round', 0.025), ('norm', 0.024), ('theorem', 0.024), ('dre', 0.023), ('colt', 0.021), ('inequality', 0.021), ('gentile', 0.02), ('romma', 0.02), ('accompanying', 0.02), ('weight', 0.02), ('whenever', 0.019), ('prove', 0.019), ('instance', 0.019), ('helmbold', 0.019), ('rearranging', 0.019), ('convex', 0.018), ('holds', 0.018), ('margin', 0.018), ('algorithmic', 0.018), ('radius', 0.018), ('ball', 0.018), ('bregman', 0.017), ('gradient', 0.017), ('proof', 0.017), ('discuss', 0.017), ('mistake', 0.016), ('suffer', 0.016), ('optimization', 0.016), ('warmuth', 0.016), ('crammer', 0.016), ('wj', 0.015), ('pn', 0.015), ('derive', 0.015), ('prediction', 0.015), ('implies', 0.015), ('settings', 0.014), ('algorithm', 0.014), ('projection', 0.014), ('concludes', 0.014), ('trivially', 0.014), ('algebraic', 0.014), ('rounds', 0.014), ('euclidean', 0.013), ('otherwise', 0.013), ('zero', 0.013), ('singer', 0.013), ('initialize', 0.013), ('let', 0.013), ('receiving', 0.013), ('minus', 0.012), ('instantaneous', 0.012), ('namely', 0.012), ('vector', 0.012), ('view', 0.012), ('conclude', 0.012), ('equals', 0.012), ('run', 0.012), ('exist', 0.012), ('cumulative', 0.011), ('algorithms', 0.011), ('numerous', 0.011), ('community', 0.011), ('therefore', 0.011), ('three', 0.011), ('worst', 0.011), ('nips', 0.011), ('construction', 0.011), ('distance', 0.01), ('divergences', 0.01), ('exponentiated', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 148 nips-2003-Online Passive-Aggressive Algorithms

Author: Shai Shalev-shwartz, Koby Crammer, Ofer Dekel, Yoram Singer

Abstract: We present a unified view for online classification, regression, and uniclass problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also discussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss. 1

2 0.53928995 145 nips-2003-Online Classification on a Budget

Author: Koby Crammer, Jaz Kandola, Yoram Singer

Abstract: Online algorithms for classification often require vast amounts of memory and computation time when employed in conjunction with kernel functions. In this paper we describe and analyze a simple approach for an on-the-fly reduction of the number of past examples used for prediction. Experiments performed with real datasets show that using the proposed algorithmic approach with a single epoch is competitive with the support vector machine (SVM) although the latter, being a batch algorithm, accesses each training example multiple times. 1 Introduction and Motivation Kernel-based methods are widely being used for data modeling and prediction because of their conceptual simplicity and outstanding performance on many real-world tasks. The support vector machine (SVM) is a well known algorithm for finding kernel-based linear classifiers with maximal margin [7]. The kernel trick can be used to provide an effective method to deal with very high dimensional feature spaces as well as to model complex input phenomena via embedding into inner product spaces. However, despite generalization error being upper bounded by a function of the margin of a linear classifier, it is notoriously difficult to implement such classifiers efficiently. Empirically this often translates into very long training times. A number of alternative algorithms exist for finding a maximal margin hyperplane many of which have been inspired by Rosenblatt’s Perceptron algorithm [6] which is an on-line learning algorithm for linear classifiers. The work on SVMs has inspired a number of modifications and enhancements to the original Perceptron algorithm. These incorporate the notion of margin to the learning and prediction processes whilst exhibiting good empirical performance in practice. Examples of such algorithms include the Relaxed Online Maximum Margin Algorithm (ROMMA) [4], the Approximate Maximal Margin Classification Algorithm (ALMA) [2], and the Margin Infused Relaxed Algorithm (MIRA) [1] which can be used in conjunction with kernel functions. A notable limitation of kernel based methods is their computational complexity since the amount of computer memory that they require to store the so called support patterns grows linearly with the number prediction errors. A number of attempts have been made to speed up the training and testing of SVM’s by enforcing a sparsity condition. In this paper we devise an online algorithm that is not only sparse but also generalizes well. To achieve this goal our algorithm employs an insertion and deletion process. Informally, it can be thought of as revising the weight vector after each example on which a prediction mistake has been made. Once such an event occurs the algorithm adds the new erroneous example (the insertion phase), and then immediately searches for past examples that appear to be redundant given the recent addition (the deletion phase). As we describe later, making this adjustment to the algorithm allows us to modify the standard online proof techniques so as to provide a bound on the total number of examples the algorithm keeps. This paper is organized as follows. In Sec. 2 we formalize the problem setting and provide a brief outline of our method for obtaining a sparse set of support patterns in an online setting. In Sec. 3 we present both theoretical and algorithmic details of our approach and provide a bound on the number of support patterns that constitute the cache. Sec. 4 provides experimental details, evaluated on three real world datasets, to illustrate the performance and merits of our sparse online algorithm. We end the paper with conclusions and ideas for future work. 2 Problem Setting and Algorithms This work focuses on online additive algorithms for classification tasks. In such problems we are typically given a stream of instance-label pairs (x1 , y1 ), . . . , (xt , yt ), . . .. we assume that each instance is a vector xt ∈ Rn and each label belongs to a finite set Y. In this and the next section we assume that Y = {−1, +1} but relax this assumption in Sec. 4 where we describe experiments with datasets consisting of more than two labels. When dealing with the task of predicting new labels, thresholded linear classifiers of the form h(x) = sign(w · x) are commonly employed. The vector w is typically represented as a weighted linear combination of the examples, namely w = t αt yt xt where αt ≥ 0. The instances for which αt > 0 are referred to as support patterns. Under this assumption, the output of the classifier solely depends on inner-products of the form x · xt the use of kernel functions can easily be employed simply by replacing the standard scalar product with a function K(·, ·) which satisfies Mercer conditions [7]. The resulting classification rule takes the form h(x) = sign(w · x) = sign( t αt yt K(x, xt )). The majority of additive online algorithms for classification, for example the well known Perceptron [6], share a common algorithmic structure. These online algorithms typically work in rounds. On the tth round, an online algorithm receives an instance xt , computes the inner-products st = i 0. The various online algorithms differ in the way the values of the parameters βt , αt and ct are set. A notable example of an online algorithm is the Perceptron algorithm [6] for which we set βt = 0, αt = 1 and ct = 1. More recent algorithms such as the Relaxed Online Maximum Margin Algorithm (ROMMA) [4] the Approximate Maximal Margin Classification Algorithm (ALMA) [2] and the Margin Infused Relaxed Algorithm (MIRA) [1] can also be described in this framework although the constants βt , αt and ct are not as simple as the ones employed by the Perceptron algorithm. An important computational consideration needs to be made when employing kernel functions for machine learning tasks. This is because the amount of memory required to store the so called support patterns grows linearly with the number prediction errors. In Input: Tolerance β. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. Insert Ct ← Ct−1 ∪ {t}. 2. Set αt = 1. 3. Compute wt ← wt−1 + yt αt xt . 4. DistillCache(Ct , wt , (α1 , . . . , αt )). Output : H(x) = sign(wT · x). Figure 1: The aggressive Perceptron algorithm with a variable-size cache. this paper we shift the focus to the problem of devising online algorithms which are budget-conscious as they attempt to keep the number of support patterns small. The approach is attractive for at least two reasons. Firstly, both the training time and classification time can be reduced significantly if we store only a fraction of the potential support patterns. Secondly, a classier with a small number of support patterns is intuitively ”simpler”, and hence are likely to exhibit good generalization properties rather than complex classifiers with large numbers of support patterns. (See for instance [7] for formal results connecting the number of support patterns to the generalization error.) In Sec. 3 we present a formal analysis and Input: C, w, (α1 , . . . , αt ). the algorithmic details of our approach. Loop: Let us now provide a general overview • Choose i ∈ C such that of how to restrict the number of support β ≤ yi (w − αi yi xi ). patterns in an online setting. Denote by Ct the indices of patterns which consti• if no such i exists then return. tute the classification vector wt . That is, • Remove the example i : i ∈ Ct if and only if αi > 0 on round 1. αi = 0. t when xt is received. The online classification algorithms discussed above keep 2. w ← w − αi yi xi . enlarging Ct – once an example is added 3. C ← C/{i} to Ct it will never be deleted. However, Return : C, w, (α1 , . . . , αt ). as the online algorithm receives more examples, the performance of the classifier Figure 2: DistillCache improves, and some of the past examples may have become redundant and hence can be removed. Put another way, old examples may have been inserted into the cache simply due the lack of support patterns in early rounds. As more examples are observed, the old examples maybe replaced with new examples whose location is closer to the decision boundary induced by the online classifier. We thus add a new stage to the online algorithm in which we discard a few old examples from the cache Ct . We suggest a modification of the online algorithm structure as follows. Whenever yt i 0. Then the number of support patterns constituting the cache is at most S ≤ (R2 + 2β)/γ 2 . Proof: The proof of the theorem is based on the mistake bound of the Perceptron algorithm [5]. To prove the theorem we bound wT 2 from above and below and compare the 2 t bounds. Denote by αi the weight of the ith example at the end of round t (after stage 4 of the algorithm). Similarly, we denote by αi to be the weight of the ith example on round ˜t t after stage 3, before calling the DistillCache (Fig. 2) procedure. We analogously ˜ denote by wt and wt the corresponding instantaneous classifiers. First, we derive a lower bound on wT 2 by bounding the term wT · u from below in a recursive manner. T αt yt (xt · u) wT · u = t∈CT T αt = γ S . ≥ γ (1) t∈CT We now turn to upper bound wT 2 . Recall that each example may be added to the cache and removed from the cache a single time. Let us write wT 2 as a telescopic sum, wT 2 = ( wT 2 ˜ − wT 2 ˜ ) + ( wT 2 − wT −1 2 ˜ ) + . . . + ( w1 2 − w0 2 ) . (2) We now consider three different scenarios that may occur for each new example. The first case is when we did not insert the tth example into the cache at all. In this case, ˜ ( wt 2 − wt−1 2 ) = 0. The second scenario is when an example is inserted into the cache and is never discarded in future rounds, thus, ˜ wt 2 = wt−1 + yt xt 2 = wt−1 2 + 2yt (wt−1 · xt ) + xt 2 . Since we inserted (xt , yt ), the condition yt (wt−1 · xt ) ≤ β must hold. Combining this ˜ with the assumption that the examples are enclosed in a ball of radius R we get, ( wt 2 − wt−1 2 ) ≤ 2β + R2 . The last scenario occurs when an example is inserted into the cache on some round t, and is then later on removed from the cache on round t + p for p > 0. As in the previous case we can bound the value of summands in Equ. (2), ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 ˜ − wt+p 2 ) Input: Tolerance β, Cache Limit n. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. If |Ct | = n remove one example: (a) Find i = arg maxj∈Ct {yj (wt−1 − αj yj xj )}. (b) Update wt−1 ← wt−1 − αi yi xi . (c) Remove Ct−1 ← Ct−1 /{i} 2. Insert Ct ← Ct−1 ∪ {t}. 3. Set αt = 1. 4. Compute wt ← wt−1 + yt αt xt . Output : H(x) = sign(wT · x). Figure 3: The aggressive Perceptron algorithm with as fixed-size cache. ˜ = 2yt (wt−1 · xt ) + xt 2 − 2yt (wt+p · xt ) + xt ˜ = 2 [yt (wt−1 · xt ) − yt ((wt+p − yt xt ) · xt )] ˜ ≤ 2 [β − yt ((wt+p − yt xt ) · xt )] . 2 ˜ Based on the form of the cache update we know that yt ((wt+p − yt xt ) · xt ) ≥ β, and thus, ˜ ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 − wt+p 2 ) ≤ 0 . Summarizing all three cases we see that only the examples which persist in the cache contribute a factor of R2 + 2β each to the bound of the telescopic sum of Equ. (2) and the rest of the examples do contribute anything to the bound. Hence, we can bound the norm of wT as follows, wT 2 ≤ S R2 + 2β . (3) We finish up the proof by applying the Cauchy-Swartz inequality and the assumption u = 1. Combining Equ. (1) and Equ. (3) we get, γ 2 S 2 ≤ (wT · u)2 ≤ wT 2 u 2 ≤ S(2β + R2 ) , which gives the desired bound. 4 Experiments In this section we describe the experimental methods that were used to compare the performance of standard online algorithms with the new algorithm described above. We also describe shortly another variant that sets a hard limit on the number of support patterns. The experiments were designed with the aim of trying to answer the following questions. First, what is effect of the number of support patterns on the generalization error (measured in terms of classification accuracy on unseen data), and second, would the algorithm described in Fig. 2 be able to find an optimal cache size that is able to achieve the best generalization performance. To examine each question separately we used a modified version of the algorithm described by Fig. 2 in which we restricted ourselves to have a fixed bounded cache. This modified algorithm (which we refer to as the fixed budget Perceptron) Name mnist letter usps No. of Training Examples 60000 16000 7291 No. of Test Examples 10000 4000 2007 No. of Classes 10 26 10 No. of Attributes 784 16 256 Table 1: Description of the datasets used in experiments. simulates the original Perceptron algorithm with one notable difference. When the number of support patterns exceeds a pre-determined limit, it chooses a support pattern from the cache and discards it. With this modification the number of support patterns can never exceed the pre-determined limit. This modified algorithm is described in Fig. 3. The algorithm deletes the example which seemingly attains the highest margin after the removal of the example itself (line 1(a) in Fig. 3). Despite the simplicity of the original Perceptron algorithm [6] its good generalization performance on many datasets is remarkable. During the last few year a number of other additive online algorithms have been developed [4, 2, 1] that have shown better performance on a number of tasks. In this paper, we have preferred to embed these ideas into another online algorithm and start with a higher baseline performance. We have chosen to use the Margin Infused Relaxed Algorithm (MIRA) as our baseline algorithm since it has exhibited good generalization performance in previous experiments [1] and has the additional advantage that it is designed to solve multiclass classification problem directly without any recourse to performing reductions. The algorithms were evaluated on three natural datasets: mnist1 , usps2 and letter3 . The characteristics of these datasets has been summarized in Table 1. A comprehensive overview of the performance of various algorithms on these datasets can be found in a recent paper [2]. Since all of the algorithms that we have evaluated are online, it is not implausible for the specific ordering of examples to affect the generalization performance. We thus report results averaged over 11 random permutations for usps and letter and over 5 random permutations for mnist. No free parameter optimization was carried out and instead we simply used the values reported in [1]. More specifically, the margin parameter was set to β = 0.01 for all algorithms and for all datasets. A homogeneous polynomial kernel of degree 9 was used when training on the mnist and usps data sets, and a RBF kernel for letter data set. (The variance of the RBF kernel was identical to the one used in [1].) We evaluated the performance of four algorithms in total. The first algorithm was the standard MIRA online algorithm, which does not incorporate any budget constraints. The second algorithm is the version of MIRA described in Fig. 3 which uses a fixed limited budget. Here we enumerated the cache size limit in each experiment we performed. The different sizes that we tested are dataset dependent but for each dataset we evaluated at least 10 different sizes. We would like to note that such an enumeration cannot be done in an online fashion and the goal of employing the the algorithm with a fixed-size cache is to underscore the merit of the truly adaptive algorithm. The third algorithm is the version of MIRA described in Fig. 2 that adapts the cache size during the running of the algorithms. We also report additional results for a multiclass version of the SVM [1]. Whilst this algorithm is not online and during the training process it considers all the examples at once, this algorithm serves as our gold-standard algorithm against which we want to compare 1 Available from http://www.research.att.com/˜yann Available from ftp.kyb.tuebingen.mpg.de 3 Available from http://www.ics.uci.edu/˜mlearn/MLRepository.html 2 usps mnist Fixed Adaptive SVM MIRA 1.8 4.8 4.7 letter Fixed Adaptive SVM MIRA 5.5 1.7 4.6 5 1.5 1.4 Test Error 4.5 Test Error Test Error 1.6 Fixed Adaptive SVM MIRA 6 4.4 4.3 4.5 4 3.5 4.2 4.1 3 4 2.5 1.3 1.2 3.9 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 2 1000 1500 x 10 mnist 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 usps Fixed Adaptive MIRA 1550 7000 8000 9000 letter Fixed Adaptive MIRA 270 4000 5000 6000 # Support Patterns Fixed Adaptive MIRA 1500 265 1500 1400 260 Training Online Errors Training Online Errors Training Online Errors 1450 1450 255 250 245 1400 1350 1300 1350 240 1250 235 1300 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 1000 1500 x 10 mnist 4 x 10 2000 2500 # Support Patterns 3000 3500 1000 usps 6500 Fixed Adaptive MIRA 5.5 2000 3000 4000 5000 6000 # Support Patterns 7000 Fixed Adaptive MIRA 1.5 6000 9000 letter 4 x 10 1.6 Fixed Adaptive MIRA 8000 4 3.5 3 1.4 5500 Training Margin Errors Training Margin Errors Training Margin Errors 5 4.5 5000 4500 1.3 1.2 1.1 4000 1 2.5 3500 0.9 2 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 4 x 10 500 1000 1500 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 4000 5000 6000 # Support Patterns 7000 8000 9000 Figure 4: Results on a three data sets - mnist (left), usps (center) and letter (right). Each point in a plot designates the test error (y-axis) vs. the number of support patterns used (x-axis). Four algorithms are compared - SVM, MIRA, MIRA with a fixed cache size and MIRA with a variable cache size. performance. Note that for the multiclass SVM we report the results using the best set of parameters, which does not coincide with the set of parameters used for the online training. The results are summarized in Fig 4. This figure is composed of three different plots organized in columns. Each of these plots corresponds to a different dataset - mnist (left), usps (center) and letter (right). In each of the three plots the x-axis designates number of support patterns the algorithm uses. The results for the fixed-size cache are connected with a line to emphasize the performance dependency on the size of the cache. The top row of the three columns shows the generalization error. Thus the y-axis designates the test error of an algorithm on unseen data at the end of the training. Looking at the error of the algorithm with a fixed-size cache reveals that there is a broad range of cache size where the algorithm exhibits good performance. In fact for MNIST and USPS there are sizes for which the test error of the algorithm is better than SVM’s test error. Naturally, we cannot fix the correct size in hindsight so the question is whether the algorithm with variable cache size is a viable automatic size-selection method. Analyzing each of the datasets in turn reveals that this is indeed the case – the algorithm obtains a very similar number of support patterns and test error when compared to the SVM method. The results are somewhat less impressive for the letter dataset which contains less examples per class. One possible explanation is that the algorithm had fewer chances to modify and distill the cache. Nonetheless, overall the results are remarkable given that all the online algorithms make a single pass through the data and the variable-size method finds a very good cache size while making it also comparable to the SVM in terms of performance. The MIRA algorithm, which does not incorporate any form of example insertion or deletion in its algorithmic structure, obtains the poorest level of performance not only in terms of generalization error but also in terms of number of support patterns. The plot of online training error against the number of support patterns, in row 2 of Fig 4, can be considered to be a good on-the-fly validation of generalization performance. As the plots indicate, for the fixed and adaptive versions of the algorithm, on all the datasets, a low online training error translates into good generalization performance. Comparing the test error plots with the online error plots we see a nice similarity between the qualitative behavior of the two errors. Hence, one can use the online error, which is easy to evaluate, to choose a good cache size for the fixed-size algorithm. The third row gives the online training margin errors that translates directly to the number of insertions into the cache. Here we see that the good test error and compactness of the algorithm with a variable cache size come with a price. Namely, the algorithm makes significantly more insertions into the cache than the fixed size version of the algorithm. However, as the upper two sets of plots indicate, the surplus in insertions is later taken care of by excess deletions and the end result is very good overall performance. In summary, the online algorithm with a variable cache and SVM obtains similar levels of generalization and also number of support patterns. While the SVM is still somewhat better in both aspects for the letter dataset, the online algorithm is much simpler to implement and performs a single sweep through the training data. 5 Summary We have described and analyzed a new sparse online algorithm that attempts to deal with the computational problems implicit in classification algorithms such as the SVM. The proposed method was empirically tested and its performance in both the size of the resulting classifier and its error rate are comparable to SVM. There are a few possible extensions and enhancements. We are currently looking at alternative criteria for the deletions of examples from the cache. For instance, the weight of examples might relay information on their importance for accurate classification. Incorporating prior knowledge to the insertion and deletion scheme might also prove important. We hope that such enhancements would make the proposed approach a viable alternative to SVM and other batch algorithms. Acknowledgements: The authors would like to thank John Shawe-Taylor for many helpful comments and discussions. This research was partially funded by the EU project KerMIT No. IST-2000-25341. References [1] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Jornal of Machine Learning Research, 3:951–991, 2003. [2] C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242, 2001. [3] M´ zard M. Krauth W. Learning algorithms with optimal stability in neural networks. Journal of e Physics A., 20:745, 1987. [4] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1–3):361–387, 2002. [5] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume XII, pages 615–622, 1962. [6] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).). [7] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.

3 0.17084116 102 nips-2003-Large Scale Online Learning

Author: Léon Bottou, Yann L. Cun

Abstract: We consider situations where training data is abundant and computing resources are comparatively scarce. We argue that suitably designed online learning algorithms asymptotically outperform any batch learning algorithm. Both theoretical and experimental evidences are presented. 1

4 0.16689864 176 nips-2003-Sequential Bayesian Kernel Regression

Author: Jaco Vermaak, Simon J. Godsill, Arnaud Doucet

Abstract: We propose a method for sequential Bayesian kernel regression. As is the case for the popular Relevance Vector Machine (RVM) [10, 11], the method automatically identifies the number and locations of the kernels. Our algorithm overcomes some of the computational difficulties related to batch methods for kernel regression. It is non-iterative, and requires only a single pass over the data. It is thus applicable to truly sequential data sets and batch data sets alike. The algorithm is based on a generalisation of Importance Sampling, which allows the design of intuitively simple and efficient proposal distributions for the model parameters. Comparative results on two standard data sets show our algorithm to compare favourably with existing batch estimation strategies.

5 0.12115408 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

Author: Liva Ralaivola, Florence D'alché-buc

Abstract: We consider the question of predicting nonlinear time series. Kernel Dynamical Modeling (KDM), a new method based on kernels, is proposed as an extension to linear dynamical models. The kernel trick is used twice: first, to learn the parameters of the model, and second, to compute preimages of the time series predicted in the feature space by means of Support Vector Regression. Our model shows strong connection with the classic Kalman Filter model, with the kernel feature space as hidden state space. Kernel Dynamical Modeling is tested against two benchmark time series and achieves high quality predictions. 1

6 0.11983229 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

7 0.11259157 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion

8 0.10088013 122 nips-2003-Margin Maximizing Loss Functions

9 0.098828018 41 nips-2003-Boosting versus Covering

10 0.098805115 146 nips-2003-Online Learning of Non-stationary Sequences

11 0.088550046 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels

12 0.079151466 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

13 0.07576827 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

14 0.06579791 108 nips-2003-Learning a Distance Metric from Relative Comparisons

15 0.059586927 48 nips-2003-Convex Methods for Transduction

16 0.052691888 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

17 0.052348338 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

18 0.044490725 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

19 0.043778472 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

20 0.043732665 171 nips-2003-Semi-Definite Programming by Perceptron Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.147), (1, 0.019), (2, -0.059), (3, -0.299), (4, 0.264), (5, 0.03), (6, 0.14), (7, 0.406), (8, 0.269), (9, 0.075), (10, 0.01), (11, -0.061), (12, -0.051), (13, 0.111), (14, -0.044), (15, -0.134), (16, -0.105), (17, -0.034), (18, -0.008), (19, 0.157), (20, 0.129), (21, -0.11), (22, 0.107), (23, 0.132), (24, 0.077), (25, 0.083), (26, 0.024), (27, -0.139), (28, 0.047), (29, 0.048), (30, 0.119), (31, -0.078), (32, -0.086), (33, -0.015), (34, 0.007), (35, 0.049), (36, 0.067), (37, 0.056), (38, -0.038), (39, -0.012), (40, -0.044), (41, 0.036), (42, -0.0), (43, -0.034), (44, -0.068), (45, -0.057), (46, -0.016), (47, 0.063), (48, 0.04), (49, -0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98331469 148 nips-2003-Online Passive-Aggressive Algorithms

Author: Shai Shalev-shwartz, Koby Crammer, Ofer Dekel, Yoram Singer

Abstract: We present a unified view for online classification, regression, and uniclass problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also discussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss. 1

2 0.90926826 145 nips-2003-Online Classification on a Budget

Author: Koby Crammer, Jaz Kandola, Yoram Singer

Abstract: Online algorithms for classification often require vast amounts of memory and computation time when employed in conjunction with kernel functions. In this paper we describe and analyze a simple approach for an on-the-fly reduction of the number of past examples used for prediction. Experiments performed with real datasets show that using the proposed algorithmic approach with a single epoch is competitive with the support vector machine (SVM) although the latter, being a batch algorithm, accesses each training example multiple times. 1 Introduction and Motivation Kernel-based methods are widely being used for data modeling and prediction because of their conceptual simplicity and outstanding performance on many real-world tasks. The support vector machine (SVM) is a well known algorithm for finding kernel-based linear classifiers with maximal margin [7]. The kernel trick can be used to provide an effective method to deal with very high dimensional feature spaces as well as to model complex input phenomena via embedding into inner product spaces. However, despite generalization error being upper bounded by a function of the margin of a linear classifier, it is notoriously difficult to implement such classifiers efficiently. Empirically this often translates into very long training times. A number of alternative algorithms exist for finding a maximal margin hyperplane many of which have been inspired by Rosenblatt’s Perceptron algorithm [6] which is an on-line learning algorithm for linear classifiers. The work on SVMs has inspired a number of modifications and enhancements to the original Perceptron algorithm. These incorporate the notion of margin to the learning and prediction processes whilst exhibiting good empirical performance in practice. Examples of such algorithms include the Relaxed Online Maximum Margin Algorithm (ROMMA) [4], the Approximate Maximal Margin Classification Algorithm (ALMA) [2], and the Margin Infused Relaxed Algorithm (MIRA) [1] which can be used in conjunction with kernel functions. A notable limitation of kernel based methods is their computational complexity since the amount of computer memory that they require to store the so called support patterns grows linearly with the number prediction errors. A number of attempts have been made to speed up the training and testing of SVM’s by enforcing a sparsity condition. In this paper we devise an online algorithm that is not only sparse but also generalizes well. To achieve this goal our algorithm employs an insertion and deletion process. Informally, it can be thought of as revising the weight vector after each example on which a prediction mistake has been made. Once such an event occurs the algorithm adds the new erroneous example (the insertion phase), and then immediately searches for past examples that appear to be redundant given the recent addition (the deletion phase). As we describe later, making this adjustment to the algorithm allows us to modify the standard online proof techniques so as to provide a bound on the total number of examples the algorithm keeps. This paper is organized as follows. In Sec. 2 we formalize the problem setting and provide a brief outline of our method for obtaining a sparse set of support patterns in an online setting. In Sec. 3 we present both theoretical and algorithmic details of our approach and provide a bound on the number of support patterns that constitute the cache. Sec. 4 provides experimental details, evaluated on three real world datasets, to illustrate the performance and merits of our sparse online algorithm. We end the paper with conclusions and ideas for future work. 2 Problem Setting and Algorithms This work focuses on online additive algorithms for classification tasks. In such problems we are typically given a stream of instance-label pairs (x1 , y1 ), . . . , (xt , yt ), . . .. we assume that each instance is a vector xt ∈ Rn and each label belongs to a finite set Y. In this and the next section we assume that Y = {−1, +1} but relax this assumption in Sec. 4 where we describe experiments with datasets consisting of more than two labels. When dealing with the task of predicting new labels, thresholded linear classifiers of the form h(x) = sign(w · x) are commonly employed. The vector w is typically represented as a weighted linear combination of the examples, namely w = t αt yt xt where αt ≥ 0. The instances for which αt > 0 are referred to as support patterns. Under this assumption, the output of the classifier solely depends on inner-products of the form x · xt the use of kernel functions can easily be employed simply by replacing the standard scalar product with a function K(·, ·) which satisfies Mercer conditions [7]. The resulting classification rule takes the form h(x) = sign(w · x) = sign( t αt yt K(x, xt )). The majority of additive online algorithms for classification, for example the well known Perceptron [6], share a common algorithmic structure. These online algorithms typically work in rounds. On the tth round, an online algorithm receives an instance xt , computes the inner-products st = i 0. The various online algorithms differ in the way the values of the parameters βt , αt and ct are set. A notable example of an online algorithm is the Perceptron algorithm [6] for which we set βt = 0, αt = 1 and ct = 1. More recent algorithms such as the Relaxed Online Maximum Margin Algorithm (ROMMA) [4] the Approximate Maximal Margin Classification Algorithm (ALMA) [2] and the Margin Infused Relaxed Algorithm (MIRA) [1] can also be described in this framework although the constants βt , αt and ct are not as simple as the ones employed by the Perceptron algorithm. An important computational consideration needs to be made when employing kernel functions for machine learning tasks. This is because the amount of memory required to store the so called support patterns grows linearly with the number prediction errors. In Input: Tolerance β. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. Insert Ct ← Ct−1 ∪ {t}. 2. Set αt = 1. 3. Compute wt ← wt−1 + yt αt xt . 4. DistillCache(Ct , wt , (α1 , . . . , αt )). Output : H(x) = sign(wT · x). Figure 1: The aggressive Perceptron algorithm with a variable-size cache. this paper we shift the focus to the problem of devising online algorithms which are budget-conscious as they attempt to keep the number of support patterns small. The approach is attractive for at least two reasons. Firstly, both the training time and classification time can be reduced significantly if we store only a fraction of the potential support patterns. Secondly, a classier with a small number of support patterns is intuitively ”simpler”, and hence are likely to exhibit good generalization properties rather than complex classifiers with large numbers of support patterns. (See for instance [7] for formal results connecting the number of support patterns to the generalization error.) In Sec. 3 we present a formal analysis and Input: C, w, (α1 , . . . , αt ). the algorithmic details of our approach. Loop: Let us now provide a general overview • Choose i ∈ C such that of how to restrict the number of support β ≤ yi (w − αi yi xi ). patterns in an online setting. Denote by Ct the indices of patterns which consti• if no such i exists then return. tute the classification vector wt . That is, • Remove the example i : i ∈ Ct if and only if αi > 0 on round 1. αi = 0. t when xt is received. The online classification algorithms discussed above keep 2. w ← w − αi yi xi . enlarging Ct – once an example is added 3. C ← C/{i} to Ct it will never be deleted. However, Return : C, w, (α1 , . . . , αt ). as the online algorithm receives more examples, the performance of the classifier Figure 2: DistillCache improves, and some of the past examples may have become redundant and hence can be removed. Put another way, old examples may have been inserted into the cache simply due the lack of support patterns in early rounds. As more examples are observed, the old examples maybe replaced with new examples whose location is closer to the decision boundary induced by the online classifier. We thus add a new stage to the online algorithm in which we discard a few old examples from the cache Ct . We suggest a modification of the online algorithm structure as follows. Whenever yt i 0. Then the number of support patterns constituting the cache is at most S ≤ (R2 + 2β)/γ 2 . Proof: The proof of the theorem is based on the mistake bound of the Perceptron algorithm [5]. To prove the theorem we bound wT 2 from above and below and compare the 2 t bounds. Denote by αi the weight of the ith example at the end of round t (after stage 4 of the algorithm). Similarly, we denote by αi to be the weight of the ith example on round ˜t t after stage 3, before calling the DistillCache (Fig. 2) procedure. We analogously ˜ denote by wt and wt the corresponding instantaneous classifiers. First, we derive a lower bound on wT 2 by bounding the term wT · u from below in a recursive manner. T αt yt (xt · u) wT · u = t∈CT T αt = γ S . ≥ γ (1) t∈CT We now turn to upper bound wT 2 . Recall that each example may be added to the cache and removed from the cache a single time. Let us write wT 2 as a telescopic sum, wT 2 = ( wT 2 ˜ − wT 2 ˜ ) + ( wT 2 − wT −1 2 ˜ ) + . . . + ( w1 2 − w0 2 ) . (2) We now consider three different scenarios that may occur for each new example. The first case is when we did not insert the tth example into the cache at all. In this case, ˜ ( wt 2 − wt−1 2 ) = 0. The second scenario is when an example is inserted into the cache and is never discarded in future rounds, thus, ˜ wt 2 = wt−1 + yt xt 2 = wt−1 2 + 2yt (wt−1 · xt ) + xt 2 . Since we inserted (xt , yt ), the condition yt (wt−1 · xt ) ≤ β must hold. Combining this ˜ with the assumption that the examples are enclosed in a ball of radius R we get, ( wt 2 − wt−1 2 ) ≤ 2β + R2 . The last scenario occurs when an example is inserted into the cache on some round t, and is then later on removed from the cache on round t + p for p > 0. As in the previous case we can bound the value of summands in Equ. (2), ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 ˜ − wt+p 2 ) Input: Tolerance β, Cache Limit n. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. If |Ct | = n remove one example: (a) Find i = arg maxj∈Ct {yj (wt−1 − αj yj xj )}. (b) Update wt−1 ← wt−1 − αi yi xi . (c) Remove Ct−1 ← Ct−1 /{i} 2. Insert Ct ← Ct−1 ∪ {t}. 3. Set αt = 1. 4. Compute wt ← wt−1 + yt αt xt . Output : H(x) = sign(wT · x). Figure 3: The aggressive Perceptron algorithm with as fixed-size cache. ˜ = 2yt (wt−1 · xt ) + xt 2 − 2yt (wt+p · xt ) + xt ˜ = 2 [yt (wt−1 · xt ) − yt ((wt+p − yt xt ) · xt )] ˜ ≤ 2 [β − yt ((wt+p − yt xt ) · xt )] . 2 ˜ Based on the form of the cache update we know that yt ((wt+p − yt xt ) · xt ) ≥ β, and thus, ˜ ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 − wt+p 2 ) ≤ 0 . Summarizing all three cases we see that only the examples which persist in the cache contribute a factor of R2 + 2β each to the bound of the telescopic sum of Equ. (2) and the rest of the examples do contribute anything to the bound. Hence, we can bound the norm of wT as follows, wT 2 ≤ S R2 + 2β . (3) We finish up the proof by applying the Cauchy-Swartz inequality and the assumption u = 1. Combining Equ. (1) and Equ. (3) we get, γ 2 S 2 ≤ (wT · u)2 ≤ wT 2 u 2 ≤ S(2β + R2 ) , which gives the desired bound. 4 Experiments In this section we describe the experimental methods that were used to compare the performance of standard online algorithms with the new algorithm described above. We also describe shortly another variant that sets a hard limit on the number of support patterns. The experiments were designed with the aim of trying to answer the following questions. First, what is effect of the number of support patterns on the generalization error (measured in terms of classification accuracy on unseen data), and second, would the algorithm described in Fig. 2 be able to find an optimal cache size that is able to achieve the best generalization performance. To examine each question separately we used a modified version of the algorithm described by Fig. 2 in which we restricted ourselves to have a fixed bounded cache. This modified algorithm (which we refer to as the fixed budget Perceptron) Name mnist letter usps No. of Training Examples 60000 16000 7291 No. of Test Examples 10000 4000 2007 No. of Classes 10 26 10 No. of Attributes 784 16 256 Table 1: Description of the datasets used in experiments. simulates the original Perceptron algorithm with one notable difference. When the number of support patterns exceeds a pre-determined limit, it chooses a support pattern from the cache and discards it. With this modification the number of support patterns can never exceed the pre-determined limit. This modified algorithm is described in Fig. 3. The algorithm deletes the example which seemingly attains the highest margin after the removal of the example itself (line 1(a) in Fig. 3). Despite the simplicity of the original Perceptron algorithm [6] its good generalization performance on many datasets is remarkable. During the last few year a number of other additive online algorithms have been developed [4, 2, 1] that have shown better performance on a number of tasks. In this paper, we have preferred to embed these ideas into another online algorithm and start with a higher baseline performance. We have chosen to use the Margin Infused Relaxed Algorithm (MIRA) as our baseline algorithm since it has exhibited good generalization performance in previous experiments [1] and has the additional advantage that it is designed to solve multiclass classification problem directly without any recourse to performing reductions. The algorithms were evaluated on three natural datasets: mnist1 , usps2 and letter3 . The characteristics of these datasets has been summarized in Table 1. A comprehensive overview of the performance of various algorithms on these datasets can be found in a recent paper [2]. Since all of the algorithms that we have evaluated are online, it is not implausible for the specific ordering of examples to affect the generalization performance. We thus report results averaged over 11 random permutations for usps and letter and over 5 random permutations for mnist. No free parameter optimization was carried out and instead we simply used the values reported in [1]. More specifically, the margin parameter was set to β = 0.01 for all algorithms and for all datasets. A homogeneous polynomial kernel of degree 9 was used when training on the mnist and usps data sets, and a RBF kernel for letter data set. (The variance of the RBF kernel was identical to the one used in [1].) We evaluated the performance of four algorithms in total. The first algorithm was the standard MIRA online algorithm, which does not incorporate any budget constraints. The second algorithm is the version of MIRA described in Fig. 3 which uses a fixed limited budget. Here we enumerated the cache size limit in each experiment we performed. The different sizes that we tested are dataset dependent but for each dataset we evaluated at least 10 different sizes. We would like to note that such an enumeration cannot be done in an online fashion and the goal of employing the the algorithm with a fixed-size cache is to underscore the merit of the truly adaptive algorithm. The third algorithm is the version of MIRA described in Fig. 2 that adapts the cache size during the running of the algorithms. We also report additional results for a multiclass version of the SVM [1]. Whilst this algorithm is not online and during the training process it considers all the examples at once, this algorithm serves as our gold-standard algorithm against which we want to compare 1 Available from http://www.research.att.com/˜yann Available from ftp.kyb.tuebingen.mpg.de 3 Available from http://www.ics.uci.edu/˜mlearn/MLRepository.html 2 usps mnist Fixed Adaptive SVM MIRA 1.8 4.8 4.7 letter Fixed Adaptive SVM MIRA 5.5 1.7 4.6 5 1.5 1.4 Test Error 4.5 Test Error Test Error 1.6 Fixed Adaptive SVM MIRA 6 4.4 4.3 4.5 4 3.5 4.2 4.1 3 4 2.5 1.3 1.2 3.9 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 2 1000 1500 x 10 mnist 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 usps Fixed Adaptive MIRA 1550 7000 8000 9000 letter Fixed Adaptive MIRA 270 4000 5000 6000 # Support Patterns Fixed Adaptive MIRA 1500 265 1500 1400 260 Training Online Errors Training Online Errors Training Online Errors 1450 1450 255 250 245 1400 1350 1300 1350 240 1250 235 1300 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 1000 1500 x 10 mnist 4 x 10 2000 2500 # Support Patterns 3000 3500 1000 usps 6500 Fixed Adaptive MIRA 5.5 2000 3000 4000 5000 6000 # Support Patterns 7000 Fixed Adaptive MIRA 1.5 6000 9000 letter 4 x 10 1.6 Fixed Adaptive MIRA 8000 4 3.5 3 1.4 5500 Training Margin Errors Training Margin Errors Training Margin Errors 5 4.5 5000 4500 1.3 1.2 1.1 4000 1 2.5 3500 0.9 2 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 4 x 10 500 1000 1500 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 4000 5000 6000 # Support Patterns 7000 8000 9000 Figure 4: Results on a three data sets - mnist (left), usps (center) and letter (right). Each point in a plot designates the test error (y-axis) vs. the number of support patterns used (x-axis). Four algorithms are compared - SVM, MIRA, MIRA with a fixed cache size and MIRA with a variable cache size. performance. Note that for the multiclass SVM we report the results using the best set of parameters, which does not coincide with the set of parameters used for the online training. The results are summarized in Fig 4. This figure is composed of three different plots organized in columns. Each of these plots corresponds to a different dataset - mnist (left), usps (center) and letter (right). In each of the three plots the x-axis designates number of support patterns the algorithm uses. The results for the fixed-size cache are connected with a line to emphasize the performance dependency on the size of the cache. The top row of the three columns shows the generalization error. Thus the y-axis designates the test error of an algorithm on unseen data at the end of the training. Looking at the error of the algorithm with a fixed-size cache reveals that there is a broad range of cache size where the algorithm exhibits good performance. In fact for MNIST and USPS there are sizes for which the test error of the algorithm is better than SVM’s test error. Naturally, we cannot fix the correct size in hindsight so the question is whether the algorithm with variable cache size is a viable automatic size-selection method. Analyzing each of the datasets in turn reveals that this is indeed the case – the algorithm obtains a very similar number of support patterns and test error when compared to the SVM method. The results are somewhat less impressive for the letter dataset which contains less examples per class. One possible explanation is that the algorithm had fewer chances to modify and distill the cache. Nonetheless, overall the results are remarkable given that all the online algorithms make a single pass through the data and the variable-size method finds a very good cache size while making it also comparable to the SVM in terms of performance. The MIRA algorithm, which does not incorporate any form of example insertion or deletion in its algorithmic structure, obtains the poorest level of performance not only in terms of generalization error but also in terms of number of support patterns. The plot of online training error against the number of support patterns, in row 2 of Fig 4, can be considered to be a good on-the-fly validation of generalization performance. As the plots indicate, for the fixed and adaptive versions of the algorithm, on all the datasets, a low online training error translates into good generalization performance. Comparing the test error plots with the online error plots we see a nice similarity between the qualitative behavior of the two errors. Hence, one can use the online error, which is easy to evaluate, to choose a good cache size for the fixed-size algorithm. The third row gives the online training margin errors that translates directly to the number of insertions into the cache. Here we see that the good test error and compactness of the algorithm with a variable cache size come with a price. Namely, the algorithm makes significantly more insertions into the cache than the fixed size version of the algorithm. However, as the upper two sets of plots indicate, the surplus in insertions is later taken care of by excess deletions and the end result is very good overall performance. In summary, the online algorithm with a variable cache and SVM obtains similar levels of generalization and also number of support patterns. While the SVM is still somewhat better in both aspects for the letter dataset, the online algorithm is much simpler to implement and performs a single sweep through the training data. 5 Summary We have described and analyzed a new sparse online algorithm that attempts to deal with the computational problems implicit in classification algorithms such as the SVM. The proposed method was empirically tested and its performance in both the size of the resulting classifier and its error rate are comparable to SVM. There are a few possible extensions and enhancements. We are currently looking at alternative criteria for the deletions of examples from the cache. For instance, the weight of examples might relay information on their importance for accurate classification. Incorporating prior knowledge to the insertion and deletion scheme might also prove important. We hope that such enhancements would make the proposed approach a viable alternative to SVM and other batch algorithms. Acknowledgements: The authors would like to thank John Shawe-Taylor for many helpful comments and discussions. This research was partially funded by the EU project KerMIT No. IST-2000-25341. References [1] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Jornal of Machine Learning Research, 3:951–991, 2003. [2] C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242, 2001. [3] M´ zard M. Krauth W. Learning algorithms with optimal stability in neural networks. Journal of e Physics A., 20:745, 1987. [4] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1–3):361–387, 2002. [5] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume XII, pages 615–622, 1962. [6] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).). [7] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.

3 0.54846859 102 nips-2003-Large Scale Online Learning

Author: Léon Bottou, Yann L. Cun

Abstract: We consider situations where training data is abundant and computing resources are comparatively scarce. We argue that suitably designed online learning algorithms asymptotically outperform any batch learning algorithm. Both theoretical and experimental evidences are presented. 1

4 0.4284665 176 nips-2003-Sequential Bayesian Kernel Regression

Author: Jaco Vermaak, Simon J. Godsill, Arnaud Doucet

Abstract: We propose a method for sequential Bayesian kernel regression. As is the case for the popular Relevance Vector Machine (RVM) [10, 11], the method automatically identifies the number and locations of the kernels. Our algorithm overcomes some of the computational difficulties related to batch methods for kernel regression. It is non-iterative, and requires only a single pass over the data. It is thus applicable to truly sequential data sets and batch data sets alike. The algorithm is based on a generalisation of Importance Sampling, which allows the design of intuitively simple and efficient proposal distributions for the model parameters. Comparative results on two standard data sets show our algorithm to compare favourably with existing batch estimation strategies.

5 0.32242072 44 nips-2003-Can We Learn to Beat the Best Stock

Author: Allan Borodin, Ran El-Yaniv, Vincent Gogan

Abstract: A novel algorithm for actively trading stocks is presented. While traditional universal algorithms (and technical trading heuristics) attempt to predict winners or trends, our approach relies on predictable statistical relations between all pairs of stocks in the market. Our empirical results on historical markets provide strong evidence that this type of technical trading can “beat the market” and moreover, can beat the best stock in the market. In doing so we utilize a new idea for smoothing critical parameters in the context of expert learning. 1 Introduction: The Portfolio Selection Problem The portfolio selection (PS) problem is a challenging problem for machine learning, online algorithms and, of course, computational finance. As is well known (e.g. see Lugosi [1]) sequence prediction under the log loss measure can be viewed as a special case of portfolio selection, and perhaps more surprisingly, from a certain worst case minimax criterion, portfolio selection is not essentially any harder (than prediction) as shown in [2] (see also [1], Thm. 20 & 21). But there seems to be a qualitative difference between the practical utility of “universal” sequence prediction and universal portfolio selection. Simply stated, universal sequence prediction algorithms under various probabilistic and worst-case models work very well in practice whereas the known universal portfolio selection algorithms do not seem to provide any substantial benefit over a naive investment strategy (see Sec. 4). A major pragmatic question is whether or not a computer program can consistently outperform the market. A closer inspection of the interesting ideas developed in information theory and online learning suggests that a promising approach is to exploit the natural volatility in the market and in particular to benefit from simple and rather persistent statistical relations between stocks rather than to try to predict stock prices or “winners”. We present a non-universal portfolio selection algorithm1 , which does not try to predict winners. The motivation behind our algorithm is the rationale behind constant rebalancing algorithms and the worst case study of universal trading introduced by Cover [3]. Not only does our proposed algorithm substantially “beat the market” on historical markets, it also beats the best stock. So why are we presenting this algorithm and not just simply making money? There are, of course some caveats and obstacles to utilizing the algorithm. But for large investors the possibility of a goose laying silver (if not golden) eggs is not impossible. 1 Any PS algorithm can be modified to be universal by investing any fixed fraction of the initial wealth in a universal algorithm. Assume a market with m stocks. Let vt = (vt (1), . . . , vt (m)) be the closing prices of the m stocks for the tth day, where vt (j) is the price of the jth stock. It is convenient to work with relative prices xt (j) = vt (j)/vt−1 (j) so that an investment of $d in the jth stock just before the tth period yields dxt (j) dollars. We let xt = (xt (1), . . . , xt (m)) denote the market vector of relative prices corresponding to the tth day. A portfolio b is an allocation of wealth in the stocks, specified by the proportions b = (b(1), . . . , b(m)) of current dollar wealth invested in each of the stocks, where b(j) ≥ 0 and j b(j) = 1. The daily return of a portfolio b w.r.t. a market vector x is b · x = j b(j)x(j) and the (compound) total return, retX (b1 , . . . , bn ), of a sequence of portfolios b1 , . . . , bn w.r.t. a market sequence n X = x1 , . . . , xn is t=1 bt · xt . A portfolio selection algorithm is any deterministic or randomized rule for specifying a sequence of portfolios. The simplest strategy is to “buy-and-hold” stocks using some portfolio b. We denote this strategy by BAHb and let U-BAH denote the uniform buy-and-hold when b = (1/m, . . . , 1/m). We say that a portfolio selection algorithm “beats the market” when it outperforms U-BAH on a given market sequence although in practice “the market” can be represented by some non-uniform BAH (e.g. DJIA). Buy-and-hold strategies rely on the tendency of successful markets to grow. Much of modern portfolio theory focuses on how to choose a good b for the buy-and-hold strategy. The seminal ideas of Markowitz in [4] yield an algorithmic procedure for choosing the weights of the portfolio b so as to minimize the variance for any feasible expected return. This variance minimization is possible by placing appropriate larger weights on subsets of anti-correlated stocks, an idea which we shall also utilize. We denote the optimal in hindsight buy-and-hold strategy (i.e. invest only in the best stock) by BAH∗ . An alternative approach to the static buy-and-hold is to dynamically change the portfolio during the trading period. This approach is often called “active trading”. One example of active trading is constant rebalancing; namely, fix a portfolio b and (re)invest your dollars each day according to b. We denote this constant rebalancing strategy by CBALb and let CBAL∗ denote the optimal (in hindsight) CBAL. A constant rebalancing strategy can often take advantage of market fluctuations to achieve a return significantly greater than that of BAH∗ . CBAL∗ is always at least as good as the best stock BAH∗ and in some real market sequences a constant rebalancing strategy will take advantage of market fluctuations and significantly outperform the best stock (see Table 1). For now, consider Cover and Gluss’ [5] classic (but contrived) example of a market consisting of cash and one stock and 1 1 the market sequence of price relatives 1/2 , 1 , 1/2 , 1 , . . . Now consider the CBALb 2 2 3 1 1 1 11 with b = ( 2 , 2 ). On each odd day the daily return of CBALb is 2 1 + 2 2 = 4 and on each even day, it is 3/2. The total return over n days is therefore (9/8)n/2 , illustrating how a constant rebalancing strategy can yield exponential returns in a “no-growth market”. Under the assumption that the daily market vectors are observations of identically and independently distributed (i.i.d) random variables, it is shown in [6] that CBAL∗ performs at least as good (in the sense of expected total return) as the best online portfolio selection algorithm. However, many studies (see e.g. [7]) argue that stock price sequences do have long term memory and are not i.i.d. A non-traditional objective (in computational finance) is to develop online trading strategies that are in some sense always guaranteed to perform well. Within a line of research pioneered by Cover [5, 3, 2] one attempts to design portfolio selection algorithms that can provably do well (in terms of their total return) with respect to some online or offline benchmark algorithms. Two natural online benchmark algorithms are the uniform buy and hold U-BAH, and the uniform constant rebalancing strategy U-CBAL, which is CBALb with 1 1 b = ( m , . . . , m ). A natural offline benchmark is BAH∗ and a more challenging offline benchmark is CBAL∗ . Cover and Ordentlich’s Universal Portfolios algorithm [3, 2], denoted here by UNIVERSAL, was proven to be universal against CBAL∗ , in the sense that for every market sequence X of m stocks over n days, it guarantees a sub-exponential (indeed polynomial) ratio in n, retX (CBAL∗ )/retX (UNIVERSAL) ≤ O n m−1 2 (1) From a theoretical perspective this is surprising as the ratio is a polynomial in n (for fixed m) whereas CBAL∗ is capable of exponential returns. From a practical perspective, while the m−1 ratio n 2 is not very useful, the motivation that underlies the potential of CBAL algorithms is useful! We follow this motivation and develop a new algorithm which we call ANTICOR. By attempting to systematically follow the constant rebalancing philosophy, ANTICOR is capable of some extraordinary performance in the absence of transaction costs, or even with very small transaction costs. 2 Trying to Learn the Winners The most direct approach to expert learning and portfolio selection is a “(reward based) weighted average prediction” algorithm which adaptively computes a weighted average of experts by gradually increasing (by some multiplicative or additive update rule) the relative weights of the more successful experts. For example, in the context of the PS problem consider the “exponentiated gradient” EG(η) algorithm proposed by Helmbold et al. [8]. The EG(η) algorithm computes the next portfolio to be bt+1 (j) = bt (j) exp {ηxt (j)/(bt · xt )} m j=1 bt (j) exp {ηxt (j)/(bt · xt )} where η is a “learning rate” parameter. EG was designed to greedily choose the best portfolio for yesterday’s market xt while at the same time paying a penalty from moving far from yesterday’s portfolio. For a universal bound on EG, Helmbold et al. set η = 2xmin 2(log m)/n where xmin is a lower bound on any price relative.2 It is easy to see that as n increases, η decreases to 0 so that we can think of η as being very small in order to achieve universality. When η = 0, the algorithm EG(η) degenerates to the uniform CBAL which is not a universal algorithm. It is also the case that if each day the price relatives for all stocks were identical, then EG (as well as other PS algorithms) will converge to the uniform CBAL. Combining a small learning rate with a “reasonably balanced” market we expect the performance of EG to be similar to that of the uniform CBAL and this is confirmed by our experiments (see Table1).3 Cover’s universal algorithms adaptively learn each day’s portfolio by increasing the weights of successful CBALs. The update rule for these universal algorithms is bt+1 = b · rett (CBALb )dµ(b) , rett (CBALb )dµ(b) where µ(·) is some prior distribution over portfolios. Thus, the weight of a possible portfolio is proportional to its total return rett (b) thus far times its prior. The particular universal algorithm we consider in our experiments uses the Dirichlet prior (with parameters 1 1 ( 2 , . . . , 2 )) [2]. Within a constant factor, this algorithm attains the optimal ratio (1) with respect to CBAL∗ .4 The algorithm is equivalent to a particular static distribution over the 2 Helmbold et al. show how to eliminate the need to know xmin and n. While EG can be made universal, its performance ratio is only sub-exponential (and not polynomial) in n. 3 Following Helmbold et al. we fix η = 0.01 in our experiments. 4 Experimentally (on our datasets) there is a negligible difference between the uniform universal algorithm in [3] and the above Dirichlet universal algorithm. class of all CBALs. This equivalence helps to demystify the universality result and also shows that the algorithm can never outperform CBAL∗ . A different type of “winner learning” algorithm can be obtained from any sequence prediction strategy. For each stock, a (soft) sequence prediction algorithm provides a probability p(j) that the next symbol will be j ∈ {1, . . . , m}. We view this as a prediction that stock j will have the best price relative for the next day and set bt+1 (j) = pj . We consider predictions made using the prediction component of the well-known Lempel-Ziv (LZ) lossless compression algorithm [9]. This prediction component is nicely described in Langdon [10] and in Feder [11]. As a prediction algorithm, LZ is provably powerful in various senses. First it can be shown that it is asymptotically optimal with respect to any stationary and ergodic finite order Markov source (Rissanen [12]). Moreover, Feder shows that LZ is also universal in a worst case sense with respect to the (offline) benchmark class of all finite state prediction machines. To summarize, the common approach to devising PS algorithms has been to attempt and learn winners using winner learning schemes. 3 The Anticor Algorithm We propose a different approach, motivated by the CBAL “philosophy”. How can we interpret the success of the uniform CBAL on the Cover and Gluss example of Sec. 1? Clearly, the uniform CBAL here is taking advantage of price fluctuation by constantly transferring wealth from the high performing stock to the anti-correlated low performing stock. Even in a less contrived market, we should be able to take advantage when a stock is currently outperforming other stocks especially if this strong performance is anti-correlated with the performance of these other stocks. Our ANTICORw algorithm considers a short market history (consisting of two consecutive “windows”, each of w trading days) so as to model statistical relations between each pair of stocks. Let LX1 = log(xt−2w+1 ), . . . , log(xt−w )T and LX2 = log(xt−w+1 ), . . . , log(xt )T , where log(xk ) denotes (log(xk (1)), . . . , log(xk (m))). Thus, LX1 and LX2 are the two vector sequences (equivalently, two w × m matrices) constructed by taking the logarithm over the market subsequences corresponding to the time windows [t − 2w + 1, t − w] and [t − w + 1, t], respectively. We denote the jth column of LXk by LXk (j). Let µk = (µk (1), . . . , µk (m)), be the vectors of averages of columns of LXk (that is, µk (j) = E{LXk (j)}). Similarly, let σk , be the vector of standard deviations of columns of LXk . The cross-correlation matrix (and its normalization) between column vectors in LX1 and LX2 are defined as: Mcov (i, j) = (LX1 (i) − µ1 (i))T (LX2 (j) − µ2 (j)); Mcov (i,j) σ1 (i)σ2 (j) σ1 (i), σ2 (j) = 0; 0 otherwise. Mcor (i, j) ∈ [−1, 1] measures the correlation between log-relative prices of stock i over the first window and stock j over the second window. For each pair of stocks i and j we compute claimi→j , the extent to which we want to shift our investment from stock i to stock j. Namely, there is such a claim iff µ2 (i) > µ2 (j) and Mcor (i, j) > 0 in which case claimi→j = Mcor (i, j) + A(i) + A(j) where A(h) = |Mcor (h, h)| if Mcor (h, h) < 0, else 0. Following our interpretation for the success of a CBAL, Mcor (i, j) > 0 is used to predict that stocks i and j will be correlated in consecutive windows (i.e. the current window and the next window based on the evidence for the last two windows) and Mcor (h, h) < 0 predicts that stock h will be anti-correlated with itself over consec˜ utive windows. Finally, bt+1 (i) = bt (i) + j=i [transferj→i − transferi→j ] where ˜ t (i) · claimi→j / ˜ transferi→j = b j claimi→j and bt is the resulting portfolio just after market closing (on day t). Mcor (i, j) SP500: Anticor vs. window size NYSE: Anticor vs. window size w BAH(Anticor ) w Anticor 12 8 w Best Stock Market Return 10 Total Return Total Return (log−scale) 10 Anticorw 5 10 BAH(Anticorw) Anticorw Best Stock Market Best Stock 8 Anticorw 6 4 2 10 2 1 10 Best Stock 1 0 10 2 5 10 15 20 25 5 30 10 15 20 25 30 Window Size (w) Window Size (w) Figure 1: ANTICORw ’s total return (per $1 investment) vs. window size 2 ≤ w ≤ 30 for NYSE (left) and SP500 (right). Our ANTICORw algorithm has one critical parameter, the window size w. In Figure 1 we depict the total return of ANTICORw on two historical datasets as a function of the window size w = 2, . . . , 30. As we might expect, the performance of ANTICORw depends significantly on the window size. However, for all w, ANTICORw beats the uniform market and, moreover, it beats the best stock using most window sizes. Of course, in online trading we cannot choose w in hindsight. Viewing the ANTICORw algorithms as experts, we can try to learn the best expert. But the windows, like individual stocks, induce a rather volatile set of experts and standard expert combination algorithms [13] tend to fail. Alternatively, we can adaptively learn and invest in some weighted average of all ANTICORw algorithms with w less than some maximum W . The simplest case is a uniform investment on all the windows; that is, a uniform buy-and-hold investment on the algorithms ANTICORw , w ∈ [2, W ], denoted by BAHW (ANTICOR). Figure 2 (left) graphs the total return of BAHW (ANTICOR) as a function of W for all values of 2 ≤ W ≤ 50 with respect to the NYSE dataset (see details below). Similar graphs for the other datasets we consider appear qualitatively the same and the choice W = 30 is clearly not optimal. However, for all W ≥ 3, BAHW (ANTICOR) beats the best stock in all our experiments. DJIA: Dec 14, 2002 − Jan 14, 2003 NYSE: Total Return vs. Max Window 10 1.1 BAHW(Anticor) 10 Best Stock MArket 4 10 3 10 Best Stock 2.8 Anticor2 2.2 2.6 1 BAHW(Anticor) 5 Total Return Total Return (log−scale) 10 6 Anticor1 Stocks 7 2 0.9 2.4 1.8 0.8 2.2 1.6 0.7 2 1.4 0.6 2 10 1.8 1.2 0.5 1 10 1.6 1 0.4 0 10 5 10 15 20 25 30 35 Maximal Window size (W) 40 45 50 5 10 15 20 25 Days 5 10 15 20 25 Days 5 10 15 20 25 Days Figure 2: Left: BAHW (ANTICOR)’s total return (per $1 investment) as a function of the maximal window W . Right: Cumulative returns for last month of the DJIA dataset: stocks (left panel); ANTICORw algorithms trading the stocks (denoted ANTICOR1 , middle panel); ANTICORw algorithms trading the ANTICOR algorithms (right panel). Since we now consider the various algorithms as stocks (whose prices are determined by the cumulative returns of the algorithms), we are back to our original portfolio selection problem and if the ANTICOR algorithm performs well on stocks it may also perform well on algorithms. We thus consider active investment in the various ANTICORw algorithms using ANTICOR. We again consider all windows w ≤ W . Of course, we can continue to compound the algorithm any number of times. Here we compound twice and then use a buy-and-hold investment. The resulting algorithm is denoted BAHW (ANTICOR(ANTICOR)). One impact of this compounding, depicted in Figure 2 (right), is to smooth out the anti-correlations exhibited in the stocks. It is evident that after compounding twice the returns become almost completely correlated thus diminishing the possibility that additional compounding will substantially help.5 This idea for eliminating critical parameters may be applicable in other learning applications. The challenge is to understand the conditions and applications in which the process of compounding algorithms will have this smoothing effect! 4 Experimental Results We present an experimental study of the the ANTICOR algorithm and the three online learning algorithms described in Sec. 2. We focus on BAH30 (ANTICOR), abbreviated by ANTI1 and BAH30 (ANTICOR(ANTICOR)), abbreviated by ANTI2 . Four historical datasets are used. The first NYSE dataset, is the one used in [3, 2, 8, 14]. This dataset contains 5651 daily prices for 36 stocks in the New York Stock Exchange (NYSE) for the twenty two year period July 3rd , 1962 to Dec 31st , 1984. The second TSE dataset consists of 88 stocks from the Toronto Stock Exchange (TSE), for the five year period Jan 4th , 1994 to Dec 31st , 1998. The third dataset consists of the 25 stocks from SP500 which (as of Apr. 2003) had the largest market capitalization. This set spans 1276 trading days for the period Jan 2nd , 1998 to Jan 31st , 2003. The fourth dataset consists of the thirty stocks composing the Dow Jones Industrial Average (DJIA) for the two year period (507 days) from Jan 14th , 2001 to Jan 14th , 2003.6 These four datasets are quite different in nature (the market returns for these datasets appear in the first row of Table 1). While every stock in the NYSE increased in value, 32 of the 88 stocks in the TSE lost money, 7 of the 25 stocks in the SP500 lost money and 25 of the 30 stocks in the “negative market” DJIA lost money. All these sets include only highly liquid stocks with huge market capitalizations. In order to maximize the utility of these datasets and yet present rather different markets, we also ran each market in reverse. This is simply done by reversing the order and inverting the relative prices. The reverse datasets are denoted by a ‘-1’ superscript. Some of the reverse markets are particularly challenging. For example, all of the NYSE−1 stocks are going down. Note that the forward and reverse markets (i.e. U-BAH) for the TSE are both increasing but that the TSE−1 is also a challenging market since so many stocks (56 of 88) are declining. Table 1 reports on the total returns of the various algorithms for all eight datasets. We see that prediction algorithms such as LZ can do quite well but the more aggressive ANTI1 and 2 ANTI have excellent and sometimes fantastic returns. Note that these active strategies beat the best stock and even CBAL∗ in all markets with the exception of the TSE−1 in which they still significantly outperform the market. The reader may well be distrustful of what appears to be such unbelievable returns for ANTI1 and ANTI2 especially when applied to the NYSE dataset. However, recall that the NYSE dataset consists of n = 5651 trading days and the y such that y n = the total NYSE return is approximately 1.0029511 for ANTI1 (respectively, 1.0074539 for ANTI2 ); that is, the average daily increase is less than .3% 5 This smoothing effect also allows for the use of simple prediction algorithms such as “expert advice” algorithms [13], which can now better predict a good window size. We have not explored this direction. 6 The four datasets, including their sources and individual stock compositions can be downloaded from http://www.cs.technion.ac.il/∼rani/portfolios. (respectively, .75%). Thus a transaction cost of 1% can present a significant challenge to such active trading strategies (see also Sec. 5). We observe that UNIVERSAL and EG have no substantial advantage over U-CBAL. Some previous expositions of these algorithms highlighted particular combinations of stocks where the returns significantly outperformed UNIVERSAL and the best stock. But the same can be said for U-CBAL. Algorithm M ARKET (U-BAH) B EST S TOCK CBAL∗ U-CBAL ANTI1 ANTI2 LZ EG UNIVERSAL NYSE 14.49 54.14 250.59 27.07 17,059,811.56 238,820,058.10 79.78 27.08 26.99 TSE 1.61 6.27 6.77 1.59 26.77 39.07 1.32 1.59 1.59 SP500 1.34 3.77 4.06 1.64 5.56 5.88 1.67 1.64 1.62 DJIA 0.76 1.18 1.23 0.81 1.59 2.28 0.89 0.81 0.80 NYSE−1 0.11 0.32 2.86 0.22 246.22 1383.78 5.41 0.22 0.22 TSE−1 1.67 37.64 58.61 1.18 7.12 7.27 4.80 1.19 1.19 SP500−1 0.87 1.65 1.91 1.09 6.61 9.69 1.20 1.09 1.07 DJIA−1 1.43 2.77 2.97 1.53 3.67 4.60 1.83 1.53 1.53 Table 1: Monetary returns in dollars (per $1 investment) of various algorithms for four different datasets and their reversed versions. The winner and runner-up for each market appear in boldface. All figures are truncated to two decimals. 5 Concluding Remarks When handling a portfolio of m stocks our algorithm may perform up to m transactions per day. A major concern is therefore the commissions it will incur. Within the proportional commission model (see e.g. [14] and [15], Sec. 14.5.4) there exists a fraction γ ∈ (0, 1) such that an investor pays at a rate of γ/2 for each buy and for each sell. Therefore, the return of a sequence b1 , . . . , bn of portfolios with re˜ spect to a market sequence x1 , . . . , xn is t bt · xt (1 − j γ |bt (j) − bt (j)|) , where 2 1 ˜ (bt (1)xt (1), . . . , bt (m)xt (m)). Our investment algorithm in its simplest form bt = bt ·xt can tolerate very small proportional commission rates and still beat the best stock.7 We note that Blum and Kalai [14] showed that the performance guarantee of UNIVERSAL still holds (and gracefully degrades) in the case of proportional commissions. Many current online brokers only charge a small per share commission rate. A related problem that one must face when actually trading is the difference between bid and ask prices. These bid-ask spreads (and the availability of stocks for both buying and selling) are typically functions of stock liquidity and are typically smaller for large market capitalization stocks. We consider here only very large market cap stocks. As a final caveat, we note that we assume that any one portfolio selection algorithm has no impact on the market! But just like any goose laying golden eggs, widespread use will soon lead to the end of the goose; that is, the market will quickly react. Any report of abnormal returns using historical markets should be suspected of “data snooping”. In particular, when a dataset is excessively mined by testing many strategies there is a substantial chance that one of the strategies will be successful by simple overfitting. Another data snooping hazard is stock selection. For example, the 36 stocks selected for the NYSE dataset were all known to have survived for 22 years. Our ANTICOR algorithms were fully developed using only the NYSE and TSE datasets. The DJIA and SP500 sets were obtained (from public domain sources) after the algorithms were fixed. Finally, our algorithm has one parameter (the maximal window size W ). Our experiments indicate that the algorithm’s performance is robust with respect to W (see Figure 2). 7 For example, with γ = 0.1% we can typically beat the best stock. These results will be presented in the full paper. A number of well-respected works report on statistically robust “abnormal” returns for simple “technical analysis” heuristics, which slightly beat the market. For example, the landmark study of Brock et al. [16] apply 26 simple trading heuristics to the DJIA index from 1897 to 1986 and provide strong support for technical analysis heuristics. While consistently beating the market is considered a great (if not impossible) challenge, our approach to portfolio selection indicates that beating the best stock is an achievable goal. What is missing at this point of time is an analytical model which better explains why our active trading strategies are so successful. In this regard, we are investigating various “statistical adversary” models along the lines suggested by [17, 18]. Namely, we would like to show that an algorithm performs well (relative to some benchmark) for any market sequence that satisfies certain constraints on its empirical statistics. References [1] G. Lugosi. Lectures on prediction URL:http://www.econ.upf.es/∼lugosi/ihp.ps, 2001. of individual sequences. [2] T.M. Cover and E. Ordentlich. Universal portfolios with side information. IEEE Transactions on Information Theory, 42(2):348–363, 1996. [3] T.M. Cover. Universal portfolios. Mathematical Finance, 1:1–29, 1991. [4] H. Markowitz. Portfolio Selection: Efficient Diversification of Investments. John Wiley and Sons, 1959. [5] T.M. Cover and D.H. Gluss. Empirical bayes stock market portfolios. Advances in Applied Mathematics, 7:170–181, 1986. [6] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., 1991. [7] A. Lo and C. MacKinlay. A Non-Random Walk Down Wall Street. Princeton University Press, 1999. [8] D.P. Helmbold, R.E. Schapire, Y. Singer, and M.K. Warmuth. Portfolio selection using multiplicative updates. Mathematical Finance, 8(4):325–347, 1998. [9] J. Ziv and A. Lempel. Compression of individual sequences via variable rate coding. IEEE Transactions on Information Theory, 24:530–536, 1978. [10] G.G. Langdon. A note on the lempel-ziv model for compressing individual sequences. IEEE Transactions on Information Theory, 29:284–287, 1983. [11] M. Feder. Gambling using a finite state machine. IEEE Transactions on Information Theory, 37:1459–1465, 1991. [12] J. Rissanen. A universal data compression system. IEEE Transactions on Information Theory, 29:656–664, 1983. [13] N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, May 1997. [14] A. Blum and A. Kalai. Universal portfolios with and without transaction costs. Machine Learning, 30(1):23–30, 1998. [15] A. Borodin and R. El-Yaniv. Online Computation and Competitive Analysis. Cambridge University Press, 1998. [16] L. Brock, J. Lakonishok, and B. LeBaron. Simple technical trading rules and the stochastic properties of stock returns. Journal of Finance, 47:1731–1764, 1992. [17] P. Raghavan. A statistical adversary for on-line algorithms. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 7:79–83, 1992. [18] A. Chou, J.R. Cooperstock, R. El-Yaniv, M. Klugerman, and T. Leighton. The statistical adversary allows optimal money-making trading strategies. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, 1995.

6 0.30015323 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

7 0.29635623 146 nips-2003-Online Learning of Non-stationary Sequences

8 0.27721319 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

9 0.27491903 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels

10 0.26983458 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

11 0.23496853 122 nips-2003-Margin Maximizing Loss Functions

12 0.21272534 48 nips-2003-Convex Methods for Transduction

13 0.21269491 14 nips-2003-A Nonlinear Predictive State Representation

14 0.21094607 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

15 0.20354362 108 nips-2003-Learning a Distance Metric from Relative Comparisons

16 0.19149376 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

17 0.17546386 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion

18 0.17123276 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

19 0.1697403 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

20 0.16917853 1 nips-2003-1-norm Support Vector Machines


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.031), (11, 0.012), (24, 0.329), (30, 0.015), (35, 0.054), (45, 0.011), (53, 0.072), (69, 0.012), (71, 0.074), (72, 0.012), (76, 0.034), (85, 0.1), (91, 0.082), (99, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81340718 148 nips-2003-Online Passive-Aggressive Algorithms

Author: Shai Shalev-shwartz, Koby Crammer, Ofer Dekel, Yoram Singer

Abstract: We present a unified view for online classification, regression, and uniclass problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also discussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss. 1

2 0.8094759 60 nips-2003-Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis

Author: James T. Kwok, Brian Mak, Simon Ho

Abstract: Eigenvoice speaker adaptation has been shown to be effective when only a small amount of adaptation data is available. At the heart of the method is principal component analysis (PCA) employed to find the most important eigenvoices. In this paper, we postulate that nonlinear PCA, in particular kernel PCA, may be even more effective. One major challenge is to map the feature-space eigenvoices back to the observation space so that the state observation likelihoods can be computed during the estimation of eigenvoice weights and subsequent decoding. Our solution is to compute kernel PCA using composite kernels, and we will call our new method kernel eigenvoice speaker adaptation. On the TIDIGITS corpus, we found that compared with a speaker-independent model, our kernel eigenvoice adaptation method can reduce the word error rate by 28–33% while the standard eigenvoice approach can only match the performance of the speaker-independent model. 1

3 0.67689222 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

Author: Liam Paninski, Eero P. Simoncelli, Jonathan W. Pillow

Abstract: Recent work has examined the estimation of models of stimulus-driven neural activity in which some linear filtering process is followed by a nonlinear, probabilistic spiking stage. We analyze the estimation of one such model for which this nonlinear step is implemented by a noisy, leaky, integrate-and-fire mechanism with a spike-dependent aftercurrent. This model is a biophysically plausible alternative to models with Poisson (memory-less) spiking, and has been shown to effectively reproduce various spiking statistics of neurons in vivo. However, the problem of estimating the model from extracellular spike train data has not been examined in depth. We formulate the problem in terms of maximum likelihood estimation, and show that the computational problem of maximizing the likelihood is tractable. Our main contribution is an algorithm and a proof that this algorithm is guaranteed to find the global optimum with reasonable speed. We demonstrate the effectiveness of our estimator with numerical simulations. A central issue in computational neuroscience is the characterization of the functional relationship between sensory stimuli and neural spike trains. A common model for this relationship consists of linear filtering of the stimulus, followed by a nonlinear, probabilistic spike generation process. The linear filter is typically interpreted as the neuron’s “receptive field,” while the spiking mechanism accounts for simple nonlinearities like rectification and response saturation. Given a set of stimuli and (extracellularly) recorded spike times, the characterization problem consists of estimating both the linear filter and the parameters governing the spiking mechanism. One widely used model of this type is the Linear-Nonlinear-Poisson (LNP) cascade model, in which spikes are generated according to an inhomogeneous Poisson process, with rate determined by an instantaneous (“memoryless”) nonlinear function of the filtered input. This model has a number of desirable features, including conceptual simplicity and computational tractability. Additionally, reverse correlation analysis provides a simple unbiased estimator for the linear filter [5], and the properties of estimators (for both the linear filter and static nonlinearity) have been thoroughly analyzed, even for the case of highly non-symmetric or “naturalistic” stimuli [12]. One important drawback of the LNP model, * JWP and LP contributed equally to this work. We thank E.J. Chichilnisky for helpful discussions. L−NLIF model LNP model )ekips(P Figure 1: Simulated responses of LNLIF and LNP models to 20 repetitions of a fixed 100-ms stimulus segment of temporal white noise. Top: Raster of responses of L-NLIF model, where σnoise /σsignal = 0.5 and g gives a membrane time constant of 15 ms. The top row shows the fixed (deterministic) response of the model with σnoise set to zero. Middle: Raster of responses of LNP model, with parameters fit with standard methods from a long run of the L-NLIF model responses to nonrepeating stimuli. Bottom: (Black line) Post-stimulus time histogram (PSTH) of the simulated L-NLIF response. (Gray line) PSTH of the LNP model. Note that the LNP model fails to preserve the fine temporal structure of the spike trains, relative to the L-NLIF model. 001 05 0 )sm( emit however, is that Poisson processes do not accurately capture the statistics of neural spike trains [2, 9, 16, 1]. In particular, the probability of observing a spike is not a functional of the stimulus only; it is also strongly affected by the recent history of spiking. The leaky integrate-and-fire (LIF) model provides a biophysically more realistic spike mechanism with a simple form of spike-history dependence. This model is simple, wellunderstood, and has dynamics that are entirely linear except for a nonlinear “reset” of the membrane potential following a spike. Although this model’s overriding linearity is often emphasized (due to the approximately linear relationship between input current and firing rate, and lack of active conductances), the nonlinear reset has significant functional importance for the model’s response properties. In previous work, we have shown that standard reverse correlation analysis fails when applied to a neuron with deterministic (noise-free) LIF spike generation; we developed a new estimator for this model, and demonstrated that a change in leakiness of such a mechanism might underlie nonlinear effects of contrast adaptation in macaque retinal ganglion cells [15]. We and others have explored other “adaptive” properties of the LIF model [17, 13, 19]. In this paper, we consider a model consisting of a linear filter followed by noisy LIF spike generation with a spike-dependent after-current; this is essentially the standard LIF model driven by a noisy, filtered version of the stimulus, with an additional current waveform injected following each spike. We will refer to this as the the “L-NLIF” model. The probabilistic nature of this model provides several important advantages over the deterministic version we have considered previously. First, an explicit noise model allows us to couch the problem in the terms of classical estimation theory. This, in turn, provides a natural “cost function” (likelihood) for model assessment and leads to more efficient estimation of the model parameters. Second, noise allows us to explicitly model neural firing statistics, and could provide a rigorous basis for a metric distance between spike trains, useful in other contexts [18]. Finally, noise influences the behavior of the model itself, giving rise to phenomena not observed in the purely deterministic model [11]. Our main contribution here is to show that the maximum likelihood estimator (MLE) for the L-NLIF model is computationally tractable. Specifically, we describe an algorithm for computing the likelihood function, and prove that this likelihood function contains no non-global maxima, implying that the MLE can be computed efficiently using standard ascent techniques. The desirable statistical properties of this estimator (e.g. consistency, efficiency) are all inherited “for free” from classical estimation theory. Thus, we have a compact and powerful model for the neural code, and a well-motivated, efficient way to estimate the parameters of this model from extracellular data. The Model We consider a model for which the (dimensionless) subthreshold voltage variable V evolves according to i−1 dV = − gV (t) + k · x(t) + j=0 h(t − tj ) dt + σNt , (1) and resets to Vr whenever V = 1. Here, g denotes the leak conductance, k · x(t) the projection of the input signal x(t) onto the linear kernel k, h is an “afterpotential,” a current waveform of fixed amplitude and shape whose value depends only on the time since the last spike ti−1 , and Nt is an unobserved (hidden) noise process with scale parameter σ. Without loss of generality, the “leak” and “threshold” potential are set at 0 and 1, respectively, so the cell spikes whenever V = 1, and V decays back to 0 with time constant 1/g in the absence of input. Note that the nonlinear behavior of the model is completely determined by only a few parameters, namely {g, σ, Vr }, and h (where the function h is allowed to take values in some low-dimensional vector space). The dynamical properties of this type of “spike response model” have been extensively studied [7]; for example, it is known that this class of models can effectively capture much of the behavior of apparently more biophysically realistic models (e.g. Hodgkin-Huxley). Figures 1 and 2 show several simple comparisons of the L-NLIF and LNP models. In 1, note the fine structure of spike timing in the responses of the L-NLIF model, which is qualitatively similar to in vivo experimental observations [2, 16, 9]). The LNP model fails to capture this fine temporal reproducibility. At the same time, the L-NLIF model is much more flexible and representationally powerful, as demonstrated in Fig. 2: by varying V r or h, for example, we can match a wide variety of dynamical behaviors (e.g. adaptation, bursting, bistability) known to exist in biological neurons. The Estimation Problem Our problem now is to estimate the model parameters {k, σ, g, Vr , h} from a sufficiently rich, dynamic input sequence x(t) together with spike times {ti }. A natural choice is the maximum likelihood estimator (MLE), which is easily proven to be consistent and statistically efficient here. To compute the MLE, we need to compute the likelihood and develop an algorithm for maximizing it. The tractability of the likelihood function for this model arises directly from the linearity of the subthreshold dynamics of voltage V (t) during an interspike interval. In the noiseless case [15], the voltage trace during an interspike interval t ∈ [ti−1 , ti ] is given by the solution to equation (1) with σ = 0:   V0 (t) = Vr e−gt + t ti−1 i−1 k · x(s) + j=0 h(s − tj ) e−g(t−s) ds, (2) A stimulus h current responses 0 0 0 1 )ces( t 0 2. 0 t stimulus x 0 B c responses c=1 h current 0 c=2 2. 0 c=5 1 )ces( t t 0 0 stimulus C 0 h current responses Figure 2: Illustration of diverse behaviors of L-NLIF model. A: Firing rate adaptation. A positive DC current (top) was injected into three model cells differing only in their h currents (shown on left: top, h = 0; middle, h depolarizing; bottom, h hyperpolarizing). Voltage traces of each cell’s response (right, with spikes superimposed) exhibit rate facilitation for depolarizing h (middle), and rate adaptation for hyperpolarizing h (bottom). B: Bursting. The response of a model cell with a biphasic h current (left) is shown as a function of the three different levels of DC current. For small current levels (top), the cell responds rhythmically. For larger currents (middle and bottom), the cell responds with regular bursts of spikes. C: Bistability. The stimulus (top) is a positive followed by a negative current pulse. Although a cell with no h current (middle) responds transiently to the positive pulse, a cell with biphasic h (bottom) exhibits a bistable response: the positive pulse puts it into a stable firing regime which persists until the arrival of a negative pulse. 0 0 1 )ces( t 0 5 0. t 0 which is simply a linear convolution of the input current with a negative exponential. It is easy to see that adding Gaussian noise to the voltage during each time step induces a Gaussian density over V (t), since linear dynamics preserve Gaussianity [8]. This density is uniquely characterized by its first two moments; the mean is given by (2), and its covariance T is σ 2 Eg Eg , where Eg is the convolution operator corresponding to e−gt . Note that this density is highly correlated for nearby points in time, since noise is integrated by the linear dynamics. Intuitively, smaller leak conductance g leads to stronger correlation in V (t) at nearby time points. We denote this Gaussian density G(xi , k, σ, g, Vr , h), where index i indicates the ith spike and the corresponding stimulus chunk xi (i.e. the stimuli that influence V (t) during the ith interspike interval). Now, on any interspike interval t ∈ [ti−1 , ti ], the only information we have is that V (t) is less than threshold for all times before ti , and exceeds threshold during the time bin containing ti . This translates to a set of linear constraints on V (t), expressed in terms of the set Ci = ti−1 ≤t < 1 ∩ V (ti ) ≥ 1 . Therefore, the likelihood that the neuron first spikes at time ti , given a spike at time ti−1 , is the probability of the event V (t) ∈ Ci , which is given by Lxi ,ti (k, σ, g, Vr , h) = G(xi , k, σ, g, Vr , h), Ci the integral of the Gaussian density G(xi , k, σ, g, Vr , h) over the set Ci . sulumits Figure 3: Behavior of the L-NLIF model during a single interspike interval, for a single (repeated) input current (top). Top middle: Ten simulated voltage traces V (t), evaluated up to the first threshold crossing, conditional on a spike at time zero (Vr = 0). Note the strong correlation between neighboring time points, and the sparsening of the plot as traces are eliminated by spiking. Bottom Middle: Time evolution of P (V ). Each column represents the conditional distribution of V at the corresponding time (i.e. for all traces that have not yet crossed threshold). Bottom: Probability density of the interspike interval (isi) corresponding to this particular input. Note that probability mass is concentrated at the points where input drives V0 (t) close to threshold. rhtV secart V 0 rhtV )V(P 0 )isi(P 002 001 )cesm( t 0 0 Spiking resets V to Vr , meaning that the noise contribution to V in different interspike intervals is independent. This “renewal” property, in turn, implies that the density over V (t) for an entire experiment factorizes into a product of conditionally independent terms, where each of these terms is one of the Gaussian integrals derived above for a single interspike interval. The likelihood for the entire spike train is therefore the product of these terms over all observed spikes. Putting all the pieces together, then, the full likelihood is L{xi ,ti } (k, σ, g, Vr , h) = G(xi , k, σ, g, Vr , h), i Ci where the product, again, is over all observed spike times {ti } and corresponding stimulus chunks {xi }. Now that we have an expression for the likelihood, we need to be able to maximize it. Our main result now states, basically, that we can use simple ascent algorithms to compute the MLE without getting stuck in local maxima. Theorem 1. The likelihood L{xi ,ti } (k, σ, g, Vr , h) has no non-global extrema in the parameters (k, σ, g, Vr , h), for any data {xi , ti }. The proof [14] is based on the log-concavity of L{xi ,ti } (k, σ, g, Vr , h) under a certain parametrization of (k, σ, g, Vr , h). The classical approach for establishing the nonexistence of non-global maxima of a given function uses concavity, which corresponds roughly to the function having everywhere non-positive second derivatives. However, the basic idea can be extended with the use of any invertible function: if f has no non-global extrema, neither will g(f ), for any strictly increasing real function g. The logarithm is a natural choice for g in any probabilistic context in which independence plays a role, since sums are easier to work with than products. Moreover, concavity of a function f is strictly stronger than logconcavity, so logconcavity can be a powerful tool even in situations for which concavity is useless (the Gaussian density is logconcave but not concave, for example). Our proof relies on a particular theorem [3] establishing the logconcavity of integrals of logconcave functions, and proceeds by making a correspondence between this type of integral and the integrals that appear in the definition of the L-NLIF likelihood above. We should also note that the proof extends without difficulty to some other noise processes which generate logconcave densities (where white noise has the standard Gaussian density); for example, the proof is nearly identical if Nt is allowed to be colored or nonGaussian noise, with possibly nonzero drift. Computational methods and numerical results Theorem 1 tells us that we can ascend the likelihood surface without fear of getting stuck in local maxima. Now how do we actually compute the likelihood? This is a nontrivial problem: we need to be able to quickly compute (or at least approximate, in a rational way) integrals of multivariate Gaussian densities G over simple but high-dimensional orthants Ci . We discuss two ways to compute these integrals; each has its own advantages. The first technique can be termed “density evolution” [10, 13]. The method is based on the following well-known fact from the theory of stochastic differential equations [8]: given the data (xi , ti−1 ), the probability density of the voltage process V (t) up to the next spike ti satisfies the following partial differential (Fokker-Planck) equation: ∂P (V, t) σ2 ∂ 2 P ∂[(V − Veq (t))P ] = , +g 2 ∂t 2 ∂V ∂V under the boundary conditions (3) P (V, ti−1 ) = δ(V − Vr ), P (Vth , t) = 0; where Veq (t) is the instantaneous equilibrium potential:   i−1 1 Veq (t) = h(t − tj ) . k · x(t) + g j=0 Moreover, the conditional firing rate f (t) satisfies t ti−1 f (s)ds = 1 − P (V, t)dV. Thus standard techniques for solving the drift-diffusion evolution equation (3) lead to a fast method for computing f (t) (as illustrated in Fig. 2). Finally, the likelihood Lxi ,ti (k, σ, g, Vr , h) is simply f (ti ). While elegant and efficient, this density evolution technique turns out to be slightly more powerful than what we need for the MLE: recall that we do not need to compute the conditional rate function f at all times t, but rather just at the set of spike times {ti }, and thus we can turn to more specialized techniques for faster performance. We employ a rapid technique for computing the likelihood using an algorithm due to Genz [6], designed to compute exactly the kinds of multidimensional Gaussian probability integrals considered here. This algorithm works well when the orthants Ci are defined by fewer than ≈ 10 linear constraints on V (t). The number of actual constraints on V (t) during an interspike interval (ti+1 − ti ) grows linearly in the length of the interval: thus, to use this algorithm in typical data situations, we adopt a strategy proposed in our work on the deterministic form of the model [15], in which we discard all but a small subset of the constraints. The key point is that, due to strong correlations in the noise and the fact that the constraints only figure significantly when the V (t) is driven close to threshold, a small number of constraints often suffice to approximate the true likelihood to a high degree of precision. h mitse h eurt K mitse ATS K eurt 0 0 06 )ekips retfa cesm( t 03 0 0 )ekips erofeb cesm( t 001- 002- Figure 4: Demonstration of the estimator’s performance on simulated data. Dashed lines show the true kernel k and aftercurrent h; k is a 12-sample function chosen to resemble the biphasic temporal impulse response of a macaque retinal ganglion cell, while h is function specified in a five-dimensional vector space, whose shape induces a slight degree of burstiness in the model’s spike responses. The L-NLIF model was stimulated with parameters g = 0.05 (corresponding to a membrane time constant of 20 time-samples), σ noise = 0.5, and Vr = 0. The stimulus was 30,000 time samples of white Gaussian noise with a standard deviation of 0.5. With only 600 spikes of output, the estimator is able to retrieve an estimate of k (gray curve) which closely matches the true kernel. Note that the spike-triggered average (black curve), which is an unbiased estimator for the kernel of an LNP neuron [5], differs significantly from this true kernel (see also [15]). The accuracy of this approach improves with the number of constraints considered, but performance is fastest with fewer constraints. Therefore, because ascending the likelihood function requires evaluating the likelihood at many different points, we can make this ascent process much quicker by applying a version of the coarse-to-fine idea. Let L k denote the approximation to the likelihood given by allowing only k constraints in the above algorithm. Then we know, by a proof identical to that of Theorem 1, that Lk has no local maxima; in addition, by the above logic, Lk → L as k grows. It takes little additional effort to prove that argmax Lk → argmax L; thus, we can efficiently ascend the true likelihood surface by ascending the “coarse” approximants Lk , then gradually “refining” our approximation by letting k increase. An application of this algorithm to simulated data is shown in Fig. 4. Further applications to both simulated and real data will be presented elsewhere. Discussion We have shown here that the L-NLIF model, which couples a linear filtering stage to a biophysically plausible and flexible model of neuronal spiking, can be efficiently estimated from extracellular physiological data using maximum likelihood. Moreover, this model lends itself directly to analysis via tools from the modern theory of point processes. For example, once we have obtained our estimate of the parameters (k, σ, g, Vr , h), how do we verify that the resulting model provides an adequate description of the data? This important “model validation” question has been the focus of some recent elegant research, under the rubric of “time rescaling” techniques [4]. While we lack the room here to review these methods in detail, we can note that they depend essentially on knowledge of the conditional firing rate function f (t). Recall that we showed how to efficiently compute this function in the last section and examined some of its qualitative properties in the L-NLIF context in Figs. 2 and 3. We are currently in the process of applying the model to physiological data recorded both in vivo and in vitro, in order to assess whether it accurately accounts for the stimulus preferences and spiking statistics of real neurons. One long-term goal of this research is to elucidate the different roles of stimulus-driven and stimulus-independent activity on the spiking patterns of both single cells and multineuronal ensembles. References [1] B. Aguera y Arcas and A. Fairhall. What causes a neuron to spike? 15:1789–1807, 2003. Neral Computation, [2] M. Berry and M. Meister. Refractoriness and neural precision. Journal of Neuroscience, 18:2200–2211, 1998. [3] V. Bogachev. Gaussian Measures. AMS, New York, 1998. [4] E. Brown, R. Barbieri, V. Ventura, R. Kass, and L. Frank. The time-rescaling theorem and its application to neural spike train data analysis. Neural Computation, 14:325–346, 2002. [5] E. Chichilnisky. A simple white noise analysis of neuronal light responses. Network: Computation in Neural Systems, 12:199–213, 2001. [6] A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1:141–149, 1992. [7] W. Gerstner and W. Kistler. Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cambridge University Press, 2002. [8] S. Karlin and H. Taylor. A Second Course in Stochastic Processes. Academic Press, New York, 1981. [9] J. Keat, P. Reinagel, R. Reid, and M. Meister. Predicting every spike: a model for the responses of visual neurons. Neuron, 30:803–817, 2001. [10] B. Knight, A. Omurtag, and L. Sirovich. The approach of a neuron population firing rate to a new equilibrium: an exact theoretical result. Neural Computation, 12:1045–1055, 2000. [11] J. Levin and J. Miller. Broadband neural encoding in the cricket cercal sensory system enhanced by stochastic resonance. Nature, 380:165–168, 1996. [12] L. Paninski. Convergence properties of some spike-triggered analysis techniques. Network: Computation in Neural Systems, 14:437–464, 2003. [13] L. Paninski, B. Lau, and A. Reyes. Noise-driven adaptation: in vitro and mathematical analysis. Neurocomputing, 52:877–883, 2003. [14] L. Paninski, J. Pillow, and E. Simoncelli. Maximum likelihood estimation of a stochastic integrate-and-fire neural encoding model. submitted manuscript (cns.nyu.edu/∼liam), 2004. [15] J. Pillow and E. Simoncelli. Biases in white noise analysis due to non-poisson spike generation. Neurocomputing, 52:109–115, 2003. [16] D. Reich, J. Victor, and B. Knight. The power ratio and the interval map: Spiking models and extracellular recordings. The Journal of Neuroscience, 18:10090–10104, 1998. [17] M. Rudd and L. Brown. Noise adaptation in integrate-and-fire neurons. Neural Computation, 9:1047–1069, 1997. [18] J. Victor. How the brain uses time to represent and process visual information. Brain Research, 886:33–46, 2000. [19] Y. Yu and T. Lee. Dynamical mechanisms underlying contrast gain control in sing le neurons. Physical Review E, 68:011901, 2003.

4 0.48422116 41 nips-2003-Boosting versus Covering

Author: Kohei Hatano, Manfred K. Warmuth

Abstract: We investigate improvements of AdaBoost that can exploit the fact that the weak hypotheses are one-sided, i.e. either all its positive (or negative) predictions are correct. In particular, for any set of m labeled examples consistent with a disjunction of k literals (which are one-sided in this case), AdaBoost constructs a consistent hypothesis by using O(k 2 log m) iterations. On the other hand, a greedy set covering algorithm finds a consistent hypothesis of size O(k log m). Our primary question is whether there is a simple boosting algorithm that performs as well as the greedy set covering. We first show that InfoBoost, a modification of AdaBoost proposed by Aslam for a different purpose, does perform as well as the greedy set covering algorithm. We then show that AdaBoost requires Ω(k 2 log m) iterations for learning k-literal disjunctions. We achieve this with an adversary construction and as well as in simple experiments based on artificial data. Further we give a variant called SemiBoost that can handle the degenerate case when the given examples all have the same label. We conclude by showing that SemiBoost can be used to produce small conjunctions as well. 1

5 0.46679527 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin

Abstract: Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classifiers of equivalent quality. This faster method could be used for more demanding classification tasks, such as on-line learning. 1

6 0.4662905 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

7 0.46582752 145 nips-2003-Online Classification on a Budget

8 0.46566299 124 nips-2003-Max-Margin Markov Networks

9 0.46472922 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

10 0.46343112 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

11 0.46271834 3 nips-2003-AUC Optimization vs. Error Rate Minimization

12 0.45564061 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

13 0.45436823 78 nips-2003-Gaussian Processes in Reinforcement Learning

14 0.454211 158 nips-2003-Policy Search by Dynamic Programming

15 0.4539676 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

16 0.45384237 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

17 0.45371351 126 nips-2003-Measure Based Regularization

18 0.45318633 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

19 0.45307088 117 nips-2003-Linear Response for Approximate Inference

20 0.45305434 100 nips-2003-Laplace Propagation