nips nips2003 nips2003-145 knowledge-graph by maker-knowledge-mining

145 nips-2003-Online Classification on a Budget

Source: pdf

Author: Koby Crammer, Jaz Kandola, Yoram Singer

Abstract: Online algorithms for classiﬁcation often require vast amounts of memory and computation time when employed in conjunction with kernel functions. In this paper we describe and analyze a simple approach for an on-the-ﬂy reduction of the number of past examples used for prediction. Experiments performed with real datasets show that using the proposed algorithmic approach with a single epoch is competitive with the support vector machine (SVM) although the latter, being a batch algorithm, accesses each training example multiple times. 1 Introduction and Motivation Kernel-based methods are widely being used for data modeling and prediction because of their conceptual simplicity and outstanding performance on many real-world tasks. The support vector machine (SVM) is a well known algorithm for ﬁnding kernel-based linear classiﬁers with maximal margin [7]. The kernel trick can be used to provide an effective method to deal with very high dimensional feature spaces as well as to model complex input phenomena via embedding into inner product spaces. However, despite generalization error being upper bounded by a function of the margin of a linear classiﬁer, it is notoriously difﬁcult to implement such classiﬁers efﬁciently. Empirically this often translates into very long training times. A number of alternative algorithms exist for ﬁnding a maximal margin hyperplane many of which have been inspired by Rosenblatt’s Perceptron algorithm [6] which is an on-line learning algorithm for linear classiﬁers. The work on SVMs has inspired a number of modiﬁcations and enhancements to the original Perceptron algorithm. These incorporate the notion of margin to the learning and prediction processes whilst exhibiting good empirical performance in practice. Examples of such algorithms include the Relaxed Online Maximum Margin Algorithm (ROMMA) [4], the Approximate Maximal Margin Classiﬁcation Algorithm (ALMA) [2], and the Margin Infused Relaxed Algorithm (MIRA) [1] which can be used in conjunction with kernel functions. A notable limitation of kernel based methods is their computational complexity since the amount of computer memory that they require to store the so called support patterns grows linearly with the number prediction errors. A number of attempts have been made to speed up the training and testing of SVM’s by enforcing a sparsity condition. In this paper we devise an online algorithm that is not only sparse but also generalizes well. To achieve this goal our algorithm employs an insertion and deletion process. Informally, it can be thought of as revising the weight vector after each example on which a prediction mistake has been made. Once such an event occurs the algorithm adds the new erroneous example (the insertion phase), and then immediately searches for past examples that appear to be redundant given the recent addition (the deletion phase). As we describe later, making this adjustment to the algorithm allows us to modify the standard online proof techniques so as to provide a bound on the total number of examples the algorithm keeps. This paper is organized as follows. In Sec. 2 we formalize the problem setting and provide a brief outline of our method for obtaining a sparse set of support patterns in an online setting. In Sec. 3 we present both theoretical and algorithmic details of our approach and provide a bound on the number of support patterns that constitute the cache. Sec. 4 provides experimental details, evaluated on three real world datasets, to illustrate the performance and merits of our sparse online algorithm. We end the paper with conclusions and ideas for future work. 2 Problem Setting and Algorithms This work focuses on online additive algorithms for classiﬁcation tasks. In such problems we are typically given a stream of instance-label pairs (x1 , y1 ), . . . , (xt , yt ), . . .. we assume that each instance is a vector xt ∈ Rn and each label belongs to a ﬁnite set Y. In this and the next section we assume that Y = {−1, +1} but relax this assumption in Sec. 4 where we describe experiments with datasets consisting of more than two labels. When dealing with the task of predicting new labels, thresholded linear classiﬁers of the form h(x) = sign(w · x) are commonly employed. The vector w is typically represented as a weighted linear combination of the examples, namely w = t αt yt xt where αt ≥ 0. The instances for which αt > 0 are referred to as support patterns. Under this assumption, the output of the classiﬁer solely depends on inner-products of the form x · xt the use of kernel functions can easily be employed simply by replacing the standard scalar product with a function K(·, ·) which satisﬁes Mercer conditions [7]. The resulting classiﬁcation rule takes the form h(x) = sign(w · x) = sign( t αt yt K(x, xt )). The majority of additive online algorithms for classiﬁcation, for example the well known Perceptron [6], share a common algorithmic structure. These online algorithms typically work in rounds. On the tth round, an online algorithm receives an instance xt , computes the inner-products st = i 0. The various online algorithms differ in the way the values of the parameters βt , αt and ct are set. A notable example of an online algorithm is the Perceptron algorithm [6] for which we set βt = 0, αt = 1 and ct = 1. More recent algorithms such as the Relaxed Online Maximum Margin Algorithm (ROMMA) [4] the Approximate Maximal Margin Classiﬁcation Algorithm (ALMA) [2] and the Margin Infused Relaxed Algorithm (MIRA) [1] can also be described in this framework although the constants βt , αt and ct are not as simple as the ones employed by the Perceptron algorithm. An important computational consideration needs to be made when employing kernel functions for machine learning tasks. This is because the amount of memory required to store the so called support patterns grows linearly with the number prediction errors. In Input: Tolerance β. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. Insert Ct ← Ct−1 ∪ {t}. 2. Set αt = 1. 3. Compute wt ← wt−1 + yt αt xt . 4. DistillCache(Ct , wt , (α1 , . . . , αt )). Output : H(x) = sign(wT · x). Figure 1: The aggressive Perceptron algorithm with a variable-size cache. this paper we shift the focus to the problem of devising online algorithms which are budget-conscious as they attempt to keep the number of support patterns small. The approach is attractive for at least two reasons. Firstly, both the training time and classiﬁcation time can be reduced signiﬁcantly if we store only a fraction of the potential support patterns. Secondly, a classier with a small number of support patterns is intuitively ”simpler”, and hence are likely to exhibit good generalization properties rather than complex classiﬁers with large numbers of support patterns. (See for instance [7] for formal results connecting the number of support patterns to the generalization error.) In Sec. 3 we present a formal analysis and Input: C, w, (α1 , . . . , αt ). the algorithmic details of our approach. Loop: Let us now provide a general overview • Choose i ∈ C such that of how to restrict the number of support β ≤ yi (w − αi yi xi ). patterns in an online setting. Denote by Ct the indices of patterns which consti• if no such i exists then return. tute the classiﬁcation vector wt . That is, • Remove the example i : i ∈ Ct if and only if αi > 0 on round 1. αi = 0. t when xt is received. The online classiﬁcation algorithms discussed above keep 2. w ← w − αi yi xi . enlarging Ct – once an example is added 3. C ← C/{i} to Ct it will never be deleted. However, Return : C, w, (α1 , . . . , αt ). as the online algorithm receives more examples, the performance of the classiﬁer Figure 2: DistillCache improves, and some of the past examples may have become redundant and hence can be removed. Put another way, old examples may have been inserted into the cache simply due the lack of support patterns in early rounds. As more examples are observed, the old examples maybe replaced with new examples whose location is closer to the decision boundary induced by the online classiﬁer. We thus add a new stage to the online algorithm in which we discard a few old examples from the cache Ct . We suggest a modiﬁcation of the online algorithm structure as follows. Whenever yt i 0. Then the number of support patterns constituting the cache is at most S ≤ (R2 + 2β)/γ 2 . Proof: The proof of the theorem is based on the mistake bound of the Perceptron algorithm [5]. To prove the theorem we bound wT 2 from above and below and compare the 2 t bounds. Denote by αi the weight of the ith example at the end of round t (after stage 4 of the algorithm). Similarly, we denote by αi to be the weight of the ith example on round ˜t t after stage 3, before calling the DistillCache (Fig. 2) procedure. We analogously ˜ denote by wt and wt the corresponding instantaneous classiﬁers. First, we derive a lower bound on wT 2 by bounding the term wT · u from below in a recursive manner. T αt yt (xt · u) wT · u = t∈CT T αt = γ S . ≥ γ (1) t∈CT We now turn to upper bound wT 2 . Recall that each example may be added to the cache and removed from the cache a single time. Let us write wT 2 as a telescopic sum, wT 2 = ( wT 2 ˜ − wT 2 ˜ ) + ( wT 2 − wT −1 2 ˜ ) + . . . + ( w1 2 − w0 2 ) . (2) We now consider three different scenarios that may occur for each new example. The ﬁrst case is when we did not insert the tth example into the cache at all. In this case, ˜ ( wt 2 − wt−1 2 ) = 0. The second scenario is when an example is inserted into the cache and is never discarded in future rounds, thus, ˜ wt 2 = wt−1 + yt xt 2 = wt−1 2 + 2yt (wt−1 · xt ) + xt 2 . Since we inserted (xt , yt ), the condition yt (wt−1 · xt ) ≤ β must hold. Combining this ˜ with the assumption that the examples are enclosed in a ball of radius R we get, ( wt 2 − wt−1 2 ) ≤ 2β + R2 . The last scenario occurs when an example is inserted into the cache on some round t, and is then later on removed from the cache on round t + p for p > 0. As in the previous case we can bound the value of summands in Equ. (2), ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 ˜ − wt+p 2 ) Input: Tolerance β, Cache Limit n. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. If |Ct | = n remove one example: (a) Find i = arg maxj∈Ct {yj (wt−1 − αj yj xj )}. (b) Update wt−1 ← wt−1 − αi yi xi . (c) Remove Ct−1 ← Ct−1 /{i} 2. Insert Ct ← Ct−1 ∪ {t}. 3. Set αt = 1. 4. Compute wt ← wt−1 + yt αt xt . Output : H(x) = sign(wT · x). Figure 3: The aggressive Perceptron algorithm with as ﬁxed-size cache. ˜ = 2yt (wt−1 · xt ) + xt 2 − 2yt (wt+p · xt ) + xt ˜ = 2 [yt (wt−1 · xt ) − yt ((wt+p − yt xt ) · xt )] ˜ ≤ 2 [β − yt ((wt+p − yt xt ) · xt )] . 2 ˜ Based on the form of the cache update we know that yt ((wt+p − yt xt ) · xt ) ≥ β, and thus, ˜ ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 − wt+p 2 ) ≤ 0 . Summarizing all three cases we see that only the examples which persist in the cache contribute a factor of R2 + 2β each to the bound of the telescopic sum of Equ. (2) and the rest of the examples do contribute anything to the bound. Hence, we can bound the norm of wT as follows, wT 2 ≤ S R2 + 2β . (3) We ﬁnish up the proof by applying the Cauchy-Swartz inequality and the assumption u = 1. Combining Equ. (1) and Equ. (3) we get, γ 2 S 2 ≤ (wT · u)2 ≤ wT 2 u 2 ≤ S(2β + R2 ) , which gives the desired bound. 4 Experiments In this section we describe the experimental methods that were used to compare the performance of standard online algorithms with the new algorithm described above. We also describe shortly another variant that sets a hard limit on the number of support patterns. The experiments were designed with the aim of trying to answer the following questions. First, what is effect of the number of support patterns on the generalization error (measured in terms of classiﬁcation accuracy on unseen data), and second, would the algorithm described in Fig. 2 be able to ﬁnd an optimal cache size that is able to achieve the best generalization performance. To examine each question separately we used a modiﬁed version of the algorithm described by Fig. 2 in which we restricted ourselves to have a ﬁxed bounded cache. This modiﬁed algorithm (which we refer to as the ﬁxed budget Perceptron) Name mnist letter usps No. of Training Examples 60000 16000 7291 No. of Test Examples 10000 4000 2007 No. of Classes 10 26 10 No. of Attributes 784 16 256 Table 1: Description of the datasets used in experiments. simulates the original Perceptron algorithm with one notable difference. When the number of support patterns exceeds a pre-determined limit, it chooses a support pattern from the cache and discards it. With this modiﬁcation the number of support patterns can never exceed the pre-determined limit. This modiﬁed algorithm is described in Fig. 3. The algorithm deletes the example which seemingly attains the highest margin after the removal of the example itself (line 1(a) in Fig. 3). Despite the simplicity of the original Perceptron algorithm [6] its good generalization performance on many datasets is remarkable. During the last few year a number of other additive online algorithms have been developed [4, 2, 1] that have shown better performance on a number of tasks. In this paper, we have preferred to embed these ideas into another online algorithm and start with a higher baseline performance. We have chosen to use the Margin Infused Relaxed Algorithm (MIRA) as our baseline algorithm since it has exhibited good generalization performance in previous experiments [1] and has the additional advantage that it is designed to solve multiclass classiﬁcation problem directly without any recourse to performing reductions. The algorithms were evaluated on three natural datasets: mnist1 , usps2 and letter3 . The characteristics of these datasets has been summarized in Table 1. A comprehensive overview of the performance of various algorithms on these datasets can be found in a recent paper [2]. Since all of the algorithms that we have evaluated are online, it is not implausible for the speciﬁc ordering of examples to affect the generalization performance. We thus report results averaged over 11 random permutations for usps and letter and over 5 random permutations for mnist. No free parameter optimization was carried out and instead we simply used the values reported in [1]. More speciﬁcally, the margin parameter was set to β = 0.01 for all algorithms and for all datasets. A homogeneous polynomial kernel of degree 9 was used when training on the mnist and usps data sets, and a RBF kernel for letter data set. (The variance of the RBF kernel was identical to the one used in [1].) We evaluated the performance of four algorithms in total. The ﬁrst algorithm was the standard MIRA online algorithm, which does not incorporate any budget constraints. The second algorithm is the version of MIRA described in Fig. 3 which uses a ﬁxed limited budget. Here we enumerated the cache size limit in each experiment we performed. The different sizes that we tested are dataset dependent but for each dataset we evaluated at least 10 different sizes. We would like to note that such an enumeration cannot be done in an online fashion and the goal of employing the the algorithm with a ﬁxed-size cache is to underscore the merit of the truly adaptive algorithm. The third algorithm is the version of MIRA described in Fig. 2 that adapts the cache size during the running of the algorithms. We also report additional results for a multiclass version of the SVM [1]. Whilst this algorithm is not online and during the training process it considers all the examples at once, this algorithm serves as our gold-standard algorithm against which we want to compare 1 Available from http://www.research.att.com/˜yann Available from ftp.kyb.tuebingen.mpg.de 3 Available from http://www.ics.uci.edu/˜mlearn/MLRepository.html 2 usps mnist Fixed Adaptive SVM MIRA 1.8 4.8 4.7 letter Fixed Adaptive SVM MIRA 5.5 1.7 4.6 5 1.5 1.4 Test Error 4.5 Test Error Test Error 1.6 Fixed Adaptive SVM MIRA 6 4.4 4.3 4.5 4 3.5 4.2 4.1 3 4 2.5 1.3 1.2 3.9 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 2 1000 1500 x 10 mnist 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 usps Fixed Adaptive MIRA 1550 7000 8000 9000 letter Fixed Adaptive MIRA 270 4000 5000 6000 # Support Patterns Fixed Adaptive MIRA 1500 265 1500 1400 260 Training Online Errors Training Online Errors Training Online Errors 1450 1450 255 250 245 1400 1350 1300 1350 240 1250 235 1300 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 1000 1500 x 10 mnist 4 x 10 2000 2500 # Support Patterns 3000 3500 1000 usps 6500 Fixed Adaptive MIRA 5.5 2000 3000 4000 5000 6000 # Support Patterns 7000 Fixed Adaptive MIRA 1.5 6000 9000 letter 4 x 10 1.6 Fixed Adaptive MIRA 8000 4 3.5 3 1.4 5500 Training Margin Errors Training Margin Errors Training Margin Errors 5 4.5 5000 4500 1.3 1.2 1.1 4000 1 2.5 3500 0.9 2 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 4 x 10 500 1000 1500 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 4000 5000 6000 # Support Patterns 7000 8000 9000 Figure 4: Results on a three data sets - mnist (left), usps (center) and letter (right). Each point in a plot designates the test error (y-axis) vs. the number of support patterns used (x-axis). Four algorithms are compared - SVM, MIRA, MIRA with a ﬁxed cache size and MIRA with a variable cache size. performance. Note that for the multiclass SVM we report the results using the best set of parameters, which does not coincide with the set of parameters used for the online training. The results are summarized in Fig 4. This ﬁgure is composed of three different plots organized in columns. Each of these plots corresponds to a different dataset - mnist (left), usps (center) and letter (right). In each of the three plots the x-axis designates number of support patterns the algorithm uses. The results for the ﬁxed-size cache are connected with a line to emphasize the performance dependency on the size of the cache. The top row of the three columns shows the generalization error. Thus the y-axis designates the test error of an algorithm on unseen data at the end of the training. Looking at the error of the algorithm with a ﬁxed-size cache reveals that there is a broad range of cache size where the algorithm exhibits good performance. In fact for MNIST and USPS there are sizes for which the test error of the algorithm is better than SVM’s test error. Naturally, we cannot ﬁx the correct size in hindsight so the question is whether the algorithm with variable cache size is a viable automatic size-selection method. Analyzing each of the datasets in turn reveals that this is indeed the case – the algorithm obtains a very similar number of support patterns and test error when compared to the SVM method. The results are somewhat less impressive for the letter dataset which contains less examples per class. One possible explanation is that the algorithm had fewer chances to modify and distill the cache. Nonetheless, overall the results are remarkable given that all the online algorithms make a single pass through the data and the variable-size method ﬁnds a very good cache size while making it also comparable to the SVM in terms of performance. The MIRA algorithm, which does not incorporate any form of example insertion or deletion in its algorithmic structure, obtains the poorest level of performance not only in terms of generalization error but also in terms of number of support patterns. The plot of online training error against the number of support patterns, in row 2 of Fig 4, can be considered to be a good on-the-ﬂy validation of generalization performance. As the plots indicate, for the ﬁxed and adaptive versions of the algorithm, on all the datasets, a low online training error translates into good generalization performance. Comparing the test error plots with the online error plots we see a nice similarity between the qualitative behavior of the two errors. Hence, one can use the online error, which is easy to evaluate, to choose a good cache size for the ﬁxed-size algorithm. The third row gives the online training margin errors that translates directly to the number of insertions into the cache. Here we see that the good test error and compactness of the algorithm with a variable cache size come with a price. Namely, the algorithm makes signiﬁcantly more insertions into the cache than the ﬁxed size version of the algorithm. However, as the upper two sets of plots indicate, the surplus in insertions is later taken care of by excess deletions and the end result is very good overall performance. In summary, the online algorithm with a variable cache and SVM obtains similar levels of generalization and also number of support patterns. While the SVM is still somewhat better in both aspects for the letter dataset, the online algorithm is much simpler to implement and performs a single sweep through the training data. 5 Summary We have described and analyzed a new sparse online algorithm that attempts to deal with the computational problems implicit in classiﬁcation algorithms such as the SVM. The proposed method was empirically tested and its performance in both the size of the resulting classiﬁer and its error rate are comparable to SVM. There are a few possible extensions and enhancements. We are currently looking at alternative criteria for the deletions of examples from the cache. For instance, the weight of examples might relay information on their importance for accurate classiﬁcation. Incorporating prior knowledge to the insertion and deletion scheme might also prove important. We hope that such enhancements would make the proposed approach a viable alternative to SVM and other batch algorithms. Acknowledgements: The authors would like to thank John Shawe-Taylor for many helpful comments and discussions. This research was partially funded by the EU project KerMIT No. IST-2000-25341. References [1] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Jornal of Machine Learning Research, 3:951–991, 2003. [2] C. Gentile. A new approximate maximal margin classiﬁcation algorithm. Journal of Machine Learning Research, 2:213–242, 2001. [3] M´ zard M. Krauth W. Learning algorithms with optimal stability in neural networks. Journal of e Physics A., 20:745, 1987. [4] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1–3):361–387, 2002. [5] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume XII, pages 615–622, 1962. [6] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).). [7] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper we describe and analyze a simple approach for an on-the-ﬂy reduction of the number of past examples used for prediction. [sent-15, score-0.082]

2 Experiments performed with real datasets show that using the proposed algorithmic approach with a single epoch is competitive with the support vector machine (SVM) although the latter, being a batch algorithm, accesses each training example multiple times. [sent-16, score-0.257]

3 The support vector machine (SVM) is a well known algorithm for ﬁnding kernel-based linear classiﬁers with maximal margin [7]. [sent-18, score-0.284]

4 However, despite generalization error being upper bounded by a function of the margin of a linear classiﬁer, it is notoriously difﬁcult to implement such classiﬁers efﬁciently. [sent-20, score-0.194]

5 A number of alternative algorithms exist for ﬁnding a maximal margin hyperplane many of which have been inspired by Rosenblatt’s Perceptron algorithm [6] which is an on-line learning algorithm for linear classiﬁers. [sent-22, score-0.249]

6 These incorporate the notion of margin to the learning and prediction processes whilst exhibiting good empirical performance in practice. [sent-24, score-0.206]

7 A notable limitation of kernel based methods is their computational complexity since the amount of computer memory that they require to store the so called support patterns grows linearly with the number prediction errors. [sent-26, score-0.366]

8 In this paper we devise an online algorithm that is not only sparse but also generalizes well. [sent-28, score-0.321]

9 To achieve this goal our algorithm employs an insertion and deletion process. [sent-29, score-0.179]

10 Once such an event occurs the algorithm adds the new erroneous example (the insertion phase), and then immediately searches for past examples that appear to be redundant given the recent addition (the deletion phase). [sent-31, score-0.281]

11 As we describe later, making this adjustment to the algorithm allows us to modify the standard online proof techniques so as to provide a bound on the total number of examples the algorithm keeps. [sent-32, score-0.44]

12 2 we formalize the problem setting and provide a brief outline of our method for obtaining a sparse set of support patterns in an online setting. [sent-35, score-0.508]

13 3 we present both theoretical and algorithmic details of our approach and provide a bound on the number of support patterns that constitute the cache. [sent-37, score-0.298]

14 4 provides experimental details, evaluated on three real world datasets, to illustrate the performance and merits of our sparse online algorithm. [sent-39, score-0.321]

15 2 Problem Setting and Algorithms This work focuses on online additive algorithms for classiﬁcation tasks. [sent-41, score-0.304]

16 we assume that each instance is a vector xt ∈ Rn and each label belongs to a ﬁnite set Y. [sent-49, score-0.237]

17 The vector w is typically represented as a weighted linear combination of the examples, namely w = t αt yt xt where αt ≥ 0. [sent-53, score-0.437]

18 The instances for which αt > 0 are referred to as support patterns. [sent-54, score-0.108]

19 Under this assumption, the output of the classiﬁer solely depends on inner-products of the form x · xt the use of kernel functions can easily be employed simply by replacing the standard scalar product with a function K(·, ·) which satisﬁes Mercer conditions [7]. [sent-55, score-0.239]

20 The resulting classiﬁcation rule takes the form h(x) = sign(w · x) = sign( t αt yt K(x, xt )). [sent-56, score-0.437]

21 The majority of additive online algorithms for classiﬁcation, for example the well known Perceptron [6], share a common algorithmic structure. [sent-57, score-0.338]

22 On the tth round, an online algorithm receives an instance xt , computes the inner-products st = i 0. [sent-59, score-0.566]

23 The various online algorithms differ in the way the values of the parameters βt , αt and ct are set. [sent-60, score-0.491]

24 A notable example of an online algorithm is the Perceptron algorithm [6] for which we set βt = 0, αt = 1 and ct = 1. [sent-61, score-0.589]

25 This is because the amount of memory required to store the so called support patterns grows linearly with the number prediction errors. [sent-64, score-0.303]

26 this paper we shift the focus to the problem of devising online algorithms which are budget-conscious as they attempt to keep the number of support patterns small. [sent-86, score-0.516]

27 Firstly, both the training time and classiﬁcation time can be reduced signiﬁcantly if we store only a fraction of the potential support patterns. [sent-88, score-0.174]

28 Secondly, a classier with a small number of support patterns is intuitively ”simpler”, and hence are likely to exhibit good generalization properties rather than complex classiﬁers with large numbers of support patterns. [sent-89, score-0.422]

29 (See for instance [7] for formal results connecting the number of support patterns to the generalization error. [sent-90, score-0.335]

30 Loop: Let us now provide a general overview • Choose i ∈ C such that of how to restrict the number of support β ≤ yi (w − αi yi xi ). [sent-97, score-0.162]

31 Denote by Ct the indices of patterns which consti• if no such i exists then return. [sent-99, score-0.124]

32 as the online algorithm receives more examples, the performance of the classiﬁer Figure 2: DistillCache improves, and some of the past examples may have become redundant and hence can be removed. [sent-112, score-0.423]

33 Put another way, old examples may have been inserted into the cache simply due the lack of support patterns in early rounds. [sent-113, score-0.765]

34 As more examples are observed, the old examples maybe replaced with new examples whose location is closer to the decision boundary induced by the online classiﬁer. [sent-114, score-0.468]

35 We thus add a new stage to the online algorithm in which we discard a few old examples from the cache Ct . [sent-115, score-0.806]

36 We suggest a modiﬁcation of the online algorithm structure as follows. [sent-116, score-0.301]

37 Then the number of support patterns constituting the cache is at most S ≤ (R2 + 2β)/γ 2 . [sent-118, score-0.63]

38 Proof: The proof of the theorem is based on the mistake bound of the Perceptron algorithm [5]. [sent-119, score-0.102]

39 We analogously ˜ denote by wt and wt the corresponding instantaneous classiﬁers. [sent-124, score-1.078]

40 Recall that each example may be added to the cache and removed from the cache a single time. [sent-128, score-0.796]

41 The ﬁrst case is when we did not insert the tth example into the cache at all. [sent-134, score-0.462]

42 The second scenario is when an example is inserted into the cache and is never discarded in future rounds, thus, ˜ wt 2 = wt−1 + yt xt 2 = wt−1 2 + 2yt (wt−1 · xt ) + xt 2 . [sent-136, score-1.883]

43 Since we inserted (xt , yt ), the condition yt (wt−1 · xt ) ≤ β must hold. [sent-137, score-0.709]

44 Combining this ˜ with the assumption that the examples are enclosed in a ball of radius R we get, ( wt 2 − wt−1 2 ) ≤ 2β + R2 . [sent-138, score-0.601]

45 The last scenario occurs when an example is inserted into the cache on some round t, and is then later on removed from the cache on round t + p for p > 0. [sent-139, score-0.975]

46 (2), ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 ˜ − wt+p 2 ) Input: Tolerance β, Cache Limit n. [sent-141, score-0.539]

47 ˜ = 2yt (wt−1 · xt ) + xt 2 − 2yt (wt+p · xt ) + xt ˜ = 2 [yt (wt−1 · xt ) − yt ((wt+p − yt xt ) · xt )] ˜ ≤ 2 [β − yt ((wt+p − yt xt ) · xt )] . [sent-160, score-2.808]

48 2 ˜ Based on the form of the cache update we know that yt ((wt+p − yt xt ) · xt ) ≥ β, and thus, ˜ ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 − wt+p 2 ) ≤ 0 . [sent-161, score-1.811]

49 Summarizing all three cases we see that only the examples which persist in the cache contribute a factor of R2 + 2β each to the bound of the telescopic sum of Equ. [sent-162, score-0.541]

50 4 Experiments In this section we describe the experimental methods that were used to compare the performance of standard online algorithms with the new algorithm described above. [sent-169, score-0.349]

51 We also describe shortly another variant that sets a hard limit on the number of support patterns. [sent-170, score-0.126]

52 First, what is effect of the number of support patterns on the generalization error (measured in terms of classiﬁcation accuracy on unseen data), and second, would the algorithm described in Fig. [sent-172, score-0.386]

53 2 be able to ﬁnd an optimal cache size that is able to achieve the best generalization performance. [sent-173, score-0.481]

54 This modiﬁed algorithm (which we refer to as the ﬁxed budget Perceptron) Name mnist letter usps No. [sent-176, score-0.402]

55 simulates the original Perceptron algorithm with one notable difference. [sent-181, score-0.081]

56 When the number of support patterns exceeds a pre-determined limit, it chooses a support pattern from the cache and discards it. [sent-182, score-0.738]

57 With this modiﬁcation the number of support patterns can never exceed the pre-determined limit. [sent-183, score-0.252]

58 The algorithm deletes the example which seemingly attains the highest margin after the removal of the example itself (line 1(a) in Fig. [sent-186, score-0.148]

59 Despite the simplicity of the original Perceptron algorithm [6] its good generalization performance on many datasets is remarkable. [sent-188, score-0.207]

60 During the last few year a number of other additive online algorithms have been developed [4, 2, 1] that have shown better performance on a number of tasks. [sent-189, score-0.324]

61 In this paper, we have preferred to embed these ideas into another online algorithm and start with a higher baseline performance. [sent-190, score-0.301]

62 A comprehensive overview of the performance of various algorithms on these datasets can be found in a recent paper [2]. [sent-194, score-0.126]

63 Since all of the algorithms that we have evaluated are online, it is not implausible for the speciﬁc ordering of examples to affect the generalization performance. [sent-195, score-0.175]

64 We thus report results averaged over 11 random permutations for usps and letter and over 5 random permutations for mnist. [sent-196, score-0.235]

65 More speciﬁcally, the margin parameter was set to β = 0. [sent-198, score-0.103]

66 A homogeneous polynomial kernel of degree 9 was used when training on the mnist and usps data sets, and a RBF kernel for letter data set. [sent-200, score-0.393]

67 The ﬁrst algorithm was the standard MIRA online algorithm, which does not incorporate any budget constraints. [sent-203, score-0.374]

68 Here we enumerated the cache size limit in each experiment we performed. [sent-206, score-0.439]

69 We would like to note that such an enumeration cannot be done in an online fashion and the goal of employing the the algorithm with a ﬁxed-size cache is to underscore the merit of the truly adaptive algorithm. [sent-208, score-0.759]

70 2 that adapts the cache size during the running of the algorithms. [sent-210, score-0.421]

71 Whilst this algorithm is not online and during the training process it considers all the examples at once, this algorithm serves as our gold-standard algorithm against which we want to compare 1 Available from http://www. [sent-212, score-0.489]

72 2 500 4 1000 1500 x 10 mnist 4 x 10 2000 2500 # Support Patterns 3000 3500 1000 usps 6500 Fixed Adaptive MIRA 5. [sent-261, score-0.202]

73 2 4 x 10 500 1000 1500 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 4000 5000 6000 # Support Patterns 7000 8000 9000 Figure 4: Results on a three data sets - mnist (left), usps (center) and letter (right). [sent-281, score-0.303]

74 Each point in a plot designates the test error (y-axis) vs. [sent-282, score-0.105]

75 Four algorithms are compared - SVM, MIRA, MIRA with a ﬁxed cache size and MIRA with a variable cache size. [sent-284, score-0.847]

76 Note that for the multiclass SVM we report the results using the best set of parameters, which does not coincide with the set of parameters used for the online training. [sent-286, score-0.293]

77 Each of these plots corresponds to a different dataset - mnist (left), usps (center) and letter (right). [sent-289, score-0.361]

78 In each of the three plots the x-axis designates number of support patterns the algorithm uses. [sent-290, score-0.368]

79 The results for the ﬁxed-size cache are connected with a line to emphasize the performance dependency on the size of the cache. [sent-291, score-0.441]

80 Thus the y-axis designates the test error of an algorithm on unseen data at the end of the training. [sent-293, score-0.168]

81 Looking at the error of the algorithm with a ﬁxed-size cache reveals that there is a broad range of cache size where the algorithm exhibits good performance. [sent-294, score-0.98]

82 In fact for MNIST and USPS there are sizes for which the test error of the algorithm is better than SVM’s test error. [sent-295, score-0.116]

83 Naturally, we cannot ﬁx the correct size in hindsight so the question is whether the algorithm with variable cache size is a viable automatic size-selection method. [sent-296, score-0.513]

84 Analyzing each of the datasets in turn reveals that this is indeed the case – the algorithm obtains a very similar number of support patterns and test error when compared to the SVM method. [sent-297, score-0.436]

85 The results are somewhat less impressive for the letter dataset which contains less examples per class. [sent-298, score-0.184]

86 Nonetheless, overall the results are remarkable given that all the online algorithms make a single pass through the data and the variable-size method ﬁnds a very good cache size while making it also comparable to the SVM in terms of performance. [sent-300, score-0.727]

87 The MIRA algorithm, which does not incorporate any form of example insertion or deletion in its algorithmic structure, obtains the poorest level of performance not only in terms of generalization error but also in terms of number of support patterns. [sent-301, score-0.436]

88 The plot of online training error against the number of support patterns, in row 2 of Fig 4, can be considered to be a good on-the-ﬂy validation of generalization performance. [sent-302, score-0.513]

89 As the plots indicate, for the ﬁxed and adaptive versions of the algorithm, on all the datasets, a low online training error translates into good generalization performance. [sent-303, score-0.533]

90 Comparing the test error plots with the online error plots we see a nice similarity between the qualitative behavior of the two errors. [sent-304, score-0.412]

91 Hence, one can use the online error, which is easy to evaluate, to choose a good cache size for the ﬁxed-size algorithm. [sent-305, score-0.699]

92 The third row gives the online training margin errors that translates directly to the number of insertions into the cache. [sent-306, score-0.507]

93 Here we see that the good test error and compactness of the algorithm with a variable cache size come with a price. [sent-307, score-0.539]

94 Namely, the algorithm makes signiﬁcantly more insertions into the cache than the ﬁxed size version of the algorithm. [sent-308, score-0.52]

95 However, as the upper two sets of plots indicate, the surplus in insertions is later taken care of by excess deletions and the end result is very good overall performance. [sent-309, score-0.149]

96 In summary, the online algorithm with a variable cache and SVM obtains similar levels of generalization and also number of support patterns. [sent-310, score-0.897]

97 While the SVM is still somewhat better in both aspects for the letter dataset, the online algorithm is much simpler to implement and performs a single sweep through the training data. [sent-311, score-0.438]

98 5 Summary We have described and analyzed a new sparse online algorithm that attempts to deal with the computational problems implicit in classiﬁcation algorithms such as the SVM. [sent-312, score-0.349]

99 We are currently looking at alternative criteria for the deletions of examples from the cache. [sent-315, score-0.098]

100 Incorporating prior knowledge to the insertion and deletion scheme might also prove important. [sent-317, score-0.134]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wt', 0.539), ('cache', 0.398), ('mira', 0.341), ('online', 0.256), ('yt', 0.225), ('xt', 0.212), ('ct', 0.207), ('patterns', 0.124), ('mnist', 0.114), ('support', 0.108), ('perceptron', 0.108), ('margin', 0.103), ('letter', 0.101), ('usps', 0.088), ('svm', 0.085), ('fixed', 0.073), ('deletion', 0.072), ('classi', 0.069), ('insertion', 0.062), ('examples', 0.062), ('generalization', 0.06), ('datasets', 0.06), ('adaptive', 0.06), ('round', 0.057), ('budget', 0.054), ('designates', 0.054), ('distillcache', 0.054), ('infused', 0.054), ('insertions', 0.054), ('relaxed', 0.051), ('inserted', 0.047), ('algorithm', 0.045), ('sign', 0.043), ('plots', 0.037), ('multiclass', 0.037), ('training', 0.036), ('deletions', 0.036), ('enhancements', 0.036), ('jaz', 0.036), ('insert', 0.036), ('notable', 0.036), ('algorithmic', 0.034), ('bound', 0.032), ('translates', 0.031), ('romma', 0.031), ('telescopic', 0.031), ('error', 0.031), ('obtains', 0.03), ('store', 0.03), ('cation', 0.029), ('tth', 0.028), ('tolerance', 0.028), ('alma', 0.028), ('maximal', 0.028), ('algorithms', 0.028), ('kernel', 0.027), ('errors', 0.027), ('old', 0.026), ('evaluated', 0.025), ('aggressive', 0.025), ('fig', 0.025), ('mistake', 0.025), ('loop', 0.025), ('instance', 0.025), ('viable', 0.024), ('crammer', 0.024), ('size', 0.023), ('permutations', 0.023), ('whilst', 0.023), ('modi', 0.023), ('good', 0.022), ('remove', 0.022), ('memory', 0.022), ('dataset', 0.021), ('rn', 0.021), ('test', 0.02), ('singer', 0.02), ('never', 0.02), ('past', 0.02), ('additive', 0.02), ('performance', 0.02), ('get', 0.02), ('sparse', 0.02), ('redundant', 0.02), ('jerusalem', 0.02), ('stage', 0.019), ('ers', 0.019), ('hebrew', 0.019), ('batch', 0.019), ('incorporate', 0.019), ('prediction', 0.019), ('scenario', 0.018), ('formal', 0.018), ('unseen', 0.018), ('limit', 0.018), ('overview', 0.018), ('israel', 0.018), ('yi', 0.018), ('contribute', 0.018), ('reveals', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999899 145 nips-2003-Online Classification on a Budget

Author: Koby Crammer, Jaz Kandola, Yoram Singer

2 0.53928995 148 nips-2003-Online Passive-Aggressive Algorithms

Author: Shai Shalev-shwartz, Koby Crammer, Ofer Dekel, Yoram Singer

Abstract: We present a uniﬁed view for online classiﬁcation, regression, and uniclass problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also discussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss. 1

3 0.16796437 102 nips-2003-Large Scale Online Learning

Author: Léon Bottou, Yann L. Cun

Abstract: We consider situations where training data is abundant and computing resources are comparatively scarce. We argue that suitably designed online learning algorithms asymptotically outperform any batch learning algorithm. Both theoretical and experimental evidences are presented. 1

4 0.16247363 176 nips-2003-Sequential Bayesian Kernel Regression

Author: Jaco Vermaak, Simon J. Godsill, Arnaud Doucet

Abstract: We propose a method for sequential Bayesian kernel regression. As is the case for the popular Relevance Vector Machine (RVM) [10, 11], the method automatically identiﬁes the number and locations of the kernels. Our algorithm overcomes some of the computational difﬁculties related to batch methods for kernel regression. It is non-iterative, and requires only a single pass over the data. It is thus applicable to truly sequential data sets and batch data sets alike. The algorithm is based on a generalisation of Importance Sampling, which allows the design of intuitively simple and efﬁcient proposal distributions for the model parameters. Comparative results on two standard data sets show our algorithm to compare favourably with existing batch estimation strategies.

5 0.14228311 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

Author: Radford M. Neal, Matthew J. Beal, Sam T. Roweis

Abstract: We describe a Markov chain method for sampling from the distribution of the hidden state sequence in a non-linear dynamical system, given a sequence of observations. This method updates all states in the sequence simultaneously using an embedded Hidden Markov Model (HMM). An update begins with the creation of “pools” of candidate states at each time. We then deﬁne an embedded HMM whose states are indexes within these pools. Using a forward-backward dynamic programming algorithm, we can efﬁciently choose a state sequence with the appropriate probabilities from the exponentially large number of state sequences that pass through states in these pools. We illustrate the method in a simple one-dimensional example, and in an example showing how an embedded HMM can be used to in effect discretize the state space without any discretization error. We also compare the embedded HMM to a particle smoother on a more substantial problem of inferring human motion from 2D traces of markers. 1

6 0.10923147 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels

7 0.10548456 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

8 0.095548496 48 nips-2003-Convex Methods for Transduction

9 0.093678288 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

10 0.085302733 171 nips-2003-Semi-Definite Programming by Perceptron Learning

11 0.072955318 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

12 0.071161985 1 nips-2003-1-norm Support Vector Machines

13 0.070537828 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

14 0.070304319 122 nips-2003-Margin Maximizing Loss Functions

15 0.066443488 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

16 0.062074378 124 nips-2003-Max-Margin Markov Networks

17 0.059669703 108 nips-2003-Learning a Distance Metric from Relative Comparisons

18 0.056951642 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

19 0.055972073 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

20 0.055272818 146 nips-2003-Online Learning of Non-stationary Sequences

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.188), (1, 0.006), (2, -0.056), (3, -0.303), (4, 0.244), (5, 0.037), (6, 0.123), (7, 0.383), (8, 0.275), (9, 0.102), (10, 0.051), (11, -0.085), (12, -0.091), (13, 0.08), (14, -0.008), (15, -0.088), (16, -0.132), (17, -0.026), (18, -0.023), (19, 0.196), (20, 0.149), (21, -0.121), (22, 0.023), (23, 0.098), (24, 0.041), (25, 0.081), (26, 0.048), (27, -0.114), (28, 0.039), (29, 0.027), (30, 0.086), (31, -0.043), (32, -0.088), (33, 0.001), (34, 0.042), (35, 0.045), (36, 0.094), (37, 0.051), (38, -0.021), (39, -0.015), (40, -0.014), (41, 0.055), (42, 0.007), (43, -0.023), (44, -0.05), (45, -0.039), (46, -0.02), (47, 0.007), (48, 0.011), (49, -0.106)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96435541 145 nips-2003-Online Classification on a Budget

Author: Koby Crammer, Jaz Kandola, Yoram Singer

2 0.96271664 148 nips-2003-Online Passive-Aggressive Algorithms

Author: Shai Shalev-shwartz, Koby Crammer, Ofer Dekel, Yoram Singer

3 0.59346813 102 nips-2003-Large Scale Online Learning

Author: Léon Bottou, Yann L. Cun

4 0.47481579 176 nips-2003-Sequential Bayesian Kernel Regression

Author: Jaco Vermaak, Simon J. Godsill, Arnaud Doucet

5 0.3933641 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

Author: Claudio Gentile

Abstract: New feature selection algorithms for linear threshold functions are described which combine backward elimination with an adaptive regularization method. This makes them particularly suitable to the classiﬁcation of microarray expression data, where the goal is to obtain accurate rules depending on few genes only. Our algorithms are fast and easy to implement, since they center on an incremental (large margin) algorithm which allows us to avoid linear, quadratic or higher-order programming methods. We report on preliminary experiments with ﬁve known DNA microarray datasets. These experiments suggest that multiplicative large margin algorithms tend to outperform additive algorithms (such as SVM) on feature selection tasks. 1

6 0.34419927 44 nips-2003-Can We Learn to Beat the Best Stock

7 0.31934774 48 nips-2003-Convex Methods for Transduction

8 0.31880778 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels

9 0.30048746 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

10 0.28693402 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

11 0.25410774 108 nips-2003-Learning a Distance Metric from Relative Comparisons

12 0.25368157 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

13 0.24101582 146 nips-2003-Online Learning of Non-stationary Sequences

14 0.23428352 171 nips-2003-Semi-Definite Programming by Perceptron Learning

15 0.22747119 14 nips-2003-A Nonlinear Predictive State Representation

16 0.22507036 1 nips-2003-1-norm Support Vector Machines

17 0.21987109 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

18 0.2196689 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

19 0.20870827 3 nips-2003-AUC Optimization vs. Error Rate Minimization

20 0.20865211 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.053), (11, 0.011), (29, 0.015), (30, 0.015), (35, 0.056), (53, 0.099), (71, 0.1), (72, 0.282), (76, 0.034), (85, 0.114), (91, 0.098), (99, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79070306 145 nips-2003-Online Classification on a Budget

Author: Koby Crammer, Jaz Kandola, Yoram Singer

2 0.66778606 107 nips-2003-Learning Spectral Clustering

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1

3 0.6058473 3 nips-2003-AUC Optimization vs. Error Rate Minimization

Author: Corinna Cortes, Mehryar Mohri

Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classiﬁcation algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate. Our results show that the average AUC is monotonically increasing as a function of the classiﬁcation accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classiﬁcation error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a speciﬁc query. That list may contain several thousand documents, but, in practice, only the top ﬁfty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classiﬁcation of all documents as relevant or not. More generally, for a binary classiﬁer assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classiﬁcation is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classiﬁer is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classiﬁcation algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values signiﬁcantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modiﬁed under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classiﬁed positive total positive incorrectly classiﬁed negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classiﬁcation, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. speciﬁcity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Deﬁnition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classiﬁcation problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classiﬁcation threshold from the most positive classiﬁcation value to the most negative. For a fully random classiﬁcation, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classiﬁcation results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is deﬁned as the area under the ROC curve and is closely related to the ranking quality of the classiﬁcation as shown more formally by Lemma 1 below. Consider a binary classiﬁcation task with m positive examples and n negative examples. We will assume that a classiﬁer outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a ﬁxed classiﬁer. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classiﬁcations of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a ﬁxed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classiﬁcations with a ﬁxed number of errors and compare that to the error rate. Different classiﬁers may have the same error rate but different AUC values. Indeed, for a given classiﬁcation threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is ﬁxed. We wish to compute the average value of the AUC over all classiﬁcations with k errors. Our model is based on the simple assumption that all classiﬁcations or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classiﬁcation, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is ﬁxed, there are k − x false negative examples. Figure 3 shows the corresponding conﬁguration. The two regions of examples with classiﬁcation outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The ﬁrst term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classiﬁcation rank. Let position one be the ﬁrst position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classiﬁed as positive is N = m − (k − x) + x. Thus, by deﬁnition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the ﬁrst term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a ﬁxed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is ﬁxed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classiﬁcations with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a ﬁxed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classiﬁcations for a given x. There are N possible ways of x choosing the positions of the x misclassiﬁed negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassiﬁed positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a ﬁxed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simpliﬁed into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classiﬁer only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a ﬁxed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing speciﬁc learning algorithms for maximizing the AUC: a classiﬁcation algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classiﬁcations, we can ﬁrst determine its average < X >x for a ﬁxed x, and then use the function F deﬁned by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of conﬁgurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a ﬁxed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By deﬁnition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i

4 0.60325933 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin

Abstract: Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classiﬁers of equivalent quality. This faster method could be used for more demanding classiﬁcation tasks, such as on-line learning. 1

5 0.60180008 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

Author: Kevin P. Murphy, Antonio Torralba, William T. Freeman

Abstract: Standard approaches to object detection focus on local patches of the image, and try to classify them as background or not. We propose to use the scene context (image as a whole) as an extra source of (global) information, to help resolve local ambiguities. We present a conditional random ﬁeld for jointly solving the tasks of object detection and scene classiﬁcation. 1

6 0.5976699 41 nips-2003-Boosting versus Covering

7 0.59561539 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

8 0.59465533 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

9 0.59361452 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

10 0.59237605 124 nips-2003-Max-Margin Markov Networks

11 0.59073567 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

12 0.58957505 126 nips-2003-Measure Based Regularization

13 0.58926171 143 nips-2003-On the Dynamics of Boosting

14 0.58765405 78 nips-2003-Gaussian Processes in Reinforcement Learning

15 0.58664513 117 nips-2003-Linear Response for Approximate Inference

16 0.58611822 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks

17 0.58602923 113 nips-2003-Learning with Local and Global Consistency

18 0.58504677 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

19 0.58485311 187 nips-2003-Training a Quantum Neural Network

20 0.58484262 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates