nips nips2003 nips2003-176 knowledge-graph by maker-knowledge-mining

176 nips-2003-Sequential Bayesian Kernel Regression

Source: pdf

Author: Jaco Vermaak, Simon J. Godsill, Arnaud Doucet

Abstract: We propose a method for sequential Bayesian kernel regression. As is the case for the popular Relevance Vector Machine (RVM) [10, 11], the method automatically identiﬁes the number and locations of the kernels. Our algorithm overcomes some of the computational difﬁculties related to batch methods for kernel regression. It is non-iterative, and requires only a single pass over the data. It is thus applicable to truly sequential data sets and batch data sets alike. The algorithm is based on a generalisation of Importance Sampling, which allows the design of intuitively simple and efﬁcient proposal distributions for the model parameters. Comparative results on two standard data sets show our algorithm to compare favourably with existing batch estimation strategies.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We propose a method for sequential Bayesian kernel regression. [sent-7, score-0.626]

2 Our algorithm overcomes some of the computational difﬁculties related to batch methods for kernel regression. [sent-9, score-0.353]

3 It is thus applicable to truly sequential data sets and batch data sets alike. [sent-11, score-0.468]

4 The algorithm is based on a generalisation of Importance Sampling, which allows the design of intuitively simple and efﬁcient proposal distributions for the model parameters. [sent-12, score-0.164]

5 Comparative results on two standard data sets show our algorithm to compare favourably with existing batch estimation strategies. [sent-13, score-0.345]

6 1 Introduction Bayesian kernel methods, including the popular Relevance Vector Machine (RVM) [10, 11], have proved to be effective tools for regression and classiﬁcation. [sent-14, score-0.351]

7 For the RVM the sparsity constraints are elegantly formulated within a Bayesian framework, and the result of the estimation is a mixture of kernel functions that rely on only a small fraction of the data points. [sent-15, score-0.269]

8 Furthermore, the RVM does not require any constraints on the types of kernel functions, and provides a probabilistic output, rather than a hard decision. [sent-18, score-0.175]

9 Standard batch methods for kernel regression suffer from a computational drawback in that they are iterative in nature, with a computational complexity that is normally cubic in the number of data points at each iteration. [sent-19, score-0.555]

10 In this paper we propose a full Bayesian formulation for kernel regression on sequential data. [sent-22, score-0.53]

11 It is equally applicable to batch data sets by presenting the data points one at a time, with the order of presentation being unimportant. [sent-24, score-0.28]

12 As opposed to batch strategies that attempt to ﬁnd the optimal solution conditional on all the data, the sequential strategy includes the data one at a time, so that the poste- rior exhibits a tempering effect as the amount of data increases. [sent-26, score-0.569]

13 The algorithm itself is based on a generalisation of Importance Sampling, and recursively updates a sample based approximation of the posterior distribution as more data points become available. [sent-28, score-0.385]

14 The proposal distribution is deﬁned on an augmented parameter space, and is formulated in terms of model moves, reminiscent of the Reversible Jump Markov Chain Monte Carlo (RJ-MCMC) algorithm [5]. [sent-29, score-0.16]

15 For kernel regression these moves may include update moves to reﬁne the kernel locations, birth moves to add new kernels to better explain the increasing data, and death moves to eliminate erroneous or redundant kernels. [sent-30, score-1.858]

16 In Section 2 we outline the details of the model for sequential Bayesian kernel regression. [sent-32, score-0.395]

17 In Section 3 we present the sequential estimation algorithm. [sent-33, score-0.279]

18 It can, in fact, be applied to any model for which the posterior can be evaluated up to a normalising constant. [sent-35, score-0.199]

19 We illustrate the performance of the algorithm on two standard regression data sets in Section 4, before concluding with some remarks in Section 5. [sent-36, score-0.17]

20 2 Model Description The data is assumed to arrive sequentially as input-output pairs (xt , yt ), t = 1, 2, · · · , xt ∈ Rd , yt ∈ R. [sent-37, score-0.792]

21 , k ∈ {1 · · · kmax } 2 2 2 p(β k , σβ ) = N(β k |0, σβ Ik+1 )IG(σβ |aβ , bβ ) p(Uk ) = 2 p(σy ) = k t l=1 s=1 2 IG(σy |ay , by ), δxs (µl )/t where δx (·) denotes the Dirac delta function with mass at x, and IG(·|a, b) denotes the Inverted Gamma distribution with parameters a and b. [sent-41, score-0.181]

22 The prior on the number of kernels is set to be a truncated Poisson distribution, with the mean λ and the maximum number of kernels kmax assumed to be ﬁxed and known. [sent-42, score-0.311]

23 The regression coefﬁcients are drawn from 2 an isotropic Gaussian prior with variance σβ in each direction. [sent-43, score-0.172]

24 The prior for the kernel centres is assumed to be uniform over the grid formed by the input data points available at the current time step. [sent-46, score-0.522]

25 Given the likelihood and prior in (1) and (2), respectively, it is straightforward to obtain an expression for the full posterior distribution p(k, θ k |Yt ). [sent-50, score-0.226]

26 It will be our obk k jective to approximate this distribution recursively in time as more data becomes available, using Monte Carlo techniques. [sent-52, score-0.157]

27 Once we have samples for the kernel centres, we will re2 2 quire new samples for the unknown parameters (σy , σβ ) at the next time step. [sent-53, score-0.43]

28 Since the number of kernel functions to use is unknown the marginal posterior in (3) is deﬁned over a discrete space of variable dimension. [sent-55, score-0.375]

29 In the next section we will present a generalised importance sampling strategy to obtain Monte Carlo approximations for distributions of this nature recursively as more data becomes available. [sent-56, score-0.429]

30 3 Sequential Estimation Recall that it is our objective to recursively update a Monte Carlo representation of the posterior distribution for the kernel regression parameters as more data becomes available. [sent-57, score-0.633]

31 The method we propose here is based on a generalisation of the popular importance sampling technique. [sent-58, score-0.347]

32 Its application extends to any model for which the posterior can be evaluated up to a normalising constant. [sent-59, score-0.234]

33 We will thus ﬁrst present the general strategy, before outlining the details for sequential kernel regression. [sent-60, score-0.395]

34 1 Generalised Importance Sampling Our aim is to recursively update a sample based approximation of the posterior p(k, θ k |Yt ) of a model parameterised by θ k as more data becomes available. [sent-62, score-0.292]

35 The efﬁciency of importance sampling hinges on the ability to design a good proposal distribution, i. [sent-63, score-0.34]

36 Designing an efﬁcient proposal distribution to generate samples directly in the target parameter space is difﬁcult. [sent-66, score-0.274]

37 Apart from some weak assumptions, which we will discuss shortly, the distribution qt is entirely arbitrary, and may depend on the data and the time step. [sent-71, score-0.263]

38 In fact, in the application to the RVM we consider here we will set it to qt (k , θ k |k, θ k ) = δ(k,θk ) (k , θ k ), so that it effectively disappears from the expression above. [sent-72, score-0.212]

39 A similar strategy of augmenting the space to simplify the importance sampling procedure has been exploited before in [7] to develop efﬁcient Sequential Monte Carlo (SMC) samplers for a wide range of models. [sent-73, score-0.284]

40 To generate samples in this joint space we deﬁne the proposal for importance sampling to be of the form Qt (k, θ k ; k , θ k ) = p(k , θ k |Yt−1 )qt (k, θ k |k , θ k ), (7) where qt may again depend on the data and the time step. [sent-74, score-0.685]

41 This proposal embodies the sequential character of our algorithm. [sent-75, score-0.319]

42 Similar to SMC methods [3] it generates samples for the parameters at the current time step by incrementally reﬁning the posterior at the previous time step through the distribution qt . [sent-76, score-0.549]

43 Designing efﬁcient incremental proposals is much easier than constructing proposals that generate samples directly in the target parameter space, since the posterior is unlikely to undergo dramatic changes over consecutive time steps. [sent-77, score-0.365]

44 To compensate for the discrepancy between the proposal in (7) and the joint posterior in (6) the importance weight takes the form p(k, θ k |Yt )qt (k , θ k |k, θ k ) Wt (k, θ k ; k , θ k ) = . [sent-78, score-0.447]

45 (8) p(k , θ k |Yt−1 )qt (k, θ k |k , θ k ) Due to the construction of the joint in (6), marginal samples in the target parameter space associated with this weighting will indeed be distributed according to the target posterior p(k, θ k |Yt ). [sent-79, score-0.421]

46 As might be expected the importance weight in (8) is similar in form to the acceptance ratio for the RJ-MCMC algorithm [5]. [sent-80, score-0.196]

47 One notable difference is that the reversibility condition is not required, so that for a given qt , qt may be arbitrary, as long as the ratio in (8) is well-deﬁned. [sent-81, score-0.344]

48 In practice it is often necessary to design a number of candidate moves to obtain an efﬁcient algorithm. [sent-82, score-0.227]

49 Examples include update moves to reﬁne the model parameters in the light of the new data, birth moves to add new parameters to better explain the new data, death moves to remove redundant or erroneous parameters, and many more. [sent-83, score-1.164]

50 We will denote the set of candidate moves at time t by {αt,i , qt,i , qt,i }M , where αt,i is the probability of choosing i=1 M move i, with i=1 αt,i = 1. [sent-84, score-0.465]

51 For each move i the importance weight is computed by substituting the corresponding qt,i and qt,i into (8). [sent-85, score-0.405]

52 Note that the probability of choosing a particular move may depend on the old state and the time step, so that moves may be included or excluded as is appropriate. [sent-86, score-0.465]

53 2 Sequential Kernel Regression We will now present the details for sequential kernel regression. [sent-88, score-0.395]

54 Our main concern will be the recursive estimation of the marginal posterior for the kernel centres in (3). [sent-89, score-0.565]

55 This 2 2 distribution is conditional on the parameters (σy , σβ ), for which samples can be obtained at each time step from the corresponding posteriors in (4) and (5). [sent-90, score-0.208]

56 To sample for the new kernel centres we will consider three kinds of moves: a zero move qt,1 , a birth move qt,2 , and a death move qt,3 . [sent-91, score-1.296]

57 The birth move adds a new kernel at a uniformly randomly chosen location over the grid of unoccupied input data points. [sent-93, score-0.754]

58 The death move removes a uniformly randomly chosen kernel. [sent-94, score-0.448]

59 For k = 0 only the birth move is possible, whereas the birth move is impossible for k = kmax or k = t. [sent-95, score-0.858]

60 Similar to [5] we set the move probabilities to αt,2 = c min{1, p(k + 1)/p(k)} αt,3 = c min{1, p(k − 1)/p(k)} αt,1 = 1 − αt,2 − αt,3 in all other cases. [sent-96, score-0.209]

61 In the above c ∈ (0, 1) is a parameter that tunes the relative frequency of the dimension changing moves to the zero move. [sent-97, score-0.227]

62 For these choices the importance weight in (8) becomes Wt,i (k, Uk ; k , Uk ) ∝ T T 2 |Bk |1/2 exp(−(Yt Pk Yt − Yt−1 Pk Yt−1 )/2σy ) 2 )k−k /2 2 |Bk |1/2 (2πσy )1/2 (σβ × λk−k (t − 1)(k − 1)! [sent-98, score-0.196]

63 qt,i (k, Uk |k , Uk ) where the primed variables are those corresponding to the posterior at time t − 1. [sent-100, score-0.151]

64 For the zero move the parameters are left unchanged, so that the expression for qt,1 in the importance weight becomes unity. [sent-101, score-0.487]

65 This is often a good move to choose, and captures the notion that the posterior rarely changes dramatically over consecutive time steps. [sent-102, score-0.36]

66 For the birth move one new kernel is added, so that k = k + 1. [sent-103, score-0.548]

67 The centre for this kernel is uniformly randomly chosen from the grid of unoccupied input data points. [sent-104, score-0.381]

68 This means that the expression for qt,2 in the importance weight reduces to 1/(t − k ), since there are t − k such data points. [sent-105, score-0.271]

69 Similarly, the death move removes a uniformly randomly chosen kernel, so that k = k − 1. [sent-106, score-0.448]

70 In this case the expression for qt,3 in the importance weight reduces to 1/k . [sent-107, score-0.236]

71 However, we found that the simple moves presented yield satisfactory results while keeping the computational complexity acceptable. [sent-111, score-0.227]

72 Algorithm 1: Sequential Kernel Regression Inputs: • Kernel function K(·, ·), model parameters (λ, kmax , ay , by , aβ , bβ ), fraction of dimension change moves c, number of samples to approximate the posterior N . [sent-113, score-0.664]

73 Generalised Importance Sampling Step: t > 0 • For i = 1 · · · N , − Sample a move j(i) so that P (j(i) = l) = αt,l . [sent-115, score-0.209]

74 (i) Else if j(i) = 2 (birth move), form Uk by uniformly randomly adding a kernel at one of the unoccupied data points, and set k(i) = k(i) + 1. [sent-117, score-0.354]

75 (i) • For i = 1 · · · N , compute the importance weights Wt normalise. [sent-119, score-0.161]

76 Resampling Step: t > 0 (i) (i) • Multiply / discard samples {k(i) , θ k } with respect to high / low importance weights {Wt } (i) to obtain N samples {k(i) , θ k }. [sent-121, score-0.319]

77 The resampling step is required to avoid degeneracy of the sample based representation. [sent-127, score-0.163]

78 It can be performed by standard procedures such as multinomial resampling [4], stratiﬁed resampling [6], or minimum entropy resampling [2]. [sent-128, score-0.375]

79 All these schemes are unbiased, (i) so that the number of times Ni the sample (k (i) , θ k ) appears after resampling satisﬁes (i) E(Ni ) = N Wt . [sent-129, score-0.163]

80 Thus, resampling essentially multiplies samples with high importance weights, and discards those with low importance weights. [sent-130, score-0.526]

81 4 Experiments and Results In this section we illustrate the performance of the proposed sequential estimation algorithm on two standard regression data sets. [sent-135, score-0.449]

82 The training data is taken to be the sinc function, i. [sent-138, score-0.19]

83 In all the runs we presented these points to the sequential estimation algorithm in random order. [sent-142, score-0.311]

84 6, and set the ﬁxed parameters of the model to (λ, kmax , ay , by , aβ , bβ ) = (1, 50, 0, 0, 0, 0). [sent-145, score-0.236]

85 The fraction of dimension change moves was set to c = 0. [sent-147, score-0.227]

86 In Table 1 we compare the results of the proposed sequential estimation algorithm with a number of batch strategies for the SVM and RVM. [sent-155, score-0.515]

87 The results for the batch algorithms are duplicated from [1, 9]. [sent-156, score-0.178]

88 The error for the sequential algorithm is slightly higher. [sent-157, score-0.22]

89 This is due to the stochastic nature of the algorithm, and the fact that it uses only very simple moves that take no account of the characteristics of the data during the move proposition. [sent-158, score-0.471]

90 5 −10 1000 −8 −6 −4 −2 0 2 4 6 8 10 Figure 1: Results for the sinc experiment. [sent-172, score-0.155]

91 1136 Table 1: Comparative performance results for the sinc data. [sent-192, score-0.155]

92 For the model and algorithm parameters we used values similar to those for the sinc experiment, except for setting λ = 5 to allow a larger number of kernels. [sent-198, score-0.197]

93 The test error is comparable to those for the batch strategies, but far fewer kernels are required. [sent-201, score-0.244]

94 5 Conclusions In this paper we proposed a sequential estimation strategy for Bayesian kernel regression. [sent-210, score-0.497]

95 Our algorithm is based on a generalisation of importance sampling, and incrementally updates a Monte Carlo representation of the target posterior distribution as more data points become available. [sent-211, score-0.56]

96 It is further non-iterative, and requires only a single pass over the data, thus overcoming some of the computational difﬁculties associated with batch estimation strategies for kernel regression. [sent-213, score-0.509]

97 Our algorithm is more general than the kernel regression problem considered here. [sent-214, score-0.31]

98 Its application extends to any model for which the posterior can be evaluated up to a normalising constant. [sent-215, score-0.234]

99 Initial experiments on two standard regression data sets showed our algorithm to compare favourably with existing batch estimation strategies for kernel regression. [sent-216, score-0.713]

100 Sparse Bayesian learning for regression and classiﬁcation using Markov chain Monte Carlo. [sent-287, score-0.165]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('yt', 0.349), ('uk', 0.231), ('moves', 0.227), ('sequential', 0.22), ('move', 0.209), ('rvm', 0.204), ('kk', 0.179), ('batch', 0.178), ('kernel', 0.175), ('qt', 0.172), ('birth', 0.164), ('bk', 0.164), ('importance', 0.161), ('centres', 0.157), ('sinc', 0.155), ('regression', 0.135), ('death', 0.135), ('resampling', 0.125), ('posterior', 0.122), ('monte', 0.121), ('kmax', 0.112), ('carlo', 0.109), ('ig', 0.102), ('relevance', 0.101), ('proposal', 0.099), ('wt', 0.085), ('ay', 0.082), ('sampling', 0.08), ('samples', 0.079), ('normalising', 0.077), ('inverted', 0.076), ('bayesian', 0.071), ('target', 0.069), ('doucet', 0.068), ('unoccupied', 0.067), ('housing', 0.067), ('mmse', 0.067), ('recursively', 0.066), ('kernels', 0.066), ('generalisation', 0.065), ('pk', 0.065), ('estimation', 0.059), ('strategies', 0.058), ('gamma', 0.055), ('marginal', 0.052), ('vermaak', 0.052), ('uniformly', 0.051), ('kt', 0.049), ('incrementally', 0.049), ('comparative', 0.049), ('godsill', 0.045), ('favourably', 0.045), ('gordon', 0.044), ('boston', 0.044), ('generalised', 0.044), ('strategy', 0.043), ('bishop', 0.042), ('parameters', 0.042), ('reproduced', 0.041), ('smc', 0.041), ('erroneous', 0.041), ('reversible', 0.041), ('svm', 0.041), ('popular', 0.041), ('expression', 0.04), ('coef', 0.039), ('pass', 0.039), ('designing', 0.039), ('sample', 0.038), ('prior', 0.037), ('determination', 0.036), ('extends', 0.035), ('weight', 0.035), ('data', 0.035), ('reminiscent', 0.034), ('jump', 0.034), ('cients', 0.033), ('proposals', 0.033), ('points', 0.032), ('freitas', 0.031), ('tipping', 0.031), ('posteriors', 0.031), ('update', 0.031), ('joint', 0.03), ('clean', 0.03), ('ik', 0.03), ('xs', 0.03), ('assumed', 0.03), ('chain', 0.03), ('editors', 0.029), ('time', 0.029), ('xt', 0.029), ('redundant', 0.028), ('existing', 0.028), ('removes', 0.027), ('distribution', 0.027), ('grid', 0.027), ('vt', 0.027), ('randomly', 0.026), ('unknown', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 176 nips-2003-Sequential Bayesian Kernel Regression

Author: Jaco Vermaak, Simon J. Godsill, Arnaud Doucet

2 0.16689864 148 nips-2003-Online Passive-Aggressive Algorithms

Author: Shai Shalev-shwartz, Koby Crammer, Ofer Dekel, Yoram Singer

Abstract: We present a uniﬁed view for online classiﬁcation, regression, and uniclass problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also discussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss. 1

3 0.16247363 145 nips-2003-Online Classification on a Budget

Author: Koby Crammer, Jaz Kandola, Yoram Singer

Abstract: Online algorithms for classiﬁcation often require vast amounts of memory and computation time when employed in conjunction with kernel functions. In this paper we describe and analyze a simple approach for an on-the-ﬂy reduction of the number of past examples used for prediction. Experiments performed with real datasets show that using the proposed algorithmic approach with a single epoch is competitive with the support vector machine (SVM) although the latter, being a batch algorithm, accesses each training example multiple times. 1 Introduction and Motivation Kernel-based methods are widely being used for data modeling and prediction because of their conceptual simplicity and outstanding performance on many real-world tasks. The support vector machine (SVM) is a well known algorithm for ﬁnding kernel-based linear classiﬁers with maximal margin [7]. The kernel trick can be used to provide an effective method to deal with very high dimensional feature spaces as well as to model complex input phenomena via embedding into inner product spaces. However, despite generalization error being upper bounded by a function of the margin of a linear classiﬁer, it is notoriously difﬁcult to implement such classiﬁers efﬁciently. Empirically this often translates into very long training times. A number of alternative algorithms exist for ﬁnding a maximal margin hyperplane many of which have been inspired by Rosenblatt’s Perceptron algorithm [6] which is an on-line learning algorithm for linear classiﬁers. The work on SVMs has inspired a number of modiﬁcations and enhancements to the original Perceptron algorithm. These incorporate the notion of margin to the learning and prediction processes whilst exhibiting good empirical performance in practice. Examples of such algorithms include the Relaxed Online Maximum Margin Algorithm (ROMMA) [4], the Approximate Maximal Margin Classiﬁcation Algorithm (ALMA) [2], and the Margin Infused Relaxed Algorithm (MIRA) [1] which can be used in conjunction with kernel functions. A notable limitation of kernel based methods is their computational complexity since the amount of computer memory that they require to store the so called support patterns grows linearly with the number prediction errors. A number of attempts have been made to speed up the training and testing of SVM’s by enforcing a sparsity condition. In this paper we devise an online algorithm that is not only sparse but also generalizes well. To achieve this goal our algorithm employs an insertion and deletion process. Informally, it can be thought of as revising the weight vector after each example on which a prediction mistake has been made. Once such an event occurs the algorithm adds the new erroneous example (the insertion phase), and then immediately searches for past examples that appear to be redundant given the recent addition (the deletion phase). As we describe later, making this adjustment to the algorithm allows us to modify the standard online proof techniques so as to provide a bound on the total number of examples the algorithm keeps. This paper is organized as follows. In Sec. 2 we formalize the problem setting and provide a brief outline of our method for obtaining a sparse set of support patterns in an online setting. In Sec. 3 we present both theoretical and algorithmic details of our approach and provide a bound on the number of support patterns that constitute the cache. Sec. 4 provides experimental details, evaluated on three real world datasets, to illustrate the performance and merits of our sparse online algorithm. We end the paper with conclusions and ideas for future work. 2 Problem Setting and Algorithms This work focuses on online additive algorithms for classiﬁcation tasks. In such problems we are typically given a stream of instance-label pairs (x1 , y1 ), . . . , (xt , yt ), . . .. we assume that each instance is a vector xt ∈ Rn and each label belongs to a ﬁnite set Y. In this and the next section we assume that Y = {−1, +1} but relax this assumption in Sec. 4 where we describe experiments with datasets consisting of more than two labels. When dealing with the task of predicting new labels, thresholded linear classiﬁers of the form h(x) = sign(w · x) are commonly employed. The vector w is typically represented as a weighted linear combination of the examples, namely w = t αt yt xt where αt ≥ 0. The instances for which αt > 0 are referred to as support patterns. Under this assumption, the output of the classiﬁer solely depends on inner-products of the form x · xt the use of kernel functions can easily be employed simply by replacing the standard scalar product with a function K(·, ·) which satisﬁes Mercer conditions [7]. The resulting classiﬁcation rule takes the form h(x) = sign(w · x) = sign( t αt yt K(x, xt )). The majority of additive online algorithms for classiﬁcation, for example the well known Perceptron [6], share a common algorithmic structure. These online algorithms typically work in rounds. On the tth round, an online algorithm receives an instance xt , computes the inner-products st = i 0. The various online algorithms differ in the way the values of the parameters βt , αt and ct are set. A notable example of an online algorithm is the Perceptron algorithm [6] for which we set βt = 0, αt = 1 and ct = 1. More recent algorithms such as the Relaxed Online Maximum Margin Algorithm (ROMMA) [4] the Approximate Maximal Margin Classiﬁcation Algorithm (ALMA) [2] and the Margin Infused Relaxed Algorithm (MIRA) [1] can also be described in this framework although the constants βt , αt and ct are not as simple as the ones employed by the Perceptron algorithm. An important computational consideration needs to be made when employing kernel functions for machine learning tasks. This is because the amount of memory required to store the so called support patterns grows linearly with the number prediction errors. In Input: Tolerance β. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. Insert Ct ← Ct−1 ∪ {t}. 2. Set αt = 1. 3. Compute wt ← wt−1 + yt αt xt . 4. DistillCache(Ct , wt , (α1 , . . . , αt )). Output : H(x) = sign(wT · x). Figure 1: The aggressive Perceptron algorithm with a variable-size cache. this paper we shift the focus to the problem of devising online algorithms which are budget-conscious as they attempt to keep the number of support patterns small. The approach is attractive for at least two reasons. Firstly, both the training time and classiﬁcation time can be reduced signiﬁcantly if we store only a fraction of the potential support patterns. Secondly, a classier with a small number of support patterns is intuitively ”simpler”, and hence are likely to exhibit good generalization properties rather than complex classiﬁers with large numbers of support patterns. (See for instance [7] for formal results connecting the number of support patterns to the generalization error.) In Sec. 3 we present a formal analysis and Input: C, w, (α1 , . . . , αt ). the algorithmic details of our approach. Loop: Let us now provide a general overview • Choose i ∈ C such that of how to restrict the number of support β ≤ yi (w − αi yi xi ). patterns in an online setting. Denote by Ct the indices of patterns which consti• if no such i exists then return. tute the classiﬁcation vector wt . That is, • Remove the example i : i ∈ Ct if and only if αi > 0 on round 1. αi = 0. t when xt is received. The online classiﬁcation algorithms discussed above keep 2. w ← w − αi yi xi . enlarging Ct – once an example is added 3. C ← C/{i} to Ct it will never be deleted. However, Return : C, w, (α1 , . . . , αt ). as the online algorithm receives more examples, the performance of the classiﬁer Figure 2: DistillCache improves, and some of the past examples may have become redundant and hence can be removed. Put another way, old examples may have been inserted into the cache simply due the lack of support patterns in early rounds. As more examples are observed, the old examples maybe replaced with new examples whose location is closer to the decision boundary induced by the online classiﬁer. We thus add a new stage to the online algorithm in which we discard a few old examples from the cache Ct . We suggest a modiﬁcation of the online algorithm structure as follows. Whenever yt i 0. Then the number of support patterns constituting the cache is at most S ≤ (R2 + 2β)/γ 2 . Proof: The proof of the theorem is based on the mistake bound of the Perceptron algorithm [5]. To prove the theorem we bound wT 2 from above and below and compare the 2 t bounds. Denote by αi the weight of the ith example at the end of round t (after stage 4 of the algorithm). Similarly, we denote by αi to be the weight of the ith example on round ˜t t after stage 3, before calling the DistillCache (Fig. 2) procedure. We analogously ˜ denote by wt and wt the corresponding instantaneous classiﬁers. First, we derive a lower bound on wT 2 by bounding the term wT · u from below in a recursive manner. T αt yt (xt · u) wT · u = t∈CT T αt = γ S . ≥ γ (1) t∈CT We now turn to upper bound wT 2 . Recall that each example may be added to the cache and removed from the cache a single time. Let us write wT 2 as a telescopic sum, wT 2 = ( wT 2 ˜ − wT 2 ˜ ) + ( wT 2 − wT −1 2 ˜ ) + . . . + ( w1 2 − w0 2 ) . (2) We now consider three different scenarios that may occur for each new example. The ﬁrst case is when we did not insert the tth example into the cache at all. In this case, ˜ ( wt 2 − wt−1 2 ) = 0. The second scenario is when an example is inserted into the cache and is never discarded in future rounds, thus, ˜ wt 2 = wt−1 + yt xt 2 = wt−1 2 + 2yt (wt−1 · xt ) + xt 2 . Since we inserted (xt , yt ), the condition yt (wt−1 · xt ) ≤ β must hold. Combining this ˜ with the assumption that the examples are enclosed in a ball of radius R we get, ( wt 2 − wt−1 2 ) ≤ 2β + R2 . The last scenario occurs when an example is inserted into the cache on some round t, and is then later on removed from the cache on round t + p for p > 0. As in the previous case we can bound the value of summands in Equ. (2), ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 ˜ − wt+p 2 ) Input: Tolerance β, Cache Limit n. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. If |Ct | = n remove one example: (a) Find i = arg maxj∈Ct {yj (wt−1 − αj yj xj )}. (b) Update wt−1 ← wt−1 − αi yi xi . (c) Remove Ct−1 ← Ct−1 /{i} 2. Insert Ct ← Ct−1 ∪ {t}. 3. Set αt = 1. 4. Compute wt ← wt−1 + yt αt xt . Output : H(x) = sign(wT · x). Figure 3: The aggressive Perceptron algorithm with as ﬁxed-size cache. ˜ = 2yt (wt−1 · xt ) + xt 2 − 2yt (wt+p · xt ) + xt ˜ = 2 [yt (wt−1 · xt ) − yt ((wt+p − yt xt ) · xt )] ˜ ≤ 2 [β − yt ((wt+p − yt xt ) · xt )] . 2 ˜ Based on the form of the cache update we know that yt ((wt+p − yt xt ) · xt ) ≥ β, and thus, ˜ ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 − wt+p 2 ) ≤ 0 . Summarizing all three cases we see that only the examples which persist in the cache contribute a factor of R2 + 2β each to the bound of the telescopic sum of Equ. (2) and the rest of the examples do contribute anything to the bound. Hence, we can bound the norm of wT as follows, wT 2 ≤ S R2 + 2β . (3) We ﬁnish up the proof by applying the Cauchy-Swartz inequality and the assumption u = 1. Combining Equ. (1) and Equ. (3) we get, γ 2 S 2 ≤ (wT · u)2 ≤ wT 2 u 2 ≤ S(2β + R2 ) , which gives the desired bound. 4 Experiments In this section we describe the experimental methods that were used to compare the performance of standard online algorithms with the new algorithm described above. We also describe shortly another variant that sets a hard limit on the number of support patterns. The experiments were designed with the aim of trying to answer the following questions. First, what is effect of the number of support patterns on the generalization error (measured in terms of classiﬁcation accuracy on unseen data), and second, would the algorithm described in Fig. 2 be able to ﬁnd an optimal cache size that is able to achieve the best generalization performance. To examine each question separately we used a modiﬁed version of the algorithm described by Fig. 2 in which we restricted ourselves to have a ﬁxed bounded cache. This modiﬁed algorithm (which we refer to as the ﬁxed budget Perceptron) Name mnist letter usps No. of Training Examples 60000 16000 7291 No. of Test Examples 10000 4000 2007 No. of Classes 10 26 10 No. of Attributes 784 16 256 Table 1: Description of the datasets used in experiments. simulates the original Perceptron algorithm with one notable difference. When the number of support patterns exceeds a pre-determined limit, it chooses a support pattern from the cache and discards it. With this modiﬁcation the number of support patterns can never exceed the pre-determined limit. This modiﬁed algorithm is described in Fig. 3. The algorithm deletes the example which seemingly attains the highest margin after the removal of the example itself (line 1(a) in Fig. 3). Despite the simplicity of the original Perceptron algorithm [6] its good generalization performance on many datasets is remarkable. During the last few year a number of other additive online algorithms have been developed [4, 2, 1] that have shown better performance on a number of tasks. In this paper, we have preferred to embed these ideas into another online algorithm and start with a higher baseline performance. We have chosen to use the Margin Infused Relaxed Algorithm (MIRA) as our baseline algorithm since it has exhibited good generalization performance in previous experiments [1] and has the additional advantage that it is designed to solve multiclass classiﬁcation problem directly without any recourse to performing reductions. The algorithms were evaluated on three natural datasets: mnist1 , usps2 and letter3 . The characteristics of these datasets has been summarized in Table 1. A comprehensive overview of the performance of various algorithms on these datasets can be found in a recent paper [2]. Since all of the algorithms that we have evaluated are online, it is not implausible for the speciﬁc ordering of examples to affect the generalization performance. We thus report results averaged over 11 random permutations for usps and letter and over 5 random permutations for mnist. No free parameter optimization was carried out and instead we simply used the values reported in [1]. More speciﬁcally, the margin parameter was set to β = 0.01 for all algorithms and for all datasets. A homogeneous polynomial kernel of degree 9 was used when training on the mnist and usps data sets, and a RBF kernel for letter data set. (The variance of the RBF kernel was identical to the one used in [1].) We evaluated the performance of four algorithms in total. The ﬁrst algorithm was the standard MIRA online algorithm, which does not incorporate any budget constraints. The second algorithm is the version of MIRA described in Fig. 3 which uses a ﬁxed limited budget. Here we enumerated the cache size limit in each experiment we performed. The different sizes that we tested are dataset dependent but for each dataset we evaluated at least 10 different sizes. We would like to note that such an enumeration cannot be done in an online fashion and the goal of employing the the algorithm with a ﬁxed-size cache is to underscore the merit of the truly adaptive algorithm. The third algorithm is the version of MIRA described in Fig. 2 that adapts the cache size during the running of the algorithms. We also report additional results for a multiclass version of the SVM [1]. Whilst this algorithm is not online and during the training process it considers all the examples at once, this algorithm serves as our gold-standard algorithm against which we want to compare 1 Available from http://www.research.att.com/˜yann Available from ftp.kyb.tuebingen.mpg.de 3 Available from http://www.ics.uci.edu/˜mlearn/MLRepository.html 2 usps mnist Fixed Adaptive SVM MIRA 1.8 4.8 4.7 letter Fixed Adaptive SVM MIRA 5.5 1.7 4.6 5 1.5 1.4 Test Error 4.5 Test Error Test Error 1.6 Fixed Adaptive SVM MIRA 6 4.4 4.3 4.5 4 3.5 4.2 4.1 3 4 2.5 1.3 1.2 3.9 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 2 1000 1500 x 10 mnist 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 usps Fixed Adaptive MIRA 1550 7000 8000 9000 letter Fixed Adaptive MIRA 270 4000 5000 6000 # Support Patterns Fixed Adaptive MIRA 1500 265 1500 1400 260 Training Online Errors Training Online Errors Training Online Errors 1450 1450 255 250 245 1400 1350 1300 1350 240 1250 235 1300 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 1000 1500 x 10 mnist 4 x 10 2000 2500 # Support Patterns 3000 3500 1000 usps 6500 Fixed Adaptive MIRA 5.5 2000 3000 4000 5000 6000 # Support Patterns 7000 Fixed Adaptive MIRA 1.5 6000 9000 letter 4 x 10 1.6 Fixed Adaptive MIRA 8000 4 3.5 3 1.4 5500 Training Margin Errors Training Margin Errors Training Margin Errors 5 4.5 5000 4500 1.3 1.2 1.1 4000 1 2.5 3500 0.9 2 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 4 x 10 500 1000 1500 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 4000 5000 6000 # Support Patterns 7000 8000 9000 Figure 4: Results on a three data sets - mnist (left), usps (center) and letter (right). Each point in a plot designates the test error (y-axis) vs. the number of support patterns used (x-axis). Four algorithms are compared - SVM, MIRA, MIRA with a ﬁxed cache size and MIRA with a variable cache size. performance. Note that for the multiclass SVM we report the results using the best set of parameters, which does not coincide with the set of parameters used for the online training. The results are summarized in Fig 4. This ﬁgure is composed of three different plots organized in columns. Each of these plots corresponds to a different dataset - mnist (left), usps (center) and letter (right). In each of the three plots the x-axis designates number of support patterns the algorithm uses. The results for the ﬁxed-size cache are connected with a line to emphasize the performance dependency on the size of the cache. The top row of the three columns shows the generalization error. Thus the y-axis designates the test error of an algorithm on unseen data at the end of the training. Looking at the error of the algorithm with a ﬁxed-size cache reveals that there is a broad range of cache size where the algorithm exhibits good performance. In fact for MNIST and USPS there are sizes for which the test error of the algorithm is better than SVM’s test error. Naturally, we cannot ﬁx the correct size in hindsight so the question is whether the algorithm with variable cache size is a viable automatic size-selection method. Analyzing each of the datasets in turn reveals that this is indeed the case – the algorithm obtains a very similar number of support patterns and test error when compared to the SVM method. The results are somewhat less impressive for the letter dataset which contains less examples per class. One possible explanation is that the algorithm had fewer chances to modify and distill the cache. Nonetheless, overall the results are remarkable given that all the online algorithms make a single pass through the data and the variable-size method ﬁnds a very good cache size while making it also comparable to the SVM in terms of performance. The MIRA algorithm, which does not incorporate any form of example insertion or deletion in its algorithmic structure, obtains the poorest level of performance not only in terms of generalization error but also in terms of number of support patterns. The plot of online training error against the number of support patterns, in row 2 of Fig 4, can be considered to be a good on-the-ﬂy validation of generalization performance. As the plots indicate, for the ﬁxed and adaptive versions of the algorithm, on all the datasets, a low online training error translates into good generalization performance. Comparing the test error plots with the online error plots we see a nice similarity between the qualitative behavior of the two errors. Hence, one can use the online error, which is easy to evaluate, to choose a good cache size for the ﬁxed-size algorithm. The third row gives the online training margin errors that translates directly to the number of insertions into the cache. Here we see that the good test error and compactness of the algorithm with a variable cache size come with a price. Namely, the algorithm makes signiﬁcantly more insertions into the cache than the ﬁxed size version of the algorithm. However, as the upper two sets of plots indicate, the surplus in insertions is later taken care of by excess deletions and the end result is very good overall performance. In summary, the online algorithm with a variable cache and SVM obtains similar levels of generalization and also number of support patterns. While the SVM is still somewhat better in both aspects for the letter dataset, the online algorithm is much simpler to implement and performs a single sweep through the training data. 5 Summary We have described and analyzed a new sparse online algorithm that attempts to deal with the computational problems implicit in classiﬁcation algorithms such as the SVM. The proposed method was empirically tested and its performance in both the size of the resulting classiﬁer and its error rate are comparable to SVM. There are a few possible extensions and enhancements. We are currently looking at alternative criteria for the deletions of examples from the cache. For instance, the weight of examples might relay information on their importance for accurate classiﬁcation. Incorporating prior knowledge to the insertion and deletion scheme might also prove important. We hope that such enhancements would make the proposed approach a viable alternative to SVM and other batch algorithms. Acknowledgements: The authors would like to thank John Shawe-Taylor for many helpful comments and discussions. This research was partially funded by the EU project KerMIT No. IST-2000-25341. References [1] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Jornal of Machine Learning Research, 3:951–991, 2003. [2] C. Gentile. A new approximate maximal margin classiﬁcation algorithm. Journal of Machine Learning Research, 2:213–242, 2001. [3] M´ zard M. Krauth W. Learning algorithms with optimal stability in neural networks. Journal of e Physics A., 20:745, 1987. [4] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1–3):361–387, 2002. [5] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume XII, pages 615–622, 1962. [6] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).). [7] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.

4 0.16024606 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels

Author: Ryan C. Kelly, Tai Sing Lee

Abstract: Decoding is a strategy that allows us to assess the amount of information neurons can provide about certain aspects of the visual scene. In this study, we develop a method based on Bayesian sequential updating and the particle ﬁltering algorithm to decode the activity of V1 neurons in awake monkeys. A distinction in our method is the use of Volterra kernels to ﬁlter the particles, which live in a high dimensional space. This parametric Bayesian decoding scheme is compared to the optimal linear decoder and is shown to work consistently better than the linear optimal decoder. Interestingly, our results suggest that for decoding in real time, spike trains of as few as 10 independent but similar neurons would be sufﬁcient for decoding a critical scene variable in a particular class of visual stimuli. The reconstructed variable can predict the neural activity about as well as the actual signal with respect to the Volterra kernels. 1

5 0.12781747 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

Author: Kazuyuki Samejima, Kenji Doya, Yasumasa Ueda, Minoru Kimura

Abstract: When we model a higher order functions, such as learning and memory, we face a difﬁculty of comparing neural activities with hidden variables that depend on the history of sensory and motor signals and the dynamics of the network. Here, we propose novel method for estimating hidden variables of a learning agent, such as connection weights from sequences of observable variables. Bayesian estimation is a method to estimate the posterior probability of hidden variables from observable data sequence using a dynamic model of hidden and observable variables. In this paper, we apply particle ﬁlter for estimating internal parameters and metaparameters of a reinforcement learning model. We veriﬁed the effectiveness of the method using both artiﬁcial data and real animal behavioral data. 1

6 0.12215657 48 nips-2003-Convex Methods for Transduction

7 0.11177588 102 nips-2003-Large Scale Online Learning

8 0.10367723 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

9 0.096853852 196 nips-2003-Wormholes Improve Contrastive Divergence

10 0.093197882 51 nips-2003-Design of Experiments via Information Theory

11 0.08986745 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures

12 0.089421839 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

13 0.081116408 112 nips-2003-Learning to Find Pre-Images

14 0.07426405 73 nips-2003-Feature Selection in Clustering Problems

15 0.063674599 155 nips-2003-Perspectives on Sparse Bayesian Learning

16 0.063555844 141 nips-2003-Nonstationary Covariance Functions for Gaussian Process Regression

17 0.062844269 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

18 0.061711702 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

19 0.061313599 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

20 0.061194319 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.23), (1, 0.021), (2, -0.035), (3, -0.105), (4, 0.121), (5, 0.087), (6, 0.158), (7, 0.202), (8, 0.174), (9, 0.063), (10, 0.0), (11, -0.02), (12, -0.093), (13, -0.014), (14, 0.095), (15, -0.027), (16, -0.029), (17, 0.072), (18, 0.007), (19, 0.107), (20, -0.052), (21, 0.033), (22, 0.056), (23, 0.035), (24, 0.031), (25, -0.07), (26, 0.04), (27, 0.028), (28, 0.012), (29, -0.049), (30, -0.089), (31, -0.161), (32, 0.003), (33, -0.035), (34, 0.06), (35, -0.046), (36, 0.004), (37, -0.125), (38, 0.047), (39, 0.062), (40, 0.059), (41, 0.054), (42, -0.051), (43, -0.008), (44, -0.098), (45, -0.024), (46, 0.01), (47, -0.003), (48, -0.131), (49, 0.075)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94626665 176 nips-2003-Sequential Bayesian Kernel Regression

Author: Jaco Vermaak, Simon J. Godsill, Arnaud Doucet

2 0.60002881 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels

Author: Ryan C. Kelly, Tai Sing Lee

3 0.59351069 145 nips-2003-Online Classification on a Budget

Author: Koby Crammer, Jaz Kandola, Yoram Singer

4 0.57116747 196 nips-2003-Wormholes Improve Contrastive Divergence

Author: Max Welling, Andriy Mnih, Geoffrey E. Hinton

Abstract: In models that deﬁne probabilities via energies, maximum likelihood learning typically involves using Markov Chain Monte Carlo to sample from the model’s distribution. If the Markov chain is started at the data distribution, learning often works well even if the chain is only run for a few time steps [3]. But if the data distribution contains modes separated by regions of very low density, brief MCMC will not ensure that different modes have the correct relative energies because it cannot move particles from one mode to another. We show how to improve brief MCMC by allowing long-range moves that are suggested by the data distribution. If the model is approximately correct, these long-range moves have a reasonable acceptance rate.

5 0.56550771 148 nips-2003-Online Passive-Aggressive Algorithms

Author: Shai Shalev-shwartz, Koby Crammer, Ofer Dekel, Yoram Singer

6 0.54957211 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures

7 0.46200418 102 nips-2003-Large Scale Online Learning

8 0.44613177 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

9 0.43788326 48 nips-2003-Convex Methods for Transduction

10 0.42643714 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters

11 0.42432672 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

12 0.41222182 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

13 0.40246516 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

14 0.38889697 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification

15 0.38783726 112 nips-2003-Learning to Find Pre-Images

16 0.38290775 51 nips-2003-Design of Experiments via Information Theory

17 0.37662393 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels

18 0.36565542 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

19 0.36319613 169 nips-2003-Sample Propagation

20 0.34164488 25 nips-2003-An MCMC-Based Method of Comparing Connectionist Models in Cognitive Science

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.044), (11, 0.017), (29, 0.022), (30, 0.013), (33, 0.345), (35, 0.07), (53, 0.096), (69, 0.033), (71, 0.058), (76, 0.077), (85, 0.066), (91, 0.066), (99, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80898571 176 nips-2003-Sequential Bayesian Kernel Regression

Author: Jaco Vermaak, Simon J. Godsill, Arnaud Doucet

2 0.69892687 172 nips-2003-Semi-Supervised Learning with Trees

Author: Charles Kemp, Thomas L. Griffiths, Sean Stromsten, Joshua B. Tenenbaum

Abstract: We describe a nonparametric Bayesian approach to generalizing from few labeled examples, guided by a larger set of unlabeled objects and the assumption of a latent tree-structure to the domain. The tree (or a distribution over trees) may be inferred using the unlabeled data. A prior over concepts generated by a mutation process on the inferred tree(s) allows efﬁcient computation of the optimal Bayesian classiﬁcation function from the labeled examples. We test our approach on eight real-world datasets. 1

3 0.69213629 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

Author: Stuart Andrews, Thomas Hofmann

Abstract: Learning from ambiguous training data is highly relevant in many applications. We present a new learning algorithm for classiﬁcation problems where labels are associated with sets of pattern instead of individual patterns. This encompasses multiple instance learning as a special case. Our approach is based on a generalization of linear programming boosting and uses results from disjunctive programming to generate successively stronger linear relaxations of a discrete non-convex problem. 1

4 0.57789701 3 nips-2003-AUC Optimization vs. Error Rate Minimization

Author: Corinna Cortes, Mehryar Mohri

Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classiﬁcation algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate. Our results show that the average AUC is monotonically increasing as a function of the classiﬁcation accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classiﬁcation error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a speciﬁc query. That list may contain several thousand documents, but, in practice, only the top ﬁfty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classiﬁcation of all documents as relevant or not. More generally, for a binary classiﬁer assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classiﬁcation is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classiﬁer is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classiﬁcation algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values signiﬁcantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modiﬁed under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classiﬁed positive total positive incorrectly classiﬁed negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classiﬁcation, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. speciﬁcity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Deﬁnition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classiﬁcation problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classiﬁcation threshold from the most positive classiﬁcation value to the most negative. For a fully random classiﬁcation, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classiﬁcation results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is deﬁned as the area under the ROC curve and is closely related to the ranking quality of the classiﬁcation as shown more formally by Lemma 1 below. Consider a binary classiﬁcation task with m positive examples and n negative examples. We will assume that a classiﬁer outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a ﬁxed classiﬁer. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classiﬁcations of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a ﬁxed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classiﬁcations with a ﬁxed number of errors and compare that to the error rate. Different classiﬁers may have the same error rate but different AUC values. Indeed, for a given classiﬁcation threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is ﬁxed. We wish to compute the average value of the AUC over all classiﬁcations with k errors. Our model is based on the simple assumption that all classiﬁcations or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classiﬁcation, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is ﬁxed, there are k − x false negative examples. Figure 3 shows the corresponding conﬁguration. The two regions of examples with classiﬁcation outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The ﬁrst term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classiﬁcation rank. Let position one be the ﬁrst position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classiﬁed as positive is N = m − (k − x) + x. Thus, by deﬁnition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the ﬁrst term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a ﬁxed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is ﬁxed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classiﬁcations with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a ﬁxed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classiﬁcations for a given x. There are N possible ways of x choosing the positions of the x misclassiﬁed negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassiﬁed positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a ﬁxed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simpliﬁed into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classiﬁer only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a ﬁxed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing speciﬁc learning algorithms for maximizing the AUC: a classiﬁcation algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classiﬁcations, we can ﬁrst determine its average < X >x for a ﬁxed x, and then use the function F deﬁned by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of conﬁgurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a ﬁxed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By deﬁnition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i

5 0.47739649 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures

Author: Alexander T. Ihler, Erik B. Sudderth, William T. Freeman, Alan S. Willsky

Abstract: The problem of approximating the product of several Gaussian mixture distributions arises in a number of contexts, including the nonparametric belief propagation (NBP) inference algorithm and the training of product of experts models. This paper develops two multiscale algorithms for sampling from a product of Gaussian mixtures, and compares their performance to existing methods. The ﬁrst is a multiscale variant of previously proposed Monte Carlo techniques, with comparable theoretical guarantees but improved empirical convergence rates. The second makes use of approximate kernel density evaluation methods to construct a fast approximate sampler, which is guaranteed to sample points to within a tunable parameter of their true probability. We compare both multiscale samplers on a set of computational examples motivated by NBP, demonstrating signiﬁcant improvements over existing methods. 1

6 0.46413597 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

7 0.46291974 196 nips-2003-Wormholes Improve Contrastive Divergence

8 0.46016079 189 nips-2003-Tree-structured Approximations by Expectation Propagation

9 0.45626244 78 nips-2003-Gaussian Processes in Reinforcement Learning

10 0.45541966 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

11 0.45524758 122 nips-2003-Margin Maximizing Loss Functions

12 0.45120624 113 nips-2003-Learning with Local and Global Consistency

13 0.45029819 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

14 0.45001087 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

15 0.44944355 126 nips-2003-Measure Based Regularization

16 0.44887498 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

17 0.44823021 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

18 0.44710416 112 nips-2003-Learning to Find Pre-Images

19 0.44593915 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

20 0.44523504 158 nips-2003-Policy Search by Dynamic Programming