nips nips2002 nips2002-6 knowledge-graph by maker-knowledge-mining

6 nips-2002-A Formulation for Minimax Probability Machine Regression

Source: pdf

Author: Thomas Strohmann, Gregory Z. Grudic

Abstract: We formulate the regression problem as one of maximizing the minimum probability, symbolized by Ω, that future predicted outputs of the regression model will be within some ±ε bound of the true regression function. Our formulation is unique in that we obtain a direct estimate of this lower probability bound Ω. The proposed framework, minimax probability machine regression (MPMR), is based on the recently described minimax probability machine classiﬁcation algorithm [Lanckriet et al.] and uses Mercer Kernels to obtain nonlinear regression models. MPMR is tested on both toy and real world data, verifying the accuracy of the Ω bound, and the efﬁcacy of the regression models. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We formulate the regression problem as one of maximizing the minimum probability, symbolized by Ω, that future predicted outputs of the regression model will be within some ±ε bound of the true regression function. [sent-6, score-1.211]

2 Our formulation is unique in that we obtain a direct estimate of this lower probability bound Ω. [sent-7, score-0.228]

3 The proposed framework, minimax probability machine regression (MPMR), is based on the recently described minimax probability machine classiﬁcation algorithm [Lanckriet et al. [sent-8, score-0.55]

4 ] and uses Mercer Kernels to obtain nonlinear regression models. [sent-9, score-0.332]

5 MPMR is tested on both toy and real world data, verifying the accuracy of the Ω bound, and the efﬁcacy of the regression models. [sent-10, score-0.36]

6 1 Introduction The problem of constructing a regression model can be posed as maximizing the minimum probability of future predictions being within some bound of the true regression function. [sent-11, score-0.87]

7 We refer to this regression framework as minimax probability machine regression (MPMR). [sent-12, score-0.703]

8 For MPMR to be useful in practice, it must make minimal assumptions about the distributions underlying the true regression function, since accurate estimation of these distribution is prohibitive on anything but the most trivial regression problems. [sent-13, score-0.634]

9 As with the minimax probability machine classiﬁcation (MPMC) framework proposed in [1], we avoid the use of detailed distribution knowledge by obtaining a worst case bound on the probability that the regression model is within some ε > 0 of the true regression function. [sent-14, score-0.911]

10 In [1], this formulation is used to construct nonlinear classiﬁers (MPMC) that maximize the minimum probability of correct classiﬁcation on future data. [sent-17, score-0.22]

11 ) for building nonlinear regression functions which maximize the minimum probability that the future predictions will be within an ε to the true regression function. [sent-20, score-0.812]

12 classiﬁcation boundary) between these two classes corresponds to a regression surface, which we term the minimix probability machine regression model. [sent-24, score-0.625]

13 The proposed MPMR formulation is unique because it directly computes a bound on the probability that the regression model is within ±ε of the true regression function (see Theorem 1 below). [sent-25, score-0.912]

14 The theoretical foundations of MPMR are formalized in Section 2. [sent-26, score-0.02]

15 Experimental results on synthetic and real data are given in Section 3, verifying the accuracy of the minimax probability regression bound and the efﬁcacy of the regression models. [sent-27, score-0.789]

16 Proofs of the two theorems presented in this paper are given in the appendix. [sent-28, score-0.026]

17 2 Regression Model We assume that learning data is generated from some unknown regression function f : d → that has the form: y = f (x) + ρ (2) where x ∈ d are generated according to some bounded distribution Λ, y ∈ , E[ρ] = 0, V ar[ρ] = σ 2 , and σ ∈ is ﬁnite. [sent-33, score-0.336]

18 , xid ) ∈ d is generated from the distribution Λ, and yi ∈ . [sent-43, score-0.177]

19 Therefore we can estimate the predictive power of a regression function by a bound on the minimum probability that we are within ε of the true regression function. [sent-45, score-0.81]

20 We refer to a regression function that directly estimates (4) as a mimimax probability machine regression (MPMR) model. [sent-46, score-0.658]

21 The proposed MPMR formulation is based on the kernel formulation for mimimax probability machine classiﬁcation (MPMC) presented in [1]. [sent-47, score-0.292]

22 Therefore, the MPMR model has the form: N ˆ y = f (x) = ˆ βi K (xi , x) + b (5) i=1 where, K (xi , x) = ϕ(xi )ϕ(x) is a kernel satisfying Mercer’s Conditions, xi , ∀i ∈ {1, . [sent-48, score-0.093]

23 1 Kernel Based MPM Classiﬁcation Before formalizing the MPMR algorithm for calculating βi and b from the training data Γ, we ﬁrst describe the MPMC formulation upon which it is based. [sent-53, score-0.082]

24 In [1], the binary classiﬁcation problem is posed as one of maximizing the probability of correctly classifying future data. [sent-54, score-0.118]

25 Speciﬁcally, two sets of points are considered, here symbolized by {u 1 , . [sent-55, score-0.087]

26 , Nu }, ui ∈ m , belonging to the ﬁrst class, and {v1 , . [sent-61, score-0.098]

27 The points ui are assumed to be generated from a distribution that has mean u and a covariance matrix Σ u , and correspondingly, the points vi are assumed to be generated from a distribution that has mean v and a covariance matrix Σv . [sent-68, score-0.478]

28 For the nonlinear kernel formulation, these points are mapped into a higher dimensional space ϕ : m → h as follows: u → ϕ(u) with corresponding mean and covariance matrix (ϕ(u), Σϕ(u) ), and v → ϕ(v) with corresponding mean and covariance matrix (ϕ(v), Σϕ(v) ). [sent-69, score-0.313]

29 The binary classiﬁer derived in [1] has the form (c = −1 for the ﬁrst class and c = +1 for the second): Nu +Nv γi K c (zi , z) + bc c = sign (6) i=1 where K c (zi , z) = ϕ(zi )ϕ(z), zi = ui for i = 1, . [sent-70, score-0.373]

30 , γNu +Nv ), bc obtained by solving the following optimization problem: ˜ ˜ K K ˜ ˜ √ u γ + √ v γ min s. [sent-79, score-0.095]

31 a square matrix consisting of the elements Kij = K c (zi , zj )); and ﬁnally Kv contains the last Nv rows of the Gram matrix K. [sent-83, score-0.304]

32 2 Kernel Based MPM Regression In order to use the above MPMC formulation for our proposed MPMR framework, we ﬁrst take the original learning data Γ and create two classes of points ui ∈ d+1 and vi ∈ d+1 , for i = 1, . [sent-86, score-0.297]

33 , xid ) Given these two sets of points, we obtain γ by minimizing equation (7). [sent-95, score-0.05]

34 , xd ) generated from the distribution Λ, calculating y the regression model output (5), involves ﬁnding a y that solves equation (12), ˆ ˆ where z = (ˆ, x1 , . [sent-100, score-0.463]

35 , xd ), and, recalling from above, zi = ui for i = 1, . [sent-103, score-0.413]

36 If K c (zi , z) is nonlinear, solving (12) for y is in general a nonlinear single variable optimization problem, which can be solved ˆ using a root ﬁnding algorithm (for example the Newton-Raphson Method outlined in [4]). [sent-110, score-0.04]

37 However, below we present a speciﬁc form of nonlinear K c (zi , z) that allows (12) to be solved analytically. [sent-111, score-0.04]

38 It is interesting to note that the above formulation of a regression model can be derived using any binary classiﬁcation algorithm, and is not limited to the MPMC algorithm. [sent-112, score-0.391]

39 Speciﬁcally, if a binary classiﬁer is built to separate any two sets of points (11), then ﬁnding a crossing point y at where the classiﬁer separates these classes for some input ˆ x = (x1 , . [sent-113, score-0.07]

40 , xd ), is equivalent to ﬁnding the output of the regression model for input x = (x1 , . [sent-116, score-0.42]

41 It would be interesting to explore the efﬁcacy of various classiﬁcation algorithms for this type of regression model formulation. [sent-120, score-0.292]

42 , xd ) generated according to the distribution Λ, assume that there exists only one y that solves equation (12). [sent-125, score-0.171]

43 Assume also perfect knowledge of the ˆ statistics u, Σu , v, Σv . [sent-126, score-0.047]

44 Then, the minimum probability that y is within ε of y (as deﬁned in ˆ (2)) is given by: κ2 Ω = inf Pr {|ˆ − y| ≤ ε} = y (13) 1 + κ2 where κ is deﬁned in (9). [sent-127, score-0.159]

45 Therefore, from the above theorem, the MPMC framework directly computes the lower bound on the probability that the regression model is within ε of the function that generated the learning data Γ (i. [sent-129, score-0.498]

46 However, one key requirement of the theorem is perfect knowledge of the statistics u, Σu , v, Σv . [sent-132, score-0.11]

47 In the actual implementation of MPMR, these statistics are estimated from Γ, and it is an open question (which we address in Section 3) as to how accurately Ω can be estimated from real data. [sent-133, score-0.023]

48 In order to avoid the use of nonlinear optimizations techniques to solve (12) for y , we ˆ restrict the form of the kernel K c (zi , z) to the following: K c (zi , z) = yi y + K (xi , x) ˆ (14) where K (xi , x) = ϕ(xi )ϕ(x) is a kernel satisfying Mercer’s Conditions; where z = (ˆ, x1 , . [sent-134, score-0.213]

49 , N ; and where zi = vi−N , yi−N = y yi − for i = N + 1, . [sent-140, score-0.318]

50 Given this restriction on K c (zi , z), we now state our ﬁnal theorem which uses the following lemma: Lemma 1: Proof: See Appendix. [sent-144, score-0.063]

51 ˜ ˜ ku − kv = 2 y Theorem 2: Assume that (14) is true. [sent-145, score-0.68]

52 2 of y; predicted Ω: predicted probability that the model is within ε = 0. [sent-147, score-0.205]

53 0 mean (std) mean (std) mean (std) mean squared error 0. [sent-151, score-0.2]

54 Therefore, Theorem 2 establishes that the MPMR formulation proposed in this paper has a closed form analytical solution, and its computational complexity is equivalent to solving a linear system of 2N − 1 equations in 2N − 1 unknowns. [sent-171, score-0.102]

55 Toy Sinc Data: Our toy example uses the noisy sinc function yi = sin(πxi )/(πxi ) + νi i = 1, . [sent-176, score-0.211]

56 , N , where νi is drawn from a Gaussian distribution with mean 0 and variance σ 2 [5]. [sent-179, score-0.04]

57 We use a RBF kernel K(a, b) = exp(−|a − b|2 ) and N = 100 training examples. [sent-180, score-0.034]

58 Figure 1 (d) and (e) illustrate how different tube sizes 0. [sent-183, score-0.026]

59 05 ≤ ε ≤ 2 affect the mean squared error (on 100 random test points), the predicted Ω and measured percentage of test data within ε (here called MPTDε) of the regression model. [sent-184, score-0.531]

60 The average mean squared error in (e) has a small deviation (0. [sent-186, score-0.112]

61 0453) over all tested ε and always was within the range 0. [sent-187, score-0.04]

62 This indicates that the accuracy of the regression model is essentially independent from the choice of ε. [sent-190, score-0.292]

63 Also note that the mean predicted Ω is a lower bound on the mean MPTDε. [sent-191, score-0.227]

64 The tightness of this lower bound varies for different amounts of noise (Table 1) and different choices of ε (Figure 1 d). [sent-192, score-0.085]

65 Boston Housing Data: We test MPMR on the widely used Boston housing regression data available from the UCI repository. [sent-193, score-0.37]

66 Following the experiments done in [5], we use the RBF kernel K(a, b) = exp(− a − b /(2σ 2 )), where (2σ 2 )) = 0. [sent-194, score-0.034]

67 The Boston housing data contains 506 training examples, which we randomly divided into N = 481 training examples and 25 testing examples for each test run. [sent-197, score-0.114]

68 Results are reported in Table 2 for 1) average mean squared errors and the standard deviation; 2) MPTDε: fraction of test points that are within of y and the standard deviation; 3) predicted Ω: predicted probability that the model is within ε of y and standard deviation. [sent-206, score-0.399]

69 We ﬁrst note that the results compare favorably to those reported for other state of the art regression algorithms [5], even though 2 1. [sent-207, score-0.313]

70 5 4 3 learning examples true regression function MPMR function MPMR function + ε MPMR function − ε learning examples true regression function MPMR function MPMR function + ε MPMR function − ε 2. [sent-208, score-0.72]

71 5 2 learning examples true regression function MPMR function MPMR function + ε MPMR function − ε 3 2 1. [sent-209, score-0.36]

72 5 −3 −2 Percentage of Test Data within ± ε −1 0 1 2 −3 −3 3 −1 0 1 2 3 x 2 a) ε = 0. [sent-214, score-0.04]

73 6 d) MPTDε and predicted Ω e) mean squared error on test data w. [sent-246, score-0.162]

74 0: mean squared errors and the standard deviation; MPDTε: fraction of test points that are within of y and the standard deviation; predicted Ω: predicted probability that the model is within ε of y and standard deviation. [sent-262, score-0.399]

75 Second, as with the toy data, the errors are relatively independent of ε. [sent-341, score-0.039]

76 Finally, we note that the mean predicted Ω is lower than the measured average MPTDε, thus validating the the MPMR algorithm does indeed predict an effective lower bound on the probability that the regression model is within ε of the true regression function. [sent-342, score-0.93]

77 4 Discussion and Conclusion We formalize the regression problem as one of maximizing the minimum probability, Ω, that the regression model is within ±ε of the true regression function. [sent-343, score-1.023]

78 By estimating mean and covariance matrix statistics of the regression data (and making no other assumptions on the underlying true regression function distributions), the proposed minimax probability machine regression (MPMR) algorithm obtains a direct estimate of Ω. [sent-344, score-1.189]

79 Two theorems are presented proving that, given perfect knowledge of the mean and covariance statistics of the true regression function, the proposed MPMR algorithm directly computes the exact lower probability bound Ω. [sent-345, score-0.657]

80 We are unaware of any other nonlinear regression model formulation that has this property. [sent-346, score-0.414]

81 Future research will focus on a theoretical analysis of the conditions under which the accuracy of the regression model is independent of ε. [sent-348, score-0.292]

82 Also, we are analyzing the rate, as a function of sample size, at which estimates of the lower probability bound Ω converge to the true value. [sent-349, score-0.176]

83 Finally, the proposed minimax probability machine regression framework is a new formulation of the regression problem, and therefore its properties can only be fully understood through extensive experimentation. [sent-350, score-0.805]

84 We are currently applying MPMR to a wide variety of regression problems and have made Matlab / C source code available (http://www. [sent-351, score-0.292]

85 Optimal inequalities in probability theory: A convex optimization approach. [sent-380, score-0.041]

86 Shrinko ing the tube: A new support vector regression algorithm. [sent-400, score-0.292]

87 This point will have a corresponding y (deﬁned in (2)), and from (10), the probability that z +ε = (y + ε, x1 , . [sent-413, score-0.041]

88 , xd ) will be classiﬁed correctly (as belonging to class u) by (6) is α. [sent-416, score-0.154]

89 Furthermore, the classiﬁcation boundary occurs uniquely at the point where z = (ˆ, x1 , . [sent-417, score-0.048]

90 , xd ), where, y from the assumptions, y is the unique solution to (12). [sent-420, score-0.148]

91 Similarly, for the same point y, ˆ the probability that z−ε = (y − ε, x1 , . [sent-421, score-0.041]

92 , xd ) will be classiﬁed correctly (as belonging to class v) by (6) is also α, and the classiﬁcations boundary occurs uniquely at the point where z = (ˆ, x1 , . [sent-424, score-0.202]

93 , xd ) are, with probability α, on the correct side of the regression surface, deﬁned by z = (ˆ, x1 , . [sent-434, score-0.461]

94 Therefore, z+ε differs from z by at most +ε in the ﬁrst dimension, y and z−ε differs from z by at most −ε in the ﬁrst dimension. [sent-438, score-0.034]

95 Thus, the minimum bound on the probability that |y − y | ≤ ε is α (deﬁned in (10)), which has the same form as Ω. [sent-439, score-0.136]

96 Part 2: The values zi are deﬁned as: z1 = u1 , . [sent-442, score-0.213]

97 Since Ku = Ku − 1N ku we have the following term for a single matrix entry: ˜ [Ku ]i,j = K c (ui , zj ) − 1 N N l=1 K c (ul , zj ) i = 1, . [sent-449, score-0.886]

98 , 2N ˜ Similarly the matrix entries for Kv look like: N 1 c ˜ [Kv ]i,j = K (vi , zj ) − N l=1 K c (vl , zj ) i = 1, . [sent-454, score-0.539]

99 (7) collapses to min K 2 Formulating this minimization with the use of the orthogonal matrix F and an initial vector ˜ γo this becomes (see [1]): min Ku (γo + Ft) 2 with respect to t ∈ 2N −1 . [sent-462, score-0.09]

100 Therefore in order to ﬁnd the minimum we must solve 2N − 1 h(t) = K 2 d linear equations: 0 = dti h(t) i = 1, . [sent-464, score-0.038]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mpmr', 0.515), ('ku', 0.347), ('kv', 0.333), ('regression', 0.292), ('zj', 0.258), ('zi', 0.213), ('nv', 0.2), ('nu', 0.198), ('mpmc', 0.133), ('mptd', 0.133), ('xd', 0.128), ('yi', 0.105), ('vl', 0.092), ('std', 0.088), ('vi', 0.086), ('formulation', 0.082), ('minimax', 0.078), ('classi', 0.073), ('ui', 0.072), ('bc', 0.071), ('sinc', 0.067), ('theorem', 0.063), ('predicted', 0.062), ('xi', 0.059), ('housing', 0.058), ('mpm', 0.058), ('bound', 0.057), ('symbolized', 0.05), ('xid', 0.05), ('true', 0.05), ('ft', 0.046), ('probability', 0.041), ('mean', 0.04), ('squared', 0.04), ('nonlinear', 0.04), ('within', 0.04), ('inf', 0.04), ('toy', 0.039), ('minimum', 0.038), ('covariance', 0.038), ('points', 0.037), ('boston', 0.037), ('mercer', 0.035), ('proof', 0.035), ('cacy', 0.035), ('completes', 0.035), ('mse', 0.035), ('kernel', 0.034), ('cation', 0.033), ('grudic', 0.033), ('mimimax', 0.033), ('mpdt', 0.033), ('deviation', 0.032), ('verifying', 0.029), ('zn', 0.029), ('boundary', 0.028), ('surface', 0.028), ('lower', 0.028), ('boulder', 0.026), ('tube', 0.026), ('theorems', 0.026), ('belonging', 0.026), ('lemma', 0.026), ('matlab', 0.025), ('min', 0.024), ('perfect', 0.024), ('matrix', 0.023), ('statistics', 0.023), ('colorado', 0.022), ('lanckriet', 0.022), ('posed', 0.022), ('ul', 0.022), ('yl', 0.022), ('generated', 0.022), ('nding', 0.021), ('art', 0.021), ('denominator', 0.021), ('shifting', 0.021), ('un', 0.021), ('solves', 0.021), ('part', 0.021), ('test', 0.02), ('proposed', 0.02), ('unique', 0.02), ('formalized', 0.02), ('uniquely', 0.02), ('future', 0.019), ('orthogonal', 0.019), ('maximizing', 0.019), ('computes', 0.018), ('examples', 0.018), ('proofs', 0.018), ('gram', 0.018), ('percentage', 0.017), ('fraction', 0.017), ('binary', 0.017), ('er', 0.017), ('differs', 0.017), ('rbf', 0.016), ('separates', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 6 nips-2002-A Formulation for Minimax Probability Machine Regression

Author: Thomas Strohmann, Gregory Z. Grudic

2 0.12033209 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary ﬁrst to derive the variational update equations, and then to implement them in application-speciﬁc code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are speciﬁed either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and ﬂexibility of VIBES using examples from Bayesian mixture modelling. 1

3 0.1123213 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

Author: Fei Sha, Lawrence K. Saul, Daniel D. Lee

Abstract: We derive multiplicative updates for solving the nonnegative quadratic programming problem in support vector machines (SVMs). The updates have a simple closed form, and we prove that they converge monotonically to the solution of the maximum margin hyperplane. The updates optimize the traditionally proposed objective function for SVMs. They do not involve any heuristics such as choosing a learning rate or deciding which variables to update at each iteration. They can be used to adjust all the quadratic programming variables in parallel with a guarantee of improvement at each iteration. We analyze the asymptotic convergence of the updates and show that the coefﬁcients of non-support vectors decay geometrically to zero at a rate that depends on their margins. In practice, the updates converge very rapidly to good classiﬁers.

4 0.081032455 120 nips-2002-Kernel Design Using Boosting

Author: Koby Crammer, Joseph Keshet, Yoram Singer

Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efﬁcient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a ﬁxed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classiﬁcation, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classiﬁcation and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predeﬁned kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predeﬁned kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for speciﬁc tasks such as text categorization [3] and protein classiﬁcation problems [4]. Our work attempts to give a computational alternative to predeﬁned kernels by learning kernel operators from data. We start with a few deﬁnitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a ﬁnite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix. The speciﬁc setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to ﬁnd a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be deﬁned shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this deﬁnition, they deﬁned the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a ﬁnite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “ﬂattened” into a vector of dimension m2 . Therefore, this deﬁnition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself sufﬁces to construct an excellent classiﬁer f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to ﬁnd a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classiﬁer whose empirical loss is zero. Furthermore, the task of ﬁnding a good kernel when it is not always possible to ﬁnd a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difﬁculties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore ﬁrst discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classiﬁcation error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] uniﬁes the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X → Figure 1: The skeleton of the boosting algorithm for kernels. ical classiﬁcation error. Given the prediction of a classiﬁer f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are deﬁned as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classiﬁer f (x). This classiﬁer is a weighted combination of (possibly very simple) base classiﬁers. (In the boosting framework, the base classiﬁers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to ﬁnd a good of weights αt . Our starting point in this paper is the ﬁrst sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-ﬂy. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reﬂects how difﬁcult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classiﬁcation algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspeciﬁed. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we deﬁne the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself. Using this deﬁnition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a speciﬁc base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We deﬁne the score of a kernel Kw w.r.t to the current distribution Dt to be, Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw ﬁts the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efﬁciently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now deﬁne a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to ﬁnding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semideﬁnite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for ﬁnding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions. Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efﬁciently. First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the ﬁrst two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters, µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the ﬁrst two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (ﬁrst and third from left) and after learning (second and fourth). For each of the two settings we show the ﬁrst two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we ﬁrst learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classiﬁer on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classiﬁer with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the ﬁgure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the ﬁrst two features. In all plots, each bar indicates a 95% conﬁdence level. It is clear from the ﬁgure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reﬂects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the beneﬁts of using a boutique kernel. The ﬁrst and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the ﬁrst eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the ﬁrst and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the ﬁrst group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classiﬁcation problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classiﬁers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classiﬁcation problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the ﬁrst ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as signiﬁcant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classiﬁers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classiﬁcation using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classiﬁcation. In Proceedings of the Paciﬁc Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-deﬁnite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classiﬁers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.

5 0.077732049 110 nips-2002-Incremental Gaussian Processes

Author: Joaquin Quiñonero-candela, Ole Winther

Abstract: In this paper, we consider Tipping’s relevance vector machine (RVM) [1] and formalize an incremental training strategy as a variant of the expectation-maximization (EM) algorithm that we call Subspace EM (SSEM). Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. We also introduce a mean ﬁeld approach to the intractable classiﬁcation model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. We test the algorithms on two large data sets with O(103 − 104 ) examples. The results indicate that Bayesian learning of large data sets, e.g. the MNIST database is realistic.

6 0.075515576 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

7 0.068793423 119 nips-2002-Kernel Dependency Estimation

8 0.067786582 178 nips-2002-Robust Novelty Detection with Single-Class MPM

9 0.06660369 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

10 0.06641335 45 nips-2002-Boosted Dyadic Kernel Discriminants

11 0.062350333 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

12 0.062156323 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

13 0.059081919 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

14 0.05872383 96 nips-2002-GeneralizedÂ˛ LinearÂ˛ Models

15 0.057627339 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking

16 0.057288662 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces

17 0.055732567 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

18 0.055564493 156 nips-2002-On the Complexity of Learning the Kernel Matrix

19 0.055475757 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

20 0.051751617 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.153), (1, -0.075), (2, 0.03), (3, -0.042), (4, 0.061), (5, -0.028), (6, -0.026), (7, 0.059), (8, 0.025), (9, -0.015), (10, 0.042), (11, -0.088), (12, 0.041), (13, -0.063), (14, 0.042), (15, -0.027), (16, 0.041), (17, 0.006), (18, 0.052), (19, 0.071), (20, 0.034), (21, 0.066), (22, -0.018), (23, -0.05), (24, 0.14), (25, -0.043), (26, -0.026), (27, 0.039), (28, 0.005), (29, -0.079), (30, -0.042), (31, 0.111), (32, 0.065), (33, 0.049), (34, -0.111), (35, -0.069), (36, 0.082), (37, -0.03), (38, 0.177), (39, -0.099), (40, 0.007), (41, -0.187), (42, 0.005), (43, -0.091), (44, -0.01), (45, -0.077), (46, -0.002), (47, 0.12), (48, 0.127), (49, 0.174)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91759562 6 nips-2002-A Formulation for Minimax Probability Machine Regression

Author: Thomas Strohmann, Gregory Z. Grudic

2 0.61605406 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

Author: Fei Sha, Lawrence K. Saul, Daniel D. Lee

3 0.56499904 178 nips-2002-Robust Novelty Detection with Single-Class MPM

Author: Laurent E. Ghaoui, Michael I. Jordan, Gert R. Lanckriet

Abstract: In this paper we consider the problem of novelty detection, presenting an algorithm that aims to find a minimal region in input space containing a fraction 0: of the probability mass underlying a data set. This algorithm- the

4 0.51307285 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

5 0.50699031 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking

Author: Jun Morimoto, Christopher G. Atkeson

Abstract: We developed a robust control policy design method in high-dimensional state space by using differential dynamic programming with a minimax criterion. As an example, we applied our method to a simulated ﬁve link biped robot. The results show lower joint torques from the optimal control policy compared to a hand-tuned PD servo controller. Results also show that the simulated biped robot can successfully walk with unknown disturbances that cause controllers generated by standard differential dynamic programming and the hand-tuned PD servo to fail. Learning to compensate for modeling error and previously unknown disturbances in conjunction with robust control design is also demonstrated.

6 0.46237785 110 nips-2002-Incremental Gaussian Processes

7 0.36345604 45 nips-2002-Boosted Dyadic Kernel Discriminants

8 0.36043206 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

9 0.35524526 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

10 0.32731456 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

11 0.32372224 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

12 0.31418836 85 nips-2002-Fast Kernels for String and Tree Matching

13 0.31127384 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression

14 0.30615816 120 nips-2002-Kernel Design Using Boosting

15 0.3061474 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems

16 0.30435181 185 nips-2002-Speeding up the Parti-Game Algorithm

17 0.29651237 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

18 0.29212007 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

19 0.29160622 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

20 0.28561074 96 nips-2002-GeneralizedÂ˛ LinearÂ˛ Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(9, 0.307), (23, 0.016), (42, 0.072), (54, 0.119), (55, 0.043), (57, 0.013), (68, 0.022), (72, 0.017), (74, 0.052), (92, 0.097), (98, 0.141)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76370734 6 nips-2002-A Formulation for Minimax Probability Machine Regression

Author: Thomas Strohmann, Gregory Z. Grudic

2 0.57928878 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model speciﬁcation, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efﬁcient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efﬁcient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model speciﬁcation into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more ﬂexibility than a ﬁxed code repository such as a Matlab toolbox, and allows the creation of efﬁcient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level speciﬁcation 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The ﬁgure shows the 7 double phi(1..n_classes) as ’weights’ speciﬁcation of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-deﬁned. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line speciﬁes the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspeciﬁed. The class variable and single data variable are vectors, which deﬁnes them as i.i.d. Computer algebra. The ﬁrst core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simpliﬁcation are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efﬁcient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly speciﬁed preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efﬁciently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A ﬁrst-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., speciﬁc instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantiﬁed Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the ﬁgure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efﬁciently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!

3 0.57436979 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

4 0.57169533 17 nips-2002-A Statistical Mechanics Approach to Approximate Analytical Bootstrap Averages

Author: Dörthe Malzahn, Manfred Opper

Abstract: We apply the replica method of Statistical Physics combined with a variational method to the approximate analytical computation of bootstrap averages for estimating the generalization error. We demonstrate our approach on regression with Gaussian processes and compare our results with averages obtained by Monte-Carlo sampling.

5 0.56835389 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

Author: Peter Sykacek, Stephen J. Roberts

Abstract: We propose in this paper a probabilistic approach for adaptive inference of generalized nonlinear classiﬁcation that combines the computational advantage of a parametric solution with the ﬂexibility of sequential sampling techniques. We regard the parameters of the classiﬁer as latent states in a ﬁrst order Markov process and propose an algorithm which can be regarded as variational generalization of standard Kalman ﬁltering. The variational Kalman ﬁlter is based on two novel lower bounds that enable us to use a non-degenerate distribution over the adaptation rate. An extensive empirical evaluation demonstrates that the proposed method is capable of infering competitive classiﬁers both in stationary and non-stationary environments. Although we focus on classiﬁcation, the algorithm is easily extended to other generalized nonlinear models.

6 0.56372035 45 nips-2002-Boosted Dyadic Kernel Discriminants

7 0.56157649 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

8 0.56109202 46 nips-2002-Boosting Density Estimation

9 0.56089628 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

10 0.56066161 110 nips-2002-Incremental Gaussian Processes

11 0.5576247 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

12 0.55613923 10 nips-2002-A Model for Learning Variance Components of Natural Images

13 0.55606568 140 nips-2002-Margin Analysis of the LVQ Algorithm

14 0.55517077 161 nips-2002-PAC-Bayes & Margins

15 0.55284935 27 nips-2002-An Impossibility Theorem for Clustering

16 0.55122447 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

17 0.5498746 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

18 0.54945523 53 nips-2002-Clustering with the Fisher Score

19 0.54861528 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

20 0.54854846 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition