nips nips2002 nips2002-6 nips2002-6-reference knowledge-graph by maker-knowledge-mining

6 nips-2002-A Formulation for Minimax Probability Machine Regression

Source: pdf

Author: Thomas Strohmann, Gregory Z. Grudic

Abstract: We formulate the regression problem as one of maximizing the minimum probability, symbolized by Ω, that future predicted outputs of the regression model will be within some ±ε bound of the true regression function. Our formulation is unique in that we obtain a direct estimate of this lower probability bound Ω. The proposed framework, minimax probability machine regression (MPMR), is based on the recently described minimax probability machine classiﬁcation algorithm [Lanckriet et al.] and uses Mercer Kernels to obtain nonlinear regression models. MPMR is tested on both toy and real world data, verifying the accuracy of the Ω bound, and the efﬁcacy of the regression models. 1

reference text

[1] G. R. G. Lanckriet, L. E. Ghaoui, C. Bhattacharyya, and M. I. Jordan. Minimax probability machine. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press.

[2] A. W. Marshall and I. Olkin. Multivariate chebyshev inequalities. Annals of Mathematical Statistics, 31(4):1001–1014, 1960.

[3] I. Popescu and D. Bertsimas. Optimal inequalities in probability theory: A convex optimization approach. Technical Report TM62, INSEAD, Dept. Math. O.R., Cambridge, Mass, 2001.

[4] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C. Cambridge University Press, New York NY, 1988.

[5] Bernhard Sch¨ lkopf, Peter L. Bartlett, Alex J. Smola, and Robert Williamson. Shrinko ing the tube: A new support vector regression algorithm. In D. A. Cohn M. S. Kearns, S. A. Solla, editor, Advances in Neural Information Processing Systems, volume 11, Cambridge, MA, 1999. The MIT Press. Appendix: Proofs of Theorems 1 and 2 Proof of Theorem 1: Consider any point x = (x1 , ..., xd ) generated according to the distribution Λ. This point will have a corresponding y (deﬁned in (2)), and from (10), the probability that z +ε = (y + ε, x1 , ..., xd ) will be classiﬁed correctly (as belonging to class u) by (6) is α. Furthermore, the classiﬁcation boundary occurs uniquely at the point where z = (ˆ, x1 , ..., xd ), where, y from the assumptions, y is the unique solution to (12). Similarly, for the same point y, ˆ the probability that z−ε = (y − ε, x1 , ..., xd ) will be classiﬁed correctly (as belonging to class v) by (6) is also α, and the classiﬁcations boundary occurs uniquely at the point where z = (ˆ, x1 , ..., xd ). Therefore, both z+ε = (y + ε, x1 , ..., xd ) and z−ε = (y − y ε, x1 , ..., xd ) are, with probability α, on the correct side of the regression surface, deﬁned by z = (ˆ, x1 , ..., xd ). Therefore, z+ε differs from z by at most +ε in the ﬁrst dimension, y and z−ε differs from z by at most −ε in the ﬁrst dimension. Thus, the minimum bound on the probability that |y − y | ≤ ε is α (deﬁned in (10)), which has the same form as Ω. This ˆ completes the proof. Proof of Lemma 1: ˜ ˜ [ku ]i − [ kv ]i = 1 ( 1 N N l=1 N 1 K c (ul , zi )) − N ( l=1 K c (vl , zi )) = N 1 l=1 (yl + )yi + K(xl , xi ) − ((yl − )yi + K(xl , xi )) = N N 2 yi = 2 yi N Proof of Theorem 2: Part 1: Plugging (14) into (12), we get: 2N 0= i=1 N 0= γi [yi y + K (xi , x)] + bc ˆ N γi [(yi + ε) y + K (xi , x)] + ˆ i=1 N 0= i=1 i=1 γi+N [(yi − ε) y + K (xi , x)] + bc ˆ {(γi + γi+N ) [yi y + K (xi , x)] + (γi − γi+N ) εˆ} + bc ˆ y When we solve analytically for y , giving (5), the coefﬁcients βi and the offset b have a ˆ N denominator that looks like: − i=1 [(γi + γi+N ) yi + (γi − γi+N ) ε] = −γ T y 1 ˜ Applying Lemma 1 and (7) we obtain: 1 = γ T (˜ u ) − kv ) = γ T 2 y ⇔ −γ T y = − 2 (k for the denominator of βi and b. Part 2: The values zi are deﬁned as: z1 = u1 , ..., zN = uN , zN +1 = v1 = u1 − T T ˜ ˜ (2 , 0, · · · , 0) , ..., z2N = vN = uN − (2 , 0, · · · , 0) . Since Ku = Ku − 1N ku we have the following term for a single matrix entry: ˜ [Ku ]i,j = K c (ui , zj ) − 1 N N l=1 K c (ul , zj ) i = 1, .., N j = 1, ..., 2N ˜ Similarly the matrix entries for Kv look like: N 1 c ˜ [Kv ]i,j = K (vi , zj ) − N l=1 K c (vl , zj ) i = 1, .., N j = 1, ..., 2N We show that these entries are the same for all i and j: T N T 1 ˜ [Ku ]i,j = K c (vi + (2 0 · · · 0) , zj ) − N l=1 K c (vl + (2 0 · · · 0) , zj ) = N l=1 K c (vi , zj ) + 2 [zj ]1 − 1 N( K c (vi , zj ) + 2 [zj ]1 − 1 N N l=1 K c (vl , zj ) − 1 N K c (vi , zj ) + 2 [zj ]1 − 1 N N l=1 K c (vl , zj ) − 1 N N2 K c (vi , zj ) − 1 N N l=1 K c (vl , zj ) + 2 [zj ]1 ) = N l=1 2 [zj ]1 = [zj ]1 = ˜ K c (vl , zj ) = [Kv ]i,j This completes the proof of Part 2. ˜ ˜ Part 3: From Part 2 we know that Ku = Kv . Therefore, the minimization problem ˜u γ 2 with respect to γ (the N is constant and can be removed). (7) collapses to min K 2 Formulating this minimization with the use of the orthogonal matrix F and an initial vector ˜ γo this becomes (see [1]): min Ku (γo + Ft) 2 with respect to t ∈ 2N −1 . We set 2 ˜u (γ + Ft) 2 . Therefore in order to ﬁnd the minimum we must solve 2N − 1 h(t) = K 2 d linear equations: 0 = dti h(t) i = 1, ..., 2N − 1. This completes the proof of Part 3.