jmlr jmlr2006 jmlr2006-40 knowledge-graph by maker-knowledge-mining

40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization


Source: pdf

Author: Ross A. Lippert, Ryan M. Rifkin

Abstract: We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). The proof rests on two key ideas: epi-convergence, a notion of functional convergence under which limits of minimizers converge to minimizers of limits, and a value-based formulation of learning, where we work with regularization on the function output values (y) as opposed to the function expansion coefficients in the RKHS. Our result generalizes and unifies previous results in this area. Keywords: Tikhonov regularization, Gaussian kernel, theory, kernel machines

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 145 Tremont Street Boston, MA 02111, USA Editor: Gabor Lugosi Abstract We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. [sent-7, score-0.329]

2 For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. [sent-8, score-0.144]

3 In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). [sent-9, score-0.626]

4 One popular family of techniques is Tikhonov regularization in a Reproducing Kernel Hilbert Space (RKHS) (Evgeniou et al. [sent-18, score-0.154]

5 , 2000): n inf f ∈H ˆ nλ|| f ||2 + ∑ v( f (xi ), yi ) . [sent-19, score-0.22]

6 κ i=1 Here, v : R × R → R is a loss function indicating the price we pay when we see xi , predict f (xi ), and the true value is yi . [sent-20, score-0.164]

7 The regularization constant, λ > 0, controls the trade-off between fitting the training set accurately (minimizing the penalties) and forcing f to be smooth in H . [sent-22, score-0.154]

8 L IPPERT AND R IFKIN the Tikhonov regularization can be written in the form n f (x) = ∑ ci κ(xi , x). [sent-28, score-0.201]

9 i=1 In practice, solving a Tikhonov regularization problem is equivalent to finding the expansion coefficients ci . [sent-29, score-0.268]

10 ′ 2 − ||x−x || One popular choice for κ is the Gaussian kernel κσ (x, x′ ) = e 2σ2 , where σ is the bandwidth of the Gaussian. [sent-30, score-0.133]

11 Common choices for v include the square loss, v(y, y) = (y − y)2 , and the hinge ˆ ˆ loss, v(y, y) = max{0, 1 − yy}, which lead to regularized least squares and support vector machines, ˆ ˆ respectively. [sent-31, score-0.131]

12 Our work was originally motivated by the empirical observation that on a range of tasks, regularized least squares achieved very good performance with very large σ. [sent-32, score-0.131]

13 Regularized least squares (RLS) is an especially simple Tikhonov regularization algorithm: “training” RLS simply involves solving a system of linear equations. [sent-37, score-0.196]

14 ˆ In Lippert and Rifkin (2006), we studied the limit of this expression as σ → ∞, showing that ˜ if we set λ = λσ−2p−1 for p a positive integer, the infinite-σ limit converges (pointwise) to the degree p polynomial with minimal empirical risk on the training set. [sent-40, score-0.251]

15 The asymptotic predictions are equivalent to those we would get if we simply fit an (unregularized) degree p polynomial to our training data. [sent-41, score-0.135]

16 In the current work, we unify and generalize these results, showing that the occurrence of these polynomial approximation limits is a general phenomenon, which holds across all convex loss functions and a wide variety of kernels taking the form κσ (x, x′ ) = κ(x/σ, x′ /σ). [sent-44, score-0.307]

17 Our main result is that ˜ for a convex loss function and a valid kernel, if we take σ → ∞ and λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋. [sent-45, score-0.363]

18 In the case where p ∈ Z, there is residual regularization on the degree-p coefficients of the limiting polynomial. [sent-46, score-0.289]

19 The first is the notion of epi-convergence, a functional convergence under which limits of minimizers converge to minimizers of limits. [sent-48, score-0.378]

20 This notion allows us to characterize the limiting Tikhonov regularization problem in a mathematically precise way. [sent-49, score-0.248]

21 The idea is that instead of working with the expansion coefficients in the RKHS (ci ), we can write the regularization problem directly in terms of the predicted values (yi ). [sent-51, score-0.221]

22 Tikhonov regularization is given by n inf f ∈H ˆ nλ|| f ||2 + ∑ v( f (xi ), yi ) . [sent-64, score-0.374]

23 κ (1) i=1 Tikhonov regularization can be used for both classification and regression tasks, but we refer to the function f as the regularized solution in all cases. [sent-65, score-0.243]

24 We call the left-hand portion the regularization term, and the right-hand portion the loss term. [sent-66, score-0.195]

25 Aside from convexity, we will be unconcerned with the form of the loss function and often take the loss term in the optimization in (1) to be some convex function V : Rn → R which is minimized by the vector y of yi ’s. [sent-69, score-0.207]

26 Additionally, our main result requires not only genericity of the data, but also that n > d p where p is the degree of the asymptotic regularized solution. [sent-102, score-0.147]

27 We say that a kernel is valid if every finite upper-left submatrix of M is symmetric and positive definite; in this case, we also say that the infinite matrix M is symmetric positive definite. [sent-111, score-0.155]

28 Lemma 1 If κ(x, x′ ) = ∑c≥0 (x·x′ )c gc (x)gc (x′ ) for some analytic functions gc (x) such that gc (0) = 0, then κ is a valid kernel. [sent-117, score-0.351]

29 1 We note that κ(x, x′ ) = exp(− 2 ||x − x′ ||2 ) can be written in the form of Lemma 1: 1 1 κ(x, x′ ) = exp − ||x||2 exp(x · x′ ) exp − ||x′ ||2 2 2 ∞ ′ )c 1 1 (x · x exp − ||x||2 exp − ||x′ ||2 = ∑ c! [sent-122, score-0.132]

30 2 2 c=0 ∞ = ∑ (x · x′ )c gc (x)gc (x′ ), c=0 1 where gc (x) = √c! [sent-123, score-0.234]

31 2 We will consider kernel functions κσ which are parametrized by a bandwidth parameter σ (or 1 s = σ ). [sent-125, score-0.133]

32 Using the representer theorem and basic facts about RKHS, the standard Tikhonov regularization problem (1) can be written in terms of the expansion coefficients c and the kernel matrix K: n ˆ infn nλct Kc + ∑ v([Kc]i , yi ) . [sent-130, score-0.38]

33 If the kernel matrix K is invertible (which is the case for a Gaussian kernel and a generic data set), then c = K −1 y, and we can rewrite the minimization as inf nλyt K −1 y +V (y) . [sent-132, score-0.332]

34 ˆ i=1 While problem 2 is explicit in the coefficients of the expansion of the regularized solution, problem 3 is explicit in the predicted values yi . [sent-134, score-0.236]

35 The purpose behind our choice of formulation is to avoid the unnecessary complexities which result from replacing κ with κσ and taking limits as both ci and κσ (x, xi ) change separately with σ: note that in problem 3, only the regularization term is varying with σ. [sent-135, score-0.338]

36 In this section, we will show how our formulation achieves this, by allowing us to state a single optimization problem which simultaneously solves the Tikhonov regularization problem on the training data and evaluates the resulting function on the test data. [sent-136, score-0.154]

37 For any V : Rn → R, if y minimizes ˙ Theorem 2 Let y = K00 K10 K01 K11 ∈ yt K −1 y +V (y1 ) (4) −1 yt1 K11 y1 +V (y1 ) (5) inf {yt K −1 y +V (y1 )} = inf{inf{yt K −1 y} +V (y1 )}. [sent-138, score-0.232]

38 Thus, ˙ ¯ Let K −1 = K = −1 ¯ ¯ ¯ −1 ¯ inf yt K −1 y = yt1 (K11 − K10 K00 K01 )y1 = yt1 K11 y1 y0 by (19) of Lemma 15. [sent-143, score-0.232]

39 When taking limits, we are going to work directly with the yi , and we are going to avoid dealing with the (divergent) limits of the ci . [sent-155, score-0.221]

40 Since the component of our objective which depends on the limiting parameter is a quadratic form, we will eventually specialize the results to quadratic forms. [sent-160, score-0.182]

41 Definition 4 (epigraphs) Given a function f : Rn → (−∞, ∞], its epigraph, epi f is the subset of R × Rn given by epi f = {(z, x) : z ≥ f (x)}. [sent-161, score-0.336]

42 We call f closed, convex, or proper if those statements are true of epi f (proper referring to epi f / being neither 0 nor Rn+1 ). [sent-162, score-0.336]

43 Additionally, since we will be studying parameterized functions, fs , for 0 < s as s → 0, we say that such a family of functions is eventually convex (or closed, or proper) when there exists some s0 > 0 such that fs is convex (or closed, or proper) for all 0 < s < s0 . [sent-165, score-1.014]

44 We review the definition of lim inf and lim sup for functions of a single variable. [sent-166, score-0.505]

45 Definition 5 For h : (0, ∞) → (−∞, ∞], lim inf h(s) = sup s→0 s>0 lim sup h(s) = inf s>0 s→0 inf {h(s′ )} s′ ∈(0,s) sup {h(s′ )} . [sent-168, score-0.863]

46 A useful alternate characterization, which is immediate from the definition, is lim infs→0 h(s) = h0 iff ∀ε > 0, ∃s0 , ∀s ∈ (0, s0 ) : h(s) ≥ h0 − ε, and lim sups→0 h(s) = h0 iff ∀ε > 0, ∃s0 , ∀s ∈ (0, s0 ) : h(s) ≤ h0 + ε, where either inequality can be strict if h0 < ∞. [sent-170, score-0.326]

47 Definition 6 (epi-limits) We say lims→0 fs = f if for all x0 ∈ Rn , both the following properties hold: Property 1: ∀x : [0, ∞) → Rn continuous at x(0) = x0 satisfies lim inf fs (x(s)) ≥ f (x0 ) s→0 (7) Property 2: ∃x : [0, ∞) → Rn continuous at x(0) = x0 satisfying lim sup fs (x(s)) ≤ f (x0 ). [sent-171, score-1.891]

48 s→0 861 (8) L IPPERT AND R IFKIN exists doesn’t exist Figure 1: (property 1) of Definition 6 says that paths of points within epi fs cannot end up below epi f , while (property 2) says that at least one such path hits every point of epi f . [sent-172, score-0.966]

49 This notion of functional limit is called an epigraphical limit (or epi-limit). [sent-173, score-0.153]

50 Less formally, (property 1) is the condition that paths of the form (x(s), fs (x(s))) are, asymptotically, inside epi f , while (property 2) asserts the existence of a path which hits the boundary of epi f , as depicted in figure 1. [sent-174, score-0.798]

51 Considering (property 1) with the function x(s) = x0 , it is clear that the epigraphical limit minorizes the pointwise limit (assuming both exist), but the two need not coincide. [sent-175, score-0.204]

52 An example of this distinction is given by the family of functions 2 fs (x) = x(x − s) + 1, s illustrated by Figure 2. [sent-176, score-0.462]

53 ) The pointwise and epi-limits of quadratic forms agree when the limiting quadratic form is finite, but the example in the figure is not of that sort. [sent-181, score-0.233]

54 Theorem 7 Let fs : Rn → (−∞, ∞] be eventually ccp, with lims→0 fs = f . [sent-185, score-0.924]

55 If fs , f have unique minimizers x(s), x then ˙ ˙ lim x(s) = x and ˙ ˙ s→0 lim fs (x(s)) = inf f (x). [sent-186, score-1.532]

56 Let s0 > 0 be such that ˆ ˆ ˙ ∀s ∈ (0, s0 ) : fs (x(s)) < f (x) + δ and x(s) ∈ Bδ . [sent-191, score-0.462]

57 Thus, ∀s ∈ (0, s0 ) : infx∈Bδ fs (x) < f (x) + δ. [sent-192, score-0.462]

58 015625 Figure 2: The function above, fs (x) = 2 x(x − s) + 1, has different pointwise and epi-limits, having s values 1 and 0, respectively, at x = 0 and ∞ for all other x. [sent-196, score-0.513]

59 863 L IPPERT AND R IFKIN By property 1 of definition 6, ∀x ∈ Rn , lim infs→0 fs (x) ≥ f (x), in particular, ∀x ∈ ∂Bδ , ∃s1 ∈ (0, s0 ), ∀s ∈ (0, s1 ) : fs (x) ≥ f (x) − δ = f (x) + δ. [sent-197, score-1.087]

60 Since ∂Bδ is compact, we can choose s1 ∈ ˙ (0, s0 ), ∀x ∈ ∂Bδ , s ∈ (0, s1 ) : fs (x) ≥ f (x) + δ. [sent-198, score-0.462]

61 ˙ Thus ∀x ∈ ∂Bδ , s < s1 : fs (x) ≥ f (x) + δ > infx∈Bδ fs (x), and therefore x(s) ∈ Bδ by the convexity ˙ ˙ of fs . [sent-199, score-1.386]

62 In particular, ˙ ˙ lim sups→0 fs (x(s)) ≤ f (x) and f (x) ≤ lim infs→0 fs (x(s)). [sent-203, score-1.25]

63 Since ∀s : fs (x(s)) ≤ fs (x(s)), we have ˆ ˙ ˙ ˙ ˙ ˆ lim sups→0 fs (x(s)) ≤ f (x) and hence f (x) ≤ lim infs→0 fs (x(s)) ≤ lim sups→0 fs (x(s)) ≤ f (x). [sent-204, score-2.799]

64 ˙ ˙ ˙ ˙ ˙ ˙ We now apply this theorem to characterize limits of quadratic forms (which are becoming infinite in the limit). [sent-205, score-0.138]

65 If fs (x, y) = x Z(s)−1 y t A(s) B(s)t B(s) C(s) x Z(s)−1 y then lims→0 fs = f , where f (x, y) = ∞ xt (A(0) − B(0)t C(0)−1 B(0))x y=0 . [sent-211, score-0.924]

66 y=0 Proof Completing the square, 2 fs (x, y) = ||x||2 + ||Z(s)−1 y +C(s)−1 B(s)x||C(s) ˜ A(s) 2 ˜ ˜ where ||v||W = vt W v and A(s) = A(s) − B(s)t C(s)−1 B(s). [sent-212, score-0.462]

67 2 c 6z(s) Thus, for all s < s1 : fs (x(s), y(s)) > c ||y(0)|| , and hence lim infs→0 fs (x(s), y(s)) = ∞, which im4z(s) plies property 1 of definition 6 (and property 2, since lim inf ≤ lim sup). [sent-220, score-1.553]

68 Otherwise (y(0) = 0), fs (x(s), y(s)) ≥ ||x(s)||2 and thus ˜ A(s) ≤ lim inf f (x(s), y(s)) lim ||x(s)||2 ˜ A(s) s→0 s→0 ||x(0)||2 ≤ lim inf f (x(s), y(s)) ˜ A(0) s→0 (property 1). [sent-221, score-1.231]

69 fs (x(s), y(s)) = ||x(s)||2 when y(s) = −Z(s)C(s)−1 B(s)x(s) (which is continuous ˜ A(s) and vanishing at s = 0), and thus lim sup f (x(s), y(s)) = lim ||x(s)||2 = ||x(0)||2 ˜ ˜ A(s) A(0) s→0 s→0 (property 2). [sent-222, score-0.827]

70 If t  Z1 (s)qa A(s) B(s)t   B(s) D(s) qb fs (qa , qb , qc ) =  −1 q Z2 (s) c C(s) E(s)  then lims→0 fs = f , where f (qa , qb , qc ) =   C(s)t Z1 (s)qa  E(s)t   qb −1 q F(s) Z2 (s) c ∞ qtb (D(0) − E(0)t F(0)−1 E(0))qb 865 qc = 0 . [sent-226, score-1.687]

71 z=0 L IPPERT AND R IFKIN Proof We apply Lemma 9 to the quadratic form given by  t  t t t qa Z1 AZ1 Z1 Bt Z1Ct  qb   BZ1 D Et −1 (CZ1 E ) F Z2 qc   qa   qb  −1 Z2 qc (s dependence suppressed). [sent-227, score-0.668]

72 We will have occasion to apply Corollary 10 when some of qa , qb and qc are empty. [sent-228, score-0.312]

73 Kernel Expansions and Regularization Limits In this section, we present our key result, characterizing the asymptotic behavior of the regularization term of Tikhonov regularization. [sent-231, score-0.154]

74 We define a family of quadratic forms on the polynomials in x; these forms will turn out to be the limits of the quadratic Tikhonov regularizer. [sent-232, score-0.247]

75 For any p > 0, define RK : f → [0, ∞] by p 0 f (x) = ∑0≤i≤d⌊p⌋ qi xIi , if p ∈ Z / ∞ else  t   qd p−1 +1  qd p−1 +1  . [sent-234, score-0.164]

76 Rp( f ) =   qd p qd p  ∞ else Rκ ( f ) = p −1 where, for p ∈ Z, C = (Mbb − Mba Maa Mab )−1 where Maa and Maa Mba Mab Mbb if p ∈ Z are the d p−1 × d p−1 and d p × d p upper-left submatrices of K. [sent-240, score-0.199]

77 Because our data set is generic, vα (X) is non-singular, and the interpolating polynomial through the points (xi , yi ) over the monomials {xIi : i < n} is given by f (x) = vα (x)vα (X)−1 y. [sent-245, score-0.206]

78 We now state and prove our key result, showing the convergence of the regularization term of Tikhonov regularization to Rκ . [sent-246, score-0.308]

79 Then lim fs = f , s→0 ˜ ˜ ˜ where f (y) = Rκ (q), and q(x) = vα (x)q = ∑0≤i < n, 0 ≤ j < ∞, the i, jth entry of χ(s) is s|I j+n |−|Ii | χi j , and |I j+n | − |Ii | ≥ 0. [sent-250, score-0.625]

80 Summarizing, ˜ Therefore, lims→0 M(s) = ( I ˜ χ(0) ) M fs (y) = s2p yt κ(sX, sX)−1 y ˜ = (vα (X)−1 y)t (s p Σα (1/s))M(s)−1 (s p Σα (1/s))(vα (X)−1 y) t p −1 p ˜ = q (s Σα (1/s))M(s) (s Σα (1/s))q, ˜ ˜ where q ≡ vα (X)−1 y. [sent-255, score-0.554]

81 If (hi = 0 or) qhi = 0, the limit is: ˜ ˜ −1 qtmi (Mmi,mi − Mmi,lo Mlo,lo Mlo,mi )−1 qmi , ˜ ˜ hence fs (y) → Rκ (q) for p ∈ Z. [sent-275, score-0.632]

82 In other words, we can get polynomial behavior of degree ⌊p⌋ for any p, but we must have at least d⌊p⌋ = O(d ⌊p⌋ ) generic data points in order to do so. [sent-285, score-0.169]

83 The Asymptotic Regularized Solution By Theorem 12, the regularization term (under certain conditions) becomes a penalty on degree 1 > p behavior of the regularized solution. [sent-301, score-0.301]

84 Since the loss function is fixed as σ, λ → ∞, the objective function in (1) approaches a limiting constrained optimization problem. [sent-302, score-0.135]

85 Let f˙σ , f˙∞ ∈ H be the unique minimizers of n ˆ nλ(σ)|| f ||2 σ + ∑ v( f (xi ), yi ) κ (10) i=1 and n ˜ ˆ nλRκ ( f ) + ∑ v( f (xi ), yi ) p (11) i=1 respectively. [sent-306, score-0.302]

86 Then ∀x0 ∈ Rd such that X0 = x0 X is generic, lim f˙σ (x0 ) = f˙∞ (x0 ). [sent-307, score-0.163]

87 ˙ ˙ ˙ ˙ ˙ ˜ ˆ ˙ Let g∞ (q) = nλRκ (q) + ∑n v(q(xi ), yi ) with minimizer q∞ . [sent-311, score-0.158]

88 For κ strictly convex loss functions, such as the square loss used in regularized least squares, problem 11 will have a unique minimizer as well. [sent-315, score-0.294]

89 In these cases, Theorem 12 still determines the value of the limiting solution, but Theorem 14 does not completely determine the limiting minimizer. [sent-317, score-0.188]

90 33 of Rockafellar and Wets (2004) provides a generalization of Theorem 14 which applies when the minimizers are non-unique (and even when the objective functions are non-convex, as long as certain local convexity conditions hold). [sent-319, score-0.142]

91 It can be shown that the minimizer of problem 10 will converge to one of the minimizers of problem 11, though not knowing which one, we cannot predict the limiting regularized solution. [sent-320, score-0.403]

92 Additionally, we note that our work has focused on “standard” Tikhonov regularization problems, in which the function f is “completely” regularized. [sent-322, score-0.154]

93 In this case, n inf b∈R, f ∈H ˆ nλ|| f ||κσ + ∑ (1 − ( f (xi ) + b)yi )+ i=1 n = ˆ inf inf nλ|| f ||κσ + ∑ (1 − ( f (xi ) + b)yi )+ b f i=1 n ˜ ˆ → inf inf nλRκ ( f ) + ∑ (1 − ( f (xi ) + b)yi )+ p b f , i=1 with our results applying to the inner optimization problem (where b is fixed). [sent-325, score-0.7]

94 Finally, we note that the limiting problem is one where all polynomials of degree < p are free, and hence, the bias term is “absorbed” into what is already free in the limiting problem. [sent-328, score-0.311]

95 The proof is based on an expansion of the kernel function (Equation 2. [sent-332, score-0.146]

96 In this case, for any degree p, an asymptotic regime was identified in which the regularized solution approached the least squares degree-p polynomial. [sent-338, score-0.189]

97 The result hinges upon the simultaneous cancellation effects between the coefficients c(σ, λ) and the kernel function κσ in the kernel expansion of f (x), with f (x) and c(σ, λ) given by f (x) = ∑ ci (σ, λ)κσ (x, xi ) i c(σ, λ) = (κσ (X, X) + nλI)−1 y when κσ (x, x′ ) = exp(−||x − x′ ||2 /σ2 ). [sent-339, score-0.315]

98 Note that in our previous work, we did not work with the value-based formulation of learning, and we were forced to take the limit of an expression combining training and testing kernel products, exploiting the explicit nature of the regularized least squares equations. [sent-342, score-0.268]

99 This in turn allowed us to avoid discussing the limits of the ci , which we do not know how to characterize. [sent-374, score-0.141]

100 We are not suggesting that practicioners wishing to do polynomial approximation use Gaussian kernels with extreme σ, λ values; there is no difficulty in using standard polynomial kernels directly, and using extreme σ and λ values invites numerical difficulties. [sent-375, score-0.254]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fs', 0.462), ('ifkin', 0.206), ('ippert', 0.206), ('tikhonov', 0.204), ('epi', 0.168), ('ikhonov', 0.168), ('imits', 0.168), ('lim', 0.163), ('regularization', 0.154), ('lims', 0.15), ('lippert', 0.15), ('nfinite', 0.143), ('egularization', 0.143), ('minimizers', 0.142), ('inf', 0.14), ('bt', 0.132), ('qb', 0.127), ('gc', 0.117), ('infs', 0.112), ('rn', 0.1), ('qa', 0.1), ('limits', 0.094), ('limiting', 0.094), ('sups', 0.094), ('yt', 0.092), ('regularized', 0.089), ('qc', 0.085), ('yi', 0.08), ('kernel', 0.079), ('minimizer', 0.078), ('polynomial', 0.077), ('jjt', 0.075), ('qhi', 0.075), ('submatrices', 0.071), ('expansion', 0.067), ('rls', 0.065), ('polynomials', 0.065), ('keerthi', 0.065), ('xii', 0.065), ('qd', 0.064), ('wets', 0.064), ('degree', 0.058), ('limit', 0.058), ('rockafellar', 0.057), ('ba', 0.057), ('ccp', 0.056), ('maa', 0.056), ('ryan', 0.056), ('ux', 0.056), ('rkhs', 0.056), ('bandwidth', 0.054), ('rifkin', 0.054), ('cients', 0.053), ('sx', 0.052), ('lemma', 0.052), ('pointwise', 0.051), ('kernels', 0.05), ('monomials', 0.049), ('poly', 0.049), ('ci', 0.047), ('convex', 0.045), ('quadratic', 0.044), ('xi', 0.043), ('squares', 0.042), ('residual', 0.041), ('loss', 0.041), ('mi', 0.04), ('mercer', 0.04), ('coef', 0.039), ('lo', 0.039), ('sup', 0.039), ('gaussian', 0.039), ('symmetric', 0.038), ('hi', 0.038), ('epigraphical', 0.037), ('gci', 0.037), ('ggt', 0.037), ('grace', 0.037), ('infx', 0.037), ('llt', 0.037), ('mba', 0.037), ('mbb', 0.037), ('qmi', 0.037), ('tomaso', 0.037), ('phenomenon', 0.037), ('dc', 0.037), ('ii', 0.037), ('unregularized', 0.036), ('qi', 0.036), ('rd', 0.036), ('ct', 0.034), ('lin', 0.034), ('expansions', 0.034), ('ross', 0.034), ('generic', 0.034), ('exp', 0.033), ('valued', 0.033), ('uy', 0.032), ('honda', 0.032), ('kc', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

Author: Ross A. Lippert, Ryan M. Rifkin

Abstract: We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). The proof rests on two key ideas: epi-convergence, a notion of functional convergence under which limits of minimizers converge to minimizers of limits, and a value-based formulation of learning, where we work with regularization on the function output values (y) as opposed to the function expansion coefficients in the RKHS. Our result generalizes and unifies previous results in this area. Keywords: Tikhonov regularization, Gaussian kernel, theory, kernel machines

2 0.12094339 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

Author: Mikhail Belkin, Partha Niyogi, Vikas Sindhwani

Abstract: We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework. Keywords: semi-supervised learning, graph transduction, regularization, kernel methods, manifold learning, spectral graph theory, unlabeled data, support vector machines

3 0.10766488 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

Author: Régis Vert, Jean-Philippe Vert

Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation

4 0.10403949 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

Author: Di-Rong Chen, Tao Sun

Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition

5 0.079906218 45 jmlr-2006-Learning Coordinate Covariances via Gradients

Author: Sayan Mukherjee, Ding-Xuan Zhou

Abstract: We introduce an algorithm that learns gradients from samples in the supervised learning framework. An error analysis is given for the convergence of the gradient estimated by the algorithm to the true gradient. The utility of the algorithm for the problem of variable selection as well as determining variable covariance is illustrated on simulated data as well as two gene expression data sets. For square loss we provide a very efficient implementation with respect to both memory and time. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds

6 0.07672976 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra     (Special Topic on Machine Learning and Optimization)

7 0.076013036 70 jmlr-2006-Online Passive-Aggressive Algorithms

8 0.071194254 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

9 0.067099437 93 jmlr-2006-Universal Kernels

10 0.066193104 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

11 0.066167466 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications     (Special Topic on Machine Learning and Optimization)

12 0.061470687 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

13 0.057312079 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

14 0.056795269 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

15 0.051957563 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

16 0.050479032 43 jmlr-2006-Large Scale Multiple Kernel Learning     (Special Topic on Machine Learning and Optimization)

17 0.049634553 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data     (Special Topic on Machine Learning and Optimization)

18 0.048994463 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

19 0.04841391 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

20 0.047345527 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs     (Special Topic on Machine Learning and Optimization)


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.258), (1, -0.026), (2, 0.155), (3, -0.175), (4, 0.056), (5, 0.117), (6, 0.031), (7, 0.033), (8, 0.011), (9, -0.076), (10, 0.117), (11, -0.07), (12, -0.142), (13, -0.115), (14, -0.051), (15, -0.055), (16, 0.08), (17, 0.027), (18, -0.169), (19, -0.089), (20, -0.086), (21, -0.03), (22, -0.011), (23, -0.001), (24, 0.079), (25, -0.162), (26, 0.166), (27, -0.095), (28, -0.044), (29, 0.078), (30, 0.075), (31, 0.002), (32, 0.007), (33, 0.154), (34, 0.139), (35, 0.015), (36, -0.117), (37, 0.003), (38, -0.004), (39, -0.013), (40, 0.055), (41, -0.015), (42, 0.038), (43, -0.039), (44, -0.019), (45, 0.056), (46, -0.033), (47, -0.1), (48, -0.223), (49, 0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93559885 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

Author: Ross A. Lippert, Ryan M. Rifkin

Abstract: We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). The proof rests on two key ideas: epi-convergence, a notion of functional convergence under which limits of minimizers converge to minimizers of limits, and a value-based formulation of learning, where we work with regularization on the function output values (y) as opposed to the function expansion coefficients in the RKHS. Our result generalizes and unifies previous results in this area. Keywords: Tikhonov regularization, Gaussian kernel, theory, kernel machines

2 0.56582838 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

Author: Di-Rong Chen, Tao Sun

Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition

3 0.51960725 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

Author: Régis Vert, Jean-Philippe Vert

Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation

4 0.41851559 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra     (Special Topic on Machine Learning and Optimization)

Author: Shai Shalev-Shwartz, Yoram Singer

Abstract: We discuss the problem of learning to rank labels from a real valued feedback associated with each label. We cast the feedback as a preferences graph where the nodes of the graph are the labels and edges express preferences over labels. We tackle the learning problem by defining a loss function for comparing a predicted graph with a feedback graph. This loss is materialized by decomposing the feedback graph into bipartite sub-graphs. We then adopt the maximum-margin framework which leads to a quadratic optimization problem with linear constraints. While the size of the problem grows quadratically with the number of the nodes in the feedback graph, we derive a problem of a significantly smaller size and prove that it attains the same minimum. We then describe an efficient algorithm, called SOPOPO, for solving the reduced problem by employing a soft projection onto the polyhedron defined by a reduced set of constraints. We also describe and analyze a wrapper procedure for batch learning when multiple graphs are provided for training. We conclude with a set of experiments which show significant improvements in run time over a state of the art interior-point algorithm.

5 0.41420799 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

Author: Mikhail Belkin, Partha Niyogi, Vikas Sindhwani

Abstract: We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework. Keywords: semi-supervised learning, graph transduction, regularization, kernel methods, manifold learning, spectral graph theory, unlabeled data, support vector machines

6 0.39645371 93 jmlr-2006-Universal Kernels

7 0.39322159 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

8 0.32096964 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

9 0.29794282 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications     (Special Topic on Machine Learning and Optimization)

10 0.29302597 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

11 0.28776991 43 jmlr-2006-Large Scale Multiple Kernel Learning     (Special Topic on Machine Learning and Optimization)

12 0.28479409 45 jmlr-2006-Learning Coordinate Covariances via Gradients

13 0.26918858 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

14 0.25883174 42 jmlr-2006-Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting     (Special Topic on Inductive Programming)

15 0.25260067 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

16 0.25213069 44 jmlr-2006-Large Scale Transductive SVMs

17 0.24811716 70 jmlr-2006-Online Passive-Aggressive Algorithms

18 0.23623724 67 jmlr-2006-On Representing and Generating Kernels by Fuzzy Equivalence Relations

19 0.23572108 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

20 0.23380607 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.021), (17, 0.045), (34, 0.013), (36, 0.046), (45, 0.021), (50, 0.095), (63, 0.036), (76, 0.01), (78, 0.024), (81, 0.038), (84, 0.386), (90, 0.061), (91, 0.026), (96, 0.056)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91706687 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data     (Special Topic on Machine Learning and Optimization)

Author: Pannagadatta K. Shivaswamy, Chiranjib Bhattacharyya, Alexander J. Smola

Abstract: We propose a novel second order cone programming formulation for designing robust classifiers which can handle uncertainty in observations. Similar formulations are also derived for designing regression functions which are robust to uncertainties in the regression setting. The proposed formulations are independent of the underlying distribution, requiring only the existence of second order moments. These formulations are then specialized to the case of missing values in observations for both classification and regression problems. Experiments show that the proposed formulations outperform imputation.

same-paper 2 0.87151003 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

Author: Ross A. Lippert, Ryan M. Rifkin

Abstract: We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). The proof rests on two key ideas: epi-convergence, a notion of functional convergence under which limits of minimizers converge to minimizers of limits, and a value-based formulation of learning, where we work with regularization on the function output values (y) as opposed to the function expansion coefficients in the RKHS. Our result generalizes and unifies previous results in this area. Keywords: Tikhonov regularization, Gaussian kernel, theory, kernel machines

3 0.50351071 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to infinity. We derive probabilistic finite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reflecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a significant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds

4 0.48491931 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models     (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efficient as training independent SVM-light classifiers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classification learning algorithms. Keywords: kernel methods, hierarchical classification, text categorization, convex optimization, structured outputs

5 0.47583142 77 jmlr-2006-Quantile Regression Forests

Author: Nicolai Meinshausen

Abstract: Random forests were introduced as a machine learning tool in Breiman (2001) and have since proven to be very popular and powerful for high-dimensional regression and classification. For regression, random forests give an accurate approximation of the conditional mean of a response variable. It is shown here that random forests provide information about the full conditional distribution of the response variable, not only about the conditional mean. Conditional quantiles can be inferred with quantile regression forests, a generalisation of random forests. Quantile regression forests give a non-parametric and accurate way of estimating conditional quantiles for high-dimensional predictor variables. The algorithm is shown to be consistent. Numerical examples suggest that the algorithm is competitive in terms of predictive power. Keywords: quantile regression, random forests, adaptive neighborhood regression

6 0.47437292 44 jmlr-2006-Large Scale Transductive SVMs

7 0.46744847 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

8 0.45892656 70 jmlr-2006-Online Passive-Aggressive Algorithms

9 0.45768315 45 jmlr-2006-Learning Coordinate Covariances via Gradients

10 0.45429021 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

11 0.45290381 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

12 0.45283657 65 jmlr-2006-Nonparametric Quantile Estimation

13 0.44669342 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models     (Special Topic on Machine Learning and Optimization)

14 0.44024655 51 jmlr-2006-Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming     (Special Topic on Machine Learning and Optimization)

15 0.43725148 83 jmlr-2006-Sparse Boosting

16 0.43222976 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

17 0.42696372 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

18 0.42528623 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

19 0.4251883 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

20 0.42036942 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data