nips nips2006 nips2006-30 knowledge-graph by maker-knowledge-mining

30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers


Source: pdf

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We establish a general oracle inequality for clipped approximate minimizers of regularized empirical risks and apply this inequality to support vector machine (SVM) type algorithms. We then show that for SVMs using Gaussian RBF kernels for classification this oracle inequality leads to learning rates that are faster than the ones established in [9]. Finally, we use our oracle inequality to show that a simple parameter selection approach based on a validation set can yield the same fast learning rates without knowing the noise exponents which were required to be known a-priori in [9]. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 gov Abstract We establish a general oracle inequality for clipped approximate minimizers of regularized empirical risks and apply this inequality to support vector machine (SVM) type algorithms. [sent-2, score-1.023]

2 We then show that for SVMs using Gaussian RBF kernels for classification this oracle inequality leads to learning rates that are faster than the ones established in [9]. [sent-3, score-0.6]

3 Finally, we use our oracle inequality to show that a simple parameter selection approach based on a validation set can yield the same fast learning rates without knowing the noise exponents which were required to be known a-priori in [9]. [sent-4, score-0.741]

4 1 Introduction The theoretical understanding of support vector machines (SVMs) and related kernel-based methods has been substantially improved in recent years. [sent-5, score-0.044]

5 For example using Talagrand’s concentration inequality and local Rademacher averages it has recently been shown that SVMs for classification can learn with rates up to n−1 under somewhat realistic assumptions on the data-generating distribution (see [9, 11] and the related work [2]). [sent-6, score-0.293]

6 On the other hand, the oracle inequality in [2] only holds for distributions having Tsybakov noise exponent ∞, and hence it describes a situation which is rarely met in practice. [sent-9, score-0.649]

7 The goal of this work is to overcome these shortcomings by establishing a general oracle inequality (see Theorem 3. [sent-10, score-0.564]

8 The key ingredient of this oracle inequality is the observation that for most commonly used loss functions it is possible to “clip” the decision function of the algorithm before beginning with the theoretical analysis. [sent-12, score-0.625]

9 In addition, a careful choice of the weighted empirical process Talagrand’s inequality is applied to, makes the “shrinking technique” superfluous. [sent-13, score-0.246]

10 Finally, by explicitly dealing with -approximate minimizers of the regularized risk our results also apply to actual SVM algorithms. [sent-14, score-0.14]

11 With the help of the general oracle inequality we then establish an oracle inequality for SVM type algorithms (see Theorem 2. [sent-15, score-1.104]

12 1) as well as a simple oracle inequality for model selection (see Theorem 4. [sent-16, score-0.576]

13 For the former, we show that it leads to improved rates for e. [sent-18, score-0.058]

14 Using the model selection theorem we then show how our new oracle inequality for SVMs can be used to analyze a simple parameter selection procedure based on a validation set that achieves the same learning rates without prior knowledge on the noise exponents. [sent-21, score-0.771]

15 The rest of this work is organized as follows: In Section 2 we present our oracle inequality for SVM type algorithms. [sent-22, score-0.542]

16 We then discuss its implications and analyze the simple parameter selection procedure when using Gaussian RBF kernels. [sent-23, score-0.034]

17 In Section 3 we then present and prove the general oracle inequality. [sent-24, score-0.327]

18 1 as well as the oracle inequality for model selection can be found in Section 4. [sent-26, score-0.576]

19 2 Main Results Throughout this work we assume that X is compact metric space, Y ⊂ [−1, 1] is compact, P is a Borel probability measure on X × Y , and F is a set of functions over X such that 0 ∈ F. [sent-27, score-0.053]

20 Often F is a reproducing kernel Hilbert space (RKHS) H of continuous functions over X with closed unit ball BH . [sent-28, score-0.082]

21 It is well-known that H can then be continuously embedded into the space of continuous functions C(X) equipped with the usual maximum-norm . [sent-29, score-0.078]

22 In order to avoid constants we always assume that this embedding has norm 1, i. [sent-31, score-0.048]

23 The functions L will serve as loss functions and consequently let us recall that the associated L-risk of a measurable function f : X → R is defined by RL,P (f ) = E(x,y)∼P L y, f (x) . [sent-37, score-0.288]

24 The learning schemes we are mainly interested in are based on an optimization problem of the form fP,λ := arg min λ f f ∈H 2 H + RL,P (f ) , (1) where λ > 0. [sent-43, score-0.043]

25 , (xn , yn )) ∈ (X × Y )n with its empirical measure, then fT,λ denotes the empirical estimators of the above learning scheme. [sent-47, score-0.062]

26 Furthermore in order to deal with the complexity of the used RKHSs let us recall that for a subset A ⊂ E of a Banach space E the covering numbers are defined by n N (A, ε, E) := min n ≥ 1 : ∃x1 , . [sent-54, score-0.103]

27 , xn ∈ E with A ⊂ (xi + εBE ) , ε > 0, i=1 where BE denotes the closed unit ball of E. [sent-57, score-0.058]

28 For our main results we are particularly interested in covering numbers in the Hilbert space L2 (TX ) which consists of all equivalence classes of functions f : X × Y → R and which is equipped with the norm f L2 (TX ) 1 := n n f (xi ) 2 1 2 . [sent-65, score-0.127]

29 Learning schemes of the form (1) typically produce functions fP,λ with limλ→0 fP,λ ∞ = ∞ (see e. [sent-70, score-0.055]

30 Unfortunately, this behaviour has a serious negative impact on the learning rates when directly employing standard tool’s such as Hoeffding’s, Bernstein’s or Talagrand’s inequality. [sent-73, score-0.058]

31 the hinge loss it is obvious that clipping the function fP,λ at −1 and 1 does not worsen the corresponding risks. [sent-76, score-0.169]

32 Following this simple observation we will consider loss functions L that satisfy the clipping condition L(y, t) ≥ L(y, 1) L(y, −1) if t ≥ 1 if t ≤ −1 , (3) for all y ∈ Y . [sent-77, score-0.157]

33 Recall that this type of loss function was already considered in [4, 11], but the clipping idea actually goes back to [1]. [sent-78, score-0.125]

34 Moreover, it is elementary to check that most commonly used loss functions including the hinge loss and the least squares loss satisfy (3). [sent-79, score-0.261]

35 Given a function f : X → R ˆ we now define its clipped version f : X → [−1, 1] by  if f (x) > 1 1 ˆ f (x) := f (x) if f (x) ∈ [−1, 1]  −1 if f (x) < −1 . [sent-80, score-0.101]

36 ˆ It is clear from (3) that we always have L(y, f (x)) ≤ L(y, f (x)) and consequently we obtain ˆ) ≤ RL,P (f ) for all distributions P . [sent-81, score-0.047]

37 |t1 − t2 | y∈Y,−1≤t1 ,t2 ≤1 sup (4) With the help of these definitions we can now state our main result which establishes an oracle inequality for clipped versions of fT,λ : Theorem 2. [sent-83, score-0.814]

38 1 Let P be a distribution on X × Y and let L be a loss function which satisfies (3) and (4). [sent-84, score-0.079]

39 sup (5) x∈X,y∈Y In addition, we assume that we have a variance bound of the form ∗ ˆ EP L ◦ f − L ◦ fL,P 2 ∗ ˆ ≤ v EP (L ◦ f − L ◦ fL,P ) ϑ (6) for constants v ≥ 1, ϑ ∈ [0, 1] and all measurable f : X → R. [sent-87, score-0.313]

40 Moreover, suppose that H satisfies sup log N BH , ε, L2 (TX ) ≤ aε−2p , ε > 0, (7) T ∈(X×Y )n for some constants p ∈ (0, 1) and a ≥ 1. [sent-88, score-0.219]

41 Then there exists a constant Kp,v depending only on p H and v such that for all τ ≥ 1 we have with probability not less than 1 − 3e−τ that ˆ RL,P (fT,λ ) − R∗ L,P ≤ Kp,v a λp n 1 2−ϑ+p(ϑ−1) + Kp,v a 32vτ +5 λp n n + 8a(f0 ) + 4 . [sent-90, score-0.096]

42 1 2−ϑ + 140τ 14B(f0 )τ + n 3n (8) The above oracle inequality has some interesting consequences as the following examples illustrate. [sent-91, score-0.542]

43 2 (Learning rates for single kernel) Assume that in Theorem 2. [sent-93, score-0.058]

44 1 we have a Lipschitz continuous loss function such as the hinge loss. [sent-94, score-0.114]

45 In addition assume that the approximation error function satisfies a(λ) ≤ cλβ , λ > 0, for some constants c > 0 and β ∈ (0, 1]. [sent-95, score-0.072]

46 The next example investigates SVMs that use a Gaussian RBF kernel whose width may vary with the sample size: Example 2. [sent-98, score-0.041]

47 Furthermore assume that we are interested in binary classification using the hinge 2 2 loss and the Gaussian RKHSs Hσ that belong to the RBF kernels kσ (x1 , x2 ) := e−σ x1 −x2 with width σ > 0. [sent-100, score-0.137]

48 If P has geometric noise exponent α ∈ (0, ∞) in the sense of [9] then it was shown in [9] that there exists a function f0 ∈ Hσ with f0 ∞ ≤ 1 and aσ (f0 ) ≤ c σ d λ + σ −αd , σ > 0, λ > 0, where c > 0 is a constant independent of λ and σ. [sent-101, score-0.108]

49 Now assume that P has Tsybakov noise exponent q ∈ [0, ∞] in the sense of [9]. [sent-105, score-0.081]

50 Note that these rates are superior to those obtained in [9, Theorem 2. [sent-108, score-0.058]

51 However, these optimal parameters require us to know certain characteristics of the distribution such as the approximation exponent β or the noise exponents α and q. [sent-111, score-0.137]

52 The following example shows that the oracle inequality of Theorem 2. [sent-112, score-0.542]

53 We write T0 for the first n samples and T1 for the last n samples. [sent-115, score-0.034]

54 Moreover, let Σ ⊂ [1, n1/d ) and Λ ⊂ (0, 1] be finite sets with cardinality mΣ and mΛ , respectively. [sent-117, score-0.052]

55 Now using a simple model selection approach (see e. [sent-120, score-0.034]

56 With such parameter sets it is then easy to check that we obtain exactly the rates we have found in Example 2. [sent-127, score-0.058]

57 3, but without knowing the noise exponents α and q a-priori. [sent-128, score-0.107]

58 3 An oracle inequality for clipped penalized ERM Theorem 2. [sent-129, score-0.662]

59 1 is a consequence of a far more general oracle inequality on clipped penalized empirical risk minimizers. [sent-130, score-0.726]

60 Since this result is of its own interest we now present it together with its proof in detail. [sent-131, score-0.04]

61 To √ end recall that a subroot is a nondecreasing function ϕ : [0, ∞) → [0, ∞) such this that ϕ(r)/ r is nonincreasing in r. [sent-132, score-0.131]

62 Now the general oracle inequality is: Theorem 3. [sent-137, score-0.542]

63 1 Let P = ∅ be a set of (hyper)-parameters, F be a set of measurable functions f : X → R with 0 ∈ F, and Ω : P × F → [0, ∞] be a function. [sent-138, score-0.103]

64 Let P be a distribution on X × Y and L be a loss function which satisfies (3) and (4). [sent-139, score-0.051]

65 In addition, we assume that we have a variance bound of the form (6) for constants v ≥ 1, ϑ ∈ [0, 1] and all measurable f : X → R. [sent-142, score-0.142]

66 Furthermore, suppose that there exists a subroot ϕn with ∗ ˆ Rσ (L ◦ f − L ◦ fL,P ) ≤ ϕn (r) , sup ET ∼P n Eσ∼ν r > 0. [sent-143, score-0.283]

67 Then for all τ ≥ 1 and all r satisfying r ≥ max 120ϕn (r), 32vτ n 1 2−ϑ , 28τ n (10) we have with probability not less than 1 − 3e−τ that ˆ Ω(pT,Ω , fT,Ω ) + RL,P (fT,Ω ) − R∗ ≤ 5r + L,P 14B(f0 )τ + 8aΩ (p0 , f0 ) + 4 . [sent-145, score-0.048]

68 2] shows that with probability not less than 1 − e−τ we have ET h1 − EP h1 < Now using √ ab ≤ a 2 + b 2 we find √ 2τ B EP h1 2Bτ + . [sent-154, score-0.076]

69 n 3n 1 2τ BEP h1 · n− 2 ≤ EP h1 + Bτ 2n , and consequently we have 7Bτ ˆ ˆ ≥ 1−e−τ . [sent-155, score-0.047]

70 In addition, our variance bound gives EP (h2 − EP h2 )2 ≤ EP h2 ≤ v(EP h2 )ϑ , and consequently, Bernstein’s inequality shows that with probability not less 2 than 1 − e−τ we have 2τ v(EP h2 )ϑ 4τ ET h2 − EP h2 < + . [sent-159, score-0.286]

71 n n −1 −1 Now, for q −1 + (q ) = 1 the elementary inequality ab ≤ aq q −1 + bq (q ) holds, and hence for √ 1 ϑ/2 2 2 q := 2−ϑ , q := ϑ , a := 21−ϑ ϑϑ τ v · n− 2 , and b := 2EP h2 we obtain ϑ 2τ v(EP h2 )ϑ ϑ ≤ 1− n 2 21−ϑ ϑϑ vτ n 1 2−ϑ + EP h2 . [sent-160, score-0.301]

72 1 2−ϑ Since elementary calculations show that 2−ϑ ϑϑ 2τ v(EP h2 )ϑ ϑ ≤ 1− n 2 ≤ 1 we obtain 2vτ n 1 2−ϑ + EP h2 . [sent-161, score-0.032]

73 Therefore we have with probability not less than 1 − e−τ that 1 2vτ 2−ϑ 4τ + . [sent-162, score-0.048]

74 To this end we write hf := L ◦ f − L ◦ fL,P , f ∈ F. [sent-164, score-0.516]

75 Moreover, for r > 0 we define ∗ ∗ ˆ ˆ RL,T (f0 ) − RL,T (fL,P ) − RL,P (f0 ) + RL,P (fL,P ) < EP h2 + 1 − EP hf − hf : (p, f ) ∈ P × F . [sent-165, score-0.964]

76 Ω(p, f ) + EP (hf ) + r Gr := EP hf −hf Ω(p,f )+EP (hf )+r Then for gp,f := gp,f ∞ = sup z∈Z ϑ 2 ∈ Gr we have EP gp,f = 0 and EP hf − hf (z) EP hf − hf ∞ 6 = ≤ . [sent-166, score-2.581]

77 Ω(p, f ) + EP (hf ) + r Ω(p, f ) + EP (hf ) + r r In addition, the inequality aϑ b2−ϑ ≤ (a + b)2 and the variance bound assumption (6) implies that 2 EP gp,f ≤ EP h2 EP h2 v f f ≤ 2−ϑ ≤ 2−ϑ . [sent-167, score-0.238]

78 (EP (hf ) + r)2 r (EP hf )ϑ r Now define Φ(r) := ET ∼P n sup (p,f )∈P×F EP hf − ET hf . [sent-168, score-1.617]

79 Ω(p, f ) + EP (hf ) + r Standard symmetrization then yields ET ∼P n |EP hf − ET hf | ≤ 2ET ∼P n Eσ∼ν sup (p,f )∈P×F Ω(p,f )+EP (hf )≤r sup |Rσ hf | , (p,f )∈P×F Ω(p,f )+EP (hf )≤r and hence Lemma 3. [sent-169, score-1.814]

80 Therefore applying Talagrand’s inequality in the version of [3] to the class Gr we obtain P n T ∈ Z n : sup ET g ≤ g∈Gr 30ϕn (r) + r 2τ v 7τ + nr2−ϑ nr ≥ 1 − e−τ . [sent-171, score-0.428]

81 1/2 7τ 2τ v n Let us define εr := 30ϕr (r) + nr2−ϑ + nr . [sent-172, score-0.042]

82 This shows 1 − εr ≥ 1 , and hence we obtain with probability not less than 1 − 3e−τ that nr 4 4 + 1 2−ϑ 2vτ 32τ vrϑ + 2(2 − ϑ) n n ˆ Ω(pT,Ω , fT,Ω ) + RL,P (fT,Ω ) − R∗ ≤ 120ϕn (r) + L,P + 44τ n 14Bτ + 4aΩ (p0 , f0 ) + 4 RL,P (f0 )−R∗ L,P + 4 . [sent-176, score-0.116]

83 3n ϑ vr However we also have 120ϕn (r) ≤ r, 32τn r 2(2 − ϑ) 4 ≤ r, and hence we find the assertion. [sent-177, score-0.133]

84 1/2 44τ n ≤ r, 5r 3 , ≤ and 2(2 − ϑ) 1 2−ϑ 2vτ n ≤ For the proof of Theorem 3. [sent-178, score-0.04]

85 Define ET W (f ) − EP W (f ) Φ(r) := ET ∼P n sup a(p, f ) + r f ∈P×F and suppose that there exists a subroot Ψ such that sup ET ∼P n ET W (f ) − EP W (f ) ≤ Ψ(r) , r > 0. [sent-183, score-0.454]

86 r xi + 1 i=0 However since Ψ is a subroot we obtain that Ψ(rxi+1 ) ≤ x by setting x := 4. [sent-186, score-0.085]

87 1 let us state the following proposition which follows directly from [8] (see also [9, Prop. [sent-189, score-0.066]

88 7]) together with simple considerations on covering numbers: Proposition 4. [sent-191, score-0.048]

89 1 Let F := H be a RKHS, P := {p0 } be a singleton, and Ω(p0 , f ) := λ f is satisfied then there exists a constant cp depending only on p such that (9) is satisfied for 1 ϑ ϕn (r) := cp max v 2 (1−p) r 2 (1−p) r λ p 2 a n 1 2 , r λ p 1+p a n 1 1+p . [sent-192, score-0.122]

90 1: From the covering bound assumption we observe that Proposition 4. [sent-195, score-0.071]

91 1 2 (1−p) ϑ 2 (1−p) p 1+p a n 1 1+p , 32vτ n 1 2−ϑ , 28τ n (18) Finally, for the parameter selection approach in Example 2. [sent-199, score-0.034]

92 4 we need the following oracle inequality for model selection: Theorem 4. [sent-200, score-0.542]

93 2 Let P be a distribution on X × Y and let L be a loss function which satisfies (3), (4), and the variance bound (6). [sent-201, score-0.102]

94 , fm } be a finite set of functions mapping X into [−1, 1]. [sent-205, score-0.032]

95 f ∈F Then there exists a universal constant K such that for all τ ≥ 1 we have with probability not less than 1 − 3e−τ that RL,P (fT ) − R∗ L,P ≤ 5 K log m n 1 2−ϑ +5 32vτ n 1 2−ϑ + 5K log m + 154τ n +8 min(RL,P (f ) − R∗ ) . [sent-207, score-0.098]

96 L,P f ∈F Proof: Since all functions fi already map into [−1, 1] we do not have to consider the clipping operator. [sent-208, score-0.106]

97 Then the cardinality of L,P ∗ Fr is smaller than or equal to m and hence we have N (L ◦ Fr − L ◦ fL,P , ε, L2 (T )) ≤ m for all ε > 0. [sent-210, score-0.05]

98 7]) we hence obtain that (9) is satisfied for c ϕn (r) := √ max n log m v log m rϑ/2 , √ n , where c is a universal constant. [sent-214, score-0.049]

99 A Bennet concentration inequality and its application to suprema of empirical processes. [sent-232, score-0.266]

100 Fast rates for support vector machines using Gaussian kernels. [sent-275, score-0.102]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ep', 0.535), ('hf', 0.482), ('oracle', 0.327), ('ft', 0.265), ('inequality', 0.215), ('sup', 0.171), ('rxi', 0.17), ('vr', 0.107), ('fp', 0.104), ('clipped', 0.101), ('subroot', 0.085), ('talagrand', 0.085), ('pt', 0.084), ('tx', 0.074), ('clipping', 0.074), ('theorem', 0.074), ('measurable', 0.071), ('gr', 0.068), ('et', 0.065), ('rates', 0.058), ('exponents', 0.056), ('exponent', 0.052), ('fr', 0.051), ('minimizers', 0.051), ('steinwart', 0.051), ('loss', 0.051), ('satis', 0.049), ('covering', 0.048), ('constants', 0.048), ('consequently', 0.047), ('bernstein', 0.045), ('hinge', 0.044), ('svms', 0.044), ('alamos', 0.043), ('bh', 0.043), ('rkhss', 0.043), ('tsybakov', 0.043), ('nr', 0.042), ('proof', 0.04), ('svm', 0.039), ('proposition', 0.038), ('regularized', 0.038), ('bep', 0.037), ('cp', 0.037), ('ying', 0.034), ('write', 0.034), ('selection', 0.034), ('rbf', 0.034), ('risk', 0.033), ('elementary', 0.032), ('functions', 0.032), ('ball', 0.031), ('empirical', 0.031), ('furthermore', 0.03), ('rkhs', 0.03), ('los', 0.03), ('shrinking', 0.03), ('obviously', 0.029), ('noise', 0.029), ('let', 0.028), ('rademacher', 0.028), ('ab', 0.028), ('lipschitz', 0.028), ('moreover', 0.028), ('exists', 0.027), ('recall', 0.027), ('equipped', 0.027), ('less', 0.027), ('xn', 0.027), ('hence', 0.026), ('support', 0.025), ('cardinality', 0.024), ('addition', 0.024), ('bound', 0.023), ('arbitrarily', 0.023), ('universal', 0.023), ('schemes', 0.023), ('width', 0.022), ('knowing', 0.022), ('establishing', 0.022), ('depending', 0.021), ('probability', 0.021), ('interested', 0.02), ('es', 0.02), ('concentration', 0.02), ('classi', 0.02), ('establish', 0.02), ('wu', 0.019), ('penalized', 0.019), ('machines', 0.019), ('continuous', 0.019), ('hilbert', 0.019), ('nondecreasing', 0.019), ('banach', 0.019), ('attaining', 0.019), ('hoeffding', 0.019), ('hush', 0.019), ('investigates', 0.019), ('lemma', 0.018), ('dealing', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We establish a general oracle inequality for clipped approximate minimizers of regularized empirical risks and apply this inequality to support vector machine (SVM) type algorithms. We then show that for SVMs using Gaussian RBF kernels for classification this oracle inequality leads to learning rates that are faster than the ones established in [9]. Finally, we use our oracle inequality to show that a simple parameter selection approach based on a validation set can yield the same fast learning rates without knowing the noise exponents which were required to be known a-priori in [9]. 1

2 0.14902875 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

Author: Mark Girolami, Mingjun Zhong

Abstract: By adopting Gaussian process priors a fully Bayesian solution to the problem of integrating possibly heterogeneous data sets within a classification setting is presented. Approximate inference schemes employing Variational & Expectation Propagation based methods are developed and rigorously assessed. We demonstrate our approach to integrating multiple data sets on a large scale protein fold prediction problem where we infer the optimal combinations of covariance functions and achieve state-of-the-art performance without resorting to any ad hoc parameter tuning and classifier combination. 1

3 0.12492105 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

Author: Peter L. Bartlett, Ambuj Tewari

Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.

4 0.12476881 203 nips-2006-implicit Online Learning with Kernels

Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan

Abstract: We present two new algorithms for online learning in reproducing kernel Hilbert spaces. Our first algorithm, ILK (implicit online learning with kernels), employs a new, implicit update technique that can be applied to a wide variety of convex loss functions. We then introduce a bounded memory version, SILK (sparse ILK), that maintains a compact representation of the predictor without compromising solution quality, even in non-stationary environments. We prove loss bounds and analyze the convergence rate of both. Experimental evidence shows that our proposed algorithms outperform current methods on synthetic and real data. 1

5 0.092801839 116 nips-2006-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1

6 0.088303357 21 nips-2006-AdaBoost is Consistent

7 0.084485948 61 nips-2006-Convex Repeated Games and Fenchel Duality

8 0.057191197 186 nips-2006-Support Vector Machines on a Budget

9 0.054492462 154 nips-2006-Optimal Change-Detection and Spiking Neurons

10 0.050512608 20 nips-2006-Active learning for misspecified generalized linear models

11 0.050232604 109 nips-2006-Learnability and the doubling dimension

12 0.049360801 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

13 0.049229749 181 nips-2006-Stability of $K$-Means Clustering

14 0.048710458 117 nips-2006-Learning on Graph with Laplacian Regularization

15 0.047732685 79 nips-2006-Fast Iterative Kernel PCA

16 0.046830501 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

17 0.043974753 169 nips-2006-Relational Learning with Gaussian Processes

18 0.043969378 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

19 0.04194035 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds

20 0.041500505 5 nips-2006-A Kernel Method for the Two-Sample-Problem


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.131), (1, 0.041), (2, -0.137), (3, 0.011), (4, -0.034), (5, 0.126), (6, 0.013), (7, -0.013), (8, -0.023), (9, 0.021), (10, 0.058), (11, 0.009), (12, -0.006), (13, -0.064), (14, 0.017), (15, 0.051), (16, 0.007), (17, -0.008), (18, -0.034), (19, -0.065), (20, -0.013), (21, 0.191), (22, -0.006), (23, 0.05), (24, -0.045), (25, -0.099), (26, -0.041), (27, -0.024), (28, -0.031), (29, 0.115), (30, 0.068), (31, -0.029), (32, 0.002), (33, 0.063), (34, 0.057), (35, -0.11), (36, -0.072), (37, -0.18), (38, 0.02), (39, -0.018), (40, -0.036), (41, -0.077), (42, -0.098), (43, 0.016), (44, -0.01), (45, 0.103), (46, 0.079), (47, 0.07), (48, 0.01), (49, -0.13)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94796968 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We establish a general oracle inequality for clipped approximate minimizers of regularized empirical risks and apply this inequality to support vector machine (SVM) type algorithms. We then show that for SVMs using Gaussian RBF kernels for classification this oracle inequality leads to learning rates that are faster than the ones established in [9]. Finally, we use our oracle inequality to show that a simple parameter selection approach based on a validation set can yield the same fast learning rates without knowing the noise exponents which were required to be known a-priori in [9]. 1

2 0.68260437 21 nips-2006-AdaBoost is Consistent

Author: Peter L. Bartlett, Mikhail Traskin

Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after nν iterations—for sample size n and ν < 1—the sequence of risks of the classifiers it produces approaches the Bayes risk if Bayes risk L∗ > 0. 1

3 0.52497256 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

Author: Peter L. Bartlett, Ambuj Tewari

Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.

4 0.50763452 116 nips-2006-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1

5 0.47243071 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds

Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein

Abstract: Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this n n−1 paper is a density bound of n ≤d−1 / ( ≤d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. 1

6 0.44525599 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

7 0.42358071 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors

8 0.38852614 109 nips-2006-Learnability and the doubling dimension

9 0.3780973 181 nips-2006-Stability of $K$-Means Clustering

10 0.3473323 5 nips-2006-A Kernel Method for the Two-Sample-Problem

11 0.34349144 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

12 0.33983147 61 nips-2006-Convex Repeated Games and Fenchel Duality

13 0.32221484 203 nips-2006-implicit Online Learning with Kernels

14 0.30089715 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems

15 0.29955643 50 nips-2006-Chained Boosting

16 0.29396749 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

17 0.27962869 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

18 0.27470347 194 nips-2006-Towards a general independent subspace analysis

19 0.27027687 136 nips-2006-Multi-Instance Multi-Label Learning with Application to Scene Classification

20 0.26722944 108 nips-2006-Large Scale Hidden Semi-Markov SVMs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.107), (3, 0.015), (4, 0.342), (7, 0.084), (9, 0.048), (22, 0.069), (44, 0.087), (57, 0.043), (65, 0.049), (69, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7899701 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We establish a general oracle inequality for clipped approximate minimizers of regularized empirical risks and apply this inequality to support vector machine (SVM) type algorithms. We then show that for SVMs using Gaussian RBF kernels for classification this oracle inequality leads to learning rates that are faster than the ones established in [9]. Finally, we use our oracle inequality to show that a simple parameter selection approach based on a validation set can yield the same fast learning rates without knowing the noise exponents which were required to be known a-priori in [9]. 1

2 0.48431653 20 nips-2006-Active learning for misspecified generalized linear models

Author: Francis R. Bach

Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1

3 0.48359469 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1

4 0.47948995 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

5 0.47875956 117 nips-2006-Learning on Graph with Laplacian Regularization

Author: Rie K. Ando, Tong Zhang

Abstract: We consider a general form of transductive learning on graphs with Laplacian regularization, and derive margin-based generalization bounds using appropriate geometric properties of the graph. We use this analysis to obtain a better understanding of the role of normalization of the graph Laplacian matrix as well as the effect of dimension reduction. The results suggest a limitation of the standard degree-based normalization. We propose a remedy from our analysis and demonstrate empirically that the remedy leads to improved classification performance.

6 0.47830078 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

7 0.47689128 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

8 0.47431672 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

9 0.474213 175 nips-2006-Simplifying Mixture Models through Function Approximation

10 0.4741132 5 nips-2006-A Kernel Method for the Two-Sample-Problem

11 0.47337094 21 nips-2006-AdaBoost is Consistent

12 0.4720419 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines

13 0.47102788 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

14 0.47094741 169 nips-2006-Relational Learning with Gaussian Processes

15 0.47072485 109 nips-2006-Learnability and the doubling dimension

16 0.47004241 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

17 0.46998113 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

18 0.46968815 35 nips-2006-Approximate inference using planar graph decomposition

19 0.4695797 163 nips-2006-Prediction on a Graph with a Perceptron

20 0.469372 61 nips-2006-Convex Repeated Games and Fenchel Duality