nips nips2010 nips2010-278 knowledge-graph by maker-knowledge-mining

278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification

Source: pdf

Author: Tobias Glasmachers

Abstract: Steinwart was the ﬁrst to prove universal consistency of support vector machine classiﬁcation. His proof analyzed the ‘standard’ support vector machine classiﬁer, which is restricted to binary classiﬁcation problems. In contrast, recent analysis has resulted in the common belief that several extensions of SVM classiﬁcation to more than two classes are inconsistent. Countering this belief, we prove the universal consistency of the multi-class support vector machine by Crammer and Singer. Our proof extends Steinwart’s techniques to the multi-class case. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ch Abstract Steinwart was the ﬁrst to prove universal consistency of support vector machine classiﬁcation. [sent-2, score-0.351]

2 His proof analyzed the ‘standard’ support vector machine classiﬁer, which is restricted to binary classiﬁcation problems. [sent-3, score-0.145]

3 Countering this belief, we prove the universal consistency of the multi-class support vector machine by Crammer and Singer. [sent-5, score-0.351]

4 Our proof extends Steinwart’s techniques to the multi-class case. [sent-6, score-0.055]

5 1 Introduction Support vector machine (SVM) as proposed in [1, 8] are powerful classiﬁers, especially in the binary case of two possible classes. [sent-7, score-0.039]

6 This is trivially the case for general techniques such as one-versus-one architectures and the oneversus-all approach, which combine a set of binary machines to a multi-class decision maker. [sent-9, score-0.197]

7 Recently, consistency of multi-class support vector machines has been investigated based on properties of the loss function Ψ measuring empirical risk in machine training [7]. [sent-11, score-0.303]

8 This work is conceptually related to Fisher consistency, in contrast to univeral statistical consistency, see [3, 5]. [sent-13, score-0.045]

9 0-1-risk (Bayes risk): limn→∞ R(fn ˆ The classiﬁers fn are assumed to result from structural risk minimization [8], that is, the space Fn ˆ for which we obtain fn = arg min{RΨ (f ) | f ∈ Fn } grows suitably with the size of the training set such that SB holds. [sent-17, score-0.288]

10 1 The confusion around the consistency of multi-class machines arises from mixing the equivalence and the implication in statement (1). [sent-18, score-0.259]

11 Examples 1 and 2 in [7] show that the loss functions Ψ used in the machines by Crammer and Singer [2] and by Weston and Watkins [9] are not classiﬁcation calibrated, thus SA = false. [sent-19, score-0.069]

12 Then it is deduced that the corresponding machines are not consistent (SC = false), although it can be deduced only that the implication SB ⇒ SC does not hold. [sent-20, score-0.249]

13 We argue that the consistency of a machine is not necessarily determined by properties of its loss function. [sent-22, score-0.119]

14 Thus, we generalize Steinwart’s universal consistency theorem for binary SVMs (Theorem 2 in [6]) to the multi-class support vector machine [2] proposed by Crammer and Singer: Theorem 2. [sent-24, score-0.418]

15 Let X ⊂ Rd be compact and k : X × X → R be a universal kernel with1 N ((X, dk ), ε) ∈ O(ε−α ) for some α > 0. [sent-25, score-0.486]

16 The theorem does not only establish the universal consistency of the multi-class SVM by Crammer and Singer, it also gives precise conditions for how exactly the complexity control parameters needs to be coupled to the training set size in order to obtain universal consistency. [sent-29, score-0.541]

17 Moreover, the rigorous proof of this statement implies that the common belief on the inconsistency of the popular multi-class SVM by Crammer and Singer is wrong. [sent-30, score-0.12]

18 In contrast to the conceptually simpler binary case we have q > 2. [sent-44, score-0.084]

19 We call a function on X induced by k if there exists w ∈ H such that f (x) = w, φ(x) . [sent-50, score-0.075]

20 Let dk (x, x′ ) := φ(x) − φ(x′ ) H = k(x, x) − 2k(x, x′ ) + k(x′ , x′ ) denote the metric induced on X by the kernel k. [sent-51, score-0.272]

21 Analog to Steinwart [6] we require that the input space X is a compact subset of Rd , and deﬁne the notion of a universal kernel: Deﬁnition 1. [sent-52, score-0.261]

22 (Deﬁnition 2 in [6]) A continuous positive deﬁnite kernel function k : X × X → R on a compact subset X ⊂ Rd is called universal if the set of induced functions is dense in the space C 0 (X) of continuous functions, i. [sent-53, score-0.344]

23 , for all g ∈ C 0 (X) and all ε > 0 there exists an induced function f with g − f ∞ < ε. [sent-55, score-0.075]

24 Intuitively, this property makes sure that the feature space of a kernel is rich enough to achieve consistency for all possible data generating distributions. [sent-56, score-0.184]

25 For a detailed treatment of universal kernels we refer to [6]. [sent-57, score-0.181]

26 An SVM classiﬁer for a q-class problem is given in the form of a vector-valued function f : X → Rq with component functions fu : X → R, u ∈ Y (sometimes restricted by the so-called sum-to-zero constraint u∈Y fu = 0). [sent-58, score-0.518]

27 Each of its components takes the form fu (x) = wu , φ(x) + bu with wu ∈ H and bu ∈ R. [sent-59, score-0.907]

28 2 Here, the arbitrary rule for breaking ties favors the smallest class index. [sent-65, score-0.074]

29 , wq ) ∈ Hq , as the solution of the quadratic program ℓ minimize wu , wu + u∈Y s. [sent-74, score-0.681]

30 C ξi · ℓ i=1 wyi − wu , φ(xi ) ≥ 1 − ξi (2) ∀ i ∈ {1, . [sent-76, score-0.253]

31 The slack variables in the optimum can be written as ξi = max v∈Y \{yi } 1 − (fyi (xi ) − fv (xi )) + ≥ 1 − δh(xi ),yi − fyi (xi ) + fh(xi ) (xi ) + , (3) with the auxiliaury function [t]+ := max{0, t}. [sent-80, score-0.189]

32 We denote the function induced by the solution of this problem by f = fT,k,C = ( w1 , · , . [sent-81, score-0.047]

33 3 The Central Construction In this section we introduce a number of deﬁnitions and constructions preparing the proofs in the later sections. [sent-92, score-0.084]

34 Most of the differences to the binary case are incorporated into these constructions such that the lemmas and theorems proven later on naturally extend to the multi-class case. [sent-93, score-0.227]

35 Let ∆ := {p ∈ Rq | pu ≥ 0 ∀ u ∈ Y and u∈Y pu = 1} denote the probability simplex over Y . [sent-94, score-0.199]

36 We introduce the covering number of the metric space (X, dk ) as n N ((X, dk ), ε) := min n ∃ {x1 , . [sent-95, score-0.407]

37 , xn } ⊂ X such that X ⊂ B(xi , ε) , i=1 with B(x, ε) = {x′ ∈ X | dk (x, x′ ) < ε} being the open ball of radius ε > 0 around x ∈ X. [sent-98, score-0.189]

38 Next we construct a partition of a large part of the input space X into suitable subsets. [sent-99, score-0.094]

39 In a ﬁrst step we partition the probability simplex, then we transfer this partition to the input space, and ﬁnally we discard small subsets of negligible impact. [sent-100, score-0.245]

40 The resulting partition has a number of properties of importance for the proofs of diverse lemmas in the next section. [sent-101, score-0.217]

41 We split the simplex ∆ into a partition of ‘classiﬁcation-aligned’ subsets ∆y := κ−1 ({y}) = p ∈ ∆ py > pu for u < y and py ≥ pu for u > y for y ∈ Y , on which the decision function κ decides for class y. [sent-104, score-0.521]

42 Then we combine both constructions to the partition Γ := y∈Y 2 γ ∩ ∆y ˜ γ ∈ Γ and γ ∩ ∆y = ∅ ˜ ˜ ˜ Note that any other deterministic rule for breaking ties can be realized by permuting the class indices. [sent-109, score-0.256]

43 3 of ∆ into classiﬁcation-aligned subsets of side length upper bounded by τ . [sent-110, score-0.116]

44 We have the trivial upper bound |Γ| ≤ D := q · (1/τ + 1)q for the size of the partition. [sent-111, score-0.052]

45 The partition Γ will serve as an index set in a number of cases. [sent-112, score-0.094]

46 The ﬁrst one of these is the partition X = γ∈Γ Xγ with Xγ := x ∈ X P (y|x) ∈ γ . [sent-113, score-0.094]

47 Thus, for each γ ∈ Γ there exists a ˜ ˜ ˜ compact subset Kγ ⊂ Xγ with P (Kγ ) ≥ (1 − τ /2) · P (Xγ ). [sent-115, score-0.108]

48 All of each K ˜ A∈Aγ A such that the diameter of each A ∈ A ˜ ˜ ˜ these sets are summarized in the partition A = γ∈Γ Aγ . [sent-117, score-0.172]

49 Now we drop all A ∈ Aγ below a certain probabiliy mass, resulting in τ ˜ PX (A) ≥ Aγ := A ∈ Aγ , (4) 2M with M := D · N ((X, dk ), σ). [sent-118, score-0.189]

50 We denote the Bayes-optimal decision by y(Xγ ) = y(γ) := κ(p) for any p ∈ γ, and for y ∈ Y we deﬁne the lower and upper bounds Ly (Xγ ) = Ly (γ) := inf py p∈γ and Uy (Xγ ) = Uy (γ) := sup py p∈γ on the corresponding components in the probability simplex. [sent-122, score-0.197]

51 We canonically extend these deﬁni˜ tions to the above deﬁned sets Kγ , Kγ , and A ∈ A, which are all subsets of exactly one of the sets Xγ , by deﬁning y(S) := y(γ) for all non-empty subsets S ⊂ Xγ . [sent-123, score-0.182]

52 The resulting construction has the following properties: (P1) The decision function κ is constant on each set γ ∈ Γ, and thus h = κ ◦ f is constant on each set Xγ as well as on each of their subsets, most importantly on each A ∈ A. [sent-124, score-0.058]

53 (P2) For each y ∈ Y , the side length Uy (γ) − Ly (γ) of each set γ ∈ Γ is upper bounded by τ . [sent-125, score-0.059]

54 (P4) The cardinality of the partition Γ is upper bounded by D = q · (1/τ + 1)q , which depends only on τ and q, but not on T , k, or C. [sent-127, score-0.19]

55 3 √ (P5) The cardinality of the partition A is upper bounded by M = D · N ((X, dk ), τ /(2 C)), which is ﬁnite by Lemma 1. [sent-128, score-0.379]

56 (P6) The set A∈A A = γ∈Γ Kγ ⊂ X covers a probability mass (w. [sent-129, score-0.072]

57 √ (P8) Each A ∈ A has diameter less than σ = τ /(2 C), that is, for x, x′ ∈ A we have dk (x, x′ ) < σ. [sent-137, score-0.233]

58 With properties (P2) and (P6) it is straight-forward to obtain the inequality 1− ≤ R∗ ≤1 − 3 γ∈Γ γ∈Γ Ly(γ) (γ) · PX (Xγ ) − τ ≤ 1 − Ly(γ) (γ) · PX (Xγ ) ≤ 1 − A tight bound would be in O(τ 1−q ). [sent-138, score-0.093]

59 We combine the properties of all these sets in the set Fℓ := u∈Y A∈A FℓA,u of training sets of size ℓ, with the same lower bound on the number of training examples in all sets A ∈ A, and for all classes u ∈ Y . [sent-149, score-0.225]

60 4 Preparations The proof of our main result follows the proofs of Theorems 1 and 2 in [6] as closely as possible. [sent-150, score-0.084]

61 For the sake of clarity we organize the proof such that all six lemmas in this section directly correspond to Lemmas 1-6 in [6]. [sent-151, score-0.193]

62 (Lemma 1 from [6]) Let k : X × X → R be a universal kernel on a compact subset X or Rd and Φ : X → H be a feature map of k. [sent-153, score-0.297]

63 The Φ is continuous and dk (x, x′ ) := Φ(x) − Φ(x′ ) deﬁnes a metric on X such that id : (X, · ) → (X, dk ) is continuous. [sent-154, score-0.378]

64 In particular, N ((X, dk ), ε) is ﬁnite for all ε > 0. [sent-155, score-0.189]

65 Let X ⊂ Rd be compact and let k : X × X → R be a universal kernel. [sent-157, score-0.261]

66 Then, for all ˜ ε > 0 and all pairwise disjoint and compact (or empty) subsets Ku ⊂ X, u ∈ Y , there exists an induced function f : X → − 1/2 · (1 + ε), 1/2 · (1 + ε) q ; ∗ ∗ x → ( w1 , x , . [sent-158, score-0.212]

67 , wq , x )T , such that ˜ if x ∈ Ku ˜ if x ∈ Kv for some v ∈ Y \ {u} fu (x) ∈ [1/2, 1/2 · (1 + ε)] for all u ∈ Y . [sent-161, score-0.434]

68 This lemma directly corresponds to Lemma 2 in [6], with slightly different cases. [sent-163, score-0.1]

69 The probability of the training sets Fℓ is lower bounded by 1 P ℓ (Fℓ ) ≥ 1 − q · M · exp − (τ 6 /M 2 )ℓ 8 . [sent-166, score-0.144]

70 , (xℓ , yℓ ) ∈ (X × Y )ℓ and deﬁne the binary variables zi := 1{A×{u}} (xi , yi ), where the indicator function 1S (s) is one for s ∈ S and zero otherwise. [sent-173, score-0.207]

71 , ℓ} xn ∈ A, yn = ℓ u = i=1 zi found in the deﬁnition of FℓA,u in a form suitable for the application of Hoeffding’s inequality. [sent-177, score-0.115]

72 The inequality, applied to the variables zi , states ℓ Pℓ i=1 zi ≤ (1 − τ ) · E · ℓ 5 ≤ exp −2(τ E)2 ℓ , where E := E[zi ] = can use the relation A×{u} dP (x, y) = A P (u|x)dx ≥ Lu (A) · PX (A) > 0. [sent-178, score-0.275]

73 Due to E > 0 we ℓ ℓ i=1 zi ≤ (1 − τ ) · E · ℓ ⇒ i=1 zi < (1 − τ /2) · E · ℓ in order to obtain Hoeffding’s formula for the case of strict inequality ℓ Pℓ i=1 1 ≤ exp − (τ E)2 ℓ 2 zi < (1 − τ ) · E · ℓ ℓ i=1 zi Combining E ≥ Lu (A) · PX (A) and obtain . [sent-179, score-0.572]

74 < (1 − τ ) · Lu (A) · PX (A) · ℓ ⇔ T ∈ FℓA,u we ℓ P ℓ (X × Y )ℓ \ FℓA,u = P ℓ ℓ ≤ Pℓ i=1 i=1 zi < (1 − τ ) · Lu (A) · PX (A) · ℓ 1 ≤ exp − (τ E)2 ℓ 2 zi < (1 − τ ) · E · ℓ 1 ≤ exp − (τ Lu (A)PX (A))2 ℓ 2 . [sent-180, score-0.32]

75 Applying these to the previous inequality results in 1 P ℓ (X × Y )ℓ \ FℓA,u ≤ exp − (τ 3 /(2M ))2 ℓ 2 1 = exp − (τ 6 /M 2 )ℓ 8 , which also holds in the case Lu (A) = 0 treated earlier. [sent-182, score-0.157]

76 Finally, we use the union bound 1 − P ℓ (Fℓ ) = 1 − P ℓ FℓA,u u∈Y A∈A 1 ≤ |Y | · |A| · exp − (τ 6 /M 2 )ℓ 8 (X × Y )ℓ \ FℓA,u = Pℓ u∈Y A∈A 1 ≤ q · M · exp − (τ 6 /M 2 )ℓ 8 and properties (P4) and (P5) to prove the assertion. [sent-183, score-0.116]

77 The lemma follows directly from the deﬁnition of ηh , even with equality. [sent-187, score-0.1]

78 fulﬁlls For all training sets T ∈ Fℓ the SVM solution given by (w1 , . [sent-190, score-0.066]

79 , wq ) C ℓ ℓ i=1 ξi ≤ ∗ ∗ wu , wu + C(R∗ + 2τ ) , u∈Y as deﬁned in Lemma 2. [sent-199, score-0.681]

80 The optimality of the SVM solution for the primal problem (2) implies wu , wu + u∈Y C ℓ ℓ i=1 ξi ≤ ∗ ξi . [sent-201, score-0.506]

81 ∗ ∗ wu , wu + u∈Y C ℓ ℓ ∗ ξi i=1 ∗ for any feasible choice of the slack variables We choose the values of these variables as ξi = 1 + τ for P (y | xi ) ∈ ∆yi and zero otherwise which corresponds to a feasible solution according ℓ ∗ ∗ to the construction of wu in Lemma 2. [sent-202, score-0.946]

82 , ℓ} P (y | xi ) ∈ ∆yi denote the number of training examples correctly ℓ ∗ classiﬁed by the Bayes rule expressed by ∆yi (or κ). [sent-207, score-0.083]

83 For all training sets T ∈ Fℓ the sum of the slack variables (ξ1 , . [sent-212, score-0.113]

84 Thus, we have u∈Y wu ≤ C in the optimum, and we deduce wu ≤ C for each u ∈ Y . [sent-221, score-0.506]

85 Thus, property (P8) makes sure that |fu (x) − fu (x′ )| ≤ τ /2 for all x, x′ ∈ A and u ∈ Y . [sent-222, score-0.288]

86 The proof works through the following series of inequalities. [sent-223, score-0.055]

87 The second inequality is clear from the deﬁnition of FℓA,u together with |fu (x) − fu (x′ )| ≤ τ /2 within each A ∈ A. [sent-226, score-0.326]

88 For the third inequality we use that the case u = h(x) does not contribute, and the non-negativity of fh(x) (x) − fu (x). [sent-227, score-0.326]

89 In the next steps we make use of u∈Y Lu (A) ≥ 1 − q · τ and the lower bound Lh(x) (x) ≤ P (h(x)|x) = 1 − Eh (x) = 1 − s(x) − ηh (x), which can be deduced from properties (P1) and (P2). [sent-228, score-0.097]

90 5 Proof of the Main Result Just like the lemmas, we organize our theorems analogous to the ones found in [6]. [sent-229, score-0.083]

91 We start with a detailed but technical auxiliaury result. [sent-230, score-0.058]

92 Consider wu as deﬁned in Lemma 2, then we combine Lemmas 5 and 6 to          1    ∗ wu 2 − wu 2  + (R∗ + 2τ ) . [sent-240, score-0.792]

93 ηh (x) dPX (x) − q · τ  ≤  (1 − τ )2 · R∗ +     C X  u∈Y u∈Y ≤1 ≥0 1 C ∗ 2 Using a−τ ≤ (1−τ )·a for any a ∈ [0, 1], we derive X ηh (x) dPX (x) ≤ u∈Y wu +(q+4)· 1 ∗ 2 ∗ ∗ τ . [sent-241, score-0.253]

94 With the choice C = τ · u∈Y wu and the condition C ≥ C we obtain X ηh (x) dPX (x) ≤ (q + 5) · τ = ε. [sent-242, score-0.253]

95 Up to constants, this short proof coincides with the proof of Theorem 2 in [6]. [sent-244, score-0.11]

96 Because of the importance of the statement and the brevity of the proof we repeat it here: Since ℓ · Cℓ → ∞ there exists an integer ℓ0 such that ℓ · Cℓ ≥ C ∗ for all ℓ ≥ ℓ0 . [sent-245, score-0.116]

97 Thus for ℓ ≥ ℓ0 Theorem 1 yields 1 2 Pr∗ T ∈ (X × Y )ℓ R(fT,k,Cℓ ) ≤ R∗ + ε ≥ 1 − qMℓ exp − (τ 6 /Mℓ )ℓ , 8 √ where Mℓ = D · N ((X, dk ), τ /(2 Cℓ )). [sent-246, score-0.234]

98 Moreover, by the assumption on the covering numbers of −2 2 (X, dk ) we obtain Mℓ ∈ O((ℓ · Cℓ )2 ) and thus ℓ · Mℓ → ∞. [sent-247, score-0.218]

99 6 Conclusion We have proven the universal consistency of the popular multi-class SVM by Crammer and Singer. [sent-248, score-0.3]

100 The proof itself can be understood as an extension of Steinwart’s universal consistency result for binary SVMs. [sent-250, score-0.394]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('px', 0.396), ('dpx', 0.306), ('lu', 0.279), ('fu', 0.259), ('wu', 0.253), ('ly', 0.231), ('dk', 0.189), ('universal', 0.181), ('wq', 0.175), ('crammer', 0.134), ('consistency', 0.119), ('uy', 0.118), ('zi', 0.115), ('fh', 0.115), ('fn', 0.112), ('sb', 0.11), ('svm', 0.103), ('lemma', 0.1), ('steinwart', 0.1), ('partition', 0.094), ('lemmas', 0.094), ('sc', 0.089), ('eh', 0.088), ('rq', 0.08), ('compact', 0.08), ('watkins', 0.077), ('pu', 0.077), ('classi', 0.077), ('singer', 0.076), ('py', 0.072), ('bu', 0.071), ('deduced', 0.071), ('machines', 0.069), ('inequality', 0.067), ('weston', 0.066), ('auxiliaury', 0.058), ('fyi', 0.058), ('multicategory', 0.058), ('subsets', 0.057), ('constructions', 0.055), ('proof', 0.055), ('nq', 0.054), ('yi', 0.053), ('ful', 0.052), ('support', 0.051), ('xi', 0.051), ('induced', 0.047), ('lh', 0.047), ('tewari', 0.047), ('sa', 0.047), ('slack', 0.047), ('conceptually', 0.045), ('simplex', 0.045), ('exp', 0.045), ('tobias', 0.044), ('organize', 0.044), ('diameter', 0.044), ('mass', 0.043), ('pr', 0.043), ('rd', 0.042), ('borel', 0.042), ('ku', 0.042), ('qm', 0.042), ('ties', 0.042), ('theorems', 0.039), ('binary', 0.039), ('implication', 0.038), ('limn', 0.038), ('hoeffding', 0.038), ('lls', 0.038), ('cardinality', 0.037), ('kernel', 0.036), ('bayes', 0.036), ('universally', 0.036), ('sets', 0.034), ('combine', 0.033), ('statement', 0.033), ('bounded', 0.033), ('risk', 0.032), ('breaking', 0.032), ('training', 0.032), ('belief', 0.032), ('construction', 0.031), ('ers', 0.03), ('proofs', 0.029), ('covers', 0.029), ('lin', 0.029), ('sure', 0.029), ('covering', 0.029), ('trivially', 0.029), ('feasible', 0.029), ('theorem', 0.028), ('exists', 0.028), ('decision', 0.027), ('upper', 0.026), ('nition', 0.026), ('bound', 0.026), ('preparations', 0.026), ('zq', 0.026), ('fv', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification

Author: Tobias Glasmachers

2 0.16261967 279 nips-2010-Universal Kernels on Non-Standard Input Spaces

Author: Andreas Christmann, Ingo Steinwart

Abstract: During the last years support vector machines (SVMs) have been successfully applied in situations where the input space X is not necessarily a subset of Rd . Examples include SVMs for the analysis of histograms or colored images, SVMs for text classiÄ?Ĺš cation and web mining, and SVMs for applications from computational biology using, e.g., kernels for trees and graphs. Moreover, SVMs are known to be consistent to the Bayes risk, if either the input space is a complete separable metric space and the reproducing kernel Hilbert space (RKHS) H Ă˘Š‚ Lp (PX ) is dense, or if the SVM uses a universal kernel k. So far, however, there are no kernels of practical interest known that satisfy these assumptions, if X Ă˘Š‚ Rd . We close this gap by providing a general technique based on Taylor-type kernels to explicitly construct universal kernels on compact metric spaces which are not subset of Rd . We apply this technique for the following special cases: universal kernels on the set of probability measures, universal kernels based on Fourier transforms, and universal kernels for signal processing. 1

3 0.15314342 222 nips-2010-Random Walk Approach to Regret Minimization

Author: Hariharan Narayanan, Alexander Rakhlin

Abstract: We propose a computationally efﬁcient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efﬁcient method for implementing mixture forecasting strategies. 1

4 0.11415188 250 nips-2010-Spectral Regularization for Support Estimation

Author: Ernesto D. Vito, Lorenzo Rosasco, Alessandro Toigo

Abstract: In this paper we consider the problem of learning from data the support of a probability distribution when the distribution does not have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call “completely regular”. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.

5 0.10735481 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

Author: Gilles Blanchard, Nicole Krämer

Abstract: We prove rates of convergence in the statistical sense for kernel-based least squares regression using a conjugate gradient algorithm, where regularization against overﬁtting is obtained by early stopping. This method is directly related to Kernel Partial Least Squares, a regression method that combines supervised dimensionality reduction with least squares projection. The rates depend on two key quantities: ﬁrst, on the regularity of the target regression function and second, on the effective dimensionality of the data mapped into the kernel space. Lower bounds on attainable rates depending on these two quantities were established in earlier literature, and we obtain upper bounds for the considered method that match these lower bounds (up to a log factor) if the true regression function belongs to the reproducing kernel Hilbert space. If this assumption is not fulﬁlled, we obtain similar convergence rates provided additional unlabeled data are available. The order of the learning rates match state-of-the-art results that were recently obtained for least squares support vector machines and for linear regularization operators. 1

6 0.084928654 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars

7 0.084258035 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

8 0.077702552 223 nips-2010-Rates of convergence for the cluster tree

9 0.075470395 243 nips-2010-Smoothness, Low Noise and Fast Rates

10 0.071568601 142 nips-2010-Learning Bounds for Importance Weighting

11 0.070900105 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis

12 0.070421338 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

13 0.069762677 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

14 0.068525903 63 nips-2010-Distributed Dual Averaging In Networks

15 0.065357409 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

16 0.064471796 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

17 0.060620915 151 nips-2010-Learning from Candidate Labeling Sets

18 0.060517646 265 nips-2010-The LASSO risk: asymptotic results and real world examples

19 0.059546079 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models

20 0.056856357 108 nips-2010-Graph-Valued Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.177), (1, 0.04), (2, 0.129), (3, -0.025), (4, 0.097), (5, 0.084), (6, 0.029), (7, -0.027), (8, 0.035), (9, 0.034), (10, 0.016), (11, -0.079), (12, -0.101), (13, -0.006), (14, -0.043), (15, 0.018), (16, 0.115), (17, 0.039), (18, 0.04), (19, -0.056), (20, -0.025), (21, -0.055), (22, 0.018), (23, 0.046), (24, 0.033), (25, 0.07), (26, -0.005), (27, -0.075), (28, -0.065), (29, -0.036), (30, -0.01), (31, 0.002), (32, -0.083), (33, 0.026), (34, 0.043), (35, 0.057), (36, -0.122), (37, 0.087), (38, 0.061), (39, 0.022), (40, -0.049), (41, 0.011), (42, -0.016), (43, 0.094), (44, 0.024), (45, -0.004), (46, 0.011), (47, -0.093), (48, 0.047), (49, 0.075)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92973804 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification

Author: Tobias Glasmachers

2 0.69288486 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression

Author: Gilles Blanchard, Nicole Krämer

3 0.68914348 250 nips-2010-Spectral Regularization for Support Estimation

Author: Ernesto D. Vito, Lorenzo Rosasco, Alessandro Toigo

4 0.68601286 279 nips-2010-Universal Kernels on Non-Standard Input Spaces

Author: Andreas Christmann, Ingo Steinwart

5 0.56512153 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

Author: Sivan Sabato, Nathan Srebro, Naftali Tishby

Abstract: We obtain a tight distribution-speciﬁc characterization of the sample complexity of large-margin classiﬁcation with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-speciﬁc upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classiﬁcation. The bounds hold for a rich family of sub-Gaussian distributions. 1

6 0.55780452 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars

7 0.55074668 142 nips-2010-Learning Bounds for Importance Weighting

8 0.50315034 191 nips-2010-On the Theory of Learnining with Privileged Information

9 0.50302052 223 nips-2010-Rates of convergence for the cluster tree

10 0.49871424 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics

11 0.48193476 222 nips-2010-Random Walk Approach to Regret Minimization

12 0.47511104 205 nips-2010-Permutation Complexity Bound on Out-Sample Error

13 0.46501219 63 nips-2010-Distributed Dual Averaging In Networks

14 0.46187487 274 nips-2010-Trading off Mistakes and Don't-Know Predictions

15 0.46101037 148 nips-2010-Learning Networks of Stochastic Differential Equations

16 0.45375943 280 nips-2010-Unsupervised Kernel Dimension Reduction

17 0.45091781 233 nips-2010-Scrambled Objects for Least-Squares Regression

18 0.44823387 202 nips-2010-Parallelized Stochastic Gradient Descent

19 0.43752566 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

20 0.43740121 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.017), (27, 0.028), (30, 0.065), (45, 0.14), (50, 0.038), (52, 0.069), (60, 0.441), (77, 0.042), (78, 0.018), (90, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89628881 223 nips-2010-Rates of convergence for the cluster tree

Author: Kamalika Chaudhuri, Sanjoy Dasgupta

Abstract: For a density f on Rd , a high-density cluster is any connected component of {x : f (x) ≥ λ}, for some λ > 0. The set of all high-density clusters form a hierarchy called the cluster tree of f . We present a procedure for estimating the cluster tree given samples from f . We give ﬁnite-sample convergence rates for our algorithm, as well as lower bounds on the sample complexity of this estimation problem. 1

same-paper 2 0.8645578 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification

Author: Tobias Glasmachers

3 0.85301173 165 nips-2010-MAP estimation in Binary MRFs via Bipartite Multi-cuts

Author: Sashank J. Reddi, Sunita Sarawagi, Sundar Vishwanathan

Abstract: We propose a new LP relaxation for obtaining the MAP assignment of a binary MRF with pairwise potentials. Our relaxation is derived from reducing the MAP assignment problem to an instance of a recently proposed Bipartite Multi-cut problem where the LP relaxation is guaranteed to provide an O(log k) approximation where k is the number of vertices adjacent to non-submodular edges in the MRF. We then propose a combinatorial algorithm to efﬁciently solve the LP and also provide a lower bound by concurrently solving its dual to within an approximation. The algorithm is up to an order of magnitude faster and provides better MAP scores and bounds than the state of the art message passing algorithm of [1] that tightens the local marginal polytope with third-order marginal constraints. 1

4 0.79885316 104 nips-2010-Generative Local Metric Learning for Nearest Neighbor Classification

Author: Yung-kyun Noh, Byoung-tak Zhang, Daniel D. Lee

Abstract: We consider the problem of learning a local metric to enhance the performance of nearest neighbor classiﬁcation. Conventional metric learning methods attempt to separate data distributions in a purely discriminative manner; here we show how to take advantage of information from parametric generative models. We focus on the bias in the information-theoretic error arising from ﬁnite sampling effects, and ﬁnd an appropriate local metric that maximally reduces the bias based upon knowledge from generative models. As a byproduct, the asymptotic theoretical analysis in this work relates metric learning with dimensionality reduction, which was not understood from previous discriminative approaches. Empirical experiments show that this learned local metric enhances the discriminative nearest neighbor performance on various datasets using simple class conditional generative models. 1

5 0.74315631 62 nips-2010-Discriminative Clustering by Regularized Information Maximization

Author: Andreas Krause, Pietro Perona, Ryan G. Gomes

Abstract: Is there a principled way to learn a probabilistic discriminative classiﬁer from an unlabeled data set? We present a framework that simultaneously clusters the data and trains a discriminative classiﬁer. We call it Regularized Information Maximization (RIM). RIM optimizes an intuitive information-theoretic objective function which balances class separation, class balance and classiﬁer complexity. The approach can ﬂexibly incorporate different likelihood functions, express prior assumptions about the relative size of different classes and incorporate partial labels for semi-supervised learning. In particular, we instantiate the framework to unsupervised, multi-class kernelized logistic regression. Our empirical evaluation indicates that RIM outperforms existing methods on several real data sets, and demonstrates that RIM is an effective model selection method. 1

6 0.66731369 229 nips-2010-Reward Design via Online Gradient Ascent

7 0.55140167 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs

8 0.5475055 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization

9 0.53752983 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

10 0.53393412 279 nips-2010-Universal Kernels on Non-Standard Input Spaces

11 0.527888 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning

12 0.5269267 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

13 0.52331001 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars

14 0.5211193 102 nips-2010-Generalized roof duality and bisubmodular functions

15 0.5166297 4 nips-2010-A Computational Decision Theory for Interactive Assistants

16 0.51493967 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

17 0.51450217 222 nips-2010-Random Walk Approach to Regret Minimization

18 0.51257205 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework

19 0.51190656 250 nips-2010-Spectral Regularization for Support Estimation

20 0.51056516 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models