jmlr jmlr2013 jmlr2013-117 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Robert Hable
Abstract: In supervised learning problems, global and local learning algorithms are used. In contrast to global learning algorithms, the prediction of a local learning algorithm in a testing point is only based on training data which are close to the testing point. Every global algorithm such as support vector machines (SVM) can be localized in the following way: in every testing point, the (global) learning algorithm is not applied to the whole training data but only to the k nearest neighbors (kNN) of the testing point. In case of support vector machines, the success of such mixtures of SVM and kNN (called SVM-KNN) has been shown in extensive simulation studies and also for real data sets but only little has been known on theoretical properties so far. In the present article, it is shown how a large class of regularized kernel methods (including SVM) can be localized in order to get a universally consistent learning algorithm. Keywords: machine learning, regularized kernel methods, localization, SVM, k-nearest neighbors, SVM-KNN
Reference: text
sentIndex sentText sentNum sentScore
1 In contrast to global learning algorithms, the prediction of a local learning algorithm in a testing point is only based on training data which are close to the testing point. [sent-4, score-0.126]
2 Every global algorithm such as support vector machines (SVM) can be localized in the following way: in every testing point, the (global) learning algorithm is not applied to the whole training data but only to the k nearest neighbors (kNN) of the testing point. [sent-5, score-0.388]
3 In the present article, it is shown how a large class of regularized kernel methods (including SVM) can be localized in order to get a universally consistent learning algorithm. [sent-7, score-0.151]
4 This can be done in the following way: (1) select a few training data points which are close to the testing data point, (2) determine a predictor based on the selected training data points by use of a (global) learning algorithm, and (3) calculate the prediction in the testing data point. [sent-30, score-0.123]
5 That is, the prediction in a testing point x is given by that SVM which is calculated based on the kn training points which are nearest to x; the natural number kn acts as a hyperparameter. [sent-37, score-1.27]
6 In order to decide which training points are the kn closest ones to x, a metric on the input space is needed. [sent-38, score-0.58]
7 (2006) is that distances (for selecting the kn nearest neighbors) are not measured in the input space but in the feature space (i. [sent-45, score-0.648]
8 There, data points are not selected according to a fixed number kn of nearest neighbors as in kNN; instead, those training data points are selected which are contained in a fixed neighborhood about the testing point x. [sent-52, score-0.802]
9 Here, kn nearest neighbors are selected by use of the ordinary Euclidean metric on the input space X ⊂ R p so that this approach is closest to Zhang et al. [sent-59, score-0.732]
10 This means that, in general, the set of the kn nearest neighbors to a testing point x is not necessarily unique because different testing points might have the same distance to x. [sent-62, score-0.84]
11 Then, for the new input variables Xi′ := (Xi ,Ui ), 154 U NIVERSAL C ONSISTENCY OF L OCALIZED V ERSIONS OF R EGULARIZED K ERNEL M ETHODS Figure 1: Neighborhood (dotted circle) determined by the k nearest neighbors of a testing point (empty point) for k = 3. [sent-74, score-0.194]
12 The left figure shows a situation without distance ties at the border of the neighborhood (dotted circle). [sent-75, score-0.139]
13 The right figure shows a situation with distance ties at the border of the neighborhood (empty point): only one of the two data points (filled points) at the border may belong to the k = 3 nearest neighbors; choosing between these two candidates is done by randomization here. [sent-76, score-0.24]
14 In Zakai and Ritov (2009), localizing is not done by fixed numbers kn of nearest neighbors (as in kNN and SVM-KNN) but by fixed sizes (radii) Rn of neighborhoods (similar as in Cheng et al. [sent-93, score-0.779]
15 In contrast, whether a data point xi0 belongs to the kn nearest neighbors depends on the whole sample; that is, the corresponding indicator variables are not independent and one has to work with random sets of indexes. [sent-98, score-0.732]
16 In particular, the kNN-approach leads to random sizes of neighborhoods which depend on the testing point x while Zakai and Ritov (2009) deal with deterministic sequences of radii Rn which do not depend on the testing point x. [sent-99, score-0.119]
17 For kNN, consistency requires that the number of selected neighbors kn goes to infinity but not too fast for n → ∞. [sent-104, score-0.694]
18 Accordingly, in case of SVM-KNN, the interplay between the convergence of kn and the convergence of λn is crucial. [sent-107, score-0.58]
19 Theorem 1 below gives precise conditions on kn and λn which guarantee consistency of SVM-KNN. [sent-108, score-0.61]
20 In Therorem 1, it is assumed that kn , n ∈ N, is a predefined deterministic sequence. [sent-109, score-0.58]
21 This enables a local regularization of the complexity of the predictor which is an important motivation for localizing a global algorithm as already stated above. [sent-112, score-0.153]
22 156 U NIVERSAL C ONSISTENCY OF L OCALIZED V ERSIONS OF R EGULARIZED K ERNEL M ETHODS y = f (ξ) for a point ξ ∈ X , a kNN-rule is based on the kn nearest neighbors of ξ. [sent-138, score-0.732]
23 The kn nearest neighbors of ξ ∈ R p within x1 , . [sent-139, score-0.732]
24 , n} such that ♯(I ) = kn and max |xi − ξ| < min |x j − ξ| . [sent-145, score-0.58]
25 , |xi − ξ| = |x j − ξ|) so that the kn nearest neighbors are not unique and an index set I as defined above does not exist. [sent-148, score-0.732]
26 We say that zi = (xi , ui ) is (strictly) closer to ζ = (ξ, u) ∈ X × (0, 1) than z j = (x j , u j ) if |xi − ξ| < |x j − ξ|; and, in case of a distance tie |xi − ξ| = |x j − ξ|, we say that zi = (xi , ui ) is (strictly) closer to ζ = (ξ, u) than z j = (x j , u j ), if |ui − u| < |u j − u|. [sent-173, score-0.34]
27 The following is a precise definition of “nearest neighbors” which also takes into account distance ties in the xi and the ui . [sent-176, score-0.157]
28 , n} such that ♯(I ) = kn , max |xi − ξ| ≤ min |xi − ξ| and i∈I i∈I max |u j − u| < min |u j − u| j∈I ∩J j∈J \I (2) where J = j ∈ {1, . [sent-190, score-0.58]
29 So, definition (2) and (3) is a modification of (1) in order to deal with distance ties in the xi . [sent-215, score-0.118]
30 Note that, due to the lexicographic order, the values ui and u are only relevant in case of distance ties (at the border of the neighborhood given by the kn nearest neighbors). [sent-216, score-0.826]
31 (4) That is, In,ζ contains the indexes of the kn -nearest neighbors of ζ. [sent-221, score-0.664]
32 Then, the vector of the kn -nearest neighbors is Dn,ζ := (Zi1 ,Yi1 ), . [sent-226, score-0.664]
33 kn i∈In,ζ 157 (5) H ABLE The SVM-KNN method replaces the mean by an SVM. [sent-231, score-0.58]
34 The empirical support vector machine fDn ,λDn uniquely exists for every λDn ∈ (0, ∞) and every data-set Dn ∈ (X × Y )n if t → L(y,t) is convex for every y ∈ Y . [sent-250, score-0.204]
35 The prediction of the SVM-KNN learning algorithm in ζ = (ξ, u) ∈ X × (0, 1) is given by fDn,ζ ,Λn,ζ (ξ) with fDn,ζ ,Λn,ζ = arg min f ∈H 1 ∑ L Yi , f (Xi) + Λn,ζ f kn i∈In,ζ 2 H (7) where ω → Λn,ζ (ω) is a random regularization parameter depending on n and ζ. [sent-251, score-0.605]
36 That is, the method calculates the empirical SVM fDn,ζ ,Λn,ζ for the kn nearest neighbors (given by the index set In,ζ ) and uses the value fDn,ζ ,Λn,ζ (ξ) for the prediction in ζ. [sent-252, score-0.732]
37 In order to get consistency, the choice of the number of neighbors kn and the datadriven local choice of the regularization parameter λ = Λn,ξ needs some care. [sent-265, score-0.711]
38 Possible choices for kn and λn are, for example, kn = b · n0. [sent-267, score-1.16]
39 Settings: Choose a sequence kn ∈ N, n ∈ N, such that k1 ≤ k2 ≤ k3 ≤ . [sent-270, score-0.58]
40 ≤ lim kn = ∞ and n→∞ kn ց 0 for n → ∞ , n and a sequence λn ∈ (0, ∞), n ∈ N, such that lim λn = 0 n→∞ and 3 kn 2 lim λn · √ = ∞ n→∞ n (9) √ and a constant c ∈ (0, ∞), and a sequence cn ∈ [0, ∞) such that limn→∞ cn / λn = 0. [sent-273, score-1.916]
41 For every ζ = (ξ, u) ∈ X × (0, 1), define 3 1 ˜ Λn,ζ = ∑ |Xi − ξ| 2 kn i∈In,ζ and choose random regularization parameters Λn,ζ such that X × (0, 1) × Ω → (0, ∞), (ξ, u, ω) = (ζ, ω) → Λn,ζ (ω) is measurable and ˜ c · max λn , Λn,ζ ˜ ≤ Λn,ζ ≤ (c + cn ) · max λn , Λn,ζ ∀ ζ ∈ X × (0, 1) . [sent-274, score-0.754]
42 Let L : [−M, M] × R → [0, ∞) be a convex loss function with the following local Lipschitz property: there are some b0 , b1 ∈ (0, ∞) and q ∈ [0, 1] such that, for every a ∈ (0, ∞), sup y∈[−M,M] L(y,t1 ) − L(y,t2 ) ≤ |L|a,1 · |t1 − t2 | ∀t1 ,t2 ∈ [−a, a] (12) for |L|a,1 = b0 + b1 aq . [sent-277, score-0.134]
43 Then, every SVM-KNN defined by (7,8) according to the above settings and clipped at M, fDn : ζ = (ξ, u) → fDn,ζ ,Λn,ζ (ξ) is risk-consistent, that is, RP fDn − − −→ n→∞ ∗ inf RP ( f ) =: RP f :X →R measurable in probability. [sent-292, score-0.195]
44 (14) The deterministic λn prevents the regularization parameters from decreasing to 0 too fast and (9) controls the interplay between kn and λn . [sent-300, score-0.605]
45 ) Note that the calculation of ˜ Λn,ζ is computationally fast as In,ζ (the index set of the kn nearest neighbors) has to be calculated ˜ anyway. [sent-302, score-0.648]
46 As it is assumed in Theorem 1 that limn→∞ kn /n = 0 (i. [sent-308, score-0.58]
47 Instead, it would also be possible to assume that limn→∞ kn /n = 1 so that the method (asymptotically) acts as an ordinary SVM. [sent-316, score-0.58]
48 If convergence of the fraction kn /n to 1 is fast enough, then universal consistency of such a method follows from universal consistency of SVM. [sent-317, score-0.686]
49 In case of SVM-KNN, the number of nearest neighbors is equal to kn = b · n0. [sent-375, score-0.732]
50 75 for the definition of kn is in accordance with the settings in Section 3. [sent-387, score-0.58]
51 For each testing point ξ, the prediction is calculated by a local o SVM on the kn nearest neighbor. [sent-392, score-0.712]
52 Similarly to the case of SVM-KNN, the number of nearest neighbors in the classical kNN method is equal to kn = c · n0. [sent-401, score-0.759]
53 For every scenario j and every learning algorithm, the values MAE j,r ( f⋆ ), r ∈ {1, . [sent-417, score-0.184]
54 189 Table 1: The average MAE j of the mean absolute error over the 500 runs for classical SVMs and SVM-KNN for scenarios j = 1 and j = 2 It turns out that SVM-KNN is clearly better than classical SVM in scenario 1 while classical SVM is clearly better than SVM-KNN in scenario 2. [sent-428, score-0.177]
55 This problem is avoided by localized learners such as SVM-KNN, which is a main motivation for localizing global learning algorithms. [sent-435, score-0.131]
56 , 500} for classical SVMs and SVM-KNN for scenarios j = 1 and j = 2 scenario 1 scenario 2 Figure 4: Values of the hyperparameter γ selected by cross validation for the classical SVM in the 500 runs for each scenario 5. [sent-461, score-0.226]
57 In this article, it has been shown for a large class of regularized kernel methods (including SVM) that suitably localized versions (called SVM-KNN) are universally consistent. [sent-520, score-0.151]
58 For every ζ = (ξ, u) ∈ X × (0, 1), there is a smallest rn,ξ ∈ [0, ∞] such that Q |Xi − ξ| ≤ rn,ξ ≥ knn and there is an sn,ζ ∈ [0, ∞) such that Q |Xi − ξ| < rn,ξ or |Xi − ξ| = rn,ξ , |Ui − u| < sn,ζ = kn . [sent-528, score-0.772]
59 Define Bn,ζ = Brn,ξ (ξ) × (0, 1) ∪ ∂Brn,ξ (ξ) × Bsn,ζ (u) Roughly spoken, Bn,ζ is a neighborhood around ζ = (ξ, u) with probability kn /n which is in line with our tie-breaking strategy. [sent-530, score-0.608]
60 Then, PX ⊗ Unif(0,1) Bn,ζ = Q Zi ∈ Bn,ζ = kn n where Unif(0,1) denotes the uniform distribution on (0, 1). [sent-531, score-0.58]
61 Let Pn,ζ be the conditional distribution of Zi given Zi ∈ Bn,ζ , that is, Pn,ζ (B) = Q Zi ∈ B ∩ Bn,ζ n = Q Zi ∈ B ∩ Bn,ζ kn Q Zi ∈ Bn,ζ ∀ B ∈ BX ×(0,1) . [sent-532, score-0.58]
62 Then, for every ζ ∈ Z , n ∈ N, and every integrable g : Z × Y → R, n g(z, y) P(dy|z)Pn,ζ (dz) . [sent-539, score-0.136]
63 IB (z)g(z, y) QZ,Y d(z, y) = kn Z ×Y n,ζ Z Y (16) When this does not lead to confusion, the conditional distribution of the pair of random variables (Zi ,Yi ) given Zi ∈ Bn,ζ is also denoted by Pn,ζ . [sent-540, score-0.58]
64 IB (z)g(z, y) QZ,Y d(z, y) = kn Z ×Y n,ζ Z ×Y (17) The following lemma is an immediate consequence of the definitions and well known facts about the support of measures, see, for example, Parthasarathy (1967, II. [sent-542, score-0.614]
65 This means: while the the sets In,ζ and Dn,ζ consist of a fixed num⋆ ber of nearest neighbors, the sets In,ζ and D⋆ consist of all those neighbors which lie in a fixed n,ζ neighborhood. [sent-568, score-0.152]
66 As the probability that Zi ∈ Bn,ζ is kn /n, we expect that, for large n, the index sets In,ζ and ⋆ ⋆ In,ζ and the vectors of data points Dn,ζ and D⋆ are similar. [sent-569, score-0.58]
67 In combination with (12), this implies that, for every ξ ∈ X and for every f1 , f2 ∈ H, L y, f1 (ξ) P(dy|ξ)− L y, f2 (ξ) P(dy|ξ) ≤ |L|M,1 · K Define L(·, 0) ∞ ∞· f1 − f2 H. [sent-578, score-0.136]
68 Lemma 3 Let V be a separable Hilbert space and, for every n ∈ N, let Ψn : Z × Y → V be a Borel-measurable function such that for every bounded subset B ⊂ Z , sup sup Ψn (z, y) n∈N z∈B,y∈Y H < ∞. [sent-582, score-0.224]
69 Then, for every ζ ∈ X , 3 −2 λn n 1 n ∑ Ψn (Zi ,Yi )IBn,ζ (Zi ) − Ψn (z, y)IBn,ζ (z) QZ,Y d(z, y) kn n i=1 in probability. [sent-583, score-0.648]
70 Hence, there is a constant b ∈ (0, ∞) such that, for every n ≥ n0 , sup (z,y)∈Z ×Y Ψn (z, y)IBn,ζ (z) For every n ≥ n0 and τ ∈ (0, ∞), define an,τ := 2b· An,τ = √ 1 n ∑ Ψn (Zi ,Yi )IBn,ζ (Zi ) − n i=1 V ≤b. [sent-589, score-0.18]
71 (20) −3 1 −1 2 Define τn := λn kn n− 2 and εn := λn 2 nkn an,τn for every n ≥ n0 . [sent-595, score-0.648]
72 Then, for every ω ∈ An,τ , 3 −2 λn n 1 n ∑ Ψn (Zi (ω),Yi (ω))IBn,ζ (Zi (ω)) − Ψn IBn,ζ dQZ,Y kn n i=1 < εn . [sent-596, score-0.648]
73 V According to (9), 3 εn = n·an,τn 3 2 λn kn = 2bn 3 2 λn kn 2 λn kn √ + nn √ 3 2 1 λn kn +√ n nn = 2b · n 3 2 + λn kn √ 1 +√ n λn kn n 3 2 − − 0. [sent-597, score-3.48]
74 −→ n→∞ Hence, for every ε > 0, there is an nε ∈ N such that ε > εn for every n ≥ nε and, therefore, 3 −2 Q λn n 1 n ∑ Ψn (Zi ,Yi )IBn,ζ (Zi ) − Ψn IBn,ζ dQZ,Y kn n i=1 >ε V ≤ Q ∁An,τn (20) ≤ e−τn . [sent-598, score-0.716]
75 (c) For every ζ = (ξ, u) ∈ R p × (0, 1) and every Λ : Ω → (0, ∞) measurable with respect to A and the Borel-σ-Algebra, the map Ω → R, ω → fD⋆ (ω),Λ(ω) (ξ) n,ζ is measurable with respect to A and B. [sent-606, score-0.27]
76 i∈J ∩I j∈J \I (1) (2) (3) The set BJ says that no Xℓ is closer to ξ than the kn nearest neighbors. [sent-619, score-0.648]
77 The sets BJ and BJ states that J specifies all those X j which lie at the border of the neighborhood given by the nearest neigh(4) bors. [sent-620, score-0.129]
78 The set BJ is concerned with all data points which lie at the border: the nearest neighbors among them have strictly smaller |Ui − u| than those which do not belong to the nearest neighbors. [sent-621, score-0.22]
79 H ABLE (t) ˜n Since BJ is measurable for every t ∈ {1, 2, 3, 4} and J ⊂ {1, . [sent-626, score-0.124]
80 kn i=1 Now, we can prove part (b) and (c): For every I ⊂ {1, 2, . [sent-653, score-0.648]
81 Then, (b) follows from (a), and (c) follows from measurability of τ∗ : ω → In,ζ (ω) ˜ n,ζ for every fixed ζ = (ξ, u). [sent-676, score-0.134]
82 Proof of Theorem 1 In the main part of the proof, it is shown that for PX ⊗ Unif(0,1) - almost every ζ = (ξ, u) ∈ X × (0, 1), 0≤ L y, fDn,ζ ,Λn,ζ (ξ) P(dy|ζ) − inf t∈R L(y,t) P(dy|ζ) − − 0 −→ n→∞ in probability. [sent-682, score-0.117]
83 −→ n→∞ For every measurable f : X → R, L y, f (ξ) P(dy|ξ) ≥ inf t∈R L(y,t) P(dy|ξ) Hence, RP∗ ≥ inf t∈R L(y,t) P(dy|ξ)PX (dξ) and, therefore, (23) implies ∗ −→ EQ RP fDn − RP − − 0 n→∞ ∗ and, as RP fDn ≥ RP , ∗ EQ RP fDn − RP − − 0 . [sent-689, score-0.222]
84 Proof Statement (28) follows from Lemma 3 because the definitions imply −3 2 λn ⋆ ♯(In,ζ ) − kn kn 3 −2 = λn n 1 n ∑ IB (Zi ) − kn n i=1 n,ζ 172 IBn,ζ (z) QZ,Y d(z, y) . [sent-698, score-1.74]
85 (32) U NIVERSAL C ONSISTENCY OF L OCALIZED V ERSIONS OF R EGULARIZED K ERNEL M ETHODS ⋆ In order to prove (29) note that (28) implies that ♯(In,ζ )/kn → 1 in probability and, therefore, also ⋆ kn /♯(In,ζ ) → 1 in probability. [sent-699, score-0.58]
86 Hence, (28) implies (29) because ⋆ ♯(In,ζ ) − kn −3 λn 2 ⋆ ♯(In,ζ ) = −3 λn 2 ⋆ ♯(In,ζ ) − kn kn · kn ⋆ ♯(In,ζ ) . [sent-700, score-2.32]
87 ) Therefore, ⋆ ⋆ ♯ In,ζ \ In,ζ ≤ ♯(In,ζ ) − kn ⋆ ⋆ ♯ In,ζ \ In,ζ ≤ ♯(In,ζ ) − kn . [sent-703, score-1.16]
88 kn + ⋆ ⋆ kn ♯ In,ζ ) kn ♯(In,ζ ) Therefore, (30) follows from (28) and (29). [sent-706, score-1.74]
89 As ξ ∈ B0 , we have PX Bε (ξ) > 0 and, therefore, 1 PX Bε (ξ) − kn /n > 2 PX Bε (ξ) > 0 for n large enough (see Lemma 2). [sent-708, score-0.58]
90 , n} Xi ∈ Bε (ξ) < kn = Q = Q PX Bε (ξ) − 1 n kn ∑ IBε (ξ) (Xi ) > PX Bε (ξ) − n n i=1 ≤ Q PX Bε (ξ) − kn 1 n ∑ IBε (ξ) (Xi ) < n n i=1 1 1 n ∑ I (Xi ) > 2 PX Bε (ξ) n i=1 Bε (ξ) ≤ and the law of large numbers. [sent-712, score-1.74]
91 An application of Lemma 3 for Ψn (x, v), y = |x − ξ|β and (16) yield that it suffices to prove 3 −2 λn 1 1 n −→ ∑ |Xi − ξ|β − kn ∑ |Xi − ξ|β IBn,ζ (Zi ) − − 0 n→∞ kn i∈In,ζ i=1 173 (35) H ABLE in probability in order to prove statement (32). [sent-714, score-1.16]
92 It follows from (28) that −3 2 λn ⋆ ♯(In,ζ ) − kn kn −− 0 −→ n→∞ in probability 1 2 and it follows from Lemma 3 for Ψn (z, y) = λn hn,ζ (z, y)Φ(x), z = (x, v), that λ−1 n n 1 n ∑ h (Zi ,Yi )Φ(Xi )IBn,ζ (Zi ) − hn,ζ ΦIBn,ζ dQZ,Y kn n i=1 n,ζ − − 0 in probability. [sent-742, score-1.74]
93 As RPn,ζ (0) ≤ sup L(y, 0) < ∞ and lim n→∞ y∈[−M,M] sup ∂1,1 K(x, x) = ∂1,1 K(ξ, ξ) , x∈Brn,ξ (ξ) it remains to show that |x − ξ| Pn,ζ d(x, v) Λn,ζ −− 0 −→ in probability . [sent-764, score-0.13]
94 For every f ∈ H, define the map A f : M1 (X × Y ) × [0, ∞) → R by L y, f (x) P0 d(x, y) + λ f A f (P0 , λ) = 2 H (45) for every P0 ∈ M1 (X × Y ) and λ ∈ [0, ∞). [sent-770, score-0.158]
95 Then, for every integrable g : X × Y → R, ˜ g(x, y)Pn,ζ d(x, y) = g(x, y)Pn,ζ d(x, v, y) = Z Y g(x, y)P(dy|x)Pn,ζ d(x, v) and, according to the definitions (45) and (6), ˜ inf A f (Pn,ζ , λ) = L y, fPn,ζ ,λ (x) Pn,ζ d(x, v, y) + λ fPn,ζ ,λ f ∈H 2 H (47) ˜ ˜ for every λ ∈ (0, ∞) and n ∈ N. [sent-791, score-0.185]
96 (b) Assume that L has the local Lipschitz-property that, for every a ∈ (0, ∞), there is an |L|a,1 ∈ (0, ∞) such that sup L(y,t1 ) − L(y,t2 ) ≤ |L|a,1 · |t1 − t2 | ∀t1 ,t2 ∈ [−a, a] . [sent-817, score-0.134]
97 y∈Y0 (i) Then, for every probability measure P0 on (X0 × Y0 , BX0 ×Y0 ) such that RP0 (0) < ∞ and for every λ0 , λ1 ∈ (0, ∞), it holds that fP0 ,λ1 − fP0 ,λ0 H ≤ |λ1 − λ0 | √ 2 RP0 (0) . [sent-818, score-0.136]
98 10), there is a bounded measurable map h :∈ X0 × Y0 → R such that h(x, y) ∈ ∂L y, fP0 ,λ0 (x) for every (x, y) ∈ X0 × Y0 and fP0 ,λ0 = − 1 2λ0 hΦ dP0 . [sent-857, score-0.146]
99 10) that there is a measurable function hP1 ,λ : X0 × Y0 → R which fulfills (50) and fP1 ,λ − fP2 ,λ = sup f ∈H f H ≤1 1 λ H ≤ 1 λ hP1 ,λ Φ dP1 − hP1 ,λ Φ dP2 = H hP1 ,λ Φ dP1 − hP1 ,λ Φ dP2 , f = sup H f ∈H f H ≤1 That is, we have shown (ii). [sent-878, score-0.144]
100 On the strong universal consistency a o o z a of nearest neighbor regression function estimates. [sent-934, score-0.121]
wordName wordTfidf (topN-words)
[('kn', 0.58), ('fdn', 0.338), ('fpn', 0.231), ('fd', 0.23), ('pn', 0.176), ('dy', 0.167), ('brn', 0.14), ('ersions', 0.132), ('ibn', 0.132), ('niversal', 0.132), ('ocalized', 0.132), ('knn', 0.124), ('zi', 0.119), ('blanzieri', 0.116), ('egularized', 0.113), ('px', 0.108), ('rp', 0.103), ('christmann', 0.103), ('ernel', 0.094), ('svm', 0.092), ('onsistency', 0.088), ('neighbors', 0.084), ('dn', 0.077), ('steinwart', 0.076), ('pdn', 0.074), ('summand', 0.071), ('ethods', 0.069), ('nearest', 0.068), ('every', 0.068), ('denkowski', 0.066), ('measurability', 0.066), ('svms', 0.064), ('bj', 0.064), ('localized', 0.064), ('bn', 0.059), ('measurable', 0.056), ('ties', 0.054), ('unif', 0.053), ('mae', 0.051), ('segata', 0.05), ('inf', 0.049), ('scenario', 0.048), ('bh', 0.047), ('localizing', 0.047), ('zakai', 0.045), ('sup', 0.044), ('mm', 0.043), ('lim', 0.042), ('testing', 0.042), ('enrico', 0.041), ('indn', 0.041), ('rkhs', 0.04), ('xi', 0.04), ('predictor', 0.039), ('zn', 0.039), ('ui', 0.039), ('hn', 0.039), ('limn', 0.037), ('ib', 0.035), ('radii', 0.035), ('ic', 0.034), ('bx', 0.034), ('lemma', 0.034), ('bryl', 0.033), ('fremlin', 0.033), ('hable', 0.033), ('eq', 0.033), ('universally', 0.033), ('border', 0.033), ('cheng', 0.033), ('rd', 0.032), ('consistency', 0.03), ('pd', 0.029), ('regularized', 0.029), ('neighborhood', 0.028), ('article', 0.028), ('rn', 0.028), ('hyperparameter', 0.028), ('classical', 0.027), ('tv', 0.026), ('regularization', 0.025), ('kernel', 0.025), ('cn', 0.025), ('bayreuth', 0.025), ('bsn', 0.025), ('fdi', 0.025), ('devroye', 0.025), ('ritov', 0.025), ('distance', 0.024), ('universal', 0.023), ('map', 0.022), ('fp', 0.022), ('clipped', 0.022), ('local', 0.022), ('hilbert', 0.022), ('localize', 0.021), ('szl', 0.021), ('ikn', 0.021), ('continuity', 0.021), ('global', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods
Author: Robert Hable
Abstract: In supervised learning problems, global and local learning algorithms are used. In contrast to global learning algorithms, the prediction of a local learning algorithm in a testing point is only based on training data which are close to the testing point. Every global algorithm such as support vector machines (SVM) can be localized in the following way: in every testing point, the (global) learning algorithm is not applied to the whole training data but only to the k nearest neighbors (kNN) of the testing point. In case of support vector machines, the success of such mixtures of SVM and kNN (called SVM-KNN) has been shown in extensive simulation studies and also for real data sets but only little has been known on theoretical properties so far. In the present article, it is shown how a large class of regularized kernel methods (including SVM) can be localized in order to get a universally consistent learning algorithm. Keywords: machine learning, regularized kernel methods, localization, SVM, k-nearest neighbors, SVM-KNN
2 0.16992381 32 jmlr-2013-Differential Privacy for Functions and Functional Data
Author: Rob Hall, Alessandro Rinaldo, Larry Wasserman
Abstract: Differential privacy is a rigorous cryptographically-motivated characterization of data privacy which may be applied when releasing summaries of a database. Previous work has focused mainly on methods for which the output is a finite dimensional vector, or an element of some discrete set. We develop methods for releasing functions while preserving differential privacy. Specifically, we show that adding an appropriate Gaussian process to the function of interest yields differential privacy. When the functions lie in the reproducing kernel Hilbert space (RKHS) generated by the covariance kernel of the Gaussian process, then the correct noise level is established by measuring the “sensitivity” of the function in the RKHS norm. As examples we consider kernel density estimation, kernel support vector machines, and functions in RKHSs. Keywords: differential privacy, density estimation, Gaussian processes, reproducing kernel Hilbert space
3 0.1571499 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
Author: Arnaud Guyader, Nick Hengartner
Abstract: Motivated by promising experimental results, this paper investigates the theoretical properties of a recently proposed nonparametric estimator, called the Mutual Nearest Neighbors rule, which estimates the regression function m(x) = E[Y |X = x] as follows: first identify the k nearest neighbors of x in the sample Dn , then keep only those for which x is itself one of the k nearest neighbors, and finally take the average over the corresponding response variables. We prove that this estimator is consistent and that its rate of convergence is optimal. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, we also present adaptation results by data-splitting. Keywords: nonparametric estimation, nearest neighbor methods, mathematical statistics
4 0.10736075 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models
Author: Manfred Opper, Ulrich Paquet, Ole Winther
Abstract: Expectation Propagation (EP) provides a framework for approximate inference. When the model under consideration is over a latent Gaussian field, with the approximation being Gaussian, we show how these approximations can systematically be corrected. A perturbative expansion is made of the exact but intractable correction, and can be applied to the model’s partition function and other moments of interest. The correction is expressed over the higher-order cumulants which are neglected by EP’s local matching of moments. Through the expansion, we see that EP is correct to first order. By considering higher orders, corrections of increasing polynomial complexity can be applied to the approximation. The second order provides a correction in quadratic time, which we apply to an array of Gaussian process and Ising models. The corrections generalize to arbitrarily complex approximating families, which we illustrate on tree-structured Ising model approximations. Furthermore, they provide a polynomial-time assessment of the approximation error. We also provide both theoretical and practical insights on the exactness of the EP solution. Keywords: expectation consistent inference, expectation propagation, perturbation correction, Wick expansions, Ising model, Gaussian process
5 0.098134018 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
Author: Sumio Watanabe
Abstract: A statistical model or a learning machine is called regular if the map taking a parameter to a probability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/ log n, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model. Since WBIC can be numerically calculated without any information about a true distribution, it is a generalized version of BIC onto singular statistical models. Keywords: Bayes marginal likelihood, widely applicable Bayes information criterion
6 0.081597671 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
7 0.046561062 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
8 0.044910733 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
9 0.043649133 104 jmlr-2013-Sparse Single-Index Model
10 0.042822395 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
11 0.04147042 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
12 0.040681481 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
13 0.040556889 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability
14 0.040124603 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
15 0.038485426 76 jmlr-2013-Nonparametric Sparsity and Regularization
16 0.036548916 86 jmlr-2013-Parallel Vector Field Embedding
17 0.032504488 108 jmlr-2013-Stochastic Variational Inference
18 0.031940468 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations
19 0.031104274 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
20 0.029235899 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
topicId topicWeight
[(0, -0.18), (1, 0.049), (2, 0.076), (3, -0.065), (4, -0.013), (5, -0.165), (6, -0.049), (7, 0.299), (8, -0.013), (9, -0.082), (10, -0.117), (11, 0.021), (12, 0.268), (13, 0.005), (14, -0.195), (15, 0.101), (16, 0.052), (17, 0.227), (18, 0.197), (19, -0.07), (20, 0.034), (21, 0.074), (22, -0.17), (23, 0.062), (24, -0.062), (25, 0.04), (26, 0.051), (27, 0.089), (28, 0.137), (29, -0.005), (30, 0.054), (31, -0.017), (32, -0.023), (33, -0.123), (34, -0.063), (35, -0.02), (36, 0.006), (37, -0.035), (38, -0.134), (39, -0.103), (40, 0.036), (41, -0.035), (42, 0.015), (43, -0.096), (44, 0.051), (45, 0.095), (46, 0.011), (47, 0.044), (48, -0.052), (49, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.94681329 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods
Author: Robert Hable
Abstract: In supervised learning problems, global and local learning algorithms are used. In contrast to global learning algorithms, the prediction of a local learning algorithm in a testing point is only based on training data which are close to the testing point. Every global algorithm such as support vector machines (SVM) can be localized in the following way: in every testing point, the (global) learning algorithm is not applied to the whole training data but only to the k nearest neighbors (kNN) of the testing point. In case of support vector machines, the success of such mixtures of SVM and kNN (called SVM-KNN) has been shown in extensive simulation studies and also for real data sets but only little has been known on theoretical properties so far. In the present article, it is shown how a large class of regularized kernel methods (including SVM) can be localized in order to get a universally consistent learning algorithm. Keywords: machine learning, regularized kernel methods, localization, SVM, k-nearest neighbors, SVM-KNN
2 0.63336396 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
Author: Arnaud Guyader, Nick Hengartner
Abstract: Motivated by promising experimental results, this paper investigates the theoretical properties of a recently proposed nonparametric estimator, called the Mutual Nearest Neighbors rule, which estimates the regression function m(x) = E[Y |X = x] as follows: first identify the k nearest neighbors of x in the sample Dn , then keep only those for which x is itself one of the k nearest neighbors, and finally take the average over the corresponding response variables. We prove that this estimator is consistent and that its rate of convergence is optimal. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, we also present adaptation results by data-splitting. Keywords: nonparametric estimation, nearest neighbor methods, mathematical statistics
3 0.53650743 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
Author: Sumio Watanabe
Abstract: A statistical model or a learning machine is called regular if the map taking a parameter to a probability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/ log n, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model. Since WBIC can be numerically calculated without any information about a true distribution, it is a generalized version of BIC onto singular statistical models. Keywords: Bayes marginal likelihood, widely applicable Bayes information criterion
4 0.411239 32 jmlr-2013-Differential Privacy for Functions and Functional Data
Author: Rob Hall, Alessandro Rinaldo, Larry Wasserman
Abstract: Differential privacy is a rigorous cryptographically-motivated characterization of data privacy which may be applied when releasing summaries of a database. Previous work has focused mainly on methods for which the output is a finite dimensional vector, or an element of some discrete set. We develop methods for releasing functions while preserving differential privacy. Specifically, we show that adding an appropriate Gaussian process to the function of interest yields differential privacy. When the functions lie in the reproducing kernel Hilbert space (RKHS) generated by the covariance kernel of the Gaussian process, then the correct noise level is established by measuring the “sensitivity” of the function in the RKHS norm. As examples we consider kernel density estimation, kernel support vector machines, and functions in RKHSs. Keywords: differential privacy, density estimation, Gaussian processes, reproducing kernel Hilbert space
5 0.37552845 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models
Author: Manfred Opper, Ulrich Paquet, Ole Winther
Abstract: Expectation Propagation (EP) provides a framework for approximate inference. When the model under consideration is over a latent Gaussian field, with the approximation being Gaussian, we show how these approximations can systematically be corrected. A perturbative expansion is made of the exact but intractable correction, and can be applied to the model’s partition function and other moments of interest. The correction is expressed over the higher-order cumulants which are neglected by EP’s local matching of moments. Through the expansion, we see that EP is correct to first order. By considering higher orders, corrections of increasing polynomial complexity can be applied to the approximation. The second order provides a correction in quadratic time, which we apply to an array of Gaussian process and Ising models. The corrections generalize to arbitrarily complex approximating families, which we illustrate on tree-structured Ising model approximations. Furthermore, they provide a polynomial-time assessment of the approximation error. We also provide both theoretical and practical insights on the exactness of the EP solution. Keywords: expectation consistent inference, expectation propagation, perturbation correction, Wick expansions, Ising model, Gaussian process
6 0.29044962 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
7 0.26257011 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
8 0.24816562 104 jmlr-2013-Sparse Single-Index Model
9 0.2469147 76 jmlr-2013-Nonparametric Sparsity and Regularization
10 0.22525455 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
12 0.21057615 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
13 0.20277828 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels
14 0.19245207 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
15 0.19198658 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
16 0.18887514 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
17 0.18358298 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations
18 0.17906156 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
19 0.17747422 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
20 0.17034332 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
topicId topicWeight
[(0, 0.016), (5, 0.111), (6, 0.026), (10, 0.048), (23, 0.047), (53, 0.04), (68, 0.017), (70, 0.026), (71, 0.456), (75, 0.037), (85, 0.026), (87, 0.013), (93, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.71578699 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods
Author: Robert Hable
Abstract: In supervised learning problems, global and local learning algorithms are used. In contrast to global learning algorithms, the prediction of a local learning algorithm in a testing point is only based on training data which are close to the testing point. Every global algorithm such as support vector machines (SVM) can be localized in the following way: in every testing point, the (global) learning algorithm is not applied to the whole training data but only to the k nearest neighbors (kNN) of the testing point. In case of support vector machines, the success of such mixtures of SVM and kNN (called SVM-KNN) has been shown in extensive simulation studies and also for real data sets but only little has been known on theoretical properties so far. In the present article, it is shown how a large class of regularized kernel methods (including SVM) can be localized in order to get a universally consistent learning algorithm. Keywords: machine learning, regularized kernel methods, localization, SVM, k-nearest neighbors, SVM-KNN
2 0.62762982 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
Author: Rami Mahdi, Jason Mezey
Abstract: Constraint-based learning of Bayesian networks (BN) from limited data can lead to multiple testing problems when recovering dense areas of the skeleton and to conflicting results in the orientation of edges. In this paper, we present a new constraint-based algorithm, light mutual min (LMM) for improved accuracy of BN learning from small sample data. LMM improves the assessment of candidate edges by using a ranking criterion that considers conditional independence on neighboring variables at both sides of an edge simultaneously. The algorithm also employs an adaptive relaxation of constraints that, selectively, allows some nodes not to condition on some neighbors. This relaxation aims at reducing the incorrect rejection of true edges connecting high degree nodes due to multiple testing. LMM additionally incorporates a new criterion for ranking v-structures that is used to recover the completed partially directed acyclic graph (CPDAG) and to resolve conflicting v-structures, a common problem in small sample constraint-based learning. Using simulated data, each of these components of LMM is shown to significantly improve network inference compared to commonly applied methods when learning from limited data, including more accurate recovery of skeletons and CPDAGs compared to the PC, MaxMin, and MaxMin hill climbing algorithms. A proof of asymptotic correctness is also provided for LMM for recovering the correct skeleton and CPDAG. Keywords: Bayesian networks, skeleton, constraint-based learning, mutual min
3 0.62283027 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
Author: Raman Arora, Maya R. Gupta, Amol Kapila, Maryam Fazel
Abstract: For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. A rotation-based algorithm is proposed for the matrix factorization. Conditions for unique matrix factorizations and clusterings are given, and an error bound is provided. The algorithm is particularly efficient for the case of two clusters, which motivates a hierarchical variant for cases where the number of desired clusters is large. Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efficient hierarchical variant performs surprisingly well. Keywords: clustering, non-negative matrix factorization, rotation, indefinite kernel, similarity, completely positive
4 0.30060479 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
Author: Partha Niyogi
Abstract: Manifold regularization (Belkin et al., 2006) is a geometrically motivated framework for machine learning within which several semi-supervised algorithms have been constructed. Here we try to provide some theoretical understanding of this approach. Our main result is to expose the natural structure of a class of problems on which manifold regularization methods are helpful. We show that for such problems, no supervised learner can learn effectively. On the other hand, a manifold based learner (that knows the manifold or “learns” it from unlabeled examples) can learn with relatively few labeled examples. Our analysis follows a minimax style with an emphasis on finite sample results (in terms of n: the number of labeled examples). These results allow us to properly interpret manifold regularization and related spectral and geometric algorithms in terms of their potential use in semi-supervised learning. Keywords: semi-supervised learning, manifold regularization, graph Laplacian, minimax rates
5 0.29662755 108 jmlr-2013-Stochastic Variational Inference
Author: Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley
Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets. Keywords: Bayesian inference, variational inference, stochastic optimization, topic models, Bayesian nonparametrics
6 0.29044658 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
7 0.2875447 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
8 0.28544569 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
9 0.28513741 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
11 0.28280321 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
12 0.28257507 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
13 0.28210878 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
14 0.28199753 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
15 0.2818526 73 jmlr-2013-Multicategory Large-Margin Unified Machines
17 0.2813372 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
18 0.28085005 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
19 0.28040603 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
20 0.28038046 76 jmlr-2013-Nonparametric Sparsity and Regularization