jmlr jmlr2013 jmlr2013-79 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arnaud Guyader, Nick Hengartner
Abstract: Motivated by promising experimental results, this paper investigates the theoretical properties of a recently proposed nonparametric estimator, called the Mutual Nearest Neighbors rule, which estimates the regression function m(x) = E[Y |X = x] as follows: first identify the k nearest neighbors of x in the sample Dn , then keep only those for which x is itself one of the k nearest neighbors, and finally take the average over the corresponding response variables. We prove that this estimator is consistent and that its rate of convergence is optimal. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, we also present adaptation results by data-splitting. Keywords: nonparametric estimation, nearest neighbor methods, mathematical statistics
Reference: text
sentIndex sentText sentNum sentScore
1 We prove that this estimator is consistent and that its rate of convergence is optimal. [sent-5, score-0.075]
2 Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, we also present adaptation results by data-splitting. [sent-6, score-0.11]
3 Keywords: nonparametric estimation, nearest neighbor methods, mathematical statistics 1. [sent-7, score-0.336]
4 For fixed x in Rd , our goal is to estimate the regression function m(x) = E[Y |X = x] using the data Dn . [sent-15, score-0.084]
5 A regression function estimate mn (x) is said to be weakly consistent if the mean integrated squared error E[mn (X) − m(X)]2 tends to 0 as the sample size n goes to infinity, and is said to be universally weakly consistent if this property holds for all distributions of (X,Y ) with EY 2 < ∞. [sent-16, score-0.543]
6 Then, for x in Rd , the k Nearest Neighbors (kNN) estimate for the regression function m is defined by mkNN (x) = n 1 k ∑ Y (x), k i=1 (i,n) where (X(1,n) (x),Y(1,n) (x)), . [sent-18, score-0.084]
7 Let us denote Nk (x) the set of the k nearest neighbors of x in Dn , Nk′ (Xi ) the set of the k nearest neighbors of Xi in (Dn \ {Xi }) ∪ {x}, and Mk (x) = Xi ∈ Nk (x) : x ∈ Nk′ (Xi ) , the set of the Mutual Nearest Neighbors (MNN) of x. [sent-28, score-0.81]
8 Denoting Mk (x) = |Mk (x)| the number of mutual nearest neighbors of x, Mk (x) is a random variable taking values between 0 and k. [sent-29, score-0.521]
9 The mutual nearest neighbors regression estimate is then defined as follows mn (x) = 1 ∑ Yi , Mk (x) i:X ∈M (x) i k with the convention that 0/0 = 0. [sent-30, score-1.042]
10 First, contrarily to the k-NN estimate, the MNN estimate is symmetric. [sent-32, score-0.063]
11 This means that, when averaging over the neighbors of x in the sample Dn , we only consider the points for which x is itself one of the k nearest neighbors. [sent-33, score-0.405]
12 Since the cost of computing the distance between a pair of ddimensional vectors is O (d), and that there are n(n − 1)/2 such pairs in Dn , and considering that the (quick)sorting of a vector of size n is O (n log n), the cost of this precomputation is O ((d + log n)n2 ). [sent-37, score-0.12]
13 Indeed, for a new point x, computing the distances to the Xi ’s and finding the k nearest neighbors has a cost in O ((d + log k)n). [sent-39, score-0.45]
14 For the mutual nearest neighbors, for each of these k nearest neighbors, one has also to see if x is one of its k nearest neighbors, hence an additive cost in O (k). [sent-40, score-0.797]
15 In this case, the algorithmic cost for the kNN rule is of course the same as before, that is in O ((d + log k)n), while the cost for the MNN rule is O ((k + 1)(d + log k)n) = O (k(d + log k)n). [sent-42, score-0.175]
16 The term of mutual nearest neighbors seems to date back to Chidananda Gowda and Krishna (1978, 1979) in the context of clustering. [sent-43, score-0.521]
17 Interestingly, the latter reports that experimental results show that, on standard data sets, the MNN estimates have better performances than standard nearest neighbors estimates as well as other widely used classification rules. [sent-48, score-0.405]
18 Without claiming that MNN estimates always outperform standard nearest neighbors estimates, a heuristic explanation for this better behavior in some situations is related to the existence of hubs in high dimensional data. [sent-49, score-0.504]
19 c (2010), hubness is an aspect of the curse of dimensionality as increasing the dimensionality results in the emergence of hubs under widely applicable conditions. [sent-52, score-0.099]
20 These authors have also conducted several simulations to show how the existence of “bad” hubs negatively affects the kNN classifier 2362 M UTUAL N EAREST N EIGHBORS E STIMATE (see Section 7. [sent-53, score-0.099]
21 In our context, the existence of hubs might not affect c the performance of MNN estimates and one could even consider the MNN rule as a variant of the kNN rule which allows to automatically reduce the role of these hubs. [sent-57, score-0.163]
22 However, to the best of our knowledge, little if nothing is known about the theoretical properties of the mutual nearest neighbors estimator. [sent-58, score-0.521]
23 In Section 3, we go one step further and show that the rate of convergence of this estimate is, in fact, optimal when d ≥ 2. [sent-61, score-0.083]
24 Since the parameter k = kn of the estimate with the optimal rate of convergence depends on the unknown distribution of (X,Y ), especially on the smoothness of the regression function, we also present adaptive (i. [sent-62, score-0.261]
25 , data-dependent) choices for kn that preserve the minimax optimality of the estimate. [sent-64, score-0.153]
26 Consistency To prove the consistency of the MNN estimator, we write n n mn (x) = ∑ Wi (x, X1 , . [sent-66, score-0.448]
27 This representation brings the MNN estimator into the general framework of weighted nearest neighbors, as studied for example in Stone (1977). [sent-70, score-0.24]
28 Accordingly, let us define the random variable B as the number of nearest neighbors Xi ’s which belong to Sx,d(k+1) /2 . [sent-84, score-0.405]
29 Given d(k+1) and defining pk as pk = µ(Sx,d(k+1) /2 ) µ(Sx,d(k+1) ) 2363 , G UYADER AND H ENGARTNER where µ stands for the law of X, we will justify in the proof of Theorem 1 that B has a binomial distribution with parameters k and pk . [sent-85, score-0.628]
30 This latter remark is of crucial importance for showing the following consistency results as well as for establishing the rates of convergence of Section 3. [sent-87, score-0.067]
31 Theorem 1 Suppose that the distribution µ of X is absolutely continuous on Rd , that Y is bounded and that the regression function m is µ almost everywhere continuous. [sent-89, score-0.146]
32 If k → ∞, k/n → 0, and k/ log n → ∞, then mn is strongly consistent, that is mn (X) − m(X) → 0, with probability one. [sent-90, score-0.835]
33 Indeed, for µ almost every x and for every ε > 0, P(|mn (x) − m(x)| > ε) → 0, when n goes to infinity, provided that k → ∞ and k/n → 0. [sent-92, score-0.083]
34 , L2 ) consistency of Theorem 2 below is just a straightforward consequence of the dominated convergence theorem. [sent-95, score-0.067]
35 Notice that a standard way to prove the weak consistency of weighted nearest neighbors rules is to check the five conditions of Stone’s universal consistency theorem (see Stone, 1977, Theorem 1). [sent-96, score-0.516]
36 The additional constraints in Theorem 2 are sufficient and are in fact the same as for the layered nearest neighbor estimate studied in Biau and Devroye (2010), as well as for the affine invariant nearest neighbor estimate investigated in Biau et al. [sent-98, score-0.666]
37 Theorem 2 Suppose that the distribution µ of X is absolutely continuous on Rd , that Y is bounded and that the regression function m is µ almost everywhere continuous. [sent-100, score-0.146]
38 If k → ∞ and k/n → 0, then mn is weakly consistent, that is E[(mn (X) − m(X))2 ] → 0. [sent-101, score-0.433]
39 Indeed, an inspection of the proof of Theorem 1 indicates that consistency holds as long as for µ almost every x, lim inf h→0 µ(Sx,h/2 ) > 0. [sent-103, score-0.175]
40 µ(Sx,h ) 2364 M UTUAL N EAREST N EIGHBORS E STIMATE Interestingly, this condition is linked to the notion of doubling measure in geometric measure theory. [sent-104, score-0.183]
41 µ(Sx,h ) Thus, we can relax the condition of Theorems 1 and 2 to only requiring that the probability measure µ is asymptotically doubling almost surely. [sent-109, score-0.261]
42 It is also readily seen that any discrete probability measure is asymptotically doubling almost surely. [sent-114, score-0.261]
43 Singular continuous probability measures can also be asymptotically doubling as is seen on the following example. [sent-115, score-0.246]
44 2 µN (Sx,h ) Hence, for every x in C , lim inf h→0 µ(Sx,h/2 ) 1 ≥ , µ(Sx,h ) 2 and µ is asymptotically doubling almost surely. [sent-119, score-0.356]
45 Next, we give an example of a singular continuous distribution that is non-asymptotically doubling with probability one. [sent-120, score-0.207]
46 N N=1 3 ∑ Note that X takes values in the standard Cantor ternary set C , but that the law µ of X is not the uniform law on it: obviously, in the triadic expansion of X, the 2’s are much more likely than the 2365 G UYADER AND H ENGARTNER 0’s. [sent-122, score-0.148]
47 Nevertheless, a direct application of Borel-Cantelli Lemma ensures that µ almost surely, there is an infinite number of 0’s in the triadic expansion of X. [sent-123, score-0.078]
48 Then, by construction, for each N in Ix , there exists an h = hN (x) ∈ [1/3N , 2/3N ] such that µN (Sx,h/2 ) µ(Sx,h/2 ) 1 = ⇒ lim inf = 0. [sent-125, score-0.073]
49 h→0 µN (Sx,h ) N µ(Sx,h ) Consequently, µ is almost surely not asymptotically doubling. [sent-126, score-0.109]
50 However, even on this pathological probability space, it is not obvious that we can define a regression function m and a distribution for Y such that the mutual nearest neighbors rule would fail to be consistent. [sent-127, score-0.608]
51 To conclude this section, let us finally notice that, in the context of adaptation to local intrinsic dimension of kNN regression, similar ideas related to the doubling property also appear in a recent paper by Kpotufe (2011). [sent-128, score-0.239]
52 2002) that for the class F , the optimal minimax rate of convergence is n−2/(d+2) . [sent-132, score-0.084]
53 In o particular, one has that E[mn (X) − m(X)]2 ˆ lim inf inf sup ≥ ∆, 2 2 n→∞ mn (X,Y )∈F ˆ ((ρC)d σ2 ) d+2 n− d+2 for some positive constant ∆ independent of C, ρ and σ2 . [sent-133, score-0.524]
54 Here the infimum is taken over all estimates mn , that is, over all measurable functions of the data. [sent-134, score-0.407]
55 ˆ It turns out that, for d ≥ 2 and a suitable choice of the sequence (kn ), the MNN estimate mn achieves the optimum rate for the class F , that is lim sup sup n→∞ (X,Y )∈F E[mn (X) − m(X)]2 2 2 ((ρC)d σ2 ) d+2 n− d+2 ≤ Λ, for some positive Λ independent of C, ρ and σ2 . [sent-135, score-0.493]
56 Let µ be a probability measure on Rd with compact support S (µ) with diameter 2ρ. [sent-137, score-0.073]
57 (3) (x,h)∈S (µ)×(0,2ρ] µ(Sx,h ) It is readily seen that if µ is absolutely continuous with density f , a sufficient condition is that there exist two strictly positive real numbers a and A such that for almost every x in S (µ), we have a ≤ f (x) ≤ A. [sent-139, score-0.113]
58 Suppose that the law µ of X has a compact support S (µ) with diameter 2ρ, and that µ is doubling, with p defined as in (3). [sent-141, score-0.108]
59 kp n E [mn (X) − m(X)]2 ≤ 2σ2 32ρ2C2 2 + + Lm (1 − p)k . [sent-145, score-0.062]
60 2 (ii) If d ≥ 2, there exists a sequence (kn ) with kn ∝ n d+2 such that (ρC)d σ2 E [mn (X) − m(X)] ≤ (Λ + o(1)) n 2 2 d+2 , for some positive constant Λ independent of ρ, C and σ2 . [sent-148, score-0.123]
61 This low-dimensional phenomenon is also known to hold for the traditional kNN regression estimate, which does not achieve the optimal rate in dimension 1 (see Problem 6. [sent-154, score-0.083]
62 o In Corollary 1, the parameter kn of the estimate with the optimal rate of convergence for the class F depends on the unknown distribution of (X,Y ), especially on the smoothness of the regression function as measured by the Lipschitz constant C. [sent-157, score-0.261]
63 To conclude this section, we present a datadependent way for choosing the resampling size kn and show that, for bounded Y , the estimate with parameter chosen in such an adaptive way achieves the optimal rate of convergence. [sent-158, score-0.18]
64 ˆ The second half is used to choose k by picking kn ∈ K = {1, . [sent-164, score-0.123]
65 n − ⌊n/2⌋ i=⌊n/2⌋+1 Define the estimate ℓ mn (x) = mkn ,⌊n/2⌋ (x, Dn ), ˆ and note that mn depends on the entire data Dn . [sent-168, score-0.843]
66 If |Y | ≤ L < ∞ almost surely, a straightforward adaptation of Theorem 7. [sent-169, score-0.066]
67 Immediately from Corollary 1, we can conclude: Theorem 4 Suppose that |Y | ≤ L almost surely, and let mn be the MNN estimate with k ∈ K = {1, . [sent-172, score-0.475]
68 Thus, the expected error of the estimate obtained via data-splitting is bounded from above up to a constant by the corresponding minimax lower bound for the class F of regression functions, with the optimal dependence in C and ρ. [sent-177, score-0.114]
69 Setting n mn (x) = ∑ Wi m(Xi ), ˜ i=1 we have k P (|mn (x) − m(x)| > 2ε) ≤ P Mk (x) < 2d+1 + P |mn (x) − mn (x)| > ε, Mk (x) ≥ ˜ + P |mn (x) − m(x)| > ε, Mk (x) ≥ ˜ k 2d+1 k 2d+1 . [sent-182, score-0.814]
70 1 in C´ rou and Guyader 2006) that e ˜ ˜ L (X⋆ , . [sent-197, score-0.079]
71 1 k (5) Next, given d(k+1) , denote pk = P ˜ X−x < d(k+1) ˜ X∼µ = ˜ 2 Sx,d (k+1) /2 Sx,d(k+1) f (u)du f (u)du , where the denominator is strictly positive since x belongs to the support of µ. [sent-204, score-0.185]
72 Concerning pk , recall that Lebesgue’s differentiation Theorem ensures that for λ-almost all x ∈ Rd , 1 λ(Sx,δ ) Sx,δ f (u)du → f (x), when δ tends to 0 (see for example Theorem A. [sent-205, score-0.21]
73 = 2 0 P Mk (x) < Given δ and defining B as the number of Xi ’s among the k nearest neighbors of x which belong to Sx,δ/2 , then according to (5), the random variable B has binomial distribution B (k, p(δ)) and (1) implies Mk (x) ≥ B, so that P Mk (x) < k d(k+1) = δ ≤ P B < 2d+1 k 2d+1 p(δ) . [sent-216, score-0.443]
74 In this respect, Hoeffding’s inequality and (6) lead to P B< k 2d+1 p(δ) ≤ exp −2 p(δ) − 1 2 2d+1 k ≤ exp − k 22d+3 , which is summable in n provided that k/ log n → ∞. [sent-217, score-0.077]
75 This time, we write P |mn (x) − mn (x)| > ε, Mk (x) ≥ ˜ = E 1{Mk (x)≥ k } 2d+1 k 2d+1 P ( |mn (x) − mn (x)| > ε| X1 , . [sent-219, score-0.814]
76 Consequently, Lemma 6 in Devroye (1982) leads to P |mn (x) − mn (x)| > ε, Mk (x) ≥ ˜ k 2d+1 ≤ 2 exp − kε2 , 2d+3 (2L2 + Lε) which is summable in n for all ε > 0, provided that k/ log n → ∞. [sent-237, score-0.484]
77 Sx,δ1 Putting all things together, we have proved that for any ε > 0, if k/n → 0, then for n large enough we have P (|mn (x) − m(x)| > 2ε) ≤ 2 exp −k −kε2 + exp 2d+3 d+3 (2L2 + Lε) 2 2 + exp −nq2 0 2 + exp −nq2 1 2 , which is summable in n for all ε > 0, provided that k/ log n → ∞. [sent-241, score-0.077]
78 Since this is true for µ almost every x, the strong consistency is established. [sent-242, score-0.102]
79 2 Proof of Theorem 3 As previously, setting n mn (x) = ∑ Wi m(Xi ), ˜ i=1 the proof of Theorem 3 will rely on the variance/bias decomposition E [mn (x) − m(x)]2 = E [mn (x) − mn (x)]2 + E [mn (x) − m(x)]2 . [sent-244, score-0.814]
80 ˜ ˜ 2371 (7) G UYADER AND H ENGARTNER The first term is easily bounded by noting that, for all x in Rd , 2 n 2 E [mn (x) − mn (x)] = E ˜ ∑ Wi (Yi − m(Xi )) i=1 n ∑ Wi2 (Yi − m(Xi ))2 =E i=1 n ∑ Wi2 E =E i=1 (Yi − m(Xi ))2 Xi n ∑ Wi2 σ2 (Xi ) =E i=1 n ≤ σ2 E ∑ Wi2 . [sent-245, score-0.407]
81 1 Mk (X) Mk (X)=0 As in the proof of Theorem 1, given d(k+1) , denote ˜ X−x < pk = P µ Sx,d(k+1) /2 d(k+1) ˜ ˜ X∼µ = , 2 µ Sx,d(k+1) and define B as the number of Xi ’s among the k nearest neighbors of x which belong to Sx,d(k+1) /2 . [sent-247, score-0.59]
82 Then, given pk , the random variable B has binomial distribution B (k, pk ), and (1) implies Mk (x) ≥ B. [sent-248, score-0.408]
83 (8) In particular, 1 2 2 1Mk (X)=0 ≤ ≤ , Mk (x) 1 + Mk (x) 1 + B so that n E ∑ Wi2 i=1 Since pk ≥ p, we are led to ≤E E 2 pk 1+B = 2E E [mn (x) − mn (x)]2 ≤ ˜ 1 − (1 − pk )k . [sent-249, score-0.962]
84 (k + 1)pk 2σ2 , kp and integrating with respect to the distribution of X yields E [mn (X) − mn (X)]2 ≤ ˜ 2372 2σ2 . [sent-250, score-0.469]
85 kp M UTUAL N EAREST N EIGHBORS E STIMATE Concerning the bias term in (7), again fix x in Rd , denote by Lm an upper-bound of the continuous function m on the compact S (µ), and write 2 E [mn (x) − m(x)]2 ≤ E (mn (x) − m(x))2 1{Mk (x)>0} + Lm P(Mk (x) = 0). [sent-251, score-0.112]
86 ˜ ˜ The second term is bounded thanks to (8), P(Mk (x) = 0|pk ) ≤ P(B = 0|pk ) = (1 − pk )k , so that P(Mk (x) = 0) ≤ E (1 − pk )k , and since pk ≥ p, P(Mk (x) = 0) ≤ E (1 − pk )k ≤ (1 − p)k . [sent-252, score-0.74]
87 k Since any mutual nearest neighbor of x belongs to its k nearest neighbors, we deduce E (mn (x) − m(x))2 1{Mk (x)>0} ≤ C2 E ˜ X(k,n) − x 2 . [sent-255, score-0.624]
88 Next, let us denote ρ = inf r > 0 : ∃ x0 ∈ Rd such that S (µ) ⊂ Sx0 ,r , and notice that 2ρ is an upper-bound of the diameter of S (µ). [sent-257, score-0.12]
89 Acknowledgments Arnaud Guyader wish to thank Herv´ J´ gou to have made him aware of the mutual nearest neighbors e e rule, and Fr´ d´ ric C´ rou for helpful discussions on the second version of the article. [sent-267, score-0.651]
90 The authors e e e are greatly indebted to Luigi Ambrosio and Giovanni Alberti for fruitful discussions on doubling measure theory. [sent-268, score-0.183]
91 Special functions of bounded variation in doubling metric measure spaces. [sent-280, score-0.183]
92 On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. [sent-292, score-0.647]
93 On the rate of convergence of the bagged nearest neighbor e estimate. [sent-298, score-0.382]
94 An affine invariant k-nearest neighbor regresc z sion estimate. [sent-305, score-0.07]
95 Agglomerative clustering using the concept of mutual nearest neighbourhood. [sent-315, score-0.335]
96 The condensed nearest neighbor rule using the concept of mutual nearest neighborhood. [sent-320, score-0.656]
97 Necessary and sufficient conditions for the pointwise convergence of nearest neighbor regression function estimates. [sent-324, score-0.37]
98 Residual variance estimation using a nearest neighbor a statistic. [sent-393, score-0.289]
99 Hello neighbor: accurate object retrieval with k-reciprocal nearest neighbors. [sent-410, score-0.219]
100 Hubs in space: popular nearest neighbors in c c high-dimensional data. [sent-416, score-0.405]
wordName wordTfidf (topN-words)
[('mk', 0.435), ('mn', 0.407), ('mnn', 0.296), ('nearest', 0.219), ('neighbors', 0.186), ('pk', 0.185), ('doubling', 0.183), ('engartner', 0.158), ('uyader', 0.158), ('stimate', 0.138), ('utual', 0.138), ('knn', 0.136), ('dn', 0.126), ('kn', 0.123), ('earest', 0.118), ('eighbors', 0.118), ('mutual', 0.116), ('ambrosio', 0.099), ('guyader', 0.099), ('hubs', 0.099), ('devroye', 0.094), ('biau', 0.092), ('sx', 0.086), ('lm', 0.086), ('rd', 0.081), ('ibragimov', 0.079), ('rou', 0.079), ('neighbor', 0.07), ('arnaud', 0.067), ('kp', 0.062), ('du', 0.061), ('chidananda', 0.059), ('gowda', 0.059), ('radovanovi', 0.059), ('summable', 0.056), ('regression', 0.055), ('wi', 0.052), ('gy', 0.052), ('rennes', 0.051), ('gou', 0.051), ('xi', 0.049), ('nonparametric', 0.047), ('diameter', 0.047), ('nk', 0.046), ('inf', 0.044), ('stone', 0.041), ('consistency', 0.041), ('aviation', 0.039), ('bagged', 0.039), ('cantor', 0.039), ('liiti', 0.039), ('randolph', 0.039), ('ternary', 0.039), ('triadic', 0.039), ('usaf', 0.039), ('almost', 0.039), ('asymptotically', 0.039), ('binomial', 0.038), ('law', 0.035), ('alamos', 0.034), ('inen', 0.034), ('contrarily', 0.034), ('rule', 0.032), ('surely', 0.031), ('minimax', 0.03), ('nauk', 0.03), ('precomputation', 0.03), ('layered', 0.03), ('discriminatory', 0.03), ('qin', 0.03), ('nick', 0.03), ('convention', 0.03), ('theorem', 0.029), ('notice', 0.029), ('estimate', 0.029), ('lim', 0.029), ('rate', 0.028), ('los', 0.028), ('neighbour', 0.028), ('absolutely', 0.028), ('adaptation', 0.027), ('convergence', 0.026), ('krzy', 0.026), ('compact', 0.026), ('weakly', 0.026), ('differentiation', 0.025), ('cost', 0.024), ('continuous', 0.024), ('ix', 0.023), ('un', 0.023), ('every', 0.022), ('ey', 0.021), ('assertion', 0.021), ('sorting', 0.021), ('log', 0.021), ('xk', 0.021), ('estimator', 0.021), ('vd', 0.021), ('cn', 0.02), ('lipschitz', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
Author: Arnaud Guyader, Nick Hengartner
Abstract: Motivated by promising experimental results, this paper investigates the theoretical properties of a recently proposed nonparametric estimator, called the Mutual Nearest Neighbors rule, which estimates the regression function m(x) = E[Y |X = x] as follows: first identify the k nearest neighbors of x in the sample Dn , then keep only those for which x is itself one of the k nearest neighbors, and finally take the average over the corresponding response variables. We prove that this estimator is consistent and that its rate of convergence is optimal. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, we also present adaptation results by data-splitting. Keywords: nonparametric estimation, nearest neighbor methods, mathematical statistics
2 0.1571499 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods
Author: Robert Hable
Abstract: In supervised learning problems, global and local learning algorithms are used. In contrast to global learning algorithms, the prediction of a local learning algorithm in a testing point is only based on training data which are close to the testing point. Every global algorithm such as support vector machines (SVM) can be localized in the following way: in every testing point, the (global) learning algorithm is not applied to the whole training data but only to the k nearest neighbors (kNN) of the testing point. In case of support vector machines, the success of such mixtures of SVM and kNN (called SVM-KNN) has been shown in extensive simulation studies and also for real data sets but only little has been known on theoretical properties so far. In the present article, it is shown how a large class of regularized kernel methods (including SVM) can be localized in order to get a universally consistent learning algorithm. Keywords: machine learning, regularized kernel methods, localization, SVM, k-nearest neighbors, SVM-KNN
3 0.10331292 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
Author: Tony Cai, Jianqing Fan, Tiefeng Jiang
Abstract: This paper studies the asymptotic behaviors of the pairwise angles among n randomly and uniformly distributed unit vectors in R p as the number of points n → ∞, while the dimension p is either fixed or growing with n. For both settings, we derive the limiting empirical distribution of the random angles and the limiting distributions of the extreme angles. The results reveal interesting differences in the two settings and provide a precise characterization of the folklore that “all high-dimensional random vectors are almost always nearly orthogonal to each other”. Applications to statistics and machine learning and connections with some open problems in physics and mathematics are also discussed. Keywords: random angle, uniform distribution on sphere, empirical law, maximum of random variables, minimum of random variables, extreme-value distribution, packing on sphere
4 0.078606822 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
Author: Lauren A. Hannah, David B. Dunson
Abstract: We propose a new, nonparametric method for multivariate regression subject to convexity or concavity constraints on the response function. Convexity constraints are common in economics, statistics, operations research, financial engineering and optimization, but there is currently no multivariate method that is stable and computationally feasible for more than a few thousand observations. We introduce convex adaptive partitioning (CAP), which creates a globally convex regression model from locally linear estimates fit on adaptively selected covariate partitions. CAP is a computationally efficient, consistent method for convex regression. We demonstrate empirical performance by comparing the performance of CAP to other shape-constrained and unconstrained regression methods for predicting weekly wages and value function approximation for pricing American basket options. Keywords: adaptive partitioning, convex regression, nonparametric regression, shape constraint, treed linear model
5 0.075032294 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
Author: Bruno Scherrer
Abstract: We consider the discrete-time infinite-horizon optimal control problem formalized by Markov decision processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ policy iteration—a family of algorithms parametrized by a parameter λ—that generalizes the standard algorithms value and policy iteration, and has some deep connections with the temporal-difference algorithms described by Sutton and Barto (1998). We deepen the original theory developed by the authors by providing convergence rate bounds which generalize standard bounds for value iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form. We extend and unify the separate analyzes developed by Munos for approximate value iteration (Munos, 2007) and approximate policy iteration (Munos, 2003), and provide performance bounds in the discounted and the undiscounted situations. Finally, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). Our empirical results are different from those of Bertsekas and Ioffe (which were originally qualified as “paradoxical” and “intriguing”). We track down the reason to be a minor implementation error of the algorithm, which suggests that, in practice, λ policy iteration may be more stable than previously thought. Keywords: stochastic optimal control, reinforcement learning, Markov decision processes, analysis of algorithms
6 0.073945895 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
7 0.058704253 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
8 0.053585846 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
9 0.048112474 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
10 0.046948008 104 jmlr-2013-Sparse Single-Index Model
11 0.043552916 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models
12 0.042822484 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
13 0.041255988 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
14 0.037560351 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
15 0.035994984 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
16 0.034083791 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
17 0.033957757 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
18 0.031901632 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
19 0.030934064 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing
20 0.03051788 59 jmlr-2013-Large-scale SVD and Manifold Learning
topicId topicWeight
[(0, -0.178), (1, 0.051), (2, 0.075), (3, 0.02), (4, -0.04), (5, -0.104), (6, -0.046), (7, 0.102), (8, 0.072), (9, -0.054), (10, -0.15), (11, -0.001), (12, 0.284), (13, -0.011), (14, -0.236), (15, 0.158), (16, -0.023), (17, 0.155), (18, 0.101), (19, -0.026), (20, 0.183), (21, -0.029), (22, -0.248), (23, -0.093), (24, -0.096), (25, 0.11), (26, -0.112), (27, 0.096), (28, 0.005), (29, -0.01), (30, 0.145), (31, -0.069), (32, 0.128), (33, -0.086), (34, -0.011), (35, -0.13), (36, -0.054), (37, 0.01), (38, 0.12), (39, 0.009), (40, -0.062), (41, 0.06), (42, -0.057), (43, 0.06), (44, 0.002), (45, 0.027), (46, 0.012), (47, 0.005), (48, -0.049), (49, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.9609381 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
Author: Arnaud Guyader, Nick Hengartner
Abstract: Motivated by promising experimental results, this paper investigates the theoretical properties of a recently proposed nonparametric estimator, called the Mutual Nearest Neighbors rule, which estimates the regression function m(x) = E[Y |X = x] as follows: first identify the k nearest neighbors of x in the sample Dn , then keep only those for which x is itself one of the k nearest neighbors, and finally take the average over the corresponding response variables. We prove that this estimator is consistent and that its rate of convergence is optimal. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, we also present adaptation results by data-splitting. Keywords: nonparametric estimation, nearest neighbor methods, mathematical statistics
2 0.67285526 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods
Author: Robert Hable
Abstract: In supervised learning problems, global and local learning algorithms are used. In contrast to global learning algorithms, the prediction of a local learning algorithm in a testing point is only based on training data which are close to the testing point. Every global algorithm such as support vector machines (SVM) can be localized in the following way: in every testing point, the (global) learning algorithm is not applied to the whole training data but only to the k nearest neighbors (kNN) of the testing point. In case of support vector machines, the success of such mixtures of SVM and kNN (called SVM-KNN) has been shown in extensive simulation studies and also for real data sets but only little has been known on theoretical properties so far. In the present article, it is shown how a large class of regularized kernel methods (including SVM) can be localized in order to get a universally consistent learning algorithm. Keywords: machine learning, regularized kernel methods, localization, SVM, k-nearest neighbors, SVM-KNN
3 0.52091527 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
Author: Lauren A. Hannah, David B. Dunson
Abstract: We propose a new, nonparametric method for multivariate regression subject to convexity or concavity constraints on the response function. Convexity constraints are common in economics, statistics, operations research, financial engineering and optimization, but there is currently no multivariate method that is stable and computationally feasible for more than a few thousand observations. We introduce convex adaptive partitioning (CAP), which creates a globally convex regression model from locally linear estimates fit on adaptively selected covariate partitions. CAP is a computationally efficient, consistent method for convex regression. We demonstrate empirical performance by comparing the performance of CAP to other shape-constrained and unconstrained regression methods for predicting weekly wages and value function approximation for pricing American basket options. Keywords: adaptive partitioning, convex regression, nonparametric regression, shape constraint, treed linear model
4 0.46849045 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
Author: Tony Cai, Jianqing Fan, Tiefeng Jiang
Abstract: This paper studies the asymptotic behaviors of the pairwise angles among n randomly and uniformly distributed unit vectors in R p as the number of points n → ∞, while the dimension p is either fixed or growing with n. For both settings, we derive the limiting empirical distribution of the random angles and the limiting distributions of the extreme angles. The results reveal interesting differences in the two settings and provide a precise characterization of the folklore that “all high-dimensional random vectors are almost always nearly orthogonal to each other”. Applications to statistics and machine learning and connections with some open problems in physics and mathematics are also discussed. Keywords: random angle, uniform distribution on sphere, empirical law, maximum of random variables, minimum of random variables, extreme-value distribution, packing on sphere
5 0.36685079 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
6 0.33285266 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
7 0.32381266 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
8 0.29335338 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
9 0.25693727 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models
10 0.24851482 104 jmlr-2013-Sparse Single-Index Model
11 0.23250315 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing
12 0.22686522 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
13 0.22656858 68 jmlr-2013-Machine Learning with Operational Costs
14 0.21654072 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
15 0.21577331 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
16 0.20786774 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
17 0.20358342 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
18 0.20018625 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization
19 0.19754419 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
20 0.1836686 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
topicId topicWeight
[(5, 0.111), (6, 0.021), (10, 0.061), (20, 0.016), (23, 0.049), (68, 0.021), (70, 0.036), (75, 0.031), (85, 0.048), (87, 0.017), (95, 0.48)]
simIndex simValue paperId paperTitle
1 0.79476452 95 jmlr-2013-Ranking Forests
Author: Stéphan Clémençon, Marine Depecker, Nicolas Vayatis
Abstract: The present paper examines how the aggregation and feature randomization principles underlying the algorithm R ANDOM F OREST (Breiman, 2001) can be adapted to bipartite ranking. The approach taken here is based on nonparametric scoring and ROC curve optimization in the sense of the AUC criterion. In this problem, aggregation is used to increase the performance of scoring rules produced by ranking trees, as those developed in Cl´ mencon and Vayatis (2009c). The present work e ¸ describes the principles for building median scoring rules based on concepts from rank aggregation. Consistency results are derived for these aggregated scoring rules and an algorithm called R ANK ING F OREST is presented. Furthermore, various strategies for feature randomization are explored through a series of numerical experiments on artificial data sets. Keywords: bipartite ranking, nonparametric scoring, classification data, ROC optimization, AUC criterion, tree-based ranking rules, bootstrap, bagging, rank aggregation, median ranking, feature randomization
same-paper 2 0.76298648 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
Author: Arnaud Guyader, Nick Hengartner
Abstract: Motivated by promising experimental results, this paper investigates the theoretical properties of a recently proposed nonparametric estimator, called the Mutual Nearest Neighbors rule, which estimates the regression function m(x) = E[Y |X = x] as follows: first identify the k nearest neighbors of x in the sample Dn , then keep only those for which x is itself one of the k nearest neighbors, and finally take the average over the corresponding response variables. We prove that this estimator is consistent and that its rate of convergence is optimal. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, we also present adaptation results by data-splitting. Keywords: nonparametric estimation, nearest neighbor methods, mathematical statistics
3 0.2919271 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
4 0.29123574 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
5 0.28877452 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
6 0.28806934 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
7 0.28665572 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
9 0.2852309 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
10 0.28287515 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
11 0.28213927 73 jmlr-2013-Multicategory Large-Margin Unified Machines
12 0.28133667 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
13 0.28120688 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
15 0.27891997 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
16 0.27886078 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
17 0.27883261 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
18 0.27882656 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
19 0.27838516 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
20 0.2778016 68 jmlr-2013-Machine Learning with Operational Costs