jmlr jmlr2013 jmlr2013-79 knowledge-graph by maker-knowledge-mining

79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

Source: pdf

Author: Arnaud Guyader, Nick Hengartner

Abstract: Motivated by promising experimental results, this paper investigates the theoretical properties of a recently proposed nonparametric estimator, called the Mutual Nearest Neighbors rule, which estimates the regression function m(x) = E[Y |X = x] as follows: ﬁrst identify the k nearest neighbors of x in the sample Dn , then keep only those for which x is itself one of the k nearest neighbors, and ﬁnally take the average over the corresponding response variables. We prove that this estimator is consistent and that its rate of convergence is optimal. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, we also present adaptation results by data-splitting. Keywords: nonparametric estimation, nearest neighbor methods, mathematical statistics

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We prove that this estimator is consistent and that its rate of convergence is optimal. [sent-5, score-0.075]

2 Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, we also present adaptation results by data-splitting. [sent-6, score-0.11]

3 Keywords: nonparametric estimation, nearest neighbor methods, mathematical statistics 1. [sent-7, score-0.336]

4 For ﬁxed x in Rd , our goal is to estimate the regression function m(x) = E[Y |X = x] using the data Dn . [sent-15, score-0.084]

5 A regression function estimate mn (x) is said to be weakly consistent if the mean integrated squared error E[mn (X) − m(X)]2 tends to 0 as the sample size n goes to inﬁnity, and is said to be universally weakly consistent if this property holds for all distributions of (X,Y ) with EY 2 < ∞. [sent-16, score-0.543]

6 Then, for x in Rd , the k Nearest Neighbors (kNN) estimate for the regression function m is deﬁned by mkNN (x) = n 1 k ∑ Y (x), k i=1 (i,n) where (X(1,n) (x),Y(1,n) (x)), . [sent-18, score-0.084]

7 Let us denote Nk (x) the set of the k nearest neighbors of x in Dn , Nk′ (Xi ) the set of the k nearest neighbors of Xi in (Dn \ {Xi }) ∪ {x}, and Mk (x) = Xi ∈ Nk (x) : x ∈ Nk′ (Xi ) , the set of the Mutual Nearest Neighbors (MNN) of x. [sent-28, score-0.81]

8 Denoting Mk (x) = |Mk (x)| the number of mutual nearest neighbors of x, Mk (x) is a random variable taking values between 0 and k. [sent-29, score-0.521]

9 The mutual nearest neighbors regression estimate is then deﬁned as follows mn (x) = 1 ∑ Yi , Mk (x) i:X ∈M (x) i k with the convention that 0/0 = 0. [sent-30, score-1.042]

10 First, contrarily to the k-NN estimate, the MNN estimate is symmetric. [sent-32, score-0.063]

11 This means that, when averaging over the neighbors of x in the sample Dn , we only consider the points for which x is itself one of the k nearest neighbors. [sent-33, score-0.405]

12 Since the cost of computing the distance between a pair of ddimensional vectors is O (d), and that there are n(n − 1)/2 such pairs in Dn , and considering that the (quick)sorting of a vector of size n is O (n log n), the cost of this precomputation is O ((d + log n)n2 ). [sent-37, score-0.12]

13 Indeed, for a new point x, computing the distances to the Xi ’s and ﬁnding the k nearest neighbors has a cost in O ((d + log k)n). [sent-39, score-0.45]

14 For the mutual nearest neighbors, for each of these k nearest neighbors, one has also to see if x is one of its k nearest neighbors, hence an additive cost in O (k). [sent-40, score-0.797]

15 In this case, the algorithmic cost for the kNN rule is of course the same as before, that is in O ((d + log k)n), while the cost for the MNN rule is O ((k + 1)(d + log k)n) = O (k(d + log k)n). [sent-42, score-0.175]

16 The term of mutual nearest neighbors seems to date back to Chidananda Gowda and Krishna (1978, 1979) in the context of clustering. [sent-43, score-0.521]

17 Interestingly, the latter reports that experimental results show that, on standard data sets, the MNN estimates have better performances than standard nearest neighbors estimates as well as other widely used classiﬁcation rules. [sent-48, score-0.405]

18 Without claiming that MNN estimates always outperform standard nearest neighbors estimates, a heuristic explanation for this better behavior in some situations is related to the existence of hubs in high dimensional data. [sent-49, score-0.504]

19 c (2010), hubness is an aspect of the curse of dimensionality as increasing the dimensionality results in the emergence of hubs under widely applicable conditions. [sent-52, score-0.099]

20 These authors have also conducted several simulations to show how the existence of “bad” hubs negatively affects the kNN classiﬁer 2362 M UTUAL N EAREST N EIGHBORS E STIMATE (see Section 7. [sent-53, score-0.099]

21 In our context, the existence of hubs might not affect c the performance of MNN estimates and one could even consider the MNN rule as a variant of the kNN rule which allows to automatically reduce the role of these hubs. [sent-57, score-0.163]

22 However, to the best of our knowledge, little if nothing is known about the theoretical properties of the mutual nearest neighbors estimator. [sent-58, score-0.521]

23 In Section 3, we go one step further and show that the rate of convergence of this estimate is, in fact, optimal when d ≥ 2. [sent-61, score-0.083]

24 Since the parameter k = kn of the estimate with the optimal rate of convergence depends on the unknown distribution of (X,Y ), especially on the smoothness of the regression function, we also present adaptive (i. [sent-62, score-0.261]

25 , data-dependent) choices for kn that preserve the minimax optimality of the estimate. [sent-64, score-0.153]

26 Consistency To prove the consistency of the MNN estimator, we write n n mn (x) = ∑ Wi (x, X1 , . [sent-66, score-0.448]

27 This representation brings the MNN estimator into the general framework of weighted nearest neighbors, as studied for example in Stone (1977). [sent-70, score-0.24]

28 Accordingly, let us deﬁne the random variable B as the number of nearest neighbors Xi ’s which belong to Sx,d(k+1) /2 . [sent-84, score-0.405]

29 Given d(k+1) and deﬁning pk as pk = µ(Sx,d(k+1) /2 ) µ(Sx,d(k+1) ) 2363 , G UYADER AND H ENGARTNER where µ stands for the law of X, we will justify in the proof of Theorem 1 that B has a binomial distribution with parameters k and pk . [sent-85, score-0.628]

30 This latter remark is of crucial importance for showing the following consistency results as well as for establishing the rates of convergence of Section 3. [sent-87, score-0.067]

31 Theorem 1 Suppose that the distribution µ of X is absolutely continuous on Rd , that Y is bounded and that the regression function m is µ almost everywhere continuous. [sent-89, score-0.146]

32 If k → ∞, k/n → 0, and k/ log n → ∞, then mn is strongly consistent, that is mn (X) − m(X) → 0, with probability one. [sent-90, score-0.835]

33 Indeed, for µ almost every x and for every ε > 0, P(|mn (x) − m(x)| > ε) → 0, when n goes to inﬁnity, provided that k → ∞ and k/n → 0. [sent-92, score-0.083]

34 , L2 ) consistency of Theorem 2 below is just a straightforward consequence of the dominated convergence theorem. [sent-95, score-0.067]

35 Notice that a standard way to prove the weak consistency of weighted nearest neighbors rules is to check the ﬁve conditions of Stone’s universal consistency theorem (see Stone, 1977, Theorem 1). [sent-96, score-0.516]

36 The additional constraints in Theorem 2 are sufﬁcient and are in fact the same as for the layered nearest neighbor estimate studied in Biau and Devroye (2010), as well as for the afﬁne invariant nearest neighbor estimate investigated in Biau et al. [sent-98, score-0.666]

37 Theorem 2 Suppose that the distribution µ of X is absolutely continuous on Rd , that Y is bounded and that the regression function m is µ almost everywhere continuous. [sent-100, score-0.146]

38 If k → ∞ and k/n → 0, then mn is weakly consistent, that is E[(mn (X) − m(X))2 ] → 0. [sent-101, score-0.433]

39 Indeed, an inspection of the proof of Theorem 1 indicates that consistency holds as long as for µ almost every x, lim inf h→0 µ(Sx,h/2 ) > 0. [sent-103, score-0.175]

40 µ(Sx,h ) 2364 M UTUAL N EAREST N EIGHBORS E STIMATE Interestingly, this condition is linked to the notion of doubling measure in geometric measure theory. [sent-104, score-0.183]

41 µ(Sx,h ) Thus, we can relax the condition of Theorems 1 and 2 to only requiring that the probability measure µ is asymptotically doubling almost surely. [sent-109, score-0.261]

42 It is also readily seen that any discrete probability measure is asymptotically doubling almost surely. [sent-114, score-0.261]

43 Singular continuous probability measures can also be asymptotically doubling as is seen on the following example. [sent-115, score-0.246]

44 2 µN (Sx,h ) Hence, for every x in C , lim inf h→0 µ(Sx,h/2 ) 1 ≥ , µ(Sx,h ) 2 and µ is asymptotically doubling almost surely. [sent-119, score-0.356]

45 Next, we give an example of a singular continuous distribution that is non-asymptotically doubling with probability one. [sent-120, score-0.207]

46 N N=1 3 ∑ Note that X takes values in the standard Cantor ternary set C , but that the law µ of X is not the uniform law on it: obviously, in the triadic expansion of X, the 2’s are much more likely than the 2365 G UYADER AND H ENGARTNER 0’s. [sent-122, score-0.148]

47 Nevertheless, a direct application of Borel-Cantelli Lemma ensures that µ almost surely, there is an inﬁnite number of 0’s in the triadic expansion of X. [sent-123, score-0.078]

48 Then, by construction, for each N in Ix , there exists an h = hN (x) ∈ [1/3N , 2/3N ] such that µN (Sx,h/2 ) µ(Sx,h/2 ) 1 = ⇒ lim inf = 0. [sent-125, score-0.073]

49 h→0 µN (Sx,h ) N µ(Sx,h ) Consequently, µ is almost surely not asymptotically doubling. [sent-126, score-0.109]

50 However, even on this pathological probability space, it is not obvious that we can deﬁne a regression function m and a distribution for Y such that the mutual nearest neighbors rule would fail to be consistent. [sent-127, score-0.608]

51 To conclude this section, let us ﬁnally notice that, in the context of adaptation to local intrinsic dimension of kNN regression, similar ideas related to the doubling property also appear in a recent paper by Kpotufe (2011). [sent-128, score-0.239]

52 2002) that for the class F , the optimal minimax rate of convergence is n−2/(d+2) . [sent-132, score-0.084]

53 In o particular, one has that E[mn (X) − m(X)]2 ˆ lim inf inf sup ≥ ∆, 2 2 n→∞ mn (X,Y )∈F ˆ ((ρC)d σ2 ) d+2 n− d+2 for some positive constant ∆ independent of C, ρ and σ2 . [sent-133, score-0.524]

54 Here the inﬁmum is taken over all estimates mn , that is, over all measurable functions of the data. [sent-134, score-0.407]

55 ˆ It turns out that, for d ≥ 2 and a suitable choice of the sequence (kn ), the MNN estimate mn achieves the optimum rate for the class F , that is lim sup sup n→∞ (X,Y )∈F E[mn (X) − m(X)]2 2 2 ((ρC)d σ2 ) d+2 n− d+2 ≤ Λ, for some positive Λ independent of C, ρ and σ2 . [sent-135, score-0.493]

56 Let µ be a probability measure on Rd with compact support S (µ) with diameter 2ρ. [sent-137, score-0.073]

57 (3) (x,h)∈S (µ)×(0,2ρ] µ(Sx,h ) It is readily seen that if µ is absolutely continuous with density f , a sufﬁcient condition is that there exist two strictly positive real numbers a and A such that for almost every x in S (µ), we have a ≤ f (x) ≤ A. [sent-139, score-0.113]

58 Suppose that the law µ of X has a compact support S (µ) with diameter 2ρ, and that µ is doubling, with p deﬁned as in (3). [sent-141, score-0.108]

59 kp n E [mn (X) − m(X)]2 ≤ 2σ2 32ρ2C2 2 + + Lm (1 − p)k . [sent-145, score-0.062]

60 2 (ii) If d ≥ 2, there exists a sequence (kn ) with kn ∝ n d+2 such that (ρC)d σ2 E [mn (X) − m(X)] ≤ (Λ + o(1)) n 2 2 d+2 , for some positive constant Λ independent of ρ, C and σ2 . [sent-148, score-0.123]

61 This low-dimensional phenomenon is also known to hold for the traditional kNN regression estimate, which does not achieve the optimal rate in dimension 1 (see Problem 6. [sent-154, score-0.083]

62 o In Corollary 1, the parameter kn of the estimate with the optimal rate of convergence for the class F depends on the unknown distribution of (X,Y ), especially on the smoothness of the regression function as measured by the Lipschitz constant C. [sent-157, score-0.261]

63 To conclude this section, we present a datadependent way for choosing the resampling size kn and show that, for bounded Y , the estimate with parameter chosen in such an adaptive way achieves the optimal rate of convergence. [sent-158, score-0.18]

64 ˆ The second half is used to choose k by picking kn ∈ K = {1, . [sent-164, score-0.123]

65 n − ⌊n/2⌋ i=⌊n/2⌋+1 Deﬁne the estimate ℓ mn (x) = mkn ,⌊n/2⌋ (x, Dn ), ˆ and note that mn depends on the entire data Dn . [sent-168, score-0.843]

66 If |Y | ≤ L < ∞ almost surely, a straightforward adaptation of Theorem 7. [sent-169, score-0.066]

67 Immediately from Corollary 1, we can conclude: Theorem 4 Suppose that |Y | ≤ L almost surely, and let mn be the MNN estimate with k ∈ K = {1, . [sent-172, score-0.475]

68 Thus, the expected error of the estimate obtained via data-splitting is bounded from above up to a constant by the corresponding minimax lower bound for the class F of regression functions, with the optimal dependence in C and ρ. [sent-177, score-0.114]

70 1 in C´ rou and Guyader 2006) that e ˜ ˜ L (X⋆ , . [sent-197, score-0.079]

71 1 k (5) Next, given d(k+1) , denote pk = P ˜ X−x < d(k+1) ˜ X∼µ = ˜ 2 Sx,d (k+1) /2 Sx,d(k+1) f (u)du f (u)du , where the denominator is strictly positive since x belongs to the support of µ. [sent-204, score-0.185]

72 Concerning pk , recall that Lebesgue’s differentiation Theorem ensures that for λ-almost all x ∈ Rd , 1 λ(Sx,δ ) Sx,δ f (u)du → f (x), when δ tends to 0 (see for example Theorem A. [sent-205, score-0.21]

73 = 2 0 P Mk (x) < Given δ and deﬁning B as the number of Xi ’s among the k nearest neighbors of x which belong to Sx,δ/2 , then according to (5), the random variable B has binomial distribution B (k, p(δ)) and (1) implies Mk (x) ≥ B, so that P Mk (x) < k d(k+1) = δ ≤ P B < 2d+1 k 2d+1 p(δ) . [sent-216, score-0.443]

74 In this respect, Hoeffding’s inequality and (6) lead to P B< k 2d+1 p(δ) ≤ exp −2 p(δ) − 1 2 2d+1 k ≤ exp − k 22d+3 , which is summable in n provided that k/ log n → ∞. [sent-217, score-0.077]

75 This time, we write P |mn (x) − mn (x)| > ε, Mk (x) ≥ ˜ = E 1{Mk (x)≥ k } 2d+1 k 2d+1 P ( |mn (x) − mn (x)| > ε| X1 , . [sent-219, score-0.814]

76 Consequently, Lemma 6 in Devroye (1982) leads to P |mn (x) − mn (x)| > ε, Mk (x) ≥ ˜ k 2d+1 ≤ 2 exp − kε2 , 2d+3 (2L2 + Lε) which is summable in n for all ε > 0, provided that k/ log n → ∞. [sent-237, score-0.484]

77 Sx,δ1 Putting all things together, we have proved that for any ε > 0, if k/n → 0, then for n large enough we have P (|mn (x) − m(x)| > 2ε) ≤ 2 exp −k −kε2 + exp 2d+3 d+3 (2L2 + Lε) 2 2 + exp −nq2 0 2 + exp −nq2 1 2 , which is summable in n for all ε > 0, provided that k/ log n → ∞. [sent-241, score-0.077]

78 Since this is true for µ almost every x, the strong consistency is established. [sent-242, score-0.102]

79 2 Proof of Theorem 3 As previously, setting n mn (x) = ∑ Wi m(Xi ), ˜ i=1 the proof of Theorem 3 will rely on the variance/bias decomposition E [mn (x) − m(x)]2 = E [mn (x) − mn (x)]2 + E [mn (x) − m(x)]2 . [sent-244, score-0.814]

80 ˜ ˜ 2371 (7) G UYADER AND H ENGARTNER The ﬁrst term is easily bounded by noting that, for all x in Rd , 2 n 2 E [mn (x) − mn (x)] = E ˜ ∑ Wi (Yi − m(Xi )) i=1 n ∑ Wi2 (Yi − m(Xi ))2 =E i=1 n ∑ Wi2 E =E i=1 (Yi − m(Xi ))2 Xi n ∑ Wi2 σ2 (Xi ) =E i=1 n ≤ σ2 E ∑ Wi2 . [sent-245, score-0.407]

81 1 Mk (X) Mk (X)=0 As in the proof of Theorem 1, given d(k+1) , denote ˜ X−x < pk = P µ Sx,d(k+1) /2 d(k+1) ˜ ˜ X∼µ = , 2 µ Sx,d(k+1) and deﬁne B as the number of Xi ’s among the k nearest neighbors of x which belong to Sx,d(k+1) /2 . [sent-247, score-0.59]

82 Then, given pk , the random variable B has binomial distribution B (k, pk ), and (1) implies Mk (x) ≥ B. [sent-248, score-0.408]

83 (8) In particular, 1 2 2 1Mk (X)=0 ≤ ≤ , Mk (x) 1 + Mk (x) 1 + B so that n E ∑ Wi2 i=1 Since pk ≥ p, we are led to ≤E E 2 pk 1+B = 2E E [mn (x) − mn (x)]2 ≤ ˜ 1 − (1 − pk )k . [sent-249, score-0.962]

84 (k + 1)pk 2σ2 , kp and integrating with respect to the distribution of X yields E [mn (X) − mn (X)]2 ≤ ˜ 2372 2σ2 . [sent-250, score-0.469]

85 kp M UTUAL N EAREST N EIGHBORS E STIMATE Concerning the bias term in (7), again ﬁx x in Rd , denote by Lm an upper-bound of the continuous function m on the compact S (µ), and write 2 E [mn (x) − m(x)]2 ≤ E (mn (x) − m(x))2 1{Mk (x)>0} + Lm P(Mk (x) = 0). [sent-251, score-0.112]

86 ˜ ˜ The second term is bounded thanks to (8), P(Mk (x) = 0|pk ) ≤ P(B = 0|pk ) = (1 − pk )k , so that P(Mk (x) = 0) ≤ E (1 − pk )k , and since pk ≥ p, P(Mk (x) = 0) ≤ E (1 − pk )k ≤ (1 − p)k . [sent-252, score-0.74]

87 k Since any mutual nearest neighbor of x belongs to its k nearest neighbors, we deduce E (mn (x) − m(x))2 1{Mk (x)>0} ≤ C2 E ˜ X(k,n) − x 2 . [sent-255, score-0.624]

88 Next, let us denote ρ = inf r > 0 : ∃ x0 ∈ Rd such that S (µ) ⊂ Sx0 ,r , and notice that 2ρ is an upper-bound of the diameter of S (µ). [sent-257, score-0.12]

89 Acknowledgments Arnaud Guyader wish to thank Herv´ J´ gou to have made him aware of the mutual nearest neighbors e e rule, and Fr´ d´ ric C´ rou for helpful discussions on the second version of the article. [sent-267, score-0.651]

90 The authors e e e are greatly indebted to Luigi Ambrosio and Giovanni Alberti for fruitful discussions on doubling measure theory. [sent-268, score-0.183]

91 Special functions of bounded variation in doubling metric measure spaces. [sent-280, score-0.183]

92 On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classiﬁcation. [sent-292, score-0.647]

93 On the rate of convergence of the bagged nearest neighbor e estimate. [sent-298, score-0.382]

94 An afﬁne invariant k-nearest neighbor regresc z sion estimate. [sent-305, score-0.07]

95 Agglomerative clustering using the concept of mutual nearest neighbourhood. [sent-315, score-0.335]

96 The condensed nearest neighbor rule using the concept of mutual nearest neighborhood. [sent-320, score-0.656]

97 Necessary and sufﬁcient conditions for the pointwise convergence of nearest neighbor regression function estimates. [sent-324, score-0.37]

98 Residual variance estimation using a nearest neighbor a statistic. [sent-393, score-0.289]

99 Hello neighbor: accurate object retrieval with k-reciprocal nearest neighbors. [sent-410, score-0.219]

100 Hubs in space: popular nearest neighbors in c c high-dimensional data. [sent-416, score-0.405]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mk', 0.435), ('mn', 0.407), ('mnn', 0.296), ('nearest', 0.219), ('neighbors', 0.186), ('pk', 0.185), ('doubling', 0.183), ('engartner', 0.158), ('uyader', 0.158), ('stimate', 0.138), ('utual', 0.138), ('knn', 0.136), ('dn', 0.126), ('kn', 0.123), ('earest', 0.118), ('eighbors', 0.118), ('mutual', 0.116), ('ambrosio', 0.099), ('guyader', 0.099), ('hubs', 0.099), ('devroye', 0.094), ('biau', 0.092), ('sx', 0.086), ('lm', 0.086), ('rd', 0.081), ('ibragimov', 0.079), ('rou', 0.079), ('neighbor', 0.07), ('arnaud', 0.067), ('kp', 0.062), ('du', 0.061), ('chidananda', 0.059), ('gowda', 0.059), ('radovanovi', 0.059), ('summable', 0.056), ('regression', 0.055), ('wi', 0.052), ('gy', 0.052), ('rennes', 0.051), ('gou', 0.051), ('xi', 0.049), ('nonparametric', 0.047), ('diameter', 0.047), ('nk', 0.046), ('inf', 0.044), ('stone', 0.041), ('consistency', 0.041), ('aviation', 0.039), ('bagged', 0.039), ('cantor', 0.039), ('liiti', 0.039), ('randolph', 0.039), ('ternary', 0.039), ('triadic', 0.039), ('usaf', 0.039), ('almost', 0.039), ('asymptotically', 0.039), ('binomial', 0.038), ('law', 0.035), ('alamos', 0.034), ('inen', 0.034), ('contrarily', 0.034), ('rule', 0.032), ('surely', 0.031), ('minimax', 0.03), ('nauk', 0.03), ('precomputation', 0.03), ('layered', 0.03), ('discriminatory', 0.03), ('qin', 0.03), ('nick', 0.03), ('convention', 0.03), ('theorem', 0.029), ('notice', 0.029), ('estimate', 0.029), ('lim', 0.029), ('rate', 0.028), ('los', 0.028), ('neighbour', 0.028), ('absolutely', 0.028), ('adaptation', 0.027), ('convergence', 0.026), ('krzy', 0.026), ('compact', 0.026), ('weakly', 0.026), ('differentiation', 0.025), ('cost', 0.024), ('continuous', 0.024), ('ix', 0.023), ('un', 0.023), ('every', 0.022), ('ey', 0.021), ('assertion', 0.021), ('sorting', 0.021), ('log', 0.021), ('xk', 0.021), ('estimator', 0.021), ('vd', 0.021), ('cn', 0.02), ('lipschitz', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

Author: Arnaud Guyader, Nick Hengartner

2 0.1571499 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

Author: Robert Hable

Abstract: In supervised learning problems, global and local learning algorithms are used. In contrast to global learning algorithms, the prediction of a local learning algorithm in a testing point is only based on training data which are close to the testing point. Every global algorithm such as support vector machines (SVM) can be localized in the following way: in every testing point, the (global) learning algorithm is not applied to the whole training data but only to the k nearest neighbors (kNN) of the testing point. In case of support vector machines, the success of such mixtures of SVM and kNN (called SVM-KNN) has been shown in extensive simulation studies and also for real data sets but only little has been known on theoretical properties so far. In the present article, it is shown how a large class of regularized kernel methods (including SVM) can be localized in order to get a universally consistent learning algorithm. Keywords: machine learning, regularized kernel methods, localization, SVM, k-nearest neighbors, SVM-KNN

3 0.10331292 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

Author: Tony Cai, Jianqing Fan, Tiefeng Jiang

Abstract: This paper studies the asymptotic behaviors of the pairwise angles among n randomly and uniformly distributed unit vectors in R p as the number of points n → ∞, while the dimension p is either ﬁxed or growing with n. For both settings, we derive the limiting empirical distribution of the random angles and the limiting distributions of the extreme angles. The results reveal interesting differences in the two settings and provide a precise characterization of the folklore that “all high-dimensional random vectors are almost always nearly orthogonal to each other”. Applications to statistics and machine learning and connections with some open problems in physics and mathematics are also discussed. Keywords: random angle, uniform distribution on sphere, empirical law, maximum of random variables, minimum of random variables, extreme-value distribution, packing on sphere

4 0.078606822 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

Author: Lauren A. Hannah, David B. Dunson

Abstract: We propose a new, nonparametric method for multivariate regression subject to convexity or concavity constraints on the response function. Convexity constraints are common in economics, statistics, operations research, ﬁnancial engineering and optimization, but there is currently no multivariate method that is stable and computationally feasible for more than a few thousand observations. We introduce convex adaptive partitioning (CAP), which creates a globally convex regression model from locally linear estimates ﬁt on adaptively selected covariate partitions. CAP is a computationally efﬁcient, consistent method for convex regression. We demonstrate empirical performance by comparing the performance of CAP to other shape-constrained and unconstrained regression methods for predicting weekly wages and value function approximation for pricing American basket options. Keywords: adaptive partitioning, convex regression, nonparametric regression, shape constraint, treed linear model

5 0.075032294 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris

Author: Bruno Scherrer

Abstract: We consider the discrete-time inﬁnite-horizon optimal control problem formalized by Markov decision processes (Puterman, 1994; Bertsekas and Tsitsiklis, 1996). We revisit the work of Bertsekas and Ioffe (1996), that introduced λ policy iteration—a family of algorithms parametrized by a parameter λ—that generalizes the standard algorithms value and policy iteration, and has some deep connections with the temporal-difference algorithms described by Sutton and Barto (1998). We deepen the original theory developed by the authors by providing convergence rate bounds which generalize standard bounds for value iteration described for instance by Puterman (1994). Then, the main contribution of this paper is to develop the theory of this algorithm when it is used in an approximate form. We extend and unify the separate analyzes developed by Munos for approximate value iteration (Munos, 2007) and approximate policy iteration (Munos, 2003), and provide performance bounds in the discounted and the undiscounted situations. Finally, we revisit the use of this algorithm in the training of a Tetris playing controller as originally done by Bertsekas and Ioffe (1996). Our empirical results are different from those of Bertsekas and Ioffe (which were originally qualiﬁed as “paradoxical” and “intriguing”). We track down the reason to be a minor implementation error of the algorithm, which suggests that, in practice, λ policy iteration may be more stable than previously thought. Keywords: stochastic optimal control, reinforcement learning, Markov decision processes, analysis of algorithms

6 0.073945895 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

7 0.058704253 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

8 0.053585846 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

9 0.048112474 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

10 0.046948008 104 jmlr-2013-Sparse Single-Index Model

11 0.043552916 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

12 0.042822484 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

13 0.041255988 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

14 0.037560351 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

15 0.035994984 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

16 0.034083791 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

17 0.033957757 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

18 0.031901632 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

19 0.030934064 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing

20 0.03051788 59 jmlr-2013-Large-scale SVD and Manifold Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.178), (1, 0.051), (2, 0.075), (3, 0.02), (4, -0.04), (5, -0.104), (6, -0.046), (7, 0.102), (8, 0.072), (9, -0.054), (10, -0.15), (11, -0.001), (12, 0.284), (13, -0.011), (14, -0.236), (15, 0.158), (16, -0.023), (17, 0.155), (18, 0.101), (19, -0.026), (20, 0.183), (21, -0.029), (22, -0.248), (23, -0.093), (24, -0.096), (25, 0.11), (26, -0.112), (27, 0.096), (28, 0.005), (29, -0.01), (30, 0.145), (31, -0.069), (32, 0.128), (33, -0.086), (34, -0.011), (35, -0.13), (36, -0.054), (37, 0.01), (38, 0.12), (39, 0.009), (40, -0.062), (41, 0.06), (42, -0.057), (43, 0.06), (44, 0.002), (45, 0.027), (46, 0.012), (47, 0.005), (48, -0.049), (49, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9609381 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

Author: Arnaud Guyader, Nick Hengartner

2 0.67285526 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

Author: Robert Hable

3 0.52091527 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

Author: Lauren A. Hannah, David B. Dunson

4 0.46849045 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

Author: Tony Cai, Jianqing Fan, Tiefeng Jiang

5 0.36685079 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency

6 0.33285266 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

7 0.32381266 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

8 0.29335338 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

9 0.25693727 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

10 0.24851482 104 jmlr-2013-Sparse Single-Index Model

11 0.23250315 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing

12 0.22686522 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

13 0.22656858 68 jmlr-2013-Machine Learning with Operational Costs

14 0.21654072 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris

15 0.21577331 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

16 0.20786774 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

17 0.20358342 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

18 0.20018625 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

19 0.19754419 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

20 0.1836686 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.111), (6, 0.021), (10, 0.061), (20, 0.016), (23, 0.049), (68, 0.021), (70, 0.036), (75, 0.031), (85, 0.048), (87, 0.017), (95, 0.48)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.79476452 95 jmlr-2013-Ranking Forests

Author: Stéphan Clémençon, Marine Depecker, Nicolas Vayatis

Abstract: The present paper examines how the aggregation and feature randomization principles underlying the algorithm R ANDOM F OREST (Breiman, 2001) can be adapted to bipartite ranking. The approach taken here is based on nonparametric scoring and ROC curve optimization in the sense of the AUC criterion. In this problem, aggregation is used to increase the performance of scoring rules produced by ranking trees, as those developed in Cl´ mencon and Vayatis (2009c). The present work e ¸ describes the principles for building median scoring rules based on concepts from rank aggregation. Consistency results are derived for these aggregated scoring rules and an algorithm called R ANK ING F OREST is presented. Furthermore, various strategies for feature randomization are explored through a series of numerical experiments on artiﬁcial data sets. Keywords: bipartite ranking, nonparametric scoring, classiﬁcation data, ROC optimization, AUC criterion, tree-based ranking rules, bootstrap, bagging, rank aggregation, median ranking, feature randomization

same-paper 2 0.76298648 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

Author: Arnaud Guyader, Nick Hengartner

3 0.2919271 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efﬁcient algorithms for distributed optimization in statistical settings involving large-scale data sets. The ﬁrst algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efﬁciently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

4 0.29123574 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

Author: Sébastien Gerchinovitz

Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same ﬂavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with ﬁxed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds

5 0.28877452 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

6 0.28806934 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

7 0.28665572 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

8 0.28550217 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

9 0.2852309 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

10 0.28287515 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

11 0.28213927 73 jmlr-2013-Multicategory Large-Margin Unified Machines

12 0.28133667 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

13 0.28120688 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

14 0.27940857 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

15 0.27891997 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

16 0.27886078 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

17 0.27883261 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

18 0.27882656 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

19 0.27838516 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

20 0.2778016 68 jmlr-2013-Machine Learning with Operational Costs