jmlr jmlr2010 jmlr2010-27 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arthur Gretton, László Györfi
Abstract: Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1 , log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test
Reference: text
sentIndex sentText sentNum sentScore
1 A third test statistic is defined as a kernel-based independence measure. [sent-9, score-0.396]
2 Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. [sent-11, score-0.38]
3 All tests reject the null hypothesis of independence if the test statistics become large. [sent-14, score-0.443]
4 Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test 1. [sent-16, score-0.242]
5 The first is to partition the underlying space, and to evaluate the test statistic on the resulting discrete empirical measures. [sent-27, score-0.296]
6 Previous multivariate hypothesis tests in this framework, using the L1 divergence measure, include homogeneity tests (to determine whether two random variables have the same distribution), by Biau and Gy¨ rfi o ∗. [sent-29, score-0.295]
7 The log-likelihood has also been emo ployed on discretised spaces as a statistic for goodness-of-fit testing, by Gy¨ rfi and Vajda (2002). [sent-34, score-0.222]
8 o We provide generalizations of both the L1 and log-likelihood based tests to the problem of testing independence, representing to our knowledge the first application of these techniques to independence testing. [sent-35, score-0.244]
9 We obtain two kinds of tests for each statistic: first, we derive strong consistent tests—meaning that both on H0 and on its complement the tests make a. [sent-36, score-0.306]
10 Our strong consistent tests are distribution-free, meaning they require no conditions on the distribution being tested; and universal, meaning the test threshold holds independent of the distribution. [sent-40, score-0.283]
11 Moreover, the thresholds for the asymptotic tests are distribution-independent. [sent-43, score-0.223]
12 We provide two new results: a strong consistent test of independence based on a tighter large deviation bound than that of Gretton et al. [sent-50, score-0.238]
13 (2005a), and an empirical comparison of the limiting distributions of the kernel-based statistic for fixed and decreasing kernel bandwidth, as used in asymptotic tests. [sent-51, score-0.372]
14 The associated independence test is consistent under appropriate assumptions. [sent-61, score-0.214]
15 Two difficulties arise when using this statistic in a test, however. [sent-62, score-0.222]
16 Second, and more importantly, the quality of the empirical distribution function estimates ′ becomes poor as the dimensionality of the spaces Rd and Rd increases, which limits the utility of the statistic in a multivariate setting. [sent-64, score-0.222]
17 The large deviation result is used to formulate a distribution-free strong consistent test of independence, which rejects the null hypothesis if the test statistic becomes large. [sent-72, score-0.643]
18 Both a distribution-free strong consistent test and an asymptotically α-level test are presented for the log-likelihood statistic in Section 3. [sent-74, score-0.458]
19 Section 4 contains a review of kernel-based independence statistics, and describes the associated hypothesis tests for both the fixed-bandwidth and variable-bandwidth cases. [sent-75, score-0.274]
20 , Bn,m′n } of Rd , we define the L1 test statistic comparing νn and µn,1 × µn,2 as Ln (νn , µn,1 × µn,2 ) = ∑ ∑ A∈Pn B∈Qn |νn (A × B) − µn,1 (A) · µn,2 (B)|. [sent-103, score-0.296]
21 1 Strongly Consistent Test For testing a simple hypothesis versus a composite alternative, Gy¨ rfi and van der Meulen (1990) o introduced a related goodness of fit test statistic Ln defined as Ln (µn,1 , µ1 ) = ∑ A∈Pn |µn,1 (A) − µ1 (A)|. [sent-106, score-0.416]
22 Theorem 1 yields a strong consistent test of independence, which rejects the null hypothesis if Ln (νn , µn,1 × µn,2 ) becomes large. [sent-114, score-0.347]
23 1394 C ONSISTENT N ONPARAMETRIC T ESTS OF I NDEPENDENCE Corollary 2 Consider the test which rejects H0 when mn m′ n + n Ln (νn , µn,1 × µn,2 ) > c1 where c1 > Assume that conditions √ m′ n n mn + n ≈ c1 mn m′ n , n 2 ln 2 ≈ 1. [sent-116, score-1.246]
24 (5) mn m′ n = 0, n→∞ n (6) lim and m′ mn = ∞, lim n = ∞, (7) n→∞ ln n n→∞ ln n are satisfied. [sent-118, score-0.972]
25 Therefore the conditions (7) imply ∞ mn m′ n + n ∑ P Ln (νn , µn,1 × µn,2 ) > c1 n=1 < ∞, and the proof under the null hypothesis is completed by the Borel-Cantelli lemma. [sent-126, score-0.481]
26 2 Asymptotic α-level Test Beirlant, Gy¨ rfi, and Lugosi (1994) proved, under conditions o mn = 0, n→∞ n lim mn = ∞, lim n→∞ (10) and lim max µ1 (An j ) = 0, (11) n→∞ j=1,. [sent-133, score-0.789]
27 Theorem 3 yields the asymptotic null distribution of a consistent independence test, which rejects the null hypothesis if Ln (νn , µn,1 × µn,2 ) becomes large. [sent-141, score-0.519]
28 Consider the test which rejects H0 when mn m′ σ n + √ Φ−1 (1 − α) n n Ln (νn , µn,1 × µn,2 ) > c2 mn m′ n , n ≈ c2 where σ2 = 1 − 2/π and c2 = 2/π ≈ 0. [sent-144, score-0.801]
29 In particular, comparing c2 above with c1 in (5), both tests behave identically with respect to mn m′ /n n for large enough n, but c2 is smaller. [sent-149, score-0.454]
30 Thus the α-level test rejects the null hypothesis if σ Ln (νn , µn,1 × µn,2 ) > Cn + √ Φ−1 (1 − α). [sent-151, score-0.283]
31 n As Cn depends on the unknown distribution, we apply an upper bound Cn ≤ 2/π mn m′ n n (see Equation (22) in Appendix A for the definition of Cn , and Equation (23) for the bound), so decreasing the error probability. [sent-152, score-0.333]
32 Log-likelihood Statistic In the literature on goodness-of-fit testing the I-divergence statistic, Kullback-Leibler divergence, or log-likelihood statistic, mn In (µn,1 , µ1 ) = ∑ µn,1 (An, j ) log j=1 µn,1 (An, j ) , µ1 (An, j ) plays an important role. [sent-154, score-0.378]
33 For testing independence, the corresponding log-likelihood test statistic is defined as νn (A × B) . [sent-155, score-0.319]
34 Kallenberg (1985), and Quine and Robinson (1985) proved that, for all ε > 0, P{In (µn,1 , µ1 ) > ε} ≤ n + mn − 1 −nε e ≤ emn log(n+mn )−nε . [sent-160, score-0.333]
35 mn − 1 Note that using an alternative bound due to Barron (1989, Equation 3. [sent-161, score-0.333]
36 n A large deviation based test can be introduced such that the test rejects the independence if lim n→∞ In (νn , µn,1 × µn,2 ) ≥ mn m′ (log(n + mn m′ ) + 1) n n . [sent-163, score-1.016]
37 Therefore condition (7) implies ∞ ∑ P In (νn , µn,1 × µn,2 ) > n=1 mn m′ (log(n + mn m′ ) + 1) n n n < ∞, and by the Borel-Cantelli lemma we have strong consistency under the null hypothesis. [sent-165, score-0.806]
38 Note that due to the form of the universal test threshold, strong consistency under H1 requires the condition mn m′ n log(n + mn m′ ) = 0, m n→∞ n lim as compared to (6). [sent-175, score-0.826]
39 (1990), and Gy¨ rfi and Vajda (2002) proved that o under (10) and (11), 2nIn (µn,1 , µ1 ) − mn D √ → N (0, 1). [sent-178, score-0.333]
40 We can derive this statistic in a number of ways. [sent-183, score-0.222]
41 The most immediate interpretation, introduced by Rosenblatt (1975), defines the statistic as the L2 distance between the joint density estimate and the product of marginal density estimates. [sent-184, score-0.272]
42 K h hd ′ The Rosenblatt-Parzen kernel density estimates of the density of (X,Y ) and X are respectively fn (x, y) = 1 n 1 n ∑ Kh (x − Xi )Kh′ (y −Yi ) and fn,1 (x) = n ∑ Kh (x − Xi ), n i=1 i=1 (15) with fn,2 (y) defined by analogy. [sent-187, score-0.212]
43 Rosenblatt (1975) introduced the kernel-based independence statistic Z ( fn (x, y) − fn,1 (x) fn,2 (y))2 dx dy. [sent-188, score-0.372]
44 That said, a number of important differences exist between the characteristic function-based statistic and that of Rosenblatt (1975). [sent-193, score-0.277]
45 A further generalization of the statistic is presented by Gretton et al. [sent-199, score-0.222]
46 While the bounded continuous functions are too rich a class to permit the construction of a covariance-based test statistic on a sample, Fukumizu et al. [sent-211, score-0.296]
47 The test statistic in (16) is then interpreted as a biased empirical estimate of H(ν; F , G ). [sent-222, score-0.296]
48 Note that characteristic kernels need not be inner products of square integrable probability density functions: an example is the kernel T Lh (x1 , x2 ) = exp(x1 x2 /h) from Steinwart (2001, Section 3, Example 1), which is universal, hence characteristic on compact subsets of Rd . [sent-230, score-0.274]
49 Finally, while we have focused on a kernel dependence measure based on the covariance, alternative kernel dependence measures exist based on the canonical correlation. [sent-233, score-0.221]
50 When used as a statistic in an independence test, the kernel contingency was found empirically to have power superior to the HSIC-based test. [sent-244, score-0.397]
51 1 Strongly Consistent Test The empirical statistic Tn was previously shown by Gretton et al. [sent-246, score-0.222]
52 Because of ˜ E{Tn } ≈ E{Tn } ≤ ′ Lh (0)Lh (0) , n we choose the threshold ′ Lh (0)Lh (0)4 ln n + n ′ Lh (0)Lh (0) n 2 = ′ Lh (0)Lh (0) √ ( 4 ln n + 1)2 , n that is, we reject the hypothesis of independence if Tn > K 2 K′ nhd hd ′ 2 √ ( 4 ln n + 1)2 . [sent-283, score-0.585]
53 2 Approximately α-level Tests We now describe the asymptotic limit distribution of the test statistic Tn in (16). [sent-294, score-0.371]
54 Thus the asymptotic α-level test rejects the null hypothesis if Tn > E{Tn } + σ ′ nhd/2 hd /2 Φ−1 (1 − α), where E{Tn } may be replaced by its upper bound, ′ Lh (0)Lh (0)/n = K 2 ′ K ′ 2 /(nhd hd ). [sent-304, score-0.432]
55 A difficulty in using the statistic (16) in a hypothesis test therefore arises due to the form of the null distribution of the statistic, which is a function of the unknown distribution over X and Y , whether or not h is fixed. [sent-314, score-0.444]
56 Second, we mixed these random variables using a rotation matrix parametrised by an angle θ, varying from 0 to π/4 (a zero angle meant the data were independent, while dependence became easier to detect as the angle increased to π/4: see the two plots in Figure 2). [sent-398, score-0.326]
57 Although no tests are reliable for small θ, several tests do well as θ approaches π/4 (besides the case of n = 128, d = 2). [sent-457, score-0.242]
58 In the remaining cases, performance of the log-likelihood test and the L1 test is comparable, besides in the case n = 512, d = 2, where the log-likelihood test has an advantage. [sent-462, score-0.222]
59 The superior performance of the log-likelihood test compared with the χ2 test (in the cases d = 1 and d = 2) might arise due to the different convergence properties of the two test statistics. [sent-463, score-0.222]
60 In particular, we note the superior convergence behaviour of the goodness-of-fit statistic for the log likelihood (Equation 13), as compared with the χ2 statistic (Equation 24 in Appendix B), in terms of the dependence of the latter on the number mn of partitions used. [sent-464, score-0.839]
61 By analogy, we anticipate the log-likelihood independence statistic In (νn , µn,1 × µn,2 ) will also converge faster than the Pearson χ2 independence statistic χ2 (νn , µn,1 × µn,2 ), and thus provide better test performance. [sent-465, score-0.718]
62 8 0 1 % acceptance of H % acceptance of H 0 1 0. [sent-473, score-0.276]
63 In the final row, the performance of the Ker(g) and Ker(n) tests is plotted for a large bandwidth h = 3, ˜ and α = 0. [sent-537, score-0.212]
64 6 0 1 0 1 % acceptance of H % acceptance of H 0 1 0. [sent-545, score-0.276]
65 8 1 Angle (×π/4) Figure 4: Rate of acceptance of H0 for the distribution-free (Free) and shuffling-based (Shuff) null distribution quantiles, using the L1 test statistic. [sent-587, score-0.307]
66 and log likelihood tests, the kernel test thresholds in our experiments are themselves finite sample estimates (which we have not attempted to account for, and which could impact on test performance). [sent-589, score-0.272]
67 It is of interest to further investigate the null distribution approximation strategies for the kernel tests, and in particular to determine the effect on test performance of the observations made in Figure 1. [sent-591, score-0.244]
68 An alternative approach to obtaining null distribution quantiles for test thresholds is via a shuffling procedure: the ordering of the Y1 , . [sent-599, score-0.227]
69 6 0 1 0 1 % acceptance of H % acceptance of H 0 1 0. [sent-612, score-0.276]
70 8 1 Angle (×π/4) Figure 5: Rate of acceptance of H0 for the distribution-free (Free) and shuffling-based (Shuff) null distribution quantiles, using the log-likelihood test statistic. [sent-654, score-0.307]
71 log-likelihood tests we have proposed, the resulting test threshold is an empirical estimate, and the convergence behaviour of this estimate is not accounted for. [sent-656, score-0.239]
72 In the d = 3 case, however, the Like test gives too large a Type I error, and thus the Type II performance of the two approaches cannot be compared (although for n = 2048, the Like test is observed to approach the asymptotic regime, and the Type I performance is closer to the target value). [sent-664, score-0.223]
73 This comparison was made for the kernel statistic on these data by Gretton et al. [sent-666, score-0.297]
74 The asymptotic L1 and log-likelihood tests require that the distributions be non-atomic, but make no assumptions apart from this: in particular, the test thresholds are not functions of the distribution. [sent-670, score-0.297]
75 The kernel statistic is interpretable as either an L2 distance between kernel density estimates (if the kernel bandwidth shrinks for increasing sample size), or as the Hilbert-Schmidt norm of a covariance operator between reproducing kernel Hilbert spaces (if the kernel bandwidth is fixed). [sent-671, score-0.804]
76 We have provided a novel strong consistent test for the kernel statistic, as well as reviewing two asymptotically α-level tests (for both fixed and shrinking kernel bandwidth). [sent-672, score-0.433]
77 Unlike the L1 and loglikelihood tests, the thresholds for the kernel asymptotic tests are distribution dependent. [sent-673, score-0.298]
78 We also gave conjectures regarding the strong consistent test and asymptotically α-level test for the Pearson χ2 distance. [sent-674, score-0.259]
79 Our experiments showed the asymptotic tests to be capable of detecting dependence for both univariate and multi-dimensional variables (of up to three dimensions each), for variables having no linear correlation. [sent-675, score-0.216]
80 The kernel tests had lower Type II error than the L1 and log-likelihood tests for a given Type I error, however we should bear in mind that the kernel test thresholds were finite sample estimates, and the resulting convergence issues have not been addressed. [sent-676, score-0.493]
81 Second, there is as yet no distribution-free asymptotic threshold for the kernel test, which could be based on a tighter bound on the variance of the test statistic under the null distribution. [sent-680, score-0.565]
82 Third, the asymptotic distribution of the kernel statistic with fixed bandwidth is presently a heuristic: it would therefore be of interest to replace this with a null distribution estimate having appropriate convergence guarantees. [sent-681, score-0.558]
83 , m′ ) be real measurable functions, and let n mn m′ n Mn := ∑ ∑ gn jk (νN (An j × Bnk ) − µN ,1 (An j )µN ,2 (Bnk )) . [sent-718, score-0.383]
84 Then 1 σ mn m′ n ∑ ∑ gn jk (νn (An j × Bnk ) − µn,1 (An j )µn,2 (Bnk )) → N (0, 1). [sent-720, score-0.383]
85 Define the two o characteristic functions Nn − n Φn (t, v) := E exp ıtMn + ıv √ n and mn m′ n Ψn (t) := E exp ıt ∑ ∑ gn jk (νn (An j × Bnk ) − µn,1 (An j )µn,2 (Bnk )) . [sent-722, score-0.438]
86 j=1 k=1 We begin with the result ∞ E {exp (ıtMn + ıuNn )} = ∑ E{exp(ıtMn )|Nn = l}eıul pn (l), l=0 where pn (l) is the probability distribution of the Poisson(n) random variable Nn , pn (l) = P{Nn = l} = e−n nl /l! [sent-723, score-0.432]
87 with the Stirling approximation to obtain 2πpn (n) = 2πe−n nn ≈ n! [sent-727, score-0.216]
88 Let √ mn Sn := t n ∑ m′ n ∑ j=1 k=1 νNn (An j × Bnk ) − µNn ,1 (An j )µNn ,2 (Bnk ) −E νNn (An j × Bnk ) − µNn ,1 (An j )µNn ,2 (Bnk ) √ Nn +v n −1 . [sent-737, score-0.333]
89 In particular, we require the variance of the Poissonized statistic Sn . [sent-739, score-0.222]
90 Finally, we use the variable ZA×B in defining a distribution-free upper bound on Cn , which we use in our asymptotically α-level independence test, Cn = ∑ ∑ E{|νNn (A × B) − µNn ,1 (A) · µNn ,2 (B)|} ∑ ∑ E{|ZA×B |} µ1 (A)µ2 (B)/n A∈Pn B∈Qn ≈ ≤ A∈Pn B∈Qn 2/π mn m′ n n (23) Appendix B. [sent-744, score-0.457]
91 λ→0 For λ = 1, we have the Pearson χ2 statistic: χ2 (µn,1 , µ1 ) = Dn,1 (µn,1 , µ1 ) = n mn (µn,1 (An, j )) − µ1 (An, j ))2 . [sent-748, score-0.333]
92 ∑ µ1 (An, j ) j=1 For testing independence, we employ the Pearson χ2 test statistic (νn (A × B) − µn,1 (A) · µn,2 (B))2 . [sent-749, score-0.319]
93 1 Strongly Consistent Test Quine and Robinson (1985) proved that, for all ε > 0, P{χ2 (µn,1 , µ1 ) > ε} ≤ n √ n + mn − 1 − n2logmnn √ε √m √m m log(n+mn )− n2logmnn ε ≤e n . [sent-751, score-0.333]
94 e mn − 1 (24) A large deviation-based test can be introduced that rejects independence if χ2 (νn , µn,1 × µn,2 ) n ≥ 2(mn m′ )3/2 (log(n + mn m′ ) + 1) n n n log(mn m′ ) n 2 . [sent-752, score-0.901]
95 Therefore the conditions (7) imply ∞ P χ2 (νn , µn,1 × µn,2 ) > ∑ n n=1 2(mn m′ )3/2 (log(n + mn m′ ) + 1) n n n log(mn m′ ) n 2 < ∞, and by the Borel-Cantelli lemma we have strong consistency under the null hypothesis. [sent-754, score-0.473]
96 (1990), and Gy¨ rfi and Vajda (2002) proved that under (10) and (11), o nχ2 (µn,1 , µ1 ) − mn D n √ → N (0, 1), 2mn which is the same asymptotic normality result as for 2In (µn,1 , µ1 ) (see Equation (14) in Section 3. [sent-760, score-0.436]
97 We conjecture that under the conditions (6) and (12), nχ2 (νn , µn,1 × µn,2 ) − mn m′ D n n → N (0, 1). [sent-762, score-0.333]
98 2mn m′ n Thus, as for the log-likelihood statistic, the hypothesis of independence is rejected if χ2 (νn , µn,1 × µn,2 ) ≥ n Φ−1 (1 − α) 2mn m′ + mn m′ n n . [sent-763, score-0.486]
99 On the asymptotic properties of a nonparametric l1 -test statistic of homoo geneity. [sent-815, score-0.318]
100 Distribution free tests of independence based on the sample distribution function. [sent-822, score-0.221]
wordName wordTfidf (topN-words)
[('lh', 0.37), ('mn', 0.333), ('samp', 0.265), ('tn', 0.241), ('kh', 0.225), ('statistic', 0.222), ('nn', 0.216), ('qn', 0.186), ('gretton', 0.152), ('dim', 0.144), ('pn', 0.144), ('onparametric', 0.141), ('orfi', 0.141), ('acceptance', 0.138), ('gy', 0.128), ('bnk', 0.124), ('za', 0.123), ('tests', 0.121), ('ndependence', 0.121), ('onsistent', 0.121), ('retton', 0.121), ('ln', 0.112), ('ests', 0.109), ('angle', 0.102), ('independence', 0.1), ('beirlant', 0.097), ('null', 0.095), ('bandwidth', 0.091), ('fukumizu', 0.076), ('asymptotic', 0.075), ('kernel', 0.075), ('test', 0.074), ('mh', 0.061), ('rejects', 0.061), ('ker', 0.06), ('var', 0.057), ('gamma', 0.055), ('characteristic', 0.055), ('shuf', 0.054), ('feuerverger', 0.053), ('ntn', 0.053), ('tmn', 0.053), ('hypothesis', 0.053), ('fn', 0.05), ('rd', 0.046), ('vajda', 0.045), ('kankainen', 0.044), ('zb', 0.044), ('lim', 0.041), ('cn', 0.041), ('rosenblatt', 0.041), ('consistent', 0.04), ('sriperumbudur', 0.038), ('hd', 0.037), ('nhd', 0.035), ('shuff', 0.035), ('transitional', 0.035), ('pearson', 0.035), ('kernels', 0.035), ('barron', 0.034), ('rkhs', 0.034), ('quantiles', 0.031), ('canonical', 0.031), ('mixture', 0.03), ('integrable', 0.029), ('normality', 0.028), ('borel', 0.028), ('cdf', 0.027), ('biau', 0.027), ('thresholds', 0.027), ('dauxois', 0.027), ('inglot', 0.027), ('meulen', 0.027), ('pears', 0.027), ('poissonized', 0.027), ('quine', 0.027), ('gn', 0.026), ('multimodal', 0.026), ('unimodal', 0.026), ('density', 0.025), ('goodness', 0.025), ('strong', 0.024), ('threshold', 0.024), ('asymptotically', 0.024), ('poisson', 0.024), ('jk', 0.024), ('bach', 0.023), ('testing', 0.023), ('conjectures', 0.023), ('log', 0.022), ('nonparametric', 0.021), ('zn', 0.021), ('consistency', 0.021), ('arthur', 0.021), ('behaviour', 0.02), ('dependence', 0.02), ('normal', 0.02), ('der', 0.019), ('lugosi', 0.019), ('type', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999845 27 jmlr-2010-Consistent Nonparametric Tests of Independence
Author: Arthur Gretton, László Györfi
Abstract: Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1 , log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test
2 0.19563422 82 jmlr-2010-On Learning with Integral Operators
Author: Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito
Abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in spectral approximation. 2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008). Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods
3 0.13974661 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet
Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und
4 0.094014578 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
Author: Markus Ojala, Gemma C. Garriga
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing
5 0.088903084 72 jmlr-2010-Matrix Completion from Noisy Entries
Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh
Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the ‘Netflix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization
6 0.074043483 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation
7 0.051804475 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
8 0.050215069 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
9 0.04778925 6 jmlr-2010-A Rotation Test to Verify Latent Structure
10 0.047623172 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
11 0.047307134 102 jmlr-2010-Semi-Supervised Novelty Detection
12 0.044493828 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
13 0.043185171 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks
14 0.042965744 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
15 0.038797047 44 jmlr-2010-Graph Kernels
16 0.037477065 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
17 0.035553988 53 jmlr-2010-Inducing Tree-Substitution Grammars
18 0.03427637 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
19 0.033821914 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
20 0.031851053 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars
topicId topicWeight
[(0, -0.177), (1, -0.03), (2, 0.031), (3, 0.022), (4, -0.171), (5, -0.041), (6, -0.265), (7, 0.058), (8, 0.085), (9, -0.037), (10, -0.107), (11, 0.063), (12, -0.215), (13, -0.107), (14, 0.11), (15, -0.015), (16, -0.12), (17, -0.042), (18, -0.106), (19, 0.15), (20, -0.072), (21, 0.375), (22, -0.145), (23, 0.089), (24, -0.012), (25, 0.16), (26, 0.12), (27, 0.036), (28, -0.049), (29, -0.068), (30, -0.017), (31, 0.054), (32, 0.039), (33, -0.15), (34, -0.029), (35, 0.024), (36, -0.077), (37, 0.027), (38, -0.004), (39, 0.033), (40, 0.002), (41, 0.186), (42, -0.04), (43, 0.012), (44, -0.003), (45, 0.001), (46, 0.032), (47, 0.047), (48, 0.093), (49, 0.112)]
simIndex simValue paperId paperTitle
same-paper 1 0.9666099 27 jmlr-2010-Consistent Nonparametric Tests of Independence
Author: Arthur Gretton, László Györfi
Abstract: Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1 , log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test
2 0.70757699 82 jmlr-2010-On Learning with Integral Operators
Author: Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito
Abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in spectral approximation. 2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008). Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods
3 0.47918957 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet
Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und
4 0.30955678 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation
Author: Miki Aoyagi
Abstract: In this paper, we consider the asymptotic form of the generalization error for the restricted Boltzmann machine in Bayesian estimation. It has been shown that obtaining the maximum pole of zeta functions is related to the asymptotic form of the generalization error for hierarchical learning models (Watanabe, 2001a,b). The zeta function is defined by using a Kullback function. We use two methods to obtain the maximum pole: a new eigenvalue analysis method and a recursive blowing up process. We show that these methods are effective for obtaining the asymptotic form of the generalization error of hierarchical learning models. Keywords: Boltzmann machine, non-regular learning machine, resolution of singularities, zeta function
5 0.30080697 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
Author: Markus Ojala, Gemma C. Garriga
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing
6 0.300192 102 jmlr-2010-Semi-Supervised Novelty Detection
7 0.25885421 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
8 0.25785512 6 jmlr-2010-A Rotation Test to Verify Latent Structure
9 0.24674428 72 jmlr-2010-Matrix Completion from Noisy Entries
10 0.2189181 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
11 0.21110044 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
12 0.20030911 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
13 0.1625289 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
14 0.16140436 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
15 0.16066794 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
16 0.16064219 10 jmlr-2010-An Exponential Model for Infinite Rankings
17 0.15750195 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
18 0.15699622 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning
19 0.15691125 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
20 0.14990719 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
topicId topicWeight
[(1, 0.013), (3, 0.115), (4, 0.013), (8, 0.014), (15, 0.017), (21, 0.014), (32, 0.046), (36, 0.034), (37, 0.038), (75, 0.088), (81, 0.015), (83, 0.414), (85, 0.074)]
simIndex simValue paperId paperTitle
same-paper 1 0.73927909 27 jmlr-2010-Consistent Nonparametric Tests of Independence
Author: Arthur Gretton, László Györfi
Abstract: Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1 , log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test
2 0.46320674 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
Author: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro
Abstract: Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large data sets. Keywords: basis pursuit, dictionary learning, matrix factorization, online learning, sparse coding, sparse principal component analysis, stochastic approximations, stochastic optimization, nonnegative matrix factorization
3 0.36984506 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet
Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und
4 0.3667767 72 jmlr-2010-Matrix Completion from Noisy Entries
Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh
Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the ‘Netflix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization
5 0.33158267 82 jmlr-2010-On Learning with Integral Operators
Author: Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito
Abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in spectral approximation. 2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008). Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods
6 0.32056195 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
7 0.31997854 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
8 0.31993198 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
9 0.31622452 102 jmlr-2010-Semi-Supervised Novelty Detection
10 0.31587771 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
11 0.31573322 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
13 0.31284899 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
14 0.31047705 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
15 0.30968189 84 jmlr-2010-On Spectral Learning
16 0.30953959 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
17 0.30654323 25 jmlr-2010-Composite Binary Losses
18 0.30639711 60 jmlr-2010-Learnability, Stability and Uniform Convergence
19 0.30579847 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
20 0.30523345 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization