nips nips2013 nips2013-44 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko
Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efficient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1
Reference: text
sentIndex sentText sentNum sentScore
1 fr Abstract A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. [sent-6, score-0.492]
2 Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. [sent-7, score-0.365]
3 The choice of block size allows control over the tradeoff between test power and computation time. [sent-8, score-0.284]
4 A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. [sent-10, score-1.034]
5 Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. [sent-11, score-0.869]
6 d, the two sample i=1 i=1 problem consists in testing whether to accept or reject the null hypothesis H0 that P = Q, vs the alternative hypothesis HA that P and Q are different. [sent-17, score-0.611]
7 This problem has recently been addressed using measures of similarity computed in a reproducing kernel Hilbert space (RKHS), which apply in very general settings where P and Q might be distributions over high dimensional data or structured objects. [sent-18, score-0.34]
8 When used in testing, it is necessary to determine whether the empirical estimate of the relevant similarity measure is sufficiently large as to give the hypothesis P = Q low probability; i. [sent-20, score-0.153]
9 The test power denotes the probability of correctly rejecting the null hypothesis, given that P = Q. [sent-23, score-0.471]
10 Unfortunately, this statistic is degenerate under the null hypothesis H0 that P = Q, and its asymptotic distribution takes the form of an infinite weighted sum of independent χ2 variables (it is asymptotically Gaussian under the alternative hypothesis HA that P = Q). [sent-25, score-0.831]
11 Two methods for empirically estimating the null distribution in a consistent way have been proposed: the bootstrap [10], and a method requiring an eigendecomposition of the kernel matrices computed on the merged samples from P and Q [7]. [sent-26, score-0.888]
12 Another approach is to approximate the null distribution by a member of a simpler parametric family (for instance, a Pearson curve approximation), however this has no consistency guarantees. [sent-28, score-0.383]
13 While this has much greater variance than the U-statistic, it also has a simpler null distribution: being an average over i. [sent-30, score-0.353]
14 terms, the central limit theorem gives an asymptotically Normal distribution, under both H0 and HA . [sent-33, score-0.248]
15 Kernel selection for the U-statistic is a much harder question due to the complex form of the null distribution, and remains an open problem. [sent-35, score-0.32]
16 A major reason MMDl is faster is that its null distribution is straightforward to compute, since it is Gaussian and its variance can be calculated at the same cost as the test statistic. [sent-37, score-0.486]
17 A reasonable next step would be to find a compromise between these two extremes: to construct a statistic with a lower variance than MMDl , while retaining an asymptotically Gaussian null distribution (hence remaining faster than tests based on MMDu ). [sent-38, score-0.712]
18 We study a family of such test statistics, where we split the data into blocks of size B, compute the quadratic-time MMDu on each block, and then average the resulting statistics. [sent-39, score-0.183]
19 As long as we choose the size B of blocks such that n/B → ∞, we are still guaranteed asymptotic Normality by the central limit theorem, and the null distribution can be computed at the same cost as the test statistic. [sent-41, score-0.695]
20 For a given sample size n, however, the power of the test can increase dramatically over the MMDl test, even for moderate block sizes B, making much better use of the available data with only a small increase in computation. [sent-42, score-0.359]
21 The block averaging scheme was originally proposed in [13], as an instance of a two-stage Ustatistic, to be applied when the degree of degeneracy of the U-statistic is indeterminate. [sent-43, score-0.229]
22 Differences with respect to our method are that Ho and Shieh compute the block statistics by sampling with replacement [13, (b) p. [sent-44, score-0.135]
23 863], and propose to obtain the variance of the test statistic via Monte Carlo, jackknife, or bootstrap techniques, whereas we use closed form expressions. [sent-45, score-0.517]
24 Finally, the kernel learning approach of [9] applies straightforwardly, allowing us to maximize test power over a given kernel family. [sent-48, score-0.659]
25 We begin with a brief review of kernel methods, and of the maximum mean discrepancy. [sent-52, score-0.255]
26 The central idea employed in the construction of the B-test is to generate a low variance MMD estimate by averaging multiple low variance kernel statistics computed over blocks of samples. [sent-54, score-0.563]
27 We show simple sufficient conditions on the block size for consistency of the estimator. [sent-55, score-0.168]
28 Furthermore, we analyze the properties of the finite sample estimate, and propose a consistent strategy for setting the block size as a function of the number of samples. [sent-56, score-0.205]
29 1 Definition and asymptotics of the block-MMD Let Fk be an RKHS defined on a topological space X with reproducing kernel k, and P a Borel probability measure on X . [sent-58, score-0.289]
30 The mean embedding of P in Fk , written µk (p) ∈ Fk is defined such 2 250 HA histogram 250 H0 histogram HA histogram approximated 5% quantile of H0 H histogram 0 approximated 5% quantile of H 200 0 200 150 150 100 100 50 50 0 −4 0 −0. [sent-59, score-0.232]
31 As we vary B, we trade off the quality of the finite sample Gaussian approximation to the null distribution, as in Theorem 2. [sent-74, score-0.33]
32 05 Kolmogorov-Smirnov (KS) normality test [16, 20]), and a Gaussian approximation results in a conservative test threshold (vertical green line). [sent-78, score-0.329]
33 The remaining empirical distributions all pass a KS normality test. [sent-79, score-0.15]
34 (2) When the kernel k is characteristic, then ηk (P, Q) = 0 iff P = Q [21]. [sent-83, score-0.255]
35 (3) a=(i−1)B+1 b=(i−1)B+1,b=a The B-test statistic is an MMD estimate obtained by averaging the ηk (i). [sent-88, score-0.171]
36 Although setting B = n would lead to the lowest variance estimate of the MMD, computing sound thresholds for a given p-value is expensive, involving repeated bootstrap sampling [5, 14], or computing the eigenvalues of a Gram matrix [7]. [sent-90, score-0.316]
37 variables, and averaging them allows us to apply ˆ B the central limit theorem in order to estimate p-values from a normal distribution. [sent-97, score-0.26]
38 ˆ (4) i=1 We would like to apply the central limit theorem to variables ηk (i)i=1,. [sent-99, score-0.184]
39 , B , the central limit theorem implies that under HA , η D −1 2 ηk → N MMD2 , σu (Bn/B) ˆ 2 = N MMD2 , σu n−1 . [sent-111, score-0.184]
40 (7) This result shows that the distribution of HA is asymptotically independent of the block size, B. [sent-112, score-0.229]
41 Turning to the null hypothesis, [11, Theorem 8] additionally implies that under H0 for every i, Theorem 2. [sent-113, score-0.289]
42 The central limit theorem implies that under H0 , D ηk → N 0, C B 2 n/B ˆ −1 The asymptotic distributions for ηk under H0 and HA are Gaussian, and consequently it is easy ˆ to calculate the distribution quantiles and test thresholds. [sent-119, score-0.491]
43 A related strategy of averaging over data blocks to deal with large sample sizes has recently been developed in [15], with the goal of efficiently computing bootstrapped estimates of statistics of interest (e. [sent-122, score-0.198]
44 Briefly, the approach splits the data (of size n) into s subsamples each of size B, computes an estimate of the n-fold bootstrap on each block, and averages these estimates. [sent-125, score-0.222]
45 n η The central limit theorem implies that the empirical mean of {ˆk (i)}i=1,. [sent-129, score-0.235]
46 4 Under both H0 and HA , thresholds computed from normal distribution tables are asymptotically unbiased. [sent-139, score-0.157]
47 The skewness for the entire sum is finite and positive, ∞ 8λ3 , l C= (11) l=1 as λl ≥ 0 for all l due to the positive definiteness of the kernel k. [sent-143, score-0.315]
48 At smaller sample sizes, test thresholds obtained from ˆ the standard Normal table may therefore be inaccurate, as they do not account for this skew. [sent-145, score-0.174]
49 In our experiments, this bias caused the tests to be overly conservative, with lower Type I error than the design level required (Figures 2 and 5). [sent-146, score-0.235]
50 3 Finite Sample Case In the finite sample case, we apply the Berry-Ess´ en theorem, which gives conservative bounds on e the ∞ convergence of a series of finite sample random variables to a Gaussian distribution [4]. [sent-148, score-0.187]
51 , it is dependent only on the underlying distributions of the samples and not on the sample size. [sent-163, score-0.155]
52 B While the asymptotic results indicate that convergence to an optimal predictor is fastest for larger B, the finite sample results support decreasing the size of B in order to have a sufficient number n of samples for application of the central limit theorem. [sent-172, score-0.328]
53 When B is small, we have many samples, hence the null distribution is close to the asymptotic limit provided by the central limit theorem, and the Type I error is estimated accurately. [sent-175, score-0.672]
54 The disadvantage of a small B is a lower test power for a given sample size. [sent-176, score-0.19]
55 Conversely, if we increase B, we will have a lower variance empirical distribution for H0 , hence higher test power, but we may have a poor estimate of the number of Type I errors (Figure 1). [sent-177, score-0.297]
56 In this setting the number of samples available for application of the central limit theorem will be [n(1−γ) ]. [sent-179, score-0.247]
57 We evaluate B-test performance in comparison to the MMDl and MMDu estimators, where for the latter we compare across different strategies for null distribution quantile estimation. [sent-185, score-0.365]
58 8332 × × × × > 60000, or 2h per iteration timeout σ = median Table 1: Sample complexity for tests on the distributions described in Figure 3. [sent-196, score-0.29]
59 08 Empirical Type I error Expected Type I error √ B= n 0. [sent-202, score-0.134]
60 08 Empirical Type I error Expected Type I error √ B= n 0. [sent-204, score-0.134]
61 06 Empirical Type I error Expected Type I error B= n 2 0. [sent-213, score-0.134]
62 01 0 2 4 8 16 32 Size of inner block (a) 64 128 0 2 0. [sent-219, score-0.179]
63 1 Synthetic data Following previous work on kernel hypothesis testing [9], our synthetic distributions are 5 × 5 grids of 2D Gaussians. [sent-227, score-0.526]
64 These distributions have proved to be very challenging for existing non-parametric two-sample tests [9]. [sent-231, score-0.188]
65 We employed three different kernel selection strategies in the hypothesis test. [sent-232, score-0.388]
66 First, we used a Gaussian kernel with σ = 1, which approximately matches the scale of the variance of each Gaussian in mixture P . [sent-233, score-0.319]
67 Next, we set σ equal to the median pairwise distance over the training data, Figure 3: Synthetic data distributions P and which is a standard way to choose the Gaussian kernel Q. [sent-235, score-0.476]
68 Finally, we applied a kernel learning strategy, in which the kernel was optimized to maximize the test power for the alternative P = Q [9]. [sent-238, score-0.659]
69 This approach returned a non-negative linear combination combination of base kernels, where half the data were used in learning the kernel weights (these data were excluded from the testing phase). [sent-239, score-0.332]
70 For methods using Pearson curves and the Gram matrix spectrum, we drew 500 samples from the null distribution estimates to obtain the 1 − α quantiles, for a test of level α. [sent-246, score-0.554]
71 We considered only the setting with σ = 1 and σ set to the median pairwise distance, as kernel selection is not yet solved for tests using MMDu [9]. [sent-249, score-0.554]
72 2 0 1 10 2 10 Size of inner block 3 10 Figure 4: Synthetic experiment: number of Type II er- In the first experiment we set the Type I error to rors vs B, given a fixed probability α of Type I erbe 5%, and we recorded the Type II error. [sent-254, score-0.246]
73 As B grows, the Type II error drops quickly when conducted these experiments on 2000 samples the kernel is appropriately chosen. [sent-256, score-0.385]
74 The kernel selecover 1000 repetitions, with varying block size, tion method is described in [9], and closely approxB. [sent-257, score-0.39]
75 Figure 4 presents results for different kernel imates the baseline performance of the well-informed choice strategies, as a function of B. [sent-258, score-0.255]
76 As discussed in [9, Section 5], the reason for this failure is that the lengthscale of the difference between the distributions P and Q differs from the lengthscale of the main data variation as captured by the median, which gives too broad a kernel for the data. [sent-261, score-0.454]
77 We again fixed the same Type I error for all methods, but this time we also fixed a Type II error of 5%, increasing the number of samples until the latter error rate was achieved. [sent-263, score-0.264]
78 For the user-chosen kernel (σ = 1, Figure 2(a)), the number of Type I errors closely matches the targeted test level. [sent-272, score-0.447]
79 When median heuristic is used, however, the test is overly conservative, and makes fewer Type I errors than required (Figure 2(b)). [sent-273, score-0.285]
80 This indicates that for this choice of σ, we are not in the asymptotic regime, and our Gaussian null distribution approximation is inaccurate. [sent-274, score-0.393]
81 This setting coincides with a block size substantially larger than 2 (MMDl ), and therefore achieves lower Type II errors while retaining the targeted Type I error. [sent-276, score-0.224]
82 As the feature vector had size 1000 we set the block size B = 1000 = 32. [sent-285, score-0.135]
83 Figure 5 shows the average number of Type I errors as a function of B: in this case, all kernel selection strategies result in conservative tests (lower Type I error than required), indicating that more samples are needed to reach the asymptotic regime. [sent-287, score-0.751]
84 The results show that the B-test has a much better 7 Kernel parameters Method Additional parameters B =√ 2 B= n B =√ 2 B= n B=2 B= n 2 σ=1 B-test σ = median multiple kernels Gram matrix spectrum Bootstrap Gram matrix spectrum Bootstrap Type I error Type II error 0. [sent-290, score-0.435]
85 8297 σ=1 B = 2000 σ = median Table 2: A comparison of consistent tests on the music experiment described in Section 3. [sent-313, score-0.313]
86 Here computation time is reported for the test achieving the stated error rates. [sent-315, score-0.17]
87 08 Empirical Type I error Expected Type I error √ B= n 0. [sent-319, score-0.134]
88 01 0 2 4 8 16 32 Size of inner block (a) 64 128 0 2 4 8 16 32 Size of inner block (b) 64 128 0 2 4 8 16 32 Size of inner block 64 128 (c) Figure 5: Empirical Type I error rate for α = 5% on the music data (Section 3. [sent-340, score-0.649]
89 (a) A single kernel test with σ = 1, (b) A single kernel test with σ = median, and (c) for multiple kernels. [sent-342, score-0.716]
90 sample complexity than MMDl over all tested kernel selection strategies. [sent-345, score-0.327]
91 Moreover, it is an order of magnitude faster than any test that consistently estimates the null distribution for MMDu (i. [sent-346, score-0.453]
92 , the Gram matrix eigenspectrum and bootstrap estimates): these estimates are impractical at large sample sizes, due to their computational complexity. [sent-348, score-0.294]
93 First, we have observed that the empirical Type I error rate is often conservative, and is less than the 5% targeted by the threshold based on a Gaussian null distribution assumption (Figures 2 and 5). [sent-352, score-0.477]
94 Equation (7) implies that the size of B does not influence the asymptotic variance under HA , however we observe in Figure 1 that the empirical variance of HA drops with larger B. [sent-355, score-0.253]
95 This is because, for these P and Q and small B, the null and alternative distributions have considerable overlap. [sent-356, score-0.34]
96 Hence, given the distributions are effectively indistinguishable at these sample sizes n, the variance of the alternative distribution as a function of B behaves more like that of H0 (cf. [sent-357, score-0.22]
97 Finally, [13] propose an alternative approach for U-statistic based testing when the degree of degeneracy is known: a new U-statistic (the TU-statistic) is written in terms of products of centred U-statistics computed on the individual blocks, and a test is formulated using this TU-statistic. [sent-360, score-0.231]
98 Ho and Shieh show that a TU-statistic based test can be asymptotically more powerful than a test using a single U-statistic on the whole sample, when the latter is degenerate under H0 , and nondegenerate under HA . [sent-361, score-0.312]
99 On the convergence of moments in the central limit theorem. [sent-366, score-0.191]
100 Kernels based tests with nonasymptotic bootstrap approaches for two-sample problems. [sent-395, score-0.359]
wordName wordTfidf (topN-words)
[('mmdu', 0.368), ('ha', 0.323), ('null', 0.289), ('mmd', 0.26), ('kernel', 0.255), ('bootstrap', 0.222), ('mmdl', 0.216), ('type', 0.215), ('tests', 0.137), ('block', 0.135), ('statistic', 0.128), ('gretton', 0.113), ('gram', 0.106), ('test', 0.103), ('hypothesis', 0.102), ('median', 0.102), ('central', 0.088), ('pearson', 0.087), ('fk', 0.081), ('testing', 0.077), ('conservative', 0.075), ('spectrum', 0.074), ('asymptotic', 0.074), ('lengthscale', 0.074), ('shieh', 0.074), ('discrepancy', 0.069), ('borel', 0.068), ('rkhs', 0.068), ('error', 0.067), ('variance', 0.064), ('ii', 0.064), ('asymptotically', 0.064), ('samples', 0.063), ('limit', 0.062), ('skewness', 0.06), ('ho', 0.06), ('kernels', 0.051), ('empirical', 0.051), ('fukumizu', 0.051), ('degeneracy', 0.051), ('distributions', 0.051), ('blocks', 0.049), ('sriperumbudur', 0.049), ('quantiles', 0.049), ('errors', 0.049), ('sehnsucht', 0.049), ('normality', 0.048), ('quantile', 0.046), ('power', 0.046), ('music', 0.045), ('sch', 0.044), ('inner', 0.044), ('malte', 0.043), ('averaging', 0.043), ('arthur', 0.042), ('degenerate', 0.042), ('gaussian', 0.042), ('moments', 0.041), ('sample', 0.041), ('gamma', 0.041), ('synthetic', 0.041), ('sejdinovic', 0.04), ('wrongly', 0.04), ('amplitude', 0.04), ('targeted', 0.04), ('distance', 0.039), ('curves', 0.038), ('kely', 0.037), ('ib', 0.037), ('karsten', 0.037), ('zl', 0.037), ('visible', 0.035), ('histogram', 0.035), ('theorem', 0.034), ('sizes', 0.034), ('reproducing', 0.034), ('normal', 0.033), ('rasch', 0.033), ('ez', 0.033), ('rejecting', 0.033), ('consistency', 0.033), ('lkopf', 0.032), ('fisher', 0.032), ('harchaoui', 0.032), ('family', 0.031), ('ex', 0.031), ('selection', 0.031), ('estimates', 0.031), ('borgwardt', 0.031), ('ch', 0.031), ('overly', 0.031), ('thresholds', 0.03), ('hilbert', 0.03), ('distribution', 0.03), ('discriminant', 0.03), ('bernhard', 0.03), ('consistent', 0.029), ('ks', 0.029), ('nb', 0.029), ('pairwise', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test
Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko
Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efficient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1
2 0.21857095 173 nips-2013-Least Informative Dimensions
Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda
Abstract: We present a novel non-parametric method for finding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest. 1
3 0.19208193 9 nips-2013-A Kernel Test for Three-Variable Interactions
Author: Dino Sejdinovic, Arthur Gretton, Wicher Bergsma
Abstract: We introduce kernel nonparametric tests for Lancaster three-variable interaction and for total independence, using embeddings of signed measures into a reproducing kernel Hilbert space. The resulting test statistics are straightforward to compute, and are used in powerful interaction tests, which are consistent against all alternatives for a large family of reproducing kernels. We show the Lancaster test to be sensitive to cases where two independent causes individually have weak influence on a third dependent variable, but their combined effect has a strong influence. This makes the Lancaster test especially suited to finding structure in directed graphical models, where it outperforms competing nonparametric tests in detecting such V-structures.
4 0.17720604 306 nips-2013-Speeding up Permutation Testing in Neuroimaging
Author: Chris Hinrichs, Vamsi Ithapu, Qinyuan Sun, Sterling C. Johnson, Vikas Singh
Abstract: Multiple hypothesis testing is a significant problem in nearly all neuroimaging studies. In order to correct for this phenomena, we require a reliable estimate of the Family-Wise Error Rate (FWER). The well known Bonferroni correction method, while simple to implement, is quite conservative, and can substantially under-power a study because it ignores dependencies between test statistics. Permutation testing, on the other hand, is an exact, non-parametric method of estimating the FWER for a given α-threshold, but for acceptably low thresholds the computational burden can be prohibitive. In this paper, we show that permutation testing in fact amounts to populating the columns of a very large matrix P. By analyzing the spectrum of this matrix, under certain conditions, we see that P has a low-rank plus a low-variance residual decomposition which makes it suitable for highly sub–sampled — on the order of 0.5% — matrix completion methods. Based on this observation, we propose a novel permutation testing methodology which offers a large speedup, without sacrificing the fidelity of the estimated FWER. Our evaluations on four different neuroimaging datasets show that a computational speedup factor of roughly 50× can be achieved while recovering the FWER distribution up to very high accuracy. Further, we show that the estimated α-threshold is also recovered faithfully, and is stable. 1
5 0.15571344 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
Author: Michel Besserve, Nikos K. Logothetis, Bernhard Schölkopf
Abstract: Many applications require the analysis of complex interactions between time series. These interactions can be non-linear and involve vector valued as well as complex data structures such as graphs or strings. Here we provide a general framework for the statistical analysis of these dependencies when random variables are sampled from stationary time-series of arbitrary objects. To achieve this goal, we study the properties of the Kernel Cross-Spectral Density (KCSD) operator induced by positive definite kernels on arbitrary input domains. This framework enables us to develop an independence test between time series, as well as a similarity measure to compare different types of coupling. The performance of our test is compared to the HSIC test using i.i.d. assumptions, showing improvements in terms of detection errors, as well as the suitability of this approach for testing dependency in complex dynamical systems. This similarity measure enables us to identify different types of interactions in electrophysiological neural time series. 1
6 0.14840332 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
7 0.14695737 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
8 0.13745368 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
9 0.12119702 156 nips-2013-Learning Kernels Using Local Rademacher Complexity
10 0.095036201 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
11 0.090892494 289 nips-2013-Scalable kernels for graphs with continuous attributes
12 0.087314546 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory
13 0.079868078 40 nips-2013-Approximate Inference in Continuous Determinantal Processes
14 0.07981921 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
15 0.07945209 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations
16 0.076499201 224 nips-2013-On the Sample Complexity of Subspace Learning
17 0.075591981 36 nips-2013-Annealing between distributions by averaging moments
18 0.069960386 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
19 0.067949392 137 nips-2013-High-Dimensional Gaussian Process Bandits
20 0.067637123 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion
topicId topicWeight
[(0, 0.192), (1, 0.083), (2, 0.019), (3, 0.034), (4, -0.002), (5, 0.064), (6, 0.029), (7, 0.006), (8, -0.121), (9, 0.024), (10, -0.025), (11, -0.091), (12, -0.111), (13, -0.112), (14, 0.259), (15, 0.143), (16, 0.022), (17, 0.102), (18, -0.155), (19, -0.095), (20, -0.019), (21, -0.068), (22, -0.012), (23, -0.001), (24, -0.065), (25, 0.002), (26, 0.014), (27, 0.009), (28, 0.037), (29, 0.034), (30, 0.162), (31, 0.037), (32, 0.042), (33, -0.026), (34, -0.011), (35, -0.071), (36, -0.023), (37, 0.088), (38, 0.015), (39, 0.037), (40, -0.004), (41, 0.034), (42, 0.061), (43, 0.074), (44, -0.012), (45, 0.017), (46, 0.028), (47, -0.122), (48, 0.004), (49, 0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.95543271 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test
Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko
Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efficient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1
2 0.91307944 9 nips-2013-A Kernel Test for Three-Variable Interactions
Author: Dino Sejdinovic, Arthur Gretton, Wicher Bergsma
Abstract: We introduce kernel nonparametric tests for Lancaster three-variable interaction and for total independence, using embeddings of signed measures into a reproducing kernel Hilbert space. The resulting test statistics are straightforward to compute, and are used in powerful interaction tests, which are consistent against all alternatives for a large family of reproducing kernels. We show the Lancaster test to be sensitive to cases where two independent causes individually have weak influence on a third dependent variable, but their combined effect has a strong influence. This makes the Lancaster test especially suited to finding structure in directed graphical models, where it outperforms competing nonparametric tests in detecting such V-structures.
3 0.8684535 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
Author: Michel Besserve, Nikos K. Logothetis, Bernhard Schölkopf
Abstract: Many applications require the analysis of complex interactions between time series. These interactions can be non-linear and involve vector valued as well as complex data structures such as graphs or strings. Here we provide a general framework for the statistical analysis of these dependencies when random variables are sampled from stationary time-series of arbitrary objects. To achieve this goal, we study the properties of the Kernel Cross-Spectral Density (KCSD) operator induced by positive definite kernels on arbitrary input domains. This framework enables us to develop an independence test between time series, as well as a similarity measure to compare different types of coupling. The performance of our test is compared to the HSIC test using i.i.d. assumptions, showing improvements in terms of detection errors, as well as the suitability of this approach for testing dependency in complex dynamical systems. This similarity measure enables us to identify different types of interactions in electrophysiological neural time series. 1
4 0.74190187 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach
Author: Qichao Que, Mikhail Belkin
Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities defined on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difficult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classification within the covariate shift framework. 1
5 0.71881825 173 nips-2013-Least Informative Dimensions
Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda
Abstract: We present a novel non-parametric method for finding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest. 1
6 0.70323759 156 nips-2013-Learning Kernels Using Local Rademacher Complexity
7 0.62732857 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
8 0.62369162 327 nips-2013-The Randomized Dependence Coefficient
9 0.62054926 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
10 0.5401479 306 nips-2013-Speeding up Permutation Testing in Neuroimaging
11 0.53257757 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
12 0.51422834 289 nips-2013-Scalable kernels for graphs with continuous attributes
13 0.5137195 76 nips-2013-Correlated random features for fast semi-supervised learning
14 0.44683844 224 nips-2013-On the Sample Complexity of Subspace Learning
15 0.4453325 40 nips-2013-Approximate Inference in Continuous Determinantal Processes
16 0.41006252 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties
17 0.40559921 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation
18 0.38487527 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
19 0.3822813 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport
20 0.38102081 36 nips-2013-Annealing between distributions by averaging moments
topicId topicWeight
[(16, 0.026), (19, 0.044), (33, 0.178), (34, 0.113), (36, 0.211), (41, 0.025), (49, 0.042), (56, 0.127), (70, 0.026), (85, 0.032), (89, 0.038), (93, 0.052), (95, 0.012)]
simIndex simValue paperId paperTitle
1 0.90719682 317 nips-2013-Streaming Variational Bayes
Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan
Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1
2 0.88317746 188 nips-2013-Memory Limited, Streaming PCA
Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain
Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the first algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1
3 0.87830907 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors
Author: Karin C. Knudson, Jonathan W. Pillow
Abstract: Entropy rate quantifies the amount of disorder in a stochastic process. For spiking neurons, the entropy rate places an upper bound on the rate at which the spike train can convey stimulus information, and a large literature has focused on the problem of estimating entropy rate from spike train data. Here we present Bayes least squares and empirical Bayesian entropy rate estimators for binary spike trains using hierarchical Dirichlet process (HDP) priors. Our estimator leverages the fact that the entropy rate of an ergodic Markov Chain with known transition probabilities can be calculated analytically, and many stochastic processes that are non-Markovian can still be well approximated by Markov processes of sufficient depth. Choosing an appropriate depth of Markov model presents challenges due to possibly long time dependencies and short data sequences: a deeper model can better account for long time dependencies, but is more difficult to infer from limited data. Our approach mitigates this difficulty by using a hierarchical prior to share statistical power across Markov chains of different depths. We present both a fully Bayesian and empirical Bayes entropy rate estimator based on this model, and demonstrate their performance on simulated and real neural spike train data. 1
same-paper 4 0.85956043 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test
Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko
Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efficient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1
5 0.85273063 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
Author: Prashanth L.A., Mohammad Ghavamzadeh
Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1
6 0.7978937 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
7 0.78711456 233 nips-2013-Online Robust PCA via Stochastic Optimization
8 0.77189475 201 nips-2013-Multi-Task Bayesian Optimization
9 0.77137703 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
10 0.77007622 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
11 0.76902229 249 nips-2013-Polar Operators for Structured Sparse Estimation
12 0.76839721 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
13 0.76623464 324 nips-2013-The Fast Convergence of Incremental PCA
14 0.7646246 75 nips-2013-Convex Two-Layer Modeling
15 0.76412702 173 nips-2013-Least Informative Dimensions
16 0.7641089 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
17 0.76384538 232 nips-2013-Online PCA for Contaminated Data
18 0.7637974 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions
19 0.76296008 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
20 0.76254541 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization