nips nips2013 nips2013-44 knowledge-graph by maker-knowledge-mining

44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

Source: pdf

Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko

Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efﬁcient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. [sent-6, score-0.492]

2 Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. [sent-7, score-0.365]

3 The choice of block size allows control over the tradeoff between test power and computation time. [sent-8, score-0.284]

4 A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. [sent-10, score-1.034]

5 Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. [sent-11, score-0.869]

6 d, the two sample i=1 i=1 problem consists in testing whether to accept or reject the null hypothesis H0 that P = Q, vs the alternative hypothesis HA that P and Q are different. [sent-17, score-0.611]

7 This problem has recently been addressed using measures of similarity computed in a reproducing kernel Hilbert space (RKHS), which apply in very general settings where P and Q might be distributions over high dimensional data or structured objects. [sent-18, score-0.34]

8 When used in testing, it is necessary to determine whether the empirical estimate of the relevant similarity measure is sufﬁciently large as to give the hypothesis P = Q low probability; i. [sent-20, score-0.153]

9 The test power denotes the probability of correctly rejecting the null hypothesis, given that P = Q. [sent-23, score-0.471]

10 Unfortunately, this statistic is degenerate under the null hypothesis H0 that P = Q, and its asymptotic distribution takes the form of an inﬁnite weighted sum of independent χ2 variables (it is asymptotically Gaussian under the alternative hypothesis HA that P = Q). [sent-25, score-0.831]

11 Two methods for empirically estimating the null distribution in a consistent way have been proposed: the bootstrap [10], and a method requiring an eigendecomposition of the kernel matrices computed on the merged samples from P and Q [7]. [sent-26, score-0.888]

12 Another approach is to approximate the null distribution by a member of a simpler parametric family (for instance, a Pearson curve approximation), however this has no consistency guarantees. [sent-28, score-0.383]

13 While this has much greater variance than the U-statistic, it also has a simpler null distribution: being an average over i. [sent-30, score-0.353]

14 terms, the central limit theorem gives an asymptotically Normal distribution, under both H0 and HA . [sent-33, score-0.248]

15 Kernel selection for the U-statistic is a much harder question due to the complex form of the null distribution, and remains an open problem. [sent-35, score-0.32]

16 A major reason MMDl is faster is that its null distribution is straightforward to compute, since it is Gaussian and its variance can be calculated at the same cost as the test statistic. [sent-37, score-0.486]

17 A reasonable next step would be to ﬁnd a compromise between these two extremes: to construct a statistic with a lower variance than MMDl , while retaining an asymptotically Gaussian null distribution (hence remaining faster than tests based on MMDu ). [sent-38, score-0.712]

18 We study a family of such test statistics, where we split the data into blocks of size B, compute the quadratic-time MMDu on each block, and then average the resulting statistics. [sent-39, score-0.183]

19 As long as we choose the size B of blocks such that n/B → ∞, we are still guaranteed asymptotic Normality by the central limit theorem, and the null distribution can be computed at the same cost as the test statistic. [sent-41, score-0.695]

20 For a given sample size n, however, the power of the test can increase dramatically over the MMDl test, even for moderate block sizes B, making much better use of the available data with only a small increase in computation. [sent-42, score-0.359]

21 The block averaging scheme was originally proposed in [13], as an instance of a two-stage Ustatistic, to be applied when the degree of degeneracy of the U-statistic is indeterminate. [sent-43, score-0.229]

22 Differences with respect to our method are that Ho and Shieh compute the block statistics by sampling with replacement [13, (b) p. [sent-44, score-0.135]

23 863], and propose to obtain the variance of the test statistic via Monte Carlo, jackknife, or bootstrap techniques, whereas we use closed form expressions. [sent-45, score-0.517]

24 Finally, the kernel learning approach of [9] applies straightforwardly, allowing us to maximize test power over a given kernel family. [sent-48, score-0.659]

25 We begin with a brief review of kernel methods, and of the maximum mean discrepancy. [sent-52, score-0.255]

26 The central idea employed in the construction of the B-test is to generate a low variance MMD estimate by averaging multiple low variance kernel statistics computed over blocks of samples. [sent-54, score-0.563]

27 We show simple sufﬁcient conditions on the block size for consistency of the estimator. [sent-55, score-0.168]

28 Furthermore, we analyze the properties of the ﬁnite sample estimate, and propose a consistent strategy for setting the block size as a function of the number of samples. [sent-56, score-0.205]

29 1 Deﬁnition and asymptotics of the block-MMD Let Fk be an RKHS deﬁned on a topological space X with reproducing kernel k, and P a Borel probability measure on X . [sent-58, score-0.289]

30 The mean embedding of P in Fk , written µk (p) ∈ Fk is deﬁned such 2 250 HA histogram 250 H0 histogram HA histogram approximated 5% quantile of H0 H histogram 0 approximated 5% quantile of H 200 0 200 150 150 100 100 50 50 0 −4 0 −0. [sent-59, score-0.232]

31 As we vary B, we trade off the quality of the ﬁnite sample Gaussian approximation to the null distribution, as in Theorem 2. [sent-74, score-0.33]

32 05 Kolmogorov-Smirnov (KS) normality test [16, 20]), and a Gaussian approximation results in a conservative test threshold (vertical green line). [sent-78, score-0.329]

33 The remaining empirical distributions all pass a KS normality test. [sent-79, score-0.15]

34 (2) When the kernel k is characteristic, then ηk (P, Q) = 0 iff P = Q [21]. [sent-83, score-0.255]

35 (3) a=(i−1)B+1 b=(i−1)B+1,b=a The B-test statistic is an MMD estimate obtained by averaging the ηk (i). [sent-88, score-0.171]

36 Although setting B = n would lead to the lowest variance estimate of the MMD, computing sound thresholds for a given p-value is expensive, involving repeated bootstrap sampling [5, 14], or computing the eigenvalues of a Gram matrix [7]. [sent-90, score-0.316]

37 variables, and averaging them allows us to apply ˆ B the central limit theorem in order to estimate p-values from a normal distribution. [sent-97, score-0.26]

38 ˆ (4) i=1 We would like to apply the central limit theorem to variables ηk (i)i=1,. [sent-99, score-0.184]

39 , B , the central limit theorem implies that under HA , η D −1 2 ηk → N MMD2 , σu (Bn/B) ˆ 2 = N MMD2 , σu n−1 . [sent-111, score-0.184]

40 (7) This result shows that the distribution of HA is asymptotically independent of the block size, B. [sent-112, score-0.229]

41 Turning to the null hypothesis, [11, Theorem 8] additionally implies that under H0 for every i, Theorem 2. [sent-113, score-0.289]

42 The central limit theorem implies that under H0 , D ηk → N 0, C B 2 n/B ˆ −1 The asymptotic distributions for ηk under H0 and HA are Gaussian, and consequently it is easy ˆ to calculate the distribution quantiles and test thresholds. [sent-119, score-0.491]

43 A related strategy of averaging over data blocks to deal with large sample sizes has recently been developed in [15], with the goal of efﬁciently computing bootstrapped estimates of statistics of interest (e. [sent-122, score-0.198]

44 Brieﬂy, the approach splits the data (of size n) into s subsamples each of size B, computes an estimate of the n-fold bootstrap on each block, and averages these estimates. [sent-125, score-0.222]

45 n η The central limit theorem implies that the empirical mean of {ˆk (i)}i=1,. [sent-129, score-0.235]

46 4 Under both H0 and HA , thresholds computed from normal distribution tables are asymptotically unbiased. [sent-139, score-0.157]

47 The skewness for the entire sum is ﬁnite and positive, ∞ 8λ3 , l C= (11) l=1 as λl ≥ 0 for all l due to the positive deﬁniteness of the kernel k. [sent-143, score-0.315]

48 At smaller sample sizes, test thresholds obtained from ˆ the standard Normal table may therefore be inaccurate, as they do not account for this skew. [sent-145, score-0.174]

49 In our experiments, this bias caused the tests to be overly conservative, with lower Type I error than the design level required (Figures 2 and 5). [sent-146, score-0.235]

50 3 Finite Sample Case In the ﬁnite sample case, we apply the Berry-Ess´ en theorem, which gives conservative bounds on e the ∞ convergence of a series of ﬁnite sample random variables to a Gaussian distribution [4]. [sent-148, score-0.187]

51 , it is dependent only on the underlying distributions of the samples and not on the sample size. [sent-163, score-0.155]

52 B While the asymptotic results indicate that convergence to an optimal predictor is fastest for larger B, the ﬁnite sample results support decreasing the size of B in order to have a sufﬁcient number n of samples for application of the central limit theorem. [sent-172, score-0.328]

53 When B is small, we have many samples, hence the null distribution is close to the asymptotic limit provided by the central limit theorem, and the Type I error is estimated accurately. [sent-175, score-0.672]

54 The disadvantage of a small B is a lower test power for a given sample size. [sent-176, score-0.19]

55 Conversely, if we increase B, we will have a lower variance empirical distribution for H0 , hence higher test power, but we may have a poor estimate of the number of Type I errors (Figure 1). [sent-177, score-0.297]

56 In this setting the number of samples available for application of the central limit theorem will be [n(1−γ) ]. [sent-179, score-0.247]

57 We evaluate B-test performance in comparison to the MMDl and MMDu estimators, where for the latter we compare across different strategies for null distribution quantile estimation. [sent-185, score-0.365]

58 8332 × × × × > 60000, or 2h per iteration timeout σ = median Table 1: Sample complexity for tests on the distributions described in Figure 3. [sent-196, score-0.29]

59 08 Empirical Type I error Expected Type I error √ B= n 0. [sent-202, score-0.134]

60 08 Empirical Type I error Expected Type I error √ B= n 0. [sent-204, score-0.134]

61 06 Empirical Type I error Expected Type I error B= n 2 0. [sent-213, score-0.134]

62 01 0 2 4 8 16 32 Size of inner block (a) 64 128 0 2 0. [sent-219, score-0.179]

63 1 Synthetic data Following previous work on kernel hypothesis testing [9], our synthetic distributions are 5 × 5 grids of 2D Gaussians. [sent-227, score-0.526]

64 These distributions have proved to be very challenging for existing non-parametric two-sample tests [9]. [sent-231, score-0.188]

65 We employed three different kernel selection strategies in the hypothesis test. [sent-232, score-0.388]

66 First, we used a Gaussian kernel with σ = 1, which approximately matches the scale of the variance of each Gaussian in mixture P . [sent-233, score-0.319]

67 Next, we set σ equal to the median pairwise distance over the training data, Figure 3: Synthetic data distributions P and which is a standard way to choose the Gaussian kernel Q. [sent-235, score-0.476]

68 Finally, we applied a kernel learning strategy, in which the kernel was optimized to maximize the test power for the alternative P = Q [9]. [sent-238, score-0.659]

69 This approach returned a non-negative linear combination combination of base kernels, where half the data were used in learning the kernel weights (these data were excluded from the testing phase). [sent-239, score-0.332]

70 For methods using Pearson curves and the Gram matrix spectrum, we drew 500 samples from the null distribution estimates to obtain the 1 − α quantiles, for a test of level α. [sent-246, score-0.554]

71 We considered only the setting with σ = 1 and σ set to the median pairwise distance, as kernel selection is not yet solved for tests using MMDu [9]. [sent-249, score-0.554]

72 2 0 1 10 2 10 Size of inner block 3 10 Figure 4: Synthetic experiment: number of Type II er- In the ﬁrst experiment we set the Type I error to rors vs B, given a ﬁxed probability α of Type I erbe 5%, and we recorded the Type II error. [sent-254, score-0.246]

73 As B grows, the Type II error drops quickly when conducted these experiments on 2000 samples the kernel is appropriately chosen. [sent-256, score-0.385]

74 The kernel selecover 1000 repetitions, with varying block size, tion method is described in [9], and closely approxB. [sent-257, score-0.39]

75 Figure 4 presents results for different kernel imates the baseline performance of the well-informed choice strategies, as a function of B. [sent-258, score-0.255]

76 As discussed in [9, Section 5], the reason for this failure is that the lengthscale of the difference between the distributions P and Q differs from the lengthscale of the main data variation as captured by the median, which gives too broad a kernel for the data. [sent-261, score-0.454]

77 We again ﬁxed the same Type I error for all methods, but this time we also ﬁxed a Type II error of 5%, increasing the number of samples until the latter error rate was achieved. [sent-263, score-0.264]

78 For the user-chosen kernel (σ = 1, Figure 2(a)), the number of Type I errors closely matches the targeted test level. [sent-272, score-0.447]

79 When median heuristic is used, however, the test is overly conservative, and makes fewer Type I errors than required (Figure 2(b)). [sent-273, score-0.285]

80 This indicates that for this choice of σ, we are not in the asymptotic regime, and our Gaussian null distribution approximation is inaccurate. [sent-274, score-0.393]

81 This setting coincides with a block size substantially larger than 2 (MMDl ), and therefore achieves lower Type II errors while retaining the targeted Type I error. [sent-276, score-0.224]

82 As the feature vector had size 1000 we set the block size B = 1000 = 32. [sent-285, score-0.135]

83 Figure 5 shows the average number of Type I errors as a function of B: in this case, all kernel selection strategies result in conservative tests (lower Type I error than required), indicating that more samples are needed to reach the asymptotic regime. [sent-287, score-0.751]

84 The results show that the B-test has a much better 7 Kernel parameters Method Additional parameters B =√ 2 B= n B =√ 2 B= n B=2 B= n 2 σ=1 B-test σ = median multiple kernels Gram matrix spectrum Bootstrap Gram matrix spectrum Bootstrap Type I error Type II error 0. [sent-290, score-0.435]

85 8297 σ=1 B = 2000 σ = median Table 2: A comparison of consistent tests on the music experiment described in Section 3. [sent-313, score-0.313]

86 Here computation time is reported for the test achieving the stated error rates. [sent-315, score-0.17]

87 08 Empirical Type I error Expected Type I error √ B= n 0. [sent-319, score-0.134]

88 01 0 2 4 8 16 32 Size of inner block (a) 64 128 0 2 4 8 16 32 Size of inner block (b) 64 128 0 2 4 8 16 32 Size of inner block 64 128 (c) Figure 5: Empirical Type I error rate for α = 5% on the music data (Section 3. [sent-340, score-0.649]

89 (a) A single kernel test with σ = 1, (b) A single kernel test with σ = median, and (c) for multiple kernels. [sent-342, score-0.716]

90 sample complexity than MMDl over all tested kernel selection strategies. [sent-345, score-0.327]

91 Moreover, it is an order of magnitude faster than any test that consistently estimates the null distribution for MMDu (i. [sent-346, score-0.453]

92 , the Gram matrix eigenspectrum and bootstrap estimates): these estimates are impractical at large sample sizes, due to their computational complexity. [sent-348, score-0.294]

93 First, we have observed that the empirical Type I error rate is often conservative, and is less than the 5% targeted by the threshold based on a Gaussian null distribution assumption (Figures 2 and 5). [sent-352, score-0.477]

94 Equation (7) implies that the size of B does not inﬂuence the asymptotic variance under HA , however we observe in Figure 1 that the empirical variance of HA drops with larger B. [sent-355, score-0.253]

95 This is because, for these P and Q and small B, the null and alternative distributions have considerable overlap. [sent-356, score-0.34]

96 Hence, given the distributions are effectively indistinguishable at these sample sizes n, the variance of the alternative distribution as a function of B behaves more like that of H0 (cf. [sent-357, score-0.22]

97 Finally, [13] propose an alternative approach for U-statistic based testing when the degree of degeneracy is known: a new U-statistic (the TU-statistic) is written in terms of products of centred U-statistics computed on the individual blocks, and a test is formulated using this TU-statistic. [sent-360, score-0.231]

98 Ho and Shieh show that a TU-statistic based test can be asymptotically more powerful than a test using a single U-statistic on the whole sample, when the latter is degenerate under H0 , and nondegenerate under HA . [sent-361, score-0.312]

99 On the convergence of moments in the central limit theorem. [sent-366, score-0.191]

100 Kernels based tests with nonasymptotic bootstrap approaches for two-sample problems. [sent-395, score-0.359]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mmdu', 0.368), ('ha', 0.323), ('null', 0.289), ('mmd', 0.26), ('kernel', 0.255), ('bootstrap', 0.222), ('mmdl', 0.216), ('type', 0.215), ('tests', 0.137), ('block', 0.135), ('statistic', 0.128), ('gretton', 0.113), ('gram', 0.106), ('test', 0.103), ('hypothesis', 0.102), ('median', 0.102), ('central', 0.088), ('pearson', 0.087), ('fk', 0.081), ('testing', 0.077), ('conservative', 0.075), ('spectrum', 0.074), ('asymptotic', 0.074), ('lengthscale', 0.074), ('shieh', 0.074), ('discrepancy', 0.069), ('borel', 0.068), ('rkhs', 0.068), ('error', 0.067), ('variance', 0.064), ('ii', 0.064), ('asymptotically', 0.064), ('samples', 0.063), ('limit', 0.062), ('skewness', 0.06), ('ho', 0.06), ('kernels', 0.051), ('empirical', 0.051), ('fukumizu', 0.051), ('degeneracy', 0.051), ('distributions', 0.051), ('blocks', 0.049), ('sriperumbudur', 0.049), ('quantiles', 0.049), ('errors', 0.049), ('sehnsucht', 0.049), ('normality', 0.048), ('quantile', 0.046), ('power', 0.046), ('music', 0.045), ('sch', 0.044), ('inner', 0.044), ('malte', 0.043), ('averaging', 0.043), ('arthur', 0.042), ('degenerate', 0.042), ('gaussian', 0.042), ('moments', 0.041), ('sample', 0.041), ('gamma', 0.041), ('synthetic', 0.041), ('sejdinovic', 0.04), ('wrongly', 0.04), ('amplitude', 0.04), ('targeted', 0.04), ('distance', 0.039), ('curves', 0.038), ('kely', 0.037), ('ib', 0.037), ('karsten', 0.037), ('zl', 0.037), ('visible', 0.035), ('histogram', 0.035), ('theorem', 0.034), ('sizes', 0.034), ('reproducing', 0.034), ('normal', 0.033), ('rasch', 0.033), ('ez', 0.033), ('rejecting', 0.033), ('consistency', 0.033), ('lkopf', 0.032), ('fisher', 0.032), ('harchaoui', 0.032), ('family', 0.031), ('ex', 0.031), ('selection', 0.031), ('estimates', 0.031), ('borgwardt', 0.031), ('ch', 0.031), ('overly', 0.031), ('thresholds', 0.03), ('hilbert', 0.03), ('distribution', 0.03), ('discriminant', 0.03), ('bernhard', 0.03), ('consistent', 0.029), ('ks', 0.029), ('nb', 0.029), ('pairwise', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko

2 0.21857095 173 nips-2013-Least Informative Dimensions

Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda

Abstract: We present a novel non-parametric method for ﬁnding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest. 1

3 0.19208193 9 nips-2013-A Kernel Test for Three-Variable Interactions

Author: Dino Sejdinovic, Arthur Gretton, Wicher Bergsma

Abstract: We introduce kernel nonparametric tests for Lancaster three-variable interaction and for total independence, using embeddings of signed measures into a reproducing kernel Hilbert space. The resulting test statistics are straightforward to compute, and are used in powerful interaction tests, which are consistent against all alternatives for a large family of reproducing kernels. We show the Lancaster test to be sensitive to cases where two independent causes individually have weak inﬂuence on a third dependent variable, but their combined effect has a strong inﬂuence. This makes the Lancaster test especially suited to ﬁnding structure in directed graphical models, where it outperforms competing nonparametric tests in detecting such V-structures.

4 0.17720604 306 nips-2013-Speeding up Permutation Testing in Neuroimaging

Author: Chris Hinrichs, Vamsi Ithapu, Qinyuan Sun, Sterling C. Johnson, Vikas Singh

Abstract: Multiple hypothesis testing is a signiﬁcant problem in nearly all neuroimaging studies. In order to correct for this phenomena, we require a reliable estimate of the Family-Wise Error Rate (FWER). The well known Bonferroni correction method, while simple to implement, is quite conservative, and can substantially under-power a study because it ignores dependencies between test statistics. Permutation testing, on the other hand, is an exact, non-parametric method of estimating the FWER for a given α-threshold, but for acceptably low thresholds the computational burden can be prohibitive. In this paper, we show that permutation testing in fact amounts to populating the columns of a very large matrix P. By analyzing the spectrum of this matrix, under certain conditions, we see that P has a low-rank plus a low-variance residual decomposition which makes it suitable for highly sub–sampled — on the order of 0.5% — matrix completion methods. Based on this observation, we propose a novel permutation testing methodology which offers a large speedup, without sacriﬁcing the ﬁdelity of the estimated FWER. Our evaluations on four different neuroimaging datasets show that a computational speedup factor of roughly 50× can be achieved while recovering the FWER distribution up to very high accuracy. Further, we show that the estimated α-threshold is also recovered faithfully, and is stable. 1

5 0.15571344 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

Author: Michel Besserve, Nikos K. Logothetis, Bernhard Schölkopf

Abstract: Many applications require the analysis of complex interactions between time series. These interactions can be non-linear and involve vector valued as well as complex data structures such as graphs or strings. Here we provide a general framework for the statistical analysis of these dependencies when random variables are sampled from stationary time-series of arbitrary objects. To achieve this goal, we study the properties of the Kernel Cross-Spectral Density (KCSD) operator induced by positive deﬁnite kernels on arbitrary input domains. This framework enables us to develop an independence test between time series, as well as a similarity measure to compare different types of coupling. The performance of our test is compared to the HSIC test using i.i.d. assumptions, showing improvements in terms of detection errors, as well as the suitability of this approach for testing dependency in complex dynamical systems. This similarity measure enables us to identify different types of interactions in electrophysiological neural time series. 1

6 0.14840332 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

7 0.14695737 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

8 0.13745368 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

9 0.12119702 156 nips-2013-Learning Kernels Using Local Rademacher Complexity

10 0.095036201 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic

11 0.090892494 289 nips-2013-Scalable kernels for graphs with continuous attributes

12 0.087314546 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory

13 0.079868078 40 nips-2013-Approximate Inference in Continuous Determinantal Processes

14 0.07981921 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

15 0.07945209 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations

16 0.076499201 224 nips-2013-On the Sample Complexity of Subspace Learning

17 0.075591981 36 nips-2013-Annealing between distributions by averaging moments

18 0.069960386 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

19 0.067949392 137 nips-2013-High-Dimensional Gaussian Process Bandits

20 0.067637123 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.192), (1, 0.083), (2, 0.019), (3, 0.034), (4, -0.002), (5, 0.064), (6, 0.029), (7, 0.006), (8, -0.121), (9, 0.024), (10, -0.025), (11, -0.091), (12, -0.111), (13, -0.112), (14, 0.259), (15, 0.143), (16, 0.022), (17, 0.102), (18, -0.155), (19, -0.095), (20, -0.019), (21, -0.068), (22, -0.012), (23, -0.001), (24, -0.065), (25, 0.002), (26, 0.014), (27, 0.009), (28, 0.037), (29, 0.034), (30, 0.162), (31, 0.037), (32, 0.042), (33, -0.026), (34, -0.011), (35, -0.071), (36, -0.023), (37, 0.088), (38, 0.015), (39, 0.037), (40, -0.004), (41, 0.034), (42, 0.061), (43, 0.074), (44, -0.012), (45, 0.017), (46, 0.028), (47, -0.122), (48, 0.004), (49, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95543271 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko

2 0.91307944 9 nips-2013-A Kernel Test for Three-Variable Interactions

Author: Dino Sejdinovic, Arthur Gretton, Wicher Bergsma

3 0.8684535 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

Author: Michel Besserve, Nikos K. Logothetis, Bernhard Schölkopf

4 0.74190187 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

Author: Qichao Que, Mikhail Belkin

Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the ﬁrst kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is ﬂexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities deﬁned on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difﬁcult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classiﬁcation within the covariate shift framework. 1

5 0.71881825 173 nips-2013-Least Informative Dimensions

Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda

6 0.70323759 156 nips-2013-Learning Kernels Using Local Rademacher Complexity

7 0.62732857 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

8 0.62369162 327 nips-2013-The Randomized Dependence Coefficient

9 0.62054926 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

10 0.5401479 306 nips-2013-Speeding up Permutation Testing in Neuroimaging

11 0.53257757 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions

12 0.51422834 289 nips-2013-Scalable kernels for graphs with continuous attributes

13 0.5137195 76 nips-2013-Correlated random features for fast semi-supervised learning

14 0.44683844 224 nips-2013-On the Sample Complexity of Subspace Learning

15 0.4453325 40 nips-2013-Approximate Inference in Continuous Determinantal Processes

16 0.41006252 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties

17 0.40559921 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation

18 0.38487527 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic

19 0.3822813 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport

20 0.38102081 36 nips-2013-Annealing between distributions by averaging moments

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.026), (19, 0.044), (33, 0.178), (34, 0.113), (36, 0.211), (41, 0.025), (49, 0.042), (56, 0.127), (70, 0.026), (85, 0.032), (89, 0.038), (93, 0.052), (95, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90719682 317 nips-2013-Streaming Variational Bayes

Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan

Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-speciﬁed approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by ﬁtting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1

2 0.88317746 188 nips-2013-Memory Limited, Streaming PCA

Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain

Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the ﬁrst algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1

3 0.87830907 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Author: Karin C. Knudson, Jonathan W. Pillow

Abstract: Entropy rate quantiﬁes the amount of disorder in a stochastic process. For spiking neurons, the entropy rate places an upper bound on the rate at which the spike train can convey stimulus information, and a large literature has focused on the problem of estimating entropy rate from spike train data. Here we present Bayes least squares and empirical Bayesian entropy rate estimators for binary spike trains using hierarchical Dirichlet process (HDP) priors. Our estimator leverages the fact that the entropy rate of an ergodic Markov Chain with known transition probabilities can be calculated analytically, and many stochastic processes that are non-Markovian can still be well approximated by Markov processes of sufﬁcient depth. Choosing an appropriate depth of Markov model presents challenges due to possibly long time dependencies and short data sequences: a deeper model can better account for long time dependencies, but is more difﬁcult to infer from limited data. Our approach mitigates this difﬁculty by using a hierarchical prior to share statistical power across Markov chains of different depths. We present both a fully Bayesian and empirical Bayes entropy rate estimator based on this model, and demonstrate their performance on simulated and real neural spike train data. 1

same-paper 4 0.85956043 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko

5 0.85273063 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

Author: Prashanth L.A., Mohammad Ghavamzadeh

Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in ﬁnance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we ﬁrst deﬁne a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a trafﬁc signal control application. 1

6 0.7978937 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

7 0.78711456 233 nips-2013-Online Robust PCA via Stochastic Optimization

8 0.77189475 201 nips-2013-Multi-Task Bayesian Optimization

9 0.77137703 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

10 0.77007622 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

11 0.76902229 249 nips-2013-Polar Operators for Structured Sparse Estimation

12 0.76839721 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

13 0.76623464 324 nips-2013-The Fast Convergence of Incremental PCA

14 0.7646246 75 nips-2013-Convex Two-Layer Modeling

15 0.76412702 173 nips-2013-Least Informative Dimensions

16 0.7641089 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting

17 0.76384538 232 nips-2013-Online PCA for Contaminated Data

18 0.7637974 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

19 0.76296008 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

20 0.76254541 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization