nips nips2012 nips2012-264 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur
Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.
Reference: text
sentIndex sentText sentNum sentScore
1 Optimal kernel choice for large-scale two-sample tests Arthur Gretton,1,3 Bharath Sriperumbudur,1 Dino Sejdinovic,1 Heiko Strathmann2 1 Gatsby Unit and 2 CSD, CSML, UCL, UK; 3 MPI for Intelligent Systems, Germany {arthur. [sent-1, score-0.345]
2 jp Abstract Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. [sent-13, score-0.713]
3 One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. [sent-14, score-0.958]
4 The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. [sent-15, score-0.472]
5 For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. [sent-17, score-0.47]
6 The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. [sent-18, score-0.499]
7 These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. [sent-19, score-0.482]
8 In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics. [sent-20, score-0.805]
9 In the setting of statistical hypothesis testing, this corresponds to choosing whether to reject the null hypothesis H0 that the generating distributions p and q are the same, vs. [sent-22, score-0.35]
10 the alternative hypothesis HA that distributions p and q are different, given a set of independent observations drawn from each. [sent-23, score-0.141]
11 A number of recent approaches to two-sample testing have made use of mappings of the distributions to a reproducing kernel Hilbert space (RKHS); or have sought out RKHS functions with large amplitude where the probability mass of p and q differs most [8, 10, 15, 17, 7]. [sent-24, score-0.522]
12 The most straightforward test statistic is the norm of the difference between distribution embeddings, and is called the maximum mean discrepancy (MMD). [sent-25, score-0.463]
13 One difficulty in using this statistic in a hypothesis test, however, is that the MMD depends on the choice of the kernel. [sent-26, score-0.424]
14 When a radial basis function kernel (such as the Gaussian kernel) is used, one simple choice is to set the kernel width to the median distance between points in the aggregate sample [8, 7]. [sent-28, score-0.73]
15 An alternative heuristic is to choose the kernel that maximizes the test statistic [15]: in experiments, this was found to reliably outperform the median approach. [sent-30, score-0.856]
16 If the statistic is to be applied in hypothesis testing, however, then this choice of kernel does not explicitly address the question of test performance. [sent-32, score-0.815]
17 We propose a new approach to kernel choice for hypothesis testing, which explicitly optimizes the performance of the hypothesis test. [sent-33, score-0.559]
18 Our kernel choice minimizes Type II error (the probability of wrongly accepting H0 when p = q), given an upper bound on Type I error (the probability of wrongly rejecting H0 when p = q). [sent-34, score-0.533]
19 We address the case of the linear time statistic in [7, Section 6], for which both the test statistic and the parameters of the null distribution can be computed in O(n), for sample size n. [sent-37, score-0.812]
20 This has a higher variance at a given n than the U-statistic estimate costing O(n2 ) used in [8, 7], since the latter is the minimum variance unbiased estimator. [sent-38, score-0.156]
21 Thus, we would use the quadratic time statistic in the “limited data, unlimited time” scenario, as it extracts the most possible information from the data available. [sent-39, score-0.411]
22 The linear time statistic is used in the “unlimited data, limited time” scenario, since it is the cheapest statistic that still incorporates each datapoint: it does not require the data to be stored, and is thus appropriate for analyzing data streams. [sent-40, score-0.597]
23 As a further consequence of the streaming data setting, we learn the kernel parameter on a separate sample to the sample used in testing; i. [sent-41, score-0.345]
24 , unlike the classical testing scenario, we use a training set to learn the kernel parameters. [sent-43, score-0.383]
25 An advantage of this setting is that our null distribution remains straightforward, and the test threshold can be computed without a costly bootstrap procedure. [sent-44, score-0.322]
26 We begin our presentation in Section 2 with a review of the maximum mean discrepancy, its linear time estimate, and the associated asymptotic distribution and test. [sent-45, score-0.16]
27 In Section 3 we describe a criterion for kernel choice to maximize the Hodges and Lehmann asymptotic relative efficiency. [sent-46, score-0.528]
28 We demonstrate the convergence of the empirical estimate of this criterion when the family of kernels is a linear combination of base kernels (with non-negative coefficients), and of the kernel coefficients themselves. [sent-47, score-0.829]
29 In Section 4, we provide an optimization procedure to learn the kernel weights. [sent-48, score-0.312]
30 We observe that a principled kernel choice for testing outperforms competing heuristics, including the previous best-performing heuristic in [15]. [sent-50, score-0.45]
31 htm 2 Maximum mean discrepancy, and a linear time estimate We begin with a brief review of kernel methods, and of the maximum mean discrepancy [8, 7, 14]. [sent-56, score-0.482]
32 We then describe the family of kernels over which we optimize, and the linear time estimate of the MMD. [sent-57, score-0.257]
33 1 MMD for a family of kernels Let Fk be a reproducing kernel Hilbert space (RKHS) defined on a topological space X with reproducing kernel k, and p a Borel probability measure on X . [sent-59, score-0.949]
34 The maximum mean discrepancy (MMD) between Borel probability measures p and q is defined as the RKHS-distance between the mean embeddings of p and q. [sent-63, score-0.181]
35 By introducing we can write hk (x, x , y, y ) = k(x, x ) + k(y, y ) − k(x, y ) − k(x , y), ηk (p, q) = Exx yy hk (x, x , y, y ) =: Ev hk (v), 2 (2) where we have defined the random vector v := [x, x , y, y ]. [sent-71, score-0.441]
36 The Gaussian kernels used in the present work are characteristic. [sent-75, score-0.143]
37 Our goal is to select a kernel for hypothesis testing from a particular family K of kernels, which we d now define. [sent-76, score-0.534]
38 Let {ku }u=1 be a set of positive definite functions ku : X × X → R. [sent-77, score-0.308]
39 Let d d K := k : k = βu ku , βu = D, βu ≥ 0, ∀u ∈ {1, . [sent-78, score-0.308]
40 Each k ∈ K is associated uniquely with an RKHS Fk , and we assume the kernels are bounded, |ku | ≤ K, ∀u ∈ {1, . [sent-82, score-0.143]
41 2 Empirical estimate of the MMD, asymptotic distribution, and test We now describe an empirical estimate of the maximum mean discrepancy, given i. [sent-104, score-0.292]
42 We use the linear time estimate of [7, Section 6], for which both the test statistic and the parameters of the null distribution can be computed in time O(n). [sent-114, score-0.569]
43 This has a higher variance at a given n than a U-statistic estimate costing O(n2 ), since the latter is the minimum variance unbiased estimator [13, Ch. [sent-115, score-0.156]
44 3] that the linear time statistic yields better performance at a given computational cost than the quadratic time statistic, when sufficient data are available (bearing in mind that consistent estimates of the null distribution in the latter case are computationally demanding [9]). [sent-118, score-0.529]
45 Moreover, the linear time statistic does not require the sample to be stored in memory, and is thus suited to data streaming contexts, where a large number of observations arrive in sequence. [sent-119, score-0.386]
46 We use ηk to denote the empirical statistic computed over the ˇ samples being tested, to distinguish it from the training sample estimate ηk used in selecting the ˆ kernel. [sent-121, score-0.353]
47 Given the family of kernels K in (3), this can be written ηk = β η , where we again use ˇ ˇ the convention η = (ˇ1 , η2 , . [sent-122, score-0.187]
48 The statistic ηk has expectation zero under the null ˇ η ˇ ˇ ˇ hypothesis H0 that p = q, and has positive expectation under the alternative hypothesis HA that p = q. [sent-126, score-0.668]
49 k 3 (6) Unlike the case of a quadratic time statistic, the null and alternative distributions differ only in mean; by contrast, the quadratic time statistic has as its null distribution an infinite weighted sum of χ2 variables [7, Section 5], and a Gaussian alternative distribution. [sent-132, score-0.722]
50 The population variance can be written 2 σk = Ev h2 (v) − Ev,v (hk (v)hk (v )) = k 1 Ev,v (hk (v) − hk (v ))2 . [sent-135, score-0.209]
51 2 Expanding in terms of the kernel coefficients β, we get 2 σk := β Qk β, where Qk := cov(h) is the covariance matrix of h. [sent-136, score-0.346]
52 A linear time estimate for the variance is ˇ σk = β Qk β, ˇ2 where ˇ Qk n/4 uu = 4 h∆,u (wi )h∆,u (wi ), n i=1 (7) and wi := [v2i−1 , v2i ],1 h∆,k (wi ) := hk (v2i−1 ) − hk (v2i ). [sent-137, score-0.431]
53 From (5), a test of asymptotic level α using the statistic ηk will have the threshold ˇ √ tk,α = n−1/2 σk 2Φ−1 (1 − α), (8) bearing in mind the asymptotic distribution of the test statistic, and that ηk (p, p) = 0. [sent-140, score-0.809]
54 This threshold is computed empirically by replacing σk with its estimate σk (computed using the data being tested), ˇ which yields a test of the desired asymptotic level. [sent-141, score-0.303]
55 The asymptotic distribution (5) holds only when the kernel is fixed, and does not depend on the sample X, Y . [sent-142, score-0.443]
56 If the kernel were a function of the data, then a test would require large deviation probabilities over the supremum of the Gaussian process indexed by the kernel parameters (e. [sent-143, score-0.73]
57 Instead, we set aside a portion of the data to learn the kernel (the “training data”), and use the remainder to construct a test using the learned kernel parameters. [sent-147, score-0.703]
58 3 Choice of kernel The choice of kernel will affect both the test statistic itself, (4), and its asymptotic variance, (6). [sent-148, score-1.151]
59 The asymptotic probability of a ˇ Type II error is therefore √ ηk (p, q) n √ P (ˇk < tk,α ) = Φ Φ−1 (1 − α) − η . [sent-152, score-0.184]
60 Therefore, the kernel minimizing this error probability is −1 k∗ = arg sup ηk (p, q)σk , (9) k∈K with the associated test threshold tk∗ ,α . [sent-154, score-0.676]
61 ˇ ˇ We therefore estimate tk∗ ,α by a regularized empirical estimate tk∗ ,α , where ˆ −1 ˆ k∗ = arg sup ηk (ˆk,λ ) , ˆ σ k∈K 1 This vector is the concatenation of two four-dimensional vectors, and has eight dimensions. [sent-156, score-0.262]
62 Then if λm = Θ m−1/3 , sup ηk σ −1 − sup ηk σ −1 = OP m−1/3 ˆ ˆk,λ k k∈K k∈K and ˆ P k∗ → k∗ . [sent-162, score-0.304]
63 We remark that other families of kernels may be worth considering, besides K. [sent-172, score-0.143]
64 For instance, we could use a family of RBF kernels with continuous bandwidth parameter θ ≥ 0. [sent-173, score-0.253]
65 4 Optimization procedure d ˆ∗ We wish to select kernel k = ˆ σ u=1 βu ku ∈ K that maximizes the ratio ηk /ˆk,λ . [sent-175, score-0.723]
66 We perform ˆ∗ to construct a hypothesis this optimization over training data, then use the resulting parameters β test on the data to be tested (which must be independent of the training data, and drawn from the same p, q). [sent-176, score-0.186]
67 2, this gives us the test threshold without requiring a bootstrap ˆ procedure. [sent-178, score-0.186]
68 4 25 0 0 5 10 15 20 Dimension max ratio opt l2 max mmd 25 30 0. [sent-189, score-0.726]
69 1 20 15 0 0 10 5 max ratio opt l2 max mmd xval median 5 10 Ratio ε 10 20 y1 30 Figure 1: Left: Feature selection results, Type II error vs number of dimensions, average over 5000 trials, m = n = 104 . [sent-191, score-1.032]
70 Right: Gaussian grid results, Type II error vs ε, the eigenvalue ratio for the covariance of the Gaussians in q; average over 1500 trials, m = n = 104 . [sent-193, score-0.257]
71 Indeed, since all of the squared MMD estimates calculated on the training data using each of the base kernels are negative, it is unlikely the statistic computed on the data used for the test will exceed the (always positive) threshold. [sent-208, score-0.571]
72 Therefore, when no entries in η are positive, we (arbitrarily) select a single base kernel ku with largest ηu /ˆu,λ . [sent-209, score-0.713]
73 This problem can be solved by interior point methods, or, if the number of kernels d is large, we could use proximal√ gradient methods. [sent-211, score-0.143]
74 Therefore, the overall computational cost of the proposed test is linear in the number of samples, and quadratic in the number of kernels. [sent-213, score-0.157]
75 5 Experiments We compared our kernel selection strategy to alternative approaches, with a focus on challenging problems that benefit from careful kernel choice. [sent-214, score-0.709]
76 In our first experiment, we investigated a synthetic data set for which the best kernel in the family K of linear combinations in (3) outperforms the best d individual kernel from the set {ku }u=1 . [sent-215, score-0.697]
77 d Our base kernel set {ku }u=1 contained only d univariate kernels with fixed bandwidth (one for each dimension): in other words, this was a feature selection problem. [sent-219, score-0.637]
78 We used two kernel selection strategies arising from our criterion in (9): opt - the kernel from the set K that maximizes the ratio ηk /ˆk,λ , as described in Section 4, and max-ratio - the single base kernel ku with largest ηu /ˆu,λ . [sent-220, score-1.684]
79 ˆ σ ˆ σ 6 15 AM signals, p Amplitude modulated signals AM signals, q 1 Type I I error 0. [sent-221, score-0.176]
80 8 Added noise σ ε max ratio opt median l2 max mmd 1 1. [sent-229, score-0.799]
81 2 Figure 2: Left: amplitude modulated signals, four samples from each of p and q prior to noise being added. [sent-230, score-0.143]
82 An alternative kernel selection procedure is simply to maximize the MMD on the training data, which is equivalent to minimizing the error in classifying p vs. [sent-236, score-0.45]
83 In this case, it is necessary to bound the norm of β, since the test statistic can otherwise be increased without limit by rescaling the β entries. [sent-238, score-0.363]
84 We employed two such kernel selection strategies: max-mmd - a single base kernel ku that maximizes ηu (as proposed in [15]), ˆ and l2 - a kernel from the set K that maximizes ηk subject to the constraint β2 ≤ 1 on the vector ˆ of weights. [sent-239, score-1.44]
85 We see that opt and l2 perform much better than max-ratio and ˆ max-mmd, with the former each having large β ∗ weights in both the relevant dimensions, whereas the latter are permitted to choose only a single kernel. [sent-241, score-0.141]
86 In these cases, a good choice of kernel becomes crucial. [sent-245, score-0.345]
87 In the second experiment, p and q were both grids of Gaussians in two dimensions, where p had unit covariance matrices in each mixture component, and q was a grid of correlated Gaussians with a ratio ε of largest to smallest covariance eigenvalues. [sent-247, score-0.216]
88 We compared opt, max-ratio, max-mmd, and l2 , as well as an additional approach, xval, for which d we chose the best kernel from {ku }u=1 by five-fold cross-validation, following [17]. [sent-251, score-0.312]
89 We made repeated splits to obtain the average validation error, and chose the kernel with the highest average MMD on the validation sets (equivalently, the lowest average linear loss). [sent-254, score-0.341]
90 d Our base kernels {ku }u=1 in (3) were multivariate isotropic Gaussians with bandwidth varying between 2−10 and 215 , with a multiplicative step-size of 20. [sent-256, score-0.303]
91 The median heuristic fails entirely, yielding the 95% error expected under the null hypothesis. [sent-259, score-0.324]
92 In our final experiment, the distributions p, q were short samples of amplitude modulated (AM) signals, which were carrier sinusoids with amplitudes scaled by different audio signals for p and q. [sent-261, score-0.341]
93 7 These signals took the form y(t) = cos(ωc t) (As(t) + oc ) + n(t), where y(t) is the AM signal at time t, s(t) is an audio signal, ωc is the frequency of the carrier signal, A is an amplitude scaling parameter, oc is a constant offset, and n(t) is i. [sent-262, score-0.408]
94 The audio was sampled at 8kHz, the carrier was at 24kHz, and the resulting AM signals were sampled at 120kHz. [sent-270, score-0.198]
95 Our base kernels d {ku }u=1 in (3) were multivariate isotropic Gaussians with bandwidth varying between 2−15 and 215 , with a multiplicative step-size of 2, and we set λn = 10−5 . [sent-275, score-0.303]
96 We remark that in the second and third experiments, simply choosing the kernel ku with largest ratio ηu /ˆu,λ does as well or better than ˆ σ ˆ∗ in (11). [sent-278, score-0.711]
97 The max-ratio strategy is thus recommended when a single best kernel exists solving for β d in the set {ku }u=1 , although it clearly fails when a linear combination of several kernels is needed (as in the first experiment). [sent-279, score-0.512]
98 These include an empirical verification that the Type I error is close to the design parameter α, and that kernels are not chosen at extreme values when the null hypothesis holds, additional AM experiments, and further synthetic benchmarks. [sent-281, score-0.439]
99 6 Conclusions We have proposed a criterion to explicitly optimize the Hodges and Lehmann asymptotic relative efficiency for the kernel two-sample test: the kernel parameters are chosen to minimize the asymptotic Type II error at a given Type I error. [sent-282, score-0.991]
100 A promising next step would be to optimize over the parameters of a single kernel (e. [sent-284, score-0.312]
wordName wordTfidf (topN-words)
[('mmd', 0.466), ('kernel', 0.312), ('ku', 0.308), ('statistic', 0.284), ('sup', 0.152), ('supk', 0.151), ('hk', 0.147), ('kernels', 0.143), ('opt', 0.141), ('null', 0.136), ('asymptotic', 0.131), ('hypothesis', 0.107), ('ev', 0.106), ('fk', 0.101), ('discrepancy', 0.1), ('fukumizu', 0.092), ('gretton', 0.092), ('rkhs', 0.09), ('type', 0.088), ('gaussians', 0.086), ('embeddings', 0.081), ('hodges', 0.079), ('xval', 0.079), ('test', 0.079), ('signals', 0.078), ('median', 0.073), ('tk', 0.072), ('testing', 0.071), ('amplitude', 0.07), ('carrier', 0.07), ('oc', 0.07), ('reproducing', 0.069), ('bandwidth', 0.066), ('base', 0.065), ('hilbert', 0.064), ('borel', 0.063), ('ratio', 0.063), ('ii', 0.063), ('qk', 0.057), ('grid', 0.057), ('lehmann', 0.055), ('bootstrap', 0.055), ('error', 0.053), ('bearing', 0.053), ('exx', 0.053), ('hook', 0.053), ('witness', 0.053), ('criterion', 0.052), ('threshold', 0.052), ('selection', 0.051), ('audio', 0.05), ('vs', 0.05), ('quadratic', 0.049), ('csml', 0.047), ('modulated', 0.045), ('family', 0.044), ('csd', 0.043), ('costing', 0.043), ('cov', 0.041), ('estimate', 0.041), ('op', 0.041), ('wrongly', 0.041), ('unlimited', 0.041), ('sriperumbudur', 0.041), ('maximizes', 0.04), ('stored', 0.04), ('wald', 0.039), ('ucl', 0.039), ('curran', 0.039), ('massimiliano', 0.039), ('mpi', 0.037), ('extracts', 0.037), ('harchaoui', 0.037), ('borgwardt', 0.037), ('rasch', 0.037), ('variance', 0.036), ('sch', 0.035), ('covariance', 0.034), ('heuristic', 0.034), ('alternative', 0.034), ('cients', 0.034), ('choice', 0.033), ('streaming', 0.033), ('lanckriet', 0.032), ('scenario', 0.031), ('wi', 0.031), ('demanding', 0.031), ('ha', 0.031), ('isotropic', 0.029), ('linear', 0.029), ('largest', 0.028), ('max', 0.028), ('fails', 0.028), ('arg', 0.028), ('samples', 0.028), ('trials', 0.028), ('rbf', 0.027), ('deviation', 0.027), ('gaussian', 0.027), ('population', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur
Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.
2 0.18488675 188 nips-2012-Learning from Distributions via Support Measure Machines
Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf
Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1
3 0.14905517 231 nips-2012-Multiple Operator-valued Kernel Learning
Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach
Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1
4 0.14541832 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging
Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson
Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1
5 0.12564354 338 nips-2012-The Perturbed Variation
Author: Maayan Harel, Shie Mannor
Abstract: We introduce a new discrepancy score between two distributions that gives an indication on their similarity. While much research has been done to determine if two samples come from exactly the same distribution, much less research considered the problem of determining if two finite samples come from similar distributions. The new score gives an intuitive interpretation of similarity; it optimally perturbs the distributions so that they best fit each other. The score is defined between distributions, and can be efficiently estimated from samples. We provide convergence bounds of the estimated score, and develop hypothesis testing procedures that test if two data sets come from similar distributions. The statistical power of this procedures is presented in simulations. We also compare the score’s capacity to detect similarity with that of other known measures on real data. 1
6 0.12122462 32 nips-2012-Active Comparison of Prediction Models
8 0.11567621 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
9 0.10061247 291 nips-2012-Reducing statistical time-series problems to binary classification
10 0.099020451 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
11 0.098181814 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
12 0.096362986 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
13 0.091731429 330 nips-2012-Supervised Learning with Similarity Functions
14 0.09143164 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection
15 0.091313303 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support
16 0.090200707 282 nips-2012-Proximal Newton-type methods for convex optimization
17 0.085624762 168 nips-2012-Kernel Latent SVM for Visual Recognition
18 0.083866529 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
19 0.082896121 16 nips-2012-A Polynomial-time Form of Robust Regression
20 0.082369126 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
topicId topicWeight
[(0, 0.2), (1, 0.038), (2, 0.033), (3, -0.05), (4, 0.082), (5, 0.055), (6, 0.006), (7, 0.156), (8, -0.024), (9, -0.125), (10, -0.012), (11, -0.007), (12, 0.118), (13, -0.087), (14, 0.058), (15, -0.191), (16, 0.028), (17, 0.066), (18, -0.061), (19, -0.075), (20, 0.019), (21, -0.1), (22, 0.131), (23, -0.168), (24, 0.027), (25, 0.084), (26, 0.041), (27, -0.036), (28, -0.01), (29, 0.01), (30, 0.012), (31, -0.062), (32, -0.079), (33, -0.024), (34, 0.044), (35, -0.014), (36, -0.138), (37, -0.039), (38, 0.024), (39, 0.069), (40, 0.102), (41, -0.01), (42, -0.06), (43, 0.078), (44, -0.042), (45, -0.004), (46, 0.083), (47, 0.031), (48, -0.061), (49, -0.008)]
simIndex simValue paperId paperTitle
same-paper 1 0.96648765 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur
Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.
2 0.8368184 231 nips-2012-Multiple Operator-valued Kernel Learning
Author: Hachem Kadri, Alain Rakotomamonjy, Philippe Preux, Francis R. Bach
Abstract: Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an r -norm constraint on the combination coefficients (r ≥ 1). The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces. 1
3 0.82131749 188 nips-2012-Learning from Distributions via Support Measure Machines
Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf
Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a flexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1
Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson
Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1
5 0.75959957 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
Author: Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou
Abstract: Both random Fourier features and the Nystr¨ m method have been successfully o applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features where the basis functions (i.e., cosine and sine functions) are sampled from a distribution independent from the training data, basis functions used by the Nystr¨ m method are randomly sampled from the training examples and are o therefore data dependent. By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nystr¨ m method can yield impressively better generalization error bound than o random Fourier features based approach. We empirically verify our theoretical findings on a wide range of large data sets. 1
6 0.69746649 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection
7 0.66801149 167 nips-2012-Kernel Hyperalignment
8 0.66657472 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support
9 0.63341725 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition
10 0.61989635 95 nips-2012-Density-Difference Estimation
11 0.61032563 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data
12 0.60301208 330 nips-2012-Supervised Learning with Similarity Functions
13 0.58302695 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies
14 0.56890416 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
15 0.56212419 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction
16 0.50022095 145 nips-2012-Gradient Weights help Nonparametric Regressors
17 0.49755532 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
18 0.4970696 222 nips-2012-Multi-Task Averaging
19 0.49522299 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests
20 0.46999955 338 nips-2012-The Perturbed Variation
topicId topicWeight
[(0, 0.048), (17, 0.012), (21, 0.023), (38, 0.152), (39, 0.017), (42, 0.036), (44, 0.015), (54, 0.02), (55, 0.057), (74, 0.058), (76, 0.184), (80, 0.065), (84, 0.191), (92, 0.042)]
simIndex simValue paperId paperTitle
1 0.91673183 59 nips-2012-Bayesian nonparametric models for bipartite graphs
Author: Francois Caron
Abstract: We develop a novel Bayesian nonparametric model for random bipartite graphs. The model is based on the theory of completely random measures and is able to handle a potentially infinite number of nodes. We show that the model has appealing properties and in particular it may exhibit a power-law behavior. We derive a posterior characterization, a generative process for network growth, and a simple Gibbs sampler for posterior simulation. Our model is shown to be well fitted to several real-world social networks. 1
2 0.91420722 102 nips-2012-Distributed Non-Stochastic Experts
Author: Varun Kanade, Zhenming Liu, Bozidar Radunovic
Abstract: We consider the online distributed non-stochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each time-step t, one of the k site nodes has to pick an expert from the set {1, . . . , n}, and the same site receives information about payoffs of all experts for that round. The goal of the distributed system is to minimize regret at time horizon T , while simultaneously keeping communication to a minimum. The two extreme solutions to this problem are: (i) Full communication: This essentially simulates the nondistributed setting to obtain the optimal O( log(n)T ) regret bound at the cost of T communication. (ii) No communication: Each site runs an independent copy – the regret is O( log(n)kT ) and the communication is 0. This paper shows the √ difficulty of simultaneously achieving regret asymptotically better than kT and communication better than T . We give a novel algorithm that for an oblivious √ adversary achieves a non-trivial trade-off: regret O( k 5(1+ )/6 T ) and communication O(T /k ), for any value of ∈ (0, 1/5). We also consider a variant of the model, where the coordinator picks the expert. In this model, we show that the label-efficient forecaster of Cesa-Bianchi et al. (2005) already gives us strategy that is near optimal in regret vs communication trade-off. 1
3 0.90225476 234 nips-2012-Multiresolution analysis on the symmetric group
Author: Risi Kondor, Walter Dempsey
Abstract: There is no generally accepted way to define wavelets on permutations. We address this issue by introducing the notion of coset based multiresolution analysis (CMRA) on the symmetric group, find the corresponding wavelet functions, and describe a fast wavelet transform for sparse signals. We discuss potential applications in ranking, sparse approximation, and multi-object tracking. 1
4 0.87822735 51 nips-2012-Bayesian Hierarchical Reinforcement Learning
Author: Feng Cao, Soumya Ray
Abstract: We describe an approach to incorporating Bayesian priors in the MAXQ framework for hierarchical reinforcement learning (HRL). We define priors on the primitive environment model and on task pseudo-rewards. Since models for composite tasks can be complex, we use a mixed model-based/model-free learning approach to find an optimal hierarchical policy. We show empirically that (i) our approach results in improved convergence over non-Bayesian baselines, (ii) using both task hierarchies and Bayesian priors is better than either alone, (iii) taking advantage of the task hierarchy reduces the computational cost of Bayesian reinforcement learning and (iv) in this framework, task pseudo-rewards can be learned instead of being manually specified, leading to hierarchically optimal rather than recursively optimal policies. 1
same-paper 5 0.86722547 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur
Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.
6 0.79822367 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
7 0.79537123 215 nips-2012-Minimizing Uncertainty in Pipelines
8 0.79282039 148 nips-2012-Hamming Distance Metric Learning
9 0.79220396 319 nips-2012-Sparse Prediction with the $k$-Support Norm
10 0.79166937 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
11 0.79012269 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
12 0.7899943 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set
13 0.78991562 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
14 0.78976393 69 nips-2012-Clustering Sparse Graphs
15 0.78965342 188 nips-2012-Learning from Distributions via Support Measure Machines
16 0.78920591 75 nips-2012-Collaborative Ranking With 17 Parameters
17 0.78900689 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
18 0.78875536 304 nips-2012-Selecting Diverse Features via Spectral Regularization
19 0.78866184 111 nips-2012-Efficient Sampling for Bipartite Matching Problems
20 0.78850597 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video