nips nips2007 nips2007-7 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arthur Gretton, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Schölkopf, Alex J. Smola
Abstract: Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2 ), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. [sent-16, score-0.566]
2 We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). [sent-17, score-1.066]
3 The resulting test costs O(m2 ), where m is the sample size. [sent-18, score-0.129]
4 We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. [sent-19, score-0.21]
5 Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists. [sent-20, score-0.506]
6 1 Introduction Kernel independence measures have been widely applied in recent machine learning literature, most commonly in independent component analysis (ICA) [2, 11], but also in fitting graphical models [1] and in feature selection [22]. [sent-21, score-0.237]
7 There is presently no way to tell whether the empirical estimates of these dependence measures indicate a statistically significant dependence, however. [sent-23, score-0.206]
8 In other words, we are interested in the threshold an empirical kernel dependence estimate must exceed, before we can dismiss with high probability the hypothesis that the underlying variables are independent. [sent-24, score-0.419]
9 Statistical tests of independence have been associated with a broad variety of dependence measures. [sent-25, score-0.543]
10 Classical tests such as Spearman’s ρ and Kendall’s τ are widely applied, however they are not guaranteed to detect all modes of dependence between the random variables. [sent-26, score-0.341]
11 Contingency tablebased methods, and in particular the power-divergence family of test statistics [17], are the best known general purpose tests of independence, but are limited to relatively low dimensions, since they require a partitioning of the space in which each random variable resides. [sent-27, score-0.293]
12 Characteristic functionbased tests [6, 13] have also been proposed, which are more general than kernel density-based tests [19], although to our knowledge they have been used only to compare univariate random variables. [sent-28, score-0.541]
13 In this paper we present three main results: first, and most importantly, we show how to test whether statistically significant dependence is detected by a particular kernel independence measure, the Hilbert Schmidt independence criterion (HSIC, from [9]). [sent-29, score-0.943]
14 That is, we provide a fast (O(m2 ) for sample size m) and accurate means of obtaining a threshold which HSIC will only exceed with small probability, when the underlying variables are independent. [sent-30, score-0.099]
15 Second, we show the distribution 1 of our empirical test statistic in the large sample limit can be straightforwardly parameterised in terms of kernels on the data. [sent-31, score-0.361]
16 Third, we apply our test to structured data (in this case, by establishing the statistical dependence between a text and its translation). [sent-32, score-0.287]
17 To our knowledge, ours is the first independence test for structured data. [sent-33, score-0.332]
18 In Section 3, we describe how to determine whether the dependence returned via HSIC is statistically significant, by proposing a hypothesis test with HSIC as its statistic. [sent-35, score-0.331]
19 In particular, we show that this test can be parameterised using a combination of covariance operator norms and norms of mean elements of the random variables in feature space. [sent-36, score-0.385]
20 Finally, in Section 4, we give our experimental results, both for testing dependence between random vectors (which could be used for instance to verify convergence in independent subspace analysis [25]), and for testing dependence between text and its translation. [sent-37, score-0.478]
21 Software to implement the test may be downloaded from http : //www. [sent-38, score-0.095]
22 ⊥ We begin with a description of our kernel dependence criterion, leaving to the following section the question of whether this dependence is significant. [sent-49, score-0.437]
23 This presentation is largely a review of material from [9, 11, 22], the main difference being that we establish links to the characteristic function-based independence criteria in [6, 13]. [sent-50, score-0.329]
24 Let F be an RKHS, with the continuous feature mapping φ(x) ∈ F from each x ∈ X, such that the inner product between the features is given by the kernel function k(x, x′ ) := φ(x), φ(x′ ) . [sent-51, score-0.149]
25 Likewise, let G be a second RKHS on Y with kernel l(·, ·) and feature map ψ(y). [sent-52, score-0.149]
26 An unbiased estimator of (2) is a sum of three U-statistics [21, 22], HSIC(Z) = 1 (m)2 kij lij + (i,j)∈im 2 1 (m)4 kij lqr − 2 (i,j,q,r)∈im 4 2 1 (m)3 kij liq , (i,j,q)∈im 3 (3) m! [sent-60, score-0.503]
27 , m}, kij := k(xi , xj ), and lij := l(yi , yj ). [sent-65, score-0.174]
28 , m} (r 1 being the number of indices below the sum), K is the m×m matrix with entries kij , H = I− m 11⊤ , and 1 is an m × 1 vector of ones (the cost of computing this statistic is O(m2 )). [sent-69, score-0.249]
29 When a Gaussian 2 kernel kij := exp −σ −2 xi − xj is used (or a kernel deriving from [6, Eq. [sent-70, score-0.426]
30 10]), the latter statistic is equivalent to the characteristic function-based statistic [6, Eq. [sent-72, score-0.304]
31 Another related test described in [4] is based on the functional canonical correlation between F and G, rather than the covariance: in this sense the test statistic resembles those in [2]. [sent-78, score-0.38]
32 The approach in [4] differs with both the present work and [2], however, in that the function spaces F and G are represented by finite sets of basis functions (specifically B-spline kernels) when computing the empirical test statistic. [sent-79, score-0.095]
33 3 Test description We now describe a statistical test of independence for two random variables, based on the test statistic HSICb (Z). [sent-80, score-0.548]
34 sample Z defined earlier, the statistical test, T(Z) : (X × Y)m → {0, 1} is used to distinguish between the null hypothesis H0 : Pxy = Px Py and the alternative hypothesis H1 : Pxy = Px Py . [sent-85, score-0.252]
35 This is achieved by comparing the test statistic, in our case HSICb (Z), with a particular threshold: if the threshold is exceeded, then the test rejects the null hypothesis (bearing in mind that a zero population HSIC indicates Pxy = Px Py ). [sent-86, score-0.38]
36 The acceptance region of the test is thus defined as any real number below the threshold. [sent-87, score-0.249]
37 Since the test is based on a finite sample, it is possible that an incorrect answer will be returned: the Type I error is defined as the probability of rejecting H0 based on the observed sample, despite x and y being independent. [sent-88, score-0.095]
38 The level α of a test is an upper bound on the Type I error, and is a design parameter of the test, used to set the test threshold. [sent-90, score-0.19]
39 A consistent test achieves a level α, and a Type II error of zero, in the large sample limit. [sent-91, score-0.129]
40 How, then, do we set the threshold of the test given α? [sent-92, score-0.128]
41 We then use the 1 − α quantile of this distribution as the test threshold. [sent-94, score-0.195]
42 We shall see, however, that the null distribution has a complex form, and cannot be evaluated directly. [sent-97, score-0.096]
43 Thus, in the second part of this section, we describe ways to accurately approximate the 1 − α quantile of this distribution. [sent-98, score-0.1]
44 Asymptotic distribution of HSICb (Z) We now describe the distribution of the test statistic in (4) The first theorem holds under H1 . [sent-99, score-0.249]
45 (i,j,q,r) ktu ltu + ktu lvw − 2ktu ltv , (5) (t,u,v,w) where the sum represents all ordered quadruples (t, u, v, w) drawn without replacement from (i, j, q, r), and assume E h2 < ∞. [sent-104, score-0.115]
46 2 The variance is σu = 16 Ei Ej,q,r hijqr 2 (6) − HSIC(Pxy , F, G) , where Ej,q,r := Ezj ,zq ,zr . [sent-106, score-0.116]
47 Proof We first rewrite (4) as a single V-statistic, HSICb (Z) = 1 m4 m hijqr , (7) i,j,q,r where we note that hijqr defined in (5) does not change with permutation of its indices. [sent-107, score-0.232]
48 The second theorem applies under H0 Theorem 2 Under H0 , the U-statistic HSICs (Z) corresponding to the V-statistic in (7) is degenerate, meaning Ei hijqr = 0. [sent-112, score-0.149]
49 , and λl are the solutions to the eigenvalue problem λl ψl (zj ) = hijqr ψl (zi )dFi,q,r , where the integral is over the distribution of variables zi , zq , and zr . [sent-118, score-0.148]
50 One approach, taken by [6], is to use a Monte Carlo resampling technique: the ordering of the Y sample is permuted repeatedly while that of X is kept fixed, and the 1 − α quantile is obtained from the resulting distribution of HSICb values. [sent-124, score-0.134]
51 34], is to approximate the null distribution as a two-parameter Gamma distribution [12, p. [sent-127, score-0.096]
52 We note the Gamma approximation is quite accurate, especially in areas of high probability (which we use to compute the test quantile). [sent-137, score-0.095]
53 2 tic functions, and not in our more general kernel setting Gamma 0 (see also [14, p. [sent-145, score-0.149]
54 5 2 mHSIC b we provide much simpler expressions for both quantities, in terms of norms of mean elements µx and µy , and the covariance operators Cxx := Ex [(φ(x) − µx ) ⊗ (φ(x) − µx )] and Cyy , in feature space. [sent-149, score-0.107]
55 The main advantage of our new expressions is that they are computed entirely in terms of kernels, which makes possible the application of the test to any domains on which kernels can be defined, and not only Rd . [sent-150, score-0.193]
56 An empirical estimate of this statistic is obtained E(HSICb (Z)) = by replacing the norms above with µx in a (generally negligible) bias of O(m 2 −1 = (m)−1 2 (i,j)∈im 2 kij , bearing in mind that this results ) in the estimate of µx 2 µy 2 2 HS + O(m−3 ). [sent-152, score-0.358]
57 Proofs of both theorems may be found in [10], where we also compare with the original characteristic function-based expressions in [13]. [sent-155, score-0.138]
58 We remark that these parameters, like the original test statistic in (4), may be computed in O(m2 ). [sent-156, score-0.216]
59 4 Experiments General tests of statistical independence are most useful for data having complex interactions that simple correlation does not detect. [sent-157, score-0.434]
60 We investigate two cases where this situation arises: first, we test vectors in Rd which have a dependence relation but no correlation, as occurs in independent subspace analysis; and second, we study the statistical dependence between a text and its translation. [sent-158, score-0.495]
61 Independence of subspaces One area where independence tests have been applied is in determining the convergence of algorithms for independent component analysis (ICA), which involves separating random variables that have been linearly mixed, using only their mutual independence. [sent-159, score-0.431]
62 Contingency table-based tests have been applied [15] 5 in this context, while the test of [13] has been used in [14] for verifying ICA outcomes when the data are stationary random processes (through using a subset of samples with a sufficiently large delay between them). [sent-161, score-0.257]
63 Contingency table-based tests may be less useful in the case of independent subspace analysis (ISA, see e. [sent-162, score-0.226]
64 Thus, characteristic function-based tests [6, 13] and kernel independence measures might work better for this problem. [sent-165, score-0.61]
65 In our experiments, we tested the independence of random vectors, as a way of verifying the solutions of independent subspace analysis. [sent-166, score-0.301]
66 Second, we mixed these random variables using a rotation matrix parameterised by an angle θ, varying from 0 to π/4 (a zero angle means the data are independent, while dependence becomes easier to detect as the angle increases to π/4: see the two plots in Figure 2, top left). [sent-170, score-0.556]
67 We compared two different methods for computing the 1 − α quantile of the HSIC null distribution: repeated random permutation of the Y sample ordering as in [6] (HSICp), where we used 200 permutations; and Gamma approximation (HSICg) as in [13], based on (9). [sent-175, score-0.23]
68 We used a Gaussian kernel, with kernel size set to the median distance between points in input space. [sent-176, score-0.149]
69 4] (Ku and Fine did not specify a space partitioning strategy for higher dimensions, since they dealt only with univariate random variables). [sent-180, score-0.104]
70 The functional correlationbased test (fCorr) is described in [4]: the main differences with respect to our test are that it uses the spectrum of the functional correlation operator, rather than the covariance operator; and that it approximates the RKHSs F and G by finite sets of basis functions. [sent-182, score-0.293]
71 Parameter settings were as in [4, Table 1], with the second order B-spline kernel and a twofold dyadic partitioning. [sent-183, score-0.149]
72 The y-intercept on these plots corresponds to the acceptance rate of H0 at independence, or 1 − (Type I error), and should be close to the design parameter of 1 − α = 0. [sent-186, score-0.154]
73 Elsewhere, the plots indicate acceptance of H0 where the underlying variables are dependent, i. [sent-188, score-0.186]
74 As expected, we observe that dependence becomes easier to detect as θ increases from 0 to π/4, when m increases, and when d decreases. [sent-191, score-0.179]
75 The PD and fCorr tests perform poorly at m = 128, but approach the performance of HSIC-based tests for increasing m (although PD remains slightly worse than HSIC at m = 512 and d = 1, while fCorr becomes slightly worse again than PD). [sent-192, score-0.324]
76 PD also scales very badly with d, and never rejects the null hypothesis when d = 4, even for m = 2048. [sent-193, score-0.157]
77 Although HSIC-based tests are unreliable for small θ, they generally do well as θ approaches π/4 (besides m = 128, d = 2). [sent-194, score-0.162]
78 Dependence and independence between text In this section, we demonstrate independence testing on text. [sent-196, score-0.561]
79 Our goal was to test whether there exists a statistical dependence between 6 Rotation θ = π/8 Rotation θ = π/4 2 1 0 −1 −2 −2 −2 0 −3 2 −2 X Samp:512, Dim:1 0. [sent-202, score-0.239]
80 8 % acceptance of H −1 % acceptance of H Y Y 0 0 1 1 1 % acceptance of H 2 Samp:128, Dim:2 Samp:128, Dim:1 3 0 3 0. [sent-226, score-0.462]
81 We remark that the random variables appear “more dependent” as the angle θ increases, although their correlation is always zero. [sent-234, score-0.152]
82 Remaining plots: Rate of acceptance of H 0 for the PD, fCorr, HSICp, and HSICg tests. [sent-235, score-0.154]
83 We used the k-spectrum kernel of [16], computed according to the method of [24]. [sent-244, score-0.149]
84 We compared this kernel with a simple kernel between bags of words [3, pp. [sent-246, score-0.298]
85 Our results demonstrate the excellent performance of the HSICp test on this task: even for small sample sizes, HSICp with a spectral kernel always achieves zero Type II error, and a Type I error close to the design value (0. [sent-249, score-0.278]
86 We further observe for m = 10 that HSICp with the spectral kernel always has better Type II error than the bag-of words kernel. [sent-251, score-0.149]
87 This suggests that a kernel with a more sophisticated encoding of text structure induces a more sensitive test, although for larger sample sizes, the advantage vanishes. [sent-252, score-0.231]
88 Thus, while the test threshold for HSICg at m = 50 still fell between the dependent and independent values of HSICb , this was not the result of an accurate modelling of the null distribution. [sent-255, score-0.261]
89 5 Conclusion We have introduced a test of whether significant statistical dependence is obtained by a kernel dependence measure, the Hilbert-Schmidt independence criterion (HSIC). [sent-258, score-0.819]
90 In our experiments, HSIC-based tests always outperformed the contingency table [17] and functional correlation [4] approaches, for both univariate random variables and higher dimensional vectors which were dependent but uncorrelated. [sent-260, score-0.449]
91 We would therefore recommend HSIC-based tests for checking the convergence of independent component analysis and independent subspace analysis. [sent-261, score-0.226]
92 Finally, our test also applies on structured domains, being able to detect the dependence 7 Table 1: Independence tests for cross-language dependence detection. [sent-262, score-0.58]
93 BOW(10) denotes a bag of words kernel and m = 10 sample size, Spec(50) is a k-spectrum kernel with m = 50. [sent-264, score-0.332]
94 The first entry in each cell is the null acceptance rate of the test under H 0 (i. [sent-265, score-0.345]
95 95); the second entry is the null acceptance rate under H 1 (the Type II error, small is better). [sent-268, score-0.25]
96 Another application along these lines might be in testing dependence between data of completely different types, such as images and captions. [sent-319, score-0.183]
97 Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. [sent-366, score-0.149]
98 A resampling test for the total independence of stationary time series: Application to the performance evaluation of ica algorithms. [sent-420, score-0.417]
99 The spectrum kernel: A string kernel for SVM protein classification. [sent-433, score-0.149]
100 The influence of the kernel on the consistency of support vector machines. [sent-477, score-0.149]
wordName wordTfidf (topN-words)
[('hsicb', 0.482), ('hsic', 0.307), ('independence', 0.237), ('hsicp', 0.231), ('hsicg', 0.193), ('pxy', 0.172), ('tests', 0.162), ('acceptance', 0.154), ('kernel', 0.149), ('dependence', 0.144), ('samp', 0.135), ('kij', 0.128), ('statistic', 0.121), ('fcorr', 0.116), ('hijqr', 0.116), ('mhsicb', 0.116), ('quantile', 0.1), ('null', 0.096), ('test', 0.095), ('dim', 0.086), ('ica', 0.085), ('angle', 0.085), ('contingency', 0.081), ('pd', 0.077), ('gretton', 0.074), ('py', 0.074), ('norms', 0.07), ('gamma', 0.07), ('univariate', 0.068), ('operator', 0.068), ('cxy', 0.067), ('subspace', 0.064), ('px', 0.064), ('characteristic', 0.062), ('kernels', 0.061), ('hypothesis', 0.061), ('fisheries', 0.058), ('hsics', 0.058), ('spec', 0.058), ('english', 0.052), ('type', 0.052), ('extracts', 0.051), ('fukumizu', 0.051), ('nicta', 0.051), ('criterion', 0.05), ('anu', 0.05), ('bow', 0.05), ('parameterised', 0.05), ('im', 0.05), ('french', 0.049), ('text', 0.048), ('lij', 0.046), ('schmidt', 0.046), ('zl', 0.046), ('mpi', 0.043), ('ex', 0.043), ('song', 0.04), ('teo', 0.04), ('rotation', 0.04), ('theorems', 0.039), ('testing', 0.039), ('bearing', 0.039), ('cxx', 0.039), ('cyy', 0.039), ('discretisation', 0.039), ('emphasise', 0.039), ('entrywise', 0.039), ('immigration', 0.039), ('ktu', 0.039), ('liq', 0.039), ('spearman', 0.039), ('ey', 0.039), ('sch', 0.038), ('singular', 0.037), ('expressions', 0.037), ('dependent', 0.037), ('australia', 0.037), ('replacement', 0.037), ('ii', 0.036), ('partitioning', 0.036), ('detect', 0.035), ('correlation', 0.035), ('optimisation', 0.034), ('functional', 0.034), ('sample', 0.034), ('agriculture', 0.034), ('exx', 0.034), ('exy', 0.034), ('lqr', 0.034), ('threshold', 0.033), ('theorem', 0.033), ('hilbert', 0.033), ('variables', 0.032), ('smola', 0.032), ('statistically', 0.031), ('ku', 0.031), ('paragraphs', 0.031), ('presently', 0.031), ('rkhss', 0.031), ('presentation', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 7 nips-2007-A Kernel Statistical Test of Independence
Author: Arthur Gretton, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Schölkopf, Alex J. Smola
Abstract: Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2 ), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.
2 0.20429285 108 nips-2007-Kernel Measures of Conditional Dependence
Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf
Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
3 0.1825431 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
Author: Moulines Eric, Francis R. Bach, Zaïd Harchaoui
Abstract: We propose to investigate test statistics for testing homogeneity based on kernel Fisher discriminant analysis. Asymptotic null distributions under null hypothesis are derived, and consistency against fixed alternatives is assessed. Finally, experimental evidence of the performance of the proposed approach on both artificial and real datasets is provided. 1
4 0.14664389 49 nips-2007-Colored Maximum Variance Unfolding
Author: Le Song, Arthur Gretton, Karsten M. Borgwardt, Alex J. Smola
Abstract: Maximum variance unfolding (MVU) is an effective heuristic for dimensionality reduction. It produces a low-dimensional representation of the data by maximizing the variance of their embeddings while preserving the local distances of the original data. We show that MVU also optimizes a statistical dependence measure which aims to retain the identity of individual observations under the distancepreserving constraints. This general view allows us to design “colored” variants of MVU, which produce low-dimensional representations for a given task, e.g. subject to class labels or other side information. 1
5 0.095503435 118 nips-2007-Learning with Transformation Invariant Kernels
Author: Christian Walder, Olivier Chapelle
Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1
6 0.070727222 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
7 0.068810582 160 nips-2007-Random Features for Large-Scale Kernel Machines
8 0.066991985 184 nips-2007-Stability Bounds for Non-i.i.d. Processes
9 0.065091871 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
10 0.064592831 109 nips-2007-Kernels on Attributed Pointsets with Applications
11 0.061952781 125 nips-2007-Markov Chain Monte Carlo with People
12 0.061126202 97 nips-2007-Hidden Common Cause Relations in Relational Learning
13 0.054627731 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
14 0.05204656 164 nips-2007-Receptive Fields without Spike-Triggering
15 0.048629411 101 nips-2007-How SVMs can estimate quantiles and the median
16 0.048002712 110 nips-2007-Learning Bounds for Domain Adaptation
17 0.04756549 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
18 0.046530873 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
19 0.04641477 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
20 0.045695454 62 nips-2007-Convex Learning with Invariances
topicId topicWeight
[(0, -0.17), (1, 0.034), (2, -0.049), (3, 0.059), (4, -0.009), (5, 0.02), (6, 0.015), (7, 0.004), (8, -0.165), (9, 0.084), (10, 0.076), (11, -0.021), (12, 0.071), (13, 0.013), (14, -0.111), (15, -0.211), (16, 0.183), (17, -0.025), (18, -0.013), (19, 0.061), (20, 0.239), (21, 0.16), (22, 0.051), (23, -0.053), (24, 0.022), (25, 0.046), (26, 0.14), (27, -0.067), (28, 0.027), (29, -0.035), (30, -0.005), (31, -0.046), (32, 0.078), (33, -0.095), (34, 0.096), (35, -0.149), (36, -0.138), (37, 0.147), (38, 0.049), (39, 0.017), (40, 0.088), (41, 0.122), (42, -0.059), (43, 0.019), (44, 0.023), (45, -0.039), (46, 0.023), (47, -0.061), (48, -0.063), (49, -0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.95173407 7 nips-2007-A Kernel Statistical Test of Independence
Author: Arthur Gretton, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Schölkopf, Alex J. Smola
Abstract: Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2 ), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.
2 0.90463912 108 nips-2007-Kernel Measures of Conditional Dependence
Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf
Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
3 0.72374517 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
Author: Moulines Eric, Francis R. Bach, Zaïd Harchaoui
Abstract: We propose to investigate test statistics for testing homogeneity based on kernel Fisher discriminant analysis. Asymptotic null distributions under null hypothesis are derived, and consistency against fixed alternatives is assessed. Finally, experimental evidence of the performance of the proposed approach on both artificial and real datasets is provided. 1
4 0.69823807 49 nips-2007-Colored Maximum Variance Unfolding
Author: Le Song, Arthur Gretton, Karsten M. Borgwardt, Alex J. Smola
Abstract: Maximum variance unfolding (MVU) is an effective heuristic for dimensionality reduction. It produces a low-dimensional representation of the data by maximizing the variance of their embeddings while preserving the local distances of the original data. We show that MVU also optimizes a statistical dependence measure which aims to retain the identity of individual observations under the distancepreserving constraints. This general view allows us to design “colored” variants of MVU, which produce low-dimensional representations for a given task, e.g. subject to class labels or other side information. 1
5 0.48141623 118 nips-2007-Learning with Transformation Invariant Kernels
Author: Christian Walder, Olivier Chapelle
Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1
6 0.41878957 184 nips-2007-Stability Bounds for Non-i.i.d. Processes
7 0.40908957 109 nips-2007-Kernels on Attributed Pointsets with Applications
8 0.40636528 101 nips-2007-How SVMs can estimate quantiles and the median
9 0.39541167 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation
10 0.3812117 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
11 0.33295327 97 nips-2007-Hidden Common Cause Relations in Relational Learning
12 0.32586038 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
13 0.30098042 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
14 0.29679698 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
15 0.2963284 160 nips-2007-Random Features for Large-Scale Kernel Machines
16 0.29188555 99 nips-2007-Hierarchical Penalization
17 0.28052127 24 nips-2007-An Analysis of Inference with the Universum
18 0.26743251 70 nips-2007-Discriminative K-means for Clustering
19 0.26443046 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines
20 0.25721756 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
topicId topicWeight
[(5, 0.062), (13, 0.062), (16, 0.025), (19, 0.016), (21, 0.058), (31, 0.02), (34, 0.015), (35, 0.029), (46, 0.027), (47, 0.062), (49, 0.072), (55, 0.256), (83, 0.107), (85, 0.023), (87, 0.017), (90, 0.073)]
simIndex simValue paperId paperTitle
same-paper 1 0.74983174 7 nips-2007-A Kernel Statistical Test of Independence
Author: Arthur Gretton, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Schölkopf, Alex J. Smola
Abstract: Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2 ), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.
2 0.55344033 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks
Author: Alex Graves, Marcus Liwicki, Horst Bunke, Jürgen Schmidhuber, Santiago Fernández
Abstract: In online handwriting recognition the trajectory of the pen is recorded during writing. Although the trajectory provides a compact and complete representation of the written output, it is hard to transcribe directly, because each letter is spread over many pen locations. Most recognition systems therefore employ sophisticated preprocessing techniques to put the inputs into a more localised form. However these techniques require considerable human effort, and are specific to particular languages and alphabets. This paper describes a system capable of directly transcribing raw online handwriting data. The system consists of an advanced recurrent neural network with an output layer designed for sequence labelling, combined with a probabilistic language model. In experiments on an unconstrained online database, we record excellent results using either raw or preprocessed data, well outperforming a state-of-the-art HMM based system in both cases. 1
3 0.55304289 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers
Author: Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, Edward Y. Chang
Abstract: Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than n), and m the number of machines. PSVM reduces the memory requirement from O(n2 ) to O(np/m), and improves computation time to O(np2 /m). Empirical study shows PSVM to be effective. PSVM Open Source is available for download at http://code.google.com/p/psvm/.
4 0.54344368 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
Author: Kristiaan Pelckmans, Johan Suykens, Bart D. Moor
Abstract: This paper1 explores the use of a Maximal Average Margin (MAM) optimality principle for the design of learning algorithms. It is shown that the application of this risk minimization principle results in a class of (computationally) simple learning machines similar to the classical Parzen window classifier. A direct relation with the Rademacher complexities is established, as such facilitating analysis and providing a notion of certainty of prediction. This analysis is related to Support Vector Machines by means of a margin transformation. The power of the MAM principle is illustrated further by application to ordinal regression tasks, resulting in an O(n) algorithm able to process large datasets in reasonable time. 1
5 0.54291391 108 nips-2007-Kernel Measures of Conditional Dependence
Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf
Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
6 0.53665721 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
7 0.53537792 63 nips-2007-Convex Relaxations of Latent Variable Training
8 0.53513283 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
9 0.53419769 185 nips-2007-Stable Dual Dynamic Programming
10 0.53321183 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
11 0.53057474 156 nips-2007-Predictive Matrix-Variate t Models
12 0.52997065 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
13 0.52957332 49 nips-2007-Colored Maximum Variance Unfolding
14 0.5272631 84 nips-2007-Expectation Maximization and Posterior Constraints
15 0.52699482 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
16 0.52670723 134 nips-2007-Multi-Task Learning via Conic Programming
17 0.52613848 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
18 0.52605224 115 nips-2007-Learning the 2-D Topology of Images
19 0.52571863 24 nips-2007-An Analysis of Inference with the Universum
20 0.52512002 200 nips-2007-The Tradeoffs of Large Scale Learning