nips nips2007 nips2007-108 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf
Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. [sent-11, score-0.361]
2 Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. [sent-12, score-0.353]
3 1 Introduction Measuring dependence of random variables is one of the main concerns of statistical inference. [sent-15, score-0.137]
4 A typical example is the inference of a graphical model, which expresses the relations among variables in terms of independence and conditional independence. [sent-16, score-0.175]
5 Independent component analysis employs a measure of independence as the objective function, and feature selection in supervised learning looks for a set of features on which the response variable most depends. [sent-17, score-0.144]
6 Kernel methods have been successfully used for capturing (conditional) dependence of variables [1, 5, 8, 9, 16]. [sent-18, score-0.16]
7 With the ability to represent high order moments, mapping of variables into reproducing kernel Hilbert spaces (RKHSs) allows us to infer properties of the distributions, such as independence and homogeneity [7]. [sent-19, score-0.284]
8 A drawback of previous kernel dependence measures, however, is that their value depends not only on the distribution of the variables, but also on the kernel, in contrast to measures such as mutual information. [sent-20, score-0.334]
9 In this paper, we propose to use the Hilbert-Schmidt norm of the normalized conditional crosscovariance operator, and show that this operator encodes the dependence structure of random variables. [sent-21, score-0.301]
10 Our criterion includes a measure of unconditional dependence as a special case. [sent-22, score-0.138]
11 We prove in the limit of infinite data, under assumptions on the richness of the RKHS, that this measure has an explicit integral expression which depends only on the probability densities of the variables, despite being defined in terms of kernels. [sent-23, score-0.051]
12 Furthermore, we provide a general formulation for 1 the “richness” of an RKHS, and a theoretically motivated kernel selection method. [sent-25, score-0.12]
13 2 Measuring conditional dependence with kernels The probability law of a random variable X is denoted by PX , and the space of the square integrable functions with probability P by L2 (P ). [sent-27, score-0.246]
14 The null space and the range of an operator T are written N (T ) and R(T ), respectively. [sent-29, score-0.131]
15 1 Dependence measures with normalized cross-covariance operators Covariance operators on RKHSs have been successfully used for capturing dependence and conditional dependence of random variables, by incorporating high order moments [5, 8, 16]. [sent-31, score-0.436]
16 Suppose we have a random variable (X, Y ) on X × Y, and RKHSs HX and HY on X and Y, respectively, with measurable positive definite kernels kX and kY . [sent-33, score-0.119]
17 The cross-covariance operator ΣY X : HX → HY is defined by the unique bounded operator that satisfies g, ΣY X f HY = Cov[f (X), g(Y )] ( = E[f (X)g(Y )] − E[f (X)]E[g(Y )]) (1) for all f ∈ HX and g ∈ HY . [sent-36, score-0.262]
18 The operator ΣY X naturally extends the covariance matrix CY X on Euclidean spaces, and represents higher order correlations of X and Y through f (X) and g(Y ) with nonlinear kernels. [sent-38, score-0.158]
19 It is known [2] that the cross-covariance operator can be decomposed into the covariance of the marginals and the correlation; that is, there exists a unique bounded operator V Y X such that 1/2 1/2 ΣY X = ΣY Y VY X ΣXX , (2) R(VY X ) ⊂ R(ΣY Y ), and N (VY X )⊥ ⊂ R(ΣXX ). [sent-39, score-0.289]
20 The operator norm of VY X is less than or equal to 1. [sent-40, score-0.156]
21 We call VY X the normalized cross-covariance operator (NOCCO, see also [4]). [sent-41, score-0.131]
22 While the operator VY X encodes the same information regarding the dependence of X and Y as ΣY X , the former rather expresses the information more directly than ΣY X , with less influence of the marginals. [sent-42, score-0.244]
23 This relation can be understood as an analogue to the difference between the covariance Cov[X, Y ] and the correlation Cov[X, Y ]/(Var(X)Var(Y ))1/2 . [sent-43, score-0.05]
24 Note also that kernel canonical correlation analysis [1] uses the largest eigenvalue of VY X and its corresponding eigenfunctions [4]. [sent-44, score-0.169]
25 We then define the normalized conditional cross-covariance operator, VY X|Z = VY X − VY Z VZX , (3) for measuring the conditional dependence of X and Y given Z, where VY Z and VZX are defined similarly to Eq. [sent-46, score-0.205]
26 The operator VY X|Z may be better understood by expressing it as −1/2 VY X|Z = ΣY Y −1/2 ΣY X − ΣY Z Σ−1 ΣZX ΣXX , ZZ where ΣY X|Z = ΣY X − ΣY Z Σ−1 ΣZX can be interpreted as a nonlinear extension of the condiZZ −1 tional covariance matrix CY X − CY Z CZZ CZX of Gaussian random variables. [sent-48, score-0.184]
27 The operator ΣY X can be used to determine the independence of X and Y : roughly speaking, ΣY X = O if and only if X⊥ . [sent-49, score-0.25]
28 Similarly, a relation between ΣY X|Z and conditional independence, ⊥Y ¨ ¨ X⊥ | Z, has been established in [5]: if the extended variables X = (X, Z) and Y = (Y, Z) are ⊥Y used, X⊥ | Z is equivalent to ΣX Y |Z = O. [sent-50, score-0.056]
29 2 ⊥Y ¨¨ Noting that the conditions ΣY X = O and ΣY X|Z = O are equivalent to VY X = O and VY X|Z = O, respectively, we propose to use the Hilbert-Schmidt norms of the latter operators as dependence 2 measures. [sent-52, score-0.163]
30 Recall that an operator A : H1 → H2 is called Hilbert-Schmidt if for complete orthonormal systems (CONSs) {φi } of H1 and {ψj } of H2 , the sum i,j ψj , Aφi 2 2 is finite (see H [13]). [sent-53, score-0.131]
31 For a Hilbert-Schmidt operator A, the Hilbert-Schmidt (HS) norm A HS is defined by A 2 = i,j ψj , Aφi 2 2 . [sent-54, score-0.156]
32 HS A sufficient condition that these operators are Hilbert-Schmidt will be discussed in Section 2. [sent-57, score-0.05]
33 The HS norm of the finite rank operator VY X|Z is easy to calculate. [sent-71, score-0.156]
34 The empirical dependence measures are then ˆCON D ≡ V (n) In ¨ ¨ Y X|Z 2 HS = Tr RY RX − 2RY RX RZ + RY RZ RX RZ , ¨ ¨ ¨ ¨ ¨ ¨ (7) (n) 2 ˆN In OCCO (X, Y ) ≡ VY X HS = Tr RY RX , (8) ˆCON D . [sent-73, score-0.192]
35 These empirical estimators, and use of εn , will be where the extended variables are used for In justified in Section 2. [sent-74, score-0.048]
36 2 Inference on probabilities by characteristic kernels To relate I N OCCO and I CON D with independence and conditional independence, respectively, the RKHS should contain a sufficiently rich class of functions to represent all higher order moments. [sent-78, score-0.301]
37 Similar notions have already appeared in the literature: universal kernel on compact domains [15] and Gaussian kernels on the entire Rm characterize independence via the cross-covariance operator [8, 1]. [sent-79, score-0.447]
38 We now discuss a unified class of kernels for inference on probabilities. [sent-80, score-0.077]
39 Let (X , B) be a measurable space, X a random variable on X , and (H, k) an RKHS on X satisfying assumption (A-1). [sent-81, score-0.042]
40 The kernel k is said to be characteristic1 if the map Mk is injective, or equivalently, if the condition EX∼P [f (X)] = EX∼Q [f (X)] (∀f ∈ H) implies P = Q. [sent-85, score-0.12]
41 √ T The notion of a characteristic kernel is an analogy to the characteristic function E P [e −1u X ], √ T which is the expectation of the Fourier kernel kF (x, u) = e −1u x . [sent-86, score-0.386]
42 Noting that mP = mQ iff EP [k(u, X)] = EQ [k(u, X)] for all u ∈ X , the definition of a characteristic kernel generalizes the well-known property of the characteristic function that EP [kF (u, X)] uniquely determines a Borel probability P on Rm . [sent-87, score-0.266]
43 The next lemma is useful to show that a kernel is characteristic. [sent-88, score-0.12]
44 1 Although the same notion was called probability-determining in [5], we call it ”characteristic” by analogy with the characteristic function. [sent-89, score-0.073]
45 Suppose that (H, k) is an RKHS on a measurable space (X , B) with k measurable and bounded. [sent-92, score-0.084]
46 If H + R (the direct sum of the two RKHSs) is dense in L q (X , P ) for any probability P on (X , B), the kernel k is characteristic. [sent-93, score-0.154]
47 By the assumption, for any ε > 0 and a measurable set A, there is a function f ∈ H and c ∈ R such that |EP [f (X)] + c − P (A)| < ε and |EQ [f (Y )] + c − Q(A)| < ε, from which we have |P (A) − Q(A)| < 2ε. [sent-96, score-0.042]
48 For a compact metric space, it is easy to see that the RKHS given by a universal kernel [15] is dense in L2 (P ) for any P , and thus characteristic (see also [7] Theorem 3). [sent-99, score-0.227]
49 It is also important to consider kernels on non-compact spaces, since many standard random variables, such as Gaussian variables, are defined on non-compact spaces. [sent-100, score-0.077]
50 The next theorem implies that many kernels on the entire Rm , including Gaussian and Laplacian, are characteristic. [sent-101, score-0.103]
51 Let φ(z) be a continuous positive function on Rm with the Fourier transform φ(u), m and k be a kernel of the form k(x, y) = φ(x − y). [sent-104, score-0.12]
52 If for any ξ ∈ R there exists τ0 such that ˜ φ(τ (u+ξ))2 du < ∞ for all τ > τ0 , then the RKHS associated with k is dense in L2 (P ) for any ˜ φ(u) Borel probability P on Rm . [sent-105, score-0.034]
53 Hence k is characteristic with respect to the Borel σ-field. [sent-106, score-0.073]
54 The assumptions to relate the operators with independence are well described by using characteristic kernels and denseness. [sent-107, score-0.319]
55 In addition to (A-1), assume that the product kX kY is a ¨ ¨ characteristic kernel on (X × Z) × Y, and HZ + R is dense in L2 (PZ ). [sent-113, score-0.227]
56 ⊥Y From the above results, we can guarantee that VY X and VY X|Z will detect independence and condi¨ tional independence, if we use a Gaussian or Laplacian kernel either on a compact set or the whole of Rm . [sent-115, score-0.286]
57 3 Kernel-free integral expression of the measures A remarkable property of I N OCCO and I CON D is that they do not depend on the kernels under some assumptions, having integral expressions containing only the probability density functions. [sent-118, score-0.132]
58 Let µX and µY be measures on X and Y, respectively, and assume that the probabilities PXY and EZ [PX|Z ⊗ PY |Z ] are absolutely continuous with respect to µX × µY with probability density functions pXY and pX⊥ |Z , respectively. [sent-121, score-0.055]
59 While the empirical estimate from finite samples depends on the choice of kernels, it is a desirable property for the empirical dependence measure to converge to a value that depends only on the distributions of the variables. [sent-134, score-0.186]
60 (9) shows that, under the assumptions, I N OCCO is equal to the mean square contingency, a well-known dependence measure[14] commonly used for discrete variables. [sent-136, score-0.137]
61 4, In OCCO works as a consistent kernel estimator of the mean square contingency. [sent-138, score-0.144]
62 (9) can be compared with the mutual information, M I(X, Y ) = pXY (x, y) log X ×Y pXY (x, y) dµX dµY . [sent-140, score-0.046]
63 pX (x)pY (y) Both the mutual information and the mean square contingency are nonnegative, and equal to zero if and only if X and Y are independent. [sent-141, score-0.102]
64 While the mutual information is the best known dependence measure, its finite sample empirical estimate is not straightforward, especially for continuous variables. [sent-143, score-0.183]
65 4 Consistency of the measures It is important to ask whether the empirical measures converge to the population value I CON D and I N OCCO , since this provides a theoretical justification for the empirical measures. [sent-146, score-0.158]
66 It is known [4] (n) that VY X converges in probability to VY X in operator norm. [sent-147, score-0.131]
67 Although the proof is analogous to the case of operator norm, it is more involved to discuss the HS norm. [sent-149, score-0.131]
68 5 Choice of kernels ˆN ˆCON D are dependent on the As with all empirical measures, the sample estimates In OCCO and In kernel, and the problem of choosing a kernel has yet to be solved. [sent-156, score-0.221]
69 Unlike supervised learning, there are no easy criteria to choose a kernel for dependence measures. [sent-157, score-0.233]
70 We propose a method of choosing a kernel by considering the large sample behavior. [sent-158, score-0.12]
71 The basic idea is that a kernel should be chosen so that the covariance operator detects independence of variables as effectively as possible. [sent-160, score-0.447]
72 It has been recently shown [10], under the independence of 5 4 2 2 0 1. [sent-161, score-0.119]
73 Right: The marks ”o” and ”+” ˆN show In OCCO for each angle and the 95th percentile of the permutation test, respectively. [sent-173, score-0.085]
74 We choose a HS HS kernel so that the bootstrapped variance VarB [nHSIC] of nHSIC is close to this theoretical limit variance. [sent-175, score-0.12]
75 We can expect that the chosen kernel uses the data effectively. [sent-182, score-0.12]
76 the next section we see that the method gives a reasonable result for In OCCO and In 3 Experiments To evaluate the dependence measures, we use a permutation test of independence for data sets with various degrees of dependence. [sent-184, score-0.297]
77 In the following experiments, we always use Gaussian kernels 2 1 e− 2σ2 x1 −x2 and choose σ by the method proposed in Section 2. [sent-197, score-0.077]
78 The random variables X (0) and Y (0) are independent and uniformly distributed on [−2, 2] and [a, b] ∪ [−b, −a], respectively, so that (X (0) , Y (0) ) has a scalar covariance matrix. [sent-200, score-0.051]
79 (n) ˆN We perform permutation tests with In OCCO , HSIC = ΣY X 2 , and the mutual information HS (MI). [sent-204, score-0.154]
80 Since In OCCO is an estimate of the mean square contingency, we also apply a relevant contingency-table-based independence test ([12]), partitioning the variables ˆN into bins. [sent-206, score-0.167]
81 We evaluate a chaotic time series derived from the coupled H´ non map. [sent-219, score-0.066]
82 5 0 0 0 0 0 0 0 0 0 0 0 0 45 0 0 0 0 0 0 0 0 0 0 0 0 Table 1: Comparison of dependence measures. [sent-240, score-0.113]
83 The number of times independence is accepted out of 100 permutation tests is shown. [sent-241, score-0.25]
84 ”Median” is a heuristic method [8] which chooses σ as the median of pairwise distances of the data. [sent-246, score-0.045]
85 (c,d) examples of In the threshholds of the permutation test with significance level 5% (black ”+”). [sent-262, score-0.065]
86 shows the results of permutation tests of independence for the instantaneous pairs (X(t), Y (t)) 100 . [sent-263, score-0.227]
87 ˆCON D to detect the causal structure of the same time series. [sent-265, score-0.05]
88 ⊥Y ⊥X ˆCON D detects the small causal influence from Xt to Yt+1 for In Table 3, it is remarkable that In γ ≥ 0. [sent-269, score-0.055]
89 The data consist of three variables, creatinine clearance (C), digoxin clearance (D), urine flow (U). [sent-273, score-0.06]
90 Table 4 shows the results of the permutation tests and a comparison with the linear method. [sent-279, score-0.108]
91 6 0 0 0 0 0 Table 2: Results for the independence tests for the chaotic time series. [sent-288, score-0.228]
92 The number of times independence was accepted out of 100 permutation tests is shown. [sent-289, score-0.25]
93 6 0 1 Table 3: Results of the permutation test of non-causality for the chaotic time series. [sent-305, score-0.131]
94 The number of times non-causality was accepted out of 100 tests is shown. [sent-306, score-0.066]
95 4 Concluding remarks There are many dependence measures, and further theoretical and experimental comparison is important. [sent-326, score-0.113]
96 That said, one unambiguous strength of the kernel measure we propose is its kernel-free population expression. [sent-327, score-0.145]
97 It is interesting to ask if other classical dependence measures, such as the mutual information, can be estimated by kernels (in a broader sense than the expansion about independence of [9]). [sent-328, score-0.384]
98 A relevant measure is the kernel generalized variance (KGV [1]), which is based on a sum of the logarithm of the eigenvalues of VY X , while I N OCCO is their squared sum. [sent-329, score-0.145]
99 Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. [sent-363, score-0.141]
100 On the influence of the kernel on the consistency of support vector machines. [sent-436, score-0.12]
wordName wordTfidf (topN-words)
[('vy', 0.648), ('occo', 0.425), ('hs', 0.177), ('px', 0.176), ('py', 0.174), ('hy', 0.145), ('operator', 0.131), ('hx', 0.121), ('kernel', 0.12), ('independence', 0.119), ('dependence', 0.113), ('kx', 0.113), ('vzx', 0.106), ('pxy', 0.09), ('kernels', 0.077), ('nhsic', 0.076), ('ry', 0.076), ('rz', 0.076), ('characteristic', 0.073), ('hsic', 0.072), ('rx', 0.072), ('rkhs', 0.072), ('xx', 0.071), ('chaotic', 0.066), ('permutation', 0.065), ('ky', 0.064), ('mx', 0.061), ('fukumizu', 0.061), ('measures', 0.055), ('yt', 0.054), ('mi', 0.054), ('operators', 0.05), ('gretton', 0.048), ('rkhss', 0.048), ('xt', 0.048), ('mutual', 0.046), ('con', 0.046), ('corr', 0.046), ('gz', 0.046), ('scholkopf', 0.046), ('spemannstra', 0.046), ('median', 0.045), ('tests', 0.043), ('measurable', 0.042), ('borel', 0.04), ('gx', 0.04), ('mp', 0.039), ('rm', 0.035), ('dense', 0.034), ('gy', 0.034), ('cy', 0.034), ('conditional', 0.032), ('contingency', 0.032), ('bach', 0.032), ('clearance', 0.03), ('conss', 0.03), ('kgv', 0.03), ('nocco', 0.03), ('ppxyy', 0.03), ('thresh', 0.03), ('varb', 0.03), ('ez', 0.03), ('causal', 0.029), ('expansion', 0.029), ('measuring', 0.028), ('covariance', 0.027), ('xp', 0.026), ('detects', 0.026), ('eigenfunctions', 0.026), ('mq', 0.026), ('pz', 0.026), ('richness', 0.026), ('tional', 0.026), ('zx', 0.026), ('theorem', 0.026), ('bingen', 0.026), ('mk', 0.026), ('cov', 0.025), ('measure', 0.025), ('norm', 0.025), ('variables', 0.024), ('kz', 0.024), ('zz', 0.024), ('square', 0.024), ('table', 0.024), ('medical', 0.024), ('empirical', 0.024), ('capturing', 0.023), ('correlation', 0.023), ('accepted', 0.023), ('kf', 0.023), ('respectively', 0.022), ('cybernetics', 0.022), ('hz', 0.022), ('laplacian', 0.021), ('detect', 0.021), ('reproducing', 0.021), ('angle', 0.02), ('bins', 0.02), ('germany', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 108 nips-2007-Kernel Measures of Conditional Dependence
Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf
Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
2 0.20429285 7 nips-2007-A Kernel Statistical Test of Independence
Author: Arthur Gretton, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Schölkopf, Alex J. Smola
Abstract: Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2 ), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.
3 0.15479878 184 nips-2007-Stability Bounds for Non-i.i.d. Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds. A key advantage of these bounds is that they are designed for specific learning algorithms, exploiting their particular properties. But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i.i.d.). In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. This paper studies the scenario where the observations are drawn from a stationary mixing sequence, which implies a dependence between observations that weaken over time. It proves novel stability-based generalization bounds that hold even with this more general setting. These bounds strictly generalize the bounds given in the i.i.d. case. It also illustrates their application in the case of several general classes of learning algorithms, including Support Vector Regression and Kernel Ridge Regression.
4 0.11696299 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
Author: Moulines Eric, Francis R. Bach, Zaïd Harchaoui
Abstract: We propose to investigate test statistics for testing homogeneity based on kernel Fisher discriminant analysis. Asymptotic null distributions under null hypothesis are derived, and consistency against fixed alternatives is assessed. Finally, experimental evidence of the performance of the proposed approach on both artificial and real datasets is provided. 1
5 0.088784881 118 nips-2007-Learning with Transformation Invariant Kernels
Author: Christian Walder, Olivier Chapelle
Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1
6 0.069195978 49 nips-2007-Colored Maximum Variance Unfolding
7 0.064728506 101 nips-2007-How SVMs can estimate quantiles and the median
8 0.059122406 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
9 0.058930092 109 nips-2007-Kernels on Attributed Pointsets with Applications
10 0.05379511 160 nips-2007-Random Features for Large-Scale Kernel Machines
11 0.053569168 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process
12 0.050002988 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
13 0.049505949 146 nips-2007-On higher-order perceptron algorithms
14 0.047957875 185 nips-2007-Stable Dual Dynamic Programming
15 0.047788717 21 nips-2007-Adaptive Online Gradient Descent
16 0.046073146 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
17 0.044948041 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
18 0.042930946 135 nips-2007-Multi-task Gaussian Process Prediction
19 0.039520942 164 nips-2007-Receptive Fields without Spike-Triggering
20 0.039312657 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
topicId topicWeight
[(0, -0.13), (1, 0.009), (2, -0.025), (3, 0.079), (4, -0.006), (5, -0.017), (6, 0.005), (7, 0.003), (8, -0.152), (9, 0.042), (10, 0.103), (11, -0.001), (12, 0.054), (13, 0.039), (14, -0.111), (15, -0.152), (16, 0.174), (17, -0.015), (18, -0.032), (19, 0.074), (20, 0.238), (21, 0.123), (22, 0.059), (23, -0.087), (24, 0.006), (25, 0.078), (26, 0.188), (27, -0.047), (28, 0.044), (29, -0.039), (30, -0.026), (31, -0.009), (32, 0.044), (33, -0.036), (34, 0.032), (35, -0.152), (36, -0.147), (37, 0.072), (38, 0.013), (39, 0.014), (40, 0.093), (41, 0.047), (42, -0.034), (43, -0.007), (44, 0.057), (45, 0.094), (46, 0.141), (47, -0.088), (48, -0.015), (49, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.95518655 108 nips-2007-Kernel Measures of Conditional Dependence
Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf
Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
2 0.84946018 7 nips-2007-A Kernel Statistical Test of Independence
Author: Arthur Gretton, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Schölkopf, Alex J. Smola
Abstract: Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2 ), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.
3 0.64210284 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
Author: Moulines Eric, Francis R. Bach, Zaïd Harchaoui
Abstract: We propose to investigate test statistics for testing homogeneity based on kernel Fisher discriminant analysis. Asymptotic null distributions under null hypothesis are derived, and consistency against fixed alternatives is assessed. Finally, experimental evidence of the performance of the proposed approach on both artificial and real datasets is provided. 1
4 0.57543886 184 nips-2007-Stability Bounds for Non-i.i.d. Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds. A key advantage of these bounds is that they are designed for specific learning algorithms, exploiting their particular properties. But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i.i.d.). In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. This paper studies the scenario where the observations are drawn from a stationary mixing sequence, which implies a dependence between observations that weaken over time. It proves novel stability-based generalization bounds that hold even with this more general setting. These bounds strictly generalize the bounds given in the i.i.d. case. It also illustrates their application in the case of several general classes of learning algorithms, including Support Vector Regression and Kernel Ridge Regression.
5 0.54304004 49 nips-2007-Colored Maximum Variance Unfolding
Author: Le Song, Arthur Gretton, Karsten M. Borgwardt, Alex J. Smola
Abstract: Maximum variance unfolding (MVU) is an effective heuristic for dimensionality reduction. It produces a low-dimensional representation of the data by maximizing the variance of their embeddings while preserving the local distances of the original data. We show that MVU also optimizes a statistical dependence measure which aims to retain the identity of individual observations under the distancepreserving constraints. This general view allows us to design “colored” variants of MVU, which produce low-dimensional representations for a given task, e.g. subject to class labels or other side information. 1
6 0.42260554 101 nips-2007-How SVMs can estimate quantiles and the median
7 0.41712016 118 nips-2007-Learning with Transformation Invariant Kernels
8 0.35031545 109 nips-2007-Kernels on Attributed Pointsets with Applications
9 0.29583722 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
10 0.28796414 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
11 0.27222162 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation
12 0.26611817 99 nips-2007-Hierarchical Penalization
13 0.24342415 15 nips-2007-A general agnostic active learning algorithm
14 0.23559482 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
15 0.23406591 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process
16 0.23237012 160 nips-2007-Random Features for Large-Scale Kernel Machines
17 0.22338587 46 nips-2007-Cluster Stability for Finite Samples
18 0.21865633 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
19 0.21209037 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
20 0.21100003 185 nips-2007-Stable Dual Dynamic Programming
topicId topicWeight
[(5, 0.054), (13, 0.03), (16, 0.03), (19, 0.015), (21, 0.046), (31, 0.012), (34, 0.015), (35, 0.02), (46, 0.369), (47, 0.038), (49, 0.068), (55, 0.016), (83, 0.095), (85, 0.029), (90, 0.072)]
simIndex simValue paperId paperTitle
1 0.75189376 81 nips-2007-Estimating disparity with confidence from energy neurons
Author: Eric K. Tsang, Bertram E. Shi
Abstract: The peak location in a population of phase-tuned neurons has been shown to be a more reliable estimator for disparity than the peak location in a population of position-tuned neurons. Unfortunately, the disparity range covered by a phasetuned population is limited by phase wraparound. Thus, a single population cannot cover the large range of disparities encountered in natural scenes unless the scale of the receptive fields is chosen to be very large, which results in very low resolution depth estimates. Here we describe a biologically plausible measure of the confidence that the stimulus disparity is inside the range covered by a population of phase-tuned neurons. Based upon this confidence measure, we propose an algorithm for disparity estimation that uses many populations of high-resolution phase-tuned neurons that are biased to different disparity ranges via position shifts between the left and right eye receptive fields. The population with the highest confidence is used to estimate the stimulus disparity. We show that this algorithm outperforms a previously proposed coarse-to-fine algorithm for disparity estimation, which uses disparity estimates from coarse scales to select the populations used at finer scales and can effectively detect occlusions.
same-paper 2 0.73306024 108 nips-2007-Kernel Measures of Conditional Dependence
Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf
Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
3 0.70958984 199 nips-2007-The Price of Bandit Information for Online Optimization
Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes
Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1
4 0.50892836 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
Author: Jonathan W. Pillow, Peter E. Latham
Abstract: Point process encoding models provide powerful statistical methods for understanding the responses of neurons to sensory stimuli. Although these models have been successfully applied to neurons in the early sensory pathway, they have fared less well capturing the response properties of neurons in deeper brain areas, owing in part to the fact that they do not take into account multiple stages of processing. Here we introduce a new twist on the point-process modeling approach: we include unobserved as well as observed spiking neurons in a joint encoding model. The resulting model exhibits richer dynamics and more highly nonlinear response properties, making it more powerful and more flexible for fitting neural data. More importantly, it allows us to estimate connectivity patterns among neurons (both observed and unobserved), and may provide insight into how networks process sensory input. We formulate the estimation procedure using variational EM and the wake-sleep algorithm, and illustrate the model’s performance using a simulated example network consisting of two coupled neurons.
5 0.46456295 189 nips-2007-Supervised Topic Models
Author: Jon D. Mcauliffe, David M. Blei
Abstract: We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression. 1
6 0.41520101 7 nips-2007-A Kernel Statistical Test of Independence
7 0.37123027 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
8 0.3707816 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers
9 0.36904669 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
10 0.36695266 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
11 0.36407498 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
12 0.36396843 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
13 0.3634007 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks
14 0.35922351 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
15 0.3584885 70 nips-2007-Discriminative K-means for Clustering
16 0.35730222 185 nips-2007-Stable Dual Dynamic Programming
17 0.35709575 156 nips-2007-Predictive Matrix-Variate t Models
18 0.35630113 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
19 0.35602146 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
20 0.35438111 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs