jmlr jmlr2005 jmlr2005-41 knowledge-graph by maker-knowledge-mining

41 jmlr-2005-Kernel Methods for Measuring Independence

Source: pdf

Author: Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf

Abstract: We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is veriﬁed in the context of independent component analysis. Keywords: independence, covariance operator, mutual information, kernel, Parzen window estimate, independent component analysis c 2005 Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet and Bernhard Schölkopf . G RETTON , H ERBRICH , S MOLA , B OUSQUET AND S CHÖLKOPF

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 DE MPI for Biological Cybernetics Spemannstrasse 38 72076, Tübingen, Germany Editor: Aapo Hyvärinen Abstract We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. [sent-14, score-0.376]

2 We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. [sent-17, score-0.483]

3 The performance of the kernel dependence functionals in measuring independence is veriﬁed in the context of independent component analysis. [sent-19, score-0.277]

4 Keywords: independence, covariance operator, mutual information, kernel, Parzen window estimate, independent component analysis c 2005 Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet and Bernhard Schölkopf . [sent-20, score-0.241]

5 For instance, one well known measure of statistical dependence between two random variables is the mutual information (Cover and Thomas, 1991), which for random vectors x, y is zero if and only if the random vectors are independent. [sent-23, score-0.216]

6 1 This was ﬁrst accomplished by Bach and Jordan (2002a), who introduced kernel dependence functionals that signiﬁcantly outperformed alternative approaches, including for source distributions that are difﬁcult for standard ICA methods to deal with. [sent-26, score-0.225]

7 The ﬁrst, which we call the constrained covariance (COCO), is simply the spectral norm of the covariance operator between reproducing kernel Hilbert spaces. [sent-28, score-0.229]

8 The second functional, called the kernel mutual information (KMI), is a more sophisticated measure of dependence, being a function of the entire spectrum of the covariance operator. [sent-30, score-0.272]

9 We show that the KMI is an upper bound near independence on a Parzen window estimate of the mutual information, which becomes tight (i. [sent-31, score-0.263]

10 Indeed, Rényi (1959) suggested using the functional covariance or correlation to measure the dependence of random variables (implementation details depend on the nature of the function spaces chosen: the use of RKHSs is a more recent innovation). [sent-37, score-0.166]

11 , 1993), which is a regularised estimate of the spectral norm of the correlation operator between reproducing kernel Hilbert spaces. [sent-39, score-0.166]

12 It follows from the properties of COCO that the KCC is zero at independence for universal kernels, since the correlation differs from the covariance only in its normalisation: at independence, where both the KCC and COCO are zero, this normalisation is immaterial. [sent-40, score-0.209]

13 Another kernel method for dependence measurement, the kernel generalised variance (KGV) (Bach and Jordan, 2002a), extends the KCC by incorporating the entire spectrum of its associated 1. [sent-42, score-0.296]

14 sources, in the absence of information about the source distributions beyond their mutual independence (Hyvärinen et al. [sent-46, score-0.244]

15 Columns show whether the functional is covariance or correlation based, and rows indicate whether the dependence measure is the maximum singular value of the covariance/correlation operator, or a bound on the mutual information. [sent-57, score-0.345]

16 A relation between the KGV and the mutual information is also proposed by Bach and Jordan (2002a), who rely on a limiting argument in which the RKHS kernel size approaches zero (no Parzen window estimate is invoked): our discussion of this proof is given in Appendix B. [sent-60, score-0.261]

17 The COCO and KCC dependence functionals for the 2-variable case are described in Section 2, and it is shown that these measure independence when the associated kernels are universal. [sent-66, score-0.23]

18 In particular, the kernel mutual information is introduced in Deﬁnition 14, its use as an independence measure is justiﬁed by Theorem 15, and its relation to the mutual information is provided in Theorem 16. [sent-69, score-0.442]

19 This is similar to the kernel canonical correlation (KCC) introduced by Bach and Jordan (2002a), although we shall see that the latter is more restrictive in its choice of f , g. [sent-85, score-0.168]

20 3 that this quantity is a measure of independence when computed in universal RKHSs (it follows that the KCC also requires a universal RKHS, as do all independence criteria that are based on the covariance in RKHSs). [sent-95, score-0.274]

21 It turns out that unit-radius balls in universal reproducing kernel Hilbert spaces constitute function classes that yield non-trivial dependence estimates. [sent-180, score-0.177]

22 Theorem 6 (COCO(Px,y ; F, G) is only zero at independence for universal kernels) Denote by F and G RKHSs with universal kernels on the compact metric spaces X and Y , respectively, and let F, G be the unit balls in F and G . [sent-184, score-0.173]

23 Further discussion and applications of the kernel canonical correlation include Akaho (2001); Bach and Jordan (2002a); Hardoon et al. [sent-215, score-0.168]

24 In particular, a much more extensive discussion of the properties of canonical correlation analysis and its kernelisation may be found in these studies, and this section simply summarises the properties and derivations relevant to our requirements for independence measurement. [sent-219, score-0.17]

25 Deﬁnition 7 (Kernel canonical correlation (KCC)) The kernel canonical correlation is deﬁned as KCC (Px,y ; F , G ) = sup f ∈F , g∈G = sup f ∈F , g∈G corr ( f (x), g(y)) E ( f (x)g(y)) − Ex ( f (x)) Ey (g(y)) Ex ( f 2 (x)) − E2 ( f (x)) x 2083 Ey (g2 (y)) − E2 (g(y)) y . [sent-222, score-0.264]

26 A problem with using the kernel canonical correlation to measure independence is discussed in various forms by Bach and Jordan (2002a); Fukumizu et al. [sent-226, score-0.242]

27 Finally, we show that the regularised kernel canonical correlation is a measure of independence, as long as the functions attaining the supremum have bounded variance. [sent-242, score-0.193]

28 2084 K ERNEL M ETHODS FOR M EASURING I NDEPENDENCE Theorem 10 (KCC (Px,y ; F , G , κ) = 0 only at independence for universal kernels) Denote by F and G RKHSs with universal kernels on the compact metric spaces X and Y , respectively, and assume that var ( f (x)) < ∞ and var (g (y)) < ∞. [sent-243, score-0.223]

29 We present the kernel mutual information (KMI) in Deﬁnition 14, and prove it to be zero if and only if the empirical COCO is zero (Theorem 15), which justiﬁes using the KMI as a measure of independence. [sent-255, score-0.22]

30 We then show the KMI upper bounds a Parzen window estimate of the mutual information near independence (Theorem 16). [sent-256, score-0.263]

31 That said, the KGV appears to be a regularised empirical estimate of the mutual information for Gaussian processes of Baker 8. [sent-264, score-0.173]

32 Introduced in the discrete approximation to the mutual information 2085 G RETTON , H ERBRICH , S MOLA , B OUSQUET AND S CHÖLKOPF (1970), although to our knowledge the convergence of the KGV to this population quantity is not yet established. [sent-265, score-0.176]

33 1 The KMI, the KGV, and the Mutual Information Three intermediate steps are required to obtain the KMI from the mutual information: an approximation to the MI which is accurate near independence, a Parzen window estimate of this approximation, and ﬁnally a bound on the empirical estimate. [sent-270, score-0.189]

34 3 that the discrete mutual information may be approximated by the Gaussian mutual information (GMI), by doing a Taylor expansion of both quantities to second order around independence. [sent-279, score-0.296]

35 5, we give an upper bound on the empirical GMI, which constitutes the kernel mutual information. [sent-286, score-0.22]

36 6 that the regularised kernel generalised variance (KGV) proposed by Bach and Jordan (2002a) is an upper bound on the KMI, and hence on the Gaussian mutual information, under certain circumstances. [sent-289, score-0.329]

37 1 M UTUAL I NFORMATION B ETWEEN T WO M ULTIVARIATE G AUSSIAN R ANDOM VARIABLES We begin by introducing the Gaussian mutual information and its relation with the canonical correlation. [sent-294, score-0.198]

38 If xG , yG are Gaussian random vectors9 in Rlx , Rly respectively, Cxx Cxy with joint covariance matrix C := , then the mutual information between them can be Cxy Cyy written |C| 1 , (7) I (xG ; yG ) = − log |Cxx | |Cyy | 2 where | · | is the determinant. [sent-297, score-0.222]

39 Finally, we denote as px the vector for which (px )i = Px (i), ˆy ˆ with a similar py deﬁnition. [sent-330, score-0.382]

40 The mutual information between x and y is deﬁned as ˆ ˆ lx ly I (ˆ ; y) = ∑ ∑ Px,ˆ (i, j) log x ˆ ˆy i=1 j=1 Px,ˆ (i, j) ˆy . [sent-331, score-0.273]

41 0 otherwise In summary, = Pxy ˇˇ Ex,y x y (12) Ex (ˇ ) = px x (13) = Dx (14) ˇˇ Ex x x where Dx = diag (px ). [sent-346, score-0.206]

42 Using these results, it is possible to deﬁne the covariances ˇˇ Cxy = Ex,y x y Cxx ˇˇ = Ex x x Cyy ˇˇ = Ey y y − Ex (ˇ ) Ey (ˇ ) = Pxy − px py , x y (15) − Ex (ˇ ) Ex (ˇ ) = x x (16) − Ey (ˇ ) Ey (ˇ ) = y y Dx − px px , Dy − py py . [sent-347, score-1.146]

43 (17) We may therefore deﬁne Gaussian random variables xG , yG with the same covariance structure as ˇ ˇ x, y, and with mutual information given by (7). [sent-348, score-0.2]

44 3 that the mutual information for this Gaussian case is 1 I (xG ; yG ) = − log 2 Ily − Pxy − px py D−1 Pxy − px py D−1 x y , (18) which can also be expressed in the singular value form (9). [sent-350, score-0.943]

45 4 K ERNEL D ENSITY E STIMATES OF THE G AUSSIAN M UTUAL I NFORMATION In this section, we describe a kernel density estimate of the approximate mutual information in (18): this is the point at which our reasoning diverges from the approach of Bach and Jordan (2002a). [sent-358, score-0.22]

46 according to some unknown distribution with density px , the associated Parzen window estimate of this density is written 1 m px (x) = ∑ κ (xl − x) . [sent-364, score-0.453]

47 (21) (x) Of the four matrices in this deﬁnition, Dl is a diagonal matrix of unnormalised Parzen window estimates of px at the grid points,   m 0 ∑l=1 κ(q1 − xl ) . [sent-376, score-0.336]

48 2090 K ERNEL M ETHODS FOR M EASURING I NDEPENDENCE (y) Dl is the equivalent diagonal matrix for py ,13 and    Kl :=    κ (q1 − x1 ) . [sent-394, score-0.198]

49 The main disadvantage in using this approximation to the mutual information is that it is exceedingly computationally inefﬁcient, in that it requires a kernel density estimate at each point in a ﬁne grid. [sent-445, score-0.22]

50 5 T HE KMI: A N U PPER B OUND ON THE M UTUAL I NFORMATION We now deﬁne the kernel mutual information, and show is both a valid dependence criterion (Theorem 15), and an upper bound on the Parzen GMI in Lemma 13 (Theorem 16). [sent-449, score-0.288]

51 3 deﬁnition of Ki (x) and K j (y), we use the notation κ(x) and κ(y) to denote the Parzen windows for the estimates px (x) and py (y), respectively, even though these may not be identical kernel functions. [sent-460, score-0.454]

52 That said, we approximate the mutual information, rather than the entropy; in addition, the KMI is computed in the limit of inﬁnitely small grid size, which removes the need for binning. [sent-467, score-0.181]

53 m} and i=1 i=1 m ∑ κ (yi − y j ) (the expressions above are alternative, unnormalised estimates of minx∈X px (x) and miny∈Y py (y), respectively; the right hand expressions are used so as to obtain the KMI entirely in terms of the sample z ). [sent-480, score-0.382]

54 In particular, the approximate nature of the inequality (25) arises from our use of empirical estimates for lower bounds on px (x) and py (y) (see the proof for details). [sent-484, score-0.382]

55 6 T HE KGV: A N A LTERNATIVE U PPER B OUND ON THE M UTUAL I NFORMATION Bach and Jordan (2002a) propose two related quantities as independence functionals: the kernel canonical correlation (KCC), as discussed in Section 2. [sent-487, score-0.242]

56 2 Multivariate COCO and KMI We now describe how our dependence functionals may be generalised to more than two random variables. [sent-505, score-0.215]

57 It is instructive to compare with the KCC-based dependence functional for more than two variables, which uses the smallest eigenvalue of a matrix of correlations (with diagonal terms equal to one, rather than zero), where this correlation matrix has only positive eigenvalues. [sent-598, score-0.197]

58 We next introduce a generalisation of the kernel mutual information to more than two variables. [sent-599, score-0.22]

59 Deﬁnition 21 (Multivariate KMI) The kernel mutual information for more than two random variables is deﬁned as mn 1 ˘ z (30) KMI (z ; FX1 , . [sent-601, score-0.246]

60 1 Efﬁcient Computation of Kernel Dependence Functionals We note that COCO requires us to determine the eigenvalue of maximum magnitude for an mn × mn matrix (see (29)), and the KMI is a determinant of an mn × mn matrix, as speciﬁed in (30). [sent-641, score-0.165]

61 The reverse does not hold, however: pairwise independence does not guarantee mutual independence. [sent-767, score-0.222]

62 As pointed out by Bach and Jordan (2002a), kernel dependence functionals are well suited to this problem, since they also apply straightforwardly to multivariate random variables: it sufﬁces to deﬁne appropriate Gram matrices. [sent-793, score-0.203]

63 3 Gradient Descent on the Stiefel Manifold We now describe the method used to minimise our kernel dependence functionals over possible choices of the orthogonal demixing matrix W. [sent-801, score-0.256]

64 We begin with a brief investigation into the form taken by the various kernel dependence functionals for a selection of the data in Table 3. [sent-884, score-0.203]

65 We used a Gaussian kernel with σ = 1 and precision ε = 2 × 10−5 for all kernel dependence functionals, and κ = 2 × 10−2 for the KCC and KGV. [sent-1226, score-0.212]

66 The ﬁrst quantity is analogous to the kernel canonical correlation (KCC), which is the spectral norm of the correlation operator; the second is analogous to the kernel generalised variance (KGV), which is a function of the empirical correlation operator spectrum (see Table 1 in the introduction). [sent-1245, score-0.439]

67 Second, we link the KMI and the KGV with the mutual information, proving the KMI is an upper bound near independence on the Parzen window estimate of the mutual information, and the KGV is a looser upper bound under certain conditions. [sent-1248, score-0.411]

68 Since independence of two random variables implies that the entire spectrum of the associated covariance (or correlation) operator is zero, it comes as no surprise that measures using the whole spectrum are more robust than those using only the largest singular value. [sent-1254, score-0.18]

69 The absence of a separate regularisation parameter in our kernel functionals therefore greatly simpliﬁes model selection, especially if the observations are known to be corrupted by outliers. [sent-1261, score-0.16]

70 2110 K ERNEL M ETHODS FOR M EASURING I NDEPENDENCE the link between the Gaussian mutual information and the discrete mutual information, described in Section 3. [sent-1272, score-0.296]

71 Many real life problems do not ﬁt neatly into the linear ICA framework: we now outline ways in which our kernel dependence functionals might be used to improve performance in these more difﬁcult signal separation problems. [sent-1284, score-0.225]

72 In both cases, however, the time dependence greatly assists in separating signals into independent components, the idea being that the independence of different random processes should hold not only between samples drawn at the same time, but also between samples drawn at different times. [sent-1289, score-0.185]

73 This generalisation has been investigated, using the mutual information as a dependence measure, by Stögbauer et al. [sent-1295, score-0.216]

74 Fitting these models requires in particular that the mutual information between pairs of random variables be maximised: thus, Bach and Jordan compare the KGV to a Parzen window estimate of the mutual information in this context. [sent-1315, score-0.337]

75 3, which states that 1 I (xG ; yG ) = − log 2 D−1 Pxy − px py D−1 x y Ily − Pxy − px py . [sent-1352, score-0.764]

76 Note that both Dx − px px and Dy − py py have rank lx − 1 and ly − 1 respectively, and are not invertible. [sent-1355, score-0.889]

77 32 To see this, we make the expansions Dx − px px 1 = Dx Ilx −1 lx px = Dx Ex , Dy − py py 1 = Dy Ily −1 ly py = Dy Ey , 31. [sent-1356, score-1.271]

78 2114 K ERNEL M ETHODS FOR M EASURING I NDEPENDENCE 1 1 where Ex := Ilx −1 lx px and Ey := Ily −1 ly py have zero eigenvalues corresponding to the eigenvec1 1 √ 1 l and √ 1 l , respectively. [sent-1360, score-0.507]

79 In addition, we note that tors l x y x ly Pxy − px py Ey = 1 Ily −1 ly py Pxy − px py = Pxy − px py − Pxy1 ly py + px py 1 ly py = Pxy − px py − px py + px py = Pxy − px py , with an analogous result for Pxy − px py 0 Pxy − px py Ex Ex . [sent-1361, score-4.632]

80 We may therefore write (42) as Pxy − px py Ey 0 ci di Dx Ex 0 0 Dy Ey = ρi ci di , from which we obtain a generalised eigenvalue problem with identical eigenvalues ρi , 0 Pxy − px py Pxy − px py 0 ei fi Dx 0 0 Dy = ρi ei fi . [sent-1362, score-1.415]

81 4 Details of Deﬁnition 13 In this section, we derive the Parzen window estimate of the Gaussian mutual information provided in Deﬁnition 13. [sent-1365, score-0.189]

82 We therefore deﬁne ˆ ˆ the vectors px , py , and the matrix Pxy , using the expectations in (12)-(14) computed with these kernel expressions; = Pxy , ˇˇ Ex,y x y (43) ˆ Ex (ˇ ) = px , x (44) = Dx . [sent-1368, score-0.682]

83 κ (qlx − xm ) where we write the above in such a manner as to indicate lx results to re-write (43)-(45) as respectively m and ly and ∆x ∆y m Kl Ll − Dx ≈ ˆ ˆ Pxy − px py ≈ m. [sent-1419, score-0.507]

84 ∑l=1 κ(rly − yl ) and (47) (48)  With these substitutions, we can rewrite Dx −1/2 ˆ ˆ Pxy − px py Dy −1/2 (x) −1/2 ≈ Dl which results in our deﬁnition. [sent-1451, score-0.382]

85  −1  (y) Ll H (Kl ) 0 0 Dl E D−1 (x) According to (22), Dl is a diagonal matrix with jth entry 1 ∆x ∑m κ(xi − q j ), which is an unnori=1 (y) malised Parzen window estimate of px at grid point q j (an analogous result holds for Dl ). [sent-1466, score-0.302]

86 m} ∑ κ (yi − y j ) , i=1 which are respectively the smallest (unnormalised) Parzen window estimates of px and py at any sample point: these approach the smallest values of px on X , and of py on Y , as the sample size increases (the densities are bounded away from zero by assumption). [sent-1488, score-0.805]

87 j −1 = νz Z X lx ∑ κ(xi − q p )κ(x j − q p ) p=1 κ(xi − q)κ(x j − q)dq −1 = νz k(xi , x j ), where we recover our RKHS kernel as the convolution of the kernel density functions at each pair of data points. [sent-1492, score-0.198]

88 , ly } using the probability mass in the resulting bins (this multinomial distribution is deˆ scribed by the matrix Pxy of joint probabilities, with marginal distribution vectors px and py ). [sent-1568, score-0.475]

89 3, we approximate I (ˆ ; y) x ˆ x ˆ using the Gaussian mutual information I (xG ; yG ) between vectors xG ; yG , deﬁned to have the same ˇ ˇ covariance as x and y, where x = i is equivalent to (ˇ )i = 1 and (ˇ ) j : j=i = 0 (likewise for y). [sent-1572, score-0.2]

90 Rather than replacing x and y by x ˆ ˆ ˆ xG and yG , we may instead replace them with the smoothed approximations kl = ∆x and ll = ∆y k(x, q1 ) · · · k(x, qlx ) l(y, r1 ) . [sent-1586, score-0.209]

91 38 We can of course specify the Gaussian mutual information I(kl ; ll ) between these smoothed vectors, using the appropriate log ratio of determinants. [sent-1598, score-0.23]

92 First, does this smoothed approximation I(kl ; ll ) approach the Gaussian mutual information I (xG ; yG ) as the kernel size drops? [sent-1600, score-0.302]

93 px ≈ ∆x Ex (kl ) , (54) under appropriate conditions, with similar results for the terms in y. [sent-1606, score-0.206]

94 σx →0 (55) To compute the covariance structure of the vectors in (53), we require expressions for the expectations Ex,y kl ll , Ex (kl ) , Ex kl kl , Ey ll ll , Ey (ll ) . [sent-1608, score-0.565]

95 The expectation of individual entries in the matrix kl ll is Ex,y [k(qi , x)l(r j , y)] = = Z Z X Y k(x − qi )l(y − r j )px,y (x, y)dxdy k(x)l(y) px,y (x, y) (qi , r j ), 38. [sent-1609, score-0.259]

96 We use a sans-serif font to deﬁne kl and ll , to indicate that these are random vectors. [sent-1610, score-0.171]

97 Similarly, Ex [k(qi , x)k(q j , x)] = Z k(x − qi )k(x − q j )px (x)dx X k2 (x) px (x) (qi ) i= j , 0 otherwise ≈ where the above assumes σx ∆x 1. [sent-1614, score-0.272]

98 Note, however, that k2 (x − qi ) = = 1 (x − qi )2 exp − 2πσ2 σ2 x x 1 1 √ × 2σx π πσ2 x (56) exp − (x − qi )2 σ2 x and thus k2 (x) is not a probability density (the integral over R is equal to Ex [k(qi, , x)] = Z R , (57) 1√ ). [sent-1615, score-0.198]

99 2σx π Finally, k(x − qi )px (x)dx = [k(x) px (x)] (qi ). [sent-1616, score-0.272]

100 2), an empirical estimate of I(kl ; ll ) is obtained via the usual expression (9), where ρi are the solutions to the generalised eigenvalue problem 0 Ll H (Kl ) Kl H (Ll ) 0 ci di = ρi Kl H (Kl ) 0 0 Ll H (Ll ) ci di , (58) and Kl and Ll are deﬁned in Section 3. [sent-1629, score-0.351]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('coco', 0.461), ('kmi', 0.398), ('kgv', 0.369), ('kcc', 0.23), ('px', 0.206), ('py', 0.176), ('mutual', 0.148), ('bach', 0.146), ('pxy', 0.121), ('easuring', 0.113), ('ndependence', 0.113), ('ousquet', 0.113), ('retton', 0.113), ('ica', 0.11), ('xg', 0.105), ('erbrich', 0.095), ('mola', 0.095), ('gretton', 0.092), ('kl', 0.089), ('parzen', 0.087), ('generalised', 0.084), ('ll', 0.082), ('rkhss', 0.075), ('independence', 0.074), ('kernel', 0.072), ('ly', 0.071), ('ch', 0.071), ('dependence', 0.068), ('ethods', 0.066), ('qi', 0.066), ('ernel', 0.064), ('functionals', 0.063), ('yg', 0.063), ('ex', 0.058), ('dl', 0.056), ('jordan', 0.055), ('lx', 0.054), ('covariance', 0.052), ('canonical', 0.05), ('ey', 0.05), ('jade', 0.047), ('cfica', 0.046), ('correlation', 0.046), ('di', 0.045), ('amari', 0.044), ('signals', 0.043), ('radical', 0.042), ('rly', 0.042), ('window', 0.041), ('eigenvalue', 0.039), ('cxy', 0.038), ('qlx', 0.038), ('dx', 0.037), ('gram', 0.037), ('universal', 0.037), ('fukumizu', 0.035), ('hyv', 0.034), ('xl', 0.034), ('gn', 0.034), ('grid', 0.033), ('zn', 0.032), ('rinen', 0.032), ('stiefel', 0.032), ('demixing', 0.031), ('sources', 0.031), ('ki', 0.031), ('dy', 0.031), ('singular', 0.031), ('rkhs', 0.031), ('mixing', 0.03), ('lkopf', 0.03), ('blind', 0.03), ('constrained', 0.03), ('fxn', 0.029), ('leurgans', 0.029), ('musical', 0.029), ('polishing', 0.029), ('kurtosis', 0.029), ('population', 0.028), ('bc', 0.028), ('ci', 0.028), ('mn', 0.026), ('cyy', 0.025), ('utual', 0.025), ('kernels', 0.025), ('regularisation', 0.025), ('regularised', 0.025), ('music', 0.025), ('var', 0.025), ('hilbert', 0.024), ('cov', 0.023), ('operator', 0.023), ('cholesky', 0.023), ('source', 0.022), ('outlier', 0.022), ('fastica', 0.022), ('unmixing', 0.022), ('horn', 0.022), ('signal', 0.022), ('matrix', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 41 jmlr-2005-Kernel Methods for Measuring Independence

Author: Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf

2 0.12784609 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

Author: Aapo Hyvärinen

Abstract: One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant. Typically, one then has to resort to Markov Chain Monte Carlo methods, or approximations of the normalization constant. Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data. While the estimation of the gradient of log-density function is, in principle, a very diﬃcult non-parametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simpliﬁes to a sample average of a sum of some derivatives of the log-density given by the model. The validity of the method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete ﬁlter set for natural image data. Keywords: statistical estimation, non-normalized densities, pseudo-likelihood, Markov chain Monte Carlo, contrastive divergence

3 0.10871425 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies

Author: Motoaki Kawanabe, Klaus-Robert Müller

Abstract: A blind separation problem where the sources are not independent, but have variance dependencies is discussed. For this scenario Hyv¨ rinen and Hurri (2004) proposed an algorithm which requires a no assumption on distributions of sources and no parametric model of dependencies between components. In this paper, we extend the semiparametric approach of Amari and Cardoso (1997) to variance dependencies and study estimating functions for blind separation of such dependent sources. In particular, we show that many ICA algorithms are applicable to the variance-dependent model as well under mild conditions, although they should in principle not. Our results indicate that separation can be done based only on normalized sources which are adjusted to have stationary variances and is not affected by the dependent activity levels. We also study the asymptotic distribution of the quasi maximum likelihood method and the stability of the natural gradient learning in detail. Simulation results of artiﬁcial and realistic examples match well with our theoretical ﬁndings. Keywords: blind source separation, variance dependencies, independent component analysis, semiparametric statistical models, estimating functions

4 0.097867787 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

Author: Asela Gunawardana, William Byrne

Abstract: The EM algorithm is widely used to develop iterative parameter estimation procedures for statistical models. In cases where these procedures strictly follow the EM formulation, the convergence properties of the estimation procedures are well understood. In some instances there are practical reasons to develop procedures that do not strictly fall within the EM framework. We study EM variants in which the E-step is not performed exactly, either to obtain improved rates of convergence, or due to approximations needed to compute statistics under a model family over which E-steps cannot be realized. Since these variants are not EM procedures, the standard (G)EM convergence results do not apply to them. We present an information geometric framework for describing such algorithms and analyzing their convergence properties. We apply this framework to analyze the convergence properties of incremental EM and variational EM. For incremental EM, we discuss conditions under these algorithms converge in likelihood. For variational EM, we show how the E-step approximation prevents convergence to local maxima in likelihood. Keywords: EM, variational EM, incremental EM, convergence, information geometry

5 0.082570426 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture

Author: Luís B. Almeida

Abstract: When acquiring an image of a paper document, the image printed on the back page sometimes shows through. The mixture of the front- and back-page images thus obtained is markedly nonlinear, and thus constitutes a good real-life test case for nonlinear blind source separation. This paper addresses a difﬁcult version of this problem, corresponding to the use of “onion skin” paper, which results in a relatively strong nonlinearity of the mixture, which becomes close to singular in the lighter regions of the images. The separation is achieved through the MISEP technique, which is an extension of the well known INFOMAX method. The separation results are assessed with objective quality measures. They show an improvement over the results obtained with linear separation, but have room for further improvement. Keywords: ICA, blind source separation, nonlinear mixtures, nonlinear separation, image mixture, image separation

6 0.065001853 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

7 0.060046822 3 jmlr-2005-A Classification Framework for Anomaly Detection

8 0.056267537 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

9 0.050402574 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

10 0.050363213 25 jmlr-2005-Denoising Source Separation

11 0.048594717 64 jmlr-2005-Semigroup Kernels on Measures

12 0.0400378 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

13 0.039550517 39 jmlr-2005-Information Bottleneck for Gaussian Variables

14 0.03913128 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

15 0.036828939 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

16 0.031809144 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

17 0.031025507 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

18 0.028916532 36 jmlr-2005-Gaussian Processes for Ordinal Regression

19 0.028016772 32 jmlr-2005-Expectation Consistent Approximate Inference

20 0.026795158 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.207), (1, 0.225), (2, -0.264), (3, 0.064), (4, -0.06), (5, -0.001), (6, -0.002), (7, -0.232), (8, -0.101), (9, 0.048), (10, -0.199), (11, 0.009), (12, 0.045), (13, 0.111), (14, -0.135), (15, -0.017), (16, -0.023), (17, 0.07), (18, -0.067), (19, -0.072), (20, 0.047), (21, -0.007), (22, -0.015), (23, -0.002), (24, 0.072), (25, 0.047), (26, 0.082), (27, -0.058), (28, 0.022), (29, -0.119), (30, -0.096), (31, -0.021), (32, 0.067), (33, 0.068), (34, -0.063), (35, 0.149), (36, 0.093), (37, -0.162), (38, -0.042), (39, -0.201), (40, -0.02), (41, 0.081), (42, -0.202), (43, -0.047), (44, -0.286), (45, 0.101), (46, 0.202), (47, 0.147), (48, 0.073), (49, 0.176)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93812376 41 jmlr-2005-Kernel Methods for Measuring Independence

Author: Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf

2 0.36246696 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

Author: Aapo Hyvärinen

3 0.34869739 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies

Author: Motoaki Kawanabe, Klaus-Robert Müller

4 0.29111522 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures

Author: Asela Gunawardana, William Byrne

5 0.27948228 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by deﬁning a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semideﬁnite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classiﬁcation, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semideﬁnite programming

6 0.23346198 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

7 0.21597312 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

8 0.19511542 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture

9 0.19312026 39 jmlr-2005-Information Bottleneck for Gaussian Variables

10 0.16427554 64 jmlr-2005-Semigroup Kernels on Measures

11 0.13850816 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

12 0.13695128 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression

13 0.13203293 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks

14 0.13184246 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

15 0.1227334 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

16 0.12239178 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines

17 0.11291547 3 jmlr-2005-A Classification Framework for Anomaly Detection

18 0.10967642 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error

19 0.10466976 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

20 0.10397214 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.041), (13, 0.02), (17, 0.013), (19, 0.019), (36, 0.046), (37, 0.033), (42, 0.011), (43, 0.03), (44, 0.02), (47, 0.011), (52, 0.059), (59, 0.014), (70, 0.033), (80, 0.015), (88, 0.071), (90, 0.474), (94, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86167496 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer

Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The ﬁrst family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classiﬁcation and regression algorithms. Our regression framework also has implications on classiﬁcation algorithms, namely, a new additive update boosting algorithm for classiﬁcation. We demonstrate the merits of our algorithms in a series of experiments.

same-paper 2 0.82080621 41 jmlr-2005-Kernel Methods for Measuring Independence

Author: Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf

3 0.36946407 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

Author: Aapo Hyvärinen

4 0.36382389 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

Author: Marcus Hutter, Jan Poland

Abstract: When applying aggregating strategies to Prediction with Expert Advice (PEA), the learning rate √ must be adaptively tuned. The natural choice of complexity/current loss renders the analysis of Weighted Majority (WM) derivatives quite complicated. In particular, for arbitrary weights there have been no results proven so far. The analysis of the alternative Follow the Perturbed Leader (FPL) algorithm from Kalai and Vempala (2003) based on Hannan’s algorithm is easier. We derive loss bounds for adaptive learning rate and both ﬁnite expert classes with uniform weights and countable expert classes with arbitrary weights. For the former setup, our loss bounds match the best known results so far, while for the latter our results are new. Keywords: prediction with expert advice, follow the perturbed leader, general weights, adaptive learning rate, adaptive adversary, hierarchy of experts, expected and high probability bounds, general alphabet and loss, online sequential prediction

5 0.35314199 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classiﬁers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classiﬁer should minimize the conﬁdence-rated error, whereas for AdaBoost.M2 the base classiﬁer should minimize the pseudo-loss. This makes them more easily applicable to already existing base classiﬁers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classiﬁcation, decision stumps

6 0.35004631 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

7 0.32327577 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies

8 0.32281363 48 jmlr-2005-Learning the Kernel Function via Regularization

9 0.31917638 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

10 0.31781873 36 jmlr-2005-Gaussian Processes for Ordinal Regression

11 0.3176578 54 jmlr-2005-Managing Diversity in Regression Ensembles

12 0.31761375 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

13 0.31661084 3 jmlr-2005-A Classification Framework for Anomaly Detection

14 0.31619498 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

15 0.30968922 20 jmlr-2005-Clustering with Bregman Divergences

16 0.30956772 32 jmlr-2005-Expectation Consistent Approximate Inference

17 0.30888873 39 jmlr-2005-Information Bottleneck for Gaussian Variables

18 0.30742976 11 jmlr-2005-Algorithmic Stability and Meta-Learning

19 0.29648018 47 jmlr-2005-Learning from Examples as an Inverse Problem

20 0.29568461 67 jmlr-2005-Stability of Randomized Learning Algorithms