jmlr jmlr2006 jmlr2006-36 knowledge-graph by maker-knowledge-mining

36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

Source: pdf

Author: Gilles Blanchard, Motoaki Kawanabe, Masashi Sugiyama, Vladimir Spokoiny, Klaus-Robert Müller

Abstract: Finding non-Gaussian components of high-dimensional data is an important preprocessing step for efﬁcient information processing. This article proposes a new linear method to identify the “nonGaussian subspace” within a very general semi-parametric framework. Our proposed method, called NGCA (non-Gaussian component analysis), is based on a linear operator which, to any arbitrary nonlinear (smooth) function, associates a vector belonging to the low dimensional nonGaussian target subspace, up to an estimation error. By applying this operator to a family of different nonlinear functions, one obtains a family of different vectors lying in a vicinity of the target space. As a ﬁnal step, the target space itself is estimated by applying PCA to this family of vectors. We show that this procedure is consistent in the sense that the estimaton error tends to zero at a parametric rate, uniformly over the family, Numerical examples demonstrate the usefulness of our method.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our proposed method, called NGCA (non-Gaussian component analysis), is based on a linear operator which, to any arbitrary nonlinear (smooth) function, associates a vector belonging to the low dimensional nonGaussian target subspace, up to an estimation error. [sent-18, score-0.108]

2 By applying this operator to a family of different nonlinear functions, one obtains a family of different vectors lying in a vicinity of the target space. [sent-19, score-0.227]

3 As a ﬁnal step, the target space itself is estimated by applying PCA to this family of vectors. [sent-20, score-0.088]

4 1 Setting and General Principle We want to emphasize from the beginning that we do not assume the Gaussian components to be of smaller order of magnitude than the signal components; all components are instead typically of the same amplitude. [sent-33, score-0.087]

5 The framework we consider is on the other hand very close to that of projection pursuit (denoted PP in short in the sequel) algorithms (Friedman and Tukey, 1974; Huber, 1985; Hyv¨ rinen et al. [sent-43, score-0.194]

6 The goal of projection pursuit methods is to extract non-Gaussian components in a general setting, i. [sent-45, score-0.173]

7 Projection pursuit methods typically proceed by ﬁxing a single index which measures the nonGaussianity (or ’interessingness’) of a projection direction. [sent-48, score-0.178]

8 However, it is known that some projection indices are suitable for ﬁnding super-Gaussian components (heavy-tailed distribution) while others are suited for identifying sub-Gaussian components (light-tailed distribution) (Hyv¨ rinen et al. [sent-50, score-0.209]

9 To summarize: existing methods for the setting we consider typically proceed by deﬁning an appropriate interestingness index, and then compute a projection that maximizes this index (projection pursuit methods, and some ICA methods). [sent-53, score-0.178]

10 Clearly, a multi-dimensional Gaussian subspace is a reasonable candidate for an undesired component (our idea could be generalized by deﬁning, say, a Laplacian subspace to be uninformative). [sent-55, score-0.198]

11 2 Presentation of the Method Technically, our new approach to identifying the non-Gaussian subspace uses a very general semiparametric framework. [sent-58, score-0.099]

12 Using a whole family of different nonlinear functions h then yields a family of different vectors β(h) which all approximately lie in, and span, the non-Gaussian subspace. [sent-62, score-0.146]

13 We ﬁnally perform PCA on this family of vectors to extract the principal directions and estimate the target space. [sent-63, score-0.153]

14 Note that no particular structural assumption is made about the noise covariance matrix Γ . [sent-88, score-0.09]

15 Assume the signal S is contained in a lower-dimensional linear subspace E of dimension m < d . [sent-89, score-0.151]

16 Lemma 1 The density p(x) for the model (1) with the m -dimensional signal S and an independent Gaussian noise N can be represented as p(x) = g(T x)φΓ (x) where T is a linear operator from Rd to Rm , g(·) is some function on Rm and φΓ (x) is the density of the Gaussian component. [sent-93, score-0.169]

17 Note that the above density representation is not unique, as the parameters g, T, Γ are not identifable from the density p. [sent-95, score-0.086]

18 In particular, is useful to notice that if the noise N is standard normal, then the operator T can be taken equal to the projector on the signal space E . [sent-97, score-0.117]

19 Therefore, in this case, K(T ) coincides with E ⊥ , the orthogonal 250 N ON -G AUSSIAN C OMPONENTS OF A H IGH -D IMENSIONAL D ISTRIBUTION complementary subspace to E . [sent-98, score-0.125]

20 In the general situation with “colored” Gaussian noise, the signal space E does not coincide with the orthogonal complementary of the kernel I = K(T )⊥ of the operator T . [sent-99, score-0.08]

21 However, the density representation of Lemma 1 shows that the the subspace K(T ) is non-informative and contains only noise. [sent-100, score-0.142]

22 This deﬁnition implements the general point of view outlined in the introduction, namely: we deﬁne what is considered uninteresting; the target space is then deﬁned indirectly as the orthogonal of the uninteresting component. [sent-103, score-0.089]

23 (1) into a component N1 belonging to the signal space E and an independent component N2 ; it can then be shown that N2 belongs to the subspace K(T ) deﬁned above. [sent-106, score-0.13]

24 In this view, the space I is orthogonal to the independent noise component, and projecting the data onto I amounts to cancelling this independent noise component by an orthogonal projection. [sent-107, score-0.133]

25 An alternative point of view would be to disregard the input space geometry altogether, and to ﬁrst map the data linearly to a reference space where it has covariance identity (“whitening” transform), which would be closer to a traditional ICA analysis. [sent-111, score-0.084]

26 For simplicity, we will stick to our original goal of orthogonal projection in the original space. [sent-114, score-0.111]

27 In what follows, we denote by I the m -dimensional linear subspace in Rd generated by the adjoint operator T∗: I = K(T )⊥ = ℑ(T ∗ ) , where ℑ(·) denotes the range of an operator. [sent-119, score-0.122]

28 The proposed goal is therefore to estimate I by some subspace I computed from an i. [sent-121, score-0.099]

29 We measure the closeness of the two subspaces I and I by the following error function: E (I , I ) = (2m)−1 ΠI − ΠI 2 Frob m = m−1 ∑ (Id − ΠI )vi 2 , (3) i=1 where ΠI denotes the orthogonal projection on I , · orthonormal basis of I and Id is the identity matrix. [sent-130, score-0.111]

30 (4) In the general case where the covariance matrix Σ is different from identity, provided it is nondegenerated, we can apply a whitening operation (also known as Mahalanobis transform). [sent-139, score-0.137]

31 Note that if the density function of X is of the form p(x) = g(T x)φΓ (x), then by change of variable the density function of Z = AX is given by q(z) = cA g(TA−1 z)φAΓA (z), where cA is a normalization constant depending on A . [sent-141, score-0.114]

32 The ﬁrst step is to apply the whitening transform to the data (where the true covariance matrix Σ is estimated by the empirical covariance Σ ). [sent-148, score-0.241]

33 This gives rise to the estimated vector β(h) = 1 n ∑ Yi h(Yi ) − ∇h(Yi ) , n i=1 (5) which we expect to be close to the non-Gaussian subspace up to some estimation error. [sent-155, score-0.156]

34 At this point, the natural next step is to consider a whole family of functions {hi }n , giving rise to an i=1 associated vector family of {βi }L , all lying close to the target subspace, where βi := β(hi ) . [sent-156, score-0.17]

35 The i=1 ﬁnal step is to recover the non-Gaussian subspace from this set. [sent-157, score-0.099]

36 2 Normalization of the Vectors When extracting information on the target subspace from the set of vectors {βi }L , attention should i=1 be paid to how the functions {hi }L are normalized. [sent-163, score-0.158]

37 We argue in more detail that the norm of β(h) after normalization is directly linked to the amount of information brought by this vector about the target subspace. [sent-171, score-0.096]

38 To obtain a spherical error ball, we would have to apply a (linear) error whitening transform separately to each β(h) . [sent-184, score-0.096]

39 However, in that case the error whitening transform would be different for each h , and the information of the vector family about the target subspace would then be lost. [sent-185, score-0.283]

40 254 N ON -G AUSSIAN C OMPONENTS OF A H IGH -D IMENSIONAL D ISTRIBUTION the relation between the norm of the (normalized) β and the amount of information on the nonGaussian subspace brought by β , we plot in the right part of Figure 3 the relation between β and ΠJ β / β = cos(θ(β)) . [sent-187, score-0.13]

41 As expected, the vectors β with highest norm are indeed much closer to the non-Gaussian subspace in general. [sent-188, score-0.152]

42 n i=1 (8) It is interesting to notice that this equation precisely coincides with one iteration of a well-known projection pursuit algorithm, FastICA (Hyv¨ rinen, 1999). [sent-207, score-0.145]

43 Since we are using projection pursuit as a heuristic to ﬁnd suitable parameters ω for a ﬁxed f , the theoretical setting proposed here can therefore also be seen as a suitable framework for combining projection pursuit results when using different index functions f . [sent-276, score-0.323]

44 We now consider a ﬁnite family of possible choices { fk }L k=1 which are then combined. [sent-281, score-0.083]

45 The choice to involve directly the empirical covariance in the bound instead of the population one was made to emphasize that estimation error for the covariance itself is also taken into account for the bound. [sent-302, score-0.148]

46 Let Σ denote the empirical covariance matrix of the data sample (Xi ) ; 1 put Yi = Σ− 2 Xi the empirically whitened data. [sent-316, score-0.155]

47 From the family v(k) , throw away vectors having norm smaller than threshold ε . [sent-330, score-0.104]

48 We start by considering an idealized 1 case where whitening is done using the true covariance matrix Σ : Y = Σ− 2 X . [sent-339, score-0.137]

49 Theorem 3 Let {hk }L Assume that k=1 be a family of smooth functions from R −1 ≤ K 2 , and is supy,k max ( ∇hk (y) , hk (y) ) < B and that X has covariance matrix Σ with Σ such that for some λ0 > 0 the following inequality holds: E [exp (λ0 X )] ≤ a0 < ∞. [sent-341, score-0.234]

50 The above inequality tells us that the rate of convergence of the estimated vectors to the target space is in this case of order n−1/2 (classical “parametric” rate). [sent-356, score-0.08]

51 Theorem 4 implies that the vectors γ(hk ) obtained from any h(x) converge to the unknown non-Gaussian subspace I uniformly at a rate of order log(n)/n . [sent-375, score-0.121]

52 Then we only have to apply the bound to a discretized family of size L = O (ε1−d ) , giving rise only to an additional factor in the bound of order d log ε−1 . [sent-382, score-0.144]

53 The result in terms of the initial data ensures the overall consistency of the approach, but the convergence in the whitened space is equally interesting since we use it as the main working space for the algorithm and the bound itself is more precise. [sent-386, score-0.094]

54 Each of these ranges was divided into 1000 equispaced values, thus yielding a family { fk } of size 4000 (Fourier functions count twice because of the sine and cosine parts). [sent-395, score-0.083]

55 The data sets are: (a) 2D independent Gaussian mixtures, (b) 2D isotropic super-Gaussian, (c) 2D isotropic uniform and (d) dependent 1D Laplacian + 1D uniform. [sent-406, score-0.081]

56 008 NGCA NGCA NGCA (B) (C) (D) Figure 7: Performance comparison plots (for error criterion E (I , I ) ) of NGCA versus FastICA; top: versus pow3 index; bottom: versus tanh index. [sent-515, score-0.286]

57 (D) Dependent super- and sub-Gaussian: 1 -dimensional Laplacian with density proportional to exp(−|xLap |) and 1 -dimensional dependent uniform U(c, c + 1) , where c = 0 for |xLap | ≤ log 2 and c = −1 otherwise. [sent-519, score-0.105]

58 The proﬁles of the density functions of the non-Gaussian components in the above data sets are described in Figure 5. [sent-521, score-0.094]

59 Because PP tends to get trapped into local optima of the index function it optimizes, we restarted it 10 times with random starting points and took the subspace obtaining the best index value. [sent-526, score-0.165]

60 It is known that PP(tanh) is suitable for ﬁnding super-Gaussian components (heavy-tailed distribution) while PP(pow3) is suitable for ﬁnding sub-Gaussian components (light-tailed distribution) (Hyv¨ rinen et al. [sent-529, score-0.105]

61 This can be observed in the data sets (B) and (C): PP(tanh) works well for a the data set (B) and PP(pow3) works well for the data set (C), although the upper-quantile is very large for the data set (C) (because of PP getting trapped in local minima). [sent-531, score-0.092]

62 In the synthetic data sets used so far, the data was always generated with a covariance matrix equal to identity. [sent-585, score-0.107]

63 266 N ON -G AUSSIAN C OMPONENTS OF A H IGH -D IMENSIONAL D ISTRIBUTION The results are depicted in Figure 9, where we observe again a transition to a failure mode when the covariance matrix is too badly conditioned. [sent-633, score-0.107]

64 However, this result is not entirely surprising, as we expect this type of failure mode to be caused by a too large estimation error in the covariance matrix and therefore in the whitening/dewhitening steps. [sent-636, score-0.133]

65 We compare the NGCA methodology described in the previous section, projection pursuit (“vanilla” FastICA) using the tanh or the pow3 index, and Isomap (non-linear projection method, see Tenenbaum et al. [sent-649, score-0.516]

66 A 3D projection of the data was computed using these methods, which was in turn projected in 2D to draw the ﬁgure; this last projection was chosen manually so as to make the cluster structure as visible as possible in each case. [sent-652, score-0.193]

67 We see that the NGCA methodology gives a much more relevant projection than PP using either tanh or pow3 alone: we can distinguish 10-11 clusters versus at most 5 for the PP methods and 7-8 for Isomap. [sent-653, score-0.423]

68 Finally, we conﬁrm, by applying the projection found to held-out test data (i. [sent-655, score-0.108]

69 This, in passing, shows one advantage of a linear projection method, namely that it can be extended to new data in a straightforward way. [sent-658, score-0.108]

70 Presumably, an important difference between the NGCA projection and the others comes from the Fourier functions, since they are not present in either of the PP methods. [sent-659, score-0.085]

71 It can be conﬁrmed by looking at the vector norms that Fourier functions are more relevant for this data set; they gave rise to estimated vectors with generally higher norms and had consequently a sizable inﬂuence of the choice of the projection. [sent-660, score-0.096]

72 For the middle right panel, the 2D projection found from the middle left panel was used to visualize additional held out test data. [sent-667, score-0.085]

73 This is conﬁrmed in Figure 10: even without the relevant Fourier functions, NGCA yields a projection where 8 clusters can be distinguished, and the classes are much more clearly separated than with PP methods. [sent-669, score-0.137]

74 Conclusion We proposed a new semi-parametric framework for constructing a linear projection to separate an uninteresting multivariate Gaussian ‘noise’ subspace of possibly large amplitude from the ‘signalof-interest’ subspace. [sent-673, score-0.21]

75 To estimate the non-Gaussian subspace from the set of vectors obtained, PCA is ﬁnally performed after suitable renormalization and elimination of uninformative vectors. [sent-675, score-0.15]

76 Note that, in general, PP methods need to pre-specify a projection index used to search for non-Gaussian components. [sent-682, score-0.118]

77 Extending the proposed framework to non-linear projection scenarios (Cox and Cox, 1994; Sch¨ lkopf et al. [sent-687, score-0.085]

78 Denote by ΠE the projector from Rd to Rm which corresponds to the subspace E . [sent-704, score-0.133]

79 Let also E ⊥ be the subspace complementary to E and ΠE ⊥ mean the projector on E ⊥ . [sent-705, score-0.133]

80 It is clear that the density of ΠE S + N1 in Rm can be represented as the product g(x1 )φ(x1 ) for some function g and the standard normal density φ(x1 ) , x1 ∈ Rm . [sent-708, score-0.086]

81 Note that for the linear mapping T = ΠE characterizes the signal subspace E . [sent-710, score-0.13]

82 Namely, E is the image ℑ(T ∗ ) of the dual operator T ∗ while E ⊥ is the null subspace (kernel) of T : E ⊥ = K(T ) . [sent-711, score-0.122]

83 Next we drop the assumption of the standard normal noise and assume only that the covariance matrix Γ of the noise is nondegenerated. [sent-712, score-0.119]

84 The transformed signal X = Γ−1/2 S belongs to the subspace E = Γ−1/2 E . [sent-714, score-0.13]

85 (15) and using ∇p(x) = ∇ log p(x) p(x) , we have Z β(h) = ψh (x)∇ log p(x) p(x)dx. [sent-723, score-0.124]

86 It follows that h(k) (x) is such that E exp λ0 (k) h (x) BK ≤ a0 exp λ0 K , and hence satisﬁes the assumption of Lemma 5. [sent-733, score-0.096]

87 Now summing over the coordinates, taking the square root and √ √ √ using a + b ≤ a + b leads to: βY − βY ≤ 2 2 σY (h) log δ−1 + log d log(nδ−1 ) log δ−1 +C2 (λ0 , a0 , B, d, K) 3 n n4 , (17) with probability at least 1 − 4δ . [sent-735, score-0.186]

88 To turn this into a uniform bound over the family {hk }L , we k=1 simply apply this inequality separately to each function in the family with δ = δ/L. [sent-736, score-0.123]

89 2n(n−1) ∑i= j (Xi − X j ) Then for any δ < constant: |µ − µ| ≤ 1 4 the following holds with probability at least 1 − 4δ , where c is a universal log δ−1 n 2σ2 log δ−1 + cλ−1 max 1, log na0 δ−1 0 n 3 4 + log δ−1 n . [sent-748, score-0.248]

90 By Markov’s inequality, it holds that P [|X| > t] ≤ a0 exp (−λ0t) , Fixing A = log nδ−1 a0 /λ0 for the rest of the proof, it follows by taking t = A in the above inequality that for any 1 ≤ i ≤ n : δ P XiA = Xi ≤ . [sent-751, score-0.131]

91 We now deal with the third term: we have T3 = |E [X1{|X| > A}]| ≤ E [X1{X > A}] = Z ∞ 0 P [X1{X > A} > t] dt ≤ AP [X > A] + Z ∞ A a0 exp (−λ0t) dt ≤ a0 A + λ−1 exp (−λ0 A) 0 δ 1 + log nδ−1 a0 . [sent-753, score-0.158]

92 = nλ0 Finally, for the second term, since X A ≤ A = λ−1 log nδ−1 a0 , Bernstein’s inequality ensures 0 that with probability as least 1 − 2δ the following holds: 1 n A ∑ X − E XA n i=1 i ≤ log nδ−1 a0 log δ−1 2Var [X A ] log δ−1 +2 . [sent-754, score-0.269]

93 McDiarmid’s) inequality (McDiarmid , 1989) to the random variable σA yields that with probability 1 − δ we have (σA )2 − Var X A ≤ 4A2 log δ−1 ; n ﬁnally, except for samples in the set ΩA which we have already excluded above, we have σA = σ . [sent-762, score-0.083]

94 One remaining technicality is to replace the term σY (h) (which cannot be evaluated from the data since it depends on the exactly whitened data Yi ) in (21) by σY (h) , which can be evaluated from the data. [sent-773, score-0.117]

95 i=1 Proof This is an application of Chernoff’s bounding method: n Rn := log P n−1/2 ∑ (ξi − m) > z i=1 ≤ = n √ −µz n + log E exp ∑ µ(ξi − m) i=1 √ −µz n + n log E [exp µ(ξ1 − m)] , where the above inequality is Markov’s. [sent-797, score-0.255]

96 1 Using this and the assumption, taking expectation in (22) yields that for c = 2 ce0 > 0 ∀µ ∈ R, |µ| < µ0 /2 ⇒ E [exp(µX)] ≤ 1 + µE [X] + c µ2 ≤ exp µE [X] + c µ2 , implying E [exp (µ (X − E [X]))] ≤ exp c µ2 ; taking logarithms on both sides yields the conclusion. [sent-807, score-0.096]

97 5c−1 (κ −C)d log n} ≤ n n R∗ ≤ n 1 +P n θ∈Bd,2ε > κd log n −C > (κ −C)d log n i=1 provided that κ is chosen so that c−1 (κ − C)d/2 > d/2 + 1 . [sent-830, score-0.186]

98 We apply a mean distance linkage clustering algorithm to data projected in lower dimension using various techniques: NGCA, FastICA, PCA, local linear embedding (LLE, Roweis and Saul, 2000), Isomap (Tenenbaum et al. [sent-848, score-0.084]

99 Although this information is not used in determining the clustering, we will use it as a yardstick to measure whether the clustering gives rise to relevant structure discovery. [sent-854, score-0.091]

100 Recent attempts at formalizing criteria for clustering have proposed that clustering stability should be a relevant criterion for data clustering (see, e. [sent-859, score-0.184]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ngca', 0.572), ('pp', 0.376), ('fastica', 0.293), ('tanh', 0.286), ('awanabe', 0.151), ('lanchard', 0.151), ('pokoiny', 0.151), ('istribution', 0.143), ('ugiyama', 0.128), ('igh', 0.121), ('imensional', 0.121), ('omponents', 0.121), ('uller', 0.115), ('isomap', 0.1), ('subspace', 0.099), ('bd', 0.089), ('lle', 0.086), ('projection', 0.085), ('aussian', 0.082), ('hk', 0.079), ('whitening', 0.076), ('whitened', 0.071), ('oil', 0.067), ('log', 0.062), ('yi', 0.061), ('covariance', 0.061), ('pursuit', 0.06), ('fourier', 0.056), ('pca', 0.056), ('hyv', 0.055), ('family', 0.051), ('xia', 0.05), ('dist', 0.05), ('rinen', 0.049), ('exp', 0.048), ('density', 0.043), ('clustering', 0.04), ('rd', 0.039), ('target', 0.037), ('xh', 0.035), ('projector', 0.034), ('index', 0.033), ('clusters', 0.032), ('fk', 0.032), ('tenenbaum', 0.032), ('rise', 0.031), ('signal', 0.031), ('norm', 0.031), ('wine', 0.029), ('xx', 0.029), ('noise', 0.029), ('vowel', 0.029), ('nongaussian', 0.029), ('isotropic', 0.029), ('renormalization', 0.029), ('yh', 0.029), ('normalization', 0.028), ('components', 0.028), ('rm', 0.027), ('mode', 0.027), ('orthogonal', 0.026), ('cox', 0.026), ('nl', 0.026), ('ica', 0.026), ('cos', 0.026), ('visualization', 0.026), ('strasse', 0.026), ('uninteresting', 0.026), ('estimation', 0.026), ('hyperbolic', 0.025), ('spokoiny', 0.025), ('rigorous', 0.025), ('fraunhofer', 0.023), ('kekul', 0.023), ('data', 0.023), ('directions', 0.023), ('dx', 0.023), ('roweis', 0.023), ('operator', 0.023), ('smooth', 0.022), ('ball', 0.022), ('vectors', 0.022), ('nonlinear', 0.022), ('usps', 0.022), ('kawanabe', 0.022), ('harmeling', 0.021), ('vicinity', 0.021), ('inequality', 0.021), ('gaussian', 0.021), ('dimension', 0.021), ('stability', 0.021), ('laplacian', 0.02), ('transform', 0.02), ('relevant', 0.02), ('principal', 0.02), ('dence', 0.019), ('indices', 0.019), ('failure', 0.019), ('bernstein', 0.019), ('fhg', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

Author: Gilles Blanchard, Motoaki Kawanabe, Masashi Sugiyama, Vladimir Spokoiny, Klaus-Robert Müller

2 0.23247567 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

Author: Daniil Ryabko

Abstract: In this work we consider the task of relaxing the i.i.d. assumption in pattern recognition (or classiﬁcation), aiming to make existing learning algorithms applicable to a wider range of tasks. Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). We consider the case of deterministically deﬁned labels. Traditionally, this task is studied under the assumption that examples are independent and identically distributed. However, it turns out that many results of pattern recognition theory carry over a weaker assumption. Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. We ﬁnd a broad class of learning algorithms for which estimations of the probability of the classiﬁcation error achieved under the classical i.i.d. assumption can be generalized to the similar estimates for case of conditionally i.i.d. examples.

3 0.071717918 4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery

Author: Shohei Shimizu, Patrik O. Hoyer, Aapo Hyvärinen, Antti Kerminen

Abstract: In recent years, several methods have been proposed for the discovery of causal structure from non-experimental data. Such methods make various assumptions on the data generating process to facilitate its identiﬁcation from purely observational data. Continuing this line of research, we show how to discover the complete causal structure of continuous-valued data, under the assumptions that (a) the data generating process is linear, (b) there are no unobserved confounders, and (c) disturbance variables have non-Gaussian distributions of non-zero variances. The solution relies on the use of the statistical method known as independent component analysis, and does not require any pre-speciﬁed time-ordering of the variables. We provide a complete Matlab package for performing this LiNGAM analysis (short for Linear Non-Gaussian Acyclic Model), and demonstrate the effectiveness of the method using artiﬁcially generated data and real-world data. Keywords: independent component analysis, non-Gaussianity, causal discovery, directed acyclic graph, non-experimental data

4 0.065167658 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram. Keywords: spectral clustering, blind source separation, computational auditory scene analysis

5 0.054086186 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

Author: Mikhail Belkin, Partha Niyogi, Vikas Sindhwani

Abstract: We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework. Keywords: semi-supervised learning, graph transduction, regularization, kernel methods, manifold learning, spectral graph theory, unlabeled data, support vector machines

6 0.04799078 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

7 0.043844637 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification

8 0.042100985 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

9 0.040688943 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

10 0.039871886 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

11 0.039844345 93 jmlr-2006-Universal Kernels

12 0.039129663 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

13 0.038307548 45 jmlr-2006-Learning Coordinate Covariances via Gradients

14 0.035911631 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

15 0.032634772 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

16 0.030158972 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

17 0.029228764 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

18 0.026644375 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

19 0.026541926 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

20 0.026501117 92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.173), (1, -0.072), (2, -0.051), (3, -0.143), (4, -0.082), (5, 0.051), (6, -0.142), (7, 0.018), (8, -0.114), (9, 0.433), (10, 0.121), (11, -0.008), (12, 0.047), (13, -0.278), (14, 0.064), (15, -0.07), (16, 0.077), (17, -0.148), (18, 0.197), (19, -0.262), (20, 0.115), (21, 0.026), (22, 0.182), (23, 0.079), (24, -0.021), (25, 0.131), (26, -0.122), (27, -0.069), (28, -0.022), (29, 0.019), (30, 0.002), (31, -0.01), (32, 0.066), (33, 0.02), (34, -0.046), (35, -0.002), (36, 0.074), (37, 0.012), (38, 0.004), (39, -0.029), (40, 0.035), (41, -0.015), (42, -0.025), (43, -0.04), (44, 0.023), (45, 0.045), (46, 0.038), (47, -0.033), (48, 0.002), (49, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91635239 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

Author: Gilles Blanchard, Motoaki Kawanabe, Masashi Sugiyama, Vladimir Spokoiny, Klaus-Robert Müller

2 0.83434027 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

Author: Daniil Ryabko

3 0.26084557 4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery

Author: Shohei Shimizu, Patrik O. Hoyer, Aapo Hyvärinen, Antti Kerminen

4 0.19737388 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Author: Francis R. Bach, Michael I. Jordan

5 0.18037201 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

Author: Masashi Sugiyama

Abstract: The goal of active learning is to determine the locations of training input points so that the generalization error is minimized. We discuss the problem of active learning in linear regression scenarios. Traditional active learning methods using least-squares learning often assume that the model used for learning is correctly speciﬁed. In many practical situations, however, this assumption may not be fulﬁlled. Recently, active learning methods using “importance”-weighted least-squares learning have been proposed, which are shown to be robust against misspeciﬁcation of models. In this paper, we propose a new active learning method also using the weighted least-squares learning, which we call ALICE (Active Learning using the Importance-weighted least-squares learning based on Conditional Expectation of the generalization error). An important difference from existing methods is that we predict the conditional expectation of the generalization error given training input points, while existing methods predict the full expectation of the generalization error. Due to this difference, the training input design can be ﬁne-tuned depending on the realization of training input points. Theoretically, we prove that the proposed active learning criterion is a more accurate predictor of the single-trial generalization error than the existing criterion. Numerical studies with toy and benchmark data sets show that the proposed method compares favorably to existing methods. Keywords: Active Learning, Conditional Expectation of Generalization Error, Misspeciﬁcation of Models, Importance-Weighted Least-Squares Learning, Covariate Shift.

6 0.17406195 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification

7 0.17086339 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

8 0.17024352 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

9 0.16303383 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

10 0.15905377 45 jmlr-2006-Learning Coordinate Covariances via Gradients

11 0.15725546 93 jmlr-2006-Universal Kernels

12 0.1498614 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

13 0.14788994 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

14 0.13979444 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

15 0.1393526 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

16 0.13746971 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

17 0.129344 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

18 0.12692398 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

19 0.12404966 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

20 0.11976947 49 jmlr-2006-Learning Parts-Based Representations of Data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.462), (35, 0.01), (36, 0.064), (45, 0.022), (50, 0.072), (63, 0.049), (68, 0.011), (76, 0.017), (78, 0.038), (81, 0.039), (84, 0.023), (90, 0.029), (91, 0.021), (96, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87086928 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

Author: Tonatiuh Peña Centeno, Neil D. Lawrence

Abstract: In this paper we consider a novel Bayesian interpretation of Fisher’s discriminant analysis. We relate Rayleigh’s coefﬁcient to a noise model that minimises a cost based on the most probable class centres and that abandons the ‘regression to the labels’ assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with the incorporation of a prior we can apply Bayes’ rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher’s discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefﬁcient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefﬁcient with the generalisation error.

same-paper 2 0.84960234 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

Author: Gilles Blanchard, Motoaki Kawanabe, Masashi Sugiyama, Vladimir Spokoiny, Klaus-Robert Müller

3 0.44799867 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

Author: Kazuho Watanabe, Sumio Watanabe

Abstract: Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clariﬁed yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity

4 0.40526852 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

Author: Daniil Ryabko

5 0.38868111 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Author: Francis R. Bach, Michael I. Jordan

6 0.38407025 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

7 0.3725276 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

8 0.37168244 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

9 0.36558235 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

10 0.36553895 49 jmlr-2006-Learning Parts-Based Representations of Data

11 0.35653946 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

12 0.35420915 47 jmlr-2006-Learning Image Components for Object Recognition

13 0.35367602 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification