jmlr jmlr2006 jmlr2006-45 knowledge-graph by maker-knowledge-mining

45 jmlr-2006-Learning Coordinate Covariances via Gradients

Source: pdf

Author: Sayan Mukherjee, Ding-Xuan Zhou

Abstract: We introduce an algorithm that learns gradients from samples in the supervised learning framework. An error analysis is given for the convergence of the gradient estimated by the algorithm to the true gradient. The utility of the algorithm for the problem of variable selection as well as determining variable covariance is illustrated on simulated data as well as two gene expression data sets. For square loss we provide a very efﬁcient implementation with respect to both memory and time. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The utility of the algorithm for the problem of variable selection as well as determining variable covariance is illustrated on simulated data as well as two gene expression data sets. [sent-7, score-0.174]

2 Since statistical models based on shrinkage or regularization (Vapnik, 1998; West, 2003) have had success in the framework of both classiﬁcation and regression, we formulate the problem of learning coordinate covariation and relevance in this framework. [sent-14, score-0.203]

3 We ﬁrst describe the Tikhonov regularization method for classiﬁcation and regression in order to deﬁne notation and basic concepts. [sent-15, score-0.105]

4 We also motivate the algorithm and give an intuition of how the gradient can be used to learn coordinate covariation and relevance. [sent-17, score-0.214]

5 If we assign a penalty functional Ω : H → R+ on H and choose a loss function V : R2 → R+ , the Tikhonov regularization scheme in H associated with (V, Ω) is m deﬁned for a sample z = (xi , yi ) i=1 ∈ (X ×Y )m and λ > 0 as V fz = arg min f ∈H 1 m ∑ V (yi , f (xi )) + λΩ( f ) . [sent-23, score-0.632]

6 , for any ﬁnite set of distinct points {x1 , · · · , xm } ⊂ X, the matrix (K(xi , x j ))mj=1 is positive semideﬁnite. [sent-27, score-0.165]

7 (2) If H = H K and Ω( f ) = f 2 K in (1), the reproducing property (2) tells us that m V fz = ∑ ci Kxi i=1 and the coefﬁcients {ci }m can be found by solving an optimization problem in Rm . [sent-31, score-0.533]

8 i=1 m Assume that ρ is a probability distribution on Z := X ×Y and z = (xi , yi ) i=1 ∈ Z m is a random sample independently drawn according to ρ. [sent-32, score-0.124]

9 When the loss function is the least-square loss V (y,t) = (y −t)2 , the algorithm (1) is least-square regression and the objective is to learn the regression function fρ (x) = Z ydρ(y|x), Y x∈X (3) from the random sample z. [sent-33, score-0.11]

10 , 2005; Smale and Zhou, 2006b)) in learning theory showing for this least-square regression algorithm the conV vergence of fz to fρ in the metric · ρ under the assumption that fρ lies in the closure of H K and λ = λ(m) → 0 as m → ∞. [sent-40, score-0.481]

11 , 2000; Schoelkopf and Smola, V 2001; Vapnik, 1998; Wu and Zhou, 2005)) has shown that sgn( fz ) converges to the Bayes rule sgn( fρ ) with respect to the misclassiﬁcation error: R (sgn( f )) = Prob{sgn( f (x)) = y}. [sent-48, score-0.442]

12 2 Learning the Gradient In this paper we are interested in learning the gradient of fρ from the function sample values. [sent-50, score-0.109]

13 The gradient of fρ is the vector of functions (if the partial derivatives exist) ∂ fρ ∂ fρ T ∇ fρ = . [sent-56, score-0.112]

14 In the graph setting we are given a sparse sample on the manifold which can be thought of as vertices of a graph and the distance between the points is the weight matrix of the graph. [sent-64, score-0.145]

15 For regression the algorithm to learn gradients will use least-square loss to minimize the error of the Taylor expansion at sample points. [sent-67, score-0.115]

16 The ℓ empirical error on sample points x = xi , u = x j will be measured by the square loss g(u) − g(x) − ∇g(x) · (u − x) 2 2 = yi − y j + f (xi ) · (x j − xi ) . [sent-74, score-0.274]

17 Deﬁnition 1 The least-square type learning scheme is deﬁned for the sample z ∈ Z m as fz,λ := arg min n f ∈H K 1 m (s) w yi − y j + f (xi ) · (x j − xi ) m2 i,∑ i, j j=1 2 +λ f 2 K , (7) where λ, s are two positive constants called the regularization parameters. [sent-87, score-0.265]

18 Deﬁnition 2 The regularization scheme for classiﬁcation is deﬁned for the sample z ∈ Z m as fz,λ = arg min n f ∈H K 1 m (s) w φ yi y j + f (xi ) · (xi − x j ) m2 i,∑ i, j j=1 +λ f 2 K . [sent-89, score-0.19]

19 The inner products reﬂect the nature of the measure, which is often on a low dimensional manifold embedded in a large dimensional space. [sent-98, score-0.142]

20 Remark 4 Estimation of coordinate covariation is not possible in standard regression models that allow for variable selection such as: recursive feature elimination (RFE) (Guyon et al. [sent-100, score-0.176]

21 In Section 4, we show for a Gaussian weight function (6) how a particular choice of the two regularization parameters leads to rates of convergence of our estimate of the gradient to the true gradient, fz,λ to ∇ fρ . [sent-107, score-0.143]

22 The utility of the algorithm is demonstrated in Section 5 in applications to simulated data as well as gene expression data. [sent-108, score-0.174]

23 The following theorem is an analog of the standard representer theorem (Schoelkopf and Smola, 2001) that states the minimizer of the optimization problem deﬁned by (7) has the form m fz,λ = ∑ ci,z Kxi (9) i=1 with cz = (c1,z , . [sent-117, score-0.15]

24 , m, let Bi m m Bi = ∑ wi, j (x j − xi )(x j − xi )T ∈ Rn×n , Yi = j=1 ∑ wi, j (y j − yi )(x j − xi ) ∈ Rn . [sent-124, score-0.317]

25 (10) j=1 Then fz,λ = ∑m ci,z Kxi with cz = (c1,z , . [sent-125, score-0.15]

26 Then i=1 m f (xi ) · (x j − xi ) = ∑ K(x p , xi )c p · (x j − xi ) = p=1 and f 2 K m ∑ K(x p , xi )(x j − xi )T c p p=1 m = ∑ K(xi , x j )ci · c j . [sent-135, score-0.375]

27 i, j=1 Deﬁne the empirical error E z as E z( f ) = 1 m (s) w yi − y j + f (xi ) · (x j − xi ) m2 i,∑ i, j j=1 523 2 . [sent-136, score-0.167]

28 , n}, q q q k=1 ∂ E z( f ) + λ f ∂ck q m = 2λ ∑ K(xq , xi )ck i 2 K i=1 m m + 2 wi, j yi − y j + ∑ K(x p , xi )(x j − xi )T c p K(xq , xi )(xk − xik ). [sent-147, score-0.392]

29 Then we know that fz,λ = ∑m ci,z Kxi where cz = ˜ i=1 ˜ {ci,z }m is the solution to the linear system ˜ i=1 λci + 1 m2 m m ∑ wi, j j=1 yi − y j + ∑ K(x p , xi )(x j − xi )T c p p=1 (x j − xi ) = 0, i = 1, . [sent-152, score-0.467]

30 Since (x j − xi )T c p is a scalar, [(x j − xi )T c p ](x j − xi ) = (x j − xi )(x j − xi )T c p . [sent-156, score-0.375]

31 However, to ﬁnd ( fz,λ )ℓ by minimizing over f ∈ H K , one needs to replace yi − y j by yi − y j + ∑k=ℓ ( fz,λ )k (xi )(xk − xik ). [sent-163, score-0.184]

32 Consider the matrix involving the data x given by Mx = [x1 − xm , x2 − xm , . [sent-177, score-0.297]

33 The theory of singular value decomposition tells us that there exists an n × n orthogonal matrix 524 L EARNING C OORDINATE C OVARIANCES VIA G RADIENTS V = [V1 ,V2 , . [sent-183, score-0.096]

34 The j-th column of Mx equals x j − xm = ∑d σℓVℓUℓ and ℓ=1 j d x j − xi = ∑ σℓ j Uℓ −Uℓi Vℓ . [sent-206, score-0.207]

35 Uℓ −Uℓi (16) Now we can reduce the matrix size by solving an approximation to the linear system derived from the singular values. [sent-208, score-0.096]

36 The error between bz = (bi,z )m and cz can be bounded as i=1 bz − cz ℓ2 (Rmn ) where ∆m := max1≤i≤m ∑m wi, j j=1 2 ≤ 4MσS +1 m2 λ d∆m + 2 + ∑mj=1 wi, j . [sent-234, score-0.54]

37 Then fz,λ = ∑m ∑d cℓ Vℓ Kxi where cz satisﬁes (19) with B i and i=1 ℓ=1 i,z Y i given by (17) and (18) with S = d, respectively. [sent-239, score-0.15]

38 code Algorithm 1: Approximation algorithm with reduced matrix size to approximate fz,λ input : inputs (xi )m , labels (yi )m , kernel K, weights (wi, j ), eigenvalue threshold ε > 0 i=1 i=1 return: coefﬁcients (bi,z )m i=1 Mx = [x1 − xm , x2 − xm , . [sent-244, score-0.297]

39 , xm−1 − xm , xm − xm ] ∈ Rn×m ; Given Mx compute the singular value decomposition (14) with orthogonal matrices U,V and singular values σ1 ≥ σ2 ≥ · · · ≥ σS > ε; j j T t j = σ1U1 , . [sent-247, score-0.522]

40 , σS US ∈ RS for 1 ≤ j ≤ m; B i = ∑m wi, j (t j − ti )(t j − ti )T for 1 ≤ i ≤ m; j=1 Y i = ∑m wi, j (y j − yi )(t j − ti ) for 1 ≤ i ≤ m; j=1     Y1 B 1 K(x1 , x1 ) B 1 K(x1 , x2 ) · · · B 1 K(x1 , xm ) Y 2   B 2 K(x2 , x1 ) B 2 K(x2 , x2 ) · · · B 2 K(x2 , xm )    Y = . [sent-250, score-0.356]

41 B m K(xm , x1 ) B m K(xm , x2 ) · · · B m K(xm , xm ) Ym ˆ ˆ ˆ bz = (b1,z , . [sent-262, score-0.252]

42 , bm,z )T ∈ RmS where m2 λImS + K ˆ bz = Y ˆ bi,z = ∑S bℓ Vℓ and fz,λ,S = ∑m bi,z Kxi is an approximation of fz,λ ; i=1 ℓ=1 i,z return (bi,z )m i=1 526 (21) L EARNING C OORDINATE C OVARIANCES VIA G RADIENTS 4. [sent-265, score-0.12]

43 The idea behind the proof for the convergence of the gradient consists of simultaneously controlling a sample or estimation error term and a regularization or approximation error term. [sent-278, score-0.175]

44 The second term, the regularization error, does not depend on the sample and we use functional analysis to bound this quantity. [sent-280, score-0.098]

45 following integral operator on the space LρX with norm f ρ = fℓ 2 ρ n 2 Proposition 12 Let LK,s : LρX LK,s f = Z Z X X 2 It is a positive operator on LρX 2 → LρX n be the integral operator deﬁned by w(x − u)(u − x)Kx (u − x)T f (x) dρX (x)dρX (u). [sent-304, score-0.219]

46 2 x m m That is, F(z) = 1 m m 1 m m ∑ ∑ wi, j (x j − xi )Kxi (y j − yi ) − m2 ∑ ∑ wi, j (x j − xi )Kxi (x j − xi )T fλ (xi ). [sent-348, score-0.317]

47 m2 i=1 j=1 i=1 j=1 By independence, the expected value of F(z) equals 1 m ∑ Ez Ez wi, j (x j − xi )Kxi (y j − yi ) − (x j − xi )T fλ (xi ) m2 i=1 i ∑ j j=i = m−1 m ∑ Ez m2 i=1 i Z X w(xi − u)(u − xi )Kxi ( fρ (u) − yi ) − (u − xi )T fλ (xi ) dρX (u) . [sent-349, score-0.484]

48 We know that F(z) − Ezi (F(z)) equals 1 m2 ∑ w(xi − x j )(x j − xi ) j=i Kxi y j − yi − (x j − xi )T fλ (xi ) +Kx j y j − yi − (x j − xi )T fλ (x j ) 1 − 2 m ∑ Z j=i X w(x − x j )(x j − x) Kx y j − fρ (x) − (x j − x)T fλ (x) +Kx j y j − fρ (x) − (x j − x)T fλ (x j ) dρX (x). [sent-360, score-0.409]

49 Vρ (37) 2 The operator LK is also a positive operator on (LρX )n satisfying LK,s −Vρ n(2π)n/2 LK 2 2 (LρX )n →(LρX )n ≤ sτ κ2 cρ M2+τ + J4 + cρ J2 , ∀ 0 < s ≤ 1. [sent-385, score-0.108]

50 Hence Z X Z Rn 1 − |u−x|2 u − x e 2s2 sn s u−x s T du p(x)Kx f (x)dρX (x). [sent-391, score-0.141]

51 It follows that 1 − |u−x|2 u − x u − x T du p(x)Kx f (x)dρX (x) e 2s2 K n s s Rn \X s X K Z Z 1 − |u−x|2 u − x 2 e 2s2 ≤ du κ| f (x)|p(x)dρX (x). [sent-392, score-0.138]

52 Hence s g −Vρ n(2π)n/2 LK f Z Rn \X = Z Z 1 − |u−x|2 u − x 2 e 2s2 du ≤ s sn s Z Rn \X 532 1 − |u−x|2 u − x 4 e 2s2 du ≤ sJ4 . [sent-395, score-0.21]

53 sn s L EARNING C OORDINATE C OVARIANCES VIA G RADIENTS It follows from (23) that Z X\Xs 1 − |u−x|2 u − x 2 e 2s2 du κ| f (x)|p(x)dρX (x) ≤ sκcρ M4 sn s Z Rn \X Z X\Xs | f (x)|dρX (x) which is bounded by sκcρ M4 f ρ . [sent-396, score-0.213]

54 For the subset Xs , we use the Cauchy-Schwartz inequality and (23) to obtain Z Xs 1 − |u−x|2 u − x 2 e 2s2 du κ| f (x)|p(x)dρX (x) ≤ sn s Z Rn \X This is bounded by κcρ M2 ρX (Xs ) f ρ. [sent-397, score-0.141]

55 So we know (see Cucker and Smale (2001)) Vρ r that the operator LK can be used to deﬁne the reproducing kernel Hilbert space: Let LK be the r-th 1/2 n 2 n power of the positive operator LK on (LρX )n having range in H K . [sent-406, score-0.166]

56 Our idea is to rank the importance of variables according to the norm of their ∂f partial derivatives ∂xρ , since a small norm implies small changes on the function with respect ℓ to this variable. [sent-449, score-0.149]

57 Deﬁnition 22 The empirical gradient matrix (EGM), Fz , is the n × m matrix whose columns are fz,λ (x j ) with j = 1, . [sent-457, score-0.143]

58 The empirical covariance matrix (ECM), Ξz , is the n × n matrix of inner products of the directional derivative of two coordinates Cov( fz,λ ) := fz,λ p , fz,λ n K q p,q=1 535 m = ∑ ci,z cT K(xi , x j ). [sent-461, score-0.102]

59 For samples {yi }30 i=21 yi = xi · w1 + N (0, σy ), yi = xi · w2 + N (0, σy ), yi = xi · w3 + N (0, σy ). [sent-516, score-0.535]

60 Perhaps more interesting is that we can evaluate the gradient at each sample {xi }m . [sent-529, score-0.109]

61 2 Gene Expression Data In computational biology, speciﬁcally in the subﬁeld of gene expression analysis variable selection and estimation of covariation is of fundamental importance. [sent-549, score-0.264]

62 The expression level of a gene is proportional to the number of copies of mRNA transcribed by that gene. [sent-551, score-0.174]

63 This readout of gene expression is considered a proxy of the state of the cell. [sent-552, score-0.174]

64 The goals of gene expression analysis include using the expression level of the genes to predict classes, for example tissue morphology or treatment outcome, or real-valued quantities such as toxicity or sensitivity. [sent-553, score-0.35]

65 Fundamental to understanding the biology giving rise to the outcome or toxicity is determining which genes are most relevant for the prediction. [sent-554, score-0.117]

66 , 2000) and estimating the genes most relevant to this discrimination. [sent-560, score-0.117]

67 Expression levels of n = 7, 129 genes and expressed sequence tags (ESTs) were measured via an oligonucleotide microarray for each sample. [sent-562, score-0.117]

68 genes (S) test errors 5 1 55 3 105 2 155 1 205 1 255 1 305 1 355 1 405 1 455 1 Table 1: Number of errors in classiﬁcation for various values of S using the genes corresponding to dimensions with the largest norms. [sent-573, score-0.267]

69 Plotting the decay of the elements for 0 w the normalized hyperplane w0 that is the solution of a linear SVM results in a plot much more like that of the Fisher score than the gradient statistic. [sent-582, score-0.119]

70 01 0 0 100 200 300 400 500 600 700 800 900 0 1000 (c) 0 50 100 150 200 (d) ρ Figure 2: The decay of s(ℓ) (blue) and sF (red) over: a) all the genes/dimensions, b) the top 2000 (ℓ) genes/dimensions, c) the top 1000 genes/dimensions, d) the top 500 genes/dimensions. [sent-620, score-0.15]

71 The covariation in the coordinates is plotted for the top 50 dimensions ordered in the same way as the EGM (see ﬁgure (3b)). [sent-624, score-0.195]

72 3 50 50 5 10 15 20 25 30 35 5 10 15 Samples 20 25 30 35 40 45 50 Dimensions (a) (b) Figure 3: The a) EGM for the top 50 dimensions ordered by clustering the EGM and b) the ECM for the top 50 dimensions ordered in the same way. [sent-642, score-0.138]

73 We examine a gene expression data set with 15 male and 17 females samples from lymphblastoid cell lines (unpublished). [sent-643, score-0.208]

74 Expression levels of n = 22, 283 probes corresponding to genes and expressed sequence tags (ESTs) were measured via an oligonucleotide microarray for each sample. [sent-644, score-0.223]

75 ρ In ﬁgure (4a-d) we plot the relative magnitude sequence sℓ for the genes as compared to those of the relative Fisher score sF and we see again the quicker decay for the gradient norms. [sent-645, score-0.236]

76 02 100 (c) 0 5 10 15 20 (d) ρ Figure 4: The decay of s(ℓ) (blue) and sF (red) over: a) all the genes/dimensions, b) the top 200 (ℓ) genes/dimensions, c) the top 100 genes/dimensions, d) the top 50 genes/dimensions. [sent-677, score-0.15]

77 From a priori biological knowledge we would predict that the most discriminative genes for gender would be those on the Y chromosome as well as genes on the X chromosome known to 540 L EARNING C OORDINATE C OVARIANCES VIA G RADIENTS escape X inactivation. [sent-678, score-0.314]

78 The reason that all the genes on the X chromosome would not be expected to be discriminative is due to dosage compensation in expression which takes compensates for the fact that women have two X chromosomes and men have one. [sent-679, score-0.216]

79 However, there are genes known to escape X inactivation and these should be differentially expressed. [sent-681, score-0.202]

80 We obtained a list of such genes by combining lists reported in two sources (Carrel et al. [sent-682, score-0.117]

81 There were 35 probes in the X inactivation set and 66 probes corresponding to genes on the Y chromosome. [sent-685, score-0.369]

82 An important caveat is that while these 101 probes would be expected to be differentially expressed they would not all be expected to rank at the top of a list of genes that are differentially expressed. [sent-686, score-0.349]

83 This is due to the fact that in the cell line or tissue of question there may be other genes that are more strongly differentially expressed due to local conditions. [sent-687, score-0.162]

84 , 2000) which reduced the number of probes to about 12, 000. [sent-690, score-0.106]

85 This data set was then standardized (the expression values for each gene was recentered and scaled to be zero mean and standard deviation of one). [sent-691, score-0.174]

86 We found that 16 of the 101 probes appeared in the top ranked 500 probes. [sent-693, score-0.142]

87 Ranking by the Fisher score we found 22 of the top 101 probes in the top ranked 500 probes. [sent-694, score-0.178]

88 However, the assumptions of independence in the model which gives rise to the hypergeometric distribution are completely inappropriate in this problem (the probes tend to be strongly correlated). [sent-697, score-0.106]

89 However, in most data sets and when variable selection is meaningful the data are concentrated on a much lower dimensional manifold embedded in the high dimensional space. [sent-717, score-0.142]

90 Semi-supervised setting: Intrinsic properties of the manifold X can be further studied by unlabelled data. [sent-720, score-0.12]

91 Approximate cz by cz which is deﬁned by −1 m (Y 1 , . [sent-745, score-0.3]

92 , Y m )T , i, j=1 cz = m2 λImn + diag{B1 , · · · , Bm } K(xi , x j )In where Y i = ∑m wi, j (y j − yi ) ∑S σℓ Uℓ −Uℓi Vℓ . [sent-748, score-0.242]

93 ℓ=S +1 It follows that m 1 cz − cz ℓ2 (Rmn ) ≤ 2 m λ ∑ i=1 1/2 2 Y i −Yi ℓ2 (Rn ) √ 4MσS +1 d − S ≤ m2 λ ∆m . [sent-752, score-0.3]

94 Approximate cz by bz which is deﬁned by bz = m2 λImn + diag{B1 , · · · , Bm } K(xi , x j )In where S m Bi = S ∑ wi, j ∑ ∑ σℓ σ p j Uℓ −Uℓi ℓ=1 p=1 j=1 −1 m (Y 1 , . [sent-754, score-0.39]

95 The ℓ2 (Rn ) norm of the second term in the expression of j=1 Bi − Bi b can be bounded in the same way and we thus have m Bi − Bi b ≤ 4σS +1 σ1 |b|ℓ2 (Rn ) ∑ wi, j . [sent-761, score-0.116]

96 ℓ2 (Rn ) j=1 Then we have the following estimate for the operator norm of the difference of the diagonal operators diag{B1 , · · · , Bm } − diag{B1 , · · · , Bm } ≤ 4σS +1 σ1 max m ∑ wi, j . [sent-762, score-0.111]

97 Applying this to our setting, we have bz − cz ℓ2 (Rmn ) ≤ 4κ2 mσS +1 σ1 (m2 λ)2 max ∑ wi, j (Y 1 , . [sent-766, score-0.27]

98 Therefore, bi,z = b∗ for each i and bz = bz is the desired i,z coefﬁcients for the function fz,λ,S . [sent-785, score-0.24]

99 Molecular classiﬁcation of cancer: class discovery and class prediction by gene expression monitoring. [sent-877, score-0.174]

100 Bayesian estimation of gene expression index and Bayesian kernel models. [sent-894, score-0.174]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fz', 0.442), ('kxi', 0.238), ('lk', 0.233), ('ovariances', 0.198), ('radients', 0.198), ('wi', 0.183), ('oordinate', 0.168), ('ukherjee', 0.168), ('cz', 0.15), ('xm', 0.132), ('ez', 0.13), ('kx', 0.128), ('hou', 0.128), ('rn', 0.128), ('bz', 0.12), ('sx', 0.12), ('genes', 0.117), ('gene', 0.115), ('ezi', 0.112), ('rmn', 0.112), ('probes', 0.106), ('upj', 0.106), ('egm', 0.101), ('yi', 0.092), ('bm', 0.09), ('covariation', 0.09), ('diam', 0.09), ('mx', 0.083), ('manifold', 0.08), ('aml', 0.08), ('pinelis', 0.079), ('gradient', 0.077), ('smale', 0.077), ('xi', 0.075), ('zeros', 0.073), ('sn', 0.072), ('du', 0.069), ('xs', 0.069), ('dx', 0.066), ('cmat', 0.066), ('regularization', 0.066), ('diag', 0.065), ('singular', 0.063), ('bi', 0.062), ('earning', 0.059), ('expression', 0.059), ('reproducing', 0.058), ('norm', 0.057), ('sf', 0.056), ('sigma', 0.056), ('ecm', 0.055), ('operator', 0.054), ('bmat', 0.053), ('ktilde', 0.053), ('subramanian', 0.053), ('proposition', 0.051), ('transpose', 0.05), ('sgn', 0.048), ('zhou', 0.048), ('coordinate', 0.047), ('slonim', 0.046), ('yv', 0.045), ('differentially', 0.045), ('gradients', 0.044), ('tikhonov', 0.043), ('mukherjee', 0.043), ('decay', 0.042), ('vec', 0.04), ('unlabelled', 0.04), ('chromosome', 0.04), ('imn', 0.04), ('inactivation', 0.04), ('nrm', 0.04), ('ytilde', 0.04), ('rkhs', 0.039), ('regression', 0.039), ('dp', 0.037), ('banach', 0.037), ('vp', 0.037), ('coordinates', 0.036), ('top', 0.036), ('derivatives', 0.035), ('golub', 0.035), ('samples', 0.034), ('duke', 0.034), ('sayan', 0.034), ('schoelkopf', 0.034), ('hong', 0.034), ('kong', 0.034), ('dimensions', 0.033), ('ci', 0.033), ('matrix', 0.033), ('sample', 0.032), ('rs', 0.032), ('genome', 0.032), ('dimensional', 0.031), ('hilbert', 0.031), ('dence', 0.03), ('xq', 0.03), ('tamayo', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 45 jmlr-2006-Learning Coordinate Covariances via Gradients

Author: Sayan Mukherjee, Ding-Xuan Zhou

2 0.49557623 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

Author: Sayan Mukherjee, Qiang Wu

Abstract: We introduce an algorithm that simultaneously estimates a classiﬁcation function as well as its gradient in the supervised learning framework. The motivation for the algorithm is to ﬁnd salient variables and estimate how they covary. An efﬁcient implementation with respect to both memory and time is given. The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. An error analysis is given for the convergence of the estimate of the classiﬁcation function and its gradient to the true classiﬁcation function and true gradient. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds, classiﬁcation

3 0.11153743 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

Author: Mikhail Belkin, Partha Niyogi, Vikas Sindhwani

Abstract: We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework. Keywords: semi-supervised learning, graph transduction, regularization, kernel methods, manifold learning, spectral graph theory, unlabeled data, support vector machines

4 0.079906218 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

Author: Ross A. Lippert, Ryan M. Rifkin

Abstract: We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). The proof rests on two key ideas: epi-convergence, a notion of functional convergence under which limits of minimizers converge to minimizers of limits, and a value-based formulation of learning, where we work with regularization on the function output values (y) as opposed to the function expansion coefﬁcients in the RKHS. Our result generalizes and uniﬁes previous results in this area. Keywords: Tikhonov regularization, Gaussian kernel, theory, kernel machines

5 0.068195499 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır

Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classiﬁer, whose key component is a weight vector in a feature space implicitly introduced by a positive deﬁnite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modiﬁed optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classiﬁers. These classiﬁers essentially ﬁnd a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the diﬀerent classes of data are linearly well separated. Experimental results over several classiﬁcation benchmarks demonstrate the eﬀectiveness of our approach. Keywords: sparse learning, sparse large margin classiﬁers, kernel learning algorithms, support vector machine, kernel Fisher discriminant

6 0.063707925 93 jmlr-2006-Universal Kernels

7 0.059449486 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

8 0.059376981 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

9 0.059368908 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

10 0.054241512 75 jmlr-2006-Policy Gradient in Continuous Time

11 0.05302529 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

12 0.047434051 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

13 0.04725901 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

14 0.046401989 62 jmlr-2006-MinReg: A Scalable Algorithm for Learning Parsimonious Regulatory Networks in Yeast and Mammals

15 0.044611137 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

16 0.044318441 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

17 0.040806994 31 jmlr-2006-Exact 1-Norm Support Vector Machines Via Unconstrained Convex Differentiable Minimization (Special Topic on Machine Learning and Optimization)

18 0.040005479 4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery

19 0.0389378 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

20 0.038844042 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.279), (1, -0.126), (2, 0.013), (3, -0.497), (4, 0.216), (5, -0.456), (6, 0.174), (7, -0.049), (8, 0.226), (9, 0.024), (10, -0.066), (11, -0.082), (12, 0.069), (13, 0.073), (14, 0.002), (15, 0.047), (16, 0.027), (17, 0.02), (18, 0.078), (19, -0.021), (20, -0.048), (21, 0.025), (22, 0.04), (23, 0.019), (24, -0.063), (25, 0.059), (26, -0.012), (27, 0.017), (28, 0.047), (29, 0.011), (30, 0.001), (31, 0.003), (32, 0.012), (33, 0.029), (34, -0.019), (35, 0.007), (36, -0.01), (37, 0.032), (38, -0.0), (39, 0.011), (40, -0.005), (41, 0.03), (42, 0.035), (43, 0.028), (44, 0.022), (45, 0.001), (46, -0.032), (47, -0.029), (48, -0.004), (49, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92944133 45 jmlr-2006-Learning Coordinate Covariances via Gradients

Author: Sayan Mukherjee, Ding-Xuan Zhou

2 0.924245 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

Author: Sayan Mukherjee, Qiang Wu

3 0.2809464 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

Author: Mikhail Belkin, Partha Niyogi, Vikas Sindhwani

4 0.26476559 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

Author: Ross A. Lippert, Ryan M. Rifkin

5 0.25333714 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır

6 0.23538204 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

7 0.21023534 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

8 0.20805119 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

9 0.192479 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

10 0.17476101 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

11 0.17403263 62 jmlr-2006-MinReg: A Scalable Algorithm for Learning Parsimonious Regulatory Networks in Yeast and Mammals

12 0.1727414 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

13 0.16949354 93 jmlr-2006-Universal Kernels

14 0.16838302 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

15 0.16821432 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

16 0.16549225 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

17 0.16245073 75 jmlr-2006-Policy Gradient in Continuous Time

18 0.15742686 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

19 0.15591398 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)

20 0.15432121 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.027), (35, 0.013), (36, 0.047), (45, 0.014), (46, 0.459), (50, 0.052), (63, 0.034), (75, 0.013), (76, 0.021), (77, 0.01), (78, 0.023), (81, 0.038), (84, 0.037), (90, 0.061), (91, 0.024), (96, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76893276 45 jmlr-2006-Learning Coordinate Covariances via Gradients

Author: Sayan Mukherjee, Ding-Xuan Zhou

2 0.6484524 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections (Special Topic on Machine Learning and Optimization)

Author: Ben Taskar, Simon Lacoste-Julien, Michael I. Jordan

Abstract: We present a simple and scalable algorithm for maximum-margin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem that allows us to use simple projection methods based on the dual extragradient algorithm (Nesterov, 2003). The projection step can be solved using dynamic programming or combinatorial algorithms for min-cost convex ﬂow, depending on the structure of the problem. We show that this approach provides a memoryefﬁcient alternative to formulations based on reductions to a quadratic program (QP). We analyze the convergence of the method and present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm.1 Keywords: Markov networks, large-margin methods, structured prediction, extragradient, Bregman projections

3 0.5183441 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

Author: Sayan Mukherjee, Qiang Wu

4 0.30343479 62 jmlr-2006-MinReg: A Scalable Algorithm for Learning Parsimonious Regulatory Networks in Yeast and Mammals

Author: Dana Pe'er, Amos Tanay, Aviv Regev

Abstract: In recent years, there has been a growing interest in applying Bayesian networks and their extensions to reconstruct regulatory networks from gene expression data. Since the gene expression domain involves a large number of variables and a limited number of samples, it poses both computational and statistical challenges to Bayesian network learning algorithms. Here we deﬁne a constrained family of Bayesian network structures suitable for this domain and devise an efﬁcient search algorithm that utilizes these structural constraints to ﬁnd high scoring networks from data. Interestingly, under reasonable assumptions on the underlying probability distribution, we can provide performance guarantees on our algorithm. Evaluation on real data from yeast and mouse, demonstrates that our method cannot only reconstruct a high quality model of the yeast regulatory network, but is also the ﬁrst method to scale to the complexity of mammalian networks and successfully reconstructs a reasonable model over thousands of variables. Keywords: Bayesian networks, structure learning, gene networks, gene expression, approximation algorithms

5 0.25435081 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to inﬁnity. We derive probabilistic ﬁnite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reﬂecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a signiﬁcant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds

6 0.25432253 70 jmlr-2006-Online Passive-Aggressive Algorithms

7 0.25312617 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

8 0.25055072 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

9 0.24428076 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

10 0.24105483 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

11 0.2398584 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

12 0.23836577 88 jmlr-2006-Streamwise Feature Selection

13 0.23611388 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

14 0.2358928 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

15 0.2336387 78 jmlr-2006-Rearrangement Clustering: Pitfalls, Remedies, and Applications

16 0.23355544 66 jmlr-2006-On Model Selection Consistency of Lasso

17 0.23148321 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)

18 0.23117119 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

19 0.23103522 44 jmlr-2006-Large Scale Transductive SVMs

20 0.23050328 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation