jmlr jmlr2006 jmlr2006-29 knowledge-graph by maker-knowledge-mining

29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

Source: pdf

Author: Sayan Mukherjee, Qiang Wu

Abstract: We introduce an algorithm that simultaneously estimates a classiﬁcation function as well as its gradient in the supervised learning framework. The motivation for the algorithm is to ﬁnd salient variables and estimate how they covary. An efﬁcient implementation with respect to both memory and time is given. The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. An error analysis is given for the convergence of the estimate of the classiﬁcation function and its gradient to the true classiﬁcation function and true gradient. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds, classiﬁcation

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. [sent-8, score-0.13]

2 In a number of applications, such as the analysis of gene expression data, classical questions from statistical modeling of which variables are of relevance and how these variables interact arise. [sent-13, score-0.13]

3 An example of this is that genes co-regulated by a biological pathway may be modeled as features that covary. [sent-15, score-0.097]

4 Statistical models based on shrinkage or regularization were applied to the problem of learning coordinate covariation and relevance for regression problems in Mukherjee and Zhou (2006). [sent-22, score-0.147]

5 2 Learning the Classiﬁcation Function and Gradient In this paper we are interested in simultaneously learning f φ and its gradient from the sample values, m z = (xi , yi ) i=1 . [sent-40, score-0.142]

6 The gradient of f φ is the vector of functions (if the partial derivatives exist) ∂ fφ ∂ fφ T ∇ fφ = . [sent-45, score-0.123]

7 This is one of the reasons we use a logistic model rather than learning the gradient of a {−1, 1} classiﬁcation function. [sent-54, score-0.156]

8 At ﬁrst glance, the problem of estimating the gradient is equivalent to that of computing classical numerical derivatives in inverse problems. [sent-57, score-0.123]

9 sn Rn 2483 1 sn since the M UKHERJEE AND W U Other weight functions can be used as long as the bandwidth of the weight function decreases with the number of samples. [sent-77, score-0.236]

10 m2 i,∑ i, j j=1 We may expect that minimizing this error functional using functions in a hypothesis space H n+1 leads to gz and fz such that gz (u) + fz (x) · (x − u) ≈ f z (x) ≈ fφ (x) ≈ fφ (u) + ∇ fφ (x) · (x − u), for x ≈ u. [sent-80, score-1.836]

11 This in general leads to gz ≈ fφ and fz ≈ ∇ fφ . [sent-81, score-0.918]

12 Deﬁnition 2 Given a sample z ∈ Z m we can estimate the classiﬁcation function, gz , and its gradient, fz , as follows: λ (3) (gz , fz ) = arg min Ez (g, f ) + ( g 2 + f 2 ) , K K 2 (g, f )∈H n+1 K n where s, λ > 0 are the regularization parameters and, for f = ( f 1 , . [sent-89, score-1.397]

13 In Section 3, we prove the convergence of our estimate of the gradient, fz , to the true gradient of the classiﬁcation function, ∇ fφ . [sent-107, score-0.547]

14 The utility of the algorithm is illustrated in Section 4 on simulated data as well as gene expression data. [sent-108, score-0.13]

15 The following theorem is an analog of the standard representer theorem (Wahba, 1990; Schoelkopf and Smola, 2001) that states the minimizer of the optimization problem deﬁned by Equation (3) has a ﬁnite dimensional representation. [sent-115, score-0.108]

16 Proposition 4 Given a sample z ∈ Z m the solution of (3) exists and takes the form m gz (x) = ∑ αi,z K(x, xi ) m and i=1 fz (x) = ∑ ci,z K(x, xi ) (4) i=1 with cz = (c1,z , . [sent-116, score-1.08]

17 Proof The existence follows from the convexity of φ and functionals g n (gz , fz ) is a minimizer. [sent-123, score-0.453]

18 We can write functions gz ∈ HK and fz ∈ HK as 2 K and f 2. [sent-124, score-0.918]

19 K Suppose gz = g + g⊥ and fz = f + f⊥ where g and each element of f is in the span of {Kx1 , , . [sent-125, score-0.918]

20 But gz 2 = g + g⊥ 2 > g 2 K K K and fz 2 = f + f⊥ 2 > f 2 unless g⊥ , f⊥ = 0. [sent-131, score-0.918]

21 , cm ) ∈ Rn×m (when optimized these will be the coefﬁcients cz in the gradient expansion) and the vector α ∈ Rm (when optimized the vector will be αz ). [sent-138, score-0.184]

22 The optimization function can be written in matrix form as Φ(C, α) = 1 m λ wi, j φ yi (k j α + kiCT (xi − x j )) + αT Kα + Tr(CKCT ) , 2 ∑ m i, j=1 2 (5) where Tr(M) is the trace of a matrix M. [sent-143, score-0.106]

23 , h0 )T , m 1 h0 (α,C) = j 1 m ∑ wi, j φ yi (k j α + kiCT (x j − xi )) yi + λα j m2 i=1 and, for i = 1, . [sent-155, score-0.132]

24 , m, hi (α,C) = 1 m2 m ∑ wi, j φ j=1 yi (k j α + kiCT (x j − xi )) yi (xi − x j ) + λci . [sent-158, score-0.132]

25 Proposition 6 If the solution to the equation h(α,C) = 0 exists, then the coefﬁcients c z in the representation of fz satisfy the constraint for every i = 1, . [sent-161, score-0.453]

26 So ∇Φ(αz , cz ) = 0 and (αz , cz ) gives the representation of gz and fz . [sent-169, score-1.098]

27 2486 C OORDINATE C OVARIATION IN C LASSIFICATION Remark 7 We know the solution (gz , fz ) exists and even is unique. [sent-172, score-0.453]

28 We apply the well known approach of singular value decomposition to the matrix involving the data x given by Mx = (x1 − xm , x2 − xm , . [sent-178, score-0.307]

29 Set βi ∈ Rd to satisfy xi − xm = V βi for i = 1, . [sent-205, score-0.175]

30 , u0 )T , 1 m u0 (γ0 , γ) = j 1 m ∑ wi, j φ yi (k j γ0 + ki γT (βi − β j )) yi + λγ0j , m2 i=1 and, for i = 1, . [sent-218, score-0.128]

31 , m, ui (γ0 , γ) = 1 m2 m ∑ wi, j φ j=1 yi (k j γ0 + ki γT (βi − β j )) yi (βi − β j ) + λγi . [sent-221, score-0.128]

32 , m, solve ∇Φ(α,C) = 0 and hence yield a representation of gz and fz respectively. [sent-228, score-0.918]

33 at each iteration the matrix a ∈ Rm×m is deﬁned by its components ai, j = wi, j φ yi (k j γ0 + ki γT (βi − β j )) where ki is the i-th column of the kernel matrix 10. [sent-273, score-0.17]

34 at each iteration the matrix A ∈ Rm×m is deﬁned by its components Ai, j = wi, j φ yi (k j γ0 + ki γT (βi − β j )) 11. [sent-274, score-0.109]

35 the coefﬁcents of the classiﬁcation function estimate, α z ∈ Rm and gz = ∑m αi,z K(x, xi ) i=1 17. [sent-344, score-0.501]

36 2489 M UKHERJEE AND W U Algorithm 1: Algorithm for computing gz and fz input : inputs x = (xi , . [sent-350, score-0.918]

37 , ym )T , kernel K, weights (wi, j ), regularization parameter s, λ > 0 and threshold ε > 0 return: coefﬁcients αz and (ci,z )m i=1 Mx = [x1 − xm , x2 − xm , . [sent-356, score-0.304]

38 , xm−1 − xm , xm − xm ]; [V, Σ,U] = svd(Mx ); η0 = 0; stop ← false; t ← 0; repeat 1 T T (b , b , . [sent-359, score-0.417]

39 We will show that under certain conditions, gz → fφ and fz → ∇ fφ as λ, s → 0. [sent-370, score-0.918]

40 Choose λ = λ(m) = m 3(n+2+2θ) and s = s(m) = m Then there exists a constant C > 0 such that for any 0 < η < 1 with conﬁdence 1 − η gz − f φ fz − ∇ fφ 2 Lρ X 2 Lρ X 4 ≤ C log η θ 6(n+2+2θ) 1 m 4 ≤ C log η 1 − 3(n+2+2θ) 1 m , θ 6(n+2+2θ) . [sent-376, score-0.918]

41 Substituting r 0 into this upper bound and by the choice of λ, s, we obtain with conﬁdence at least 1 − 2δ max gz − fφ 2 2 Lρ , X 2 ≤ C log δ fz − ∇ fφ 2 2 Lρ X 1 m θ 3(n+2+2θ) , where ˜ C = C (r0 )2 + Br0 (Lr0 r0 + Mr0 + 2) . [sent-406, score-0.918]

42 The ∂f idea is to rank the importance of variables according to the norm of their partial derivatives ∂xφ , since a small norm implies small changes of the classiﬁcation function with respect to this variable. [sent-413, score-0.139]

43 So we shall use the norms of the components of fz to rank the variables. [sent-415, score-0.493]

44 Deﬁnition 13 The relative magnitude of the norm for the variables is deﬁned as fz φ s = ∑n j=1 fz K 2 1/2 j K , = 1, . [sent-416, score-0.961]

45 Deﬁnition 14 The empirical gradient matrix (EGM), Fz , is the n × m matrix whose columns are fz (x j ) with j = 1, . [sent-421, score-0.605]

46 The empirical covariance matrix (ECM), Ξz , is the n × n matrix of inner products of the directional derivative of two coordinates Cov( fz ) := fz p , fz n K q p,q=1 = cz KcT = z 2492 m ∑ i, j=1 ci,z cT K(xi , x j ). [sent-425, score-1.571]

47 We ran our algorithm on draws of the above data using a linear kernel and report both the results of the gradient estimate as well as the classiﬁcation function passed through a logistic function. [sent-460, score-0.156]

48 Drawing twenty samples from the two respective classes results in a design matrix x that is 80 × 40 where the ﬁrst twenty samples belong to class −1 and the remaining to class +1. [sent-461, score-0.187]

49 In Figures 1b and 2b we plot the norm of each component of the estimate of the gradient, { ( fz ) K }80 . [sent-470, score-0.508]

50 Drawing thirty samples from the two respective classes results in a design matrix x that is 200 × 60 where the ﬁrst thirty samples belong to class −1 and the remaining to class +1. [sent-555, score-0.141]

51 In the following we report both the results of the gradient estimate as well as the classiﬁcation function passed through a logistic 2495 M UKHERJEE AND W U function. [sent-558, score-0.156]

52 The norm of ﬁrst two coordinates are much larger than the norm of any of the other coordinates, mini=1,2 fz maxi=3,. [sent-560, score-0.593]

53 The classiﬁcation accuracy improves when we rerun our algorithm using only the dimensions with nonzero norms (3f). [sent-567, score-0.119]

54 3 Gene Expression Data In computational biology, speciﬁcally in the subﬁeld of gene expression analysis variable selection and estimation of covariation is of fundamental importance. [sent-570, score-0.24]

55 The expression level of a gene is proportional to the number of copies of mRNA transcribed by that gene. [sent-572, score-0.13]

56 This readout of gene expression is considered a proxy of the state of the cell. [sent-573, score-0.13]

57 The goals of gene expression analysis include using the expression level of the genes to predict classes, for example tissue morphology or treatment outcome, or real-valued quantities such as drug toxicity or sensitivity. [sent-574, score-0.222]

58 genes (S) test errors 50 2 100 1 200 1 300 1 400 1 500 1 1,000 1 3,000 1 7,129 2 Table 1: Number of errors in classiﬁcation for various values of S using the genes corresponding to dimensions with the largest norms. [sent-630, score-0.191]

59 Plotting the decay of the elements for 0 w the normalized hyperplane w0 that is the solution of a linear SVM or the solution of regularized linear logistic regression results in a plot much more like that of the Fisher score than the gradient statistic. [sent-639, score-0.181]

60 This matrix indicates how the dimensions covary and can be used to construct clusters of genes. [sent-644, score-0.14]

61 Discussion We introduce an algorithm that learns a classiﬁcation function and its gradient from sample data in the logistic regression framework. [sent-687, score-0.156]

62 Logistic regression models: An alternative optimization problem was proposed in Mukherjee and Zhou (2006) for estimating the gradient fz in the binary regression problem fz,λ = arg min n f ∈HK 1 m (s) w φ yi y j + f (xi ) · (xi − x j ) m2 i,∑ i, j j=1 +λ f 2 K . [sent-706, score-0.595]

63 This formulation does have an interesting interpretation for variable selection in that variables that are relevant in the classiﬁcation problem will have large gradient norms and those not relevant will have norms near zero. [sent-708, score-0.201]

64 However, in most data sets and when variable selection is meaningful the data are concentrated on a much lower dimensional manifold embedded in the high dimensional space. [sent-718, score-0.112]

65 The basic problem is given model systems with known genetic or molecular perturbations infer gene expression “signatures of pathways” (sets of genes that characterize the perturbation in the model system). [sent-728, score-0.186]

66 Our proposal is that by studying the gradient of the classiﬁcation function we can determine which variables are salient with respect to the classiﬁcation problem and how these variables covary. [sent-740, score-0.127]

67 The conceptual key is that an estimate of the gradient allows us to measure coordinate covariation ∂f ∂f since the inner product between partial derivatives ∂xφj , ∂xφ indicates the covariance of the j-th and -th coordinates with respect to variation in the classiﬁcation function f φ . [sent-741, score-0.308]

68 These simulations suggest that the method does work for variable selection and some degree of covariation can be estimated. [sent-746, score-0.11]

69 The efﬁcacy of the method was clearly demonstrated on the simulated data and applying the method to gene expression data as well as images of digits (Mukherjee et al. [sent-747, score-0.13]

70 For this purpose, we prove two facts which we state in Lemmas 18 and 19 and deﬁne the local error function of t ∈ R at x ∈ X as errx (t) = Ey∼Y [φ(yt)] = φ(t)P(1|x) + φ(−t)P(−1|x), which is a twice differentiable, univariate convex function for every x ∈ X. [sent-808, score-0.4]

71 2504 C OORDINATE C OVARIATION IN C LASSIFICATION Lemma 18 For almost every x ∈ X, the following hold (i) fφ (x) is a minimizer of the function errx (t), i. [sent-809, score-0.4]

72 t∈R (ii) If T > max |t|, fφ ∞ , then 2 1 2 q1 (T )(t − f φ (x)) (iii) If T ≥ |t|, 3 fφ ∞ ≤ errx (t) − errx ( fφ (x)) ≤ 1 q2 (T )(t − fφ (x))2 . [sent-812, score-0.8]

73 2 there exists a constant c1 > 0 such that errx (t) − errx ( fφ (x)) ≥ c1 max q1 (T ), 1 T (t − fφ (x))2 . [sent-813, score-0.8]

74 Proof The ﬁrst conclusion is a direct consequence of the fact R (f) = Z errx ( f (x))dρX (x). [sent-814, score-0.4]

75 X Note that (errx ) ( fφ (x)) = 0 since f φ (x) is a minimizer of errx (t). [sent-815, score-0.4]

76 By a Taylor series expansion, there exists t0 between t and f φ (x) such that 1 errx (t) − errx ( fφ (x)) = (errx ) (t0 )(t − fφ (x))2 . [sent-816, score-0.8]

77 To show (iii), write errx (t) − errx ( fφ (x)) = Since (errx ) (a) is positive, if t ≥ 3 f φ errx (t) − errx ( fφ (x)) ≥ ∞ Z t fφ (x) Z r fφ (x) (errx ) (a)dadr. [sent-819, score-1.6]

78 := 3Mφ , then Z t Z 2Mφ 2Mφ | fφ (x)| (errx ) (a)dadr ≥ q1 (2Mφ )Mφ (|t| − 2Mφ ) and, if t ≤ −3Mφ , then errx (t) − errx ( fφ (x)) ≥ Z t Z −2Mφ −2Mφ −| fφ (x)| (errx ) (a)dadr ≥ q1 (2Mφ )Mφ (|t| − 2Mφ ). [sent-820, score-0.8]

79 So, if |t| > 3Mφ , errx (t) − errx ( fφ (x)) ≥ q1 (2Mφ )Mφ (|t| − 2Mφ ) ≥ 2505 3q1 (2Mφ )Mφ (t − fφ (x))2 , 16T M UKHERJEE AND W U where we have used the facts |t| − 2Mφ ≥ 1 |t − fφ (x)| and |t − f φ (x)| ≤ T + Mφ ≤ 4 T. [sent-821, score-0.8]

80 On the other 4 3 hand, by (ii), if |t| ≤ 3Mφ 3q1 (3Mφ )Mφ 1 errx (t) − errx ( fφ (x)) ≥ q1 (3Mφ )(t − fφ (x))2 ≥ (t − fφ (x))2 . [sent-822, score-0.8]

81 2 2T Hence for all |t| ≤ T, errx (t) − errx ( fφ (x)) ≥ 3q1 (3Mφ )Mφ (t − fφ (x))2 . [sent-823, score-0.8]

82 16T Together with (ii), we obtain errx (t) − errx ( fφ (x)) ≥ c1 max q1 (T ), with c1 = min 1 3q1 (3Mφ )Mφ 2, 16 1 T (t − fφ (x))2 . [sent-824, score-0.8]

83 N2 In order to apply Theorem 15 to (gz , fz ), we need a bound on gz rough bound. [sent-850, score-0.918]

84 Lemma 21 For every s > 0 and λ > 0, gz 2 K + fz 2 K ≤ 2 K + fz 2. [sent-851, score-1.371]

85 λsn+2 Proof The conclusion follows from the fact λ 2 gz 2 K + fz 2 K ≤ Ez (gz , fz ) + λ 2 gz 2 K + f 2 K ≤ Ez (0, 0) + 0 = φ(0) . [sent-853, score-1.836]

86 2 Bounding the Excess Error In this section, we bound the quantity E (gz , fz ) − Rs . [sent-859, score-0.453]

87 n+1 Theorem 23 If ( f φ , ∇ fφ ) ∈ HK , (gz , fz ) and (gλ , fλ ) are in Fr for some r ≥ 1, then with conﬁdence 1 − δ Lr r + Mr log 2 √ n+2 δ + s2 + λ , E (gz , fz ) − Rs ≤ C2 ms where C2 > 0 is a constant depending on φ and ρ but not on r, s and λ. [sent-861, score-0.945]

88 Proposition 24 The following hold E (gz , fz ) − Rs + where λ 2 2 K gz + fz 2 K ≤ S (z) + A (λ) S (z) = E (gz , fz ) − Ez (gz , fz ) + Ez (gλ , fλ ) − E (gλ , fλ ) and A (λ) = inf n+1 (g, f )∈HK E (g, f ) − Rs + λ 2 g 2 K + f 2 K . [sent-863, score-2.277]

89 (g, f )∈Fr In fact, if both (gz , fz ) and (gλ , fλ ) are in Fr for some r > 0, then S (z) ≤ 2S(z, r). [sent-865, score-0.453]

90 It is easy to verify that S(z, r) − S(zi , r) = sup (g, f )∈Fr ≤ sup (g, f )∈Fr Ez (g, f ) − E (g, f ) − sup (g, f )∈Fr Ez (g, f ) − Ezi (g, f ) ≤ 2509 Ezi (g, f ) − E (g, f ) 2m − 1 Mr . [sent-870, score-0.201]

91 By Lemma 18 (ii), we have E ( f φ , ∇ f φ ) − Rs = ˜ ≤ q 2 ( Mφ ) Z Z X ˜ ≤ q2 (Mφ )(cK ˜ where Mφ = κ fφ K X 2 ) Z Z X X w(x − u) errx ( fφ (u) + ∇ fφ (x) · (x − u)) − errx ( fφ (x) dρX (u)dρX (x) w(x − u) fφ (u) − fφ (x) + ∇ fφ (x) · (x − u) ∇ fφ 2c K ρ + κD ∇ fφ Z Z X K. [sent-894, score-0.8]

92 Notice that both theorems need a bound r so that (gz , fz ) and (gλ , fλ ) are in Fr . [sent-899, score-0.453]

93 In Lemma 21 we have shown 2φ(0) gz 2 + fz 2 ≤ n+2 . [sent-900, score-0.918]

94 λsn+2 So 2φ(0) λsn+2 is a universal bound for r such that (gz , fz ) and (gλ , fλ ) are in Fr . [sent-902, score-0.453]

95 Lemma 30 Under the assumptions of Theorem 10 with conﬁdence at least 1 − δ gz 2 K + fz 2 K ≤ c5 1 + Lλ,s s2 2 + √ + Mλ,s log n+2 λ δ λs 1 √ mλsn+2 for some c5 > 0. [sent-909, score-0.918]

96 Proof By the fact E (gz , fz ) − Rs > 0 and Proposition 24 we have λ 2 gz 2 K + fz 2 K ≤ S (z) + A (λ). [sent-910, score-1.371]

97 Since both (gz , fz ) and (gλ , fλ ) are in F√2φ(0)/λsn+2 , we apply Proposition 27 to get with probability at least 1 − δ S (z) ≤ c3 Lλ,s 2φ(0) 2 + Mλ,s log λsn+2 δ 1 √ n+2 . [sent-911, score-0.453]

98 By Theorems 15 and 23 we have with probability at least 1 − δ both g z − fφ Lρ and | fz − ∇ fφ | 2 2 are bounded by 2 L X ρX max {C0 ,C1 } r2 sθ +C2 Br Lr r + Mr log 2 √ n+2 δ + s2 + λ s−θ , ms (12) if both (gz , fz ) and (gλ , fλ ) are in Fr for some r > 1. [sent-915, score-0.945]

99 By Lemmas 29 and 30 we can state that both {(gz , fz ) ∈ Fr } and {(gλ , fλ ) ∈ Fr } with probability at least 1 − δ if r2 = max(c4 , c5 , 1) 1 + Lλ,s s2 2 + √ + Mλ,s log λ δ λsn+2 1 √ mλsn+2 . [sent-916, score-0.453]

100 Molecular classiﬁcation of cancer: Class discovery and class prediction by gene expression monitoring. [sent-992, score-0.13]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gz', 0.465), ('fz', 0.453), ('errx', 0.4), ('ovariation', 0.173), ('hk', 0.157), ('ez', 0.156), ('ukherjee', 0.156), ('oordinate', 0.147), ('xm', 0.139), ('rs', 0.125), ('sn', 0.118), ('fr', 0.112), ('gene', 0.094), ('gradient', 0.094), ('cz', 0.09), ('lassification', 0.09), ('covariation', 0.083), ('rkhs', 0.081), ('br', 0.079), ('xs', 0.079), ('dimensions', 0.079), ('mukherjee', 0.077), ('mr', 0.074), ('mx', 0.068), ('sup', 0.067), ('aml', 0.066), ('dud', 0.065), ('logistic', 0.062), ('lr', 0.06), ('genes', 0.056), ('ck', 0.055), ('norm', 0.055), ('yi', 0.048), ('wi', 0.048), ('classi', 0.046), ('prob', 0.046), ('rm', 0.046), ('twenty', 0.045), ('ecm', 0.045), ('rn', 0.045), ('tzs', 0.043), ('pathway', 0.041), ('norms', 0.04), ('ms', 0.039), ('proposition', 0.039), ('coordinate', 0.038), ('vx', 0.037), ('duke', 0.037), ('egm', 0.037), ('xi', 0.036), ('expression', 0.036), ('covariance', 0.034), ('samples', 0.034), ('reproducing', 0.033), ('manifold', 0.033), ('salient', 0.033), ('covary', 0.032), ('kict', 0.032), ('qiang', 0.032), ('tusher', 0.032), ('ki', 0.032), ('excess', 0.031), ('slonim', 0.03), ('coordinates', 0.03), ('derivatives', 0.029), ('matrix', 0.029), ('theorem', 0.029), ('guyon', 0.029), ('golub', 0.029), ('taylor', 0.028), ('xz', 0.028), ('rfe', 0.028), ('sayan', 0.028), ('schoelkopf', 0.028), ('sf', 0.028), ('um', 0.028), ('selection', 0.027), ('sgn', 0.026), ('cation', 0.026), ('regularization', 0.026), ('dimensional', 0.026), ('evgeniou', 0.025), ('rademacher', 0.025), ('decay', 0.025), ('bm', 0.025), ('leukemia', 0.025), ('dt', 0.024), ('representer', 0.024), ('lemmas', 0.023), ('tibshirani', 0.023), ('mcdiarmid', 0.023), ('zhou', 0.022), ('dence', 0.022), ('risk', 0.022), ('dadr', 0.022), ('dtd', 0.022), ('explanatory', 0.022), ('hermes', 0.022), ('nonlinearly', 0.022), ('thirty', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

Author: Sayan Mukherjee, Qiang Wu

2 0.49557623 45 jmlr-2006-Learning Coordinate Covariances via Gradients

Author: Sayan Mukherjee, Ding-Xuan Zhou

Abstract: We introduce an algorithm that learns gradients from samples in the supervised learning framework. An error analysis is given for the convergence of the gradient estimated by the algorithm to the true gradient. The utility of the algorithm for the problem of variable selection as well as determining variable covariance is illustrated on simulated data as well as two gene expression data sets. For square loss we provide a very efﬁcient implementation with respect to both memory and time. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds

3 0.096002266 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

Author: Mikhail Belkin, Partha Niyogi, Vikas Sindhwani

Abstract: We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework. Keywords: semi-supervised learning, graph transduction, regularization, kernel methods, manifold learning, spectral graph theory, unlabeled data, support vector machines

4 0.071299292 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır

Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classiﬁer, whose key component is a weight vector in a feature space implicitly introduced by a positive deﬁnite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modiﬁed optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classiﬁers. These classiﬁers essentially ﬁnd a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the diﬀerent classes of data are linearly well separated. Experimental results over several classiﬁcation benchmarks demonstrate the eﬀectiveness of our approach. Keywords: sparse learning, sparse large margin classiﬁers, kernel learning algorithms, support vector machine, kernel Fisher discriminant

5 0.063100882 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

6 0.061479144 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

7 0.059801664 75 jmlr-2006-Policy Gradient in Continuous Time

8 0.057312079 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

9 0.055756748 93 jmlr-2006-Universal Kernels

10 0.051296264 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

11 0.050728902 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

12 0.047376629 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

13 0.046837643 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

14 0.043587558 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

15 0.042100985 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

16 0.040424887 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

17 0.039287318 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

18 0.038389809 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

19 0.038273521 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

20 0.036970589 44 jmlr-2006-Large Scale Transductive SVMs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.26), (1, -0.111), (2, 0.023), (3, -0.51), (4, 0.221), (5, -0.448), (6, 0.162), (7, -0.051), (8, 0.231), (9, 0.015), (10, -0.071), (11, -0.081), (12, 0.093), (13, 0.052), (14, -0.006), (15, 0.045), (16, 0.032), (17, -0.003), (18, 0.085), (19, -0.021), (20, -0.094), (21, 0.018), (22, 0.023), (23, 0.031), (24, -0.045), (25, 0.035), (26, -0.019), (27, 0.029), (28, 0.037), (29, 0.025), (30, -0.02), (31, 0.001), (32, -0.019), (33, 0.032), (34, -0.044), (35, 0.041), (36, -0.006), (37, 0.048), (38, -0.015), (39, 0.001), (40, -0.035), (41, 0.03), (42, 0.023), (43, 0.025), (44, 0.013), (45, -0.008), (46, -0.026), (47, -0.028), (48, -0.006), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92131054 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

Author: Sayan Mukherjee, Qiang Wu

2 0.90831846 45 jmlr-2006-Learning Coordinate Covariances via Gradients

Author: Sayan Mukherjee, Ding-Xuan Zhou

3 0.25771427 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

Author: Mikhail Belkin, Partha Niyogi, Vikas Sindhwani

4 0.25644857 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır

5 0.25041285 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

Author: Ross A. Lippert, Ryan M. Rifkin

Abstract: We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). The proof rests on two key ideas: epi-convergence, a notion of functional convergence under which limits of minimizers converge to minimizers of limits, and a value-based formulation of learning, where we work with regularization on the function output values (y) as opposed to the function expansion coefﬁcients in the RKHS. Our result generalizes and uniﬁes previous results in this area. Keywords: Tikhonov regularization, Gaussian kernel, theory, kernel machines

6 0.21120887 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

7 0.18740323 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

8 0.1854711 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

9 0.18431783 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

10 0.18109234 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

11 0.17109466 75 jmlr-2006-Policy Gradient in Continuous Time

12 0.16653331 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

13 0.16651878 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

14 0.14349245 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

15 0.14092249 93 jmlr-2006-Universal Kernels

16 0.13992697 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

17 0.1387215 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

18 0.13465928 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

19 0.13447505 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

20 0.13390444 31 jmlr-2006-Exact 1-Norm Support Vector Machines Via Unconstrained Convex Differentiable Minimization (Special Topic on Machine Learning and Optimization)

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.024), (35, 0.013), (36, 0.058), (45, 0.022), (46, 0.207), (50, 0.073), (63, 0.036), (76, 0.022), (77, 0.19), (78, 0.022), (81, 0.054), (84, 0.029), (90, 0.062), (91, 0.042), (96, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.74707091 45 jmlr-2006-Learning Coordinate Covariances via Gradients

Author: Sayan Mukherjee, Ding-Xuan Zhou

same-paper 2 0.72180927 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

Author: Sayan Mukherjee, Qiang Wu

3 0.6631037 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections (Special Topic on Machine Learning and Optimization)

Author: Ben Taskar, Simon Lacoste-Julien, Michael I. Jordan

Abstract: We present a simple and scalable algorithm for maximum-margin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem that allows us to use simple projection methods based on the dual extragradient algorithm (Nesterov, 2003). The projection step can be solved using dynamic programming or combinatorial algorithms for min-cost convex ﬂow, depending on the structure of the problem. We show that this approach provides a memoryefﬁcient alternative to formulations based on reductions to a quadratic program (QP). We analyze the convergence of the method and present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm.1 Keywords: Markov networks, large-margin methods, structured prediction, extragradient, Bregman projections

4 0.44461423 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to inﬁnity. We derive probabilistic ﬁnite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reﬂecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a signiﬁcant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds

5 0.43646285 70 jmlr-2006-Online Passive-Aggressive Algorithms

Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer

Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This uniﬁed view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any ﬁxed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.

6 0.43461329 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

7 0.43408537 62 jmlr-2006-MinReg: A Scalable Algorithm for Learning Parsimonious Regulatory Networks in Yeast and Mammals

8 0.4198254 66 jmlr-2006-On Model Selection Consistency of Lasso

9 0.41480902 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

10 0.41381159 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

11 0.41194105 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

12 0.41059858 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

13 0.40874931 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

14 0.40796423 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

15 0.40452296 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

16 0.40126801 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

17 0.39965358 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

18 0.3995153 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

19 0.39839196 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers

20 0.39778143 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms