jmlr jmlr2006 jmlr2006-29 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sayan Mukherjee, Qiang Wu
Abstract: We introduce an algorithm that simultaneously estimates a classification function as well as its gradient in the supervised learning framework. The motivation for the algorithm is to find salient variables and estimate how they covary. An efficient implementation with respect to both memory and time is given. The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. An error analysis is given for the convergence of the estimate of the classification function and its gradient to the true classification function and true gradient. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds, classification
Reference: text
sentIndex sentText sentNum sentScore
1 The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. [sent-8, score-0.13]
2 In a number of applications, such as the analysis of gene expression data, classical questions from statistical modeling of which variables are of relevance and how these variables interact arise. [sent-13, score-0.13]
3 An example of this is that genes co-regulated by a biological pathway may be modeled as features that covary. [sent-15, score-0.097]
4 Statistical models based on shrinkage or regularization were applied to the problem of learning coordinate covariation and relevance for regression problems in Mukherjee and Zhou (2006). [sent-22, score-0.147]
5 2 Learning the Classification Function and Gradient In this paper we are interested in simultaneously learning f φ and its gradient from the sample values, m z = (xi , yi ) i=1 . [sent-40, score-0.142]
6 The gradient of f φ is the vector of functions (if the partial derivatives exist) ∂ fφ ∂ fφ T ∇ fφ = . [sent-45, score-0.123]
7 This is one of the reasons we use a logistic model rather than learning the gradient of a {−1, 1} classification function. [sent-54, score-0.156]
8 At first glance, the problem of estimating the gradient is equivalent to that of computing classical numerical derivatives in inverse problems. [sent-57, score-0.123]
9 sn Rn 2483 1 sn since the M UKHERJEE AND W U Other weight functions can be used as long as the bandwidth of the weight function decreases with the number of samples. [sent-77, score-0.236]
10 m2 i,∑ i, j j=1 We may expect that minimizing this error functional using functions in a hypothesis space H n+1 leads to gz and fz such that gz (u) + fz (x) · (x − u) ≈ f z (x) ≈ fφ (x) ≈ fφ (u) + ∇ fφ (x) · (x − u), for x ≈ u. [sent-80, score-1.836]
11 This in general leads to gz ≈ fφ and fz ≈ ∇ fφ . [sent-81, score-0.918]
12 Definition 2 Given a sample z ∈ Z m we can estimate the classification function, gz , and its gradient, fz , as follows: λ (3) (gz , fz ) = arg min Ez (g, f ) + ( g 2 + f 2 ) , K K 2 (g, f )∈H n+1 K n where s, λ > 0 are the regularization parameters and, for f = ( f 1 , . [sent-89, score-1.397]
13 In Section 3, we prove the convergence of our estimate of the gradient, fz , to the true gradient of the classification function, ∇ fφ . [sent-107, score-0.547]
14 The utility of the algorithm is illustrated in Section 4 on simulated data as well as gene expression data. [sent-108, score-0.13]
15 The following theorem is an analog of the standard representer theorem (Wahba, 1990; Schoelkopf and Smola, 2001) that states the minimizer of the optimization problem defined by Equation (3) has a finite dimensional representation. [sent-115, score-0.108]
16 Proposition 4 Given a sample z ∈ Z m the solution of (3) exists and takes the form m gz (x) = ∑ αi,z K(x, xi ) m and i=1 fz (x) = ∑ ci,z K(x, xi ) (4) i=1 with cz = (c1,z , . [sent-116, score-1.08]
17 Proof The existence follows from the convexity of φ and functionals g n (gz , fz ) is a minimizer. [sent-123, score-0.453]
18 We can write functions gz ∈ HK and fz ∈ HK as 2 K and f 2. [sent-124, score-0.918]
19 K Suppose gz = g + g⊥ and fz = f + f⊥ where g and each element of f is in the span of {Kx1 , , . [sent-125, score-0.918]
20 But gz 2 = g + g⊥ 2 > g 2 K K K and fz 2 = f + f⊥ 2 > f 2 unless g⊥ , f⊥ = 0. [sent-131, score-0.918]
21 , cm ) ∈ Rn×m (when optimized these will be the coefficients cz in the gradient expansion) and the vector α ∈ Rm (when optimized the vector will be αz ). [sent-138, score-0.184]
22 The optimization function can be written in matrix form as Φ(C, α) = 1 m λ wi, j φ yi (k j α + kiCT (xi − x j )) + αT Kα + Tr(CKCT ) , 2 ∑ m i, j=1 2 (5) where Tr(M) is the trace of a matrix M. [sent-143, score-0.106]
23 , h0 )T , m 1 h0 (α,C) = j 1 m ∑ wi, j φ yi (k j α + kiCT (x j − xi )) yi + λα j m2 i=1 and, for i = 1, . [sent-155, score-0.132]
24 , m, hi (α,C) = 1 m2 m ∑ wi, j φ j=1 yi (k j α + kiCT (x j − xi )) yi (xi − x j ) + λci . [sent-158, score-0.132]
25 Proposition 6 If the solution to the equation h(α,C) = 0 exists, then the coefficients c z in the representation of fz satisfy the constraint for every i = 1, . [sent-161, score-0.453]
26 So ∇Φ(αz , cz ) = 0 and (αz , cz ) gives the representation of gz and fz . [sent-169, score-1.098]
27 2486 C OORDINATE C OVARIATION IN C LASSIFICATION Remark 7 We know the solution (gz , fz ) exists and even is unique. [sent-172, score-0.453]
28 We apply the well known approach of singular value decomposition to the matrix involving the data x given by Mx = (x1 − xm , x2 − xm , . [sent-178, score-0.307]
29 Set βi ∈ Rd to satisfy xi − xm = V βi for i = 1, . [sent-205, score-0.175]
30 , u0 )T , 1 m u0 (γ0 , γ) = j 1 m ∑ wi, j φ yi (k j γ0 + ki γT (βi − β j )) yi + λγ0j , m2 i=1 and, for i = 1, . [sent-218, score-0.128]
31 , m, ui (γ0 , γ) = 1 m2 m ∑ wi, j φ j=1 yi (k j γ0 + ki γT (βi − β j )) yi (βi − β j ) + λγi . [sent-221, score-0.128]
32 , m, solve ∇Φ(α,C) = 0 and hence yield a representation of gz and fz respectively. [sent-228, score-0.918]
33 at each iteration the matrix a ∈ Rm×m is defined by its components ai, j = wi, j φ yi (k j γ0 + ki γT (βi − β j )) where ki is the i-th column of the kernel matrix 10. [sent-273, score-0.17]
34 at each iteration the matrix A ∈ Rm×m is defined by its components Ai, j = wi, j φ yi (k j γ0 + ki γT (βi − β j )) 11. [sent-274, score-0.109]
35 the coefficents of the classification function estimate, α z ∈ Rm and gz = ∑m αi,z K(x, xi ) i=1 17. [sent-344, score-0.501]
36 2489 M UKHERJEE AND W U Algorithm 1: Algorithm for computing gz and fz input : inputs x = (xi , . [sent-350, score-0.918]
37 , ym )T , kernel K, weights (wi, j ), regularization parameter s, λ > 0 and threshold ε > 0 return: coefficients αz and (ci,z )m i=1 Mx = [x1 − xm , x2 − xm , . [sent-356, score-0.304]
38 , xm−1 − xm , xm − xm ]; [V, Σ,U] = svd(Mx ); η0 = 0; stop ← false; t ← 0; repeat 1 T T (b , b , . [sent-359, score-0.417]
39 We will show that under certain conditions, gz → fφ and fz → ∇ fφ as λ, s → 0. [sent-370, score-0.918]
40 Choose λ = λ(m) = m 3(n+2+2θ) and s = s(m) = m Then there exists a constant C > 0 such that for any 0 < η < 1 with confidence 1 − η gz − f φ fz − ∇ fφ 2 Lρ X 2 Lρ X 4 ≤ C log η θ 6(n+2+2θ) 1 m 4 ≤ C log η 1 − 3(n+2+2θ) 1 m , θ 6(n+2+2θ) . [sent-376, score-0.918]
41 Substituting r 0 into this upper bound and by the choice of λ, s, we obtain with confidence at least 1 − 2δ max gz − fφ 2 2 Lρ , X 2 ≤ C log δ fz − ∇ fφ 2 2 Lρ X 1 m θ 3(n+2+2θ) , where ˜ C = C (r0 )2 + Br0 (Lr0 r0 + Mr0 + 2) . [sent-406, score-0.918]
42 The ∂f idea is to rank the importance of variables according to the norm of their partial derivatives ∂xφ , since a small norm implies small changes of the classification function with respect to this variable. [sent-413, score-0.139]
43 So we shall use the norms of the components of fz to rank the variables. [sent-415, score-0.493]
44 Definition 13 The relative magnitude of the norm for the variables is defined as fz φ s = ∑n j=1 fz K 2 1/2 j K , = 1, . [sent-416, score-0.961]
45 Definition 14 The empirical gradient matrix (EGM), Fz , is the n × m matrix whose columns are fz (x j ) with j = 1, . [sent-421, score-0.605]
46 The empirical covariance matrix (ECM), Ξz , is the n × n matrix of inner products of the directional derivative of two coordinates Cov( fz ) := fz p , fz n K q p,q=1 = cz KcT = z 2492 m ∑ i, j=1 ci,z cT K(xi , x j ). [sent-425, score-1.571]
47 We ran our algorithm on draws of the above data using a linear kernel and report both the results of the gradient estimate as well as the classification function passed through a logistic function. [sent-460, score-0.156]
48 Drawing twenty samples from the two respective classes results in a design matrix x that is 80 × 40 where the first twenty samples belong to class −1 and the remaining to class +1. [sent-461, score-0.187]
49 In Figures 1b and 2b we plot the norm of each component of the estimate of the gradient, { ( fz ) K }80 . [sent-470, score-0.508]
50 Drawing thirty samples from the two respective classes results in a design matrix x that is 200 × 60 where the first thirty samples belong to class −1 and the remaining to class +1. [sent-555, score-0.141]
51 In the following we report both the results of the gradient estimate as well as the classification function passed through a logistic 2495 M UKHERJEE AND W U function. [sent-558, score-0.156]
52 The norm of first two coordinates are much larger than the norm of any of the other coordinates, mini=1,2 fz maxi=3,. [sent-560, score-0.593]
53 The classification accuracy improves when we rerun our algorithm using only the dimensions with nonzero norms (3f). [sent-567, score-0.119]
54 3 Gene Expression Data In computational biology, specifically in the subfield of gene expression analysis variable selection and estimation of covariation is of fundamental importance. [sent-570, score-0.24]
55 The expression level of a gene is proportional to the number of copies of mRNA transcribed by that gene. [sent-572, score-0.13]
56 This readout of gene expression is considered a proxy of the state of the cell. [sent-573, score-0.13]
57 The goals of gene expression analysis include using the expression level of the genes to predict classes, for example tissue morphology or treatment outcome, or real-valued quantities such as drug toxicity or sensitivity. [sent-574, score-0.222]
58 genes (S) test errors 50 2 100 1 200 1 300 1 400 1 500 1 1,000 1 3,000 1 7,129 2 Table 1: Number of errors in classification for various values of S using the genes corresponding to dimensions with the largest norms. [sent-630, score-0.191]
59 Plotting the decay of the elements for 0 w the normalized hyperplane w0 that is the solution of a linear SVM or the solution of regularized linear logistic regression results in a plot much more like that of the Fisher score than the gradient statistic. [sent-639, score-0.181]
60 This matrix indicates how the dimensions covary and can be used to construct clusters of genes. [sent-644, score-0.14]
61 Discussion We introduce an algorithm that learns a classification function and its gradient from sample data in the logistic regression framework. [sent-687, score-0.156]
62 Logistic regression models: An alternative optimization problem was proposed in Mukherjee and Zhou (2006) for estimating the gradient fz in the binary regression problem fz,λ = arg min n f ∈HK 1 m (s) w φ yi y j + f (xi ) · (xi − x j ) m2 i,∑ i, j j=1 +λ f 2 K . [sent-706, score-0.595]
63 This formulation does have an interesting interpretation for variable selection in that variables that are relevant in the classification problem will have large gradient norms and those not relevant will have norms near zero. [sent-708, score-0.201]
64 However, in most data sets and when variable selection is meaningful the data are concentrated on a much lower dimensional manifold embedded in the high dimensional space. [sent-718, score-0.112]
65 The basic problem is given model systems with known genetic or molecular perturbations infer gene expression “signatures of pathways” (sets of genes that characterize the perturbation in the model system). [sent-728, score-0.186]
66 Our proposal is that by studying the gradient of the classification function we can determine which variables are salient with respect to the classification problem and how these variables covary. [sent-740, score-0.127]
67 The conceptual key is that an estimate of the gradient allows us to measure coordinate covariation ∂f ∂f since the inner product between partial derivatives ∂xφj , ∂xφ indicates the covariance of the j-th and -th coordinates with respect to variation in the classification function f φ . [sent-741, score-0.308]
68 These simulations suggest that the method does work for variable selection and some degree of covariation can be estimated. [sent-746, score-0.11]
69 The efficacy of the method was clearly demonstrated on the simulated data and applying the method to gene expression data as well as images of digits (Mukherjee et al. [sent-747, score-0.13]
70 For this purpose, we prove two facts which we state in Lemmas 18 and 19 and define the local error function of t ∈ R at x ∈ X as errx (t) = Ey∼Y [φ(yt)] = φ(t)P(1|x) + φ(−t)P(−1|x), which is a twice differentiable, univariate convex function for every x ∈ X. [sent-808, score-0.4]
71 2504 C OORDINATE C OVARIATION IN C LASSIFICATION Lemma 18 For almost every x ∈ X, the following hold (i) fφ (x) is a minimizer of the function errx (t), i. [sent-809, score-0.4]
72 t∈R (ii) If T > max |t|, fφ ∞ , then 2 1 2 q1 (T )(t − f φ (x)) (iii) If T ≥ |t|, 3 fφ ∞ ≤ errx (t) − errx ( fφ (x)) ≤ 1 q2 (T )(t − fφ (x))2 . [sent-812, score-0.8]
73 2 there exists a constant c1 > 0 such that errx (t) − errx ( fφ (x)) ≥ c1 max q1 (T ), 1 T (t − fφ (x))2 . [sent-813, score-0.8]
74 Proof The first conclusion is a direct consequence of the fact R (f) = Z errx ( f (x))dρX (x). [sent-814, score-0.4]
75 X Note that (errx ) ( fφ (x)) = 0 since f φ (x) is a minimizer of errx (t). [sent-815, score-0.4]
76 By a Taylor series expansion, there exists t0 between t and f φ (x) such that 1 errx (t) − errx ( fφ (x)) = (errx ) (t0 )(t − fφ (x))2 . [sent-816, score-0.8]
77 To show (iii), write errx (t) − errx ( fφ (x)) = Since (errx ) (a) is positive, if t ≥ 3 f φ errx (t) − errx ( fφ (x)) ≥ ∞ Z t fφ (x) Z r fφ (x) (errx ) (a)dadr. [sent-819, score-1.6]
78 := 3Mφ , then Z t Z 2Mφ 2Mφ | fφ (x)| (errx ) (a)dadr ≥ q1 (2Mφ )Mφ (|t| − 2Mφ ) and, if t ≤ −3Mφ , then errx (t) − errx ( fφ (x)) ≥ Z t Z −2Mφ −2Mφ −| fφ (x)| (errx ) (a)dadr ≥ q1 (2Mφ )Mφ (|t| − 2Mφ ). [sent-820, score-0.8]
79 So, if |t| > 3Mφ , errx (t) − errx ( fφ (x)) ≥ q1 (2Mφ )Mφ (|t| − 2Mφ ) ≥ 2505 3q1 (2Mφ )Mφ (t − fφ (x))2 , 16T M UKHERJEE AND W U where we have used the facts |t| − 2Mφ ≥ 1 |t − fφ (x)| and |t − f φ (x)| ≤ T + Mφ ≤ 4 T. [sent-821, score-0.8]
80 On the other 4 3 hand, by (ii), if |t| ≤ 3Mφ 3q1 (3Mφ )Mφ 1 errx (t) − errx ( fφ (x)) ≥ q1 (3Mφ )(t − fφ (x))2 ≥ (t − fφ (x))2 . [sent-822, score-0.8]
81 2 2T Hence for all |t| ≤ T, errx (t) − errx ( fφ (x)) ≥ 3q1 (3Mφ )Mφ (t − fφ (x))2 . [sent-823, score-0.8]
82 16T Together with (ii), we obtain errx (t) − errx ( fφ (x)) ≥ c1 max q1 (T ), with c1 = min 1 3q1 (3Mφ )Mφ 2, 16 1 T (t − fφ (x))2 . [sent-824, score-0.8]
83 N2 In order to apply Theorem 15 to (gz , fz ), we need a bound on gz rough bound. [sent-850, score-0.918]
84 Lemma 21 For every s > 0 and λ > 0, gz 2 K + fz 2 K ≤ 2 K + fz 2. [sent-851, score-1.371]
85 λsn+2 Proof The conclusion follows from the fact λ 2 gz 2 K + fz 2 K ≤ Ez (gz , fz ) + λ 2 gz 2 K + f 2 K ≤ Ez (0, 0) + 0 = φ(0) . [sent-853, score-1.836]
86 2 Bounding the Excess Error In this section, we bound the quantity E (gz , fz ) − Rs . [sent-859, score-0.453]
87 n+1 Theorem 23 If ( f φ , ∇ fφ ) ∈ HK , (gz , fz ) and (gλ , fλ ) are in Fr for some r ≥ 1, then with confidence 1 − δ Lr r + Mr log 2 √ n+2 δ + s2 + λ , E (gz , fz ) − Rs ≤ C2 ms where C2 > 0 is a constant depending on φ and ρ but not on r, s and λ. [sent-861, score-0.945]
88 Proposition 24 The following hold E (gz , fz ) − Rs + where λ 2 2 K gz + fz 2 K ≤ S (z) + A (λ) S (z) = E (gz , fz ) − Ez (gz , fz ) + Ez (gλ , fλ ) − E (gλ , fλ ) and A (λ) = inf n+1 (g, f )∈HK E (g, f ) − Rs + λ 2 g 2 K + f 2 K . [sent-863, score-2.277]
89 (g, f )∈Fr In fact, if both (gz , fz ) and (gλ , fλ ) are in Fr for some r > 0, then S (z) ≤ 2S(z, r). [sent-865, score-0.453]
90 It is easy to verify that S(z, r) − S(zi , r) = sup (g, f )∈Fr ≤ sup (g, f )∈Fr Ez (g, f ) − E (g, f ) − sup (g, f )∈Fr Ez (g, f ) − Ezi (g, f ) ≤ 2509 Ezi (g, f ) − E (g, f ) 2m − 1 Mr . [sent-870, score-0.201]
91 By Lemma 18 (ii), we have E ( f φ , ∇ f φ ) − Rs = ˜ ≤ q 2 ( Mφ ) Z Z X ˜ ≤ q2 (Mφ )(cK ˜ where Mφ = κ fφ K X 2 ) Z Z X X w(x − u) errx ( fφ (u) + ∇ fφ (x) · (x − u)) − errx ( fφ (x) dρX (u)dρX (x) w(x − u) fφ (u) − fφ (x) + ∇ fφ (x) · (x − u) ∇ fφ 2c K ρ + κD ∇ fφ Z Z X K. [sent-894, score-0.8]
92 Notice that both theorems need a bound r so that (gz , fz ) and (gλ , fλ ) are in Fr . [sent-899, score-0.453]
93 In Lemma 21 we have shown 2φ(0) gz 2 + fz 2 ≤ n+2 . [sent-900, score-0.918]
94 λsn+2 So 2φ(0) λsn+2 is a universal bound for r such that (gz , fz ) and (gλ , fλ ) are in Fr . [sent-902, score-0.453]
95 Lemma 30 Under the assumptions of Theorem 10 with confidence at least 1 − δ gz 2 K + fz 2 K ≤ c5 1 + Lλ,s s2 2 + √ + Mλ,s log n+2 λ δ λs 1 √ mλsn+2 for some c5 > 0. [sent-909, score-0.918]
96 Proof By the fact E (gz , fz ) − Rs > 0 and Proposition 24 we have λ 2 gz 2 K + fz 2 K ≤ S (z) + A (λ). [sent-910, score-1.371]
97 Since both (gz , fz ) and (gλ , fλ ) are in F√2φ(0)/λsn+2 , we apply Proposition 27 to get with probability at least 1 − δ S (z) ≤ c3 Lλ,s 2φ(0) 2 + Mλ,s log λsn+2 δ 1 √ n+2 . [sent-911, score-0.453]
98 By Theorems 15 and 23 we have with probability at least 1 − δ both g z − fφ Lρ and | fz − ∇ fφ | 2 2 are bounded by 2 L X ρX max {C0 ,C1 } r2 sθ +C2 Br Lr r + Mr log 2 √ n+2 δ + s2 + λ s−θ , ms (12) if both (gz , fz ) and (gλ , fλ ) are in Fr for some r > 1. [sent-915, score-0.945]
99 By Lemmas 29 and 30 we can state that both {(gz , fz ) ∈ Fr } and {(gλ , fλ ) ∈ Fr } with probability at least 1 − δ if r2 = max(c4 , c5 , 1) 1 + Lλ,s s2 2 + √ + Mλ,s log λ δ λsn+2 1 √ mλsn+2 . [sent-916, score-0.453]
100 Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. [sent-992, score-0.13]
wordName wordTfidf (topN-words)
[('gz', 0.465), ('fz', 0.453), ('errx', 0.4), ('ovariation', 0.173), ('hk', 0.157), ('ez', 0.156), ('ukherjee', 0.156), ('oordinate', 0.147), ('xm', 0.139), ('rs', 0.125), ('sn', 0.118), ('fr', 0.112), ('gene', 0.094), ('gradient', 0.094), ('cz', 0.09), ('lassification', 0.09), ('covariation', 0.083), ('rkhs', 0.081), ('br', 0.079), ('xs', 0.079), ('dimensions', 0.079), ('mukherjee', 0.077), ('mr', 0.074), ('mx', 0.068), ('sup', 0.067), ('aml', 0.066), ('dud', 0.065), ('logistic', 0.062), ('lr', 0.06), ('genes', 0.056), ('ck', 0.055), ('norm', 0.055), ('yi', 0.048), ('wi', 0.048), ('classi', 0.046), ('prob', 0.046), ('rm', 0.046), ('twenty', 0.045), ('ecm', 0.045), ('rn', 0.045), ('tzs', 0.043), ('pathway', 0.041), ('norms', 0.04), ('ms', 0.039), ('proposition', 0.039), ('coordinate', 0.038), ('vx', 0.037), ('duke', 0.037), ('egm', 0.037), ('xi', 0.036), ('expression', 0.036), ('covariance', 0.034), ('samples', 0.034), ('reproducing', 0.033), ('manifold', 0.033), ('salient', 0.033), ('covary', 0.032), ('kict', 0.032), ('qiang', 0.032), ('tusher', 0.032), ('ki', 0.032), ('excess', 0.031), ('slonim', 0.03), ('coordinates', 0.03), ('derivatives', 0.029), ('matrix', 0.029), ('theorem', 0.029), ('guyon', 0.029), ('golub', 0.029), ('taylor', 0.028), ('xz', 0.028), ('rfe', 0.028), ('sayan', 0.028), ('schoelkopf', 0.028), ('sf', 0.028), ('um', 0.028), ('selection', 0.027), ('sgn', 0.026), ('cation', 0.026), ('regularization', 0.026), ('dimensional', 0.026), ('evgeniou', 0.025), ('rademacher', 0.025), ('decay', 0.025), ('bm', 0.025), ('leukemia', 0.025), ('dt', 0.024), ('representer', 0.024), ('lemmas', 0.023), ('tibshirani', 0.023), ('mcdiarmid', 0.023), ('zhou', 0.022), ('dence', 0.022), ('risk', 0.022), ('dadr', 0.022), ('dtd', 0.022), ('explanatory', 0.022), ('hermes', 0.022), ('nonlinearly', 0.022), ('thirty', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
Author: Sayan Mukherjee, Qiang Wu
Abstract: We introduce an algorithm that simultaneously estimates a classification function as well as its gradient in the supervised learning framework. The motivation for the algorithm is to find salient variables and estimate how they covary. An efficient implementation with respect to both memory and time is given. The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. An error analysis is given for the convergence of the estimate of the classification function and its gradient to the true classification function and true gradient. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds, classification
2 0.49557623 45 jmlr-2006-Learning Coordinate Covariances via Gradients
Author: Sayan Mukherjee, Ding-Xuan Zhou
Abstract: We introduce an algorithm that learns gradients from samples in the supervised learning framework. An error analysis is given for the convergence of the gradient estimated by the algorithm to the true gradient. The utility of the algorithm for the problem of variable selection as well as determining variable covariance is illustrated on simulated data as well as two gene expression data sets. For square loss we provide a very efficient implementation with respect to both memory and time. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds
3 0.096002266 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
Author: Mikhail Belkin, Partha Niyogi, Vikas Sindhwani
Abstract: We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework. Keywords: semi-supervised learning, graph transduction, regularization, kernel methods, manifold learning, spectral graph theory, unlabeled data, support vector machines
4 0.071299292 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır
Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classifier, whose key component is a weight vector in a feature space implicitly introduced by a positive definite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modified optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classifiers. These classifiers essentially find a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the different classes of data are linearly well separated. Experimental results over several classification benchmarks demonstrate the effectiveness of our approach. Keywords: sparse learning, sparse large margin classifiers, kernel learning algorithms, support vector machine, kernel Fisher discriminant
5 0.063100882 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes
Author: Andrea Caponnetto, Alexander Rakhlin
Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes
6 0.061479144 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
7 0.059801664 75 jmlr-2006-Policy Gradient in Continuous Time
8 0.057312079 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
9 0.055756748 93 jmlr-2006-Universal Kernels
10 0.051296264 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
11 0.050728902 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
12 0.047376629 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
14 0.043587558 81 jmlr-2006-Some Discriminant-Based PAC Algorithms
15 0.042100985 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
16 0.040424887 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann
17 0.039287318 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events
18 0.038389809 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
19 0.038273521 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
20 0.036970589 44 jmlr-2006-Large Scale Transductive SVMs
topicId topicWeight
[(0, 0.26), (1, -0.111), (2, 0.023), (3, -0.51), (4, 0.221), (5, -0.448), (6, 0.162), (7, -0.051), (8, 0.231), (9, 0.015), (10, -0.071), (11, -0.081), (12, 0.093), (13, 0.052), (14, -0.006), (15, 0.045), (16, 0.032), (17, -0.003), (18, 0.085), (19, -0.021), (20, -0.094), (21, 0.018), (22, 0.023), (23, 0.031), (24, -0.045), (25, 0.035), (26, -0.019), (27, 0.029), (28, 0.037), (29, 0.025), (30, -0.02), (31, 0.001), (32, -0.019), (33, 0.032), (34, -0.044), (35, 0.041), (36, -0.006), (37, 0.048), (38, -0.015), (39, 0.001), (40, -0.035), (41, 0.03), (42, 0.023), (43, 0.025), (44, 0.013), (45, -0.008), (46, -0.026), (47, -0.028), (48, -0.006), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.92131054 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
Author: Sayan Mukherjee, Qiang Wu
Abstract: We introduce an algorithm that simultaneously estimates a classification function as well as its gradient in the supervised learning framework. The motivation for the algorithm is to find salient variables and estimate how they covary. An efficient implementation with respect to both memory and time is given. The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. An error analysis is given for the convergence of the estimate of the classification function and its gradient to the true classification function and true gradient. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds, classification
2 0.90831846 45 jmlr-2006-Learning Coordinate Covariances via Gradients
Author: Sayan Mukherjee, Ding-Xuan Zhou
Abstract: We introduce an algorithm that learns gradients from samples in the supervised learning framework. An error analysis is given for the convergence of the gradient estimated by the algorithm to the true gradient. The utility of the algorithm for the problem of variable selection as well as determining variable covariance is illustrated on simulated data as well as two gene expression data sets. For square loss we provide a very efficient implementation with respect to both memory and time. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds
3 0.25771427 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
Author: Mikhail Belkin, Partha Niyogi, Vikas Sindhwani
Abstract: We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework. Keywords: semi-supervised learning, graph transduction, regularization, kernel methods, manifold learning, spectral graph theory, unlabeled data, support vector machines
4 0.25644857 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır
Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classifier, whose key component is a weight vector in a feature space implicitly introduced by a positive definite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modified optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classifiers. These classifiers essentially find a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the different classes of data are linearly well separated. Experimental results over several classification benchmarks demonstrate the effectiveness of our approach. Keywords: sparse learning, sparse large margin classifiers, kernel learning algorithms, support vector machine, kernel Fisher discriminant
5 0.25041285 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
Author: Ross A. Lippert, Ryan M. Rifkin
Abstract: We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). The proof rests on two key ideas: epi-convergence, a notion of functional convergence under which limits of minimizers converge to minimizers of limits, and a value-based formulation of learning, where we work with regularization on the function output values (y) as opposed to the function expansion coefficients in the RKHS. Our result generalizes and unifies previous results in this area. Keywords: Tikhonov regularization, Gaussian kernel, theory, kernel machines
7 0.18740323 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
8 0.1854711 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
9 0.18431783 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
10 0.18109234 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes
11 0.17109466 75 jmlr-2006-Policy Gradient in Continuous Time
12 0.16653331 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
13 0.16651878 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
14 0.14349245 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events
15 0.14092249 93 jmlr-2006-Universal Kernels
16 0.13992697 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
17 0.1387215 81 jmlr-2006-Some Discriminant-Based PAC Algorithms
topicId topicWeight
[(8, 0.024), (35, 0.013), (36, 0.058), (45, 0.022), (46, 0.207), (50, 0.073), (63, 0.036), (76, 0.022), (77, 0.19), (78, 0.022), (81, 0.054), (84, 0.029), (90, 0.062), (91, 0.042), (96, 0.051)]
simIndex simValue paperId paperTitle
1 0.74707091 45 jmlr-2006-Learning Coordinate Covariances via Gradients
Author: Sayan Mukherjee, Ding-Xuan Zhou
Abstract: We introduce an algorithm that learns gradients from samples in the supervised learning framework. An error analysis is given for the convergence of the gradient estimated by the algorithm to the true gradient. The utility of the algorithm for the problem of variable selection as well as determining variable covariance is illustrated on simulated data as well as two gene expression data sets. For square loss we provide a very efficient implementation with respect to both memory and time. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds
same-paper 2 0.72180927 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
Author: Sayan Mukherjee, Qiang Wu
Abstract: We introduce an algorithm that simultaneously estimates a classification function as well as its gradient in the supervised learning framework. The motivation for the algorithm is to find salient variables and estimate how they covary. An efficient implementation with respect to both memory and time is given. The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. An error analysis is given for the convergence of the estimate of the classification function and its gradient to the true classification function and true gradient. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds, classification
Author: Ben Taskar, Simon Lacoste-Julien, Michael I. Jordan
Abstract: We present a simple and scalable algorithm for maximum-margin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem that allows us to use simple projection methods based on the dual extragradient algorithm (Nesterov, 2003). The projection step can be solved using dynamic programming or combinatorial algorithms for min-cost convex flow, depending on the structure of the problem. We show that this approach provides a memoryefficient alternative to formulations based on reductions to a quadratic program (QP). We analyze the convergence of the method and present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm.1 Keywords: Markov networks, large-margin methods, structured prediction, extragradient, Bregman projections
4 0.44461423 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
Author: Mikio L. Braun
Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to infinity. We derive probabilistic finite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reflecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a significant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds
5 0.43646285 70 jmlr-2006-Online Passive-Aggressive Algorithms
Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer
Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.
7 0.43408537 62 jmlr-2006-MinReg: A Scalable Algorithm for Learning Parsimonious Regulatory Networks in Yeast and Mammals
8 0.4198254 66 jmlr-2006-On Model Selection Consistency of Lasso
9 0.41480902 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
10 0.41381159 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
11 0.41194105 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
12 0.41059858 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
13 0.40874931 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models
14 0.40796423 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
15 0.40452296 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data
16 0.40126801 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
17 0.39965358 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity
18 0.3995153 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
19 0.39839196 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers
20 0.39778143 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms