jmlr jmlr2009 jmlr2009-94 knowledge-graph by maker-knowledge-mining

94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

Source: pdf

Author: Han Liu, John Lafferty, Larry Wasserman

Abstract: Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula—or “nonparanormal”—for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method’s theoretical properties, and show that it works well in many examples. Keywords: graphical models, Gaussian copula, high dimensional inference, sparsity, ℓ1 regularization, graphical lasso, paranormal, occult

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. [sent-10, score-0.685]

2 Speciﬁcally, we use a high dimensional Gaussian copula with nonparametric marginals, which we refer to as a nonparanormal distribution. [sent-19, score-0.739]

3 The nonparanormal extends additive models to the graphical model setting. [sent-30, score-0.716]

4 This semiparametric copula results in a nonparametric extension of the normal that we call the nonparanormal distribution. [sent-50, score-0.781]

5 The nonparanormal depends on the functions { f j }, and a mean µ and covariance matrix Σ, all of which are to be estimated from data. [sent-51, score-0.708]

6 We propose a nonparametric estimator for the functions { f j }, and show how the graphical lasso can be used to estimate the graph in the high dimensional setting. [sent-53, score-0.233]

7 The estimator Ω can be computed efﬁciently using the glasso algorithm (Friedman et al. [sent-81, score-0.424]

8 We observe a new vector X ∼ P and deﬁne the prediction risk to be R(Σ) = −E log p(X; µ0 , Σ) = − It follows that R(Σ) = Z log p(x; µ0 , Σ) dP(x). [sent-90, score-0.284]

9 We present our theoretical analysis on these properties of the nonparanormal in Section 5. [sent-103, score-0.638]

10 , Xp )T has a nonparanormal distribution if there exist functions { f j } p such that Z ≡ f (X) ∼ N(µ, Σ), where f (X) = ( f1 (X1 ), . [sent-108, score-0.638]

11 (2) pX (x) = 2 (2π) p/2 |Σ|1/2 j=1 Lemma 1 The nonparanormal distribution NPN (µ, Σ, f ) is a Gaussian copula when the f j ’s are monotone and differentiable. [sent-114, score-0.711]

12 (4) The following basic fact says that the independence graph of the nonparanormal is encoded in Ω = Σ−1 , as for the parametric normal. [sent-142, score-0.671]

13 Lemma 2 If X ∼ NPN (µ, Σ, f ) is nonparanormal and each f j is differentiable, then Xi ⊥ X j | X\{i, j} ⊥ −1 . [sent-143, score-0.638]

14 Figure 2 shows three examples of 2-dimensional nonparanormal densities. [sent-162, score-0.638]

15 The ℓ1 -regularized estimator is Ωn = arg min tr ΩSn ( f ) − log |Ω| + λ Ω Ω 1 (9) where λ is a regularization parameter, and Ω 1 = ∑ j k |Ω jk |. [sent-208, score-0.31]

16 The nonparanormal is analogous to a sparse additive regression model (Ravikumar et al. [sent-210, score-0.689]

17 However, while sparse additive models use a regularized risk criterion to ﬁt univariate transformations, our nonparanormal estimator uses a two-step procedure: 1. [sent-212, score-0.743]

18 The ﬁrst step is non-iterative and computationally efﬁcient, with no tuning parameters; it also makes the nonparanormal amenable to theoretical analysis. [sent-216, score-0.638]

19 Then for any ε ≥ CM log p log2 n and sufﬁciently large n, we have n1/2 P max Sn ( f ) jk − Sn ( f ) jk > 2ε jk ≤ 2 1 n1/2 ε2 + 2 exp 2 log p − 1232π2 log2 n π log(np) + 2 exp 2 log p − n1/2 8π log n + o(1). [sent-240, score-0.83]

20 n1/2 P max Sn ( f ) jk − Sn ( f ) jk > 2CM jk Hence,  max Sn ( f ) jk − Sn ( f ) jk = OP  j,k  log p log2 n  . [sent-245, score-0.49]

21 Then the nonparanormal estimator Ωn of (9) satisﬁes   2 (s + p)(log p log n)  Ωn − Ω0 F = OP  n1/2 2303 L IU , L AFFERTY, AND WASSERMAN and Ωn − Ω 0 2 where  = OP  2  s(log p log n)  , n1/2 s ≡ Card ({(i, j) ∈ {1, . [sent-254, score-0.976]

22 Then the nonparanormal estimator Ωn satisﬁes P G Ωn , Ω0 ≥ 1 − o(1) where G (Ωn , Ω0 ) is the event sign Ωn ( j, k) = sign (Ω0 ( j, k)) , ∀ j, k ∈ S . [sent-296, score-0.706]

23 The deﬁnition in this theorem uses the fact (from Lemma 11) that supx Φ−1 Fj (x) ≤ √ √ 2 log n when δn = 1/(4n1/4 π log n). [sent-299, score-0.284]

24 In the next theorem, we do not assume the true model is nonparanormal and deﬁne the population and sample risks as 1 tr ΩE( f (X) f (X)T − log |Ω| − p log(2π) 2 1 {tr [ΩSn ( f )] − log |Ω| − p log(2π)} . [sent-300, score-0.944]

25 2 R( f , Ω) = R( f , Ω) = ξ Theorem 8 Suppose that p ≤ en for some ξ < 1, and deﬁne the classes Mn = Cn = f : R → R : f is monotone with f Ω : Ω−1 1 ∞ ≤C log n ≤ Ln . [sent-301, score-0.386]

26 Ω∈Cn Then R( fn , Ωn ) − inf p ( f ,Ω)∈Mn ⊕Cn R( f , Ω) = OP Ln Hence the Winsorized estimator of ( f , Ω) with δn = 1/(4n1/4 √ Ln = o n(1−ξ)/2 / log n . [sent-303, score-0.251]

27 We mainly compare the ℓ1 -regularized nonparanormal and Gaussian (paranormal) models, computed using the graphical lasso algorithm (glasso) of Friedman et al. [sent-308, score-0.741]

28 Note that we can reuse the glasso implementation to ﬁt a sparse nonparanormal. [sent-311, score-0.392]

29 In particular, after computing the Winsorized sample covariance Sn ( f ), we pass this matrix to the glasso routine to carry out the optimization Ωn = arg min tr ΩSn ( f ) − log |Ω| + λn Ω Ω 2305 1 . [sent-312, score-0.585]

30 To sample data from the nonparanormal distribution, we also require g ≡ f −1 ; two different transformations g are employed. [sent-345, score-0.677]

31 It can be seen how the cdf and power transformations map a univariate normal distribution into a highly skewed and a bi-modal distribution, respectively. [sent-387, score-0.229]

32 2307 L IU , L AFFERTY, AND WASSERMAN cdf power glasso path 0. [sent-388, score-0.579]

33 4 n = 200 Figure 4: Regularization paths for the glasso and nonparanormal with n = 500 (top) and n = 200 (bottom). [sent-528, score-1.046]

34 For non-Gaussian distributions, the nonparanormal better separates the relevant and irrelevant dimensions. [sent-530, score-0.657]

35 In each case, both the glasso and the nonparanormal are applied to estimate the graph. [sent-533, score-1.008]

36 For the cdf transformation and the power transformation, the nonparanormal separates the relevant and the irrelevant dimensions very well. [sent-540, score-0.861]

37 This is intuitive due to the high skewness of the nonparanormal distribution in this case. [sent-547, score-0.638]

38 The curves clearly show how the performance of both methods improves with sample size, and that the nonparanormal is superior to the Gaussian model in most cases. [sent-585, score-0.653]

39 95 1−FP Nonparanormal glasso Nonparanormal glasso 0. [sent-605, score-0.74]

40 It’s clear from the tables that the nonparanormal achieves signiﬁcantly smaller errors than the glasso if the true distribution of the data is not multivariate Gaussian and achieves performance comparable to the glasso when the true distribution is exactly multivariate Gaussian. [sent-684, score-1.432]

41 It’s clear that when the glasso estimates the graph incorrectly, the mistakes include both false positives and negatives. [sent-686, score-0.403]

42 2311 L IU , L AFFERTY, AND WASSERMAN Nonparanormal glasso n FPE (sd(FPE)) FNE (sd(FNE)) FPE (sd(FPE)) FNE (sd(FNE)) 1000 0. [sent-687, score-0.37]

43 For both FPE and FNE, the nonparanormal performs much better in general. [sent-760, score-0.638]

44 Nonparanormal glasso n FPE (sd(FPE)) FNE (sd(FNE)) FPE (sd(FPE)) FNE (sd(FNE)) 1000 0. [sent-761, score-0.37]

45 For both FPE and FNE, the nonparanormal performs much better in general. [sent-834, score-0.638]

46 4 C OMPARISON IN THE G AUSSIAN C ASE The previous experiments indicate that the nonparanormal works almost as well as the glasso in the Gaussian case. [sent-837, score-1.008]

47 For multivariate Gaussian models, Figure 9 shows results with (n, p, s) = (50, 40, 1/8), (50, 100, 1/15) 2312 T HE N ONPARANORMAL Nonparanormal glasso n FPE (sd(FPE)) FNE (sd(FNE)) FPE (sd(FPE)) FNE (sd(FNE)) 1000 0. [sent-840, score-0.397]

48 The two methods behave similarly, the glasso is slightly better. [sent-913, score-0.37]

49 From the mean ROC curves, we see that nonparanormal does indeed behave worse than the glasso, suggesting some efﬁciency loss. [sent-915, score-0.638]

50 5 T HE C ASE W HEN p ≫ n Figure 10 shows results from a simulation of the nonparanormal using cdf transformations with n = 200, p = 500 and sparsity level s = 1/40. [sent-919, score-0.833]

51 The boxplot shows that the nonparanormal outperforms the glasso. [sent-920, score-0.638]

52 A typical run of the regularization paths conﬁrms this conclusion, showing that the nonparanormal path separates the relevant and irrelevant dimensions very well. [sent-921, score-0.759]

53 In contrast, with the glasso the relevant variables are “buried” among the irrelevant variables. [sent-922, score-0.389]

54 A subset of 40 genes from the isoprenoid pathway are chosen, and we study the associations among them using both the paranormal and nonparanormal models. [sent-928, score-0.682]

55 , 2004), our study shows that the results of the nonparanormal and the glasso are very different over a wide range of regularization parameters. [sent-930, score-1.035]

56 This suggests the nonparanormal could support different scientiﬁc conclusions. [sent-931, score-0.638]

57 The dashed (black) lines in the symmetric difference plots indicate edges found by the glasso but not the nonparanormal, and vice-versa for the solid (red) lines. [sent-1123, score-0.399]

58 For each λ, we show the estimated graph from the nonparanormal in the ﬁrst column. [sent-1134, score-0.69]

59 40 lambda Figure 10: For the cdf transformation with n = 200, p = 500, s = 1/40, comparison of the boxplots and a typical run of the regularization paths. [sent-1199, score-0.249]

60 The nonparanormal paths separate the relevant from the irrelevant dimensions well. [sent-1200, score-0.695]

61 regularization path of the glasso ﬁt and ﬁnding the graph having the smallest symmetric difference with the nonparanormal graph. [sent-1202, score-1.12]

62 The closest glasso ﬁt is different, with edges selected by the glasso not selected by the nonparanormal, and vice-versa. [sent-1204, score-0.754]

63 2315 L IU , L AFFERTY, AND WASSERMAN nonparanormal path −1. [sent-1207, score-0.675]

64 Lemma 12 (Distribution function of the transformed random variable) For any α ∈ (−∞, ∞) Φ−1 Fj g j (α log n) 2316 =α log n. [sent-1242, score-0.284]

65 0 Figure 12: The nonparanormal estimated graph for three values of λ = 0. [sent-1250, score-0.69]

66 30857 (left column), the closest glasso estimated graph from the full path (middle) and the symmetric difference graph (right). [sent-1253, score-0.507]

67 2πα log n Proof Using Mill’s inequality, we have √ φ( α log n) 1 α log n ≤ n √ , = α/2−1 √ α log n n 2πα log n n P max Wi > 1≤i≤n α log n ≤ ∑ P Wi > i=1 from which the result follows. [sent-1271, score-0.875]

68 Lemma 14 For any α > 0 that satisﬁes 1 − δn − Φ α log n > 0 for all n, we have > 1 − δn ≤ exp −2n 1 − δn − Φ α log n P Fj g j √ 2 α log n . [sent-1272, score-0.448]

69 (13) and P Fj g j − α log n < δn ≤ exp −2n 1 − δn − Φ α log n 2 . [sent-1273, score-0.306]

70 (14) Proof Using Hoeffding’s inequality, P Fj g j α log n > 1 − δn α log n = P Fj g j α log n − Fj g j ≤ exp −2n 1 − δn − Fj g j α log n 2 > 1 − δn − F j g j α log n . [sent-1274, score-0.732]

71 We split the interval 2 g j (− M log n), g j ( M log n) β log n , g j β log n into two parts, the middle Mn ≡ g j − and ends En ≡ g j − M log n , g j − β log n ∪ gj β log n , g j M log n . [sent-1278, score-1.136]

72 Lemma 16 For all n, we have sup Φ−1 (Fj (t)) − Φ−1 (Fj (t)) < t∈En 2(M + 2) log n, ∀ j ∈ {1, . [sent-1288, score-0.222]

73 Proof From Lemma 12 and the deﬁnition of En , we have sup Φ−1 (Fj (t)) ∈ 0, M log n . [sent-1292, score-0.222]

74 Therefore, from Equation n n (11), sup Φ−1 Fj (t) t∈En ∈ 0, The result follows from the triangle inequality and 2 log n . [sent-1294, score-0.222]

75 Let ∆i ( j, k) ≡ f j (Xi j ) fk (Xik ) − f j (Xi j ) fk (Xik ) and Θt,s ( j, k) ≡ f j (t) fk (s) − f j (t) fk (s). [sent-1299, score-0.524]

76 Such a θ1 guarantees that nε − nA 4θ1 log n = nA nβ log n > 0. [sent-1320, score-0.284]

77 1 Using the Bernstein’s inequality, for β = , 2 P ε 1 n ∑ 1{Xi j ∈En ,Xik ∈En } > 4θ1 n i=1 n ≤ P ∑ i=1 ≤ exp − 1{Xi j ∈En } − P (X1 j ∈ En ) > nA log n nβ c1 n2−β log n √ √ c2 n1−β/2 log n + c3 n1−β/2 log n where c1 , c2 , c3 > 0 are generic constants. [sent-1322, score-0.59]

79 t∈En t∈En we have CM log p log2 n nβ/2 ε θ1 √ √ ≥ = 2(M + 2) log n. [sent-1327, score-0.284]

80 = 3 24A log n 24A log n This implies that θ1 ≥ 3 2(M + 2) log n and θ1 ≥ 3 M log n √ 2(M + 2) log n. [sent-1328, score-0.71]

81 Using Lemma 14, we have β log n p2 P Fj g j > 1 − δn ≤ p2 exp −2nδ2 = exp 2 log p − n n1−β (16πβ log n) (15) and β log n p2 P Fj g j − < δn ≤ exp 2 log p − n1−β (16πβ log n) . [sent-1340, score-0.918]

82 From (15) and (16), it is easy to see that n1/2 8π log n c P (Bn ) ≤ 2 exp 2 log p − . [sent-1345, score-0.306]

83 From the deﬁnition of Fj , we have p2 P sup | f j (t) − f j (t)| > t∈Mn ε 12 β log n sup Φ−1 Fj (t) − Φ−1 (Fj (t)) > ≤ p2 P t∈Mn ≤ p2 P sup Φ−1 Fj (t) − Φ−1 (Fj (t)) > t∈Mn ε 12 β log n c , Bn + P (Bn ) . [sent-1346, score-0.524]

84 ε 12 β log n + 2 exp 2 log p − n1/2 8π log n β log n , δn . [sent-1347, score-0.59]

85 Deﬁne T1n ≡ max Fj g j β log n , 1 − δn and T2n ≡ 1 − min Fj g j − From Equation (12) and the fact that 1 − δn ≥ Φ β log n , we have that T1n = T2n = 1 − δn . [sent-1348, score-0.307]

86 Thus, by the mean value theorem, P sup Φ−1 Fj (t) − Φ−1 (Fj (t)) > t∈Mn ε 12 β log n ≤ P (Φ−1 )′ (max {T1n , T2n }) sup Fj (t) − Fj (t) > t∈Mn = P (Φ−1 )′ (1 − δn ) sup Fj (t) − Fj (t) > t∈Mn 2324 ε 12 ε 12 β log n β log n . [sent-1349, score-0.666]

87 T HE N ONPARANORMAL Finally, using the Dvoretzky-Kiefer-Wolfowitz inequality, P sup Φ−1 Fj (t) − Φ−1 (Fj (t)) > t∈Mn ≤ P ε 12 β log n sup Fj (t) − Fj (t) > t∈Mn ≤ 2 exp −2 ε (Φ−1 )′ (1 − δn ) 12 nε2 144β log n [(Φ−1 )′ (1 − δn )]2 β log n . [sent-1351, score-0.608]

88 Furthermore, by Lemma 11, (Φ−1 )′ (1 − δn ) = 1 ≤ φ (Φ−1 (1 − δn )) 1 φ 2 log 1 δn = √ 2π 1 δn = 8πnβ/2 β log n. [sent-1352, score-0.284]

89 This implies that p2 P sup Φ−1 Fj (t) − Φ−1 (Fj (t)) > t∈Mn ε 12 β log n ≤ 2 exp 2 log p − n1/2 ε2 1232π2 log2 n . [sent-1353, score-0.386]

90 In summary, we have  n1/2 ε2 1 ε |∆i ( j, k)| > P max ≤ 2 exp 2 log p − ∑ j,k n 4 1232π2 log2 n Xi j ∈Mn ,Xik ∈En  + 2 exp 2 log p − This ﬁnish the proof. [sent-1354, score-0.328]

91 2 Proof of Theorem 8 Proof First note that the population and sample risks are R( f , Ω) = R( f , Ω) = 1 tr ΩE( f (X) f (X)T − log |Ω| − p log(2π) 2 1 {tr [ΩSn ( f )] − log |Ω| − p log(2π)} . [sent-1357, score-0.306]

93 2 jk f j , fk ∈Mn 2325 n1/2 8π log n L IU , L AFFERTY, AND WASSERMAN Now, if F is a class of functions, we have E sup |µ(g) − µ(g)| ≤ g∈F C J[ ] ( F √ ∞ ,F ) (17) n for some C > 0, where F(x) = supg∈cF |g(x)|, µ(g) = E(g(X)) and µ(g) = n−1 ∑n g(Xi ) (see Coroli=1 lary 19. [sent-1359, score-0.418]

94 Here the bracketing integral is deﬁned to be J[ ] (δ, F ) = Z δ 0 log N[ ] (u, F ) du where log N[ ] (ε, F ) is the bracketing entropy. [sent-1361, score-0.342]

95 Then the bracketing entropy satisﬁes log N[ ] (C log n, Pn,p ) ≤ 2 log p + K and the bracketing integral satisﬁes J[ ] (C Markov’s inequality that √ √ log n, Pn,p ) = O( log n log p). [sent-1367, score-0.91]

96 It follows from (17) and max sup |Sn ( f ) jk − E( f j (X j ) fk (Xk )| = OP jk f j , fk ∈Mn log n log p n Therefore, sup p ( f ,Ω)∈Mn ⊕Cn 1 ε |R( f , Ω) − R( f , Ω)| = OP √ Ln log n . [sent-1368, score-1.001]

97 2326 = OP √ Ln log n n(1−ξ)/2 √ Ln log n n(1−ξ)/2 √ Ln log n n(1−ξ)/2 log n n1−ξ . [sent-1370, score-0.568]

98 The nonparanormal can be viewed as an extension of sparse additive models to the setting of graphical models. [sent-1373, score-0.738]

99 Computationally, ﬁtting a high dimensional nonparanormal is no more difﬁcult than estimating a multivariate Gaussian, and indeed one can exploit existing software for the graphical lasso. [sent-1379, score-0.731]

100 Our experimental results indicate that the sparse nonparanormal can give very different results than a sparse Gaussian graphical model. [sent-1380, score-0.731]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nonparanormal', 0.638), ('fj', 0.404), ('glasso', 0.37), ('en', 0.229), ('log', 0.142), ('cdf', 0.14), ('fk', 0.131), ('afferty', 0.117), ('mn', 0.11), ('fne', 0.11), ('onparanormal', 0.11), ('wasserman', 0.105), ('iu', 0.099), ('fpe', 0.093), ('sup', 0.08), ('sn', 0.079), ('fp', 0.069), ('ravikumar', 0.067), ('jk', 0.065), ('oracle', 0.06), ('copula', 0.058), ('fn', 0.055), ('estimator', 0.054), ('lasso', 0.054), ('covariance', 0.051), ('graphical', 0.049), ('op', 0.046), ('sd', 0.045), ('rothman', 0.041), ('winsorized', 0.041), ('semiparametric', 0.041), ('transformations', 0.039), ('transform', 0.039), ('paths', 0.038), ('path', 0.037), ('larry', 0.034), ('npn', 0.034), ('gaussian', 0.034), ('graph', 0.033), ('transformation', 0.032), ('power', 0.032), ('boxplots', 0.031), ('additive', 0.029), ('bracketing', 0.029), ('xik', 0.029), ('coefficient', 0.027), ('paranormal', 0.027), ('regularization', 0.027), ('multivariate', 0.027), ('nonparametric', 0.026), ('undirected', 0.025), ('persistency', 0.023), ('max', 0.023), ('sparse', 0.022), ('omparison', 0.022), ('consistency', 0.022), ('tr', 0.022), ('exp', 0.022), ('klaassen', 0.021), ('score', 0.02), ('ij', 0.02), ('bn', 0.02), ('lambda', 0.019), ('cm', 0.019), ('ik', 0.019), ('estimated', 0.019), ('irrelevant', 0.019), ('normal', 0.018), ('lemma', 0.018), ('isoprenoid', 0.017), ('wille', 0.017), ('pradeep', 0.017), ('lafferty', 0.017), ('han', 0.017), ('dimensional', 0.017), ('sparsity', 0.016), ('cn', 0.016), ('np', 0.016), ('parallels', 0.016), ('subtle', 0.016), ('symmetric', 0.015), ('microarray', 0.015), ('monotone', 0.015), ('density', 0.015), ('der', 0.015), ('curves', 0.015), ('ln', 0.015), ('event', 0.014), ('persistent', 0.014), ('vaart', 0.014), ('wellner', 0.014), ('edges', 0.014), ('graphs', 0.014), ('xi', 0.014), ('abramovich', 0.014), ('mill', 0.014), ('sparsistent', 0.014), ('stimated', 0.014), ('tsukahara', 0.014), ('winsorization', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

Author: Han Liu, John Lafferty, Larry Wasserman

2 0.10947721 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

Author: Sébastien Bubeck, Ulrike von Luxburg

Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to ﬁnd, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the ﬁnite data set has been sampled from some underlying space, the goal is not to ﬁnd the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal, and instead can lead to inconsistency. We construct examples which provably have this behavior. As in the case of supervised learning, the cure is to restrict the size of the function classes under consideration. For appropriate “small” function classes we can prove very general consistency theorems for clustering optimization schemes. As one particular algorithm for clustering with a restricted function space we introduce “nearest neighbor clustering”. Similar to the k-nearest neighbor classiﬁer in supervised learning, this algorithm can be seen as a general baseline algorithm to minimize arbitrary clustering objective functions. We prove that it is statistically consistent for all commonly used clustering objective functions. Keywords: clustering, minimizing objective functions, consistency

3 0.066515863 18 jmlr-2009-Consistency and Localizability

Author: Alon Zakai, Ya'acov Ritov

Abstract: We show that all consistent learning methodsĂ˘€”that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y )Ă˘€”are necessarily localizable, by which we mean that they do not signiÄ?Ĺš cantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be deÄ?Ĺš ned in a non-local manner, such as support vector machines in classiÄ?Ĺš cation and least-squares estimators in regression. Aside from showing that consistency implies a speciÄ?Ĺš c form of localizability, we also show that consistency is logically equivalent to the combination of two properties: (1) a form of localizability, and (2) that the methodĂ˘€™s global mean (over the entire X distribution) correctly estimates the true mean. Consistency can therefore be seen as comprised of two aspects, one local and one global. Keywords: consistency, local learning, regression, classiÄ?Ĺš cation

4 0.045343906 21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression

Author: Sylvain Arlot, Pascal Massart

Abstract: Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from data. We propose a completely data-driven calibration algorithm for these parameters in the least-squares regression framework, without assuming a particular shape for the penalty. Our algorithm relies on the concept of minimal penalty, recently introduced by Birg´ and Massart (2007) in the context of penalized least squares for Gaussian hoe moscedastic regression. On the positive side, the minimal penalty can be evaluated from the data themselves, leading to a data-driven estimation of an optimal penalty which can be used in practice; on the negative side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic framework. The purpose of this paper is twofold: stating a more general heuristics for designing a datadriven penalty (the slope heuristics) and proving that it works for penalized least-squares regression with a random design, even for heteroscedastic non-Gaussian data. For technical reasons, some exact mathematical results will be proved only for regressogram bin-width selection. This is at least a ﬁrst step towards further results, since the approach and the method that we use are indeed general. Keywords: data-driven calibration, non-parametric regression, model selection by penalization, heteroscedastic data, regressogram

5 0.04377462 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks

Author: Nikolai Slobodianik, Dmitry Zaporozhets, Neal Madras

Abstract: In the machine learning community, the Bayesian scoring criterion is widely used for model selection problems. One of the fundamental theoretical properties justifying the usage of the Bayesian scoring criterion is its consistency. In this paper we reﬁne this property for the case of binomial Bayesian network models. As a by-product of our derivations we establish strong consistency and obtain the law of iterated logarithm for the Bayesian scoring criterion. Keywords: Bayesian networks, consistency, scoring criterion, model selection, BIC

6 0.042388145 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

7 0.04155729 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications

8 0.035395078 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

9 0.033063829 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks

10 0.031190328 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations

11 0.030743968 46 jmlr-2009-Learning Halfspaces with Malicious Noise

12 0.030588953 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors

13 0.029792065 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

14 0.028645994 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

15 0.028634639 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

16 0.027323795 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

17 0.026645236 16 jmlr-2009-Classification with Gaussians and Convex Loss

18 0.023814257 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

19 0.023258146 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

20 0.022642605 23 jmlr-2009-Discriminative Learning Under Covariate Shift

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.121), (1, -0.033), (2, 0.072), (3, 0.045), (4, -0.093), (5, 0.008), (6, -0.112), (7, -0.032), (8, -0.035), (9, 0.084), (10, 0.096), (11, 0.012), (12, -0.306), (13, 0.118), (14, 0.158), (15, -0.071), (16, -0.043), (17, 0.174), (18, -0.021), (19, -0.079), (20, -0.138), (21, 0.096), (22, -0.086), (23, -0.168), (24, 0.217), (25, 0.057), (26, 0.028), (27, -0.004), (28, -0.105), (29, 0.041), (30, 0.007), (31, 0.203), (32, 0.009), (33, 0.129), (34, -0.051), (35, -0.055), (36, -0.087), (37, -0.001), (38, 0.008), (39, -0.001), (40, 0.154), (41, -0.012), (42, -0.012), (43, 0.012), (44, 0.108), (45, -0.136), (46, -0.067), (47, 0.088), (48, 0.032), (49, -0.19)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93090165 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

Author: Han Liu, John Lafferty, Larry Wasserman

2 0.67132312 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

Author: Sébastien Bubeck, Ulrike von Luxburg

3 0.3597292 18 jmlr-2009-Consistency and Localizability

Author: Alon Zakai, Ya'acov Ritov

4 0.27941266 21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression

Author: Sylvain Arlot, Pascal Massart

5 0.20919675 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications

Author: Ulrich Paquet, Ole Winther, Manfred Opper

Abstract: Bayesian inference is intractable for many interesting models, making deterministic algorithms for approximate inference highly desirable. Unlike stochastic methods, which are exact in the limit, the accuracy of these approaches cannot be reasonably judged. In this paper we show how low order perturbation corrections to an expectation-consistent (EC) approximation can provide the necessary tools to ameliorate inference accuracy, and to give an indication of the quality of approximation without having to resort to Monte Carlo methods. Further comparisons are given with variational Bayes and parallel tempering (PT) combined with thermodynamic integration on a Gaussian mixture model. To obtain practical results we further generalize PT to temper from arbitrary distributions rather than a prior in Bayesian inference. Keywords: Bayesian inference, mixture models, expectation propagation, expectation consistent, perturbation correction, variational Bayes, parallel tempering, thermodynamic integration

6 0.19719334 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks

7 0.18423809 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

8 0.17639431 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

9 0.1696412 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

10 0.15997693 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

11 0.15577269 46 jmlr-2009-Learning Halfspaces with Malicious Noise

12 0.15339474 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors

13 0.14997414 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

14 0.14620365 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages (Special Topic on Mining and Learning with Graphs and Relations)

15 0.1429297 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

16 0.1398547 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation

17 0.13705696 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

18 0.13544345 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments

19 0.13117228 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning

20 0.12691468 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.389), (8, 0.031), (11, 0.046), (19, 0.017), (21, 0.012), (26, 0.014), (38, 0.019), (45, 0.012), (47, 0.019), (52, 0.036), (55, 0.055), (58, 0.029), (66, 0.103), (68, 0.034), (90, 0.05), (96, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.78208333 44 jmlr-2009-Learning Acyclic Probabilistic Circuits Using Test Paths

Author: Dana Angluin, James Aspnes, Jiang Chen, David Eisenstat, Lev Reyzin

Abstract: We deﬁne a model of learning probabilistic acyclic circuits using value injection queries, in which ﬁxed values are assigned to an arbitrary subset of the wires and the value on the single output wire is observed. We adapt the approach of using test paths from the Circuit Builder algorithm (Angluin et al., 2009) to show that there is a polynomial time algorithm that uses value injection queries to learn acyclic Boolean probabilistic circuits of constant fan-in and log depth. We establish upper and lower bounds on the attenuation factor for general and transitively reduced Boolean probabilistic circuits of test paths versus general experiments. We give computational evidence that a polynomial time learning algorithm using general value injection experiments may not do much better than one using test paths. For probabilistic circuits with alphabets of size three or greater, we show that the test path lemmas (Angluin et al., 2009, 2008b) fail utterly. To overcome this obstacle, we introduce function injection queries, in which the values on a wire may be mapped to other values rather than just to themselves or constants, and prove a generalized test path lemma for this case. Keywords: nonadaptive learning algorithms, probabilistic circuits, causal Bayesian networks, value injection queries, test paths

same-paper 2 0.68710846 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

Author: Han Liu, John Lafferty, Larry Wasserman

3 0.3484754 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm (Special Topic on Mining and Learning with Graphs and Relations)

Author: Junning Li, Z. Jane Wang

Abstract: In real world applications, graphical statistical models are not only a tool for operations such as classiﬁcation or prediction, but usually the network structures of the models themselves are also of great interest (e.g., in modeling brain connectivity). The false discovery rate (FDR), the expected ratio of falsely claimed connections to all those claimed, is often a reasonable error-rate criterion in these applications. However, current learning algorithms for graphical models have not been adequately adapted to the concerns of the FDR. The traditional practice of controlling the type I error rate and the type II error rate under a conventional level does not necessarily keep the FDR low, especially in the case of sparse networks. In this paper, we propose embedding an FDR-control procedure into the PC algorithm to curb the FDR of the skeleton of the learned graph. We prove that the proposed method can control the FDR under a user-speciﬁed level at the limit of large sample sizes. In the cases of moderate sample size (about several hundred), empirical experiments show that the method is still able to control the FDR under the user-speciﬁed level, and a heuristic modiﬁcation of the method is able to control the FDR more accurately around the user-speciﬁed level. The proposed method is applicable to any models for which statistical tests of conditional independence are available, such as discrete models and Gaussian models. Keywords: Bayesian networks, false discovery rate, PC algorithm, directed acyclic graph, skeleton

4 0.34457996 13 jmlr-2009-Bounded Kernel-Based Online Learning

Author: Francesco Orabona, Joseph Keshet, Barbara Caputo

Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the ﬁrst online bounded algorithm that can learn complex classiﬁcation tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set

5 0.34415948 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we ﬁrst perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the ﬁrst phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ﬁcient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efﬁcient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization

6 0.34157556 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

7 0.34130475 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

8 0.34114707 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

9 0.33492774 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models

10 0.33486262 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

11 0.33360684 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

12 0.33288604 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

13 0.33281177 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks

14 0.33106431 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

15 0.32967886 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

16 0.32929888 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications

17 0.32793635 18 jmlr-2009-Consistency and Localizability

18 0.32663625 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning

19 0.32630956 78 jmlr-2009-Refinement of Reproducing Kernels

20 0.32546458 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis