jmlr jmlr2009 jmlr2009-94 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Han Liu, John Lafferty, Larry Wasserman
Abstract: Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula—or “nonparanormal”—for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method’s theoretical properties, and show that it works well in many examples. Keywords: graphical models, Gaussian copula, high dimensional inference, sparsity, ℓ1 regularization, graphical lasso, paranormal, occult
Reference: text
sentIndex sentText sentNum sentScore
1 Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. [sent-10, score-0.685]
2 Specifically, we use a high dimensional Gaussian copula with nonparametric marginals, which we refer to as a nonparanormal distribution. [sent-19, score-0.739]
3 The nonparanormal extends additive models to the graphical model setting. [sent-30, score-0.716]
4 This semiparametric copula results in a nonparametric extension of the normal that we call the nonparanormal distribution. [sent-50, score-0.781]
5 The nonparanormal depends on the functions { f j }, and a mean µ and covariance matrix Σ, all of which are to be estimated from data. [sent-51, score-0.708]
6 We propose a nonparametric estimator for the functions { f j }, and show how the graphical lasso can be used to estimate the graph in the high dimensional setting. [sent-53, score-0.233]
7 The estimator Ω can be computed efficiently using the glasso algorithm (Friedman et al. [sent-81, score-0.424]
8 We observe a new vector X ∼ P and define the prediction risk to be R(Σ) = −E log p(X; µ0 , Σ) = − It follows that R(Σ) = Z log p(x; µ0 , Σ) dP(x). [sent-90, score-0.284]
9 We present our theoretical analysis on these properties of the nonparanormal in Section 5. [sent-103, score-0.638]
10 , Xp )T has a nonparanormal distribution if there exist functions { f j } p such that Z ≡ f (X) ∼ N(µ, Σ), where f (X) = ( f1 (X1 ), . [sent-108, score-0.638]
11 (2) pX (x) = 2 (2π) p/2 |Σ|1/2 j=1 Lemma 1 The nonparanormal distribution NPN (µ, Σ, f ) is a Gaussian copula when the f j ’s are monotone and differentiable. [sent-114, score-0.711]
12 (4) The following basic fact says that the independence graph of the nonparanormal is encoded in Ω = Σ−1 , as for the parametric normal. [sent-142, score-0.671]
13 Lemma 2 If X ∼ NPN (µ, Σ, f ) is nonparanormal and each f j is differentiable, then Xi ⊥ X j | X\{i, j} ⊥ −1 . [sent-143, score-0.638]
14 Figure 2 shows three examples of 2-dimensional nonparanormal densities. [sent-162, score-0.638]
15 The ℓ1 -regularized estimator is Ωn = arg min tr ΩSn ( f ) − log |Ω| + λ Ω Ω 1 (9) where λ is a regularization parameter, and Ω 1 = ∑ j k |Ω jk |. [sent-208, score-0.31]
16 The nonparanormal is analogous to a sparse additive regression model (Ravikumar et al. [sent-210, score-0.689]
17 However, while sparse additive models use a regularized risk criterion to fit univariate transformations, our nonparanormal estimator uses a two-step procedure: 1. [sent-212, score-0.743]
18 The first step is non-iterative and computationally efficient, with no tuning parameters; it also makes the nonparanormal amenable to theoretical analysis. [sent-216, score-0.638]
19 Then for any ε ≥ CM log p log2 n and sufficiently large n, we have n1/2 P max Sn ( f ) jk − Sn ( f ) jk > 2ε jk ≤ 2 1 n1/2 ε2 + 2 exp 2 log p − 1232π2 log2 n π log(np) + 2 exp 2 log p − n1/2 8π log n + o(1). [sent-240, score-0.83]
20 n1/2 P max Sn ( f ) jk − Sn ( f ) jk > 2CM jk Hence, max Sn ( f ) jk − Sn ( f ) jk = OP j,k log p log2 n . [sent-245, score-0.49]
21 Then the nonparanormal estimator Ωn of (9) satisfies 2 (s + p)(log p log n) Ωn − Ω0 F = OP n1/2 2303 L IU , L AFFERTY, AND WASSERMAN and Ωn − Ω 0 2 where = OP 2 s(log p log n) , n1/2 s ≡ Card ({(i, j) ∈ {1, . [sent-254, score-0.976]
22 Then the nonparanormal estimator Ωn satisfies P G Ωn , Ω0 ≥ 1 − o(1) where G (Ωn , Ω0 ) is the event sign Ωn ( j, k) = sign (Ω0 ( j, k)) , ∀ j, k ∈ S . [sent-296, score-0.706]
23 The definition in this theorem uses the fact (from Lemma 11) that supx Φ−1 Fj (x) ≤ √ √ 2 log n when δn = 1/(4n1/4 π log n). [sent-299, score-0.284]
24 In the next theorem, we do not assume the true model is nonparanormal and define the population and sample risks as 1 tr ΩE( f (X) f (X)T − log |Ω| − p log(2π) 2 1 {tr [ΩSn ( f )] − log |Ω| − p log(2π)} . [sent-300, score-0.944]
25 2 R( f , Ω) = R( f , Ω) = ξ Theorem 8 Suppose that p ≤ en for some ξ < 1, and define the classes Mn = Cn = f : R → R : f is monotone with f Ω : Ω−1 1 ∞ ≤C log n ≤ Ln . [sent-301, score-0.386]
26 Ω∈Cn Then R( fn , Ωn ) − inf p ( f ,Ω)∈Mn ⊕Cn R( f , Ω) = OP Ln Hence the Winsorized estimator of ( f , Ω) with δn = 1/(4n1/4 √ Ln = o n(1−ξ)/2 / log n . [sent-303, score-0.251]
27 We mainly compare the ℓ1 -regularized nonparanormal and Gaussian (paranormal) models, computed using the graphical lasso algorithm (glasso) of Friedman et al. [sent-308, score-0.741]
28 Note that we can reuse the glasso implementation to fit a sparse nonparanormal. [sent-311, score-0.392]
29 In particular, after computing the Winsorized sample covariance Sn ( f ), we pass this matrix to the glasso routine to carry out the optimization Ωn = arg min tr ΩSn ( f ) − log |Ω| + λn Ω Ω 2305 1 . [sent-312, score-0.585]
30 To sample data from the nonparanormal distribution, we also require g ≡ f −1 ; two different transformations g are employed. [sent-345, score-0.677]
31 It can be seen how the cdf and power transformations map a univariate normal distribution into a highly skewed and a bi-modal distribution, respectively. [sent-387, score-0.229]
32 2307 L IU , L AFFERTY, AND WASSERMAN cdf power glasso path 0. [sent-388, score-0.579]
33 4 n = 200 Figure 4: Regularization paths for the glasso and nonparanormal with n = 500 (top) and n = 200 (bottom). [sent-528, score-1.046]
34 For non-Gaussian distributions, the nonparanormal better separates the relevant and irrelevant dimensions. [sent-530, score-0.657]
35 In each case, both the glasso and the nonparanormal are applied to estimate the graph. [sent-533, score-1.008]
36 For the cdf transformation and the power transformation, the nonparanormal separates the relevant and the irrelevant dimensions very well. [sent-540, score-0.861]
37 This is intuitive due to the high skewness of the nonparanormal distribution in this case. [sent-547, score-0.638]
38 The curves clearly show how the performance of both methods improves with sample size, and that the nonparanormal is superior to the Gaussian model in most cases. [sent-585, score-0.653]
39 95 1−FP Nonparanormal glasso Nonparanormal glasso 0. [sent-605, score-0.74]
40 It’s clear from the tables that the nonparanormal achieves significantly smaller errors than the glasso if the true distribution of the data is not multivariate Gaussian and achieves performance comparable to the glasso when the true distribution is exactly multivariate Gaussian. [sent-684, score-1.432]
41 It’s clear that when the glasso estimates the graph incorrectly, the mistakes include both false positives and negatives. [sent-686, score-0.403]
42 2311 L IU , L AFFERTY, AND WASSERMAN Nonparanormal glasso n FPE (sd(FPE)) FNE (sd(FNE)) FPE (sd(FPE)) FNE (sd(FNE)) 1000 0. [sent-687, score-0.37]
43 For both FPE and FNE, the nonparanormal performs much better in general. [sent-760, score-0.638]
44 Nonparanormal glasso n FPE (sd(FPE)) FNE (sd(FNE)) FPE (sd(FPE)) FNE (sd(FNE)) 1000 0. [sent-761, score-0.37]
45 For both FPE and FNE, the nonparanormal performs much better in general. [sent-834, score-0.638]
46 4 C OMPARISON IN THE G AUSSIAN C ASE The previous experiments indicate that the nonparanormal works almost as well as the glasso in the Gaussian case. [sent-837, score-1.008]
47 For multivariate Gaussian models, Figure 9 shows results with (n, p, s) = (50, 40, 1/8), (50, 100, 1/15) 2312 T HE N ONPARANORMAL Nonparanormal glasso n FPE (sd(FPE)) FNE (sd(FNE)) FPE (sd(FPE)) FNE (sd(FNE)) 1000 0. [sent-840, score-0.397]
48 The two methods behave similarly, the glasso is slightly better. [sent-913, score-0.37]
49 From the mean ROC curves, we see that nonparanormal does indeed behave worse than the glasso, suggesting some efficiency loss. [sent-915, score-0.638]
50 5 T HE C ASE W HEN p ≫ n Figure 10 shows results from a simulation of the nonparanormal using cdf transformations with n = 200, p = 500 and sparsity level s = 1/40. [sent-919, score-0.833]
51 The boxplot shows that the nonparanormal outperforms the glasso. [sent-920, score-0.638]
52 A typical run of the regularization paths confirms this conclusion, showing that the nonparanormal path separates the relevant and irrelevant dimensions very well. [sent-921, score-0.759]
53 In contrast, with the glasso the relevant variables are “buried” among the irrelevant variables. [sent-922, score-0.389]
54 A subset of 40 genes from the isoprenoid pathway are chosen, and we study the associations among them using both the paranormal and nonparanormal models. [sent-928, score-0.682]
55 , 2004), our study shows that the results of the nonparanormal and the glasso are very different over a wide range of regularization parameters. [sent-930, score-1.035]
56 This suggests the nonparanormal could support different scientific conclusions. [sent-931, score-0.638]
57 The dashed (black) lines in the symmetric difference plots indicate edges found by the glasso but not the nonparanormal, and vice-versa for the solid (red) lines. [sent-1123, score-0.399]
58 For each λ, we show the estimated graph from the nonparanormal in the first column. [sent-1134, score-0.69]
59 40 lambda Figure 10: For the cdf transformation with n = 200, p = 500, s = 1/40, comparison of the boxplots and a typical run of the regularization paths. [sent-1199, score-0.249]
60 The nonparanormal paths separate the relevant from the irrelevant dimensions well. [sent-1200, score-0.695]
61 regularization path of the glasso fit and finding the graph having the smallest symmetric difference with the nonparanormal graph. [sent-1202, score-1.12]
62 The closest glasso fit is different, with edges selected by the glasso not selected by the nonparanormal, and vice-versa. [sent-1204, score-0.754]
63 2315 L IU , L AFFERTY, AND WASSERMAN nonparanormal path −1. [sent-1207, score-0.675]
64 Lemma 12 (Distribution function of the transformed random variable) For any α ∈ (−∞, ∞) Φ−1 Fj g j (α log n) 2316 =α log n. [sent-1242, score-0.284]
65 0 Figure 12: The nonparanormal estimated graph for three values of λ = 0. [sent-1250, score-0.69]
66 30857 (left column), the closest glasso estimated graph from the full path (middle) and the symmetric difference graph (right). [sent-1253, score-0.507]
67 2πα log n Proof Using Mill’s inequality, we have √ φ( α log n) 1 α log n ≤ n √ , = α/2−1 √ α log n n 2πα log n n P max Wi > 1≤i≤n α log n ≤ ∑ P Wi > i=1 from which the result follows. [sent-1271, score-0.875]
68 Lemma 14 For any α > 0 that satisfies 1 − δn − Φ α log n > 0 for all n, we have > 1 − δn ≤ exp −2n 1 − δn − Φ α log n P Fj g j √ 2 α log n . [sent-1272, score-0.448]
69 (13) and P Fj g j − α log n < δn ≤ exp −2n 1 − δn − Φ α log n 2 . [sent-1273, score-0.306]
70 (14) Proof Using Hoeffding’s inequality, P Fj g j α log n > 1 − δn α log n = P Fj g j α log n − Fj g j ≤ exp −2n 1 − δn − Fj g j α log n 2 > 1 − δn − F j g j α log n . [sent-1274, score-0.732]
71 We split the interval 2 g j (− M log n), g j ( M log n) β log n , g j β log n into two parts, the middle Mn ≡ g j − and ends En ≡ g j − M log n , g j − β log n ∪ gj β log n , g j M log n . [sent-1278, score-1.136]
72 Lemma 16 For all n, we have sup Φ−1 (Fj (t)) − Φ−1 (Fj (t)) < t∈En 2(M + 2) log n, ∀ j ∈ {1, . [sent-1288, score-0.222]
73 Proof From Lemma 12 and the definition of En , we have sup Φ−1 (Fj (t)) ∈ 0, M log n . [sent-1292, score-0.222]
74 Therefore, from Equation n n (11), sup Φ−1 Fj (t) t∈En ∈ 0, The result follows from the triangle inequality and 2 log n . [sent-1294, score-0.222]
75 Let ∆i ( j, k) ≡ f j (Xi j ) fk (Xik ) − f j (Xi j ) fk (Xik ) and Θt,s ( j, k) ≡ f j (t) fk (s) − f j (t) fk (s). [sent-1299, score-0.524]
76 Such a θ1 guarantees that nε − nA 4θ1 log n = nA nβ log n > 0. [sent-1320, score-0.284]
77 1 Using the Bernstein’s inequality, for β = , 2 P ε 1 n ∑ 1{Xi j ∈En ,Xik ∈En } > 4θ1 n i=1 n ≤ P ∑ i=1 ≤ exp − 1{Xi j ∈En } − P (X1 j ∈ En ) > nA log n nβ c1 n2−β log n √ √ c2 n1−β/2 log n + c3 n1−β/2 log n where c1 , c2 , c3 > 0 are generic constants. [sent-1322, score-0.59]
78 j,k t∈En ,s∈En j,k t∈En ,s∈En 2321 = o(1), L IU , L AFFERTY, AND WASSERMAN Now, we analyze the first term P max sup j,k t∈En ,s∈En |Θt,s ( j, k)| > θ1 ≤ p2 P sup t∈En ,s∈En = p2 P sup t∈En ,s∈En |Θt,s ( j, k)| > θ1 | f j (t) fk (s) − f j (t) fk (s)| > θ1 . [sent-1324, score-0.525]
79 t∈En t∈En we have CM log p log2 n nβ/2 ε θ1 √ √ ≥ = 2(M + 2) log n. [sent-1327, score-0.284]
80 = 3 24A log n 24A log n This implies that θ1 ≥ 3 2(M + 2) log n and θ1 ≥ 3 M log n √ 2(M + 2) log n. [sent-1328, score-0.71]
81 Using Lemma 14, we have β log n p2 P Fj g j > 1 − δn ≤ p2 exp −2nδ2 = exp 2 log p − n n1−β (16πβ log n) (15) and β log n p2 P Fj g j − < δn ≤ exp 2 log p − n1−β (16πβ log n) . [sent-1340, score-0.918]
82 From (15) and (16), it is easy to see that n1/2 8π log n c P (Bn ) ≤ 2 exp 2 log p − . [sent-1345, score-0.306]
83 From the definition of Fj , we have p2 P sup | f j (t) − f j (t)| > t∈Mn ε 12 β log n sup Φ−1 Fj (t) − Φ−1 (Fj (t)) > ≤ p2 P t∈Mn ≤ p2 P sup Φ−1 Fj (t) − Φ−1 (Fj (t)) > t∈Mn ε 12 β log n c , Bn + P (Bn ) . [sent-1346, score-0.524]
84 ε 12 β log n + 2 exp 2 log p − n1/2 8π log n β log n , δn . [sent-1347, score-0.59]
85 Define T1n ≡ max Fj g j β log n , 1 − δn and T2n ≡ 1 − min Fj g j − From Equation (12) and the fact that 1 − δn ≥ Φ β log n , we have that T1n = T2n = 1 − δn . [sent-1348, score-0.307]
86 Thus, by the mean value theorem, P sup Φ−1 Fj (t) − Φ−1 (Fj (t)) > t∈Mn ε 12 β log n ≤ P (Φ−1 )′ (max {T1n , T2n }) sup Fj (t) − Fj (t) > t∈Mn = P (Φ−1 )′ (1 − δn ) sup Fj (t) − Fj (t) > t∈Mn 2324 ε 12 ε 12 β log n β log n . [sent-1349, score-0.666]
87 T HE N ONPARANORMAL Finally, using the Dvoretzky-Kiefer-Wolfowitz inequality, P sup Φ−1 Fj (t) − Φ−1 (Fj (t)) > t∈Mn ≤ P ε 12 β log n sup Fj (t) − Fj (t) > t∈Mn ≤ 2 exp −2 ε (Φ−1 )′ (1 − δn ) 12 nε2 144β log n [(Φ−1 )′ (1 − δn )]2 β log n . [sent-1351, score-0.608]
88 Furthermore, by Lemma 11, (Φ−1 )′ (1 − δn ) = 1 ≤ φ (Φ−1 (1 − δn )) 1 φ 2 log 1 δn = √ 2π 1 δn = 8πnβ/2 β log n. [sent-1352, score-0.284]
89 This implies that p2 P sup Φ−1 Fj (t) − Φ−1 (Fj (t)) > t∈Mn ε 12 β log n ≤ 2 exp 2 log p − n1/2 ε2 1232π2 log2 n . [sent-1353, score-0.386]
90 In summary, we have n1/2 ε2 1 ε |∆i ( j, k)| > P max ≤ 2 exp 2 log p − ∑ j,k n 4 1232π2 log2 n Xi j ∈Mn ,Xik ∈En + 2 exp 2 log p − This finish the proof. [sent-1354, score-0.328]
91 2 Proof of Theorem 8 Proof First note that the population and sample risks are R( f , Ω) = R( f , Ω) = 1 tr ΩE( f (X) f (X)T − log |Ω| − p log(2π) 2 1 {tr [ΩSn ( f )] − log |Ω| − p log(2π)} . [sent-1357, score-0.306]
92 2 Therefore, for all ( f , Ω) ∈ Mnp ⊕ Cn , we have |R( f , Ω) − R( f , Ω)| = ≤ ≤ 1 tr Ω E[ f f T ] − Sn ( f ) 2 1 Ω 1 max sup |E( f j (X j ) fk (Xk ) − Sn ( f ) jk | jk f , f ∈M 2 j k n Ln max sup |E( f j (X j ) fk (Xk ) − Sn ( f ) jk |. [sent-1358, score-0.685]
93 2 jk f j , fk ∈Mn 2325 n1/2 8π log n L IU , L AFFERTY, AND WASSERMAN Now, if F is a class of functions, we have E sup |µ(g) − µ(g)| ≤ g∈F C J[ ] ( F √ ∞ ,F ) (17) n for some C > 0, where F(x) = supg∈cF |g(x)|, µ(g) = E(g(X)) and µ(g) = n−1 ∑n g(Xi ) (see Coroli=1 lary 19. [sent-1359, score-0.418]
94 Here the bracketing integral is defined to be J[ ] (δ, F ) = Z δ 0 log N[ ] (u, F ) du where log N[ ] (ε, F ) is the bracketing entropy. [sent-1361, score-0.342]
95 Then the bracketing entropy satisfies log N[ ] (C log n, Pn,p ) ≤ 2 log p + K and the bracketing integral satisfies J[ ] (C Markov’s inequality that √ √ log n, Pn,p ) = O( log n log p). [sent-1367, score-0.91]
96 It follows from (17) and max sup |Sn ( f ) jk − E( f j (X j ) fk (Xk )| = OP jk f j , fk ∈Mn log n log p n Therefore, sup p ( f ,Ω)∈Mn ⊕Cn 1 ε |R( f , Ω) − R( f , Ω)| = OP √ Ln log n . [sent-1368, score-1.001]
97 2326 = OP √ Ln log n n(1−ξ)/2 √ Ln log n n(1−ξ)/2 √ Ln log n n(1−ξ)/2 log n n1−ξ . [sent-1370, score-0.568]
98 The nonparanormal can be viewed as an extension of sparse additive models to the setting of graphical models. [sent-1373, score-0.738]
99 Computationally, fitting a high dimensional nonparanormal is no more difficult than estimating a multivariate Gaussian, and indeed one can exploit existing software for the graphical lasso. [sent-1379, score-0.731]
100 Our experimental results indicate that the sparse nonparanormal can give very different results than a sparse Gaussian graphical model. [sent-1380, score-0.731]
wordName wordTfidf (topN-words)
[('nonparanormal', 0.638), ('fj', 0.404), ('glasso', 0.37), ('en', 0.229), ('log', 0.142), ('cdf', 0.14), ('fk', 0.131), ('afferty', 0.117), ('mn', 0.11), ('fne', 0.11), ('onparanormal', 0.11), ('wasserman', 0.105), ('iu', 0.099), ('fpe', 0.093), ('sup', 0.08), ('sn', 0.079), ('fp', 0.069), ('ravikumar', 0.067), ('jk', 0.065), ('oracle', 0.06), ('copula', 0.058), ('fn', 0.055), ('estimator', 0.054), ('lasso', 0.054), ('covariance', 0.051), ('graphical', 0.049), ('op', 0.046), ('sd', 0.045), ('rothman', 0.041), ('winsorized', 0.041), ('semiparametric', 0.041), ('transformations', 0.039), ('transform', 0.039), ('paths', 0.038), ('path', 0.037), ('larry', 0.034), ('npn', 0.034), ('gaussian', 0.034), ('graph', 0.033), ('transformation', 0.032), ('power', 0.032), ('boxplots', 0.031), ('additive', 0.029), ('bracketing', 0.029), ('xik', 0.029), ('coefficient', 0.027), ('paranormal', 0.027), ('regularization', 0.027), ('multivariate', 0.027), ('nonparametric', 0.026), ('undirected', 0.025), ('persistency', 0.023), ('max', 0.023), ('sparse', 0.022), ('omparison', 0.022), ('consistency', 0.022), ('tr', 0.022), ('exp', 0.022), ('klaassen', 0.021), ('score', 0.02), ('ij', 0.02), ('bn', 0.02), ('lambda', 0.019), ('cm', 0.019), ('ik', 0.019), ('estimated', 0.019), ('irrelevant', 0.019), ('normal', 0.018), ('lemma', 0.018), ('isoprenoid', 0.017), ('wille', 0.017), ('pradeep', 0.017), ('lafferty', 0.017), ('han', 0.017), ('dimensional', 0.017), ('sparsity', 0.016), ('cn', 0.016), ('np', 0.016), ('parallels', 0.016), ('subtle', 0.016), ('symmetric', 0.015), ('microarray', 0.015), ('monotone', 0.015), ('density', 0.015), ('der', 0.015), ('curves', 0.015), ('ln', 0.015), ('event', 0.014), ('persistent', 0.014), ('vaart', 0.014), ('wellner', 0.014), ('edges', 0.014), ('graphs', 0.014), ('xi', 0.014), ('abramovich', 0.014), ('mill', 0.014), ('sparsistent', 0.014), ('stimated', 0.014), ('tsukahara', 0.014), ('winsorization', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
Author: Han Liu, John Lafferty, Larry Wasserman
Abstract: Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula—or “nonparanormal”—for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method’s theoretical properties, and show that it works well in many examples. Keywords: graphical models, Gaussian copula, high dimensional inference, sparsity, ℓ1 regularization, graphical lasso, paranormal, occult
Author: Sébastien Bubeck, Ulrike von Luxburg
Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal, and instead can lead to inconsistency. We construct examples which provably have this behavior. As in the case of supervised learning, the cure is to restrict the size of the function classes under consideration. For appropriate “small” function classes we can prove very general consistency theorems for clustering optimization schemes. As one particular algorithm for clustering with a restricted function space we introduce “nearest neighbor clustering”. Similar to the k-nearest neighbor classifier in supervised learning, this algorithm can be seen as a general baseline algorithm to minimize arbitrary clustering objective functions. We prove that it is statistically consistent for all commonly used clustering objective functions. Keywords: clustering, minimizing objective functions, consistency
3 0.066515863 18 jmlr-2009-Consistency and Localizability
Author: Alon Zakai, Ya'acov Ritov
Abstract: We show that all consistent learning methods—that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y )—are necessarily localizable, by which we mean that they do not signiÄ?Ĺš cantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be deÄ?Ĺš ned in a non-local manner, such as support vector machines in classiÄ?Ĺš cation and least-squares estimators in regression. Aside from showing that consistency implies a speciÄ?Ĺš c form of localizability, we also show that consistency is logically equivalent to the combination of two properties: (1) a form of localizability, and (2) that the method’s global mean (over the entire X distribution) correctly estimates the true mean. Consistency can therefore be seen as comprised of two aspects, one local and one global. Keywords: consistency, local learning, regression, classiÄ?Ĺš cation
4 0.045343906 21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression
Author: Sylvain Arlot, Pascal Massart
Abstract: Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from data. We propose a completely data-driven calibration algorithm for these parameters in the least-squares regression framework, without assuming a particular shape for the penalty. Our algorithm relies on the concept of minimal penalty, recently introduced by Birg´ and Massart (2007) in the context of penalized least squares for Gaussian hoe moscedastic regression. On the positive side, the minimal penalty can be evaluated from the data themselves, leading to a data-driven estimation of an optimal penalty which can be used in practice; on the negative side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic framework. The purpose of this paper is twofold: stating a more general heuristics for designing a datadriven penalty (the slope heuristics) and proving that it works for penalized least-squares regression with a random design, even for heteroscedastic non-Gaussian data. For technical reasons, some exact mathematical results will be proved only for regressogram bin-width selection. This is at least a first step towards further results, since the approach and the method that we use are indeed general. Keywords: data-driven calibration, non-parametric regression, model selection by penalization, heteroscedastic data, regressogram
5 0.04377462 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
Author: Nikolai Slobodianik, Dmitry Zaporozhets, Neal Madras
Abstract: In the machine learning community, the Bayesian scoring criterion is widely used for model selection problems. One of the fundamental theoretical properties justifying the usage of the Bayesian scoring criterion is its consistency. In this paper we refine this property for the case of binomial Bayesian network models. As a by-product of our derivations we establish strong consistency and obtain the law of iterated logarithm for the Bayesian scoring criterion. Keywords: Bayesian networks, consistency, scoring criterion, model selection, BIC
6 0.042388145 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
7 0.04155729 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
8 0.035395078 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
9 0.033063829 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks
10 0.031190328 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations
11 0.030743968 46 jmlr-2009-Learning Halfspaces with Malicious Noise
12 0.030588953 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
13 0.029792065 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
14 0.028645994 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
15 0.028634639 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
16 0.027323795 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
17 0.026645236 16 jmlr-2009-Classification with Gaussians and Convex Loss
18 0.023814257 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
19 0.023258146 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
20 0.022642605 23 jmlr-2009-Discriminative Learning Under Covariate Shift
topicId topicWeight
[(0, 0.121), (1, -0.033), (2, 0.072), (3, 0.045), (4, -0.093), (5, 0.008), (6, -0.112), (7, -0.032), (8, -0.035), (9, 0.084), (10, 0.096), (11, 0.012), (12, -0.306), (13, 0.118), (14, 0.158), (15, -0.071), (16, -0.043), (17, 0.174), (18, -0.021), (19, -0.079), (20, -0.138), (21, 0.096), (22, -0.086), (23, -0.168), (24, 0.217), (25, 0.057), (26, 0.028), (27, -0.004), (28, -0.105), (29, 0.041), (30, 0.007), (31, 0.203), (32, 0.009), (33, 0.129), (34, -0.051), (35, -0.055), (36, -0.087), (37, -0.001), (38, 0.008), (39, -0.001), (40, 0.154), (41, -0.012), (42, -0.012), (43, 0.012), (44, 0.108), (45, -0.136), (46, -0.067), (47, 0.088), (48, 0.032), (49, -0.19)]
simIndex simValue paperId paperTitle
same-paper 1 0.93090165 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
Author: Han Liu, John Lafferty, Larry Wasserman
Abstract: Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula—or “nonparanormal”—for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method’s theoretical properties, and show that it works well in many examples. Keywords: graphical models, Gaussian copula, high dimensional inference, sparsity, ℓ1 regularization, graphical lasso, paranormal, occult
Author: Sébastien Bubeck, Ulrike von Luxburg
Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal, and instead can lead to inconsistency. We construct examples which provably have this behavior. As in the case of supervised learning, the cure is to restrict the size of the function classes under consideration. For appropriate “small” function classes we can prove very general consistency theorems for clustering optimization schemes. As one particular algorithm for clustering with a restricted function space we introduce “nearest neighbor clustering”. Similar to the k-nearest neighbor classifier in supervised learning, this algorithm can be seen as a general baseline algorithm to minimize arbitrary clustering objective functions. We prove that it is statistically consistent for all commonly used clustering objective functions. Keywords: clustering, minimizing objective functions, consistency
3 0.3597292 18 jmlr-2009-Consistency and Localizability
Author: Alon Zakai, Ya'acov Ritov
Abstract: We show that all consistent learning methods—that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y )—are necessarily localizable, by which we mean that they do not signiÄ?Ĺš cantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be deÄ?Ĺš ned in a non-local manner, such as support vector machines in classiÄ?Ĺš cation and least-squares estimators in regression. Aside from showing that consistency implies a speciÄ?Ĺš c form of localizability, we also show that consistency is logically equivalent to the combination of two properties: (1) a form of localizability, and (2) that the method’s global mean (over the entire X distribution) correctly estimates the true mean. Consistency can therefore be seen as comprised of two aspects, one local and one global. Keywords: consistency, local learning, regression, classiÄ?Ĺš cation
4 0.27941266 21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression
Author: Sylvain Arlot, Pascal Massart
Abstract: Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from data. We propose a completely data-driven calibration algorithm for these parameters in the least-squares regression framework, without assuming a particular shape for the penalty. Our algorithm relies on the concept of minimal penalty, recently introduced by Birg´ and Massart (2007) in the context of penalized least squares for Gaussian hoe moscedastic regression. On the positive side, the minimal penalty can be evaluated from the data themselves, leading to a data-driven estimation of an optimal penalty which can be used in practice; on the negative side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic framework. The purpose of this paper is twofold: stating a more general heuristics for designing a datadriven penalty (the slope heuristics) and proving that it works for penalized least-squares regression with a random design, even for heteroscedastic non-Gaussian data. For technical reasons, some exact mathematical results will be proved only for regressogram bin-width selection. This is at least a first step towards further results, since the approach and the method that we use are indeed general. Keywords: data-driven calibration, non-parametric regression, model selection by penalization, heteroscedastic data, regressogram
5 0.20919675 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
Author: Ulrich Paquet, Ole Winther, Manfred Opper
Abstract: Bayesian inference is intractable for many interesting models, making deterministic algorithms for approximate inference highly desirable. Unlike stochastic methods, which are exact in the limit, the accuracy of these approaches cannot be reasonably judged. In this paper we show how low order perturbation corrections to an expectation-consistent (EC) approximation can provide the necessary tools to ameliorate inference accuracy, and to give an indication of the quality of approximation without having to resort to Monte Carlo methods. Further comparisons are given with variational Bayes and parallel tempering (PT) combined with thermodynamic integration on a Gaussian mixture model. To obtain practical results we further generalize PT to temper from arbitrary distributions rather than a prior in Bayesian inference. Keywords: Bayesian inference, mixture models, expectation propagation, expectation consistent, perturbation correction, variational Bayes, parallel tempering, thermodynamic integration
6 0.19719334 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
7 0.18423809 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
8 0.17639431 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
9 0.1696412 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
10 0.15997693 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
11 0.15577269 46 jmlr-2009-Learning Halfspaces with Malicious Noise
12 0.15339474 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
13 0.14997414 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification
15 0.1429297 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
16 0.1398547 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation
17 0.13705696 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
18 0.13544345 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments
19 0.13117228 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning
20 0.12691468 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality
topicId topicWeight
[(6, 0.389), (8, 0.031), (11, 0.046), (19, 0.017), (21, 0.012), (26, 0.014), (38, 0.019), (45, 0.012), (47, 0.019), (52, 0.036), (55, 0.055), (58, 0.029), (66, 0.103), (68, 0.034), (90, 0.05), (96, 0.035)]
simIndex simValue paperId paperTitle
1 0.78208333 44 jmlr-2009-Learning Acyclic Probabilistic Circuits Using Test Paths
Author: Dana Angluin, James Aspnes, Jiang Chen, David Eisenstat, Lev Reyzin
Abstract: We define a model of learning probabilistic acyclic circuits using value injection queries, in which fixed values are assigned to an arbitrary subset of the wires and the value on the single output wire is observed. We adapt the approach of using test paths from the Circuit Builder algorithm (Angluin et al., 2009) to show that there is a polynomial time algorithm that uses value injection queries to learn acyclic Boolean probabilistic circuits of constant fan-in and log depth. We establish upper and lower bounds on the attenuation factor for general and transitively reduced Boolean probabilistic circuits of test paths versus general experiments. We give computational evidence that a polynomial time learning algorithm using general value injection experiments may not do much better than one using test paths. For probabilistic circuits with alphabets of size three or greater, we show that the test path lemmas (Angluin et al., 2009, 2008b) fail utterly. To overcome this obstacle, we introduce function injection queries, in which the values on a wire may be mapped to other values rather than just to themselves or constants, and prove a generalized test path lemma for this case. Keywords: nonadaptive learning algorithms, probabilistic circuits, causal Bayesian networks, value injection queries, test paths
same-paper 2 0.68710846 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
Author: Han Liu, John Lafferty, Larry Wasserman
Abstract: Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula—or “nonparanormal”—for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method’s theoretical properties, and show that it works well in many examples. Keywords: graphical models, Gaussian copula, high dimensional inference, sparsity, ℓ1 regularization, graphical lasso, paranormal, occult
Author: Junning Li, Z. Jane Wang
Abstract: In real world applications, graphical statistical models are not only a tool for operations such as classification or prediction, but usually the network structures of the models themselves are also of great interest (e.g., in modeling brain connectivity). The false discovery rate (FDR), the expected ratio of falsely claimed connections to all those claimed, is often a reasonable error-rate criterion in these applications. However, current learning algorithms for graphical models have not been adequately adapted to the concerns of the FDR. The traditional practice of controlling the type I error rate and the type II error rate under a conventional level does not necessarily keep the FDR low, especially in the case of sparse networks. In this paper, we propose embedding an FDR-control procedure into the PC algorithm to curb the FDR of the skeleton of the learned graph. We prove that the proposed method can control the FDR under a user-specified level at the limit of large sample sizes. In the cases of moderate sample size (about several hundred), empirical experiments show that the method is still able to control the FDR under the user-specified level, and a heuristic modification of the method is able to control the FDR more accurately around the user-specified level. The proposed method is applicable to any models for which statistical tests of conditional independence are available, such as discrete models and Gaussian models. Keywords: Bayesian networks, false discovery rate, PC algorithm, directed acyclic graph, skeleton
4 0.34457996 13 jmlr-2009-Bounded Kernel-Based Online Learning
Author: Francesco Orabona, Joseph Keshet, Barbara Caputo
Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set
5 0.34415948 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
6 0.34157556 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
7 0.34130475 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
8 0.34114707 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
9 0.33492774 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
10 0.33486262 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
11 0.33360684 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
12 0.33288604 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
13 0.33281177 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks
14 0.33106431 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
15 0.32967886 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
16 0.32929888 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications
17 0.32793635 18 jmlr-2009-Consistency and Localizability
18 0.32663625 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning
19 0.32630956 78 jmlr-2009-Refinement of Reproducing Kernels
20 0.32546458 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis