jmlr jmlr2012 jmlr2012-29 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yongdai Kim, Sunghoon Kwon, Hosik Choi
Abstract: Asymptotic properties of model selection criteria for high-dimensional regression models are studied where the dimension of covariates is much larger than the sample size. Several sufficient conditions for model selection consistency are provided. Non-Gaussian error distributions are considered and it is shown that the maximal number of covariates for model selection consistency depends on the tail behavior of the error distribution. Also, sufficient conditions for model selection consistency are given when the variance of the noise is neither known nor estimated consistently. Results of simulation studies as well as real data analysis are given to illustrate that finite sample performances of consistent model selection criteria can be quite different. Keywords: model selection consistency, general information criteria, high dimension, regression
Reference: text
sentIndex sentText sentNum sentScore
1 COM Department of Informational Statistics Hoseo University Chungnam 336-795, Korea Editor: Xiaotong Shen Abstract Asymptotic properties of model selection criteria for high-dimensional regression models are studied where the dimension of covariates is much larger than the sample size. [sent-7, score-0.207]
2 Non-Gaussian error distributions are considered and it is shown that the maximal number of covariates for model selection consistency depends on the tail behavior of the error distribution. [sent-9, score-0.295]
3 Results of simulation studies as well as real data analysis are given to illustrate that finite sample performances of consistent model selection criteria can be quite different. [sent-11, score-0.256]
4 (2009) proposed a modified BIC which is consistent when the dimension of covariates is diverging slower than the sample size. [sent-20, score-0.181]
5 Here, the consistency of a model selection criterion means that the probability of the selected model being equal to the true model converges to 1. [sent-21, score-0.194]
6 The extended BIC by Chen and Chen (2008) and corrected RIC by Zhang and Shen (2010) are shown to be consistent even when the dimension of covariates is larger than the c 2012 Yongdai Kim, Sunghoon Kwon and Hosik Choi. [sent-23, score-0.207]
7 In this paper, we study asymptotic properties of a large class of model selection criteria based on the generalized information criterion (GIC) considered by Shao (1997). [sent-28, score-0.177]
8 For a case of the Gaussian error distribution with consistent estimator of the variance, our sufficient conditions include most of the previously proposed consistent model selection criteria such as the modified BIC (Wang et al. [sent-34, score-0.253]
9 For high-dimensional models, it is not practically feasible to find the best model among all possible submodels since the number of submodels are too large. [sent-36, score-0.355]
10 A simple remedy is to find a sequence of submodels with increasing complexities (e. [sent-37, score-0.168]
11 Examples of constructing a sequence of submodels are the forward selection procedure and solution paths of penalized regression approaches. [sent-40, score-0.287]
12 Our sufficient conditions are still valid as long as the sequence of submodels includes the true model with probability converging to 1. [sent-41, score-0.209]
13 , (yn , xn )} be a given data set of independent pairs of response and covariates, where yi ∈ R and xi ∈ R pn . [sent-53, score-0.568]
14 Suppose the true regression model for (y, x) is given as ′ y = x β∗ + ε, where β∗ ∈ R pn , E(ε) = 0 and Var(ε) = σ2 . [sent-54, score-0.477]
15 , yn ) and Xn be the n × pn dimensional design matrix whose jth column is ′ j Xn = (x1 j , . [sent-61, score-0.585]
16 For given β ∈ R pn , let Rn (β) = Yn − Xn β 2 , where · is the Euclidean norm. [sent-65, score-0.458]
17 The AIC corresponds to λn = 2, the BIC to λn = log n, the RIC of Foster and George (1994) to λn = 2 log pn , the RIC of Zhang and Shen (2010) to λn = 2(log pn + log log pn ). [sent-77, score-1.594]
18 When pn is large, it would not be wise to search all possible subsets of {1, . [sent-79, score-0.458]
19 Instead, we set an upper bound on the cardinality of π, say sn and search the optimal model among submodels whose cardinalities are smaller than sn . [sent-83, score-0.499]
20 We define the restricted GICλn as ˆ ˆ πλn = argminπ∈M sn Rn (βπ ) + λn |π|σ2 . [sent-89, score-0.156]
21 (1) The restricted GIC is the same as the GIC if sn = pn . [sent-90, score-0.614]
22 , pn }, let Xπ = (Xn , j ∈ π) be the n × |π| matrix whose columns j consist of Xn , j ∈ π. [sent-99, score-0.458]
23 j ∗ j∈πn • A5 : qn = O(nc3 ) for some 0 ≤ c3 < c2 , and qn ≤ sn , where qn = |π∗ |. [sent-110, score-0.678]
24 If λn = o(nc2 −c1 ) and pn /(λn ρn )k → 0, then the GICλn is consistent. [sent-120, score-0.458]
25 In Theorem 2, pn can diverge only polynomially fast in n since pn = o(λk ) = o(nkc2 ). [sent-121, score-0.938]
26 Since k n can be considered as a degree of tail lightness of the error distribution, we can conclude that the lighter the tail of the error distribution is, the more covariates the GICλn is consistent with. [sent-122, score-0.304]
27 When ε is Gaussian, the following theorem proves that the GICλn can be consistent when pn diverges exponentially fast. [sent-123, score-0.542]
28 If λn = o(nc2 −c1 ), sn log pn = o(nc2 −c1 ) and λn − 2 log pn − log log pn → ∞, then the GICλn is consistent. [sent-125, score-1.75]
29 In the following, we give three examples for (i) fixed pn , (ii) polynomially diverging pn and (iii) exponentially diverging pn . [sent-126, score-1.454]
30 j n Example 1 Consider a standard case where pn is fixed and n goes to infinity. [sent-135, score-0.458]
31 Any GIC with λn = nc , 0 < c < 1 is consistent, which suggests that the class of consistent model selection criteria is quite large. [sent-141, score-0.2]
32 That is, for larger pn , we need larger λn for consistency, which is reasonable because we need to be more careful not to overfit when pn is large. [sent-145, score-0.916]
33 (2009), which is a GIC with λn = log log pn log n, is consistent. [sent-151, score-0.623]
34 Chen and Chen (2008) proposed a model selection criterion called the extended BIC given by pn ˆ ˆ πeBIC = argminπ⊂{1,. [sent-152, score-0.563]
35 ,pn },|π|≤K Rn (βπ ) + |π|σ2 log n + 2κσ2 log |π| for some K > 0 and 0 ≤ κ ≤ 1, and proved that the extended BIC is consistent when κ > 1 − 1/(2γ). [sent-155, score-0.163]
36 pn Since log |π| ≍ |π| log pn for |π| ≤ K, we have |π|σ2 log n + 2γσ2 log pn ≍ (log n + 2κ log pn )|π|σ2 . [sent-156, score-2.107]
37 Example 3 When the error distribution is Gaussian, the GIC can be consistent for exponentially increasing pn (i. [sent-158, score-0.539]
38 The GIC with λn = nξ , 0 < ξ < 1 is consistent when pn = O(exp(αnγ )) for 0 < γ < ξ and α > 0. [sent-161, score-0.511]
39 Also, it can be shown by Theorem 3 that the extended BIC with γ = 1 is consistent with pn = O(exp(αnγ )) for 0 < γ < 1/2. [sent-162, score-0.511]
40 The consistency of the corrected RIC of Zhang and Shen (2010) can be confirmed by Theorem 3, but the regularity conditions for Theorem 3 are more general than those of Zhang and Shen (2010). [sent-163, score-0.156]
41 A simple remedy is to construct a sequence of submodels and select the optimal model among the sequence of submodels. [sent-168, score-0.187]
42 Examples of constructing a sequence of submodels are the forward selection (Wang, 2009) and the solution path of a sparse penalized estimator obtained by, for example, the Lars algorithm (Efron et al. [sent-169, score-0.356]
43 • For a given sparse penalty Jη (t) indexed by η ≥ 0, find the solution path of a penalized ˆ estimator {β(η) : η > 0}, where p ˆ β(η) = argminβ Rn (β) + ∑ Jη (|β j |) . [sent-172, score-0.181]
44 ˆ • Let S(η) = { j : β(η) j = 0} and ϒ = {η : S(η) = S(η−), |S(η)| ≤ sn }. [sent-174, score-0.156]
45 1041 K IM , K WON AND C HOI It is easy to see that a consistent GIC is still consistent with a sequence of sub-models as long as the sequence of submodels includes the true model with probability converging to 1. [sent-177, score-0.315]
46 The consistency of the solution path of a nonconvex penalized estimator with either the SCAD penalty or minimax concave penalty is proved by Zhang (2010) and Kim and Kwon (2012). [sent-180, score-0.354]
47 By combining Theorem 4 of Kim and Kwon (2012) and Theorem 2 of the current paper, we can prove the consistency of the GIC with the solution path of the SCAD penalty or minimax concave penalty, which is formally stated in the following theorem. [sent-181, score-0.261]
48 Remark 6 Theorem 3 can be modified similarly for the GIC with the solution path of the SCAD or minimax concave penalty, since Theorem 4 of Kim and Kwon (2012) can be modified accordingly for the Gaussian error distribution. [sent-187, score-0.198]
49 When pn is fixed, we can estimate σ2 consistently by the mean squared error of the full model. [sent-192, score-0.486]
50 If λn = o(nc2 −c1 ) and λn − 2M1 log pn /ρn rin f → ∞, then the GICλn with the estimated variance is consistent. [sent-202, score-0.578]
51 The corrected RIC, the GIC with λn = 2(log pn + log log pn ), does not satisfy the condition in Theorem 7, and hence may not be consistent with an estimated variance. [sent-203, score-1.169]
52 On the other hand, the GIC with λn = αn log pn is consistent as long as αn → ∞. [sent-204, score-0.566]
53 3 The Size of sn For condition A5, sn should be large enough so that qn ≤ sn . [sent-206, score-0.666]
54 In many cases, sn can be sufficiently large for practical purposes. [sent-207, score-0.156]
55 For example, suppose {xi , i ≤ n} are independent and identically distributed pn dimensional random vectors such that E(x1 ) = 0 and Var(x1 ) = Σ = [σ jk ]. [sent-208, score-0.494]
56 By the inequality (2) in Greenshtein and Ritov (2004), we have n sup ∑ xi j xik /n − σ jk = Op j,k i=1 log n n , and hence sup jk |a jk | = O p ( log n/n), where a jk is the ( j, k) entry of A. [sent-218, score-0.254]
57 We consider the five GICs whose corresponding λn s are given as (1) • GIC1 (=BIC) : λn = log n, (2) 1/3 • GIC2 : λn = pn , (3) • GIC3 : λn = 2 log pn , (4) • GIC4 : λn = 2(log pn + log log pn ), (5) • GIC5 : λn = log log n log pn , (6) • GIC6 : λn = log n log pn . [sent-223, score-3.243]
58 First, we compare performances of the GICs applied to all possible submodels with those applied to submodels constructed by the solution path of a sparse penalized approach. [sent-240, score-0.514]
59 From Table 1, we can see that the results based on the SCAD solution path are almost identical to those based on the all possible search, which suggests that the model selection with the SCAD solution path is a promising alternative to all possible search. [sent-246, score-0.242]
60 However, the relative performances of the model selection criteria are similar. [sent-307, score-0.156]
61 However, when n = 100, the GIC1 is better in terms of prediction accuracy than some other GICs which are selection consistent, which is an example of the conflict between selection consistency and prediction optimality. [sent-430, score-0.185]
62 We compare prediction accuracies of the 6 GICs with the submodels obtained from the SCAD solution path. [sent-548, score-0.207]
63 nonzero coefficients is equal to pmax , and to estimate the error variance by the mean squared error of the selected model. [sent-665, score-0.225]
64 Table 6 compares the 6 GICs with the number of pre-screened genes being p = 500 and p = 3000, when the error variance is estimated with pmax being 20, 30 and 40, respectively. [sent-671, score-0.194]
65 For p = 500, the lowest prediction error is achieved by the GIC2 and the GIC3 , GIC4 and GIC5 perform reasonably well with pmax = 20. [sent-676, score-0.157]
66 For p = 3000, the lowest prediction error is achieved by the GIC5 with pmax = 20. [sent-677, score-0.157]
67 Concluding Remarks The range of consistent model selection criteria is rather large, and it is not clear which one is better with finite samples. [sent-802, score-0.172]
68 It would be interesting to rewrite the class of GICs as {λn = αn log pn : αn > 0}. [sent-803, score-0.513]
69 The GIC3 , GIC5 and GIC6 correspond to αn = 2, αn = log log n and αn = log n, respectively. [sent-804, score-0.165]
70 , αn = 2 or αn = log log n) would be better when many signal covariates with small regression coefficients are expected to exist. [sent-809, score-0.23]
71 n n n ′ (i, j) We let β∗ = (β(1)∗ , β(2)∗ ), where β(1)∗ ∈ Rqn and β(2)∗ ∈ R pn −qn . [sent-964, score-0.458]
72 Since ′ (1,1) (1) H(1) H(1) = (Cn )−1 , A2 of the regularity conditions implies h j 2 ≤ 1/M2 for all j ≤ qn . [sent-981, score-0.213]
73 Hence, 2 E(z j )2k < ∞ for all j ≤ qn since E(εi )2k < ∞. [sent-982, score-0.174]
74 , qn ) ≤ ≤ = qn ∑ Pr(|z j | > ηnc /2 ) 2 j=1 qn 1 ∑ η n−c k 2 j=1 1 1 qn n−c2 k ≤ n−(c2 −c3 )k → 0, η η which completes the proof. [sent-987, score-0.696]
75 , pn ) ′ (2) (1) ˆ = Xn Yn − Xn β∗(1) (2)′ = Xn (2)′ = Xn (2)′ = Xn Hence, we have (1) 1 Yn − Xn n (1)′ (1,1) −1 (Cn ) Xn Yn (1) 1 (1) Xn β∗(1) + εn − Xn n (1,1) −1 (Cn (1)′ (1) ) Xn (Xn β∗(1) + εn ) 1 (1) (1,1) (1)′ I − Xn (Cn )−1 Xn εn . [sent-993, score-0.458]
76 n √ (2)′ ˆ < Yn − Yn∗ , Xnj > / n = h j εn for j = qn + 1, . [sent-994, score-0.174]
77 , pn , (2) where h j is the j − qn column vector of H(2) and ′ (2,1) H(2) = Cn Note that (1,1) −1 (Cn ) 1 (2)′ 1 (1)′ √ Xn − √ Xn . [sent-997, score-0.632]
78 n 1051 (3) K IM , K WON AND C HOI (1)′ (1) (1)′ (1) (2) 2 2 Since the all eigenvalues of I − Xn (Xn Xn )−1 Xn are between 0 and 1, we have h j 2k < ∞, where ξ =< Y − Y ∗ , X j > /√n, and so ˆn n for all j = qn + 1, . [sent-999, score-0.174]
79 Finally, for any η > 0, ˆ Pr | < Yn − Yn∗ , Xnj > | > η nλn ρn for some j = qn + 1, . [sent-1004, score-0.174]
80 , pn = Pr |ξ j | > η λn ρn for some j = qn + 1, . [sent-1007, score-0.632]
81 , pn pn ≤ ∑ j=qn +1 Pr |ξ j | > η 1 (λn ρn )k = (pn − qn )O λn ρn =O pn (λn ρn )k → 0, which completes the proof. [sent-1010, score-1.548]
82 For any π, we can write = ˆ −2 ∑ pn n +1 βπ, j j=q By Condition A3, ˆ ˆ Rn (βπ ) + λn |π|σ2 − Rn (β∗ ) − λn |π∗ |σ2 n ˆ ˆ ˆ ˆ ˆ ∗ , X j > +(βπ − β∗ )′ (X′ Xn )(βπ − β∗ ) + λn (|π| − |π∗ |)σ2 . [sent-1012, score-0.458]
83 j Hence, we have for any π ∈ M sn , ˆ ˆ Rn (βπ ) + λn |π|σ2 − Rn (β∗ ) − λn |π∗ |σ2 ≥ n ∑ w j, j∈π∪π∗ n where ˆ ˆ ˆ ˆ w j = −2βπ, j < Yn − Yn∗ , Xnj > I( j ∈ π∗ ) + nρn (βπ, j − β∗ )2 + λn (I( j ∈ π − π∗ ) − I( j ∈ π∗ − π))σ2 . [sent-1014, score-0.156]
84 ≥ − < Yn − Y Let ˆ Bn = {− < Yn − Yn∗ , Xnj >2 /(nρn ) + λn σ2 > 0, j = qn + 1, . [sent-1022, score-0.174]
85 , pn }, let Mπ be the projection operator onto the space spanned by (X ( j) , j ∈ π). [sent-1034, score-0.458]
86 Lemma 10 There exists η > 0 such that for any π ∈ M sn with π∗ n π, ′ µn (I − Mπ )µn ≥ η|π− |nc2 −c1 , where π− = π∗ − π. [sent-1038, score-0.156]
87 For given π ∈ M sn with π∗ n π, we have ′ = = µn (I − Mπ )µn inf α∈R|π| Xπ− β∗ − − Xπ α π ′ 2 ′ ′ ′ ′ ′ inf (β∗ − , α )(Xπ− , Xπ ) (Xπ− , Xπ )(β∗ − , α ) π π α∈R|π| ≥ n β∗ − π 2 ρn − ≥ M3 M4 |π |nc2 −c1 , where β∗ − = (β∗ , j ∈ π− ) and the last inequality is due to Condition A4. [sent-1040, score-0.156]
88 , pn }, let ′ Zπ = µn (I − Mπ )εn µ′n (I − Mπ )µn . [sent-1044, score-0.458]
89 Since Pr(|Zπ | > t) ≤ C exp(−t 2 /2) 1053 (6) K IM , K WON AND C HOI for some C > 0, we have Pr Hence, if we let t = √ max |Zπ | > t π∈M sn ∑ ≤ C exp(−t 2 /2) π∈M sn Cpsn exp(−t 2 /2). [sent-1048, score-0.312]
90 n ≤ wsn log pn , max |Zπ | > t Pr ≤ C exp((−w/2 + 1)sn log pn ) → 0 π∈M sn as w → ∞. [sent-1049, score-1.234]
91 (7) 2 r(π) Hence Pr ′ max εn Mπ εn ≥ t π∈M sn sn ∑ ≤ k=1 pn Pr(Wk ≤ t), k where Wk ∼ χ2 (k). [sent-1058, score-0.77]
92 Since Pr(Wk ≥ t) ≤ Pr(Wsn ≥ t), we have Pr sn ′ max en Mπ en ≥ t π∈M sn ≤ Pr(Wsn ≥ t) ∑ k=1 pn k ≤ Pr(Wsn ≥ t)psn . [sent-1059, score-0.77]
93 For π π∗ , Lemmas 10, 11 and 12 imply n Rn (π) − Rn (π∗ ) + λn (|π| − |π∗ |)σ2 n n ′ ′ ′ = µn (I − Mπ )µn + 2µn (I − Mπ )εn + εn (Mπ∗ − Mπ )εn + λn (|π| − |π∗ |)σ2 n ≥ η|π− |nc2 −c1 − 2 η|π− |nc2 −c1 O p ( sn log pn ) − O p (sn log pn ) − |π− |λn , where π− = π∗ − π. [sent-1068, score-1.182]
94 Since sn log pn ≤ o(nc2 −c1 ) and λn = o(nc2 −c1 ), the proof is done. [sent-1069, score-0.669]
95 By Theorem 1 of Zhang and Shen (2010), the probability of (9) is larger than 2 − 1 + e1/2 exp − λn − log λn 2 pn −qn , which converges to 1 when 2 log pn − λn + log λn → −∞. [sent-1071, score-1.081]
96 The equivalent condition with 2 log pn − λn + log λn → −∞ is λn − 2 log pn − log log pn → ∞. [sent-1072, score-1.673]
97 Proof of Theorem 4 By Theorem 4 of Kim and Kwon (2012), the solution path of the SCAD or minimax concave penalty include the true model with probability converging to 1. [sent-1074, score-0.251]
98 Since condition A3’ is stronger than condition A3, the GICλn with λn = o(nc2 −c1 ) is consistent, and so is with the solution path of the SCAD or minimax concave penalty. [sent-1075, score-0.218]
99 By (6), we have j ˜n ˆ ˆ Pr(Bc ) ≤ Pr(< Yn − Yn∗ , Xnj >2 > nρn λn σ2 for some j = qn + 1, . [sent-1081, score-0.174]
100 ˜n Hence, as long as 2M1 log pn /(ρn rin f ) − λn → −∞, Pr(Bc ) → 0 and the proof is done. [sent-1085, score-0.557]
wordName wordTfidf (topN-words)
[('gic', 0.504), ('pn', 0.458), ('gics', 0.258), ('qn', 0.174), ('scad', 0.169), ('submodels', 0.168), ('bic', 0.163), ('sn', 0.156), ('imensions', 0.129), ('onsistent', 0.129), ('riteria', 0.129), ('xnj', 0.129), ('yn', 0.127), ('ric', 0.11), ('xn', 0.11), ('pmax', 0.109), ('pr', 0.107), ('kwon', 0.103), ('hoi', 0.092), ('won', 0.092), ('covariates', 0.088), ('igh', 0.086), ('aic', 0.08), ('election', 0.077), ('odel', 0.07), ('path', 0.069), ('kim', 0.068), ('ptm', 0.066), ('corrected', 0.066), ('shen', 0.066), ('foster', 0.065), ('im', 0.064), ('chen', 0.063), ('log', 0.055), ('cn', 0.054), ('consistent', 0.053), ('criteria', 0.053), ('penalized', 0.053), ('scheetz', 0.052), ('wsn', 0.052), ('consistency', 0.051), ('rn', 0.048), ('simulation', 0.047), ('selection', 0.047), ('rin', 0.044), ('concave', 0.041), ('minimax', 0.041), ('bn', 0.041), ('penalty', 0.04), ('diverging', 0.04), ('criterion', 0.039), ('nonzero', 0.039), ('hosik', 0.039), ('lightness', 0.039), ('shao', 0.039), ('regularity', 0.039), ('performances', 0.037), ('zou', 0.037), ('genes', 0.036), ('jk', 0.036), ('argmin', 0.035), ('tail', 0.034), ('zhang', 0.033), ('ation', 0.033), ('clipped', 0.033), ('signal', 0.032), ('theorem', 0.031), ('george', 0.031), ('korea', 0.031), ('remarks', 0.029), ('gene', 0.028), ('error', 0.028), ('sinica', 0.028), ('statistica', 0.028), ('choi', 0.028), ('nc', 0.028), ('lasso', 0.027), ('broman', 0.026), ('casavant', 0.026), ('craven', 0.026), ('selectivity', 0.026), ('sunghoon', 0.026), ('swiderski', 0.026), ('yongdai', 0.026), ('huang', 0.026), ('condition', 0.024), ('smoothly', 0.023), ('shef', 0.022), ('diverge', 0.022), ('greenshtein', 0.022), ('converging', 0.022), ('variance', 0.021), ('eigenvalue', 0.021), ('prediction', 0.02), ('eye', 0.02), ('seoul', 0.02), ('asymptotic', 0.019), ('model', 0.019), ('solution', 0.019), ('wang', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions
Author: Yongdai Kim, Sunghoon Kwon, Hosik Choi
Abstract: Asymptotic properties of model selection criteria for high-dimensional regression models are studied where the dimension of covariates is much larger than the sample size. Several sufficient conditions for model selection consistency are provided. Non-Gaussian error distributions are considered and it is shown that the maximal number of covariates for model selection consistency depends on the tail behavior of the error distribution. Also, sufficient conditions for model selection consistency are given when the variance of the noise is neither known nor estimated consistently. Results of simulation studies as well as real data analysis are given to illustrate that finite sample performances of consistent model selection criteria can be quite different. Keywords: model selection consistency, general information criteria, high dimension, regression
2 0.14924984 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
Author: Lan Xue, Annie Qu
Abstract: The varying-coefficient model is flexible and powerful for modeling the dynamic changes of regression coefficients. It is important to identify significant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size. We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L0 penalty, and we investigate the global optimality properties of the varying-coefficient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying-coefficient modeling could be infinite, in addition to dealing with the high-dimensional linear covariates. We show that the proposed varying-coefficient estimator is consistent, enjoys the oracle property and achieves an optimal convergence rate for the non-zero nonparametric components for high-dimensional data. Our simulations and numerical examples indicate that the difference convex algorithm is efficient using the coordinate decent algorithm, and is able to select the true model at a higher frequency than the least absolute shrinkage and selection operator (LASSO), the adaptive LASSO and the smoothly clipped absolute deviation (SCAD) approaches. Keywords: coordinate decent algorithm, difference convex programming, L0 - regularization, large-p small-n, model selection, nonparametric function, oracle property, truncated L1 penalty
3 0.14063405 80 jmlr-2012-On Ranking and Generalization Bounds
Author: Wojciech Rejchel
Abstract: The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. In this paper we consider ranking estimators that minimize the empirical convex risk. We prove generalization bounds for the excess risk of such estimators with rates that are 1 faster than √n . We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. Moreover, we study the performance of considered estimators on real data sets. Keywords: convex risk minimization, excess risk, support vector machine, empirical process, U-process
Author: Michael U. Gutmann, Aapo Hyvärinen
Abstract: We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only specified up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations which are often computationally expensive. We propose here a new objective function for the estimation of both normalized and unnormalized models. The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise. With this approach, the normalizing partition function can be estimated like any other parameter. We prove that the new estimation method leads to a consistent (convergent) estimator of the parameters. For large noise sample sizes, the new estimator is furthermore shown to behave like the maximum likelihood estimator. In the estimation of unnormalized models, there is a trade-off between statistical and computational performance. We show that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models. As an application to real data, we estimate novel two-layer models of natural image statistics with spline nonlinearities. Keywords: statistics unnormalized models, partition function, computation, estimation, natural image
5 0.12875324 20 jmlr-2012-Analysis of a Random Forests Model
Author: Gérard Biau
Abstract: Random forests are a scheme proposed by Leo Breiman in the 2000’s for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm. In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present. Keywords: random forests, randomization, sparsity, dimension reduction, consistency, rate of convergence
6 0.10306484 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
7 0.08437907 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
8 0.081507102 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
9 0.077379242 68 jmlr-2012-Minimax Manifold Estimation
10 0.070890814 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
11 0.060483392 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications
12 0.058113161 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression
13 0.049229171 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
14 0.047915645 91 jmlr-2012-Plug-in Approach to Active Learning
15 0.040734604 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
16 0.038548499 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning
17 0.036868677 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
18 0.036164224 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
19 0.034261525 59 jmlr-2012-Linear Regression With Random Projections
20 0.032844275 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods
topicId topicWeight
[(0, -0.19), (1, 0.18), (2, -0.195), (3, -0.093), (4, -0.022), (5, -0.036), (6, -0.106), (7, -0.03), (8, 0.033), (9, -0.08), (10, 0.232), (11, 0.157), (12, 0.258), (13, 0.052), (14, -0.123), (15, 0.253), (16, -0.102), (17, -0.009), (18, 0.08), (19, 0.044), (20, 0.012), (21, -0.095), (22, -0.019), (23, -0.086), (24, 0.018), (25, 0.024), (26, -0.151), (27, -0.113), (28, 0.058), (29, 0.054), (30, -0.085), (31, 0.071), (32, 0.003), (33, 0.115), (34, 0.071), (35, -0.121), (36, 0.07), (37, 0.021), (38, 0.098), (39, -0.021), (40, -0.002), (41, 0.017), (42, 0.074), (43, -0.002), (44, -0.027), (45, 0.079), (46, -0.093), (47, 0.161), (48, 0.034), (49, 0.09)]
simIndex simValue paperId paperTitle
same-paper 1 0.9497543 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions
Author: Yongdai Kim, Sunghoon Kwon, Hosik Choi
Abstract: Asymptotic properties of model selection criteria for high-dimensional regression models are studied where the dimension of covariates is much larger than the sample size. Several sufficient conditions for model selection consistency are provided. Non-Gaussian error distributions are considered and it is shown that the maximal number of covariates for model selection consistency depends on the tail behavior of the error distribution. Also, sufficient conditions for model selection consistency are given when the variance of the noise is neither known nor estimated consistently. Results of simulation studies as well as real data analysis are given to illustrate that finite sample performances of consistent model selection criteria can be quite different. Keywords: model selection consistency, general information criteria, high dimension, regression
Author: Michael U. Gutmann, Aapo Hyvärinen
Abstract: We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a finite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only specified up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations which are often computationally expensive. We propose here a new objective function for the estimation of both normalized and unnormalized models. The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise. With this approach, the normalizing partition function can be estimated like any other parameter. We prove that the new estimation method leads to a consistent (convergent) estimator of the parameters. For large noise sample sizes, the new estimator is furthermore shown to behave like the maximum likelihood estimator. In the estimation of unnormalized models, there is a trade-off between statistical and computational performance. We show that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models. As an application to real data, we estimate novel two-layer models of natural image statistics with spline nonlinearities. Keywords: statistics unnormalized models, partition function, computation, estimation, natural image
3 0.52185589 20 jmlr-2012-Analysis of a Random Forests Model
Author: Gérard Biau
Abstract: Random forests are a scheme proposed by Leo Breiman in the 2000’s for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm. In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present. Keywords: random forests, randomization, sparsity, dimension reduction, consistency, rate of convergence
4 0.48782659 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
Author: Lan Xue, Annie Qu
Abstract: The varying-coefficient model is flexible and powerful for modeling the dynamic changes of regression coefficients. It is important to identify significant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size. We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L0 penalty, and we investigate the global optimality properties of the varying-coefficient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying-coefficient modeling could be infinite, in addition to dealing with the high-dimensional linear covariates. We show that the proposed varying-coefficient estimator is consistent, enjoys the oracle property and achieves an optimal convergence rate for the non-zero nonparametric components for high-dimensional data. Our simulations and numerical examples indicate that the difference convex algorithm is efficient using the coordinate decent algorithm, and is able to select the true model at a higher frequency than the least absolute shrinkage and selection operator (LASSO), the adaptive LASSO and the smoothly clipped absolute deviation (SCAD) approaches. Keywords: coordinate decent algorithm, difference convex programming, L0 - regularization, large-p small-n, model selection, nonparametric function, oracle property, truncated L1 penalty
5 0.4635902 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
Author: Tim van Erven, Mark D. Reid, Robert C. Williamson
Abstract: Mixability of a loss characterizes fast rates in the online learning setting of prediction with expert advice. The determination of the mixability constant for binary losses is straightforward but opaque. In the binary case we make this transparent and simpler by characterising mixability in terms of the second derivative of the Bayes risk of proper losses. We then extend this result to multiclass proper losses where there are few existing results. We show that mixability is governed by the maximum eigenvalue of the Hessian of the Bayes risk, relative to the Hessian of the Bayes risk for log loss. We conclude by comparing our result to other work that bounds prediction performance in terms of the geometry of the Bayes risk. Although all calculations are for proper losses, we also show how to carry the results across to improper losses. Keywords: mixability, multiclass, prediction with expert advice, proper loss, learning rates
6 0.44029915 80 jmlr-2012-On Ranking and Generalization Bounds
7 0.34029603 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
8 0.31735262 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
9 0.30171892 68 jmlr-2012-Minimax Manifold Estimation
10 0.2652649 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression
11 0.2651898 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
12 0.20692676 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
13 0.20628056 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications
14 0.20242989 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
15 0.18505532 59 jmlr-2012-Linear Regression With Random Projections
16 0.18175946 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods
17 0.17264716 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
18 0.16801068 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
19 0.16473788 91 jmlr-2012-Plug-in Approach to Active Learning
20 0.16250335 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression
topicId topicWeight
[(0, 0.421), (7, 0.017), (21, 0.076), (26, 0.033), (29, 0.025), (35, 0.011), (49, 0.02), (75, 0.066), (77, 0.036), (79, 0.019), (92, 0.098), (96, 0.069)]
simIndex simValue paperId paperTitle
1 0.85248333 9 jmlr-2012-A Topic Modeling Toolbox Using Belief Propagation
Author: Jia Zeng
Abstract: Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts worldwide interests and touches on many important applications in text mining, computer vision and computational biology. This paper introduces a topic modeling toolbox (TMBP) based on the belief propagation (BP) algorithms. TMBP toolbox is implemented by MEX C++/Matlab/Octave for either Windows 7 or Linux. Compared with existing topic modeling packages, the novelty of this toolbox lies in the BP algorithms for learning LDA-based topic models. The current version includes BP algorithms for latent Dirichlet allocation (LDA), authortopic models (ATM), relational topic models (RTM), and labeled LDA (LaLDA). This toolbox is an ongoing project and more BP-based algorithms for various topic models will be added in the near future. Interested users may also extend BP algorithms for learning more complicated topic models. The source codes are freely available under the GNU General Public Licence, Version 1.0 at https://mloss.org/software/view/399/. Keywords: topic models, belief propagation, variational Bayes, Gibbs sampling
same-paper 2 0.64855552 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions
Author: Yongdai Kim, Sunghoon Kwon, Hosik Choi
Abstract: Asymptotic properties of model selection criteria for high-dimensional regression models are studied where the dimension of covariates is much larger than the sample size. Several sufficient conditions for model selection consistency are provided. Non-Gaussian error distributions are considered and it is shown that the maximal number of covariates for model selection consistency depends on the tail behavior of the error distribution. Also, sufficient conditions for model selection consistency are given when the variance of the noise is neither known nor estimated consistently. Results of simulation studies as well as real data analysis are given to illustrate that finite sample performances of consistent model selection criteria can be quite different. Keywords: model selection consistency, general information criteria, high dimension, regression
3 0.34125453 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
Author: Lan Xue, Annie Qu
Abstract: The varying-coefficient model is flexible and powerful for modeling the dynamic changes of regression coefficients. It is important to identify significant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size. We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L0 penalty, and we investigate the global optimality properties of the varying-coefficient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying-coefficient modeling could be infinite, in addition to dealing with the high-dimensional linear covariates. We show that the proposed varying-coefficient estimator is consistent, enjoys the oracle property and achieves an optimal convergence rate for the non-zero nonparametric components for high-dimensional data. Our simulations and numerical examples indicate that the difference convex algorithm is efficient using the coordinate decent algorithm, and is able to select the true model at a higher frequency than the least absolute shrinkage and selection operator (LASSO), the adaptive LASSO and the smoothly clipped absolute deviation (SCAD) approaches. Keywords: coordinate decent algorithm, difference convex programming, L0 - regularization, large-p small-n, model selection, nonparametric function, oracle property, truncated L1 penalty
4 0.34095973 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
5 0.3408969 20 jmlr-2012-Analysis of a Random Forests Model
Author: Gérard Biau
Abstract: Random forests are a scheme proposed by Leo Breiman in the 2000’s for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm. In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman (2004), which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present. Keywords: random forests, randomization, sparsity, dimension reduction, consistency, rate of convergence
6 0.33999008 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression
7 0.33943975 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
8 0.33441135 111 jmlr-2012-Structured Sparsity and Generalization
9 0.33308795 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
10 0.33280712 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
11 0.32917097 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
12 0.32790789 73 jmlr-2012-Multi-task Regression using Minimal Penalties
13 0.32723236 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
14 0.32576698 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
15 0.32548434 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
16 0.3239457 82 jmlr-2012-On the Necessity of Irrelevant Variables
17 0.32361135 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications
18 0.32309884 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
19 0.32112634 13 jmlr-2012-Active Learning via Perfect Selective Classification
20 0.32076544 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints