jmlr jmlr2012 jmlr2012-35 knowledge-graph by maker-knowledge-mining

35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning


Source: pdf

Author: Zhihua Zhang, Shusen Wang, Dehua Liu, Michael I. Jordan

Abstract: In this paper we propose a novel framework for the construction of sparsity-inducing priors. In particular, we define such priors as a mixture of exponential power distributions with a generalized inverse Gaussian density (EP-GIG). EP-GIG is a variant of generalized hyperbolic distributions, and the special cases include Gaussian scale mixtures and Laplace scale mixtures. Furthermore, Laplace scale mixtures can subserve a Bayesian framework for sparse learning with nonconvex penalization. The densities of EP-GIG can be explicitly expressed. Moreover, the corresponding posterior distribution also follows a generalized inverse Gaussian distribution. We exploit these properties to develop EM algorithms for sparse empirical Bayesian learning. We also show that these algorithms bear an interesting resemblance to iteratively reweighted ℓ2 or ℓ1 methods. Finally, we present two extensions for grouped variable selection and logistic regression. Keywords: sparsity priors, scale mixtures of exponential power distributions, generalized inverse Gaussian distributions, expectation-maximization algorithms, iteratively reweighted minimization methods

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In particular, we define such priors as a mixture of exponential power distributions with a generalized inverse Gaussian density (EP-GIG). [sent-9, score-0.518]

2 EP-GIG is a variant of generalized hyperbolic distributions, and the special cases include Gaussian scale mixtures and Laplace scale mixtures. [sent-10, score-0.264]

3 Furthermore, Laplace scale mixtures can subserve a Bayesian framework for sparse learning with nonconvex penalization. [sent-11, score-0.244]

4 We also show that these algorithms bear an interesting resemblance to iteratively reweighted ℓ2 or ℓ1 methods. [sent-15, score-0.27]

5 Keywords: sparsity priors, scale mixtures of exponential power distributions, generalized inverse Gaussian distributions, expectation-maximization algorithms, iteratively reweighted minimization methods 1. [sent-17, score-0.606]

6 But in addition a number of studies have emphasized the advantages of nonconvex penalties—such as the bridge penalty and the log-penalty—for achieving sparsity (Fu, 1998; Fan and Li, 2001; Mazumder et al. [sent-26, score-0.309]

7 This is done by taking a Bayesian decision-theoretic approach in which the loss function L(b; X ) is based on the conditional likelihood of the response yi and the penalty Pλ (b) is associated with a prior distribution for b. [sent-29, score-0.18]

8 This has led to Bayesian versions of the lasso, which are based on expressing the Laplace prior as a scale-mixture of a Gaussian distribution and an exponential density (Andrews and Mallows, 1974; West, 1987). [sent-32, score-0.163]

9 In recent work, Polson and Scott (2011) proposed using generalized hyperbolic distributions, variance-mean mixtures of Gaussians with generalized inverse Gaussian densities, devising EM algorithms via data augmentation methodology. [sent-35, score-0.274]

10 Further work by Griffin and Brown (2010a) involved the use of a family of normal-gamma priors as a generalization of the Bayesian lasso. [sent-40, score-0.157]

11 (2010), the authors proposed horseshoe priors which are a mixture of normal distributions and a half-Cauchy density on the positive reals with a scale parameter. [sent-43, score-0.367]

12 In particular, they showed that the bridge penalty can be obtained by mixing the Laplace distribution with a stable distribution. [sent-48, score-0.239]

13 Other authors have shown that the prior induced from the log-penalty has an interpretation as a scale mixture of Laplace distributions with an inverse gamma density (Cevher, 2009; Garrigues and Olshausen, 2010; Lee et al. [sent-49, score-0.395]

14 Additionally, Griffin and Brown (2010b) devised a family of normal-exponential-gamma priors for a Bayesian adaptive lasso (Zou, 2006). [sent-52, score-0.212]

15 Polson and Scott (2010, 2012) provided a unifying framework for the construction of sparsity priors using L´ vy e processes. [sent-53, score-0.185]

16 Generalized inverse Gaussian (GIG) distributions (Jørgensen, 1982) are conjugate with respect to an exponential power (EP) distribution (Box and Tiao, 1992)—an extension of Gaussian and Laplace distributions. [sent-55, score-0.246]

17 In particular, we define EP-GIG distributions as scale mixtures of EP distributions with a GIG density, and derive their explicit densities. [sent-57, score-0.243]

18 EP-GIG distributions can be regarded as a variant of generalized hyperbolic distributions, and include Gaussian scale mixtures and Laplacian scale mixtures as special cases. [sent-58, score-0.424]

19 The Gaussian scale mixture is a class of generalized hyperbolic distributions (Polson and Scott, 2011) and its special cases include normal-gamma distributions (Griffin and Brown, 2010a) 2032 EP-GIG P RIORS AND A PPLICATIONS IN BAYESIAN S PARSE L EARNING as well as the Laplacian distribution. [sent-59, score-0.337]

20 , 2010) and the bridge distribution inducing the ℓ1/2 bridge penalty (Zou and Li, 2008) are special cases of Laplacian scale mixtures. [sent-62, score-0.347]

21 Since GIG priors are conjugate with respect to EP distributions, it is feasible to apply EP-GIG to Bayesian sparse learning. [sent-64, score-0.194]

22 In particular, using the fact that EP-GIG distributions are scale mixtures of exponential power distributions, we devise EM algorithms for finding a sparse MAP estimate of b. [sent-67, score-0.309]

23 When we set the exponential power distribution to be the Gaussian distribution, the resulting EM algorithm is closely related to the iteratively reweighted ℓ2 minimization methods in Daubechies et al. [sent-68, score-0.389]

24 When we employ the Laplace distribution as a special exponential power distribution, we obtain an EM algorithm which is identical to the iteratively reweighted ℓ1 minimization method in Cand` s et al. [sent-70, score-0.389]

25 We apply our proposed EP-GIG priors in Appendix B to conduct experimental analysis. [sent-73, score-0.157]

26 The experimental results validate that the proposed EP-GIG priors which induce nonconvex penalties are potentially feasible and effective in sparsity modeling. [sent-74, score-0.337]

27 Theorem 5 proves that an exponential power distribution of order q/2 (q > 0) can be represented a scale mixture of exponential power distributions of order q with a gamma mixing density. [sent-78, score-0.494]

28 The first part of Theorem 6 shows that GIG is conjugate with respect to EP, while the second part then offers theoretical support for relating EM algorithms with iteratively reweighted minimization methods under our framework. [sent-80, score-0.27]

29 A brief review of exponential power distributions and generalized inverse Gaussian distributions is given in Section 2. [sent-86, score-0.329]

30 Section 3 presents EP-GIG distributions and some of their properties, Section 4 develops our EM algorithm for Bayesian sparse learning, and Section 5 discusses the relationship between the EM and iteratively reweighted minimization methods. [sent-87, score-0.401]

31 Finally, we conclude our work in Section 7, defer all proofs to Appendix A, and provide several new sparsity priors in Appendix B. [sent-89, score-0.185]

32 Preliminaries Before presenting EP-GIG priors for sparse modeling of regression vector b, we review the exponential power (EP) and generalized inverse Gaussian (GIG) distributions. [sent-91, score-0.413]

33 As for the case that q < 1, the corresponding density induces a bridge penalty for b. [sent-99, score-0.231]

34 2 Generalized Inverse Gaussian Distributions We first let G(η|τ, θ) denote the gamma distribution whose density is p(η) = θτ τ−1 η exp(−θη), Γ(τ) τ, θ > 0, and IG(η|τ, θ) denote the inverse gamma distribution whose density is p(η) = θτ −(1+τ) η exp(−θη−1 ), Γ(τ) τ, θ > 0. [sent-102, score-0.373]

35 It is well known that its special cases include the gamma distribution G(η|γ, α/2) when β = 0 and γ > 0, the inverse gamma distribution IG(η| − γ, β/2) when α = 0 and γ < 0, the inverse Gaussian distribution when γ = −1/2, and the hyperbolic distribution when γ = 0. [sent-106, score-0.456]

36 EP-GIG Distributions We now develop a family of distributions by mixing the exponential power EP(b|0, η, q) with the generalized inverse Gaussian GIG(η|γ, β, α). [sent-119, score-0.293]

37 EP-GIG distributions can be regarded as variants of generalized hyperbolic distributions (Jørgensen, 1982), because when q = 2 EP-GIG distributions are generalized hyperbolic distributions— a class of Gaussian scale mixtures. [sent-131, score-0.487]

38 Note that when 0 < q < 2 an EP distribution is a class of Gaussian scale mixtures (West, 1987; Lange and Sinsheimer, 1993), which implies that EP-GIG can also be represented as 2035 Z HANG , WANG , L IU AND J ORDAN a class of Gaussian scale mixtures. [sent-133, score-0.181]

39 We now consider the two special cases in which the mixing density is either a gamma distribution or an inverse gamma distribution. [sent-136, score-0.333]

40 This yields two special EP-GIG distributions: exponential power-gamma distributions and exponential power-inverse gamma distributions. [sent-137, score-0.262]

41 1 Generalized t Distributions We first consider an important family of EP-GIG distributions which are scale mixtures of exponential power EP(b|u, η, q) with inverse gamma IG(η|τ/2, τ/(2λ)). [sent-139, score-0.419]

42 (2011) called the resulting distributions generalized double Pareto distributions (GDP). [sent-146, score-0.173]

43 (4) exp(− α|b|) = 4 0 Obviously, the current exponential power-gamma distribution is identical to exponential power distribution EP(b|0, α−1/2 /2, 1/2), a bridge distribution with q = 1/2. [sent-165, score-0.314]

44 When − log p(b) is used as a penalty in supervised sparse learning, iteratively reweighted ℓ1 or ℓ2 methods are generally used for solving the resulting optimization problem. [sent-182, score-0.422]

45 We will see that Theorem 6 implies a relationship between an iteratively reweighted method and an EM algorithm, which is presented in Section 4. [sent-183, score-0.298]

46 Furthermore, when 0 < q ≤ 1, − log(p(b)) d|b|q is concave in |b| on (0, ∞); namely, − log(p(b)) defines a class of nonconvex penalties for b. [sent-192, score-0.186]

47 In other words, − log(p(b)) with 0 < q ≤ 1 induces a nonconvex penalty for b. [sent-198, score-0.189]

48 Figure 1 depicts several penalties graphically; these are obtained from the special priors in Appendix B. [sent-199, score-0.213]

49 In Figure 2, we also illustrate the penalties induced from the 1/2-bridge scale mixture priors (see Examples 7 and 8 in in Appendix B), generalized t priors and EP-Gamma priors. [sent-201, score-0.509]

50 Again, we see that the two penalties induced from the 1/2-bridge mixture priors are concave in |b| on (0, ∞). [sent-202, score-0.299]

51 Quasi-Bayesian Sparse Learning Methods In this section we apply EP-GIG priors to quasi-Bayesian sparse learning. [sent-205, score-0.194]

52 j=1 Using the property that the EP-GIG distribution is a scale mixture of exponential power distributions, we devise an EM algorithm for the MAP estimation of b. [sent-225, score-0.217]

53 5 7 12 5 6 10 4 3 penalty function penalty function penalty function 4. [sent-227, score-0.279]

54 5 β=5 2 0 −2 −10 −8 −6 −4 −2 0 2 4 6 8 penalty function 10 penalty function penalty function 12 6 4 β = 0. [sent-237, score-0.279]

55 1 β=1 0 10 −5 −10 −8 −6 b (d) γ = 0, q = 2 −4 −2 0 2 4 6 8 10 b (e) γ = 1, q = 2 (f) γ = −1, q = 2 Figure 1: Penalty functions induced from exponential power-generalized inverse gamma (EP-GIG) priors in which α = 1. [sent-241, score-0.359]

56 Let pl be the cardinality of Il , and bl = {b j : j ∈ Il } denote the subvectors of b, for l = 1, . [sent-285, score-0.203]

57 By integrating out ηl , the marginal density of bl conditional on σ is then q α(β+σ−1 bl q )) α pl /(2q) β+σ−1 bl 1 q+1 pl βγl /2 ) Kγ ( αβ) σ q Γ( K γl q−pl ( p(bl |σ) = 2 q q+1 q q q q (γl q−pl )/(2q) , l which implies bl is non-factorial. [sent-294, score-0.694]

58 The posterior distribution of ηl on bl is then GIG(ηl | γl q−pl , β + q q σ−1 bl q , α). [sent-295, score-0.291]

59 2041 Z HANG , WANG , L IU AND J ORDAN In this case, the iterative procedure for (b, σ) is given by g (t+1) b(t+1) = argmin (y−Xb)T (y−Xb) + ∑ wl b σ(t+1) = bl q , q l=1 g q (t+1) (t+1) (y−Xb(t+1) )T (y−Xb(t+1) ) + ∑ wl bl qn+2p l=1 q q , where for l = 1, . [sent-296, score-0.321]

60 , g, (t+1) wl = (t) q (t) q /σ ]) K γl q−q−pl ( α[β + bl α1/2 q (t) q (t) 1/2 q /σ β + bl (t) q (t) q /σ ]) . [sent-299, score-0.269]

61 As a result, we have 2 or q  1/2 (t)   (γl q−pl )/q = 1/2,  (t) σ α(t) q σ β+ bl q (t+1) wj = 1/2  σ(t) + σ(t) α(σ(t) β+ b(t) q ) q l   (γl q−pl )/q = −1/2. [sent-302, score-0.165]

62 i=1 Given the tth estimate b(t) of b, the E-step of EM calculates p Q(b|b(t) ) (t) log p(y|b) + ∑ log p[b j |η j ]p(η j |b j , α, β, γ)dη j j=1 n ∝ ∑ [yi log πi + (1−yi ) log(1−πi )] − i=1 1 2 p (t+1) ∑ wj j=1 |b j |q . [sent-313, score-0.179]

63 Then the penalized regression problem is min F(b) b 1 y−Xb 2 p 2 2 +λ ∑ R(|b j |q ) , j=1 which can be solved via an iteratively reweighted ℓq method. [sent-324, score-0.331]

64 This implies the iteratively reweighted minimization method is identical to the EM algorithm given in Section 4. [sent-338, score-0.27]

65 When q = 2, the EM algorithm is identical to the reweighted ℓ2 method and corresponds to a local quadratic approximation (Fan and Li, 2001; Hunter and Li, 2005). [sent-340, score-0.219]

66 When q = 1, the EM algorithm is reweighted ℓ1 minimization and corresponds to an LLA. [sent-341, score-0.219]

67 This implies that the reweighted ℓ2 method of Daubechies et al. [sent-344, score-0.219]

68 , γ = 1 and q = 2), we obtain the combination of the reweighted ℓ2 method of Daubechies et al. [sent-348, score-0.219]

69 (2010) and the reweighted ℓ2 method of Chartrand and Yin (2008). [sent-349, score-0.219]

70 When γ = 3 and q = 1, the EM algorithm (see Table 1) is equivalent to a reweighted ℓ1 method, 2 which in turn has a close connection with the reweighted ℓ2 method of Daubechies et al. [sent-350, score-0.438]

71 1 Additionally, the EM algorithm based on γ = 2 and q = 1 (see Table 1) can be regarded as the combination of the above reweighted ℓ1 method and the reweighted ℓ1 of Cand` s et al. [sent-352, score-0.467]

72 Ine terestingly, the EM algorithm based on the EP-GIG priors given in Examples 7 and 8 of Appendix B 1 (i. [sent-354, score-0.157]

73 , γ = 3 and q = 2 or γ = 5 and q = 1 ) corresponds a reweighted ℓ1/2 method. [sent-356, score-0.219]

74 2 Convergence Analysis Owing to the equivalence between the iteratively reweighted ℓq method and the EM algorithm, we investigate convergence analysis based on the iteratively reweighted ℓq method. [sent-361, score-0.54]

75 } be a sequence defined by the the iteratively reweighted ℓq method. [sent-366, score-0.27]

76 In fact, the iteratively reweighted ℓ method enjoys the same convergence as the to some F q standard EM algorithm (Dempster et al. [sent-373, score-0.27]

77 Experimental Studies In this paper our principal purpose has been to provide a new hierarchical framework within which we can construct sparsity-inducing priors and EM algorithms. [sent-405, score-0.185]

78 We also studied two EM algorithms based on the generalized t priors, that is, the exponential power-inverse gamma priors (see Section 3. [sent-408, score-0.339]

79 For the lasso, the adLasso and the reweighted ℓ1 problems in the M-step, we solved the optimization problems by a coordinate descent algorithm (Mazumder et al. [sent-418, score-0.219]

80 In particular, Method 1, Method 2 and Method 3 are based on the Laplace scale mixture priors proposed in Appendix B. [sent-422, score-0.255]

81 Additionally, Method 5 and Method 6 are based on the Gaussian scale mixture priors given in Appendix B, and Method 7 is based on the Cauchy prior. [sent-427, score-0.255]

82 In addition, Methods 5, 6 and 7 are based on reweighted ℓ2 minimization, so they do not naturally produce sparse estimates. [sent-445, score-0.256]

83 Conclusions In this paper we have proposed a family of sparsity-inducing priors that we call exponential powergeneralized inverse Gaussian (EP-GIG) distributions. [sent-917, score-0.273]

84 We have defined the EP-GIG family as a mixture of exponential power distributions with a generalized inverse Gaussian (GIG) density. [sent-918, score-0.315]

85 In Appendix B, we have presented five new EP-GIG priors which can induce nonconvex penalties. [sent-921, score-0.253]

86 Since GIG distributions are conjugate with respect to the exponential power distribution, EPGIG are natural for Bayesian sparse learning. [sent-922, score-0.198]

87 Our experiments have shown that the proposed EP-GIG priors giving rise to nonconvex penalties are potentially feasible and effective in sparsity modeling. [sent-925, score-0.337]

88 Thus, we have Kν (ν1/2 z) = lim ν→∞ π−1/2 2ν ν−(ν+1)/2 z−(ν+1) Γ ν + 1 ν→∞ 2 ∞ lim ∞ = 0 2050 0 cos(t) 1 (1 + t 2 /(νz2 ))ν+ 2 cos(t) exp(−t 2 /z2 )dt. [sent-939, score-0.22]

89 Since √ ∞ ∞ π −u2 −u2 C = lim f (z) = lim e du = e cos(uz)du = , z→+0 z→+0 0 2 0 we have φ(z) = √ π 2 2 z exp(−z /4). [sent-945, score-0.22]

90 ν→∞ (2π)1/2 νν exp(−ν) ν→∞ 2πνν [1+1/(2ν)]ν exp(−ν) exp(−1/2) lim Thus, lim ν→∞ Kν (ν1/2 z) 1 2 1 ν− 2 π 2 ν ν−1 2 2 z−ν exp(−ν) exp(− z4 ) = 1. [sent-948, score-0.22]

91 2 Some Limiting Properties of GIG Distributions An interesting property of the gamma and inverse gamma distributions is given as follows. [sent-973, score-0.299]

92 Proof Note that (τλ)τ τ−1 η exp(−τλη) τ→∞ Γ(τ) (τλ)τ ητ−1 exp(−τλη) (Use the Stirling Formula) = lim 1 1 τ→∞ (2π) 2 ττ− 2 exp(−τ) lim G(η|τ, τλ) = lim τ→∞ 1 (λη)τ . [sent-978, score-0.33]

93 Z HANG , WANG , L IU AND J ORDAN Proof Using Lemma 13, lim GIG(η|γ, β, γα) = lim γ→+∞ γ→+∞ = lim γγ/2 (α/β)γ/2 γαβ) 2Kγ ( ηγ−1 exp(−(γαη + βη−1 )/2) αγ exp( αβ ) exp(−βη−1 /2) 4 ν→+∞ = lim π 1 η−1 γ 2 γ→+∞ (2π) = δ(η|2/α). [sent-986, score-0.44]

94 We have lim GIG(η|γ, −γβ, α) = lim GIG(η| − τ, τβ, α) γ→−∞ τ→+∞ = lim τ→+∞ (α/(τβ))−τ/2 2Kτ ( ταβ) η−τ−1 exp(−(αη + τβη−1 )/2), due to the fact that K−τ ( ταβ) = Kτ ( ταβ). [sent-989, score-0.33]

95 lim p(η) = lim √ ψ→+∞ ψ→+∞ 2π exp( ψ (φη − 1)2 ) 2φη A. [sent-992, score-0.22]

96 In other words, we consider the mixture of the Gaussian distribution N(b|0, η) with the hyperbolic distribution GIG(η|β, α, 0). [sent-1033, score-0.166]

97 5 Example 5 In the fifth special case we set q = 2 and γ = 1; that is, we consider the mixture of the Gaussian distribution N(b|0, η) with the generalized inverse Gaussian GIG(η|1, β, α). [sent-1036, score-0.178]

98 The Hierarchy with the Bridge Prior Given in (4) Using the bridge prior in (4) yields the following hierarchical model: [y|b, σ] ∼ N(y|Xb, σIn ), ind [b j |η j , σ] ∼ L(b j |0, ση j ), ind [η j ] ∝ G(η j |3/2, α/2), p(σ) = “Constant”. [sent-1063, score-0.268]

99 Iteratively reweighted least squares u u minimization for sparse recovery. [sent-1138, score-0.256]

100 Iterative reweighted ℓ1 and ℓ2 methods for finding sparse solutions. [sent-1309, score-0.256]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gig', 0.606), ('reweighted', 0.219), ('egig', 0.213), ('em', 0.212), ('ep', 0.207), ('xb', 0.202), ('ethod', 0.198), ('priors', 0.157), ('ordan', 0.136), ('pplications', 0.136), ('bl', 0.121), ('riors', 0.113), ('lim', 0.11), ('laplace', 0.106), ('iu', 0.106), ('hang', 0.1), ('bayesian', 0.099), ('nonconvex', 0.096), ('penalty', 0.093), ('bridge', 0.092), ('gamma', 0.086), ('pl', 0.082), ('exp', 0.08), ('parse', 0.077), ('polson', 0.074), ('ig', 0.071), ('distributions', 0.066), ('zou', 0.066), ('hyperbolic', 0.066), ('mixtures', 0.065), ('uz', 0.064), ('inverse', 0.061), ('nir', 0.056), ('cos', 0.056), ('penalties', 0.056), ('exponential', 0.055), ('lasso', 0.055), ('adlasso', 0.055), ('ind', 0.055), ('armagan', 0.053), ('bessel', 0.053), ('mixture', 0.052), ('earning', 0.052), ('iteratively', 0.051), ('ic', 0.05), ('asso', 0.049), ('mse', 0.048), ('density', 0.046), ('scale', 0.046), ('tth', 0.045), ('daubechies', 0.045), ('wj', 0.044), ('wang', 0.043), ('grif', 0.042), ('eg', 0.042), ('generalized', 0.041), ('power', 0.04), ('std', 0.04), ('gaussian', 0.039), ('penalized', 0.039), ('gt', 0.038), ('prior', 0.038), ('sparse', 0.037), ('rgensen', 0.036), ('pdf', 0.035), ('concave', 0.034), ('appendix', 0.032), ('caron', 0.032), ('gdp', 0.032), ('jeffreys', 0.032), ('pyrim', 0.032), ('triazines', 0.032), ('mixing', 0.03), ('regarded', 0.029), ('sparsity', 0.028), ('relationship', 0.028), ('hierarchical', 0.028), ('cevher', 0.027), ('chartrand', 0.027), ('pareto', 0.027), ('grouped', 0.027), ('wl', 0.027), ('laplacian', 0.025), ('zk', 0.025), ('jn', 0.025), ('penalization', 0.025), ('posterior', 0.025), ('response', 0.025), ('argmin', 0.025), ('scad', 0.025), ('distribution', 0.024), ('calculates', 0.024), ('bn', 0.024), ('proposition', 0.024), ('mazumder', 0.023), ('theorem', 0.022), ('log', 0.022), ('regression', 0.022), ('archambeau', 0.021), ('epgig', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning

Author: Zhihua Zhang, Shusen Wang, Dehua Liu, Michael I. Jordan

Abstract: In this paper we propose a novel framework for the construction of sparsity-inducing priors. In particular, we define such priors as a mixture of exponential power distributions with a generalized inverse Gaussian density (EP-GIG). EP-GIG is a variant of generalized hyperbolic distributions, and the special cases include Gaussian scale mixtures and Laplace scale mixtures. Furthermore, Laplace scale mixtures can subserve a Bayesian framework for sparse learning with nonconvex penalization. The densities of EP-GIG can be explicitly expressed. Moreover, the corresponding posterior distribution also follows a generalized inverse Gaussian distribution. We exploit these properties to develop EM algorithms for sparse empirical Bayesian learning. We also show that these algorithms bear an interesting resemblance to iteratively reweighted ℓ2 or ℓ1 methods. Finally, we present two extensions for grouped variable selection and logistic regression. Keywords: sparsity priors, scale mixtures of exponential power distributions, generalized inverse Gaussian distributions, expectation-maximization algorithms, iteratively reweighted minimization methods

2 0.18245368 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

Author: Zhihua Zhang, Dehua Liu, Guang Dai, Michael I. Jordan

Abstract: Support vector machines (SVMs) naturally embody sparseness due to their use of hinge loss functions. However, SVMs can not directly estimate conditional class probabilities. In this paper we propose and study a family of coherence functions, which are convex and differentiable, as surrogates of the hinge function. The coherence function is derived by using the maximum-entropy principle and is characterized by a temperature parameter. It bridges the hinge function and the logit function in logistic regression. The limit of the coherence function at zero temperature corresponds to the hinge function, and the limit of the minimizer of its expected error is the minimizer of the expected error of the hinge loss. We refer to the use of the coherence function in large-margin classification as “C -learning,” and we present efficient coordinate descent algorithms for the training of regularized C -learning models. Keywords: large-margin classifiers, hinge functions, logistic functions, coherence functions, C learning

3 0.093859904 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

Author: Jian Huang, Cun-Hui Zhang

Abstract: The ℓ1 -penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1 -penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1 -penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results. Keywords: variable selection, penalized estimation, oracle inequality, generalized linear models, selection consistency, sparsity

4 0.072673351 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

Author: Lan Xue, Annie Qu

Abstract: The varying-coefficient model is flexible and powerful for modeling the dynamic changes of regression coefficients. It is important to identify significant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size. We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L0 penalty, and we investigate the global optimality properties of the varying-coefficient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying-coefficient modeling could be infinite, in addition to dealing with the high-dimensional linear covariates. We show that the proposed varying-coefficient estimator is consistent, enjoys the oracle property and achieves an optimal convergence rate for the non-zero nonparametric components for high-dimensional data. Our simulations and numerical examples indicate that the difference convex algorithm is efficient using the coordinate decent algorithm, and is able to select the true model at a higher frequency than the least absolute shrinkage and selection operator (LASSO), the adaptive LASSO and the smoothly clipped absolute deviation (SCAD) approaches. Keywords: coordinate decent algorithm, difference convex programming, L0 - regularization, large-p small-n, model selection, nonparametric function, oracle property, truncated L1 penalty

5 0.061100017 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox

Author: Hannes Nickisch

Abstract: The glm-ie toolbox contains functionality for estimation and inference in generalised linear models over continuous-valued variables. Besides a variety of penalised least squares solvers for estimation, it offers inference based on (convex) variational bounds, on expectation propagation and on factorial mean field. Scalable and efficient inference in fully-connected undirected graphical models or Markov random fields with Gaussian and non-Gaussian potentials is achieved by casting all the computations as matrix vector multiplications. We provide a wide choice of penalty functions for estimation, potential functions for inference and matrix classes with lazy evaluation for convenient modelling. We designed the glm-ie package to be simple, generic and easily expansible. Most of the code is written in Matlab including some MEX files to be fully compatible to both Matlab 7.x and GNU Octave 3.3.x. Large scale probabilistic classification as well as sparse linear modelling can be performed in a common algorithmical framework by the glm-ie toolkit. Keywords: sparse linear models, generalised linear models, Bayesian inference, approximate inference, probabilistic regression and classification, penalised least squares estimation, lazy evaluation matrix class

6 0.060700893 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

7 0.055048026 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization

8 0.052285619 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

9 0.049548291 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

10 0.046052672 91 jmlr-2012-Plug-in Approach to Active Learning

11 0.044329558 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

12 0.041612454 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

13 0.040441934 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

14 0.038548499 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

15 0.036117978 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels

16 0.035255168 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

17 0.032923635 3 jmlr-2012-A Geometric Approach to Sample Compression

18 0.031997923 4 jmlr-2012-A Kernel Two-Sample Test

19 0.031347867 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

20 0.030819992 37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.174), (1, 0.105), (2, -0.056), (3, -0.01), (4, 0.033), (5, 0.079), (6, -0.028), (7, -0.081), (8, -0.156), (9, 0.039), (10, 0.1), (11, -0.067), (12, 0.026), (13, -0.037), (14, 0.286), (15, 0.055), (16, -0.002), (17, 0.212), (18, -0.128), (19, 0.001), (20, -0.072), (21, 0.052), (22, -0.212), (23, 0.247), (24, -0.239), (25, 0.098), (26, -0.116), (27, -0.067), (28, -0.119), (29, 0.042), (30, 0.139), (31, 0.097), (32, 0.073), (33, 0.144), (34, 0.121), (35, 0.033), (36, 0.129), (37, -0.079), (38, 0.004), (39, 0.072), (40, -0.047), (41, -0.116), (42, -0.017), (43, -0.006), (44, 0.057), (45, -0.0), (46, 0.115), (47, 0.027), (48, -0.045), (49, 0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95183969 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning

Author: Zhihua Zhang, Shusen Wang, Dehua Liu, Michael I. Jordan

Abstract: In this paper we propose a novel framework for the construction of sparsity-inducing priors. In particular, we define such priors as a mixture of exponential power distributions with a generalized inverse Gaussian density (EP-GIG). EP-GIG is a variant of generalized hyperbolic distributions, and the special cases include Gaussian scale mixtures and Laplace scale mixtures. Furthermore, Laplace scale mixtures can subserve a Bayesian framework for sparse learning with nonconvex penalization. The densities of EP-GIG can be explicitly expressed. Moreover, the corresponding posterior distribution also follows a generalized inverse Gaussian distribution. We exploit these properties to develop EM algorithms for sparse empirical Bayesian learning. We also show that these algorithms bear an interesting resemblance to iteratively reweighted ℓ2 or ℓ1 methods. Finally, we present two extensions for grouped variable selection and logistic regression. Keywords: sparsity priors, scale mixtures of exponential power distributions, generalized inverse Gaussian distributions, expectation-maximization algorithms, iteratively reweighted minimization methods

2 0.70705295 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

Author: Zhihua Zhang, Dehua Liu, Guang Dai, Michael I. Jordan

Abstract: Support vector machines (SVMs) naturally embody sparseness due to their use of hinge loss functions. However, SVMs can not directly estimate conditional class probabilities. In this paper we propose and study a family of coherence functions, which are convex and differentiable, as surrogates of the hinge function. The coherence function is derived by using the maximum-entropy principle and is characterized by a temperature parameter. It bridges the hinge function and the logit function in logistic regression. The limit of the coherence function at zero temperature corresponds to the hinge function, and the limit of the minimizer of its expected error is the minimizer of the expected error of the hinge loss. We refer to the use of the coherence function in large-margin classification as “C -learning,” and we present efficient coordinate descent algorithms for the training of regularized C -learning models. Keywords: large-margin classifiers, hinge functions, logistic functions, coherence functions, C learning

3 0.33419138 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

Author: Jian Huang, Cun-Hui Zhang

Abstract: The ℓ1 -penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1 -penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1 -penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results. Keywords: variable selection, penalized estimation, oracle inequality, generalized linear models, selection consistency, sparsity

4 0.33039144 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

Author: Lan Xue, Annie Qu

Abstract: The varying-coefficient model is flexible and powerful for modeling the dynamic changes of regression coefficients. It is important to identify significant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size. We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L0 penalty, and we investigate the global optimality properties of the varying-coefficient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying-coefficient modeling could be infinite, in addition to dealing with the high-dimensional linear covariates. We show that the proposed varying-coefficient estimator is consistent, enjoys the oracle property and achieves an optimal convergence rate for the non-zero nonparametric components for high-dimensional data. Our simulations and numerical examples indicate that the difference convex algorithm is efficient using the coordinate decent algorithm, and is able to select the true model at a higher frequency than the least absolute shrinkage and selection operator (LASSO), the adaptive LASSO and the smoothly clipped absolute deviation (SCAD) approaches. Keywords: coordinate decent algorithm, difference convex programming, L0 - regularization, large-p small-n, model selection, nonparametric function, oracle property, truncated L1 penalty

5 0.29051363 37 jmlr-2012-Eliminating Spammers and Ranking Annotators for Crowdsourced Labeling Tasks

Author: Vikas C. Raykar, Shipeng Yu

Abstract: With the advent of crowdsourcing services it has become quite cheap and reasonably effective to get a data set labeled by multiple annotators in a short amount of time. Various methods have been proposed to estimate the consensus labels by correcting for the bias of annotators with different kinds of expertise. Since we do not have control over the quality of the annotators, very often the annotations can be dominated by spammers, defined as annotators who assign labels randomly without actually looking at the instance. Spammers can make the cost of acquiring labels very expensive and can potentially degrade the quality of the final consensus labels. In this paper we propose an empirical Bayesian algorithm called SpEM that iteratively eliminates the spammers and estimates the consensus labels based only on the good annotators. The algorithm is motivated by defining a spammer score that can be used to rank the annotators. Experiments on simulated and real data show that the proposed approach is better than (or as good as) the earlier approaches in terms of the accuracy and uses a significantly smaller number of annotators. Keywords: crowdsourcing, multiple annotators, ranking annotators, spammers

6 0.29051238 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox

7 0.28120795 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

8 0.27517489 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

9 0.25000921 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization

10 0.2283408 3 jmlr-2012-A Geometric Approach to Sample Compression

11 0.21697006 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

12 0.20882058 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

13 0.20115477 91 jmlr-2012-Plug-in Approach to Active Learning

14 0.20010033 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

15 0.19426876 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

16 0.17467454 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

17 0.16860873 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

18 0.16631047 12 jmlr-2012-Active Clustering of Biological Sequences

19 0.161293 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

20 0.15686832 4 jmlr-2012-A Kernel Two-Sample Test


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.01), (21, 0.054), (22, 0.055), (26, 0.036), (29, 0.034), (49, 0.452), (58, 0.013), (75, 0.046), (77, 0.019), (92, 0.08), (96, 0.084)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94264144 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression

Author: Chiwoo Park, Jianhua Z. Huang, Yu Ding

Abstract: This paper presents the Getting-started style documentation for the local and parallel computation toolbox for Gaussian process regression (GPLP), an open source software package written in Matlab (but also compatible with Octave). The working environment and the usage of the software package will be presented in this paper. Keywords: Gaussian process regression, domain decomposition method, partial independent conditional, bagging for Gaussian process, local probabilistic regression

same-paper 2 0.83358407 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning

Author: Zhihua Zhang, Shusen Wang, Dehua Liu, Michael I. Jordan

Abstract: In this paper we propose a novel framework for the construction of sparsity-inducing priors. In particular, we define such priors as a mixture of exponential power distributions with a generalized inverse Gaussian density (EP-GIG). EP-GIG is a variant of generalized hyperbolic distributions, and the special cases include Gaussian scale mixtures and Laplace scale mixtures. Furthermore, Laplace scale mixtures can subserve a Bayesian framework for sparse learning with nonconvex penalization. The densities of EP-GIG can be explicitly expressed. Moreover, the corresponding posterior distribution also follows a generalized inverse Gaussian distribution. We exploit these properties to develop EM algorithms for sparse empirical Bayesian learning. We also show that these algorithms bear an interesting resemblance to iteratively reweighted ℓ2 or ℓ1 methods. Finally, we present two extensions for grouped variable selection and logistic regression. Keywords: sparsity priors, scale mixtures of exponential power distributions, generalized inverse Gaussian distributions, expectation-maximization algorithms, iteratively reweighted minimization methods

3 0.73078537 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

Author: Marius Kloft, Gilles Blanchard

Abstract: We derive an upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely α fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. Keywords: multiple kernel learning, learning kernels, generalization bounds, local Rademacher complexity

4 0.4052507 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

Author: Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun

Abstract: This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs’ generalization. The computation of the bound involves estimating a prior of the distribution of classifiers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. The prior PAC-Bayes bound motivates two SVM-like classification algorithms, prior SVM and ηprior SVM, whose regularization term pushes towards the minimization of the prior PAC-Bayes bound. The experimental work illustrates that the new bounds can be significantly tighter than the original PAC-Bayes bound when applied to SVMs, and among them the combination of the prior PAC-Bayes bound and the prior SVM algorithm gives the tightest bound. Keywords: PAC-Bayes bound, support vector machine, generalization capability prediction, classification

5 0.38080987 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan

Abstract: Classification algorithms are frequently used on data with a natural hierarchical structure. For instance, classifiers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (fixed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classification accuracy is complemented by inference on the balanced accuracy, which avoids inflated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classification studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies

6 0.37184229 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

7 0.3694759 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

8 0.36830115 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox

9 0.36521655 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

10 0.36229819 111 jmlr-2012-Structured Sparsity and Generalization

11 0.36032474 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

12 0.35469306 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

13 0.35282639 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

14 0.35121107 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

15 0.35002789 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

16 0.34705189 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

17 0.34674722 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

18 0.34625557 103 jmlr-2012-Sampling Methods for the Nyström Method

19 0.3455832 13 jmlr-2012-Active Learning via Perfect Selective Classification

20 0.34540385 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting