jmlr jmlr2013 jmlr2013-9 knowledge-graph by maker-knowledge-mining

9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

Source: pdf

Author: Sumio Watanabe

Abstract: A statistical model or a learning machine is called regular if the map taking a parameter to a probability distribution is one-to-one and if its Fisher information matrix is always positive deﬁnite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is deﬁned by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difﬁcult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we deﬁne a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/ log n, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model. Since WBIC can be numerically calculated without any information about a true distribution, it is a generalized version of BIC onto singular statistical models. Keywords: Bayes marginal likelihood, widely applicable Bayes information criterion

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In regular statistical models, the Bayes free energy, which is deﬁned by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. [sent-6, score-0.289]

2 Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. [sent-7, score-0.498]

3 Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. [sent-8, score-0.185]

4 In the present paper, we deﬁne a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/ log n, where n is the number of training samples. [sent-10, score-0.365]

5 We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model. [sent-11, score-0.37]

6 Since WBIC can be numerically calculated without any information about a true distribution, it is a generalized version of BIC onto singular statistical models. [sent-12, score-0.166]

7 Introduction A statistical model or a learning machine is called regular if the map taking a parameter to a probability distribution is one-to-one and if its Fisher information matrix is always positive deﬁnite. [sent-14, score-0.186]

8 Many statistical models and learning machines are not regular but singular, for example, artiﬁcial neural networks, normal mixtures, binomial mixtures, reduced rank regressions, Bayesian networks, and hidden Markov models. [sent-16, score-0.152]

9 The log loss function or the minus log likelihood function is deﬁned by Ln (w) = − 1 n ∑ log p(Xi |w). [sent-28, score-0.349]

10 n i=1 (1) Also the Bayes free energy F is deﬁned by n F = − log ∏ p(Xi |w)ϕ(w)dw. [sent-29, score-0.213]

11 If a statistical model is singular, then the posterior distribution is different from any normal distribution, hence the Bayes free energy cannot be approximated by BIC in general. [sent-34, score-0.237]

12 In algebraic geometry, it represents a relative property of singularities of a pair of algebraic varieties. [sent-37, score-0.313]

13 In statistical learning theory, it shows the asymptotic behaviors of the Bayes free energy and the generalization loss, which are determined by a pair of an optimal parameter set and a parameter set W . [sent-38, score-0.191]

14 If a triple of a true distribution, a statistical model, and a prior distribution is ﬁxed, then there is an algebraic geometrical procedure which enables us to ﬁnd an RLCT (Hironaka, 1964). [sent-39, score-0.229]

15 To study singular statistical models, new algebraic geometrical theory is constructed (Watanabe, 2009; Drton et al. [sent-42, score-0.28]

16 These results are very important because they indicate the quantitative difference of singular models from regular ones. [sent-46, score-0.148]

17 Secondly, in Theorem 4 we prove that, even if a statistical model is singular, WBIC = nLn (w0 ) + λ log n + O p ( log n). [sent-55, score-0.292]

18 In other words, WBIC has the same asymptotic behavior as the Bayes free energy even if a statistical model is singular. [sent-56, score-0.221]

19 And lastly, in Theorem 5 we prove that, if a statistical model is regular, then WBIC = nLn (w) + ˆ d log n + O p (1), 2 which shows WBIC coincides with BIC in regular statistical models. [sent-57, score-0.289]

20 These results show that WBIC is a generalized version of BIC onto singular statistical models and that RLCTs can be estimated even if a true distribution is unknown. [sent-59, score-0.189]

21 The average log loss function L(w) and the entropy of the true distribution S are respectively deﬁned by L(w) = − S = − q(x) log p(x|w)dx, q(x) log q(x)dx. [sent-70, score-0.362]

22 Note that such w0 is not unique in general, because the map w → p(x|w) is not one-to-one in general in singular statistical models. [sent-75, score-0.175]

23 In general, the set W0 = {w ∈ W ; p(x|w) = p0 (x)} is not a set of single element but an analytic set or an algebraic set with singularities. [sent-78, score-0.152]

24 Let us deﬁne a log density ratio function, p0 (x) f (x, w) = log , p(x|w) which is equivalent to p(x|w) = p0 (x) exp(− f (x, w)). [sent-79, score-0.212]

25 The problem of statistical learning is characterized by the log density ratio function f (x, w). [sent-87, score-0.156]

26 w exp(−nβKn (w))ϕ(w)dw (9) (10) The main purpose of the present paper is to prove F = nLn (w0 ) + Eβ [nKn (w)] + O p ( log n) w for β = 1/ log n. [sent-89, score-0.212]

27 Note that the matrix J(w) is equal to the Hessian matrix of K(w) and that J(w0 ) is equal to the Fisher information matrix if the true distribution is realizable by a statistical model. [sent-95, score-0.198]

28 (2) The condition Equation (12) is satisﬁed if a true distribution is realizable by or regular for a statistical model (Watanabe, 2010a). [sent-118, score-0.281]

29 If a true distribution is unrealizable by and singular for a statistical model, there is an example which does not satisfy this condition. [sent-119, score-0.248]

30 To study integrals in the region Wε , we need algebraic geometrical method, because the set {w; K(w) = 0} contains singularities in general. [sent-127, score-0.266]

31 Theorem 1 shows that any singularities can be made normal crossing by using an analytic function w = g(u). [sent-142, score-0.159]

32 In order to prove this theorem, we need the Hironaka resolution of singularities (Hironaka, 1964; Atiyah, 1970) that is the fundamental theorem in algebraic geometry. [sent-146, score-0.338]

33 For a given function K(w), there is an algebraic recursive algorithm which enables us to ﬁnd a resolution 873 WATANABE map w = g(u). [sent-149, score-0.206]

34 M= α∈A The real log canonical threshold (RLCT) is deﬁned by d λ = min min α∈A j=1 hj +1 , 2k j (15) where we deﬁne 1/k j = ∞ if k j = 0. [sent-153, score-0.184]

35 The concept of RLCT is well known in algebraic geometry and statistical learning theory. [sent-155, score-0.174]

36 In the following deﬁnition we introduce a parity of a statistical model. [sent-156, score-0.163]

37 If Q(K(w), ϕ(w)) = 1, then the parity of a statistical model is said to be odd, otherwise even. [sent-169, score-0.212]

38 Lemma 2 If a true distribution q(x) is realizable by a statistical model p(x|w), then the value Q(K(g(u)), ϕ(g(u))) is independent of a choice of a resolution map w = g(u). [sent-174, score-0.332]

39 Lemma 2 indicates that, if a true distribution is realizable by a statistical model, then Q(K(g(u)), ϕ(g(u))) is a birational invariant. [sent-176, score-0.257]

40 6 of a book (Watanabe, 2009), where a true distribution is realizable by a statistical model. [sent-181, score-0.241]

41 In a singular statistical model, compactness of the prior is necessary in general, because, if the parameter set is not compact, then it is not easy to mathematically treat the integration on the neighborhood among the inﬁnite point. [sent-183, score-0.145]

42 875 WATANABE Lemma 3 Assume that the Fundamental Conditions (1)-(4) are satisﬁed and that a true distribution q(x) is regular for a statistical model p(x|w). [sent-196, score-0.177]

43 F = nLn (w0 ) + λ log n − (m − 1) log log n + Rn , where λ is a real log canonical threshold, m is its multiplicity, and {Rn } is a sequence of random variables which converges to a random variable in law, when n → ∞. [sent-201, score-0.462]

44 In the case when q(x) is realizable by and singular for p(x|w), the expectation value of F is given by algebraic analysis (Watanabe, 2001a). [sent-203, score-0.301]

45 In general, the optimal inverse temperature β∗ depends on a true distribution q(x), a statistical model p(x|w), a prior ϕ(w), and training samples. [sent-221, score-0.157]

46 Then there exists a random variable Un such that Eβ [nLn (w)] = nLn (w0 ) + w λ log n + O p (1), 2β0 λ log n +Un β0 where λ is the real log canonical threshold and {Un } is a sequence of random variables, which satisﬁes E[Un ] = 0, converges to a Gaussian random variable in law as n → ∞. [sent-226, score-0.375]

47 Moreover, if a true distribution q(x) is realizable by a statistical model p(x|w), then E[(Un )2 ] < 1. [sent-227, score-0.228]

48 Theorem 4 with β0 = 1 shows that WBIC = nLn (w0 ) + λ log n +Un λ log n + O p (1), 2 whose ﬁrst two main terms are equal to those of F in Theorem 2. [sent-229, score-0.212]

49 Corollary 1 If the parity of a statistical model is odd, Q(K(w), ϕ(w)) = 1, then Un = 0. [sent-231, score-0.193]

50 Then β∗ = 1 1+ log n Un 2λ log n + op √ 1 log n . [sent-233, score-0.336]

51 Corollary 3 Let β1 = β01 / log n and β2 = β02 / log n, where β01 and β02 are positive constants. [sent-234, score-0.212]

52 WBIC can be understood as the generalized BIC ˆ onto singular statistical models, because it satisﬁes the following theorem. [sent-242, score-0.145]

53 877 WATANABE Theorem 5 If a true distribution q(x) is regular for a statistical model p(x|w), then WBIC = nLn (w) + ˆ d log n + o p (1). [sent-243, score-0.283]

54 This theorem shows that the difference of WBIC and BIC is smaller than a constant order term, if a true distribution is regular for a statistical model. [sent-245, score-0.181]

55 If a true distribution is regular for and realizable by a statistical model, its average is asymptotically equal to d/2, where d is the dimension of parameter. [sent-250, score-0.27]

56 If a true distribution is singular for a statistical model, then it is sometimes much larger than d/2, because it is asymptotically equal to the maximum value of the Gaussian process. [sent-251, score-0.208]

57 Whether replacement of nLn (w0 ) by nL(w) is ˆ appropriate or not depends on the statistical model and its singularities (Drton, 2009). [sent-252, score-0.189]

58 , uid ), wi = uii , w j = uii ui j ( j = i), it follows that ℓ1 u2 u2 ii (1 + ∑ u2j ) ≤ ii (u, J(w∗ )u) ≤ ℓ2 u2 (1 + ∑ u2j ), ˆ ˆ i i ii 4 2 j=i j=i where ui j = ui j ( j = i) and uii = 1. [sent-278, score-0.171]

59 Consequently, G(w)ϕ(w)dw = K(w)<ε ∑ α∈A [−1,1]d 880 du G(g(u)) |uh | bα (u). [sent-303, score-0.146]

60 We deﬁne a measure du∗ by m d ( ∏ δ(u j )) ( du∗ = j=1 ∏ (u j )µ j ) du j=m+1 2m (m − 1)! [sent-320, score-0.146]

61 The following asymptotic expansion holds as t → +0, [−1,1]d du δ(t − u2k )|u|h G(u2k , uk , u) = t λ−1 (− logt)m−1 ∑ d σ∈S(d) [0,1] +O t λ−1 (− logt)m−2 , where du∗ is a measure deﬁned by Equation (25). [sent-342, score-0.213]

62 6 of a book (Watanabe, 2009), if u ∈ [0, 1]d , then δ(t − u2k )|u|h du = t λ−1 (− logt)m−1 du∗ +O(t λ−1 (− logt)m−2 ). [sent-346, score-0.189]

63 By using a resolution map w = g(u), Y (t, Φ) = ∑ ∑ d α∈A σ∈S(d) [−1,1] du δ(t − u2k ) uk |u|h a(x, u)Φ(g(u))bα (u)du. [sent-354, score-0.283]

64 By the assumption that a true distribution is realizable by a statistical model, Equation (24) shows that there exists x such that a(x, u) = 0 for u2k = 0. [sent-356, score-0.198]

65 On the support of du∗ , σu = (σa ua , σb ub ) = (0, σb ub ), consequently the main order term of Y (t, Φ) is determined by Φ(0, ub ). [sent-357, score-0.175]

66 √ An = ∑ du exp(−nβu2k + β nuk ξn (u))|u|h bα (u) d α∈A [−1,1] ∞ ∑ = d α∈A [−1,1] du 0 dt δ(t − u2k )|u|h bα (u) √ × exp(−nβu2k + β nuk ξn (u)). [sent-372, score-0.44]

67 +O p ( (nβ)λ 884 1 βΓ(λ + ) 2 M du∗ ξ∗ (u) n WBIC Secondly, Bn can be calculated by the same way, Bn = ∞ ∑ du d α∈A [−1,1] √ 2k 0 dt δ(t − u2k )|u|h bα (u) √ ×(nu − nuk ξn (u)) exp(−nβu2k + β nuk ξn (u)). [sent-378, score-0.294]

68 By using Theorem 2 and Theorem 4, λ log n = T λ log n +Un T λ log n/2 +Vn = 0, where Vn = O p (log log n). [sent-408, score-0.424]

69 log n Therefore, √ Since Un = O p (1), T =− √ Un 8λ log n T = 1− + Un 8λ log n 1+ (Un )2 Vn − . [sent-410, score-0.318]

70 8λ log n log n + op( 1 λ log n ), resulting that β∗ log n = 1 + Un 2λ log n + op( 1 λ log n ), which completes Corollary 2. [sent-411, score-0.636]

71 10 Proof of Corollary 3 By using Theorem 4, λ + O p ( log n), β1 λ Eβ2 [nLn (w)] = nLn (w0 ) + + O p ( log n). [sent-416, score-0.212]

72 nKn (w) + ˆ d/2 det(J(w ) + o (1))1/2 2β (nβ) 0 p × exp − = Here nKn (w) = O p (1), because the true distribution is regular for a statistical model. [sent-476, score-0.216]

73 A reduced rank regression model is deﬁned by p(x, y|w) = r(x) 1 exp − 2 y − BAx 2σ (2πσ2 )N/2 2 , where r(x) is a probability density function of x and σ2 is the variance of an output. [sent-488, score-0.148]

74 , Xn ) ∝ exp(−βnLn (w) + log ϕ(w)), where β = 1/ log n. [sent-501, score-0.212]

75 R r=1 The six statistical models H = 1, 2, 3, 4, 5, 6 were compared by the criterion, WBIC = Eβ [nLn (w)], (β = 1/ log n). [sent-512, score-0.156]

76 In the table, the average and the standard deviation of WBIC1 deﬁned by WBIC1 = WBIC − nSn , for 100 independent sets of training samples are shown, where the empirical entropy of the true distribution 1 n Sn = − ∑ log q(Xi ) n i=1 does not depend on a statistical model. [sent-514, score-0.2]

77 Also WBIC2 in Table 2 shows the average and the standard deviation of WBIC2 = nLn (w) + λ log n − (m − 1) log log n − nSn , ˆ where w is the maximum likelihood estimator, λ is the real log canonical threshold, and m is the ˆ ˆ ˆ multiplicity. [sent-518, score-0.493]

78 If we know the model that is equal to the true distribution, and if we have theoretical real log canonical threshold and multiplicity, then we can calculate WBIC2 . [sent-538, score-0.214]

79 The value BIC in Table 2 shows the average and the standard deviation of Shwarz BIC, BIC = nLn (w) + ˆ d log n − nSn , 2 where d is the essential dimension of the parameter space of the reduced rank regression, d = H(M + N − H). [sent-539, score-0.174]

80 In the case the multiplicity m = 2, the term log log n also affected the results. [sent-552, score-0.251]

81 In the present paper, we study the Bayes free energy F as the statistical model selection criterion. [sent-557, score-0.187]

82 Its expectation value is given by q(xn ) n E[F ] = nS + q(xn ) log dx , p(xn ) where S is the entropy of the true distribution, q(xn ) = n ∏ q(xi ), i=1 p(xn ) = n ∏ p(xi |w)ϕ(w)dw, i=1 and dxn = dx1 dx2 · · · dxn . [sent-558, score-0.15]

83 There is a different model evaluation criterion, which is the generalization loss deﬁned by G =− q(x) log p∗ (x)dx, (29) β where p∗ (x) is the Bayes predictive distribution deﬁned by p∗ (x) = Ew [p(x|w)], with β = 1. [sent-560, score-0.159]

84 ˆ 2 (30) If a true distribution is realizable by and regular for a statistical model, then 1 E[AIC] = E[G ] + o( ), n E[BIC] = E[F ] + O(1). [sent-565, score-0.251]

85 We deﬁne WAIC and WBIC by WAIC = Tn +Vn /n, WBIC = Eβ [nLn (w)], β = 1/ log n, w 892 WBIC where Tn = − 1 n ∑ log p∗ (Xi |w), n i=1 n Vn = ∑ i=1 Ew [(log p(Xi |w))2 ] − Ew [log p(Xi |w)]2 . [sent-567, score-0.212]

86 Then, even if a statistical model is unrealizable by and singular for a statistical model, 1 ), n2 E[WBIC] = E[F ] + O(log log n), E[WAIC] = E[G ] + O( (31) (32) where Equation (31) was proved in a book (Watanabe, 2009, 2010b), whereas Equation (32) has been proved in the present paper. [sent-568, score-0.483]

87 If a statistical model is realizable by and regular for a statistical model, WAIC and WBIC respectively coincide with AIC and BIC, 1 WAIC = AIC + o p ( ), n WBIC = BIC + o p (1). [sent-571, score-0.287]

88 Then the Bayes free energy satisﬁes J β F = − ∑ log Ewj−1 [exp(−n(β j − β j−1 )Ln (w))]. [sent-580, score-0.213]

89 Cost Huge Small Small Small Table 4: Comparison of Several Methods Then F ˆ = − log Ew [exp(−nLn (w) + H(w))] − log exp(−H(w))ϕ(w)dw, where the last term is the free energy of H(w). [sent-589, score-0.319]

90 The effectiveness of a model selection method strongly depends on a statistical condition which is determined by a true distribution, a statistical model, a prior distribution, and a set of training samples. [sent-600, score-0.151]

91 In the present paper, we deﬁne the parity of a statistical Q(K(w), ϕ(w)) and proved that it affects the asymptotic behavior of WBIC. [sent-610, score-0.222]

92 In this subsection we show three mathematical properties of the parity of a statistical model. [sent-611, score-0.163]

93 Firstly, the parity has a relation to the analytic continuation of K(w)1/2 . [sent-612, score-0.163]

94 Secondly, the parity has a relation to statistical model with a restricted parameter set. [sent-616, score-0.193]

95 For example, a statistical model 1 (x − a)2 p(x|a) = √ exp(− ) 2 2π whose parameter set is given by {a ≥ 0} is equivalent to a statistical model p(x|b2 ) and {b ∈ R}. [sent-617, score-0.16]

96 As is proven (Watanabe, 2001a), the relation − log exp(−nKn (w))ϕ(w)dw = − log exp(−nK(w))ϕ(w)dw + O p (1) holds independent of the parity of a statistical model. [sent-621, score-0.375]

97 On the other hand, if β = 1/ log n, then Eβ [nKn (w)] = w nK(w) exp(−nβK(w))ϕ(w)dw exp(−nβK(w))ϕ(w) +Un log n + O p (1). [sent-622, score-0.212]

98 Conclusion We proposed a widely applicable Bayesian information criterion (WBIC) which can be used even if a true distribution is unrealizable by and singular for a statistical model and proved that WBIC has the same asymptotic expansion as the Bayes free energy. [sent-626, score-0.428]

99 Also we developed a method how to estimate real log canonical thresholds even if a true distribution is unknown. [sent-627, score-0.188]

100 Resolution of singularities of an algebraic variety over a ﬁeld of characteristic zero. [sent-685, score-0.211]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wbic', 0.569), ('nln', 0.5), ('watanabe', 0.235), ('nkn', 0.189), ('du', 0.146), ('bic', 0.141), ('kn', 0.139), ('rlcts', 0.137), ('dw', 0.128), ('rlct', 0.127), ('parity', 0.113), ('singularities', 0.109), ('log', 0.106), ('realizable', 0.104), ('algebraic', 0.102), ('ew', 0.098), ('singular', 0.095), ('un', 0.087), ('waic', 0.084), ('bayes', 0.074), ('resolution', 0.074), ('exp', 0.069), ('uh', 0.067), ('jn', 0.065), ('birational', 0.059), ('nuk', 0.059), ('unrealizable', 0.059), ('energy', 0.055), ('regular', 0.053), ('free', 0.052), ('analytic', 0.05), ('statistical', 0.05), ('ub', 0.049), ('aoyagi', 0.049), ('nsn', 0.049), ('fn', 0.049), ('bn', 0.047), ('book', 0.043), ('aic', 0.042), ('equation', 0.04), ('quartet', 0.039), ('uii', 0.039), ('logt', 0.039), ('multiplicity', 0.039), ('canonical', 0.038), ('theorem', 0.034), ('asymptotic', 0.034), ('drton', 0.034), ('pw', 0.034), ('yamazaki', 0.034), ('temperature', 0.033), ('uk', 0.033), ('geometrical', 0.033), ('metropolis', 0.033), ('likelihood', 0.031), ('model', 0.03), ('map', 0.03), ('dt', 0.03), ('hironaka', 0.029), ('ln', 0.029), ('ua', 0.028), ('vn', 0.027), ('posterior', 0.027), ('rank', 0.027), ('schwarz', 0.026), ('odd', 0.025), ('proved', 0.025), ('rd', 0.024), ('firstly', 0.023), ('secondly', 0.023), ('dx', 0.023), ('distribution', 0.023), ('eigen', 0.023), ('ud', 0.023), ('coordinate', 0.022), ('integrals', 0.022), ('geometry', 0.022), ('reduced', 0.022), ('bayesian', 0.021), ('lemma', 0.021), ('hj', 0.021), ('true', 0.021), ('lastly', 0.02), ('criterion', 0.02), ('ewj', 0.02), ('foregoing', 0.02), ('kir', 0.02), ('koll', 0.02), ('mixtures', 0.02), ('rusakov', 0.02), ('sumio', 0.02), ('trails', 0.02), ('asymptotically', 0.019), ('said', 0.019), ('essential', 0.019), ('threshold', 0.019), ('fundamental', 0.019), ('applicable', 0.019), ('op', 0.018), ('ui', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

Author: Sumio Watanabe

2 0.098134018 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

Author: Robert Hable

Abstract: In supervised learning problems, global and local learning algorithms are used. In contrast to global learning algorithms, the prediction of a local learning algorithm in a testing point is only based on training data which are close to the testing point. Every global algorithm such as support vector machines (SVM) can be localized in the following way: in every testing point, the (global) learning algorithm is not applied to the whole training data but only to the k nearest neighbors (kNN) of the testing point. In case of support vector machines, the success of such mixtures of SVM and kNN (called SVM-KNN) has been shown in extensive simulation studies and also for real data sets but only little has been known on theoretical properties so far. In the present article, it is shown how a large class of regularized kernel methods (including SVM) can be localized in order to get a universally consistent learning algorithm. Keywords: machine learning, regularized kernel methods, localization, SVM, k-nearest neighbors, SVM-KNN

3 0.0500631 104 jmlr-2013-Sparse Single-Index Model

Author: Pierre Alquier, Gérard Biau

Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a ﬂexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difﬁculty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method

4 0.049334012 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

Author: Manfred Opper, Ulrich Paquet, Ole Winther

Abstract: Expectation Propagation (EP) provides a framework for approximate inference. When the model under consideration is over a latent Gaussian ﬁeld, with the approximation being Gaussian, we show how these approximations can systematically be corrected. A perturbative expansion is made of the exact but intractable correction, and can be applied to the model’s partition function and other moments of interest. The correction is expressed over the higher-order cumulants which are neglected by EP’s local matching of moments. Through the expansion, we see that EP is correct to ﬁrst order. By considering higher orders, corrections of increasing polynomial complexity can be applied to the approximation. The second order provides a correction in quadratic time, which we apply to an array of Gaussian process and Ising models. The corrections generalize to arbitrarily complex approximating families, which we illustrate on tree-structured Ising model approximations. Furthermore, they provide a polynomial-time assessment of the approximation error. We also provide both theoretical and practical insights on the exactness of the EP solution. Keywords: expectation consistent inference, expectation propagation, perturbation correction, Wick expansions, Ising model, Gaussian process

5 0.04875996 90 jmlr-2013-Quasi-Newton Method: A New Direction

Author: Philipp Hennig, Martin Kiefel

Abstract: Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that ﬁt a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efﬁcient use of available information at computational cost similar to its predecessors. Keywords: optimization, numerical analysis, probability, Gaussian processes

6 0.048500795 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

7 0.048112474 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

8 0.041855957 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation

9 0.041223943 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

10 0.040973727 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

11 0.040302623 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

12 0.039346881 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

13 0.038733743 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

14 0.033933051 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization

15 0.031894438 65 jmlr-2013-Lower Bounds and Selectivity of Weak-Consistent Policies in Stochastic Multi-Armed Bandit Problem

16 0.030457526 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

17 0.030172719 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

18 0.030024201 114 jmlr-2013-The Rate of Convergence of AdaBoost

19 0.029422753 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models

20 0.0291132 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.155), (1, -0.011), (2, 0.072), (3, 0.017), (4, 0.006), (5, -0.092), (6, 0.0), (7, 0.034), (8, 0.028), (9, 0.036), (10, -0.098), (11, -0.005), (12, 0.129), (13, 0.012), (14, -0.135), (15, 0.062), (16, -0.02), (17, 0.092), (18, 0.17), (19, -0.06), (20, 0.145), (21, 0.106), (22, -0.114), (23, -0.017), (24, -0.108), (25, -0.085), (26, 0.023), (27, -0.092), (28, 0.026), (29, -0.047), (30, -0.024), (31, 0.041), (32, -0.189), (33, -0.007), (34, -0.146), (35, 0.05), (36, 0.005), (37, 0.1), (38, -0.243), (39, -0.07), (40, 0.256), (41, 0.05), (42, 0.028), (43, -0.077), (44, 0.086), (45, 0.126), (46, -0.228), (47, 0.077), (48, -0.001), (49, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89481986 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

Author: Sumio Watanabe

2 0.53638864 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

Author: Robert Hable

3 0.38554323 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation

Author: Michael Chertkov, Adam B. Yedidia

Abstract: We discuss schemes for exact and approximate computations of permanents, and compare them with each other. Speciﬁcally, we analyze the belief propagation (BP) approach and its fractional belief propagation (FBP) generalization for computing the permanent of a non-negative matrix. Known bounds and Conjectures are veriﬁed in experiments, and some new theoretical relations, bounds and Conjectures are proposed. The fractional free energy (FFE) function is parameterized by a scalar parameter γ ∈ [−1; 1], where γ = −1 corresponds to the BP limit and γ = 1 corresponds to the exclusion principle (but ignoring perfect matching constraints) mean-ﬁeld (MF) limit. FFE shows monotonicity and continuity with respect to γ. For every non-negative matrix, we deﬁne its special value γ∗ ∈ [−1; 0] to be the γ for which the minimum of the γ-parameterized FFE function is equal to the permanent of the matrix, where the lower and upper bounds of the γ-interval corresponds to respective bounds for the permanent. Our experimental analysis suggests that the distribution of γ∗ varies for different ensembles but γ∗ always lies within the [−1; −1/2] interval. Moreover, for all ensembles considered, the behavior of γ∗ is highly distinctive, offering an empirical practical guidance for estimating permanents of non-negative matrices via the FFE approach. Keywords: permanent, graphical models, belief propagation, exact and approximate algorithms, learning ﬂows

4 0.32601726 104 jmlr-2013-Sparse Single-Index Model

Author: Pierre Alquier, Gérard Biau

5 0.31347731 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

Author: Ming-Jie Zhao, Narayanan Edakunni, Adam Pocock, Gavin Brown

Abstract: Fano’s inequality lower bounds the probability of transmission error through a communication channel. Applied to classiﬁcation problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more than just the error rate. In medical diagnosis, different errors incur different cost; hence, the overall risk is cost-sensitive. Two other popular criteria are balanced error rate (BER) and F-score. In this work, we focus on the two-class problem and use a general deﬁnition of conditional entropy (including Shannon’s as a special case) to derive upper/lower bounds on the optimal F-score, BER and cost-sensitive risk, extending Fano’s result. As a consequence, we show that Infomax is not suitable for optimizing F-score or cost-sensitive risk, in that it can potentially lead to low F-score and high risk. For cost-sensitive risk, we propose a new conditional entropy formulation which avoids this inconsistency. In addition, we consider the common practice of using a threshold on the posterior probability to tune performance of a classiﬁer. As is widely known, a threshold of 0.5, where the posteriors cross, minimizes error rate—we derive similar optimal thresholds for F-score and BER. Keywords: balanced error rate, F-score (Fβ -measure), cost-sensitive risk, conditional entropy, lower/upper bound

6 0.28866026 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

7 0.27191395 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality

8 0.26305664 90 jmlr-2013-Quasi-Newton Method: A New Direction

9 0.26110265 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

10 0.25711292 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

11 0.25638855 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

12 0.2451387 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization

13 0.23166882 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

14 0.23026182 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion

15 0.21975857 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

16 0.20796321 85 jmlr-2013-Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models

17 0.2072196 21 jmlr-2013-Classifier Selection using the Predicate Depth

18 0.20245086 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

19 0.19812825 65 jmlr-2013-Lower Bounds and Selectivity of Weak-Consistent Policies in Stochastic Multi-Armed Bandit Problem

20 0.187425 15 jmlr-2013-Bayesian Canonical Correlation Analysis

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.04), (5, 0.112), (6, 0.033), (10, 0.047), (20, 0.012), (23, 0.04), (24, 0.401), (68, 0.02), (70, 0.041), (75, 0.058), (85, 0.052), (87, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.63325453 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

Author: Sumio Watanabe

2 0.35488448 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efﬁcient algorithms for distributed optimization in statistical settings involving large-scale data sets. The ﬁrst algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efﬁciently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

3 0.35033742 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency

4 0.3488616 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

Author: Jaakko Riihimäki, Pasi Jylänki, Aki Vehtari

Abstract: This paper considers probabilistic multinomial probit classiﬁcation using Gaussian process (GP) priors. Challenges with multiclass GP classiﬁcation are the integration over the non-Gaussian posterior distribution, and the increase of the number of unknown latent variables as the number of target classes grows. Expectation propagation (EP) has proven to be a very accurate method for approximate inference but the existing EP approaches for the multinomial probit GP classiﬁcation rely on numerical quadratures, or independence assumptions between the latent values associated with different classes, to facilitate the computations. In this paper we propose a novel nested EP approach which does not require numerical quadratures, and approximates accurately all betweenclass posterior dependencies of the latent values, but still scales linearly in the number of classes. The predictive accuracy of the nested EP approach is compared to Laplace, variational Bayes, and Markov chain Monte Carlo (MCMC) approximations with various benchmark data sets. In the experiments nested EP was the most consistent method compared to MCMC sampling, but in terms of classiﬁcation accuracy the differences between all the methods were small from a practical point of view. Keywords: Gaussian process, multiclass classiﬁcation, multinomial probit, approximate inference, expectation propagation

5 0.34880263 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

Author: Sébastien Gerchinovitz

Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same ﬂavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with ﬁxed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds

6 0.34764647 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

7 0.34750041 21 jmlr-2013-Classifier Selection using the Predicate Depth

8 0.34535518 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

9 0.34516612 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

10 0.34448627 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

11 0.34434843 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

12 0.34350362 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

13 0.3424139 68 jmlr-2013-Machine Learning with Operational Costs

14 0.34188285 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

15 0.34158167 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

16 0.34069455 32 jmlr-2013-Differential Privacy for Functions and Functional Data

17 0.34042612 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

18 0.33990708 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

19 0.33989701 120 jmlr-2013-Variational Algorithms for Marginal MAP

20 0.33986178 73 jmlr-2013-Multicategory Large-Margin Unified Machines