jmlr jmlr2010 jmlr2010-16 knowledge-graph by maker-knowledge-mining

16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory


Source: pdf

Author: Sumio Watanabe

Abstract: In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. Therefore, model selection and hyperparameter optimization using these two values are asymptotically equivalent. Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to 2λ/n, where λ is the real log canonical threshold and n is the number of training samples. Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion. Keywords: cross-validation, information criterion, singular learning machine, birational invariant

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. [sent-5, score-0.26]

2 In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. [sent-6, score-0.446]

3 In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. [sent-7, score-0.214]

4 First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. [sent-8, score-0.275]

5 Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to 2λ/n, where λ is the real log canonical threshold and n is the number of training samples. [sent-10, score-0.347]

6 Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. [sent-11, score-0.241]

7 We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion. [sent-12, score-0.263]

8 Keywords: cross-validation, information criterion, singular learning machine, birational invariant 1. [sent-13, score-0.304]

9 Therefore, singular learning theory is necessary in modern information science. [sent-18, score-0.173]

10 The statistical properties of singular models have remained unknown until recently, because analyzing a singular likelihood function had been difficult (Hartigan, 1985; Watanabe, 1995). [sent-19, score-0.415]

11 In singular statistical models, the maximum likelihood estimator does not satisfy asymptotic normality. [sent-20, score-0.303]

12 WATANABE Consequently, AIC is not equal to the average generalization error (Hagiwara, 2002), and the Bayes information criterion (BIC) is not equal to the Bayes marginal likelihood (Watanabe, 2001a), even asymptotically. [sent-22, score-0.209]

13 In singular models, the maximum likelihood estimator often diverges, or even if it does not diverge, makes the generalization error very large. [sent-23, score-0.297]

14 Therefore, the maximum likelihood method is not appropriate for singular models. [sent-24, score-0.216]

15 In singular learning theory, a log likelihood function can be made into a common standard form, even if it contains singularities, by using the resolution theorem in algebraic geometry. [sent-29, score-0.374]

16 As a result, the asymptotic behavior of the posterior distribution is clarified, and the concepts of BIC and AIC can be generalized onto singular statistical models. [sent-30, score-0.337]

17 The asymptotic Bayes marginal likelihood was proven to be determined by the real log canonical threshold (Watanabe, 2001a), and the average Bayes generalization error was proven to be estimable by the widely applicable information criterion (Watanabe, 2009, 2010a,c). [sent-31, score-0.544]

18 By definition, the average of the cross-validation is equal to the average generalization error in both regular and singular models. [sent-33, score-0.367]

19 In regular statistical models, the leave-oneout cross-validation is asymptotically equivalent to AIC (Akaike, 1974) in the maximum likelihood method (Stone, 1977; Linhart, 1986; Browne, 2000). [sent-34, score-0.242]

20 However, the asymptotic behavior of the cross-validation in singular models has not been clarified. [sent-35, score-0.234]

21 In the present paper, in singular statistical models, we theoretically compare the Bayes crossvalidation, the widely applicable information criterion, and the Bayes generalization error and prove two theorems. [sent-36, score-0.412]

22 First, we show that the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. [sent-37, score-0.275]

23 Second, we also show that the sum of the Bayes cross-validation error and the Bayes generalization error is asymptotically equal to 2λ/n, where λ is the real log canonical threshold and n is the number of training samples. [sent-38, score-0.347]

24 Since the real log canonical threshold is a birational invariant of the statistical model, the relationship between the Bayes cross-validation and the Bayes generalization error is determined by the algebraic geometrical structure of the statistical model. [sent-40, score-0.534]

25 In Section 2, we introduce the framework of Bayes learning and explain singular learning theory. [sent-42, score-0.173]

26 In Section 5, we discuss the results of the present paper, and the differences among the cross-validation, the widely applicable information criterion, and the deviance information criterion are investigated theoretically and experimentally. [sent-45, score-0.294]

27 Bayes Learning Theory In this section, we summarize Bayes learning theory for singular learning machines. [sent-48, score-0.173]

28 The Bayes generalization loss Bg L(n) and the Bayes training loss Bt L(n) are defined, respectively, as Bg L(n) = −EX [log p∗ (X)], 1 n Bt L(n) = − ∑ log p∗ (Xi ), n i=1 (3) (4) where EX [ ] gives the expectation value over X. [sent-66, score-0.195]

29 In regular statistical models, the asymptotic Bayes generalization error does not depend on 0 < β ≤ ∞, whereas in singular models it strongly depends on β. [sent-77, score-0.435]

30 w∈W Note that such w0 is not unique in general because the map w → p(x|w) is, in general, not a oneto-one map in singular learning machines. [sent-91, score-0.173]

31 Here, a set in Rd is said to be an analytic or algebraic set if and only if the set is equal to the set of all zero points of an analytic or algebraic function, respectively. [sent-95, score-0.296]

32 For simple notations, the minimum log loss L0 and the empirical log loss Ln are defined, respectively, as L0 = −EX [log p0 (X)], 1 n Ln = − ∑ log p0 (Xi ). [sent-96, score-0.208]

33 Otherwise, q(x) is said to be singular for p(x|w). [sent-105, score-0.198]

34 Bayes learning theory was investigated for a realizable and regular case (Schwarz, 1978; Levin et al. [sent-106, score-0.199]

35 The WAIC was found for a realizable and singular case (Watanabe, 2001a, 2009, 2010a) and for an unrealizable and regular case (Watanabe, 2010b). [sent-108, score-0.498]

36 In addition, WAIC was generalized for an unrealizable and singular case (Watanabe, 2010d). [sent-109, score-0.299]

37 (13) Remark 3 In ordinary learning problems, if the true distribution is regular for or realizable by a learning machine, then assumptions (1), (2), (3) and (4) are satisfied, and the results of the present paper hold. [sent-128, score-0.213]

38 If the true distribution is singular for and unrealizable by a learning machine, then assumption (4) is satisfied in some cases but not in other cases. [sent-129, score-0.313]

39 The investigation of cross-validation in singular learning machines requires singular learning theory. [sent-131, score-0.346]

40 (14) (3) The expectation value of the Bayes generalization loss is asymptotically equal to the widely applicable information criterion, 1 E[Bg L(n)] = E[WAIC(n)] + o( ). [sent-139, score-0.308]

41 n 3577 (15) WATANABE Proof For the case in which q(x) is realizable by and singular for p(x|w), this lemma was proven in Watanabe (2010a) and Watanabe (2009). [sent-140, score-0.33]

42 For the case in which q(x) is regular for and unrealizable by p(x|w), this lemma was proven in Watanabe (2010b). [sent-145, score-0.272]

43 For the case in which q(x) is singular for and unrealizable by p(x|w), these results can be generalized under the condition that Equation (13) is satisfied (Watanabe, 2010d). [sent-146, score-0.299]

44 The log loss of p(i) (x) when Xi is used as a testing sample is (i) − log p(i) (Xi ) = − log Ew [p(Xi |w)]. [sent-153, score-0.173]

45 On the other hand, both the cross-validation loss Cv L(n) and the widely applicable information criterion WAIC(n) can be calculated using only training samples. [sent-166, score-0.214]

46 Therefore, the cross-validation loss and the widely applicable information criterion can be used for model selection and hyperparameter optimization. [sent-167, score-0.218]

47 Second, we prove that both the cross-validation loss and the widely applicable information criterion can be represented by the functional cumulants. [sent-172, score-0.217]

48 Finally, we prove that the cross-validation loss and the widely applicable information criterion are related to the birational invariants. [sent-173, score-0.301]

49 Then, gi (0) = 1, (k) gi (0) ≡ d k gi (0) = ℓk (Xi ) (k = 1, 2, 3, 4), dαk and F(α) = 1 n ∑ log gi (α). [sent-184, score-0.262]

50 n i=1 For arbitrary natural number k, gi (α)(k) gi (α) ′ = gi (α)(k+1) gi (α)(k) − gi (α) gi (α) gi (α)′ . [sent-185, score-0.378]

51 2 Bayes Cross-validation and Widely Applicable Information Criterion We show the asymptotic equivalence of the cross-validation loss Cv L(n) and the widely applicable information criterion WAIC(n). [sent-199, score-0.257]

52 Theorem 1 For arbitrary 0 < β < ∞, the cross-validation loss Cv L(n) and the widely applicable information criterion WAIC(n) are given, respectively, as 2β − 1 Y2 (n) 2 3β2 − 3β + 1 1 − Y3 (n) + O p ( 2 ), 6 n 2β − 1 Y2 (n) WAIC(n) = −Y1 (n) + 2 1 1 − Y3 (n) + O p ( 2 ). [sent-200, score-0.196]

53 Corollary 1 For arbitrary 0 < β < ∞, the cross-validation loss Cv L(n) and the widely applicable information criterion WAIC(n) satisfy Cv L(n) = WAIC(n) + O p ( 1 ). [sent-214, score-0.196]

54 n2 More precisely, the difference between the cross-validation loss and the widely applicable information criterion is given by β − β2 Cv L(n) − WAIC(n) ∼ Y3 (n). [sent-216, score-0.196]

55 3 Generalization Error and Cross-validation Error In the previous subsection, we have shown that the cross-validation loss is asymptotically equivalent to the widely applicable information criterion. [sent-219, score-0.228]

56 We need mathematical concepts, the real log canonical threshold, and the singular fluctuation. [sent-221, score-0.277]

57 (30) 2 Note that the real log canonical threshold does not depend on β, whereas the singular fluctuation is a function of β. [sent-228, score-0.303]

58 ν = ν(β) = lim n→∞ Both the real log canonical threshold and the singular fluctuation are birational invariants. [sent-229, score-0.425]

59 Proof For the case in which q(x) is realizable by and singular for p(x|w), Equations (31) and (32) were proven by in Corollary 3 in Watanabe (2010a). [sent-233, score-0.312]

60 For the case in which q(x) is singular for and unrealizable by p(x|w) they were generalized in Watanabe (2010d). [sent-236, score-0.299]

61 1 E XAMPLES If q(x) is regular for and realizable by p(x|w), then λ = ν = d/2, where d is the dimension of the parameter space. [sent-239, score-0.199]

62 If q(x) is regular for and unrealizable by p(x|w), then λ and ν are given by Watanabe (2010b). [sent-240, score-0.22]

63 If q(x) is singular for and realizable by p(x|w), then λ for several models are obtained by resolution of singularities (Aoyagi and Watanabe, 2005; Rusakov and Geiger, 2005; Yamazaki and Watanabe, 2003; Lin, 2010; Zwiernik, 2010). [sent-241, score-0.352]

64 If q(x) is singular for and unrealizable by p(x|w), then λ and ν remain unknown constants. [sent-242, score-0.299]

65 This theorem indicates that both the cross-validation error and the Bayes generalization error are determined by the algebraic geometrical structure of the statistical model, which is extracted as the real log canonical threshold. [sent-249, score-0.387]

66 Note that a regular statistical model is a special example of singular models, hence both Theorems 1 and 2 also hold in regular statistical models. [sent-252, score-0.413]

67 If a true distribution is regular for and realizable by the statistical model, then the variance of Bt (n) is asymptotically equal to that of Bg (n). [sent-259, score-0.337]

68 1 From Regular to Singular First, we summarize the regular and singular learning theories. [sent-264, score-0.267]

69 In regular statistical models, the generalization loss of the maximum likelihood method is asymptotically equal to that of the Bayes estimation. [sent-265, score-0.357]

70 On the other hand, in singular learning machines, the generalization loss of the maximum likelihood method is larger than the Bayes generalization loss. [sent-268, score-0.373]

71 Since the generalization loss of the maximum likelihood method is determined by the maximum value of the Gaussian process, the maximum likelihood method is not appropriate in singular models (Watanabe, 2009). [sent-269, score-0.355]

72 In Bayes estimation, we derived the asymptotic expansion of the generalization loss and proved that the average of the widely applicable information criterion is asymptotically equal to the Bayes generalization loss (Watanabe, 2010a). [sent-270, score-0.512]

73 It was proven (Watanabe, 2001a) that the Bayes marginal likelihood of a singular model is different from BIC of a regular model. [sent-272, score-0.344]

74 In the future, we intend to compare the cross-validation and Bayes marginal likelihood in model selection and hyperparameter optimization in singular statistical models. [sent-273, score-0.264]

75 In Theorem 1, we theoretically proved that the leave-one-out cross-validation is asymptotically equivalent to the widely applicable information criterion. [sent-276, score-0.211]

76 However, if the posterior distribution was not precisely approximated, then the cross-validation might not be equivalent to the widely applicable information criterion. [sent-279, score-0.191]

77 If the posterior distribution is completely realized, then CV1 and CV2 coincide with each other and are asymptotically equivalent to the widely applicable information criterion. [sent-289, score-0.27]

78 Moreover, in singular learning machines, the set of true parameters is not a single point but rather an analytic set, hence we must restrict the parameter space to be compact for well-defined average values. [sent-292, score-0.221]

79 In a singular learning machine, since the set of optimal parameters is an analytic set, the correlation between different true parameters does not vanish, even asymptotically. [sent-303, score-0.235]

80 n n If the true distribution is realizable by or regular for the statistical model and if β = 1, then the asymptotic behavior of DIC2 is given by 1 1 E[DIC2 ] = L0 + (3λ − 2ν(1) + 2ν′ (1)) + o( ), n n (35) where ν′ (1) = (dν/dβ)(1). [sent-308, score-0.3]

81 If the true distribution is regular for and realizable by the statistical model and if β = 1, then λ = ν = d/2, ν′ (1) = 0, where d is the number of parameters. [sent-311, score-0.239]

82 Thus, their averages are asymptotically equal to the Bayes generalization error, d 1 + o( ), 2n n 1 d E[DIC2 ] = L0 + + o( ). [sent-312, score-0.185]

83 If the true distribution is regular for and unrealizable by the statistical model and if β = 1, then λ = d/2, ν = (1/2)tr(IJ −1 ), and ν′ (1) = 0 (Watanabe, 2010b), where I is the Fisher information matrix at w0 , and J is the Hessian matrix of L(w) at w = w0 . [sent-314, score-0.26]

84 n n S INGULAR C ROSS -VALIDATION In this case, as shown in Lemma 3, the Bayes generalization error is given by L0 + d/(2n) asymptotically, and so the averages of the deviance information criteria are not equal to the average of the Bayes generalization error. [sent-316, score-0.321]

85 If the true distribution is singular for and realizable by the statistical model and if β = 1, then E[DIC1 ] = C + o(1), (36) 1 1 E[DIC2 ] = L0 + (3λ − 2ν(1) + 2ν′ (1)) + o( ), n n where C (C = L0 ) is, in general, a constant. [sent-317, score-0.318]

86 Equation (36) is obtained because the set of true parameters in a singular model is not a single point, but rather an analytic set, so that, in general, the average Ew [w] is not contained in the neighborhood of the set of the true parameters. [sent-318, score-0.221]

87 Hence the averages of the deviance information criteria are not equal to those of the Bayes generalization error. [sent-319, score-0.24]

88 The averages of the cross-validation loss and WAIC have the same asymptotic behavior as that of the Bayes generalization error, even if the true distribution is unrealizable by or singular for the statistical model. [sent-320, score-0.522]

89 Therefore, the deviance information criteria are different from the cross-validation and WAIC, if the true distribution is singular for or unrealizable by the statistical model. [sent-321, score-0.473]

90 WATANABE In this model, although the Bayes generalization error is not equal to the average square error SE(n) = 1 EEX 2σ2 RH (X, w0 ) − Ew [RH (X, w)] 2 , asymptotically SE(n) and Bg (n) are equal to each other (Watanabe, 2009). [sent-340, score-0.218]

91 The real log canonical threshold, the singular fluctuation, and its derivative of this case were estimated as λ ≈ 5. [sent-357, score-0.277]

92 Note that, if the true distribution is regular for and realizable by the statistical model, λ = ν(1) = d/2 = 9 and ν′ (1) = 0. [sent-361, score-0.239]

93 The averages of the two deviance information criteria were not equal to that of the Bayes generalization error. [sent-362, score-0.24]

94 A constant defined for a set of statistical models and a prior is said to be a birational invariant if it is invariant under such a transform w = g(u). [sent-422, score-0.208]

95 The real log canonical threshold λ is a birational invariant (Atiyah, 1970; Hiroanaka, 1964; Kashiwara, 1976; Koll´ r et al. [sent-423, score-0.261]

96 Although the singular fluctuation is also a birational invariant, its properties remain unknown. [sent-425, score-0.278]

97 Hence, clarification of the algebraic geometrical structure in statistical estimation is an important problem in statistical learning theory. [sent-432, score-0.192]

98 In addition, we clarified that cross-validation and the widely applicable information criterion are different from the deviance information criteria. [sent-435, score-0.276]

99 This result indicates that, even in singular statistical models, the cross-validation is asymptotically equivalent to the information criterion, and that the asymptotic properties of these models are determined by the algebraic geometrical structure of a statistical model. [sent-436, score-0.505]

100 Equations of states in statistical learning for an unrealizable and regular case. [sent-660, score-0.246]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('watanabe', 0.46), ('waic', 0.388), ('cv', 0.373), ('ew', 0.34), ('bg', 0.316), ('bayes', 0.188), ('singular', 0.173), ('bt', 0.128), ('unrealizable', 0.126), ('deviance', 0.115), ('ingular', 0.115), ('realizable', 0.105), ('birational', 0.105), ('regular', 0.094), ('asymptotically', 0.079), ('algebraic', 0.078), ('ross', 0.076), ('dic', 0.063), ('posterior', 0.063), ('geometrical', 0.062), ('generalization', 0.061), ('asymptotic', 0.061), ('canonical', 0.058), ('xi', 0.057), ('applicable', 0.057), ('widely', 0.057), ('rh', 0.056), ('singularities', 0.056), ('gi', 0.054), ('uctuation', 0.052), ('analytic', 0.048), ('criterion', 0.047), ('yk', 0.047), ('ex', 0.047), ('log', 0.046), ('yamazaki', 0.045), ('likelihood', 0.043), ('cumulants', 0.042), ('equation', 0.04), ('aoyagi', 0.036), ('loss', 0.035), ('proven', 0.034), ('dw', 0.034), ('ln', 0.033), ('pw', 0.032), ('gt', 0.032), ('nbg', 0.031), ('clari', 0.031), ('aic', 0.027), ('gelfand', 0.027), ('rusakov', 0.027), ('invariant', 0.026), ('threshold', 0.026), ('statistical', 0.026), ('averages', 0.026), ('said', 0.025), ('hyperparameter', 0.022), ('carlo', 0.021), ('cumulant', 0.021), ('drton', 0.021), ('koll', 0.021), ('limsupn', 0.021), ('linhart', 0.021), ('sumio', 0.021), ('zwiernik', 0.021), ('functional', 0.021), ('error', 0.02), ('monte', 0.02), ('bayesian', 0.02), ('criteria', 0.019), ('mk', 0.019), ('equal', 0.019), ('ck', 0.019), ('theoretically', 0.018), ('resolution', 0.018), ('oxford', 0.018), ('carlin', 0.018), ('levin', 0.018), ('nagata', 0.018), ('lemma', 0.018), ('training', 0.018), ('akaike', 0.018), ('crossvalidation', 0.018), ('lim', 0.017), ('geometry', 0.017), ('theorem', 0.016), ('bic', 0.016), ('bh', 0.016), ('spiegelhalter', 0.016), ('ah', 0.016), ('gelman', 0.016), ('metropolis', 0.016), ('leaving', 0.016), ('ij', 0.016), ('arxiv', 0.015), ('pole', 0.015), ('raftery', 0.015), ('clarify', 0.015), ('distribution', 0.014), ('correlation', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory

Author: Sumio Watanabe

Abstract: In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. Therefore, model selection and hyperparameter optimization using these two values are asymptotically equivalent. Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to 2λ/n, where λ is the real log canonical threshold and n is the number of training samples. Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion. Keywords: cross-validation, information criterion, singular learning machine, birational invariant

2 0.13544841 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

Author: Miki Aoyagi

Abstract: In this paper, we consider the asymptotic form of the generalization error for the restricted Boltzmann machine in Bayesian estimation. It has been shown that obtaining the maximum pole of zeta functions is related to the asymptotic form of the generalization error for hierarchical learning models (Watanabe, 2001a,b). The zeta function is defined by using a Kullback function. We use two methods to obtain the maximum pole: a new eigenvalue analysis method and a recursive blowing up process. We show that these methods are effective for obtaining the asymptotic form of the generalization error of hierarchical learning models. Keywords: Boltzmann machine, non-regular learning machine, resolution of singularities, zeta function

3 0.055667888 40 jmlr-2010-Fast and Scalable Local Kernel Machines

Author: Nicola Segata, Enrico Blanzieri

Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning

4 0.045172658 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

Author: Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley

Abstract: The principle of parsimony also known as “Ockham’s razor” has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overfitting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches. We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian! Keywords: model selection, ensemble methods, multilevel inference, multilevel optimization, performance prediction, bias-variance tradeoff, Bayesian priors, structural risk minimization, guaranteed risk minimization, over-fitting, regularization, minimum description length

5 0.036577173 84 jmlr-2010-On Spectral Learning

Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil

Abstract: In this paper, we study the problem of learning a matrix W from a set of linear measurements. Our formulation consists in solving an optimization problem which involves regularization with a spectral penalty term. That is, the penalty term is a function of the spectrum of the covariance of W . Instances of this problem in machine learning include multi-task learning, collaborative filtering and multi-view learning, among others. Our goal is to elucidate the form of the optimal solution of spectral learning. The theory of spectral learning relies on the von Neumann characterization of orthogonally invariant norms and their association with symmetric gauge functions. Using this tool we formulate a representer theorem for spectral regularization and specify it to several useful example, such as Schatten p−norms, trace norm and spectral norm, which should proved useful in applications. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, orthogonally invariant norms, regularization

6 0.035574317 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

7 0.034204122 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

8 0.034197643 25 jmlr-2010-Composite Binary Losses

9 0.033772014 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

10 0.033159364 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations

11 0.032771077 72 jmlr-2010-Matrix Completion from Noisy Entries

12 0.030611757 109 jmlr-2010-Stochastic Composite Likelihood

13 0.029404737 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

14 0.028793445 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

15 0.02878073 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

16 0.028664764 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

17 0.026533145 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

18 0.026510354 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

19 0.024986317 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

20 0.024717551 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.127), (1, -0.035), (2, 0.022), (3, -0.019), (4, -0.034), (5, 0.027), (6, 0.008), (7, -0.002), (8, 0.002), (9, -0.012), (10, -0.054), (11, 0.037), (12, -0.08), (13, -0.143), (14, 0.023), (15, -0.017), (16, -0.012), (17, -0.078), (18, -0.062), (19, 0.095), (20, 0.314), (21, 0.2), (22, -0.064), (23, -0.092), (24, -0.038), (25, -0.263), (26, -0.117), (27, -0.013), (28, -0.069), (29, 0.058), (30, 0.167), (31, -0.056), (32, -0.029), (33, 0.062), (34, -0.139), (35, 0.105), (36, 0.257), (37, -0.156), (38, -0.1), (39, 0.005), (40, -0.112), (41, -0.186), (42, 0.005), (43, -0.061), (44, -0.042), (45, -0.038), (46, 0.082), (47, 0.068), (48, -0.226), (49, -0.096)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95188606 16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory

Author: Sumio Watanabe

Abstract: In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. Therefore, model selection and hyperparameter optimization using these two values are asymptotically equivalent. Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to 2λ/n, where λ is the real log canonical threshold and n is the number of training samples. Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion. Keywords: cross-validation, information criterion, singular learning machine, birational invariant

2 0.73165524 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

Author: Miki Aoyagi

Abstract: In this paper, we consider the asymptotic form of the generalization error for the restricted Boltzmann machine in Bayesian estimation. It has been shown that obtaining the maximum pole of zeta functions is related to the asymptotic form of the generalization error for hierarchical learning models (Watanabe, 2001a,b). The zeta function is defined by using a Kullback function. We use two methods to obtain the maximum pole: a new eigenvalue analysis method and a recursive blowing up process. We show that these methods are effective for obtaining the asymptotic form of the generalization error of hierarchical learning models. Keywords: Boltzmann machine, non-regular learning machine, resolution of singularities, zeta function

3 0.22222915 109 jmlr-2010-Stochastic Composite Likelihood

Author: Joshua V. Dillon, Guy Lebanon

Abstract: Maximum likelihood estimators are often of limited practical use due to the intensive computation they require. We propose a family of alternative estimators that maximize a stochastic variation of the composite likelihood function. Each of the estimators resolve the computation-accuracy tradeoff differently, and taken together they span a continuous spectrum of computation-accuracy tradeoff resolutions. We prove the consistency of the estimators, provide formulas for their asymptotic variance, statistical robustness, and computational complexity. We discuss experimental results in the context of Boltzmann machines and conditional random fields. The theoretical and experimental studies demonstrate the effectiveness of the estimators when the computational resources are insufficient. They also demonstrate that in some cases reduced computational complexity is associated with robustness thereby increasing statistical accuracy. Keywords: Markov random fields, composite likelihood, maximum likelihood estimation

4 0.21946165 25 jmlr-2010-Composite Binary Losses

Author: Mark D. Reid, Robert C. Williamson

Abstract: We study losses for binary classification and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function. We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and “classification calibrated” losses. We also consider the question of the “best” surrogate binary loss. We introduce a precise notion of “best” and show there exist situations where two convex surrogate losses are incommensurable. We provide a complete explicit characterisation of the convexity of composite binary losses in terms of the link function and the weight function associated with the proper loss which make up the composite loss. This characterisation suggests new ways of “surrogate tuning” as well as providing an explicit characterisation of when Bregman divergences on the unit interval are convex in their second argument. Finally, in an appendix we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassification noise for binary losses and show that all convex proper losses are non-robust to misclassification noise. Keywords: surrogate loss, convexity, probability estimation, classification, Fisher consistency, classification-calibrated, regret bound, proper scoring rule, Bregman divergence, robustness, misclassification noise

5 0.21650046 40 jmlr-2010-Fast and Scalable Local Kernel Machines

Author: Nicola Segata, Enrico Blanzieri

Abstract: A computationally efficient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classification accuracies by dividing the separation function in local optimisation problems that can be handled very efficiently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning

6 0.21636881 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations

7 0.178481 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes

8 0.17620918 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

9 0.17354542 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

10 0.16742505 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

11 0.16267864 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

12 0.16258267 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

13 0.1443983 72 jmlr-2010-Matrix Completion from Noisy Entries

14 0.1427523 84 jmlr-2010-On Spectral Learning

15 0.13524799 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

16 0.13420057 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

17 0.13356087 63 jmlr-2010-Learning Instance-Specific Predictive Models

18 0.13265951 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

19 0.13221452 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure

20 0.13040365 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.403), (3, 0.044), (8, 0.016), (21, 0.012), (32, 0.056), (36, 0.058), (37, 0.04), (57, 0.018), (75, 0.092), (81, 0.014), (85, 0.106), (97, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.64606488 16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory

Author: Sumio Watanabe

Abstract: In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. Therefore, model selection and hyperparameter optimization using these two values are asymptotically equivalent. Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to 2λ/n, where λ is the real log canonical threshold and n is the number of training samples. Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion. Keywords: cross-validation, information criterion, singular learning machine, birational invariant

2 0.55341154 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

Author: Alexander Ilin, Tapani Raiko

Abstract: Principal component analysis (PCA) is a classical data analysis technique that Ä?Ĺš nds linear transformations of data that retain the maximal amount of variance. We study a case where some of the data values are missing, and show that this problem has many features which are usually associated with nonlinear models, such as overÄ?Ĺš tting and bad locally optimal solutions. A probabilistic formulation of PCA provides a good foundation for handling missing values, and we provide formulas for doing that. In case of high dimensional and very sparse data, overÄ?Ĺš tting becomes a severe problem and traditional algorithms for PCA are very slow. We introduce a novel fast algorithm and extend it to variational Bayesian learning. Different versions of PCA are compared in artiÄ?Ĺš cial experiments, demonstrating the effects of regularization and modeling of posterior variance. The scalability of the proposed algorithm is demonstrated by applying it to the NetÄ?Ĺš‚ix problem. Keywords: principal component analysis, missing values, overÄ?Ĺš tting, regularization, variational Bayes

3 0.36242199 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le

Abstract: A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ε) steps to ε precision for general convex problems and in O(log(1/ε)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available data sets. Keywords: optimization, subgradient methods, cutting plane method, bundle methods, regularized risk minimization, parallel optimization ∗. Also at Canberra Research Laboratory, NICTA. c 2010 Choon Hui Teo, S.V. N. Vishwanthan, Alex J. Smola and Quoc V. Le. T EO , V ISHWANATHAN , S MOLA AND L E

4 0.36060888 14 jmlr-2010-Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes

Author: Antti Honkela, Tapani Raiko, Mikael Kuusela, Matti Tornio, Juha Karhunen

Abstract: Variational Bayesian (VB) methods are typically only applied to models in the conjugate-exponential family using the variational Bayesian expectation maximisation (VB EM) algorithm or one of its variants. In this paper we present an efficient algorithm for applying VB to more general models. The method is based on specifying the functional form of the approximation, such as multivariate Gaussian. The parameters of the approximation are optimised using a conjugate gradient algorithm that utilises the Riemannian geometry of the space of the approximations. This leads to a very efficient algorithm for suitably structured approximations. It is shown empirically that the proposed method is comparable or superior in efficiency to the VB EM in a case where both are applicable. We also apply the algorithm to learning a nonlinear state-space model and a nonlinear factor analysis model for which the VB EM is not applicable. For these models, the proposed algorithm outperforms alternative gradient-based methods by a significant margin. Keywords: variational inference, approximate Riemannian conjugate gradient, fixed-form approximation, Gaussian approximation

5 0.35864395 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

Author: Miki Aoyagi

Abstract: In this paper, we consider the asymptotic form of the generalization error for the restricted Boltzmann machine in Bayesian estimation. It has been shown that obtaining the maximum pole of zeta functions is related to the asymptotic form of the generalization error for hierarchical learning models (Watanabe, 2001a,b). The zeta function is defined by using a Kullback function. We use two methods to obtain the maximum pole: a new eigenvalue analysis method and a recursive blowing up process. We show that these methods are effective for obtaining the asymptotic form of the generalization error of hierarchical learning models. Keywords: Boltzmann machine, non-regular learning machine, resolution of singularities, zeta function

6 0.35244462 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

7 0.34971112 102 jmlr-2010-Semi-Supervised Novelty Detection

8 0.34949362 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

9 0.34869704 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

10 0.34756339 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

11 0.34556592 22 jmlr-2010-Classification Using Geometric Level Sets

12 0.34411448 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

13 0.34144425 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

14 0.3405875 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

15 0.33920282 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

16 0.33911756 107 jmlr-2010-Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

17 0.33800018 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data

18 0.33609769 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

19 0.33430678 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

20 0.3325949 109 jmlr-2010-Stochastic Composite Likelihood