jmlr jmlr2012 jmlr2012-21 knowledge-graph by maker-knowledge-mining

21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

Source: pdf

Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan

Abstract: Classiﬁcation algorithms are frequently used on data with a natural hierarchical structure. For instance, classiﬁers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classiﬁcation outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (ﬁxed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classiﬁcation accuracy is complemented by inference on the balanced accuracy, which avoids inﬂated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classiﬁcation studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 One important question is how classiﬁcation outcomes observed in individual subjects can be generalized to the population from which the group was sampled. [sent-39, score-0.872]

2 Third, inference on classiﬁcation accuracy is complemented by inference on the balanced accuracy, which avoids inﬂated accuracy estimates for imbalanced data sets. [sent-43, score-0.709]

3 The typical question of interest for studies as those described above is: What is the accuracy of the classiﬁer in the general population from which the subjects were sampled? [sent-55, score-0.822]

4 Rather than treating classiﬁcation outcomes obtained in different subjects as samples from the same distribution, a hierarchical setting requires us to account for the fact that each subject itself has been sampled from a heterogeneous population (Beckmann et al. [sent-61, score-0.986]

5 , betweensubjects variability) that results from the distribution of true accuracies in the population from which 3134 M IXED -E FFECTS I NFERENCE ON C LASSIFICATION P ERFORMANCE subjects were drawn. [sent-68, score-0.962]

6 This is addressed by inference on the mean classiﬁcation accuracy in the population from which subjects were drawn. [sent-123, score-1.005]

7 In particular, we wish to predict how well a trial-wise classiﬁer will perform ‘out of sample’, that is, on trials from an unseen subject drawn from the same population as the one underlying the presently studied group. [sent-128, score-0.739]

8 This is achieved by modelling subject-wise accuracies as drawn from a population distribution described by a Beta density, p(π j | α, β) = Beta(π j | α, β) = Γ(α + β) α−1 π (1 − π j )β−1 , Γ(α)Γ(β) j (5) such that α and β characterize the population as a whole. [sent-197, score-1.201]

9 Formally, a particular subject’s π j is drawn from a population characterized by α and β: subject-speciﬁc accuracies are assumed to be i. [sent-200, score-0.724]

10 To describe our uncertainty about the population parameters, we use a diffuse prior on α and β which ensures that the posterior will be dominated by the data. [sent-204, score-0.864]

11 One option would be to assign uniform densities to both the prior expected accuracy α/(α + β) and the prior virtual sample size α + β, using logistic and logarithmic transformations to put each on a (−∞, ∞) scale; but this prior would lead to an improper posterior density (Gelman et al. [sent-205, score-0.667]

12 2 M ODEL I NVERSION Inverting the beta-binomial model allows us to infer on (i) the posterior population mean accuracy, (ii) the subject-speciﬁc posterior accuracies, and (iii) the posterior predictive accuracy. [sent-219, score-1.535]

13 First, to obtain the posterior density over the population parameters α and β we need to evaluate p(k1:m | α, β) p(α, β) p(k1:m | α, β) p(α, β) dα dβ p(α, β | k1:m ) = (8) with k1:m := (k1 , k2 , . [sent-222, score-0.835]

14 This set allows us to obtain samples from the posterior population mean accuracy, p α k1:m . [sent-236, score-0.85]

15 α+β We can use these samples in various ways, for example, to obtain a point estimate of the population mean accuracy using the posterior mean, ˆ 1 c α(τ) ∑ α(τ) + β(τ) . [sent-237, score-0.957]

16 ˆ c τ=1 ˆ We could also numerically evaluate the posterior probability that the mean classiﬁcation accuracy in the population does not exceed chance, p = Pr α ≤ 0. [sent-238, score-0.93]

17 Finally, we could compute the posterior probability that the mean accuracy in one population is greater than in another, p = Pr α(2) α(1) k (1) , k1:m(2) . [sent-243, score-0.93]

18 Subjects with fewer trials will exert a smaller effect on the group and shrink more, while subjects with more trials will have a larger inﬂuence on the group and shrink less. [sent-253, score-0.716]

19 In this case, we are typically less interested in the average effect in the group but more in the effect that a new subject from the same population would display, as this estimate takes into account both the population mean and the population variance. [sent-256, score-1.619]

20 The expected performance is expressed by the posterior predictive density, ˜ p(π | k1:m ), ˜ in which π denotes the classiﬁcation accuracy in a new subject drawn from the same population as the existing group of subjects with latent accuracies π1 , . [sent-257, score-1.57]

21 4 Samples for this density can easily be obtained using the samples α(τ) and β(τ) from the posterior population mean. [sent-262, score-0.862]

22 In the same way, we can obtain approximations to the posterior mean, the posterior mode, or a posterior probability interval. [sent-313, score-0.921]

23 For instance, we can obtain − + the posterior population parameters, p(α+ , β+ | k1:m ) and p(α− , β− | k1:m ) using the same sampling procedure as summarized in Section 2. [sent-323, score-0.784]

24 The two sets of samples can then be averaged in a pairwise fashion to obtain samples from the posterior mean balanced accuracy in the population, − + p φ | k1:m , k1:m , 3142 M IXED -E FFECTS I NFERENCE ON C LASSIFICATION P ERFORMANCE where we have deﬁned φ := 1 2 α− α+ + − α+ + β+ α + β− . [sent-325, score-0.662]

25 Similarly, we can average pairs of posterior samples from π+ and π− to obtain samples from the j j posterior densities of subject-speciﬁc balanced accuracies, − + p φ j k1:m , k1:m . [sent-326, score-0.852]

26 Using the same idea, we can obtain samples from the posterior predictive density of the balanced accuracy that can be expected in a new subject from the same population, ˜ + − p φ k1:m , k1:m . [sent-327, score-0.778]

27 In this case, an unbiased classiﬁer yields high accuracies on either class in some subjects and lower accuracies in others, inducing a positive correlation between class-speciﬁc accuracies. [sent-334, score-0.732]

28 We therefore turn to an alternative model for mixed-effects inference on the balanced accuracy that embraces potential dependencies between class-speciﬁc accuracies (Figure 2b). [sent-339, score-0.68]

29 Instead, we use a bivariate population density whose covariance structure deﬁnes the form and extent of the dependency between π+ and π− . [sent-343, score-0.661]

30 2 M ODEL I NVERSION In contrast to the twofold beta-binomial model discussed in the previous section, the bivariate normal-binomial model makes it difﬁcult to sample from the posterior densities over model pa− + rameters using a Metropolis implementation. [sent-381, score-0.674]

31 First, population parameter estimates can be obtained by sampling from the posterior density − + p(µ, Σ | k1:m , k1:m ) using a Metropolis-Hastings approach. [sent-389, score-0.835]

32 Second, subject-speciﬁc accuracies are − + estimated by ﬁrst sampling from p(ρ j | k1:m , k1:m ) and then applying a sigmoid transform to obtain − + samples from the posterior density over subject-speciﬁc balanced accuracies, p(φ j | k1:m , k1:m ). [sent-390, score-0.787]

33 The best model can then be used for posterior inferences on the mean accuracy in the population or the predictive accuracy in a new subject from the new population. [sent-409, score-1.259]

34 M Similarly, we can obtain the posterior predictive distribution of the balanced accuracy in a new subject from the same population, − + ˜ + − ˜ + − p φ k1:m , k1:m = ∑ p φ k1:m , k1:m , M p M k1:m , k1:m . [sent-414, score-0.7]

35 We then contrast inference on accuracies with inference on balanced accuracies (Section 3. [sent-436, score-0.937]

36 Their empirical sample accuracies are shown in Figure 4b, along with the ground-truth density of the population accuracy. [sent-452, score-0.801]

37 1 (Figure 4c), and examining the posterior distribution over the population mean accuracy showed that more than 99. [sent-454, score-0.93]

38 This is in contrast to the dispersion of the posterior over the population mean, which becomes more and more precise with an increasing amount of data. [sent-464, score-0.822]

39 8 FFX RFX posterior confidence interval interval 4 (f) predictive inference 8 MFX posterior interval 2 log(a/b) 10 p(>0. [sent-487, score-0.959]

40 5 predictive accuracy 1 Figure 4: Inference on the population mean and the predictive accuracy. [sent-494, score-0.713]

41 (b) Empirical sample accuracies (blue) and their underlying population distribution (green). [sent-497, score-0.75]

42 (c) Inverting the beta-binomial model yields samples from the posterior distribution over the population parameters, visualized using a nonparametric (bivariate Gaussian kernel) density estimate (contour lines). [sent-498, score-0.889]

43 (d) The posterior about the population mean accuracy, plotted using a kernel density estimator (black), is sharply peaked around the true population mean (green). [sent-499, score-1.422]

44 (f) The posterior predictive distribution over ˜ π represents the posterior belief of the accuracy expected in a new subject (black). [sent-505, score-0.852]

45 75 1 true population mean Figure 5: Inference on the population mean under varying population heterogeneity. [sent-534, score-1.509]

46 The ﬁgure shows Bayesian estimates of the frequentist probability of above-chance classiﬁcation performance, as a function of the true population mean, separately for three different level of population heterogeneity (a,b,c). [sent-535, score-1.043]

47 By contrast, a ﬁxed-effects approach (orange) offers invalid population inference as it disregards between-subjects variability; at a true population mean of 0. [sent-538, score-1.164]

48 Insets show the distribution of the true underlying population accuracy (green) for a population mean accuracy of 0. [sent-546, score-1.207]

49 We then varied the true population mean and plotted the fraction of decisions for an above-chance classiﬁer as a function of population mean (Figure 5a). [sent-549, score-1.032]

50 Since the population variance was chosen to be very low in this initial simulation, the inferences afforded by a ﬁxed-effects analysis (yellow) prove very similar as well; but this changes drastically when increasing the population variance to more realistic levels, as described below. [sent-556, score-1.078]

51 5 1 population mean accuracy beta- conven- convenbinomial tional tional model FFX RFX interval interval 0 Figure 6: Inadequate inferences provided by ﬁxed-effects and random-effects models. [sent-567, score-0.818]

52 (b) The (mixed-effects) posterior density of the population mean (black) provides a good estimate of ground truth (green). [sent-570, score-0.94]

53 example, given a fairly homogeneous population with a true population mean accuracy of 60% and a variance of 0. [sent-574, score-1.13]

54 The above simulations show that a ﬁxed-effects analysis (yellow) becomes an invalid procedure to infer on the population mean when the population variance is non-negligible. [sent-581, score-1.049]

55 Classiﬁcation outcomes were generated using the betabinomial model with a population mean of 0. [sent-590, score-0.662]

56 The example shows that the proposed beta-binomial model yields a posterior density with the necessary asymmetry; it comfortably includes the true population mean (Figure 6b). [sent-594, score-0.901]

57 This simulation was based on 45 subjects overall; 40 subjects were characterized by a relatively moderate number of trials (n = 20) while 5 subjects had even fewer trials (n = 5). [sent-605, score-1.066]

58 The plot shows that, in each subject, the posterior mode (black) represents a compromise between the observed sample accuracy (blue) and the population mean (0. [sent-611, score-0.956]

59 Another way of demonstrating the shrinkage effect is by illustrating the transition from ground truth to sample accuracies (with its increase in dispersion) and from sample accuracies to posterior means (with its decrease in dispersion). [sent-615, score-0.963]

60 This shows how the high variability in sample accuracies is reduced, informed by what has been learned about the population (Figure 7b). [sent-616, score-0.75]

61 Here, subjects with only 5 trials were shrunk more than subjects with 20 trials. [sent-618, score-0.652]

62 4 ground truth sample accuracies posterior interval (bb) 20 40 subjects (sorted) 0. [sent-638, score-0.936]

63 2 0 subjects with very few trials (c) power curves ground sample posterior truth accuracies mean (bb) 1 0. [sent-639, score-1.099]

64 (a) Classiﬁcation outcomes were generated for a synthetic heterogeneous group of 45 subjects (40 subjects with 20 trials each, 5 subjects with 5 trials each). [sent-650, score-1.266]

65 (b) Another way of visualizing the shrinking effect is to contrast the increase in dispersion (as we move from ground truth to sample accuracies) with the decrease in dispersion (as we move from sample accuracies to posterior means) enforced by the hierarchical model. [sent-656, score-0.812]

66 Shrinking changes the order of subjects (when sorted by posterior mean as opposed to by sample accuracy) as the amount of shrinking depends on the subject-speciﬁc (ﬁrst-level) posterior uncertainty. [sent-657, score-0.917]

67 (d) Across the same 1 000 simulations, a Bayes estimator, based on the posterior means of subject-speciﬁc accuracies (black), was superior to both a classical ML estimator (blue) and a James-Stein estimator (red). [sent-661, score-0.664]

68 An initial simulation speciﬁed a high population accuracy on the positive class and a low accuracy on the negative class, with equal variance in both (Figure 8a,b). [sent-675, score-0.721]

69 2 60 40 +ve trials 20 0 TPR correct predictions (c) population-mean intervals 1 1 –ve trials 5 10 15 subjects 20 0 0 0. [sent-708, score-0.637]

70 5 TNR 1 0 accuracy (bb) balanced accuracy (nb) balanced accuracy (bb) ln ! [sent-709, score-0.69]

71 75 1 population mean on positive trials Figure 8: Inference on the balanced accuracy. [sent-718, score-0.847]

72 The underlying true population distribution is represented by a bivariate Gaussian kernel density estimate (contour lines). [sent-722, score-0.661]

73 The plot shows that the population accuracy is high on positive trials and low on negative trials. [sent-723, score-0.76]

74 (c) Central 95% posterior probability intervals based on three models: the simple beta-binomial model for inference on the population accuracy; and the twofold beta-binomial model as well as the bivariate normal-binomial model for inference on the balanced accuracy. [sent-724, score-1.586]

75 The true mean balanced accuracy in the population is at chance (green). [sent-725, score-0.841]

76 To examine this dependence, we carried out a sensitivity analysis in which we considered the infraliminal probability of the posterior population mean as a function of prior moments (Figure 9). [sent-738, score-0.97]

77 We found that inferences were extremely robust, in the sense that the inﬂuence of the prior moments on the resulting posterior densities was negligible in relation to the variance resulting from the fact that we are using a (stochastic) approximate inference method for model inversion. [sent-739, score-0.65]

78 Similarly, varying µ0 , κ0 , or ν0 in the normal-binomial model had practically no inﬂuence on the infraliminal probability of the posterior balanced accuracy (Figure 9c,d,e). [sent-741, score-0.66]

79 Each graph shows the infraliminal probability of the population mean accuracy (i. [sent-769, score-0.687]

80 Inferences on the population balanced accuracy are based on the bivariate normal-binomial model. [sent-777, score-0.872]

81 To illustrate the generic applicability of our approach when its assumptions are not satisﬁed by construction, we applied models for mixed-effects inference to classiﬁcation outcomes obtained on synthetic data features for a group of 20 subjects with 100 trials each (Figure 10). [sent-781, score-0.786]

82 The underlying population distribution is represented by a bivariate Gaussian kernel density estimate (contour lines). [sent-825, score-0.661]

83 tribution was symmetric with regard to class-speciﬁc accuracies while these accuracies themselves were strongly positively correlated, as one would expect from a linear classiﬁer tested on perfectly balanced data sets. [sent-833, score-0.649]

84 Central 95% posterior probability intervals about the population mean are shown in Figure 10c, along with a frequentist conﬁdence interval of the population mean accuracy. [sent-838, score-1.527]

85 In stark contrast, using the single betabinomial model or a conventional mean of sample accuracies to infer on the population accuracy (as opposed to balanced accuracy) resulted in estimates that were overly optimistic and therefore misleading. [sent-861, score-1.129]

86 Inverting the former model, which captures potential dependencies between class-speciﬁc accuracies, yields a posterior distribution over the population mean balanced accuracy (black) which shows that the classiﬁer is performing above chance. [sent-884, score-1.085]

87 The plot contrasts sample accuracies (blue) with central 95% posterior probability intervals (black), which avoid overﬁtting by shrinking to the population mean. [sent-886, score-1.104]

88 Using the beta-binomial model for inference on the population mean balanced accuracy, we obtained very strong evidence (infraliminal probability p < 0. [sent-902, score-0.88]

89 The shrinkage effect in these subject-speciﬁc accuracies was rather small: the average absolute difference between sample accuracies and posterior means amounted to 1. [sent-905, score-0.871]

90 Speciﬁcally, the posterior distribution of the accuracy of one subject is partially inﬂuenced by the data from all other subjects, correctly weighted by their respective posterior precisions (see Section 3. [sent-923, score-0.833]

91 In this way, one would explicitly model task- or session-speciﬁc accuracies to be conditionally independent from one another given an overall subject-speciﬁc effect π j , and conditionally independent from other subjects given the population parameters. [sent-954, score-0.989]

92 4 we showed how alternative a priori as3161 B RODERSEN , M ATHYS , C HUMBLEY, DAUNIZEAU , O NG , B UHMANN AND S TEPHAN sumptions about the population covariance of class-speciﬁc accuracies can be evaluated, relative to the priors of the models, using Bayesian model selection. [sent-969, score-0.784]

93 (2012), who carry out inference on the population mean accuracy by comparing two beta-binomial models: one with a population mean prior at 0. [sent-1037, score-1.332]

94 In order to assess whether the mean classiﬁcation performance achieved in the population is above chance, we must evaluate our posterior knowledge about the population parameters α and β. [sent-1074, score-1.3]

95 k1:m ≈ ∑ (τ) α+β c τ=1 α + β(τ) Another informative measure is the posterior probability that the mean classiﬁcation accuracy in the population does not exceed chance, p = Pr α ≤ 0. [sent-1077, score-0.93]

96 10 In order to derive an expression for the posterior predictive distribution in closed form, one would need to integrate out the population parameters α and β, ˜ p(π | k1:m ) = ˜ p(π | α, β) p(α, β | k1:m ) dα dβ, which is analytically intractable. [sent-1095, score-0.829]

97 m We then complete the ﬁrst step by drawing µ(τ) ∼ N 2 µ(τ) µm , Σ(τ) /κm , which we can use to obtain samples from the posterior mean balanced accuracy using φ(τ) := 1 (τ) (τ) µ + µ2 . [sent-1123, score-0.635]

98 2 j,1 3168 M IXED -E FFECTS I NFERENCE ON C LASSIFICATION P ERFORMANCE Apart from using µ(τ) and Σ(τ) to obtain samples from the posterior distributions over ρ j , we can ˜ 1:m − further use the two vectors to draw samples from the posterior predictive distribution p(π+ , k1:m ). [sent-1138, score-0.713]

99 For this we ﬁrst draw ˜ ˜ ρ(τ) ∼ N 2 ρ(τ) µ(τ) , Σ(τ) , and then obtain the desired sample using ˜ ˜ π(τ) = σ ρ(τ) , from which we can obtain samples from the posterior predictive balanced accuracy using 1 (τ) ˜ ˜ ˜ (τ) π + π2 . [sent-1139, score-0.667]

100 Additionally, it is common practice to indicate the uncertainty about the population mean of the classiﬁcation accuracy by reporting the 95% conﬁdence interval ˆ σm−1 ¯ π ± t0. [sent-1160, score-0.706]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('population', 0.477), ('posterior', 0.307), ('accuracies', 0.247), ('subjects', 0.238), ('trials', 0.176), ('daunizeau', 0.166), ('balanced', 0.155), ('inference', 0.144), ('athys', 0.14), ('humbley', 0.14), ('rodersen', 0.14), ('tephan', 0.14), ('ffects', 0.134), ('bivariate', 0.133), ('uhmann', 0.12), ('zurich', 0.118), ('ixed', 0.114), ('accuracy', 0.107), ('bayesian', 0.099), ('twofold', 0.098), ('beta', 0.097), ('nference', 0.095), ('outcomes', 0.094), ('doi', 0.091), ('frequentist', 0.089), ('subject', 0.086), ('erformance', 0.084), ('issn', 0.078), ('bb', 0.074), ('classi', 0.07), ('bin', 0.068), ('hierarchical', 0.064), ('lassification', 0.064), ('inferences', 0.064), ('infraliminal', 0.064), ('group', 0.063), ('chance', 0.063), ('bayes', 0.062), ('metropolis', 0.06), ('ln', 0.059), ('ng', 0.059), ('brodersen', 0.054), ('gelman', 0.054), ('favour', 0.054), ('imbalanced', 0.052), ('interval', 0.052), ('density', 0.051), ('prior', 0.049), ('fmri', 0.049), ('intervals', 0.047), ('classical', 0.046), ('predictive', 0.045), ('kass', 0.045), ('shrinkage', 0.044), ('stephan', 0.044), ('raftery', 0.044), ('binomial', 0.043), ('synthetic', 0.043), ('mean', 0.039), ('evidence', 0.038), ('ffx', 0.038), ('neuroimaging', 0.038), ('rfx', 0.038), ('tnr', 0.038), ('dispersion', 0.038), ('er', 0.034), ('truth', 0.034), ('sensitivity', 0.034), ('priors', 0.033), ('madigan', 0.033), ('neuroimage', 0.033), ('ground', 0.032), ('estimator', 0.032), ('bluemlisalpstrasse', 0.032), ('mbb', 0.032), ('mnb', 0.032), ('sns', 0.032), ('uncertainty', 0.031), ('variance', 0.03), ('tpr', 0.029), ('densities', 0.029), ('black', 0.028), ('models', 0.028), ('model', 0.027), ('disregards', 0.027), ('falsely', 0.027), ('mfx', 0.027), ('samples', 0.027), ('switzerland', 0.026), ('brain', 0.026), ('red', 0.026), ('infer', 0.026), ('correctly', 0.026), ('sample', 0.026), ('akbani', 0.025), ('betabinomial', 0.025), ('chawla', 0.025), ('friston', 0.025), ('gustafsson', 0.025), ('overdispersed', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999875 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan

2 0.11824343 95 jmlr-2012-Random Search for Hyper-Parameter Optimization

Author: James Bergstra, Yoshua Bengio

Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent Ă˘€œHigh ThroughputĂ˘€? methods achieve surprising successĂ˘€”they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword

3 0.093944304 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

Author: Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun

Abstract: This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs’ generalization. The computation of the bound involves estimating a prior of the distribution of classiﬁers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. The prior PAC-Bayes bound motivates two SVM-like classiﬁcation algorithms, prior SVM and ηprior SVM, whose regularization term pushes towards the minimization of the prior PAC-Bayes bound. The experimental work illustrates that the new bounds can be signiﬁcantly tighter than the original PAC-Bayes bound when applied to SVMs, and among them the combination of the prior PAC-Bayes bound and the prior SVM algorithm gives the tightest bound. Keywords: PAC-Bayes bound, support vector machine, generalization capability prediction, classiﬁcation

4 0.085465744 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox

Author: Hannes Nickisch

Abstract: The glm-ie toolbox contains functionality for estimation and inference in generalised linear models over continuous-valued variables. Besides a variety of penalised least squares solvers for estimation, it offers inference based on (convex) variational bounds, on expectation propagation and on factorial mean ﬁeld. Scalable and efﬁcient inference in fully-connected undirected graphical models or Markov random ﬁelds with Gaussian and non-Gaussian potentials is achieved by casting all the computations as matrix vector multiplications. We provide a wide choice of penalty functions for estimation, potential functions for inference and matrix classes with lazy evaluation for convenient modelling. We designed the glm-ie package to be simple, generic and easily expansible. Most of the code is written in Matlab including some MEX ﬁles to be fully compatible to both Matlab 7.x and GNU Octave 3.3.x. Large scale probabilistic classiﬁcation as well as sparse linear modelling can be performed in a common algorithmical framework by the glm-ie toolkit. Keywords: sparse linear models, generalised linear models, Bayesian inference, approximate inference, probabilistic regression and classiﬁcation, penalised least squares estimation, lazy evaluation matrix class

5 0.064528838 82 jmlr-2012-On the Necessity of Irrelevant Variables

Author: David P. Helmbold, Philip M. Long

Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classiﬁers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high conﬁdence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory

6 0.064161249 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

7 0.061556138 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

8 0.052465536 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy

9 0.050612941 4 jmlr-2012-A Kernel Two-Sample Test

10 0.04939758 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

11 0.048978709 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss

12 0.047621015 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

13 0.047010776 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

14 0.045475379 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

15 0.044329558 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning

16 0.042264633 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

17 0.042200316 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

18 0.041035861 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

19 0.040860213 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

20 0.039848045 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.181), (1, 0.083), (2, 0.13), (3, -0.056), (4, 0.117), (5, -0.014), (6, 0.104), (7, 0.164), (8, -0.113), (9, 0.095), (10, 0.026), (11, -0.05), (12, 0.099), (13, 0.096), (14, 0.07), (15, 0.01), (16, 0.17), (17, 0.185), (18, 0.125), (19, -0.077), (20, 0.083), (21, 0.091), (22, 0.017), (23, 0.121), (24, -0.161), (25, -0.088), (26, -0.171), (27, 0.06), (28, 0.043), (29, -0.087), (30, -0.065), (31, 0.014), (32, 0.058), (33, -0.093), (34, -0.087), (35, -0.115), (36, -0.185), (37, 0.076), (38, -0.054), (39, 0.016), (40, -0.03), (41, 0.177), (42, -0.013), (43, -0.058), (44, -0.074), (45, 0.03), (46, 0.078), (47, 0.02), (48, -0.042), (49, -0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9660092 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan

2 0.57495844 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

Author: Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun

3 0.56777579 95 jmlr-2012-Random Search for Hyper-Parameter Optimization

Author: James Bergstra, Yoshua Bengio

4 0.50183195 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox

Author: Hannes Nickisch

5 0.39396575 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

Author: Philipp Hennig, Christian J. Schuler

Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation

6 0.37887076 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

7 0.36236772 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss

8 0.35976017 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy

9 0.33543101 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies

10 0.31045777 4 jmlr-2012-A Kernel Two-Sample Test

11 0.30070493 118 jmlr-2012-Variational Multinomial Logit Gaussian Process

12 0.29836637 82 jmlr-2012-On the Necessity of Irrelevant Variables

13 0.26535204 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

14 0.26487914 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

15 0.24981821 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning

16 0.2476472 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

17 0.24645194 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

18 0.24094403 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

19 0.23085846 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

20 0.23044857 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.017), (21, 0.035), (23, 0.377), (26, 0.054), (29, 0.066), (35, 0.037), (49, 0.036), (56, 0.021), (57, 0.013), (69, 0.019), (75, 0.044), (77, 0.02), (79, 0.015), (81, 0.011), (92, 0.058), (96, 0.099)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74814165 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan

2 0.36057317 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

Author: Neil D. Lawrence

Abstract: We introduce a new perspective on spectral dimensionality reduction which views these methods as Gaussian Markov random ﬁelds (GRFs). Our unifying perspective is based on the maximum entropy principle which is in turn inspired by maximum variance unfolding. The resulting model, which we call maximum entropy unfolding (MEU) is a nonlinear generalization of principal component analysis. We relate the model to Laplacian eigenmaps and isomap. We show that parameter ﬁtting in the locally linear embedding (LLE) is approximate maximum likelihood MEU. We introduce a variant of LLE that performs maximum likelihood exactly: Acyclic LLE (ALLE). We show that MEU and ALLE are competitive with the leading spectral approaches on a robot navigation visualization and a human motion capture data set. Finally the maximum likelihood perspective allows us to introduce a new approach to dimensionality reduction based on L1 regularization of the Gaussian random ﬁeld via the graphical lasso.

3 0.35957974 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization

4 0.35793114 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel

Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identiﬁcation of an appropriate distance metric on the input data. It is often beneﬁcial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semideﬁnite. Semideﬁnite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semideﬁnite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semideﬁnite matrices as weak learners within an efﬁcient and scalable boosting-based learning process. The resulting methods are easy to implement, efﬁcient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semideﬁnite matrix with trace and rank being one rather than a classiﬁer or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classiﬁcation accuracy and running time. Keywords: Mahalanobis distance, semideﬁnite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor

5 0.35384044 103 jmlr-2012-Sampling Methods for the Nyström Method

Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar

Abstract: The Nystr¨ m method is an efﬁcient technique to generate low-rank matrix approximations and is o used in several large-scale learning applications. A key aspect of this method is the procedure according to which columns are sampled from the original matrix. In this work, we explore the efﬁcacy of a variety of ﬁxed and adaptive sampling schemes. We also propose a family of ensemble-based sampling algorithms for the Nystr¨ m method. We report results of extensive experiments o that provide a detailed comparison of various ﬁxed and adaptive sampling techniques, and demonstrate the performance improvement associated with the ensemble Nystr¨ m method when used in o conjunction with either ﬁxed or adaptive sampling schemes. Corroborating these empirical ﬁndings, we present a theoretical analysis of the Nystr¨ m method, providing novel error bounds guaro anteeing a better convergence rate of the ensemble Nystr¨ m method in comparison to the standard o Nystr¨ m method. o Keywords: low-rank approximation, nystr¨ m method, ensemble methods, large-scale learning o

6 0.35364652 4 jmlr-2012-A Kernel Two-Sample Test

7 0.35296857 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

8 0.3527661 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

9 0.34935701 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

10 0.34806472 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

11 0.34802371 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

12 0.34751314 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

13 0.34745675 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

14 0.34700492 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

15 0.34637249 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

16 0.34623265 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

17 0.34622425 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

18 0.34609404 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

19 0.34564573 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

20 0.34546649 100 jmlr-2012-Robust Kernel Density Estimation