jmlr jmlr2012 jmlr2012-21 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan
Abstract: Classification algorithms are frequently used on data with a natural hierarchical structure. For instance, classifiers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (fixed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classification accuracy is complemented by inference on the balanced accuracy, which avoids inflated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classification studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies
Reference: text
sentIndex sentText sentNum sentScore
1 One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. [sent-39, score-0.872]
2 Third, inference on classification accuracy is complemented by inference on the balanced accuracy, which avoids inflated accuracy estimates for imbalanced data sets. [sent-43, score-0.709]
3 The typical question of interest for studies as those described above is: What is the accuracy of the classifier in the general population from which the subjects were sampled? [sent-55, score-0.822]
4 Rather than treating classification outcomes obtained in different subjects as samples from the same distribution, a hierarchical setting requires us to account for the fact that each subject itself has been sampled from a heterogeneous population (Beckmann et al. [sent-61, score-0.986]
5 , betweensubjects variability) that results from the distribution of true accuracies in the population from which 3134 M IXED -E FFECTS I NFERENCE ON C LASSIFICATION P ERFORMANCE subjects were drawn. [sent-68, score-0.962]
6 This is addressed by inference on the mean classification accuracy in the population from which subjects were drawn. [sent-123, score-1.005]
7 In particular, we wish to predict how well a trial-wise classifier will perform ‘out of sample’, that is, on trials from an unseen subject drawn from the same population as the one underlying the presently studied group. [sent-128, score-0.739]
8 This is achieved by modelling subject-wise accuracies as drawn from a population distribution described by a Beta density, p(π j | α, β) = Beta(π j | α, β) = Γ(α + β) α−1 π (1 − π j )β−1 , Γ(α)Γ(β) j (5) such that α and β characterize the population as a whole. [sent-197, score-1.201]
9 Formally, a particular subject’s π j is drawn from a population characterized by α and β: subject-specific accuracies are assumed to be i. [sent-200, score-0.724]
10 To describe our uncertainty about the population parameters, we use a diffuse prior on α and β which ensures that the posterior will be dominated by the data. [sent-204, score-0.864]
11 One option would be to assign uniform densities to both the prior expected accuracy α/(α + β) and the prior virtual sample size α + β, using logistic and logarithmic transformations to put each on a (−∞, ∞) scale; but this prior would lead to an improper posterior density (Gelman et al. [sent-205, score-0.667]
12 2 M ODEL I NVERSION Inverting the beta-binomial model allows us to infer on (i) the posterior population mean accuracy, (ii) the subject-specific posterior accuracies, and (iii) the posterior predictive accuracy. [sent-219, score-1.535]
13 First, to obtain the posterior density over the population parameters α and β we need to evaluate p(k1:m | α, β) p(α, β) p(k1:m | α, β) p(α, β) dα dβ p(α, β | k1:m ) = (8) with k1:m := (k1 , k2 , . [sent-222, score-0.835]
14 This set allows us to obtain samples from the posterior population mean accuracy, p α k1:m . [sent-236, score-0.85]
15 α+β We can use these samples in various ways, for example, to obtain a point estimate of the population mean accuracy using the posterior mean, ˆ 1 c α(τ) ∑ α(τ) + β(τ) . [sent-237, score-0.957]
16 ˆ c τ=1 ˆ We could also numerically evaluate the posterior probability that the mean classification accuracy in the population does not exceed chance, p = Pr α ≤ 0. [sent-238, score-0.93]
17 Finally, we could compute the posterior probability that the mean accuracy in one population is greater than in another, p = Pr α(2) α(1) k (1) , k1:m(2) . [sent-243, score-0.93]
18 Subjects with fewer trials will exert a smaller effect on the group and shrink more, while subjects with more trials will have a larger influence on the group and shrink less. [sent-253, score-0.716]
19 In this case, we are typically less interested in the average effect in the group but more in the effect that a new subject from the same population would display, as this estimate takes into account both the population mean and the population variance. [sent-256, score-1.619]
20 The expected performance is expressed by the posterior predictive density, ˜ p(π | k1:m ), ˜ in which π denotes the classification accuracy in a new subject drawn from the same population as the existing group of subjects with latent accuracies π1 , . [sent-257, score-1.57]
21 4 Samples for this density can easily be obtained using the samples α(τ) and β(τ) from the posterior population mean. [sent-262, score-0.862]
22 In the same way, we can obtain approximations to the posterior mean, the posterior mode, or a posterior probability interval. [sent-313, score-0.921]
23 For instance, we can obtain − + the posterior population parameters, p(α+ , β+ | k1:m ) and p(α− , β− | k1:m ) using the same sampling procedure as summarized in Section 2. [sent-323, score-0.784]
24 The two sets of samples can then be averaged in a pairwise fashion to obtain samples from the posterior mean balanced accuracy in the population, − + p φ | k1:m , k1:m , 3142 M IXED -E FFECTS I NFERENCE ON C LASSIFICATION P ERFORMANCE where we have defined φ := 1 2 α− α+ + − α+ + β+ α + β− . [sent-325, score-0.662]
25 Similarly, we can average pairs of posterior samples from π+ and π− to obtain samples from the j j posterior densities of subject-specific balanced accuracies, − + p φ j k1:m , k1:m . [sent-326, score-0.852]
26 Using the same idea, we can obtain samples from the posterior predictive density of the balanced accuracy that can be expected in a new subject from the same population, ˜ + − p φ k1:m , k1:m . [sent-327, score-0.778]
27 In this case, an unbiased classifier yields high accuracies on either class in some subjects and lower accuracies in others, inducing a positive correlation between class-specific accuracies. [sent-334, score-0.732]
28 We therefore turn to an alternative model for mixed-effects inference on the balanced accuracy that embraces potential dependencies between class-specific accuracies (Figure 2b). [sent-339, score-0.68]
29 Instead, we use a bivariate population density whose covariance structure defines the form and extent of the dependency between π+ and π− . [sent-343, score-0.661]
30 2 M ODEL I NVERSION In contrast to the twofold beta-binomial model discussed in the previous section, the bivariate normal-binomial model makes it difficult to sample from the posterior densities over model pa− + rameters using a Metropolis implementation. [sent-381, score-0.674]
31 First, population parameter estimates can be obtained by sampling from the posterior density − + p(µ, Σ | k1:m , k1:m ) using a Metropolis-Hastings approach. [sent-389, score-0.835]
32 Second, subject-specific accuracies are − + estimated by first sampling from p(ρ j | k1:m , k1:m ) and then applying a sigmoid transform to obtain − + samples from the posterior density over subject-specific balanced accuracies, p(φ j | k1:m , k1:m ). [sent-390, score-0.787]
33 The best model can then be used for posterior inferences on the mean accuracy in the population or the predictive accuracy in a new subject from the new population. [sent-409, score-1.259]
34 M Similarly, we can obtain the posterior predictive distribution of the balanced accuracy in a new subject from the same population, − + ˜ + − ˜ + − p φ k1:m , k1:m = ∑ p φ k1:m , k1:m , M p M k1:m , k1:m . [sent-414, score-0.7]
35 We then contrast inference on accuracies with inference on balanced accuracies (Section 3. [sent-436, score-0.937]
36 Their empirical sample accuracies are shown in Figure 4b, along with the ground-truth density of the population accuracy. [sent-452, score-0.801]
37 1 (Figure 4c), and examining the posterior distribution over the population mean accuracy showed that more than 99. [sent-454, score-0.93]
38 This is in contrast to the dispersion of the posterior over the population mean, which becomes more and more precise with an increasing amount of data. [sent-464, score-0.822]
39 8 FFX RFX posterior confidence interval interval 4 (f) predictive inference 8 MFX posterior interval 2 log(a/b) 10 p(>0. [sent-487, score-0.959]
40 5 predictive accuracy 1 Figure 4: Inference on the population mean and the predictive accuracy. [sent-494, score-0.713]
41 (b) Empirical sample accuracies (blue) and their underlying population distribution (green). [sent-497, score-0.75]
42 (c) Inverting the beta-binomial model yields samples from the posterior distribution over the population parameters, visualized using a nonparametric (bivariate Gaussian kernel) density estimate (contour lines). [sent-498, score-0.889]
43 (d) The posterior about the population mean accuracy, plotted using a kernel density estimator (black), is sharply peaked around the true population mean (green). [sent-499, score-1.422]
44 (f) The posterior predictive distribution over ˜ π represents the posterior belief of the accuracy expected in a new subject (black). [sent-505, score-0.852]
45 75 1 true population mean Figure 5: Inference on the population mean under varying population heterogeneity. [sent-534, score-1.509]
46 The figure shows Bayesian estimates of the frequentist probability of above-chance classification performance, as a function of the true population mean, separately for three different level of population heterogeneity (a,b,c). [sent-535, score-1.043]
47 By contrast, a fixed-effects approach (orange) offers invalid population inference as it disregards between-subjects variability; at a true population mean of 0. [sent-538, score-1.164]
48 Insets show the distribution of the true underlying population accuracy (green) for a population mean accuracy of 0. [sent-546, score-1.207]
49 We then varied the true population mean and plotted the fraction of decisions for an above-chance classifier as a function of population mean (Figure 5a). [sent-549, score-1.032]
50 Since the population variance was chosen to be very low in this initial simulation, the inferences afforded by a fixed-effects analysis (yellow) prove very similar as well; but this changes drastically when increasing the population variance to more realistic levels, as described below. [sent-556, score-1.078]
51 5 1 population mean accuracy beta- conven- convenbinomial tional tional model FFX RFX interval interval 0 Figure 6: Inadequate inferences provided by fixed-effects and random-effects models. [sent-567, score-0.818]
52 (b) The (mixed-effects) posterior density of the population mean (black) provides a good estimate of ground truth (green). [sent-570, score-0.94]
53 example, given a fairly homogeneous population with a true population mean accuracy of 60% and a variance of 0. [sent-574, score-1.13]
54 The above simulations show that a fixed-effects analysis (yellow) becomes an invalid procedure to infer on the population mean when the population variance is non-negligible. [sent-581, score-1.049]
55 Classification outcomes were generated using the betabinomial model with a population mean of 0. [sent-590, score-0.662]
56 The example shows that the proposed beta-binomial model yields a posterior density with the necessary asymmetry; it comfortably includes the true population mean (Figure 6b). [sent-594, score-0.901]
57 This simulation was based on 45 subjects overall; 40 subjects were characterized by a relatively moderate number of trials (n = 20) while 5 subjects had even fewer trials (n = 5). [sent-605, score-1.066]
58 The plot shows that, in each subject, the posterior mode (black) represents a compromise between the observed sample accuracy (blue) and the population mean (0. [sent-611, score-0.956]
59 Another way of demonstrating the shrinkage effect is by illustrating the transition from ground truth to sample accuracies (with its increase in dispersion) and from sample accuracies to posterior means (with its decrease in dispersion). [sent-615, score-0.963]
60 This shows how the high variability in sample accuracies is reduced, informed by what has been learned about the population (Figure 7b). [sent-616, score-0.75]
61 Here, subjects with only 5 trials were shrunk more than subjects with 20 trials. [sent-618, score-0.652]
62 4 ground truth sample accuracies posterior interval (bb) 20 40 subjects (sorted) 0. [sent-638, score-0.936]
63 2 0 subjects with very few trials (c) power curves ground sample posterior truth accuracies mean (bb) 1 0. [sent-639, score-1.099]
64 (a) Classification outcomes were generated for a synthetic heterogeneous group of 45 subjects (40 subjects with 20 trials each, 5 subjects with 5 trials each). [sent-650, score-1.266]
65 (b) Another way of visualizing the shrinking effect is to contrast the increase in dispersion (as we move from ground truth to sample accuracies) with the decrease in dispersion (as we move from sample accuracies to posterior means) enforced by the hierarchical model. [sent-656, score-0.812]
66 Shrinking changes the order of subjects (when sorted by posterior mean as opposed to by sample accuracy) as the amount of shrinking depends on the subject-specific (first-level) posterior uncertainty. [sent-657, score-0.917]
67 (d) Across the same 1 000 simulations, a Bayes estimator, based on the posterior means of subject-specific accuracies (black), was superior to both a classical ML estimator (blue) and a James-Stein estimator (red). [sent-661, score-0.664]
68 An initial simulation specified a high population accuracy on the positive class and a low accuracy on the negative class, with equal variance in both (Figure 8a,b). [sent-675, score-0.721]
69 2 60 40 +ve trials 20 0 TPR correct predictions (c) population-mean intervals 1 1 –ve trials 5 10 15 subjects 20 0 0 0. [sent-708, score-0.637]
70 5 TNR 1 0 accuracy (bb) balanced accuracy (nb) balanced accuracy (bb) ln ! [sent-709, score-0.69]
71 75 1 population mean on positive trials Figure 8: Inference on the balanced accuracy. [sent-718, score-0.847]
72 The underlying true population distribution is represented by a bivariate Gaussian kernel density estimate (contour lines). [sent-722, score-0.661]
73 The plot shows that the population accuracy is high on positive trials and low on negative trials. [sent-723, score-0.76]
74 (c) Central 95% posterior probability intervals based on three models: the simple beta-binomial model for inference on the population accuracy; and the twofold beta-binomial model as well as the bivariate normal-binomial model for inference on the balanced accuracy. [sent-724, score-1.586]
75 The true mean balanced accuracy in the population is at chance (green). [sent-725, score-0.841]
76 To examine this dependence, we carried out a sensitivity analysis in which we considered the infraliminal probability of the posterior population mean as a function of prior moments (Figure 9). [sent-738, score-0.97]
77 We found that inferences were extremely robust, in the sense that the influence of the prior moments on the resulting posterior densities was negligible in relation to the variance resulting from the fact that we are using a (stochastic) approximate inference method for model inversion. [sent-739, score-0.65]
78 Similarly, varying µ0 , κ0 , or ν0 in the normal-binomial model had practically no influence on the infraliminal probability of the posterior balanced accuracy (Figure 9c,d,e). [sent-741, score-0.66]
79 Each graph shows the infraliminal probability of the population mean accuracy (i. [sent-769, score-0.687]
80 Inferences on the population balanced accuracy are based on the bivariate normal-binomial model. [sent-777, score-0.872]
81 To illustrate the generic applicability of our approach when its assumptions are not satisfied by construction, we applied models for mixed-effects inference to classification outcomes obtained on synthetic data features for a group of 20 subjects with 100 trials each (Figure 10). [sent-781, score-0.786]
82 The underlying population distribution is represented by a bivariate Gaussian kernel density estimate (contour lines). [sent-825, score-0.661]
83 tribution was symmetric with regard to class-specific accuracies while these accuracies themselves were strongly positively correlated, as one would expect from a linear classifier tested on perfectly balanced data sets. [sent-833, score-0.649]
84 Central 95% posterior probability intervals about the population mean are shown in Figure 10c, along with a frequentist confidence interval of the population mean accuracy. [sent-838, score-1.527]
85 In stark contrast, using the single betabinomial model or a conventional mean of sample accuracies to infer on the population accuracy (as opposed to balanced accuracy) resulted in estimates that were overly optimistic and therefore misleading. [sent-861, score-1.129]
86 Inverting the former model, which captures potential dependencies between class-specific accuracies, yields a posterior distribution over the population mean balanced accuracy (black) which shows that the classifier is performing above chance. [sent-884, score-1.085]
87 The plot contrasts sample accuracies (blue) with central 95% posterior probability intervals (black), which avoid overfitting by shrinking to the population mean. [sent-886, score-1.104]
88 Using the beta-binomial model for inference on the population mean balanced accuracy, we obtained very strong evidence (infraliminal probability p < 0. [sent-902, score-0.88]
89 The shrinkage effect in these subject-specific accuracies was rather small: the average absolute difference between sample accuracies and posterior means amounted to 1. [sent-905, score-0.871]
90 Specifically, the posterior distribution of the accuracy of one subject is partially influenced by the data from all other subjects, correctly weighted by their respective posterior precisions (see Section 3. [sent-923, score-0.833]
91 In this way, one would explicitly model task- or session-specific accuracies to be conditionally independent from one another given an overall subject-specific effect π j , and conditionally independent from other subjects given the population parameters. [sent-954, score-0.989]
92 4 we showed how alternative a priori as3161 B RODERSEN , M ATHYS , C HUMBLEY, DAUNIZEAU , O NG , B UHMANN AND S TEPHAN sumptions about the population covariance of class-specific accuracies can be evaluated, relative to the priors of the models, using Bayesian model selection. [sent-969, score-0.784]
93 (2012), who carry out inference on the population mean accuracy by comparing two beta-binomial models: one with a population mean prior at 0. [sent-1037, score-1.332]
94 In order to assess whether the mean classification performance achieved in the population is above chance, we must evaluate our posterior knowledge about the population parameters α and β. [sent-1074, score-1.3]
95 k1:m ≈ ∑ (τ) α+β c τ=1 α + β(τ) Another informative measure is the posterior probability that the mean classification accuracy in the population does not exceed chance, p = Pr α ≤ 0. [sent-1077, score-0.93]
96 10 In order to derive an expression for the posterior predictive distribution in closed form, one would need to integrate out the population parameters α and β, ˜ p(π | k1:m ) = ˜ p(π | α, β) p(α, β | k1:m ) dα dβ, which is analytically intractable. [sent-1095, score-0.829]
97 m We then complete the first step by drawing µ(τ) ∼ N 2 µ(τ) µm , Σ(τ) /κm , which we can use to obtain samples from the posterior mean balanced accuracy using φ(τ) := 1 (τ) (τ) µ + µ2 . [sent-1123, score-0.635]
98 2 j,1 3168 M IXED -E FFECTS I NFERENCE ON C LASSIFICATION P ERFORMANCE Apart from using µ(τ) and Σ(τ) to obtain samples from the posterior distributions over ρ j , we can ˜ 1:m − further use the two vectors to draw samples from the posterior predictive distribution p(π+ , k1:m ). [sent-1138, score-0.713]
99 For this we first draw ˜ ˜ ρ(τ) ∼ N 2 ρ(τ) µ(τ) , Σ(τ) , and then obtain the desired sample using ˜ ˜ π(τ) = σ ρ(τ) , from which we can obtain samples from the posterior predictive balanced accuracy using 1 (τ) ˜ ˜ ˜ (τ) π + π2 . [sent-1139, score-0.667]
100 Additionally, it is common practice to indicate the uncertainty about the population mean of the classification accuracy by reporting the 95% confidence interval ˆ σm−1 ¯ π ± t0. [sent-1160, score-0.706]
wordName wordTfidf (topN-words)
[('population', 0.477), ('posterior', 0.307), ('accuracies', 0.247), ('subjects', 0.238), ('trials', 0.176), ('daunizeau', 0.166), ('balanced', 0.155), ('inference', 0.144), ('athys', 0.14), ('humbley', 0.14), ('rodersen', 0.14), ('tephan', 0.14), ('ffects', 0.134), ('bivariate', 0.133), ('uhmann', 0.12), ('zurich', 0.118), ('ixed', 0.114), ('accuracy', 0.107), ('bayesian', 0.099), ('twofold', 0.098), ('beta', 0.097), ('nference', 0.095), ('outcomes', 0.094), ('doi', 0.091), ('frequentist', 0.089), ('subject', 0.086), ('erformance', 0.084), ('issn', 0.078), ('bb', 0.074), ('classi', 0.07), ('bin', 0.068), ('hierarchical', 0.064), ('lassification', 0.064), ('inferences', 0.064), ('infraliminal', 0.064), ('group', 0.063), ('chance', 0.063), ('bayes', 0.062), ('metropolis', 0.06), ('ln', 0.059), ('ng', 0.059), ('brodersen', 0.054), ('gelman', 0.054), ('favour', 0.054), ('imbalanced', 0.052), ('interval', 0.052), ('density', 0.051), ('prior', 0.049), ('fmri', 0.049), ('intervals', 0.047), ('classical', 0.046), ('predictive', 0.045), ('kass', 0.045), ('shrinkage', 0.044), ('stephan', 0.044), ('raftery', 0.044), ('binomial', 0.043), ('synthetic', 0.043), ('mean', 0.039), ('evidence', 0.038), ('ffx', 0.038), ('neuroimaging', 0.038), ('rfx', 0.038), ('tnr', 0.038), ('dispersion', 0.038), ('er', 0.034), ('truth', 0.034), ('sensitivity', 0.034), ('priors', 0.033), ('madigan', 0.033), ('neuroimage', 0.033), ('ground', 0.032), ('estimator', 0.032), ('bluemlisalpstrasse', 0.032), ('mbb', 0.032), ('mnb', 0.032), ('sns', 0.032), ('uncertainty', 0.031), ('variance', 0.03), ('tpr', 0.029), ('densities', 0.029), ('black', 0.028), ('models', 0.028), ('model', 0.027), ('disregards', 0.027), ('falsely', 0.027), ('mfx', 0.027), ('samples', 0.027), ('switzerland', 0.026), ('brain', 0.026), ('red', 0.026), ('infer', 0.026), ('correctly', 0.026), ('sample', 0.026), ('akbani', 0.025), ('betabinomial', 0.025), ('chawla', 0.025), ('friston', 0.025), ('gustafsson', 0.025), ('overdispersed', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999875 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan
Abstract: Classification algorithms are frequently used on data with a natural hierarchical structure. For instance, classifiers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (fixed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classification accuracy is complemented by inference on the balanced accuracy, which avoids inflated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classification studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies
2 0.11824343 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
Author: James Bergstra, Yoshua Bengio
Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent “High Throughputâ€? methods achieve surprising success—they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword
3 0.093944304 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
Author: Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun
Abstract: This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs’ generalization. The computation of the bound involves estimating a prior of the distribution of classifiers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. The prior PAC-Bayes bound motivates two SVM-like classification algorithms, prior SVM and ηprior SVM, whose regularization term pushes towards the minimization of the prior PAC-Bayes bound. The experimental work illustrates that the new bounds can be significantly tighter than the original PAC-Bayes bound when applied to SVMs, and among them the combination of the prior PAC-Bayes bound and the prior SVM algorithm gives the tightest bound. Keywords: PAC-Bayes bound, support vector machine, generalization capability prediction, classification
4 0.085465744 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox
Author: Hannes Nickisch
Abstract: The glm-ie toolbox contains functionality for estimation and inference in generalised linear models over continuous-valued variables. Besides a variety of penalised least squares solvers for estimation, it offers inference based on (convex) variational bounds, on expectation propagation and on factorial mean field. Scalable and efficient inference in fully-connected undirected graphical models or Markov random fields with Gaussian and non-Gaussian potentials is achieved by casting all the computations as matrix vector multiplications. We provide a wide choice of penalty functions for estimation, potential functions for inference and matrix classes with lazy evaluation for convenient modelling. We designed the glm-ie package to be simple, generic and easily expansible. Most of the code is written in Matlab including some MEX files to be fully compatible to both Matlab 7.x and GNU Octave 3.3.x. Large scale probabilistic classification as well as sparse linear modelling can be performed in a common algorithmical framework by the glm-ie toolkit. Keywords: sparse linear models, generalised linear models, Bayesian inference, approximate inference, probabilistic regression and classification, penalised least squares estimation, lazy evaluation matrix class
5 0.064528838 82 jmlr-2012-On the Necessity of Irrelevant Variables
Author: David P. Helmbold, Philip M. Long
Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory
6 0.064161249 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
7 0.061556138 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
8 0.052465536 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy
9 0.050612941 4 jmlr-2012-A Kernel Two-Sample Test
10 0.04939758 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
11 0.048978709 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss
12 0.047621015 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
13 0.047010776 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
14 0.045475379 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems
15 0.044329558 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning
16 0.042264633 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
17 0.042200316 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
18 0.041035861 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
19 0.040860213 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems
20 0.039848045 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
topicId topicWeight
[(0, -0.181), (1, 0.083), (2, 0.13), (3, -0.056), (4, 0.117), (5, -0.014), (6, 0.104), (7, 0.164), (8, -0.113), (9, 0.095), (10, 0.026), (11, -0.05), (12, 0.099), (13, 0.096), (14, 0.07), (15, 0.01), (16, 0.17), (17, 0.185), (18, 0.125), (19, -0.077), (20, 0.083), (21, 0.091), (22, 0.017), (23, 0.121), (24, -0.161), (25, -0.088), (26, -0.171), (27, 0.06), (28, 0.043), (29, -0.087), (30, -0.065), (31, 0.014), (32, 0.058), (33, -0.093), (34, -0.087), (35, -0.115), (36, -0.185), (37, 0.076), (38, -0.054), (39, 0.016), (40, -0.03), (41, 0.177), (42, -0.013), (43, -0.058), (44, -0.074), (45, 0.03), (46, 0.078), (47, 0.02), (48, -0.042), (49, -0.049)]
simIndex simValue paperId paperTitle
same-paper 1 0.9660092 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan
Abstract: Classification algorithms are frequently used on data with a natural hierarchical structure. For instance, classifiers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (fixed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classification accuracy is complemented by inference on the balanced accuracy, which avoids inflated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classification studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies
2 0.57495844 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors
Author: Emilio Parrado-Hernández, Amiran Ambroladze, John Shawe-Taylor, Shiliang Sun
Abstract: This paper presents the prior PAC-Bayes bound and explores its capabilities as a tool to provide tight predictions of SVMs’ generalization. The computation of the bound involves estimating a prior of the distribution of classifiers from the available data, and then manipulating this prior in the usual PAC-Bayes generalization bound. We explore two alternatives: to learn the prior from a separate data set, or to consider an expectation prior that does not need this separate data set. The prior PAC-Bayes bound motivates two SVM-like classification algorithms, prior SVM and ηprior SVM, whose regularization term pushes towards the minimization of the prior PAC-Bayes bound. The experimental work illustrates that the new bounds can be significantly tighter than the original PAC-Bayes bound when applied to SVMs, and among them the combination of the prior PAC-Bayes bound and the prior SVM algorithm gives the tightest bound. Keywords: PAC-Bayes bound, support vector machine, generalization capability prediction, classification
3 0.56777579 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
Author: James Bergstra, Yoshua Bengio
Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent “High Throughputâ€? methods achieve surprising success—they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword
4 0.50183195 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox
Author: Hannes Nickisch
Abstract: The glm-ie toolbox contains functionality for estimation and inference in generalised linear models over continuous-valued variables. Besides a variety of penalised least squares solvers for estimation, it offers inference based on (convex) variational bounds, on expectation propagation and on factorial mean field. Scalable and efficient inference in fully-connected undirected graphical models or Markov random fields with Gaussian and non-Gaussian potentials is achieved by casting all the computations as matrix vector multiplications. We provide a wide choice of penalty functions for estimation, potential functions for inference and matrix classes with lazy evaluation for convenient modelling. We designed the glm-ie package to be simple, generic and easily expansible. Most of the code is written in Matlab including some MEX files to be fully compatible to both Matlab 7.x and GNU Octave 3.3.x. Large scale probabilistic classification as well as sparse linear modelling can be performed in a common algorithmical framework by the glm-ie toolkit. Keywords: sparse linear models, generalised linear models, Bayesian inference, approximate inference, probabilistic regression and classification, penalised least squares estimation, lazy evaluation matrix class
5 0.39396575 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
Author: Philipp Hennig, Christian J. Schuler
Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation
6 0.37887076 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
7 0.36236772 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss
8 0.35976017 31 jmlr-2012-DEAP: Evolutionary Algorithms Made Easy
9 0.33543101 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
10 0.31045777 4 jmlr-2012-A Kernel Two-Sample Test
11 0.30070493 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
12 0.29836637 82 jmlr-2012-On the Necessity of Irrelevant Variables
13 0.26535204 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data
14 0.26487914 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
15 0.24981821 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning
17 0.24645194 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
18 0.24094403 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
19 0.23085846 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems
20 0.23044857 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
topicId topicWeight
[(7, 0.017), (21, 0.035), (23, 0.377), (26, 0.054), (29, 0.066), (35, 0.037), (49, 0.036), (56, 0.021), (57, 0.013), (69, 0.019), (75, 0.044), (77, 0.02), (79, 0.015), (81, 0.011), (92, 0.058), (96, 0.099)]
simIndex simValue paperId paperTitle
same-paper 1 0.74814165 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets
Author: Kay H. Brodersen, Christoph Mathys, Justin R. Chumbley, Jean Daunizeau, Cheng Soon Ong, Joachim M. Buhmann, Klaas E. Stephan
Abstract: Classification algorithms are frequently used on data with a natural hierarchical structure. For instance, classifiers are often trained and tested on trial-wise measurements, separately for each subject within a group. One important question is how classification outcomes observed in individual subjects can be generalized to the population from which the group was sampled. To address this question, this paper introduces novel statistical models that are guided by three desiderata. First, all models explicitly respect the hierarchical nature of the data, that is, they are mixed-effects models that simultaneously account for within-subjects (fixed-effects) and across-subjects (random-effects) variance components. Second, maximum-likelihood estimation is replaced by full Bayesian inference in order to enable natural regularization of the estimation problem and to afford conclusions in terms of posterior probability statements. Third, inference on classification accuracy is complemented by inference on the balanced accuracy, which avoids inflated accuracy estimates for imbalanced data sets. We introduce hierarchical models that satisfy these criteria and demonstrate their advantages over conventional methods using MCMC implementations for model inversion and model selection on both synthetic and empirical data. We envisage that our approach will improve the sensitivity and validity of statistical inference in future hierarchical classification studies. Keywords: beta-binomial, normal-binomial, balanced accuracy, Bayesian inference, group studies
2 0.36057317 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
Author: Neil D. Lawrence
Abstract: We introduce a new perspective on spectral dimensionality reduction which views these methods as Gaussian Markov random fields (GRFs). Our unifying perspective is based on the maximum entropy principle which is in turn inspired by maximum variance unfolding. The resulting model, which we call maximum entropy unfolding (MEU) is a nonlinear generalization of principal component analysis. We relate the model to Laplacian eigenmaps and isomap. We show that parameter fitting in the locally linear embedding (LLE) is approximate maximum likelihood MEU. We introduce a variant of LLE that performs maximum likelihood exactly: Acyclic LLE (ALLE). We show that MEU and ALLE are competitive with the leading spectral approaches on a robot navigation visualization and a human motion capture data set. Finally the maximum likelihood perspective allows us to introduce a new approach to dimensionality reduction based on L1 regularization of the Gaussian random field via the graphical lasso.
3 0.35957974 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
4 0.35793114 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel
Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identification of an appropriate distance metric on the input data. It is often beneficial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semidefinite. Semidefinite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semidefinite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable boosting-based learning process. The resulting methods are easy to implement, efficient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semidefinite matrix with trace and rank being one rather than a classifier or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classification accuracy and running time. Keywords: Mahalanobis distance, semidefinite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor
5 0.35384044 103 jmlr-2012-Sampling Methods for the Nyström Method
Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar
Abstract: The Nystr¨ m method is an efficient technique to generate low-rank matrix approximations and is o used in several large-scale learning applications. A key aspect of this method is the procedure according to which columns are sampled from the original matrix. In this work, we explore the efficacy of a variety of fixed and adaptive sampling schemes. We also propose a family of ensemble-based sampling algorithms for the Nystr¨ m method. We report results of extensive experiments o that provide a detailed comparison of various fixed and adaptive sampling techniques, and demonstrate the performance improvement associated with the ensemble Nystr¨ m method when used in o conjunction with either fixed or adaptive sampling schemes. Corroborating these empirical findings, we present a theoretical analysis of the Nystr¨ m method, providing novel error bounds guaro anteeing a better convergence rate of the ensemble Nystr¨ m method in comparison to the standard o Nystr¨ m method. o Keywords: low-rank approximation, nystr¨ m method, ensemble methods, large-scale learning o
6 0.35364652 4 jmlr-2012-A Kernel Two-Sample Test
7 0.35296857 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
8 0.3527661 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
9 0.34935701 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
10 0.34806472 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
11 0.34802371 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
12 0.34751314 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
13 0.34745675 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression
14 0.34700492 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
15 0.34637249 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
16 0.34623265 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
17 0.34622425 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
18 0.34609404 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies
19 0.34564573 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
20 0.34546649 100 jmlr-2012-Robust Kernel Density Estimation