nips nips2005 nips2005-37 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mikaela Keller, Samy Bengio, Siew Y. Wong
Abstract: Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ch Abstract Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. [sent-4, score-0.846]
2 This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. [sent-5, score-0.431]
3 More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. [sent-6, score-0.29]
4 The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions. [sent-7, score-0.584]
5 1 Introduction Statistical tests are often used in machine learning in order to assess the performance of a new learning algorithm or model over a set of benchmark datasets, with respect to the state-of-the-art solutions. [sent-8, score-0.399]
6 On the other hand, various research domains prefer to measure the performance of their models using different indicators, such as the F1 measure, used in information retrieval [11], described in Section 2. [sent-10, score-0.094]
7 Most classical statistical tests cannot cope directly with such measure as the usual necessary assumptions are no longer correct, and non-parametric bootstrap-based methods are then used [5]. [sent-12, score-0.536]
8 Since several papers already use these non-parametric tests [2, 1], we were interested in verifying empirically how reliable they were. [sent-13, score-0.456]
9 For this purpose, we used a very large text categorization database (the extended Reuters dataset [10]), composed of more than 800000 examples, and concerning more than 100 categories (each document was labelled with one or more of these categories). [sent-14, score-0.273]
10 We purposely set aside the largest part of the dataset and considered it as the whole population, while a much smaller part of it was used as a training set for the models. [sent-15, score-0.105]
11 Using the large set aside dataset part, we tested the statistical test in the ∗ This work was supported in part by the Swiss NSF through the NCCR on IM2 and in part by the European PASCAL Network of Excellence, IST-2002-506778, through the Swiss OFES. [sent-16, score-0.425]
12 same spirit as was done in [4], by sampling evaluation sets over which we observed the performance of the models and the behavior of the significance test. [sent-17, score-0.268]
13 Following the taxonomy of questions of interest defined by Dietterich in [4], we can differentiate between statistical tests that analyze learning algorithms and statistical tests that analyze classifiers. [sent-18, score-0.9]
14 In the first case, one intends to be robust to possible variations of the train and evaluation sets, while in the latter, one intends to only be robust to variations of the evaluation set. [sent-19, score-0.378]
15 In order to conduct a thorough analysis, we tried to vary the evaluation set size, the class unbalance, the error measure, the statistical test itself (with its associated assumptions), and even the closeness of the compared learning algorithms. [sent-21, score-0.639]
16 As it will be seen empirically, the closeness of the compared learning algorithms seems to have an effect on the resulting quality of the statistical tests: comparing an MLP and an SVM yields less reliable statistical tests than comparing two SVMs with a different kernel. [sent-23, score-0.823]
17 To the best of our knowledge, this has never been considered in the literature of statistical tests for machine learning. [sent-24, score-0.474]
18 2 A Statistical Significance Test for the Difference of F1 Let us first remind the basic classification framework in which statistical significance tests are used in machine learning. [sent-25, score-0.45]
19 We consider comparing two models A and B on a two-class classification task where the goal is to classify input examples xi into the corresponding class yi ∈ {−1, 1}, using already trained models fA (xi ) or fB (xi ). [sent-26, score-0.211]
20 One can estimate their respective performance on some test data by counting the number of utterances of each possible outcome: either the obtained class corresponds to the desired class, or not. [sent-27, score-0.288]
21 B) and N the total number of test examples; The difference between models A and B can then be written as Ne,A − Ne,B . [sent-30, score-0.253]
22 (1) N The usual starting point of most statistical tests is to define the so-called null hypothesis H0 which considers that the two models are equivalent, and then verifies how probable this hypothesis is. [sent-31, score-0.591]
23 Hence, assuming that D is an instance of some random variable D which follows some distribution, we are interested in D= p (|D| < |D|) < α (2) where α represents the risk of selecting the alternate hypothesis (the two models are different) while the null hypothesis is in fact true. [sent-32, score-0.116]
24 In the simplest case, known as the proportion test, one assumes (reasonably) that the decision taken by each model on each example can be modeled by a Bernoulli, and further assumes that the errors of the models are independent. [sent-34, score-0.368]
25 This is in general wrong in machine learning since the evaluation sets are the same for both models. [sent-35, score-0.207]
26 In order to get rid of the wrong independence assumption between the errors of the models, the McNemar test [6] concentrates on examples which were differently classified by the two compared models. [sent-37, score-0.32]
27 (4) z= N01 + N10 More recently, several other statistical tests have been proposed, such as the 5x2cv method [4] or the variance estimate proposed in [9], which both claim to better estimate the distribution of the errors (and hence the confidence on the statistical significance of the results). [sent-40, score-0.628]
28 As explained in [11], text categorization is usually solved as K 2-class classification problems, in a one-against-the-others approach. [sent-45, score-0.1]
29 For each category k, Precisionk measures the proportion of documents of the class among the ones considered as such by the classifier and Recallk the proportion of documents of the class correctly classified. [sent-50, score-0.978]
30 Precision = To summarize these two values, it is common to consider the so-called F1 measure [12], often used in domains such as information retrieval, text categorization, or vision processing. [sent-51, score-0.074]
31 The difference dF1 = F1,A − F1,B does not fit the assumptions of the tests presented earlier. [sent-54, score-0.347]
32 Indeed, it cannot be decomposed into a sum over the documents of independent random variables, since the numerator and the denominator of dF1 are non constant sums over documents of independent random variables. [sent-55, score-0.194]
33 An alternative solution to measure the statistical significance of dF1 is based on the Bootstrap Percentile Test proposed in [5]. [sent-57, score-0.139]
34 The idea of this test is to approximate the unknown distribution of dF1 by an estimate based on bootstrap replicates of the data. [sent-58, score-0.718]
35 2 Bootstrap Percentile Test Given an evaluation set of size N , one draws, with replacement, N samples from it. [sent-60, score-0.183]
36 This gives the first bootstrap replicate B1 , over which one can compute the statistics of interest, dF1,B1 . [sent-61, score-0.422]
37 Similarly, one can create as many bootstrap replicates Bn as needed, and for each, compute dF1,Bn . [sent-62, score-0.477]
38 The higher n is, the more precise should be the statistical test. [sent-63, score-0.103]
39 Literature [3] suggests to create at least 50 replicates where α is the level of the test; for α the smallest α we considered (0. [sent-64, score-0.079]
40 If 0 lies outside this interval, one can say that dF1 = 0 is not among the most probable results, and thus reject the null hypothesis. [sent-68, score-0.101]
41 3 Analysis of Statistical Tests We report in this section an analysis of the bootstrap percentile test, as well as other more classical statistical tests, based on a real large database. [sent-69, score-0.624]
42 We divided it as follows: 798,809 documents were kept aside and any statistics computed over this set Dtrue was considered as being the truth (ie a very good estimate of the actual value); the remaining 7982 documents were used as a training set Dtr (to train models A and B). [sent-73, score-0.333]
43 There was a total of 101 categories and each document was labeled with one or more of these categories. [sent-74, score-0.119]
44 We first extracted the dictionary from the training set, removed stop-words and applied stemming to it, as normally done in text categorization. [sent-75, score-0.063]
45 There was one model for each category for the SVMs, and a single MLP for the 101 categories. [sent-78, score-0.073]
46 We further define the level of the test α = p(Reject H0 |H0 ), where α takes on values 0. [sent-81, score-0.217]
47 Table 1 summarizes the possible outcomes of a statistical test. [sent-85, score-0.145]
48 Table 1: Various outcomes of a statistical test, with α = p(Type I error). [sent-87, score-0.124]
49 Truth H0 H1 Decision Reject H0 Accept H0 Type I error OK OK Type II error In order to assess the performance of the statistical tests on their Type I error, also called Size of the test, and on their Power = 1− Type II error, we used the following protocol. [sent-88, score-0.58]
50 s For each category Ci , we sampled over Dtrue , S (500) evaluation sets Dte of N documents, s ran the significance test over each Dte and computed the proportion of sets for which H0 was rejected given that H0 was true over Dtrue (resp. [sent-89, score-0.918]
51 We used αtrue as an estimate of the significance test’s probability of making a Type I error and π as an estimate of the significance test’s Power. [sent-92, score-0.087]
52 When αtrue is higher than the α fixed by the statistical test, the test underestimates Type I error, which means we should not rely on its decision regarding the superiority of one model over the other. [sent-93, score-0.32]
53 On the contrary, αtrue < α yields a pessimistic statistical test that decides correctly H0 more often than predicted. [sent-95, score-0.414]
54 Furthermore we would like to favor significance tests with a high π, since the Power of the test reflects its ability to reject H0 when H0 is false. [sent-96, score-0.631]
55 2 Summary of Conditions In order to verify the sensitivity of the analyzed statistical tests to several conditions, we varied the following parameters: • the value of α: it took on values in {0. [sent-98, score-0.486]
56 3 Results Figure 1 summarizes the results for the Size of the test estimates. [sent-106, score-0.238]
57 All graphs show αtrue , the number of times the test rejected H0 while H0 was true, for a fixed α = 0. [sent-107, score-0.264]
58 05, with respect to the sample size, for various statistical tests and tested measures. [sent-108, score-0.474]
59 Figure 2 shows the obtained results for the Power of the test estimates. [sent-109, score-0.217]
60 The proportion of evaluation sets over which the significance test (with α = 0. [sent-110, score-0.708]
61 05) rejected H0 when indeed H0 was false, is plotted against the evaluation set size. [sent-111, score-0.194]
62 Figures 1(a) and 2(a) show the results for balanced data (where the positive and negative examples were approximatively equally present in the evaluation set) when comparing two different models (an SVM and an MLP). [sent-112, score-0.411]
63 Figures 1(b) and 2(b) show the results for unbalanced data when comparing two different models. [sent-113, score-0.196]
64 Figures 1(c) and 2(c) show the results for balanced data when comparing two similar models (a linear SVM and a Gaussian SVM) for balanced data, and finally Figures 1(d) and 2(d) show the results for unbalanced data and two similar models. [sent-114, score-0.436]
65 Note that each point in the graphs was computed over a different number of samples, since eg over the (500 evaluation sets × 101 categories) experiments only those for which H0 was true in Dtrue were taken into account in the computation of αtrue . [sent-115, score-0.237]
66 When the proportion of H0 true in Dtrue equals 0 (resp. [sent-116, score-0.392]
67 the proportion of H0 false in Dtrue equals 0), αtrue (resp. [sent-117, score-0.399]
68 , 1000}) of Figures 2(c) and 2(d) were computed over only 500 evaluation sets on which respectively the same categorization task was performed. [sent-122, score-0.248]
69 05 line, we can state that the statistical test is optimistic, while when it is below the line, the statistical test is pessimistic. [sent-126, score-0.64]
70 As already explained, a pessimistic test should be favored whenever possible. [sent-127, score-0.333]
71 First of all, as expected, most of the statistical tests are positively influenced by the size of the evaluation set, in the sense that their αtrue value converges to α for large sample sizes 1 . [sent-129, score-0.661]
72 On the available results, the McNemar test and the bootstrap test over dCerr have a similar performance. [sent-130, score-0.856]
73 They are always pessimistic even for small evaluation set sizes, and tend to the expected α values when the models compared on balanced tasks are dissimilar. [sent-131, score-0.438]
74 They have also a similar performance in Power over all the different conditions, higher in general when comparing very different models. [sent-132, score-0.089]
75 When the compared models are similar, the bootstrap test over dF1 has a pessimistic behavior even on quite small evaluation sets. [sent-133, score-0.97]
76 However, when the models are really different the bootstrap test over dF1 is on average always optimistic. [sent-134, score-0.696]
77 Another interesting point is that in the available results for the Power, the dF1 ’s bootstrap test have relatively high values with respect to the other tests. [sent-136, score-0.639]
78 The proportion test have in general, on the available results, a more conservative behavior than the McNemar test and the dCerr bootstrap test. [sent-137, score-1.241]
79 It is too often prone to “Accept H0 ”, ie to conclude that the compared models have an equivalent performance, whether it is true or not. [sent-139, score-0.171]
80 However, when comparing close models in a small unbalanced evaluation set (Figure 1(d)), this conservative behavior is not present. [sent-141, score-0.459]
81 To summarize the findings, the bootstrap-based statistical test over dCerr obtained a good performance in Size comparable to the one of the McNemar test in all conditions. [sent-142, score-0.559]
82 However both significance test performances in Power are low even for big evaluation sets in particular when the compared models are close. [sent-143, score-0.492]
83 The bootstrap-based statistical test over dF1 has higher Power than the other compared tests, however it must be emphasized that it is slightly over-optimistic in particular for small evaluation sets. [sent-144, score-0.497]
84 Finally, when applying the proportion test over unbalanced data for close models we obtained an optimistic behavior, untypical of this usually conservative test. [sent-145, score-0.817]
85 5 Bootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr 0. [sent-152, score-0.868]
86 0 Proportion of Type I error Bootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr 0 1000 2000 3000 4000 5000 6000 0 1000 Evaluation set size 5000 6000 0. [sent-163, score-0.943]
87 5 Bootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr 0. [sent-168, score-0.868]
88 5 4000 (b) Linear SVM vs MLP - Unbalanced data Bootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr 0. [sent-178, score-0.993]
89 The proportion of Type I error equals -1, in Figure 1(b), when there was no data to compute the proportion (ie H0 was always false). [sent-180, score-0.685]
90 More particularly, we were concerned by the quality of non-parametric tests since in some cases (when using more complex performance measures such as F1 ), they are the only available statistical tests. [sent-182, score-0.499]
91 Fortunately, most statistical tests performed reasonably well (in the sense that they were more often pessimistic than optimistic in their decisions) and larger test sets always improved their performance. [sent-183, score-0.874]
92 Note however that for dF1 the only available statistical test was too optimistic although consistant for different levels. [sent-184, score-0.394]
93 An unexpected result was that the rather conservative proportion test used over unbalanced data for close models yielded an optimistic behavior. [sent-185, score-0.817]
94 It has to be noted that recently, a probabilistic interpretation of F1 was suggested in [7], and a comparison with bootstrap-based tests should be worthwhile. [sent-186, score-0.347]
95 6 Bootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr 0. [sent-203, score-0.868]
96 2 Power of the test Bootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr 0 1000 2000 3000 4000 5000 6000 0 1000 Evaluation set size 5000 6000 1. [sent-212, score-1.121]
97 8 Bootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr 0. [sent-215, score-0.868]
98 0 4000 (b) Linear SVM vs MLP - Unbalanced data Bootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr 0. [sent-223, score-0.993]
99 The power equals -1, in Figures 2(c) and 2(d), when there was not data to compute the proportion (ie H1 was never true). [sent-225, score-0.405]
100 Approximate statistical tests for comparing supervised classification learning algorithms. [sent-236, score-0.517]
wordName wordTfidf (topN-words)
[('bootstrap', 0.422), ('tests', 0.347), ('proportion', 0.305), ('dcerr', 0.276), ('mcnemar', 0.276), ('test', 0.217), ('mlp', 0.203), ('dtrue', 0.148), ('evaluation', 0.147), ('unbalanced', 0.129), ('nf', 0.127), ('cance', 0.126), ('vs', 0.125), ('ntp', 0.106), ('statistical', 0.103), ('balanced', 0.102), ('svm', 0.099), ('documents', 0.097), ('pessimistic', 0.094), ('categories', 0.091), ('closeness', 0.078), ('optimistic', 0.074), ('percentile', 0.074), ('category', 0.073), ('figures', 0.073), ('svms', 0.073), ('comparing', 0.067), ('reject', 0.067), ('power', 0.064), ('unbalance', 0.064), ('idiap', 0.063), ('categorization', 0.062), ('rbf', 0.061), ('false', 0.058), ('precision', 0.058), ('conservative', 0.056), ('aside', 0.055), ('replicates', 0.055), ('ie', 0.054), ('true', 0.051), ('reuters', 0.05), ('type', 0.049), ('classi', 0.048), ('martigny', 0.047), ('rejected', 0.047), ('recall', 0.046), ('dte', 0.042), ('intends', 0.042), ('keller', 0.042), ('sets', 0.039), ('error', 0.039), ('text', 0.038), ('benchmarking', 0.037), ('verifying', 0.037), ('equals', 0.036), ('analyzed', 0.036), ('size', 0.036), ('models', 0.036), ('measure', 0.036), ('switzerland', 0.034), ('null', 0.034), ('approximatively', 0.034), ('bengio', 0.032), ('std', 0.031), ('ok', 0.031), ('assess', 0.03), ('compared', 0.03), ('positives', 0.03), ('tasks', 0.029), ('reliable', 0.028), ('document', 0.028), ('accept', 0.028), ('database', 0.028), ('sizes', 0.028), ('signi', 0.027), ('errors', 0.027), ('measures', 0.027), ('dataset', 0.026), ('swiss', 0.026), ('class', 0.025), ('normally', 0.025), ('classical', 0.025), ('usual', 0.025), ('examples', 0.025), ('estimate', 0.024), ('considered', 0.024), ('tested', 0.024), ('behavior', 0.024), ('hypothesis', 0.023), ('protocol', 0.023), ('misclassi', 0.023), ('big', 0.023), ('cation', 0.022), ('empirically', 0.022), ('performance', 0.022), ('already', 0.022), ('outcomes', 0.021), ('wrong', 0.021), ('really', 0.021), ('summarizes', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 37 nips-2005-Benchmarking Non-Parametric Statistical Tests
Author: Mikaela Keller, Samy Bengio, Siew Y. Wong
Abstract: Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions. 1
2 0.20321718 148 nips-2005-Online Discovery and Learning of Predictive State Representations
Author: Peter Mccracken, Michael Bowling
Abstract: Predictive state representations (PSRs) are a method of modeling dynamical systems using only observable data, such as actions and observations, to describe their model. PSRs use predictions about the outcome of future tests to summarize the system state. The best existing techniques for discovery and learning of PSRs use a Monte Carlo approach to explicitly estimate these outcome probabilities. In this paper, we present a new algorithm for discovery and learning of PSRs that uses a gradient descent approach to compute the predictions for the current state. The algorithm takes advantage of the large amount of structure inherent in a valid prediction matrix to constrain its predictions. Furthermore, the algorithm can be used online by an agent to constantly improve its prediction quality; something that current state of the art discovery and learning algorithms are unable to do. We give empirical results to show that our constrained gradient algorithm is able to discover core tests using very small amounts of data, and with larger amounts of data can compute accurate predictions of the system dynamics. 1
3 0.13153906 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio
Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classification, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1
4 0.088631153 195 nips-2005-Transfer learning for text classification
Author: Chuong B. Do, Andrew Y. Ng
Abstract: Linear text classification algorithms work by computing an inner product between a test document vector and a parameter vector. In many such algorithms, including naive Bayes and most TFIDF variants, the parameters are determined by some simple, closed-form, function of training set statistics; we call this mapping mapping from statistics to parameters, the parameter function. Much research in text classification over the last few decades has consisted of manual efforts to identify better parameter functions. In this paper, we propose an algorithm for automatically learning this function from related classification problems. The parameter function found by our algorithm then defines a new learning algorithm for text classification, which we can apply to novel classification tasks. We find that our learned classifier outperforms existing methods on a variety of multiclass text classification tasks. 1
5 0.072418645 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard
Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1
6 0.067099258 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
7 0.06008099 47 nips-2005-Consistency of one-class SVM and related algorithms
8 0.054847475 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
9 0.053170934 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
10 0.047278989 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
11 0.04658103 151 nips-2005-Pattern Recognition from One Example by Chopping
12 0.043307129 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries
13 0.042337965 105 nips-2005-Large-Scale Multiclass Transduction
14 0.040001821 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
15 0.039643545 30 nips-2005-Assessing Approximations for Gaussian Process Classification
16 0.039444488 152 nips-2005-Phase Synchrony Rate for the Recognition of Motor Imagery in Brain-Computer Interface
17 0.03836922 114 nips-2005-Learning Rankings via Convex Hull Separation
18 0.036828339 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
19 0.035610516 160 nips-2005-Query by Committee Made Real
20 0.035033371 19 nips-2005-Active Learning for Misspecified Models
topicId topicWeight
[(0, 0.144), (1, 0.04), (2, -0.011), (3, -0.007), (4, 0.075), (5, 0.056), (6, 0.099), (7, 0.042), (8, -0.001), (9, 0.006), (10, -0.04), (11, -0.069), (12, 0.116), (13, -0.058), (14, 0.044), (15, -0.031), (16, 0.027), (17, 0.029), (18, 0.023), (19, 0.137), (20, -0.105), (21, -0.005), (22, 0.043), (23, -0.028), (24, -0.06), (25, 0.028), (26, 0.034), (27, -0.033), (28, 0.278), (29, 0.264), (30, 0.105), (31, -0.147), (32, 0.05), (33, -0.082), (34, -0.102), (35, -0.16), (36, -0.051), (37, 0.018), (38, 0.074), (39, -0.106), (40, -0.037), (41, 0.035), (42, -0.18), (43, -0.076), (44, -0.163), (45, -0.178), (46, 0.064), (47, -0.149), (48, 0.071), (49, 0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.96504235 37 nips-2005-Benchmarking Non-Parametric Statistical Tests
Author: Mikaela Keller, Samy Bengio, Siew Y. Wong
Abstract: Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions. 1
2 0.75649536 148 nips-2005-Online Discovery and Learning of Predictive State Representations
Author: Peter Mccracken, Michael Bowling
Abstract: Predictive state representations (PSRs) are a method of modeling dynamical systems using only observable data, such as actions and observations, to describe their model. PSRs use predictions about the outcome of future tests to summarize the system state. The best existing techniques for discovery and learning of PSRs use a Monte Carlo approach to explicitly estimate these outcome probabilities. In this paper, we present a new algorithm for discovery and learning of PSRs that uses a gradient descent approach to compute the predictions for the current state. The algorithm takes advantage of the large amount of structure inherent in a valid prediction matrix to constrain its predictions. Furthermore, the algorithm can be used online by an agent to constantly improve its prediction quality; something that current state of the art discovery and learning algorithms are unable to do. We give empirical results to show that our constrained gradient algorithm is able to discover core tests using very small amounts of data, and with larger amounts of data can compute accurate predictions of the system dynamics. 1
3 0.45969629 195 nips-2005-Transfer learning for text classification
Author: Chuong B. Do, Andrew Y. Ng
Abstract: Linear text classification algorithms work by computing an inner product between a test document vector and a parameter vector. In many such algorithms, including naive Bayes and most TFIDF variants, the parameters are determined by some simple, closed-form, function of training set statistics; we call this mapping mapping from statistics to parameters, the parameter function. Much research in text classification over the last few decades has consisted of manual efforts to identify better parameter functions. In this paper, we propose an algorithm for automatically learning this function from related classification problems. The parameter function found by our algorithm then defines a new learning algorithm for text classification, which we can apply to novel classification tasks. We find that our learned classifier outperforms existing methods on a variety of multiclass text classification tasks. 1
4 0.44633946 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio
Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classification, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1
5 0.37398019 62 nips-2005-Efficient Estimation of OOMs
Author: Herbert Jaeger, Mingjie Zhao, Andreas Kolling
Abstract: A standard method to obtain stochastic models for symbolic time series is to train state-emitting hidden Markov models (SE-HMMs) with the Baum-Welch algorithm. Based on observable operator models (OOMs), in the last few months a number of novel learning algorithms for similar purposes have been developed: (1,2) two versions of an ”efficiency sharpening” (ES) algorithm, which iteratively improves the statistical efficiency of a sequence of OOM estimators, (3) a constrained gradient descent ML estimator for transition-emitting HMMs (TE-HMMs). We give an overview on these algorithms and compare them with SE-HMM/EM learning on synthetic and real-life data. 1
6 0.36495242 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
7 0.32696319 35 nips-2005-Bayesian model learning in human visual perception
8 0.28320795 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries
9 0.28191936 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
10 0.27998614 103 nips-2005-Kernels for gene regulatory regions
11 0.27954072 30 nips-2005-Assessing Approximations for Gaussian Process Classification
12 0.2690568 105 nips-2005-Large-Scale Multiclass Transduction
13 0.25247633 171 nips-2005-Searching for Character Models
14 0.25206298 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
15 0.25200698 160 nips-2005-Query by Committee Made Real
16 0.24577454 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
17 0.24552557 47 nips-2005-Consistency of one-class SVM and related algorithms
18 0.24286647 114 nips-2005-Learning Rankings via Convex Hull Separation
19 0.2381269 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
20 0.23529449 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation
topicId topicWeight
[(3, 0.055), (10, 0.042), (27, 0.044), (31, 0.017), (34, 0.074), (39, 0.055), (55, 0.011), (65, 0.015), (69, 0.08), (73, 0.02), (78, 0.3), (88, 0.132), (91, 0.036)]
simIndex simValue paperId paperTitle
1 0.80406177 158 nips-2005-Products of ``Edge-perts
Author: Max Welling, Peter V. Gehler
Abstract: Images represent an important and abundant source of data. Understanding their statistical structure has important applications such as image compression and restoration. In this paper we propose a particular kind of probabilistic model, dubbed the “products of edge-perts model” to describe the structure of wavelet transformed images. We develop a practical denoising algorithm based on a single edge-pert and show state-ofthe-art denoising performance on benchmark images. 1
same-paper 2 0.77900732 37 nips-2005-Benchmarking Non-Parametric Statistical Tests
Author: Mikaela Keller, Samy Bengio, Siew Y. Wong
Abstract: Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions. 1
3 0.51908886 151 nips-2005-Pattern Recognition from One Example by Chopping
Author: Francois Fleuret, Gilles Blanchard
Abstract: We investigate the learning of the appearance of an object from a single image of it. Instead of using a large number of pictures of the object to recognize, we use a labeled reference database of pictures of other objects to learn invariance to noise and variations in pose and illumination. This acquired knowledge is then used to predict if two pictures of new objects, which do not appear on the training pictures, actually display the same object. We propose a generic scheme called chopping to address this task. It relies on hundreds of random binary splits of the training set chosen to keep together the images of any given object. Those splits are extended to the complete image space with a simple learning algorithm. Given two images, the responses of the split predictors are combined with a Bayesian rule into a posterior probability of similarity. Experiments with the COIL-100 database and with a database of 150 deA graded LTEX symbols compare our method to a classical learning with several examples of the positive class and to a direct learning of the similarity. 1
4 0.51635516 136 nips-2005-Noise and the two-thirds power Law
Author: Uri Maoz, Elon Portugaly, Tamar Flash, Yair Weiss
Abstract: The two-thirds power law, an empirical law stating an inverse non-linear relationship between the tangential hand speed and the curvature of its trajectory during curved motion, is widely acknowledged to be an invariant of upper-limb movement. It has also been shown to exist in eyemotion, locomotion and was even demonstrated in motion perception and prediction. This ubiquity has fostered various attempts to uncover the origins of this empirical relationship. In these it was generally attributed either to smoothness in hand- or joint-space or to the result of mechanisms that damp noise inherent in the motor system to produce the smooth trajectories evident in healthy human motion. We show here that white Gaussian noise also obeys this power-law. Analysis of signal and noise combinations shows that trajectories that were synthetically created not to comply with the power-law are transformed to power-law compliant ones after combination with low levels of noise. Furthermore, there exist colored noise types that drive non-power-law trajectories to power-law compliance and are not affected by smoothing. These results suggest caution when running experiments aimed at verifying the power-law or assuming its underlying existence without proper analysis of the noise. Our results could also suggest that the power-law might be derived not from smoothness or smoothness-inducing mechanisms operating on the noise inherent in our motor system but rather from the correlated noise which is inherent in this motor system. 1
5 0.51449913 30 nips-2005-Assessing Approximations for Gaussian Process Classification
Author: Malte Kuss, Carl E. Rasmussen
Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classification using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classifiers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classification remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classification (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classification error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classifier to produce class probabilities or confidences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classification Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classification (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to finding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the first two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difficult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efficiently requires some fine-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ significantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classification tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the first set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is defined by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a fine regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justifies the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and significantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we find the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classification. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are refined until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classification error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The first three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–73. The MIT Press, 2000.
6 0.51271552 45 nips-2005-Conditional Visual Tracking in Kernel Space
7 0.51180828 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
8 0.51113254 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs
9 0.51050895 35 nips-2005-Bayesian model learning in human visual perception
10 0.51038545 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex
11 0.5098002 74 nips-2005-Faster Rates in Regression via Active Learning
12 0.50821024 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
13 0.50783283 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
14 0.50576168 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
15 0.50497103 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
16 0.50376487 171 nips-2005-Searching for Character Models
17 0.50371486 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery
18 0.5035339 8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity
19 0.50335258 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
20 0.50046974 41 nips-2005-Coarse sample complexity bounds for active learning