jmlr jmlr2008 jmlr2008-7 knowledge-graph by maker-knowledge-mining

7 jmlr-2008-A Tutorial on Conformal Prediction


Source: pdf

Author: Glenn Shafer, Vladimir Vovk

Abstract: Conformal prediction uses past experience to determine precise levels of conÄ?Ĺš dence in new predictions. Given an error probability ĂŽÄž, together with a method that makes a prediction y of a label Ă‹† y, it produces a set of labels, typically containing y, that also contains y with probability 1 − ĂŽÄž. Ă‹† Conformal prediction can be applied to any method for producing y: a nearest-neighbor method, a Ă‹† support-vector machine, ridge regression, etc. Conformal prediction is designed for an on-line setting in which labels are predicted successively, each one being revealed before the next is predicted. The most novel and valuable feature of conformal prediction is that if the successive examples are sampled independently from the same distribution, then the successive predictions will be right 1 − ĂŽÄž of the time, even though they are based on an accumulating data set rather than on independent data sets. In addition to the model under which successive examples are sampled independently, other on-line compression models can also use conformal prediction. The widely used Gaussian linear model is one of these. This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. A more comprehensive treatment of the topic is provided in Algorithmic Learning in a Random World, by Vladimir Vovk, Alex Gammerman, and Glenn Shafer (Springer, 2005). Keywords: conÄ?Ĺš dence, on-line compression modeling, on-line learning, prediction regions

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Starting from the method for point prediction, we construct a nonconformity measure, which measures how unusual an example looks relative to previous examples, and the conformal algorithm turns this nonconformity measure into prediction regions. [sent-41, score-1.15]

2 Given a nonconformity measure, the conformal algorithm produces a prediction region ĂŽ“ ĂŽÄž for every probability of error ĂŽÄž. [sent-42, score-0.814]

3 Ĺš dence for a 95% conformal prediction region is valid under exchangeability, no matter what the probability distribution the examples follow and no matter what nonconformity measure is used to construct the conformal prediction region. [sent-87, score-1.37]

4 Ĺš ciency of conformal prediction will depend on the probability distribution and the nonconformity measure. [sent-89, score-0.773]

5 Fisher, who explained how to give a 95% prediction interval for zn based on z1 , . [sent-136, score-0.753]

6 But the simplicity of the set-up where we predict zn from z1 , . [sent-147, score-0.664]

7 The natural point predictor for zn is the average so far: 1 n−1 zn−1 := ∑ zi , n − 1 i=1 but we want to give an interval that will contain zn 95% of the time. [sent-161, score-1.438]

8 n−1 (1) Fisher based this procedure on the fact that zn − zn−1 sn−1 n−1 n has the t-distribution with n − 2 degrees of freedom, which is symmetric about 0. [sent-174, score-0.66]

9 This implies that (1) will contain zn with probability 95% regardless of the values of the mean and variance. [sent-175, score-0.642]

10 025 â‰Â¤ zn â‰Â¤ zn−1 + tn−2 sn−1 n−1 n n−1 (2) for successive n are probabilistically independent in spite of the overlap. [sent-217, score-0.671]

11 Before any of the zi are observed, the event (2) involves multiple uncertainties: z n−1 , sn−1 , and zn are all uncertain. [sent-242, score-0.746]

12 Ĺš rst n − 1 examples that we calculate zn−1 and sn−1 and then calculate the interval (1), and we would like to be able to say at this point that there is still a 95% probability that zn will be in (1). [sent-246, score-0.762]

13 For any n, 1 â‰Â¤ n â‰Â¤ N, zn is independent of zn+1 , . [sent-526, score-0.642]

14 , zn and for any bag B of size n, k Pr (zn = a | z1 , . [sent-532, score-0.833]

15 , zn = B) = , (4) n where k is the number of times a occurs in B. [sent-535, score-0.642]

16 , zn symbolize drawing an example zn out at random, leaving the smaller bag z1 , . [sent-589, score-1.475]

17 , zn , and for simplicity, we set the initial capital KN equal to $1. [sent-605, score-0.695]

18 â€Ë˜ An event E is an n-event, where 1 â‰Â¤ n â‰Â¤ N, if its happening or failing is determined by the value of zn and the value of the bag z1 , . [sent-637, score-0.862]

19 Conformal Prediction under Exchangeability We are now in a position to state the conformal algorithm under exchangeability and explain why it produces valid nested prediction regions. [sent-683, score-0.657]

20 For some readers, the simplicity of the conformal algorithm may be obscured by its generality and the scope of our preliminary discussion of nonconformity measures. [sent-723, score-0.699]

21 1 Nonconformity Measures The starting point for conformal prediction is what we call a nonconformity measure, a real-valued function A(B, z) that measures how different an example z is from the examples in a bag B. [sent-733, score-0.995]

22 The conformal algorithm assumes that a nonconformity measure has been chosen. [sent-734, score-0.719]

23 A method z(B) for obtaining a point prediction z for a new example from a bag B of old examples Ă‹† Ă‹† usually leads naturally to a nonconformity measure A. [sent-738, score-0.717]

24 z (7) The prediction regions produced by the conformal algorithm do not change when the nonconformity measure A is transformed monotonically. [sent-742, score-0.84]

25 For simplicity in stating this algorithm, we provisionally use the symbol zn for z, as if we were assuming that zn is in fact equal to z. [sent-799, score-1.313]

26 The event that our nth prediction is an error, zn ∈ ĂŽĹ‚ĂŽÄž (z1 , . [sent-872, score-0.745]

27 This is an n-event, because the / value of pzn is determined by zn and the bag z1 , . [sent-876, score-0.833]

28 − zi = 20 20 (14) Under the hypothesis that z is the actual value of zn , these 20 numbers are exchangeable. [sent-908, score-0.74]

29 , zn \ zi , zi ), (15) apparently signaling that we do not want to include zi in the bag to which it is compared. [sent-925, score-1.058]

30 , zn , zi ) (16) in the conformal algorithm and using A(B, z) := |zB − z| . [sent-930, score-1.081]

31 Ĺš ning nonconformity scores, (15) and (16), are equivalent, inasmuch as whatever we can get with one of them we can get from the other by changing the nonconformity measure. [sent-932, score-0.695]

32 For simplicity in stating this algorithm, we provisionally use the symbol zn for (xn , y), as if we were assuming that yn is in fact equal to y. [sent-957, score-0.758]

33 , zn−1 ) ⊆ Z, (17) contains the new example zn = (xn , yn ) at least 95% of the time. [sent-1000, score-0.729]

34 We calculate conformal prediction regions using three different nonconformity measures: one based on distance to the nearest neighbors, one based on distance to the species average, and one based on a support-vector machine. [sent-1029, score-0.909]

35 08 1 Table 2: Conformal prediction of iris species from sepal length, using three different nonconformity measures. [sent-1160, score-0.646]

36 For each nonconformity measure, we calculate nonconformity scores under each hypothesis, y25 = s and y25 = v. [sent-1166, score-0.721]

37 A close look at the nonconformity scores reveals that it is being perceived as unusual simply because 2/3 of the plants have other plants in the sample with exactly the same sepal length, whereas there is no other plant with the sepal length 6. [sent-1192, score-0.827]

38 Ĺš ned by A(B, (x, y)) = |xBâˆĹž (x,y) ,y − x|, (20) where xB,y denotes the average sepal length of all plants of species y in the bag B, and B âˆĹž z denotes the bag obtained by adding z to B. [sent-1232, score-0.665]

39 Ă‹† Ă‹† The fourth column of Table 4 gives the nonconformity scores for our sample using this nonconformity measure. [sent-1563, score-0.695]

40 Ĺš nding conformal prediction intervals using a least-squares or ridge-regression nonconformity measure with an object space of any Ä? [sent-1631, score-0.817]

41 , zn are exchangeable, Pr{zn ∈ ĂŽĹ‚ĂŽÄž ( z1 , . [sent-1649, score-0.642]

42 But as we now demonstrate, any ĂŽĹ‚ satisfying them can be improved on by a conformal predictor: there always exists a nonconformity measure A such that the predictor ĂŽĹ‚ A constructed from A by the conformal algorithm satisÄ? [sent-1658, score-1.125]

43 , an has an equal probability of being zn , and in this case, (22) is a mistake. [sent-1681, score-0.642]

44 Given the region predictor ĂŽĹ‚, what nonconformity measure will give us a conformal predictor that improves on it? [sent-1838, score-0.844]

45 The conformal prediction intervals using the least-squares nonconformity measure are quite close to the standard intervals based on least-squares with normal errors. [sent-1855, score-0.841]

46 The conformal predictor ĂŽĹ‚A obtained from this nonconformity measure, though it agrees with ĂŽĹ‚ on how to rank different z with respect to their nonconformity with B, may produce tighter prediction regions if ĂŽĹ‚ is too conservative in the levels of conÄ? [sent-1858, score-1.197]

47 According to the conformal algorithm, (24) means that when we provisionally set zn equal to z and calculate the nonconformity scores ĂŽÄ…i = sup{1 − δ | zi ∈ γδ ( z1 , . [sent-1867, score-1.496]

48 , 2005), it is used to illustrate conformal prediction with a number of different nonconformity measures. [sent-1905, score-0.773]

49 Ĺš gure records the performance of the 95% conformal predictor using the nearest-neighbor nonconformity measure (10), applied to the USPS data in two ways. [sent-1916, score-0.761]

50 On-Line Compression Models In this section, we generalize conformal prediction from the exchangeability model to a whole class of models, which we call on-line compression models. [sent-1936, score-0.696]

51 , zn | ÄŽƒn ), where ÄŽƒn summarizes the examples z1 , . [sent-1954, score-0.673]

52 , zn by z= 1 n ∑ zi n i=1 n and r2 = ∑ (zi − z)2 i=1 and gives these summaries to Bill, who does not know z1 , . [sent-1967, score-0.717]

53 , zn that is uniform over the possibilities, which form the (n − 1)-dimensional sphere of radius r centered around (z, . [sent-1974, score-0.692]

54 , zn are exchangeable for all n, then the distribution of z1 , z2 , . [sent-1994, score-0.711]

55 For the exchangeability model, conformal prediction is optimal for obtaining prediction regions (§4. [sent-2008, score-0.75]

56 , zn ) to the bag containing these examples: ĂŽĹ n (z1 , . [sent-2040, score-0.833]

57 Given the bag ÄŽƒ n , what are the probabilities for zn and ÄŽƒn−1 ? [sent-2105, score-0.858]

58 They are the same as if we drew zn out of ÄŽƒn at random. [sent-2106, score-0.642]

59 In other words, for each z that appears in ÄŽƒn , there is a probability k/n, where k is the number of times z appears in ÄŽƒn , that (1) zn = z and (2) ÄŽƒn−1 is the bag obtained by removing one instance of z from ÄŽƒn . [sent-2107, score-0.833]

60 , zn , which has the distribution Pn (¡ | ÄŽƒn ). [sent-2122, score-0.642]

61 In the case of the exchangeability model, we obtain the bag ÄŽƒ i by adding the new example zi to the old bag ÄŽƒi−1 . [sent-2134, score-0.714]

62 z1 6 2  zn−1 z2 6 ÄŽƒ1  6 ÄŽƒ2  ¡¡¡  ÄŽƒn−1  zn 6 ÄŽƒn Backward probabilities. [sent-2135, score-0.642]

63 First we draw zn and ÄŽƒn−1 from Rn (¡ | ÄŽƒn ), then we draw zn−1 and ÄŽƒn−2 from Rn−1 (¡ | ÄŽƒn−1 ), and so on. [sent-2180, score-0.642]

64 , zn obtained in this way has the distribution Pn (¡ | ÄŽƒn ). [sent-2184, score-0.642]

65 , zn with the distribution Pn (¡ | ÄŽƒn ). [sent-2213, score-0.642]

66 Another way to verify it, without necessarily exhibiting the one-step kernels, is to verify the conditional independence relations represented by Figure 8: zn (and hence also ÄŽƒn ) is probabilistically independent of z1 , . [sent-2214, score-0.642]

67 Ă‹œ In order to state the conformal algorithm, we write ÄŽƒn−1 and zn for random variables with a joint Ă‹œ probability distribution given by the one-step kernel R(¡ | ÄŽƒn ). [sent-2221, score-1.006]

68 Set pz := Rn (A(ÄŽƒn−1 , zn ) â‰Ä˝ A(ÄŽƒn−1 , zn ) | ÄŽƒn ). [sent-2233, score-1.311]

69 , zn \ zn in that model, so that Ă‹œ Ă‹œ A(ÄŽƒn−1 , zn ) = A( z1 , . [sent-2246, score-1.926]

70 , zn \ zn , zn ) Ă‹œ Ă‹œ Ă‹œ (25) A(ÄŽƒn−1 , zn ) = A( z1 , . [sent-2249, score-2.568]

71 , zn ), the random variable zn has equal chances of being any of the zi , so that the Ă‹œ probability of (25) being greater than or equal to (26) is simply the fraction of the z i for which A( z1 , . [sent-2256, score-1.359]

72 , zn−1 , zn ), and this is how pz is deÄ? [sent-2262, score-0.669]

73 Set py := Rn (A(ÄŽƒn−1 , zn ) â‰Ä˝ A(ÄŽƒn−1 , zn ) | ÄŽƒn ). [sent-2300, score-1.284]

74 The graph in the bottom panel shows the results of the exchangeabilitywithin-label conformal predictor using the same nearest-neighbor nonconformity measure; here the error rate stays close to 5%. [sent-2374, score-0.741]

75 , zN , of the form zn = (xn , yn ), where yn is a number and xn is a row vector consisting of p numbers. [sent-2390, score-0.902]

76 , zn | ÄŽƒn ) distributes its probability uniformly over the set of vectors (y1 , . [sent-2422, score-0.642]

77 , zn gives Pi (¡ | ÄŽƒi ), we note that conditioning the uniform distribution on a sphere on values yi+1 = ai+1 , . [sent-2465, score-0.692]

78 Ă‹† Proposition 2 When (30) is used as the nonconformity measure, the 1 − ĂŽÄž conformal prediction region for yn is (29), the interval given by the t-distribution in the classical theory. [sent-2528, score-0.961]

79 Proof When (30) is used as the nonconformity measure, the test statistic A(ÄŽƒ n−1 , zn ) used in the conformal algorithm becomes |yn − yn |. [sent-2529, score-1.428]

80 4 to establish the validity of conformal prediction in the exchangeability model. [sent-2565, score-0.663]

81 , zn , but the equation says that it is not really random, for it is always equal to ĂŽÄž. [sent-2601, score-0.659]

82 Just before he observes zn , he knows the bag z1 , . [sent-2665, score-0.833]

83 , zn and can bet on the value of zn at odds corresponding to the probabilities the bag determines: k Pr (zn = a | z1 , . [sent-2668, score-1.552]

84 , zn = B) = , n (33) where k is the number of times a occurs in B. [sent-2671, score-0.642]

85 , zn , zn+1 are independent normal random variables with mean 0 and standard deviation 1, the distribution of the ratio zn+1 (41) ∑n z2 /n i=1 i is called the t-distribution with n degrees of freedom. [sent-2738, score-0.66]

86 Proof It is straightforward to verify that zn − zn−1 = and s2 = n−1 n (zn − zn ) n−1 (44) (n − 1)s2 n(zn − zn )2 n − . [sent-2767, score-1.926]

87 (47) The value of (47) is unaffected if all the zi − zn are multiplied by a nonzero constant. [sent-2769, score-0.717]

88 tn Proof Given zn , we can calculate zn−1 and sn−1 from (44) and (45) and then calculate tn from (42). [sent-2774, score-0.846]

89 Given zn−1 and sn−1 , we can calculate zn from (44) or (45) and then tn from (42). [sent-2775, score-0.744]

90 Ĺš xed, this equation expresses tn as a monotonically increasing function of zn ) and then calculate zn−1 and sn−1 from (44) and (45). [sent-2778, score-0.744]

91 , zn are independent and normal with a common mean and standard deviation, then conditional on zn = w and ∑n (zi − zn )2 = r2 , the vector (z1 , . [sent-2786, score-1.926]

92 , zn ) is distributed uniformly i=1 over the (n − 2)-dimensional sphere of vectors satisfying these conditions (the intersection of the hyperplane zn = w with the (n − 1)-dimensional sphere of radius r centered on the point (w, . [sent-2789, score-1.401]

93 , zn ) only through zn and ∑n (zi −zn )2 , the distribution of (z1 , . [sent-2800, score-1.284]

94 , zn ) conditional on zn = w and ∑n (zi − i=1 i=1 zn )2 = r2 is uniform over the set of vectors satisfying these conditions. [sent-2803, score-1.926]

95 , zn ) is distributed uniformly over the (n − 2)-dimensional sphere deÄ? [sent-2807, score-0.692]

96 Ĺš ned by the conditions zn = w and ∑n (zi − zn )2 = r2 , then tn has the t-distribution with n − 2 i=1 degrees of freedom. [sent-2808, score-1.378]

97 Lemma 8 says that conditional on zn = w and (n−1)s2 = r2 , the vector (z1 , . [sent-2815, score-0.659]

98 , zn ) is distributed n uniformly over the sphere of radius r centered on w, . [sent-2818, score-0.692]

99 Ĺš ned by the conditions zn = w and ∑n (zi − zn )2 = r2 . [sent-2827, score-1.284]

100 , zn ) is distributed uniformly over an (n − 1)-dimensional sphere in Rn , and so tn has the t-distribution with n − 2 degrees of freedom by Lemma 9. [sent-2857, score-0.803]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('zn', 0.642), ('conformal', 0.364), ('nonconformity', 0.335), ('vovk', 0.2), ('exchangeability', 0.191), ('bag', 0.191), ('sepal', 0.155), ('bill', 0.118), ('hafer', 0.102), ('onformal', 0.102), ('utorial', 0.102), ('yn', 0.087), ('xn', 0.086), ('tn', 0.076), ('zi', 0.075), ('prediction', 0.074), ('rediction', 0.074), ('exchangeable', 0.069), ('compression', 0.067), ('old', 0.066), ('species', 0.063), ('en', 0.061), ('capital', 0.053), ('joe', 0.053), ('petal', 0.053), ('law', 0.052), ('sphere', 0.05), ('shafer', 0.049), ('hits', 0.048), ('regions', 0.047), ('bn', 0.047), ('plants', 0.045), ('fisher', 0.042), ('predictor', 0.042), ('versicolor', 0.042), ('region', 0.041), ('dence', 0.039), ('kn', 0.039), ('setosa', 0.038), ('interval', 0.037), ('betting', 0.037), ('validity', 0.034), ('sn', 0.033), ('bet', 0.033), ('glenn', 0.033), ('pr', 0.033), ('examples', 0.031), ('successive', 0.029), ('label', 0.029), ('summary', 0.029), ('event', 0.029), ('freqn', 0.029), ('neyman', 0.029), ('provisionally', 0.029), ('valid', 0.028), ('pn', 0.028), ('pz', 0.027), ('calculate', 0.026), ('scores', 0.025), ('probabilities', 0.025), ('rn', 0.025), ('book', 0.025), ('informal', 0.025), ('plant', 0.025), ('inasmuch', 0.025), ('intervals', 0.024), ('numbers', 0.023), ('uncertain', 0.023), ('classical', 0.023), ('events', 0.022), ('predict', 0.022), ('yi', 0.022), ('unusual', 0.022), ('backwards', 0.021), ('kernels', 0.021), ('mutually', 0.021), ('bettor', 0.02), ('credibility', 0.02), ('gammerman', 0.02), ('length', 0.02), ('measure', 0.02), ('picture', 0.019), ('summarizing', 0.019), ('predictions', 0.019), ('width', 0.019), ('proposition', 0.019), ('odds', 0.019), ('iris', 0.019), ('errors', 0.018), ('degrees', 0.018), ('vladimir', 0.018), ('normally', 0.018), ('hyperplane', 0.017), ('bl', 0.017), ('con', 0.017), ('says', 0.017), ('freedom', 0.017), ('announces', 0.016), ('bets', 0.016), ('cournot', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 7 jmlr-2008-A Tutorial on Conformal Prediction

Author: Glenn Shafer, Vladimir Vovk

Abstract: Conformal prediction uses past experience to determine precise levels of conÄ?Ĺš dence in new predictions. Given an error probability ĂŽÄž, together with a method that makes a prediction y of a label Ă‹† y, it produces a set of labels, typically containing y, that also contains y with probability 1 − ĂŽÄž. Ă‹† Conformal prediction can be applied to any method for producing y: a nearest-neighbor method, a Ă‹† support-vector machine, ridge regression, etc. Conformal prediction is designed for an on-line setting in which labels are predicted successively, each one being revealed before the next is predicted. The most novel and valuable feature of conformal prediction is that if the successive examples are sampled independently from the same distribution, then the successive predictions will be right 1 − ĂŽÄž of the time, even though they are based on an accumulating data set rather than on independent data sets. In addition to the model under which successive examples are sampled independently, other on-line compression models can also use conformal prediction. The widely used Gaussian linear model is one of these. This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. A more comprehensive treatment of the topic is provided in Algorithmic Learning in a Random World, by Vladimir Vovk, Alex Gammerman, and Glenn Shafer (Springer, 2005). Keywords: conÄ?Ĺš dence, on-line compression modeling, on-line learning, prediction regions

2 0.15329947 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

Author: Bernadetta Tarigan, Sara A. van de Geer

Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy

3 0.12302469 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix

Author: Xing Sun, Andrew B. Nobel

Abstract: Binary matrices, and their associated submatrices of 1s, play a central role in the study of random bipartite graphs and in core data mining problems such as frequent itemset mining (FIM). Motivated by these connections, this paper addresses several statistical questions regarding submatrices of 1s in a random binary matrix with independent Bernoulli entries. We establish a three-point concentration result, and a related probability bound, for the size of the largest square submatrix of 1s in a square Bernoulli matrix, and extend these results to non-square matrices and submatrices with fixed aspect ratios. We then consider the noise sensitivity of frequent itemset mining under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. As a result, standard FIM algorithms, which search only for submatrices of 1s, cannot directly recover such blocks when noise is present. On the positive side, we show that an error-tolerant frequent itemset criterion can recover a submatrix of 1s against a background of 0s plus noise, even when the size of the submatrix of 1s is very small. 1 Keywords: frequent itemset mining, bipartite graph, biclique, submatrix of 1s, statistical significance

4 0.062787607 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

Author: Bo Jiang, Xuegong Zhang, Tianxi Cai

Abstract: Support vector machine (SVM) is one of the most popular and promising classification algorithms. After a classification rule is constructed via the SVM, it is essential to evaluate its prediction accuracy. In this paper, we develop procedures for obtaining both point and interval estimators for the prediction error. Under mild regularity conditions, we derive the consistency and asymptotic normality of the prediction error estimators for SVM with finite-dimensional kernels. A perturbationresampling procedure is proposed to obtain interval estimates for the prediction error in practice. With numerical studies on simulated data and a benchmark repository, we recommend the use of interval estimates centered at the cross-validated point estimates for the prediction error. Further applications of the proposed procedure in model evaluation and feature selection are illustrated with two examples. Keywords: k-fold cross-validation, model evaluation, perturbation-resampling, prediction errors, support vector machine

5 0.051215932 52 jmlr-2008-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning

6 0.037978396 9 jmlr-2008-Active Learning by Spherical Subdivision

7 0.032847118 4 jmlr-2008-A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters

8 0.032082174 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

9 0.03176643 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

10 0.030173432 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

11 0.028591441 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment

12 0.028210035 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds

13 0.027991045 56 jmlr-2008-Magic Moments for Structured Output Prediction

14 0.027659804 57 jmlr-2008-Manifold Learning: The Price of Normalization

15 0.027446643 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function    (Special Topic on Model Selection)

16 0.026499411 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

17 0.024597196 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

18 0.022333341 92 jmlr-2008-Universal Multi-Task Kernels

19 0.021164952 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data

20 0.019645872 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.131), (1, -0.051), (2, -0.025), (3, -0.071), (4, 0.047), (5, -0.008), (6, 0.112), (7, -0.177), (8, 0.05), (9, -0.173), (10, 0.12), (11, -0.005), (12, -0.175), (13, -0.012), (14, 0.272), (15, 0.009), (16, 0.006), (17, 0.202), (18, 0.349), (19, -0.269), (20, 0.06), (21, -0.015), (22, 0.183), (23, 0.088), (24, -0.0), (25, 0.012), (26, 0.035), (27, -0.04), (28, -0.089), (29, 0.088), (30, 0.008), (31, 0.04), (32, 0.002), (33, 0.074), (34, 0.099), (35, -0.022), (36, -0.039), (37, 0.111), (38, -0.091), (39, 0.075), (40, 0.03), (41, 0.071), (42, 0.002), (43, 0.036), (44, -0.051), (45, -0.028), (46, 0.061), (47, -0.019), (48, -0.004), (49, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97206438 7 jmlr-2008-A Tutorial on Conformal Prediction

Author: Glenn Shafer, Vladimir Vovk

Abstract: Conformal prediction uses past experience to determine precise levels of conÄ?Ĺš dence in new predictions. Given an error probability ĂŽÄž, together with a method that makes a prediction y of a label Ă‹† y, it produces a set of labels, typically containing y, that also contains y with probability 1 − ĂŽÄž. Ă‹† Conformal prediction can be applied to any method for producing y: a nearest-neighbor method, a Ă‹† support-vector machine, ridge regression, etc. Conformal prediction is designed for an on-line setting in which labels are predicted successively, each one being revealed before the next is predicted. The most novel and valuable feature of conformal prediction is that if the successive examples are sampled independently from the same distribution, then the successive predictions will be right 1 − ĂŽÄž of the time, even though they are based on an accumulating data set rather than on independent data sets. In addition to the model under which successive examples are sampled independently, other on-line compression models can also use conformal prediction. The widely used Gaussian linear model is one of these. This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. A more comprehensive treatment of the topic is provided in Algorithmic Learning in a Random World, by Vladimir Vovk, Alex Gammerman, and Glenn Shafer (Springer, 2005). Keywords: conÄ?Ĺš dence, on-line compression modeling, on-line learning, prediction regions

2 0.70599455 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix

Author: Xing Sun, Andrew B. Nobel

Abstract: Binary matrices, and their associated submatrices of 1s, play a central role in the study of random bipartite graphs and in core data mining problems such as frequent itemset mining (FIM). Motivated by these connections, this paper addresses several statistical questions regarding submatrices of 1s in a random binary matrix with independent Bernoulli entries. We establish a three-point concentration result, and a related probability bound, for the size of the largest square submatrix of 1s in a square Bernoulli matrix, and extend these results to non-square matrices and submatrices with fixed aspect ratios. We then consider the noise sensitivity of frequent itemset mining under a simple binary additive noise model, and show that, even at small noise levels, large blocks of 1s leave behind fragments of only logarithmic size. As a result, standard FIM algorithms, which search only for submatrices of 1s, cannot directly recover such blocks when noise is present. On the positive side, we show that an error-tolerant frequent itemset criterion can recover a submatrix of 1s against a background of 0s plus noise, even when the size of the submatrix of 1s is very small. 1 Keywords: frequent itemset mining, bipartite graph, biclique, submatrix of 1s, statistical significance

3 0.55109555 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

Author: Bernadetta Tarigan, Sara A. van de Geer

Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy

4 0.23453447 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

Author: Bo Jiang, Xuegong Zhang, Tianxi Cai

Abstract: Support vector machine (SVM) is one of the most popular and promising classification algorithms. After a classification rule is constructed via the SVM, it is essential to evaluate its prediction accuracy. In this paper, we develop procedures for obtaining both point and interval estimators for the prediction error. Under mild regularity conditions, we derive the consistency and asymptotic normality of the prediction error estimators for SVM with finite-dimensional kernels. A perturbationresampling procedure is proposed to obtain interval estimates for the prediction error in practice. With numerical studies on simulated data and a benchmark repository, we recommend the use of interval estimates centered at the cross-validated point estimates for the prediction error. Further applications of the proposed procedure in model evaluation and feature selection are illustrated with two examples. Keywords: k-fold cross-validation, model evaluation, perturbation-resampling, prediction errors, support vector machine

5 0.18204342 52 jmlr-2008-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields general results for classification and regression. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. We discuss the related problem of learning parameters of a distribution from multiple data sources. Finally, we illustrate our theory through a series of synthetic simulations. Keywords: error bounds, multi-task learning

6 0.1757056 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function    (Special Topic on Model Selection)

7 0.16838801 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

8 0.15402521 4 jmlr-2008-A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters

9 0.12549466 57 jmlr-2008-Manifold Learning: The Price of Normalization

10 0.12425321 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

11 0.12381221 37 jmlr-2008-Forecasting Web Page Views: Methods and Observations

12 0.11851662 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines    (Special Topic on Model Selection)

13 0.11475721 73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding

14 0.11284033 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning    (Special Topic on Causality)

15 0.11227731 56 jmlr-2008-Magic Moments for Structured Output Prediction

16 0.11189722 9 jmlr-2008-Active Learning by Spherical Subdivision

17 0.11158295 96 jmlr-2008-Visualizing Data using t-SNE

18 0.1113678 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment

19 0.11106788 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

20 0.10977075 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.016), (31, 0.017), (40, 0.034), (54, 0.03), (58, 0.058), (60, 0.017), (66, 0.055), (76, 0.028), (88, 0.05), (92, 0.034), (94, 0.028), (99, 0.504)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89175463 7 jmlr-2008-A Tutorial on Conformal Prediction

Author: Glenn Shafer, Vladimir Vovk

Abstract: Conformal prediction uses past experience to determine precise levels of conÄ?Ĺš dence in new predictions. Given an error probability ĂŽÄž, together with a method that makes a prediction y of a label Ă‹† y, it produces a set of labels, typically containing y, that also contains y with probability 1 − ĂŽÄž. Ă‹† Conformal prediction can be applied to any method for producing y: a nearest-neighbor method, a Ă‹† support-vector machine, ridge regression, etc. Conformal prediction is designed for an on-line setting in which labels are predicted successively, each one being revealed before the next is predicted. The most novel and valuable feature of conformal prediction is that if the successive examples are sampled independently from the same distribution, then the successive predictions will be right 1 − ĂŽÄž of the time, even though they are based on an accumulating data set rather than on independent data sets. In addition to the model under which successive examples are sampled independently, other on-line compression models can also use conformal prediction. The widely used Gaussian linear model is one of these. This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. A more comprehensive treatment of the topic is provided in Algorithmic Learning in a Random World, by Vladimir Vovk, Alex Gammerman, and Glenn Shafer (Springer, 2005). Keywords: conÄ?Ĺš dence, on-line compression modeling, on-line learning, prediction regions

2 0.86795008 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

Author: Matthias W. Seeger

Abstract: We propose a highly efficient framework for penalized likelihood kernel methods applied to multiclass models with a large, structured set of classes. As opposed to many previous approaches which try to decompose the fitting problem into many smaller ones, we focus on a Newton optimization of the complete model, making use of model structure and linear conjugate gradients in order to approximate Newton search directions. Crucially, our learning method is based entirely on matrix-vector multiplication primitives with the kernel matrices and their derivatives, allowing straightforward specialization to new kernels, and focusing code optimization efforts to these primitives only. Kernel parameters are learned automatically, by maximizing the cross-validation log likelihood in a gradient-based way, and predictive probabilities are estimated. We demonstrate our approach on large scale text classification tasks with hierarchical structure on thousands of classes, achieving state-of-the-art results in an order of magnitude less time than previous work. Parts of this work appeared in the conference paper Seeger (2007). Keywords: multi-way classification, kernel logistic regression, hierarchical classification, cross validation optimization, Newton-Raphson optimization

3 0.49145404 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing

4 0.37564236 16 jmlr-2008-Approximations for Binary Gaussian Process Classification

Author: Hannes Nickisch, Carl Edward Rasmussen

Abstract: We provide a comprehensive overview of many recent algorithms for approximate inference in Gaussian process models for probabilistic binary classification. The relationships between several approaches are elucidated theoretically, and the properties of the different algorithms are corroborated by experimental results. We examine both 1) the quality of the predictive distributions and 2) the suitability of the different marginal likelihood approximations for model selection (selecting hyperparameters) and compare to a gold standard based on MCMC. Interestingly, some methods produce good predictive distributions although their marginal likelihood approximations are poor. Strong conclusions are drawn about the methods: The Expectation Propagation algorithm is almost always the method of choice unless the computational budget is very tight. We also extend existing methods in various ways, and provide unifying code implementing all approaches. Keywords: Gaussian process priors, probabilistic classification, Laplaces’s approximation, expectation propagation, variational bounding, mean field methods, marginal likelihood evidence, MCMC

5 0.34408849 83 jmlr-2008-Robust Submodular Observation Selection

Author: Andreas Krause, H. Brendan McMahan, Carlos Guestrin, Anupam Gupta

Abstract: In many applications, one has to actively select among a set of expensive observations before making an informed decision. For example, in environmental monitoring, we want to select locations to measure in order to most effectively predict spatial phenomena. Often, we want to select observations which are robust against a number of possible objective functions. Examples include minimizing the maximum posterior variance in Gaussian Process regression, robust experimental design, and sensor placement for outbreak detection. In this paper, we present the Submodular Saturation algorithm, a simple and efficient algorithm with strong theoretical approximation guarantees for cases where the possible objective functions exhibit submodularity, an intuitive diminishing returns property. Moreover, we prove that better approximation algorithms do not exist unless NP-complete problems admit efficient algorithms. We show how our algorithm can be extended to handle complex cost functions (incorporating non-unit observation cost or communication and path costs). We also show how the algorithm can be used to near-optimally trade off expected-case (e.g., the Mean Square Prediction Error in Gaussian Process regression) and worst-case (e.g., maximum predictive variance) performance. We show that many important machine learning problems fit our robust submodular observation selection formalism, and provide extensive empirical evaluation on several real-world problems. For Gaussian Process regression, our algorithm compares favorably with state-of-the-art heuristics described in the geostatistics literature, while being simpler, faster and providing theoretical guarantees. For robust experimental design, our algorithm performs favorably compared to SDP-based algorithms. c 2008 Andreas Krause, H. Brendan McMahan, Carlos Guestrin and Anupam Gupta. K RAUSE , M C M AHAN , G UESTRIN AND G UPTA Keywords: observation selection, experimental design, active learning, submodular functions, Gaussi

6 0.34056026 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

7 0.34019691 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

8 0.33726868 86 jmlr-2008-SimpleMKL

9 0.3324303 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2

10 0.33112776 56 jmlr-2008-Magic Moments for Structured Output Prediction

11 0.33079189 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming    (Special Topic on Model Selection)

12 0.32824263 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

13 0.32764217 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

14 0.32721254 58 jmlr-2008-Max-margin Classification of Data with Absent Features

15 0.32441977 57 jmlr-2008-Manifold Learning: The Price of Normalization

16 0.32348388 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

17 0.32253236 72 jmlr-2008-On the Size and Recovery of Submatrices of Ones in a Random Binary Matrix

18 0.32187778 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

19 0.32140401 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

20 0.31647351 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function    (Special Topic on Model Selection)