nips nips2003 nips2003-170 knowledge-graph by maker-knowledge-mining

170 nips-2003-Self-calibrating Probability Forecasting


Source: pdf

Author: Vladimir Vovk, Glenn Shafer, Ilia Nouretdinov

Abstract: In the problem of probability forecasting the learner’s goal is to output, given a training set and a new object, a suitable probability measure on the possible values of the new object’s label. An on-line algorithm for probability forecasting is said to be well-calibrated if the probabilities it outputs agree with the observed frequencies. We give a natural nonasymptotic formalization of the notion of well-calibratedness, which we then study under the assumption of randomness (the object/label pairs are independent and identically distributed). It turns out that, although no probability forecasting algorithm is automatically well-calibrated in our sense, there exists a wide class of algorithms for “multiprobability forecasting” (such algorithms are allowed to output a set, ideally very narrow, of probability measures) which satisfy this property; we call the algorithms in this class “Venn probability machines”. Our experimental results demonstrate that a 1-Nearest Neighbor Venn probability machine performs reasonably well on a standard benchmark data set, and one of our theoretical results asserts that a simple Venn probability machine asymptotically approaches the true conditional probabilities regardless, and without knowledge, of the true probability measure generating the examples.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract In the problem of probability forecasting the learner’s goal is to output, given a training set and a new object, a suitable probability measure on the possible values of the new object’s label. [sent-9, score-0.512]

2 An on-line algorithm for probability forecasting is said to be well-calibrated if the probabilities it outputs agree with the observed frequencies. [sent-10, score-0.469]

3 We give a natural nonasymptotic formalization of the notion of well-calibratedness, which we then study under the assumption of randomness (the object/label pairs are independent and identically distributed). [sent-11, score-0.055]

4 In this introductory section we will assume that yn ∈ {0, 1}; we can then take pn to be a real number from the interval [0, 1] (the probability that yn = 1 given xn ); our exposition here will be very informal. [sent-16, score-1.551]

5 213–216) is that the quality of probability forecasting systems has two components: “reliability” and “resolution”. [sent-18, score-0.412]

6 At the crudest level, reliability requires that the forecasting system should not lie, and resolution requires that it should say something useful. [sent-19, score-0.41]

7 To be slightly more precise, consider the first n forecasts pi and the actual labels yi . [sent-20, score-0.259]

8 The most basic test is to compare the overall average forecast probability pn := n n n−1 i=1 pi with the overall relative frequency y n := n−1 i=1 yi of 1s among yi . [sent-21, score-0.797]

9 If pn ≈ y n , the forecasts are “unbiased in the large”. [sent-22, score-0.529]

10 A more refined test would look at the subset of i for which pi is close to a given value p∗ , and compare the relative frequency of yi = 1 in this subset, say y n (p∗ ), with p∗ . [sent-23, score-0.12]

11 If y n (p∗ ) ≈ p∗ for all p∗ , (1) the forecasts are “unbiased in the small”, “reliable”, “valid”, or “well-calibrated”; in later sections, we will use “well-calibrated”, or just “calibrated”, as a technical term. [sent-24, score-0.124]

12 It is easy to see that there are reliable forecasting systems that are virtually useless. [sent-26, score-0.346]

13 For example, the definition of reliability does not require that the forecasting system pay any attention to the objects xi . [sent-27, score-0.414]

14 In another popular example, the labels follow the pattern yi = 1 if i is odd 0 otherwise. [sent-28, score-0.088]

15 ; the “resolution” (which we do not define here) of the latter forecasts is better. [sent-34, score-0.124]

16 In this paper we construct forecasting systems that are automatically reliable. [sent-35, score-0.312]

17 To achieve this, we allow our prediction algorithms to output sets of probability measures Pn instead of single measures pn ; typically the sets Pn will be small (see §5). [sent-36, score-0.65]

18 This paper develops the approach of [2–4], which show that it is possible to produce valid, asymptotically optimal, and practically useful p-values; the p-values can be then used for region prediction. [sent-37, score-0.052]

19 Disadvantages of p-values, however, are that their interpretation is less direct than that of probabilities and that they are easy to confuse with probabilities; some authors have even objected to any use of p-values (see, e. [sent-38, score-0.057]

20 In this paper we use the methodology developed in the previous papers to produce valid probabilities rather than p-values. [sent-41, score-0.057]

21 2 Probability forecasting and calibration From this section we start rigorous exposition. [sent-43, score-0.427]

22 Let P(Y) be the set of all probability measures on a measurable space Y. [sent-44, score-0.187]

23 We use the following protocol in this paper: M ULTIPROBABILITY FORECASTING Players: Reality, Forecaster Protocol: FOR n = 1, 2, . [sent-45, score-0.088]

24 In this protocol, Reality generates examples zn = (xn , yn ) ∈ Z := X × Y consisting of two parts, objects xn and labels yn . [sent-51, score-1.343]

25 After seeing the object xn Forecaster is required to output a prediction for the label yn . [sent-52, score-0.843]

26 The usual probability forecasting protocol requires that Forecaster output a probability measure; we relax this requirement by allowing him to output a family of probability measures (and we are interested in the case where the families Pn become smaller and smaller as n grows). [sent-53, score-0.86]

27 It can be shown (we omit the proof and even the precise statement) that it is impossible to achieve automatic well-calibratedness, in our finitary sense, in the probability forecasting protocol. [sent-54, score-0.439]

28 In this paper we make the simplifying assumption that the label space Y is finite; in many informal explanations it will be assumed binary, Y = {0, 1}. [sent-55, score-0.143]

29 A probability machine is a measurable strategy for Forecaster in our protocol, where at each step he is required to output a sequence of K probability measures. [sent-57, score-0.311]

30 The problem of calibration is usually treated in an asymptotic framework. [sent-58, score-0.115]

31 in the multiprobability forecasting protocol by n = 1, . [sent-64, score-0.524]

32 It is clear that, regardless of formalization, we cannot guarantee that miscalibration, in the sense of (1) being violated, will never happen: for typical probability measures, everything can happen, perhaps with a small probability. [sent-68, score-0.126]

33 The idea of our definition is: a prediction algorithm is well-calibrated if any evidence of miscalibration translates into evidence against the assumption of randomness. [sent-69, score-0.041]

34 Therefore, we first need to define ways of testing calibration and randomness; this will be done following [7]. [sent-70, score-0.115]

35 A game N -martingale is a function M on sequences of the form x1 , p1 , y1 , . [sent-71, score-0.089]

36 , N , xi ∈ X, pi ∈ P(Y), and yi ∈ Y, that satisfies M (x1 , p1 , y1 , . [sent-77, score-0.093]

37 , xn , pn , y)pn (dy) Y for all x1 , p1 , y1 , . [sent-83, score-0.768]

38 A calibration N -martingale is a nonnegative game N -martingale that is invariant under permutations: M (x1 , p1 , y1 , . [sent-90, score-0.204]

39 To cover the multiprobability forecasting protocol, we extend the domain of definition for a calibration N -martingale M from sequences of the form x1 , p1 , y1 , . [sent-106, score-0.582]

40 , pn are single probability measures on Y, to sequences of the form x1 , P1 , y1 , . [sent-112, score-0.587]

41 , Pn are sets of probability measures on Y, by M (x1 , P1 , y1 , . [sent-118, score-0.151]

42 A QN -martingale, where Q is a probability measure on Z, is a function S on sequences of the form x1 , y1 , . [sent-128, score-0.131]

43 , N , xi ∈ X, and yi ∈ Y, that satisfies S(x1 , y1 , . [sent-134, score-0.046]

44 , yN ) very large, then we may reject Q as the probability measure generating individual examples (xn , yn ). [sent-150, score-0.493]

45 Analogously, if a game N martingale M starts with M ( ) = 1 and ends with M (x1 , P1 , y1 , . [sent-152, score-0.099]

46 , yN ) very large, then we may reject the hypothesis that each Pn contains the true probability measure for yn . [sent-155, score-0.445]

47 If M is a calibration N -martingale, this event is interpreted as evidence of miscalibration. [sent-156, score-0.115]

48 (The restriction to calibration N -martingales is motivated by the fact that (1) is invariant under permutations). [sent-157, score-0.115]

49 We call a probability machine F N -calibrated if for any probability measure Q on Z and any nonnegative calibration N -martingale M with M ( ) = 1, there exists a QN martingale S with S( ) = 1 such that M (x1 , F (x1 ), y1 , . [sent-158, score-0.419]

50 We say that F is finitarily calibrated if it is N -calibrated for each N . [sent-171, score-0.068]

51 3 Self-calibrating probability forecasting Now we will describe a general algorithm for multiprobability forecasting. [sent-172, score-0.536]

52 , is called a taxonomy if, for any n ∈ N, any permutation π of {1, . [sent-177, score-0.246]

53 , αn ) (2) is a taxonomy if every αi is determined by the bag z1 , . [sent-205, score-0.265]

54 The Venn probability machine associated with (An ) is the probability machine which outputs the following K = |Y| probability measures py , y ∈ Y, at the nth step: complement the new object xn by the postulated label y; consider the division of z1 , . [sent-210, score-1.007]

55 , zn , where zn is understood (only for the purpose of this definition) to be (xn , y), into groups (formally, bags) according to the values of An (i. [sent-213, score-0.447]

56 , zi and zj are assigned to the same group if and only if αi = αj , where the αs are defined by (2)); find the empirical distribution py ∈ P(Y) of the labels in the group G containing the nth example zn = (xn , y): |{(x∗ , y ∗ ) ∈ G : y ∗ = y }| . [sent-215, score-0.439]

57 py ({y }) := |G| 1 A Venn probability machine (VPM) is the Venn probability machine associated with some taxonomy. [sent-216, score-0.318]

58 Theorem 1 Any Venn probability machine is finitarily calibrated. [sent-217, score-0.132]

59 It is clear that VPM depends on the taxonomy only through the way it splits the examples z1 , . [sent-220, score-0.255]

60 , zn into groups; therefore, we may specify only the latter when constructing specific VPMs. [sent-223, score-0.187]

61 4 Discussion of the Venn probability machine In this somewhat informal section we will discuss the intuitions behind VPM, considering only the binary case Y = {0, 1} and considering the probability forecasts pi to be elements of [0, 1] rather than P({0, 1}), as in §1. [sent-225, score-0.436]

62 We start with the almost trivial Bernoulli case, where the objects xi are absent,2 and our goal is to predict, at each step n = 1, 2, . [sent-226, score-0.063]

63 , the new label yn given the previous labels y1 , . [sent-229, score-0.439]

64 The most naive probability forecast is pn = k/(n − 1), where k is the number of 1s among the first n − 1 labels. [sent-233, score-0.658]

65 ) In the Bernoulli case there is only one natural VPM: the multiprobability forecast for yn is {k/n, (k+1)/n}. [sent-235, score-0.57]

66 Indeed, since there are no objects xn , it is natural to take the one-element taxonomy An at each step, and this produces the VPM Pn = {k/n, (k + 1)/n}. [sent-236, score-0.633]

67 It is clear that the diameter 1/n of Pn for this VPM is the smallest achievable. [sent-237, score-0.041]

68 ) Now let us consider the case where xn are present. [sent-239, score-0.363]

69 The probability forecast k/(n − 1) for yn will usually be too crude, since the known population z1 , . [sent-240, score-0.546]

70 A reasonable statistical forecast would take into account only objects xi that are similar, in a suitable sense, to xn . [sent-244, score-0.552]

71 A simple modification of the Bernoulli forecast k/(n − 1) is as follows: 1. [sent-245, score-0.126]

72 Output k /n as the predicted probability that yn = 1, where n is the number of objects among x1 , . [sent-251, score-0.51]

73 , xn−1 in the same group as xn and k is the number of objects among those n that are labeled as 1. [sent-254, score-0.489]

74 At the first stage, a delicate balance has to be struck between two contradictory goals: the groups should be as large as possible (to have a reasonable sample size for estimating probabilities); the groups should be as homogeneous as possible. [sent-255, score-0.146]

75 , (xn−1 , yn−1 ), xn ) : in one (called the 0-completion) xn is assigned label 0, and in the other (called the 1-completion) xn is assigned label 1. [sent-264, score-1.295]

76 , zn−1 , (xn , y) into a number of groups, so that the split does not depend on the order of examples (y = 0 for the 0-partition and y = 1 for the 1-partition). [sent-269, score-0.072]

77 2 Formally, this correspond in our protocol to the situation where |X| = 1, and so xn , although nominally present, do not carry any information. [sent-270, score-0.451]

78 In each completion, output k /n as the predicted probability that yn = 1, where n is the number of examples among z1 , . [sent-272, score-0.538]

79 , zn−1 , (xn , y) in the same group as (xn , y) and k is the number of examples among those n that are labeled as 1. [sent-275, score-0.111]

80 In this way, we will have not one but two predicted probabilities that yn = 1; but in practically interesting cases we can hope that these probabilities will be close to each other (see the next section). [sent-276, score-0.434]

81 A taxonomy with too many groups means overfitting; it is punished by the large diameter of the multiprobability forecast (importantly, this is visible, unlike the standard approaches). [sent-278, score-0.571]

82 Important advantages of our procedure over the naive procedure are: our procedure is selfcalibrating; there exists an asymptotically optimal VPM (see §6); we can use labels in splitting examples into groups (this will be used in the next section). [sent-280, score-0.215]

83 5 Experiments In this section, we will report the results for a natural taxonomy applied to the well-known USPS data set of hand-written digits; this taxonomy is inspired by the 1-Nearest Neighbor algorithm. [sent-281, score-0.414]

84 Since the data set is relatively small (9298 examples in total), we have to use a crude taxonomy: two examples are assigned to the same group if their nearest neighbors have the same label; therefore, the taxonomy consists of 10 groups. [sent-283, score-0.365]

85 The distance between two examples is defined as the Euclidean distance between their objects (which are 16 × 16 matrices of pixels and represented as points in R256 ). [sent-284, score-0.111]

86 The algorithm processes the nth object xn as follows. [sent-285, score-0.461]

87 , 9, is computed by assigning i to xn as label and finding the fraction of examples labeled j among the examples in the bag z1 , . [sent-289, score-0.621]

88 Choose a column (called the best column) with the highest quality; let the best column be jbest . [sent-294, score-0.11]

89 Output jbest as the prediction and output min Ai,jbest , max Ai,jbest i=0,. [sent-295, score-0.105]

90 ,9 as the interval for the probability that this prediction is correct. [sent-301, score-0.143]

91 If the latter interval is [a, b], the complementary interval [1 − b, 1 − a] is called the error probability interval. [sent-302, score-0.186]

92 The plot confirms that the error probability intervals are calibrated. [sent-305, score-0.1]

93 6 Universal Venn probability machine The following result asserts the existence of a universal VPM. [sent-306, score-0.197]

94 Such a VPM can be constructed quite easily using the histogram approach to probability estimation [11]. [sent-307, score-0.1]

95 The dashed line shows the cumulative number of errors En and the solid ones the cumulative upper and lower error probability curves Un and Ln . [sent-309, score-0.2]

96 0425 and the mean probability interval (1/N )[LN , UN ] is [0. [sent-311, score-0.143]

97 This theorem shows that not only all VPMs are reliable but some of them also have asymptotically optimal resolution. [sent-317, score-0.086]

98 Two most important approaches to analysis of machine-learning algorithms are Bayesian learning theory and PAC theory (the recent mixture, the PAC-Bayesian theory, is part of PAC theory in its assumptions). [sent-320, score-0.081]

99 This paper is in a way intermediate between Bayesian learning (no empirical justification for probabilities is required) and PAC learning (the goal is to find or bound the true probability of error, not just to output calibrated probabilities). [sent-321, score-0.241]

100 An important difference of our approach from the PAC approach is that we are interested in the conditional probabilities for the label given the new object, whereas PAC theory (even in its “data-dependent” version, as in [12–14]) tries to estimate the unconditional probability of error. [sent-322, score-0.261]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pn', 0.405), ('xn', 0.363), ('yn', 0.32), ('forecasting', 0.312), ('vpm', 0.249), ('taxonomy', 0.207), ('venn', 0.197), ('zn', 0.187), ('forecast', 0.126), ('forecaster', 0.124), ('forecasts', 0.124), ('multiprobability', 0.124), ('calibration', 0.115), ('vladimir', 0.105), ('probability', 0.1), ('pac', 0.09), ('vovk', 0.09), ('protocol', 0.088), ('label', 0.077), ('groups', 0.073), ('objects', 0.063), ('erri', 0.062), ('gammerman', 0.062), ('ilia', 0.062), ('jbest', 0.062), ('un', 0.061), ('alex', 0.061), ('game', 0.058), ('bag', 0.058), ('nth', 0.058), ('probabilities', 0.057), ('glenn', 0.054), ('py', 0.054), ('asymptotically', 0.052), ('measures', 0.051), ('cumulative', 0.05), ('reality', 0.05), ('announces', 0.049), ('examples', 0.048), ('en', 0.048), ('pi', 0.047), ('yi', 0.046), ('sixteenth', 0.046), ('interval', 0.043), ('output', 0.043), ('labels', 0.042), ('martingale', 0.041), ('miscalibration', 0.041), ('newark', 0.041), ('nitarily', 0.041), ('nouretdinov', 0.041), ('surrey', 0.041), ('calibrated', 0.041), ('qn', 0.041), ('diameter', 0.041), ('object', 0.04), ('reliability', 0.039), ('permutation', 0.039), ('bernoulli', 0.038), ('measurable', 0.036), ('shafer', 0.036), ('group', 0.036), ('reliable', 0.034), ('ln', 0.033), ('informal', 0.033), ('philosophy', 0.033), ('asserts', 0.033), ('egham', 0.033), ('explanations', 0.033), ('universal', 0.032), ('machine', 0.032), ('resolution', 0.032), ('york', 0.031), ('nonnegative', 0.031), ('sequences', 0.031), ('neighbor', 0.029), ('formalization', 0.029), ('manfred', 0.029), ('completion', 0.029), ('dawid', 0.027), ('nn', 0.027), ('holloway', 0.027), ('dy', 0.027), ('among', 0.027), ('say', 0.027), ('precise', 0.027), ('theory', 0.027), ('sense', 0.026), ('assigned', 0.026), ('craig', 0.026), ('randomness', 0.026), ('permutations', 0.026), ('dence', 0.025), ('ui', 0.025), ('reject', 0.025), ('centre', 0.024), ('column', 0.024), ('split', 0.024), ('editors', 0.023), ('families', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 170 nips-2003-Self-calibrating Probability Forecasting

Author: Vladimir Vovk, Glenn Shafer, Ilia Nouretdinov

Abstract: In the problem of probability forecasting the learner’s goal is to output, given a training set and a new object, a suitable probability measure on the possible values of the new object’s label. An on-line algorithm for probability forecasting is said to be well-calibrated if the probabilities it outputs agree with the observed frequencies. We give a natural nonasymptotic formalization of the notion of well-calibratedness, which we then study under the assumption of randomness (the object/label pairs are independent and identically distributed). It turns out that, although no probability forecasting algorithm is automatically well-calibrated in our sense, there exists a wide class of algorithms for “multiprobability forecasting” (such algorithms are allowed to output a set, ideally very narrow, of probability measures) which satisfy this property; we call the algorithms in this class “Venn probability machines”. Our experimental results demonstrate that a 1-Nearest Neighbor Venn probability machine performs reasonably well on a standard benchmark data set, and one of our theoretical results asserts that a simple Venn probability machine asymptotically approaches the true conditional probabilities regardless, and without knowledge, of the true probability measure generating the examples.

2 0.29025754 151 nips-2003-PAC-Bayesian Generic Chaining

Author: Jean-yves Audibert, Olivier Bousquet

Abstract: There exist many different generalization error bounds for classification. Each of these bounds contains an improvement over the others for certain situations. Our goal is to combine these different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester [1], which is interesting for averaging classifiers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand [2]. This combination is quite natural since the generic chaining is based on the notion of majorizing measures, which can be considered as priors on the set of classifiers, and such priors also arise in the PAC-bayesian setting. 1

3 0.21339992 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

Author: Susanne Still, William Bialek, Léon Bottou

Abstract: We argue that K–means and deterministic annealing algorithms for geometric clustering can be derived from the more general Information Bottleneck approach. If we cluster the identities of data points to preserve information about their location, the set of optimal solutions is massively degenerate. But if we treat the equations that define the optimal solution as an iterative algorithm, then a set of “smooth” initial conditions selects solutions with the desired geometrical properties. In addition to conceptual unification, we argue that this approach can be more efficient and robust than classic algorithms. 1

4 0.13113712 51 nips-2003-Design of Experiments via Information Theory

Author: Liam Paninski

Abstract: We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efficiency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the “psychometric function” underlying the output responses. 1

5 0.091807522 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

Author: Jakob J. Verbeek, Sam T. Roweis, Nikos A. Vlassis

Abstract: We propose a non-linear Canonical Correlation Analysis (CCA) method which works by coordinating or aligning mixtures of linear models. In the same way that CCA extends the idea of PCA, our work extends recent methods for non-linear dimensionality reduction to the case where multiple embeddings of the same underlying low dimensional coordinates are observed, each lying on a different high dimensional manifold. We also show that a special case of our method, when applied to only a single manifold, reduces to the Laplacian Eigenmaps algorithm. As with previous alignment schemes, once the mixture models have been estimated, all of the parameters of our model can be estimated in closed form without local optima in the learning. Experimental results illustrate the viability of the approach as a non-linear extension of CCA. 1

6 0.084496319 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

7 0.077604033 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks

8 0.071909063 112 nips-2003-Learning to Find Pre-Images

9 0.069709904 153 nips-2003-Parameterized Novelty Detectors for Environmental Sensor Monitoring

10 0.059862576 41 nips-2003-Boosting versus Covering

11 0.056995474 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

12 0.056617934 88 nips-2003-Image Reconstruction by Linear Programming

13 0.056524292 113 nips-2003-Learning with Local and Global Consistency

14 0.054314077 55 nips-2003-Distributed Optimization in Adaptive Networks

15 0.053809792 30 nips-2003-Approximability of Probability Distributions

16 0.05289593 193 nips-2003-Variational Linear Response

17 0.047868084 124 nips-2003-Max-Margin Markov Networks

18 0.046702355 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

19 0.042213507 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

20 0.042072851 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.163), (1, -0.062), (2, -0.05), (3, -0.019), (4, 0.063), (5, 0.07), (6, -0.05), (7, 0.006), (8, -0.219), (9, 0.066), (10, -0.223), (11, -0.026), (12, 0.168), (13, 0.218), (14, 0.255), (15, -0.126), (16, 0.06), (17, -0.023), (18, -0.162), (19, 0.012), (20, 0.128), (21, 0.187), (22, -0.098), (23, 0.113), (24, 0.085), (25, -0.072), (26, -0.081), (27, -0.002), (28, 0.016), (29, 0.054), (30, -0.022), (31, 0.083), (32, -0.058), (33, -0.054), (34, -0.015), (35, -0.002), (36, 0.025), (37, 0.078), (38, -0.106), (39, 0.011), (40, 0.132), (41, 0.031), (42, 0.083), (43, -0.001), (44, 0.03), (45, 0.024), (46, 0.084), (47, 0.032), (48, 0.039), (49, 0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96918303 170 nips-2003-Self-calibrating Probability Forecasting

Author: Vladimir Vovk, Glenn Shafer, Ilia Nouretdinov

Abstract: In the problem of probability forecasting the learner’s goal is to output, given a training set and a new object, a suitable probability measure on the possible values of the new object’s label. An on-line algorithm for probability forecasting is said to be well-calibrated if the probabilities it outputs agree with the observed frequencies. We give a natural nonasymptotic formalization of the notion of well-calibratedness, which we then study under the assumption of randomness (the object/label pairs are independent and identically distributed). It turns out that, although no probability forecasting algorithm is automatically well-calibrated in our sense, there exists a wide class of algorithms for “multiprobability forecasting” (such algorithms are allowed to output a set, ideally very narrow, of probability measures) which satisfy this property; we call the algorithms in this class “Venn probability machines”. Our experimental results demonstrate that a 1-Nearest Neighbor Venn probability machine performs reasonably well on a standard benchmark data set, and one of our theoretical results asserts that a simple Venn probability machine asymptotically approaches the true conditional probabilities regardless, and without knowledge, of the true probability measure generating the examples.

2 0.88592774 151 nips-2003-PAC-Bayesian Generic Chaining

Author: Jean-yves Audibert, Olivier Bousquet

Abstract: There exist many different generalization error bounds for classification. Each of these bounds contains an improvement over the others for certain situations. Our goal is to combine these different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester [1], which is interesting for averaging classifiers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand [2]. This combination is quite natural since the generic chaining is based on the notion of majorizing measures, which can be considered as priors on the set of classifiers, and such priors also arise in the PAC-bayesian setting. 1

3 0.61123717 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method

Author: Susanne Still, William Bialek, Léon Bottou

Abstract: We argue that K–means and deterministic annealing algorithms for geometric clustering can be derived from the more general Information Bottleneck approach. If we cluster the identities of data points to preserve information about their location, the set of optimal solutions is massively degenerate. But if we treat the equations that define the optimal solution as an iterative algorithm, then a set of “smooth” initial conditions selects solutions with the desired geometrical properties. In addition to conceptual unification, we argue that this approach can be more efficient and robust than classic algorithms. 1

4 0.43057457 51 nips-2003-Design of Experiments via Information Theory

Author: Liam Paninski

Abstract: We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efficiency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the “psychometric function” underlying the output responses. 1

5 0.31349429 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks

Author: Anton Schwaighofer, Marian Grigoras, Volker Tresp, Clemens Hoffmann

Abstract: In this article, we present a novel approach to solving the localization problem in cellular networks. The goal is to estimate a mobile user’s position, based on measurements of the signal strengths received from network base stations. Our solution works by building Gaussian process models for the distribution of signal strengths, as obtained in a series of calibration measurements. In the localization stage, the user’s position can be estimated by maximizing the likelihood of received signal strengths with respect to the position. We investigate the accuracy of the proposed approach on data obtained within a large indoor cellular network. 1

6 0.30948374 153 nips-2003-Parameterized Novelty Detectors for Environmental Sensor Monitoring

7 0.27688676 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

8 0.24835615 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

9 0.2456356 75 nips-2003-From Algorithmic to Subjective Randomness

10 0.24003418 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

11 0.2218595 30 nips-2003-Approximability of Probability Distributions

12 0.21945503 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

13 0.20258659 108 nips-2003-Learning a Distance Metric from Relative Comparisons

14 0.20100948 112 nips-2003-Learning to Find Pre-Images

15 0.19407897 146 nips-2003-Online Learning of Non-stationary Sequences

16 0.18925516 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

17 0.18212157 41 nips-2003-Boosting versus Covering

18 0.18078634 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

19 0.1802797 83 nips-2003-Hierarchical Topic Models and the Nested Chinese Restaurant Process

20 0.1796086 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.042), (11, 0.023), (30, 0.018), (35, 0.04), (48, 0.012), (53, 0.08), (58, 0.378), (66, 0.012), (69, 0.027), (71, 0.077), (76, 0.038), (85, 0.078), (91, 0.086), (99, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77095282 170 nips-2003-Self-calibrating Probability Forecasting

Author: Vladimir Vovk, Glenn Shafer, Ilia Nouretdinov

Abstract: In the problem of probability forecasting the learner’s goal is to output, given a training set and a new object, a suitable probability measure on the possible values of the new object’s label. An on-line algorithm for probability forecasting is said to be well-calibrated if the probabilities it outputs agree with the observed frequencies. We give a natural nonasymptotic formalization of the notion of well-calibratedness, which we then study under the assumption of randomness (the object/label pairs are independent and identically distributed). It turns out that, although no probability forecasting algorithm is automatically well-calibrated in our sense, there exists a wide class of algorithms for “multiprobability forecasting” (such algorithms are allowed to output a set, ideally very narrow, of probability measures) which satisfy this property; we call the algorithms in this class “Venn probability machines”. Our experimental results demonstrate that a 1-Nearest Neighbor Venn probability machine performs reasonably well on a standard benchmark data set, and one of our theoretical results asserts that a simple Venn probability machine asymptotically approaches the true conditional probabilities regardless, and without knowledge, of the true probability measure generating the examples.

2 0.64227009 48 nips-2003-Convex Methods for Transduction

Author: Tijl D. Bie, Nello Cristianini

Abstract: The 2-class transduction problem, as formulated by Vapnik [1], involves finding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. In this form, the problem has exponential computational complexity in the size of the working set. So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. In this paper we present a relaxation of this task based on semidefinite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. The results are very encouraging for mid sized data sets, however the cost is still too high for large scale problems, due to the high dimensional search space. To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. With this approximation, the computational cost of the algorithm is such that problems with more than 1000 points can be treated. 1

3 0.63384783 41 nips-2003-Boosting versus Covering

Author: Kohei Hatano, Manfred K. Warmuth

Abstract: We investigate improvements of AdaBoost that can exploit the fact that the weak hypotheses are one-sided, i.e. either all its positive (or negative) predictions are correct. In particular, for any set of m labeled examples consistent with a disjunction of k literals (which are one-sided in this case), AdaBoost constructs a consistent hypothesis by using O(k 2 log m) iterations. On the other hand, a greedy set covering algorithm finds a consistent hypothesis of size O(k log m). Our primary question is whether there is a simple boosting algorithm that performs as well as the greedy set covering. We first show that InfoBoost, a modification of AdaBoost proposed by Aslam for a different purpose, does perform as well as the greedy set covering algorithm. We then show that AdaBoost requires Ω(k 2 log m) iterations for learning k-literal disjunctions. We achieve this with an adversary construction and as well as in simple experiments based on artificial data. Further we give a variant called SemiBoost that can handle the degenerate case when the given examples all have the same label. We conclude by showing that SemiBoost can be used to produce small conjunctions as well. 1

4 0.41177684 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin

Abstract: Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classifiers of equivalent quality. This faster method could be used for more demanding classification tasks, such as on-line learning. 1

5 0.41060024 113 nips-2003-Learning with Local and Global Consistency

Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf

Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data. 1

6 0.41046357 126 nips-2003-Measure Based Regularization

7 0.40871119 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

8 0.40799096 143 nips-2003-On the Dynamics of Boosting

9 0.40719363 3 nips-2003-AUC Optimization vs. Error Rate Minimization

10 0.40644154 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

11 0.40533295 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

12 0.4049696 78 nips-2003-Gaussian Processes in Reinforcement Learning

13 0.40484884 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

14 0.40428883 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

15 0.40372449 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

16 0.4031904 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

17 0.40270001 102 nips-2003-Large Scale Online Learning

18 0.40240589 172 nips-2003-Semi-Supervised Learning with Trees

19 0.40185717 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

20 0.40171114 145 nips-2003-Online Classification on a Budget