nips nips2003 nips2003-151 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jean-yves Audibert, Olivier Bousquet
Abstract: There exist many different generalization error bounds for classification. Each of these bounds contains an improvement over the others for certain situations. Our goal is to combine these different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester [1], which is interesting for averaging classifiers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand [2]. This combination is quite natural since the generic chaining is based on the notion of majorizing measures, which can be considered as priors on the set of classifiers, and such priors also arise in the PAC-bayesian setting. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract There exist many different generalization error bounds for classification. [sent-6, score-0.111]
2 Each of these bounds contains an improvement over the others for certain situations. [sent-7, score-0.078]
3 Our goal is to combine these different improvements into a single bound. [sent-8, score-0.053]
4 In particular we combine the PAC-Bayes approach introduced by McAllester [1], which is interesting for averaging classifiers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand [2]. [sent-9, score-0.491]
5 This combination is quite natural since the generic chaining is based on the notion of majorizing measures, which can be considered as priors on the set of classifiers, and such priors also arise in the PAC-bayesian setting. [sent-10, score-0.274]
6 On the one hand, people developing empirical processes theory like Dudley and Talagrand (among others) obtained very interesting results concerning the behaviour of the suprema of empirical processes. [sent-13, score-0.122]
7 One crucial aspect of all the generalization error bounds is that they aim at controlling the behaviour of the function that is returned by the algorithm. [sent-15, score-0.139]
8 the difference between its empirical error and true error), one has to be able to predict which function is likely to be chosen by the algorithm. [sent-19, score-0.066]
9 The way to perform this union bound optimally is now well mastered in the empirical processes community. [sent-22, score-0.318]
10 In the learning theory setting, one is interested in bounds that are as algorithm and data dependent as possible. [sent-23, score-0.057]
11 This particular focus has made concentration inequalities (see e. [sent-24, score-0.07]
12 Another aspect that is of interest for learning is the case where the classifiers are randomized or averaged. [sent-27, score-0.064]
13 McAllester [1, 4] has proposed a new type of bound that takes the randomization into account in a clever way. [sent-28, score-0.14]
14 Next section introduces the notation and reviews the previous improved bounds that have been proposed. [sent-31, score-0.077]
15 Finally we give the proof of the presented results. [sent-33, score-0.049]
16 2 Previous results We first introduce the notation and then give an overview of existing generalization error bounds. [sent-34, score-0.075]
17 We consider an input space X , an output space Y and a probability distribution P on the product space Z X × Y. [sent-35, score-0.063]
18 Let Z (X, Y ) denote a pair of random variables distributed according to P and for a given integer n, let Z1 , . [sent-36, score-0.076]
19 We denote by Pn , Pn and P2n the empirical measures associated respectively to the first, the second and the union of both samples. [sent-43, score-0.276]
20 For such functions, we denote their expectation under P by P f and their n empirical expectation by Pn f (i. [sent-47, score-0.154]
21 En , En and E2n denote the expectation with respect to the first, second and union of both training samples. [sent-50, score-0.225]
22 We consider the pseudo-distances d2 (f1 , f2 ) = P (f1 − f2 )2 and similarly dn , dn and d2n . [sent-51, score-0.102]
23 We denote by ρ and π two probability measures on the space F, so that ρP f will actually mean the expectation of P f when f is sampled according to the probability measure ρ. [sent-53, score-0.201]
24 For two such measures, K(ρ, π) will denote their Kullback-Leibler divergence (K(ρ, π) = dρ ρ log dπ when ρ is absolutely continuous with respect to π and K(ρ, π) = +∞ otherwise). [sent-54, score-0.15]
25 Also, β denotes some positive real number while C is some positive constant (whose value may differ from line to line) and M1 (F) is the set of probability measures on F. [sent-55, score-0.093]
26 Generalization error bounds give an upper bound on the difference between the true and empirical error of functions in a given class, which holds with high probability with respect to the sampling of the training set. [sent-57, score-0.368]
27 By Hoeffding’s inequality one easily gets that for each fixed f ∈ F, with probability at least 1 − β, log 1/β P f − Pn f ≤ C . [sent-59, score-0.244]
28 The simplest form of the union bound gives that with probability at least 1 − β, log |F| + log 1/β . [sent-62, score-0.618]
29 When the functions have values in {0, 1}, this is a finite set and the above union bound applies. [sent-68, score-0.298]
30 This idea was first used by Vapnik and Chervonenkis [5] to obtain that with probability at least 1 − β, ∀f ∈ F, P f − Pn f ≤ C log E2n N (F, 1/n, d2n ) + log 1/β . [sent-69, score-0.367]
31 The finite union bound can be directly extended to the countable case by introducing a probability distribution π over F which weights each function and gives that with probability at least 1 − β, ∀f ∈ F, P f − Pn f ≤ C log 1/π(f ) + log 1/β . [sent-71, score-0.729]
32 (4) n It is interesting to notice that now the bound depends on the actual function f being considered and not just on the set F. [sent-72, score-0.127]
33 Consider a probability distribution ρ defined on a countable F, take the expectation of (4) with respect to ρ and use Jensen’s inequality. [sent-78, score-0.138]
34 This gives with probability at least 1 − β, K(ρ, π) + H(ρ) + log 1/β ∀ρ, ρ(P f − Pn f ) ≤ C , n where H(ρ) is the Shannon entropy. [sent-79, score-0.216]
35 is the difference between true and empirical error of a randomized classifier which uses ρ as weights for choosing the decision function (independently of the data). [sent-83, score-0.147]
36 The PAC-Bayes bound [1] is a refined version of the above bound since it has the form (for possibly uncountable F) K(ρ, π) + log n + log 1/β . [sent-84, score-0.48]
37 (6) n To some extent, one can consider that the PAC-Bayes bound is a refined union bound where the gain happens when ρ is not concentrated on a single function (or more precisely ρ has entropy larger than log n). [sent-85, score-0.551]
38 The quantity En Eσ supf ∈F n σi f (Zi ), where the σi are independent random signs (+1, −1 with probability 1/2), called the Rademacher average for F, is, up to a constant equal to En supf ∈F P f − Pn f which means that it best captures the complexity of F. [sent-87, score-0.213]
39 One has with probability 1 − β, ∀ρ, ρ(P f − Pn f ) ≤ C ∀f ∈ F, P f − Pn f ≤ C 1 En Eσ sup n f ∈F σi f (Zi ) + log 1/β n . [sent-88, score-0.231]
40 Another direction in which the union bound can be refined is by considering finite covers of the set of function at different scales. [sent-90, score-0.311]
41 This is called the chaining technique, pioneered by Dudley (see e. [sent-91, score-0.101]
42 The results involve the Koltchinskii-Pollard entropy integral as, for example in [7], with probability 1 − β, ∀f ∈ F, P f − Pn f ≤ C 1 √ En n ∞ log N (F, , dn )d + 0 log 1/β n . [sent-94, score-0.358]
43 It has been noticed by Fernique and Talagrand that it is possible to capture the complexity in a better way than using minimal covers by considering majorizing measures (essentially optimal for Gaussian processes). [sent-96, score-0.205]
44 Let r > 0 and (Aj )j≥1 be partitions of F of diameter r −j w. [sent-97, score-0.064]
45 Using (7) and techniques from [2] we obtain that with probability 1 − β, ∀f ∈ F ∞ 1 log 1/β P f − P n f ≤ C √ En . [sent-101, score-0.196]
46 (9) inf sup r−j log 1/πAj (f ) + n n π∈M1 (F ) f ∈F j=1 + If one takes partitions induced by minimal covers of F at radii r −j , one recovers (8) up to a constant. [sent-102, score-0.296]
47 Using concentration inequalities as in [3] for example, one can get rid of the expectation appearing in the r. [sent-104, score-0.114]
48 of (3), (8), (7) or (9) and thus obtain a bound that can be computed from the data. [sent-107, score-0.129]
49 Refining the bound (7) is possible as one can localize it (see e. [sent-108, score-0.109]
50 Instead of using + partitions as in (9) we use approximating sets (which also induce partitions but are easier to handle here). [sent-115, score-0.076]
51 Let pj : F → Sj be maps (which can be thought of as projections) satisfying pj (f ) = f for f ∈ Sj and pj−1 ◦ pj = pj−1 . [sent-117, score-1.767]
52 2n The quantities π, Sj and pj are allowed to depend on X1 in an exchangeable way (i. [sent-118, score-0.651]
53 For a probability distribution ρ on F, define its j-th projection as ρj = f ∈Sj ρ{f : pj (f ) = f }δf , where δf denotes the Dirac measure on f . [sent-121, score-0.652]
54 To shorten notations, we denote the average distance between two successive “projections” by ρd2 ρd2 [pj (f ), pj−1 (f )]. [sent-122, score-0.082]
55 Finally, let ∆n,j (f ) 2n j Pn [f − pj (f )] − Pn [f − pj (f )]. [sent-123, score-1.21]
56 Theorem 1 If the following condition holds lim sup ∆n,j (f ) = 0, j→+∞ f ∈F a. [sent-124, score-0.098]
57 (10) then for any 0 < β < 1/2, with probability at least 1 − β, for any distribution ρ, we have +∞ ρPn f − Pn f0 ≤ ρPn f − Pn f0 + 5 j=1 ρd2 K(ρj , πj ) 1 j +√ n n +∞ χj (ρd2 ), j j=1 where χj (x) = 4 x log 4j 2 β −1 log(e2 /x) . [sent-126, score-0.234]
58 For instance, it is satisfied when F is finite, or when limj→+∞ supf ∈F |f −pj (f )| = 0, almost surely or also when the empirical process f → P f − Pn f is uniformly continuous (which happens for classes with finite V C dimension in particular) and limj→+∞ supf ∈F d2n (f, pj (f )) = 0. [sent-128, score-0.85]
59 The previous theorem compares the risk on the second g g sample of any (randomized) estimator with the risk on the second sample of the reference function g . [sent-135, score-0.095]
60 ˜ Now let us give a version of the previous theorem in which the second sample does not appear. [sent-136, score-0.112]
61 Theorem 2 If the following condition holds lim sup En ∆n,j (f ) = 0, j→+∞ f ∈F a. [sent-137, score-0.098]
62 (11) then for any 0 < β < 1/2, with probability at least 1 − β, for any distribution ρ, we have +∞ ρP f − P f0 ≤ ρPn f − Pn f0 + 5 4 j=1 En [ρd2 ]En [K(ρj , πj )] 1 j +√ n n +∞ χj En [ρd2 ] . [sent-139, score-0.103]
63 Notice that our bound is localized in the sense that it depends on the function of interest (or rather on the averaging distribution ρ) and does not involve a supremum over the class. [sent-141, score-0.222]
64 Also, the union bound is performed in an optimal way since, if one plugs in a distribution ρ concentrated on a single function, takes a supremum over F in the r. [sent-142, score-0.337]
65 , and upper bounds the squared distance by the diameter of the partition, one recovers a result similar to (9) up to logarithmic factors but which is localized. [sent-145, score-0.133]
66 Also, when two successive projections are identical, they do not enter in the bound (which comes from the fact that the variance weights the complexity terms). [sent-146, score-0.201]
67 Moreover Theorem 1 also includes the PAC-Bayesian improvement for averaging classifiers since if one considers the set S1 = F one recovers a result similar to McAllester’s (6) which in addition contains the variance improvement such as in [9]. [sent-147, score-0.106]
68 Finally due to the power of the generic chaining, it is possible to upper bound our result by Rademacher averages, up to logarithmic factors (using the results of [10] and [11]). [sent-148, score-0.183]
69 As a remark, the choice of the sequence of sets Sj can generally be done by taking successive covers of the hypothesis space with geometrically decreasing radii. [sent-149, score-0.069]
70 However, the obtained bound is not completely empirical since it involves the expectation with respect to an extra sample. [sent-150, score-0.2]
71 Future work will focus on using concentration inequalities to give a fully empirical bound. [sent-153, score-0.138]
72 5 Proofs Proof of Theorem 1: The proof is inspired by previous works on PAC-bayesian bounds [12, 13] and on the generic chaining [2]. [sent-154, score-0.262]
73 Lemma 1 For any β > 0, λ > 0, j ∈ N∗ and any exchangeable function π : X 2n → M1 (F), with probability at least 1 − β, for any probability distribution ρ ∈ M1 (F), we + + have ρ Pn [pj (f ) − pj−1 (f )] − Pn [pj (f ) − pj−1 (f )] ≤ 2λ 2 n ρd2n [pj (f ), pj−1 (f )] + K(ρ,π)+log(β −1 ) . [sent-156, score-0.21]
74 λ Proof Let λ > 0 and let π : X 2n → M1 (F) be an exchangeable function. [sent-157, score-0.094]
75 Introduce the + quantity ∆i pj (f )(Zn+i ) − pj−1 (f )(Zn+i ) + pj−1 (f )(Zi ) − pj (f )(Zi ) and h λPn pj (f ) − pj−1 (f ) − λPn pj (f ) − pj−1 (f ) − 2λ2 d2n pj (f ), pj−1 (f ) . [sent-158, score-2.945]
76 Now take the expectation wrt σ, where σ is a n-dimensional vector of Rademacher variables. [sent-160, score-0.061]
77 We obtain E2n πeh = E2n πe− ≤ E2n πe 2λ2 n 2 − 2λ n d2n [pj (f ),pj−1 (f )] d2n [pj (f ),pj−1 (f )] where at the last step we use that cosh s ≤ e s2 2 e n i=1 Pn cosh λ2 i=1 2n2 ∆2 i λ n ∆i . [sent-161, score-0.098]
78 Since ∆2 ≤ 2 pj (f )(Zn+i ) − pj−1 (f )(Zn+i ) i 2 2 + 2 pj (f )(Zi ) − pj−1 (f )(Zi ) , we obtain that for any λ > 0, E2n πeh ≤ 1. [sent-162, score-1.198]
79 Now let us apply this result to the projected measures πj and ρj . [sent-164, score-0.08]
80 Since, by definition, π, Sj and pj are exchangeable, πj is also exchangeable. [sent-165, score-0.589]
81 Since pj (f ) = f for any f ∈ Sj , with probability at least 1 − β, uniformly in ρ, we have ρj Pn [f − pj−1 (f )] − Pn [pj (f ) − pj−1 (f )] ≤ where Kj Kj 2λ ρj d2 [f, pj−1 (f )] + , 2n n λ K(ρj , πj ) + log(β −1 ). [sent-166, score-0.699]
82 Therefore, we need to get a version of this inequality which holds uniformly in λ. [sent-170, score-0.077]
83 log 2 2n First let us note that when ρd2 = 0, we have ρ∆j = 0. [sent-171, score-0.163]
84 When ρd2 > 0, let m j j k/2 and ∗ λk = me and let b be a function from R to (0, 1] such that k≥1 b(λk ) ≤ 1. [sent-172, score-0.064]
85 From the previous lemma and a union bound, we obtain that for any β > 0 and any integer j with probability at least 1 − β, for any k ∈ N∗ and any distribution ρ, we have ρ∆j ≤ 2λk 2 K(ρj , πj ) + log [b(λk )]−1 β −1 ρdj + . [sent-173, score-0.461]
86 n λk Let us take the function b such that λ → log [b(λ)]−1 λ is continuous and decreasing. [sent-174, score-0.131]
87 λ∗ ∗ For Then there exists a parameter λ∗ > 0 such that 2λ ρd2 = j n any β < 1/2, we have (λ∗ )2 ρd2 ≥ log 2 n, hence λ∗ ≥ m. [sent-176, score-0.131]
88 Then we have ∗√ K(ρj ,πj )+log([b(λ∗ )]−1 β −1 ) ρ∆j ≤ 2λ eρd2 + j n λ∗ (16) √ 2 2 K(ρ , π ) + log ([b(λ )]−1 β −1 ) . [sent-178, score-0.131]
89 By simply using an union bound with weights taken proportional to 1/j 2 , we have that the previous inequation holds uniformly in j ∈ N∗ provided that β −1 is replaced with π 2 2 −1 since j∈N∗ 1/j 2 = π 2 /6 ≈ 1. [sent-185, score-0.357]
90 Notice that 6 j β J ρ Pn f − Pn f0 + Pn f0 − Pn f = ρ∆n,J (f ) + j=1 ρj (Pn − Pn )f − (Pn − Pn )pj−1 (f ) because pj−1 = pj−1 ◦ pj . [sent-187, score-0.589]
91 So, with probability at least 1 − β, for any distribution ρ, we have ρ Pn f − Pn f0 + Pn f0 − Pn f ≤ supF ∆n,J + 5 +3. [sent-188, score-0.103]
92 75 J j=1 J j=1 ρd2 j n ρd2 K(ρj ,πj ) j n log 3. [sent-189, score-0.131]
93 Proof of Theorem 2: It suffices to modify slightly the proof of theorem 1. [sent-192, score-0.067]
94 Introduce U supρ ρh + log β − K(ρ, π) , where h is still defined as in equation (12). [sent-193, score-0.131]
95 So with probability at least 1 − β, we have supρ En ρh + log β − K(ρ, π) ≤ En U ≤ 0. [sent-196, score-0.216]
96 6 Conclusion We have obtained a generalization error bound for randomized classifiers which combines several previous improvements. [sent-197, score-0.247]
97 It contains an optimal union bound, both in the sense of optimally taking into account the metric structure of the set of functions (via the majorizing measure approach) and in the sense of taking into account the averaging distribution. [sent-198, score-0.338]
98 We believe that this is a very natural way of combining these two aspects as the result relies on the comparison of a majorizing measure which can be thought of as a prior probability distribution and a randomization distribution which can be considered as a posterior distribution. [sent-199, score-0.229]
99 Future work will focus on giving a totally empirical bound (in the induction setting) and investigating possible constructions for the approximating sets Sj . [sent-200, score-0.176]
100 Data-dependent generalization error bounds for (noisy) classification: a PACbayesian approach. [sent-269, score-0.111]
wordName wordTfidf (topN-words)
[('pn', 0.601), ('pj', 0.589), ('en', 0.181), ('union', 0.162), ('log', 0.131), ('majorizing', 0.117), ('rademacher', 0.117), ('eh', 0.115), ('bound', 0.109), ('sj', 0.107), ('chaining', 0.101), ('supf', 0.084), ('zn', 0.077), ('zi', 0.065), ('randomized', 0.064), ('exchangeable', 0.062), ('talagrand', 0.058), ('bounds', 0.057), ('generic', 0.056), ('sup', 0.055), ('dn', 0.051), ('chervonenkis', 0.051), ('measures', 0.048), ('empirical', 0.047), ('aj', 0.045), ('probability', 0.045), ('expectation', 0.044), ('mcallester', 0.043), ('vapnik', 0.042), ('covers', 0.04), ('concentration', 0.04), ('least', 0.04), ('theorem', 0.039), ('cosh', 0.039), ('dudley', 0.039), ('fernique', 0.039), ('limj', 0.039), ('kj', 0.038), ('partitions', 0.038), ('generalization', 0.035), ('localized', 0.034), ('pacbayesian', 0.034), ('shorten', 0.034), ('remark', 0.033), ('let', 0.032), ('recovers', 0.032), ('averaging', 0.032), ('nements', 0.031), ('randomization', 0.031), ('laboratoire', 0.031), ('countable', 0.031), ('combine', 0.031), ('inequalities', 0.03), ('successive', 0.029), ('supremum', 0.029), ('behaviour', 0.028), ('proof', 0.028), ('inequality', 0.028), ('functions', 0.027), ('france', 0.027), ('dj', 0.027), ('paris', 0.027), ('re', 0.026), ('ers', 0.026), ('classi', 0.026), ('bousquet', 0.026), ('diameter', 0.026), ('uniformly', 0.025), ('integer', 0.025), ('comes', 0.025), ('holds', 0.024), ('notations', 0.024), ('verlag', 0.023), ('improvements', 0.022), ('projections', 0.021), ('jensen', 0.021), ('improvement', 0.021), ('give', 0.021), ('happens', 0.021), ('nite', 0.02), ('previous', 0.02), ('induction', 0.02), ('berlin', 0.02), ('obtain', 0.02), ('concentrated', 0.019), ('denote', 0.019), ('lim', 0.019), ('error', 0.019), ('logarithmic', 0.018), ('notice', 0.018), ('distribution', 0.018), ('risk', 0.018), ('weights', 0.017), ('legendre', 0.017), ('probabilit', 0.017), ('een', 0.017), ('bringing', 0.017), ('upperbound', 0.017), ('wrt', 0.017), ('preprint', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 151 nips-2003-PAC-Bayesian Generic Chaining
Author: Jean-yves Audibert, Olivier Bousquet
Abstract: There exist many different generalization error bounds for classification. Each of these bounds contains an improvement over the others for certain situations. Our goal is to combine these different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester [1], which is interesting for averaging classifiers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand [2]. This combination is quite natural since the generic chaining is based on the notion of majorizing measures, which can be considered as priors on the set of classifiers, and such priors also arise in the PAC-bayesian setting. 1
2 0.29320011 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
Author: Susanne Still, William Bialek, Léon Bottou
Abstract: We argue that K–means and deterministic annealing algorithms for geometric clustering can be derived from the more general Information Bottleneck approach. If we cluster the identities of data points to preserve information about their location, the set of optimal solutions is massively degenerate. But if we treat the equations that define the optimal solution as an iterative algorithm, then a set of “smooth” initial conditions selects solutions with the desired geometrical properties. In addition to conceptual unification, we argue that this approach can be more efficient and robust than classic algorithms. 1
3 0.29025754 170 nips-2003-Self-calibrating Probability Forecasting
Author: Vladimir Vovk, Glenn Shafer, Ilia Nouretdinov
Abstract: In the problem of probability forecasting the learner’s goal is to output, given a training set and a new object, a suitable probability measure on the possible values of the new object’s label. An on-line algorithm for probability forecasting is said to be well-calibrated if the probabilities it outputs agree with the observed frequencies. We give a natural nonasymptotic formalization of the notion of well-calibratedness, which we then study under the assumption of randomness (the object/label pairs are independent and identically distributed). It turns out that, although no probability forecasting algorithm is automatically well-calibrated in our sense, there exists a wide class of algorithms for “multiprobability forecasting” (such algorithms are allowed to output a set, ideally very narrow, of probability measures) which satisfy this property; we call the algorithms in this class “Venn probability machines”. Our experimental results demonstrate that a 1-Nearest Neighbor Venn probability machine performs reasonably well on a standard benchmark data set, and one of our theoretical results asserts that a simple Venn probability machine asymptotically approaches the true conditional probabilities regardless, and without knowledge, of the true probability measure generating the examples.
4 0.16354325 51 nips-2003-Design of Experiments via Information Theory
Author: Liam Paninski
Abstract: We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efficiency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the “psychometric function” underlying the output responses. 1
5 0.14342649 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng
Abstract: Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. 1
6 0.071277834 112 nips-2003-Learning to Find Pre-Images
7 0.067569345 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
8 0.060048889 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
9 0.05765247 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
10 0.05752394 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
11 0.057318382 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
12 0.056563687 193 nips-2003-Variational Linear Response
13 0.054256555 107 nips-2003-Learning Spectral Clustering
14 0.052732505 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
15 0.052287839 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
16 0.051363092 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
17 0.050911348 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
18 0.048978418 117 nips-2003-Linear Response for Approximate Inference
19 0.047074202 137 nips-2003-No Unbiased Estimator of the Variance of K-Fold Cross-Validation
20 0.045812506 129 nips-2003-Minimising Contrastive Divergence in Noisy, Mixed-mode VLSI Neurons
topicId topicWeight
[(0, -0.147), (1, -0.072), (2, -0.051), (3, 0.02), (4, 0.126), (5, 0.085), (6, -0.044), (7, 0.038), (8, -0.265), (9, 0.105), (10, -0.241), (11, -0.082), (12, 0.143), (13, 0.242), (14, 0.289), (15, -0.11), (16, 0.105), (17, -0.007), (18, -0.207), (19, -0.013), (20, 0.169), (21, 0.124), (22, -0.085), (23, 0.06), (24, 0.065), (25, 0.054), (26, -0.083), (27, 0.005), (28, -0.027), (29, 0.022), (30, -0.022), (31, 0.081), (32, -0.062), (33, -0.058), (34, -0.05), (35, -0.076), (36, -0.03), (37, 0.058), (38, -0.1), (39, 0.0), (40, 0.057), (41, 0.06), (42, 0.065), (43, -0.024), (44, -0.031), (45, 0.026), (46, 0.052), (47, 0.011), (48, -0.059), (49, -0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.96632248 151 nips-2003-PAC-Bayesian Generic Chaining
Author: Jean-yves Audibert, Olivier Bousquet
Abstract: There exist many different generalization error bounds for classification. Each of these bounds contains an improvement over the others for certain situations. Our goal is to combine these different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester [1], which is interesting for averaging classifiers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand [2]. This combination is quite natural since the generic chaining is based on the notion of majorizing measures, which can be considered as priors on the set of classifiers, and such priors also arise in the PAC-bayesian setting. 1
2 0.88806176 170 nips-2003-Self-calibrating Probability Forecasting
Author: Vladimir Vovk, Glenn Shafer, Ilia Nouretdinov
Abstract: In the problem of probability forecasting the learner’s goal is to output, given a training set and a new object, a suitable probability measure on the possible values of the new object’s label. An on-line algorithm for probability forecasting is said to be well-calibrated if the probabilities it outputs agree with the observed frequencies. We give a natural nonasymptotic formalization of the notion of well-calibratedness, which we then study under the assumption of randomness (the object/label pairs are independent and identically distributed). It turns out that, although no probability forecasting algorithm is automatically well-calibrated in our sense, there exists a wide class of algorithms for “multiprobability forecasting” (such algorithms are allowed to output a set, ideally very narrow, of probability measures) which satisfy this property; we call the algorithms in this class “Venn probability machines”. Our experimental results demonstrate that a 1-Nearest Neighbor Venn probability machine performs reasonably well on a standard benchmark data set, and one of our theoretical results asserts that a simple Venn probability machine asymptotically approaches the true conditional probabilities regardless, and without knowledge, of the true probability measure generating the examples.
3 0.61316621 82 nips-2003-Geometric Clustering Using the Information Bottleneck Method
Author: Susanne Still, William Bialek, Léon Bottou
Abstract: We argue that K–means and deterministic annealing algorithms for geometric clustering can be derived from the more general Information Bottleneck approach. If we cluster the identities of data points to preserve information about their location, the set of optimal solutions is massively degenerate. But if we treat the equations that define the optimal solution as an iterative algorithm, then a set of “smooth” initial conditions selects solutions with the desired geometrical properties. In addition to conceptual unification, we argue that this approach can be more efficient and robust than classic algorithms. 1
4 0.47584024 51 nips-2003-Design of Experiments via Information Theory
Author: Liam Paninski
Abstract: We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efficiency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the “psychometric function” underlying the output responses. 1
5 0.39988667 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
Author: Ting-fan Wu, Chih-jen Lin, Ruby C. Weng
Abstract: Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3]. 1
6 0.24137734 75 nips-2003-From Algorithmic to Subjective Randomness
7 0.24107175 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering
8 0.23536463 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
9 0.21018641 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers
10 0.20849869 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
11 0.20217845 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
12 0.20205869 193 nips-2003-Variational Linear Response
13 0.20162016 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification
14 0.19997467 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
15 0.19552425 30 nips-2003-Approximability of Probability Distributions
16 0.1921061 112 nips-2003-Learning to Find Pre-Images
17 0.19000274 59 nips-2003-Efficient and Robust Feature Extraction by Maximum Margin Criterion
18 0.18671919 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
19 0.17792036 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
20 0.17587876 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach
topicId topicWeight
[(0, 0.036), (11, 0.017), (26, 0.278), (29, 0.01), (35, 0.054), (53, 0.088), (58, 0.013), (69, 0.016), (71, 0.06), (76, 0.041), (85, 0.101), (91, 0.118), (99, 0.051)]
simIndex simValue paperId paperTitle
1 0.90913117 99 nips-2003-Kernels for Structured Natural Language Data
Author: Jun Suzuki, Yutaka Sasaki, Eisaku Maeda
Abstract: This paper devises a novel kernel function for structured natural language data. In the field of Natural Language Processing, feature extraction consists of the following two steps: (1) syntactically and semantically analyzing raw data, i.e., character strings, then representing the results as discrete structures, such as parse trees and dependency graphs with part-of-speech tags; (2) creating (possibly high-dimensional) numerical feature vectors from the discrete structures. The new kernels, called Hierarchical Directed Acyclic Graph (HDAG) kernels, directly accept DAGs whose nodes can contain DAGs. HDAG data structures are needed to fully reflect the syntactic and semantic structures that natural language data inherently have. In this paper, we define the kernel function and show how it permits efficient calculation. Experiments demonstrate that the proposed kernels are superior to existing kernel functions, e.g., sequence kernels, tree kernels, and bag-of-words kernels. 1
same-paper 2 0.83404064 151 nips-2003-PAC-Bayesian Generic Chaining
Author: Jean-yves Audibert, Olivier Bousquet
Abstract: There exist many different generalization error bounds for classification. Each of these bounds contains an improvement over the others for certain situations. Our goal is to combine these different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester [1], which is interesting for averaging classifiers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand [2]. This combination is quite natural since the generic chaining is based on the notion of majorizing measures, which can be considered as priors on the set of classifiers, and such priors also arise in the PAC-bayesian setting. 1
3 0.74354237 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
Author: Sanjiv Kumar, Martial Hebert
Abstract: In this paper we present Discriminative Random Fields (DRF), a discriminative framework for the classification of natural image regions by incorporating neighborhood spatial dependencies in the labels as well as the observed data. The proposed model exploits local discriminative models and allows to relax the assumption of conditional independence of the observed data given the labels, commonly used in the Markov Random Field (MRF) framework. The parameters of the DRF model are learned using penalized maximum pseudo-likelihood method. Furthermore, the form of the DRF model allows the MAP inference for binary classification problems using the graph min-cut algorithms. The performance of the model was verified on the synthetic as well as the real-world images. The DRF model outperforms the MRF model in the experiments. 1
4 0.58022082 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe
Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.
5 0.58018702 78 nips-2003-Gaussian Processes in Reinforcement Learning
Author: Malte Kuss, Carl E. Rasmussen
Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.
6 0.58001393 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
7 0.57595146 158 nips-2003-Policy Search by Dynamic Programming
8 0.570889 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
9 0.56647646 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
10 0.56536973 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
11 0.56420475 41 nips-2003-Boosting versus Covering
12 0.56319118 143 nips-2003-On the Dynamics of Boosting
13 0.56214273 107 nips-2003-Learning Spectral Clustering
14 0.56204945 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
15 0.56180114 3 nips-2003-AUC Optimization vs. Error Rate Minimization
16 0.56169122 30 nips-2003-Approximability of Probability Distributions
17 0.56167465 126 nips-2003-Measure Based Regularization
18 0.56111276 68 nips-2003-Eye Movements for Reward Maximization
19 0.56055623 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
20 0.56042647 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes