nips nips2005 nips2005-85 nips2005-85-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri
Abstract: We analyze classification error on unseen cases, i.e. cases that are different from those in the training set. Unlike standard generalization error, this off-training-set error may differ significantly from the empirical error with high probability even with large sample sizes. We derive a datadependent bound on the difference between off-training-set and standard generalization error. Our result is based on a new bound on the missing mass, which for small samples is stronger than existing bounds based on Good-Turing estimators. As we demonstrate on UCI data-sets, our bound gives nontrivial generalization guarantees in many practical cases. In light of these results, we show that certain claims made in the No Free Lunch literature are overly pessimistic. 1
[1] Wolpert, D.H.: On the connection between in-sample testing and generalization error. Complex Systems 6 (1992) 47–94
[2] Wolpert, D.H.: The lack of a priori distinctions between learning algorithms. Neural Computation 8 (1996) 1341–1390
[3] Wolpert, D.H.: The supervised learning no-free-lunch theorems. In: Proc. 6th Online World Conf. on Soft Computing in Industrial Applications (2001).
[4] Schaffer, C.: A conservation law for generalization performance. In: Proc. 11th Int. Conf. on Machine Learning (1994) 259–265
[5] Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd Edition. Wiley, 2001.
[6] Good, I.J.: The population frequencies of species and the estimation of population parameters. Biometrika 40 (1953) 237–264
[7] McAllester, D.A., Schapire, R.E.: On the convergence rate of Good-Turing estimators. In: Proc. 13th Ann. Conf. on Computational Learning Theory (2000) 1–6
[8] McAllester, D.A., Ortiz L.: Concentration inequalities for the missing mass and for histogram rule error. Journal of Machine Learning Research 4 (2003) 895–911.
[9] Blake, C., and Merz, C.: UCI repository of machine learning databases. Univ. of California, Dept. of Information and Computer Science (1998) A Postponed Proofs We first state two propositions that are useful in the proof of Lemma 2. ∗ Proposition 1. Let Xm be a domain of size m, and let PXm be an associated probability distribution. The probability of getting no repetitions when sampling 1 ≤ k ≤ m items ∗ with replacement from distribution PXm is upper-bounded by m! Pr[“no repetitions” | k] ≤ (m−k)!mk . Proof Sketch of Proposition 1. By way of contradiction it is possible to show that the prob∗ ability of obtaining no repetitions is maximized when PXm is uniform. After this, it is easily seen that the maximal probability equals the right-hand side of the inequality. ∗ Proposition 2. Let Xm be a domain of size m, and let PXm be an associated probability distribution. The probability of getting at most r ≥ 0 repeated values when sampling ∗ 1 ≤ k ≤ m items with replacement from distribution PXm is upper-bounded by Pr[“at most r repetitions” | k] ≤ 1 min k m! −(k−r) , r (m−k+r)! m 1 if k < r if k ≥ r. Proof of Proposition 2. The case k < r is trivial. For k ≥ r, the event “at most r repetitions in k draws” is equivalent to the event that there is at least one subset of size k − r of the X-variables {X1 , . . . , Xk } such that all variables in the subset take distinct values. For a subset of size k − r, Proposition 1 implies that the probability that all values are distinct m! is at most (m−k+r)! m−(k−r) . Since there are k subsets of the X-variables of size k − r, r the union bound implies that multiplying this by k r gives the required result. Proof of Lemma 2. The probability of getting at most r repeated X-values can be upper ¯ bounded by considering repetitions in the maximally probable set Xn only. The probability ¯ of no repetitions in Xn can be broken into n + 1 mutually exclusive cases depending on ¯ how many X-values fall into the set Xn . Thus we get n ¯ Pr[“at most r repetitions in Xn ”] = k=0 ¯ Pr[“at most r repetitions in Xn ” | k] Pr[k] , where Pr[· | k] denotes probability under the condition that k of the n cases fall into ¯ Xn , and Pr[k] denotes the probability of the latter occurring. Proposition 2 gives an upper bound on the conditional probability. The probability Pr[k] is given by the binomial ¯ distribution with parameter pn : Pr[k] = Bin(k ; n, pn ) = n pk (1 − pn )n−k . Com¯ ¯ k ¯n bining these gives the formula for ∆(n, r, pn ). Showing that ∆(n, r, pn ) is non-increasing ¯ ¯ in pn is tedious but uninteresting and we only sketch the proof: It can be checked that ¯ the conditional probability given by Proposition 2 is non-increasing in k (the min operator is essential for this). From this the claim follows since for increasing pn the binomial ¯ distribution puts more weight to terms with large k, thus not increasing the sum. Proof of Thm. 2. The first three factors in the definition (1) of ∆(n, r, pn ) are equal to a ¯ binomial probability Bin(k ; n, pn ), and the expectation of k is thus n¯n . By the Hoeffd¯ p ing bound, for all > 0, the probability of k < n(¯n − ) is bounded by exp(−2n 2 ). p 2 Applying this bound with = pn /3 we get that the probability of k < 3 pn is bounded by ¯ ¯ 2 2 exp(− 9 n¯n ). Combined with (1) this gives the following upper bound on ∆(n, r, pn ): p ¯ 2 pn max f (n, r, k) + max f (n, r, k) ≤ exp − 9 n¯2 + max f (n, r, k) 2 2 exp − 2 n¯2 9 pn k < r, f (n, r, k) = 1 so that (4) holds in fact for all k with 1 ≤ k ≤ n. We bound k the last factor j=1 n−j further as follows. The average of the k factors of this product is n less than or equal to n−k/2 n = 1− k 2n . Since a product of k factors is always less than or equal to the average of the factors to the power of k, we get the upper bound 1 − exp − k·k 2n ≤ exp k2 − 2n k k 2n ≤ , where the first inequality follows from 1 − x ≤ exp(−x) n k for x < 1. Plugging this into (4) gives f (n, r, k) ≤ n2r n−k exp − 2n . Plugging 2 2 k 2 this back into (3) gives ∆(n, r, pn ) ≤ exp(− 9 n¯2 ) + maxk≥n 3 pn 3n2r exp − 2n ¯ pn 2 2 exp(− 9 n¯2 ) pn + 3n 2r exp(− 2 n¯2 ) 9 pn ≤ 4n 2r exp(− 2 n¯2 ). 9 pn ≤ Recall that B(δ, D) := arg minp {p : ∆(n, r, p) ≤ δ}. Replacing ∆(n, r, p) by the above upper bound, makes the set of p satisfying the inequality smaller. Thus, the minimal member of the reduced set is greater than or equal to the minimal member of the set with ∆(n, r, p) ≤ δ, giving the following bound on B(δ, D): 2 B(δ, D) ≤ arg minp p : 4n2r exp − 9 np2 ≤ δ = 3 1 2n log 4 δ + 2r log n .