nips nips2005 nips2005-85 knowledge-graph by maker-knowledge-mining

85 nips-2005-Generalization to Unseen Cases


Source: pdf

Author: Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri

Abstract: We analyze classification error on unseen cases, i.e. cases that are different from those in the training set. Unlike standard generalization error, this off-training-set error may differ significantly from the empirical error with high probability even with large sample sizes. We derive a datadependent bound on the difference between off-training-set and standard generalization error. Our result is based on a new bound on the missing mass, which for small samples is stronger than existing bounds based on Good-Turing estimators. As we demonstrate on UCI data-sets, our bound gives nontrivial generalization guarantees in many practical cases. In light of these results, we show that certain claims made in the No Free Lunch literature are overly pessimistic. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We analyze classification error on unseen cases, i. [sent-19, score-0.193]

2 Unlike standard generalization error, this off-training-set error may differ significantly from the empirical error with high probability even with large sample sizes. [sent-22, score-0.516]

3 We derive a datadependent bound on the difference between off-training-set and standard generalization error. [sent-23, score-0.405]

4 Our result is based on a new bound on the missing mass, which for small samples is stronger than existing bounds based on Good-Turing estimators. [sent-24, score-0.558]

5 As we demonstrate on UCI data-sets, our bound gives nontrivial generalization guarantees in many practical cases. [sent-25, score-0.499]

6 1 Introduction A large part of learning theory deals with methods that bound the generalization error of hypotheses in terms of their empirical errors. [sent-27, score-0.544]

7 The standard definition of generalization error allows overlap between the training sample and test cases. [sent-28, score-0.391]

8 , when considering off-training-set error [1]–[5] defined in terms of only previously unseen cases, usual generalization bounds do not apply. [sent-31, score-0.422]

9 The off-training-set error and the empirical error sometimes differ significantly with high probability even for large sample sizes. [sent-32, score-0.411]

10 In this paper, we show that in many practical cases, one can nevertheless bound this difference. [sent-33, score-0.293]

11 In particular, we show that with high probability, in the realistic situation where the number of repeated cases, or duplicates, relative to the total sample size is small, the difference between the off-training-set error and the standard generalization error is also small. [sent-34, score-0.521]

12 In this case any standard generalization error bound, no matter how it is arrived at, transforms into a similar bound on the off-training-set error. [sent-35, score-0.533]

13 Our Contribution We show that with probability at least 1−δ, if there are r repetitions in the training sample, then the difference between the off-training-set error and the standard generalization error is at most of order O 1 n log 4 δ + r log n (Thm. [sent-36, score-0.821]

14 1) gives a stronger non-asymptotic bound that can be evaluated numerically. [sent-39, score-0.371]

15 1 and 2 is based on Lemma 2, which is of independent interest, giving a new lower bound on the so-called missing mass, the total probability of as yet unseen cases. [sent-41, score-0.518]

16 For small samples and few repetitions, this bound is significantly stronger than existing bounds based on Good-Turing estimators [6]–[8]. [sent-42, score-0.523]

17 Properties of Our Bounds Our bounds hold (1) uniformly, are (2) distribution-free and (3) data-dependent, yet (4) relevant for data-sets encountered in practice. [sent-43, score-0.211]

18 Our bounds hold uniformly in that they hold for all hypotheses (functions from features to labels) at the same time. [sent-45, score-0.225]

19 Our bounds depend on the data: they are useful only if the number of repetitions in the training set is very small compared to the training set size. [sent-48, score-0.512]

20 4: (1) The use of off-training-set error is an essential ingredient of the No Free Lunch (NFL) theorems [1]–[5]. [sent-53, score-0.225]

21 (2) The off-training-set error is an intuitive measure of generalization performance. [sent-60, score-0.216]

22 Yet in practice it differs from standard generalization error (even with continuous feature spaces). [sent-61, score-0.289]

23 (3) Technically, we establish a surprising connection between off-training-set error (a concept from classification) and missing mass (a concept mostly applied in language modeling), and give a new lower bound on the missing mass. [sent-63, score-0.646]

24 , off-training-set, and empirical error of a hypothesis h are given by Eiid (h) := Pr[Y = h(X)] Eots (h, D) := Pr[Y = h(X) | X ∈ XD ] / n 1 Eemp (h, D) := n i=1 I{h(Xi )=Yi } i. [sent-84, score-0.182]

25 The first one of these is just the standard generalization error of learning theory. [sent-88, score-0.242]

26 First, if XD has measure one, the off-training-set error is undefined and we need not concern ourselves with it; the relevant error measure is Eiid (h) and standard results apply1 . [sent-95, score-0.271]

27 Given a training sample D, the sample coverage p(XD ) is the probability that a new X-value appears in D: p(XD ) := Pr[X ∈ XD ], where XD is as in Def. [sent-103, score-0.376]

28 a (an upper bound on Eots (h, D)); the other direction is obtained by using analogous inequalities for the quantity 1 − Eots (h, D), with Y = h(X) replaced by Y = h(X), which gives the upper bound 1 − Eots (h, D) ≤ 1 − Eiid (h) + p(XD ). [sent-111, score-0.793]

29 b follows from the first line by ignoring the upper bound 1, and subtracting Eiid (h) from both sides. [sent-113, score-0.357]

30 Given the value of (or an upper bound on) Eiid (h), the upper bound of Lemma 1. [sent-114, score-0.714]

31 The lemma would be of little use without a good enough upper bound on the sample coverage p(XD ), or equivalently, a lower bound on the missing mass. [sent-119, score-1.037]

32 The known small bias of such estimators, together with a rate of convergence, can be used to obtain lower and upper bound for the missing mass [7, 8]. [sent-123, score-0.507]

33 Unfortunately, for the sample sizes we are interested in, the lower bounds are not quite tight enough (see Fig. [sent-124, score-0.234]

34 We compare this bound to the existing ones after Thm. [sent-127, score-0.301]

35 No assumptions are made regarding the ¯ value of pn , it may or may not be zero. [sent-133, score-0.421]

36 The reason for us being interested in pn is that ¯ ¯ 1 2 Note however, that a continuous feature space does not necessarily imply this, see Sec. [sent-134, score-0.492]

37 it gives us an upper bound p(XD ) ≤ pn on the sample coverage that holds for all D. [sent-137, score-1.071]

38 We ¯ prove that when pn is large it is likely that a sample of size n will have several repeated X¯ values so that the number of distinct X-values is less than n. [sent-138, score-0.613]

39 This implies that if a sample with a small number of repeated X-values is observed, it is safe to assume that p n is small ¯ and therefore, the sample coverage p(XD ) must also be small. [sent-139, score-0.347]

40 n ∆(n, r, pn ) is a non-increasing function of pn . [sent-143, score-0.842]

41 Given a fixed confidence level 1 − δ we can now define a data-dependent upper bound on the sample coverage B(δ, D) := arg min {p : ∆(n, r, p) ≤ δ} , p (2) where r is the number of repeated X-values in D, and ∆(n, r, p) is given by Eq. [sent-146, score-0.637]

42 For any 0 ≤ δ ≤ 1, the upper bound B(δ, D) on the sample coverage given by Eq. [sent-149, score-0.561]

43 Consider fixed values of the confidence level 1 − δ, sample size n, and probability pn . [sent-152, score-0.588]

44 Let R be the largest integer for which ∆(n, R, pn ) ≤ δ. [sent-153, score-0.421]

45 By Lemma 2 the probability of ¯ ¯ obtaining at most R repetitions is upper-bounded by δ. [sent-154, score-0.392]

46 Thus, it is sufficient that the bound holds whenever the number of repetitions is greater than R. [sent-155, score-0.628]

47 By Lemma 2 the function ∆(n, r, pn ) is non-increasing in pn , and hence ¯ ¯ ¯ it must be that pn < arg minp {p : ∆(n, r, p) ≤ δ} = B(δ, D). [sent-157, score-1.354]

48 Since p(XD ) ≤ pn , the ¯ ¯ bound then holds for all r > R. [sent-158, score-0.724]

49 Rather than the sample coverage p(XD ), the real interest is often in off-training-set error. [sent-159, score-0.204]

50 error and the off-training-set error is bounded by Pr [∀h |Eots (h, D) − Eiid (h)| ≤ B(δ, D)] ≥ 1 − δ . [sent-167, score-0.246]

51 error are entangled, thus transforming all distribution-free bounds on the i. [sent-171, score-0.229]

52 Figure 1 illustrates the behavior of the bound (2) as the sample size grows. [sent-176, score-0.393]

53 It can be seen that for a small number of repetitions the bound is nontrivial already at moderate sample sizes. [sent-177, score-0.731]

54 Moreover, the effect of repetitions is tolerable, and it diminishes as the number of repetitions grows. [sent-178, score-0.65]

55 Table 1 lists values of the bound for a number of data-sets from the UCI machine learning repository [9]. [sent-179, score-0.299]

56 Theorem 2 gives an upper bound on the rate with which the bound decreases as n grows. [sent-183, score-0.685]

57 1 1 10 100 1000 10000 sample size Figure 1: Upper bound B(δ, D) given by Eq. [sent-193, score-0.393]

58 Asymptotically, it exceeds our r = 0 bound by a factor O(log n). [sent-200, score-0.271]

59 For all n and all pn , all r < n, the function ¯ B(δ, D) has the upper bound B(δ, D) ≤ 3 1 2n log 4 δ + 2r log n . [sent-204, score-0.846]

60 2 to the existing bounds on B(δ, D) based on Good-Turing estimators [7, 8]. [sent-207, score-0.209]

61 3 in [7] gives an upper bound √ of O (r/n + log n/ √ The exact bound is drawn as the G-T curve in Fig. [sent-209, score-0.719]

62 √ our bound gives O C + r log n/ n , for a known constant C > 0. [sent-212, score-0.362]

63 For fixed r and increasing n, this gives an improvement over the G-T √ bound of order O(log n) if r = 0, √ and O( log n) if r > 0. [sent-213, score-0.362]

64 For r growing faster than O( log n), asymptotically our bound becomes uncompetitive3 . [sent-214, score-0.338]

65 The real advantage of our bound is that, in contrast to G-T, it gives nontrivial bounds for sample sizes and number of repetitions that typically occur in classification problems. [sent-215, score-0.909]

66 For practical applications in language modeling (large samples, many repetitions), the existing G-T bound of [7] is probably preferable. [sent-216, score-0.347]

67 10 of that paper, it is shown that the probability that the missing mass is larger than its ex2 pected value by an amount is bounded by e−(e/2)n . [sent-219, score-0.225]

68 4, some techniques are developed to bound the expected missing mass in terms of the number of repetitions in the sample. [sent-221, score-0.746]

69 10 of [8], these techniques √ can be extended to yield an upper bound on B(δ, D) of order O(r/n + 1/ n) that would be asymptotically stronger than the current bound. [sent-223, score-0.433]

70 In practice, our bound is still relevant because typical data-sets often have r very small compared to n (see Table 1). [sent-230, score-0.3]

71 0350 Discussion – Implications of Our Results The use of off-training-set error is an essential ingredient of the influential No Free Lunch theorems [1]–[5]. [sent-259, score-0.225]

72 With the help of the little add-on we provide in this paper (Corollary 1), any bound on standard generalization error can be converted to a bound on off-training-set error. [sent-262, score-0.784]

73 Our empirical results on UCI data-sets show that the resulting bound is often not essentially weaker than the original one. [sent-263, score-0.322]

74 Secondly, contrary to what is sometimes suggested4 , we show that one can relate performance on the training sample to performance on as yet unseen cases. [sent-265, score-0.265]

75 This may seem to imply that off-training-set error coincides with standard generalization error (see remark after Def. [sent-267, score-0.379]

76 However, in practice even when the feature space has continuous components, data-sets sometimes contain repetitions (e. [sent-270, score-0.401]

77 In practice repetitions occur in many data-sets, implying that off-training-set error can be different from the standard i. [sent-273, score-0.483]

78 This makes it all the more interesting that standard generalization bounds transfer to off-training-set error – and that is the central implication of this paper. [sent-280, score-0.363]

79 4 For instance, “if we are interested in the error for [unseen cases], the NFL theorems tell us that (in the absence of prior assumptions) [empirical error] is meaningless” [2]. [sent-281, score-0.182]

80 The probability of getting no repetitions when sampling 1 ≤ k ≤ m items ∗ with replacement from distribution PXm is upper-bounded by m! [sent-335, score-0.463]

81 By way of contradiction it is possible to show that the prob∗ ability of obtaining no repetitions is maximized when PXm is uniform. [sent-339, score-0.347]

82 The probability of getting at most r ≥ 0 repeated values when sampling ∗ 1 ≤ k ≤ m items with replacement from distribution PXm is upper-bounded by Pr[“at most r repetitions” | k] ≤ 1 min k m! [sent-343, score-0.187]

83 For k ≥ r, the event “at most r repetitions in k draws” is equivalent to the event that there is at least one subset of size k − r of the X-variables {X1 , . [sent-348, score-0.353]

84 Since there are k subsets of the X-variables of size k − r, r the union bound implies that multiplying this by k r gives the required result. [sent-355, score-0.356]

85 The probability of getting at most r repeated X-values can be upper ¯ bounded by considering repetitions in the maximally probable set Xn only. [sent-357, score-0.595]

86 The probability ¯ of no repetitions in Xn can be broken into n + 1 mutually exclusive cases depending on ¯ how many X-values fall into the set Xn . [sent-358, score-0.396]

87 Thus we get n ¯ Pr[“at most r repetitions in Xn ”] = k=0 ¯ Pr[“at most r repetitions in Xn ” | k] Pr[k] , where Pr[· | k] denotes probability under the condition that k of the n cases fall into ¯ Xn , and Pr[k] denotes the probability of the latter occurring. [sent-359, score-0.766]

88 Proposition 2 gives an upper bound on the conditional probability. [sent-360, score-0.414]

89 The probability Pr[k] is given by the binomial ¯ distribution with parameter pn : Pr[k] = Bin(k ; n, pn ) = n pk (1 − pn )n−k . [sent-361, score-1.35]

90 Com¯ ¯ k ¯n bining these gives the formula for ∆(n, r, pn ). [sent-362, score-0.478]

91 Showing that ∆(n, r, pn ) is non-increasing ¯ ¯ in pn is tedious but uninteresting and we only sketch the proof: It can be checked that ¯ the conditional probability given by Proposition 2 is non-increasing in k (the min operator is essential for this). [sent-363, score-0.933]

92 From this the claim follows since for increasing pn the binomial ¯ distribution puts more weight to terms with large k, thus not increasing the sum. [sent-364, score-0.463]

93 The first three factors in the definition (1) of ∆(n, r, pn ) are equal to a ¯ binomial probability Bin(k ; n, pn ), and the expectation of k is thus n¯n . [sent-367, score-0.956]

94 p 2 Applying this bound with = pn /3 we get that the probability of k < 3 pn is bounded by ¯ ¯ 2 2 exp(− 9 n¯n ). [sent-369, score-1.188]

95 Combined with (1) this gives the following upper bound on ∆(n, r, pn ): p ¯ 2 pn max f (n, r, k) + max f (n, r, k) ≤ exp − 9 n¯2 + max f (n, r, k) 2 2 exp − 2 n¯2 9 pn k < r, f (n, r, k) = 1 so that (4) holds in fact for all k with 1 ≤ k ≤ n. [sent-370, score-1.811]

96 We bound k the last factor j=1 n−j further as follows. [sent-371, score-0.271]

97 Since a product of k factors is always less than or equal to the average of the factors to the power of k, we get the upper bound 1 − exp − k·k 2n ≤ exp k2 − 2n k k 2n ≤ , where the first inequality follows from 1 − x ≤ exp(−x) n k for x < 1. [sent-373, score-0.513]

98 Plugging 2 2 k 2 this back into (3) gives ∆(n, r, pn ) ≤ exp(− 9 n¯2 ) + maxk≥n 3 pn 3n2r exp − 2n ¯ pn 2 2 exp(− 9 n¯2 ) pn + 3n 2r exp(− 2 n¯2 ) 9 pn ≤ 4n 2r exp(− 2 n¯2 ). [sent-375, score-2.213]

99 9 pn ≤ Recall that B(δ, D) := arg minp {p : ∆(n, r, p) ≤ δ}. [sent-376, score-0.512]

100 Thus, the minimal member of the reduced set is greater than or equal to the minimal member of the set with ∆(n, r, p) ≤ δ, giving the following bound on B(δ, D): 2 B(δ, D) ≤ arg minp p : 4n2r exp − 9 np2 ≤ δ = 3 1 2n log 4 δ + 2r log n . [sent-378, score-0.541]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xd', 0.457), ('pn', 0.421), ('repetitions', 0.325), ('eiid', 0.277), ('bound', 0.271), ('eots', 0.192), ('pr', 0.186), ('bounds', 0.121), ('lemma', 0.112), ('coverage', 0.11), ('error', 0.108), ('generalization', 0.108), ('pxm', 0.107), ('sample', 0.094), ('missing', 0.093), ('upper', 0.086), ('nfl', 0.085), ('unseen', 0.085), ('uci', 0.08), ('helsinki', 0.074), ('xn', 0.069), ('finland', 0.068), ('minp', 0.064), ('proposition', 0.061), ('estimators', 0.058), ('mass', 0.057), ('gives', 0.057), ('lunch', 0.056), ('theorems', 0.055), ('proof', 0.053), ('corollary', 0.052), ('exp', 0.051), ('repeated', 0.049), ('wolpert', 0.047), ('hypothesis', 0.047), ('probability', 0.045), ('stronger', 0.043), ('nokia', 0.043), ('postponed', 0.043), ('unde', 0.043), ('binomial', 0.042), ('nontrivial', 0.041), ('getting', 0.039), ('overly', 0.039), ('hold', 0.037), ('ingredient', 0.037), ('functionals', 0.034), ('breast', 0.034), ('log', 0.034), ('learner', 0.033), ('asymptotically', 0.033), ('training', 0.033), ('appendix', 0.032), ('holds', 0.032), ('dence', 0.032), ('uential', 0.032), ('mcallester', 0.032), ('pessimistic', 0.032), ('gilles', 0.032), ('implications', 0.031), ('hypotheses', 0.03), ('member', 0.03), ('cancer', 0.03), ('bounded', 0.03), ('existing', 0.03), ('imply', 0.029), ('sometimes', 0.029), ('relevant', 0.029), ('replacement', 0.028), ('richness', 0.028), ('repository', 0.028), ('blanchard', 0.028), ('size', 0.028), ('empirical', 0.027), ('adult', 0.027), ('factors', 0.027), ('arg', 0.027), ('standard', 0.026), ('plugging', 0.026), ('items', 0.026), ('cases', 0.026), ('vc', 0.025), ('essential', 0.025), ('language', 0.024), ('yet', 0.024), ('weaker', 0.024), ('letter', 0.024), ('practice', 0.024), ('bin', 0.023), ('continuous', 0.023), ('practical', 0.022), ('obtaining', 0.022), ('overlap', 0.022), ('inequalities', 0.022), ('distinct', 0.021), ('probable', 0.021), ('sketch', 0.021), ('xm', 0.02), ('matter', 0.02), ('interested', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 85 nips-2005-Generalization to Unseen Cases

Author: Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri

Abstract: We analyze classification error on unseen cases, i.e. cases that are different from those in the training set. Unlike standard generalization error, this off-training-set error may differ significantly from the empirical error with high probability even with large sample sizes. We derive a datadependent bound on the difference between off-training-set and standard generalization error. Our result is based on a new bound on the missing mass, which for small samples is stronger than existing bounds based on Good-Turing estimators. As we demonstrate on UCI data-sets, our bound gives nontrivial generalization guarantees in many practical cases. In light of these results, we show that certain claims made in the No Free Lunch literature are overly pessimistic. 1

2 0.16746563 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

Author: Yoshua Bengio, Olivier Delalleau, Nicolas L. Roux

Abstract: We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior – with similarity between examples expressed with a local kernel – are sensitive to the curse of dimensionality, or more precisely to the variability of the target. Our discussion covers supervised, semisupervised and unsupervised learning algorithms. These algorithms are found to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may exist short descriptions of the target function, i.e. their Kolmogorov complexity may be low. This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many variations), while not using very specific prior domain knowledge. 1

3 0.11696743 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

Author: John D. Lafferty, Pradeep K. Ravikumar

Abstract: We present a family of approximation techniques for probabilistic graphical models, based on the use of graphical preconditioners developed in the scientific computing literature. Our framework yields rigorous upper and lower bounds on event probabilities and the log partition function of undirected graphical models, using non-iterative procedures that have low time complexity. As in mean field approaches, the approximations are built upon tractable subgraphs; however, we recast the problem of optimizing the tractable distribution parameters and approximate inference in terms of the well-studied linear systems problem of obtaining a good matrix preconditioner. Experiments are presented that compare the new approximation schemes to variational methods. 1

4 0.11043728 95 nips-2005-Improved risk tail bounds for on-line algorithms

Author: Nicolò Cesa-bianchi, Claudio Gentile

Abstract: We prove the strongest known bound for the risk of hypotheses selected from the ensemble generated by running a learning algorithm incrementally on the training data. Our result is based on proof techniques that are remarkably different from the standard risk analysis based on uniform convergence arguments.

5 0.10236206 117 nips-2005-Learning from Data of Variable Quality

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We initiate the study of learning from multiple sources of limited data, each of which may be corrupted at a different rate. We develop a complete theory of which data sources should be used for two fundamental problems: estimating the bias of a coin, and learning a classifier in the presence of label noise. In both cases, efficient algorithms are provided for computing the optimal subset of data. 1

6 0.092031553 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data

7 0.085162506 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks

8 0.080746785 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis

9 0.078984857 112 nips-2005-Learning Minimum Volume Sets

10 0.074941039 41 nips-2005-Coarse sample complexity bounds for active learning

11 0.072463639 47 nips-2005-Consistency of one-class SVM and related algorithms

12 0.063815899 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

13 0.06140003 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

14 0.057250481 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

15 0.052648552 160 nips-2005-Query by Committee Made Real

16 0.050010558 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

17 0.048952837 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models

18 0.048263464 76 nips-2005-From Batch to Transductive Online Learning

19 0.048188481 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares

20 0.047746647 153 nips-2005-Policy-Gradient Methods for Planning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.161), (1, 0.062), (2, 0.004), (3, -0.099), (4, 0.108), (5, 0.071), (6, -0.073), (7, 0.015), (8, -0.068), (9, -0.013), (10, -0.042), (11, 0.118), (12, 0.143), (13, -0.053), (14, -0.018), (15, 0.028), (16, -0.047), (17, -0.043), (18, 0.035), (19, 0.063), (20, 0.172), (21, -0.051), (22, 0.008), (23, 0.065), (24, 0.1), (25, -0.078), (26, -0.224), (27, 0.016), (28, -0.035), (29, 0.073), (30, -0.002), (31, -0.008), (32, 0.136), (33, 0.011), (34, -0.018), (35, 0.146), (36, 0.019), (37, 0.135), (38, 0.006), (39, 0.116), (40, 0.172), (41, 0.108), (42, -0.14), (43, -0.109), (44, -0.002), (45, -0.02), (46, -0.028), (47, 0.092), (48, 0.082), (49, 0.077)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.963557 85 nips-2005-Generalization to Unseen Cases

Author: Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri

Abstract: We analyze classification error on unseen cases, i.e. cases that are different from those in the training set. Unlike standard generalization error, this off-training-set error may differ significantly from the empirical error with high probability even with large sample sizes. We derive a datadependent bound on the difference between off-training-set and standard generalization error. Our result is based on a new bound on the missing mass, which for small samples is stronger than existing bounds based on Good-Turing estimators. As we demonstrate on UCI data-sets, our bound gives nontrivial generalization guarantees in many practical cases. In light of these results, we show that certain claims made in the No Free Lunch literature are overly pessimistic. 1

2 0.69660962 117 nips-2005-Learning from Data of Variable Quality

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We initiate the study of learning from multiple sources of limited data, each of which may be corrupted at a different rate. We develop a complete theory of which data sources should be used for two fundamental problems: estimating the bias of a coin, and learning a classifier in the presence of label noise. In both cases, efficient algorithms are provided for computing the optimal subset of data. 1

3 0.62450027 112 nips-2005-Learning Minimum Volume Sets

Author: Clayton Scott, Robert Nowak

Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P -measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P , and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P . Other than these samples, no other information is available regarding P , but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules. 1

4 0.5391345 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

Author: John D. Lafferty, Pradeep K. Ravikumar

Abstract: We present a family of approximation techniques for probabilistic graphical models, based on the use of graphical preconditioners developed in the scientific computing literature. Our framework yields rigorous upper and lower bounds on event probabilities and the log partition function of undirected graphical models, using non-iterative procedures that have low time complexity. As in mean field approaches, the approximations are built upon tractable subgraphs; however, we recast the problem of optimizing the tractable distribution parameters and approximate inference in terms of the well-studied linear systems problem of obtaining a good matrix preconditioner. Experiments are presented that compare the new approximation schemes to variational methods. 1

5 0.51499569 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data

Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari

Abstract: In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. 1

6 0.5021885 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

7 0.49047565 95 nips-2005-Improved risk tail bounds for on-line algorithms

8 0.4842619 84 nips-2005-Generalization in Clustering with Unobserved Features

9 0.41096318 76 nips-2005-From Batch to Transductive Online Learning

10 0.38921258 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

11 0.38114768 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

12 0.3809498 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis

13 0.3514303 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

14 0.34701335 59 nips-2005-Dual-Tree Fast Gauss Transforms

15 0.33164343 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

16 0.32430413 54 nips-2005-Data-Driven Online to Batch Conversions

17 0.30941895 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks

18 0.30672222 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

19 0.30623752 41 nips-2005-Coarse sample complexity bounds for active learning

20 0.30085561 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.29), (3, 0.126), (10, 0.037), (11, 0.014), (27, 0.027), (31, 0.041), (34, 0.049), (39, 0.018), (41, 0.011), (50, 0.019), (55, 0.015), (69, 0.04), (73, 0.042), (88, 0.082), (91, 0.077)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88905591 128 nips-2005-Modeling Memory Transfer and Saving in Cerebellar Motor Learning

Author: Naoki Masuda, Shun-ichi Amari

Abstract: There is a long-standing controversy on the site of the cerebellar motor learning. Different theories and experimental results suggest that either the cerebellar flocculus or the brainstem learns the task and stores the memory. With a dynamical system approach, we clarify the mechanism of transferring the memory generated in the flocculus to the brainstem and that of so-called savings phenomena. The brainstem learning must comply with a sort of Hebbian rule depending on Purkinje-cell activities. In contrast to earlier numerical models, our model is simple but it accommodates explanations and predictions of experimental situations as qualitative features of trajectories in the phase space of synaptic weights, without fine parameter tuning. 1

same-paper 2 0.78395933 85 nips-2005-Generalization to Unseen Cases

Author: Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri

Abstract: We analyze classification error on unseen cases, i.e. cases that are different from those in the training set. Unlike standard generalization error, this off-training-set error may differ significantly from the empirical error with high probability even with large sample sizes. We derive a datadependent bound on the difference between off-training-set and standard generalization error. Our result is based on a new bound on the missing mass, which for small samples is stronger than existing bounds based on Good-Turing estimators. As we demonstrate on UCI data-sets, our bound gives nontrivial generalization guarantees in many practical cases. In light of these results, we show that certain claims made in the No Free Lunch literature are overly pessimistic. 1

3 0.5391714 112 nips-2005-Learning Minimum Volume Sets

Author: Clayton Scott, Robert Nowak

Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P -measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P , and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P . Other than these samples, no other information is available regarding P , but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules. 1

4 0.5226112 41 nips-2005-Coarse sample complexity bounds for active learning

Author: Sanjoy Dasgupta

Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.

5 0.51539296 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

Author: Aurelie C. Lozano, Sanjeev R. Kulkarni, Robert E. Schapire

Abstract: We study the statistical convergence and consistency of regularized Boosting methods, where the samples are not independent and identically distributed (i.i.d.) but come from empirical processes of stationary β-mixing sequences. Utilizing a technique that constructs a sequence of independent blocks close in distribution to the original samples, we prove the consistency of the composite classifiers resulting from a regularization achieved by restricting the 1-norm of the base classifiers’ weights. When compared to the i.i.d. case, the nature of sampling manifests in the consistency result only through generalization of the original condition on the growth of the regularization parameter.

6 0.51242685 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models

7 0.50669396 74 nips-2005-Faster Rates in Regression via Active Learning

8 0.50637984 117 nips-2005-Learning from Data of Variable Quality

9 0.50439334 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments

10 0.50103939 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

11 0.49677548 131 nips-2005-Multiple Instance Boosting for Object Detection

12 0.49653777 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

13 0.49433023 159 nips-2005-Q-Clustering

14 0.49387625 177 nips-2005-Size Regularized Cut for Data Clustering

15 0.48825371 100 nips-2005-Interpolating between types and tokens by estimating power-law generators

16 0.48464972 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

17 0.483426 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation

18 0.48228669 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data

19 0.48210856 196 nips-2005-Two view learning: SVM-2K, Theory and Practice

20 0.48166105 151 nips-2005-Pattern Recognition from One Example by Chopping