nips nips2005 nips2005-117 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We initiate the study of learning from multiple sources of limited data, each of which may be corrupted at a different rate. We develop a complete theory of which data sources should be used for two fundamental problems: estimating the bias of a coin, and learning a classifier in the presence of label noise. In both cases, efficient algorithms are provided for computing the optimal subset of data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We initiate the study of learning from multiple sources of limited data, each of which may be corrupted at a different rate. [sent-3, score-0.171]
2 We develop a complete theory of which data sources should be used for two fundamental problems: estimating the bias of a coin, and learning a classifier in the presence of label noise. [sent-4, score-0.189]
3 1 Introduction In many natural machine learning settings, one is not only faced with data that may be corrupted or deficient in some way (classification noise or other label errors, missing attributes, and so on), but with data that is not uniformly corrupted. [sent-6, score-0.238]
4 In other words, we might be presented with data of variable quality — perhaps some small amount of entirely “clean” data, another amount of slightly corrupted data, yet more that is significantly corrupted, and so on. [sent-7, score-0.227]
5 Furthermore, in such circumstances we may often know at least an upper bound on the rate and type of corruption in each pile of data. [sent-8, score-0.62]
6 An extreme example is the recent interest in settings where one has a very limited set of correctly labeled examples, and an effectively unlimited set of entirely unlabeled examples, as naturally arises in problems such as classifying web pages [1]. [sent-9, score-0.124]
7 Another general category of problems that falls within our interest is when multiple piles of data are drawn from processes that differ perhaps slightly and in varying amounts from the process we wish to estimate. [sent-10, score-0.799]
8 For example, we might wish to estimate a conditional distribution P (X|Y = y) but have only a small number of observations in which Y = y, but a larger number of observations in which Y = y ′ for values of y ′ “near” to y. [sent-11, score-0.103]
9 While there is a large body of learning theory both for uncorrupted data and for data that is uniformly corrupted in some way [2, 3], there is no general framework and theory for learning from data of variable quality. [sent-13, score-0.231]
10 In this paper we introduce such a framework, and develop its theory, for two basic problems: estimating a bias from corrupted coins, and learning a classifier in the presence of varying amounts of label noise. [sent-14, score-0.354]
11 For the corrupted coins case we provide an upper bound on the error that is expressed as a trade-off between weighted approximation errors and larger amounts of data. [sent-15, score-0.657]
12 This bound provides a building block for the classification noise setting, in which we are able to give a bound on the generalization error of empirical risk minimization that specifies the optimal subset of the data to use. [sent-16, score-0.602]
13 2 Estimating the Bias from Corrupted Coins We begin by considering perhaps the simplest possible instance of the general class of problems in which we are interested — namely, the problem of estimating the unknown bias of a coin. [sent-19, score-0.181]
14 In this version of the variable quality model, we will have access to different amounts of data from “corrupted” coins whose bias differs from the one we wish to estimate. [sent-20, score-0.281]
15 1 Problem Description Suppose we wish to estimate the bias β of a coin given K piles of training observations N1 , . [sent-23, score-1.016]
16 Each pile Ni contains ni outcomes of flips of a coin with bias βi , where the only information we are provided is that βi ∈ [β − ǫi , β + ǫi ], and 0 ≤ ǫ1 ≤ ǫ2 ≤ . [sent-27, score-0.821]
17 We refer to the ǫi as bounds on the approximation errors of the corrupted coins. [sent-31, score-0.288]
18 We denote by hi the number of heads observed in the ith pile. [sent-32, score-0.107]
19 Our immediate goal is to determine which piles should be considered in order to obtain the best estimate of the true bias β. [sent-33, score-0.835]
20 all data from the piles indexed 1 to k for some k ≤ K, and possibly a subset of the data from pile k + 1. [sent-37, score-0.945]
21 In fact, it will be shown that only complete piles need to be considered. [sent-38, score-0.673]
22 Therefore, from this point on we restrict ourselves to estimates of this form, and identify them by the maximal index k of the piles used. [sent-39, score-0.673]
23 + nk We denote the expectation of this estimate by n1 β1 + . [sent-47, score-0.169]
24 + nk To simplify the presentation we denote by ni,j the number of outcomes in piles Ni , . [sent-54, score-0.857]
25 Using the Hoeffding inequality we can bound the second term and find that with high probability for an appropriate choice of δ we have k ˆ |β − βk | ≤ i=1 ni ǫi + n1,k log(2K/δ) . [sent-59, score-0.447]
26 Then for any δ > 0, with probability ≥ 1 − δ we have k ˆ β − βk ≤ i=1 log(2K/δ) 2n1,k ni ǫi + n1,k simultaneously for all k = 1, . [sent-62, score-0.29]
27 Second, the two terms in the bound reflect the well-known trade-off between bias (approximation error) and variance (estimation error). [sent-71, score-0.223]
28 The first term bounds the approximation error of replacing the true coin ¯ β with the average βk . [sent-72, score-0.471]
29 The second term corresponds to the estimation error which arises as a result of our finite sample size. [sent-73, score-0.126]
30 This theorem implies a natural algorithm to choose the number of piles k ∗ as is the minimizer of the bound over the number of piles used: k k ∗ = argmin k∈{1,. [sent-74, score-1.544]
31 To conclude this section we argue that our choice of using a prefix of piles is optimal. [sent-78, score-0.673]
32 First, note that by adding a new pile with a corruption level ǫ smaller then the current corruption level, we can always reduce the bounds. [sent-79, score-0.504]
33 Thus it is optimal to use prefix of the piles and not to ignore piles with low corruption levels. [sent-80, score-1.485]
34 Note that we can choose to view each coin toss as a separate pile with a single observation, thus yielding n1,K piles of size 1. [sent-82, score-1.192]
35 The following technical lemma states that under this view of singleton piles, once we decide to add a pile with some corruption level, it will be optimal to use all singleton piles with the same corruption level. [sent-83, score-1.364]
36 Lemma 1 Assume that all the piles are of size ni = 1 and that ǫk ≤ ǫp+k = ǫp+k+1 . [sent-85, score-0.918]
37 Then the following two inequalities cannot hold simultaneously: k i=1 k+p+1 i=1 ni ǫi + n1,k ni ǫi + n1,k+p+1 log(2n1,K /δ) 2n1,k log(2n1,K /δ) 2n1,k+p+1 k+p > i=1 k+p ≥ i=1 ni n1,k+p ǫi + ni ǫi + n1,k+p log(2n1,K /δ) 2n1,k+p log(2n1,K /δ) . [sent-86, score-0.98]
38 2n1,k+p ˆ ˆ In other words, if the bound on |β − βk+p | is smaller than the bound on |β − βk |, then ˆk+p+1 | must be smaller than both unless ǫk+p+1 > ǫk+p . [sent-87, score-0.302]
39 Thus if the the bound on |β − β pth and p+1th samples are from the same original pile (and ǫk+p+1 = ǫk+p ), then once we decide to use samples through p, we will always want to include sample p + 1. [sent-88, score-0.467]
40 It follows that we must only consider using complete piles of data. [sent-89, score-0.673]
41 The approximation errors of the corrupted coins were 1 Actual Error Bound Singeltons Bound 0. [sent-95, score-0.314]
42 2 0 2 4 6 8 Number of Piles Used 10 12 Figure 1: Left: Illustration of the actual error and our error bounds for estimating the bias of a coin. [sent-108, score-0.478]
43 Right: Illustration of actual error of a 20 dimensional classification problem and the error bounds found using our methods. [sent-111, score-0.365]
44 5), and number of outcomes in the corresponding piles were n = (10, 50, 100, 500, 1500, 2000, 3000, 10000). [sent-120, score-0.718]
45 We set the probability of the ith coin to be βi = β + ǫi and sampled ni times from it. [sent-122, score-0.452]
46 For each k, we computed the bound for the estimate using piles 1, . [sent-127, score-0.854]
47 To illustrate Lemma 1 we also computed the bound using partial piles. [sent-131, score-0.151]
48 This bound is slightly higher than the suggested bound since we use effectively more piles (n1,K instead of K). [sent-132, score-0.975]
49 We note that a strength of the theory developed is its generality, as it provides bounds for any model parameters. [sent-135, score-0.133]
50 Empirically, the best estimate of the target coin is using the first four piles, while our algorithm suggests using the first five piles. [sent-137, score-0.247]
51 We note that while our bounds have essentially the right shape (which is what matters for the computation of k ∗ ), numerically they are quite loose compared to the true behavior. [sent-139, score-0.14]
52 We assume there is a fixed and unknown binary function f : X → {0, 1} and a fixed and unknown distribution P on the inputs X to f . [sent-143, score-0.106]
53 We are presented again with K piles of data, N1 , . [sent-144, score-0.673]
54 Now each pile Ni contains ni labeled examples (x, y) that are generated from the target function f with label noise at rate ηi , where 0 ≤ η1 < η2 < . [sent-148, score-0.698]
55 In other words, for each example (x, y) in pile Ni , y = f (x) with probability 1 − ηi and y = ¬f (x) with probability ηi . [sent-152, score-0.312]
56 The goal is to decide which piles of data to use in order to choose a function h from a set of hypothesis functions H with minimal generalization (true) error e(h) with respect to f and P . [sent-153, score-0.98]
57 , Nk , we examine the most basic estimator based on this data, namely the hypothesis minimizing the observed or training error: ˆ hk = argmin{ˆk (h)} e h∈H where ek (h) is the fraction of times h(x) = y over all (x, y) ∈ N1 ∪ · · · ∪ Nk . [sent-157, score-0.358]
58 We begin by observing that for any fixed function h, the question of how ek (h) is related ˆ to e(h) bears great similarity to the biased coin setting. [sent-162, score-0.302]
59 More precisely, the expected classification error of h on pile Ni only is (1 − ηi )e(h) + ηi (1 − e(h)) = e(h) + ηi (1 − 2e(h)) . [sent-163, score-0.398]
60 The first difficulty is that even restricting attention to estimating e(h) for a fixed h, the values for ǫi above (and thus the bounds computed by the methods of Section 2) depend on e(h), which is exactly the unknown quantity we would like to estimate. [sent-166, score-0.164]
61 The second difficulty is that in order to bound the performance of empirical error minimization within H, we must say something about the probability of any h ∈ H being selected. [sent-167, score-0.341]
62 For convenience we assume that for all levels ei there exists a function h ∈ H such that e(h) = ei . [sent-175, score-0.576]
63 Each row i of B represents one possible value of e(h) = ei , while each column k represents the use of only piles N1 , . [sent-178, score-0.948]
64 The entry B(i, k) will contain a bound on |e(h) − ek (h)| that is valid simultaneously for all ˆ h ∈ H with e(h) = ei . [sent-182, score-0.533]
65 In other words, for any such h, with high probability ek (h) falls in ˆ the range [ei − B(i, k), ei + B(i, k)]. [sent-183, score-0.4]
66 It is crucial to note that we do not need to know which functions h ∈ H satisfy e(h) = ei in order to either compute or use the bound B(i, k), as we shall see shortly. [sent-184, score-0.483]
67 Rather, it is enough to know that for each h ∈ H, some row of B will provide estimation error bounds for each k. [sent-185, score-0.238]
68 (1) applies to the case of a single biased coin and here we have many (essentially one for each function at a given generalization error ei ), we must modify it slightly. [sent-190, score-0.688]
69 We can (pessimistically) bound the VC dimension of all functions with error rate e(h) = ei by the VC dimension d of the entire class H. [sent-191, score-0.633]
70 (3) We note that in cases where we have more information on the structure of the generalization errors in H, an accordingly modified equation can be used, which may yield considerably improved bounds. [sent-194, score-0.13]
71 For example, in the statistical physics theory of learning curves[4] it is common to posit knowledge of the density or number of functions in H at a given generalization error ei . [sent-195, score-0.523]
72 In a moment we describe how the matrix B can be used to choose the number k of piles ˆ to use, and to compute a bound on the generalization error of hk . [sent-197, score-1.241]
73 , M }, for all h ∈ H with e(h) = ei , and for all k ∈ {1, . [sent-204, score-0.275]
74 2 Putting It All Together By Lemma 2, the matrix B gives, for each possible generalization error ei and each k, an upper bound on the deviation between observed and true errors for functions of true error ei when using piles N1 , . [sent-210, score-1.966]
75 It is thus natural to try to use column k of B to bound the ˆ error of hk , the function minimizing the observed error on these piles. [sent-214, score-0.613]
76 The observed error of any function with true generalization error ei must, with high probability, lie in the interval Ii,k = [ei − B(i, k), ei + B(i, k)]. [sent-216, score-0.988]
77 By simultaneously considering these intervals for all values of ei , we can put a bound on the generalization error of the best function in the hypothesis class. [sent-217, score-0.695]
78 Consider a hypothesis space in which the generalization error of the available functions can take on the discrete values 0, 0. [sent-219, score-0.243]
79 We know, for example, that all functions with true generalization error e2 = 0. [sent-232, score-0.278]
80 15], and that all functions with true generalization error e4 = 0. [sent-235, score-0.278]
81 Likewise, it would not be possible for a function with true error e5 or e6 to be chosen. [sent-241, score-0.186]
82 However, a function with true error e3 could produce a lower observed error than one with true error e1 or e2 (since e3 −B(3, k) < ˆ e2 + B(2, k) and e3 − B(3, k) < e1 + B(1, k)), and thus could be chosen as hk . [sent-242, score-0.708]
83 the smallest bound we can place on the true error of h ˆ In general, we know that hk will have true error corresponding to the midpoint of an a interval which overlaps with the interval with the least upper bound (I2,k in this example). [sent-245, score-1.005]
84 This ˆ leads to an intuitive procedure for calculating a bound on the true error of hk . [sent-246, score-0.522]
85 First, we de∗ termine the interval with the smallest upper bound, ik = argmini {ei + B(i, k)}. [sent-247, score-0.099]
86 Consider the set of intervals which overlap with i∗ , namely Jk = {i : ei −B(i, k) ≤ ei∗ +B(i∗ , k)}. [sent-248, score-0.32]
87 k k k It is possible for the smallest observed error to come from a function corresponding to any ˆ of the intervals in Jk . [sent-249, score-0.2]
88 Thus, a bound on the true error of hk can be obtained by taking the def maximum e(h) value for any function in Jk , i. [sent-250, score-0.522]
89 , K, let hk = argminh {ˆk (h)} be the function in H with the lowest empirical error evaluated using the e first k piles of data. [sent-266, score-1.005]
90 , K by training using the first k piles of data using logistic regression with a learning rate of 0. [sent-273, score-0.673]
91 The generalization error for each model was determined by testing on a noise-free sample of 500 examples drawn from the same uniform distribution. [sent-275, score-0.219]
92 Bounds were calculated using the algorithm described above with functions binned into 101 evenly spaced error values e = (0, 0. [sent-276, score-0.173]
93 The right panel of Figure 1 shows an example of the bounds found with K = 12 piles, noise levels η = (0. [sent-283, score-0.171]
94 The algorithm described above correctly predicts that the eighth pile should be chosen as the cutoff, yielding an optimal error value of 0. [sent-296, score-0.461]
95 It is interesting to note that although the error bounds shown are significantly higher than the actual error, the shapes of the curves are similar. [sent-298, score-0.239]
96 Further experimentation has shown that the algorithm described here works well in general when there are small piles of low noise data and large piles of high noise data. [sent-300, score-1.426]
97 As before, we assume there is a fixed and unknown binary function f : X → {0, 1} and a fixed and unknown distribution P on the inputs X to f . [sent-303, score-0.106]
98 We are presented again with K piles of data, N1 , . [sent-304, score-0.673]
99 Now each pile Ni contains ni labeled examples (x, y) that are generated from an unknown function hi such that e(hi ) = e(hi , f ) = PrP [hi (x) = f (x)] ≤ ǫi for given values ǫ1 ≤ . [sent-308, score-0.674]
100 Thus we are provided piles of labeled examples of unknown functions “nearby” the unknown target f , where “nearby” is quantified by the sequence of ǫi . [sent-312, score-0.879]
wordName wordTfidf (topN-words)
[('piles', 0.673), ('ei', 0.275), ('pile', 0.272), ('ni', 0.245), ('coin', 0.187), ('hk', 0.185), ('corrupted', 0.152), ('bound', 0.151), ('nk', 0.139), ('error', 0.126), ('corruption', 0.116), ('coins', 0.106), ('vc', 0.092), ('ek', 0.082), ('bounds', 0.08), ('kearns', 0.074), ('bias', 0.072), ('generalization', 0.067), ('pre', 0.063), ('true', 0.06), ('agnostic', 0.058), ('lemma', 0.054), ('classi', 0.051), ('hi', 0.049), ('jk', 0.048), ('label', 0.046), ('outcomes', 0.045), ('argmink', 0.045), ('decide', 0.044), ('amounts', 0.043), ('unknown', 0.043), ('estimating', 0.041), ('noise', 0.04), ('labeled', 0.039), ('errors', 0.038), ('wish', 0.035), ('interval', 0.034), ('cation', 0.033), ('heads', 0.033), ('hoeffding', 0.033), ('singleton', 0.033), ('biased', 0.033), ('actual', 0.033), ('simulations', 0.032), ('log', 0.032), ('know', 0.032), ('inequality', 0.031), ('km', 0.031), ('theory', 0.03), ('estimate', 0.03), ('target', 0.03), ('dimension', 0.028), ('illustration', 0.028), ('suppose', 0.028), ('argmin', 0.027), ('crammer', 0.027), ('intervals', 0.026), ('circumstances', 0.026), ('levels', 0.026), ('examples', 0.026), ('perhaps', 0.025), ('panel', 0.025), ('functions', 0.025), ('quality', 0.025), ('simultaneously', 0.025), ('hypothesis', 0.025), ('entirely', 0.025), ('observed', 0.025), ('considerably', 0.025), ('uni', 0.024), ('minimization', 0.023), ('deviation', 0.023), ('optimal', 0.023), ('falls', 0.023), ('smallest', 0.023), ('developed', 0.023), ('upper', 0.023), ('examine', 0.022), ('settings', 0.022), ('setting', 0.022), ('words', 0.022), ('calculated', 0.022), ('nearby', 0.021), ('yielding', 0.021), ('empirical', 0.021), ('choose', 0.02), ('probability', 0.02), ('binary', 0.02), ('uncorrupted', 0.019), ('initiate', 0.019), ('argmini', 0.019), ('toss', 0.019), ('unlimited', 0.019), ('namely', 0.019), ('predicts', 0.019), ('unlabeled', 0.019), ('observations', 0.019), ('matrix', 0.019), ('removed', 0.019), ('approximation', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 117 nips-2005-Learning from Data of Variable Quality
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We initiate the study of learning from multiple sources of limited data, each of which may be corrupted at a different rate. We develop a complete theory of which data sources should be used for two fundamental problems: estimating the bias of a coin, and learning a classifier in the presence of label noise. In both cases, efficient algorithms are provided for computing the optimal subset of data. 1
2 0.10944305 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
Author: Liam Paninski
Abstract: We discuss a method for obtaining a subject’s a priori beliefs from his/her behavior in a psychophysics context, under the assumption that the behavior is (nearly) optimal from a Bayesian perspective. The method is nonparametric in the sense that we do not assume that the prior belongs to any fixed class of distributions (e.g., Gaussian). Despite this increased generality, the method is relatively simple to implement, being based in the simplest case on a linear programming algorithm, and more generally on a straightforward maximum likelihood or maximum a posteriori formulation, which turns out to be a convex optimization problem (with no non-global local maxima) in many important cases. In addition, we develop methods for analyzing the uncertainty of these estimates. We demonstrate the accuracy of the method in a simple simulated coin-flipping setting; in particular, the method is able to precisely track the evolution of the subject’s posterior distribution as more and more data are observed. We close by briefly discussing an interesting connection to recent models of neural population coding.
3 0.10236206 85 nips-2005-Generalization to Unseen Cases
Author: Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri
Abstract: We analyze classification error on unseen cases, i.e. cases that are different from those in the training set. Unlike standard generalization error, this off-training-set error may differ significantly from the empirical error with high probability even with large sample sizes. We derive a datadependent bound on the difference between off-training-set and standard generalization error. Our result is based on a new bound on the missing mass, which for small samples is stronger than existing bounds based on Good-Turing estimators. As we demonstrate on UCI data-sets, our bound gives nontrivial generalization guarantees in many practical cases. In light of these results, we show that certain claims made in the No Free Lunch literature are overly pessimistic. 1
4 0.071190439 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari
Abstract: In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. 1
5 0.070966691 41 nips-2005-Coarse sample complexity bounds for active learning
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
6 0.069766536 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
7 0.06822861 104 nips-2005-Laplacian Score for Feature Selection
8 0.067271158 95 nips-2005-Improved risk tail bounds for on-line algorithms
9 0.065956526 76 nips-2005-From Batch to Transductive Online Learning
10 0.064469442 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
11 0.063996404 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
12 0.063573875 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
13 0.058138635 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
14 0.057822634 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
15 0.053062156 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
16 0.05302592 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
17 0.052261498 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
18 0.05206687 47 nips-2005-Consistency of one-class SVM and related algorithms
19 0.051163431 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
20 0.049412899 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
topicId topicWeight
[(0, 0.165), (1, 0.059), (2, -0.009), (3, -0.061), (4, 0.087), (5, 0.072), (6, -0.015), (7, 0.003), (8, -0.048), (9, -0.024), (10, -0.027), (11, 0.066), (12, 0.119), (13, -0.069), (14, -0.008), (15, -0.041), (16, -0.026), (17, -0.015), (18, 0.045), (19, 0.043), (20, 0.067), (21, 0.003), (22, 0.068), (23, 0.101), (24, 0.091), (25, 0.041), (26, -0.056), (27, 0.04), (28, 0.035), (29, 0.041), (30, 0.025), (31, 0.031), (32, 0.026), (33, 0.049), (34, -0.031), (35, 0.092), (36, -0.07), (37, -0.025), (38, -0.098), (39, 0.102), (40, 0.048), (41, 0.233), (42, -0.017), (43, -0.101), (44, 0.019), (45, -0.075), (46, 0.061), (47, 0.153), (48, -0.025), (49, -0.091)]
simIndex simValue paperId paperTitle
same-paper 1 0.93054813 117 nips-2005-Learning from Data of Variable Quality
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We initiate the study of learning from multiple sources of limited data, each of which may be corrupted at a different rate. We develop a complete theory of which data sources should be used for two fundamental problems: estimating the bias of a coin, and learning a classifier in the presence of label noise. In both cases, efficient algorithms are provided for computing the optimal subset of data. 1
2 0.71305692 85 nips-2005-Generalization to Unseen Cases
Author: Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri
Abstract: We analyze classification error on unseen cases, i.e. cases that are different from those in the training set. Unlike standard generalization error, this off-training-set error may differ significantly from the empirical error with high probability even with large sample sizes. We derive a datadependent bound on the difference between off-training-set and standard generalization error. Our result is based on a new bound on the missing mass, which for small samples is stronger than existing bounds based on Good-Turing estimators. As we demonstrate on UCI data-sets, our bound gives nontrivial generalization guarantees in many practical cases. In light of these results, we show that certain claims made in the No Free Lunch literature are overly pessimistic. 1
3 0.57138324 112 nips-2005-Learning Minimum Volume Sets
Author: Clayton Scott, Robert Nowak
Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P -measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P , and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P . Other than these samples, no other information is available regarding P , but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules. 1
4 0.52052873 59 nips-2005-Dual-Tree Fast Gauss Transforms
Author: Dongryeol Lee, Andrew W. Moore, Alexander G. Gray
Abstract: In previous work we presented an efficient approach to computing kernel summations which arise in many machine learning methods such as kernel density estimation. This approach, dual-tree recursion with finitedifference approximation, generalized existing methods for similar problems arising in computational physics in two ways appropriate for statistical problems: toward distribution sensitivity and general dimension, partly by avoiding series expansions. While this proved to be the fastest practical method for multivariate kernel density estimation at the optimal bandwidth, it is much less efficient at larger-than-optimal bandwidths. In this work, we explore the extent to which the dual-tree approach can be integrated with multipole-like Hermite expansions in order to achieve reasonable efficiency across all bandwidth scales, though only for low dimensionalities. In the process, we derive and demonstrate the first truly hierarchical fast Gauss transforms, effectively combining the best tools from discrete algorithms and continuous approximation theory. 1 Fast Gaussian Summation Kernel summations are fundamental in both statistics/learning and computational physics. NR e This paper will focus on the common form G(xq ) = −||xq −xr ||2 2h2 i.e. where the ker- r=1 nel is the Gaussian kernel with scaling parameter, or bandwidth h, there are NR reference points xr , and we desire the sum for NQ different query points xq . Such kernel summations appear in a wide array of statistical/learning methods [5], perhaps most obviously in kernel density estimation [11], the most widely used distribution-free method for the fundamental task of density estimation, which will be our main example. Understanding kernel summation algorithms from a recently developed unified perspective [5] begins with the picture of Figure 1, then separately considers the discrete and continuous aspects. Discrete/geometric aspect. In terms of discrete algorithmic structure, the dual-tree framework of [5], in the context of kernel summation, generalizes all of the well-known algorithms. 1 It was applied to the problem of kernel density estimation in [7] using a simple 1 These include the Barnes-Hut algorithm [2], the Fast Multipole Method [8], Appel’s algorithm [1], and the WSPD [4]: the dual-tree method is a node-node algorithm (considers query regions rather than points), is fully recursive, can use distribution-sensitive data structures such as kd-trees, and is bichromatic (can specialize for differing query and reference sets). Figure 1: The basic idea is to approximate the kernel sum contribution of some subset of the reference points XR , lying in some compact region of space R with centroid xR , to a query point. In more efficient schemes a query region is considered, i.e. the approximate contribution is made to an entire subset of the query points XQ lying in some region of space Q, with centroid xQ . finite-difference approximation, which is tantamount to a centroid approximation. Partially by avoiding series expansions, which depend explicitly on the dimension, the result was the fastest such algorithm for general dimension, when operating at the optimal bandwidth. Unfortunately, when performing cross-validation to determine the (initially unknown) optimal bandwidth, both suboptimally small and large bandwidths must be evaluated. The finite-difference-based dual-tree method tends to be efficient at or below the optimal bandwidth, and at very large bandwidths, but for intermediately-large bandwidths it suffers. Continuous/approximation aspect. This motivates investigating a multipole-like series approximation which is appropriate for the Gaussian kernel, as introduced by [9], which can be shown the generalize the centroid approximation. We define the Hermite functions 2 hn (t) by hn (t) = e−t Hn (t), where the Hermite polynomials Hn (t) are defined by the 2 2 Rodrigues formula: Hn (t) = (−1)n et Dn e−t , t ∈ R1 . After scaling and shifting the argument t appropriately, then taking the product of univariate functions for each dimension, we obtain the multivariate Hermite expansion NR G(xq ) = e −||xq −xr ||2 2h2 NR = r=1 r=1 α≥0 1 α! xr − xR √ 2h2 α hα xq − xR √ 2h2 (1) where we’ve adopted the usual multi-index notation as in [9]. This can be re-written as NR G(xq ) = e r=1 −||xq −xr ||2 2h2 NR = r=1 α≥0 1 hα α! xr − xQ √ 2h2 xq − xQ √ 2h2 α (2) to express the sum as a Taylor (local) expansion about a nearby representative centroid xQ in the query region. We will be using both types of expansions simultaneously. Since series approximations only hold locally, Greengard and Rokhlin [8] showed that it is useful to think in terms of a set of three ‘translation operators’ for converting between expansions centered at different points, in order to create their celebrated hierarchical algorithm. This was done in the context of the Coulombic kernel, but the Gaussian kernel has importantly different mathematical properties. The original Fast Gauss Transform (FGT) [9] was based on a flat grid, and thus provided only one operator (“H2L” of the next section), with an associated error bound (which was unfortunately incorrect). The Improved Fast Gauss Transform (IFGT) [14] was based on a flat set of clusters and provided no operators with a rearranged series approximation, which intended to be more favorable in higher dimensions but had an incorrect error bound. We will show the derivations of all the translation operators and associated error bounds needed to obtain, for the first time, a hierarchical algorithm for the Gaussian kernel. 2 Translation Operators and Error Bounds The first operator converts a multipole expansion of a reference node to form a local expansion centered at the centroid of the query node, and is our main approximation workhorse. Lemma 2.1. Hermite-to-local (H2L) translation operator for Gaussian kernel (as presented in Lemma 2.2 in [9, 10]): Given a reference node XR , a query node XQ , and the Hermite expansion centered at a centroid xR of XR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 , the Taylor expansion of the Hermite expansion at the centroid xQ of the query node XQ is given by G(xq ) = Bβ β≥0 xq −xQ √ 2h2 β where Bβ = (−1)|β| β! Aα hα+β α≥0 xQ −xR √ 2h2 . Proof. (sketch) The proof consists of replacing the Hermite function portion of the expansion with its Taylor series. NR Note that we can rewrite G(xq ) = α≥0 r=1 1 α! xr −xR √ 2h2 α hα xq −xR √ 2h2 by interchanging the summation order, such that the term in the brackets depends only on the reference points, and can thus be computed indepedent of any query location – we will call such terms Hermite moments. The next operator allows the efficient pre-computation of the Hermite moments in the reference tree in a bottom-up fashion from its children. Lemma 2.2. Hermite-to-Hermite (H2H) translation operator for Gaussian kernel: Given the Hermite expansion centered at a centroid xR′ in a reference node XR′ : xq −x G(xq ) = A′ hα √2hR′ , this same Hermite expansion shifted to a new locaα 2 α≥0 tion xR of the parent node of XR is given by G(xq ) = Aγ hγ γ≥0 Aγ = 0≤α≤γ 1 ′ (γ−α)! Aα xR′ −xR √ 2h2 xq −xR √ 2h2 where γ−α . Proof. We simply replace the Hermite function part of the expansion by a new Taylor series, as follows: « x q − x R′ √ 2h2 α≥0 „ « X ′ X 1 „ x R − x R′ « β xq − xR √ √ = Aα (−1)|β| hα+β β! 2h2 2h2 α≥0 β≥0 „ «β « „ X X ′ 1 x R − x R′ xq − xR |β| √ √ (−1) hα+β = Aα β! 2h2 2h2 α≥0 β≥0 „ «β „ « X X ′ 1 x R′ − x R xq − xR √ √ Aα = hα+β β! 2h2 2h2 α≥0 β≥0 3 2 «γ−α „ « „ X X 1 x R′ − x R q ′ 5 hγ x√− xR 4 √ = Aα (γ − α)! 2h2 2h2 γ≥0 0≤α≤γ G(xq ) = where γ = α + β. X A′ hα α „ The next operator acts as a “clean-up” routine in a hierarchical algorithm. Since we can approximate at different scales in the query tree, we must somehow combine all the approximations at the end of the computation. By performing a breadth-first traversal of the query tree, the L2L operator shifts a node’s local expansion to the centroid of each child. Lemma 2.3. Local-to-local (L2L) translation operator for Gaussian kernel: Given a Taylor expansion centered at a centroid xQ′ of a query node XQ′ : G(xq ) = xq −xQ′ √ 2h2 Bβ β≥0 β , the Taylor expansion obtained by shift- ing this expansion to the new centroid xQ of the child node XQ is G(xq ) = α≥0 β≥α β! α!(β−α)! Bβ β−α xQ −xQ′ √ 2h2 xq −xQ √ 2h2 α . Proof. Applying the multinomial theorem to to expand about the new center xQ yields: G(xq ) = X Bβ β≥0 = „ XX β≥0 α≤β xq − xQ′ √ 2h2 Bβ «β β! α!(β − α)! „ xQ − xQ′ √ 2h2 «β−α „ xq − xQ √ 2h2 «α . whose summation order can be interchanged to achieve the result. Because the Hermite and the Taylor expansion are truncated after taking pD terms, we incur an error in approximation. The original error bounds for the Gaussian kernel in [9, 10] were wrong and corrections were shown in [3]. Here, we will present all necessary three error bounds incurred in performing translation operators. We note that these error bounds place limits on the size of the query node and the reference node. 2 Lemma 2.4. Error Bound for Truncating an Hermite Expansion (as presented in [3]): Suppose we are given an Hermite expansion of a reference node XR about its centroid xR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 NR where Aα = r=1 1 α! xr −xR √ 2h2 α . For any query point xq , the error due to truncating the series after the first pD term is |ǫM (p)| ≤ rp )k rp √ p! NR (1−r)D D−1 k=0 D k (1 − D−k where ∀xr ∈ XR satisfies ||xr − xR ||∞ < rh for r < 1. Proof. (sketch) We expand the Hermite expansion as a product of one-dimensional Hermite functions, and utilize a bound on one-dimensional Hermite functions due to [13]: n −x2 1 2 √ 2 e 2 , n ≥ 0, x ∈ R1 . n! |hn (x)| ≤ n! Lemma 2.5. Error Bound for Truncating a Taylor Expansion Converted from an Hermite Expansion of Infinite Order: Suppose we are given the following Taylor expansion about the centroid xQ of a query node G(xq ) = Bβ β≥0 2 xq −xQ √ 2h2 β where `Strainn[12] proposed the interesting idea of using Stirling’s formula (for any non-negative integer ´ ≤ n!) to lift the node size constraint; one might imagine that this could allow approxin: n+1 e mation of larger regions in a tree-based algorithm. Unfortunately, the error bounds developed in [12] were also incorrect. We have derived the three necessary corrected error bounds based on the techniques in [3]. However, due to space, and because using these bounds actually degraded performance slightly, we do not include those lemmas here. (−1)|β| β! Bβ = Aα hα+β α≥0 xQ −xR √ 2h2 and Aα ’s are the coefficients of the Hermite ex- pansion centered at the reference node centroid xR . Then, truncating the series after pD terms satisfies the error bound |ǫL (p)| ≤ NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k where ||xq − xQ ||∞ < rh for r < 1, ∀xq ∈ XQ . Proof. Taylor expansion of the Hermite function yields e −||xq −xr ||2 2h2 Use e „ «„ «β X (−1)|β| X 1 „ xr − xR «α xq − xQ xQ − xR √ √ √ hα+β = β! α! 2h2 2h2 2h2 α≥0 β≥0 «α „ «„ «β „ X (−1)|β| X 1 xR − xr xQ − xR xq − xQ |α| √ √ √ = (−1) hα+β β! α! 2h2 2h2 2h2 β≥0 α≥0 «„ «β „ X (−1)|β| xq − xQ xQ − xr √ √ = hβ β! 2h2 2h2 β≥0 −||xq −xr ||2 2h2 D = i=1 (up (xqi , xri , xQi ) + vp (xqi , xri , xQi )) for 1 ≤ i ≤ D, where «„ «n „ X (−1)ni xqi − xQi i xQi − xri √ √ hni ni ! 2h2 2h2 ni =0 „ «„ «ni ∞ X (−1)ni xqi − xQi xQi − xri √ √ hni vp (xqi , xri , xQi ) = . ni ! 2h2 2h2 ni =p p−1 up (xqi , xri , xQi ) = 1−r p 1−r These univariate functions respectively satisfy up (xqi , xri , xQi ) ≤ 1 rp vp (xqi , xri , xQi ) ≤ √p! 1−r , for 1 ≤ i ≤ D, achieving the multivariate bound. and Lemma 2.6. Error Bound for Truncating a Taylor Expansion Converted from an Already Truncated Hermite Expansion: A truncated Hermite expansion centered about xq −xR the centroid xR of a reference node G(xq ) = Aα hα √2h2 has the following α < rh, and a reference node XR for which ||xr − xR ||∞ < rh for r < 1 , ∀xq ∈ XQ , ∀xr ∈ XR . 2 Proof. We define upi = up (xqi , xri , xQi , xRi ), vpi = vp (xqi , xri , xQi , xRi ), wpi = wp (xqi , xri , xQi , xRi ) for 1 ≤ i ≤ D: upi = „ «„ «ni p−1 X X (−1)ni p−1 1 „ xR − xr «nj xqi − xQi xQi − xRi i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =0 j vpi = „ «„ «n ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =p nj ! 2h2 2h2 2h2 ni =0 j p−1 wpi = „ «„ «n ∞ ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =p j Note that e −||xq −xr ||2 2h2 D = i=1 (upi + vpi + wpi ) for 1 ≤ i ≤ D. Using the bound for Hermite functions and the property of geometric series, we obtain the following upper bounds: p−1 p−1 upi ≤ X X (2r)ni (2r)nj = ni =0 nj =0 „ 1 − (2r)p ) 1 − 2r «2 „ «„ « p−1 ∞ 1 X X 1 1 − (2r)p (2r)p vpi ≤ √ (2r)ni (2r)nj = √ 1 − 2r 1 − 2r p! n =0 n =p p! i 1 wpi ≤ √ p! j ∞ ∞ X X ni =p nj 1 (2r)ni (2r)nj = √ p! =0 „ 1 1 − 2r «„ (2r)p 1 − 2r « Therefore, ˛ ˛ ! «D−k „ D D−1 ˛ −||xq −xr ||2 ˛ Y X D ((2r)p )(2 − (2r)p ) ˛ ˛ −2D 2 2h √ − upi ˛ ≤ (1 − 2r) ((1 − (2r)p )2 )k ˛e ˛ ˛ k p! i=1 k=0 ˛ ˛ ˛ « ˛ „ „ « D−1 “ ” X D X ˛ xq − xQ β ˛ ((2r)p )(2 − (2r)p ) D−k NR p 2 k ˛≤ ˛G(xq ) − √ ((1 − (2r) ) ) √ Cβ ˛ ˛ 2D (1 − 2r) k p! 2h2 ˛ ˛ k=0 β < 2h, pDH = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if Q.maxside < 2h, pDL = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if max(Q.maxside,R.maxside) < h, pH2L = the smallest p ≥ 1 such that NR (1−2r)2D D−1 k=0 D k ((1 − (2r)p )2 )k ((2r)p )(2−(2r)p ) √ p! D−k < ǫGmin . Q cDH = pD NQ . cDL = pD NR . cH2L = DpD+1 . cDirect = DNQ NR . DH DL H2L if no Hermite coefficient of order pDH exists for XR , Compute it. cDH = cDH + pD NR . DH if no Hermite coefficient of order pH2L exists for XR , Compute it. cH2L = cH2L + pD NR . H2L c = min(cDH , cDL , cH2L , cDirect ). if c = cDH < ∞, (Direct Hermite) Evaluate each xq at the Hermite series of order pDH centered about xR of XR using Equation 1. if c = cDL < ∞, (Direct Local) Accumulate each xr ∈ XR as the Taylor series of order pDL about the center xQ of XQ using Equation 2. if c = cH2L < ∞, (Hermite-to-Local) Convert the Hermite series of order pH2L centered about xR of XR to the Taylor series of the same order centered about xQ of XQ using Lemma 2.1. if c = cDirect , Update Gmin and Gmax in Q and all its children. return. if leaf(Q) and leaf(R), Perform the naive algorithm on every pair of points in Q and R. else DFGT(Q.left, R.left). DFGT(Q.left, R.right). DFGT(Q.right, R.left). DFGT(Q.right, R.right). ˛ ˛ ˛b ˛ For the FGT, note that the algorithm only ensures: ˛G(xq ) − Gtrue (xq )˛ ≤ τ . Therefore, we first set τ = ǫ, halving τ until the error tolerance ǫ was met. For the IFGT, which has multiple parameters that must be tweaked simultaneously, an automatic scheme was created, based on the recommendations given in the paper and software documentation: For D = 2, use p = 8; for D = 3, √ use p = 6; set ρx = 2.5; start with K = N and double K until the error tolerance is met. When this failed to meet the tolerance, we resorted to additional trial and error by hand. The costs of parameter selection for these methods in both computer and human time is not included in the table. 4 Algorithm \ scale Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH 0.001 0.01 0.1 1 10 100 sj2-50000-2 (astronomy: positions), D = 2, N = 50000, h∗ = 0.00139506 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 3.892312 2.01846 0.319538 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.837724 1.087066 1.658592 6.018158 62.077669 151.590062 0.849935 1.11567 4.599235 72.435177 18.450387 2.777454 0.846294 1.10654 1.683913 6.265131 5.063365 1.036626 ∗ = 0.0016911 colors50k (astronomy: colors), D = 2, N = 50000, h 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM > 2×Naive > 2×Naive 0.475281 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1.095838 1.469454 2.802112 30.294007 280.633106 81.373053 1.099828 1.983888 29.231309 285.719266 12.886239 5.336602 1.081216 1.47692 2.855083 24.598749 7.142465 1.78648 ∗ edsgc-radec-rnd (astronomy: angles), D = 2, N = 50000, h = 0.00466204 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 2.859245 1.768738 0.210799 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.812462 1.083528 1.682261 5.860172 63.849361 357.099354 0.84023 1.120015 4.346061 73.036687 21.652047 3.424304 0.821672 1.104545 1.737799 6.037217 5.7398 1.883216 ∗ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h = 0.000768201 354.868751 354.868751 354.868751 354.868751 354.868751 354.868751 out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.70054 0.701547 0.761524 0.843451 1.086608 42.022605 0.73007 0.733638 0.799711 0.999316 50.619588 125.059911 0.724004 0.719951 0.789002 0.877564 1.265064 22.6106 ∗ bio5-rnd (biology: drug activity), D = 5, N = 50000, h = 0.000567161 364.439228 364.439228 364.439228 364.439228 364.439228 364.439228 out of RAM out of RAM out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 2.249868 2.4958865 4.70948 12.065697 94.345003 412.39142 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1000 301.696 0.183616 7.576783 1.551019 2.532401 0.68471 301.696 0.114430 7.55986 3.604753 3.5638 0.627554 301.696 0.059664 7.585585 0.743045 1.977302 0.436596 354.868751 > 2×Naive > 2×Naive 383.12048 109.353701 87.488392 364.439228 out of RAM > 2×Naive 107.675935 > 2×Naive > 2×Naive Discussion. The experiments indicate that the DFGTH method is able to achieve reasonable performance across all bandwidth scales. Unfortunately none of the series approximation-based methods do well on the 5-dimensional data, as expected, highlighting the main weakness of the approach presented. Pursuing corrections to the error bounds necessary to use the intriguing series form of [14] may allow an increase in dimensionality. References [1] A. W. Appel. An Efficient Program for Many-Body Simulations. SIAM Journal on Scientific and Statistical Computing, 6(1):85–103, 1985. [2] J. Barnes and P. Hut. A Hierarchical O(N logN ) Force-Calculation Algorithm. Nature, 324, 1986. [3] B. Baxter and G. Roussos. A new error estimate of the fast gauss transform. SIAM Journal on Scientific Computing, 24(1):257–259, 2002. [4] P. Callahan and S. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM, 62(1):67–90, January 1995. [5] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [6] A. G. Gray. Bringing Tractability to Generalized N-Body Problems in Statistical and Scientific Computation. PhD thesis, Carnegie Mellon University, 2003. [7] A. G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artificial Intelligence and Statistics 2003, 2003. [8] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. Journal of Computational Physics, 73, 1987. [9] L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 12(1):79–94, 1991. [10] L. Greengard and X. Sun. A new version of the fast gauss transform. Documenta Mathematica, Extra Volume ICM(III):575– 584, 1998. [11] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [12] J. Strain. The fast gauss transform with variable scales. SIAM Journal on Scientific and Statistical Computing, 12:1131– 1139, 1991. [13] O. Sz´ sz. On the relative extrema of the hermite orthogonal functions. J. Indian Math. Soc., 15:129–134, 1951. a [14] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis. Improved fast gauss transform and efficient kernel density estimation. International Conference on Computer Vision, 2003.
5 0.47850227 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
Author: John D. Lafferty, Pradeep K. Ravikumar
Abstract: We present a family of approximation techniques for probabilistic graphical models, based on the use of graphical preconditioners developed in the scientific computing literature. Our framework yields rigorous upper and lower bounds on event probabilities and the log partition function of undirected graphical models, using non-iterative procedures that have low time complexity. As in mean field approaches, the approximations are built upon tractable subgraphs; however, we recast the problem of optimizing the tractable distribution parameters and approximate inference in terms of the well-studied linear systems problem of obtaining a good matrix preconditioner. Experiments are presented that compare the new approximation schemes to variational methods. 1
6 0.47771916 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
7 0.46728101 95 nips-2005-Improved risk tail bounds for on-line algorithms
8 0.45484856 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization
9 0.44899541 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
10 0.44156894 76 nips-2005-From Batch to Transductive Online Learning
11 0.43651158 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
12 0.42804721 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
13 0.40603676 104 nips-2005-Laplacian Score for Feature Selection
14 0.40358251 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
15 0.38129026 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
16 0.37976182 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
17 0.36145133 86 nips-2005-Generalized Nonnegative Matrix Approximations with Bregman Divergences
18 0.35356644 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
19 0.34551993 41 nips-2005-Coarse sample complexity bounds for active learning
20 0.3445988 84 nips-2005-Generalization in Clustering with Unobserved Features
topicId topicWeight
[(3, 0.503), (10, 0.037), (11, 0.01), (27, 0.023), (31, 0.017), (34, 0.071), (55, 0.022), (69, 0.038), (73, 0.022), (77, 0.013), (88, 0.096), (91, 0.061)]
simIndex simValue paperId paperTitle
1 0.98552763 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari
Abstract: In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. 1
same-paper 2 0.96762353 117 nips-2005-Learning from Data of Variable Quality
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We initiate the study of learning from multiple sources of limited data, each of which may be corrupted at a different rate. We develop a complete theory of which data sources should be used for two fundamental problems: estimating the bias of a coin, and learning a classifier in the presence of label noise. In both cases, efficient algorithms are provided for computing the optimal subset of data. 1
3 0.95216393 159 nips-2005-Q-Clustering
Author: Mukund Narasimhan, Nebojsa Jojic, Jeff A. Bilmes
Abstract: We show that Queyranne’s algorithm for minimizing symmetric submodular functions can be used for clustering with a variety of different objective functions. Two specific criteria that we consider in this paper are the single linkage and the minimum description length criteria. The first criterion tries to maximize the minimum distance between elements of different clusters, and is inherently “discriminative”. It is known that optimal clusterings into k clusters for any given k in polynomial time for this criterion can be computed. The second criterion seeks to minimize the description length of the clusters given a probabilistic generative model. We show that the optimal partitioning into 2 clusters, and approximate partitioning (guaranteed to be within a factor of 2 of the the optimal) for more clusters can be computed. To the best of our knowledge, this is the first time that a tractable algorithm for finding the optimal clustering with respect to the MDL criterion for 2 clusters has been given. Besides the optimality result for the MDL criterion, the chief contribution of this paper is to show that the same algorithm can be used to optimize a broad class of criteria, and hence can be used for many application specific criterion for which efficient algorithm are not known.
4 0.95116735 131 nips-2005-Multiple Instance Boosting for Object Detection
Author: Cha Zhang, John C. Platt, Paul A. Viola
Abstract: A good image object detection algorithm is accurate, fast, and does not require exact locations of objects in a training set. We can create such an object detector by taking the architecture of the Viola-Jones detector cascade and training it with a new variant of boosting that we call MILBoost. MILBoost uses cost functions from the Multiple Instance Learning literature combined with the AnyBoost framework. We adapt the feature selection criterion of MILBoost to optimize the performance of the Viola-Jones cascade. Experiments show that the detection rate is up to 1.6 times better using MILBoost. This increased detection rate shows the advantage of simultaneously learning the locations and scales of the objects in the training set along with the parameters of the classifier. 1
5 0.83845323 85 nips-2005-Generalization to Unseen Cases
Author: Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri
Abstract: We analyze classification error on unseen cases, i.e. cases that are different from those in the training set. Unlike standard generalization error, this off-training-set error may differ significantly from the empirical error with high probability even with large sample sizes. We derive a datadependent bound on the difference between off-training-set and standard generalization error. Our result is based on a new bound on the missing mass, which for small samples is stronger than existing bounds based on Good-Turing estimators. As we demonstrate on UCI data-sets, our bound gives nontrivial generalization guarantees in many practical cases. In light of these results, we show that certain claims made in the No Free Lunch literature are overly pessimistic. 1
6 0.81436568 112 nips-2005-Learning Minimum Volume Sets
7 0.73617059 41 nips-2005-Coarse sample complexity bounds for active learning
8 0.72911996 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
9 0.710518 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
10 0.67355341 84 nips-2005-Generalization in Clustering with Unobserved Features
11 0.6637404 177 nips-2005-Size Regularized Cut for Data Clustering
12 0.66007727 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
13 0.65897948 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
14 0.65545833 58 nips-2005-Divergences, surrogate loss functions and experimental design
15 0.65243167 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
16 0.65192229 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
17 0.64810145 95 nips-2005-Improved risk tail bounds for on-line algorithms
18 0.6476391 74 nips-2005-Faster Rates in Regression via Active Learning
19 0.64278495 160 nips-2005-Query by Committee Made Real
20 0.63452548 47 nips-2005-Consistency of one-class SVM and related algorithms