nips nips2001 nips2001-114 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Pedro Domingos, Geoff Hulten
Abstract: We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , ... ,nm )) , and the model Moo that would be learned using infinite examples. Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. 1 An Approach to Large-Scale Learning Large data sets make it possible to reliably learn complex models. On the other hand , they require large computational resources to learn from. While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. Consider the simplest possible learning problem: estimating the mean of a random variable x. If we have a very large number of samples, most of them are probably superfluous. If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose to extend this type of reasoning beyond learning single parameters, to learning complex models. The approach we propose consists of three steps: 1. Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 2. Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. 3. Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. In earlier papers we applied it (or an earlier version of it) to decision tree induction [2J and k-means clustering [3J. Despite its wide use, EM has long been criticized for its inefficiency (see discussion following Dempster et al. [1]), and has been considered unsuitable for large data sets [8J. Many approaches to speeding it up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be seen as an extension of progressive sampling approaches like Meek et al. [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. 2 A Loss Bound for EM In a mixture of Gaussians model, each D-dimensional data point Xj is assumed to have been independently generated by the following process: 1) randomly choose a mixture component k; 2) randomly generate a point from it according to a Gaussian distribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict ourselves to the case where the number K of mixture components and the probability of selection P(f-Lk) and covariance matrix for each component are known. Given a training set S = {Xl, ... , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, starting from some set of initial means , alternating until convergence between estimating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the Estep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk (the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM algorithm, all N examples in the training set are used in each iteration. The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. Let Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data EM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be the vector obtained using infinite examples at each iteration. In order to proceed, we need to quantify the difference between Mii and Moo . A natural choice is the sum of the squared errors between corresponding means, which is proportional to the negative log-likelihood of the finite-data means given the infinite-data ones: K L(Mii' Moo ) = L k=l K Ililk - f-Lkl12 = D LL lilkd - f-Lkdl 2 (1) k=ld=l where ilkd is the dth coordinate of il, and similarly for f-Lkd. After any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. Then the sampling error at iteration i is Iflkdi - J-Lkdi I, the weighting error is Iflkdi - flkdi I, and the total error is Iflkdi - J-Lkdi 1 :::; Iflkdi - flkdi 1 + Iflkdi - J-Lkdi I衯 Given bounds on the total error of each coordinate of each mean after iteration i-I, we can derive a bound on the weighting error after iteration i as follows. Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj ::::: 0 and (- - 1 > (- + . W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. ot h ' erWlse, an d 1et W- jki Then Iflkdi - flkdi , IJ-Lkdi 1 < max - ~7~1 Wjk i Xj I
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We propose the following general method for scaling learning algorithms to arbitrarily large data sets. [sent-6, score-0.09]
2 Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , . [sent-7, score-0.404]
3 ,nm )) , and the model Moo that would be learned using infinite examples. [sent-10, score-0.119]
4 Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. [sent-11, score-0.204]
5 We apply this method to the EM algorithm for mixtures of Gaussians. [sent-12, score-0.059]
6 Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. [sent-13, score-0.054]
7 While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. [sent-16, score-0.086]
8 Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. [sent-17, score-0.056]
9 We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. [sent-18, score-0.087]
10 Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. [sent-19, score-0.162]
11 If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. [sent-22, score-0.139]
12 Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. [sent-25, score-0.283]
13 Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. [sent-27, score-0.216]
14 Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. [sent-29, score-0.219]
15 In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. [sent-30, score-0.059]
16 [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. [sent-37, score-0.21]
17 , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. [sent-43, score-0.081]
18 In the basic EM algorithm, all N examples in the training set are used in each iteration. [sent-45, score-0.076]
19 The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. [sent-46, score-0.519]
20 , using ni examples in iteration i), and let Moo = (f-L1, . [sent-52, score-0.484]
21 ,f-LK) be the vector obtained using infinite examples at each iteration. [sent-55, score-0.195]
22 After any given iteration of EM, lilkd - f-Lkdl has two components. [sent-58, score-0.154]
23 One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. [sent-59, score-0.34]
24 The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. [sent-60, score-0.187]
25 Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. [sent-61, score-0.653]
26 Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. [sent-63, score-0.068]
27 Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. [sent-64, score-0.152]
28 Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. [sent-65, score-0.448]
29 W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. [sent-67, score-0.242]
30 This bound is independent of the distribution of the data, which will ensure that our results are valid even if the data was not truly generated by a mixture of Gaussians, as is often the case in practice. [sent-69, score-0.246]
31 On the other hand, the bound is more conservative than distribution-dependent ones, requiring more samples to reach the same guarantees. [sent-70, score-0.19]
32 Therefore the weighting error in the first iteration is zero, and Equation 3 bounds the total error. [sent-72, score-0.329]
33 From this we can bound the weighting error in the second iteration according to Equation 2, and therefore bound the total error by the sum of Equations 2 and 3, and so on for each iteration until the algorithms converge. [sent-73, score-0.731]
34 If the finite- and infinite-data EM converge in the same number of iterations m, the loss due to finite data is L(Mii" Moo ) = ~f= l ~~= llflkdm - J-Lkdml 2 (see Equation 1). [sent-74, score-0.285]
35 In general 1 Although a normally distributed variable has infinite range, our experiments show that assuming a sufficiently wide finite range does not significantly affect the results. [sent-76, score-0.211]
36 (with probability specified below), infinite-data EM converges at one of the iterations for which the minimum possible change in mean positions is below ,,/, and is guaranteed to converge at the first iteration for which the maximum possible change is below "(. [sent-77, score-0.281]
37 More precisely, it converges at one of the iterations for which ~~= l ~~= l (max{ IPkd ,i- l - Pkdil-IPkd,i - l - ftkd,i - ll-IPkdi - ftkdil, O})2 ::; ,,/, and is guaranteed to converge at the first iteration for which ~~=l ~~=l (IPkd,i-l Pkdil + IPkd ,i-l - ftkd ,i-ll + IPkdi - ftkdil)2 ::; "/. [sent-78, score-0.253]
38 To obtain a bound for L(Mn, Moo), finite-data EM must be run until the latter condition holds. [sent-79, score-0.265]
39 Let I be the set of iterations at which infinite-data EM could have converged. [sent-80, score-0.085]
40 Then we finally obtain where m is the total number of iterations carried out. [sent-81, score-0.167]
41 This bound holds if all of the Hoeffding bounds (Equation 3) hold. [sent-82, score-0.22]
42 Since each of these bounds fails with probability at most 8, the bound above fails with probability at most 8* = K Dm8 (by the union bound). [sent-83, score-0.22]
43 As a result, the growth with K, D and m of the number of examples required to reach a given loss bound with a given probability is only O(v'lnKDm). [sent-84, score-0.38]
44 The bound we have just derived utilizes run-time information, namely the distance of each example to each mean along each coordinate in each iteration. [sent-85, score-0.231]
45 Notice also that it would be trivial to modify the treatment for any other loss criterion that depends only on the terms IPkdm - ftkdm I (e. [sent-87, score-0.121]
46 3 A Fast EM Algorithm We now apply the previous section's result to reduce the number of examples used by EM at each iteration while keeping the loss bounded. [sent-90, score-0.275]
47 The goal is to learn in minimum time a model whose loss relative to EM applied to infinite data is at most f* with probability at least 1 - 8*. [sent-92, score-0.267]
48 ) Using the notation of the previous section, if ni examples are used at each iteration then the running time of EM is O(KD ~::l ni) , and can be minimized by minimizing ~::l ni. [sent-94, score-0.536]
49 Assume for the moment that the number of iterations m is known. [sent-95, score-0.085]
50 We thus proceed by first minimizing ~::l ni subject to IIPkm - ftkmll ::; f* / K separately for each mean. [sent-99, score-0.428]
51 being equal to 1 - l/Gini(W~i)' where W~i is the vector of normalized weights wjki = wjkd 2:j,i=l Wjl ki. [sent-102, score-0.145]
52 when all the weights but one are zero, and a maximum of ni when all the weights are equal and non-zero. [sent-104, score-0.302]
53 However, we would like to have a measure whose maximum is independent of ni, so that it remains approximately constant whatever the value of ni chosen (for sufficiently large ni). [sent-105, score-0.326]
54 This captures the notion than the weighting error in one iteration should increase with the total error in the previous one. [sent-110, score-0.294]
55 Combining this with the bound for IliLki - ILkill, we obtain R 2 l n (2/8) 2f3kini (5) where CXki is the proportionality constant. [sent-111, score-0.152]
56 Given this equation and IIP-kO - ILkO II = 0, it can be shown by induction that m IIP-km - ILkmll :::; ~~ (6) where (7) The target bound will thus be satisfied by minimizing 2:: 1 ni subject to 2::1 (rkd,;niJ = J E* / K. [sent-112, score-0.616]
57 3 Finding the n/s by the method of Lagrange multipliers yields ni = ~ (f ~rkir%j) 2 (8) )=1 This equation will produce a required value of ni for each mean. [sent-113, score-0.684]
58 To guarantee the desired E*, it is sufficient to make ni equal to the maximum of these values. [sent-114, score-0.302]
59 The VFEM algorithm consists of a sequence of runs of EM, with each run using more examples than the last, until the bound L(Mii' Moo) :::; E* is satisfied, with L(Mii' Moo) bounded according to Equation 4. [sent-115, score-0.468]
60 In the first run, VFEM postulates a maximum number of iterations m, and uses it to set 8 = 8* / K Dm. [sent-116, score-0.085]
61 If m is exceeded, for the next run it is set to 50% more than the number needed in the current run. [sent-117, score-0.137]
62 (A new run will be carried out if either the 8* or E* target is not met. [sent-118, score-0.155]
63 ) The number of examples used in the first run of EM is the same for all iterations, and is set to 1. [sent-119, score-0.189]
64 This is 10% more than the number of examples that would theoretically be required in the best possible case (no weighting errors in the last 3This may lead to a suboptimal solution for the ni's, in the unlikely case that II increases with them. [sent-121, score-0.254]
65 Jtkm Ilflkm - iteration, leading to a pure Hoeffding bound, and a uniform distribution of examples among mixture components). [sent-122, score-0.111]
66 The numbers of examples for subsequent runs are set according to Equation 8. [sent-123, score-0.144]
67 For iterations beyond the last one in the previous run , the number of examples is set as for the first run. [sent-124, score-0.274]
68 A run of EM is terminated when L~= l L~= l (Iflkd ,i- l - flkdi 1+ Iflkd ,i-l - ILkd ,i-l l + Iflkdi - ILkdi 1)2 :s: "( (see discussion preceding Equation 4), or two iterations after L~=l IIILki - ILk,i-1 11 2 :s: "( 13, whichever comes first. [sent-125, score-0.369]
69 If the user target bound is E, E* is set to min{ E, "( 13}, to facilitate meeting the first criterion above. [sent-127, score-0.18]
70 When the convergence threshold for infinite-data EM was not reached even when using the whole training set, VFEM reports that it was unable to find a bound; otherwise the bound obtained is reported. [sent-128, score-0.152]
71 VFEM ensures that the total number of examples used in one run is always at least twice the number n used in the previous run. [sent-129, score-0.229]
72 This is done by, if L ni < 2n, setting the ni's instead to n~ = 2n(nil L ni). [sent-130, score-0.302]
73 If at any point L ni > mN, where m is the number of iterations carried out and N is the size of the full training set, Vi ni = N is used. [sent-131, score-0.731]
74 Thus, assuming that the number of iterations does not decrease with the number of examples, VFEM's total running time is always less than three times the time taken by the last run of EM. [sent-132, score-0.265]
75 (The worst case occurs when the one-but-last run is carried out on almost the full training set. [sent-133, score-0.155]
76 ) The run-time information gathered in one run is used to set the n/s for the next run. [sent-134, score-0.113]
77 The approximations made in the derivation will be good, and the resulting ni's accurate, if the means' paths in the current run are similar to those in the previous run. [sent-136, score-0.113]
78 This may not be true in the earlier runs , but their running time will be negligible compared to that of later runs , where the assumption of path similarity from one run to the next should hold. [sent-137, score-0.276]
79 4 Experiments We conducted a series of experiments on large synthetic data sets to compare VFEM with EM. [sent-138, score-0.054]
80 All data sets were generated by mixtures of spherical Gaussians with means ILk in the unit hypercube. [sent-139, score-0.147]
81 Each data set was generated according to three parameters: the dimensionality D , the number of mixture components K , and the standard deviation (Y of each coordinate in each component. [sent-140, score-0.145]
82 The means were generated one at a time by sampling each dimension uniformly from the range (2(Y,1 - 2(Y). [sent-141, score-0.157]
83 This ensured that most of the data points generated were within the unit hypercube. [sent-142, score-0.059]
84 Any ILk that was less than (vD1 K)(Y away from a previously generated mean was rejected and regenerated, since problems with very close means are unlikely to be solvable by either EM or VFEM. [sent-145, score-0.113]
85 Examples were generated by choosing one of the means ILk with uniform probability, and setting the value of each dimension of the example by randomly sampling from a Gaussian distribution with mean ILkd and standard deviation (Y. [sent-146, score-0.16]
86 We compared VFEM to EM on 64 data sets of 10 million examples each, generated by using every possible combination of the following parameters: D E {4, 8,12, 16}; K E {3, 4, 5, 6} ; (Y E {. [sent-147, score-0.163]
87 In each run the two algorithms were initialized with the same means, randomly selected with the constraint that no two be less than vD1(2K) apart. [sent-152, score-0.167]
88 VFEM was allowed to converge before EM's guaranteed convergence criterion was met (see discussion preceding Equation 4). [sent-153, score-0.116]
89 All experiments were run on a 1 GHz Pentium III machine under Linux, with "( = O. [sent-154, score-0.113]
90 Values are averages over the number of runs shown. [sent-159, score-0.068]
91 Runs Bound No bound All Algorithm VFEM EM VFEM EM VFEM EM #Runs 40 40 24 24 64 64 Time 217 3457 7820 4502 3068 3849 #EA 1. [sent-161, score-0.152]
92 Losses were computed relative to the true means, with the best match between true means and empirical ones found by greedy search. [sent-190, score-0.055]
93 Results for runs in which VFEM achieved and did not achieve the required E* and 8* bounds are reported separately. [sent-191, score-0.164]
94 VFEM achieved the required bounds and was able to stop early on 62. [sent-192, score-0.096]
95 The losses of the two algorithms were virtually identical in both situations. [sent-196, score-0.083]
96 When achieved , the average loss bound for VFEM was 0. [sent-198, score-0.245]
97 In other words, the means produced by both algorithms were virtually identical to those that would be obtained with infinite data. [sent-201, score-0.225]
98 5 Conclusion Learning algorithms can be sped up by minimizing the number of examples used in each step, under the constraint that the loss between the resulting model and the one that would be obtained with infinite data remain bounded. [sent-216, score-0.369]
99 In this paper we applied this method to the EM algorithm for mixtures of Gaussians, and observed the resulting speedups on a series of large data sets. [sent-217, score-0.113]
100 4The much higher loss values relative to the true means, however, indicate that infinitedata EM would often find only local optima (unless the greedy search itself only found a suboptimal match). [sent-218, score-0.137]
wordName wordTfidf (topN-words)
[('vfem', 0.507), ('ni', 0.302), ('em', 0.259), ('jki', 0.242), ('moo', 0.217), ('iflkdi', 0.193), ('mii', 0.193), ('bound', 0.152), ('flkdi', 0.145), ('wjki', 0.145), ('iipkm', 0.121), ('infinite', 0.119), ('ilk', 0.115), ('run', 0.113), ('iteration', 0.106), ('ilkd', 0.097), ('loss', 0.093), ('ki', 0.089), ('iterations', 0.085), ('domingos', 0.084), ('wjk', 0.084), ('weighting', 0.082), ('examples', 0.076), ('xj', 0.075), ('dth', 0.072), ('ftki', 0.072), ('thiesson', 0.072), ('hoeffding', 0.071), ('uj', 0.071), ('bounds', 0.068), ('runs', 0.068), ('meek', 0.063), ('means', 0.055), ('ii', 0.053), ('equation', 0.052), ('coordinate', 0.051), ('ftkdil', 0.048), ('ftkmll', 0.048), ('iflkd', 0.048), ('iipki', 0.048), ('ilftki', 0.048), ('ililki', 0.048), ('ilkill', 0.048), ('ipkd', 0.048), ('lilkd', 0.048), ('sampling', 0.047), ('suboptimal', 0.044), ('finite', 0.043), ('kd', 0.042), ('redmond', 0.042), ('wolman', 0.042), ('carried', 0.042), ('total', 0.04), ('samples', 0.038), ('rr', 0.036), ('converge', 0.035), ('mixture', 0.035), ('acm', 0.034), ('arbitrarily', 0.034), ('gaussians', 0.034), ('derives', 0.034), ('bounded', 0.033), ('mixtures', 0.033), ('error', 0.033), ('ea', 0.032), ('losses', 0.032), ('growth', 0.031), ('satisfied', 0.031), ('generated', 0.03), ('data', 0.029), ('subject', 0.029), ('minimize', 0.029), ('mean', 0.028), ('mn', 0.028), ('million', 0.028), ('dempster', 0.028), ('criterion', 0.028), ('required', 0.028), ('running', 0.027), ('guaranteed', 0.027), ('initialized', 0.027), ('algorithms', 0.027), ('washington', 0.026), ('goal', 0.026), ('preceding', 0.026), ('databases', 0.026), ('algorithm', 0.026), ('minimizing', 0.025), ('range', 0.025), ('min', 0.025), ('induction', 0.025), ('wa', 0.025), ('series', 0.025), ('errors', 0.024), ('proceed', 0.024), ('virtually', 0.024), ('mining', 0.024), ('needed', 0.024), ('microsoft', 0.024), ('sufficiently', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 114 nips-2001-Learning from Infinite Data in Finite Time
Author: Pedro Domingos, Geoff Hulten
Abstract: We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , ... ,nm )) , and the model Moo that would be learned using infinite examples. Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. 1 An Approach to Large-Scale Learning Large data sets make it possible to reliably learn complex models. On the other hand , they require large computational resources to learn from. While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. Consider the simplest possible learning problem: estimating the mean of a random variable x. If we have a very large number of samples, most of them are probably superfluous. If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose to extend this type of reasoning beyond learning single parameters, to learning complex models. The approach we propose consists of three steps: 1. Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 2. Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. 3. Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. In earlier papers we applied it (or an earlier version of it) to decision tree induction [2J and k-means clustering [3J. Despite its wide use, EM has long been criticized for its inefficiency (see discussion following Dempster et al. [1]), and has been considered unsuitable for large data sets [8J. Many approaches to speeding it up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be seen as an extension of progressive sampling approaches like Meek et al. [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. 2 A Loss Bound for EM In a mixture of Gaussians model, each D-dimensional data point Xj is assumed to have been independently generated by the following process: 1) randomly choose a mixture component k; 2) randomly generate a point from it according to a Gaussian distribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict ourselves to the case where the number K of mixture components and the probability of selection P(f-Lk) and covariance matrix for each component are known. Given a training set S = {Xl, ... , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, starting from some set of initial means , alternating until convergence between estimating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the Estep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk (the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM algorithm, all N examples in the training set are used in each iteration. The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. Let Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data EM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be the vector obtained using infinite examples at each iteration. In order to proceed, we need to quantify the difference between Mii and Moo . A natural choice is the sum of the squared errors between corresponding means, which is proportional to the negative log-likelihood of the finite-data means given the infinite-data ones: K L(Mii' Moo ) = L k=l K Ililk - f-Lkl12 = D LL lilkd - f-Lkdl 2 (1) k=ld=l where ilkd is the dth coordinate of il, and similarly for f-Lkd. After any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. Then the sampling error at iteration i is Iflkdi - J-Lkdi I, the weighting error is Iflkdi - flkdi I, and the total error is Iflkdi - J-Lkdi 1 :::; Iflkdi - flkdi 1 + Iflkdi - J-Lkdi I衯 Given bounds on the total error of each coordinate of each mean after iteration i-I, we can derive a bound on the weighting error after iteration i as follows. Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj ::::: 0 and (- - 1 > (- + . W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. ot h ' erWlse, an d 1et W- jki Then Iflkdi - flkdi , IJ-Lkdi 1 < max - ~7~1 Wjk i Xj I
2 0.11332601 1 nips-2001-(Not) Bounding the True Error
Author: John Langford, Rich Caruana
Abstract: We present a new approach to bounding the true error rate of a continuous valued classifier based upon PAC-Bayes bounds. The method first constructs a distribution over classifiers by determining how sensitive each parameter in the model is to noise. The true error rate of the stochastic classifier found with the sensitivity analysis can then be tightly bounded using a PAC-Bayes bound. In this paper we demonstrate the method on artificial neural networks with results of a order of magnitude improvement vs. the best deterministic neural net bounds. £ ¡ ¤¢
3 0.11061838 68 nips-2001-Entropy and Inference, Revisited
Author: Ilya Nemenman, F. Shafee, William Bialek
Abstract: We study properties of popular near–uniform (Dirichlet) priors for learning undersampled probability distributions on discrete nonmetric spaces and show that they lead to disastrous results. However, an Occam–style phase space argument expands the priors into their infinite mixture and resolves most of the observed problems. This leads to a surprisingly good estimator of entropies of discrete distributions. Learning a probability distribution from examples is one of the basic problems in data analysis. Common practical approaches introduce a family of parametric models, leading to questions about model selection. In Bayesian inference, computing the total probability of the data arising from a model involves an integration over parameter space, and the resulting “phase space volume” automatically discriminates against models with larger numbers of parameters—hence the description of these volume terms as Occam factors [1, 2]. As we move from finite parameterizations to models that are described by smooth functions, the integrals over parameter space become functional integrals and methods from quantum field theory allow us to do these integrals asymptotically; again the volume in model space consistent with the data is larger for models that are smoother and hence less complex [3]. Further, at least under some conditions the relevant degree of smoothness can be determined self–consistently from the data, so that we approach something like a model independent method for learning a distribution [4]. The results emphasizing the importance of phase space factors in learning prompt us to look back at a seemingly much simpler problem, namely learning a distribution on a discrete, nonmetric space. Here the probability distribution is just a list of numbers {q i }, i = 1, 2, · · · , K, where K is the number of bins or possibilities. We do not assume any metric on the space, so that a priori there is no reason to believe that any q i and qj should be similar. The task is to learn this distribution from a set of examples, which we can describe as the number of times ni each possibility is observed in a set of N = K ni i=1 samples. This problem arises in the context of language, where the index i might label words or phrases, so that there is no natural way to place a metric on the space, nor is it even clear that our intuitions about similarity are consistent with the constraints of a metric space. Similarly, in bioinformatics the index i might label n–mers of the the DNA or amino acid sequence, and although most work in the field is based on metrics for sequence comparison one might like an alternative approach that does not rest on such assumptions. In the analysis of neural responses, once we fix our time resolution the response becomes a set of discrete “words,” and estimates of the information content in the response are de- termined by the probability distribution on this discrete space. What all of these examples have in common is that we often need to draw some conclusions with data sets that are not in the asymptotic limit N K. Thus, while we might use a large corpus to sample the distribution of words in English by brute force (reaching N K with K the size of the vocabulary), we can hardly do the same for three or four word phrases. In models described by continuous functions, the infinite number of “possibilities” can never be overwhelmed by examples; one is saved by the notion of smoothness. Is there some nonmetric analog of this notion that we can apply in the discrete case? Our intuition is that information theoretic quantities may play this role. If we have a joint distribution of two variables, the analog of a smooth distribution would be one which does not have too much mutual information between these variables. Even more simply, we might say that smooth distributions have large entropy. While the idea of “maximum entropy inference” is common [5], the interplay between constraints on the entropy and the volume in the space of models seems not to have been considered. As we shall explain, phase space factors alone imply that seemingly sensible, more or less uniform priors on the space of discrete probability distributions correspond to disastrously singular prior hypotheses about the entropy of the underlying distribution. We argue that reliable inference outside the asymptotic regime N K requires a more uniform prior on the entropy, and we offer one way of doing this. While many distributions are consistent with the data when N ≤ K, we provide empirical evidence that this flattening of the entropic prior allows us to make surprisingly reliable statements about the entropy itself in this regime. At the risk of being pedantic, we state very explicitly what we mean by uniform or nearly uniform priors on the space of distributions. The natural “uniform” prior is given by Pu ({qi }) = 1 δ Zu K 1− K qi i=1 , Zu = A dq1 dq2 · · · dqK δ 1− qi (1) i=1 where the delta function imposes the normalization, Zu is the total volume in the space of models, and the integration domain A is such that each qi varies in the range [0, 1]. Note that, because of the normalization constraint, an individual q i chosen from this distribution in fact is not uniformly distributed—this is also an example of phase space effects, since in choosing one qi we constrain all the other {qj=i }. What we mean by uniformity is that all distributions that obey the normalization constraint are equally likely a priori. Inference with this uniform prior is straightforward. If our examples come independently from {qi }, then we calculate the probability of the model {qi } with the usual Bayes rule: 1 K P ({qi }|{ni }) = P ({ni }|{qi })Pu ({qi }) , P ({ni }|{qi }) = (qi )ni . Pu ({ni }) i=1 (2) If we want the best estimate of the probability qi in the least squares sense, then we should compute the conditional mean, and this can be done exactly, so that [6, 7] qi = ni + 1 . N +K (3) Thus we can think of inference with this uniform prior as setting probabilities equal to the observed frequencies, but with an “extra count” in every bin. This sensible procedure was first introduced by Laplace [8]. It has the desirable property that events which have not been observed are not automatically assigned probability zero. 1 If the data are unordered, extra combinatorial factors have to be included in P ({ni }|{qi }). However, these cancel immediately in later expressions. A natural generalization of these ideas is to consider priors that have a power–law dependence on the probabilities, the so called Dirichlet family of priors: K Pβ ({qi }) = 1 δ 1− qi Z(β) i=1 K β−1 qi , (4) i=1 It is interesting to see what typical distributions from these priors look like. Even though different qi ’s are not independent random variables due to the normalizing δ–function, generation of random distributions is still easy: one can show that if q i ’s are generated successively (starting from i = 1 and proceeding up to i = K) from the Beta–distribution j
4 0.10166986 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
Author: T. Zhang
Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.
5 0.10092302 119 nips-2001-Means, Correlations and Bounds
Author: Martijn Leisink, Bert Kappen
Abstract: The partition function for a Boltzmann machine can be bounded from above and below. We can use this to bound the means and the correlations. For networks with small weights, the values of these statistics can be restricted to non-trivial regions (i.e. a subset of [-1 , 1]). Experimental results show that reasonable bounding occurs for weight sizes where mean field expansions generally give good results. 1
6 0.080569789 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
7 0.079477899 143 nips-2001-PAC Generalization Bounds for Co-training
8 0.078145579 139 nips-2001-Online Learning with Kernels
9 0.076732062 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
10 0.073870368 8 nips-2001-A General Greedy Approximation Algorithm with Applications
11 0.07247933 61 nips-2001-Distribution of Mutual Information
12 0.072467312 164 nips-2001-Sampling Techniques for Kernel Methods
13 0.066899963 137 nips-2001-On the Convergence of Leveraging
14 0.060319811 31 nips-2001-Algorithmic Luckiness
15 0.058402054 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
16 0.057970777 117 nips-2001-MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation
17 0.056454886 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
18 0.055247787 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
19 0.054263521 193 nips-2001-Unsupervised Learning of Human Motion Models
20 0.052847818 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
topicId topicWeight
[(0, -0.185), (1, 0.05), (2, 0.043), (3, 0.024), (4, 0.047), (5, -0.121), (6, 0.022), (7, 0.044), (8, -0.057), (9, -0.065), (10, 0.044), (11, 0.095), (12, -0.002), (13, -0.025), (14, -0.045), (15, 0.084), (16, -0.03), (17, -0.092), (18, -0.042), (19, -0.039), (20, -0.098), (21, 0.09), (22, -0.059), (23, -0.142), (24, -0.058), (25, -0.064), (26, -0.053), (27, 0.012), (28, 0.09), (29, -0.053), (30, -0.072), (31, -0.032), (32, 0.019), (33, -0.025), (34, -0.072), (35, -0.022), (36, -0.09), (37, 0.084), (38, -0.117), (39, 0.074), (40, -0.084), (41, 0.097), (42, 0.02), (43, -0.104), (44, 0.003), (45, 0.107), (46, 0.04), (47, -0.041), (48, -0.013), (49, -0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.93883258 114 nips-2001-Learning from Infinite Data in Finite Time
Author: Pedro Domingos, Geoff Hulten
Abstract: We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , ... ,nm )) , and the model Moo that would be learned using infinite examples. Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. 1 An Approach to Large-Scale Learning Large data sets make it possible to reliably learn complex models. On the other hand , they require large computational resources to learn from. While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. Consider the simplest possible learning problem: estimating the mean of a random variable x. If we have a very large number of samples, most of them are probably superfluous. If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose to extend this type of reasoning beyond learning single parameters, to learning complex models. The approach we propose consists of three steps: 1. Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 2. Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. 3. Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. In earlier papers we applied it (or an earlier version of it) to decision tree induction [2J and k-means clustering [3J. Despite its wide use, EM has long been criticized for its inefficiency (see discussion following Dempster et al. [1]), and has been considered unsuitable for large data sets [8J. Many approaches to speeding it up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be seen as an extension of progressive sampling approaches like Meek et al. [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. 2 A Loss Bound for EM In a mixture of Gaussians model, each D-dimensional data point Xj is assumed to have been independently generated by the following process: 1) randomly choose a mixture component k; 2) randomly generate a point from it according to a Gaussian distribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict ourselves to the case where the number K of mixture components and the probability of selection P(f-Lk) and covariance matrix for each component are known. Given a training set S = {Xl, ... , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, starting from some set of initial means , alternating until convergence between estimating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the Estep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk (the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM algorithm, all N examples in the training set are used in each iteration. The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. Let Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data EM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be the vector obtained using infinite examples at each iteration. In order to proceed, we need to quantify the difference between Mii and Moo . A natural choice is the sum of the squared errors between corresponding means, which is proportional to the negative log-likelihood of the finite-data means given the infinite-data ones: K L(Mii' Moo ) = L k=l K Ililk - f-Lkl12 = D LL lilkd - f-Lkdl 2 (1) k=ld=l where ilkd is the dth coordinate of il, and similarly for f-Lkd. After any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. Then the sampling error at iteration i is Iflkdi - J-Lkdi I, the weighting error is Iflkdi - flkdi I, and the total error is Iflkdi - J-Lkdi 1 :::; Iflkdi - flkdi 1 + Iflkdi - J-Lkdi I衯 Given bounds on the total error of each coordinate of each mean after iteration i-I, we can derive a bound on the weighting error after iteration i as follows. Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj ::::: 0 and (- - 1 > (- + . W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. ot h ' erWlse, an d 1et W- jki Then Iflkdi - flkdi , IJ-Lkdi 1 < max - ~7~1 Wjk i Xj I
2 0.69654399 68 nips-2001-Entropy and Inference, Revisited
Author: Ilya Nemenman, F. Shafee, William Bialek
Abstract: We study properties of popular near–uniform (Dirichlet) priors for learning undersampled probability distributions on discrete nonmetric spaces and show that they lead to disastrous results. However, an Occam–style phase space argument expands the priors into their infinite mixture and resolves most of the observed problems. This leads to a surprisingly good estimator of entropies of discrete distributions. Learning a probability distribution from examples is one of the basic problems in data analysis. Common practical approaches introduce a family of parametric models, leading to questions about model selection. In Bayesian inference, computing the total probability of the data arising from a model involves an integration over parameter space, and the resulting “phase space volume” automatically discriminates against models with larger numbers of parameters—hence the description of these volume terms as Occam factors [1, 2]. As we move from finite parameterizations to models that are described by smooth functions, the integrals over parameter space become functional integrals and methods from quantum field theory allow us to do these integrals asymptotically; again the volume in model space consistent with the data is larger for models that are smoother and hence less complex [3]. Further, at least under some conditions the relevant degree of smoothness can be determined self–consistently from the data, so that we approach something like a model independent method for learning a distribution [4]. The results emphasizing the importance of phase space factors in learning prompt us to look back at a seemingly much simpler problem, namely learning a distribution on a discrete, nonmetric space. Here the probability distribution is just a list of numbers {q i }, i = 1, 2, · · · , K, where K is the number of bins or possibilities. We do not assume any metric on the space, so that a priori there is no reason to believe that any q i and qj should be similar. The task is to learn this distribution from a set of examples, which we can describe as the number of times ni each possibility is observed in a set of N = K ni i=1 samples. This problem arises in the context of language, where the index i might label words or phrases, so that there is no natural way to place a metric on the space, nor is it even clear that our intuitions about similarity are consistent with the constraints of a metric space. Similarly, in bioinformatics the index i might label n–mers of the the DNA or amino acid sequence, and although most work in the field is based on metrics for sequence comparison one might like an alternative approach that does not rest on such assumptions. In the analysis of neural responses, once we fix our time resolution the response becomes a set of discrete “words,” and estimates of the information content in the response are de- termined by the probability distribution on this discrete space. What all of these examples have in common is that we often need to draw some conclusions with data sets that are not in the asymptotic limit N K. Thus, while we might use a large corpus to sample the distribution of words in English by brute force (reaching N K with K the size of the vocabulary), we can hardly do the same for three or four word phrases. In models described by continuous functions, the infinite number of “possibilities” can never be overwhelmed by examples; one is saved by the notion of smoothness. Is there some nonmetric analog of this notion that we can apply in the discrete case? Our intuition is that information theoretic quantities may play this role. If we have a joint distribution of two variables, the analog of a smooth distribution would be one which does not have too much mutual information between these variables. Even more simply, we might say that smooth distributions have large entropy. While the idea of “maximum entropy inference” is common [5], the interplay between constraints on the entropy and the volume in the space of models seems not to have been considered. As we shall explain, phase space factors alone imply that seemingly sensible, more or less uniform priors on the space of discrete probability distributions correspond to disastrously singular prior hypotheses about the entropy of the underlying distribution. We argue that reliable inference outside the asymptotic regime N K requires a more uniform prior on the entropy, and we offer one way of doing this. While many distributions are consistent with the data when N ≤ K, we provide empirical evidence that this flattening of the entropic prior allows us to make surprisingly reliable statements about the entropy itself in this regime. At the risk of being pedantic, we state very explicitly what we mean by uniform or nearly uniform priors on the space of distributions. The natural “uniform” prior is given by Pu ({qi }) = 1 δ Zu K 1− K qi i=1 , Zu = A dq1 dq2 · · · dqK δ 1− qi (1) i=1 where the delta function imposes the normalization, Zu is the total volume in the space of models, and the integration domain A is such that each qi varies in the range [0, 1]. Note that, because of the normalization constraint, an individual q i chosen from this distribution in fact is not uniformly distributed—this is also an example of phase space effects, since in choosing one qi we constrain all the other {qj=i }. What we mean by uniformity is that all distributions that obey the normalization constraint are equally likely a priori. Inference with this uniform prior is straightforward. If our examples come independently from {qi }, then we calculate the probability of the model {qi } with the usual Bayes rule: 1 K P ({qi }|{ni }) = P ({ni }|{qi })Pu ({qi }) , P ({ni }|{qi }) = (qi )ni . Pu ({ni }) i=1 (2) If we want the best estimate of the probability qi in the least squares sense, then we should compute the conditional mean, and this can be done exactly, so that [6, 7] qi = ni + 1 . N +K (3) Thus we can think of inference with this uniform prior as setting probabilities equal to the observed frequencies, but with an “extra count” in every bin. This sensible procedure was first introduced by Laplace [8]. It has the desirable property that events which have not been observed are not automatically assigned probability zero. 1 If the data are unordered, extra combinatorial factors have to be included in P ({ni }|{qi }). However, these cancel immediately in later expressions. A natural generalization of these ideas is to consider priors that have a power–law dependence on the probabilities, the so called Dirichlet family of priors: K Pβ ({qi }) = 1 δ 1− qi Z(β) i=1 K β−1 qi , (4) i=1 It is interesting to see what typical distributions from these priors look like. Even though different qi ’s are not independent random variables due to the normalizing δ–function, generation of random distributions is still easy: one can show that if q i ’s are generated successively (starting from i = 1 and proceeding up to i = K) from the Beta–distribution j
3 0.61949903 1 nips-2001-(Not) Bounding the True Error
Author: John Langford, Rich Caruana
Abstract: We present a new approach to bounding the true error rate of a continuous valued classifier based upon PAC-Bayes bounds. The method first constructs a distribution over classifiers by determining how sensitive each parameter in the model is to noise. The true error rate of the stochastic classifier found with the sensitivity analysis can then be tightly bounded using a PAC-Bayes bound. In this paper we demonstrate the method on artificial neural networks with results of a order of magnitude improvement vs. the best deterministic neural net bounds. £ ¡ ¤¢
4 0.59550285 31 nips-2001-Algorithmic Luckiness
Author: Ralf Herbrich, Robert C. Williamson
Abstract: In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [8] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space. 1
5 0.56147879 143 nips-2001-PAC Generalization Bounds for Co-training
Author: Sanjoy Dasgupta, Michael L. Littman, David A. McAllester
Abstract: The rule-based bootstrapping introduced by Yarowsky, and its cotraining variant by Blum and Mitchell, have met with considerable empirical success. Earlier work on the theory of co-training has been only loosely related to empirically useful co-training algorithms. Here we give a new PAC-style bound on generalization error which justifies both the use of confidences — partial rules and partial labeling of the unlabeled data — and the use of an agreement-based objective function as suggested by Collins and Singer. Our bounds apply to the multiclass case, i.e., where instances are to be assigned one of labels for . £ ¡ ¤¢
6 0.55142063 119 nips-2001-Means, Correlations and Bounds
7 0.5291329 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
8 0.49698457 61 nips-2001-Distribution of Mutual Information
9 0.49618769 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
10 0.47111759 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
11 0.45363191 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
12 0.42597947 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique
13 0.41787827 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference
14 0.3881506 36 nips-2001-Approximate Dynamic Programming via Linear Programming
15 0.38141015 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
16 0.37651753 42 nips-2001-Bayesian morphometry of hippocampal cells suggests same-cell somatodendritic repulsion
17 0.37566662 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data
18 0.3713541 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
19 0.36519867 167 nips-2001-Semi-supervised MarginBoost
20 0.36172014 108 nips-2001-Learning Body Pose via Specialized Maps
topicId topicWeight
[(12, 0.293), (14, 0.035), (17, 0.038), (19, 0.046), (27, 0.162), (30, 0.045), (36, 0.014), (38, 0.015), (59, 0.041), (72, 0.082), (79, 0.04), (83, 0.019), (91, 0.096)]
simIndex simValue paperId paperTitle
1 0.85827518 147 nips-2001-Pranking with Ranking
Author: Koby Crammer, Yoram Singer
Abstract: We discuss the problem of ranking instances. In our framework each instance is associated with a rank or a rating, which is an integer from 1 to k. Our goal is to find a rank-prediction rule that assigns each instance a rank which is as close as possible to the instance's true rank. We describe a simple and efficient online algorithm, analyze its performance in the mistake bound model, and prove its correctness. We describe two sets of experiments, with synthetic data and with the EachMovie dataset for collaborative filtering. In the experiments we performed, our algorithm outperforms online algorithms for regression and classification applied to ranking. 1
same-paper 2 0.78684145 114 nips-2001-Learning from Infinite Data in Finite Time
Author: Pedro Domingos, Geoff Hulten
Abstract: We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , ... ,nm )) , and the model Moo that would be learned using infinite examples. Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. 1 An Approach to Large-Scale Learning Large data sets make it possible to reliably learn complex models. On the other hand , they require large computational resources to learn from. While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. Consider the simplest possible learning problem: estimating the mean of a random variable x. If we have a very large number of samples, most of them are probably superfluous. If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose to extend this type of reasoning beyond learning single parameters, to learning complex models. The approach we propose consists of three steps: 1. Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 2. Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. 3. Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. In earlier papers we applied it (or an earlier version of it) to decision tree induction [2J and k-means clustering [3J. Despite its wide use, EM has long been criticized for its inefficiency (see discussion following Dempster et al. [1]), and has been considered unsuitable for large data sets [8J. Many approaches to speeding it up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be seen as an extension of progressive sampling approaches like Meek et al. [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. 2 A Loss Bound for EM In a mixture of Gaussians model, each D-dimensional data point Xj is assumed to have been independently generated by the following process: 1) randomly choose a mixture component k; 2) randomly generate a point from it according to a Gaussian distribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict ourselves to the case where the number K of mixture components and the probability of selection P(f-Lk) and covariance matrix for each component are known. Given a training set S = {Xl, ... , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, starting from some set of initial means , alternating until convergence between estimating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the Estep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk (the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM algorithm, all N examples in the training set are used in each iteration. The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. Let Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data EM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be the vector obtained using infinite examples at each iteration. In order to proceed, we need to quantify the difference between Mii and Moo . A natural choice is the sum of the squared errors between corresponding means, which is proportional to the negative log-likelihood of the finite-data means given the infinite-data ones: K L(Mii' Moo ) = L k=l K Ililk - f-Lkl12 = D LL lilkd - f-Lkdl 2 (1) k=ld=l where ilkd is the dth coordinate of il, and similarly for f-Lkd. After any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. Then the sampling error at iteration i is Iflkdi - J-Lkdi I, the weighting error is Iflkdi - flkdi I, and the total error is Iflkdi - J-Lkdi 1 :::; Iflkdi - flkdi 1 + Iflkdi - J-Lkdi I衯 Given bounds on the total error of each coordinate of each mean after iteration i-I, we can derive a bound on the weighting error after iteration i as follows. Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj ::::: 0 and (- - 1 > (- + . W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. ot h ' erWlse, an d 1et W- jki Then Iflkdi - flkdi , IJ-Lkdi 1 < max - ~7~1 Wjk i Xj I
3 0.69216764 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
Author: Carlotta Domeniconi, Dimitrios Gunopulos
Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1
4 0.59803796 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
Author: Michael Collins, S. Dasgupta, Robert E. Schapire
Abstract: Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponential family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss functions, and give examples on simulated data.
5 0.59506005 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
Author: Marzia Polito, Pietro Perona
Abstract: Locally Linear Embedding (LLE) is an elegant nonlinear dimensionality-reduction technique recently introduced by Roweis and Saul [2]. It fails when the data is divided into separate groups. We study a variant of LLE that can simultaneously group the data and calculate local embedding of each group. An estimate for the upper bound on the intrinsic dimension of the data set is obtained automatically. 1
6 0.59403217 8 nips-2001-A General Greedy Approximation Algorithm with Applications
7 0.59303606 13 nips-2001-A Natural Policy Gradient
8 0.58780658 137 nips-2001-On the Convergence of Leveraging
9 0.58622825 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources
11 0.58248079 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation
12 0.58246619 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
13 0.58206618 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
14 0.58139676 155 nips-2001-Quantizing Density Estimators
15 0.58107215 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
16 0.57902992 188 nips-2001-The Unified Propagation and Scaling Algorithm
17 0.57825094 190 nips-2001-Thin Junction Trees
18 0.57807195 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
19 0.57787317 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules
20 0.57785732 44 nips-2001-Blind Source Separation via Multinode Sparse Representation