nips nips2001 nips2001-119 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Martijn Leisink, Bert Kappen
Abstract: The partition function for a Boltzmann machine can be bounded from above and below. We can use this to bound the means and the correlations. For networks with small weights, the values of these statistics can be restricted to non-trivial regions (i.e. a subset of [-1 , 1]). Experimental results show that reasonable bounding occurs for weight sizes where mean field expansions generally give good results. 1
Reference: text
sentIndex sentText sentNum sentScore
1 nl Abstract The partition function for a Boltzmann machine can be bounded from above and below. [sent-9, score-0.296]
2 We can use this to bound the means and the correlations. [sent-10, score-0.417]
3 For networks with small weights, the values of these statistics can be restricted to non-trivial regions (i. [sent-11, score-0.043]
4 Experimental results show that reasonable bounding occurs for weight sizes where mean field expansions generally give good results. [sent-14, score-0.441]
5 1 Introduction Over the last decade, bounding techniques have become a popular tool to deal with graphical models that are too complex for exact computation. [sent-15, score-0.306]
6 A nice property of bounds is that they give at least some information you can rely on. [sent-16, score-0.234]
7 An ordinary approximation might be more accurate, but in practical situations there is absolutely no warranty for that. [sent-20, score-0.098]
8 The best known bound is probably the mean field bound , which has been described for Boltzmann machines in [1] and later for sigmoid belief networks in [2]. [sent-21, score-1.329]
9 Apart from its bounding properties, mean field theory is a commonly used approximation technique as well. [sent-22, score-0.459]
10 Recently this first order bound was extended to a third order approximation for Boltzmann machines and sigmoid belief networks in [3] and [4], where it was shown that this particular third order expansion is still a bound. [sent-23, score-1.072]
11 In 1996 an upper bound for Boltzmann machines was described in [5] and [6]. [sent-24, score-0.594]
12 In the same articles the authors derive an upper bound for a special case of sigmoid belief networks: the two-layered networks. [sent-25, score-0.776]
13 In this article we will focus solely on Boltzmann machines, but an extension to sigmoid belief networks is quite straightforward. [sent-26, score-0.304]
14 This article is organized as follows: In section 2 we start with the general theory about bounding t echniques. [sent-27, score-0.227]
15 Later in that section the upper and lower bound are briefly described. [sent-28, score-0.617]
16 For a full explanation we refer to the articles mentioned before. [sent-29, score-0.088]
17 The section is concluded by explaining how these bounds on the partition function can be used to bound m eans and correlations. [sent-30, score-0.936]
18 In section 3 results are shown for fully connected Boltzmann machines, where size of weights and thresholds as well as network size are varied. [sent-31, score-0.311]
19 2 Theory There exists a general method to create a class of polynomials of a certain order, which all bound a function of interest, fo(x). [sent-33, score-0.389]
20 Such a class of order 2n can be found if the 2n-th order derivative of fo(x), written as hn(x), can be bounded by a constant . [sent-34, score-0.18]
21 When this constant is zero , the class is actu ally of order 2n-1. [sent-35, score-0.079]
22 It turns out that this class is parameterized by n free parameters. [sent-36, score-0.125]
23 Suppose we have a function b2k for some integer k which bounds the function 12k from below (an upper bound can be written as a lower bound by using the negative of both functions). [sent-37, score-1.125]
24 Thus (1) Now construct the primitive functions 12k -1 and b2k -1 such that 12k - 1 (p) = b2k- 1 (p) for a free to choose value for p. [sent-38, score-0.143]
25 This constraint can always be achieved by adding an appropriate constant to the primitive function b2k - 1 . [sent-39, score-0.091]
26 If we repeat this procedure and construct the primitive functions hk-2 and b2k - 2 such that hk-2(p) = b2k - 2(p) for the same p, one can verify that Vx hk-2(x) 2: b2k - 2(X) (3) Thus given a bound 12k (x) 2: b2k (x) we can construct a class of bounding functions for hk-2 parameterized by p. [sent-41, score-0.678]
27 Since we assumed hn (x) can be bounded from below by a constant , we can apply the procedure n times and we finally find fa (x) 2: bo(x), where bo(x) is parameterized by n free parameters. [sent-42, score-0.247]
28 1 A third order lower bound for Boltzmann machines Boltzmann machines are stochastic neural networks with N binary valued neurons , Si, which are connected by symmetric weights Wij. [sent-45, score-0.969]
29 Due to this symmetry the probability distribution is a Boltzmann-Gibbs distribution which is given by (see also [7]) p (siB, w) = ~ exp (~L. [sent-46, score-0.053]
30 WijSiSj 'J where the Bi + L BiSi) = ~ exp (-E (s, B, w)) (4) ' are threshold values and Z (B , w) = L exp (- E (s, B, w)) (5) all S is the normalization known as the partition function. [sent-47, score-0.344]
31 This partition function is especially important, since statistical quantities such as m eans and correlations can b e directly derived from it. [sent-48, score-0.535]
32 For instance , the m eans can b e computed as (sn) = LP (siB, w) Sn = L P Sn =+ l IB, w) - P Sn = - l iB, w) (s, (s, all S a ll s/sn Z+ (B, w) - Z _ (B, w) Z (B, w) (6) where Z+ and Z_ are partition functions over a network with Sn clamped to +1 and -1 , respectively. [sent-49, score-0.438]
33 This explains why the objective of almost any approximation method is the partition function given by equation 5. [sent-50, score-0.368]
34 In [3] and [4] it is shown that the standard mean field lower bound can be obtained by applying the linear bound (7) to all exponentially many terms in the sum. [sent-51, score-1.018]
35 lo , which leads to the standard mean field equations, where the J. [sent-56, score-0.239]
36 Moreover, the authors show that one can apply the procedure of 'upgrading bounds' (which is described briefly at the beginning of this section) to equation 7, which leads to the class of third order bounds for eX. [sent-58, score-0.46]
37 In principle, this third order bound could be maximized with respect to all the free parameters, but here we follow the suggestion made in [4] to use a mean field optimization , which is much faster and generally almost as good as a full optimization. [sent-69, score-0.839]
38 2 An upper bound An upper bound for Boltzmann machines has been described in [5] and [6]1. [sent-72, score-1.087]
39 Basically, this method uses a quadratic upper bound on log cosh x, which can easily be obtained in the following way: h(x) = 1 - tanh 2 x::; 1 = b2(x) h(x) = tanh x ~ x - J. [sent-73, score-0.936]
40 l fa (x) = log cosh x ::; 1 = bdx) (9) 2 "2 (x - J. [sent-75, score-0.256]
41 l = bo(x) Using this bound, one can derive Z (e , w) = Ls exp (~L WijSiSj + L eiSi) all ij i = ~ 2exp (lOg cosh (L WniSi + en)) exp all sisn , ::; L exp (~ L W~jSiSj + L e;Si + k) allsls n ij i' n INote: The articles referred to, use (~ . [sent-79, score-0.523]
42 Finally, after N steps, an upper bound on the partition function is found 2 . [sent-83, score-0.731]
43 We did a crude minimization with respect to the free parameters J-L. [sent-84, score-0.052]
44 A more sophisticated method can probably be found, but this is not the main objective of this article. [sent-85, score-0.045]
45 3 Bounding means and correlations The previous subsections showed very briefly how we can obtain a lower bound , ZL, and an upper bound , ZU , for any partition function. [sent-87, score-1.43]
46 We can use this in combination with equation 6 to obtain a bound on the means: ZL _ ZU Zu _ ZL (sn)L = + X -::::; (sn)::::; + y - = (snt (12) where X = ZU if the nominator is positive and X = ZL otherwise. [sent-88, score-0.395]
47 Naively, we can compute the correlations similarly to the means using (13) where the partition function is computed for all combinations Sn Sm. [sent-91, score-0.46]
48 Generally, however, this gives poor results , since we have to add four bounds together , which leads to a bandwidth which is about twice as large as for the means. [sent-92, score-0.404]
49 We can circumvent this by computing the correlations using (14) where we allow the sum in the partition functions to be taken over Sn , but fix Sm either to Sn or its negative. [sent-93, score-0.424]
50 Finally, the computation of the bounds (SnSm)L and (snsmt is analogue to equation 12. [sent-94, score-0.248]
51 There exists an alternative way to bound the means and correlations. [sent-95, score-0.417]
52 2The original articles show that it is not necessary to do all the N steps. [sent-99, score-0.088]
53 However, since this is based on mixing approximation t echniques with exact ca lculations, it is not used here as it would hide the real error the approximation makes. [sent-100, score-0.242]
54 5 Exact Mean field lower bound Upper bound Third order lower bound ,, 12 , ,. [sent-102, score-1.422]
55 8 Figure 1: Comparison of 1) the mean field lower bound, 2) the upper bound and 3) the third order lower bound with the exact log partition function. [sent-130, score-1.731]
56 The network was a fully connected Boltzmann machine with 14 neurons and (J'B = 0. [sent-131, score-0.235]
57 3 Results In all experiments we used fully connected Boltzmann machines of which the thresholds and weights both were drawn from a Gaussian with zero mean and standard deviation (J'B and (J'w/VN, respectively, where N is the network size. [sent-135, score-0.477]
58 Generally speaking, the mean field approximation breaks down for (J'B = 0 and (J'w > 0. [sent-137, score-0.298]
59 5, whereas it can be proven that any expansion based approximation is inaccurate when (J'w > 1 (which is the radius of convergence as in [9]). [sent-138, score-0.087]
60 In figure 1 we show the logarithm of the exact partition function , the first order or mean field bound, the upper bound (which is roughly quadratic) and the third order lower bound. [sent-140, score-1.305]
61 The weight size is varied along the horizontal axis. [sent-141, score-0.05]
62 One can see clearly that the mean field bound is not able to capture the quadratic form of the exact partition function for small weights due to its linear behaviour. [sent-142, score-1.012]
63 The error made by the upper and third order lower bound is small enough to make non-trivial bounds on the means and correlations. [sent-143, score-0.961]
64 An example of this bound is shown in figure 2 for the specific choice (J'B = (J'w = 0. [sent-144, score-0.387]
65 For both the means and t he correlations a histogram is plotted for the upper and lower bounds computed with equation 12. [sent-146, score-0.683]
66 In figure 3 th e average bandwidth is shown for several values of (J'e and (J'w ' For bandwidths of 0. [sent-149, score-0.268]
67 We conclude that almost everywhere the bandwidth is non-trivially reduced and reaches practically useful values for (J'w less than 0. [sent-152, score-0.227]
68 This is more or less equivalent to the region wh ere the m ean fi eld approximation performs well. [sent-154, score-0.114]
69 That approximation , however , gives no information on how close it actually is to the exact value, whereas the bounding m ethod limits it to a definit e region. [sent-155, score-0.344]
70 2 Distance to exact Distance to exact Figure 2: For the specific choice IJo = IJ w = 0. [sent-162, score-0.222]
71 4 thirty fully connected Boltzmann machines with 14 neurons were initialized and the bounds were computed. [sent-163, score-0.59]
72 The two left panels show the distance between the lower bound and the exact means (left) and similarly for the upper bound (right). [sent-164, score-1.121]
73 The right two panels show the distances of both bounds for the correlations. [sent-165, score-0.25]
74 2 00 Ow Figure 3: In the left panel the average bandwidth is colour coded for the means , where IJo and IJw are varied in ten steps along the axes. [sent-174, score-0.343]
75 For each IJo , IJ w thirty fully connected Boltzmann machines were initialized and the bounds on all the m eans and correlations were computed. [sent-176, score-0.841]
76 5 the bandwidth for the correlations is shown versus the network size. [sent-198, score-0.417]
77 A similar graph for the means is not shown here, but it is roughly the same. [sent-202, score-0.064]
78 The solid line is the average bandwidth over all correlations , whereas the dashed lines indicate the minimum and maximum bandwidth found. [sent-203, score-0.554]
79 Unfortunately, the bounds have the unwanted property that the error scales badly with the size of the network. [sent-204, score-0.267]
80 Although this makes the bounds unsuitable for very large networks , there is still a wide range of networks small enough to take advantage of the proposed method and still much too large to be treated exactly. [sent-205, score-0.292]
81 The bandwidth versus network size is shown in figure 4 for three values of (T w' Obviously, the threshold of practical usefulness is reached earlier for larger weights . [sent-206, score-0.315]
82 Finally, we remark that the computation time for the upper bound is (') (N4) and for the mean field and third order lower bound. [sent-207, score-0.93]
83 (') (N 3 ) 4 Conclusions In this article we combined two already existing bounds in such a way that not only the partition function of a Boltzmann machine is bounded from both sides , but also the means and correlations . [sent-209, score-0.826]
84 This may seem superfluous, since there exist already several powerful approximation methods. [sent-210, score-0.095]
85 Our method, however, can be used apart from any approximation technique and gives at least some information you can rely on. [sent-211, score-0.125]
86 Although approximation techniques might do a good job on your data, you can never be sure about that. [sent-212, score-0.059]
87 The method outlined in this paper ensures that the quantities of interest, the means and correlations, are restricted to a certain region. [sent-213, score-0.064]
88 We have seen that , generally speaking, the results are useful for weight sizes where an ordinary mean field approximation performs well. [sent-214, score-0.378]
89 Moreover, since many architectures are not fully connected, one can take advantage of that structure. [sent-216, score-0.057]
90 At least for the upper bound it is shown already that this can improve computation speed and tightness . [sent-217, score-0.559]
91 This would partially cancel the unwanted scaling with the network size. [sent-218, score-0.122]
92 First of all, an extension to sigmoid belief networks can easily be done, since both a lower and an upper bound are already described. [sent-220, score-0.84]
93 The upper bound, however , is only applicable to two layer networks. [sent-221, score-0.14]
94 A more general upper bound can probably b e found. [sent-222, score-0.538]
95 Secondly one can obtain even b etter bounds (especially for larger weights) if the general constraint (17) is taken into account. [sent-223, score-0.236]
96 This might even b e extended to similar constraints, where three or more neurons are involved. [sent-224, score-0.046]
97 A mean field theory learning algorithm for neural networks. [sent-229, score-0.239]
98 Computing upp er and lower bounds on likelihoods in intractable n etworks. [sent-266, score-0.279]
99 In Proceedings of th e T welfth Annual Conf ere nce on Uncertainty in Artificial Intelligence (UAI- 96) , pages 340- 348, San Francisco , CA, 1996. [sent-267, score-0.055]
100 Convergence condition of the TAP equation for the infinite-ranged ising spin glass model. [sent-282, score-0.042]
wordName wordTfidf (topN-words)
[('bound', 0.353), ('boltzmann', 0.305), ('sn', 0.285), ('partition', 0.238), ('bounds', 0.206), ('bandwidth', 0.198), ('cosh', 0.174), ('field', 0.174), ('bounding', 0.161), ('correlations', 0.158), ('upper', 0.14), ('eans', 0.139), ('zu', 0.138), ('sigmoid', 0.127), ('zl', 0.11), ('ijo', 0.104), ('leisink', 0.104), ('wijsisj', 0.104), ('machines', 0.101), ('tanh', 0.097), ('exact', 0.094), ('primitive', 0.091), ('bo', 0.088), ('articles', 0.088), ('third', 0.082), ('fo', 0.073), ('lower', 0.073), ('connected', 0.071), ('bandwidths', 0.07), ('bdx', 0.07), ('eisi', 0.07), ('martijn', 0.07), ('sib', 0.07), ('thirty', 0.07), ('belief', 0.068), ('thresholds', 0.066), ('article', 0.066), ('mean', 0.065), ('means', 0.064), ('network', 0.061), ('ev', 0.061), ('hn', 0.061), ('unwanted', 0.061), ('approximation', 0.059), ('en', 0.059), ('bounded', 0.058), ('fully', 0.057), ('si', 0.057), ('weights', 0.056), ('ere', 0.055), ('nijmegen', 0.055), ('exp', 0.053), ('free', 0.052), ('tighter', 0.051), ('jaakkola', 0.051), ('briefly', 0.051), ('graphical', 0.051), ('ij', 0.051), ('varied', 0.05), ('neurons', 0.046), ('co', 0.045), ('probably', 0.045), ('panels', 0.044), ('log', 0.043), ('order', 0.043), ('networks', 0.043), ('equation', 0.042), ('wij', 0.042), ('generally', 0.041), ('ex', 0.039), ('fa', 0.039), ('initialized', 0.039), ('ordinary', 0.039), ('hilbert', 0.038), ('apart', 0.038), ('parameterized', 0.037), ('class', 0.036), ('speaking', 0.036), ('already', 0.036), ('ib', 0.034), ('recursive', 0.034), ('specific', 0.034), ('quadratic', 0.032), ('panel', 0.031), ('eview', 0.03), ('etter', 0.03), ('naively', 0.03), ('conf', 0.03), ('ethod', 0.03), ('acknowledgelllents', 0.03), ('ackley', 0.03), ('definitely', 0.03), ('echniques', 0.03), ('programme', 0.03), ('tightness', 0.03), ('almost', 0.029), ('expansion', 0.028), ('rely', 0.028), ('circumvent', 0.028), ('economic', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 119 nips-2001-Means, Correlations and Bounds
Author: Martijn Leisink, Bert Kappen
Abstract: The partition function for a Boltzmann machine can be bounded from above and below. We can use this to bound the means and the correlations. For networks with small weights, the values of these statistics can be restricted to non-trivial regions (i.e. a subset of [-1 , 1]). Experimental results show that reasonable bounding occurs for weight sizes where mean field expansions generally give good results. 1
2 0.24747364 1 nips-2001-(Not) Bounding the True Error
Author: John Langford, Rich Caruana
Abstract: We present a new approach to bounding the true error rate of a continuous valued classifier based upon PAC-Bayes bounds. The method first constructs a distribution over classifiers by determining how sensitive each parameter in the model is to noise. The true error rate of the stochastic classifier found with the sensitivity analysis can then be tightly bounded using a PAC-Bayes bound. In this paper we demonstrate the method on artificial neural networks with results of a order of magnitude improvement vs. the best deterministic neural net bounds. £ ¡ ¤¢
3 0.14292824 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
Author: T. Zhang
Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.
4 0.13386045 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
Author: Shai Fine, Katya Scheinberg
Abstract: We propose a framework based on a parametric quadratic programming (QP) technique to solve the support vector machine (SVM) training problem. This framework, can be specialized to obtain two SVM optimization methods. The first solves the fixed bias problem, while the second starts with an optimal solution for a fixed bias problem and adjusts the bias until the optimal value is found. The later method can be applied in conjunction with any other existing technique which obtains a fixed bias solution. Moreover, the second method can also be used independently to solve the complete SVM training problem. A combination of these two methods is more flexible than each individual method and, among other things, produces an incremental algorithm which exactly solve the 1-Norm Soft Margin SVM optimization problem. Applying Selective Sampling techniques may further boost convergence. 1
5 0.13316442 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
Author: Mikio L. Braun, Joachim M. Buhmann
Abstract: We consider noisy Euclidean traveling salesman problems in the plane, which are random combinatorial problems with underlying structure. Gibbs sampling is used to compute average trajectories, which estimate the underlying structure common to all instances. This procedure requires identifying the exact relationship between permutations and tours. In a learning setting, the average trajectory is used as a model to construct solutions to new instances sampled from the same source. Experimental results show that the average trajectory can in fact estimate the underlying structure and that overfitting effects occur if the trajectory adapts too closely to a single instance. 1
6 0.10818411 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
7 0.10092302 114 nips-2001-Learning from Infinite Data in Finite Time
8 0.095265247 143 nips-2001-PAC Generalization Bounds for Co-training
9 0.088397883 21 nips-2001-A Variational Approach to Learning Curves
10 0.084597692 155 nips-2001-Quantizing Density Estimators
11 0.082885541 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
12 0.082881123 8 nips-2001-A General Greedy Approximation Algorithm with Applications
13 0.078676052 31 nips-2001-Algorithmic Luckiness
14 0.077565119 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
15 0.076422296 37 nips-2001-Associative memory in realistic neuronal networks
16 0.075199217 36 nips-2001-Approximate Dynamic Programming via Linear Programming
17 0.074779667 139 nips-2001-Online Learning with Kernels
18 0.070721805 105 nips-2001-Kernel Machines and Boolean Functions
19 0.068951704 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity
20 0.068474531 192 nips-2001-Tree-based reparameterization for approximate inference on loopy graphs
topicId topicWeight
[(0, -0.211), (1, 0.004), (2, 0.026), (3, 0.03), (4, 0.116), (5, -0.174), (6, 0.06), (7, -0.018), (8, -0.045), (9, -0.04), (10, -0.035), (11, 0.134), (12, 0.057), (13, 0.051), (14, -0.104), (15, 0.238), (16, -0.007), (17, -0.106), (18, -0.039), (19, -0.252), (20, -0.201), (21, 0.059), (22, 0.002), (23, -0.198), (24, -0.084), (25, 0.032), (26, -0.051), (27, 0.036), (28, -0.017), (29, -0.065), (30, -0.016), (31, 0.016), (32, 0.063), (33, -0.053), (34, 0.068), (35, 0.028), (36, 0.071), (37, -0.021), (38, 0.011), (39, 0.134), (40, 0.052), (41, -0.136), (42, -0.001), (43, 0.042), (44, -0.093), (45, -0.03), (46, 0.027), (47, 0.084), (48, 0.208), (49, -0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.9773919 119 nips-2001-Means, Correlations and Bounds
Author: Martijn Leisink, Bert Kappen
Abstract: The partition function for a Boltzmann machine can be bounded from above and below. We can use this to bound the means and the correlations. For networks with small weights, the values of these statistics can be restricted to non-trivial regions (i.e. a subset of [-1 , 1]). Experimental results show that reasonable bounding occurs for weight sizes where mean field expansions generally give good results. 1
2 0.80596149 1 nips-2001-(Not) Bounding the True Error
Author: John Langford, Rich Caruana
Abstract: We present a new approach to bounding the true error rate of a continuous valued classifier based upon PAC-Bayes bounds. The method first constructs a distribution over classifiers by determining how sensitive each parameter in the model is to noise. The true error rate of the stochastic classifier found with the sensitivity analysis can then be tightly bounded using a PAC-Bayes bound. In this paper we demonstrate the method on artificial neural networks with results of a order of magnitude improvement vs. the best deterministic neural net bounds. £ ¡ ¤¢
3 0.66457784 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
Author: Mikio L. Braun, Joachim M. Buhmann
Abstract: We consider noisy Euclidean traveling salesman problems in the plane, which are random combinatorial problems with underlying structure. Gibbs sampling is used to compute average trajectories, which estimate the underlying structure common to all instances. This procedure requires identifying the exact relationship between permutations and tours. In a learning setting, the average trajectory is used as a model to construct solutions to new instances sampled from the same source. Experimental results show that the average trajectory can in fact estimate the underlying structure and that overfitting effects occur if the trajectory adapts too closely to a single instance. 1
4 0.53187251 114 nips-2001-Learning from Infinite Data in Finite Time
Author: Pedro Domingos, Geoff Hulten
Abstract: We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , ... ,nm )) , and the model Moo that would be learned using infinite examples. Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. 1 An Approach to Large-Scale Learning Large data sets make it possible to reliably learn complex models. On the other hand , they require large computational resources to learn from. While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. Consider the simplest possible learning problem: estimating the mean of a random variable x. If we have a very large number of samples, most of them are probably superfluous. If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose to extend this type of reasoning beyond learning single parameters, to learning complex models. The approach we propose consists of three steps: 1. Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 2. Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. 3. Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. In earlier papers we applied it (or an earlier version of it) to decision tree induction [2J and k-means clustering [3J. Despite its wide use, EM has long been criticized for its inefficiency (see discussion following Dempster et al. [1]), and has been considered unsuitable for large data sets [8J. Many approaches to speeding it up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be seen as an extension of progressive sampling approaches like Meek et al. [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. 2 A Loss Bound for EM In a mixture of Gaussians model, each D-dimensional data point Xj is assumed to have been independently generated by the following process: 1) randomly choose a mixture component k; 2) randomly generate a point from it according to a Gaussian distribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict ourselves to the case where the number K of mixture components and the probability of selection P(f-Lk) and covariance matrix for each component are known. Given a training set S = {Xl, ... , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, starting from some set of initial means , alternating until convergence between estimating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the Estep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk (the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM algorithm, all N examples in the training set are used in each iteration. The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. Let Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data EM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be the vector obtained using infinite examples at each iteration. In order to proceed, we need to quantify the difference between Mii and Moo . A natural choice is the sum of the squared errors between corresponding means, which is proportional to the negative log-likelihood of the finite-data means given the infinite-data ones: K L(Mii' Moo ) = L k=l K Ililk - f-Lkl12 = D LL lilkd - f-Lkdl 2 (1) k=ld=l where ilkd is the dth coordinate of il, and similarly for f-Lkd. After any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. Then the sampling error at iteration i is Iflkdi - J-Lkdi I, the weighting error is Iflkdi - flkdi I, and the total error is Iflkdi - J-Lkdi 1 :::; Iflkdi - flkdi 1 + Iflkdi - J-Lkdi I衯 Given bounds on the total error of each coordinate of each mean after iteration i-I, we can derive a bound on the weighting error after iteration i as follows. Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj ::::: 0 and (- - 1 > (- + . W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. ot h ' erWlse, an d 1et W- jki Then Iflkdi - flkdi , IJ-Lkdi 1 < max - ~7~1 Wjk i Xj I
5 0.509381 143 nips-2001-PAC Generalization Bounds for Co-training
Author: Sanjoy Dasgupta, Michael L. Littman, David A. McAllester
Abstract: The rule-based bootstrapping introduced by Yarowsky, and its cotraining variant by Blum and Mitchell, have met with considerable empirical success. Earlier work on the theory of co-training has been only loosely related to empirically useful co-training algorithms. Here we give a new PAC-style bound on generalization error which justifies both the use of confidences — partial rules and partial labeling of the unlabeled data — and the use of an agreement-based objective function as suggested by Collins and Singer. Our bounds apply to the multiclass case, i.e., where instances are to be assigned one of labels for . £ ¡ ¤¢
6 0.47828174 31 nips-2001-Algorithmic Luckiness
7 0.47168055 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
8 0.45971659 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
9 0.45319924 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
10 0.42859671 36 nips-2001-Approximate Dynamic Programming via Linear Programming
11 0.35782394 192 nips-2001-Tree-based reparameterization for approximate inference on loopy graphs
12 0.32011718 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
13 0.31480905 99 nips-2001-Intransitive Likelihood-Ratio Classifiers
14 0.30382714 30 nips-2001-Agglomerative Multivariate Information Bottleneck
15 0.3035171 21 nips-2001-A Variational Approach to Learning Curves
16 0.29859746 155 nips-2001-Quantizing Density Estimators
17 0.29853627 57 nips-2001-Correlation Codes in Neuronal Populations
18 0.27354419 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
19 0.27298513 120 nips-2001-Minimax Probability Machine
20 0.27123782 8 nips-2001-A General Greedy Approximation Algorithm with Applications
topicId topicWeight
[(14, 0.035), (17, 0.024), (19, 0.386), (27, 0.111), (30, 0.066), (59, 0.015), (72, 0.087), (79, 0.044), (83, 0.017), (87, 0.013), (91, 0.11)]
simIndex simValue paperId paperTitle
1 0.99103528 93 nips-2001-Incremental A*
Author: S. Koenig, M. Likhachev
Abstract: Incremental search techniques find optimal solutions to series of similar search tasks much faster than is possible by solving each search task from scratch. While researchers have developed incremental versions of uninformed search methods, we develop an incremental version of A*. The first search of Lifelong Planning A* is the same as that of A* but all subsequent searches are much faster because it reuses those parts of the previous search tree that are identical to the new search tree. We then present experimental results that demonstrate the advantages of Lifelong Planning A* for simple route planning tasks. 1 Overview Artificial intelligence has investigated knowledge-based search techniques that allow one to solve search tasks in large domains. Most of the research on these methods has studied how to solve one-shot search problems. However, search is often a repetitive process, where one needs to solve a series of similar search tasks, for example, because the actual situation turns out to be slightly different from the one initially assumed or because the situation changes over time. An example for route planning tasks are changing traffic conditions. Thus, one needs to replan for the new situation, for example if one always wants to display the least time-consuming route from the airport to the conference center on a web page. In these situations, most search methods replan from scratch, that is, solve the search problems independently. Incremental search techniques share with case-based planning, plan adaptation, repair-based planning, and learning search-control knowledge the property that they find solutions to series of similar search tasks much faster than is possible by solving each search task from scratch. Incremental search techniques, however, differ from the other techniques in that the quality of their solutions is guaranteed to be as good as the quality of the solutions obtained by replanning from scratch. Although incremental search methods are not widely known in artificial intelligence and control, different researchers have developed incremental search versions of uninformed search methods in the algorithms literature. An overview can be found in [FMSN00]. We, on the other hand, develop an incremental version of A*, thus combining ideas from the algorithms literature and the artificial intelligence literature. We call the algorithm Lifelong Planning A* (LPA*), in analogy to “lifelong learning” [Thr98], because it reuses £ We thank Anthony Stentz for his support. The Intelligent Decision-Making Group is partly supported by NSF awards under contracts IIS9984827, IIS-0098807, and ITR/AP-0113881. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the sponsoring organizations and agencies or the U.S. government. information from previous searches. LPA* uses heuristics to focus the search and always finds a shortest path for the current edge costs. The first search of LPA* is the same as that of A* but all subsequent searches are much faster. LPA* produces at least the search tree that A* builds. However, it achieves a substantial speedup over A* because it reuses those parts of the previous search tree that are identical to the new search tree. 2 The Route Planning Task Lifelong Planning A* (LPA*) solves the following search task: It applies to finite graph search problems on known graphs whose edge costs can increase or decrease over time. denotes the finite set of vertices of the graph. denotes the set of successors of vertex . Similarly, denotes the set of predecessors of vertex . denotes the cost of moving from vertex to vertex . LPA* always determines a shortest path from a given start vertex to a given goal vertex , knowing both the topology of the graph and the current edge costs. We use to denote the start distance of vertex , that is, the length of a shortest path from to . ¨ ¨¦ £ £ ¡ ©§¥¤¢ FP HFE TSRQIGD¨ ¨¦ £ £ ¡ 4 ©D¥CBA@!¨ ¨ ¨¦
2 0.96456921 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons
Author: Shun-ichi Amari, Hyeyoung Park, Tomoko Ozeki
Abstract: Singularities are ubiquitous in the parameter space of hierarchical models such as multilayer perceptrons. At singularities, the Fisher information matrix degenerates, and the Cramer-Rao paradigm does no more hold, implying that the classical model selection theory such as AIC and MDL cannot be applied. It is important to study the relation between the generalization error and the training error at singularities. The present paper demonstrates a method of analyzing these errors both for the maximum likelihood estimator and the Bayesian predictive distribution in terms of Gaussian random fields, by using simple models. 1
3 0.96416259 124 nips-2001-Modeling the Modulatory Effect of Attention on Human Spatial Vision
Author: Laurent Itti, Jochen Braun, Christof Koch
Abstract: We present new simulation results , in which a computational model of interacting visual neurons simultaneously predicts the modulation of spatial vision thresholds by focal visual attention, for five dual-task human psychophysics experiments. This new study complements our previous findings that attention activates a winnertake-all competition among early visual neurons within one cortical hypercolumn. This
same-paper 4 0.92558122 119 nips-2001-Means, Correlations and Bounds
Author: Martijn Leisink, Bert Kappen
Abstract: The partition function for a Boltzmann machine can be bounded from above and below. We can use this to bound the means and the correlations. For networks with small weights, the values of these statistics can be restricted to non-trivial regions (i.e. a subset of [-1 , 1]). Experimental results show that reasonable bounding occurs for weight sizes where mean field expansions generally give good results. 1
5 0.89121836 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
Author: Kari Torkkola
Abstract: The marriage of Renyi entropy with Parzen density estimation has been shown to be a viable tool in learning discriminative feature transforms. However, it suffers from computational complexity proportional to the square of the number of samples in the training data. This sets a practical limit to using large databases. We suggest immediate divorce of the two methods and remarriage of Renyi entropy with a semi-parametric density estimation method, such as a Gaussian Mixture Models (GMM). This allows all of the computation to take place in the low dimensional target space, and it reduces computational complexity proportional to square of the number of components in the mixtures. Furthermore, a convenient extension to Hidden Markov Models as commonly used in speech recognition becomes possible.
6 0.65340739 1 nips-2001-(Not) Bounding the True Error
7 0.61060899 13 nips-2001-A Natural Policy Gradient
8 0.59077591 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
9 0.58964145 192 nips-2001-Tree-based reparameterization for approximate inference on loopy graphs
10 0.5890494 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation
11 0.57451218 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
12 0.56827855 155 nips-2001-Quantizing Density Estimators
13 0.56727165 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
14 0.56702214 114 nips-2001-Learning from Infinite Data in Finite Time
16 0.56317204 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
17 0.56159019 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique
18 0.55675536 36 nips-2001-Approximate Dynamic Programming via Linear Programming
19 0.55467176 68 nips-2001-Entropy and Inference, Revisited
20 0.55261528 143 nips-2001-PAC Generalization Bounds for Co-training