nips nips2000 nips2000-120 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alex J. Smola, Peter L. Bartlett
Abstract: We present a simple sparse greedy technique to approximate the maximum a posteriori estimate of Gaussian Processes with much improved scaling behaviour in the sample size m. In particular, computational requirements are O(n 2 m), storage is O(nm), the cost for prediction is 0 (n) and the cost to compute confidence bounds is O(nm), where n «: m. We show how to compute a stopping criterion, give bounds on the approximation error, and show applications to large scale problems. 1
Reference: text
sentIndex sentText sentNum sentScore
1 au Abstract We present a simple sparse greedy technique to approximate the maximum a posteriori estimate of Gaussian Processes with much improved scaling behaviour in the sample size m. [sent-8, score-0.785]
2 In particular, computational requirements are O(n 2 m), storage is O(nm), the cost for prediction is 0 (n) and the cost to compute confidence bounds is O(nm), where n «: m. [sent-9, score-0.422]
3 We show how to compute a stopping criterion, give bounds on the approximation error, and show applications to large scale problems. [sent-10, score-0.301]
4 1 Introduction Gaussian processes have become popular because they allow exact Bayesian analysis with simple matrix manipulations, yet provide good performance. [sent-11, score-0.152]
5 They share with Support Vector machines and Regularization Networks the concept of regularization via Reproducing Kernel Hilbert spaces [3], that is, they allow the direct specification of the smoothness properties of the class of functions under consideration. [sent-12, score-0.082]
6 However, Gaussian processes are not always the method of choice for large datasets, since they involve evaluations of the covariance function at m points (where m denotes the sample size) in order to carry out inference at a single additional point. [sent-13, score-0.187]
7 This may be rather costly to implement - practitioners prefer to use only a small number of basis functions (Le. [sent-14, score-0.102]
8 Furthermore, the Maximum a Posteriori (MAP) estimate requires computation, storage, and inversion of the full m x m covariance matrix Kii = k( Xi, Xi) where Xl! [sent-16, score-0.346]
9 While there exist techniques [2, 8] to reduce the computational cost of finding an estimate to O(km 2 ) rather than O(m 3 ) when the covariance matrix contains a significant number of small eigenvalues, all these methods still require computation and storage of the full covariance matrix. [sent-21, score-0.502]
10 None of these methods addresses the problem of speeding up the prediction stage (except for the rare case when the integral operator corresponding to the kernel can be diagonalized analytically [8]). [sent-22, score-0.099]
11 We devise a sparse greedy method, similar to those proposed in the context of wavelets [4], solutions of linear systems [5] or matrix approximation [7] that finds ·Supported by the DFG (Sm 62-1) and the Australian Research Council. [sent-23, score-0.65]
12 an approximation of the MAP estimate by expanding it in terms of a small subset of kernel functions k (Xi, . [sent-24, score-0.323]
13 Briefly, the technique works as follows: given a set of (already chosen) kernel functions, we seek the additional function that increases the posterior probability most. [sent-26, score-0.183]
14 We add it to the set of basis functions and repeat until the maximum is approximated sufficiently well. [sent-27, score-0.18]
15 A similar approach for a tight upper bound on the posterior probability gives a stopping criterion. [sent-28, score-0.247]
16 In Gaussian Process Regression, we assume that for any such set there is a covariance matrix K with elements Kij = k( Xi, Xj). [sent-31, score-0.172]
17 We assume that for each input X there is a corresponding output y(x), and that these outputs are generated by y(x) = t(x) + (1) are both normal random variables, with rv N(O, ( 2 ) and where t(x) and t = (t(Xl), . [sent-32, score-0.131]
18 We can use Bayes theorem to determine the distribution of the output y(x) at a (new) input x. [sent-36, score-0.051]
19 Conditioned on the data (X,y), the output y(x) is normally distributed. [sent-37, score-0.038]
20 It follows that the mean of this distribution is the maximum a posteriori probability (MAP) estimate of y. [sent-38, score-0.156]
21 We may assume that the vector y = (y(Xl)"" ,y(xm))T of outputs is generated by e y=Ka+e, (2) where a rv N(O, K- 1 ) and rv N(O, ( 2 1). [sent-41, score-0.262]
22 Consequently the posterior probability p(aly, X) over the latent variables a is proportional to exp(-2;21Iy - Ka11 2) exp(-! [sent-42, score-0.084]
23 , k (x m , x)) and aopt is the value of a that maximizes (3). [sent-47, score-0.152]
24 Thus, it suffices to compute aopt before any predictions are required. [sent-48, score-0.214]
25 The problem of choosing the MAP estimate of a is equivalent to the problem of minimizing the negative log-posterior, (4) minimize [-y T Ka + ! [sent-49, score-0.187]
26 It is easy to show that the mean of the conditional distribution of y(x) is k T (K +( 21)-ly, and its variance is k(x, x) + a 2 - k T (K + ( 21)-lk (see, for example, [2]). [sent-51, score-0.049]
27 The following theorem suggests a simple approach to estimating the accuracy of an approximate solution to (4). [sent-53, score-0.144]
28 Theorem 1 (Approximation Bounds for Quadratic Forms) Denote by K E lRmxm a positive semidefinite matrix, y, a E lRm and define the two quadratic forms Q(a) := -y T Ka 1 + _aT (a 2 K + KT K)a, 2 (5) Q*(a) := -y Ta 1 + _aT (a 2 1 + K)a. [sent-55, score-0.165]
29 Hence, by minimizing Q* in addition to Q we can bound Q's closeness to the optimum and vice versa. [sent-60, score-0.05]
30 Proof The minimum of Q(a) is obtained for aopt minimizes Q*), hence Qmin = 1 T -2"Y K(K +a 2 1) -1 y * and Qmin = = (K 1 + a 21)-1y T (which also 2-1 -2"Y (K +a 1) y. [sent-61, score-0.24]
31 Since by definition Q (a) ::::: Qmin for all a (and likewise Q* (a*) ::::: Q;',. [sent-65, score-0.043]
32 in for either Q or Q* to obtain lower bounds for each of the two quantities. [sent-67, score-0.127]
33 • Equation (7) is useful for computing an approximation to the MAP solution, whereas (8) can be used to obtain error bars on the estimate. [sent-69, score-0.318]
34 To see this, note that in calculating the variance, the expensive quantity to compute is -kT (K +a 21)-1k. [sent-70, score-0.102]
35 However, this can be found by solving (10) minimize [-k Ta + ~a T (a 2 1 + K) a] , aEIRm and the expression inside the parentheses is Q*(a) with y = k (see (6)). [sent-71, score-0.089]
36 Hence, an approximate minimizer of (10) gives an upper bound on the error bars, and lower bounds can be obtained from (8) . [sent-72, score-0.42]
37 - 2(Q(a)+u Q*(a*)+2liYli) • h n practice we W1 use t e quantly gap a, a . [sent-76, score-0.229]
38 t e relative size of the difference between upper and lower bound as stopping criterion. [sent-79, score-0.292]
39 Denote by P E IRffl xn with m ::::: nand m,n E N an extension matrix (Le. [sent-81, score-0.081]
40 We will make the ansatz ap := P[3 where [3 E IRn (11) and find solutions [3 such that Q(ap) (or Q*(ap)) is minimized. [sent-83, score-0.111]
41 The solution is [3opt = (pT (a 2 K + K T K) p) -1 p T K T y. [sent-84, score-0.035]
42 (12) Clearly if Pis ofrank m, this will also be the solution of (4) (the minimum negative log posterior for all a E IRffl ). [sent-85, score-0.207]
43 Computational Cost of Greedy Decompositions For a given P E IRffl xn let us analyze the computational cost involved in the estimation procedures. [sent-87, score-0.086]
44 To compute (12) we need to evaluate pT Ky which is O(nm), (KP)T(KP) which is O(n 2 m) and invert an n x n matrix, which is O(n 3 ). [sent-88, score-0.117]
45 Using P also to minimize Q*(P[3*) costs no more than O(n 3 ), which is needed to upper-bound the log posterior. [sent-91, score-0.284]
46 For error bars, we have to approximately minimize (10) which can done for a = P(3 at O(n 3 ) cost. [sent-92, score-0.149]
47 If we compute (PKpT)-l beforehand, this can be done by at O(n 2 ) and likewise for upper bounds. [sent-93, score-0.154]
48 (3T pT ((72 K + KT K)P(3 which costs O(n 2 m) (once the inverse matrices have been computed, one may, however, use them to compute error bars at different locations, too, thus costing only O(n 2 )). [sent-95, score-0.436]
49 The lower bounds on the error bars may not be so crucial, since a bad estimate will only lead to overly conservative confidence intervals and not have any other negative effect. [sent-96, score-0.451]
50 Finally note that all we ever have to compute and store is K P, i. [sent-97, score-0.062]
51 Table 1 summarizes the scaling behaviour of several optimization algorithms. [sent-100, score-0.046]
52 Note that n 0, find X E lRn with minimal number of nonzero entries such that IIAx - bl1 2 ::; E. [sent-105, score-0.034]
53 Thus the problem of sparse approximate minimization of Q(o) is a special case of Natarajan's problem (where the matrix A is square, symmetric, and positive definite). [sent-108, score-0.395]
54 In addition, the algorithm considered by Natarajan in [5J involves sequentially choosing columns of A to maximally decrease IIAx - bll. [sent-109, score-0.089]
55 This is clearly equivalent to the sparse greedy algorithm described above. [sent-110, score-0.434]
56 1 E non-zero components, where n*(E/4) is the minimal number of nonzero components in vectors a for which Q(o) ::; Q(oopt) + E/4, A = (a 2K + KTK)1/2, and). [sent-113, score-0.034]
57 l is the minimum of the magnitudes of the singular values of A, the matrix obtained by normalizing the columns of A. [sent-114, score-0.162]
58 Randomized Algorithms for Subset Selection Unfortunately, the approximation algorithm considered above is still too expensive for large m since each search operation involves O(m) indices. [sent-115, score-0.126]
59 Yet, if we are satisfied with finding a relatively good index rather than the best, we may resort to selecting a random subset of size K «: m. [sent-116, score-0.161]
60 In Algorithm 1, this corresponds to choosing M ~ I, M* ~ 1* as random subsets of size K. [sent-117, score-0.186]
61 To see why, we recall a simple lemma from [7J: the cumulative distribution function of the maximum of m i. [sent-119, score-0.037]
62 05 of all such columns, a random subsample of size ilogO. [sent-127, score-0.091]
63 Numerical Considerations The crucial part is to obtain the values of Q(P,Bopt} cheaply (with P = [Pold, eiJ), provided we solved the problem for Pold. [sent-145, score-0.035]
64 In the following we will show that this can be obtained in O(mn) operations, provided the inverse of the smaller subsystem is known. [sent-147, score-0.057]
65 Hence, inversion of pT (KT K + a 2 K) P costs only O(n 2 ). [sent-149, score-0.168]
66 Thus, to find P of size m x n takes O(ltn 2 m) time. [sent-150, score-0.091]
67 For the error bars, (p T KP)-1 will generally be a good starting value for the minimization of (10), so the typical cost for (10) will be O(Tmn) for some T < n, rather than O(mn 2 ). [sent-151, score-0.224]
68 Finally, for added numerical stability one may want to use an incremental Cholesky factorization in (13) instead of the inverse of a matrix. [sent-152, score-0.057]
69 5 Experiments and Discussion We used the Abalone dataset from the VCI Repository to investigate the properties of the algorithm. [sent-153, score-0.089]
70 The dataset is of size 4177, split into 4000 training and 177 testing split to analyze the numerical performance, and a (3000,1177) split to assess the generalization error (the latter was needed in order to be able to invert (and keep in memory) the full matrix K + a 2 1 for a comparison). [sent-154, score-0.724]
71 The data was rescaled to zero mean and unit variance coordinate-wise. [sent-155, score-0.049]
72 IIx-x'1I2 In all our experiments we used Gaussian kernels k(x, x') = exp( -~) as covariance kernels. [sent-157, score-0.134]
73 We plot the size of the gap between upper and lower bound of the log posterior (gap( a, a*)) for the first 4000 samples from the Abalone dataset (a 2 = 0. [sent-161, score-0.683]
74 From top to bottom: subsets of size 1, 2, 5, 10, 20, 50, 100, 200. [sent-163, score-0.143]
75 The relative variance of the gap size was less than 10%. [sent-165, score-0.369]
76 One can see that that subsets of size 50 and above ensure rapid convergence. [sent-166, score-0.143]
77 1 and 2w 2 = 10, chosen after [7]) the average test error of the sparse greedy approximation trained until gap(a, a*) < 0. [sent-168, score-0.58]
78 025 on a (3000,1177) split (the results were averaged over ten independent choices of training sets. [sent-169, score-0.162]
79 Hence for all practical purposes full inversion of the covariance matrix and the sparse greedy approximation have statistically indistinguishable generalization performance. [sent-182, score-0.811]
80 In a third experiment (Table 2) we analyzed the number of basis functions needed to minimize the log posterior to gap(a, a*) < 0. [sent-183, score-0.385]
81 025, depending on different choices of the kernel width a. [sent-184, score-0.208]
82 In all cases, less than 10% of the kernel functions suffice to find a good minimizer of the log posterior, for the error bars, even less than 2% are sufficient. [sent-185, score-0.369]
83 Also, number of basis functions required to approximate k T (K + 0-21)- l k which is needed to compute the error bars. [sent-188, score-0.339]
84 To ensure that our results were not dataset specific and that the algorithm scales well we tested it on a larger synthetic dataset of size 10000 in 20 dimensions distributed according to N(O, 1). [sent-190, score-0.269]
85 The data was generated by adding normal noise with variance 0- 2 = 0. [sent-191, score-0.049]
86 1 to a function consisting of 200 randomly chosen Gaussians of width 2w 2 = 40 and normally distributed coefficients and centers. [sent-192, score-0.107]
87 We purposely chose an inadequate Gaussian process prior (but correct noise level) of Gaussians with width 2w 2 = 10 in order to avoid trivial sparse expansions. [sent-193, score-0.318]
88 after using 5% of all basis functions) the size of the gap(cr, cr·) was less than 0. [sent-196, score-0.146]
89 We believe that sparse greedy approximation methods are a key technique to scale up Gaussian Process regression to sample sizes of 10. [sent-198, score-0.594]
90 Work on the solutions of dense quadratic programs and classification problems is in progress. [sent-201, score-0.109]
91 An equivalence between sparse approximation and support vector machines. [sent-216, score-0.298]
wordName wordTfidf (topN-words)
[('qmin', 0.343), ('gap', 0.229), ('greedy', 0.222), ('sparse', 0.212), ('pold', 0.19), ('pt', 0.178), ('bars', 0.172), ('nm', 0.171), ('abalone', 0.152), ('aopt', 0.152), ('irffl', 0.152), ('ktk', 0.152), ('kt', 0.133), ('rv', 0.131), ('kernel', 0.099), ('covariance', 0.091), ('size', 0.091), ('bounds', 0.089), ('dataset', 0.089), ('kp', 0.089), ('minimize', 0.089), ('cost', 0.086), ('approximation', 0.086), ('costs', 0.085), ('split', 0.085), ('posterior', 0.084), ('inversion', 0.083), ('ka', 0.083), ('australian', 0.083), ('matrix', 0.081), ('iiax', 0.076), ('langley', 0.076), ('minimizer', 0.076), ('natarajan', 0.076), ('oopt', 0.076), ('rsise', 0.076), ('regression', 0.074), ('map', 0.073), ('xm', 0.07), ('width', 0.069), ('stopping', 0.064), ('posteriori', 0.064), ('compute', 0.062), ('ap', 0.062), ('storage', 0.062), ('error', 0.06), ('quadratic', 0.06), ('semidefinite', 0.059), ('evaluations', 0.059), ('canberra', 0.059), ('approximate', 0.058), ('needed', 0.057), ('inverse', 0.057), ('gaussian', 0.056), ('basis', 0.055), ('estimate', 0.055), ('invert', 0.055), ('ta', 0.055), ('hence', 0.053), ('log', 0.053), ('subsets', 0.052), ('cr', 0.052), ('theorem', 0.051), ('bound', 0.05), ('upper', 0.049), ('variance', 0.049), ('mn', 0.049), ('solutions', 0.049), ('table', 0.048), ('functions', 0.047), ('behaviour', 0.046), ('columns', 0.046), ('forms', 0.046), ('lm', 0.044), ('minimization', 0.044), ('likewise', 0.043), ('kernels', 0.043), ('choosing', 0.043), ('repeat', 0.041), ('choices', 0.04), ('expensive', 0.04), ('francisco', 0.04), ('act', 0.038), ('normally', 0.038), ('lower', 0.038), ('processes', 0.037), ('averaged', 0.037), ('confidence', 0.037), ('maximum', 0.037), ('process', 0.037), ('full', 0.036), ('smola', 0.036), ('subset', 0.036), ('minimum', 0.035), ('solution', 0.035), ('editor', 0.035), ('crucial', 0.035), ('regularization', 0.035), ('good', 0.034), ('nonzero', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 120 nips-2000-Sparse Greedy Gaussian Process Regression
Author: Alex J. Smola, Peter L. Bartlett
Abstract: We present a simple sparse greedy technique to approximate the maximum a posteriori estimate of Gaussian Processes with much improved scaling behaviour in the sample size m. In particular, computational requirements are O(n 2 m), storage is O(nm), the cost for prediction is 0 (n) and the cost to compute confidence bounds is O(nm), where n «: m. We show how to compute a stopping criterion, give bounds on the approximation error, and show applications to large scale problems. 1
2 0.18070056 122 nips-2000-Sparse Representation for Gaussian Process Models
Author: Lehel Csatč´¸, Manfred Opper
Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.
3 0.13701099 121 nips-2000-Sparse Kernel Principal Component Analysis
Author: Michael E. Tipping
Abstract: 'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not 'sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. 1
4 0.1267806 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
Author: Dörthe Malzahn, Manfred Opper
Abstract: Based on a statistical mechanics approach, we develop a method for approximately computing average case learning curves for Gaussian process regression models. The approximation works well in the large sample size limit and for arbitrary dimensionality of the input space. We explain how the approximation can be systematically improved and argue that similar techniques can be applied to general likelihood models. 1
5 0.10837058 134 nips-2000-The Kernel Trick for Distances
Author: Bernhard Schölkopf
Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.
6 0.10724527 58 nips-2000-From Margin to Sparsity
7 0.10105994 110 nips-2000-Regularization with Dot-Product Kernels
8 0.098462045 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers
9 0.097293399 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles
10 0.096876532 21 nips-2000-Algorithmic Stability and Generalization Performance
11 0.095521756 133 nips-2000-The Kernel Gibbs Sampler
12 0.092336878 75 nips-2000-Large Scale Bayes Point Machines
13 0.09111543 54 nips-2000-Feature Selection for SVMs
14 0.091097906 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
15 0.090993837 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
16 0.090580128 130 nips-2000-Text Classification using String Kernels
17 0.085664697 94 nips-2000-On Reversing Jensen's Inequality
18 0.08444225 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
19 0.082988337 13 nips-2000-A Tighter Bound for Graphical Models
20 0.082284227 4 nips-2000-A Linear Programming Approach to Novelty Detection
topicId topicWeight
[(0, 0.297), (1, 0.136), (2, 0.006), (3, 0.019), (4, 0.055), (5, 0.105), (6, -0.04), (7, 0.011), (8, -0.037), (9, -0.082), (10, -0.103), (11, 0.068), (12, -0.021), (13, 0.02), (14, -0.031), (15, 0.076), (16, 0.043), (17, 0.159), (18, 0.019), (19, -0.059), (20, 0.05), (21, -0.055), (22, 0.164), (23, 0.052), (24, -0.034), (25, 0.006), (26, -0.04), (27, 0.084), (28, -0.144), (29, -0.004), (30, 0.037), (31, 0.162), (32, 0.01), (33, -0.033), (34, 0.149), (35, 0.046), (36, -0.113), (37, 0.158), (38, 0.092), (39, 0.002), (40, -0.029), (41, 0.031), (42, 0.004), (43, 0.033), (44, -0.065), (45, 0.033), (46, 0.027), (47, -0.032), (48, -0.113), (49, -0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.95838517 120 nips-2000-Sparse Greedy Gaussian Process Regression
Author: Alex J. Smola, Peter L. Bartlett
Abstract: We present a simple sparse greedy technique to approximate the maximum a posteriori estimate of Gaussian Processes with much improved scaling behaviour in the sample size m. In particular, computational requirements are O(n 2 m), storage is O(nm), the cost for prediction is 0 (n) and the cost to compute confidence bounds is O(nm), where n «: m. We show how to compute a stopping criterion, give bounds on the approximation error, and show applications to large scale problems. 1
2 0.66663247 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
Author: Dörthe Malzahn, Manfred Opper
Abstract: Based on a statistical mechanics approach, we develop a method for approximately computing average case learning curves for Gaussian process regression models. The approximation works well in the large sample size limit and for arbitrary dimensionality of the input space. We explain how the approximation can be systematically improved and argue that similar techniques can be applied to general likelihood models. 1
3 0.63735008 122 nips-2000-Sparse Representation for Gaussian Process Models
Author: Lehel Csatč´¸, Manfred Opper
Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.
4 0.53660417 21 nips-2000-Algorithmic Stability and Generalization Performance
Author: Olivier Bousquet, Andr茅 Elisseeff
Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1
5 0.53398293 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles
Author: Martin J. Wainwright, Erik B. Sudderth, Alan S. Willsky
Abstract: We present the embedded trees algorithm, an iterative technique for estimation of Gaussian processes defined on arbitrary graphs. By exactly solving a series of modified problems on embedded spanning trees, it computes the conditional means with an efficiency comparable to or better than other techniques. Unlike other methods, the embedded trees algorithm also computes exact error covariances. The error covariance computation is most efficient for graphs in which removing a small number of edges reveals an embedded tree. In this context, we demonstrate that sparse loopy graphs can provide a significant increase in modeling power relative to trees, with only a minor increase in estimation complexity. 1
6 0.49975052 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers
8 0.45976618 121 nips-2000-Sparse Kernel Principal Component Analysis
9 0.44278935 110 nips-2000-Regularization with Dot-Product Kernels
10 0.4385412 54 nips-2000-Feature Selection for SVMs
11 0.43632266 70 nips-2000-Incremental and Decremental Support Vector Machine Learning
12 0.42663816 52 nips-2000-Fast Training of Support Vector Classifiers
13 0.41843653 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
14 0.41725224 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm
15 0.41628358 18 nips-2000-Active Support Vector Machine Classification
16 0.41232002 134 nips-2000-The Kernel Trick for Distances
17 0.40534282 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
18 0.39981088 85 nips-2000-Mixtures of Gaussian Processes
19 0.39881292 94 nips-2000-On Reversing Jensen's Inequality
20 0.39318159 22 nips-2000-Algorithms for Non-negative Matrix Factorization
topicId topicWeight
[(10, 0.026), (17, 0.151), (32, 0.016), (33, 0.037), (54, 0.014), (55, 0.018), (62, 0.02), (65, 0.013), (67, 0.06), (76, 0.083), (79, 0.019), (81, 0.012), (90, 0.045), (97, 0.412)]
simIndex simValue paperId paperTitle
Author: Sepp Hochreiter, Michael Mozer
Abstract: The goal of many unsupervised learning procedures is to bring two probability distributions into alignment. Generative models such as Gaussian mixtures and Boltzmann machines can be cast in this light, as can recoding models such as ICA and projection pursuit. We propose a novel sample-based error measure for these classes of models, which applies even in situations where maximum likelihood (ML) and probability density estimation-based formulations cannot be applied, e.g., models that are nonlinear or have intractable posteriors. Furthermore, our sample-based error measure avoids the difficulties of approximating a density function. We prove that with an unconstrained model, (1) our approach converges on the correct solution as the number of samples goes to infinity, and (2) the expected solution of our approach in the generative framework is the ML solution. Finally, we evaluate our approach via simulations of linear and nonlinear models on mixture of Gaussians and ICA problems. The experiments show the broad applicability and generality of our approach. 1
same-paper 2 0.88661206 120 nips-2000-Sparse Greedy Gaussian Process Regression
Author: Alex J. Smola, Peter L. Bartlett
Abstract: We present a simple sparse greedy technique to approximate the maximum a posteriori estimate of Gaussian Processes with much improved scaling behaviour in the sample size m. In particular, computational requirements are O(n 2 m), storage is O(nm), the cost for prediction is 0 (n) and the cost to compute confidence bounds is O(nm), where n «: m. We show how to compute a stopping criterion, give bounds on the approximation error, and show applications to large scale problems. 1
3 0.60755861 94 nips-2000-On Reversing Jensen's Inequality
Author: Tony Jebara, Alex Pentland
Abstract: Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simple lower bounds on otherwise intractable quantities such as products of sums and latent log-likelihoods. This simplification then permits operations like integration and maximization. Quite often (i.e. in discriminative learning) upper bounds are needed as well. We derive and prove an efficient analytic inequality that provides such variational upper bounds. This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models. We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. Convergence, efficiency and prediction results are shown. 1
4 0.55315542 52 nips-2000-Fast Training of Support Vector Classifiers
Author: Fernando Pérez-Cruz, Pedro Luis Alarcón-Diana, Angel Navia-Vázquez, Antonio Artés-Rodríguez
Abstract: In this communication we present a new algorithm for solving Support Vector Classifiers (SVC) with large training data sets. The new algorithm is based on an Iterative Re-Weighted Least Squares procedure which is used to optimize the SVc. Moreover, a novel sample selection strategy for the working set is presented, which randomly chooses the working set among the training samples that do not fulfill the stopping criteria. The validity of both proposals, the optimization procedure and sample selection strategy, is shown by means of computer experiments using well-known data sets. 1 INTRODUCTION The Support Vector Classifier (SVC) is a powerful tool to solve pattern recognition problems [13, 14] in such a way that the solution is completely described as a linear combination of several training samples, named the Support Vectors. The training procedure for solving the SVC is usually based on Quadratic Programming (QP) which presents some inherent limitations, mainly the computational complexity and memory requirements for large training data sets. This problem is typically avoided by dividing the QP problem into sets of smaller ones [6, 1, 7, 11], that are iteratively solved in order to reach the SVC solution for the whole set of training samples. These schemes rely on an optimizing engine, QP, and in the sample selection strategy for each sub-problem, in order to obtain a fast solution for the SVC. An Iterative Re-Weighted Least Squares (IRWLS) procedure has already been proposed as an alternative solver for the SVC [10] and the Support Vector Regressor [9], being computationally efficient in absolute terms. In this communication, we will show that the IRWLS algorithm can replace the QP one in any chunking scheme in order to find the SVC solution for large training data sets. Moreover, we consider that the strategy to decide which training samples must j oin the working set is critical to reduce the total number of iterations needed to attain the SVC solution, and the runtime complexity as a consequence. To aim for this issue, the computer program SV cradit have been developed so as to solve the SVC for large training data sets using IRWLS procedure and fixed-size working sets. The paper is organized as follows. In Section 2, we start by giving a summary of the IRWLS procedure for SVC and explain how it can be incorporated to a chunking scheme to obtain an overall implementation which efficiently deals with large training data sets. We present in Section 3 a novel strategy to make up the working set. Section 4 shows the capabilities of the new implementation and they are compared with the fastest available SVC implementation, SV Mlight [6]. We end with some concluding remarks. 2 IRWLS-SVC In order to solve classification problems, the SVC has to minimize Lp = ~llwI12+CLei- LJliei- LQi(Yi(¢(xifw+b)-l+ei) (1) i i i with respectto w, band ei and maximize it with respectto Qi and Jli, subject to Qi, Jli ~ 0, where ¢(.) is a nonlinear transformation (usually unknown) to a higher dimensional space and C is a penalization factor. The solution to (1) is defined by the Karush-Kuhn-Tucker (KKT) conditions [2]. For further details on the SVC, one can refer to the tutorial survey by Burges [2] and to the work ofVapnik [13, 14]. In order to obtain an IRWLS procedure we will first need to rearrange (1) in such a way that the terms depending on ei can be removed because, at the solution C - Qi - Jli = 0 Vi (one of the KKT conditions [2]) must hold. Lp = 1 Qi(l- Yi(¢T(Xi)W + b)) 211wl12 + L i = (2) where The weighted least square nature of (2) can be understood if ei is defined as the error on each sample and ai as its associated weight, where! IIwl1 2 is a regularizing functional. The minimization of (2) cannot be accomplished in a single step because ai = ai(ei), and we need to apply an IRWLS procedure [4], summarized below in tree steps: 1. Considering the ai fixed, minimize (2). 2. Recalculate ai from the solution on step 1. 3. Repeat until convergence. In order to work with Reproducing Kernels in Hilbert Space (RKHS), as the QP procedure does, we require that w = Ei (JiYi¢(Xi) and in order to obtain a non-zero b, that Ei {JiYi = O. Substituting them into (2), its minimum with respect to {Ji and b for a fixed set of ai is found by solving the following linear equation system l (3) IThe detailed description of the steps needed to obtain (3) from (2) can be found in [10]. where y = [Yl, Y2, ... Yn]T (4) 'r/i,j = 1, ... ,n 'r/i,j = 1, ... ,n (H)ij = YiYj¢T(Xi)¢(Xj) = YiyjK(Xi,Xj) (Da)ij = aio[i - j] 13 = [,81, ,82, ... (5) (6) (7) , ,8n]T and 0[·] is the discrete impulse function. Finally, the dependency of ai upon the Lagrange multipliers is eliminated using the KKT conditions, obtaining a, ai 2.1 ={~ ei Yi' eiYi < Yt.et. > - ° ° (8) IRWLS ALGORITHMIC IMPLEMENTATION The SVC solution with the IRWLS procedure can be simplified by dividing the training samples into three sets. The first set, SI, contains the training samples verifying < ,8i < C, which have to be determined by solving (3). The second one, S2, includes every training sample whose,8i = 0. And the last one, S3, is made up of the training samples whose ,8i = C. This division in sets is fully justified in [10]. The IRWLS-SVC algorithm is shown in Table 1. ° 0. Initialization: SI will contain every training sample, S2 = 0 and S3 = 0. Compute H. e_a = y, f3_a = 0, b_a = 0, G 13 = Gin, a = 1 and G b3 = G bi n . 1 Solve [ (H)Sb S1 + D(al S1 . =° = e-lt a, 3. ai = { ~ (13) S2 2. e ° 1[ (Y)Sl (f3)Sl ] (y ) ~1 b and (13) Ss = C DyH(f3 - f3_a) - (b - b_a)1 =[1- G 13 ] G b3 ' °. eiYi < e- _ > O'r/Z E SI U S2 U S3 tYt 4. Sets reordering: a. Move every sample in S3 with eiYi < to S2. b. Move every sample in SI with ,8i = C to S3. c. Move every sample in SI with ai = to S2 . d. Move every sample in S2 with ai :I to SI. 5. e_a = e, f3_a = 13, G 13 = (H)Sl,SS (f3)ss + (G in )Sl' b-lt = band Gb3 = -y~s (f3)ss + Gbin · 6. Go to step 1 and repeat until convergence. ei Yi ' ° ° ° Table 1: IRWLS-SVC algorithm. The IRWLS-SVC procedure has to be slightly modified in order to be used inside a chunk:ing scheme as the one proposed in [8, 6], such that it can be directly applied in the one proposed in [1]. A chunking scheme is needed to solve the SVC whenever H is too large to fit into memory. In those cases, several SVC with a reduced set of training samples are iteratively solved until the solution for the whole set is found. The samples are divide into a working set, Sw, which is solved as a full SVC problem, and an inactive set, Sin. If there are support vectors in the inactive set, as it might be, the inactive set modifies the IRWLSSVC procedure, adding a contribution to the independent term in the linear equation system (3) . Those support vectors in S in can be seen as anchored samples in S3, because their ,8i is not zero and can not be modified by the IRWLS procedure. Then, such contribution (Gin and G bin ) will be calculated as G 13 and G b3 are (Table 1, 5th step), before calling the IRWLS-SVC algorithm. We have already modified the IRWLS-SVC in Table 1 to consider Gin and G bin , which must be set to zero if the Hessian matrix, H, fits into memory for the whole set of training samples. The resolution of the SVC for large training data sets, employing as minimization engine the IRWLS procedure, is summarized in the following steps: 1. Select the samples that will form the working set. 2. Construct Gin = (H)Sw,Sin (f3)s.n and G bin = -yIin (f3)Sin 3. Solve the IRWLS-SVC procedure, following the steps in Table 1. 4. Compute the error of every training sample. 5. If the stopping conditions Yiei < C eiYi> -c leiYil < C 'Vii 'Vii 'Vii (Ji = 0 (Ji = C 0 < (Ji < C (9) (10) (11) are fulfilled, the SVC solution has been reached. The stopping conditions are the ones proposed in [6] and C must be a small value around 10 - 3 , a full discussion concerning this topic can be found in [6]. 3 SAMPLE SELECTION STRATEGY The selection of the training samples that will constitute the working set in each iteration is the most critical decision in any chunking scheme, because such decision is directly involved in the number of IRWLS-SVC (or QP-SVC) procedures to be called and in the number of reproducing kernel evaluations to be made, which are, by far, the two most time consuming operations in any chunking schemes. In order to solve the SVC efficiently, we first need to define a candidate set of training samples to form the working set in each iteration. The candidate set will be made up, as it could not be otherwise, with all the training samples that violate the stopping conditions (9)-(11); and we will also add all those training samples that satisfy condition (11) but a small variation on their error will make them violate such condition. The strategies to select the working set are as numerous as the number of problems to be solved, but one can think three different simple strategies: • Select those samples which do not fulfill the stopping criteria and present the largest Iei I values. • Select those samples which do not fulfill the stopping criteria and present the smallest Iei I values. • Select them randomly from the ones that do not fulfill the stopping conditions. The first strategy seems the more natural one and it was proposed in [6]. If the largest leil samples are selected we guanrantee that attained solution gives the greatest step towards the solution of (1). But if the step is too large, which usually happens, it will cause the solution in each iteration and the (Ji values to oscillate around its optimal value. The magnitude of this effect is directly proportional to the value of C and q (size of the working set), so in the case ofsmall C (C < 10) and low q (q < 20) it would be less noticeable. The second one is the most conservative strategy because we will be moving towards the solution of (1) with small steps. Its drawback is readily discerned if the starting point is inappropriate, needing too many iterations to reach the SVC solution. The last strategy, which has been implemented together with the IRWLS-SVC procedure, is a mid-point between the other two, but if the number of samples whose 0 < (3i < C increases above q there might be some iterations where we will make no progress (working set is only made up of the training samples that fulfill the stopping condition in (11)). This situation is easily avoided by introducing one sample that violates each one of the stopping conditions per class. Finally, if the cardinality of the candidate set is less than q the working set is completed with those samples that fulfil the stopping criteria conditions and present the least leil. In summary, the sample selection strategy proposed is 2 : 1. Construct the candidate set, Se with those samples that do not fulfill stopping conditions (9) and (10), and those samples whose (3 obeys 0 < (3i < C. 2. IfISel < ngot05. 3. Choose a sample per class that violates each one of the stopping conditions and move them from Se to the working set, SW. 4. Choose randomly n - ISw I samples from Se and move then to SW. Go to Step 6. 5. Move every sample form Se to Sw and then-ISwl samples that fulfill the stopping conditions (9) and (10) and present the lowest leil values are used to complete SW . 6. Go on, obtaining Gin and Gbin. 4 BENCHMARK FOR THE IRWLS-SVC We have prepared two different experiments to test both the IRWLS and the sample selection strategy for solving the SVc. The first one compares the IRWLS against QP and the second one compares the samples selection strategy, together with the IRWLS, against a complete solving procedure for SVC, the SV Mlight. In the first trial, we have replaced the LOQO interior point optimizer used by SV M1ig ht version 3.02 [5] by the IRWLS-SVC procedure in Table 1, to compare both optimizing engines with equal samples selection strategy. The comparison has been made over a Pentium ill-450MHz with 128Mb running on Window98 and the programs have been compiled using Microsoft Developer 6.0. In Table 2, we show the results for two data sets: the first q 20 40 70 Adult44781 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 21.25 20.70 0.61 0.39 20.60 19.22 1.01 0.17 21.15 18.72 2.30 0.46 Splice 2175 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 46.19 30.76 21.94 4.77 71.34 24.93 46.26 8.07 53.77 20.32 34.24 7.72 Table 2: CPU Time indicates the consume time in seconds for the whole procedure. The Optimize Time indicates the consume time in second for the LOQO or IRWLS procedure. one, containing 4781 training samples, needs most CPU resources to compute the RKHS and the second one, containing 2175 training samples, uses most CPU resources to solve the SVC for each Sw, where q indicates the size of the working set. The value of C has 2In what follows, I . I represents absolute value for numbers and cardinality for sets been set to 1 and 1000, respectively, and a Radial Basis Function (RBF) RKHS [2] has been employed, where its parameter a has been set, respectively, to 10 and 70. As it can be seen, the SV M1ig ht with IRWLS is significantly faster than the LOQO procedure in all cases. The kernel cache size has been set to 64Mb for both data sets and for both procedures. The results in Table 2 validates the IRWLS procedure as the fastest SVC solver. For the second trial, we have compiled a computer program that uses the IRWLS-SVC procedure and the working set selection in Section 3, we will refer to it as svcradit from now on. We have borrowed the chunking and shrinking ideas from the SV Mlight [6] for our computer program. To test these two programs several data sets have been used. The Adult and Web data sets have been obtained from 1. Platt's web page http://research.microsoft.comr jplatt/smo.html/; the Gauss-M data set is a two dimensional classification problem proposed in [3] to test neural networks, which comprises a gaussian random variable for each class, which highly overlap. The Banana, Diabetes and Splice data sets have been obtained from Gunnar Ratsch web page http://svm.first.gmd.der raetschl. The selection of C and the RKHS has been done as indicated in [11] for Adult and Web data sets and in http://svm.first.gmd.derraetschl for Banana, Diabetes and Splice data sets. In Table 3, we show the runtime complexity for each data set, where the value of q has been elected as the one that reduces the runtime complexity. Database Dim Adult6 Adult9 Adult! Web 1 Web7 Gauss-M Gauss-M Banana Banana Diabetes Splice 123 123 123 300 300 2 2 2 2 8 69 N Sampl. 11221 32562 1605 2477 24693 4000 4000 400 4900 768 2175 C a SV 1 1 1000 5 5 1 100 316.2 316.2 10 1000 10 10 10 10 10 1 1 1 1 2 70 4477 12181 630 224 1444 1736 1516 80 1084 409 525 q CPU time radit light radit light 150 130 100 100 150 70 100 40 70 40 150 40 70 10 10 10 10 10 70 40 10 20 118.2 1093.29 25.98 2.42 158.13 12.69 61.68 0.33 22.46 2.41 14.06 124.46 1097.09 113.54 2.36 124.57 48.28 3053.20 0.77 1786.56 6.04 49.19 Table 3: Several data sets runtime complexity, when solved with the short, and SV Mlight, light for short. s v c radit , radit for One can appreciate that the svcradit is faster than the SV M1ig ht for most data sets. For the Web data set, which is the only data set the SV Mlight is sligthly faster, the value of C is low and most training samples end up as support vector with (3i < C. In such cases the best strategy is to take the largest step towards the solution in every iteration, as the SV Mlig ht does [6], because most training samples (3i will not be affected by the others training samples (3j value. But in those case the value of C increases the SV c radit samples selection strategy is a much more appropriate strategy than the one used in SV Mlight. 5 CONCLUSIONS In this communication a new algorithm for solving the SVC for large training data sets has been presented. Its two major contributions deal with the optimizing engine and the sample selection strategy. An IRWLS procedure is used to solve the SVC in each step, which is much faster that the usual QP procedure, and simpler to implement, because the most difficult step is the linear equation system solution that can be easily obtained by LU decomposition means [12]. The random working set selection from the samples not fulfilling the KKT conditions is the best option if the working is be large, because it reduces the number of chunks to be solved. This strategy benefits from the IRWLS procedure, which allows to work with large training data set. All these modifications have been concreted in the svcradit solving procedure, publicly available at http://svm.tsc.uc3m.es/. 6 ACKNOWLEDGEMENTS We are sincerely grateful to Thorsten Joachims who has allowed and encouraged us to use his SV Mlight to test our IRWLS procedure, comparisons which could not have been properly done otherwise. References [1] B. E. Boser, I. M . Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In 5th Annual Workshop on Computational Learning Theory, Pittsburg, U.S.A., 1992. [2] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998. [3] S. Haykin. Neural Networks: A comprehensivefoundation. Prentice-Hall, 1994. [4] P. W. Holland and R. E. Welch. Robust regression using iterative re-weighted least squares. Communications of Statistics Theory Methods, A6(9):813-27, 1977. [5] T. Joachims. http://www-ai.infonnatik.uni-dortmund.de/forschung/verfahren Isvmlight Isvmlight.eng.html. Technical report, University of Dortmund, Informatik, AI-Unit Collaborative Research Center on 'Complexity Reduction in Multivariate Data', 1998. [6] T. Joachims. Making Large Scale SVM Learning Practical, In Advances in Kernel Methods- Support Vector Learning, Editors SchOlkopf, B., Burges, C. 1. C. and Smola, A. 1., pages 169-184. M.I.T. Press, 1999. [7] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. In Proc. of the 1997 IEEE Workshop on Neural Networks for Signal Processing, pages 276-285, Amelia Island, U.S.A, 1997. [8] E. Osuna and F. Girosi. Reducing the run-time complexity of support vector machines. In ICPR'98, Brisbane, Australia, August 1998. [9] F. Perez-Cruz, A. Navia-Vazquez
5 0.51274621 122 nips-2000-Sparse Representation for Gaussian Process Models
Author: Lehel Csatč´¸, Manfred Opper
Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.
6 0.48818734 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data
7 0.48497432 21 nips-2000-Algorithmic Stability and Generalization Performance
8 0.48140562 60 nips-2000-Gaussianization
9 0.47751516 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
10 0.47637001 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
11 0.47318277 74 nips-2000-Kernel Expansions with Unlabeled Examples
12 0.47163722 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition
13 0.47046578 37 nips-2000-Convergence of Large Margin Separable Linear Classification
14 0.4661485 4 nips-2000-A Linear Programming Approach to Novelty Detection
15 0.46276218 75 nips-2000-Large Scale Bayes Point Machines
16 0.46147537 51 nips-2000-Factored Semi-Tied Covariance Matrices
17 0.46106359 133 nips-2000-The Kernel Gibbs Sampler
18 0.46028686 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models
19 0.45923075 36 nips-2000-Constrained Independent Component Analysis
20 0.45726484 22 nips-2000-Algorithms for Non-negative Matrix Factorization