nips nips2007 nips2007-170 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andrew Naish-guzman, Sean Holden
Abstract: We propose a Gaussian process (GP) framework for robust inference in which a GP prior on the mixing weights of a two-component noise model augments the standard process over latent function values. This approach is a generalization of the mixture likelihood used in traditional robust GP regression, and a specialization of the GP mixture models suggested by Tresp [1] and Rasmussen and Ghahramani [2]. The value of this restriction is in its tractable expectation propagation updates, which allow for faster inference and model selection, and better convergence than the standard mixture. An additional benefit over the latter method lies in our ability to incorporate knowledge of the noise domain to influence predictions, and to recover with the predictive distribution information about the outlier distribution via the gating process. The model has asymptotic complexity equal to that of conventional robust methods, but yields more confident predictions on benchmark problems than classical heavy-tailed models and exhibits improved stability for data with clustered corruptions, for which they fail altogether. We show further how our approach can be used without adjustment for more smoothly heteroscedastic data, and suggest how it could be extended to more general noise models. We also address similarities with the work of Goldberg et al. [3].
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We propose a Gaussian process (GP) framework for robust inference in which a GP prior on the mixing weights of a two-component noise model augments the standard process over latent function values. [sent-5, score-0.44]
2 This approach is a generalization of the mixture likelihood used in traditional robust GP regression, and a specialization of the GP mixture models suggested by Tresp [1] and Rasmussen and Ghahramani [2]. [sent-6, score-0.368]
3 An additional benefit over the latter method lies in our ability to incorporate knowledge of the noise domain to influence predictions, and to recover with the predictive distribution information about the outlier distribution via the gating process. [sent-8, score-0.552]
4 The model has asymptotic complexity equal to that of conventional robust methods, but yields more confident predictions on benchmark problems than classical heavy-tailed models and exhibits improved stability for data with clustered corruptions, for which they fail altogether. [sent-9, score-0.153]
5 We show further how our approach can be used without adjustment for more smoothly heteroscedastic data, and suggest how it could be extended to more general noise models. [sent-10, score-0.199]
6 1 Introduction Regression data are often modelled as noisy observations of an underlying process. [sent-13, score-0.077]
7 The simplest assumption is that all noise is independent and identically distributed (i. [sent-14, score-0.152]
8 Furthermore, the Gaussian noise model enjoys the theoretical justification of the central limit theorem, which states that the sum of sufficiently many i. [sent-19, score-0.152]
9 The random component in the signal may be caused by human or measurement error, or it may be the manifestation of systematic variation invisible to a simplified model. [sent-27, score-0.063]
10 Robust methods use a heavy-tailed likelihood to allow the interpolant effectively to favour smoothness and ignore such erroneous data. [sent-32, score-0.15]
11 Figure 1c shows how this can be achieved using a two-component noise model 2 2 p(yn |fn ) = (1 − ǫ)N yn ; fn , σR + ǫN yn ; fn , σO , 1 (1) (a) (b) (c) (d) Figure 1: Black dots show noisy samples from the sinc function. [sent-33, score-1.178]
12 In panels (a) and (b), the behaviour of a GP with a Gaussian noise assumption is illustrated; the shaded region shows 95% confidence intervals. [sent-34, score-0.224]
13 The presence of a single outlier is highly influential in this model, but the heavy-tailed likelihood (1) in panel (c) is more resilient. [sent-35, score-0.346]
14 Unfortunately, even this model fails for the cluster of outliers in panel (d). [sent-36, score-0.13]
15 Here, grey lines show ten repeated runs of the EP inference algorithm, while the black line and shaded region are their averaged mean and confidence intervals respectively—grossly at odds with those of the latent generative model. [sent-37, score-0.139]
16 in which observations yn are Gaussian corruptions of fn , being drawn with probability ǫ from a 2 2 large variance outlier distribution (σO ≫ σR ). [sent-38, score-0.996]
17 Our research is motivated by observing how the predictive distribution suffers for heavy-tailed models when outliers appear in bursts: figure 1d replicates figure 1c, but introduces an additional three outliers. [sent-44, score-0.161]
18 All parameters were taken from the optimal solution to (c), but even without the challenge of hyperparameter optimization there is now considerable uncertainty in the posterior since the competing interpretations of the cluster as signal or noise have similar posterior mass. [sent-45, score-0.391]
19 Viewed another way, the tails of the effective log likelihood of four clustered observations have approximately one-quarter the weight of a single outlier, so the magnitude of the posterior peak associated with the robust solution is comparably reduced. [sent-46, score-0.562]
20 One simple remedy is to make the tails of the likelihood heavier. [sent-47, score-0.244]
21 However, since the noise model is global, this has ramifications across the entire data space, potentially causing underfitting elsewhere when real data are relegated to the tails. [sent-48, score-0.152]
22 Our model is also a specialization of the GP mixtures proposed by Tresp [1] and Rasmussen and Ghahramani [2]; indeed, the latter automatically infers the correct number of components to use. [sent-51, score-0.079]
23 We argue that the two-component mixture is often a sensible distribution for modelling real data, with a natural interpretation and the heavy tails required for robustness; its weaknesses are exposed primarily when the noise distribution is not homoscedastic. [sent-54, score-0.517]
24 The TGP largely solves this problem, and allows inference by an efficient expectation propagation (EP) [5] procedure (rather than resorting to more heavy duty Monte Carlo methods). [sent-55, score-0.098]
25 Hence, provided a twocomponent mixture is likely to reflect adequately the noise on our data, the TGP will give similar results to the generalized mixtures mentioned above, but at a fraction of the cost. [sent-56, score-0.278]
26 [3] suggest an approach to input-dependent noise in the spirit of the TGP, in which the log variance on observations is itself modelled as a GP (the logarithm since noise variance is a non-negative property). [sent-58, score-0.541]
27 Inference is again analytically intractable, so Gibbs sampling is used to generate noise vectors from the posterior distribution by alternately fitting the signal process and fitting the noise process. [sent-59, score-0.515]
28 We apply Bayes’ rule to obtain the posterior distribution over the f , given the observed X and y, which with the assumption of i. [sent-63, score-0.133]
29 Robust GP regression is achieved by using a leptokurtic likelihood distribution, i. [sent-69, score-0.128]
30 one whose tails have more mass than the Gaussian. [sent-71, score-0.165]
31 Common choices are the Laplace (or double exponential) distribution, Student’s t distribution, and the mixture model (1). [sent-72, score-0.088]
32 In product with the prior, a heavy-tailed likelihood over an outlying observation does not exert the strong pull on the posterior witnessed with a light-tailed noise model. [sent-73, score-0.398]
33 Since it bears closest resemblance to the twinned GP, we are particularly interested in the mixture; however, in section 4, we include results for the Laplace model: it is the heaviest-tailed log concave distribution, which guarantees a unimodal posterior and allows more reliable EP convergence. [sent-75, score-0.329]
34 In any case, all such methods make a global assumption about the noise distribution, and it is where this is inappropriate that our model is most beneficial. [sent-76, score-0.213]
35 We augment the standard process over f with another GP over a set of variables u; this acts as a gating function, probabilistically dividing the domain between the real and outlier components of the noise model 2 2 p(yn |fn ) = σ(un )N yn ; fn , σR + σ(−un )N yn ; fn , σO , . [sent-78, score-1.532]
36 −∞ In the TGP likelihood, we therefore mix two forms of Gaussian corruption, one strongly peaked at the observation, the other a broader distribution which provides the heavy tails, in proportion determined by u(x). [sent-80, score-0.115]
37 The two priors may have quite different covariance structure, reflecting our different beliefs about correlations in the signal and in the noise domain. [sent-82, score-0.206]
38 In addition, we accommodate prior beliefs about the prevalence of outliers with a non-zero mean process on u, p(u|X) = N (u ; mu , Ku ) p(f |X) = N (f ; 0 , Kf ) . [sent-83, score-0.232]
39 Suppose we have an intractable distribution over f whose unnormalized form factorizes into a product of terms, such as a dense Gaussian prior t0 (f , u) and a series of independent likelihoods {tn (yn |fn , un )}N . [sent-86, score-0.353]
40 EP constructs n=1 ˜ the approximate posterior as a product of scaled site functions tn . [sent-87, score-0.295]
41 For computational tractability, these sites are usually chosen from an exponential family with natural parameters θ, since in this case their product retains the same functional form as its components. [sent-88, score-0.115]
42 The data ordinates are x, observations y, and the GP is over the latent f . [sent-91, score-0.08]
43 Panel (b) shows a graphical model for the twinned Gaussian process (TGP), in which an auxiliary set of hidden variables u describes the noisiness of the data. [sent-93, score-0.237]
44 where Z is the marginal likelihood and zn are the scale parameters. [sent-94, score-0.174]
45 A simpler optimization minθn KL q n (fn , un ; θ \n ) q(fn , un ; θ) then fits only the parameters θn : this is equivalent to moment matching between the two distributions, with scale zn chosen to match the zeroth-order moments. [sent-97, score-0.45]
46 After each site update, the moments at the remaining sites are liable to change, and several iterations may be required before convergence. [sent-98, score-0.216]
47 The priors over u and f are independent, but we expect correlations in the posterior after conditioning on observations. [sent-99, score-0.13]
48 To understand this, consider a single observation (xn , yn ); in principle, it admits two explanations corresponding to its classification as either “outlier” or as “real” data: in general terms, either un > 0 and fn ≈ yn , or un < 0 and fn respects the global structure of the signal. [sent-100, score-1.456]
49 A diagram to assist the visualization of the behaviour of the posterior is provided in figure 3. [sent-101, score-0.152]
50 Now, recall that the prior over u and f is p u f X =N u f ; mu 0 , Ku 0 0 Kf ˜ and the likelihood factorizes into a product of terms (2); our site approximations tn are therefore Gaussian in (fn , un ). [sent-102, score-0.586]
51 Of importance for EP are the moments of the tilted distribution which we seek to match. [sent-103, score-0.181]
52 These are most easily obtained by differentiation of the zeroth moments ZR and ZO of each component. [sent-104, score-0.101]
53 We find ∞ 1 0 u z 2 ZR = σ(u)N y ; f , σR N ; µ , Σ dudf = N ; µ, 2 + Σ dz; f y 0 σR f,u 0 writing the inner Gaussian as N zn yn ; where q = µu µf , µu + A C C BR C BR (y A− − µf ) C2 BR , Z R = N (y ; µf , BR ) σ(q), . [sent-105, score-0.296]
54 The integral for the outlier component is similar; ZO = N (y ; µf , BO ) σ(−q). [sent-106, score-0.265]
55 With partial deriva2 log tives ∂ log Z and ∂∂µµTZ we are equipped for EP; algorithmic details appear in Seeger’s note [8]. [sent-107, score-0.072]
56 For ∂µ efficiency, we make rank-two updates of the full approximate covariance on (f , u) during the EP loop, and refresh the posterior at the end of each cycle to avoid loss of precision. [sent-108, score-0.138]
57 The left-hand column illustrates the behaviour of a fixed heavy-tailed likelihood for one, two, four and five repeated observations at f = 5. [sent-110, score-0.17]
58 (Outliers in real data are not necessarily so tightly packed, but the symmetry of this approximation allows us to treat them as a single unit: by “posterior”, for example, we mean the a posteriori belief in all the observations’ (identical) latent f . [sent-111, score-0.067]
59 The top-left box illustrates how the influence of isolated outliers is mitigated by the standard mixture. [sent-113, score-0.129]
60 However, a repeated observation (box two on the left) causes the EP solution to collapse onto the spike at the data (the log scale is deceptive: the second peak contributes only about 8% of the posterior mass). [sent-114, score-0.141]
61 The thick bar in the central column marks the cross-section corresponding to the unnormalized posterior from column one. [sent-117, score-0.135]
62 f f f f ˆ f f f The noise process may itself be of interest, in which case we need to marginalize over both u⋆ and f⋆ in p(y⋆ |x⋆ , X, y) = ≈ p y⋆ x⋆ , p y⋆ x⋆ , u f u f p u⋆ f⋆ p X, y dudf u⋆ f⋆ u f N u f ˆ ˆ ; µ , Σ du⋆ df⋆ dudf . [sent-120, score-0.343]
63 This distribution is no longer Gaussian, but its moments may be recovered easily by the same method used to obtain moments of the tilted distribution. [sent-121, score-0.282]
64 EP provides in addition to the approximate moments of the posterior distribution an estimate of the marginal likelihood and its derivatives with respect to kernel hyperparameters. [sent-122, score-0.362]
65 Again, we refer the interested reader to the algorithm presented in [8], adding here only that our implementation uses 2 2 log noise values on (σR , σO ) to allow for their unconstrained optimization. [sent-123, score-0.188]
66 The posterior refresh is O(8N 3 ) since it requires the inverse of a 2N × 2N positive semi-definite matrix, most efficiently achieved through Cholesky factorization (this Cholesky factor can be retained for use in calculating the approximate log marginal likelihood). [sent-127, score-0.223]
67 The total number of loops required for convergence of EP is typically independent of N , and can be upper bounded by a small constant, say 10, making the entire inference process O(8N 3 ) = O(N 3 ). [sent-128, score-0.091]
68 robust regression by EP, which admittedly masks the larger coefficient that appears in approximating both u and f simultaneously. [sent-132, score-0.121]
69 4 Experiments We identify two general noise characteristics for which our model may be suitable. [sent-134, score-0.152]
70 The first is when the outlying observations can appear in clusters: we saw in figure 1d how these occurrences affect the standard mixture model. [sent-135, score-0.167]
71 In fact the problem is quite severe, since the multimodality of the posterior impedes the convergence of EP, while the possibility of conflicting gradient information at the optima hampers procedures for evidence maximization. [sent-136, score-0.105]
72 In figure 4 we illustrate how the TGP succeeds where the mixture and Laplace models fail; note how the mean process on u falls sharply in the contaminated regions. [sent-137, score-0.137]
73 A data set which exhibits the superior predictive modelling of the TGP in a domain where robust methods can also expect to perform well is provided by Kuss [7] in a variation on a set of Friedman [9]. [sent-139, score-0.138]
74 6 -10 0 (a) Mixture noise 10 -10 0 (b) Laplace noise 10 -10 0 (c) TGP 10 Figure 4: The corruptions are i. [sent-142, score-0.422]
75 We generated ten sets of 90 training examples and 10000 test examples by sampling x uniformly in [0, 1]10 , and adding to the training data noise N (0, 1). [sent-146, score-0.188]
76 error for the robust methods is similar, but the TGP is able to fit the variance far more accurately. [sent-152, score-0.134]
77 Friedman (1)), because once the outliers have been accounted for there are fewer corrupted regions; furthermore, estimates of where the data are corrupted can be recovered by considering the process on u. [sent-157, score-0.223]
78 In both experiments, the training data were renormalized to zero mean and unit variance, and throughout, we used the anisotropic squared exponential for the f process (implementing so-called relevance determination), and an isotropic version for u. [sent-158, score-0.074]
79 The approximate marginal likelihood was maximized on three to five randomly initialized models; we chose for testing the most favoured. [sent-159, score-0.128]
80 The second domain of application is when the noise on the data is believed a priori to be a function of the input (i. [sent-160, score-0.179]
81 The twinned GP can simulate this changing variance by modulating the u process, allocating varying weight to the two components. [sent-163, score-0.25]
82 By way of example, the behaviour for the one-dimensional motorcycle set [10] is shown in fig. [sent-164, score-0.109]
83 This is particularly problematic when variance on the data ranges over several orders of magnitude, such that the “outlier” width must be comparably broader than that of the “real” component. [sent-167, score-0.119]
84 In such cases, only with extreme values of u can the smallest errors be predicted, but in consequence the process tends to sweep precipitately through the region of sensitivity where variance predictions can be made accurately. [sent-168, score-0.157]
85 To circumvent these problems we might employ the warped GP [11] to rescale the process on u in a supervised manner, but we do not explore these ideas further here. [sent-169, score-0.086]
86 log probability (b) Friedman (2) (c) Motorcycle Figure 5: Results for the Friedman data, and the predictions of the TGP on the motorcycle set. [sent-178, score-0.144]
87 7 5 Extensions With prior knowledge of the nature of corruptions affecting the signal, we can seek to model the noise distribution more accurately, for example by introducing a compound likelihood for the outlier 2 component pO (yn |fn ) = j αj N yn ; µj (fn ) , σj , j αj = 1. [sent-179, score-0.89]
88 This constrains the relative weight of outlier corruptions to be constant across the entire domain. [sent-180, score-0.349]
89 A richer alternative is provided by extending the single u-process on noise to a series u(1) , u(2) , . [sent-181, score-0.152]
90 , u(ν) of noise processes, and broadening the likelihood function appropriately. [sent-184, score-0.278]
91 For example, with ν = 2, we may write 2 p(yn |fn , u(1) , u(2) ) = σ(u(1) )N yn ; fn , σR + n n n 2 σ(−u(1) )σ(u(2) )N yn ; fn , σO1 + n n 2 σ(−u(1) )σ(−u(2) )N yn ; f0 , σO2 . [sent-185, score-1.205]
92 (4) n n In the former case, the preceding analysis applies with small changes: each component of the outlier distribution contributes moments independently. [sent-186, score-0.394]
93 The second model introduces significant computational difficulty: firstly, we must maintain a posterior distribution over f and all ν us, yielding space requirements O(N (ν + 1)) and time complexity O(N 3 (ν + 1)3 ). [sent-187, score-0.133]
94 More importantly, the requisite moments needed in the EP loop are now intractable, although an inner EP loop can be used to approximate them, since the product of σs behaves in essence like the standard model for GP classification. [sent-188, score-0.232]
95 6 Conclusions We have presented a method for robust GP regression that improves upon classical approaches by allowing the noise variance to vary in the input space. [sent-190, score-0.335]
96 We found improved convergence on problems which upset the standard mixture model, and have shown how predictive certainty can be improved by adopting the TGP even for problems which do not. [sent-191, score-0.127]
97 The model also allows an arbitrary process on u, such that specialized prior knowledge could be used to drive the inference over f to respecting regions which may otherwise be considered erroneous. [sent-192, score-0.131]
98 A generalization of our ideas appears as the mixture of GPs [1], and the infinite mixture [2], but both involve a slow inference procedure. [sent-193, score-0.218]
99 When faster solutions are required for robust inference, and a two-component mixture is an adequate model for the task, we believe the TGP is a very attractive option. [sent-194, score-0.16]
100 Gaussian process models for robust regression, classification and reinforcement learning. [sent-216, score-0.121]
wordName wordTfidf (topN-words)
[('tgp', 0.377), ('gp', 0.352), ('fn', 0.334), ('ep', 0.25), ('outlier', 0.231), ('un', 0.202), ('twinned', 0.188), ('yn', 0.179), ('tails', 0.165), ('noise', 0.152), ('corruptions', 0.118), ('posterior', 0.105), ('kf', 0.105), ('moments', 0.101), ('mixtgp', 0.094), ('outliers', 0.094), ('friedman', 0.091), ('mixture', 0.088), ('site', 0.087), ('lap', 0.082), ('likelihood', 0.079), ('tn', 0.076), ('gaussian', 0.072), ('robust', 0.072), ('dudf', 0.071), ('interpolant', 0.071), ('edward', 0.07), ('variance', 0.062), ('motorcycle', 0.062), ('rasmussen', 0.061), ('br', 0.06), ('heavy', 0.056), ('carl', 0.052), ('tilted', 0.052), ('goldberg', 0.052), ('laplace', 0.052), ('loop', 0.052), ('gure', 0.05), ('mu', 0.049), ('zoubin', 0.049), ('process', 0.049), ('marginal', 0.049), ('regression', 0.049), ('broadening', 0.047), ('heteroscedastic', 0.047), ('zo', 0.047), ('gating', 0.047), ('behaviour', 0.047), ('predictions', 0.046), ('zn', 0.046), ('kt', 0.045), ('observations', 0.044), ('inference', 0.042), ('specialization', 0.041), ('snelson', 0.041), ('corrupted', 0.04), ('prior', 0.04), ('predictive', 0.039), ('christopher', 0.038), ('mixtures', 0.038), ('ku', 0.037), ('zr', 0.037), ('warped', 0.037), ('ff', 0.037), ('ten', 0.036), ('panel', 0.036), ('log', 0.036), ('latent', 0.036), ('box', 0.035), ('clustered', 0.035), ('inappropriate', 0.035), ('retains', 0.035), ('outlying', 0.035), ('kuss', 0.035), ('component', 0.034), ('modelled', 0.033), ('refresh', 0.033), ('dence', 0.033), ('broader', 0.031), ('tightly', 0.031), ('dent', 0.03), ('unnormalized', 0.03), ('cholesky', 0.03), ('affecting', 0.029), ('processes', 0.029), ('signal', 0.029), ('df', 0.028), ('sites', 0.028), ('tresp', 0.028), ('distribution', 0.028), ('product', 0.027), ('domain', 0.027), ('factorizes', 0.026), ('comparably', 0.026), ('con', 0.026), ('global', 0.026), ('correlations', 0.025), ('shaded', 0.025), ('tractability', 0.025), ('exponential', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 170 nips-2007-Robust Regression with Twinned Gaussian Processes
Author: Andrew Naish-guzman, Sean Holden
Abstract: We propose a Gaussian process (GP) framework for robust inference in which a GP prior on the mixing weights of a two-component noise model augments the standard process over latent function values. This approach is a generalization of the mixture likelihood used in traditional robust GP regression, and a specialization of the GP mixture models suggested by Tresp [1] and Rasmussen and Ghahramani [2]. The value of this restriction is in its tractable expectation propagation updates, which allow for faster inference and model selection, and better convergence than the standard mixture. An additional benefit over the latter method lies in our ability to incorporate knowledge of the noise domain to influence predictions, and to recover with the predictive distribution information about the outlier distribution via the gating process. The model has asymptotic complexity equal to that of conventional robust methods, but yields more confident predictions on benchmark problems than classical heavy-tailed models and exhibits improved stability for data with clustered corruptions, for which they fail altogether. We show further how our approach can be used without adjustment for more smoothly heteroscedastic data, and suggest how it could be extended to more general noise models. We also address similarities with the work of Goldberg et al. [3].
2 0.27408397 195 nips-2007-The Generalized FITC Approximation
Author: Andrew Naish-guzman, Sean Holden
Abstract: We present an efficient generalization of the sparse pseudo-input Gaussian process (SPGP) model developed by Snelson and Ghahramani [1], applying it to binary classification problems. By taking advantage of the SPGP prior covariance structure, we derive a numerically stable algorithm with O(N M 2 ) training complexity—asymptotically the same as related sparse methods such as the informative vector machine [2], but which more faithfully represents the posterior. We present experimental results for several benchmark problems showing that in many cases this allows an exceptional degree of sparsity without compromising accuracy. Following [1], we locate pseudo-inputs by gradient ascent on the marginal likelihood, but exhibit occasions when this is likely to fail, for which we suggest alternative solutions.
3 0.23236774 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
Author: Kai Yu, Wei Chu
Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1
4 0.2158999 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes
Author: Maneesh Sahani, Byron M. Yu, John P. Cunningham, Krishna V. Shenoy
Abstract: Neural spike trains present challenges to analytical efforts due to their noisy, spiking nature. Many studies of neuroscientific and neural prosthetic importance rely on a smoothed, denoised estimate of the spike train’s underlying firing rate. Current techniques to find time-varying firing rates require ad hoc choices of parameters, offer no confidence intervals on their estimates, and can obscure potentially important single trial variability. We present a new method, based on a Gaussian Process prior, for inferring probabilistically optimal estimates of firing rate functions underlying single or multiple neural spike trains. We test the performance of the method on simulated data and experimentally gathered neural spike trains, and we demonstrate improvements over conventional estimators. 1
5 0.19921334 200 nips-2007-The Tradeoffs of Large Scale Learning
Author: Olivier Bousquet, Léon Bottou
Abstract: This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation–estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways. 1 Motivation The computational complexity of learning algorithms has seldom been taken into account by the learning theory. Valiant [1] states that a problem is “learnable” when there exists a probably approximatively correct learning algorithm with polynomial complexity. Whereas much progress has been made on the statistical aspect (e.g., [2, 3, 4]), very little has been told about the complexity side of this proposal (e.g., [5].) Computational complexity becomes the limiting factor when one envisions large amounts of training data. Two important examples come to mind: • Data mining exists because competitive advantages can be achieved by analyzing the masses of data that describe the life of our computerized society. Since virtually every computer generates data, the data volume is proportional to the available computing power. Therefore one needs learning algorithms that scale roughly linearly with the total volume of data. • Artificial intelligence attempts to emulate the cognitive capabilities of human beings. Our biological brains can learn quite efficiently from the continuous streams of perceptual data generated by our six senses, using limited amounts of sugar as a source of power. This observation suggests that there are learning algorithms whose computing time requirements scale roughly linearly with the total volume of data. This contribution finds its source in the idea that approximate optimization algorithms might be sufficient for learning purposes. The first part proposes new decomposition of the test error where an additional term represents the impact of approximate optimization. In the case of small-scale learning problems, this decomposition reduces to the well known tradeoff between approximation error and estimation error. In the case of large-scale learning problems, the tradeoff is more complex because it involves the computational complexity of the learning algorithm. The second part explores the asymptotic properties of the large-scale learning tradeoff for various prototypical learning algorithms under various assumptions regarding the statistical estimation rates associated with the chosen objective functions. This part clearly shows that the best optimization algorithms are not necessarily the best learning algorithms. Maybe more surprisingly, certain algorithms perform well regardless of the assumed rate for the statistical estimation error. 2 2.1 Approximate Optimization Setup Following [6, 2], we consider a space of input-output pairs (x, y) ∈ X × Y endowed with a probability distribution P (x, y). The conditional distribution P (y|x) represents the unknown relationship between inputs and outputs. The discrepancy between the predicted output y and the real output ˆ y is measured with a loss function ℓ(ˆ, y). Our benchmark is the function f ∗ that minimizes the y expected risk E(f ) = that is, ℓ(f (x), y) dP (x, y) = E [ℓ(f (x), y)], f ∗ (x) = arg min E [ ℓ(ˆ, y)| x]. y y ˆ Although the distribution P (x, y) is unknown, we are given a sample S of n independently drawn training examples (xi , yi ), i = 1 . . . n. We define the empirical risk En (f ) = 1 n n ℓ(f (xi ), yi ) = En [ℓ(f (x), y)]. i=1 Our first learning principle consists in choosing a family F of candidate prediction functions and finding the function fn = arg minf ∈F En (f ) that minimizes the empirical risk. Well known combinatorial results (e.g., [2]) support this approach provided that the chosen family F is sufficiently restrictive. Since the optimal function f ∗ is unlikely to belong to the family F, we also define ∗ ∗ fF = arg minf ∈F E(f ). For simplicity, we assume that f ∗ , fF and fn are well defined and unique. We can then decompose the excess error as ∗ ∗ E [E(fn ) − E(f ∗ )] = E [E(fF ) − E(f ∗ )] + E [E(fn ) − E(fF )] = Eapp + Eest , (1) where the expectation is taken with respect to the random choice of training set. The approximation error Eapp measures how closely functions in F can approximate the optimal solution f ∗ . The estimation error Eest measures the effect of minimizing the empirical risk En (f ) instead of the expected risk E(f ). The estimation error is determined by the number of training examples and by the capacity of the family of functions [2]. Large families1 of functions have smaller approximation errors but lead to higher estimation errors. This tradeoff has been extensively discussed in the literature [2, 3] and lead to excess error that scale between the inverse and the inverse square root of the number of examples [7, 8]. 2.2 Optimization Error Finding fn by minimizing the empirical risk En (f ) is often a computationally expensive operation. Since the empirical risk En (f ) is already an approximation of the expected risk E(f ), it should not be necessary to carry out this minimization with great accuracy. For instance, we could stop an iterative optimization algorithm long before its convergence. ˜ Let us assume that our minimization algorithm returns an approximate solution fn such that ˜ En (fn ) < En (fn ) + ρ ˜ where ρ ≥ 0 is a predefined tolerance. An additional term Eopt = E E(fn ) − E(fn ) then appears ˜ ) − E(f ∗ ) : in the decomposition of the excess error E = E E(fn E ∗ ∗ ˜ = E [E(fF ) − E(f ∗ )] + E [E(fn ) − E(fF )] + E E(fn ) − E(fn ) = Eapp + Eest + Eopt . (2) We call this additional term optimization error. It reflects the impact of the approximate optimization on the generalization performance. Its magnitude is comparable to ρ (see section 3.1.) 1 We often consider nested families of functions of the form Fc = {f ∈ H, Ω(f ) ≤ c}. Then, for each value of c, function fn is obtained by minimizing the regularized empirical risk En (f ) + λΩ(f ) for a suitable choice of the Lagrange coefficient λ. We can then control the estimation-approximation tradeoff by choosing λ instead of c. 2.3 The Approximation–Estimation–Optimization Tradeoff This decomposition leads to a more complicated compromise. It involves three variables and two constraints. The constraints are the maximal number of available training example and the maximal computation time. The variables are the size of the family of functions F, the optimization accuracy ρ, and the number of examples n. This is formalized by the following optimization problem. min E = Eapp + Eest + Eopt F ,ρ,n n ≤ nmax T (F, ρ, n) ≤ Tmax subject to (3) The number n of training examples is a variable because we could choose to use only a subset of the available training examples in order to complete the optimization within the alloted time. This happens often in practice. Table 1 summarizes the typical evolution of the quantities of interest with the three variables F, n, and ρ increase. Table 1: Typical variations when F, n, and ρ increase. F Eapp Eest Eopt T (approximation error) (estimation error) (optimization error) (computation time) n ρ ց ր ··· ր ց ··· ր ր ց The solution of the optimization program (3) depends critically of which budget constraint is active: constraint n < nmax on the number of examples, or constraint T < Tmax on the training time. • We speak of small-scale learning problem when (3) is constrained by the maximal number of examples nmax . Since the computing time is not limited, we can reduce the optimization error Eopt to insignificant levels by choosing ρ arbitrarily small. The excess error is then dominated by the approximation and estimation errors, Eapp and Eest . Taking n = nmax , we recover the approximation-estimation tradeoff that is the object of abundant literature. • We speak of large-scale learning problem when (3) is constrained by the maximal computing time Tmax . Approximate optimization, that is choosing ρ > 0, possibly can achieve better generalization because more training examples can be processed during the allowed time. The specifics depend on the computational properties of the chosen optimization algorithm through the expression of the computing time T (F, ρ, n). 3 The Asymptotics of Large-scale Learning In the previous section, we have extended the classical approximation-estimation tradeoff by taking into account the optimization error. We have given an objective criterion to distiguish small-scale and large-scale learning problems. In the small-scale case, we recover the classical tradeoff between approximation and estimation. The large-scale case is substantially different because it involves the computational complexity of the learning algorithm. In order to clarify the large-scale learning tradeoff with sufficient generality, this section makes several simplifications: • We are studying upper bounds of the approximation, estimation, and optimization errors (2). It is often accepted that these upper bounds give a realistic idea of the actual convergence rates [9, 10, 11, 12]. Another way to find comfort in this approach is to say that we study guaranteed convergence rates instead of the possibly pathological special cases. • We are studying the asymptotic properties of the tradeoff when the problem size increases. Instead of carefully balancing the three terms, we write E = O(Eapp ) + O(Eest ) + O(Eopt ) and only need to ensure that the three terms decrease with the same asymptotic rate. • We are considering a fixed family of functions F and therefore avoid taking into account the approximation error Eapp . This part of the tradeoff covers a wide spectrum of practical realities such as choosing models and choosing features. In the context of this work, we do not believe we can meaningfully address this without discussing, for instance, the thorny issue of feature selection. Instead we focus on the choice of optimization algorithm. • Finally, in order to keep this paper short, we consider that the family of functions F is linearly parametrized by a vector w ∈ Rd . We also assume that x, y and w are bounded, ensuring that there is a constant B such that 0 ≤ ℓ(fw (x), y) ≤ B and ℓ(·, y) is Lipschitz. We first explain how the uniform convergence bounds provide convergence rates that take the optimization error into account. Then we discuss and compare the asymptotic learning properties of several optimization algorithms. 3.1 Convergence of the Estimation and Optimization Errors The optimization error Eopt depends directly on the optimization accuracy ρ. However, the accuracy ˜ ρ involves the empirical quantity En (fn ) − En (fn ), whereas the optimization error Eopt involves ˜ its expected counterpart E(fn ) − E(fn ). This section discusses the impact on the optimization error Eopt and of the optimization accuracy ρ on generalization bounds that leverage the uniform convergence concepts pioneered by Vapnik and Chervonenkis (e.g., [2].) In this discussion, we use the letter c to refer to any positive constant. Multiple occurences of the letter c do not necessarily imply that the constants have identical values. 3.1.1 Simple Uniform Convergence Bounds Recall that we assume that F is linearly parametrized by w ∈ Rd . Elementary uniform convergence results then state that r » – E sup |E(f ) − En (f )| ≤ c f ∈F d , n where the expectation is taken with respect to the random choice of the training set.2 This result immediately provides a bound on the estimation error: Eest = ≤ ´ ` ` ∗ ´ ∗ ∗ ´˜ E(fn ) − En (fn ) + En (fn ) − En (fF ) + En (fF ) − E(fF ) r » – d . 2 E sup |E(f ) − En (f )| ≤ c n f ∈F E ˆ` This same result also provides a combined bound for the estimation and optimization errors: Eest + Eopt = + ≤ ˆ ˜ ˆ ˜ ˜ ˜ ˜ E E(fn ) − En (fn ) + E En (fn ) − En (fn ) ∗ ∗ ∗ E [En (fn ) − En (fF )] + E [En (fF ) − E(fF )] r r r ! d d d c +ρ+0+c = c ρ+ . n n n Unfortunately, this convergence rate is known to be pessimistic in many important cases. More sophisticated bounds are required. 3.1.2 Faster Rates in the Realizable Case When the loss functions ℓ(ˆ, y) is positive, with probability 1 − e−τ for any τ > 0, relative uniform y convergence bounds state that r E(f ) − En (f ) d n τ p sup log + . ≤c n d n f ∈F E(f ) This result is very useful because it provides faster convergence rates O(log n/n) in the realizable case, that is when ℓ(fn (xi ), yi ) = 0 for all training examples (xi , yi ). We have then En (fn ) = 0, ˜ En (fn ) ≤ ρ, and we can write r q n d τ ˜ ˜ E(fn ) − ρ ≤ c E(fn ) log + . n d n q 2 d Although the original Vapnik-Chervonenkis bounds have the form c n log n , the logarithmic term can d be eliminated using the “chaining” technique (e.g., [10].) Viewing this as a second degree polynomial inequality in variable ˜ E(fn ), we obtain „ « d n τ ˜ E(fn ) ≤ c ρ + log + . n d n Integrating this inequality using a standard technique (see, e.g., [13]), we obtain a better convergence rate of the combined estimation and optimization error: „ « h i h i d n ∗ ˜ ˜ Eest + Eopt = E E(fn ) − E(fF ) ≤ E E(fn ) = c ρ + log . n d 3.1.3 Fast Rate Bounds Many authors (e.g., [10, 4, 12]) obtain fast statistical estimation rates in more general conditions. These bounds have the general form α n 1 d log for ≤ α ≤ 1. n d 2 This result holds when one can establish the following variance condition: Eapp + Eest ≤ c ∀f ∈ F Eapp + ∗ ℓ(f (X), Y ) − ℓ(fF (X), Y ) E 2 ≤ c ∗ E(f ) − E(fF ) (4) 1 2− α . (5) The convergence rate of (4) is described by the exponent α which is determined by the quality of the variance bound (5). Works on fast statistical estimation identify two main ways to establish such a variance condition. • Exploiting the strict convexity of certain loss functions [12, theorem 12]. For instance, Lee et al. [14] establish a O(log n/n) rate using the squared loss ℓ(ˆ, y) = (ˆ − y)2 . y y • Making assumptions on the data distribution. In the case of pattern recognition problems, for instance, the “Tsybakov condition” indicates how cleanly the posterior distributions P (y|x) cross near the optimal decision boundary [11, 12]. The realizable case discussed in section 3.1.2 can be viewed as an extreme case of this. Despite their much greater complexity, fast rate estimation results can accomodate the optimization accuracy ρ using essentially the methods illustrated in sections 3.1.1 and 3.1.2. We then obtain a bound of the form α d n ˜ E = Eapp + Eest + Eopt = E E(fn ) − E(f ∗ ) ≤ c Eapp + log +ρ . (6) n d For instance, a general result with α = 1 is provided by Massart [13, theorem 4.2]. Combining this result with standard bounds on the complexity of classes of linear functions (e.g., [10]) yields the following result: d n ˜ E = Eapp + Eest + Eopt = E E(fn ) − E(f ∗ ) ≤ c Eapp + log + ρ . (7) n d See also [15, 4] for more bounds taking into account the optimization accuracy. 3.2 Gradient Optimization Algorithms We now discuss and compare the asymptotic learning properties of four gradient optimization algo∗ rithms. Recall that the family of function F is linearly parametrized by w ∈ Rd . Let wF and wn ∗ correspond to the functions fF and fn defined in section 2.1. In this section, we assume that the functions w → ℓ(fw (x), y) are convex and twice differentiable with continuous second derivatives. Convexity ensures that the empirical const function C(w) = En (fw ) has a single minimum. Two matrices play an important role in the analysis: the Hessian matrix H and the gradient covariance matrix G, both measured at the empirical optimum wn . ∂ 2 ℓ(fwn (x), y) ∂2C (wn ) = En , 2 ∂w ∂w2 H = G = En ∂ℓ(fwn (x), y) ∂w ∂ℓ(fwn (x), y) ∂w (8) ′ . (9) The relation between these two matrices depends on the chosen loss function. In order to summarize them, we assume that there are constants λmax ≥ λmin > 0 and ν > 0 such that, for any η > 0, we can choose the number of examples n large enough to ensure that the following assertion is true with probability greater than 1 − η : tr(G H −1 ) ≤ ν EigenSpectrum(H) ⊂ [ λmin , λmax ] and (10) The condition number κ = λmax /λmin is a good indicator of the difficulty of the optimization [16]. The condition λmin > 0 avoids complications with stochastic gradient algorithms. Note that this condition only implies strict convexity around the optimum. For instance, consider the loss function ℓ is obtained by smoothing the well known hinge loss ℓ(z, y) = max{0, 1 − yz} in a small neighborhood of its non-differentiable points. Function C(w) is then piecewise linear with smoothed edges and vertices. It is not strictly convex. However its minimum is likely to be on a smoothed vertex with a non singular Hessian. When we have strict convexity, the argument of [12, theorem 12] yields fast estimation rates α ≈ 1 in (4) and (6). This is not necessarily the case here. The four algorithm considered in this paper use information about the gradient of the cost function to iteratively update their current estimate w(t) of the parameter vector. • Gradient Descent (GD) iterates w(t + 1) = w(t) − η ∂C 1 (w(t)) = w(t) − η ∂w n n i=1 ∂ ℓ fw(t) (xi ), yi ∂w where η > 0 is a small enough gain. GD is an algorithm with linear convergence [16]. When η = 1/λmax , this algorithm requires O(κ log(1/ρ)) iterations to reach accuracy ρ. The exact number of iterations depends on the choice of the initial parameter vector. • Second Order Gradient Descent (2GD) iterates n w(t + 1) = w(t) − H −1 1 ∂ ∂C (w(t)) = w(t) − H −1 ℓ fw(t) (xi ), yi ∂w n ∂w i=1 where matrix H −1 is the inverse of the Hessian matrix (8). This is more favorable than Newton’s algorithm because we do not evaluate the local Hessian at each iteration but simply assume that we know in advance the Hessian at the optimum. 2GD is a superlinear optimization algorithm with quadratic convergence [16]. When the cost is quadratic, a single iteration is sufficient. In the general case, O(log log(1/ρ)) iterations are required to reach accuracy ρ. • Stochastic Gradient Descent (SGD) picks a random training example (xt , yt ) at each iteration and updates the parameter w on the basis of this example only, w(t + 1) = w(t) − η ∂ ℓ fw(t) (xt ), yt . t ∂w Murata [17, section 2.2], characterizes the mean ES [w(t)] and variance VarS [w(t)] with respect to the distribution implied by the random examples drawn from the training set S at each iteration. Applying this result to the discrete training set distribution for η = 1/λmin , we have δw(t)2 = O(1/t) where δw(t) is a shorthand notation for w(t) − wn . We can then write ES [ C(w(t)) − inf C ] = = ≤ ˆ ` ´˜ ` ´ ES tr H δw(t) δw(t)′ + o 1 t ` ´ ` ´ tr H ES [δw(t)] ES [δw(t)]′ + H VarS [w(t)] + o 1 t `1´ `1´ 2 tr(GH) + o t ≤ νκ + o t . t t (11) Therefore the SGD algorithm reaches accuracy ρ after less than νκ2/ρ + o(1/ρ) iterations on average. The SGD convergence is essentially limited by the stochastic noise induced by the random choice of one example at each iteration. Neither the initial value of the parameter vector w nor the total number of examples n appear in the dominant term of this bound! When the training set is large, one could reach the desired accuracy ρ measured on the whole training set without even visiting all the training examples. This is in fact a kind of generalization bound. Table 2: Asymptotic results for gradient algorithms (with probability 1). Compare the second last column (time to optimize) with the last column (time to reach the excess test error ǫ). Legend: n number of examples; d parameter dimension; κ, ν see equation (10). Algorithm Cost of one iteration GD O(nd) 2GD O d2 + nd SGD O(d) 2SGD O d2 Iterations to reach ρ O κ log 1 ρ 1 O log log ρ νκ2 ρ ν ρ O ndκ log O 1 ρ +o +o Time to reach accuracy ρ 1 ρ 1 ρ d2 + nd log log O O Time to reach E ≤ c (Eapp + ε) d2 κ ε1/α O 1 ρ O d2 ε1/α dνκ2 ρ d2 ν ρ log2 1 ε log 1 log log 1 ε ε O O d ν κ2 ε d2 ν ε • Second Order Stochastic Gradient Descent (2SGD) replaces the gain η by the inverse of the Hessian matrix H: w(t + 1) = w(t) − 1 −1 ∂ H ℓ fw(t) (xt ), yt . t ∂w Unlike standard gradient algorithms, using the second order information does not change the influence of ρ on the convergence rate but improves the constants. Using again [17, theorem 4], accuracy ρ is reached after ν/ρ + o(1/ρ) iterations. For each of the four gradient algorithms, the first three columns of table 2 report the time for a single iteration, the number of iterations needed to reach a predefined accuracy ρ, and their product, the time needed to reach accuracy ρ. These asymptotic results are valid with probability 1, since the probability of their complement is smaller than η for any η > 0. The fourth column bounds the time necessary to reduce the excess error E below c (Eapp +ε) where c `d ´α is the constant from (6). This is computed by observing that choosing ρ ∼ n log n in (6) achieves d the fastest rate for ε, with minimal computation time. We can then use the asymptotic equivalences d ρ ∼ ε and n ∼ ε1/α log 1 . Setting the fourth column expressions to Tmax and solving for ǫ yields ε the best excess error achieved by each algorithm within the limited time Tmax . This provides the asymptotic solution of the Estimation–Optimization tradeoff (3) for large scale problems satisfying our assumptions. These results clearly show that the generalization performance of large-scale learning systems depends on both the statistical properties of the estimation procedure and the computational properties of the chosen optimization algorithm. Their combination leads to surprising consequences: • The SGD and 2SGD results do not depend on the estimation rate α. When the estimation rate is poor, there is less need to optimize accurately. That leaves time to process more examples. A potentially more useful interpretation leverages the fact that (11) is already a kind of generalization bound: its fast rate trumps the slower rate assumed for the estimation error. • Second order algorithms bring little asymptotical improvements in ε. Although the superlinear 2GD algorithm improves the logarithmic term, all four algorithms are dominated by the polynomial term in (1/ε). However, there are important variations in the influence of the constants d, κ and ν. These constants are very important in practice. • Stochastic algorithms (SGD, 2SGD) yield the best generalization performance despite being the worst optimization algorithms. This had been described before [18] and observed in experiments. In contrast, since the optimization error Eopt of small-scale learning systems can be reduced to insignificant levels, their generalization performance is solely determined by the statistical properties of their estimation procedure. 4 Conclusion Taking in account budget constraints on both the number of examples and the computation time, we find qualitative differences between the generalization performance of small-scale learning systems and large-scale learning systems. The generalization properties of large-scale learning systems depend on both the statistical properties of the estimation procedure and the computational properties of the optimization algorithm. We illustrate this fact with some asymptotic results on gradient algorithms. Considerable refinements of this framework can be expected. Extending the analysis to regularized risk formulations would make results on the complexity of primal and dual optimization algorithms [19, 20] directly exploitable. The choice of surrogate loss function [7, 12] could also have a non-trivial impact in the large-scale case. Acknowledgments Part of this work was funded by NSF grant CCR-0325463. References [1] Leslie G. Valiant. A theory of learnable. Proc. of the 1984 STOC, pages 436–445, 1984. [2] Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Series in Statistics. Springer-Verlag, Berlin, 1982. [3] St´ phane Boucheron, Olivier Bousquet, and G´ bor Lugosi. Theory of classification: a survey of recent e a advances. ESAIM: Probability and Statistics, 9:323–375, 2005. [4] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311–334, 2006. [5] J. Stephen Judd. On the complexity of loading shallow neural networks. Journal of Complexity, 4(3):177– 192, 1988. [6] Richard O. Duda and Peter E. Hart. Pattern Classification And Scene Analysis. Wiley and Son, 1973. [7] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32:56–85, 2004. [8] Clint Scovel and Ingo Steinwart. Fast rates for support vector machines. In Peter Auer and Ron Meir, editors, Proceedings of the 18th Conference on Learning Theory (COLT 2005), volume 3559 of Lecture Notes in Computer Science, pages 279–294, Bertinoro, Italy, June 2005. Springer-Verlag. [9] Vladimir N. Vapnik, Esther Levin, and Yann LeCun. Measuring the VC-dimension of a learning machine. Neural Computation, 6(5):851–876, 1994. [10] Olivier Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms. PhD thesis, Ecole Polytechnique, 2002. [11] Alexandre B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statististics, 32(1), 2004. [12] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification and risk bounds. Journal of the American Statistical Association, 101(473):138–156, March 2006. [13] Pascal Massart. Some applications of concentration inequalities to statistics. Annales de la Facult´ des e Sciences de Toulouse, series 6, 9(2):245–303, 2000. [14] Wee S. Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974–1980, 1998. [15] Shahar Mendelson. A few notes on statistical learning theory. In Shahar Mendelson and Alexander J. Smola, editors, Advanced Lectures in Machine Learning, volume 2600 of Lecture Notes in Computer Science, pages 1–40. Springer-Verlag, Berlin, 2003. [16] John E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods For Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1983. [17] Noboru Murata. A statistical study of on-line learning. In David Saad, editor, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998. [18] L´ on Bottou and Yann Le Cun. Large scale online learning. In Sebastian Thrun, Lawrence K. Saul, e and Bernhard Sch¨ lkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, o Cambridge, MA, 2004. [19] Thorsten Joachims. Training linear SVMs in linear time. In Proceedings of KDD’06, Philadelphia, PA, USA, August 20-23 2006. ACM. [20] Don Hush, Patrick Kelly, Clint Scovel, and Ingo Steinwart. QP algorithms with guaranteed accuracy and run time for support vector machines. Journal of Machine Learning Research, 7:733–769, 2006.
6 0.18463026 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
7 0.18071949 58 nips-2007-Consistent Minimization of Clustering Objective Functions
8 0.16811332 135 nips-2007-Multi-task Gaussian Process Prediction
9 0.15849088 32 nips-2007-Bayesian Co-Training
10 0.096974388 161 nips-2007-Random Projections for Manifold Learning
11 0.08655639 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach
12 0.080685869 213 nips-2007-Variational Inference for Diffusion Processes
13 0.078502707 33 nips-2007-Bayesian Inference for Spiking Neuron Models with a Sparsity Prior
14 0.078087844 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes
15 0.073781453 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
16 0.070220053 145 nips-2007-On Sparsity and Overcompleteness in Image Models
17 0.067807183 43 nips-2007-Catching Change-points with Lasso
18 0.066751771 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
19 0.066302858 59 nips-2007-Continuous Time Particle Filtering for fMRI
20 0.065160669 97 nips-2007-Hidden Common Cause Relations in Relational Learning
topicId topicWeight
[(0, -0.232), (1, 0.102), (2, -0.006), (3, 0.076), (4, -0.07), (5, 0.009), (6, -0.352), (7, -0.242), (8, -0.113), (9, -0.074), (10, -0.229), (11, -0.02), (12, -0.076), (13, -0.05), (14, -0.04), (15, 0.062), (16, 0.093), (17, -0.002), (18, -0.127), (19, -0.027), (20, -0.089), (21, -0.088), (22, 0.033), (23, -0.074), (24, -0.167), (25, -0.145), (26, -0.023), (27, 0.167), (28, 0.049), (29, 0.008), (30, -0.062), (31, 0.083), (32, 0.019), (33, 0.016), (34, 0.011), (35, -0.037), (36, -0.026), (37, 0.062), (38, -0.016), (39, 0.038), (40, 0.002), (41, 0.01), (42, 0.002), (43, -0.018), (44, 0.045), (45, 0.072), (46, 0.026), (47, -0.015), (48, 0.044), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.93836969 170 nips-2007-Robust Regression with Twinned Gaussian Processes
Author: Andrew Naish-guzman, Sean Holden
Abstract: We propose a Gaussian process (GP) framework for robust inference in which a GP prior on the mixing weights of a two-component noise model augments the standard process over latent function values. This approach is a generalization of the mixture likelihood used in traditional robust GP regression, and a specialization of the GP mixture models suggested by Tresp [1] and Rasmussen and Ghahramani [2]. The value of this restriction is in its tractable expectation propagation updates, which allow for faster inference and model selection, and better convergence than the standard mixture. An additional benefit over the latter method lies in our ability to incorporate knowledge of the noise domain to influence predictions, and to recover with the predictive distribution information about the outlier distribution via the gating process. The model has asymptotic complexity equal to that of conventional robust methods, but yields more confident predictions on benchmark problems than classical heavy-tailed models and exhibits improved stability for data with clustered corruptions, for which they fail altogether. We show further how our approach can be used without adjustment for more smoothly heteroscedastic data, and suggest how it could be extended to more general noise models. We also address similarities with the work of Goldberg et al. [3].
2 0.80142677 195 nips-2007-The Generalized FITC Approximation
Author: Andrew Naish-guzman, Sean Holden
Abstract: We present an efficient generalization of the sparse pseudo-input Gaussian process (SPGP) model developed by Snelson and Ghahramani [1], applying it to binary classification problems. By taking advantage of the SPGP prior covariance structure, we derive a numerically stable algorithm with O(N M 2 ) training complexity—asymptotically the same as related sparse methods such as the informative vector machine [2], but which more faithfully represents the posterior. We present experimental results for several benchmark problems showing that in many cases this allows an exceptional degree of sparsity without compromising accuracy. Following [1], we locate pseudo-inputs by gradient ascent on the marginal likelihood, but exhibit occasions when this is likely to fail, for which we suggest alternative solutions.
3 0.60714507 200 nips-2007-The Tradeoffs of Large Scale Learning
Author: Olivier Bousquet, Léon Bottou
Abstract: This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation–estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways. 1 Motivation The computational complexity of learning algorithms has seldom been taken into account by the learning theory. Valiant [1] states that a problem is “learnable” when there exists a probably approximatively correct learning algorithm with polynomial complexity. Whereas much progress has been made on the statistical aspect (e.g., [2, 3, 4]), very little has been told about the complexity side of this proposal (e.g., [5].) Computational complexity becomes the limiting factor when one envisions large amounts of training data. Two important examples come to mind: • Data mining exists because competitive advantages can be achieved by analyzing the masses of data that describe the life of our computerized society. Since virtually every computer generates data, the data volume is proportional to the available computing power. Therefore one needs learning algorithms that scale roughly linearly with the total volume of data. • Artificial intelligence attempts to emulate the cognitive capabilities of human beings. Our biological brains can learn quite efficiently from the continuous streams of perceptual data generated by our six senses, using limited amounts of sugar as a source of power. This observation suggests that there are learning algorithms whose computing time requirements scale roughly linearly with the total volume of data. This contribution finds its source in the idea that approximate optimization algorithms might be sufficient for learning purposes. The first part proposes new decomposition of the test error where an additional term represents the impact of approximate optimization. In the case of small-scale learning problems, this decomposition reduces to the well known tradeoff between approximation error and estimation error. In the case of large-scale learning problems, the tradeoff is more complex because it involves the computational complexity of the learning algorithm. The second part explores the asymptotic properties of the large-scale learning tradeoff for various prototypical learning algorithms under various assumptions regarding the statistical estimation rates associated with the chosen objective functions. This part clearly shows that the best optimization algorithms are not necessarily the best learning algorithms. Maybe more surprisingly, certain algorithms perform well regardless of the assumed rate for the statistical estimation error. 2 2.1 Approximate Optimization Setup Following [6, 2], we consider a space of input-output pairs (x, y) ∈ X × Y endowed with a probability distribution P (x, y). The conditional distribution P (y|x) represents the unknown relationship between inputs and outputs. The discrepancy between the predicted output y and the real output ˆ y is measured with a loss function ℓ(ˆ, y). Our benchmark is the function f ∗ that minimizes the y expected risk E(f ) = that is, ℓ(f (x), y) dP (x, y) = E [ℓ(f (x), y)], f ∗ (x) = arg min E [ ℓ(ˆ, y)| x]. y y ˆ Although the distribution P (x, y) is unknown, we are given a sample S of n independently drawn training examples (xi , yi ), i = 1 . . . n. We define the empirical risk En (f ) = 1 n n ℓ(f (xi ), yi ) = En [ℓ(f (x), y)]. i=1 Our first learning principle consists in choosing a family F of candidate prediction functions and finding the function fn = arg minf ∈F En (f ) that minimizes the empirical risk. Well known combinatorial results (e.g., [2]) support this approach provided that the chosen family F is sufficiently restrictive. Since the optimal function f ∗ is unlikely to belong to the family F, we also define ∗ ∗ fF = arg minf ∈F E(f ). For simplicity, we assume that f ∗ , fF and fn are well defined and unique. We can then decompose the excess error as ∗ ∗ E [E(fn ) − E(f ∗ )] = E [E(fF ) − E(f ∗ )] + E [E(fn ) − E(fF )] = Eapp + Eest , (1) where the expectation is taken with respect to the random choice of training set. The approximation error Eapp measures how closely functions in F can approximate the optimal solution f ∗ . The estimation error Eest measures the effect of minimizing the empirical risk En (f ) instead of the expected risk E(f ). The estimation error is determined by the number of training examples and by the capacity of the family of functions [2]. Large families1 of functions have smaller approximation errors but lead to higher estimation errors. This tradeoff has been extensively discussed in the literature [2, 3] and lead to excess error that scale between the inverse and the inverse square root of the number of examples [7, 8]. 2.2 Optimization Error Finding fn by minimizing the empirical risk En (f ) is often a computationally expensive operation. Since the empirical risk En (f ) is already an approximation of the expected risk E(f ), it should not be necessary to carry out this minimization with great accuracy. For instance, we could stop an iterative optimization algorithm long before its convergence. ˜ Let us assume that our minimization algorithm returns an approximate solution fn such that ˜ En (fn ) < En (fn ) + ρ ˜ where ρ ≥ 0 is a predefined tolerance. An additional term Eopt = E E(fn ) − E(fn ) then appears ˜ ) − E(f ∗ ) : in the decomposition of the excess error E = E E(fn E ∗ ∗ ˜ = E [E(fF ) − E(f ∗ )] + E [E(fn ) − E(fF )] + E E(fn ) − E(fn ) = Eapp + Eest + Eopt . (2) We call this additional term optimization error. It reflects the impact of the approximate optimization on the generalization performance. Its magnitude is comparable to ρ (see section 3.1.) 1 We often consider nested families of functions of the form Fc = {f ∈ H, Ω(f ) ≤ c}. Then, for each value of c, function fn is obtained by minimizing the regularized empirical risk En (f ) + λΩ(f ) for a suitable choice of the Lagrange coefficient λ. We can then control the estimation-approximation tradeoff by choosing λ instead of c. 2.3 The Approximation–Estimation–Optimization Tradeoff This decomposition leads to a more complicated compromise. It involves three variables and two constraints. The constraints are the maximal number of available training example and the maximal computation time. The variables are the size of the family of functions F, the optimization accuracy ρ, and the number of examples n. This is formalized by the following optimization problem. min E = Eapp + Eest + Eopt F ,ρ,n n ≤ nmax T (F, ρ, n) ≤ Tmax subject to (3) The number n of training examples is a variable because we could choose to use only a subset of the available training examples in order to complete the optimization within the alloted time. This happens often in practice. Table 1 summarizes the typical evolution of the quantities of interest with the three variables F, n, and ρ increase. Table 1: Typical variations when F, n, and ρ increase. F Eapp Eest Eopt T (approximation error) (estimation error) (optimization error) (computation time) n ρ ց ր ··· ր ց ··· ր ր ց The solution of the optimization program (3) depends critically of which budget constraint is active: constraint n < nmax on the number of examples, or constraint T < Tmax on the training time. • We speak of small-scale learning problem when (3) is constrained by the maximal number of examples nmax . Since the computing time is not limited, we can reduce the optimization error Eopt to insignificant levels by choosing ρ arbitrarily small. The excess error is then dominated by the approximation and estimation errors, Eapp and Eest . Taking n = nmax , we recover the approximation-estimation tradeoff that is the object of abundant literature. • We speak of large-scale learning problem when (3) is constrained by the maximal computing time Tmax . Approximate optimization, that is choosing ρ > 0, possibly can achieve better generalization because more training examples can be processed during the allowed time. The specifics depend on the computational properties of the chosen optimization algorithm through the expression of the computing time T (F, ρ, n). 3 The Asymptotics of Large-scale Learning In the previous section, we have extended the classical approximation-estimation tradeoff by taking into account the optimization error. We have given an objective criterion to distiguish small-scale and large-scale learning problems. In the small-scale case, we recover the classical tradeoff between approximation and estimation. The large-scale case is substantially different because it involves the computational complexity of the learning algorithm. In order to clarify the large-scale learning tradeoff with sufficient generality, this section makes several simplifications: • We are studying upper bounds of the approximation, estimation, and optimization errors (2). It is often accepted that these upper bounds give a realistic idea of the actual convergence rates [9, 10, 11, 12]. Another way to find comfort in this approach is to say that we study guaranteed convergence rates instead of the possibly pathological special cases. • We are studying the asymptotic properties of the tradeoff when the problem size increases. Instead of carefully balancing the three terms, we write E = O(Eapp ) + O(Eest ) + O(Eopt ) and only need to ensure that the three terms decrease with the same asymptotic rate. • We are considering a fixed family of functions F and therefore avoid taking into account the approximation error Eapp . This part of the tradeoff covers a wide spectrum of practical realities such as choosing models and choosing features. In the context of this work, we do not believe we can meaningfully address this without discussing, for instance, the thorny issue of feature selection. Instead we focus on the choice of optimization algorithm. • Finally, in order to keep this paper short, we consider that the family of functions F is linearly parametrized by a vector w ∈ Rd . We also assume that x, y and w are bounded, ensuring that there is a constant B such that 0 ≤ ℓ(fw (x), y) ≤ B and ℓ(·, y) is Lipschitz. We first explain how the uniform convergence bounds provide convergence rates that take the optimization error into account. Then we discuss and compare the asymptotic learning properties of several optimization algorithms. 3.1 Convergence of the Estimation and Optimization Errors The optimization error Eopt depends directly on the optimization accuracy ρ. However, the accuracy ˜ ρ involves the empirical quantity En (fn ) − En (fn ), whereas the optimization error Eopt involves ˜ its expected counterpart E(fn ) − E(fn ). This section discusses the impact on the optimization error Eopt and of the optimization accuracy ρ on generalization bounds that leverage the uniform convergence concepts pioneered by Vapnik and Chervonenkis (e.g., [2].) In this discussion, we use the letter c to refer to any positive constant. Multiple occurences of the letter c do not necessarily imply that the constants have identical values. 3.1.1 Simple Uniform Convergence Bounds Recall that we assume that F is linearly parametrized by w ∈ Rd . Elementary uniform convergence results then state that r » – E sup |E(f ) − En (f )| ≤ c f ∈F d , n where the expectation is taken with respect to the random choice of the training set.2 This result immediately provides a bound on the estimation error: Eest = ≤ ´ ` ` ∗ ´ ∗ ∗ ´˜ E(fn ) − En (fn ) + En (fn ) − En (fF ) + En (fF ) − E(fF ) r » – d . 2 E sup |E(f ) − En (f )| ≤ c n f ∈F E ˆ` This same result also provides a combined bound for the estimation and optimization errors: Eest + Eopt = + ≤ ˆ ˜ ˆ ˜ ˜ ˜ ˜ E E(fn ) − En (fn ) + E En (fn ) − En (fn ) ∗ ∗ ∗ E [En (fn ) − En (fF )] + E [En (fF ) − E(fF )] r r r ! d d d c +ρ+0+c = c ρ+ . n n n Unfortunately, this convergence rate is known to be pessimistic in many important cases. More sophisticated bounds are required. 3.1.2 Faster Rates in the Realizable Case When the loss functions ℓ(ˆ, y) is positive, with probability 1 − e−τ for any τ > 0, relative uniform y convergence bounds state that r E(f ) − En (f ) d n τ p sup log + . ≤c n d n f ∈F E(f ) This result is very useful because it provides faster convergence rates O(log n/n) in the realizable case, that is when ℓ(fn (xi ), yi ) = 0 for all training examples (xi , yi ). We have then En (fn ) = 0, ˜ En (fn ) ≤ ρ, and we can write r q n d τ ˜ ˜ E(fn ) − ρ ≤ c E(fn ) log + . n d n q 2 d Although the original Vapnik-Chervonenkis bounds have the form c n log n , the logarithmic term can d be eliminated using the “chaining” technique (e.g., [10].) Viewing this as a second degree polynomial inequality in variable ˜ E(fn ), we obtain „ « d n τ ˜ E(fn ) ≤ c ρ + log + . n d n Integrating this inequality using a standard technique (see, e.g., [13]), we obtain a better convergence rate of the combined estimation and optimization error: „ « h i h i d n ∗ ˜ ˜ Eest + Eopt = E E(fn ) − E(fF ) ≤ E E(fn ) = c ρ + log . n d 3.1.3 Fast Rate Bounds Many authors (e.g., [10, 4, 12]) obtain fast statistical estimation rates in more general conditions. These bounds have the general form α n 1 d log for ≤ α ≤ 1. n d 2 This result holds when one can establish the following variance condition: Eapp + Eest ≤ c ∀f ∈ F Eapp + ∗ ℓ(f (X), Y ) − ℓ(fF (X), Y ) E 2 ≤ c ∗ E(f ) − E(fF ) (4) 1 2− α . (5) The convergence rate of (4) is described by the exponent α which is determined by the quality of the variance bound (5). Works on fast statistical estimation identify two main ways to establish such a variance condition. • Exploiting the strict convexity of certain loss functions [12, theorem 12]. For instance, Lee et al. [14] establish a O(log n/n) rate using the squared loss ℓ(ˆ, y) = (ˆ − y)2 . y y • Making assumptions on the data distribution. In the case of pattern recognition problems, for instance, the “Tsybakov condition” indicates how cleanly the posterior distributions P (y|x) cross near the optimal decision boundary [11, 12]. The realizable case discussed in section 3.1.2 can be viewed as an extreme case of this. Despite their much greater complexity, fast rate estimation results can accomodate the optimization accuracy ρ using essentially the methods illustrated in sections 3.1.1 and 3.1.2. We then obtain a bound of the form α d n ˜ E = Eapp + Eest + Eopt = E E(fn ) − E(f ∗ ) ≤ c Eapp + log +ρ . (6) n d For instance, a general result with α = 1 is provided by Massart [13, theorem 4.2]. Combining this result with standard bounds on the complexity of classes of linear functions (e.g., [10]) yields the following result: d n ˜ E = Eapp + Eest + Eopt = E E(fn ) − E(f ∗ ) ≤ c Eapp + log + ρ . (7) n d See also [15, 4] for more bounds taking into account the optimization accuracy. 3.2 Gradient Optimization Algorithms We now discuss and compare the asymptotic learning properties of four gradient optimization algo∗ rithms. Recall that the family of function F is linearly parametrized by w ∈ Rd . Let wF and wn ∗ correspond to the functions fF and fn defined in section 2.1. In this section, we assume that the functions w → ℓ(fw (x), y) are convex and twice differentiable with continuous second derivatives. Convexity ensures that the empirical const function C(w) = En (fw ) has a single minimum. Two matrices play an important role in the analysis: the Hessian matrix H and the gradient covariance matrix G, both measured at the empirical optimum wn . ∂ 2 ℓ(fwn (x), y) ∂2C (wn ) = En , 2 ∂w ∂w2 H = G = En ∂ℓ(fwn (x), y) ∂w ∂ℓ(fwn (x), y) ∂w (8) ′ . (9) The relation between these two matrices depends on the chosen loss function. In order to summarize them, we assume that there are constants λmax ≥ λmin > 0 and ν > 0 such that, for any η > 0, we can choose the number of examples n large enough to ensure that the following assertion is true with probability greater than 1 − η : tr(G H −1 ) ≤ ν EigenSpectrum(H) ⊂ [ λmin , λmax ] and (10) The condition number κ = λmax /λmin is a good indicator of the difficulty of the optimization [16]. The condition λmin > 0 avoids complications with stochastic gradient algorithms. Note that this condition only implies strict convexity around the optimum. For instance, consider the loss function ℓ is obtained by smoothing the well known hinge loss ℓ(z, y) = max{0, 1 − yz} in a small neighborhood of its non-differentiable points. Function C(w) is then piecewise linear with smoothed edges and vertices. It is not strictly convex. However its minimum is likely to be on a smoothed vertex with a non singular Hessian. When we have strict convexity, the argument of [12, theorem 12] yields fast estimation rates α ≈ 1 in (4) and (6). This is not necessarily the case here. The four algorithm considered in this paper use information about the gradient of the cost function to iteratively update their current estimate w(t) of the parameter vector. • Gradient Descent (GD) iterates w(t + 1) = w(t) − η ∂C 1 (w(t)) = w(t) − η ∂w n n i=1 ∂ ℓ fw(t) (xi ), yi ∂w where η > 0 is a small enough gain. GD is an algorithm with linear convergence [16]. When η = 1/λmax , this algorithm requires O(κ log(1/ρ)) iterations to reach accuracy ρ. The exact number of iterations depends on the choice of the initial parameter vector. • Second Order Gradient Descent (2GD) iterates n w(t + 1) = w(t) − H −1 1 ∂ ∂C (w(t)) = w(t) − H −1 ℓ fw(t) (xi ), yi ∂w n ∂w i=1 where matrix H −1 is the inverse of the Hessian matrix (8). This is more favorable than Newton’s algorithm because we do not evaluate the local Hessian at each iteration but simply assume that we know in advance the Hessian at the optimum. 2GD is a superlinear optimization algorithm with quadratic convergence [16]. When the cost is quadratic, a single iteration is sufficient. In the general case, O(log log(1/ρ)) iterations are required to reach accuracy ρ. • Stochastic Gradient Descent (SGD) picks a random training example (xt , yt ) at each iteration and updates the parameter w on the basis of this example only, w(t + 1) = w(t) − η ∂ ℓ fw(t) (xt ), yt . t ∂w Murata [17, section 2.2], characterizes the mean ES [w(t)] and variance VarS [w(t)] with respect to the distribution implied by the random examples drawn from the training set S at each iteration. Applying this result to the discrete training set distribution for η = 1/λmin , we have δw(t)2 = O(1/t) where δw(t) is a shorthand notation for w(t) − wn . We can then write ES [ C(w(t)) − inf C ] = = ≤ ˆ ` ´˜ ` ´ ES tr H δw(t) δw(t)′ + o 1 t ` ´ ` ´ tr H ES [δw(t)] ES [δw(t)]′ + H VarS [w(t)] + o 1 t `1´ `1´ 2 tr(GH) + o t ≤ νκ + o t . t t (11) Therefore the SGD algorithm reaches accuracy ρ after less than νκ2/ρ + o(1/ρ) iterations on average. The SGD convergence is essentially limited by the stochastic noise induced by the random choice of one example at each iteration. Neither the initial value of the parameter vector w nor the total number of examples n appear in the dominant term of this bound! When the training set is large, one could reach the desired accuracy ρ measured on the whole training set without even visiting all the training examples. This is in fact a kind of generalization bound. Table 2: Asymptotic results for gradient algorithms (with probability 1). Compare the second last column (time to optimize) with the last column (time to reach the excess test error ǫ). Legend: n number of examples; d parameter dimension; κ, ν see equation (10). Algorithm Cost of one iteration GD O(nd) 2GD O d2 + nd SGD O(d) 2SGD O d2 Iterations to reach ρ O κ log 1 ρ 1 O log log ρ νκ2 ρ ν ρ O ndκ log O 1 ρ +o +o Time to reach accuracy ρ 1 ρ 1 ρ d2 + nd log log O O Time to reach E ≤ c (Eapp + ε) d2 κ ε1/α O 1 ρ O d2 ε1/α dνκ2 ρ d2 ν ρ log2 1 ε log 1 log log 1 ε ε O O d ν κ2 ε d2 ν ε • Second Order Stochastic Gradient Descent (2SGD) replaces the gain η by the inverse of the Hessian matrix H: w(t + 1) = w(t) − 1 −1 ∂ H ℓ fw(t) (xt ), yt . t ∂w Unlike standard gradient algorithms, using the second order information does not change the influence of ρ on the convergence rate but improves the constants. Using again [17, theorem 4], accuracy ρ is reached after ν/ρ + o(1/ρ) iterations. For each of the four gradient algorithms, the first three columns of table 2 report the time for a single iteration, the number of iterations needed to reach a predefined accuracy ρ, and their product, the time needed to reach accuracy ρ. These asymptotic results are valid with probability 1, since the probability of their complement is smaller than η for any η > 0. The fourth column bounds the time necessary to reduce the excess error E below c (Eapp +ε) where c `d ´α is the constant from (6). This is computed by observing that choosing ρ ∼ n log n in (6) achieves d the fastest rate for ε, with minimal computation time. We can then use the asymptotic equivalences d ρ ∼ ε and n ∼ ε1/α log 1 . Setting the fourth column expressions to Tmax and solving for ǫ yields ε the best excess error achieved by each algorithm within the limited time Tmax . This provides the asymptotic solution of the Estimation–Optimization tradeoff (3) for large scale problems satisfying our assumptions. These results clearly show that the generalization performance of large-scale learning systems depends on both the statistical properties of the estimation procedure and the computational properties of the chosen optimization algorithm. Their combination leads to surprising consequences: • The SGD and 2SGD results do not depend on the estimation rate α. When the estimation rate is poor, there is less need to optimize accurately. That leaves time to process more examples. A potentially more useful interpretation leverages the fact that (11) is already a kind of generalization bound: its fast rate trumps the slower rate assumed for the estimation error. • Second order algorithms bring little asymptotical improvements in ε. Although the superlinear 2GD algorithm improves the logarithmic term, all four algorithms are dominated by the polynomial term in (1/ε). However, there are important variations in the influence of the constants d, κ and ν. These constants are very important in practice. • Stochastic algorithms (SGD, 2SGD) yield the best generalization performance despite being the worst optimization algorithms. This had been described before [18] and observed in experiments. In contrast, since the optimization error Eopt of small-scale learning systems can be reduced to insignificant levels, their generalization performance is solely determined by the statistical properties of their estimation procedure. 4 Conclusion Taking in account budget constraints on both the number of examples and the computation time, we find qualitative differences between the generalization performance of small-scale learning systems and large-scale learning systems. The generalization properties of large-scale learning systems depend on both the statistical properties of the estimation procedure and the computational properties of the optimization algorithm. We illustrate this fact with some asymptotic results on gradient algorithms. Considerable refinements of this framework can be expected. Extending the analysis to regularized risk formulations would make results on the complexity of primal and dual optimization algorithms [19, 20] directly exploitable. The choice of surrogate loss function [7, 12] could also have a non-trivial impact in the large-scale case. Acknowledgments Part of this work was funded by NSF grant CCR-0325463. References [1] Leslie G. Valiant. A theory of learnable. Proc. of the 1984 STOC, pages 436–445, 1984. [2] Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Series in Statistics. Springer-Verlag, Berlin, 1982. [3] St´ phane Boucheron, Olivier Bousquet, and G´ bor Lugosi. Theory of classification: a survey of recent e a advances. ESAIM: Probability and Statistics, 9:323–375, 2005. [4] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311–334, 2006. [5] J. Stephen Judd. On the complexity of loading shallow neural networks. Journal of Complexity, 4(3):177– 192, 1988. [6] Richard O. Duda and Peter E. Hart. Pattern Classification And Scene Analysis. Wiley and Son, 1973. [7] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32:56–85, 2004. [8] Clint Scovel and Ingo Steinwart. Fast rates for support vector machines. In Peter Auer and Ron Meir, editors, Proceedings of the 18th Conference on Learning Theory (COLT 2005), volume 3559 of Lecture Notes in Computer Science, pages 279–294, Bertinoro, Italy, June 2005. Springer-Verlag. [9] Vladimir N. Vapnik, Esther Levin, and Yann LeCun. Measuring the VC-dimension of a learning machine. Neural Computation, 6(5):851–876, 1994. [10] Olivier Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms. PhD thesis, Ecole Polytechnique, 2002. [11] Alexandre B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statististics, 32(1), 2004. [12] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification and risk bounds. Journal of the American Statistical Association, 101(473):138–156, March 2006. [13] Pascal Massart. Some applications of concentration inequalities to statistics. Annales de la Facult´ des e Sciences de Toulouse, series 6, 9(2):245–303, 2000. [14] Wee S. Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974–1980, 1998. [15] Shahar Mendelson. A few notes on statistical learning theory. In Shahar Mendelson and Alexander J. Smola, editors, Advanced Lectures in Machine Learning, volume 2600 of Lecture Notes in Computer Science, pages 1–40. Springer-Verlag, Berlin, 2003. [16] John E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods For Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1983. [17] Noboru Murata. A statistical study of on-line learning. In David Saad, editor, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998. [18] L´ on Bottou and Yann Le Cun. Large scale online learning. In Sebastian Thrun, Lawrence K. Saul, e and Bernhard Sch¨ lkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, o Cambridge, MA, 2004. [19] Thorsten Joachims. Training linear SVMs in linear time. In Proceedings of KDD’06, Philadelphia, PA, USA, August 20-23 2006. ACM. [20] Don Hush, Patrick Kelly, Clint Scovel, and Ingo Steinwart. QP algorithms with guaranteed accuracy and run time for support vector machines. Journal of Machine Learning Research, 7:733–769, 2006.
4 0.53836662 58 nips-2007-Consistent Minimization of Clustering Objective Functions
Author: Ulrike V. Luxburg, Stefanie Jegelka, Michael Kaufmann, Sébastien Bubeck
Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal. As an alternative, we suggest the paradigm of “nearest neighbor clustering”. Instead of selecting the best out of all partitions of the sample, it only considers partitions in some restricted function class. Using tools from statistical learning theory we prove that nearest neighbor clustering is statistically consistent. Moreover, its worst case complexity is polynomial by construction, and it can be implemented with small average case complexity using branch and bound. 1
5 0.53742105 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
Author: Kai Yu, Wei Chu
Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1
6 0.51962566 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes
7 0.50431311 32 nips-2007-Bayesian Co-Training
8 0.44005501 135 nips-2007-Multi-task Gaussian Process Prediction
9 0.41598937 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes
10 0.40959144 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach
11 0.40420631 174 nips-2007-Selecting Observations against Adversarial Objectives
12 0.34400347 97 nips-2007-Hidden Common Cause Relations in Relational Learning
13 0.33172423 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
14 0.32101813 213 nips-2007-Variational Inference for Diffusion Processes
15 0.31116551 145 nips-2007-On Sparsity and Overcompleteness in Image Models
16 0.31038004 214 nips-2007-Variational inference for Markov jump processes
17 0.2972419 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes
18 0.28877369 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
19 0.28186029 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
20 0.28122371 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
topicId topicWeight
[(5, 0.028), (13, 0.02), (16, 0.46), (18, 0.014), (21, 0.084), (34, 0.015), (35, 0.018), (47, 0.072), (49, 0.014), (83, 0.094), (85, 0.035), (87, 0.017), (90, 0.057)]
simIndex simValue paperId paperTitle
1 0.91191661 60 nips-2007-Contraction Properties of VLSI Cooperative Competitive Neural Networks of Spiking Neurons
Author: Emre Neftci, Elisabetta Chicca, Giacomo Indiveri, Jean-jeacques Slotine, Rodney J. Douglas
Abstract: A non–linear dynamic system is called contracting if initial conditions are forgotten exponentially fast, so that all trajectories converge to a single trajectory. We use contraction theory to derive an upper bound for the strength of recurrent connections that guarantees contraction for complex neural networks. Specifically, we apply this theory to a special class of recurrent networks, often called Cooperative Competitive Networks (CCNs), which are an abstract representation of the cooperative-competitive connectivity observed in cortex. This specific type of network is believed to play a major role in shaping cortical responses and selecting the relevant signal among distractors and noise. In this paper, we analyze contraction of combined CCNs of linear threshold units and verify the results of our analysis in a hybrid analog/digital VLSI CCN comprising spiking neurons and dynamic synapses. 1
2 0.90822959 33 nips-2007-Bayesian Inference for Spiking Neuron Models with a Sparsity Prior
Author: Sebastian Gerwinn, Matthias Bethge, Jakob H. Macke, Matthias Seeger
Abstract: Generalized linear models are the most commonly used tools to describe the stimulus selectivity of sensory neurons. Here we present a Bayesian treatment of such models. Using the expectation propagation algorithm, we are able to approximate the full posterior distribution over all weights. In addition, we use a Laplacian prior to favor sparse solutions. Therefore, stimulus features that do not critically influence neural activity will be assigned zero weights and thus be effectively excluded by the model. This feature selection mechanism facilitates both the interpretation of the neuron model as well as its predictive abilities. The posterior distribution can be used to obtain confidence intervals which makes it possible to assess the statistical significance of the solution. In neural data analysis, the available amount of experimental measurements is often limited whereas the parameter space is large. In such a situation, both regularization by a sparsity prior and uncertainty estimates for the model parameters are essential. We apply our method to multi-electrode recordings of retinal ganglion cells and use our uncertainty estimate to test the statistical significance of functional couplings between neurons. Furthermore we used the sparsity of the Laplace prior to select those filters from a spike-triggered covariance analysis that are most informative about the neural response. 1
same-paper 3 0.87871861 170 nips-2007-Robust Regression with Twinned Gaussian Processes
Author: Andrew Naish-guzman, Sean Holden
Abstract: We propose a Gaussian process (GP) framework for robust inference in which a GP prior on the mixing weights of a two-component noise model augments the standard process over latent function values. This approach is a generalization of the mixture likelihood used in traditional robust GP regression, and a specialization of the GP mixture models suggested by Tresp [1] and Rasmussen and Ghahramani [2]. The value of this restriction is in its tractable expectation propagation updates, which allow for faster inference and model selection, and better convergence than the standard mixture. An additional benefit over the latter method lies in our ability to incorporate knowledge of the noise domain to influence predictions, and to recover with the predictive distribution information about the outlier distribution via the gating process. The model has asymptotic complexity equal to that of conventional robust methods, but yields more confident predictions on benchmark problems than classical heavy-tailed models and exhibits improved stability for data with clustered corruptions, for which they fail altogether. We show further how our approach can be used without adjustment for more smoothly heteroscedastic data, and suggest how it could be extended to more general noise models. We also address similarities with the work of Goldberg et al. [3].
4 0.87061691 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm
Author: Nicolas L. Roux, Pierre-antoine Manzagol, Yoshua Bengio
Abstract: Guided by the goal of obtaining an optimization algorithm that is both fast and yields good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. Although that direction can be very expensive to compute we develop an efficient, general, online approximation to the natural gradient descent which is suited to large scale problems. We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets.
5 0.70499665 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
Author: Jonathan W. Pillow, Peter E. Latham
Abstract: Point process encoding models provide powerful statistical methods for understanding the responses of neurons to sensory stimuli. Although these models have been successfully applied to neurons in the early sensory pathway, they have fared less well capturing the response properties of neurons in deeper brain areas, owing in part to the fact that they do not take into account multiple stages of processing. Here we introduce a new twist on the point-process modeling approach: we include unobserved as well as observed spiking neurons in a joint encoding model. The resulting model exhibits richer dynamics and more highly nonlinear response properties, making it more powerful and more flexible for fitting neural data. More importantly, it allows us to estimate connectivity patterns among neurons (both observed and unobserved), and may provide insight into how networks process sensory input. We formulate the estimation procedure using variational EM and the wake-sleep algorithm, and illustrate the model’s performance using a simulated example network consisting of two coupled neurons.
6 0.69304734 36 nips-2007-Better than least squares: comparison of objective functions for estimating linear-nonlinear models
7 0.68176067 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes
8 0.62714791 195 nips-2007-The Generalized FITC Approximation
9 0.59894955 164 nips-2007-Receptive Fields without Spike-Triggering
10 0.58705819 205 nips-2007-Theoretical Analysis of Learning with Reward-Modulated Spike-Timing-Dependent Plasticity
11 0.55205685 117 nips-2007-Learning to classify complex patterns using a VLSI network of spiking neurons
12 0.53168839 35 nips-2007-Bayesian binning beats approximate alternatives: estimating peri-stimulus time histograms
13 0.51286077 25 nips-2007-An in-silico Neural Model of Dynamic Routing through Neuronal Coherence
14 0.50907898 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis
15 0.50588572 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
16 0.5007447 213 nips-2007-Variational Inference for Diffusion Processes
17 0.49174836 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
18 0.46731225 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
19 0.46433827 174 nips-2007-Selecting Observations against Adversarial Objectives
20 0.46113944 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding