nips nips2001 nips2001-21 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dörthe Malzahn, Manfred Opper
Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. [sent-4, score-0.72]
2 As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance. [sent-6, score-0.575]
3 1 Introduction Approximate expressions for generalization errors for finite dimensional statistical data models can be often obtained in the large data limit using asymptotic expansions. [sent-7, score-0.209]
4 Such methods can yield approximate relations for empirical and true errors which can be used to assess the quality of the trained model see e. [sent-8, score-0.3]
5 Unfortunately, such an approximation scheme does not seem to be easily applicable to popular non-parametric models like Gaussian process (GP) models and Support Vector Machines (SVMs). [sent-11, score-0.078]
6 We apply the replica approach of statistical physics to asses the average case learning performance of these kernel machines. [sent-12, score-0.386]
7 So far, the tools of statistical physics have been successfully applied to a variety of learning problems [2]. [sent-13, score-0.054]
8 However, this elegant method suffers from the drawback that data averages can be performed exactly only under very idealistic assumptions on the data distribution in the ”thermodynamic” limit of infinite data space dimension. [sent-14, score-0.352]
9 We try to overcome these limitations by combining the replica method with a variational approximation. [sent-15, score-0.594]
10 For Bayesian models, our method allows us to express useful data averaged a-posteriori expectations by means of an approximate measure. [sent-16, score-0.227]
11 The derivation of this measure requires no assumptions about the data density and no assumptions about the input dimension. [sent-17, score-0.266]
12 The main focus of this article is Gaussian process regression where we demonstrate the various strengths of the presented method. [sent-18, score-0.194]
13 For Gaussian process models we show that our method does not only give explicit approximations for generalization errors but also of their sample fluctuations. [sent-20, score-0.218]
14 Furthermore, we show how to compute corrections to our theory and demonstrate the possibility of deriving approximate universal relations between average empirical and true errors which might be of practical interest. [sent-21, score-0.522]
15 An earlier version of our approach, which was still restricted to the assumption of idealized data distributions appeared in [4]. [sent-22, score-0.056]
16 In a Bayesian formulation, we have a prior distribution over this class of functions . [sent-25, score-0.066]
17 Assuming that a set of observations is conditionally independent given inputs , we assign a likelihood term of the form to each observation. [sent-26, score-0.059]
18 Posterior expectations (denoted by angular brackets) of any functional are expressed in the form ¥ £¡ ¦¤¢ ¥¥ 210 ¥0 £¡ ¡ & $ # 1¦)¢(''%¡ § 2 §¨ ©§ ! [sent-27, score-0.063]
19 " B $ %# D 6 4 3 D ' 5FEC A @' 53 7 9 86 4 G where the partition function normalizes the posterior and denotes the expectation with respect to the prior. [sent-29, score-0.416]
20 We are interested in computing averages of posterior expectations over different drawings of training data sets were all data examples are independently generated from the same distribution. [sent-30, score-0.56]
21 In the next section we will show how to derive a measure which enables us to compute analytically approximate combined data and posterior averages. [sent-31, score-0.328]
22 3 A Grand-Canonical Approach We utilize the statistical mechanics approach to the analysis of learning. [sent-32, score-0.078]
23 Our aim is to which serves as a generating compute the so-called averaged ”free energy” function for suitable data averages of posterior expectations. [sent-33, score-0.506]
24 The partition function is W dV B (2) R Q0 ¥ u y w u s r p i h a 9 W Fp x FvtqIeggfV B Ia U c § 2 B Ib# ca ¥0 £¡ ¡ 1F¤¢('& U U W fV B ea c ¨ H I0 ! [sent-34, score-0.301]
25 " C 9 B $ %# D G we use the replica trick , To perform the average where is computed for integer and the continuation is performed at the end [5]. [sent-35, score-0.245]
26 " ¥ £ ¡ F)¡ ('& G %# D $ qp C fV p B 9 ¤p B 9 W ¥ ¡ p p C where (3) denotes the expectation over the replicated prior measure. [sent-37, score-0.166]
27 " g d R ¥ £ f eI¢tz¥ ¦¤¡ 1¡l& G %# D $ # 9 p yxwu p q o s ! [sent-40, score-0.038]
28 " j rspH p C 9t¥b)¡ p B ¥ p uv2¡ # m n ¥ k G 9 l¤¡ p j with the Hamiltonian replicas of the predictor at the same data point is taken with respect to the true data density . [sent-41, score-0.309]
29 ¥ § 2 £ )¡ The density evaluates all and the expectation (5) ¥ § 2 £ ¤¡ The ”grand canonical” partition function represents a ”poissonized” version of the original model with fluctuating number of examples. [sent-42, score-0.249]
30 The ”chemical potential” determines the expected value of which yields simply for . [sent-43, score-0.045]
31 (4) by its dominating term (6) k g u u wx p ¦x wTu # p ¦ wx u # g thereby neglecting relative fluctuations. [sent-45, score-0.15]
32 We recover the original (canonical) free energy as . [sent-46, score-0.354]
33 4 Variational Approximation ¥ k¡ l¤p For most interesting cases, the partition function can not be computed in closed form for given . [sent-47, score-0.108]
34 Hence, we use a variational approach to approximate by a different tractable Hamiltonian . [sent-48, score-0.411]
35 Although differentiating the bound with respect to will usually not preserve the inequality, we still is a sensible thing to do [7]. [sent-52, score-0.06]
36 expect 1 that an optimization with respect to ¡ ¥ k¡ l¤p ca eb# j sp u } { FYo 4. [sent-53, score-0.265]
37 (5) can be rewritten as an integral over a local quantity in the input variable , i. [sent-55, score-0.111]
38 " ¦§ § ¥ £ ¡ ¦¤¡ 1l& G %# D $ ¥ ¦£ ¡ 9 `F)¡ 4 ¤s ¥6¥ £ £¡ We will now specialize to Gaussian priors over , for which a local quadratic expression ¥ £ ¥ £ F¤¡ 2¦¤¡ (9) ¥ £ F¤¡ s8 % # $ 7 ! [sent-58, score-0.046]
39 "¢ ¥ £ ¥ £ ¥ £ ¨ G lF)¡ 2¦¤¡ F)¡ A £ G | 9 sp u is a suitable trial Hamiltonian, leading to Gaussian averages . [sent-59, score-0.266]
40 The functions and are variational parameters to be optimized. [sent-60, score-0.349]
41 It is important to have an explicit dependence on the input variable in order to take a non uniform input density into account. [sent-61, score-0.224]
42 (8) makes the ”variational free energy” an explicit function of the first two local moments ! [sent-65, score-0.26]
43 " s z8 s p u %wu 7 ¥ s p vz¡ # p u # p C eb# ca (10) s8¥ £ S2¦¤¡ 7 9 F¤¡ ¥ £ s8¥ £ ¥ £ SF¤¡ F)¡ 7 9 F)¡ ¨ ¥ £ % & Hence, a straightforward variation yields ¥ £ ¦¤¡ (11) 9 % # ¥ £ F)¡ % s8¥¥ £¡ £¡ z2F¤¢ ) 7 ¥ £ ¦¤¡ ! [sent-66, score-0.121]
44 9 ¥ £ F)¡ ¨ & s8¥¥ £¡ £¡ z22¦¤¢ ¤s 7 To extend the variational solutions to non-integer values of , we assume that for all optimal parameters are replica symmetric, ie. [sent-67, score-0.594]
45 2 Interpretation of Note, that our approach is not equivalent to a variational approximation of the original posterior. [sent-76, score-0.349]
46 We can use the distribution induced by the prior and in order to compute approximate combined data and posterior averages. [sent-78, score-0.394]
47 As an example, we first consider the expected local . [sent-79, score-0.046]
48 The prior over functions has zero mean and covariance . [sent-82, score-0.066]
49 ¥ § 2 { S¥ ¥ ¤¢F)¢ U C t¥ ¥ £ ¤¡ V £¡ ¥ £¡ 9 £ ¨ $ ¥ ¥ £ %)¡ # £ { ¨ ¥¦¤¡ ¤ £ A ¢ y9 ¤ § § $ ¥¥ £¡ $ 'F)|¤ 'Ia ¡ c # ¥ £ F)¡ s8¥ 6¥ £ S`F¤¡ 4 ¤ 7 £¡ £ ¥ )¡ F£ ¡ which yields the set of variational equations (11). [sent-86, score-0.394]
50 They become particularly easy when the and the input distribution regression model uses a translationally invariant kernel is homogeneous in a finite interval. [sent-87, score-0.362]
51 The variational equations (11) can then be solved in terms of the eigenvalues of the Gaussian process kernel. [sent-88, score-0.427]
52 [8, 9] studied learning curves for Gaussian process regression which are not only averaged over the data but also over the data generating process using a Gaussian process prior on . [sent-89, score-0.703]
53 Applying these averages to our theory and adapting the notation of [9] simply replaces by while . [sent-90, score-0.181]
54 The data generating process is unknown but assumed to be fixed. [sent-94, score-0.191]
55 Displayed are the mean square prediction error (circle and solid line) and its sample fluctuations (error bars) with respect to the data average (cross and broken line). [sent-98, score-0.222]
56 The target was a random but fixed realization from a Gaussian process prior with a periodic Radial Basis Function kernel , . [sent-99, score-0.482]
57 g the Gaussian process regression model used the same kernel and noise . [sent-101, score-0.322]
58 The inputs are one dimensional, independent and uniformly distributed . [sent-102, score-0.059]
59 A typical property of our theory (lines) is that it becomes very accurate for sufficiently large number of example data. [sent-104, score-0.044]
60 0001 −2 −3 ∆ε −4 10 0 100 50 150 Number m of Example Data −4 0 200 20 40 60 80 Number m of Example Data 100 Figure 1: Gaussian process regression using a periodic Radial Basis Function kernel, input dimension d=1, , and homogeneous input density. [sent-109, score-0.669]
61 Left: Generalization error and fluctuations for data noise . [sent-110, score-0.157]
62 All y-data was set equal The value of the noise variance to zero. [sent-118, score-0.151]
63 2 Corrections to the Variational Approximation It is a strength of our method that the quality of the variational approximation Eq. [sent-120, score-0.349]
64 Since the posterior variance is independent of the data this is still an interesting model from which the posterior variance can be estimated. [sent-123, score-0.696]
65 We consider the third term in the expansion to the free energy Eq. [sent-124, score-0.408]
66 It is a correction to the variational free energy and evaluates to § ¡ (16) ed d ¥ £ ¦¤¡ p 8¤ p u s u # p u 7 # s ¥¥ s u # p (¡ 7 { ¨ $ s ¥ ¥ ¤¡ 8 £ { ¡ ¥ ¨ '¦¤¡ ¤ ¡ ¢ $ ¥ £ ed d ¥ ¤ # ¥ ! [sent-126, score-0.852]
67 U # § ¤s V¥¥ £¡ ¥¥ £ ¥ £ Sf¦£ ¤Y¤ f¦)¡ s F)¡ s y9 ¨¦8 A # £ A ¢ ¤ s ea p # i re h # r ih £ ¥F¤¡ #p¥ ¥ £¤¡ F)¡ 7 s p eIa 9 ¥ ¥ £ ¤¡ ¥ £ £ { ¡ ¤ ¥ ¨ l¦¤#£ ¡ ¤ ¡ A $ ¥ ed d tR # D e1 ¤ ca ¥ ¥ £ )¡ ¤ £ with . [sent-129, score-0.076]
68 We considered a homogeneous input density, the input dimension is one and the regression model uses a periodic RBF kernel. [sent-133, score-0.591]
69 1 show the difference between the true value of the free energy which is obtained by simulations and the first two terms of Eq. [sent-135, score-0.354]
70 The correction term is found to be qualitatively accurate and emphasizes a discrepancy between free energy and the first two terms of the expansion Eq. [sent-137, score-0.505]
71 3 Universal Relations ¢ and the empirical posterior variance ¤ ¨ W IH0 ¥ 0 £¤Y¤ ¡ A 9 G ¤ ¢ § W ¤ ¥ ¦0 § ¢ £¤ We can relate the training error (17) # £¡ ¢¥ 0 )¢ # ¤ ¨ H e0 A 9 G ¢ ¤ 0. [sent-141, score-0.447]
72 6 2 Theory 1d, periodic 2d, periodic 3d, periodic 2 [βε(x,y)/(βσ (x)+1) ](x,y) 1 [βσ (x)/(βσ (x)+1)]x 1 0. [sent-145, score-0.753]
73 4 Theory d=1, periodic d=2, periodic d=3, periodic d=2+2, non-periodic 0. [sent-146, score-0.753]
74 Symbols show simulation results for Radial Basis Function (RBF) regression and a homogeneous input distribution in dimensions (square, circle, diamond). [sent-161, score-0.318]
75 Additionally, the left figure shows an example were the inputs lie on a quasi two-dimensional manifold which is embedded in dimensions (cross). [sent-163, score-0.096]
76 A ¡ 9 ¥ ¤ $ 9 x ¢ ¤ ¤ U ¡ 9 W V B e# ca to the free energy . [sent-165, score-0.43]
77 (18) yields (19) ¥ £ F¤¡ ¤ ¥ £ F)¡ ¥ £ ¦¤¡ ¤ ¤ $ ¤ A § ¥ £ ¦¤¡ £ 9 ¤ ¢ which relates the empirical posterior variance to the local posterior variance at test inputs . [sent-169, score-0.857]
78 Similarly, we can derive an expression for the training error by using Eqs. [sent-170, score-0.06]
79 (19) ¤ ¢ ¢ A § ¤ ¥¥ £ F)¡ 2¤§ ¡ § $ ¤ ¥ )¡ £ ¥ ¤¡ £ (20) £ £ |19 ¢ ¤ It is interesting to note, that the relations (19,20) contain no assumptions about the data generating process. [sent-172, score-0.286]
80 They hold in general for Gaussian process models with a Gaussian likelihood. [sent-173, score-0.078]
81 2 for the example of Gaussian process regression with a Radial Basis Function kernel. [sent-176, score-0.194]
82 2, learning is initially starts in the upper right corner as the rescaled empirical posterior variance one and decreases with increasing number of example data. [sent-178, score-0.445]
83 The rescaled training error on the noisy data set is initially zero and increases to one with increasing number of example data. [sent-181, score-0.174]
84 The theory (line) holds for a sufficiently large number of example data and its accuracy increases with the input dimension. [sent-182, score-0.165]
85 For common benchmark sets such as Abalone and Boston Housing data we find that Eqs. [sent-185, score-0.056]
86 (19,20) hold well even for small and medium sizes of the training data set. [sent-186, score-0.112]
87 ¤ ¢ $ ¢ ¤ $ 6 Outlook One may question if our approximate universal relations are of any practical use as, for example, the relation between training error and generalization error involves also the unknown posterior variance . [sent-187, score-0.857]
88 Nevertheless, this relation could be useful for cases, where a large number of data inputs without output labels are available. [sent-188, score-0.172]
89 Since for regression, the posterior variance is independent of the output labels, we could use these extra input points to estimate . [sent-189, score-0.385]
90 For example, replacing ing the kernel of the Gaussian process prior gives a model for hard margin Support Vector Machine Classification with SVM kernel . [sent-192, score-0.318]
91 ¥ ¥ ¥ £ )¡ £ ¥ ¦ & £ A § # ¥ £¡ ¡ FF)¢ 5¢ ¡£ { o ¤7 ¥ ¥ £ )¡ £ £ & 9 t¥ ¥ £ )¡ £ Of particular interest is the computation of empirical estimators that can be used in practice for model selection as well as the calculation of fluctuations (error bars) for such estimators. [sent-194, score-0.067]
92 A prominent example is an efficient approximate leave-one-out estimator for SVMs. [sent-195, score-0.062]
93 Hibbs, Quantum mechanics and path integrals, Mc GrawHill Inc. [sent-233, score-0.078]
wordName wordTfidf (topN-words)
[('variational', 0.349), ('periodic', 0.251), ('replica', 0.245), ('posterior', 0.21), ('energy', 0.183), ('free', 0.171), ('averages', 0.137), ('sp', 0.129), ('relations', 0.126), ('fv', 0.123), ('uctuations', 0.123), ('ea', 0.117), ('regression', 0.116), ('grand', 0.115), ('hamiltonian', 0.115), ('malzahn', 0.115), ('sollich', 0.115), ('canonical', 0.112), ('variance', 0.11), ('partition', 0.108), ('correction', 0.097), ('homogeneous', 0.094), ('gaussian', 0.093), ('tf', 0.092), ('symbols', 0.09), ('rbf', 0.089), ('wu', 0.089), ('kernel', 0.087), ('opper', 0.08), ('radial', 0.078), ('panel', 0.078), ('process', 0.078), ('universal', 0.078), ('eb', 0.078), ('mechanics', 0.078), ('ca', 0.076), ('curves', 0.072), ('empirical', 0.067), ('prior', 0.066), ('input', 0.065), ('expectations', 0.063), ('approximate', 0.062), ('replicated', 0.062), ('error', 0.06), ('respect', 0.06), ('inputs', 0.059), ('dv', 0.058), ('corrections', 0.058), ('rescaled', 0.058), ('relation', 0.057), ('generating', 0.057), ('medium', 0.056), ('wx', 0.056), ('data', 0.056), ('expansion', 0.054), ('physics', 0.054), ('qp', 0.054), ('generalization', 0.052), ('evaluates', 0.052), ('density', 0.051), ('rst', 0.051), ('lines', 0.051), ('brackets', 0.05), ('predictor', 0.048), ('circle', 0.048), ('assumptions', 0.047), ('averaged', 0.046), ('square', 0.046), ('local', 0.046), ('errors', 0.045), ('yields', 0.045), ('theory', 0.044), ('explicit', 0.043), ('ib', 0.043), ('simulation', 0.043), ('illustration', 0.042), ('practical', 0.042), ('noise', 0.041), ('ciently', 0.041), ('cross', 0.039), ('vz', 0.038), ('yoshizawa', 0.038), ('abalone', 0.038), ('bischof', 0.038), ('chemical', 0.038), ('dominating', 0.038), ('dorffner', 0.038), ('drawings', 0.038), ('engel', 0.038), ('parisi', 0.038), ('replicas', 0.038), ('singapore', 0.038), ('thermodynamic', 0.038), ('wtu', 0.038), ('yxwu', 0.038), ('expectation', 0.038), ('suf', 0.038), ('ts', 0.037), ('left', 0.037), ('nite', 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 21 nips-2001-A Variational Approach to Learning Curves
Author: Dörthe Malzahn, Manfred Opper
Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.
2 0.18647076 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
Author: Manfred Opper, Robert Urbanczik
Abstract: Using methods of Statistical Physics, we investigate the rOle of model complexity in learning with support vector machines (SVMs). We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization error although the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which only depend on the target rule but not on the SVM kernel. 1
3 0.18560237 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity
Author: Lehel Csató, Manfred Opper, Ole Winther
Abstract: The adaptive TAP Gibbs free energy for a general densely connected probabilistic model with quadratic interactions and arbritary single site constraints is derived. We show how a specific sequential minimization of the free energy leads to a generalization of Minka’s expectation propagation. Lastly, we derive a sparse representation version of the sequential algorithm. The usefulness of the approach is demonstrated on classification and density estimation with Gaussian processes and on an independent component analysis problem.
4 0.16784793 58 nips-2001-Covariance Kernels from Bayesian Generative Models
Author: Matthias Seeger
Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1
5 0.14092274 79 nips-2001-Gaussian Process Regression with Mismatched Models
Author: Peter Sollich
Abstract: Learning curves for Gaussian process regression are well understood when the 'student' model happens to match the 'teacher' (true data generation process). I derive approximations to the learning curves for the more generic case of mismatched models, and find very rich behaviour: For large input space dimensionality, where the results become exact, there are universal (student-independent) plateaux in the learning curve, with transitions in between that can exhibit arbitrarily many over-fitting maxima; over-fitting can occur even if the student estimates the teacher noise level correctly. In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. Learning with excessively strong smoothness assumptions can be particularly dangerous: For example, a student with a standard radial basis function covariance function will learn a rougher teacher function only logarithmically slowly. All predictions are confirmed by simulations. 1
6 0.14062266 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
7 0.12131529 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering
8 0.11060727 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
9 0.10567056 8 nips-2001-A General Greedy Approximation Algorithm with Applications
10 0.10117277 139 nips-2001-Online Learning with Kernels
11 0.095603764 61 nips-2001-Distribution of Mutual Information
12 0.09544763 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
13 0.09479022 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
14 0.092435107 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition
15 0.091740787 84 nips-2001-Global Coordination of Local Linear Models
16 0.090704806 15 nips-2001-A New Discriminative Kernel From Probabilistic Models
17 0.089862652 164 nips-2001-Sampling Techniques for Kernel Methods
18 0.089174375 99 nips-2001-Intransitive Likelihood-Ratio Classifiers
19 0.088397883 119 nips-2001-Means, Correlations and Bounds
20 0.081579499 1 nips-2001-(Not) Bounding the True Error
topicId topicWeight
[(0, -0.288), (1, 0.066), (2, -0.04), (3, -0.069), (4, -0.007), (5, -0.084), (6, 0.122), (7, -0.015), (8, 0.051), (9, 0.037), (10, 0.002), (11, 0.073), (12, 0.024), (13, -0.24), (14, -0.054), (15, 0.177), (16, -0.033), (17, 0.074), (18, -0.082), (19, -0.086), (20, -0.056), (21, -0.126), (22, 0.013), (23, 0.025), (24, 0.093), (25, 0.045), (26, -0.012), (27, -0.053), (28, -0.068), (29, 0.136), (30, -0.079), (31, -0.008), (32, -0.121), (33, 0.131), (34, 0.056), (35, -0.128), (36, 0.042), (37, 0.043), (38, 0.059), (39, 0.016), (40, -0.108), (41, 0.083), (42, 0.003), (43, 0.052), (44, -0.061), (45, -0.165), (46, -0.067), (47, -0.023), (48, 0.019), (49, -0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.95919538 21 nips-2001-A Variational Approach to Learning Curves
Author: Dörthe Malzahn, Manfred Opper
Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.
2 0.74307495 79 nips-2001-Gaussian Process Regression with Mismatched Models
Author: Peter Sollich
Abstract: Learning curves for Gaussian process regression are well understood when the 'student' model happens to match the 'teacher' (true data generation process). I derive approximations to the learning curves for the more generic case of mismatched models, and find very rich behaviour: For large input space dimensionality, where the results become exact, there are universal (student-independent) plateaux in the learning curve, with transitions in between that can exhibit arbitrarily many over-fitting maxima; over-fitting can occur even if the student estimates the teacher noise level correctly. In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. Learning with excessively strong smoothness assumptions can be particularly dangerous: For example, a student with a standard radial basis function covariance function will learn a rougher teacher function only logarithmically slowly. All predictions are confirmed by simulations. 1
3 0.71949875 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity
Author: Lehel Csató, Manfred Opper, Ole Winther
Abstract: The adaptive TAP Gibbs free energy for a general densely connected probabilistic model with quadratic interactions and arbritary single site constraints is derived. We show how a specific sequential minimization of the free energy leads to a generalization of Minka’s expectation propagation. Lastly, we derive a sparse representation version of the sequential algorithm. The usefulness of the approach is demonstrated on classification and density estimation with Gaussian processes and on an independent component analysis problem.
4 0.61718237 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
Author: Manfred Opper, Robert Urbanczik
Abstract: Using methods of Statistical Physics, we investigate the rOle of model complexity in learning with support vector machines (SVMs). We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization error although the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which only depend on the target rule but not on the SVM kernel. 1
5 0.61124218 76 nips-2001-Fast Parameter Estimation Using Green's Functions
Author: K. Wong, F. Li
Abstract: We propose a method for the fast estimation of hyperparameters in large networks, based on the linear response relation in the cavity method, and an empirical measurement of the Green's function. Simulation results show that it is efficient and precise, when compared with cross-validation and other techniques which require matrix inversion. 1
6 0.60410541 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
7 0.58334684 61 nips-2001-Distribution of Mutual Information
8 0.49517289 58 nips-2001-Covariance Kernels from Bayesian Generative Models
9 0.47660378 75 nips-2001-Fast, Large-Scale Transformation-Invariant Clustering
10 0.44243374 155 nips-2001-Quantizing Density Estimators
11 0.43856251 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
12 0.42661399 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
13 0.40802166 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network
14 0.39899808 139 nips-2001-Online Learning with Kernels
15 0.39414775 35 nips-2001-Analysis of Sparse Bayesian Learning
16 0.39389777 84 nips-2001-Global Coordination of Local Linear Models
17 0.38860565 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons
18 0.38509643 15 nips-2001-A New Discriminative Kernel From Probabilistic Models
19 0.38473377 180 nips-2001-The Concave-Convex Procedure (CCCP)
20 0.38287956 165 nips-2001-Scaling Laws and Local Minima in Hebbian ICA
topicId topicWeight
[(14, 0.036), (17, 0.014), (19, 0.026), (27, 0.087), (30, 0.039), (36, 0.013), (59, 0.028), (72, 0.585), (79, 0.029), (91, 0.076)]
simIndex simValue paperId paperTitle
same-paper 1 0.97723979 21 nips-2001-A Variational Approach to Learning Curves
Author: Dörthe Malzahn, Manfred Opper
Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.
2 0.9356662 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
Author: Wheeler Ruml
Abstract: If the promise of computational modeling is to be fully realized in higherlevel cognitive domains such as language processing, principled methods must be developed to construct the semantic representations used in such models. In this paper, we propose the use of an established formalism from mathematical psychology, additive clustering, as a means of automatically constructing binary representations for objects using only pairwise similarity data. However, existing methods for the unsupervised learning of additive clustering models do not scale well to large problems. We present a new algorithm for additive clustering, based on a novel heuristic technique for combinatorial optimization. The algorithm is simpler than previous formulations and makes fewer independence assumptions. Extensive empirical tests on both human and synthetic data suggest that it is more effective than previous methods and that it also scales better to larger problems. By making additive clustering practical, we take a significant step toward scaling connectionist models beyond hand-coded examples.
3 0.92226845 99 nips-2001-Intransitive Likelihood-Ratio Classifiers
Author: Jeff Bilmes, Gang Ji, Marina Meila
Abstract: In this work, we introduce an information-theoretic based correction term to the likelihood ratio classification method for multiple classes. Under certain conditions, the term is sufficient for optimally correcting the difference between the true and estimated likelihood ratio, and we analyze this in the Gaussian case. We find that the new correction term significantly improves the classification results when tested on medium vocabulary speech recognition tasks. Moreover, the addition of this term makes the class comparisons analogous to an intransitive game and we therefore use several tournament-like strategies to deal with this issue. We find that further small improvements are obtained by using an appropriate tournament. Lastly, we find that intransitivity appears to be a good measure of classification confidence.
4 0.90714312 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
Author: Pascal Vincent, Yoshua Bengio
Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in (the input space). wx u r i q ©
5 0.79138178 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
6 0.72992122 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
7 0.67583048 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
8 0.6650933 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
9 0.65304244 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
10 0.64676815 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm
11 0.6438725 1 nips-2001-(Not) Bounding the True Error
12 0.62707859 79 nips-2001-Gaussian Process Regression with Mismatched Models
13 0.61399823 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
14 0.60790521 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
15 0.60718513 155 nips-2001-Quantizing Density Estimators
16 0.60315025 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
17 0.601973 61 nips-2001-Distribution of Mutual Information
18 0.59354579 25 nips-2001-Active Learning in the Drug Discovery Process
19 0.59350997 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
20 0.58610547 128 nips-2001-Multiagent Planning with Factored MDPs