nips nips2002 nips2002-17 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dörthe Malzahn, Manfred Opper
Abstract: We apply the replica method of Statistical Physics combined with a variational method to the approximate analytical computation of bootstrap averages for estimating the generalization error. We demonstrate our approach on regression with Gaussian processes and compare our results with averages obtained by Monte-Carlo sampling.
Reference: text
sentIndex sentText sentNum sentScore
1 uk £ ¡ ¤¢ £ ¥ ©¨ Abstract We apply the replica method of Statistical Physics combined with a variational method to the approximate analytical computation of bootstrap averages for estimating the generalization error. [sent-6, score-1.44]
2 We demonstrate our approach on regression with Gaussian processes and compare our results with averages obtained by Monte-Carlo sampling. [sent-7, score-0.215]
3 1 Introduction The application of tools from Statistical Mechanics to analyzing the average case performance of learning algorithms has a long tradition in the Neural Computing and Machine Learning community [1, 2]. [sent-8, score-0.103]
4 Unfortunately, the specific power of this approach, which is able to give explicit distribution dependent results represents also a major drawback for practical applications. [sent-10, score-0.087]
5 In general, data distributions are unknown and their replacement by simple model distributions might only reveal some qualitative behavior of the true learning performance. [sent-11, score-0.062]
6 In this paper we suggest a novel application of the Statistical Mechanics techniques to a topic within Machine Learning for which the distribution over data is well known and controlled by the experimenter. [sent-12, score-0.03]
7 It is given by the resampling of an existing dataset in the so called bootstrap approach [3]. [sent-13, score-0.802]
8 Creating bootstrap samples of the original dataset by random resampling with replacement and retraining the statistical model on the bootstrap sample is a widely applicable statistical technique. [sent-14, score-1.746]
9 By replacing averages over the true unknown distribution of data with suitable averages over the bootstrap samples one can estimate various properties such as the bias, the variance and the generalization error of a statistical model. [sent-15, score-1.267]
10 While in general bootstrap averages can be approximated by Monte-Carlo sampling, it is useful to have also analytical approximations which avoid the time consuming retraining of the model for each sample. [sent-16, score-1.096]
11 Existing analytical approximations (based on asymptotic techniques) such as the delta method and the saddle point method (see e. [sent-17, score-0.152]
12 [5]) require usually explicit analytical formulas for the estimators of the parameters for a trained model. [sent-19, score-0.248]
13 In this paper, we discuss an application of the replica method of Statistical Physics [4] which combined with a variational method [6] can produce approximate averages over the random drawings of bootstrap samples. [sent-21, score-1.285]
14 Explicit formulas for parameter estimates are avoided and replaced by the implicit condition that such estimates are expectations with respect to a certain Gibbs distribution to which the methods of Statistical Physics can be well applied. [sent-22, score-0.085]
15 We demonstrate the method for the case of regression with Gaussian processes (GP) (which is a kernel method that has gained high popularity in the Machine Learning community in recent years [7]) and compare our analytical results with results obtained by Monte-Carlo sampling. [sent-23, score-0.234]
16 2 Basic setup and Gibbs distribution We will keep the notation in this section fairly general, indicating that most of the theory can be developed for a broader class of models. [sent-24, score-0.069]
17 We will later specialize to supervised learning problems where each data point consists of an input (usually a finite dimensional vector) and a real label . [sent-29, score-0.068]
18 We will later apply our approach to the mean square error given by C ! [sent-31, score-0.13]
19 ¥ (2) ¥ The first basic ingredient of our approach is the assumption that the estimator for the unknown “true” function can be represented as the mean with respect to a posterior distribution over all possible ’s. [sent-35, score-0.164]
20 This avoids the problem of writing down explicit, complicated formulas for estimators. [sent-36, score-0.088]
21 To be precise, we assume that the statistical estimator (which is based on the training set ) can be represented as the expectation of with respect to the measure R SPQ ! [sent-37, score-0.154]
22 4 )'$ 2 #01 V V SU ¡ T ¡ which is constructed from a suitable prior distribution and the likelihood (1). [sent-47, score-0.053]
23 (4) ¡ denotes a normalizing partition function. [sent-48, score-0.031]
24 Our choice of (3) does not mean that we restrict ourselves to Bayesian estimators. [sent-49, score-0.028]
25 By introducing specific (“temperature” like) parameters in the prior and the likelihood, the measure (3) can be strongly concentrated at its mean such that maximum likelihood/MAP estimators can be included in our framework. [sent-50, score-0.106]
26 3 Bootstrap averages We will explain our analytical approximation to resampling averages for the case of supervised learning problems. [sent-51, score-0.546]
27 Hence, some of the ’s will appear several times in the bootstrap sample and others not at all. [sent-53, score-0.759]
28 A proxy for the true average test error can be obtained by retraining the model on each bootstrap training set , calculating the test error only on those points which are not contained in and finally averaging over many sets . [sent-54, score-0.994]
29 We will not discuss the statistical properties of such bootstrap estimates and their refinements (such as Efron’s . [sent-56, score-0.776]
30 ¦ ¡ a ¡ G £ § ¦ ¤ ¢ ¥£ ¡ ¦ ¡ ¢ £ ¡ §©¡ ¨ ©¡ For any given set , we represent a bootstrap sample by the vector of “occupation” with . [sent-58, score-0.759]
31 Denoting the expectation over random bootstrap samples by , Efron’s estimator for the bootstrap generalization error is ¢ Q 2 "¥ 0 1¢ (5) ¢ £ ¡ ¡ ¡ ¡ 6¢ ¨ ¨ ¡ ¡ ¡ ¥ £ ¡ 2 W¡ Q %# ! [sent-60, score-1.615]
32 &$" ¡ 6 G £ ¡¥ ¢ 4 ¦ ¢ where we specialized to the square error for testing. [sent-63, score-0.081]
33 The Kronecker symbol, defined by for test error at each data point from and else, guarantees that only realizations of bootstrap training sets contribute which do not contain the test point. [sent-66, score-0.85]
34 Introducing the abbreviation % G £ 5¢ ¢ (6) ¡ T 2 ! [sent-67, score-0.022]
35 ¥ 3 £ 6 4 5 3 7 (which is a linear function of ), and using the definition of the estimator as an average of ’s over the Gibbs distribution (3), the bootstrap estimate (5) can be rewritten as ©PQ ! [sent-69, score-0.873]
36 More complicated types of test errors which are polynomials or can be approximated by polynomials in can be rewritten in a similar way, involving more replicas of the variable . [sent-81, score-0.28]
37 4 Analytical averages using the “replica trick” ¢ For fixed , the distribution of ’s is multinomial. [sent-87, score-0.182]
38 It is simpler (and does not make a big difference when is sufficiently large) when we work with a Poisson distribution for the size of the set with as the mean number of data points in the sample. [sent-88, score-0.058]
39 In this case we get the simpler, factorizing joint distribution ¡ Q I G# RPHF! [sent-89, score-0.03]
40 £ W¡ V Q ¡ §£ ` ¡ ¦ 6 £ ¥ ¡ E ¡ ¢ ¢ ¡ The average is over the unknown distribution of training data sets. [sent-90, score-0.122]
41 (8) follows Q I UTG ` ¡ 1 where S H¢ ¡ for the occupation numbers (8) . [sent-94, score-0.039]
42 To enable the analytical average over the vector (which is the “quenched disorder” in the language of Statistical Physics) it is necessary to introduce the auxiliary quantity 9 (9) BD V V I %# ! [sent-95, score-0.154]
43 The advantage of this definition is that for integers , can be represented in terms of replicas of the original variable for which an explicit average over ’s is possible. [sent-104, score-0.224]
44 At the end of all calculations an analytical continuation to arbitrary real and the limit must be performed. [sent-105, score-0.16]
45 Using the definition of the partition function (4), we get for integer £ ¢ ¡ £ 6 H (10) £ V ¥ ¡ E %# ! [sent-106, score-0.052]
46 RI G ¢G £ ¡¥ ¡ Exchanging the expectation over datasets with the expectation over ’s and using the explicit form of the distribution (8) we obtain ¨ ¦¨ ! [sent-113, score-0.143]
47 ¥ ¡ £ ¡ U 1 2 3 denote an average with respect to a Gibbs measure for replicas 42 V ! [sent-119, score-0.189]
48 where the brackets which is given by (11) (12) (13) ¡ ¡ and where the partition function has been introduced for convenience to normalize the . [sent-121, score-0.07]
49 In most nontrivial cases, averages with respect to the measure (12) measure for can not be calculated exactly. [sent-122, score-0.229]
50 Our idea is to use techniques which have been frequently applied to probabilistic models [10] such as the variational approximation, the mean field approximation and the TAP approach. [sent-124, score-0.241]
51 In this paper, we restrict ourself to a variational Gaussian approximation. [sent-125, score-0.176]
52 6 £ 8 9£ 5 Variational approximation A method, frequently used in Statistical Physics which has also attracted considerable interest in the Machine Learning community, is the variational approximation [8]. [sent-127, score-0.25]
53 Its goal is to replace an intractable distribution like (12) by a different, sufficiently close distribution from a tractable class which we will write in the form (14) W ¡ 2V 4 6 (& )'$ W ! [sent-128, score-0.06]
54 ` E " ¡ V U ¡ 4 ¡ U U U ¡ will be used in (11) instead of to approximate the average. [sent-129, score-0.03]
55 [10]) to minimize the relative entropy between and resulting in a minimization of the variational free energy 4 2V V 4 W ¡ @ 1 ¡ ¡ 2 2 ` 6 (& )'$ W ! [sent-132, score-0.22]
56 ge E c ¤ ¡ ¢ £ being an upper bound to the true free energy for any integer . [sent-133, score-0.065]
57 The brackets denote averages with respect to the variational distribution (14). [sent-134, score-0.397]
58 ¡ 32 1 2 4 ¡ 2 ¡ (15) £ ¤ ¡ ¢ V 4 For our application to Gaussian process models, we will now specialize to Gaussian priors . [sent-135, score-0.043]
59 11 C 2 DC¥ DC¥ £ 5 DC¥ 0 P 8£ P P 5 DC¥ 5 §C¥ 675 G £ ¡ 4 as a suitable trial Hamiltonian, leading to a Gaussian distribution (14). [sent-144, score-0.053]
60 The functions and are the variational parameters to be optimized. [sent-145, score-0.176]
61 To continue the variational solutions to arbitrary real , we assume that the optimal parameters should as well as for be replica symmetric, i. [sent-146, score-0.348]
62 The variational free energy can then be expressed by the local moments (“order parameters” in the language of Statistical Physics) , for and which have the same replica symmetric structure. [sent-149, score-0.418]
63 Since each of the matrices (such as ) are assumed to have only two types of entries, it is possible to obtain variational equations which contain the number of replicas as a simcan be explicitly performed (see appendix). [sent-150, score-0.332]
64 ¦ § ¨ ¤ £ £ ¤ £ ¤ £ £ £ © 6 Explicit results for regression with Gaussian processes V We consider a GP model for regression with training energy given by Eq. [sent-158, score-0.159]
65 In this case, the prior measure can be simply represented by an dimensional Gaussian distribution for the vector having zero mean and covariance matrix , where is the covariance kernel of the GP. [sent-160, score-0.102]
66 ` ¡ U ¡ U 5 C ¨ ¢ §C¥ Using the limiting (for ) values of order parameters, and by approximating by in Eq. [sent-164, score-0.026]
67 (11), the explicit result for the bootstrap mean square generalization error is found to be £ 6 G ¡6 H e c 6¢ I ¦ ¢ C ¨ ¢ DC¥ @ S 4 ` ¥ 4 ¢ ¢)C ¨ ¢UDC¥ @ )¢ E )§C¥ ¢ I TG 2 2 6¢ E UDC¥ Q P ! [sent-165, score-0.944]
68 ¥ ¢ 2 G ¡6 W ¨ I ¦ ¢ C ¨ ¢ G DC¥ @ ¥ S E ` ¥ 4 ¢ C ¢ DC¥ @ ¢ 2 ¢ §C¥ 2 (17) ¥ ¥ ¥ © ¥ V ¡ 6 ¢ 4 Q RI G G £ ¡¥ The entire analysis can be repeated for testing (keeping the training energy fixed) with a general loss function of the type . [sent-168, score-0.113]
69 4 0 1000 1500 2000 500 Size m of Bootstrap Sample 1000 1500 2000 500 Size m of Bootstrap Sample Figure 1: Average bootstrapped generalization error on Abalone data using square error loss (left) and epsilon insensitive loss (right). [sent-179, score-0.302]
70 Simulation (circles) and theory (lines) based on the same data set with data points. [sent-180, score-0.039]
71 The GP model uses an RBF kernel with on whitened inputs. [sent-181, score-0.022]
72 The bootstrap average from our with theory is obtained from Eq. [sent-188, score-0.815]
73 Figure 1 shows the generalization error measured by the square error loss (Eq. [sent-190, score-0.21]
74 (17), left panel) as well as the one measured by the -insensitive loss (right panel). [sent-191, score-0.042]
75 Our theory (line) is compared with simulations (circles) which were based on Monte-Carlo sampling averages that were computed using the same data set having . [sent-192, score-0.239]
76 The Monte-Carlo training sets of size are obtained by sampling from with replacement. [sent-193, score-0.052]
77 We find a good agreement between theory and simulations in the . [sent-194, score-0.086]
78 When we oversample the data set , however, the agreement is region were not so good and corrections to our variational Gaussian approximation would be required. [sent-195, score-0.237]
79 Figure 2 shows the bootstrap average of the posterior variance over the whole data set , , and compares our theory (line) with simulations (circles) which were based on Monte-Carlo sampling averages. [sent-196, score-0.916]
80 The overall approximation looks better than for the bootstrap generalization error. [sent-197, score-0.815]
81 ¦ ¡ ¢ )C ¨ UDC¥ ¢ ¢ ¡ ¡ 6 ¢ ¡ 6 6 TP6 ¡ 6 6 PP6 G £ ¨ ¡ G £ ¡ ¡ Finally, it is important to note that all displayed theoretical learning curves have been obtained computationally much faster than their respective simulated learning curves. [sent-198, score-0.028]
82 7 Outlook The replica approach to bootstrap averages can be extended in a variety of different directions. [sent-199, score-1.08]
83 Besides the average generalization error, one can compute its bootstrap sample fluctuations by introducing more complicated replica expressions. [sent-200, score-1.083]
84 It is also straightforward to apply the approach to more complex problems in supervised learning which are related to Gaussian processes, such as GP classifiers or Support-vector Machines. [sent-201, score-0.046]
85 Since Posterior Variance 10 10 Simulation Theory -1 } N=1000 -2 0 1000 1500 2000 500 Size m of Bootstrap Sample Figure 2: Bootstrap averaged posterior variance for Abalone data. [sent-202, score-0.053]
86 Simulation (circles) and theory (line) based on the same data set with data points. [sent-203, score-0.039]
87 G £ 6 6 TP6 ¡ T our method requires the solution of a set of variational equations of the size of the original training set, we can expect that its computational complexity should be similar to the one needed for making the actual predictions with the basic model. [sent-204, score-0.235]
88 This will also apply to the problem of very large datasets, where one may use a variety of well known sparse approximations (see e. [sent-205, score-0.085]
89 It will also be important to assess the quality of the approximation introduced by the variational method and compare it to alternative approximation techniques in the computation of the replica average (11), such as the mean field method and its more complex generalizations (see e. [sent-208, score-0.493]
90 Appendix: Variational equations For reference, we will give the explicit form of the equations for variational and order parameters in the limit . [sent-213, score-0.324]
91 We obtain P 5 §C¥ 5 C ¨ DC¥ 5 C ¨ ¢ DC¥ ¥ (22) F ¢ DC¥ P ¤ 2 I ¡ P ¢ §C¥ ¡ ¥ 5 ¢ £ 5 ¢ ¤ ¡ ¢ ¤¢ ¥£@ ¡ I ¡ £ )DC¥ ¢ © 5 C ¨ UDC¥ ¢ £ C ¨ ¢ §C¥ ¡£ is the kernel matrix. [sent-215, score-0.022]
92 (20-22) must be solved together with the variational equations which are given by (23) ¢ ¢ F )C ¨ )DC¥ ¡ @ ¥ I¥ £ ¢ §C¥ P ¤ ¦ (25) ¡ I 9¢ @ ¡ P W ¢ ¨¢ 7 ¢ DC¥ ¦ ¥ UC )DC¥ ¥ £ ¡ ¤ ¤ ¥ @ © . [sent-218, score-0.208]
93 Opper, A variational approach to learning curves, NIPS 14, Editors: T. [sent-243, score-0.176]
94 Hibbs, Quantum mechanics and path integrals (Mc GrawHill Inc. [sent-254, score-0.11]
wordName wordTfidf (topN-words)
[('bootstrap', 0.733), ('dc', 0.347), ('udc', 0.198), ('variational', 0.176), ('replica', 0.172), ('averages', 0.152), ('replicas', 0.124), ('analytical', 0.111), ('mechanics', 0.11), ('physics', 0.085), ('pq', 0.082), ('opper', 0.069), ('resampling', 0.069), ('efron', 0.065), ('ri', 0.059), ('retraining', 0.059), ('explicit', 0.057), ('abalone', 0.055), ('formulas', 0.055), ('bd', 0.052), ('bootstrapped', 0.05), ('malzahn', 0.05), ('gp', 0.049), ('gibbs', 0.045), ('generalization', 0.045), ('circles', 0.044), ('energy', 0.044), ('dm', 0.043), ('specialize', 0.043), ('statistical', 0.043), ('average', 0.043), ('error', 0.042), ('loss', 0.042), ('simulation', 0.041), ('approximations', 0.041), ('replacement', 0.04), ('brackets', 0.039), ('occupation', 0.039), ('theory', 0.039), ('square', 0.039), ('processes', 0.038), ('community', 0.038), ('spin', 0.037), ('approximation', 0.037), ('denmark', 0.035), ('estimator', 0.034), ('gaussian', 0.034), ('polynomials', 0.033), ('nontrivial', 0.033), ('rewritten', 0.033), ('complicated', 0.033), ('equations', 0.032), ('introducing', 0.031), ('partition', 0.031), ('distribution', 0.03), ('approximate', 0.03), ('expectation', 0.028), ('curves', 0.028), ('mean', 0.028), ('posterior', 0.028), ('limit', 0.027), ('training', 0.027), ('sample', 0.026), ('limiting', 0.026), ('symmetric', 0.026), ('sampling', 0.025), ('regression', 0.025), ('supervised', 0.025), ('variance', 0.025), ('appendix', 0.025), ('estimators', 0.025), ('lecture', 0.025), ('test', 0.024), ('agreement', 0.024), ('panel', 0.024), ('advanced', 0.024), ('simulations', 0.023), ('variety', 0.023), ('springer', 0.023), ('suitable', 0.023), ('notes', 0.023), ('kernel', 0.022), ('unknown', 0.022), ('measure', 0.022), ('drawings', 0.022), ('abbreviation', 0.022), ('aston', 0.022), ('birmingham', 0.022), ('continuation', 0.022), ('disorder', 0.022), ('disordered', 0.022), ('exchanging', 0.022), ('ingredient', 0.022), ('lars', 0.022), ('quantum', 0.022), ('quenched', 0.022), ('tradition', 0.022), ('integer', 0.021), ('apply', 0.021), ('ba', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 17 nips-2002-A Statistical Mechanics Approach to Approximate Analytical Bootstrap Averages
Author: Dörthe Malzahn, Manfred Opper
Abstract: We apply the replica method of Statistical Physics combined with a variational method to the approximate analytical computation of bootstrap averages for estimating the generalization error. We demonstrate our approach on regression with Gaussian processes and compare our results with averages obtained by Monte-Carlo sampling.
2 0.18844821 139 nips-2002-Margin-Based Algorithms for Information Filtering
Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile
Abstract: In this work, we study an information filtering model where the relevance labels associated to a sequence of feature vectors are realizations of an unknown probabilistic linear function. Building on the analysis of a restricted version of our model, we derive a general filtering rule based on the margin of a ridge regression estimator. While our rule may observe the label of a vector only by classfying the vector as relevant, experiments on a real-world document filtering problem show that the performance of our rule is close to that of the on-line classifier which is allowed to observe all labels. These empirical results are complemented by a theoretical analysis where we consider a randomized variant of our rule and prove that its expected number of mistakes is never much larger than that of the optimal filtering rule which knows the hidden linear model.
3 0.10845212 166 nips-2002-Rate Distortion Function in the Spin Glass State: A Toy Model
Author: Tatsuto Murayama, Masato Okada
Abstract: We applied statistical mechanics to an inverse problem of linear mapping to investigate the physics of optimal lossy compressions. We used the replica symmetry breaking technique with a toy model to demonstrate Shannon’s result. The rate distortion function, which is widely known as the theoretical limit of the compression with a fidelity criterion, is derived. Numerical study shows that sparse constructions of the model provide suboptimal compressions. 1
4 0.10832631 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
Author: Peter Sykacek, Stephen J. Roberts
Abstract: We propose in this paper a probabilistic approach for adaptive inference of generalized nonlinear classification that combines the computational advantage of a parametric solution with the flexibility of sequential sampling techniques. We regard the parameters of the classifier as latent states in a first order Markov process and propose an algorithm which can be regarded as variational generalization of standard Kalman filtering. The variational Kalman filter is based on two novel lower bounds that enable us to use a non-degenerate distribution over the adaptation rate. An extensive empirical evaluation demonstrates that the proposed method is capable of infering competitive classifiers both in stationary and non-stationary environments. Although we focus on classification, the algorithm is easily extended to other generalized nonlinear models.
5 0.087770924 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
Author: Christopher M. Bishop, David Spiegelhalter, John Winn
Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are specified either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and flexibility of VIBES using examples from Bayesian mixture modelling. 1
6 0.064383045 110 nips-2002-Incremental Gaussian Processes
7 0.063152991 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
8 0.062639296 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
9 0.056053862 181 nips-2002-Self Supervised Boosting
10 0.052945632 41 nips-2002-Bayesian Monte Carlo
11 0.046530697 74 nips-2002-Dynamic Structure Super-Resolution
12 0.044383768 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
13 0.043958806 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
14 0.043574531 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
15 0.042655081 95 nips-2002-Gaussian Process Priors with Uncertain Inputs Application to Multiple-Step Ahead Time Series Forecasting
16 0.04235357 174 nips-2002-Regularized Greedy Importance Sampling
17 0.042080626 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
18 0.040181134 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression
19 0.040139776 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior
20 0.038134623 106 nips-2002-Hyperkernels
topicId topicWeight
[(0, -0.136), (1, -0.033), (2, -0.008), (3, 0.019), (4, -0.01), (5, 0.021), (6, -0.089), (7, 0.067), (8, -0.009), (9, -0.025), (10, 0.02), (11, -0.091), (12, 0.098), (13, -0.015), (14, 0.022), (15, -0.058), (16, -0.039), (17, -0.092), (18, -0.08), (19, -0.0), (20, 0.061), (21, -0.011), (22, -0.096), (23, -0.047), (24, -0.027), (25, 0.032), (26, 0.156), (27, 0.028), (28, -0.281), (29, -0.051), (30, 0.055), (31, 0.084), (32, -0.011), (33, -0.053), (34, -0.182), (35, 0.17), (36, -0.082), (37, 0.034), (38, -0.174), (39, -0.229), (40, 0.057), (41, -0.092), (42, -0.046), (43, -0.055), (44, 0.185), (45, -0.183), (46, 0.03), (47, -0.035), (48, -0.154), (49, -0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.94114667 17 nips-2002-A Statistical Mechanics Approach to Approximate Analytical Bootstrap Averages
Author: Dörthe Malzahn, Manfred Opper
Abstract: We apply the replica method of Statistical Physics combined with a variational method to the approximate analytical computation of bootstrap averages for estimating the generalization error. We demonstrate our approach on regression with Gaussian processes and compare our results with averages obtained by Monte-Carlo sampling.
2 0.84990299 139 nips-2002-Margin-Based Algorithms for Information Filtering
Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile
Abstract: In this work, we study an information filtering model where the relevance labels associated to a sequence of feature vectors are realizations of an unknown probabilistic linear function. Building on the analysis of a restricted version of our model, we derive a general filtering rule based on the margin of a ridge regression estimator. While our rule may observe the label of a vector only by classfying the vector as relevant, experiments on a real-world document filtering problem show that the performance of our rule is close to that of the on-line classifier which is allowed to observe all labels. These empirical results are complemented by a theoretical analysis where we consider a randomized variant of our rule and prove that its expected number of mistakes is never much larger than that of the optimal filtering rule which knows the hidden linear model.
3 0.39365843 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
Author: Peter Sykacek, Stephen J. Roberts
Abstract: We propose in this paper a probabilistic approach for adaptive inference of generalized nonlinear classification that combines the computational advantage of a parametric solution with the flexibility of sequential sampling techniques. We regard the parameters of the classifier as latent states in a first order Markov process and propose an algorithm which can be regarded as variational generalization of standard Kalman filtering. The variational Kalman filter is based on two novel lower bounds that enable us to use a non-degenerate distribution over the adaptation rate. An extensive empirical evaluation demonstrates that the proposed method is capable of infering competitive classifiers both in stationary and non-stationary environments. Although we focus on classification, the algorithm is easily extended to other generalized nonlinear models.
4 0.35020897 166 nips-2002-Rate Distortion Function in the Spin Glass State: A Toy Model
Author: Tatsuto Murayama, Masato Okada
Abstract: We applied statistical mechanics to an inverse problem of linear mapping to investigate the physics of optimal lossy compressions. We used the replica symmetry breaking technique with a toy model to demonstrate Shannon’s result. The rate distortion function, which is widely known as the theoretical limit of the compression with a fidelity criterion, is derived. Numerical study shows that sparse constructions of the model provide suboptimal compressions. 1
5 0.33181784 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
Author: Christopher M. Bishop, David Spiegelhalter, John Winn
Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are specified either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and flexibility of VIBES using examples from Bayesian mixture modelling. 1
6 0.28831741 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
7 0.27306408 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error
8 0.27110088 110 nips-2002-Incremental Gaussian Processes
9 0.26934975 32 nips-2002-Approximate Inference and Protein-Folding
10 0.25418493 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
11 0.23348218 150 nips-2002-Multiple Cause Vector Quantization
12 0.23318024 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
13 0.22893886 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
14 0.22867869 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers
15 0.22845638 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA
16 0.22402798 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences
17 0.22190171 201 nips-2002-Transductive and Inductive Methods for Approximate Gaussian Process Regression
18 0.21924055 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
19 0.21730027 95 nips-2002-Gaussian Process Priors with Uncertain Inputs Application to Multiple-Step Ahead Time Series Forecasting
20 0.20989588 174 nips-2002-Regularized Greedy Importance Sampling
topicId topicWeight
[(11, 0.018), (23, 0.025), (42, 0.059), (51, 0.015), (54, 0.09), (55, 0.032), (65, 0.021), (68, 0.025), (74, 0.05), (92, 0.462), (98, 0.102)]
simIndex simValue paperId paperTitle
Author: Sumio Watanabe, Shun-ichi Amari
Abstract: A lot of learning machines with hidden variables used in information science have singularities in their parameter spaces. At singularities, the Fisher information matrix becomes degenerate, resulting that the learning theory of regular statistical models does not hold. Recently, it was proven that, if the true parameter is contained in singularities, then the coefficient of the Bayes generalization error is equal to the pole of the zeta function of the Kullback information. In this paper, under the condition that the true parameter is almost but not contained in singularities, we show two results. (1) If the dimension of the parameter from inputs to hidden units is not larger than three, then there exits a region of true parameters where the generalization error is larger than those of regular models, however, if otherwise, then for any true parameter, the generalization error is smaller than those of regular models. (2) The symmetry of the generalization error and the training error does not hold in singular models in general. 1
2 0.86896521 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
Author: Clayton Scott, Robert Nowak
Abstract: Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.
same-paper 3 0.86736184 17 nips-2002-A Statistical Mechanics Approach to Approximate Analytical Bootstrap Averages
Author: Dörthe Malzahn, Manfred Opper
Abstract: We apply the replica method of Statistical Physics combined with a variational method to the approximate analytical computation of bootstrap averages for estimating the generalization error. We demonstrate our approach on regression with Gaussian processes and compare our results with averages obtained by Monte-Carlo sampling.
4 0.86050057 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
Author: Fei Sha, Lawrence K. Saul, Daniel D. Lee
Abstract: We derive multiplicative updates for solving the nonnegative quadratic programming problem in support vector machines (SVMs). The updates have a simple closed form, and we prove that they converge monotonically to the solution of the maximum margin hyperplane. The updates optimize the traditionally proposed objective function for SVMs. They do not involve any heuristics such as choosing a learning rate or deciding which variables to update at each iteration. They can be used to adjust all the quadratic programming variables in parallel with a guarantee of improvement at each iteration. We analyze the asymptotic convergence of the updates and show that the coefficients of non-support vectors decay geometrically to zero at a rate that depends on their margins. In practice, the updates converge very rapidly to good classifiers.
5 0.62921679 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray
Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model specification, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efficient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efficient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model specification into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more flexibility than a fixed code repository such as a Matlab toolbox, and allows the creation of efficient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level specification 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The figure shows the 7 double phi(1..n_classes) as ’weights’ specification of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-defined. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line specifies the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspecified. The class variable and single data variable are vectors, which defines them as i.i.d. Computer algebra. The first core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simplification are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efficient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly specified preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efficiently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A first-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., specific instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantified Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the figure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efficiently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!
6 0.58444029 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
7 0.5705837 166 nips-2002-Rate Distortion Function in the Spin Glass State: A Toy Model
8 0.55444026 161 nips-2002-PAC-Bayes & Margins
9 0.53216457 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
10 0.52179337 45 nips-2002-Boosted Dyadic Kernel Discriminants
11 0.51784205 27 nips-2002-An Impossibility Theorem for Clustering
12 0.5053727 6 nips-2002-A Formulation for Minimax Probability Machine Regression
13 0.50514615 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
14 0.50301963 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
15 0.50092733 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement
16 0.49916375 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
17 0.49432501 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
18 0.49347204 140 nips-2002-Margin Analysis of the LVQ Algorithm
19 0.49249879 192 nips-2002-Support Vector Machines for Multiple-Instance Learning
20 0.48421603 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction