nips nips2001 nips2001-35 knowledge-graph by maker-knowledge-mining

35 nips-2001-Analysis of Sparse Bayesian Learning


Source: pdf

Author: Anita C. Faul, Michael E. Tipping

Abstract: The recent introduction of the 'relevance vector machine' has effectively demonstrated how sparsity may be obtained in generalised linear models within a Bayesian framework. Using a particular form of Gaussian parameter prior, 'learning' is the maximisation, with respect to hyperparameters, of the marginal likelihood of the data. This paper studies the properties of that objective function, and demonstrates that conditioned on an individual hyperparameter, the marginal likelihood has a unique maximum which is computable in closed form. It is further shown that if a derived 'sparsity criterion' is satisfied, this maximum is exactly equivalent to 'pruning' the corresponding parameter from the model. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract The recent introduction of the 'relevance vector machine' has effectively demonstrated how sparsity may be obtained in generalised linear models within a Bayesian framework. [sent-5, score-0.217]

2 Using a particular form of Gaussian parameter prior, 'learning' is the maximisation, with respect to hyperparameters, of the marginal likelihood of the data. [sent-6, score-0.454]

3 This paper studies the properties of that objective function, and demonstrates that conditioned on an individual hyperparameter, the marginal likelihood has a unique maximum which is computable in closed form. [sent-7, score-0.757]

4 It is further shown that if a derived 'sparsity criterion' is satisfied, this maximum is exactly equivalent to 'pruning' the corresponding parameter from the model. [sent-8, score-0.127]

5 Given a set of data {xn' tn};;=l the 'target' samples tn = f(xn) + En are conventionally considered to be realisations of a deterministic function f, potentially corrupted by some additive noise process. [sent-10, score-0.131]

6 This function f will be modelled by a linearly-weighted sum of M fixed basis functions {4>m (X)}~= l: M f(x) = L wm¢>m(x), (1) m=l and the objective is to infer values of the parameters/weights {Wm}~=l such that is a 'good' approximation of f. [sent-11, score-0.132]

7 f While accuracy in function approximation is generally universally valued, there has been significant recent interest [2, 9, 3, 5]) in the notion of sparsity, a consequence of learning algorithms which set significant numbers of the parameters Wm to zero. [sent-12, score-0.056]

8 A methodology which effectively combines both these measures of merit is that of 'sparse Bayesian learning', briefly reviewed in Section 2, and which was the basis for the recent introduction of the relevance vector machine (RVM) and related models [6, 1, 7]. [sent-13, score-0.322]

9 This model exhibits some very compelling properties, in particular a dramatic degree of sparseness even in the case of highly overcomplete basis sets (M »N). [sent-14, score-0.149]

10 The sparse Bayesian learning algorithm essentially involves the maximisation of a marginalised likelihood function with respect to hyperparameters in the model prior. [sent-15, score-0.906]

11 In the RVM , this was achieved through re-estimation equations, the behaviour of which was not fully understood. [sent-16, score-0.057]

12 In this paper we present further relevant theoretical analysis of the properties of the marginal likelihood which gives a much fuller picture of the nature of the model and its associated learning procedure. [sent-17, score-0.509]

13 This is detailed in Section 3, and we close with a summary of our findings and discussion of their implications in Section 4 (and which, to avoid repetition here, the reader may wish to preview at this point). [sent-18, score-0.041]

14 2 Sparse Bayesian Learning We now very briefly review the methodology of sparse Bayesian learning, more comprehensively described elsewhere [6]. [sent-19, score-0.232]

15 To simplify and generalise the exposition, we omit to notate any functional dependence on the inputs x and combine quantities defined over the training set and basis set within N- and M-vectors respectively. [sent-20, score-0.23]

16 Using this representation, we first write the generative model as: t = f + €, (2) where t = (t1"'" tN )T, f = (11, . [sent-21, score-0.043]

17 The approximator is then written as: f = w, (3) where = [«Pl'" «PM] is a general N x M design matrix with column vectors «Pm and w = (W1, . [sent-25, score-0.062]

18 Recall that in the context of (1), nm = ¢m(x n ) and f = {f(x1), . [sent-29, score-0.039]

19 The sparse Bayesian framework assumes an independent zero-mean Gaussian noise model, with variance u 2 , giving a multivariate Gaussian likelihood of the target vector t: p(tlw, ( 2) = (27r)-N/2 U -N exp { _lit ~:"2 } . [sent-33, score-0.296]

20 (4) The prior over the parameters is mean-zero Gaussian: M p(wlo:) = (27r)-M/21I a~,e exp ( W2) - a m2 m , (5) where the key to the model sparsity is the use of M independent hyperparameters = (a1 " '" aM)T, one per weight (or basis vector), which moderate the strength of the prior. [sent-34, score-0.488]

21 Given 0:, the posterior parameter distribution is Gaussian and given via Bayes' rule as p(wlt , 0:) = N(wIIL,~) with 0: ~ = (A + u - 2 T Tt, (6) and A defined as diag(a1, . [sent-35, score-0.058]

22 i: p(tlw, ( 2) p(wlo:) dw, + log ICI + t C- 1t] T , (7) (8) Once most-probable values aMP have been found 1 , in practice they can be plugged into (6) to give a posterior mean (most probabletpoint estimate for the parameters J. [sent-40, score-0.1]

23 tMp· Empirically, the local maximisation of the marginal likelihood (8) with respect to a has been seen to work highly effectively [6, 1, 7]. [sent-42, score-0.834]

24 Accurate predictors may be realised, which are typically highly sparse as a result of the maximising values of many hyperparameters being infinite. [sent-43, score-0.605]

25 From (6) this leads to a parameter posterior infinitely peaked at zero for many weights Wm with the consequence that J. [sent-44, score-0.096]

26 However, the learning procedure in [6] relied upon heuristic re-estimation equations for the hyperparameters, the behaviour of which was not well characterised. [sent-46, score-0.057]

27 Also, little was known regarding the properties of (8), the validity of the local maximisation thereof and importantly, and perhaps most interestingly, the conditions under which a-values would become infinite. [sent-47, score-0.335]

28 We now give, through a judicious re-writing of (8), a more detailed analysis of the sparse Bayesian learning procedure. [sent-48, score-0.18]

29 1 Properties of the Marginal Likelihood £(0:) A convenient re-writing We re-write C from (8) in a convenient form to analyse the dependence on a single hyperparameter ai: C = (]"21 + 2. [sent-50, score-0.325]

30 : a~1¢m¢~ + a-;1¢i¢T, m m# i = C_ i + a-;1¢i¢T, (9) where we have defined C- i = (]"21+ Lm#i a;r/¢m¢~ as the covariance matrix with the influence of basis vector ¢i removed, equivalent also to ai = 00. [sent-54, score-0.407]

31 Using established matrix determinant and inverse identities, (9) allows us to write the terms of interest in £( a) as: (10) (11) which gives £(a) = -~ [Nlog(2n) + log IC-il + tTC=;t (¢TC- 1 t)2 . [sent-55, score-0.183]

32 In [6], based on earlier results from [4], the gradient of the marginal likelihood was computed as: (13) with fJi the i-th element of JL and ~ ii the i-th diagonal element of~. [sent-61, score-0.484]

33 In fact , by instead differentiating (12) directly, (13) can be seen to be equivalent to: (14) where advantageously, O::i now occurs only explicitly since C - i is independent of O::i. [sent-64, score-0.133]

34 3 Stationary points of £(0:) Equating (15) to zero indicates that stationary points of the marginal likelihood occur both at O::i = +00 (note that , being an inverse variance, O::i must be positive) and for: . [sent-69, score-0.604]

35 _ O::t subject to Qr S2 t Q; _Si' (17) > Si as a consequence again of O::i > o. [sent-70, score-0.056]

36 Since the right-hand-side of (17) is independent of O::i, we may find the stationary points of £(O::i) analytically without iterative re-estimation. [sent-71, score-0.137]

37 To find the nature of those stationary points, we consider the second derivatives. [sent-72, score-0.137]

38 1 Second derivatives of £(0:) With respect to O::i Differentiating (15) a second time with respect to O::i gives: -0::;2S;(O::i + Si)2 - 2(O::i + Si) [o::;lS;- (Qr - Si)] 2(O::i + Si)4 and we now consider (18) for both finite- and infinite-O::i stationary points. [sent-76, score-0.329]

39 In this case, for stationary points given by (17), we note that the second term in the numerator in (18) is zero, giving: (19) We see that (19) is always negative, and therefore £(O::i) has a maximum, which Si > and O::i given by (17). [sent-78, score-0.137]

40 For this case, (18) and indeed, all further derivatives, are uninformatively zero at O::i = 00 , but from (15) we can see that as O::i --+ 00, the sign of the gradient is given by the sign of - (Q; - Si). [sent-80, score-0.091]

41 Q; - If Si > 0, then the gradient at O::i = 00 is negative so as O::i decreases £(O::i) must increase to its unique maximum given by (17). [sent-81, score-0.33]

42 Conversely, if Si < 0, O::i = 00 is the unique maximum of £(O::i) . [sent-83, score-0.179]

43 If Si = 0, then this maximum and that given by (17) coincide. [sent-84, score-0.086]

44 Q; - Q; - We now have a full characterisation of the marginal likelihood as a function of a single hyperparameter, which is illustrated in Figure 1. [sent-85, score-0.43]

45 10' 10° I Q ; Figure 1: Example plots of £(ai) against a i (on a log scale) for > Si (left) , showing the single maximum at finite ai, and < Si (right), showing the maximum at a i = 00. [sent-87, score-0.381]

46 2 With respect to O::j, j i:- i To obtain the off-diagonal terms of the second derivative (Hessian) matrix, it is convenient to manipulate (15) to express it in terms of C. [sent-90, score-0.136]

47 From (11) we see that and (20) Utilising these identities in (15) gives: (21) We now write: (22) where 6ij is the Kronecker 'delta' function , allowing us to separate out the additional (diagonal) term that appears only when i = j. [sent-91, score-0.072]

48 Writing, similarly to (9) earlier, C = C_ j differentiating with respect to aj gives: + ajl¢j¢j, substituting into (21) and while we have (24) If all hyperparameters ai are individually set to their maximising values, i. [sent-92, score-0.843]

49 a = aMP such that alI8£(a)/8ai = 0, then even if all 8 2 £(a)/8a; are negative, there may still be a non-axial direction in which the likelihood could be increasing. [sent-94, score-0.158]

50 We now rule out this possibility by showing that the Hessian is negative semi-definite. [sent-95, score-0.1]

51 If the gradient vanishes, then for all 00, or from (21) , ¢rC-1¢i = (¢rC- 1t)2. [sent-98, score-0.051]

52 It follows directly from (25) that the Hessian is negative semi-definite, with (25) only zero where v is orthogonal to all finite a values. [sent-99, score-0.171]

53 , M either ai = 4 Summary Sparse Bayesian learning proposes the iterative maximisation of the marginal likelihood function £(a) with respect to the hyperparameters a. [sent-103, score-1.218]

54 As a function of an individual hyperparameter ai, £( a) has a unique maximum computable in closed-form. [sent-105, score-0.426]

55 (This maximum is, of course, dependent on the values of all other hyperparameters. [sent-106, score-0.086]

56 2) is negative, this maximum equivalent to the removal of basis function i from the III. [sent-110, score-0.215]

57 The point where all individual marginal likelihood functions £(O:i) are maximised is a joint maximum (not necessarily unique) over all O:i. [sent-111, score-0.558]

58 • From I, we see that if we update , in any arbitrary order, the O:i parameters using (17), we are guaranteed to increase the marginal likelihood at each step, unless already at a maximum. [sent-113, score-0.427]

59 Furthermore, we would expect these updates to be more efficient than those given in [6], which individually only increase, not maximise, £ (O:i) . [sent-114, score-0.059]

60 • Result III indicates that sequential optimisation of individual O:i cannot lead to a stationary point from which a joint maximisation over all 0: may have escaped. [sent-115, score-0.498]

61 ) • The result II confirms the qualitative argument and empirical observation that many O:i -+ 00 as a result of the optimisation procedure in [6]. [sent-119, score-0.065]

62 The inevitable implication of finite numerical precision prevented the genuine sparsity of the model being verified in those earlier simulations. [sent-120, score-0.348]

63 • We conclude by noting that the maximising hyperparameter solution (17) remains valid if O:i is already infinite. [sent-121, score-0.259]

64 This means that basis functions not even in the model can be assessed and their corresponding hyperparameters updated if desired. [sent-122, score-0.378]

65 So as well as the facility to increase £(0:) through the 'pruning' of basis functions if Si ::::: 0, new basis functions can be introduced if O:i = 00 but Si > O. [sent-123, score-0.257]

66 This has highly desirable computational consequences which we are exploiting to obtain a powerful 'constructive' approximation algorithm [8]. [sent-124, score-0.061]

67 Ziemske, editors, Proceedings of th e Eighth International Conference on Artificial N eural Networks (ICANN98) , pages 201- 206. [sent-149, score-0.068]

68 In Proceedings of th e Ninth Int ernational Conference on Artificial N eural N etworks, pages 575- 580, 1999. [sent-163, score-0.11]

69 Muller, editors, Advances in N eural Information Processing Systems 12, pages 652- 658. [sent-174, score-0.068]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('si', 0.357), ('hyperparameters', 0.29), ('maximisation', 0.254), ('marginal', 0.23), ('ai', 0.22), ('qr', 0.178), ('likelihood', 0.158), ('wm', 0.155), ('hyperparameter', 0.143), ('sparse', 0.138), ('stationary', 0.137), ('bayesian', 0.128), ('maximising', 0.116), ('sparsity', 0.11), ('fji', 0.098), ('tlw', 0.098), ('unique', 0.093), ('tn', 0.092), ('differentiating', 0.092), ('basis', 0.088), ('maximum', 0.086), ('rvm', 0.085), ('amp', 0.085), ('wlo', 0.085), ('hessian', 0.08), ('tc', 0.077), ('pm', 0.077), ('relevance', 0.075), ('xn', 0.075), ('identities', 0.072), ('maximise', 0.072), ('convenient', 0.07), ('finite', 0.07), ('eural', 0.068), ('tipping', 0.068), ('respect', 0.066), ('effectively', 0.065), ('optimisation', 0.065), ('approximator', 0.062), ('computable', 0.062), ('negative', 0.061), ('highly', 0.061), ('log', 0.061), ('derivatives', 0.06), ('individually', 0.059), ('ls', 0.059), ('defined', 0.058), ('behaviour', 0.057), ('consequence', 0.056), ('en', 0.055), ('gradient', 0.051), ('artificial', 0.05), ('briefly', 0.048), ('methodology', 0.046), ('earlier', 0.045), ('objective', 0.044), ('removed', 0.043), ('write', 0.043), ('implication', 0.042), ('generalised', 0.042), ('donoho', 0.042), ('analyse', 0.042), ('boden', 0.042), ('characterisation', 0.042), ('ernational', 0.042), ('facility', 0.042), ('fn', 0.042), ('generalise', 0.042), ('house', 0.042), ('ici', 0.042), ('inevitable', 0.042), ('judicious', 0.042), ('maximised', 0.042), ('normalising', 0.042), ('notate', 0.042), ('individual', 0.042), ('properties', 0.042), ('summary', 0.041), ('equivalent', 0.041), ('zero', 0.04), ('gives', 0.04), ('st', 0.04), ('increase', 0.039), ('showing', 0.039), ('inverse', 0.039), ('helping', 0.039), ('forthcoming', 0.039), ('ratsch', 0.039), ('fuller', 0.039), ('nm', 0.039), ('thereof', 0.039), ('conventionally', 0.039), ('conversely', 0.039), ('george', 0.039), ('goldszmidt', 0.039), ('inclusion', 0.039), ('niklasson', 0.039), ('plugged', 0.039), ('prevented', 0.039), ('simplification', 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 35 nips-2001-Analysis of Sparse Bayesian Learning

Author: Anita C. Faul, Michael E. Tipping

Abstract: The recent introduction of the 'relevance vector machine' has effectively demonstrated how sparsity may be obtained in generalised linear models within a Bayesian framework. Using a particular form of Gaussian parameter prior, 'learning' is the maximisation, with respect to hyperparameters, of the marginal likelihood of the data. This paper studies the properties of that objective function, and demonstrates that conditioned on an individual hyperparameter, the marginal likelihood has a unique maximum which is computable in closed form. It is further shown that if a derived 'sparsity criterion' is satisfied, this maximum is exactly equivalent to 'pruning' the corresponding parameter from the model. 1

2 0.13797837 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

Author: Mário Figueiredo

Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.

3 0.13544728 171 nips-2001-Spectral Relaxation for K-means Clustering

Author: Hongyuan Zha, Xiaofeng He, Chris Ding, Ming Gu, Horst D. Simon

Abstract: The popular K-means clustering partitions a data set by minimizing a sum-of-squares cost function. A coordinate descend method is then used to find local minima. In this paper we show that the minimization can be reformulated as a trace maximization problem associated with the Gram matrix of the data vectors. Furthermore, we show that a relaxed version of the trace maximization problem possesses global optimal solutions which can be obtained by computing a partial eigendecomposition of the Gram matrix, and the cluster assignment for each data vectors can be found by computing a pivoted QR decomposition of the eigenvector matrix. As a by-product we also derive a lower bound for the minimum of the sum-of-squares cost function. 1

4 0.13525736 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

Author: Andrew Y. Ng, Michael I. Jordan, Yair Weiss

Abstract: Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data- there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems. 1

5 0.10973969 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

Author: Lawrence K. Saul, Daniel D. Lee

Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1

6 0.10610874 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

7 0.10163124 183 nips-2001-The Infinite Hidden Markov Model

8 0.097494736 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity

9 0.092786171 76 nips-2001-Fast Parameter Estimation Using Green's Functions

10 0.088133924 8 nips-2001-A General Greedy Approximation Algorithm with Applications

11 0.083983906 13 nips-2001-A Natural Policy Gradient

12 0.081627823 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

13 0.077879094 172 nips-2001-Speech Recognition using SVMs

14 0.076937683 44 nips-2001-Blind Source Separation via Multinode Sparse Representation

15 0.07657811 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

16 0.074513845 154 nips-2001-Products of Gaussians

17 0.071730442 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

18 0.070350066 43 nips-2001-Bayesian time series classification

19 0.069014668 58 nips-2001-Covariance Kernels from Bayesian Generative Models

20 0.067919776 21 nips-2001-A Variational Approach to Learning Curves


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.219), (1, 0.017), (2, 0.03), (3, -0.131), (4, 0.002), (5, -0.069), (6, 0.081), (7, -0.008), (8, 0.042), (9, -0.094), (10, 0.118), (11, 0.079), (12, -0.041), (13, -0.152), (14, 0.001), (15, -0.026), (16, 0.002), (17, -0.026), (18, 0.06), (19, -0.032), (20, 0.077), (21, -0.071), (22, 0.1), (23, 0.046), (24, 0.176), (25, -0.005), (26, -0.079), (27, -0.107), (28, -0.027), (29, 0.005), (30, 0.175), (31, 0.052), (32, 0.127), (33, -0.057), (34, 0.091), (35, 0.116), (36, 0.049), (37, 0.003), (38, -0.037), (39, -0.017), (40, -0.012), (41, 0.038), (42, -0.071), (43, 0.025), (44, -0.061), (45, 0.137), (46, -0.124), (47, -0.009), (48, 0.047), (49, -0.118)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96494895 35 nips-2001-Analysis of Sparse Bayesian Learning

Author: Anita C. Faul, Michael E. Tipping

Abstract: The recent introduction of the 'relevance vector machine' has effectively demonstrated how sparsity may be obtained in generalised linear models within a Bayesian framework. Using a particular form of Gaussian parameter prior, 'learning' is the maximisation, with respect to hyperparameters, of the marginal likelihood of the data. This paper studies the properties of that objective function, and demonstrates that conditioned on an individual hyperparameter, the marginal likelihood has a unique maximum which is computable in closed form. It is further shown that if a derived 'sparsity criterion' is satisfied, this maximum is exactly equivalent to 'pruning' the corresponding parameter from the model. 1

2 0.61814737 154 nips-2001-Products of Gaussians

Author: Christopher Williams, Felix V. Agakov, Stephen N. Felderhof

Abstract: Recently Hinton (1999) has introduced the Products of Experts (PoE) model in which several individual probabilistic models for data are combined to provide an overall model of the data. Below we consider PoE models in which each expert is a Gaussian. Although the product of Gaussians is also a Gaussian, if each Gaussian has a simple structure the product can have a richer structure. We examine (1) Products of Gaussian pancakes which give rise to probabilistic Minor Components Analysis, (2) products of I-factor PPCA models and (3) a products of experts construction for an AR(l) process. Recently Hinton (1999) has introduced the Products of Experts (PoE) model in which several individual probabilistic models for data are combined to provide an overall model of the data. In this paper we consider PoE models in which each expert is a Gaussian. It is easy to see that in this case the product model will also be Gaussian. However, if each Gaussian has a simple structure, the product can have a richer structure. Using Gaussian experts is attractive as it permits a thorough analysis of the product architecture, which can be difficult with other models , e.g. models defined over discrete random variables. Below we examine three cases of the products of Gaussians construction: (1) Products of Gaussian pancakes (PoGP) which give rise to probabilistic Minor Components Analysis (MCA), providing a complementary result to probabilistic Principal Components Analysis (PPCA) obtained by Tipping and Bishop (1999); (2) Products of I-factor PPCA models; (3) A products of experts construction for an AR(l) process. Products of Gaussians If each expert is a Gaussian pi(xI8 i ) '

3 0.57498533 79 nips-2001-Gaussian Process Regression with Mismatched Models

Author: Peter Sollich

Abstract: Learning curves for Gaussian process regression are well understood when the 'student' model happens to match the 'teacher' (true data generation process). I derive approximations to the learning curves for the more generic case of mismatched models, and find very rich behaviour: For large input space dimensionality, where the results become exact, there are universal (student-independent) plateaux in the learning curve, with transitions in between that can exhibit arbitrarily many over-fitting maxima; over-fitting can occur even if the student estimates the teacher noise level correctly. In lower dimensions, plateaux also appear, and the learning curve remains dependent on the mismatch between student and teacher even in the asymptotic limit of a large number of training examples. Learning with excessively strong smoothness assumptions can be particularly dangerous: For example, a student with a standard radial basis function covariance function will learn a rougher teacher function only logarithmically slowly. All predictions are confirmed by simulations. 1

4 0.56306851 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity

Author: Lehel Csató, Manfred Opper, Ole Winther

Abstract: The adaptive TAP Gibbs free energy for a general densely connected probabilistic model with quadratic interactions and arbritary single site constraints is derived. We show how a specific sequential minimization of the free energy leads to a generalization of Minka’s expectation propagation. Lastly, we derive a sparse representation version of the sequential algorithm. The usefulness of the approach is demonstrated on classification and density estimation with Gaussian processes and on an independent component analysis problem.

5 0.55897212 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

Author: Mário Figueiredo

Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.

6 0.51428574 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

7 0.49866343 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

8 0.48965216 76 nips-2001-Fast Parameter Estimation Using Green's Functions

9 0.46392223 171 nips-2001-Spectral Relaxation for K-means Clustering

10 0.45332682 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons

11 0.43726656 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

12 0.43376157 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

13 0.41867673 44 nips-2001-Blind Source Separation via Multinode Sparse Representation

14 0.40372103 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation

15 0.38348344 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

16 0.37878379 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique

17 0.36752188 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

18 0.36484388 21 nips-2001-A Variational Approach to Learning Curves

19 0.36476859 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

20 0.35676116 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.027), (17, 0.026), (19, 0.011), (27, 0.108), (30, 0.024), (38, 0.013), (59, 0.019), (72, 0.034), (79, 0.588), (91, 0.085)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93400019 35 nips-2001-Analysis of Sparse Bayesian Learning

Author: Anita C. Faul, Michael E. Tipping

Abstract: The recent introduction of the 'relevance vector machine' has effectively demonstrated how sparsity may be obtained in generalised linear models within a Bayesian framework. Using a particular form of Gaussian parameter prior, 'learning' is the maximisation, with respect to hyperparameters, of the marginal likelihood of the data. This paper studies the properties of that objective function, and demonstrates that conditioned on an individual hyperparameter, the marginal likelihood has a unique maximum which is computable in closed form. It is further shown that if a derived 'sparsity criterion' is satisfied, this maximum is exactly equivalent to 'pruning' the corresponding parameter from the model. 1

2 0.87757969 2 nips-2001-3 state neurons for contextual processing

Author: Ádám Kepecs, S. Raghavachari

Abstract: Neurons receive excitatory inputs via both fast AMPA and slow NMDA type receptors. We find that neurons receiving input via NMDA receptors can have two stable membrane states which are input dependent. Action potentials can only be initiated from the higher voltage state. Similar observations have been made in several brain areas which might be explained by our model. The interactions between the two kinds of inputs lead us to suggest that some neurons may operate in 3 states: disabled, enabled and firing. Such enabled, but non-firing modes can be used to introduce context-dependent processing in neural networks. We provide a simple example and discuss possible implications for neuronal processing and response variability. 1

3 0.87053818 115 nips-2001-Linear-time inference in Hierarchical HMMs

Author: Kevin P. Murphy, Mark A. Paskin

Abstract: The hierarchical hidden Markov model (HHMM) is a generalization of the hidden Markov model (HMM) that models sequences with structure at many length/time scales [FST98]. Unfortunately, the original infertime, where is ence algorithm is rather complicated, and takes the length of the sequence, making it impractical for many domains. In this paper, we show how HHMMs are a special kind of dynamic Bayesian network (DBN), and thereby derive a much simpler inference algorithm, which only takes time. Furthermore, by drawing the connection between HHMMs and DBNs, we enable the application of many standard approximation techniques to further speed up inference. ¥ ©§ £ ¨¦¥¤¢ © £ ¦¥¤¢

4 0.85803676 180 nips-2001-The Concave-Convex Procedure (CCCP)

Author: Alan L. Yuille, Anand Rangarajan

Abstract: We introduce the Concave-Convex procedure (CCCP) which constructs discrete time iterative dynamical systems which are guaranteed to monotonically decrease global optimization/energy functions. It can be applied to (almost) any optimization problem and many existing algorithms can be interpreted in terms of CCCP. In particular, we prove relationships to some applications of Legendre transform techniques. We then illustrate CCCP by applications to Potts models, linear assignment, EM algorithms, and Generalized Iterative Scaling (GIS). CCCP can be used both as a new way to understand existing optimization algorithms and as a procedure for generating new algorithms. 1

5 0.83744836 78 nips-2001-Fragment Completion in Humans and Machines

Author: David Jacobs, Bas Rokers, Archisman Rudra, Zili Liu

Abstract: Partial information can trigger a complete memory. At the same time, human memory is not perfect. A cue can contain enough information to specify an item in memory, but fail to trigger that item. In the context of word memory, we present experiments that demonstrate some basic patterns in human memory errors. We use cues that consist of word fragments. We show that short and long cues are completed more accurately than medium length ones and study some of the factors that lead to this behavior. We then present a novel computational model that shows some of the flexibility and patterns of errors that occur in human memory. This model iterates between bottom-up and top-down computations. These are tied together using a Markov model of words that allows memory to be accessed with a simple feature set, and enables a bottom-up process to compute a probability distribution of possible completions of word fragments, in a manner similar to models of visual perceptual completion.

6 0.51159394 183 nips-2001-The Infinite Hidden Markov Model

7 0.46431831 118 nips-2001-Matching Free Trees with Replicator Equations

8 0.45908797 86 nips-2001-Grammatical Bigrams

9 0.44893971 184 nips-2001-The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank

10 0.44462332 172 nips-2001-Speech Recognition using SVMs

11 0.43212813 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's

12 0.42536482 3 nips-2001-ACh, Uncertainty, and Cortical Inference

13 0.4245435 169 nips-2001-Small-World Phenomena and the Dynamics of Information

14 0.42382216 192 nips-2001-Tree-based reparameterization for approximate inference on loopy graphs

15 0.41889575 171 nips-2001-Spectral Relaxation for K-means Clustering

16 0.41637957 188 nips-2001-The Unified Propagation and Scaling Algorithm

17 0.41145396 132 nips-2001-Novel iteration schemes for the Cluster Variation Method

18 0.41009098 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation

19 0.40841335 79 nips-2001-Gaussian Process Regression with Mismatched Models

20 0.40328652 27 nips-2001-Activity Driven Adaptive Stochastic Resonance