nips nips2000 nips2000-20 knowledge-graph by maker-knowledge-mining

20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

Source: pdf

Author: Sumio Watanabe

Abstract: Algebraic geometry is essential to learning theory. In hierarchical learning machines such as layered neural networks and gaussian mixtures, the asymptotic normality does not hold , since Fisher information matrices are singular. In this paper , the rigorous asymptotic form of the stochastic complexity is clarified based on resolution of singularities and two different problems are studied. (1) If the prior is positive, then the stochastic complexity is far smaller than BIO, resulting in the smaller generalization error than regular statistical models, even when the true distribution is not contained in the parametric model. (2) If Jeffreys' prior, which is coordinate free and equal to zero at singularities, is employed then the stochastic complexity has the same form as BIO. It is useful for model selection, but not for generalization. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 jp Abstract Algebraic geometry is essential to learning theory. [sent-4, score-0.114]

2 In hierarchical learning machines such as layered neural networks and gaussian mixtures, the asymptotic normality does not hold , since Fisher information matrices are singular. [sent-5, score-0.484]

3 In this paper , the rigorous asymptotic form of the stochastic complexity is clarified based on resolution of singularities and two different problems are studied. [sent-6, score-0.855]

4 (1) If the prior is positive, then the stochastic complexity is far smaller than BIO, resulting in the smaller generalization error than regular statistical models, even when the true distribution is not contained in the parametric model. [sent-7, score-0.869]

5 (2) If Jeffreys' prior, which is coordinate free and equal to zero at singularities, is employed then the stochastic complexity has the same form as BIO. [sent-8, score-0.244]

6 1 Introduction The Fisher information matrix determines a metric of the set of all parameters of a learning machine [2]. [sent-10, score-0.045]

7 If it is positive definite, then a learning machine can be understood as a Riemannian manifold. [sent-11, score-0.106]

8 However, almost all learning machines such as layered neural networks, gaussian mixtures, and Boltzmann machines have singular Fisher metrics. [sent-12, score-0.329]

9 For example, in a three-layer perceptron, the Fisher information matrix J( w) for a parameter w is singular (det J( w) = 0) if and only if w represents a small model which can be realized with the fewer hidden units than the learning model. [sent-13, score-0.178]

10 Therefore , when the learning machine is in an almost redundant state, any method in statistics and physics that uses a quadratic approximation of the loss function can not be applied. [sent-14, score-0.141]

11 In fact , the maximum likelihood estimator is not subject to the asymptotic normal distribution [4]. [sent-15, score-0.245]

12 The Bayesian posterior probability converges to a distribution which is quite different from the normal one [8]. [sent-16, score-0.042]

13 To construct a mathematical foundation for such learning machines, we clarified the essential relation between algebraic geometry and Bayesian statistics [9,10]. [sent-17, score-0.415]

14 In this paper, we show that the asymptotic form of the Bayesian stochastic complexity is rigorously obtained by resolution of singularities. [sent-18, score-0.488]

15 The Bayesian method gives powerful tools for both generalization and model selection, however, the appropriate prior for each purpose is quite different. [sent-19, score-0.174]

16 2 Stochastic Complexity Let p(xlw) be a learning machine, where x is a pair of an input and an output, and w E Rd is a parameter. [sent-20, score-0.045]

17 , Xn) are independently taken from the true distribution q(x), which is not contained in p(x lw) in general. [sent-25, score-0.326]

18 The stochastic complexity F(xn) and its average F (n) are defined by F(xn) = - log J ITp(Xi lw) 'fJ(w)dw i=l and F(n) = Exn{F(xn)}, respectively, where Exn{ . [sent-26, score-0.224]

19 The stochastic complexity plays a central role in Bayesian statistics. [sent-28, score-0.151]

20 Firstly, F(n+1)-F(n)-S, where S = - J q(x) logq(x)dx, is equal to the average Kullback distance from q(x) to the Bayes predictive distribution p(xlxn), which is called the generalization error denoted by G(n). [sent-29, score-0.162]

21 And lastly, if the prior distribution has a hyperparameter (), that is to say, 'fJ(w) = 'fJ(wl()), then it is optimized by minimization of F(xn) [1]. [sent-31, score-0.146]

22 We define a function Fo(n) using the Kullback distance H(w), Fo(n) = - log J exp( -nH(w))'fJ(w)dw, H(w) = J q(x) log pf~~2) dx. [sent-32, score-0.196]

23 Moreover, we assume that L(x,w) == logq(x) - logp(xlw) is an analytic function from w to the Hilbert space of all square integrable functions with the measure q(x)dx , and that the support of the prior W = supp 'fJ is compact . [sent-34, score-0.423]

24 Then H(w) is an analytic function on W, and there exists a constant CI > 0 such that, for an arbitrary n, n FO("2) - 3 CI ::; F(n) - Sn ::; Fo(n). [sent-35, score-0.326]

25 (1) General Learning Machines In this section, we study a case when the true distribution is contained in the parametric model, that is to say, there exists a parameter Wo E W such that q(x) = p(x lwo). [sent-36, score-0.494]

26 Let us introduce a zeta function J(z) (z E C) of H(w) and a state density function v(t) by J(z) = J H(wY'fJ(w)dw, v(t) = J J(t - H(w))'fJ(w)dw. [sent-37, score-0.073]

27 J(z) = lh tZv(t)dt, Fo(n) = - log lh exp(-nt)v(t)dt, where h = maxWEW H(w). [sent-39, score-0.179]

28 Moreover, by using the existence of Sato-Bernstein's b-function [6], it can be analytically continued to a meromorphic function on the entire complex plane, whose poles are real, negative, and rational numbers. [sent-42, score-0.095]

29 be the poles of J (z) and mk be the order of - Ak. [sent-46, score-0.142]

30 Then, by using the inverse Mellin tansform, it follows that v(t) has an asymptotic expansion with coefficients {Ckm}, 00 v(t) ~ mk LL Ckm tAk - 1(- logt)m-l (t ---> +0). [sent-47, score-0.322]

31 k=lm=1 Therefore, also Fo (n) has an asymptotic expansion, by putting A = Al and m = ml, Fo (n) = A log n - (m - 1) log log n + 0 (1) , which ensures the asymptotic expansion of F(n) by eq. [sent-48, score-0.697]

32 (l), F(n) = Sn + Alogn - (m - 1) log log n + 0(1). [sent-49, score-0.146]

33 The Kullback distance H(w) depends on the analytic set Wo = {w E W; H(w) = O} , resulting that both A and m depend on Woo Note that, if the Bayes generalization error G(n) = F(n + 1) - F(n) - S has an asymptotic expansion, it should be AI n - (m - 1) I (n log n). [sent-50, score-0.598]

34 The following lemma is proven using the definition of Fo(n) and its asymptotic expansion. [sent-51, score-0.3]

35 Lemma 1 (1) Let (Ai, mi) (i = 1,2) be constants corresponding to (Hi(W), rpi(W)) (i = 1, 2). [sent-52, score-0.054]

36 If H 1(w) :::::: H 2(w) and rpl(W) 2': rp2(W), then 'AI < A2' or 'AI = A2 and ml 2': m2 '. [sent-53, score-0.055]

37 (2) Let (Ai , mi) (i = 1, 2) be constants corresponding to (Hi(Wi), rpi(Wi)) (i = 1, 2). [sent-54, score-0.054]

38 Then the constants of (H(w) , rp(w)) are A = Al + A2 and m = ml + m2 - 1. [sent-56, score-0.109]

39 Let Wi be the open kernel of W (the maximal open set contained in W). [sent-58, score-0.191]

40 Theorem 1 (Resolution of Singularities, Hironaka [5}) Let H(w) 2': 0 be a real analytic function on Wi. [sent-59, score-0.202]

41 Th en there exist both a real d-dimensional manifold U and a real analytic function g : U ---> Wi such that, in a neighborhood of an arbitrary U E U, (2) where a( u) > 0 is an analytic function and {sd are non-negative integers. [sent-60, score-0.495]

42 M oreover, for arbitrary compact set K c W, g-1 (K) c U is a compact set. [sent-61, score-0.152]

43 (2) to the definition of J( z), one can see the integral in J( z) is decomposed into a direct product of the integral of each variable [3]. [sent-65, score-0.088]

44 In general it is not so easy to find g(u) that gives the complete resolution of singularities, however , in this paper, we show that even a partial resolution ma pping gives an upper bound of A. [sent-67, score-0.268]

45 (1) The prior distribution rp(w) is called positive if rp(w) > 0 for an arbitrary wE Wi, (W = supp <). [sent-70, score-0.314]

46 Ifp(xlw) satisfies the condition of the asymptotic normality, then). [sent-72, score-0.256]

47 (Outline of the Proof) (1) In order to examine the poles of J(z), we can divide the parameter space into the sum of neighborhoods. [sent-77, score-0.133]

48 Since H( w) is an analytic function, in arbitrary neighborhood of Wo that satisfies H(wo) = 0, we can find a positive definite quadratic form which is smaller than H(w). [sent-78, score-0.562]

49 (2) Because Jeffreys' prior is coordinate free, we can study the problem on the parameter space U instead of Wi in eq. [sent-82, score-0.192]

50 Hence, there exists an analytic function t(x, u) such that, in each local coordinate, L(x, u) = L(x, g( u)) = (8t ~Wi UWi t(x, U)U~l . [sent-84, score-0.282]

51 1 (1), in order to prove the latter half of the theorem , it is suthcient to prove that has a pole z = -d/2 with the order m completes the theorem. [sent-105, score-0.327]

52 Direct calculation of integrals in J(z) Three-Layer Percept ron In this section, we study some cases when the learner is a three-layer percept ron and the true distribution is contained and not contained. [sent-110, score-0.579]

53 We define the three layer percept ron p(x, vlw) with JII! [sent-111, score-0.174]

54 r(x) 1 (27ru 2)N/2 exp(- 2u211v p(x, vlw) - 2 fK(x ,w)11 ) K fK(x,w) = Laku(bk路x+Ck) k=l where w = {(ak' bk, Ck); ak E R N , bk E R M , Ck E Rl}, r(x) is the probability density on the input, and u 2 is the variance of the output (either r(x) or u is not estimated). [sent-113, score-0.173]

55 Theorem 3 If the true distribution is represented by the three-layer perceptron with Ko ::; K hidden units, and if positive prior is employed, then 1 A ::; "2 {Ko(M + N + 1) + (K - . [sent-114, score-0.355]

56 Then da db dc = o:KN-1do: da' db' dc' and there exists an analytic function H1(a' , b' , c') such that H(a , b,c) = 0:2H1(a',b',c'). [sent-125, score-0.642]

57 Also by using another blowing-up, then, da db dc = 0:(M+1)K- 1do: da" db" dc" and there exists an analytic function H2(a l ,bl ,c") such that H(a , b,c) = 0:2H2(a l ,bl ,c"), which shows that J( z) has a pole at z = -K(M + 1)/2. [sent-127, score-0.789]

58 By combining both results, we obtain A ::; (K/2) min(M + 1, N) . [sent-128, score-0.041]

59 Secondly, we prove the general case, 0 < Ko ::; K. [sent-129, score-0.051]

60 If the true regression function g(x) is not contained in the learning model, we assume that, for each 0 ::; k ::; K, there exists a parameter w~k) E W that minimizes the square error We use notations E(k) k) min(M + 1, N). [sent-136, score-0.49]

61 (1/2){k(M + N + 1) + (K - Theorem 4 If the true regression function is not contained in the learning model and positive prior is applied, then F(n):,,::: min [n2E (k) +'\(k)lognJ +0(1). [sent-137, score-0.596]

62 O~k~K a (Outline of Proof) This theorem can be shown by the same procedure as eq. [sent-138, score-0.078]

63 It should be emphasized that the optimal k that minimizes G(n) is smaller than the learning model when n is not so large, and it becomes larger as n increases. [sent-144, score-0.102]

64 This fact shows that the positive prior is useful for generalization but not appropriate for model selection. [sent-145, score-0.235]

65 Under the condition that the true distribution is contained in the parametric model, Jeffreys' prior may enable us to find the true model with higher probability. [sent-146, score-0.573]

66 Theorem 5 If the true regression function is contained in the three-layer perceptron and Jeffrey's prior is applied, then ,\ = d/2 and m = 1, even if the Fisher metric is degenerate at the true parameter. [sent-147, score-0.579]

67 (Outline of Proof) For simplicity, we prove the theorem for the case g(x) = O. [sent-148, score-0.129]

68 The general cases can be proven by the same method. [sent-149, score-0.043]

69 By direct calculation of the Fisher information matrix, there exists an analytic function D(b, e) ~ 0 such that K N detI(w) = II(Lakp)2(M+1)D(b,e) k=1 p=1 By using a blowing-up we obtain H(w) = a 2H 1(a',b',e') same as eq. [sent-150, score-0.282]

70 (5), detI(w) ex a 2(M+1)K, and da db de = aN K -1 da da' db de. [sent-151, score-0.508]

71 The integral }(z) = 1 a 2z a(M+1)K+NK- 1da 1"'1芦' has a pole at z = -(M + N + 1)K/2. [sent-152, score-0.191]

72 By combining this result with Theorem 3, we obtain Theorem. [sent-153, score-0.041]

73 5 Discussion In many applications of neural networks, rather complex machines a re employed compared with the number of training samples. [sent-159, score-0.171]

74 In such cases, the set of optimal parameters is not one point but an analytic set with singularities, and the set of almost optimal parameters {Wi H(w ) < E} is not an 'ellipsoid'. [sent-160, score-0.202]

75 Hence neither the Kullback distance can be approximat ed by any quadratic form nor the saddle point approximation can be used in integration on the parameter space. [sent-161, score-0.134]

76 The zeta function of the Kullback distance clarifies the behavior of the stochastic complexity and resolution of singularities enables us to calculate the learning efficiency. [sent-162, score-0.71]

77 6 Conclusion The relation between algebraic geometry and learning theory is clarified, and two different facts are proven. [sent-163, score-0.255]

78 (1) If the true distribution is not contained in a hierarchical learning model, then by using a positive prior, the generalization error is made smaller than the regular statistical models. [sent-164, score-0.613]

79 (2) If the true distribution is contained in the learning model and if Jeffreys' prior is used , then the average Bayesian factor has the same form as BIC. [sent-165, score-0.475]

80 (1998) On the generalization error by a layered statistical model with Bayesian estimation. [sent-203, score-0.184]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fo', 0.259), ('singularities', 0.257), ('asymptotic', 0.203), ('analytic', 0.202), ('contained', 0.191), ('jeffreys', 0.183), ('kullback', 0.158), ('wi', 0.151), ('da', 0.15), ('pole', 0.147), ('watanabe', 0.147), ('algebraic', 0.141), ('resolution', 0.134), ('fisher', 0.128), ('fq', 0.126), ('layered', 0.114), ('ck', 0.114), ('ko', 0.114), ('clarified', 0.11), ('deti', 0.11), ('rpi', 0.11), ('xlw', 0.11), ('bk', 0.106), ('dc', 0.106), ('prior', 0.104), ('db', 0.104), ('percept', 0.095), ('poles', 0.095), ('rp', 0.094), ('true', 0.093), ('xn', 0.09), ('outline', 0.082), ('wo', 0.082), ('stochastic', 0.081), ('exists', 0.08), ('dw', 0.079), ('ron', 0.079), ('theorem', 0.078), ('sn', 0.075), ('ckm', 0.073), ('exn', 0.073), ('hironaka', 0.073), ('lw', 0.073), ('mellin', 0.073), ('uwi', 0.073), ('vlw', 0.073), ('zeta', 0.073), ('log', 0.073), ('expansion', 0.072), ('generalization', 0.07), ('complexity', 0.07), ('geometry', 0.069), ('bayesian', 0.067), ('ak', 0.067), ('machines', 0.065), ('re', 0.063), ('logq', 0.063), ('supp', 0.063), ('positive', 0.061), ('min', 0.059), ('normality', 0.057), ('firstly', 0.057), ('wl', 0.057), ('smaller', 0.057), ('perceptron', 0.055), ('ml', 0.055), ('units', 0.055), ('compact', 0.054), ('constants', 0.054), ('lemma', 0.054), ('regular', 0.054), ('satisfies', 0.053), ('fk', 0.053), ('lh', 0.053), ('sp', 0.053), ('definite', 0.052), ('prove', 0.051), ('coordinate', 0.05), ('parametric', 0.05), ('proof', 0.05), ('distance', 0.05), ('let', 0.05), ('statistics', 0.05), ('hi', 0.049), ('mk', 0.047), ('sd', 0.047), ('japan', 0.047), ('neighborhood', 0.047), ('quadratic', 0.046), ('learning', 0.045), ('secondly', 0.045), ('integral', 0.044), ('arbitrary', 0.044), ('regression', 0.043), ('employed', 0.043), ('proven', 0.043), ('distribution', 0.042), ('combining', 0.041), ('singular', 0.04), ('parameter', 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

Author: Sumio Watanabe

2 0.11135891 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

3 0.09403237 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

Author: Ilya Nemenman, William Bialek

Abstract: Learning of a smooth but nonparametric probability density can be regularized using methods of Quantum Field Theory. We implement a field theoretic prior numerically, test its efficacy, and show that the free parameter of the theory (,smoothness scale') can be determined self consistently by the data; this forms an infinite dimensional generalization of the MDL principle. Finally, we study the implications of one's choice of the prior and the parameterization and conclude that the smoothness scale determination makes density estimation very weakly sensitive to the choice of the prior, and that even wrong choices can be advantageous for small data sets. One of the central problems in learning is to balance 'goodness of fit' criteria against the complexity of models. An important development in the Bayesian approach was thus the realization that there does not need to be any extra penalty for model complexity: if we compute the total probability that data are generated by a model, there is a factor from the volume in parameter space-the 'Occam factor' -that discriminates against models with more parameters [1, 2]. This works remarkably welJ for systems with a finite number of parameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equivalent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if the a priori distributions involved are strictly Gaussian, the ideas have also been proven to apply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, what happens if we leave the finite dimensional setting to consider nonparametric problems which are not Gaussian, such as the estimation of a smooth probability density. A possible route to progress on the nonparametric problem was opened by noticing [5] that a Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In particular, there are field theoretic methods for computing the infinite dimensional analog of the Occam factor, at least asymptotically for large numbers of examples. These observations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and their implications for the speed of learning. Here we return to the original formulation of Ref. [5] and use numerical methods to address some of the questions left open by the analytic work [10]: What is the result of balancing the infinite dimensional Occam factor against the goodness of fit? Is the QFT inference optimal in using alJ of the information relevant for learning [II]? What happens if our learning problem is strongly atypical of the prior distribution? Following Ref. [5], if N i. i. d. samples {Xi}, i = 1 ... N, are observed, then the probability that a particular density Q(x) gave rise to these data is given by P[Q(x)l{x.}] P[Q(x)] rr~1 Q(Xi) • - J[dQ(x)]P[Q(x)] rr~1 Q(Xi) , (1) where P[Q(x)] encodes our a priori expectations of Q. Specifying this prior on a space of functions defines a QFf, and the optimal least square estimator is then Q (I{ .}) - (Q(X)Q(Xl)Q(X2) ... Q(XN)}(O) est X X. (Q(Xl)Q(X2) ... Q(XN ))(0) , (2) where ( ... )(0) means averaging with respect to the prior. Since Q(x) ~ 0, it is convenient to define an unconstrained field ¢(x), Q(x) (l/io)exp[-¢(x)]. Other definitions are also possible [6], but we think that most of our results do not depend on this choice. = The next step is to select a prior that regularizes the infinite number of degrees of freedom and allows learning. We want the prior P[¢] to make sense as a continuous theory, independent of discretization of x on small scales. We also require that when we estimate the distribution Q(x) the answer must be everywhere finite. These conditions imply that our field theory must be convergent at small length scales. For x in one dimension, a minimal choice is P[¢(x)] 1 = Z exp [£2 11 - 1 --2- f (8 dx [1 f 11 ¢)2] c5 io 8xll ] dxe-¢(x) -1 , (3) where'T/ > 1/2, Z is the normalization constant, and the c5-function enforces normalization of Q. We refer to i and 'T/ as the smoothness scale and the exponent, respectively. In [5] this theory was solved for large Nand 'T/ = 1: N (II Q(Xi))(O) ~ (4) = (5) + (6) i=1 Seff i8;¢c1 (x) where ¢cl is the 'classical' (maximum likelihood, saddle point) solution. In the effective action [Eq. (5)], it is the square root term that arises from integrating over fluctuations around the classical solution (Occam factors). It was shown that Eq. (4) is nonsingular even at finite N, that the mean value of ¢c1 converges to the negative logarithm of the target distribution P(x) very quickly, and that the variance of fluctuations 'Ij;(x) ¢(x) [- log ioP( x)] falls off as ....., 1/ iN P( x). Finally, it was speculated that if the actual i is unknown one may average over it and hope that, much as in Bayesian model selection [2], the competition between the data and the fluctuations will select the optimal smoothness scale i*. J = At the first glance the theory seems to look almost exactly like a Gaussian Process [4]. This impression is produced by a Gaussian form of the smoothness penalty in Eq. (3), and by the fluctuation determinant that plays against the goodness of fit in the smoothness scale (model) selection. However, both similarities are incomplete. The Gaussian penalty in the prior is amended by the normalization constraint, which gives rise to the exponential term in Eq. (6), and violates many familiar results that hold for Gaussian Processes, the representer theorem [12] being just one of them. In the semi--classical limit of large N, Gaussianity is restored approximately, but the classical solution is extremely non-trivial, and the fluctuation determinant is only the leading term of the Occam's razor, not the complete razor as it is for a Gaussian Process. In addition, it has no data dependence and is thus remarkably different from the usual determinants arising in the literature. The algorithm to implement the discussed density estimation procedure numerically is rather simple. First, to make the problem well posed [10, 11] we confine x to a box a ~ x ~ L with periodic boundary conditions. The boundary value problem Eq. (6) is then solved by a standard 'relaxation' (or Newton) method of iterative improvements to a guessed solution [13] (the target precision is always 10- 5 ). The independent variable x E [0,1] is discretized in equal steps [10 4 for Figs. (l.a-2.b), and 105 for Figs. (3.a, 3.b)]. We use an equally spaced grid to ensure stability of the method, while small step sizes are needed since the scale for variation of ¢el (x) is [5] (7) c5x '

4 0.090057373 27 nips-2000-Automatic Choice of Dimensionality for PCA

Author: Thomas P. Minka

Abstract: A central issue in principal component analysis (PCA) is choosing the number of principal components to be retained. By interpreting PCA as density estimation, we show how to use Bayesian model selection to estimate the true dimensionality of the data. The resulting estimate is simple to compute yet guaranteed to pick the correct dimensionality, given enough data. The estimate involves an integral over the Steifel manifold of k-frames, which is difficult to compute exactly. But after choosing an appropriate parameterization and applying Laplace's method, an accurate and practical estimator is obtained. In simulations, it is convincingly better than cross-validation and other proposed algorithms, plus it runs much faster.

5 0.089443959 121 nips-2000-Sparse Kernel Principal Component Analysis

Author: Michael E. Tipping

Abstract: 'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not 'sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. 1

6 0.088556491 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

7 0.086332068 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

8 0.086036056 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

9 0.081909336 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

10 0.073598444 75 nips-2000-Large Scale Bayes Point Machines

11 0.0734929 35 nips-2000-Computing with Finite and Infinite Networks

12 0.071706131 110 nips-2000-Regularization with Dot-Product Kernels

13 0.070155039 94 nips-2000-On Reversing Jensen's Inequality

14 0.067625515 122 nips-2000-Sparse Representation for Gaussian Process Models

15 0.06355647 92 nips-2000-Occam's Razor

16 0.063227087 120 nips-2000-Sparse Greedy Gaussian Process Regression

17 0.061893173 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

18 0.060956992 99 nips-2000-Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech

19 0.060338527 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

20 0.060072295 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.223), (1, 0.075), (2, -0.008), (3, -0.026), (4, 0.094), (5, 0.022), (6, -0.039), (7, -0.004), (8, -0.035), (9, -0.149), (10, -0.049), (11, -0.002), (12, 0.001), (13, -0.052), (14, 0.078), (15, 0.046), (16, 0.078), (17, -0.073), (18, -0.007), (19, 0.013), (20, 0.024), (21, 0.082), (22, -0.113), (23, 0.019), (24, -0.028), (25, 0.092), (26, 0.113), (27, 0.003), (28, -0.025), (29, 0.013), (30, 0.014), (31, -0.076), (32, 0.064), (33, -0.05), (34, -0.113), (35, -0.043), (36, 0.084), (37, -0.011), (38, 0.192), (39, -0.09), (40, 0.026), (41, 0.005), (42, 0.02), (43, 0.06), (44, 0.08), (45, 0.106), (46, 0.283), (47, -0.042), (48, -0.04), (49, 0.258)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94138682 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

Author: Sumio Watanabe

2 0.46740928 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

Author: Igor V. Cadez, Padhraic Smyth

Abstract: We investigate a general characteristic of the trade-off in learning problems between goodness-of-fit and model complexity. Specifically we characterize a general class of learning problems where the goodness-of-fit function can be shown to be convex within firstorder as a function of model complexity. This general property of

3 0.43720227 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition

Author: In Jae Myung, Mark A. Pitt, Shaobo Zhang, Vijay Balasubramanian

Abstract: How should we decide among competing explanations of a cognitive process given limited observations? The problem of model selection is at the heart of progress in cognitive science. In this paper, Minimum Description Length (MDL) is introduced as a method for selecting among computational models of cognition. We also show that differential geometry provides an intuitive understanding of what drives model selection in MDL. Finally, adequacy of MDL is demonstrated in two areas of cognitive modeling. 1 Model Selection and Model Complexity The development and testing of computational models of cognitive processing are a central focus in cognitive science. A model embodies a solution to a problem whose adequacy is evaluated by its ability to mimic behavior by capturing the regularities underlying observed data. This enterprise of model selection is challenging because of the competing goals that must be satisfied. Traditionally, computational models of cognition have been compared using one of many goodness-of-fit measures. However, use of such a measure can result in the choice of a model that over-fits the data, one that captures idiosyncracies in the particular data set (i.e., noise) over and above the underlying regularities of interest. Such models are considered complex, in that the inherent flexibility in the model enables it to fit diverse patterns of data. As a group, they can be characterized as having many parameters that are combined in a highly nonlinear fashion in the model equation. They do not assume a single structure in the data. Rather, the model contains multiple structures; each obtained by finely tuning the parameter values of the model, and thus can fit a wide range of data patterns. In contrast, simple models, frequently with few parameters, assume a specific structure in the data, which will manifest itself as a narrow range of similar data patterns. Only when one of these patterns occurs will the model fit the data well. The problem of over-fitting data due to model complexity suggests that the goal of model selection should instead be to select the model that generalizes best to all data samples that arise from the same underlying regularity, thus capturing only the regularity, not the noise. To achieve this goal, the selection method must be sensitive to the complexity of a model. There are at least two independent dimensions of model complexity. They are the number of free parameters of a model and its functional form, which refers to the way the parameters are combined in the model equation. For instance, it seems unlikely that two one-parameter models, y = ex and y = x 9, are equally complex in their ability to fit data. The two dimensions of model complexity (number of parameters and functional form) and their interplay can improve a model's fit to the data, without necessarily improving generalizability. The trademark of a good model selection procedure, then, is its ability to satisfy two opposing goals. A model must be sufficiently complex to describe the data sample accurately, but without over-fitting the data and thus losing generalizability. To achieve this end, we need a theoretically well-justified measure of model complexity that takes into account the number of parameters and the functional form of a model. In this paper, we introduce Minimum Description Length (MDL) as an appropriate method of selecting among mathematical models of cognition. We also show that MDL has an elegant geometric interpretation that provides a clear, intuitive understanding of the meaning of complexity in MDL. Finally, application examples of MDL are presented in two areas of cognitive modeling. 1.1 Minimum Description Length The central thesis of model selection is the estimation of a model's generalizability. One approach to assessing generalizability is the Minimum Description Length (MDL) principle [1]. It provides a theoretically well-grounded measure of complexity that is sensitive to both dimensions of complexity and also lends itself to intuitive, geometric interpretations. MDL was developed within algorithmic coding theory to choose the model that permits the greatest compression of data. A model family f with parameters e assigns the likelihood f(yle) to a given set of observed data y . The full form of the MDL measure for such a model family is given below. MDL = -In! (yISA) + ~ln( ; )+ In f dS.jdetl(S) where SA is the parameter that maximizes the likelihood, k is the number of parameters in the model, N is the sample size and I(e) is the Fisher information matrix. MDL is the length in bits of the shortest possible code that describes the data with the help of a model. In the context of cognitive modeling, the model that minimizes MDL uncovers the greatest amount of regularity (i.e., knowledge) underlying the data and therefore should be selected. The first, maximized log likelihood term is the lack-of-fit measure, and the second and third terms constitute the intrinsic complexity of the model. In particular, the third term captures the effects of complexity due to functional form, reflected through I(e). We will call the latter two terms together the geometric complexity of the model, for reasons that will become clear in the remainder of this paper. MDL arises as a finite series of terms in an asymptotic expansion of the Bayesian posterior probability of a model given the data for a special form of the parameter prior density [2] . Hence in essence, minimization of MDL is equivalent to maximization of the Bayesian posterior probability. In this paper we present a geometric interpretation of MDL, as well as Bayesian model selection [3], that provides an elegant and intuitive framework for understanding model complexity, a central concept in model selection. 2 Differential Geometric Interpretation of MDL From a geometric perspective, a parametric model family of probability distributions forms a Riemannian manifold embedded in the space of all probability distributions [4]. Every distribution is a point in this space, and the collection of points created by varying the parameters of the model gives rise to a hyper-surface in which

4 0.416861 27 nips-2000-Automatic Choice of Dimensionality for PCA

Author: Thomas P. Minka

5 0.41578895 116 nips-2000-Sex with Support Vector Machines

Author: Baback Moghaddam, Ming-Hsuan Yang

Abstract: unkown-abstract

6 0.39787415 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

7 0.37030315 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

8 0.35865515 22 nips-2000-Algorithms for Non-negative Matrix Factorization

9 0.33964443 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

10 0.32303262 35 nips-2000-Computing with Finite and Infinite Networks

11 0.32060552 94 nips-2000-On Reversing Jensen's Inequality

12 0.31986541 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

13 0.31494802 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

14 0.314257 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

15 0.31328249 110 nips-2000-Regularization with Dot-Product Kernels

16 0.30931193 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

17 0.29783252 58 nips-2000-From Margin to Sparsity

18 0.2936455 85 nips-2000-Mixtures of Gaussian Processes

19 0.28734246 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

20 0.28707951 124 nips-2000-Spike-Timing-Dependent Learning for Oscillatory Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.353), (10, 0.024), (17, 0.112), (32, 0.014), (33, 0.071), (55, 0.014), (62, 0.05), (65, 0.015), (67, 0.101), (76, 0.076), (79, 0.021), (90, 0.037), (91, 0.021), (97, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80704093 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

Author: Sumio Watanabe

2 0.45682201 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.

3 0.44658199 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

4 0.4391703 74 nips-2000-Kernel Expansions with Unlabeled Examples

Author: Martin Szummer, Tommi Jaakkola

Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.

5 0.4384239 21 nips-2000-Algorithmic Stability and Generalization Performance

Author: Olivier Bousquet, Andr茅 Elisseeff

Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1

6 0.43603167 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

7 0.43482044 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

8 0.43237242 111 nips-2000-Regularized Winnow Methods

9 0.43086645 94 nips-2000-On Reversing Jensen's Inequality

10 0.43069309 22 nips-2000-Algorithms for Non-negative Matrix Factorization

11 0.43012324 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

12 0.42811078 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

13 0.42777658 146 nips-2000-What Can a Single Neuron Compute?

14 0.42735633 13 nips-2000-A Tighter Bound for Graphical Models

15 0.42673987 122 nips-2000-Sparse Representation for Gaussian Process Models

16 0.42665774 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

17 0.42499685 60 nips-2000-Gaussianization

18 0.42419684 79 nips-2000-Learning Segmentation by Random Walks

19 0.4236334 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

20 0.42344001 92 nips-2000-Occam's Razor