nips nips2004 nips2004-204 knowledge-graph by maker-knowledge-mining

204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss


Source: pdf

Author: Liam Paninski

Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Variational minimax estimation of discrete distributions under KL loss Liam Paninski Gatsby Computational Neuroscience Unit University College London liam@gatsby. [sent-1, score-0.235]

2 uk/∼liam Abstract We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i. [sent-8, score-0.562]

3 Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. [sent-12, score-1.031]

4 The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. [sent-13, score-1.172]

5 Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). [sent-14, score-0.981]

6 Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. [sent-15, score-0.717]

7 Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime. [sent-16, score-0.749]

8 Introduction The estimation of discrete distributions given finite data — “histogram smoothing” — is a canonical problem in statistics and is of fundamental importance in applications to language modeling, informatics, and safari organization (1–3). [sent-17, score-0.087]

9 In particular, estimation of discrete distributions under Kullback-Leibler (KL) loss is of basic interest in the coding community, in the context of two-step universal codes (4, 5). [sent-18, score-0.15]

10 , (6) and references therein); in this work, we will focus on the “minimax” approach, that is, on developing estimators which work well even in the worst case, with the performance of an estimator measured by the average KL loss. [sent-21, score-0.612]

11 Our goal here is to analyze further the opposite case, when N/m is bounded or even small (the sparse data case). [sent-23, score-0.053]

12 It will turn out that the estimators which are asymptotically optimal as N/m → ∞ are far from optimal in this sparse data case, which may be considered more important for applications to modeling of large dictionaries. [sent-24, score-0.548]

13 We will emphasize the similarities (and important differences) between these two problems throughout. [sent-26, score-0.026]

14 Upper bounds The basic idea is to find a simple upper bound on the worst-case expected loss, and then to minimize this upper bound over some tractable class of possible estimators; the resulting optimized estimator will then be guaranteed to possess good worst-case properties. [sent-27, score-1.274]

15 Clearly we want this upper bound to be as tight as possible, and the space of allowed estimators to be as large as possible, while still allowing easy minimization. [sent-28, score-0.748]

16 The approach taken here is to develop bounds which are convex in the estimator, and to allow the estimators to range over a large convex space; this implies that the minimization problem is tractable by descent methods, since no non-global local minima exist. [sent-29, score-0.878]

17 The “add-constant” estimators, gj = Nj+α , α > 0, are an important special case (7). [sent-31, score-0.489]

18 The above maxima are always achieved, by the compactness of the intervals and the continuity of the binomial and entropy functions. [sent-37, score-0.236]

19 Again, the key point is that these bounds are uniform over all possible underlying p (that is, they bound the worst-case error). [sent-38, score-0.441]

20 The first is nearly tight for N >> m (it is actually asymptotically possible to replace m with m − 1 in this limit, due to the fact that pi must sum to one; see (7, 8)), but grows linearly with m and thus cannot be tight for m comparable to or larger than N . [sent-40, score-0.702]

21 In particular, the optimizer doesn’t depend on m, only N (and hence the bound can’t help but behave linearly in m). [sent-41, score-0.211]

22 The second bound is much more useful (and, as we show below, tight) in the data-sparse regime N << m. [sent-42, score-0.23]

23 The resulting minimization problems have a polynomial approximation flavor: we are trying to find an optimal set of weights gj such that the sum in the definition of f (t) (a polynomial in t) will be as close to H(t) + t as possible. [sent-43, score-0.641]

24 In this sense our approach is nearly identical to that recently followed for bounding the bias in the entropy estimation case (11, 12). [sent-44, score-0.169]

25 Indeed, we will see below that the entropy estimation problem is qualitatively easier than the estimation of the full distribution, despite the entropic form of the KL loss. [sent-46, score-0.247]

26 Smooth minimization algorithm In the next subsections, we develop methods for minimizing these bounds as a function of gj (that is, for choosing estimators with good worst-case properties). [sent-47, score-1.067]

27 The first key point is that the bounds involve maxima over a collection of convex functions in gj , and hence the bounds are convex in gj ; since the coefficients gj take values in a convex set, no non-global local minima exist, and the global mimimum can be found by simple descent procedures. [sent-48, score-2.323]

28 A more efficient solution is given by approximating this nondifferentiable objective function by smooth functions which retain the convexity of the original objective. [sent-50, score-0.081]

29 , the exponential), and sums of convex functions are convex. [sent-53, score-0.093]

30 (In fact, since Uq is strictly convex in g for any q, the minima are unique, which to our knowledge is not necessarily the case for the original minimax problem. [sent-54, score-0.217]

31 ) It is easy to show that any limit point of the sequence of minimizers of the above problems will minimize the original problem; applying conjugate gradient descent for each q, with the previous minimizer as the seed for the minimization in the next largest q, worked well in practice. [sent-55, score-0.28]

32 Initialization; connection to Laplace estimator It is now useful to look for suitable starting points for the minimization. [sent-56, score-0.311]

33 For example, for the first bound, approximate the maximum by an integral, that is, find gj to minimize   1 m 0 dt −H(t) − t + j (gj − t log gj )BN,j (t) . [sent-57, score-1.198]

34 (Note that this can be thought of as the limit of the above Uq minimization problem as q → 0, as can be seen by expanding the exponential. [sent-58, score-0.151]

35 ) The gj that minimizes this approximation to the upper bound is trivially derived as gj = 1 tBN,j (t)dt 0 1 BN,j (t)dt 0 = β(j + 2, N − j + 1) j+1 = , β(j + 1, N − j + 1) N +2 1 with β(a, b) = 0 ta−1 (1 − t)b−1 dt defined as usual. [sent-59, score-1.446]

36 The resulting estimator p agrees ˆ exactly with “Laplace’s estimator,” the add-α estimator with α = 1. [sent-60, score-0.622]

37 Note, though, that to derive this gj , we completely ignore the first two terms (−H(t) − t) in the upper bound, and the resulting estimator can therefore be expected to be suboptimal (in particular, the gj will be chosen too large, since −H(t) − t is strictly decreasing for t < 1). [sent-61, score-1.415]

38 Indeed, we find that add-α estimators with α < 1 provide a much better starting point for the optimization, as expected given (7,8). [sent-62, score-0.275]

39 (Of course, for N/m large enough an asymptotically optimal estimator is given by the perturbed add-constant estimator of (8), and none of this numerical optimization is necessary. [sent-63, score-0.895]

40 ) In the limit as c = N/m → 0, we will see below that a better initialization point is the add-α estimator with parameter α ≈ H(c) = −c log c. [sent-64, score-0.522]

41 Fixed-point algorithm On examining the gradient of the above problems with respect to gj , a fixed-point algorithm may be derived. [sent-65, score-0.489]

42 While this is an attractive strategy, conjugate gradient descent proved to be a more stable algorithm in our hands. [sent-67, score-0.076]

43 Lower bounds Once we have found an estimator with good worst-case error, we want to compare its performance to some well-defined optimum. [sent-68, score-0.541]

44 To do this, we obtain lower bounds on the worst-case performance of any estimator (not just the class of p we minimized over in the ˆ last section). [sent-69, score-0.649]

45 Once again, we will derive a family of bounds indexed by some parameter α, and then optimize over α. [sent-70, score-0.29]

46 Our lower bounds are based on the well-known fact that, for any proper prior distribution, the average (Bayesian) loss is less than or equal to the maximum (worst-case) loss. [sent-71, score-0.404]

47 This approach therefore generalizes the asymptotic lower bound used in (7), who examined the KL loss under the special case of the uniform Dirichlet prior. [sent-74, score-0.52]

48 See also (4) for direct application of this idea to bound the average code length, and (14), who derived a lower bound on the average KL loss, again in the uniform Dirichlet case. [sent-75, score-0.535]

49 , multinomials whose parameter p is itself Dirichlet distributed); we have used linearity of the expectation, i ni = N , and Dir(α|n) = Dir(α + n). [sent-83, score-0.154]

50 Evaluating the right-hand side of the above, in turn, requires the formula −EDir(α) H(pi ) = αi i αi ψ(αi + 1) − ψ(1 + αi ) , i d with ψ(t) = dt log Γ(t); recall that ψ(t + 1) = ψ(t) + 1 . [sent-84, score-0.192]

51 All of the above may thus be t easily computed numerically for any N, m, and α; to simplify, however, we will restrict α to be constant, α = (α, α, . [sent-85, score-0.072]

52 This symmetrizes the above formulae; we can replace the outer sum with multiplication by m, and substitute i αi = mα. [sent-89, score-0.085]

53 Finally, abbreviating K = N + mα, we have that the worst-case error is bounded below by: m K N pα,m,N (j)(j + α) − log j=0 j+α 1 1 + ψ(j + α) + − ψ(K) − K j+α K with pα,m,N (j) the beta-binomial distribution pα,m,N (j) = N Γ(mα)Γ(j + α)Γ(K − (j + α)) . [sent-90, score-0.145]

54 j Γ(K)Γ(α)Γ(mα − α) , (1) This lower bound is valid for all N, m, and α, and can be optimized numerically in the (scalar) parameter α in a straightforward manner. [sent-91, score-0.428]

55 Due to space constraints, we can only sketch the proof of each of the following statements. [sent-93, score-0.025]

56 Any add-α estimator, α > 0, is uniformly KL-consistent if N/m → ∞. [sent-95, score-0.03]

57 To obtain the O(1/N ) bound, we plug in the add-constant gj = (j + α)/N :   j+α )BN,j (t) . [sent-98, score-0.489]

58 f (t) = α/N + t log t − (log N j α For t fixed, an application of the delta method implies that the sum looks like log(t + N ) − 1−t 2N t ; an expansion of the logarithm, in turn, implies that the right-hand side converges to 1 2N (1 − t), for any fixed α > 0. [sent-99, score-0.057]

59 On a 1/N scale, on the other hand, we have   t  t log(j + α)BN,j ( ) , N f ( ) = α + t log t − N N j which can be uniformly bounded above. [sent-100, score-0.057]

60 In fact, as demonstrated by (7), the binomial sum on the right-hand side converges to the corresponding Poisson sum; interestingly, a similar Poisson sum plays a key role in the analysis of the entropy estimation case in (12). [sent-101, score-0.328]

61 A converse follows easily from the lower bounds developed above: Proposition 2. [sent-102, score-0.342]

62 No estimator is uniformly KL-consistent if lim sup N/m < ∞. [sent-103, score-0.369]

63 Of course, it is intuitively clear that we need many more than m samples to estimate a distribution on m bins; our contribution here is a quantitative asymptotic lower bound on the error in the data-sparse regime. [sent-104, score-0.462]

64 (A simpler but slightly weaker asymptotic bound may be developed from the lower bound given in (14). [sent-105, score-0.606]

65 ) Once again, we contrast with the entropy estimation case, where consistent estimators do exist in this regime (12). [sent-106, score-0.487]

66 The beta-binomial distribution has mean N/m and converges to a non-degenerate limit, which we’ll denote pα,c , in this regime. [sent-108, score-0.026]

67 Using 1 Fatou’s lemma and ψ(t) = log(t) − 2t + O t−2 , t → ∞, we obtain the asymptotic lower bound 1 c+α ∞ pα,c (j)(α + j) − log(α + j) + ψ(α + j) + j=0 1 α+j > 0. [sent-109, score-0.392]

68 Also interestingly, it is easy to see that our lower bound behaves as m−1 (1 + o(1)) as 2N k N/m → ∞ for any fixed positive α (since in this case j=0 pα,m,N (j) → 0 for any fixed finite k). [sent-110, score-0.297]

69 Thus, comparing to the upper bound on the minimax error in (8), we have the somewhat surprising fact that: α 0. [sent-111, score-0.439]

70 01 optimal α approx opt (upper) / (lower) lower bound 0. [sent-113, score-0.376]

71 001 5 4 3 2 1 3 2 1 −4 10 lower bound j=0 approx (m−1)/2N approx least−favorable Bayes Braess−Sauer optimized −3 10 −2 −1 10 10 0 1 10 10 N/m Figure 1: Illustration of bounds and asymptotic results. [sent-114, score-0.816]

72 Numerically- and theoretically-obtained optimal (least-favorable) α, as a function of c = N/m; note close agreement. [sent-117, score-0.034]

73 Numerical lower bounds and theoretical approximations; note the log-linear growth as c → 0. [sent-119, score-0.315]

74 The j = 0 approximation is obtained by retaining only the j = 0 term of the sum in the lower bound (1); this approximation turns out to be sufficiently accurate in the c → 0 limit, while the (m − 1)/2N approximation is tight as c → ∞. [sent-120, score-0.558]

75 Dashed curve is the ratio obtained by plugging the asymptotically optimal estimator due to Braess-Sauer (8) into our upper bound; solid-dotted curve numerically least-favorable Dirichlet estimator; black solid curve optimized estimator. [sent-123, score-0.742]

76 Note that curves for optimized and Braess-Sauer estimators are in constant proportion, since bounds are independent of m for c large enough. [sent-124, score-0.559]

77 Most importantly, note that optimized bounds are everywhere tight within a factor of 2, and asymptotically tight as c → ∞ or c → 0. [sent-125, score-0.699]

78 Any fixed-α Dirichlet prior is asymptotically least-favorable as N m → ∞. [sent-127, score-0.145]

79 This generalizes Theorem 2 of (7) (and in fact, an alternate proof can be constructed on close examination of Krichevskiy’s proof of that result). [sent-128, score-0.091]

80 Finally, we examine the optimizers of the bounds in the data-sparse limit, c = N/m → 0. [sent-129, score-0.23]

81 The least-favorable Dirichlet parameter is given by H(c) as c → 0; the corresponding Bayes estimator also asymptotically minimizes the upper bound (and hence the bounds are asymptotically tight in this limit). [sent-131, score-1.309]

82 The maximal and average errors grow as −log(c)(1 + o(1)), c → 0. [sent-132, score-0.026]

83 It suggests a simple and interesting rule of thumb for estimating distributions in this data-sparse limit: use the add-α estimator with α = H(c). [sent-134, score-0.334]

84 When the data are very sparse (c sufficiently small) this estimator is optimal; see Fig. [sent-135, score-0.337]

85 Interestingly, in the heavily-sampled limit N >> m, the minimizing estimator provided by (8) again turns out to be a perturbed add-constant estimator. [sent-139, score-0.486]

86 We note an interesting connection to the results of (9), who find that 1/m scaling of the add-constant parameter α is empirically optimal for for an entropy estimation application with large m. [sent-141, score-0.262]

87 This 1/m scaling bears some resemblance to the optimal H(c) scaling that we find here, at least on a logarithmic scale (Fig. [sent-142, score-0.092]

88 1a); however, it is easy to see that the extra − log(c) term included here is useful. [sent-143, score-0.025]

89 As argued in (3), it is a good idea, in the data-sparse limit N << m, to assign substantial probability mass to bins which have not seen any data samples. [sent-144, score-0.167]

90 Since the total probability assigned to these bins by any add-α estimator scales in this limit as P (unseen) = mα/(N + mα), it is clear that the choice α ∼ 1/m decays too quickly. [sent-145, score-0.478]

91 Thus it would be quite worthwhile to explore upper bounds which are tight in this N ≈ m range. [sent-148, score-0.491]

92 Watson, Approximation theory and numerical methods (Wiley, Boston, 1980). [sent-200, score-0.048]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gj', 0.489), ('estimator', 0.311), ('estimators', 0.275), ('bounds', 0.23), ('pi', 0.21), ('dirichlet', 0.207), ('bound', 0.187), ('edir', 0.161), ('kl', 0.148), ('asymptotically', 0.145), ('tight', 0.135), ('upper', 0.126), ('asymptotic', 0.12), ('dt', 0.115), ('entropy', 0.114), ('braess', 0.107), ('limit', 0.104), ('ni', 0.101), ('convex', 0.093), ('dir', 0.085), ('minimax', 0.085), ('lower', 0.085), ('eqf', 0.081), ('log', 0.077), ('ep', 0.075), ('numerically', 0.072), ('perturbed', 0.071), ('binomial', 0.071), ('sauer', 0.07), ('approx', 0.07), ('uq', 0.07), ('liam', 0.07), ('loss', 0.063), ('bins', 0.063), ('paninski', 0.06), ('estimation', 0.055), ('krichevsky', 0.054), ('nondifferentiable', 0.054), ('optimized', 0.054), ('descent', 0.05), ('nk', 0.049), ('minimization', 0.047), ('laplace', 0.047), ('proposition', 0.047), ('qf', 0.047), ('interestingly', 0.046), ('regime', 0.043), ('generalizes', 0.041), ('error', 0.041), ('approximation', 0.04), ('minima', 0.039), ('mackay', 0.037), ('averages', 0.035), ('optimal', 0.034), ('turn', 0.034), ('discrete', 0.032), ('outer', 0.032), ('favorable', 0.032), ('sum', 0.031), ('parameter', 0.03), ('uniformly', 0.03), ('indexed', 0.03), ('scaling', 0.029), ('transactions', 0.029), ('samples', 0.029), ('minimize', 0.028), ('maxima', 0.028), ('lim', 0.028), ('developed', 0.027), ('optimum', 0.027), ('poisson', 0.027), ('smooth', 0.027), ('bounded', 0.027), ('conjugate', 0.026), ('converges', 0.026), ('develop', 0.026), ('sparse', 0.026), ('average', 0.026), ('similarities', 0.026), ('easy', 0.025), ('proof', 0.025), ('theory', 0.025), ('tractable', 0.025), ('uniform', 0.024), ('linearly', 0.024), ('vq', 0.023), ('thumb', 0.023), ('grassberger', 0.023), ('multinomials', 0.023), ('compactness', 0.023), ('answered', 0.023), ('orlitsky', 0.023), ('tro', 0.023), ('forster', 0.023), ('numerical', 0.023), ('importantly', 0.023), ('minimized', 0.023), ('easier', 0.023), ('replace', 0.022), ('conditional', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

Author: Liam Paninski

Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.

2 0.30460766 77 nips-2004-Hierarchical Clustering of a Mixture Model

Author: Jacob Goldberger, Sam T. Roweis

Abstract: In this paper we propose an efficient algorithm for reducing a large mixture of Gaussians into a smaller mixture while still preserving the component structure of the original model; this is achieved by clustering (grouping) the components. The method minimizes a new, easily computed distance measure between two Gaussian mixtures that can be motivated from a suitable stochastic model and the iterations of the algorithm use only the model parameters, avoiding the need for explicit resampling of datapoints. We demonstrate the method by performing hierarchical clustering of scenery images and handwritten digits. 1

3 0.1423886 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

Author: Elizaveta Levina, Peter J. Bickel

Abstract: We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators. 1

4 0.13175412 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

Author: Yee W. Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei

Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.

5 0.13095026 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

Author: Nathan Srebro, Noga Alon, Tommi S. Jaakkola

Abstract: We prove generalization error bounds for predicting entries in a partially observed matrix by fitting the observed entries with a low-rank matrix. In justifying the analysis approach we take to obtain the bounds, we present an example of a class of functions of finite pseudodimension such that the sums of functions from this class have unbounded pseudodimension. 1

6 0.11976501 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

7 0.1000155 138 nips-2004-Online Bounds for Bayesian Algorithms

8 0.084735192 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

9 0.081541903 137 nips-2004-On the Adaptive Properties of Decision Trees

10 0.080852702 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem

11 0.077852838 175 nips-2004-Stable adaptive control with online learning

12 0.077223383 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

13 0.077036463 163 nips-2004-Semi-parametric Exponential Family PCA

14 0.069211811 164 nips-2004-Semi-supervised Learning by Entropy Minimization

15 0.068143368 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems

16 0.066900022 116 nips-2004-Message Errors in Belief Propagation

17 0.066622242 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

18 0.066450633 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

19 0.066327028 82 nips-2004-Incremental Algorithms for Hierarchical Classification

20 0.06101013 28 nips-2004-Bayesian inference in spiking neurons


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.205), (1, 0.042), (2, 0.047), (3, 0.061), (4, -0.009), (5, -0.033), (6, -0.129), (7, 0.195), (8, 0.098), (9, 0.082), (10, 0.023), (11, 0.031), (12, 0.221), (13, -0.125), (14, -0.093), (15, -0.237), (16, 0.007), (17, 0.085), (18, -0.158), (19, -0.01), (20, 0.138), (21, -0.146), (22, -0.117), (23, 0.057), (24, 0.003), (25, -0.1), (26, 0.064), (27, -0.09), (28, -0.011), (29, 0.013), (30, -0.13), (31, -0.311), (32, 0.069), (33, 0.097), (34, -0.164), (35, -0.055), (36, 0.082), (37, 0.035), (38, -0.021), (39, -0.034), (40, 0.012), (41, -0.071), (42, -0.008), (43, -0.037), (44, -0.024), (45, 0.06), (46, -0.027), (47, -0.055), (48, -0.03), (49, -0.077)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96740204 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

Author: Liam Paninski

Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.

2 0.70885795 77 nips-2004-Hierarchical Clustering of a Mixture Model

Author: Jacob Goldberger, Sam T. Roweis

Abstract: In this paper we propose an efficient algorithm for reducing a large mixture of Gaussians into a smaller mixture while still preserving the component structure of the original model; this is achieved by clustering (grouping) the components. The method minimizes a new, easily computed distance measure between two Gaussian mixtures that can be motivated from a suitable stochastic model and the iterations of the algorithm use only the model parameters, avoiding the need for explicit resampling of datapoints. We demonstrate the method by performing hierarchical clustering of scenery images and handwritten digits. 1

3 0.56833059 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

Author: Yee W. Teh, Michael I. Jordan, Matthew J. Beal, David M. Blei

Abstract: We propose the hierarchical Dirichlet process (HDP), a nonparametric Bayesian model for clustering problems involving multiple groups of data. Each group of data is modeled with a mixture, with the number of components being open-ended and inferred automatically by the model. Further, components can be shared across groups, allowing dependencies across groups to be modeled effectively as well as conferring generalization to new groups. Such grouped clustering problems occur often in practice, e.g. in the problem of topic discovery in document corpora. We report experimental results on three text corpora showing the effective and superior performance of the HDP over previous models.

4 0.43261012 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

Author: Elizaveta Levina, Peter J. Bickel

Abstract: We propose a new method for estimating intrinsic dimension of a dataset derived by applying the principle of maximum likelihood to the distances between close neighbors. We derive the estimator by a Poisson process approximation, assess its bias and variance theoretically and by simulations, and apply it to a number of simulated and real datasets. We also show it has the best overall performance compared with two other intrinsic dimension estimators. 1

5 0.41878217 138 nips-2004-Online Bounds for Bayesian Algorithms

Author: Sham M. Kakade, Andrew Y. Ng

Abstract: We present a competitive analysis of Bayesian learning algorithms in the online learning setting and show that many simple Bayesian algorithms (such as Gaussian linear regression and Bayesian logistic regression) perform favorably when compared, in retrospect, to the single best model in the model class. The analysis does not assume that the Bayesian algorithms’ modeling assumptions are “correct,” and our bounds hold even if the data is adversarially chosen. For Gaussian linear regression (using logloss), our error bounds are comparable to the best bounds in the online learning literature, and we also provide a lower bound showing that Gaussian linear regression is optimal in a certain worst case sense. We also give bounds for some widely used maximum a posteriori (MAP) estimation algorithms, including regularized logistic regression. 1

6 0.41372612 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

7 0.40177646 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

8 0.3396219 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems

9 0.33292007 126 nips-2004-Nearly Tight Bounds for the Continuum-Armed Bandit Problem

10 0.30816022 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees

11 0.2963255 163 nips-2004-Semi-parametric Exponential Family PCA

12 0.28820321 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification

13 0.28188688 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

14 0.28114787 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

15 0.27783659 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

16 0.26712757 116 nips-2004-Message Errors in Belief Propagation

17 0.2613672 63 nips-2004-Expectation Consistent Free Energies for Approximate Inference

18 0.25498042 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

19 0.25391746 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

20 0.24784629 57 nips-2004-Economic Properties of Social Networks


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.177), (15, 0.098), (26, 0.08), (30, 0.148), (31, 0.038), (33, 0.22), (35, 0.036), (39, 0.038), (50, 0.073)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91641653 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

Author: Liam Paninski

Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.

2 0.88094783 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

Author: Jochen Triesch

Abstract: This paper explores the computational consequences of simultaneous intrinsic and synaptic plasticity in individual model neurons. It proposes a new intrinsic plasticity mechanism for a continuous activation model neuron based on low order moments of the neuron’s firing rate distribution. The goal of the intrinsic plasticity mechanism is to enforce a sparse distribution of the neuron’s activity level. In conjunction with Hebbian learning at the neuron’s synapses, the neuron is shown to discover sparse directions in the input. 1

3 0.87174982 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1

4 0.86578685 102 nips-2004-Learning first-order Markov models for control

Author: Pieter Abbeel, Andrew Y. Ng

Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a first-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the benefits of using first-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1

5 0.86575025 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

6 0.86502618 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

7 0.86444736 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

8 0.8632499 69 nips-2004-Fast Rates to Bayes for Kernel Machines

9 0.86318636 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

10 0.86144733 116 nips-2004-Message Errors in Belief Propagation

11 0.8598659 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech

12 0.85983741 70 nips-2004-Following Curved Regularized Optimization Solution Paths

13 0.85893667 124 nips-2004-Multiple Alignment of Continuous Time Series

14 0.85731387 64 nips-2004-Experts in a Markov Decision Process

15 0.8560921 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

16 0.8531757 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

17 0.85181218 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

18 0.85138702 28 nips-2004-Bayesian inference in spiking neurons

19 0.8513245 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

20 0.85107833 36 nips-2004-Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification