nips nips2000 nips2000-94 knowledge-graph by maker-knowledge-mining

94 nips-2000-On Reversing Jensen's Inequality

Source: pdf

Author: Tony Jebara, Alex Pentland

Abstract: Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simple lower bounds on otherwise intractable quantities such as products of sums and latent log-likelihoods. This simplification then permits operations like integration and maximization. Quite often (i.e. in discriminative learning) upper bounds are needed as well. We derive and prove an efficient analytic inequality that provides such variational upper bounds. This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models. We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. Convergence, efficiency and prediction results are shown. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. [sent-5, score-0.259]

2 Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. [sent-6, score-0.24]

3 Jensen computes simple lower bounds on otherwise intractable quantities such as products of sums and latent log-likelihoods. [sent-7, score-0.769]

4 This simplification then permits operations like integration and maximization. [sent-8, score-0.194]

5 in discriminative learning) upper bounds are needed as well. [sent-11, score-0.907]

6 We derive and prove an efficient analytic inequality that provides such variational upper bounds. [sent-12, score-0.426]

7 This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models. [sent-13, score-0.811]

8 We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. [sent-14, score-1.195]

9 1 1 Introduction Statistical model estimation and inference often require the maximization, evaluation, and integration of complicated mathematical expressions. [sent-16, score-0.26]

10 One approach for simplifying the computations is to find and manipulate variational upper and lower bounds instead of the expressions themselves. [sent-17, score-0.883]

11 A prominent tool for computing such bounds is Jensen's inequality which subsumes many information-theoretic bounds (cf. [sent-18, score-1.197]

12 In maximum likelihood (ML) estimation under incomplete data, Jensen is used to derive an iterative EM algorithm [2]. [sent-20, score-0.332]

13 For graphical models, intractable inference and estimation is performed via variational bounds [7]. [sent-21, score-0.766]

14 Bayesian integration also uses Jensen and EM-like bounds to compute integrals that are otherwise intractable [9]. [sent-22, score-0.694]

15 Recently, however, the learning community has seen the proliferation of conditional or discriminative criteria. [sent-23, score-0.5]

16 These include support vector machines, maximum entropy discrimination distributions [4], and discriminative HMMs [3]. [sent-24, score-0.526]

17 These criteria allocate resources with the given task (classification or regression) in mind, yielding improved performance. [sent-25, score-0.268]

18 In contrast, under canonical ML each density is trained separately to describe observations rather than optimize classification or regression. [sent-26, score-0.108]

19 Please download the long version with tighter bounds, detailed proofs, more results, important extensions and sample matlab code from: http://www. [sent-29, score-0.121]

20 edu/ "-'jebara/bounds Computationally, what differentiates these criteria from ML is that they not only require Jensen-type lower bounds but may also utilize the corresponding upper bounds. [sent-32, score-0.856]

21 The Jensen bounds only partially simplify their expressions and some intractabilities remain. [sent-33, score-0.519]

22 For instance, latent distributions need to be bounded above and below in a discriminative setting [4] [3]. [sent-34, score-0.513]

23 Metaphorically, discriminative learning requires lower bounds to cluster positive examples and upper bounds to repel away from negative ones. [sent-35, score-1.502]

24 We derive these complementary upper bounds 2 which are useful for discriminative classification and regression. [sent-36, score-1.038]

25 These bounds are structurally similar to Jensen bounds, allowing easy migration of ML techniques to discriminative settings. [sent-37, score-0.831]

26 This paper is organized as follows: We introduce the probabilistic models we will use: mixtures of the exponential family. [sent-38, score-0.279]

27 We then describe some estimation criteria on these models which are intractable. [sent-39, score-0.317]

28 One simplification is to lower bound via Jensen's inequality or EM. [sent-40, score-0.37]

29 We show implementation and results of the bounds in applications (i. [sent-42, score-0.484]

30 Finally, a strict algebraic proof is given to validate the reverse-bound. [sent-45, score-0.144]

31 2 The Exponential Family We restrict the reverse-Jensen bounds to mixtures of the exponential family (e-family). [sent-46, score-0.76]

32 In practice this class of densities covers a very large portion of contemporary statistical models. [sent-47, score-0.192]

33 Mixtures of the e-family include Gaussians Mixture Models, Multinomials, Poisson, Hidden Markov Models, Sigmoidal Belief Networks, Discrete Bayesian Networks, etc. [sent-48, score-0.056]

34 Typically the data vector X is constrained to live in the gradient space of K, i. [sent-51, score-0.054]

35 The table above lists example A and K functions for Gaussian and multinomial distributions. [sent-59, score-0.13]

36 More generally, though, we will deal with mixtures of the e-family (where m represents the incomplete data? [sent-60, score-0.241]

37 : m m These latent probability distributions need to get maximized, integrated, marginalized, conditioned, etc. [sent-63, score-0.181]

38 to solve various inference, prediction, and parameter estimation tasks. [sent-64, score-0.102]

39 3 Conditional and Discriminative Criteria The combination of ML with EM and Jensen have indeed produced straightforward and monotonically convergent estimation procedures for mixtures of the e-family [2] [1] [7]. [sent-66, score-0.338]

40 However, ML criteria are non-discriminative modeling techniques for estimating generative models. [sent-67, score-0.177]

41 2 A weaker bound for Gaussian mixture regression appears in [6]. [sent-69, score-0.191]

42 3Note we use El to denote an aggregate model encompassing all individual Elm \1m. [sent-71, score-0.041]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('jensen', 0.457), ('bounds', 0.445), ('discriminative', 0.332), ('ml', 0.223), ('criteria', 0.177), ('inequality', 0.16), ('mixtures', 0.16), ('upper', 0.13), ('latent', 0.127), ('contemporary', 0.109), ('estimation', 0.102), ('media', 0.091), ('intractable', 0.09), ('conditional', 0.081), ('multinomial', 0.081), ('simplification', 0.081), ('incomplete', 0.081), ('exponential', 0.081), ('variational', 0.08), ('family', 0.074), ('expressions', 0.074), ('bayesian', 0.073), ('integration', 0.072), ('bound', 0.066), ('em', 0.065), ('lab', 0.064), ('lower', 0.063), ('tool', 0.062), ('derive', 0.056), ('include', 0.056), ('allocate', 0.054), ('proliferation', 0.054), ('multinomials', 0.054), ('strict', 0.054), ('repel', 0.054), ('manipulate', 0.054), ('live', 0.054), ('manipulations', 0.054), ('structurally', 0.054), ('tony', 0.054), ('distributions', 0.054), ('regression', 0.049), ('reversing', 0.049), ('pentland', 0.049), ('marginalized', 0.049), ('lists', 0.049), ('validate', 0.049), ('covers', 0.049), ('maximum', 0.049), ('inference', 0.049), ('spans', 0.046), ('please', 0.046), ('jebara', 0.046), ('matlab', 0.046), ('ek', 0.046), ('subsumes', 0.046), ('otherwise', 0.044), ('likelihood', 0.044), ('integrals', 0.043), ('therein', 0.043), ('linearity', 0.043), ('aggregate', 0.041), ('algebraic', 0.041), ('permits', 0.041), ('canonical', 0.041), ('tighter', 0.041), ('el', 0.041), ('utilize', 0.041), ('convexity', 0.041), ('complementary', 0.041), ('applications', 0.039), ('mind', 0.039), ('monotonically', 0.039), ('prominent', 0.039), ('weaker', 0.039), ('reverse', 0.039), ('models', 0.038), ('mathematical', 0.037), ('mixture', 0.037), ('exploits', 0.037), ('resources', 0.037), ('simplifying', 0.037), ('convergent', 0.037), ('hmms', 0.037), ('maximized', 0.035), ('alex', 0.035), ('sigmoidal', 0.035), ('proofs', 0.035), ('intrinsic', 0.035), ('discrimination', 0.035), ('classification', 0.034), ('portion', 0.034), ('extensions', 0.034), ('cover', 0.034), ('poisson', 0.034), ('prediction', 0.034), ('separately', 0.033), ('cluster', 0.033), ('suffer', 0.033), ('community', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 94 nips-2000-On Reversing Jensen's Inequality

Author: Tony Jebara, Alex Pentland

2 0.16755049 21 nips-2000-Algorithmic Stability and Generalization Performance

Author: Olivier Bousquet, Andr茅 Elisseeff

Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1

3 0.16087064 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

Author: Zoubin Ghahramani, Matthew J. Beal

Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1

4 0.13124689 74 nips-2000-Kernel Expansions with Unlabeled Examples

Author: Martin Szummer, Tommi Jaakkola

Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.

5 0.12096026 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano

Abstract: In this paper we develop the method of bounding the generalization error of a classifier in terms of its margin distribution which was introduced in the recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The theory of Gaussian and empirical processes allow us to prove the margin type inequalities for the most general functional classes, the complexity of the class being measured via the so called Gaussian complexity functions. As a simple application of our results, we obtain the bounds of Schapire, Freund, Bartlett and Lee for the generalization error of boosting. We also substantially improve the results of Bartlett on bounding the generalization error of neural networks in terms of h -norms of the weights of neurons. Furthermore, under additional assumptions on the complexity of the class of hypotheses we provide some tighter bounds, which in the case of boosting improve the results of Schapire, Freund, Bartlett and Lee. 1 Introduction and margin type inequalities for general functional classes Let (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is a label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as a predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, then the choice of the predictor is based on the training data (Xl, Y l ), ... , (Xn, Y n ) that consists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based on the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small enough. We will first introduce some probabilistic bounds for general functional classes and then give several examples of their applications to bounding the generalization error of boosting and neural networks. We omit all the proofs and refer an interested reader to [5]. Let (8, A, P) be a probability space and let F be a class of measurable functions from (8, A) into lR. Let {Xd be a sequence of i.i.d. random variables taking values in (8, A) with common distribution P. Let Pn be the empirical measure based on the sample (Xl,'

6 0.11363588 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

7 0.09670192 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

8 0.089939088 13 nips-2000-A Tighter Bound for Graphical Models

9 0.089807101 54 nips-2000-Feature Selection for SVMs

10 0.086309493 37 nips-2000-Convergence of Large Margin Separable Linear Classification

11 0.085664697 120 nips-2000-Sparse Greedy Gaussian Process Regression

12 0.081955902 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

13 0.077243946 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

14 0.076734945 58 nips-2000-From Margin to Sparsity

15 0.070155039 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

16 0.06429369 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

17 0.060270302 14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks

18 0.060225669 27 nips-2000-Automatic Choice of Dimensionality for PCA

19 0.059947595 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts

20 0.057092819 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.198), (1, 0.107), (2, 0.058), (3, -0.024), (4, 0.148), (5, -0.164), (6, 0.004), (7, -0.006), (8, 0.021), (9, -0.053), (10, -0.156), (11, 0.009), (12, 0.096), (13, -0.017), (14, 0.007), (15, -0.01), (16, 0.172), (17, 0.031), (18, -0.166), (19, 0.146), (20, -0.025), (21, 0.16), (22, 0.211), (23, 0.047), (24, -0.013), (25, -0.091), (26, 0.143), (27, -0.125), (28, 0.062), (29, 0.107), (30, -0.036), (31, 0.1), (32, 0.227), (33, -0.048), (34, 0.038), (35, -0.013), (36, -0.036), (37, 0.011), (38, 0.145), (39, 0.047), (40, -0.039), (41, -0.148), (42, 0.161), (43, -0.022), (44, 0.032), (45, 0.086), (46, 0.128), (47, -0.114), (48, -0.234), (49, -0.151)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9864397 94 nips-2000-On Reversing Jensen's Inequality

Author: Tony Jebara, Alex Pentland

2 0.5310697 21 nips-2000-Algorithmic Stability and Generalization Performance

Author: Olivier Bousquet, Andr茅 Elisseeff

3 0.46178037 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

Author: Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano

4 0.40326709 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

Author: Zoubin Ghahramani, Matthew J. Beal

5 0.39324623 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

Author: Sepp Hochreiter, Michael Mozer

Abstract: The goal of many unsupervised learning procedures is to bring two probability distributions into alignment. Generative models such as Gaussian mixtures and Boltzmann machines can be cast in this light, as can recoding models such as ICA and projection pursuit. We propose a novel sample-based error measure for these classes of models, which applies even in situations where maximum likelihood (ML) and probability density estimation-based formulations cannot be applied, e.g., models that are nonlinear or have intractable posteriors. Furthermore, our sample-based error measure avoids the difficulties of approximating a density function. We prove that with an unconstrained model, (1) our approach converges on the correct solution as the number of samples goes to infinity, and (2) the expected solution of our approach in the generative framework is the ML solution. Finally, we evaluate our approach via simulations of linear and nonlinear models on mixture of Gaussians and ICA problems. The experiments show the broad applicability and generality of our approach. 1

6 0.3662746 74 nips-2000-Kernel Expansions with Unlabeled Examples

7 0.32560879 120 nips-2000-Sparse Greedy Gaussian Process Regression

8 0.31471863 115 nips-2000-Sequentially Fitting ``Inclusive'' Trees for Inference in Noisy-OR Networks

9 0.3114962 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

10 0.30103663 37 nips-2000-Convergence of Large Margin Separable Linear Classification

11 0.28811783 54 nips-2000-Feature Selection for SVMs

12 0.23560999 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

13 0.2323124 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

14 0.22312212 13 nips-2000-A Tighter Bound for Graphical Models

15 0.21967924 27 nips-2000-Automatic Choice of Dimensionality for PCA

16 0.20345503 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

17 0.20330516 30 nips-2000-Bayesian Video Shot Segmentation

18 0.19931403 60 nips-2000-Gaussianization

19 0.19377448 58 nips-2000-From Margin to Sparsity

20 0.19057018 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.028), (17, 0.128), (22, 0.24), (26, 0.017), (32, 0.026), (33, 0.071), (54, 0.013), (55, 0.012), (62, 0.063), (65, 0.025), (67, 0.076), (76, 0.043), (79, 0.046), (90, 0.019), (91, 0.019), (97, 0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86133796 94 nips-2000-On Reversing Jensen's Inequality

Author: Tony Jebara, Alex Pentland

2 0.58740985 40 nips-2000-Dendritic Compartmentalization Could Underlie Competition and Attentional Biasing of Simultaneous Visual Stimuli

Author: Kevin A. Archie, Bartlett W. Mel

Abstract: Neurons in area V4 have relatively large receptive fields (RFs), so multiple visual features are simultaneously

3 0.57960594 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

Author: Zoubin Ghahramani, Matthew J. Beal

4 0.57497954 120 nips-2000-Sparse Greedy Gaussian Process Regression

Author: Alex J. Smola, Peter L. Bartlett

Abstract: We present a simple sparse greedy technique to approximate the maximum a posteriori estimate of Gaussian Processes with much improved scaling behaviour in the sample size m. In particular, computational requirements are O(n 2 m), storage is O(nm), the cost for prediction is 0 (n) and the cost to compute confidence bounds is O(nm), where n «: m. We show how to compute a stopping criterion, give bounds on the approximation error, and show applications to large scale problems. 1

5 0.56506127 37 nips-2000-Convergence of Large Margin Separable Linear Classification

Author: Tong Zhang

Abstract: Large margin linear classification methods have been successfully applied to many applications. For a linearly separable problem, it is known that under appropriate assumptions, the expected misclassification error of the computed

6 0.56325042 60 nips-2000-Gaussianization

7 0.56190521 74 nips-2000-Kernel Expansions with Unlabeled Examples

8 0.55701572 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

9 0.55693477 122 nips-2000-Sparse Representation for Gaussian Process Models

10 0.55617267 133 nips-2000-The Kernel Gibbs Sampler

11 0.55186772 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition

12 0.55012435 22 nips-2000-Algorithms for Non-negative Matrix Factorization

13 0.54830033 21 nips-2000-Algorithmic Stability and Generalization Performance

14 0.54752332 79 nips-2000-Learning Segmentation by Random Walks

15 0.54709035 92 nips-2000-Occam's Razor

16 0.54672903 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

17 0.54392827 127 nips-2000-Structure Learning in Human Causal Induction

18 0.54314166 4 nips-2000-A Linear Programming Approach to Novelty Detection

19 0.54158092 111 nips-2000-Regularized Winnow Methods

20 0.54102767 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks