nips nips2001 nips2001-129 knowledge-graph by maker-knowledge-mining

129 nips-2001-Multiplicative Updates for Classification by Mixture Models


Source: pdf

Author: Lawrence K. Saul, Daniel D. Lee

Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Lee Department of Computer and Information Science Department of Electrical Engineering University of Pennsylvania, Philadelphia, PA 19104 ¡   Abstract We investigate a learning algorithm for the classification of nonnegative data by mixture models. [sent-3, score-0.616]

2 Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. [sent-4, score-0.243]

3 Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. [sent-6, score-0.445]

4 The learning rate of the algorithm is controlled by the sparseness of the training data. [sent-8, score-0.219]

5 We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. [sent-9, score-0.63]

6 Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. [sent-11, score-0.908]

7 Mixture models are typically used to parameterize class-conditional distributions, , and then to compute posterior probabilities, , from Bayes rule. [sent-14, score-0.228]

8 Parameter estimation in these models is handled by an Expectation-Maximization (EM) algorithm[3], a learning procedure that monotonically increases the joint log likelihood, , summed over training examples (indexed by ). [sent-15, score-0.426]

9   "   $ # A weakness of the above approach is that the model parameters are optimized by maximum likelihood estimation, as opposed to a discriminative criterion more closely related to classification error[14]. [sent-18, score-0.312]

10 In this paper, we derive multiplicative update rules for the parameters of mixture models that directly maximize the discriminative objective function, . [sent-19, score-0.962]

11 This objective function measures the conditional log likelihood that the training examples are correctly classified. [sent-20, score-0.472]

12 Our update rules retain the main virtues of the EM algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative cost function. [sent-21, score-0.498]

13 The approach in this paper is limited to the classification of nonnegative data, since from the constraint of nonnegativity emerges an especially simple learning algorithm. [sent-25, score-0.416]

14 2 Mixture models as generative models Mixture models are typically used as generative models to parameterize probability distributions over feature vectors . [sent-30, score-0.913]

15 Different mixture models are used to model different classes of data. [sent-31, score-0.395]

16 ¢ £ §"   ¦ ¡ ¦  ¨ ¦ where the rows of the nonnegative weight matrix are constrained to sum to unity, , and the basis functions are properly normalized distributions, such that for all . [sent-34, score-0.528]

17 The model can be interpreted as the latent variable model, ¢  ¨¦ ¤ §¥   £¢ ©§¥ ¨¦ #  ' (2) ' & ¤ ¡  ¤ §£¢ ©§¥  ¨¦ '  ©% ¦ ¡ '  ¢ £ ¨ §¥ ¡ ¢ £  ¦   %  where the discrete latent variable indicates which mixture component is used to generate the observed variable . [sent-35, score-0.292]

18 The basis functions, usually chosen from the exponential family, define “bumps” of high probability in the feature space. [sent-37, score-0.326]

19 For sparse nonnegative data, a more natural choice is the exponential distribution: # (4) ¢£ ¦` ¢ eg dc a ` V  §fe§S bX ¦ )YXW¡ ! [sent-40, score-0.496]

20 Here, the value of indexes the elements of parameters of these basis functions must be estimated from data. [sent-42, score-0.244]

21 The Generative models can be viewed as a prototype method for classification, with the parameters of each mixture component defining a particular basin of attraction in the feature space. [sent-44, score-0.589]

22  ¤ §¥ ¨¦ An Expectation-Maximization (EM) algorithm can be used to estimate the parameters of mixture models. [sent-48, score-0.368]

23 If basis functions are not shared across different classes, then the parameter estimation for can be done independently for each class label . [sent-51, score-0.246]

24 Moreover, for many types of basis functions, the EM updates have a simple closed form and are guaranteed to improve the joint log likelihood at each iteration. [sent-53, score-0.589]

25 These properties account for the widespread use of mixture models as generative models. [sent-54, score-0.482]

26   ¨¦ ¤ §£¢ ©§¥ ¤ 3 Mixture models as discriminative models Mixture models can also be viewed as purely discriminative models. [sent-55, score-0.579]

27 ¢ £ ¥ ¦£ ¨ ¦¤ ¢ ¥ ¥£ ¢ £   ¦   ¦  ¨ ¦  ¡ £  ¢ ¡ (7) &¤©§¥ ¨¦ ¦ ¨ The right hand side of this equation defines a valid posterior distribution provided that the mixture weights and basis functions are nonnegative. [sent-59, score-0.755]

28 Note that for this interpretation, the mixture weights and basis functions do not need to satisfy the more stringent normalization constraints of generative models. [sent-60, score-0.632]

29 We will deliberately exploit this freedom, an idea that distinguishes our approach from previous work on discriminatively trained mixture models[6] and hidden Markov models[5, 12]. [sent-61, score-0.567]

30 In particular, the unnormalized basis functions we use are able to parameterize “saddles” and “valleys” in the feature space, as well as the “bumps” of normalized basis functions. [sent-62, score-0.544]

31 (7) must be further specified by parameterizing the basis functions as a function of . [sent-66, score-0.264]

32 We study basis functions of the form d a ¡ ¢ £   ¦   # © ¦  © (8) ¢£ ¢ ¦ ¢ where denotes a real-valued vector and denotes a nonnegative and possibly “expanded” representation[14] of the original feature vector. [sent-67, score-0.609]

33 By defining the “quadratically expanded” feature vector: (9) ¦ ¢ ¦4  #  £ ¦Q §§ 5 S  £ # § # 9 £ 5 £ # 5 £ 5 £ # £ # § # 9 £ # 5 £ #  ¨ ¡  §§ ¢ we can equate the basis functions in eqs. [sent-72, score-0.324]

34 Such generative models provide a cheap way to initialize discriminative models for further training. [sent-76, score-0.428]

35 ¢£ ¢ ¢ 4 Learning algorithm Our learning algorithm directly optimizes the performance of the models in eq. [sent-77, score-0.218]

36 The objective function we use for discriminative training is the conditional log likelihood, (10)   ¢ &¤¨ §¥ ¦ ! [sent-79, score-0.462]

37 It is the subtracted term, , which distinguishes the conditional log likelihood optimized by discriminative training from the joint log likelihood optimized by EM. [sent-87, score-0.854]

38 S   Our learning algorithm works by alternately updating the mixture weights and the basis function parameters. [sent-88, score-0.532]

39 Here we simply present the update rules for these parameters; a derivation and proof of convergence are given in the appendix. [sent-89, score-0.24]

40 It is easiest to write the basis function updates in terms of the nonnegative parameters . [sent-90, score-0.666]

41 The updates then take the simple multiplicative form: fed  a # (13) § X X ! [sent-91, score-0.458]

42 (This is a consequence of the nonnegativity constraint on the feature vectors: for all examples and feature components . [sent-93, score-0.411]

43 ) Thus, the nonnegativity constraints on the mixture weights and basis functions are enforced by these multiplicative udpates. [sent-94, score-0.883]

44 Q % $X   # % The updates have a simple intuition[9] based on balancing opposing terms in the gradient of the conditional log likelihood. [sent-95, score-0.451]

45 In particular, note that the fixed points of this update rule occur at stationary points of the conditional log likelihood—that is, where and , or equivalently, where and . [sent-96, score-0.266]

46 The learning rate is controlled by the ratios of these gradi, which meaents and—additionally, for the basis function updates—by the exponent sures the sparseness of the training data. [sent-97, score-0.372]

47 Thus, sparse feature vectors leads to faster learning, a crucial point to which we will return shortly. [sent-99, score-0.312]

48 % It is worth comparing these multiplicative updates to others in the literature. [sent-102, score-0.396]

49 Jebara and Pentland[6] derived similar updates for mixture weights, but without emphasizing the special form of eq. [sent-103, score-0.481]

50 Others have investigated multiplicative updates by the method of exponentiated gradients (EG)[7]. [sent-105, score-0.515]

51 Our updates do not have the same form as EG updates: in particular, note that the gradients in eqs. [sent-106, score-0.251]

52 If we use one basis function per class and an identity matrix for the mixture weights, then the updates reduce to the method of generalized iterative scaling[2] for logistic or multinomial regression (also known as maximum entropy modeling). [sent-108, score-0.772]

53 More generally, though, our multiplicative updates can be used to train much more powerful classifiers based on mixture models. [sent-109, score-0.688]

54 5 Feature selection As previously mentioned, the learning rate for the basis function parameters is controlled by the sparseness of the training data. [sent-110, score-0.372]

55 (13–14) can be impractically slow (just as the method of iterative pixel image NMF basis vectors 01-10 11-20 21-30 31-40 NMF feature vector 10 41-50 51-60 5 61-70 71-80 0 20 40 60 80 Figure 1: Left: nonnegative basis vectors for handwritten digits discovered by NMF. [sent-112, score-1.229]

56 The basis vectors are ordered by their contribution to this image. [sent-114, score-0.248]

57 In this case, it is important to discover sparse distributed representations of the data that encode the same information. [sent-116, score-0.233]

58 The search for sparse distributed representations can be viewed as a form of feature selection. [sent-118, score-0.278]

59 We have observed that suitably sparse representations can be discovered by the method of nonnegative matrix factorization (NMF)[8]. [sent-119, score-0.617]

60 Let the raw nonnegative (and posmatrix , where is its raw dimensibly nonsparse) data be represented by the sionality and is the number of training examples. [sent-120, score-0.465]

61 Algorithms for NMF yield a factorization , where is a nonnegative marix and is a nonnegative matrix. [sent-121, score-0.646]

62 In this factorization, the columns of are interpreted as basis vectors, and the columns of as coefficients (or new feature vectors). [sent-122, score-0.272]

63 These coefficients are typically very sparse, because the nonnegative basis vectors can only be added in a constructive way to approximate the original data. [sent-123, score-0.533]

64 We used the method to discover sparse distributed representations of the MNIST data set of handwritten digits[10]. [sent-125, score-0.34]

65 The data set has 60000 training and 10000 test examples that were deslanted and cropped to form grayscale pixel images. [sent-126, score-0.229]

66 1 shows the basis vectors discovered by NMF, each plotted as a image. [sent-129, score-0.307]

67 Most of these basis vectors resemble strokes, only a fraction of which are needed to reconstruct any particular image in the training set. [sent-130, score-0.33]

68 For example, only about twenty basis vectors make an appreciable contribution to the handwritten “2” shown in the right plot of Fig. [sent-131, score-0.392]

69 These Baseline mixture models for classification were trained by EM algorithms. [sent-137, score-0.489]

70 Gaussian mixture models with diagonal covariance marices were trained on the PCA features, while exponential mixture models (as in eq. [sent-138, score-0.938]

71 The mixture models were trained for up to 64 iterations of EM, which was sufficient to ensure a high degree of convergence. [sent-140, score-0.489]

72 Seven baseline classifiers were trained on each feature set, with different numbers of mixture components per digit ( ). [sent-141, score-0.605]

73 Half as many PCA features were used as NMF features so as to equalize the number of fitted parameters in different basis functions. [sent-143, score-0.298]

74  # 0 ¤£#   #  #  # 0 # ©¡  ¢ Mixture models on the NMF features were also trained discriminatively by the multiplicative updates in eqs. [sent-144, score-0.775]

75 Models with varying numbers of mixture components per ) were trained by 1000 iterations of these updates. [sent-146, score-0.421]

76 The models were initialized by setting and for randomly selected feature vectors. [sent-148, score-0.22]

77 The results show that the discriminatively trained models classify much better than comparably sized models trained by EM. [sent-150, score-0.616]

78 4  with different numbers of mixture components per digit ( the same number of fitted parameters. [sent-191, score-0.394]

79 These results, however, either required storing large numbers of training examples or training significantly larger models. [sent-195, score-0.285]

80 For example, the nearest neighbor and support vector classifiers required storing tens of thousands of training examples (or support vectors), while the neural network had over 120,000 weights. [sent-196, score-0.218]

81 By contrast, the discriminatively trained mixture model (with ) has less than 6500 iteratively adjusted parameters, and most of its memory footprint is devoted to preprocessing by NMF. [sent-197, score-0.515]

82 § 0 ¡  §  § 0£ ¡ ¨ ¥ ¨ ¥  §  ¡  § ¨ ©¡ ¥   ¨ ¥  ¡ ¢ We conclude by describing the problems best suited to the mixture models in this paper. [sent-198, score-0.395]

83 A Proof of convergence In this appendix, we show that the multiplicative updates from section 4 lead to monotonic improvement in the conditional log likelihood. [sent-202, score-0.772]

84 This guarantee of convergence (to a stationary point) is proved by computing a lower bound on the conditional log likelihood for updated estimates of the mixture weights and basis function parameters. [sent-203, score-0.913]

85 We indicate these updated estimates by and , and we indicate the resulting values of the conditional log likelihood and its component terms by , , and . [sent-204, score-0.296]

86 Sr   r ¦ ¨ r ¢ ¦ r    ¡r   r    The first term in the conditional log likelihood can be lower bounded by Jensen’s inequality. [sent-206, score-0.296]

87 The same bound is used here as in the derivation of the EM algorithm[3, 13] for maximum likelihood estimation: (15) ¦  ¡ § ¤ £ ©   pd ©  ¡ ¦  ¡ r  ¡ a ¦ ¨  ¡    ! [sent-207, score-0.227]

88     ¤ W¡ ¡r The right hand side of this inequality introduces an auxiliary probability distribution for each example in the training set. [sent-209, score-0.349]

89 they are properly normalized: %  ¡ ¦  ¡ ¡ ¦   The second term in the conditional log likelihood occurs with a minus sign, so for this term we require an upper bound. [sent-211, score-0.296]

90   § ¨' £ ©   pd a © ¤ ¡ S   I Sr    To further bound the right hand side of eq. [sent-217, score-0.278]

91 (The validity of this observation hinges on the nonnegativity of the feature vectors . [sent-220, score-0.341]

92 ¦  ¡ ¡ Combining the above inequalities with a judicious choice for the auxiliary parameters we obtain a proof of convergence for the multiplicative updates in eqs. [sent-225, score-0.576]

93 (19) defines an analogous distribution for the opposing term in the conditional log likelihood. [sent-229, score-0.262]

94 We derive the update rules by maximizing the right hand side of this inequality. [sent-234, score-0.29]

95 Maximizing the right hand side with respect to while holding the basis function parameters fixed yields the update, eq. [sent-235, score-0.385]

96 Likewise, maximizing the right hand side with respect to while holding the mixture weights fixed yields the update, eq. [sent-237, score-0.531]

97 Since these choices and lead to positive values on the right hand side of the inequality (except at for fixed points), it follows that the multiplicative updates in eqs. [sent-239, score-0.62]

98 (13–14) lead to monotonic improvement in the conditional log likelihood. [sent-240, score-0.325]

99 Maximum conditional likelihood via bound maximization and the CEM algorithm. [sent-282, score-0.225]

100 Learning the parts of objects with nonnegative matrix factorization. [sent-300, score-0.321]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nmf', 0.392), ('mixture', 0.292), ('nonnegative', 0.285), ('multiplicative', 0.207), ('updates', 0.189), ('classi', 0.156), ('basis', 0.155), ('discriminative', 0.135), ('nonnegativity', 0.131), ('discriminatively', 0.129), ('feature', 0.117), ('efd', 0.113), ('em', 0.11), ('log', 0.108), ('handwritten', 0.107), ('models', 0.103), ('sparse', 0.102), ('likelihood', 0.099), ('trained', 0.094), ('vectors', 0.093), ('pd', 0.091), ('monotonic', 0.091), ('conditional', 0.089), ('generative', 0.087), ('training', 0.082), ('factorization', 0.076), ('inequality', 0.074), ('discover', 0.072), ('rules', 0.071), ('update', 0.069), ('side', 0.068), ('ers', 0.068), ('digit', 0.067), ('opposing', 0.065), ('parameterize', 0.065), ('fed', 0.062), ('gradients', 0.062), ('sparseness', 0.062), ('pixel', 0.061), ('posterior', 0.06), ('discovered', 0.059), ('scaling', 0.059), ('representations', 0.059), ('iterative', 0.059), ('bumps', 0.057), ('pentland', 0.057), ('parameterizing', 0.057), ('virtues', 0.057), ('exponentiated', 0.057), ('jebara', 0.057), ('eg', 0.055), ('cation', 0.054), ('exponential', 0.054), ('features', 0.053), ('functions', 0.052), ('distinguishes', 0.052), ('distributions', 0.052), ('convergence', 0.051), ('neighbor', 0.05), ('proof', 0.049), ('raw', 0.049), ('objective', 0.048), ('sized', 0.048), ('della', 0.048), ('pietra', 0.048), ('summed', 0.048), ('weights', 0.046), ('examples', 0.046), ('digits', 0.045), ('comparably', 0.045), ('hand', 0.045), ('lee', 0.043), ('auxiliary', 0.043), ('holding', 0.043), ('tted', 0.041), ('mnist', 0.041), ('generalized', 0.041), ('de', 0.041), ('optimized', 0.041), ('pca', 0.041), ('sr', 0.04), ('storing', 0.04), ('prototype', 0.04), ('seung', 0.04), ('grayscale', 0.04), ('estimation', 0.039), ('tuning', 0.039), ('algorithm', 0.039), ('closed', 0.038), ('right', 0.037), ('parameters', 0.037), ('improvement', 0.037), ('ratios', 0.037), ('optimizes', 0.037), ('expanded', 0.037), ('bound', 0.037), ('matrix', 0.036), ('guarantee', 0.036), ('controlled', 0.036), ('numbers', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

Author: Lawrence K. Saul, Daniel D. Lee

Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1

2 0.14951542 84 nips-2001-Global Coordination of Local Linear Models

Author: Sam T. Roweis, Lawrence K. Saul, Geoffrey E. Hinton

Abstract: High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the “global coordination” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. 1 Manifold Learning Consider an ensemble of images, each of which contains a face against a neutral background. Each image can be represented by a point in the high dimensional vector space of pixel intensities. This representation, however, does not exploit the strong correlations between pixels of the same image, nor does it support many useful operations for reasoning about faces. If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensities. Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the levers and dials, the box generates an image of a face. Given an image of a face, the box deduces the appropriate setting of the levers and dials. In this paper, we describe a fairly general way to construct such a box automatically from an ensemble of high-dimensional vectors. We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientific data (e.g., spectrograms of speech, robotic sensors, gene expression arrays, document collections). 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. Fortunately, despite their complicated global structure, we can usually characterize these manifolds as locally linear. Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. MFAs parameterize a joint distribution over observed and hidden variables: (1) where the observed variable, , represents the high dimensional data; the discrete hidden variables, , indexes different neighborhoods on the manifold; and the continuous hidden variables, , represent low dimensional local coordinates. The model assumes that data is sampled from different neighborhoods on the manifold with prior probabilities , and that within each neighborhood, the data’s local coordinates are normally distributed1 as:  RP&¤§¢  Q  ¡ I 0 (  3HGF D C¥@@@¥ 8¥75 ( E¨BAA9¨©©64§ 2 0 ( 31)£ ¥ ¡    ¡   ¥ ¡     ¥ §¥ ¡ &¤§¢'&§ %#¤¢$#¨

3 0.1490345 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.

4 0.14812267 190 nips-2001-Thin Junction Trees

Author: Francis R. Bach, Michael I. Jordan

Abstract: We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.

5 0.14570336 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

Author: Michael Collins, S. Dasgupta, Robert E. Schapire

Abstract: Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponential family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss functions, and give examples on simulated data.

6 0.14037369 43 nips-2001-Bayesian time series classification

7 0.13876902 60 nips-2001-Discriminative Direction for Kernel Classifiers

8 0.13062367 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems

9 0.12822123 46 nips-2001-Categorization by Learning and Combining Object Parts

10 0.12752675 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

11 0.12060075 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

12 0.11405436 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data

13 0.11313908 144 nips-2001-Partially labeled classification with Markov random walks

14 0.11291362 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

15 0.11230033 172 nips-2001-Speech Recognition using SVMs

16 0.11221624 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

17 0.11188325 139 nips-2001-Online Learning with Kernels

18 0.11092149 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

19 0.10973969 35 nips-2001-Analysis of Sparse Bayesian Learning

20 0.10078257 143 nips-2001-PAC Generalization Bounds for Co-training


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.33), (1, 0.128), (2, -0.049), (3, 0.108), (4, -0.11), (5, -0.051), (6, 0.012), (7, -0.065), (8, 0.033), (9, -0.093), (10, 0.089), (11, 0.012), (12, 0.021), (13, -0.173), (14, -0.076), (15, -0.073), (16, -0.064), (17, -0.019), (18, 0.02), (19, 0.027), (20, 0.057), (21, 0.034), (22, 0.013), (23, -0.015), (24, -0.017), (25, -0.019), (26, 0.06), (27, 0.059), (28, 0.011), (29, -0.111), (30, 0.006), (31, 0.021), (32, -0.014), (33, -0.047), (34, 0.039), (35, 0.021), (36, 0.086), (37, -0.0), (38, -0.12), (39, -0.022), (40, -0.025), (41, -0.078), (42, 0.013), (43, 0.024), (44, -0.147), (45, 0.086), (46, -0.023), (47, 0.025), (48, -0.062), (49, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97535712 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

Author: Lawrence K. Saul, Daniel D. Lee

Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1

2 0.66949058 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

Author: Mário Figueiredo

Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.

3 0.66502535 84 nips-2001-Global Coordination of Local Linear Models

Author: Sam T. Roweis, Lawrence K. Saul, Geoffrey E. Hinton

Abstract: High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the “global coordination” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. 1 Manifold Learning Consider an ensemble of images, each of which contains a face against a neutral background. Each image can be represented by a point in the high dimensional vector space of pixel intensities. This representation, however, does not exploit the strong correlations between pixels of the same image, nor does it support many useful operations for reasoning about faces. If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensities. Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the levers and dials, the box generates an image of a face. Given an image of a face, the box deduces the appropriate setting of the levers and dials. In this paper, we describe a fairly general way to construct such a box automatically from an ensemble of high-dimensional vectors. We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientific data (e.g., spectrograms of speech, robotic sensors, gene expression arrays, document collections). 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. Fortunately, despite their complicated global structure, we can usually characterize these manifolds as locally linear. Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. MFAs parameterize a joint distribution over observed and hidden variables: (1) where the observed variable, , represents the high dimensional data; the discrete hidden variables, , indexes different neighborhoods on the manifold; and the continuous hidden variables, , represent low dimensional local coordinates. The model assumes that data is sampled from different neighborhoods on the manifold with prior probabilities , and that within each neighborhood, the data’s local coordinates are normally distributed1 as:  RP&¤§¢  Q  ¡ I 0 (  3HGF D C¥@@@¥ 8¥75 ( E¨BAA9¨©©64§ 2 0 ( 31)£ ¥ ¡    ¡   ¥ ¡     ¥ §¥ ¡ &¤§¢'&§ %#¤¢$#¨

4 0.66359901 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

Author: Nicolas Chapados, Yoshua Bengio, Pascal Vincent, Joumana Ghosn, Charles Dugas, Ichiro Takeuchi, Linyan Meng

Abstract: Estimating insurance premia from data is a difficult regression problem for several reasons: the large number of variables, many of which are .discrete, and the very peculiar shape of the noise distribution, asymmetric with fat tails, with a large majority zeros and a few unreliable and very large values. We compare several machine learning methods for estimating insurance premia, and test them on a large data base of car insurance policies. We find that function approximation methods that do not optimize a squared loss, like Support Vector Machines regression, do not work well in this context. Compared methods include decision trees and generalized linear models. The best results are obtained with a mixture of experts, which better identifies the least and most risky contracts, and allows to reduce the median premium by charging more to the most risky customers. 1

5 0.63038117 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data

Author: Igor V. Cadez, Padhraic Smyth

Abstract: Massive transaction data sets are recorded in a routine manner in telecommunications, retail commerce, and Web site management. In this paper we address the problem of inferring predictive individual profiles from such historical transaction data. We describe a generative mixture model for count data and use an an approximate Bayesian estimation framework that effectively combines an individual’s specific history with more general population patterns. We use a large real-world retail transaction data set to illustrate how these profiles consistently outperform non-mixture and non-Bayesian techniques in predicting customer behavior in out-of-sample data. 1

6 0.62272573 43 nips-2001-Bayesian time series classification

7 0.61816674 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

8 0.60816902 190 nips-2001-Thin Junction Trees

9 0.59924042 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

10 0.59867221 8 nips-2001-A General Greedy Approximation Algorithm with Applications

11 0.58437049 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

12 0.5502317 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models

13 0.54896945 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity

14 0.54746085 60 nips-2001-Discriminative Direction for Kernel Classifiers

15 0.53995997 35 nips-2001-Analysis of Sparse Bayesian Learning

16 0.53433418 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

17 0.52496421 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

18 0.51134533 159 nips-2001-Reducing multiclass to binary by coupling probability estimates

19 0.50841683 144 nips-2001-Partially labeled classification with Markov random walks

20 0.49808437 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.022), (19, 0.02), (27, 0.601), (30, 0.044), (36, 0.011), (38, 0.018), (59, 0.015), (72, 0.06), (79, 0.03), (83, 0.017), (88, 0.02), (91, 0.081)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99695057 117 nips-2001-MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation

Author: Anand Rangarajan, Alan L. Yuille

Abstract: Bayesian belief propagation in graphical models has been recently shown to have very close ties to inference methods based in statistical physics. After Yedidia et al. demonstrated that belief propagation fixed points correspond to extrema of the so-called Bethe free energy, Yuille derived a double loop algorithm that is guaranteed to converge to a local minimum of the Bethe free energy. Yuille’s algorithm is based on a certain decomposition of the Bethe free energy and he mentions that other decompositions are possible and may even be fruitful. In the present work, we begin with the Bethe free energy and show that it has a principled interpretation as pairwise mutual information minimization and marginal entropy maximization (MIME). Next, we construct a family of free energy functions from a spectrum of decompositions of the original Bethe free energy. For each free energy in this family, we develop a new algorithm that is guaranteed to converge to a local minimum. Preliminary computer simulations are in agreement with this theoretical development. 1

same-paper 2 0.99213195 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

Author: Lawrence K. Saul, Daniel D. Lee

Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1

3 0.99128962 165 nips-2001-Scaling Laws and Local Minima in Hebbian ICA

Author: Magnus Rattray, Gleb Basalyga

Abstract: We study the dynamics of a Hebbian ICA algorithm extracting a single non-Gaussian component from a high-dimensional Gaussian background. For both on-line and batch learning we find that a surprisingly large number of examples are required to avoid trapping in a sub-optimal state close to the initial conditions. To extract a skewed signal at least examples are required for -dimensional data and examples are required to extract a symmetrical signal with non-zero kurtosis. § ¡ ©£¢  £ §¥ ¡ ¨¦¤£¢

4 0.98819244 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

Author: Andrew Y. Ng, Michael I. Jordan

Abstract: We compare discriminative and generative learning as typified by logistic regression and naive Bayes. We show, contrary to a widelyheld belief that discriminative classifiers are almost always to be preferred, that there can often be two distinct regimes of performance as the training set size is increased, one in which each algorithm does better. This stems from the observation- which is borne out in repeated experiments- that while discriminative learning has lower asymptotic error, a generative classifier may also approach its (higher) asymptotic error much faster. 1

5 0.98674601 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

Author: Mikhail Belkin, Partha Niyogi

Abstract: Drawing on the correspondence between the graph Laplacian, the Laplace-Beltrami op erator on a manifold , and the connections to the heat equation , we propose a geometrically motivated algorithm for constructing a representation for data sampled from a low dimensional manifold embedded in a higher dimensional space. The algorithm provides a computationally efficient approach to nonlinear dimensionality reduction that has locality preserving properties and a natural connection to clustering. Several applications are considered. In many areas of artificial intelligence, information retrieval and data mining, one is often confronted with intrinsically low dimensional data lying in a very high dimensional space. For example, gray scale n x n images of a fixed object taken with a moving camera yield data points in rn: n2 . However , the intrinsic dimensionality of the space of all images of t he same object is the number of degrees of freedom of the camera - in fact the space has the natural structure of a manifold embedded in rn: n2 . While there is a large body of work on dimensionality reduction in general, most existing approaches do not explicitly take into account the structure of the manifold on which the data may possibly reside. Recently, there has been some interest (Tenenbaum et aI, 2000 ; Roweis and Saul, 2000) in the problem of developing low dimensional representations of data in this particular context. In this paper , we present a new algorithm and an accompanying framework of analysis for geometrically motivated dimensionality reduction. The core algorithm is very simple, has a few local computations and one sparse eigenvalu e problem. The solution reflects th e intrinsic geom etric structure of the manifold. Th e justification comes from the role of the Laplacian op erator in providing an optimal emb edding. Th e Laplacian of the graph obtained from the data points may be viewed as an approximation to the Laplace-Beltrami operator defined on the manifold. The emb edding maps for the data come from approximations to a natural map that is defined on the entire manifold. The framework of analysis presented here makes this connection explicit. While this connection is known to geometers and specialists in sp ectral graph theory (for example , see [1, 2]) to the best of our knowledge we do not know of any application to data representation yet. The connection of the Laplacian to the heat kernel enables us to choose the weights of the graph in a principled manner. The locality preserving character of the Laplacian Eigenmap algorithm makes it relatively insensitive to outliers and noise. A byproduct of this is that the algorithm implicitly emphasizes the natural clusters in the data. Connections to spectral clustering algorithms developed in learning and computer vision (see Shi and Malik , 1997) become very clear. Following the discussion of Roweis and Saul (2000) , and Tenenbaum et al (2000), we note that the biological perceptual apparatus is confronted with high dimensional stimuli from which it must recover low dimensional structure. One might argue that if the approach to recovering such low-dimensional structure is inherently local , then a natural clustering will emerge and thus might serve as the basis for the development of categories in biological perception. 1 The Algorithm Given k points Xl , ... , Xk in ]]{ I, we construct a weighted graph with k nodes, one for each point , and the set of edges connecting neighboring points to each other. 1. Step 1. [Constru cting th e Graph] We put an edge between nodes i and j if Xi and Xj are

6 0.98550141 23 nips-2001-A theory of neural integration in the head-direction system

7 0.97609466 47 nips-2001-Causal Categorization with Bayes Nets

8 0.90712857 98 nips-2001-Information Geometrical Framework for Analyzing Belief Propagation Decoder

9 0.87475663 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

10 0.84134066 137 nips-2001-On the Convergence of Leveraging

11 0.79052758 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation

12 0.78737658 8 nips-2001-A General Greedy Approximation Algorithm with Applications

13 0.78426713 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

14 0.78255588 97 nips-2001-Information-Geometrical Significance of Sparsity in Gallager Codes

15 0.78230613 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources

16 0.77271122 114 nips-2001-Learning from Infinite Data in Finite Time

17 0.77067566 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

18 0.7694968 190 nips-2001-Thin Junction Trees

19 0.76587629 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules

20 0.76144445 154 nips-2001-Products of Gaussians