nips nips2001 nips2001-9 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michael Collins, S. Dasgupta, Robert E. Schapire
Abstract: Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponential family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss functions, and give examples on simulated data.
Reference: text
sentIndex sentText sentNum sentScore
1 PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. [sent-5, score-0.363]
2 This paper draws on ideas from the Exponential family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. [sent-6, score-0.278]
3 We describe algorithms for minimizing the loss functions, and give examples on simulated data. [sent-7, score-0.21]
4 1 Introduction Principal component analysis (PCA) is a hugely popular dimensionality reduction technique that attempts to find a low-dimensional subspace passing close to a given set of . [sent-8, score-0.19]
5 More specifically, in PCA, we find a lower dimensional subspace points that minimizes the sum of the squared distances from the data points to their projections in the subspace, i. [sent-9, score-0.621]
6 %$% " % (1) This turns out to be equivalent to choosing a subspace that maximizes the sum of the squared lengths of the projections , which is the same as the (empirical) variance of these projections if the data happens to be centered at the origin (so that ). [sent-13, score-0.346]
7 In this probabilistic interpretation, each point is thought of as a random draw from some unknown distribution , where denotes a unit Gaussian with mean . [sent-16, score-0.116]
8 The purpose then of PCA is to find the set of parameters that maximizes the likelihood of the data, subject to the condition that these parameters all lie in a low-dimensional subspace. [sent-17, score-0.158]
9 In are considered to be noise-corrupted versions of some true points other words, which lie in a subspace; the goal is to find these true points, and the main assumption is that the noise is Gaussian. [sent-18, score-0.156]
10 In fact, the Gaussian is only one of the canonical distributions that make up the exponential family, and it is a distribution tailored to real-valued data. [sent-29, score-0.436]
11 It seems natural to consider variants of PCA which are founded upon these other distributions in place of the Gaussian. [sent-31, score-0.184]
12 2 ¡ We extend PCA to the rest of the exponential family. [sent-33, score-0.244]
13 Let be any parameterized set of distributions from the exponential family, where is the natural parameter of a distribution. [sent-34, score-0.404]
14 For instance, a one-dimensional Poisson distribution can be parameterized by , and distribution corresponding to mean Given data , the goal is now to find parameters which lie in a low-dimensional subspace and for which the log-likelihood is maximized. [sent-35, score-0.337]
15 2 ¤¢ ¥£( ¡ &% ¡ $ # £ 8©¦ ¤ £ ¦¨¨¨ ¨ ¦ ©©¦ ¦ ¦ ¢ ¨¨¨ Our unified approach effortlessly permits hybrid dimensionality reduction schemes in which different types of distributions can be used for different attributes of the data. [sent-42, score-0.159]
16 If the data have a few binary attributes and a few integer-valued attributes, then some coordinates of the corresponding can be parameters of binomial distributions while others are parameters of Poisson distributions. [sent-43, score-0.227]
17 The dimensionality reduction schemes for non-Gaussian distributions are substantially different from PCA. [sent-46, score-0.092]
18 For instance, in PCA the parameters , which are means of Gaussians, lie in a space which coincides with that of the data . [sent-47, score-0.192]
19 This is not the case in general, and therefore, although the parameters lie in a linear subspace, they typically correspond to a nonlinear surface in the space of the data. [sent-48, score-0.176]
20 The discrepancy and interaction between the space of parameters and the space of the data is a central preoccupation in the study of exponential families, generalized linear models (GLM’s), and Bregman distances. [sent-52, score-0.399]
21 In particular, we show that the way in which we generalize PCA is exactly analogous to the manner in which regression is generalized by GLM’s. [sent-54, score-0.184]
22 £ We show that the optimization problem we derive can be solved quite naturally by an algorithm that alternately minimizes over the components of the analysis and their coefficients; thus, the algorithm is reminiscent of Csisz´ r and Tusn´ dy’s alternating minization procea a dures [2]. [sent-56, score-0.245]
23 In our case, each side of the minimization is a simple convex program that can be interpreted as a projection with respect to a suitable Bregman distance; however, the overall program is not generally convex. [sent-57, score-0.193]
24 In the case of Gaussian distributions, our algorithm coincides exactly with the power method for computing eigenvectors; in this sense it is a generalization of one of the oldest algorithms for PCA. [sent-58, score-0.16]
25 Although omitted for lack of space, we can show that our procedure converges in that any limit point of the computed coefficients is a stationary point of the loss function. [sent-59, score-0.371]
26 1 The Exponential Family and Generalized Linear Models 9 In the exponential family of distributions the conditional probability of a value parameter value takes the following form: given (2) 0 ¨ H F9 © ¦ G E© 9 ¦ C2 A@(' ( © % 9 ¦ 2 ! [sent-64, score-0.536]
27 We will see that almost all of the concepts of members of the family is the form of the PCA algorithms in this paper stem directly from the definition of . [sent-78, score-0.262]
28 A first example is a normal distribution, with mean and unit variance, which has a density . [sent-79, score-0.094]
29 It can be verified that that is usually written as this is a member of the exponential family with , , and . [sent-80, score-0.579]
30 The probability of is usually written where is a parameter in . [sent-83, score-0.146]
31 This is a member of the exponential family with , , and . [sent-84, score-0.488]
32 In the normal distribution, , and in the Bernoulli case . [sent-87, score-0.094]
33 In the general case, is referred to as the “expectation parameter”, and defines a function from the natural parameter values to the expectation parameter values. [sent-88, score-0.205]
34 Our generalization of PCA is analogous to the manner in which generalized linear models (GLM’s) [8] provide a unified treatment of regression for the exponential family by generalizing least-squares regression to loss functions that are more appropriate for other members of this family. [sent-89, score-0.992]
35 )(' ( © ¦ G % & ( © ¦ G 0( ( ' £ IH % D © % ¦ B 2 A@' ' # # ( D D ( © 1£ E' D A B ' ¦ 7 5 4 2 ( £( © 8C' % ¦ ' % ¢ @9860 ¥30 ' & In GLM’s, is taken to approximate the expectation parameter of the exponential model, where is the inverse of the “link function” [8]. [sent-96, score-0.343]
36 A natural choice is to use the “canonical link”, where , being the derivative . [sent-97, score-0.087]
37 In this case the natural parameters are directly approximated by , and the log-likelihood is simply . [sent-98, score-0.088]
38 In the case of a normal distribution and it follows easily that the maximum-likelihood criwith fixed variance, terion is equivalent to the least squares criterion. [sent-99, score-0.132]
39 Another interesting case is logistic regression where , and the negative log-likelihood for parameters is where if , if . [sent-100, score-0.125]
40 2 Bregman Distances and the Exponential Family fa e c a` dbPY be a differentiable and strictly convex function defined on a closed, convex . [sent-104, score-0.158]
41 It can be shown that, in general, every Bregman distance is nonnegative and is equal to zero if and only if its two arguments are equal. [sent-107, score-0.169]
42 0 © % 9 ¦ 2 A@(' For the exponential family the log-likelihood is directly related to a Bregman Bernoulli Poisson $ I 1H0( ) 2 I 5 3 E E " xE I 2ip hqH0( 1 ) c f c 1 ) ©0( E 1 ) H0( E E $ E GF I " ' & E T $ " %# 6 $ " %#! [sent-108, score-0.427]
43 Table 1 summarizes various functions of interest for examples of the exponential family. [sent-112, score-0.244]
44 We will find it useful to extend the idea of Bregman distances to divergences between vectors and matrices. [sent-113, score-0.137]
45 (The notion of Bregman distance as well as our generalization of PCA can be extended to vectors in a more general manner; here, for simplicity, we restrict our attention to Bregman distances and PCA problems of this particular form. [sent-115, score-0.232]
46 We now generalize PCA to other members of the exponential family. [sent-117, score-0.323]
47 We wish to find ’s that are “close” to the ’s and which belong to a lower dimensional subspace of parameter space. [sent-118, score-0.203]
48 2 § Let be the matrix whose ’th row is . [sent-122, score-0.112]
49 Let be the matrix whose ’th row is , and let be the matrix with elements . [sent-123, score-0.175]
50 This is a matrix of natural parameter values which define the probability of each point in . [sent-125, score-0.201]
51 2 Following the discussion in Section 2, we consider the loss function taking the form 7 ©© 7 ¦ G D 7 87 9 ¦ " " D h ( 0 7 © 87 % 87 9 ¦ 2 ! [sent-127, score-0.21]
52 The loss function varies depending on which member of the exponential family is taken, which simply changes the form of . [sent-130, score-0.698]
53 For example, if is a matrix of real values, and the normal distribution is appropriate for the data, then and the loss criterion is the usual squared loss for PCA. [sent-131, score-0.653]
54 )(' ( © %¦ 0i3 G ( G © ¦ G & ' ' ( © ¦ ¦ g From the relationship between log-likelihood and Bregman distances (see Eq. [sent-136, score-0.137]
55 Once and have been found for the data points, the ’th data point can be represented as the vector in the lower dimensional space . [sent-139, score-0.136]
56 The optimal value for then minimizes the sum of projection distances: . [sent-142, score-0.1]
57 Note that for the normal distribution and the Bregman distance is Euclidean distance so that the projection operation in Eq. [sent-143, score-0.345]
58 is also simplified in the normal case, simply being the hyperplane whose basis is . [sent-145, score-0.094]
59 ( is taken to be a matrix of natural parameter values. [sent-147, score-0.169]
60 defines a matrix of expectation parameters, # and . [sent-148, score-0.107]
61 A Bregman distance G e Y PCA can also be thought of as search for a matrix is “close” to all the data points. [sent-149, score-0.154]
62 that defines a surface which ( © ¦ # The normal distribution is a simple case because , and the divergence is Euclidean distance. [sent-153, score-0.187]
63 The projection operation is a linear operation, and is the hyperplane which has as its basis. [sent-154, score-0.095]
64 © ¦ £ 4 Generic Algorithms for Minimizing the Loss Function We now describe a generic algorithm for minimization of the loss function. [sent-155, score-0.317]
65 Thus is alternately minimized with respective to its two arguments, each time optimizing one argument while keeping the other one fixed, reminiscent of Csisz´ r and Tusn´ dy’s alternating minization procedures [2]. [sent-162, score-0.207]
66 D § We can then see that there are optimization problems, and that each one is essentially identical to a GLM regression problem (a very simple one, where there is a single parameter being optimized over). [sent-165, score-0.149]
67 These sub-problems are easily solved, as the functions are convex in the argument being optimized over, and the large literature on maximumlikelihood estimation in GLM’s can be directly applied to the problem. [sent-166, score-0.117]
68 These updates take a simple form for the normal distribution: , and . [sent-167, score-0.094]
69 Thus the generic algorithm generalizes one of the oldest algorithms for solving the regular PCA problem. [sent-170, score-0.187]
70 %& % " % % © ¦ ( " h h ¥ ¤ ( $%% © ¤ ¦ ( & $% ¤ % The loss is convex in either of its arguments with the other fixed, but in general is not convex in the two arguments together. [sent-171, score-0.488]
71 The normal distribution is an interesting special case in this respect — the power method is known to converge to the optimal solution, in spite of the non-convex nature of the loss surface. [sent-173, score-0.379]
72 It can also be explained by analysis of the Hessian : for any stationary point is not positive semi-definite. [sent-175, score-0.096]
73 Thus these stationary which is not the global minimum, points are saddle points rather than local minima. [sent-176, score-0.24]
74 The Hessian for the generalized loss function is more complex; it remains an open problem whether it is also not positive semidefinite at stationary points other than the global minimum. [sent-177, score-0.464]
75 Moreover, any limit point of the sequence will be a stationary point. [sent-180, score-0.129]
76 To avoid such degenerate choices of , we can use a modified loss e © B §¦ ¤ # # B C§ ¢ 7 ©© 7 ¦ # s B § ¦ p i £D ©© 7 ¦ # s e 87 9 ¦ p i " " ¢ where is a small positive constant, and is any value in the range of (and therefore for which is finite). [sent-182, score-0.21]
77 It can be proved, for this modified loss, that the sequence remains in a bounded region and hence always has at least one limit point which must be a stationary point. [sent-184, score-0.129]
78 ) There are various ways to optimize the loss function when there is more than one component. [sent-186, score-0.21]
79 Our generalization of PCA behaves rather differently for different members of the exponential family. [sent-228, score-0.359]
80 One interesting example is that of the exponential distributions on nonnegative reals. [sent-229, score-0.348]
81 For one-dimensional data, these densities are usually written as , where is the mean. [sent-230, score-0.091]
82 In the uniform system of notation we have been using, we would instead index each distribution by a single natural parameter (basically, ), and write the density as , where . [sent-231, score-0.144]
83 Once is found, we can recover the coefficients The points lie on a line through the origin. [sent-239, score-0.156]
84 Normally, we would not expect the points to also lie on a straight line; however, in this case they do, because any point of the form , can be written as and so must lie in the direction . [sent-240, score-0.317]
85 ¦ # ¨ ¤ ( ( ¤ ¤ ( $ ¤ Therefore, we can reasonably ask how the lines found under this exponential assumption differ from those found under a Gaussian assumption (that is, those found by regular PCA), provided all data is nonnegative. [sent-243, score-0.36]
86 As a very simple illustration, we conducted two toy experiments with twenty data points in (Figure 1). [sent-244, score-0.104]
87 In the second experiment, a few of the points were moved farther afield, and these outliers had a larger effect upon regular PCA than upon its exponential variant. [sent-246, score-0.476]
88 For the Bernoulli distribution, a linear subspace of the space of parameters is typically a nonlinear surface in the space of the data. [sent-248, score-0.2]
89 In Figure 2 (left), three points in the three-dimensional hypercube are mapped via our PCA to a onedimensional curve. [sent-249, score-0.125]
90 The curve passes through one of the points ( ); the projections of the two other ( and ) are indicated. [sent-250, score-0.156]
91 i c i h 6 Relationship to Previous Work Lee and Seung [6, 7] and Hofmann [4] also describe probabilistic alternatives to PCA, tailored to data types that are not gaussian. [sent-255, score-0.139]
92 In contrast to our method, [4, 6, 7] approximate mean parameters underlying the generation of the data points, with constraints on the matrices and ensuring that the elements of are in the correct domain. [sent-256, score-0.12]
93 By instead choosing to approximate the natural parameters, in our method the matrices and do not usually need to be constrained—instead, we rely on the link function to give a transformed matrix which lies in the domain of the data points. [sent-257, score-0.297]
94 q # 0 © 87 D 7 A@(' 87 9 ¦ 7 © q ¦ # More specifically, Lee and Seung [6] use the loss function (ignoring constant factors, and again defining ). [sent-258, score-0.21]
95 This method has a probabilistic interpretation, is generated from a Poisson distribution with mean parameter . [sent-260, score-0.139]
96 where each data point For the Poisson distribution, our method uses the loss function , but without any constraints on the matrices and . [sent-261, score-0.325]
97 The algorithm in Hofmann [4] uses , where the matrices and are constrained such that a loss function all the ’s are positive, and also such that . [sent-262, score-0.261]
98 This work builds upon intuitions about exponential families and Bregman distances obtained largely from interactions with Manfred Warmuth, and from his papers. [sent-266, score-0.459]
99 Relative loss bounds for on-line density estimation with the exponential family of distributions. [sent-272, score-0.637]
100 Learning the parts of objects with nonnegative matrix factorization. [sent-295, score-0.113]
wordName wordTfidf (topN-words)
[('bregman', 0.447), ('pca', 0.419), ('exponential', 0.244), ('loss', 0.21), ('family', 0.183), ('bernoulli', 0.182), ('glm', 0.17), ('distances', 0.137), ('subspace', 0.108), ('poisson', 0.107), ('normal', 0.094), ('tusn', 0.092), ('generalized', 0.086), ('lie', 0.084), ('regular', 0.084), ('projections', 0.084), ('csisz', 0.08), ('convex', 0.079), ('members', 0.079), ('points', 0.072), ('th', 0.07), ('lee', 0.068), ('attributes', 0.067), ('nes', 0.065), ('stationary', 0.064), ('tipping', 0.064), ('matrix', 0.063), ('alternating', 0.063), ('projection', 0.062), ('member', 0.061), ('minization', 0.061), ('tailored', 0.061), ('cients', 0.06), ('arguments', 0.06), ('distance', 0.059), ('nd', 0.058), ('regression', 0.056), ('hofmann', 0.056), ('seung', 0.056), ('coef', 0.055), ('generic', 0.055), ('surface', 0.055), ('parameter', 0.055), ('link', 0.054), ('distributions', 0.054), ('de', 0.053), ('dasgupta', 0.053), ('hypercube', 0.053), ('jolliffe', 0.053), ('principal', 0.053), ('minimization', 0.052), ('natural', 0.051), ('matrices', 0.051), ('nonnegative', 0.05), ('gf', 0.05), ('row', 0.049), ('oldest', 0.048), ('usually', 0.046), ('probabilistic', 0.046), ('inappropriate', 0.045), ('written', 0.045), ('expectation', 0.044), ('component', 0.044), ('alternately', 0.043), ('dy', 0.043), ('manfred', 0.043), ('manner', 0.042), ('ed', 0.042), ('variants', 0.041), ('families', 0.04), ('reminiscent', 0.04), ('dimensional', 0.04), ('interpretation', 0.039), ('modi', 0.039), ('canonical', 0.039), ('coincides', 0.039), ('cally', 0.038), ('optimized', 0.038), ('distribution', 0.038), ('upon', 0.038), ('squared', 0.038), ('dimensionality', 0.038), ('minimizes', 0.038), ('xed', 0.037), ('veri', 0.037), ('ignoring', 0.037), ('parameters', 0.037), ('power', 0.037), ('generalization', 0.036), ('derivative', 0.036), ('hessian', 0.033), ('park', 0.033), ('operation', 0.033), ('limit', 0.033), ('ne', 0.033), ('point', 0.032), ('data', 0.032), ('global', 0.032), ('negative', 0.032), ('uni', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999917 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
Author: Michael Collins, S. Dasgupta, Robert E. Schapire
Abstract: Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponential family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss functions, and give examples on simulated data.
2 0.17533915 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models
Author: Guy Lebanon, John D. Lafferty
Abstract: We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression.
3 0.17242527 74 nips-2001-Face Recognition Using Kernel Methods
Author: Ming-Hsuan Yang
Abstract: Principal Component Analysis and Fisher Linear Discriminant methods have demonstrated their success in face detection, recognition, and tracking. The representation in these subspace methods is based on second order statistics of the image set, and does not address higher order statistical dependencies such as the relationships among three or more pixels. Recently Higher Order Statistics and Independent Component Analysis (ICA) have been used as informative low dimensional representations for visual recognition. In this paper, we investigate the use of Kernel Principal Component Analysis and Kernel Fisher Linear Discriminant for learning low dimensional representations for face recognition, which we call Kernel Eigenface and Kernel Fisherface methods. While Eigenface and Fisherface methods aim to find projection directions based on the second order correlation of samples, Kernel Eigenface and Kernel Fisherface methods provide generalizations which take higher order correlations into account. We compare the performance of kernel methods with Eigenface, Fisherface and ICA-based methods for face recognition with variation in pose, scale, lighting and expression. Experimental results show that kernel methods provide better representations and achieve lower error rates for face recognition. 1 Motivation and Approach Subspace methods have been applied successfully in numerous visual recognition tasks such as face localization, face recognition, 3D object recognition, and tracking. In particular, Principal Component Analysis (PCA) [20] [13] ,and Fisher Linear Discriminant (FLD) methods [6] have been applied to face recognition with impressive results. While PCA aims to extract a subspace in which the variance is maximized (or the reconstruction error is minimized), some unwanted variations (due to lighting, facial expressions, viewing points, etc.) may be retained (See [8] for examples). It has been observed that in face recognition the variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to the changes in face identity [1]. Therefore, while the PCA projections are optimal in a correlation sense (or for reconstruction
4 0.15337233 8 nips-2001-A General Greedy Approximation Algorithm with Applications
Author: T. Zhang
Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.
5 0.14977118 137 nips-2001-On the Convergence of Leveraging
Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth
Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.
6 0.1483527 164 nips-2001-Sampling Techniques for Kernel Methods
7 0.14570336 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
8 0.13203888 139 nips-2001-Online Learning with Kernels
9 0.10549009 136 nips-2001-On the Concentration of Spectral Properties
10 0.099584378 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
11 0.089722671 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
12 0.087961905 155 nips-2001-Quantizing Density Estimators
13 0.087476373 22 nips-2001-A kernel method for multi-labelled classification
14 0.086126395 43 nips-2001-Bayesian time series classification
15 0.084082864 84 nips-2001-Global Coordination of Local Linear Models
16 0.080192253 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
17 0.07835833 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes
18 0.076499715 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
19 0.074447677 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
20 0.074271135 13 nips-2001-A Natural Policy Gradient
topicId topicWeight
[(0, -0.262), (1, 0.091), (2, -0.004), (3, 0.022), (4, 0.106), (5, -0.087), (6, -0.016), (7, 0.044), (8, 0.074), (9, -0.096), (10, -0.019), (11, 0.086), (12, 0.016), (13, -0.083), (14, -0.041), (15, -0.132), (16, -0.088), (17, -0.023), (18, -0.05), (19, 0.243), (20, 0.183), (21, -0.041), (22, 0.208), (23, 0.059), (24, -0.088), (25, 0.065), (26, -0.036), (27, 0.058), (28, -0.06), (29, -0.029), (30, -0.106), (31, -0.003), (32, 0.033), (33, -0.036), (34, 0.004), (35, 0.019), (36, 0.016), (37, 0.043), (38, -0.0), (39, 0.061), (40, -0.054), (41, -0.112), (42, 0.003), (43, 0.054), (44, 0.145), (45, 0.038), (46, -0.013), (47, 0.054), (48, 0.04), (49, -0.073)]
simIndex simValue paperId paperTitle
same-paper 1 0.96277994 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
Author: Michael Collins, S. Dasgupta, Robert E. Schapire
Abstract: Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponential family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss functions, and give examples on simulated data.
2 0.68798167 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models
Author: Guy Lebanon, John D. Lafferty
Abstract: We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression.
3 0.6795615 155 nips-2001-Quantizing Density Estimators
Author: Peter Meinicke, Helge Ritter
Abstract: We suggest a nonparametric framework for unsupervised learning of projection models in terms of density estimation on quantized sample spaces. The objective is not to optimally reconstruct the data but instead the quantizer is chosen to optimally reconstruct the density of the data. For the resulting quantizing density estimator (QDE) we present a general method for parameter estimation and model selection. We show how projection sets which correspond to traditional unsupervised methods like vector quantization or PCA appear in the new framework. For a principal component quantizer we present results on synthetic and realworld data, which show that the QDE can improve the generalization of the kernel density estimator although its estimate is based on significantly lower-dimensional projection indices of the data.
4 0.58435869 137 nips-2001-On the Convergence of Leveraging
Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth
Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.
5 0.55476642 8 nips-2001-A General Greedy Approximation Algorithm with Applications
Author: T. Zhang
Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.
6 0.51281983 74 nips-2001-Face Recognition Using Kernel Methods
7 0.502949 164 nips-2001-Sampling Techniques for Kernel Methods
8 0.47206822 139 nips-2001-Online Learning with Kernels
9 0.45419118 154 nips-2001-Products of Gaussians
10 0.44383448 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
11 0.44013971 185 nips-2001-The Method of Quantum Clustering
12 0.42867839 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
13 0.4278298 136 nips-2001-On the Concentration of Spectral Properties
14 0.41619787 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity
15 0.41226438 84 nips-2001-Global Coordination of Local Linear Models
16 0.41198489 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
17 0.3961859 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance
18 0.38271493 35 nips-2001-Analysis of Sparse Bayesian Learning
19 0.37764713 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
20 0.37330681 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons
topicId topicWeight
[(14, 0.05), (17, 0.03), (19, 0.039), (27, 0.222), (30, 0.071), (38, 0.026), (59, 0.061), (70, 0.172), (72, 0.082), (79, 0.057), (83, 0.026), (91, 0.101)]
simIndex simValue paperId paperTitle
1 0.95002568 31 nips-2001-Algorithmic Luckiness
Author: Ralf Herbrich, Robert C. Williamson
Abstract: In contrast to standard statistical learning theory which studies uniform bounds on the expected error we present a framework that exploits the specific learning algorithm used. Motivated by the luckiness framework [8] we are also able to exploit the serendipity of the training sample. The main difference to previous approaches lies in the complexity measure; rather than covering all hypotheses in a given hypothesis space it is only necessary to cover the functions which could have been learned using the fixed learning algorithm. We show how the resulting framework relates to the VC, luckiness and compression frameworks. Finally, we present an application of this framework to the maximum margin algorithm for linear classifiers which results in a bound that exploits both the margin and the distribution of the data in feature space. 1
2 0.93080521 155 nips-2001-Quantizing Density Estimators
Author: Peter Meinicke, Helge Ritter
Abstract: We suggest a nonparametric framework for unsupervised learning of projection models in terms of density estimation on quantized sample spaces. The objective is not to optimally reconstruct the data but instead the quantizer is chosen to optimally reconstruct the density of the data. For the resulting quantizing density estimator (QDE) we present a general method for parameter estimation and model selection. We show how projection sets which correspond to traditional unsupervised methods like vector quantization or PCA appear in the new framework. For a principal component quantizer we present results on synthetic and realworld data, which show that the QDE can improve the generalization of the kernel density estimator although its estimate is based on significantly lower-dimensional projection indices of the data.
same-paper 3 0.9192794 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
Author: Michael Collins, S. Dasgupta, Robert E. Schapire
Abstract: Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponential family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss functions, and give examples on simulated data.
4 0.82352483 137 nips-2001-On the Convergence of Leveraging
Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth
Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.
5 0.81133336 98 nips-2001-Information Geometrical Framework for Analyzing Belief Propagation Decoder
Author: Shiro Ikeda, Toshiyuki Tanaka, Shun-ichi Amari
Abstract: The mystery of belief propagation (BP) decoder, especially of the turbo decoding, is studied from information geometrical viewpoint. The loopy belief network (BN) of turbo codes makes it difficult to obtain the true “belief” by BP, and the characteristics of the algorithm and its equilibrium are not clearly understood. Our study gives an intuitive understanding of the mechanism, and a new framework for the analysis. Based on the framework, we reveal basic properties of the turbo decoding.
6 0.81117344 8 nips-2001-A General Greedy Approximation Algorithm with Applications
7 0.81067121 103 nips-2001-Kernel Feature Spaces and Nonlinear Blind Souce Separation
8 0.80944127 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
9 0.80830121 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources
10 0.8050577 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
11 0.80213135 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules
12 0.80125868 114 nips-2001-Learning from Infinite Data in Finite Time
13 0.79984403 190 nips-2001-Thin Junction Trees
14 0.79888624 13 nips-2001-A Natural Policy Gradient
15 0.79783046 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
16 0.79681945 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
17 0.79594898 188 nips-2001-The Unified Propagation and Scaling Algorithm
18 0.79316455 44 nips-2001-Blind Source Separation via Multinode Sparse Representation
19 0.79073614 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
20 0.789051 84 nips-2001-Global Coordination of Local Linear Models