nips nips2003 nips2003-138 knowledge-graph by maker-knowledge-mining

138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

Source: pdf

Author: Jakob J. Verbeek, Sam T. Roweis, Nikos A. Vlassis

Abstract: We propose a non-linear Canonical Correlation Analysis (CCA) method which works by coordinating or aligning mixtures of linear models. In the same way that CCA extends the idea of PCA, our work extends recent methods for non-linear dimensionality reduction to the case where multiple embeddings of the same underlying low dimensional coordinates are observed, each lying on a different high dimensional manifold. We also show that a special case of our method, when applied to only a single manifold, reduces to the Laplacian Eigenmaps algorithm. As with previous alignment schemes, once the mixture models have been estimated, all of the parameters of our model can be estimated in closed form without local optima in the learning. Experimental results illustrate the viability of the approach as a non-linear extension of CCA. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Roweis‡ , and Nikos Vlassis† † Informatics Institute, University of Amsterdam ‡ Department of Computer Science,University of Toronto Abstract We propose a non-linear Canonical Correlation Analysis (CCA) method which works by coordinating or aligning mixtures of linear models. [sent-3, score-0.16]

2 In the same way that CCA extends the idea of PCA, our work extends recent methods for non-linear dimensionality reduction to the case where multiple embeddings of the same underlying low dimensional coordinates are observed, each lying on a different high dimensional manifold. [sent-4, score-0.416]

3 As with previous alignment schemes, once the mixture models have been estimated, all of the parameters of our model can be estimated in closed form without local optima in the learning. [sent-6, score-0.337]

4 1 Introduction In this paper, we are interested in data that lies on or close to a low dimensional manifold embedded, possibly non-linearly, in a Euclidean space of much higher dimension. [sent-8, score-0.254]

5 We want to recover the structure of the data manifold, so that we can ‘unroll’ the data manifold and work with the data expressed in the underlying ‘latent coordinates’, i. [sent-15, score-0.194]

6 Learning low dimensional latent representations may be desirable for different reasons, such as compression for storage and communication, visualization of high dimensional data, or as preprocessing for further data analysis or prediction tasks. [sent-18, score-0.526]

7 In this paper, we consider a method to integrate several local feature extractors into a single global representation, similar to the approaches of [5, 6, 7, 8]. [sent-21, score-0.377]

8 These methods, as well as ours, deliver after training a functional mapping which can be used to convert previously unseen high dimensional observations into their low dimensional global coordinates. [sent-22, score-0.297]

9 Like most of the above algorithms, our method performs non-linear feature extraction by minimizing a convex objective function whose critical points can be characterized as eigenvectors of some matrix. [sent-23, score-0.22]

10 These algorithms are generally simple and efﬁcient; one needs only to construct a matrix based on local feature analysis of the training data and then computes its largest or smallest eigenvectors using standard numerical methods. [sent-24, score-0.211]

11 In contrast, methods like generative topographic mapping[9] and self-organizing maps[10] are prone to local optima in the objective function. [sent-25, score-0.218]

12 Our method is based on the same intuitions as in earlier work: the idea is to learn a mixture of latent variable density models on the original training data so that each mixture component acts as a local feature extractor. [sent-26, score-0.875]

13 For example, we may use a mixture of factor analyzers or a mixture of principal component analyzers (PCA). [sent-27, score-0.548]

14 After this mixture has been learned, the local feature extractors are ‘coordinated’ by ﬁnding, for each model, a suitable linear mapping (and offset) from its latent variable space into a single ‘global’ low-dimensional coordinate system. [sent-28, score-0.97]

15 The local feature extractors together with the coordinating linear maps provide a global non-linear map from the data space to the latent space and back. [sent-29, score-0.934]

16 Learning the mixture is driven by a density signal – we want to place models near the training points, while the post-coordination is driven by the idea that when two different models place signiﬁcant weight on the same point, they should agree on its mapping into the global space. [sent-30, score-0.296]

17 As in [6], we use a cross-entropy between a unimodal approximation and the true posterior over global coordinates to encourage agreement. [sent-32, score-0.195]

18 However we do not attempt to simultaneously learn the mixture model and coordinate since this causes severe problems with local minima. [sent-33, score-0.312]

19 Instead, as in [7, 8], we ﬁx a speciﬁc mixture and then study the computations involved in coordinating its local representations. [sent-34, score-0.313]

20 We go on, in Section 4, to extend our algorithm to a setting in which multiple different observation spaces are available, each one related to the same underlying global space but through different nonlinear embeddings. [sent-37, score-0.233]

21 2 Non-linear PCA by aligning local feature extractors Consider a given data set X = {x1 , . [sent-40, score-0.394]

22 , xN } and a collection of k local feature extractors, fs (x) is a vector containing the, zero or more, features produced by model s. [sent-43, score-0.174]

23 We convert these activities into posterior responsibilities using a simple softmax: p(s|x) = exp(as (x))/ r exp(ar (x)). [sent-45, score-0.161]

24 If the experts are actually components of a mixture, then setting the activities to the logarithm of the posteriors under the mixture will recover exactly the same posteriors above. [sent-46, score-0.339]

25 Next, we consider the relationship between the given representation of the data and the representation of the data in a global latent space, which we would like to ﬁnd. [sent-47, score-0.42]

26 Throughout, we will use g to denote latent ’Global’ coordinates for data. [sent-48, score-0.392]

27 For the unobserved latent coordinate g corresponding to a data point xn and conditioned on s, we assume the density: p(g|xn , s) = N (g; κs + As fs (xn ), σ 2 I) = N (g; gns , σ 2 I), (1) where N (g; µ, Σ) is a Gaussian distribution on g with mean µ and covariance Σ. [sent-49, score-1.006]

28 The mean, gns , of p(g|xn , s) is the sum of the component offset κs in the latent space and a linear transformation, implemented by As , of fs (xn ). [sent-50, score-0.814]

29 From now on we will use homogeneous coordinates and write: Ls = [As κs ] and zns = [fs (xn ) 1] , and thus gns = Ls zns . [sent-51, score-0.638]

30 Consider the posterior distribution on latent coordinates given some data: p(g|x) = p(s, g|x) = s p(s|x)p(g|x, s). [sent-52, score-0.431]

31 (2) s Given a ﬁxed set of local feature extractors and a corresponding activities, we are interested in ﬁnding linear maps Ls that give rise to ‘consistent’ projections of the data in the latent space. [sent-53, score-0.811]

32 If the predictions are in perfect agreement for a point xn , then all the gns are equal and the posterior p(g|x) is Gaussian, in general p(g|x) is a mixture of Gaussians. [sent-55, score-0.774]

33 QN } p(g|xn , s)), (3) n,s where we used qns as a shorthand for p(s|xn ) and Qn is a Gaussian with mean gn and covariance matrix Σn . [sent-62, score-0.652]

34 The objective sums for each data point xn and model s the Kullback-Leibler divergence D between a single Gaussian Qn (g) and the component densities p(g|x, s), weighted by the posterior p(s|xn ). [sent-63, score-0.382]

35 gn and Σn we obtain: qns gns and Σn = σ 2 I, gn = (4) s where I denotes the identity matrix. [sent-67, score-1.322]

36 Skipping some additive and multiplicative constants with respect to the linear maps Ls , the objective Φ then simpliﬁes to: Φ= qns gn − gns 2 = n,s 1 qns qnt 2 n,s,t gnt − gns 2 ≥ 0. [sent-68, score-1.931]

37 (5) The main attraction with this setup is that our objective is a quadratic function of the linear maps Ls , as in [7, 8]. [sent-69, score-0.173]

38 (6) Note that from (4) and (6) we have: gn = (un L) . [sent-81, score-0.333]

39 The expected projection coordinates can thus be computed as: G = [g1 . [sent-82, score-0.142]

40 We deﬁne the block-diagonal matrix D with k blocks given by Ds = n qns zns zns . [sent-86, score-0.474]

41 (7) The objective function is invariant to translation and rotation of the global latent space and re-scaling the latent space changes the objective monotonically, c. [sent-88, score-0.947]

42 5 2 Figure 1: Data in IR3 with local charts indicated by the axes (left). [sent-113, score-0.207]

43 This embedding is uninformative since it is constant, therefore we select the eigenvectors corresponding to the second up to the (d + 1)st smallest eigenvalues to obtain the best embedding in d dimensions. [sent-123, score-0.14]

44 Note that, as mentioned in [7], this framework enables us to use feature extractors that provide different numbers of features. [sent-124, score-0.26]

45 The plots show the original data presented to the algorithm (left) and the 2-dimensional latent coordinates gn = s qns gns found by the algorithm (right). [sent-127, score-1.413]

46 The only information the mixture model provides are the posterior probabilities collected in the matrix Q with [Q]ns = qns = p(s|xn ). [sent-129, score-0.53]

47 In that case: gns = κs , U = Q, Φ = Tr{L (D − A)L} = L = [κ1 . [sent-130, score-0.362]

48 κk ] , κs − κt 2 s,t qns qnt , (9) (10) n where A = Q Q is an adjacency matrix with [A]st = n qns qnt and D is the diagonal degree matrix of A with [D]ss = t Ast = n qns . [sent-133, score-1.047]

49 Optimization under the constrains of zero mean and identity covariance leads to the generalized eigenproblem: (D − A)v = λAv ⇔ (D − A)v = λ Dv 1+λ (11) The optimization problem is exactly the Laplacian Eigenmaps algorithm[4], but applied on the mixture components instead of the data points. [sent-134, score-0.282]

50 Since we do not use any feature extractors in this setting, it can be applied to mixture models that model data for which it is hard to design feature extractors, e. [sent-135, score-0.549]

51 Thus, we can use mixture densities without latent variables, e. [sent-138, score-0.493]

52 Notice that in this manner the mixture model not only provides a soft grouping of the data through the posteriors, but also an adjacency matrix between the groups. [sent-141, score-0.258]

53 4 Non-linear CCA by aligning local feature extractors Canonical Correlation Analysis (CCA) is a data analysis method that ﬁnds correspondences between two or more sets of measurements. [sent-142, score-0.508]

54 Our main interest in this paper is to develop a nonlinear extension of CCA which works when the different measurements come from separate nonlinear manifolds that share an underlying global coordinate system. [sent-145, score-0.313]

55 Non-linear CCA can be trained to ﬁnd a shared low dimensional embedding for both manifolds, exploiting the pairwise correspondence provided by the data set. [sent-146, score-0.216]

56 This is easily shown to be equivalent to minimizing: E= 1 2 [axn − byn ] 2 (12) n under the constraint that a[ n xn xn ]a + b[ n yn yn ]b = 1. [sent-159, score-0.414]

57 We can also generalize by mapping to IRd instead of the real line, and then requiring the sum of the covariance matrices of the projections to be identity. [sent-161, score-0.156]

58 In the generalized CCA setting with multiple point-sets, allowing translations and linear mappings to IRd , the objective is to minimize the squared distance between all pairs of projections under the same constraint as above. [sent-163, score-0.287]

59 We denote the projection of the n-th point 1 in the s-th point-set as gns and let gn = k s gns . [sent-164, score-1.132]

60 We then minimize the error function: ΦCCA = 1 2k 2 gns − gnt n,s,t 2 = 1 k gns − gn 2 . [sent-165, score-1.102]

61 (13) n,s The objective Φ in equation (5) coincides with ΦCCA if qns = 1/k. [sent-166, score-0.393]

62 We consider different point sets, each having a mixture of locally valid linear projections into the ‘global’ latent space that is now shared by all mixture components and point sets. [sent-171, score-0.903]

63 we have pairs of projections due to the same point set and also pairs that combine projections from different point sets. [sent-174, score-0.242]

64 c We use c as an index ranging over the C different observation spaces, and write q ns for the posterior on component s for observation n in observation space c. [sent-175, score-0.252]

65 Similarly, we use c gns to denote the projection due component s from space c. [sent-176, score-0.479]

66 The average projection due to c c c observation space c is then denoted by gn = s qns gns . [sent-177, score-1.103]

67 We use index r to range over all 1 mixture components and observation spaces, so that qnr = C p(s|xn ) if r corresponds to 1 (c = 1, s) and qnr = C p(s|yn ) if r corresponds to (c = 2, s), i. [sent-178, score-0.41]

68 The overall 1 c average projection then becomes: gn = C c gn = r qnr gnr . [sent-181, score-0.847]

69 The objective (5) can now be rewritten as: 1 1 c c c Φ= qnr gnr − gn 2 = gn − gn 2 + qc gn − gns 2 . [sent-182, score-1.928]

70 (14) C c,n C c,n,s ns n,r Observe how in (14) the objective sums between point set consistency of the projections (ﬁrst summand) and within point set consistency of the projections (second summand). [sent-183, score-0.434]

71 Robustness for variation in the mixture ﬁtting can be improved by using several sets of charts ﬁtted to the same manifold. [sent-214, score-0.394]

72 We can then align all these sets of charts by optimizing (14). [sent-215, score-0.197]

73 This aligns the charts within each set and at the same time makes sure the different sets of aligned charts are aligned, providing important regularization, since now every point is modeled by several local models. [sent-216, score-0.488]

74 Note that if the charts and responsibilities are obtained using a mixture of PCA or factor analyzers, the local linear mappings to the latent space induce a Gaussian mixture in the latent space. [sent-217, score-1.386]

75 This mixture can be used to compute responsibilities on components given latent coordinates. [sent-218, score-0.583]

76 Also, for each linear map from the data to the latent space we can compute a pseudo inverse projecting back. [sent-219, score-0.417]

77 By averaging the individual back projections with the responsibilities computed in latent space we obtain a projection from the latent space to the data space. [sent-220, score-0.922]

78 When using linear CCA for data that is non-linearly embedded, reconstructions will be poor since linear CCA can only map into a low dimensional linear subspace. [sent-223, score-0.33]

79 To both point sets we added Gaussian noise and we learned a 10 component mixture model on both sets. [sent-227, score-0.309]

80 2 the, clearly successfully, discovered latent coordinates are plotted against the coordinate on the generating curve. [sent-229, score-0.45]

81 We cut these images in half to obtain our two sets of images. [sent-233, score-0.151]

82 Both image halves were modeled with a mixture of 40 components. [sent-235, score-0.289]

83 3 some generated right half images based on the left half are shown. [sent-237, score-0.168]

84 The second experiment concerns appearance based pose estimation of an object. [sent-238, score-0.171]

85 One point set consists of a pixel representation of images of an object and the other point set contains the corresponding pose of the camera w. [sent-239, score-0.37]

86 For the pose parameters we used the identity to ‘extract’ features (i. [sent-243, score-0.171]

87 A mixture of 40 PCA’s was trained on the image data and aligned with the pose parameters in a 2-dimensional latent space. [sent-247, score-0.838]

88 3 shows reconstructions of the images conditioned on various pose inputs (left image of each pair is reconstruction based on pose of right image). [sent-249, score-0.572]

89 Figure 3: Right half of the images was generated given the left half using the trained model (left). [sent-253, score-0.196]

90 In the third experiment we use the same images as in the second experiment, but replace the direct (low dimensional) supervision signal of the pose parameters with (high dimensional) correspondences in the form of images of another object in corresponding poses. [sent-255, score-0.418]

91 We trained a mixture of 40 PCA’s on both image sets (2000 images of 64×64 pixels in each set) and aligned these in a 3-dimensional latent space. [sent-256, score-0.747]

92 Comparing the pose of an object to the pose of the nearest (in latent space) image from the other object the std. [sent-257, score-0.797]

93 5 Discussion In this paper, we have extended alignment methods for single manifold nonlinear dimensionality reduction to perform non-linear CCA using measurements from multiple manifolds. [sent-268, score-0.245]

94 We have also shown the close relationship with Laplacian Eigenmaps[4] in the degenerate case of a single manifold and feature extractors of zero dimensionality. [sent-269, score-0.33]

95 Computing these weights requires access to the original data directly, not just through the “interface” of the mixture model. [sent-272, score-0.254]

96 In [11] it is considered how to ﬁnd low dimensional representations for multiple point sets simultaneously, given few correspondences between the point sets. [sent-275, score-0.289]

97 The use of multiple sets of charts for one data set is similar in spirit as the self-correspondence technique of [11] where the data is split into several overlapping sets used to stabilize the generalized LLE. [sent-278, score-0.336]

98 d c b a Figure 4: I1 : image in ﬁrst set (a), I2 : corresponding image in second set (b), closest image in second set (in latent space) to I1 (c), reconstruction of I2 given I1 (d). [sent-279, score-0.508]

99 Laplacian eigenmaps and spectral techniques for embedding and clustering. [sent-309, score-0.139]

100 How to measure the pose robustness of object views. [sent-362, score-0.221]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gns', 0.362), ('cca', 0.358), ('gn', 0.333), ('latent', 0.296), ('qns', 0.294), ('extractors', 0.2), ('mixture', 0.197), ('pose', 0.171), ('charts', 0.15), ('xn', 0.147), ('eigenmaps', 0.1), ('objective', 0.099), ('coordinates', 0.096), ('projections', 0.092), ('qnr', 0.09), ('responsibilities', 0.09), ('zns', 0.09), ('ls', 0.083), ('dimensional', 0.081), ('laplacian', 0.079), ('pca', 0.076), ('reconstructions', 0.071), ('manifold', 0.07), ('qnt', 0.068), ('correspondences', 0.067), ('images', 0.065), ('global', 0.06), ('feature', 0.06), ('yn', 0.06), ('qn', 0.06), ('image', 0.059), ('coordinating', 0.059), ('analyzers', 0.059), ('coordinate', 0.058), ('fs', 0.057), ('local', 0.057), ('posteriors', 0.055), ('aligned', 0.055), ('alignment', 0.052), ('coordination', 0.051), ('un', 0.05), ('nonlinear', 0.05), ('object', 0.05), ('sets', 0.047), ('projection', 0.046), ('maps', 0.046), ('gnr', 0.045), ('gnt', 0.045), ('ird', 0.045), ('lle', 0.045), ('aligning', 0.045), ('ns', 0.043), ('measurements', 0.041), ('mappings', 0.04), ('half', 0.039), ('summand', 0.039), ('peters', 0.039), ('lk', 0.039), ('posterior', 0.039), ('embedding', 0.039), ('mapping', 0.039), ('low', 0.036), ('component', 0.036), ('eigenvectors', 0.036), ('roweis', 0.036), ('latitude', 0.036), ('longitude', 0.036), ('reconstruction', 0.035), ('space', 0.035), ('halves', 0.033), ('observation', 0.033), ('activities', 0.032), ('canonical', 0.032), ('dimensionality', 0.032), ('data', 0.032), ('optima', 0.031), ('topographic', 0.031), ('extends', 0.031), ('point', 0.029), ('adjacency', 0.029), ('generalized', 0.028), ('trained', 0.028), ('mixtures', 0.028), ('linear', 0.028), ('volume', 0.028), ('underlying', 0.028), ('dv', 0.027), ('spaces', 0.027), ('translation', 0.027), ('camera', 0.026), ('manifolds', 0.026), ('correlation', 0.026), ('smallest', 0.026), ('map', 0.026), ('december', 0.025), ('covariance', 0.025), ('points', 0.025), ('consistency', 0.025), ('weights', 0.025), ('generated', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

Author: Jakob J. Verbeek, Sam T. Roweis, Nikos A. Vlassis

2 0.20283252 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

Author: Neil D. Lawrence

Abstract: In this paper we introduce a new underlying probabilistic model for principal component analysis (PCA). Our formulation interprets PCA as a particular Gaussian process prior on a mapping from a latent space to the observed data-space. We show that if the prior’s covariance function constrains the mappings to be linear the model is equivalent to PCA, we then extend the model by considering less restrictive covariance functions which allow non-linear mappings. This more general Gaussian process latent variable model (GPLVM) is then evaluated as an approach to the visualisation of high dimensional data for three different data-sets. Additionally our non-linear algorithm can be further kernelised leading to ‘twin kernel PCA’ in which a mapping between feature spaces occurs.

3 0.15380467 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

Author: Denis V. Chigirev, William Bialek

Abstract: We introduce an information theoretic method for nonparametric, nonlinear dimensionality reduction, based on the inﬁnite cluster limit of rate distortion theory. By constraining the information available to manifold coordinates, a natural probabilistic map emerges that assigns original data to corresponding points on a lower dimensional manifold. With only the information-distortion trade off as a parameter, our method determines the shape of the manifold, its dimensionality, the probabilistic map and the prior that provide optimal description of the data. 1 A simple example Some data sets may not be as complicated as they appear. Consider the set of points on a plane in Figure 1. As a two dimensional set, it requires a two dimensional density ρ(x, y) for its description. Since the data are sparse the density will be almost singular. We may use a smoothing kernel, but then the data set will be described by a complicated combination of troughs and peaks with no obvious pattern and hence no ability to generalize. We intuitively, however, see a strong one dimensional structure (a curve) underlying the data. In this paper we attempt to capture this intuition formally, through the use of the inﬁnite cluster limit of rate distortion theory. Any set of points can be embedded in a hypersurface of any intrinsic dimensionality if we allow that hypersurface to be highly “folded.” For example, in Figure 1, any curve that goes through all the points gives a one dimensional representation. We would like to avoid such solutions, since they do not help us discover structure in the data. Looking for a simpler description one may choose to penalize the curvature term [1]. The problem with this approach is that it is not easily generalized to multiple dimensions, and requires the dimensionality of the solution as an input. An alternative approach is to allow curves of all shapes and sizes, but to send the reduced coordinates through an information bottleneck. With a ﬁxed number of bits, position along a highly convoluted curve becomes uncertain. This will penalize curves that follow the data too closely (see Figure 1). There are several advantages to this approach. First, it removes the artiﬁciality introduced by Hastie [2] of adding to the cost function only orthogonal errors. If we believe that data points fall out of the manifold due to noise, there is no reason to treat the projection onto the manifold as exact. Second, it does not require the dimension- 9 8 Figure 1: Rate distortion curve for a data set of 25 points (red). We used 1000 points to represent the curve which where initialized by scattering them uniformly on the plane. Note that the produced curve is well deﬁned, one dimensional and smooth. 7 6 5 4 3 2 1 0 2 4 6 8 10 12 ality of the solution manifold as an input. By adding extra dimensions, one quickly looses the precision with which manifold points are speciﬁed (due to the ﬁxed information bottleneck). Hence, the optimal dimension emerges naturally. This also means that the method works well in many dimensions with no adjustments. Third, the method handles sparse data well. This is important since in high dimensional spaces all data sets are sparse, i.e. they look like points in Figure 1, and the density estimation becomes impossible. Luckily, if the data are truly generated by a lower dimensional process, then density estimation in the data space is not important (from the viewpoint of prediction or any other). What is critical is the density of the data along the manifold (known in latent variable modeling as a prior), and our algorithm ﬁnds it naturally. 2 Latent variable models and dimensionality reduction Recently, the problem of reducing the dimensionality of a data set has received renewed attention [3,4]. The underlying idea, due to Hotelling [5], is that most of the variation in many high dimensional data sets can often be explained by a few latent variables. Alternatively, we say that rather than ﬁlling the whole space, the data lie on a lower dimensional manifold. The dimensionality of this manifold is the dimensionality of the latent space and the coordinate system on this manifold provides the latent variables. Traditional tools of principal component analysis (PCA) and factor analysis (FA) are still the most widely used methods in data analysis. They project the data onto a hyperplane, so the reduced coordinates are easy to interpret. However, these methods are unable to deal with nonlinear correlations in a data set. To accommodate nonlinearity in a data set, one has to relax the assumption that the data is modeled by a hyperplane, and allow a general low dimensional manifold of unknown shape and dimensionality. The same questions that we asked in the previous section apply here. What do we mean by requiring that “the manifold models the data well”? In the next section, we formalize this notion by deﬁning the manifold description of data as a doublet (the shape of the manifold and the projection map). Note that we do not require the probability distribution over the manifold (known for generative models [6,7] as a prior distribution over the latent variables and postulated a priori). It is completely determined by the doublet. Nonlinear correlations in data can also be accommodated implicitly, without constructing an actual low dimensional manifold. By mapping the data from the original space to an even higher dimensional feature space, we may hope that the correlations will become linearized and PCA will apply. Kernel methods [8] allow us to do this without actually constructing an explicit map to feature space. They introduce nonlinearity through an a priori nonlinear kernel. Alternatively, autoassociative neural networks [9] force the data through a bottleneck (with an internal layer of desired dimensionality) to produce a reduced description. One of the disadvantages of these methods is that the results are not easy to interpret. Recent attempts to describe a data set with a low dimensional representation generally follow into two categories: spectral methods and density modeling methods. Spectral methods (LLE [3], ISOMAP [4], Laplacian eigenmaps [10]) give reduced coordinates of an a priori dimensionality by introducing a quadratic cost function in reduced coordinates (hence eigenvectors are solutions) that mimics the relationships between points in the original data space (geodesic distance for ISOMAP, linear reconstruction for LLE). Density modeling methods (GTM [6], GMM [7]) are generative models that try to reproduce the data with fewer variables. They require a prior and a parametric generative model to be introduced a priori and then ﬁnd optimal parameters via maximum likelihood. The approach that we will take is inspired by the work of Kramer [9] and others who tried to formulate dimensionality reduction as a compression problem. They tried to solve the problem by building an explicit neural network encoder-decoder system which restricted the information implicitly by limiting the number of nodes in the bottleneck layer. Extending their intuition with the tools of information theory, we recast dimensionality reduction as a compression problem where the bottleneck is the information available to manifold coordinates. This allows us to deﬁne the optimal manifold description as that which produces the best reconstruction of the original data set, given that the coordinates can only be transmitted through a channel of ﬁxed capacity. 3 Dimensionality reduction as compression Suppose that we have a data set X in a high dimensional state space RD described by a density function ρ(x). We would like to ﬁnd a “simpliﬁed” description of this data set. One may do so by visualizing a lower dimensional manifold M that “almost” describes the data. If we have a manifold M and a stochastic map PM : x → PM (µ|x) to points µ on the manifold, we will say that they provide a manifold description of the data set X. Note that the stochastic map here is well justiﬁed: if a data point does not lie exactly on the manifold then we should expect some uncertainty in the estimation of the value of its latent variables. Also note that we do not need to specify the inverse (generative) map: M → RD ; it can be obtained by Bayes’ rule. The manifold description (M, PM ) is a less than faithful representation of the data. To formalize this notion we will introduce the distortion measure D(M, PM , ρ): ρ(x)PM (µ|x) x − µ 2 dD xDµ. D(M, PM , ρ) = x∈RD (1) µ∈M Here we have assumed the Euclidean distance function for simplicity. The stochastic map, PM (µ|x), together with the density, ρ(x), deﬁne a joint probability function P (M, X) that allows us to calculate the mutual information between the data and its manifold representation: I(X, M) = P (x, µ) log x∈X µ∈M P (x, µ) dD xDµ. ρ(x)PM (µ) (2) This quantity tells us how many bits (on average) are required to encode x into µ. If we view the manifold representation of X as a compression scheme, then I(X, M) tells us the necessary capacity of the channel needed to transmit the compressed data. Ideally, we would like to obtain a manifold description {M, PM (M|X)} of the data set X that provides both a low distortion D(M, PM , ρ) and a good compression (i.e. small I(X, M)). The more bits we are willing to provide for the description of the data, the more detailed a manifold that can be constructed. So there is a trade off between how faithful a manifold representation can be and how much information is required for its description. To formalize this notion we introduce the concept of an optimal manifold. DEFINITION. Given a data set X and a channel capacity I, a manifold description (M, PM (M|X)) that minimizes the distortion D(M, PM , X), and requires only information I for representing an element of X, will be called an optimal manifold M(I, X). Note that another way to deﬁne an optimal manifold is to require that the information I(M, X) is minimized while the average distortion is ﬁxed at value D. The shape and the dimensionality of optimal manifold depends on our information resolution (or the description length that we are willing to allow). This dependence captures our intuition that for real world, multi-scale data, a proper manifold representation must reﬂect the compression level we are trying to achieve. To ﬁnd the optimal manifold (M(I), PM(I) ) for a given data set X, we must solve a constrained optimization problem. Let us introduce a Lagrange multiplier λ that represents the trade off between information and distortion. Then optimal manifold M(I) minimizes the functional: F(M, PM ) = D + λI. (3) Let us parametrize the manifold M by t (presumably t ∈ Rd for some d ≤ D). The function γ(t) : t → M maps the points from the parameter space onto the manifold and therefore describes the manifold. Our equations become: D = dD x dd t ρ(x)P (t|x) x − γ(t) 2 , I = dD x dd t ρ(x)P (t|x) log P (t|x) , P (t) F(γ(t), P (t|x)) = D + λI. (4) (5) (6) Note that both information and distortion measures are properties of the manifold description doublet {M, PM (M|X)} and are invariant under reparametrization. We require the variations of the functional to vanish for optimal manifolds δF/δγ(t) = 0 and δF/δP (t|x) = 0, to obtain the following set of self consistent equations: P (t) = γ(t) = P (t|x) = Π(x) = dD x ρ(x)P (t|x), 1 dD x xρ(x)P (t|x), P (t) P (t) − 1 x−γ (t) 2 e λ , Π(x) 2 1 dd t P (t)e− λ x−γ (t) . (7) (8) (9) (10) In practice we do not have the full density ρ(x), but only a discrete number of samples. 1 So we have to approximate ρ(x) = N δ(x − xi ), where N is the number of samples, i is the sample label, and xi is the multidimensional vector describing the ith sample. Similarly, instead of using a continuous variable t we use a discrete set t ∈ {t1 , t2 , ..., tK } of K points to model the manifold. Note that in (7 − 10) the variable t appears only as an argument for other functions, so we can replace the integral over t by a sum over k = 1..K. Then P (t|x) becomes Pk (xi ),γ(t) is now γ k , and P (t) is Pk . The solution to the resulting set of equations in discrete variables (11 − 14) can be found by an iterative Blahut-Arimoto procedure [11] with an additional EM-like step. Here (n) denotes the iteration step, and α is a coordinate index in RD . The iteration scheme becomes: (n) Pk (n) γk,α = = N 1 N (n) Pk (xi ) = Π(n) (xi ) N 1 1 (n) N P k where α (11) i=1 = (n) xi,α Pk (xi ), (12) i=1 1, . . . , D, K (n) 1 (n) Pk e− λ xi −γ k 2 (13) k=1 (n) (n+1) Pk (xi ) = (n) 2 Pk 1 . e− λ xi −γ k (n) (x ) Π i (14) 0 0 One can initialize γk and Pk (xi ) by choosing K points at random from the data set and 0 letting γk = xi(k) and Pk = 1/K, then use equations (13) and (14) to initialize the 0 association map Pk (xi ). The iteration procedure (11 − 14) is terminated once n−1 n max |γk − γk | < , (15) k where determines the precision with which the manifold points are located. The above algorithm requires the information distortion cost λ = −δD/δI as a parameter. If we want to ﬁnd the manifold description (M, P (M|X)) for a particular value of information I, we can plot the curve I(λ) and, because it’s monotonic, we can easily ﬁnd the solution iteratively, arbitrarily close to a given value of I. 4 Evaluating the solution The result of our algorithm is a collection of K manifold points, γk ∈ M ⊂ RD , and a stochastic projection map, Pk (xi ), which maps the points from the data space onto the manifold. Presumably, the manifold M has a well deﬁned intrinsic dimensionality d. If we imagine a little ball of radius r centered at some point on the manifold of intrinsic dimensionality d, and then we begin to grow the ball, the number of points on the manifold that fall inside will scale as rd . On the other hand, this will not be necessarily true for the original data set, since it is more spread out and resembles locally the whole embedding space RD . The Grassberger-Procaccia algorithm [12] captures this intuition by calculating the correlation dimension. First, calculate the correlation integral: 2 C(r) = N (N − 1) N N H(r − |xi − xj |), (16) i=1 j>i where H(x) is a step function with H(x) = 1 for x > 0 and H(x) = 0 for x < 0. This measures the probability that any two points fall within the ball of radius r. Then deﬁne 0 original data manifold representation -2 ln C(r) -4 -6 -8 -10 -12 -14 -5 -4 -3 -2 -1 0 1 2 3 4 ln r Figure 2: The semicircle. (a) N = 3150 points randomly scattered around a semicircle of radius R = 20 by a normal process with σ = 1 and the ﬁnal positions of 100 manifold points. (b) Log log plot of C(r) vs r for both the manifold points (squares) and the original data set (circles). the correlation dimension at length scale r as the slope on the log log plot. dcorr (r) = d log C(r) . d log r (17) For points lying on a manifold the slope remains constant and the dimensionality is ﬁxed, while the correlation dimension of the original data set quickly approaches that of the embedding space as we decrease the length scale. Note that the slope at large length scales always tends to decrease due to ﬁnite span of the data and curvature effects and therefore does not provide a reliable estimator of intrinsic dimensionality. 5 5.1 Examples Semi-Circle We have randomly generated N = 3150 data points scattered by a normal distribution with σ = 1 around a semi-circle of radius R = 20 (Figure 2a). Then we ran the algorithm with K = 100 and λ = 8, and terminated the iterative algorithm once the precision = 0.1 had been reached. The resulting manifold is depicted in red. To test the quality of our solution, we calculated the correlation dimension as a function of spatial scale for both the manifold points and the original data set (Figure 2b). As one can see, the manifold solution is of ﬁxed dimensionality (the slope remains constant), while the original data set exhibits varying dimensionality. One should also note that the manifold points have dcorr (r) = 1 well into the territory where the original data set becomes two dimensional. This is what we should expect: at a given information level (in this case, I = 2.8 bits), the information about the second (local) degree of freedom is lost, and the resulting structure is one dimensional. A note about the parameters. Letting K → ∞ does not alter the solution. The information I and distortion D remain the same, and the additional points γk also fall on the semi-circle and are simple interpolations between the original manifold points. This allows us to claim that what we have found is a manifold, and not an agglomeration of clustering centers. Second, varying λ changes the information resolution I(λ): for small λ (high information rate) the local structure becomes important. At high information rate the solution undergoes 3.5 3 3 3 2.5 2.5 2 2.5 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0 0.5 -0.5 0 0 -1 5 -0.5 -0.5 4 1 3 0.5 2 -1 -1 0 1 -0.5 0 -1 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1.5 -1 -0.5 0 0.5 1 1.5 Figure 3: S-shaped sheet in 3D. (a) N = 2000 random points on a surface of an S-shaped sheet in 3D. (b) Normal noise added. XY-plane projection of the data. (c) Optimal manifold points in 3D, projected onto an XY plane for easy visualization. a phase transition, and the resulting manifold becomes two dimensional to take into account the local structure. Alternatively, if we take λ → ∞, the cost of information rate becomes very high and the whole manifold collapses to a single point (becomes zero dimensional). 5.2 S-surface Here we took N = 2000 points covering an S-shaped sheet in three dimensions (Figure 3a), and then scattered the position of each point by adding Gaussian noise. The resulting manifold is difﬁcult to visualize in three dimensions, so we provided its projection onto an XY plane for an illustrative purpose (Figure 3b). After running our algorithm we have recovered the original structure of the manifold (Figure 3c). 6 Discussion The problem of ﬁnding low dimensional manifolds in high dimensional data requires regularization to avoid hgihly folded, Peano curve like solutions which are low dimensional in the mathematical sense but fail to capture our geometric intuition. Rather than constraining geometrical features of the manifold (e.g., the curvature) we have constrained the mutual information between positions on the manifold and positions in the original data space, and this is invariant to all invertible coordinate transformations in either space. This approach enforces “smoothness” of the manifold only implicitly, but nonetheless seems to work. Our information theoretic approach has considerable generality relative to methods based on speciﬁc smoothing criteria, but requires a separate algorithm, such as LLE, to give the manifold points curvilinear coordinates. For data points not in the original data set, equations (9-10) and (13-14) provide the mapping onto the manifold. Eqn. (7) gives the probability distribution over the latent variable, known in the density modeling literature as “the prior.” The running time of the algorithm is linear in N . This compares favorably with other methods and makes it particularly attractive for very large data sets. The number of manifold points K usually is chosen as large as possible, given the computational constraints, to have a dense sampling of the manifold. However, a value of K << N is often sufﬁcient, since D(λ, K) → D(λ) and I(λ, K) → I(λ) approach their limits rather quickly (the convergence improves for large λ and deteriorates for small λ). In the example of a semi-circle, the value of K = 30 was sufﬁcient at the compression level of I = 2.8 bits. In general, the threshold value for K scales exponentially with the latent dimensionality (rather than with the dimensionality of the embedding space). The choice of λ depends on the desired information resolution, since I depends on λ. Ideally, one should plot the function I(λ) and then choose the region of interest. I(λ) is a monotonically decreasing function, with the kinks corresponding to phase transitions where the optimal manifold abruptly changes its dimensionality. In practice, we may want to run the algorithm only for a few choices of λ, and we would like to start with values that are most likely to correspond to a low dimensional latent variable representation. In this case, as a rule of thumb, we choose λ smaller, but on the order of the largest linear dimension (i.e. λ/2 ∼ Lmax ). The dependence of the optimal manifold M(I) on information resolution reﬂects the multi-scale nature of the data and should not be taken as a shortcoming. References [1] Bregler, C. & Omohundro, S. (1995) Nonlinear image interpolation using manifold learning. Advances in Neural Information Processing Systems 7. MIT Press. [2] Hastie, T. & Stuetzle, W. (1989) Principal curves. Journal of the American Statistical Association, 84(406), 502-516. [3] Roweis, S. & Saul, L. (2000) Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326. [4] Tenenbaum, J., de Silva, V., & Langford, J. (2000) A global geometric framework for nonlinear dimensionality reduction. Science, 290 , 2319–2323. [5] Hotelling, H. (1933) Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417-441,498-520. [6] Bishop, C., Svensen, M. & Williams, C. (1998) GTM: The generative topographic mapping. Neural Computation,10, 215–234. [7] Brand, M. (2003) Charting a manifold. Advances in Neural Information Processing Systems 15. MIT Press. [8] Scholkopf, B., Smola, A. & Muller K-R. (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299-1319. [9] Kramer, M. (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37, 233-243. [10] Belkin M. & Niyogi P. (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396. [11] Blahut, R. (1972) Computation of channel capacity and rate distortion function. IEEE Trans. Inform. Theory, IT-18, 460-473. [12] Grassberger, P., & Procaccia, I. (1983) Characterization of strange attractors. Physical Review Letters, 50, 346-349.

4 0.12986223 92 nips-2003-Information Bottleneck for Gaussian Variables

Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss

Abstract: The problem of extracting the relevant aspects of data was addressed through the information bottleneck (IB) method, by (soft) clustering one variable while preserving information about another - relevance - variable. An interesting question addressed in the current work is the extension of these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters. We give a formal deﬁnition of the general continuous IB problem and obtain an analytic solution for the optimal representation for the important case of multivariate Gaussian variables. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized correlation matrix Σx|y Σ−1 , which x is also the basis obtained in Canonical Correlation Analysis. However, in Gaussian IB, the compression tradeoﬀ parameter uniquely determines the dimension, as well as the scale of each eigenvector. This introduces a novel interpretation where solutions of diﬀerent ranks lie on a continuum parametrized by the compression level. Our analysis also provides an analytic expression for the optimal tradeoﬀ - the information curve - in terms of the eigenvalue spectrum. 1

5 0.11090966 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

Author: Yoshua Bengio, Jean-françcois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas L. Roux, Marie Ouimet

Abstract: Several unsupervised learning algorithms based on an eigendecomposition provide either an embedding or a clustering only for given training points, with no straightforward extension for out-of-sample examples short of recomputing eigenvectors. This paper provides a uniﬁed framework for extending Local Linear Embedding (LLE), Isomap, Laplacian Eigenmaps, Multi-Dimensional Scaling (for dimensionality reduction) as well as for Spectral Clustering. This framework is based on seeing these algorithms as learning eigenfunctions of a data-dependent kernel. Numerical experiments show that the generalizations performed have a level of error comparable to the variability of the embedding algorithms due to the choice of training data. 1

6 0.099256225 120 nips-2003-Locality Preserving Projections

7 0.091807522 170 nips-2003-Self-calibrating Probability Forecasting

8 0.088629514 88 nips-2003-Image Reconstruction by Linear Programming

9 0.088238202 128 nips-2003-Minimax Embeddings

10 0.08461795 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles

11 0.083801933 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

12 0.078522585 130 nips-2003-Model Uncertainty in Classical Conditioning

13 0.075832576 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence

14 0.07469596 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation

15 0.074393265 171 nips-2003-Semi-Definite Programming by Perceptron Learning

16 0.071100697 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

17 0.068508632 17 nips-2003-A Sampled Texture Prior for Image Super-Resolution

18 0.068328016 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

19 0.067059293 12 nips-2003-A Model for Learning the Semantics of Pictures

20 0.065991782 115 nips-2003-Linear Dependent Dimensionality Reduction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.203), (1, -0.096), (2, 0.006), (3, 0.006), (4, -0.137), (5, 0.144), (6, 0.07), (7, -0.052), (8, 0.059), (9, -0.232), (10, -0.083), (11, -0.076), (12, 0.067), (13, 0.024), (14, 0.033), (15, -0.074), (16, -0.008), (17, 0.155), (18, 0.125), (19, 0.071), (20, -0.061), (21, 0.134), (22, -0.071), (23, 0.021), (24, 0.114), (25, -0.03), (26, 0.036), (27, -0.133), (28, 0.033), (29, -0.013), (30, 0.039), (31, 0.053), (32, -0.049), (33, 0.065), (34, 0.12), (35, 0.026), (36, 0.014), (37, 0.051), (38, 0.032), (39, -0.11), (40, 0.026), (41, -0.066), (42, 0.09), (43, -0.096), (44, 0.147), (45, -0.015), (46, -0.024), (47, 0.135), (48, 0.144), (49, -0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95151061 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

Author: Jakob J. Verbeek, Sam T. Roweis, Nikos A. Vlassis

2 0.79320186 77 nips-2003-Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data

Author: Neil D. Lawrence

3 0.64802682 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

Author: Denis V. Chigirev, William Bialek

4 0.53467643 120 nips-2003-Locality Preserving Projections

Author: Xiaofei He, Partha Niyogi

Abstract: Many problems in information processing involve some form of dimensionality reduction. In this paper, we introduce Locality Preserving Projections (LPP). These are linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the data set. LPP should be seen as an alternative to Principal Component Analysis (PCA) – a classical linear technique that projects the data along the directions of maximal variance. When the high dimensional data lies on a low dimensional manifold embedded in the ambient space, the Locality Preserving Projections are obtained by ﬁnding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the manifold. As a result, LPP shares many of the data representation properties of nonlinear techniques such as Laplacian Eigenmaps or Locally Linear Embedding. Yet LPP is linear and more crucially is deﬁned everywhere in ambient space rather than just on the training data points. This is borne out by illustrative examples on some high dimensional data sets.

5 0.50146258 130 nips-2003-Model Uncertainty in Classical Conditioning

Author: Aaron C. Courville, Geoffrey J. Gordon, David S. Touretzky, Nathaniel D. Daw

Abstract: We develop a framework based on Bayesian model averaging to explain how animals cope with uncertainty about contingencies in classical conditioning experiments. Traditional accounts of conditioning ﬁt parameters within a ﬁxed generative model of reinforcer delivery; uncertainty over the model structure is not considered. We apply the theory to explain the puzzling relationship between second-order conditioning and conditioned inhibition, two similar conditioning regimes that nonetheless result in strongly divergent behavioral outcomes. According to the theory, second-order conditioning results when limited experience leads animals to prefer a simpler world model that produces spurious correlations; conditioned inhibition results when a more complex model is justiﬁed by additional experience. 1

6 0.42572936 92 nips-2003-Information Bottleneck for Gaussian Variables

7 0.42561537 128 nips-2003-Minimax Embeddings

8 0.40047669 172 nips-2003-Semi-Supervised Learning with Trees

9 0.38480434 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence

10 0.3746385 83 nips-2003-Hierarchical Topic Models and the Nested Chinese Restaurant Process

11 0.35909134 88 nips-2003-Image Reconstruction by Linear Programming

12 0.34904358 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

13 0.34669247 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles

14 0.34121606 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

15 0.33401012 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

16 0.32619169 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

17 0.31925091 131 nips-2003-Modeling User Rating Profiles For Collaborative Filtering

18 0.31211263 66 nips-2003-Extreme Components Analysis

19 0.306063 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

20 0.29967761 21 nips-2003-An Autonomous Robotic System for Mapping Abandoned Mines

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.055), (11, 0.031), (29, 0.027), (35, 0.036), (49, 0.012), (51, 0.268), (53, 0.128), (69, 0.024), (71, 0.079), (76, 0.053), (78, 0.034), (85, 0.074), (91, 0.091)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8755942 85 nips-2003-Human and Ideal Observers for Detecting Image Curves

Author: Fang Fang, Daniel Kersten, Paul R. Schrater, Alan L. Yuille

Abstract: This paper compares the ability of human observers to detect target image curves with that of an ideal observer. The target curves are sampled from a generative model which speciﬁes (probabilistically) the geometry and local intensity properties of the curve. The ideal observer performs Bayesian inference on the generative model using MAP estimation. Varying the probability model for the curve geometry enables us investigate whether human performance is best for target curves that obey speciﬁc shape statistics, in particular those observed on natural shapes. Experiments are performed with data on both rectangular and hexagonal lattices. Our results show that human observers’ performance approaches that of the ideal observer and are, in general, closest to the ideal for conditions where the target curve tends to be straight or similar to natural statistics on curves. This suggests a bias of human observers towards straight curves and natural statistics.

same-paper 2 0.77437067 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

Author: Jakob J. Verbeek, Sam T. Roweis, Nikos A. Vlassis

3 0.61136866 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

Author: Gang Ji, Jeff A. Bilmes

Abstract: In pattern classiﬁcation tasks, errors are introduced because of differences between the true model and the one obtained via model estimation. Using likelihood-ratio based classiﬁcation, it is possible to correct for this discrepancy by ﬁnding class-pair speciﬁc terms to adjust the likelihood ratio directly, and that can make class-pair preference relationships intransitive. In this work, we introduce new methodology that makes necessary corrections to the likelihood ratio, speciﬁcally those that are necessary to achieve perfect classiﬁcation (but not perfect likelihood-ratio correction which can be overkill). The new corrections, while weaker than previously reported such adjustments, are analytically challenging since they involve discontinuous functions, therefore requiring several approximations. We test a number of these new schemes on an isolatedword speech recognition task as well as on the UCI machine learning data sets. Results show that by using the bias terms calculated in this new way, classiﬁcation accuracy can substantially improve over both the baseline and over our previous results. 1

4 0.60101634 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

Author: Tong Zhang

Abstract: The purpose of this paper is to investigate inﬁnity-sample properties of risk minimization based multi-category classiﬁcation methods. These methods can be considered as natural extensions to binary large margin classiﬁcation. We establish conditions that guarantee the inﬁnity-sample consistency of classiﬁers obtained in the risk minimization framework. Examples are provided for two speciﬁc forms of the general formulation, which extend a number of known methods. Using these examples, we show that some risk minimization formulations can also be used to obtain conditional probability estimates for the underlying problem. Such conditional probability information will be useful for statistical inferencing tasks beyond classiﬁcation. 1 Motivation Consider a binary classiﬁcation problem where we want to predict label y ∈ {±1} based on observation x. One of the most signiﬁcant achievements for binary classiﬁcation in machine learning is the invention of large margin methods, which include support vector machines and boosting algorithms. Based on a set of observations (X1 , Y1 ), . . . , (Xn , Yn ), ˆ a large margin classiﬁcation algorithm produces a decision function fn by empirically minimizing a loss function that is often a convex upper bound of the binary classiﬁcation error ˆ ˆ function. Given fn , the binary decision rule is to predict y = 1 if fn (x) ≥ 0, and to predict ˆ y = −1 otherwise (the decision rule at fn (x) = 0 is not important). In the literature, the following form of large margin binary classiﬁcation is often encountered: we minimize the empirical risk associated with a convex function φ in a pre-chosen function class Cn : 1 ˆ fn = arg min f ∈Cn n n φ(f (Xi )Yi ). (1) i=1 Originally such a scheme was regarded as a compromise to avoid computational difﬁculties associated with direct classiﬁcation error minimization, which often leads to an NP-hard problem. The current view in the statistical literature interprets such methods as algorithms to obtain conditional probability estimates. For example, see [3, 6, 9, 11] for some related studies. This point of view allows people to show the consistency of various large margin methods: that is, in the large sample limit, the obtained classiﬁers achieve the optimal Bayes error rate. For example, see [1, 4, 7, 8, 10, 11]. The consistency of a learning method is certainly a very desirable property, and one may argue that a good classiﬁcation method should be consistent in the large sample limit. Although statistical properties of binary classiﬁcation algorithms based on the risk minimization formulation (1) are quite well-understood due to many recent works such as those mentioned above, there are much fewer studies on risk minimization based multicategory problems which generalizes the binary large margin method (1). The complexity of possible generalizations may be one reason. Another reason may be that one can always estimate the conditional probability for a multi-category problem using the binary classiﬁcation formulation (1) for each category, and then pick the category with the highest estimated conditional probability (or score).1 However, it is still useful to understand whether there are more natural alternatives, and what kind of risk minimization formulation which generalizes (1) can be used to yield consistent classiﬁers in the large sample limit. An important step toward this direction has recently been taken in [5], where the authors proposed a multi-category extension of the support vector machine that is Bayes consistent (note that there were a number of earlier proposals that were not consistent). The purpose of this paper is to generalize their investigation so as to include a much wider class of risk minimization formulations that can lead to consistent classiﬁers in the inﬁnity-sample limit. We shall see that there is a rich structure in risk minimization based multi-category classiﬁcation formulations. Multi-category large margin methods have started to draw more attention recently. For example, in [2], learning bounds for some multi-category convex risk minimization methods were obtained, although the authors did not study possible choices of Bayes consistent formulations. 2 Multi-category classiﬁcation We consider the following K-class classiﬁcation problem: we would like to predict the label y ∈ {1, . . . , K} of an input vector x. In this paper, we only consider the simplest scenario with 0 − 1 classiﬁcation loss: we have a loss of 0 for correct prediction, and loss of 1 for incorrect prediction. In binary classiﬁcation, the class label can be determined using the sign of a decision function. This can be generalized to K class classiﬁcation problem as follows: we consider K decision functions fc (x) where c = 1, . . . , K and we predict the label y of x as: T (f (x)) = arg max c∈{1,...,K} fc (x), (2) where we denote by f (x) the vector function f (x) = [f1 (x), . . . , fK (x)]. Note that if two or more components of f achieve the same maximum value, then we may choose any of them as T (f ). In this framework, fc (x) is often regarded as a scoring function for category c that is correlated with how likely x belongs to category c (compared with the remaining k − 1 categories). The classiﬁcation error is given by: (f ) = 1 − EX P (Y = T (X)|X). Note that only the relative strength of fc compared with the alternatives is important. In particular, the decision rule given in (2) does not change when we add the same numerical quantity to each component of f (x). This allows us to impose one constraint on the vector f (x) which decreases the degree of freedom K of the K-component vector f (x) to K − 1. 1 This approach is often called one-versus-all or ranking in machine learning. Another main approach is to encode a multi-category classiﬁcation problem into binary classiﬁcation sub-problems. The consistency of such encoding schemes can be difﬁcult to analyze, and we shall not discuss them. For example, in the binary classiﬁcation case, we can enforce f1 (x)+f2 (x) = 0, and hence f (x) can be represented as [f1 (x), −f1 (x)]. The decision rule in (2), which compares f1 (x) ≥ f2 (x), is equivalent to f1 (x) ≥ 0. This leads to the binary classiﬁcation rule mentioned in the introduction. In the multi-category case, one may also interpret the possible constraint on the vector function f , which reduces its degree of freedom from K to K − 1 based on the following reasoning. In many cases, we seek fc (x) as a function of p(Y = c|x). Since we have a K constraint c=1 p(Y = c|x) = 1 (implying that the degree of freedom for p(Y = c|x) is K − 1), the degree of freedom for f is also K − 1 (instead of K). However, we shall point out that in the algorithms we formulate below, we may either enforce such a constraint that reduces the degree of freedom of f , or we do not impose any constraint, which keeps the degree of freedom of f to be K. The advantage of the latter is that it allows the computation of each fc to be decoupled. It is thus much simpler both conceptually and numerically. Moreover, it directly handles multiple-label problems where we may assign each x to multiple labels of y ∈ {1, . . . , K}. In this scenario, we do not have a constraint. In this paper, we consider an empirical risk minimization method to solve a multi-category problem, which is of the following general form: 1 ˆ fn = arg min f ∈Cn n n ΨYi (f (Xi )). (3) i=1 As we shall see later, this method is a natural generalization of the binary classiﬁcation method (1). Note that one may consider an even more general form with ΨY (f (X)) replaced by ΨY (f (X), X), which we don’t study in this paper. From the standard learning theory, one can expect that with appropriately chosen Cn , the ˆ ˆ solution fn of (3) approximately minimizes the true risk R(f ) with respect to the unknown underlying distribution within the function class Cn , R(f ) = EX,Y ΨY (f (X)) = EX L(P (·|X), f (X)), (4) where P (·|X) = [P (Y = 1|X), . . . , P (Y = K|X)] is the conditional probability, and K L(q, f ) = qc Ψc (f ). (5) c=1 In order to understand the large sample behavior of the algorithm based on solving (3), we ﬁrst need to understand the behavior of a function f that approximately minimizes R(f ). We introduce the following deﬁnition (also referred to as classiﬁcation calibrated in [1]): Deﬁnition 2.1 Consider Ψc (f ) in (4). We say that the formulation is admissible (classiﬁcation calibrated) on a closed set Ω ⊆ [−∞, ∞]K if the following conditions hold: ∀c, Ψc (·) : Ω → (−∞, ∞] is bounded below and continuous; ∩c {f : Ψc (f ) < ∞} is ∗ ∗ non-empty and dense in Ω; ∀q, if L(q, f ∗ ) = inf f L(q, f ), then fc = supk fk implies qc = supk qk . Since we allow Ψc (f ) = ∞, we use the convention that qc Ψc (f ) = 0 when qc = 0 and Ψc (f ) = ∞. The following result relates the approximate minimization of the Ψ risk to the approximate minimization of classiﬁcation error: Theorem 2.1 Let B be the set of all Borel measurable functions. For a closed set Ω ⊂ [−∞, ∞]K , let BΩ = {f ∈ B : ∀x, f (x) ∈ Ω}. If Ψc (·) is admissible on Ω, then for a Borel measurable distribution, R(f ) → inf g∈BΩ R(g) implies (f ) → inf g∈B (g). Proof Sketch. First we show that the admissibility implies that ∀ > 0, ∃δ > 0 such that ∀q and x: inf {L(q, f ) : fc = sup fk } ≥ inf L(q, g) + δ. (6) qc ≤supk qk − g∈Ω k m If (6) does not hold, then ∃ > 0, and a sequence of (c , f m , q m ) with f m ∈ Ω such that m m m m fcm = supk fk , qcm ≤ supk qk − , and L(q m , f m ) − inf g∈Ω L(q m , g) → 0. Taking a limit point of (cm , f m , q m ), and using the continuity of Ψc (·), we obtain a contradiction (technical details handling the inﬁnity case are skipped). Therefore (6) must be valid. Now we consider a vector function f (x) ∈ ΩB . Let q(x) = P (·|x). Given X, if P (Y = T (f (X))|X) ≥ P (Y = T (q(X))|X)+ , then equation (6) implies that L(q(X), f (X)) ≥ inf g∈Ω L(q(X), g) + δ. Therefore (f ) − inf (g) =EX [P (Y = T (q(X))|X) − P (Y = T (f (X))|X)] g∈B ≤ + EX I(P (Y = T (q(X))|X) − P (Y = T (f (X))|X) > ) LX (q(X), f (X)) − inf g∈BΩ LX (q(X), g) ≤ + EX δ R(f ) − inf g∈BΩ R(g) = + . δ In the above derivation we use I to denote the indicator function. Since and δ are arbitrary, we obtain the theorem by letting → 0. 2 Clearly, based on the above theorem, an admissible risk minimization formulation is suitable for multi-category classiﬁcation problems. The classiﬁer obtained from minimizing (3) can approach the Bayes error rate if we can show that with appropriately chosen function class Cn , approximate minimization of (3) implies approximate minimization of (4). Learning bounds of this forms have been very well-studied in statistics and machine learning. For example, for large margin binary classiﬁcation, such bounds can be found in [4, 7, 8, 10, 11, 1], where they were used to prove the consistency of various large margin methods. In order to achieve consistency, it is also necessary to take a sequence of function classes Cn (C1 ⊂ C2 ⊂ · · · ) such that ∪n Cn is dense in the set of Borel measurable functions. The set Cn has the effect of regularization, which ensures that ˆ ˆ P R(fn ) ≈ inf f ∈Cn R(f ). It follows that as n → ∞, R(fn ) → inf f ∈B R(f ). Theorem 2.1 ˆ P then implies that (fn ) → inf f ∈B (f ). The purpose of this paper is not to study similar learning bounds that relate approximate minimization of (3) to the approximate minimization of (4). See [2] for a recent investigation. We shall focus on the choices of Ψ that lead to admissible formulations. We pay special attention to the case that each Ψc (f ) is a convex function of f , so that the resulting formulation becomes computational more tractable. Instead of working with the general form of Ψc in (4), we focus on two speciﬁc choices listed in the next two sections. 3 Unconstrained formulations We consider unconstrained formulation with the following choice of Ψ: K Ψc (f ) = φ(fc ) + s t(fk ) , (7) k=1 where φ, s and t are appropriately chosen functions that are continuously differentiable. The ﬁrst term, which has a relatively simple form, depends on the label c. The second term is independent of the label, and can be regarded as a normalization term. Note that this function is symmetric with respect to components of f . This choice treats all potential classes equally. It is also possible to treat different classes differently (e.g. replacing φ(fc ) by φc (fc )), which can be useful if we associate different classiﬁcation loss to different kinds of errors. 3.1 Optimality equation and probability model Using (7), the conditional true risk (5) can be written as: K L(q, f ) = K qc φ(fc ) + s t(fc ) . c=1 c=1 In the following, we study the property of the optimal vector f ∗ that minimizes L(q, f ) for a ﬁxed q. Given q, the optimal solution f ∗ of L(q, f ) satisﬁes the following ﬁrst order condition: ∗ ∗ qc φ (fc ) + µf ∗ t (fc ) = 0 (c = 1, . . . , K). (8) where quantity µf ∗ = s ( K k=1 ∗ t(fk )) is independent of k. ∗ Clearly this equation relates qc to fc for each component c. The relationship of q and f ∗ deﬁned by (8) can be regarded as the (inﬁnite sample-size) probability model associated with the learning method (3) with Ψ given by (7). The following result presents a simple criterion to check admissibility. We skip the proof for simplicity. Most of our examples satisfy the condition. Proposition 3.1 Consider (7). Assume Φc (f ) is continuous on [−∞, ∞]K and bounded below. If s (u) ≥ 0 and ∀p > 0, pφ (f ) + t (f ) = 0 has a unique solution fp that is an increasing function of p, then the formulation is admissible. If s(u) = u, the condition ∀p > 0 in Proposition 3.1 can be replaced by ∀p ∈ (0, 1). 3.2 Decoupled formulations We let s(u) = u in (7). The optimality condition (8) becomes ∗ ∗ qc φ (fc ) + t (fc ) = 0 (c = 1, . . . , K). (9) This means that we have K decoupled equalities, one for each fc . This is the simplest and in the author’s opinion, the most interesting formulation. Since the estimation problem in ˆ (3) is also decoupled into K separate equations, one for each component of fn , this class of methods are computationally relatively simple and easy to parallelize. Although this method seems to be preferable for multi-category problems, it is not the most efﬁcient way for two-class problem (if we want to treat the two classes in a symmetric manner) since we have to solve two separate equations. We only need to deal with one equation in (1) due to the fact that an effective constraint f1 + f2 = 0 can be used to reduce the number of equations. This variable elimination has little impact if there are many categories. In the following, we list some examples of multi-category risk minimization formulations. They all satisfy the admissibility condition in Proposition 3.1. We focus on the relationship of the optimal optimizer function f∗ (q) and the conditional probability q. For simplicity, we focus on the choice φ(u) = −u. 3.2.1 φ(u) = −u and t(u) = eu ∗ We obtain the following probability model: qc = efc . This formulation is closely related K to the maximum-likelihood estimate with conditional model qc = efc / k=1 efk (logistic regression). In particular, if we choose a function class such that the normalization condiK tion k=1 efk = 1 holds, then the two formulations are identical. However, they become different when we do not impose such a normalization condition. Another very important and closely related formulation is the choice of φ(u) = − ln u and t(u) = u. This is an extension of maximum-likelihood estimate with probability model qc = fc . The resulting method is identical to maximum-likelihood if we choose our function class such that k fk = 1. However, the formulation also allows us to use function classes that do not satisfy the normalization constraint k fk = 1. Therefore this method is more ﬂexible. 3.2.2 φ(u) = −u and t(u) = ln(1 + eu ) This version uses binary logistic regression loss, and we have the following probability ∗ model: qc = (1 + e−fc )−1 . Again this is an unnormalized model. 1 3.2.3 φ(u) = −u and t(u) = p |u|p (p > 1) ∗ ∗ We obtain the following probability model: qc = sign(fc )|fc |p−1 . This means that at the ∗ ∗ solution, fc ≥ 0. One may modify it such that we allow fc ≤ 0 to model the condition probability qc = 0. 3.2.4 φ(u) = −u and t(u) = 1 p max(u, 0)p (p > 1) ∗ In this probability model, we have the following relationship: qc = max(fc , 0)p−1 . The ∗ equation implies that we allow fc ≤ 0 to model the conditional probability qc = 0. Therefore, with a ﬁxed function class, this model is more powerful than the previous one. How∗ ever, at the optimal solution, fc ≤ 1. This requirement can be further alleviated with the following modiﬁcation. 3.2.5 φ(u) = −u and t(u) = 1 p min(max(u, 0)p , p(u − 1) + 1) (p > 1) In this probability model, we have the following relationship at the exact solution: qc = c min(max(f∗ , 0), 1)p−1 . Clearly this model is more powerful than the previous model since ∗ the function value fc ≥ 1 can be used to model qc = 1. 3.3 Coupled formulations In the coupled formulation with s(u) = u, the probability model can be normalized in a certain way. We list a few examples. 3.3.1 φ(u) = −u, and t(u) = eu , and s(u) = ln(u) This is the standard logistic regression model. The probability model is: K ∗ qc (x) = exp(fc (x))( ∗ exp(fc (x)))−1 . c=1 The right hand side is always normalized (sum up to 1). Note that the model is not continuous at inﬁnities, and thus not admissible in our deﬁnition. However, we may consider the region Ω = {f : supk fk = 0}, and it is easy to check that this model is admissible in Ω. Ω Let fc = fc − supk fk ∈ Ω, then f Ω has the same decision rule as f and R(f ) = R(f Ω ). Therefore Theorem 2.1 implies that R(f ) → inf g∈B R(g) implies (f ) → inf g∈B (g). 1 3.3.2 φ(u) = −u, and t(u) = |u|p , and s(u) = p |u|p/p (p, p > 1) The probability model is: K ∗ ∗ ∗ |fk (x)|p )(p−p )/p sign(fc (x))|fc (x)|p −1 . qc (x) = ( k=1 We may replace t(u) by t(u) = max(0, u)p , and the probability model becomes: K qc (x) = ( ∗ ∗ max(fk (x), 0)p )(p−p )/p max(fc (x), 0)p −1 . k=1 These formulations do not seem to have advantages over the decoupled counterparts. Note that if we let p → 1, then the sum of the p p -th power of the right hand side → 1. In a −1 way, this means that the model is normalized in the limit of p → 1. 4 Constrained formulations As pointed out, one may impose constraints on possible choices of f . We may impose such a condition when we specify the function class Cn . However, for clarity, we shall directly impose a condition into our formulation. If we impose a constraint into (7), then its effect is rather similar to that of the second term in (7). In this section, we consider a direct extension of binary large-margin method (1) to multi-category case. The choice given below is motivated by [5], where an extension of SVM was proposed. We use a risk formulation that is different from (7), and for simplicity, we will consider linear equality constraint only: K Ψc (f ) = φ(−fk ), s.t. f ∈ Ω, (10) k=1,k=c where we deﬁne Ω as: K Ω = {f : fk = 0} ∪ {f : sup fk = ∞}. k k=1 We may interpret the added constraint as a restriction on the function class Cn in (3) such that every f ∈ Cn satisﬁes the constraint. Note that with K = 2, this leads to the usually binary large margin method. Using (10), the conditional true risk (5) can be written as: K (1 − qc )φ(−fc ), L(q, f ) = s.t. f ∈ Ω. (11) c=1 The following result provides a simple way to check the admissibility of (10). Proposition 4.1 If φ is a convex function which is bounded below and φ (0) < 0, then (10) is admissible on Ω. Proof Sketch. The continuity condition is straight-forward to verify. We may also assume that φ(·) ≥ 0 without loss of generality. Now let f achieves the minimum of L(q, ·). If fc = ∞, then it is clear that qc = 1 and thus qk = 0 for k = c. This implies that for k = c, φ(−fk ) = inf f φ(−f ), and thus fk < 0. If fc = supk fk < ∞, then the constraint implies fc ≥ 0. It is easy to see that ∀k, qc ≥ qk since otherwise, we must have φ(−fk ) > φ(−fc ), and thus φ (−fk ) > 0 and φ (−fc ) < 0, implying that with sufﬁcient small δ > 0, φ(−(fk + δ)) < φ(−fk ) and φ(−(fc − δ)) < φ(−fc ). A contradiction. 2 Using the above criterion, we can convert any admissible convex φ for the binary formulation (1) into an admissible multi-category classiﬁcation formulation (10). In [5] the special case of SVM (with loss function φ(u) = max(0, 1 − u)) was studied. The authors demonstrated the admissibility by direct calculation, although no results similar to Theorem 2.1 were established. Such a result is needed to prove consistency. The treatment presented here generalizes their study. Note that for the constrained formulation, it is more difﬁcult to relate fc at the optimal solution to a probability model, since such a model will have a much more complicated form compared with the unconstrained counterpart. 5 Conclusion In this paper we proposed a family of risk minimization methods for multi-category classiﬁcation problems, which are natural extensions of binary large margin classiﬁcation methods. We established admissibility conditions that ensure the consistency of the obtained classiﬁers in the large sample limit. Two speciﬁc forms of risk minimization were proposed and examples were given to study the induced probability models. As an implication of this work, we see that it is possible to obtain consistent (conditional) density estimation using various non-maximum likelihood estimation methods. One advantage of some of the newly proposed methods is that they allow us to model zero density directly. Note that for the maximum-likelihood method, near zero density may cause serious robustness problems at least in theory. References [1] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classiﬁcation, and risk bounds. Technical Report 638, Statistics Department, University of California, Berkeley, 2003. [2] Ilya Desyatnikov and Ron Meir. Data-dependent bounds for multi-category classiﬁcation based on convex losses. In COLT, 2003. [3] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000. With discussion. [4] W. Jiang. Process consistency for adaboost. The Annals of Statistics, 32, 2004. with discussion. [5] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and application to the classiﬁcation of microarray data and satellite radiance data. Journal of American Statistical Association, 2002. accepted. [6] Yi Lin. Support vector machines and the bayes rule in classiﬁcation. Data Mining and Knowledge Discovery, pages 259–275, 2002. [7] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting methods. The Annals of Statistics, 32, 2004. with discussion. [8] Shie Mannor, Ron Meir, and Tong Zhang. Greedy algorithms for classiﬁcation - consistency, convergence rates, and adaptivity. Journal of Machine Learning Research, 4:713–741, 2003. [9] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Ingo Steinwart. Support vector machines are universally consistent. J. Complexity, 18:768–791, 2002. [11] Tong Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statitics, 32, 2004. with discussion.

5 0.59866774 126 nips-2003-Measure Based Regularization

Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein

Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1

6 0.59681845 113 nips-2003-Learning with Local and Global Consistency

7 0.59602123 112 nips-2003-Learning to Find Pre-Images

8 0.59202021 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons

9 0.59021068 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

10 0.58903581 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

11 0.58700556 120 nips-2003-Locality Preserving Projections

12 0.58535457 3 nips-2003-AUC Optimization vs. Error Rate Minimization

13 0.58531135 78 nips-2003-Gaussian Processes in Reinforcement Learning

14 0.58530378 107 nips-2003-Learning Spectral Clustering

15 0.5851779 172 nips-2003-Semi-Supervised Learning with Trees

16 0.58508652 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data

17 0.58505553 143 nips-2003-On the Dynamics of Boosting

18 0.5850327 66 nips-2003-Extreme Components Analysis

19 0.58311373 81 nips-2003-Geometric Analysis of Constrained Curves

20 0.581945 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications