nips nips2003 nips2003-150 knowledge-graph by maker-knowledge-mining

150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

Source: pdf

Author: Yoshua Bengio, Jean-françcois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas L. Roux, Marie Ouimet

Abstract: Several unsupervised learning algorithms based on an eigendecomposition provide either an embedding or a clustering only for given training points, with no straightforward extension for out-of-sample examples short of recomputing eigenvectors. This paper provides a uniﬁed framework for extending Local Linear Embedding (LLE), Isomap, Laplacian Eigenmaps, Multi-Dimensional Scaling (for dimensionality reduction) as well as for Spectral Clustering. This framework is based on seeing these algorithms as learning eigenfunctions of a data-dependent kernel. Numerical experiments show that the generalizations performed have a level of error comparable to the variability of the embedding algorithms due to the choice of training data. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract Several unsupervised learning algorithms based on an eigendecomposition provide either an embedding or a clustering only for given training points, with no straightforward extension for out-of-sample examples short of recomputing eigenvectors. [sent-3, score-0.769]

2 This paper provides a uniﬁed framework for extending Local Linear Embedding (LLE), Isomap, Laplacian Eigenmaps, Multi-Dimensional Scaling (for dimensionality reduction) as well as for Spectral Clustering. [sent-4, score-0.101]

3 Numerical experiments show that the generalizations performed have a level of error comparable to the variability of the embedding algorithms due to the choice of training data. [sent-6, score-0.519]

4 There are also many variants of Spectral Clustering (Weiss, 1999; Ng, Jordan and Weiss, 2002), in which such an embedding is an intermediate step before obtaining a clustering of the data that can capture ﬂat, elongated and even curved clusters. [sent-8, score-0.577]

5 The two tasks (manifold learning and clustering) are linked because the clusters found by spectral clustering can be arbitrary curved manifolds (as long as there is enough data to locally capture their curvature). [sent-9, score-0.308]

6 2 Common Framework In this paper we consider ﬁve types of unsupervised learning algorithms that can be cast in the same framework, based on the computation of an embedding for the training points obtained from the principal eigenvectors of a symmetric matrix. [sent-10, score-0.789]

7 Let us denote KD (·, ·) (or K for shorthand) the data-dependent function which produces M by Mij = KD (xi , xj ). [sent-17, score-0.083]

8 Equivalently, this corre˜ from a KD by Mij = KD (xi , xj ). [sent-20, score-0.083]

9 Compute the m largest positive eigenvalues λk and eigenvectors vk of M . [sent-22, score-0.306]

10 The embedding of each example xi is the vector yi with yik the i-th element of the k -th ˜ principal eigenvector vk of M . [sent-24, score-0.932]

11 Alternatively (MDS and Isomap), the embedding is ei , with √ eik = λk yik . [sent-25, score-0.677]

12 If the ﬁrst m eigenvalues are positive, then ei · ej is the best approximation ˜ of Mij using only m coordinates, in the squared error sense. [sent-26, score-0.063]

13 In the following, we consider the specializations of Algorithm 1 for different unsupervised learning algorithms. [sent-27, score-0.069]

14 (1) j We say that two points (a, b) are k-nearest-neighbors of each other if a is among the k nearest neighbors of b in D ∪ {a} or vice-versa. [sent-29, score-0.092]

15 We denote by xij the j-th coordinate of the vector xi . [sent-30, score-0.152]

16 1 Multi-Dimensional Scaling Multi-Dimensional Scaling (MDS) starts from a notion of distance or afﬁnity K that is computed between each pair of training examples. [sent-32, score-0.1]

17 For the normalization step 2 in Algorithm 1, these distances are converted to equivalent dot products using the “double-centering” formula: 1 1 1 1 ˜ Mij − Si − Sj + 2 Sk . [sent-34, score-0.083]

18 (2) Mij = − 2 n n n k √ The embedding eik of example xi is given by λk vki . [sent-35, score-0.905]

19 2 Spectral Clustering Spectral clustering (Weiss, 1999) can yield impressively good results where traditional clustering looking for “round blobs” in the data, such as K-means, would fail miserably. [sent-37, score-0.272]

20 It is based on two main steps: ﬁrst embedding the data points in a space in which clusters are more “obvious” (using the eigenvectors of a Gram matrix), and then applying a classical clustering algorithm such as K-means, e. [sent-38, score-0.762]

21 The afﬁnity matrix M is formed using a kernel such as the Gaussian kernel. [sent-41, score-0.165]

22 (3) Si Sj ˜ To obtain m clusters, the ﬁrst m principal eigenvectors of M are computed and K-means is applied on the unit-norm coordinates, obtained from the embedding yik = vki . [sent-44, score-1.06]

23 The authors use an approximation of the Laplacian operator such as the Gaussian kernel or the matrix whose element (i, j) is 1 if xi and xj are k-nearest-neighbors and 0 otherwise. [sent-47, score-0.436]

24 Instead of solving an ordinary eigenproblem, the following generalized eigenproblem is solved: (S − M )vj = λj Svj (4) with eigenvalues λj , eigenvectors vj and S the diagonal matrix with entries given by eq. [sent-48, score-0.314]

25 The smallest eigenvalue is left out and the eigenvectors corresponding to the other small eigenvalues are used for the embedding. [sent-50, score-0.266]

26 This is the same embedding that is computed with the spectral clustering algorithm from (Shi and Malik, 1997). [sent-51, score-0.687]

27 As noted in (Weiss, 1999) (Normalization Lemma 1), an equivalent result (up to a componentwise scaling of the embedding) can be obtained by considering the principal eigenvectors of the normalized matrix deﬁned in eq. [sent-52, score-0.298]

28 It is based on replacing the Euclidean distance by an approximation of the geodesic distance on the manifold. [sent-56, score-0.182]

29 We deﬁne the geodesic distance with respect to a data set D, a distance d(u, v) and a neighborhood k as follows: ˜ D(a, b) = min d(pi , pi+1 ) (5) p i where p is a sequence of points of length l ≥ 2 with p1 = a, pl = b, pi ∈ D ∀i ∈ {2, . [sent-57, score-0.306]

30 As in √ MDS, the embedding is eik = λk vki rather than yik = vki . [sent-64, score-1.199]

31 5 LLE The Local Linear Embedding (LLE) algorithm (Roweis and Saul, 2000) looks for an embedding that preserves the local geometry in the neighborhood of each data point. [sent-66, score-0.442]

32 First, a sparse matrix of local predictive weights Wij is computed, such that j Wij = 1, Wij = 0 if xj is not a k-nearest-neighbor of xi and ( j Wij xj −xi )2 is minimized. [sent-67, score-0.374]

33 The embedding is obtained from the lowest eigenvectors of M , except for the smallest eigenvector which is uninteresting because it is (1, 1, . [sent-69, score-0.659]

34 ˜ Note that the lowest eigenvectors of M are the largest eigenvectors of Mµ = µI − M to ﬁt Algorithm 1 (the use of µ > 0 will be discussed in section 4. [sent-73, score-0.351]

35 The embedding is given by yik = vki , and is constant with respect to µ. [sent-75, score-0.856]

36 3 From Eigenvectors to Eigenfunctions ¨ To obtain an embedding for a new data point, we propose to use the Nystrom formula (eq. [sent-76, score-0.489]

37 9) (Baker, 1977), which has been used successfully to speed-up kernel methods computations by focussing the heavier computations (the eigendecomposition) on a subset of examples. [sent-77, score-0.109]

38 The use of this formula can be justiﬁed by considering the convergence of eigenvectors and eigenvalues, as the number of examples increases (Baker, 1977; Williams and Seeger, 2000; Koltchinskii and Gin´ , 2000; Shawe-Taylor and Williams, 2003). [sent-78, score-0.239]

39 Intuitively, the e extensions to obtain the embedding for a new example require specifying a new column of ˜ ˜ the Gram matrix M , through a training-set dependent kernel function KD , in which one of the arguments may be required to be in the training set. [sent-79, score-0.654]

40 If we start from a data set D, obtain an embedding for its elements, and add more and more data, the embedding for the points in D converges (for eigenvalues that are unique). [sent-80, score-0.939]

41 (Shawe-Taylor and Williams, 2003) give bounds on the convergence error (in the case of kernel PCA). [sent-81, score-0.109]

42 In the limit, we expect each eigenvector to converge to an eigenfunction for the linear operator deﬁned below, in the sense that the i-th element of the k-th eigenvector converges to the application of the k-th eigenfunction to xi (up to a normalization factor). [sent-82, score-0.42]

43 Associate with kernel K a linear operator Kp in Hp : (Kp f )(x) = K(x, y)f (y)p(y)dy. [sent-84, score-0.145]

44 Note that the proposition below can be ˆ ˆ applied even if the kernel is not positive semi-deﬁnite, although the embedding algorithms we have studied are restricted to using the principal coordinates associated with positive eigenvalues. [sent-87, score-0.689]

45 Proposition 1 ˜ Let K(a, b) be a kernel function, not necessarily positive semi-deﬁnite, that gives rise to ˜ ˜ ˜ a symmetric matrix M with entries Mij = K(xi , xj ) upon a dataset D = {x1 , . [sent-90, score-0.248]

46 ˜ Let (vk , λk ) be an (eigenvector,eigenvalue) pair that solves M vk = λk vk . [sent-94, score-0.166]

47 Let (fk , λk ) ˜ ˆfk ˆ be an (eigenfunction,eigenvalue) pair that solves (Kp√ )(x) = λk fk (x) for any x, with p the empirical distribution over D. [sent-95, score-0.124]

48 Let ek (x) = yk (x) λk or yk (x) denote the embedding associated with a new point x. [sent-96, score-0.821]

49 Then λk fk (x) = fk (xi ) 1 λk n √ n n ˜ vki K(x, xi ) λk i=1 √ nvki = = yk (x) yk (xi ) fk (x) 1 √ = λk n = = yik , (8) (9) (10) n ˜ vki K(x, xi ) (11) i=1 ek (xi ) = eik (12) See (Bengio et al. [sent-97, score-1.876]

50 The √ generalized embedding for Isomap and MDS is ek (x) = λk yk (x) whereas the one for spectral clustering, Laplacian eigenmaps and LLE is yk (x). [sent-99, score-1.112]

51 Proposition 2 ˜ In addition, if the data-dependent kernel KD is positive semi-deﬁnite, then n πk (x) λk fk (x) = where πk (x) is the k-th component of the kernel PCA projection of x obtained from the ˜ kernel KD (up to centering). [sent-100, score-0.451]

52 This relation with kernel PCA (Sch¨ lkopf, Smola and M¨ ller, 1998), already pointed out o u in (Williams and Seeger, 2000), is further discussed in (Bengio et al. [sent-101, score-0.109]

53 4 Extending to new Points Using Proposition 1, one obtains a natural extension of all the unsupervised learning algo˜ rithms mapped to Algorithm 1, provided we can write down a kernel function K that gives ˜ rise to the matrix M on D, and can be used in eq. [sent-103, score-0.286]

54 In addition to the convergence properties discussed in section 3, another justiﬁcation for using equation (9) is given by the following proposition: Proposition 3 If we deﬁne the fk (xi ) by eq. [sent-106, score-0.124]

55 (10) and take a new point x, the value of fk (x) that minimizes n i=1 2 m ˜ K(x, xi ) − λt ft (x)ft (xi ) (13) t=1 is given by eq. [sent-107, score-0.333]

56 The proof is a direct consequence of the orthogonality of the eigenvectors v k . [sent-109, score-0.193]

57 (10) when trying to approximate ˜ K at the data points by minimizing the cost n i,j=1 2 m ˜ K(xi , xj ) − λt ft (xi )ft (xj ) t=1 for m = 1, 2, . [sent-112, score-0.196]

58 When we add a new point x, it is thus natural to use the same cost to ˜ approximate the K(x, xi ), which yields (13). [sent-115, score-0.152]

59 Future work should investigate embeddings which minimize the ˜ empirical reconstruction error of K but ignore the diagonal contributions. [sent-117, score-0.119]

60 1 Extending MDS For MDS, a normalized kernel can be deﬁned as follows, using a continuous version of the double-centering eq. [sent-119, score-0.147]

61 An extension of metric MDS to new points has already been proposed in (Gower, 1968), solving exactly for the embedding of x to be consistent with its distances to training points, which in general requires adding a new dimension. [sent-121, score-0.611]

62 2 Extending Spectral Clustering and Laplacian Eigenmaps Both the version of Spectral Clustering and Laplacian Eigenmaps described above are based on an initial kernel K, such as the Gaussian or nearest-neighbor kernel. [sent-123, score-0.109]

63 An equivalent normalized kernel is: 1 K(a, b) ˜ K(a, b) = n Ex [K(a, x)]Ex [K(b, x )] where the expectations are taken over the empirical data D. [sent-124, score-0.147]

64 3 Extending Isomap To extend Isomap, the test point is not used in computing the geodesic distance between training points, otherwise we would have to recompute all the geodesic distances. [sent-126, score-0.297]

65 (5), which only uses the training points in the intermediate points on the path from a to b. [sent-128, score-0.162]

66 We obtain a normalized kernel by ˜ applying the continuous double-centering of eq. [sent-129, score-0.147]

67 A formula has already been proposed (de Silva and Tenenbaum, 2003) to approximate Isomap using only a subset of the examples (the “landmark” points) to compute the eigenvectors. [sent-131, score-0.079]

68 Using our notations, this formula is 1 ˜ ˜ ek (x) = √ vki (Ex [D2 (x , xi )] − D2 (xi , x)). [sent-132, score-0.607]

69 The formula is applied to obtain an embedding for the non-landmark examples. [sent-134, score-0.489]

70 Corollary 1 The embedding proposed in Proposition 1 for Isomap (ek (x)) is equal to formula 15 (Land˜ ˜ mark Isomap) when K(x, y) is deﬁned as in eq. [sent-135, score-0.489]

71 1) is an eigenvector with eigenvalue 0, and all the other eigenvectors vk have the property i vki = 0 because of the orthogonality with (1, 1, . [sent-141, score-0.638]

72 ˜ ˜ ˜ ˜ ˜ Writing (Ex [D2 (x , xi )]− D2 (x, xi )) = 2K(x, xi )+Ex ,x [D2 (x , x )]−Ex [D2 (x, x )] 2 ˜ ˜ 2 (x , x )] − Ex [D2 (x, x )]) ˜ yields ek (x) = 2√λ i vki K(x, xi ) + (Ex ,x [D i vki = k ek (x), since the last sum is 0. [sent-145, score-1.36]

73 4 Extending LLE The extension of LLE is the most challenging one because it does not ﬁt as well the framework of Algorithm 1: the M matrix for LLE does not have a clear interpretation in terms of distance or dot product. [sent-147, score-0.158]

74 Their embedding of a new point x is given by n yk (xi )w(x, xi ) yk (x) = (16) i=1 where w(x, xi ) is the weight of xi in the reconstruction of x by its k-nearest-neighbors in the training set (if x = xj ∈ D, w(x, xi ) = δij ). [sent-149, score-1.484]

75 However, we can see this embedding as a limit case of Proposition 1, as shown below. [sent-152, score-0.41]

76 ˜ We ﬁrst need to deﬁne a kernel Kµ such that ˜ ˜ Kµ (xi , xj ) = Mµ,ij = (µ − 1)δij + Wij + Wji − Wki Wkj (17) k ˜ for xi , xj ∈ D. [sent-153, score-0.427]

77 Let us deﬁne a kernel K by ˜ ˜ K (xi , x) = K (x, xi ) = w(x, xi ) ˜ ˜ and K (x, y) = 0 when neither x nor y is in the training set D. [sent-154, score-0.463]

78 Let K be deﬁned by ˜ K (xi , xj ) = Wij + Wji − Wki Wkj k ˜ ˜ and K (x, y) = 0 when either x or y isn’t in D. [sent-155, score-0.083]

79 Then, by construction, the kernel Kµ = ˜ + K veriﬁes eq. [sent-156, score-0.109]

80 (11) to obtain an embedding of ˜ (µ − 1)K a new point x, which yields 1 ˜ ˜ yµ,k (x) = yik (µ − 1)K (x, xi ) + K (x, xi ) λk i ˆ ˆ with λk = (µ − λk ), and λk being the k-th lowest eigenvalue of M . [sent-159, score-0.973]

81 This rewrites into µ−1 1 ˜ yµ,k (x) = yik w(x, xi ) + yik K (x, xi ). [sent-160, score-0.674]

82 ˆk ˆ µ−λ µ − λk i i Then when µ → ∞, yµ,k (x) → yk (x) deﬁned by eq. [sent-161, score-0.148]

83 (16) as approximating the use of the ˜ kernel Kµ with a large µ in Proposition 1. [sent-164, score-0.109]

84 5 Experiments We want to evaluate whether the precision of the generalizations suggested in the previous section is comparable to the intrinsic perturbations of the embedding algorithms. [sent-169, score-0.469]

85 The perturbation analysis will be achieved by considering splits of the data in three sets, D = F ∪ R1 ∪ R2 and training either with F ∪ R1 or F ∪ R2 , comparing the embeddings on F . [sent-170, score-0.132]

86 When eigenvalues are close, the estimated eigenvectors are unstable and can rotate in the subspace they span. [sent-212, score-0.223]

87 Thus we estimate an afﬁne alignment between the two embeddings using the points in F , and we calculate the Euclidean distance between the aligned embeddings obtained for each s i ∈ F . [sent-213, score-0.27]

88 For each sample si ∈ F , we also train over {F ∪ R1 }/{si }. [sent-215, score-0.077]

89 We apply the extension to out-of-sample points to ﬁnd the predicted embedding of si and calculate the Euclidean distance between this embedding and the one obtained when training with F ∪ R1 , i. [sent-216, score-1.105]

90 We calculate the mean difference (and its standard error, shown in the ﬁgure) between the distance obtained in step 1 and the one obtained in step 2 for each sample si ∈ F , and we repeat this experiment for various sizes of F . [sent-220, score-0.127]

91 The results obtained for MDS, Isomap, spectral clustering and LLE are shown in ﬁgure 1 for different values of m. [sent-221, score-0.277]

92 Each algorithm generates a twodimensional embedding of the images, following the experiments reported for Isomap. [sent-233, score-0.41]

93 The number of neighbors is 10 for Isomap and LLE, and a Gaussian kernel with a standard deviation of 0. [sent-234, score-0.145]

94 In most cases, the out-of-sample error is less than or comparable to the training set embedding stability: it corresponds to substituting a fraction of between 1 and 4% of the training examples. [sent-238, score-0.541]

95 6 Conclusions In this paper we have presented an extension to ﬁve unsupervised learning algorithms based on a spectral embedding of the data: MDS, spectral clustering, Laplacian eigenmaps, Isomap and LLE. [sent-239, score-0.813]

96 This extension allows one to apply a trained model to out-ofsample points without having to recompute eigenvectors. [sent-240, score-0.141]

97 The experiments on real highdimensional data show that the average distance between the out-of-sample and in-sample embeddings is comparable or lower than the variation in in-sample embedding due to replacing a few points in the training set. [sent-242, score-0.679]

98 Think globally, ﬁt locally: unsupervised learning of low dimensional manifolds. [sent-305, score-0.069]

99 Nonlinear component analysis as a kernel o u eigenvalue problem. [sent-312, score-0.152]

100 The stability of kernel principal components analysis and its relation to the process eigenspectrum. [sent-317, score-0.153]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('embedding', 0.41), ('isomap', 0.266), ('vki', 0.261), ('mds', 0.225), ('mij', 0.206), ('lle', 0.188), ('yik', 0.185), ('eigenvectors', 0.16), ('xi', 0.152), ('eigenmaps', 0.15), ('yk', 0.148), ('ex', 0.142), ('laplacian', 0.142), ('spectral', 0.141), ('clustering', 0.136), ('proposition', 0.126), ('fk', 0.124), ('ek', 0.115), ('weiss', 0.11), ('kd', 0.11), ('kernel', 0.109), ('xj', 0.083), ('vk', 0.083), ('embeddings', 0.082), ('geodesic', 0.082), ('eik', 0.082), ('formula', 0.079), ('si', 0.077), ('silva', 0.072), ('williams', 0.071), ('kp', 0.071), ('unsupervised', 0.069), ('bengio', 0.067), ('cox', 0.066), ('tenenbaum', 0.065), ('eigenvalues', 0.063), ('wij', 0.063), ('roweis', 0.063), ('montr', 0.062), ('extending', 0.061), ('saul', 0.061), ('eigenvector', 0.058), ('ft', 0.057), ('hp', 0.056), ('eigenfunctions', 0.056), ('matrix', 0.056), ('points', 0.056), ('baker', 0.052), ('eigendecomposition', 0.052), ('seeger', 0.052), ('niyogi', 0.052), ('extension', 0.052), ('training', 0.05), ('distance', 0.05), ('af', 0.047), ('delalleau', 0.047), ('koltchinskii', 0.047), ('ouimet', 0.047), ('recherche', 0.047), ('roux', 0.047), ('wki', 0.047), ('wkj', 0.047), ('gram', 0.045), ('langford', 0.045), ('principal', 0.044), ('belkin', 0.043), ('distances', 0.043), ('eigenvalue', 0.043), ('informatique', 0.041), ('paiement', 0.041), ('gin', 0.041), ('gower', 0.041), ('dimensionality', 0.04), ('normalization', 0.04), ('ng', 0.039), ('normalized', 0.038), ('eigenfunction', 0.038), ('reconstruction', 0.037), ('operator', 0.036), ('neighbors', 0.036), ('pi', 0.036), ('pca', 0.036), ('nity', 0.036), ('eigenproblem', 0.035), ('justi', 0.034), ('recompute', 0.033), ('orthogonality', 0.033), ('wji', 0.033), ('neighborhood', 0.032), ('comparable', 0.031), ('curved', 0.031), ('op', 0.031), ('vincent', 0.031), ('malik', 0.031), ('lowest', 0.031), ('ller', 0.03), ('becker', 0.029), ('extensions', 0.029), ('universit', 0.029), ('generalizations', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

Author: Yoshua Bengio, Jean-françcois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas L. Roux, Marie Ouimet

2 0.23767616 128 nips-2003-Minimax Embeddings

Author: Matthew Brand

Abstract: Spectral methods for nonlinear dimensionality reduction (NLDR) impose a neighborhood graph on point data and compute eigenfunctions of a quadratic form generated from the graph. We introduce a more general and more robust formulation of NLDR based on the singular value decomposition (SVD). In this framework, most spectral NLDR principles can be recovered by taking a subset of the constraints in a quadratic form built from local nullspaces on the manifold. The minimax formulation also opens up an interesting class of methods in which the graph is “decorated” with information at the vertices, offering discrete or continuous maps, reduced computational complexity, and immunity to some solution instabilities of eigenfunction approaches. Apropos, we show almost all NLDR methods based on eigenvalue decompositions (EVD) have a solution instability that increases faster than problem size. This pathology can be observed (and corrected via the minimax formulation) in problems as small as N < 100 points. 1 Nonlinear dimensionality reduction (NLDR) . Spectral NLDR methods are graph embedding problems where a set of N points X = [x1 , · · · , xN ] ∈ RD×N sampled from a low-dimensional manifold in a ambient space RD is reparameterized by imposing a neighborhood graph G on X and embedding the graph with minimal distortion in a “parameterization” space Rd , d < D. Typically the graph is sparse and local, with edges connecting points to their immediate neighbors. The embedding must keep these edges short or preserve their length (for isometry) or angles (for conformality). The graph-embedding problem was ﬁrst introduced as a least-squares problem by Tutte [1], and as an eigenvalue problem by Fiedler [2]. The use of sparse graphs to generate metrics for least-squares problems has been studied intensely in the following three decades (see [3]). Modern NLDR methods use graph constraints to generate a metric in a space of embeddings RN . Eigenvalue decomposition (EVD) gives the directions of least or greatest variance under this metric. Typically a subset of d extremal eigenvectors gives the embedding of N points in Rd parameterization space. This includes the IsoMap family [4], the locally linear embedding (LLE) family [5,6], and Laplacian methods [7,8]. Using similar methods, the Automatic Alignment [6] and Charting [9] algorithms embed local subspaces instead of points, and by combining subspace projections thus obtain continuous maps between RD and Rd . This paper introduces a general algebraic framework for computing optimal embeddings directly from graph constraints. The aforementioned methods can can be recovered as special cases. The framework also suggests some new methods with very attractive properties, including continuous maps, reduced computational complexity, and control over the degree of conformality/isometry in the desired map. It also eliminates a solution instability that is intrinsic to EVD-based approaches. A perturbational analysis quantiﬁes the instability. 2 Minimax theorem for graph embeddings We begin with neighborhood graph speciﬁed by a nondiagonal weighted adjacency matrix M ∈ RN×N that has the data-reproducing property XM = X (this can be relaxed to XM ≈ X in practice). The graph-embedding and NLDR literatures offer various constructions of M, each appropriate to different sets of assumptions about the original embedding and its sampling X (e.g., isometry, local linearity, noiseless samples, regular sampling, etc.). Typically Mi j = 0 if points i, j are nearby on the intrinsic manifold and |Mi j | is small or zero otherwise. Each point is taken to be a linear or convex combination of its neighbors, and thus M speciﬁes manifold connectivity in the sense that any nondegenerate embedding Y that satisﬁes YM ≈ Y with small residual YM − Y F will preserve this connectivity and the structure of local neighborhoods. For example, in barycentric embeddings, each point is the average of its neighbors and thus Mi j = 1/k if vertex i is connected to vertex j (of degree k). We will also consider three optional constraints on the embedding : 1. A null-space restriction, where the solution must be outside to the column-space of C ∈ RN×M , M < N. For example, it is common to stipulate that the solution Y be centered, i.e., YC = 0 for C = 1, the constant vector. 2. A basis restriction, where the solution must be a linear combination of the rows of basis Z ∈ RK×N , K ≤ N. This can be thought of as information placed at the vertices of the graph that serves as example inputs for a target NLDR function. We will use this to construct dimension-reducing radial basis function networks. 3. A metric Σ ∈ RN×N that determines how error is distributed over the points. For example, it might be important that boundary points have less error. We assume that Σ is symmetric positive deﬁnite and has factorization Σ = AA (e.g., A could be a Cholesky factor of Σ). In most settings, the optional matrices will default to the identity matrix. In this context, we deﬁne the per-dimension embedding error of row-vector yi ∈ rows(Y) to be . EM (yi ) = max yi ∈range(Z),, K∈RM×N (yi (M + CD) − yi )A yi A (1) where D is a matrix constructed by an adversary to maximize the error. The optimizing yi is a vector inside the subspace spanned by the rows of Z and outside the subspace spanned by the columns of C, for which the reconstruction residual yi M−yi has smallest norm w.r.t. the metric Σ. The following theorem identiﬁes the optimal embedding Y for any choice of M, Z, C, Σ: Minimax solution: Let Q ∈ SK×P be a column-orthonormal basis of the null-space of the rows of ZC, with P = K − rank(C). Let B ∈ RP×P be a square factor satisfying B B = Q ZΣZ Q, e.g., a Cholesky factor (or the “R” factor in QR-decomposition of (Q ZA) ). Compute the left singular vectors U ∈ SN×N of Udiag(s)V = B− Q Z(I − M)A, with . singular values s = [s1 , · · · , sP ] ordered s1 ≤ s2 ≤ · · · ≤ s p . Using the leading columns U1:d of U, set Y = U1:d B− Q Z. Theorem 1. Y is the optimal (minimax) embedding in Rd with error [s1 , · · · , sd ] 2 : . Y = U1:d B− Q Z = arg min ∑ EM (yi )2 with EM (yi ) = si . Y∈Rd×N y ∈rows(Y) i (2) Appendix A develops the proof and other error measures that are minimized. Local NLDR techniques are easily expressed in this framework. When Z = A = I, C = [], and M reproduces X through linear combinations with M 1 = 1, we recover LLE [5]. When Z = I, C = [], I − M is the normalized graph Laplacian, and A is a diagonal matrix of vertex degrees, we recover Laplacian eigenmaps [7]. When further Z = X we recover locally preserving projections [8]. 3 Analysis and generalization of charting The minimax construction of charting [9] takes some development, but offers an interesting insight into the above-mentioned methods. Recall that charting ﬁrst solves for a set of local afﬁne subspace axes S1 ∈ RD×d , S2 , · · · at offsets µ1 ∈ RD , µ2 , · · · that best cover the data and vary smoothly over the manifold. Each subspace offers a chart—a local parameterization of the data by projection onto the local axes. Charting then constructs a weighted mixture of afﬁne projections that merges the charts into a global parameterization. If the data manifold is curved, each projection will assign a point a slightly different embedding, so the error is measured as the variance of these proposed embeddings about their mean. This maximizes consistency and tends to produce isometric embeddings; [9] discusses ways to explicitly optimize the isometry of the embedding. Under the assumption of isometry, the charting error is equivalent to the sumsquared displacements of an embedded point relative to its immediate neighbors (summed over all neighborhoods). To construct the same error criteria in the minimax setting, let xi−k , · · · , xi , · · · , xi+k denote points in the ith neighborhood and let the columns of Vi ∈ R(2k+1)×d be an orthonormal basis of rows of the local parameterization Si [xi−k , · · · , xi , · · · , xi+k ]. Then a nonzero reparameterization will satisfy [yi−k , · · · , yi , · · · , yi+k ]Vi Vi = [yi−k , · · · , yi , · · · , yi+k ] if and only if it preserves the relative position of the points in the local parameterization. Conversely, any relative displacements of the points are isolated by the formula [yi−k , · · · , yi , · · · , yi+k ](I − Vi Vi ). Minimizing the Frobenius norm of this expression is thus equivalent to minimizing the local error in charting. We sum these constraints over all neighborhoods to obtain the constraint matrix M = I − ∑i Fi (I − Vi Vi )Fi , where (Fi )k j = 1 iff the jth point of the ith neighborhood is the kth point of the dataset. Because Vi Vi and (I − Vi Vi ) are complementary, it follows that the error criterion of any local NLDR method (e.g., LLE, Laplacian eigenmaps, etc.) must measure the projection of the embedding onto some subspace of (I − Vi Vi ). To construct a continuous map, charting uses an overcomplete radial basis function (RBF) representation Z = [z(x1 ), z(x2 ), · · · z(xN )], where z(x) is a vector that stacks z1 (x), z2 (x), etc., and pm (x) . Km (x − µm ) , zm (x) = 1 ∑m pm (x) −1 . pm (x) = N (x|µm , Σm ) ∝ e−(x−µm ) Σm (x−µm )/2 (3) (4) and Km is any local linear dimensionality reducer, typically Sm itself. Each column of Z contains many “views” of the same point that are combined to give its low-dimensional embedding. Finally, we set C = 1, which forces the embedding of the full data to be centered. Applying the minimax solution to these constraints yields the RBF network mixing ma. trix, f (x) = U1:d B− Q z(x). Theorem 1 guarantees that the resulting embedding is leastsquares optimal w.r.t. Z, M, C, A at the datapoints f (xi ), and because f (·) is an afﬁne transform of z(·) it smoothly interpolates the embedding between points. There are some interesting variants: Kernel embeddings of the twisted swiss roll generalized EVD minimax SVD UR corner detail LL corner detail Fig. 1. Minimax and generalized EVD solution for kernel eigenmap of a non-developable swiss roll. Points are connected into a grid which ideally should be regular. The EVD solution shows substantial degradation. Insets detail corners where the EVD solution crosses itself repeatedly. The border compression is characteristic of Laplacian constraints. One-shot charting: If we set the local dimensionality reducers to the identity matrix (all Km = I), then the minimax method jointly optimizes the local dimensionality reduction to charts and the global coordination of the charts (under any choice of M). This requires that rows(Z) ≤ N for a fully determined solution. Discrete isometric charting: If Z = I then we directly obtain a discrete isometric embedding of the data, rather than a continuous map, making this a local equivalent of IsoMap. Reduced basis charting: Let Z be constructed using just a small number of kernels randomly placed on the data manifold, such that rows(Z) N. Then the size of the SVD problem is substantially reduced. 4 Numerical advantage of minimax method Note that the minimax method projects the constraint matrix M into a subspace derived from C and Z and decomposes it there. This suppresses unwanted degrees of freedom (DOFs) admitted by the problem constraints, for example the trivial R0 embedding where all points are mapped to a single point yi = N −1/2 . The R0 embedding serves as a translational DOF in the solution. LLE- and eigenmap-based methods construct M to have a constant null-space so that the translational DOF will be isolated in the EVD as null eigenvalue paired to a constant eigenvector, which is then discarded. However, section 4.1 shows that this construction makes the EVD increasingly unstable as problem size grows and/or the data becomes increasing amenable to low-residual embeddings, ultimately causing solution collapse. As the next paragraph demonstrates, the problem is exacerbated when embedding w.r.t. a basis Z (via the equivalent generalized eigenproblem), partly because the eigenvector associated with the unwanted DOF can have arbitrary structure. In all cases the problem can be averted by using the minimax formulation with C = 1 to suppress the DOF. A 2D plane was embedded in 3D with a curl, a twist, and 2.5% Gaussian noise, then regularly sampled at 900 points. We computed a kernelized Laplacian eigenmap using 70 random points as RBF centers, i.e., a continous map using M derived from the graph Laplacian and Z constructed as above. The map was computed both via the minimax (SVD) method and via the equivalent generalized eigenproblem, where the translational degree of freedom must be removed by discarding an eigenvector from the solution. The two solutions are algebraically equivalent in every other regard. A variety of eigensolvers were tried; we took −5 excess energy x 10 Eigen spectrum compared to minimax spectrum 15 10 5 0 −5 Eigen spectrum compared to minimax spectrum 2 15 deviation excess energy x 10 10 5 100 200 eigenvalue Error in null embedding −5 x 10 0 −2 −4 −6 −8 0 100 −5 eigenvalue Error in null embedding 200 100 200 300 400 500 point 600 700 800 900 Fig. 2. Excess energy in the eigenspectrum indicates that the translational DOF has contam2 inated many eigenvectors. If the EVD had successfully isolated the unwanted DOF, then its 0 remaining eigenvalues should be identical to those derived from the minimax solution. The −2 −4 graph at left shows the difference in the eigenspectra. The graph at right shows the EVD −6 solution’s deviation from the translational vector y0 = 1 · N −1/2 ≈ .03333. If the numer−8 ics were100 200 the line would be ﬂat, but in practice the deviation is signiﬁcant enough perfect 300 400 500 600 700 800 900 point (roughly 1% of the diameter of the embedding) to noticably perturb points in ﬁgure 1. deviation x 10 the best result. Figure 1 shows that the EVD solution exhibits many defects, particularly a folding-over of the manifold at the top and bottom edges and at the corners. Figure 2 shows that the noisiness of the EVD solution is due largely to mutual contamination of numerically unstable eigenvectors. 4.1 Numerical instability of eigen-methods The following theorem uses tools of matrix perturbation theory to show that as the problem size increases, the desired and unwanted eigenvectors become increasingly wobbly and gradually contaminate each other, leading to degraded solutions. More precisely, the low-order eigenvalues are ill-conditioned and exhibit multiplicities that may be true (due to noiseless samples from low-curvature manifolds) or false (due to numerical noise). Although in many cases some post-hoc algebra can “ﬁlter” the unwanted components out of the contaminated eigensolution, it is not hard to construct cases where the eigenvectors cannot be cleanly separated. The minimax formulation is immune to this problem because it explicitly suppresses the gratuitous component(s) before matrix decomposition. Theorem 2. For any ﬁnite numerical precision, as the number of points N increases, the Frobenius norm of numerical noise in the null eigenvector v0 can grow as O(N 3/2 ), and the eigenvalue problem can approach a false multiplicity at a rate as fast as O(N 3/2 ), at which point the eigenvectors of interest—embedding and translational—are mutually contaminated and/or have an indeterminate eigenvalue ordering. Please see appendix B for the proof. This theorem essentially lower-bounds an upperbound on error; examples can be constructed in which the problem is worse. For example, it can be shown analytically that when embedding points drawn from the simple curve xi = [a, cos πa] , a ∈ [0, 1] with K = 2 neighbors, instabilities cannot be bounded better than O(N 5/2 ); empirically we see eigenvector mixing with N < 100 points and we see it grow at the rate ≈ O(N 4 )—in many different eigensolvers. At very large scales, more pernicious instabilities set in. E.g., by N = 20000 points, the solution begins to fold over. Although algebraic multiplicity and instability of the eigenproblem is conceptually a minor oversight in the algorithmic realizations of eigenfunction embeddings, as theorem 2 shows, the consequences are eventually fatal. 5 Summary One of the most appealing aspects of the spectral NLDR literature is that algorithms are usually motivated from analyses of linear operators on smooth differentiable manifolds, e.g., [7]. Understandably, these analysis rely on assumptions (e.g., smoothness or isometry or noiseless sampling) that make it difﬁcult to predict what algorithmic realizations will do when real, noisy data violates these assumptions. The minimax embedding theorem provides a complete algebraic characterization of this discrete NLDR problem, and provides a solution that recovers numerically robustiﬁed versions of almost all known algorithms. It offers a principled way of constructing new algorithms with clear optimality properties and good numerical conditioning—notably the construction of a continuous NLDR map (an RBF network) in a one-shot optimization ( SVD ). We have also shown how to cast several local NLDR principles in this framework, and upgrade these methods to give continuous maps. Working in the opposite direction, we sketched the minimax formulation of isometric charting and showed that its constraint matrix contains a superset of all the algebraic constraints used in local NLDR techniques. References 1. W.T. Tutte. How to draw a graph. Proc. London Mathematical Society, 13:743–768, 1963. 2. Miroslav Fiedler. A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czech. Math. Journal, 25:619–633, 1975. 3. Fan R.K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series in Mathematics. American Mathematical Society, 1997. 4. Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, December 22 2000. 5. Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 22 2000. 6. Yee Whye Teh and Sam T. Roweis. Automatic alignment of hidden representations. In Proc. NIPS-15, 2003. 7. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. volume 14 of Advances in Neural Information Processing Systems, 2002. 8. Xiafei He and Partha Niyogi. Locality preserving projections. Technical Report TR-2002-09, University of Chicago Computer Science, October 2002. 9. Matthew Brand. Charting a manifold. volume 15 of Advances in Neural Information Processing Systems, 2003. 10. G.W. Stewart and Ji-Guang Sun. Matrix perturbation theory. Academic Press, 1990. A Proof of minimax embedding theorem (1) The burden of this proof is carried by supporting lemmas, below. To emphasize the proof strategy, we give the proof ﬁrst; supporting lemmas follow. Proof. Setting yi = li Z, we will solve for li ∈ columns(L). Writing the error in terms of li , EM (li ) = max K∈RM×N li Z(I − M − CK)A li Z(I − M)A − li ZCKA = max . M×N li ZA li ZA K∈R (5) The term li ZCKA produces inﬁnite error unless li ZC = 0, so we accept this as a constraint and seek li Z(I − M)A min . (6) li ZA li ZC=0 By lemma 1, that orthogonality is satisﬁed by solving the problem in the space orthogonal . to ZC; the basis for this space is given by columns of Q = null((ZC) ). By lemma 2, the denominator of the error speciﬁes the metric in solution space to be ZAA Z ; when the problem is projected into the space orthogonal to ZC it becomes Q (ZAA Z )Q. Nesting the “orthogonally-constrained-SVD” construction of lemma 1 inside the “SVD-under-a-metric” lemma 2, we obtain a solution that uses the correct metric in the orthogonal space: B B = Q ZAA Z Q − Udiag(s)V = B (7) {Q(Z(I − M)A)} (8) L = QB−1 U (9) where braces indicate the nesting of lemmas. By the “best-projection” lemma (#3), if we order the singular values by ascending magnitude, L1:d = arg min J∈RN×d ∑ji ∈cols(J) ( j Z(I − M)A / j )2 ZΣZ The proof is completed by making the substitutions L Z → Y and x A → x Σ = AA ), and leaving off the ﬁnal square root operation to obtain (Y )1:d = arg min ∑ji ∈cols(J) j (I − M) Σ / j J∈RN×d (10) Σ (for 2 Σ . (11) Lemma 1. Orthogonally constrained SVD: The left singular vectors L of matrix M under . SVD the constraint U C = 0 are calculated as Q = null(C ), Udiag(s)V ← Q M, L = QU. Proof. First observe that L is orthogonal to C: By deﬁnition, the null-space basis satisﬁes Q C = 0, thus L C = U Q C = 0. Let J be an orthonormal basis for C, with J J = I and Q J = 0. Then Ldiag(s)V = QQ M = (I − JJ )M, the orthogonal projector of C applied to M, proving that the SVD captures the component of M that is orthogonal to C. Lemma 2. SVD with respect to a metric: The vectors li ∈ L, vi ∈ V that diagonalize matrix M with respect to positive deﬁnite column-space metric Σ are calculated as B B ← Σ, SVD . Udiag(s)V ← B− M, L = B−1 U satisfy li M / li Σ = si and extremize this form for the extremal singular values smin , smax . Proof. By construction, L and V diagonalize M: L MV = (B−1 U) MV = U (B− M)V = diag(s) (12) B− and diag(s)V = M. Forming the gram matrices of both sides of the last line, we obtain the identity Vdiag(s)2 V = M B−1 B− M = M Σ−1 M, which demonstrates that si ∈ s are the singular values of M w.r.t. column-space metric Σ. Finally, L is orthonormal w.r.t. the metric Σ, because L 2 = L ΣL = U B− B BB−1 U = I. Consequently, Σ l M / l Σ = l M /1 = si vi and by the Courant-Hilbert theorem, smax = max l M / l Σ ; = si . smin = min l M / l Σ . l l (13) (14) Lemma 3. Best projection: Taking L and s from lemma 2, let the columns of L and elements of s be sorted so that s1 ≥ s2 ≥ · · · ≥ sN . Then for any dimensionality 1 ≤ d ≤ N, . L1:d = [l1 , · · · , ld ] = arg max J M (J ΣJ)−1 (15) J∈RN×d = arg max F (16) ∑ji ∈cols(J) ( j M / j Σ )2 (17) J∈RN×d |J ΣJ=I = arg max J∈RN×d J M with the optimum value of all right hand sides being (∑d s2 )1/2 . If the sort order is rei=1 i versed, the minimum of this form is obtained. Proof. By the Eckart-Young-Mirsky theorem, if U MV = diag(s) with singular values . sorted in descending order, then U1:d = [u1 , · · · , ud ] = arg maxU∈SN×d U M F . We ﬁrst extend this to a non-orthonogonal basis J under a Mahalonobis norm: maxJ∈RN×d J M (J J)−1 = maxU∈SN×d U M F (18) because J M 2 (J J)−1 = trace(M J(J J)−1 J M) = trace(M JJ+ (JJ+ ) M) = (JJ+ )M 2 = UU M 2 = U M 2 since JJ+ is a (symmetric) orthogonal proF F F jector having binary eigenvalues λ ∈ {0, 1} and therefore it is the gram of an thin orthogonal matrix. We then impose a metric Σ on the column-space of J to obtain the ﬁrst criterion (equation 15), which asks what maximizes variance in J M while minimizing the norm of J w.r.t. metric Σ. Here it sufﬁces to substitute in the leading (resp., trailing) columns of L and verify that the norm is maximized (resp., minimized). Expanding, L1:d M 2 ΣL )−1 = trace((L1:d M) (L1:d ΣL1:d )−1 (L1:d M)) = (L 1:d 1:d trace((L1:d M) I(L1:d M)) = trace((diag(s1:d )V1:d ) (diag(s1:d )V1:d )) = s1:d 2 . Again, by the Eckart-Young-Mirsky theorem, these are the maximal variance-preserving projections, so the ﬁrst criterion is indeed maximized by setting J to the columns in L corresponding to the largest values in s. Criterion #2 restates the ﬁrst criterion with the set of candidates for J restricted to (the hyperelliptical manifold of) matrices that reduce the metric on the norm to the identity matrix (thereby recovering the Frobenius norm). Criterion #3 criterion merely expands the above trace by individual singular values. Note that the numerator and denominator can have different metrics because they are norms in different spaces, possibly of different dimension. Finally, that the trailing d eigenvectors minimize these criteria follows directly from the fact that leading N − d singular values account for the maximal part of the variance. B Proof of instability theorem (2) Proof. When generated from a sparse graph with average degree K, weighted connectivity matrix W is sparse and has O(NK) entries. Since the graph vertices represent samples from a smooth manifold, increasing the sampling density N does not change the distribution of magnitudes in W. Consider a perturbation of the nonzero values in W, e.g., W → W + E due to numerical noise E created by ﬁnite machine precision. By the weak law of large √ numbers, the Frobenius norm of the sparse perturbation grows as E F ∼ O( N). However the t th -smallest nonzero eigenvalue λt (W) grows as λt (W) = vt Wvt ∼ O(N −1 ), because elements of corresponding eigenvector vt grow as O(N −1/2 ) and only K of those elements are multiplied by nonzero values to form each element of Wvt . In sum, the perturbation E F grows while the eigenvalue λt (W) shrinks. In linear embedding algorithms, . the eigengap of interest is λgap = λ1 − λ0 . The tail eigenvalue λ0 = 0 by construction but it is possible that λ0 > 0 with numerical error, thus λgap ≤ λ1 . Combining these facts, the ratio between the perturbation and the eigengap grows as E F /λgap ∼ O(N 3/2 ) or faster. Now consider the shifted eigenproblem I − W with leading (maximal) eigenvalues 1 − λ0 ≥ 1 − λ1 ≥ · · · and unchanged eigenvectors. From matrix perturbation the. ory [10, thm. V.2.8], when W is perturbed to W = W + E, the change in the lead√ ing eigenvalue from 1 − λ0 to 1 − λ0 is bounded as |λ0 − λ0 | ≤ 2 E F and similarly √ √ 1 − λ1 ≤ 1 − λ1 + 2 E F . Thus λgap ≥ λgap − 2 E F . Since E F /λgap ∼ O(N 3/2 ), the right hand side of the gap bound goes negative at a supralinear rate, implying that the eigenvalue ordering eventually becomes unstable with the possibility of the ﬁrst and second eigenvalue/vector pairs being swapped. Mutual contamination of the eigenvectors happens well before: Under general (dense) conditions, the change in the eigenvector v0 is bounded E F as v0 − v0 ≤ |λ −λ4 |−√2 E [10, thm. V.2.8]. (This bound is often tight enough to serve F 0 1 as a good approximation.) Specializing this to the sparse embedding matrix, we ﬁnd that √ √ O( N) O( N) the bound weakens to v0 − 1 · N −1/2 ∼ O(N −1 )−O(√N) > O(N −1 ) = O(N 3/2 ).

3 0.20945694 71 nips-2003-Fast Embedding of Sparse Similarity Graphs

Author: John C. Platt

Abstract: This paper applies fast sparse multidimensional scaling (MDS) to a large graph of music similarity, with 267K vertices that represent artists, albums, and tracks; and 3.22M edges that represent similarity between those entities. Once vertices are assigned locations in a Euclidean space, the locations can be used to browse music and to generate playlists. MDS on very large sparse graphs can be effectively performed by a family of algorithms called Rectangular Dijsktra (RD) MDS algorithms. These RD algorithms operate on a dense rectangular slice of the distance matrix, created by calling Dijsktra a constant number of times. Two RD algorithms are compared: Landmark MDS, which uses the Nyström approximation to perform MDS; and a new algorithm called Fast Sparse Embedding, which uses FastMap. These algorithms compare favorably to Laplacian Eigenmaps, both in terms of speed and embedding quality. 1

4 0.16854426 107 nips-2003-Learning Spectral Clustering

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive a new cost function for spectral clustering based on a measure of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing this cost function with respect to the partition leads to a new spectral clustering algorithm. Minimizing with respect to the similarity matrix leads to an algorithm for learning the similarity matrix. We develop a tractable approximation of our cost function that is based on the power method of computing eigenvectors. 1

5 0.15377665 120 nips-2003-Locality Preserving Projections

Author: Xiaofei He, Partha Niyogi

Abstract: Many problems in information processing involve some form of dimensionality reduction. In this paper, we introduce Locality Preserving Projections (LPP). These are linear projective maps that arise by solving a variational problem that optimally preserves the neighborhood structure of the data set. LPP should be seen as an alternative to Principal Component Analysis (PCA) – a classical linear technique that projects the data along the directions of maximal variance. When the high dimensional data lies on a low dimensional manifold embedded in the ambient space, the Locality Preserving Projections are obtained by ﬁnding the optimal linear approximations to the eigenfunctions of the Laplace Beltrami operator on the manifold. As a result, LPP shares many of the data representation properties of nonlinear techniques such as Laplacian Eigenmaps or Locally Linear Embedding. Yet LPP is linear and more crucially is deﬁned everywhere in ambient space rather than just on the training data points. This is borne out by illustrative examples on some high dimensional data sets.

6 0.14945552 46 nips-2003-Clustering with the Connectivity Kernel

7 0.1382353 117 nips-2003-Linear Response for Approximate Inference

8 0.13302279 126 nips-2003-Measure Based Regularization

9 0.11583064 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

10 0.11392259 152 nips-2003-Pairwise Clustering and Graphical Models

11 0.11090966 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

12 0.1103695 108 nips-2003-Learning a Distance Metric from Relative Comparisons

13 0.1032846 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

14 0.090125456 112 nips-2003-Learning to Find Pre-Images

15 0.087641239 113 nips-2003-Learning with Local and Global Consistency

16 0.087373182 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

17 0.085988753 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

18 0.083439618 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

19 0.08162304 73 nips-2003-Feature Selection in Clustering Problems

20 0.078858584 92 nips-2003-Information Bottleneck for Gaussian Variables

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.246), (1, -0.195), (2, -0.068), (3, 0.084), (4, -0.026), (5, 0.201), (6, -0.109), (7, 0.028), (8, 0.069), (9, -0.263), (10, 0.079), (11, -0.047), (12, 0.139), (13, -0.115), (14, -0.053), (15, -0.038), (16, 0.141), (17, 0.06), (18, -0.046), (19, 0.084), (20, 0.051), (21, -0.134), (22, 0.096), (23, -0.022), (24, -0.157), (25, -0.061), (26, -0.1), (27, -0.002), (28, 0.206), (29, 0.113), (30, 0.012), (31, 0.01), (32, -0.017), (33, -0.041), (34, -0.014), (35, 0.017), (36, 0.045), (37, -0.032), (38, 0.088), (39, 0.072), (40, 0.148), (41, -0.077), (42, -0.013), (43, 0.021), (44, -0.005), (45, 0.001), (46, 0.025), (47, 0.018), (48, -0.007), (49, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95037198 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

Author: Yoshua Bengio, Jean-françcois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas L. Roux, Marie Ouimet

2 0.8039819 71 nips-2003-Fast Embedding of Sparse Similarity Graphs

Author: John C. Platt

3 0.74397475 128 nips-2003-Minimax Embeddings

Author: Matthew Brand

4 0.65035027 108 nips-2003-Learning a Distance Metric from Relative Comparisons

Author: Matthew Schultz, Thorsten Joachims

Abstract: This paper presents a method for learning a distance metric from relative comparison such as “A is closer to B than A is to C”. Taking a Support Vector Machine (SVM) approach, we develop an algorithm that provides a ﬂexible way of describing qualitative training data as a set of constraints. We show that such constraints lead to a convex quadratic programming problem that can be solved by adapting standard methods for SVM training. We empirically evaluate the performance and the modelling ﬂexibility of the algorithm on a collection of text documents. 1

5 0.61832952 120 nips-2003-Locality Preserving Projections

Author: Xiaofei He, Partha Niyogi

6 0.59717268 126 nips-2003-Measure Based Regularization

7 0.55041575 46 nips-2003-Clustering with the Connectivity Kernel

8 0.54633325 107 nips-2003-Learning Spectral Clustering

9 0.44566628 117 nips-2003-Linear Response for Approximate Inference

10 0.40467259 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

11 0.38028786 152 nips-2003-Pairwise Clustering and Graphical Models

12 0.34733701 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

13 0.34321108 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

14 0.34215191 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

15 0.33404398 112 nips-2003-Learning to Find Pre-Images

16 0.32299086 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

17 0.32105872 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA

18 0.30895191 48 nips-2003-Convex Methods for Transduction

19 0.3077741 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

20 0.30334479 60 nips-2003-Eigenvoice Speaker Adaptation via Composite Kernel Principal Component Analysis

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.026), (11, 0.021), (29, 0.024), (30, 0.024), (35, 0.033), (53, 0.118), (71, 0.102), (74, 0.012), (76, 0.047), (78, 0.349), (85, 0.056), (91, 0.081), (99, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83927435 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

Author: Yoshua Bengio, Jean-françcois Paiement, Pascal Vincent, Olivier Delalleau, Nicolas L. Roux, Marie Ouimet

2 0.69232172 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

Author: Gang Ji, Jeff A. Bilmes

Abstract: In pattern classiﬁcation tasks, errors are introduced because of differences between the true model and the one obtained via model estimation. Using likelihood-ratio based classiﬁcation, it is possible to correct for this discrepancy by ﬁnding class-pair speciﬁc terms to adjust the likelihood ratio directly, and that can make class-pair preference relationships intransitive. In this work, we introduce new methodology that makes necessary corrections to the likelihood ratio, speciﬁcally those that are necessary to achieve perfect classiﬁcation (but not perfect likelihood-ratio correction which can be overkill). The new corrections, while weaker than previously reported such adjustments, are analytically challenging since they involve discontinuous functions, therefore requiring several approximations. We test a number of these new schemes on an isolatedword speech recognition task as well as on the UCI machine learning data sets. Results show that by using the bias terms calculated in this new way, classiﬁcation accuracy can substantially improve over both the baseline and over our previous results. 1

3 0.69210976 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

Author: Tong Zhang

Abstract: The purpose of this paper is to investigate inﬁnity-sample properties of risk minimization based multi-category classiﬁcation methods. These methods can be considered as natural extensions to binary large margin classiﬁcation. We establish conditions that guarantee the inﬁnity-sample consistency of classiﬁers obtained in the risk minimization framework. Examples are provided for two speciﬁc forms of the general formulation, which extend a number of known methods. Using these examples, we show that some risk minimization formulations can also be used to obtain conditional probability estimates for the underlying problem. Such conditional probability information will be useful for statistical inferencing tasks beyond classiﬁcation. 1 Motivation Consider a binary classiﬁcation problem where we want to predict label y ∈ {±1} based on observation x. One of the most signiﬁcant achievements for binary classiﬁcation in machine learning is the invention of large margin methods, which include support vector machines and boosting algorithms. Based on a set of observations (X1 , Y1 ), . . . , (Xn , Yn ), ˆ a large margin classiﬁcation algorithm produces a decision function fn by empirically minimizing a loss function that is often a convex upper bound of the binary classiﬁcation error ˆ ˆ function. Given fn , the binary decision rule is to predict y = 1 if fn (x) ≥ 0, and to predict ˆ y = −1 otherwise (the decision rule at fn (x) = 0 is not important). In the literature, the following form of large margin binary classiﬁcation is often encountered: we minimize the empirical risk associated with a convex function φ in a pre-chosen function class Cn : 1 ˆ fn = arg min f ∈Cn n n φ(f (Xi )Yi ). (1) i=1 Originally such a scheme was regarded as a compromise to avoid computational difﬁculties associated with direct classiﬁcation error minimization, which often leads to an NP-hard problem. The current view in the statistical literature interprets such methods as algorithms to obtain conditional probability estimates. For example, see [3, 6, 9, 11] for some related studies. This point of view allows people to show the consistency of various large margin methods: that is, in the large sample limit, the obtained classiﬁers achieve the optimal Bayes error rate. For example, see [1, 4, 7, 8, 10, 11]. The consistency of a learning method is certainly a very desirable property, and one may argue that a good classiﬁcation method should be consistent in the large sample limit. Although statistical properties of binary classiﬁcation algorithms based on the risk minimization formulation (1) are quite well-understood due to many recent works such as those mentioned above, there are much fewer studies on risk minimization based multicategory problems which generalizes the binary large margin method (1). The complexity of possible generalizations may be one reason. Another reason may be that one can always estimate the conditional probability for a multi-category problem using the binary classiﬁcation formulation (1) for each category, and then pick the category with the highest estimated conditional probability (or score).1 However, it is still useful to understand whether there are more natural alternatives, and what kind of risk minimization formulation which generalizes (1) can be used to yield consistent classiﬁers in the large sample limit. An important step toward this direction has recently been taken in [5], where the authors proposed a multi-category extension of the support vector machine that is Bayes consistent (note that there were a number of earlier proposals that were not consistent). The purpose of this paper is to generalize their investigation so as to include a much wider class of risk minimization formulations that can lead to consistent classiﬁers in the inﬁnity-sample limit. We shall see that there is a rich structure in risk minimization based multi-category classiﬁcation formulations. Multi-category large margin methods have started to draw more attention recently. For example, in [2], learning bounds for some multi-category convex risk minimization methods were obtained, although the authors did not study possible choices of Bayes consistent formulations. 2 Multi-category classiﬁcation We consider the following K-class classiﬁcation problem: we would like to predict the label y ∈ {1, . . . , K} of an input vector x. In this paper, we only consider the simplest scenario with 0 − 1 classiﬁcation loss: we have a loss of 0 for correct prediction, and loss of 1 for incorrect prediction. In binary classiﬁcation, the class label can be determined using the sign of a decision function. This can be generalized to K class classiﬁcation problem as follows: we consider K decision functions fc (x) where c = 1, . . . , K and we predict the label y of x as: T (f (x)) = arg max c∈{1,...,K} fc (x), (2) where we denote by f (x) the vector function f (x) = [f1 (x), . . . , fK (x)]. Note that if two or more components of f achieve the same maximum value, then we may choose any of them as T (f ). In this framework, fc (x) is often regarded as a scoring function for category c that is correlated with how likely x belongs to category c (compared with the remaining k − 1 categories). The classiﬁcation error is given by: (f ) = 1 − EX P (Y = T (X)|X). Note that only the relative strength of fc compared with the alternatives is important. In particular, the decision rule given in (2) does not change when we add the same numerical quantity to each component of f (x). This allows us to impose one constraint on the vector f (x) which decreases the degree of freedom K of the K-component vector f (x) to K − 1. 1 This approach is often called one-versus-all or ranking in machine learning. Another main approach is to encode a multi-category classiﬁcation problem into binary classiﬁcation sub-problems. The consistency of such encoding schemes can be difﬁcult to analyze, and we shall not discuss them. For example, in the binary classiﬁcation case, we can enforce f1 (x)+f2 (x) = 0, and hence f (x) can be represented as [f1 (x), −f1 (x)]. The decision rule in (2), which compares f1 (x) ≥ f2 (x), is equivalent to f1 (x) ≥ 0. This leads to the binary classiﬁcation rule mentioned in the introduction. In the multi-category case, one may also interpret the possible constraint on the vector function f , which reduces its degree of freedom from K to K − 1 based on the following reasoning. In many cases, we seek fc (x) as a function of p(Y = c|x). Since we have a K constraint c=1 p(Y = c|x) = 1 (implying that the degree of freedom for p(Y = c|x) is K − 1), the degree of freedom for f is also K − 1 (instead of K). However, we shall point out that in the algorithms we formulate below, we may either enforce such a constraint that reduces the degree of freedom of f , or we do not impose any constraint, which keeps the degree of freedom of f to be K. The advantage of the latter is that it allows the computation of each fc to be decoupled. It is thus much simpler both conceptually and numerically. Moreover, it directly handles multiple-label problems where we may assign each x to multiple labels of y ∈ {1, . . . , K}. In this scenario, we do not have a constraint. In this paper, we consider an empirical risk minimization method to solve a multi-category problem, which is of the following general form: 1 ˆ fn = arg min f ∈Cn n n ΨYi (f (Xi )). (3) i=1 As we shall see later, this method is a natural generalization of the binary classiﬁcation method (1). Note that one may consider an even more general form with ΨY (f (X)) replaced by ΨY (f (X), X), which we don’t study in this paper. From the standard learning theory, one can expect that with appropriately chosen Cn , the ˆ ˆ solution fn of (3) approximately minimizes the true risk R(f ) with respect to the unknown underlying distribution within the function class Cn , R(f ) = EX,Y ΨY (f (X)) = EX L(P (·|X), f (X)), (4) where P (·|X) = [P (Y = 1|X), . . . , P (Y = K|X)] is the conditional probability, and K L(q, f ) = qc Ψc (f ). (5) c=1 In order to understand the large sample behavior of the algorithm based on solving (3), we ﬁrst need to understand the behavior of a function f that approximately minimizes R(f ). We introduce the following deﬁnition (also referred to as classiﬁcation calibrated in [1]): Deﬁnition 2.1 Consider Ψc (f ) in (4). We say that the formulation is admissible (classiﬁcation calibrated) on a closed set Ω ⊆ [−∞, ∞]K if the following conditions hold: ∀c, Ψc (·) : Ω → (−∞, ∞] is bounded below and continuous; ∩c {f : Ψc (f ) < ∞} is ∗ ∗ non-empty and dense in Ω; ∀q, if L(q, f ∗ ) = inf f L(q, f ), then fc = supk fk implies qc = supk qk . Since we allow Ψc (f ) = ∞, we use the convention that qc Ψc (f ) = 0 when qc = 0 and Ψc (f ) = ∞. The following result relates the approximate minimization of the Ψ risk to the approximate minimization of classiﬁcation error: Theorem 2.1 Let B be the set of all Borel measurable functions. For a closed set Ω ⊂ [−∞, ∞]K , let BΩ = {f ∈ B : ∀x, f (x) ∈ Ω}. If Ψc (·) is admissible on Ω, then for a Borel measurable distribution, R(f ) → inf g∈BΩ R(g) implies (f ) → inf g∈B (g). Proof Sketch. First we show that the admissibility implies that ∀ > 0, ∃δ > 0 such that ∀q and x: inf {L(q, f ) : fc = sup fk } ≥ inf L(q, g) + δ. (6) qc ≤supk qk − g∈Ω k m If (6) does not hold, then ∃ > 0, and a sequence of (c , f m , q m ) with f m ∈ Ω such that m m m m fcm = supk fk , qcm ≤ supk qk − , and L(q m , f m ) − inf g∈Ω L(q m , g) → 0. Taking a limit point of (cm , f m , q m ), and using the continuity of Ψc (·), we obtain a contradiction (technical details handling the inﬁnity case are skipped). Therefore (6) must be valid. Now we consider a vector function f (x) ∈ ΩB . Let q(x) = P (·|x). Given X, if P (Y = T (f (X))|X) ≥ P (Y = T (q(X))|X)+ , then equation (6) implies that L(q(X), f (X)) ≥ inf g∈Ω L(q(X), g) + δ. Therefore (f ) − inf (g) =EX [P (Y = T (q(X))|X) − P (Y = T (f (X))|X)] g∈B ≤ + EX I(P (Y = T (q(X))|X) − P (Y = T (f (X))|X) > ) LX (q(X), f (X)) − inf g∈BΩ LX (q(X), g) ≤ + EX δ R(f ) − inf g∈BΩ R(g) = + . δ In the above derivation we use I to denote the indicator function. Since and δ are arbitrary, we obtain the theorem by letting → 0. 2 Clearly, based on the above theorem, an admissible risk minimization formulation is suitable for multi-category classiﬁcation problems. The classiﬁer obtained from minimizing (3) can approach the Bayes error rate if we can show that with appropriately chosen function class Cn , approximate minimization of (3) implies approximate minimization of (4). Learning bounds of this forms have been very well-studied in statistics and machine learning. For example, for large margin binary classiﬁcation, such bounds can be found in [4, 7, 8, 10, 11, 1], where they were used to prove the consistency of various large margin methods. In order to achieve consistency, it is also necessary to take a sequence of function classes Cn (C1 ⊂ C2 ⊂ · · · ) such that ∪n Cn is dense in the set of Borel measurable functions. The set Cn has the effect of regularization, which ensures that ˆ ˆ P R(fn ) ≈ inf f ∈Cn R(f ). It follows that as n → ∞, R(fn ) → inf f ∈B R(f ). Theorem 2.1 ˆ P then implies that (fn ) → inf f ∈B (f ). The purpose of this paper is not to study similar learning bounds that relate approximate minimization of (3) to the approximate minimization of (4). See [2] for a recent investigation. We shall focus on the choices of Ψ that lead to admissible formulations. We pay special attention to the case that each Ψc (f ) is a convex function of f , so that the resulting formulation becomes computational more tractable. Instead of working with the general form of Ψc in (4), we focus on two speciﬁc choices listed in the next two sections. 3 Unconstrained formulations We consider unconstrained formulation with the following choice of Ψ: K Ψc (f ) = φ(fc ) + s t(fk ) , (7) k=1 where φ, s and t are appropriately chosen functions that are continuously differentiable. The ﬁrst term, which has a relatively simple form, depends on the label c. The second term is independent of the label, and can be regarded as a normalization term. Note that this function is symmetric with respect to components of f . This choice treats all potential classes equally. It is also possible to treat different classes differently (e.g. replacing φ(fc ) by φc (fc )), which can be useful if we associate different classiﬁcation loss to different kinds of errors. 3.1 Optimality equation and probability model Using (7), the conditional true risk (5) can be written as: K L(q, f ) = K qc φ(fc ) + s t(fc ) . c=1 c=1 In the following, we study the property of the optimal vector f ∗ that minimizes L(q, f ) for a ﬁxed q. Given q, the optimal solution f ∗ of L(q, f ) satisﬁes the following ﬁrst order condition: ∗ ∗ qc φ (fc ) + µf ∗ t (fc ) = 0 (c = 1, . . . , K). (8) where quantity µf ∗ = s ( K k=1 ∗ t(fk )) is independent of k. ∗ Clearly this equation relates qc to fc for each component c. The relationship of q and f ∗ deﬁned by (8) can be regarded as the (inﬁnite sample-size) probability model associated with the learning method (3) with Ψ given by (7). The following result presents a simple criterion to check admissibility. We skip the proof for simplicity. Most of our examples satisfy the condition. Proposition 3.1 Consider (7). Assume Φc (f ) is continuous on [−∞, ∞]K and bounded below. If s (u) ≥ 0 and ∀p > 0, pφ (f ) + t (f ) = 0 has a unique solution fp that is an increasing function of p, then the formulation is admissible. If s(u) = u, the condition ∀p > 0 in Proposition 3.1 can be replaced by ∀p ∈ (0, 1). 3.2 Decoupled formulations We let s(u) = u in (7). The optimality condition (8) becomes ∗ ∗ qc φ (fc ) + t (fc ) = 0 (c = 1, . . . , K). (9) This means that we have K decoupled equalities, one for each fc . This is the simplest and in the author’s opinion, the most interesting formulation. Since the estimation problem in ˆ (3) is also decoupled into K separate equations, one for each component of fn , this class of methods are computationally relatively simple and easy to parallelize. Although this method seems to be preferable for multi-category problems, it is not the most efﬁcient way for two-class problem (if we want to treat the two classes in a symmetric manner) since we have to solve two separate equations. We only need to deal with one equation in (1) due to the fact that an effective constraint f1 + f2 = 0 can be used to reduce the number of equations. This variable elimination has little impact if there are many categories. In the following, we list some examples of multi-category risk minimization formulations. They all satisfy the admissibility condition in Proposition 3.1. We focus on the relationship of the optimal optimizer function f∗ (q) and the conditional probability q. For simplicity, we focus on the choice φ(u) = −u. 3.2.1 φ(u) = −u and t(u) = eu ∗ We obtain the following probability model: qc = efc . This formulation is closely related K to the maximum-likelihood estimate with conditional model qc = efc / k=1 efk (logistic regression). In particular, if we choose a function class such that the normalization condiK tion k=1 efk = 1 holds, then the two formulations are identical. However, they become different when we do not impose such a normalization condition. Another very important and closely related formulation is the choice of φ(u) = − ln u and t(u) = u. This is an extension of maximum-likelihood estimate with probability model qc = fc . The resulting method is identical to maximum-likelihood if we choose our function class such that k fk = 1. However, the formulation also allows us to use function classes that do not satisfy the normalization constraint k fk = 1. Therefore this method is more ﬂexible. 3.2.2 φ(u) = −u and t(u) = ln(1 + eu ) This version uses binary logistic regression loss, and we have the following probability ∗ model: qc = (1 + e−fc )−1 . Again this is an unnormalized model. 1 3.2.3 φ(u) = −u and t(u) = p |u|p (p > 1) ∗ ∗ We obtain the following probability model: qc = sign(fc )|fc |p−1 . This means that at the ∗ ∗ solution, fc ≥ 0. One may modify it such that we allow fc ≤ 0 to model the condition probability qc = 0. 3.2.4 φ(u) = −u and t(u) = 1 p max(u, 0)p (p > 1) ∗ In this probability model, we have the following relationship: qc = max(fc , 0)p−1 . The ∗ equation implies that we allow fc ≤ 0 to model the conditional probability qc = 0. Therefore, with a ﬁxed function class, this model is more powerful than the previous one. How∗ ever, at the optimal solution, fc ≤ 1. This requirement can be further alleviated with the following modiﬁcation. 3.2.5 φ(u) = −u and t(u) = 1 p min(max(u, 0)p , p(u − 1) + 1) (p > 1) In this probability model, we have the following relationship at the exact solution: qc = c min(max(f∗ , 0), 1)p−1 . Clearly this model is more powerful than the previous model since ∗ the function value fc ≥ 1 can be used to model qc = 1. 3.3 Coupled formulations In the coupled formulation with s(u) = u, the probability model can be normalized in a certain way. We list a few examples. 3.3.1 φ(u) = −u, and t(u) = eu , and s(u) = ln(u) This is the standard logistic regression model. The probability model is: K ∗ qc (x) = exp(fc (x))( ∗ exp(fc (x)))−1 . c=1 The right hand side is always normalized (sum up to 1). Note that the model is not continuous at inﬁnities, and thus not admissible in our deﬁnition. However, we may consider the region Ω = {f : supk fk = 0}, and it is easy to check that this model is admissible in Ω. Ω Let fc = fc − supk fk ∈ Ω, then f Ω has the same decision rule as f and R(f ) = R(f Ω ). Therefore Theorem 2.1 implies that R(f ) → inf g∈B R(g) implies (f ) → inf g∈B (g). 1 3.3.2 φ(u) = −u, and t(u) = |u|p , and s(u) = p |u|p/p (p, p > 1) The probability model is: K ∗ ∗ ∗ |fk (x)|p )(p−p )/p sign(fc (x))|fc (x)|p −1 . qc (x) = ( k=1 We may replace t(u) by t(u) = max(0, u)p , and the probability model becomes: K qc (x) = ( ∗ ∗ max(fk (x), 0)p )(p−p )/p max(fc (x), 0)p −1 . k=1 These formulations do not seem to have advantages over the decoupled counterparts. Note that if we let p → 1, then the sum of the p p -th power of the right hand side → 1. In a −1 way, this means that the model is normalized in the limit of p → 1. 4 Constrained formulations As pointed out, one may impose constraints on possible choices of f . We may impose such a condition when we specify the function class Cn . However, for clarity, we shall directly impose a condition into our formulation. If we impose a constraint into (7), then its effect is rather similar to that of the second term in (7). In this section, we consider a direct extension of binary large-margin method (1) to multi-category case. The choice given below is motivated by [5], where an extension of SVM was proposed. We use a risk formulation that is different from (7), and for simplicity, we will consider linear equality constraint only: K Ψc (f ) = φ(−fk ), s.t. f ∈ Ω, (10) k=1,k=c where we deﬁne Ω as: K Ω = {f : fk = 0} ∪ {f : sup fk = ∞}. k k=1 We may interpret the added constraint as a restriction on the function class Cn in (3) such that every f ∈ Cn satisﬁes the constraint. Note that with K = 2, this leads to the usually binary large margin method. Using (10), the conditional true risk (5) can be written as: K (1 − qc )φ(−fc ), L(q, f ) = s.t. f ∈ Ω. (11) c=1 The following result provides a simple way to check the admissibility of (10). Proposition 4.1 If φ is a convex function which is bounded below and φ (0) < 0, then (10) is admissible on Ω. Proof Sketch. The continuity condition is straight-forward to verify. We may also assume that φ(·) ≥ 0 without loss of generality. Now let f achieves the minimum of L(q, ·). If fc = ∞, then it is clear that qc = 1 and thus qk = 0 for k = c. This implies that for k = c, φ(−fk ) = inf f φ(−f ), and thus fk < 0. If fc = supk fk < ∞, then the constraint implies fc ≥ 0. It is easy to see that ∀k, qc ≥ qk since otherwise, we must have φ(−fk ) > φ(−fc ), and thus φ (−fk ) > 0 and φ (−fc ) < 0, implying that with sufﬁcient small δ > 0, φ(−(fk + δ)) < φ(−fk ) and φ(−(fc − δ)) < φ(−fc ). A contradiction. 2 Using the above criterion, we can convert any admissible convex φ for the binary formulation (1) into an admissible multi-category classiﬁcation formulation (10). In [5] the special case of SVM (with loss function φ(u) = max(0, 1 − u)) was studied. The authors demonstrated the admissibility by direct calculation, although no results similar to Theorem 2.1 were established. Such a result is needed to prove consistency. The treatment presented here generalizes their study. Note that for the constrained formulation, it is more difﬁcult to relate fc at the optimal solution to a probability model, since such a model will have a much more complicated form compared with the unconstrained counterpart. 5 Conclusion In this paper we proposed a family of risk minimization methods for multi-category classiﬁcation problems, which are natural extensions of binary large margin classiﬁcation methods. We established admissibility conditions that ensure the consistency of the obtained classiﬁers in the large sample limit. Two speciﬁc forms of risk minimization were proposed and examples were given to study the induced probability models. As an implication of this work, we see that it is possible to obtain consistent (conditional) density estimation using various non-maximum likelihood estimation methods. One advantage of some of the newly proposed methods is that they allow us to model zero density directly. Note that for the maximum-likelihood method, near zero density may cause serious robustness problems at least in theory. References [1] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classiﬁcation, and risk bounds. Technical Report 638, Statistics Department, University of California, Berkeley, 2003. [2] Ilya Desyatnikov and Ron Meir. Data-dependent bounds for multi-category classiﬁcation based on convex losses. In COLT, 2003. [3] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000. With discussion. [4] W. Jiang. Process consistency for adaboost. The Annals of Statistics, 32, 2004. with discussion. [5] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and application to the classiﬁcation of microarray data and satellite radiance data. Journal of American Statistical Association, 2002. accepted. [6] Yi Lin. Support vector machines and the bayes rule in classiﬁcation. Data Mining and Knowledge Discovery, pages 259–275, 2002. [7] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting methods. The Annals of Statistics, 32, 2004. with discussion. [8] Shie Mannor, Ron Meir, and Tong Zhang. Greedy algorithms for classiﬁcation - consistency, convergence rates, and adaptivity. Journal of Machine Learning Research, 4:713–741, 2003. [9] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Ingo Steinwart. Support vector machines are universally consistent. J. Complexity, 18:768–791, 2002. [11] Tong Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statitics, 32, 2004. with discussion.

4 0.53683805 120 nips-2003-Locality Preserving Projections

Author: Xiaofei He, Partha Niyogi

5 0.52758402 128 nips-2003-Minimax Embeddings

Author: Matthew Brand

6 0.50721121 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models

7 0.498171 126 nips-2003-Measure Based Regularization

8 0.47304681 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

9 0.47272244 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

10 0.4697139 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

11 0.46905899 149 nips-2003-Optimal Manifold Representation of Data: An Information Theoretic Approach

12 0.4665767 30 nips-2003-Approximability of Probability Distributions

13 0.46521464 93 nips-2003-Information Dynamics and Emergent Computation in Recurrent Circuits of Spiking Neurons