jmlr jmlr2007 jmlr2007-26 knowledge-graph by maker-knowledge-mining

26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

Source: pdf

Author: Masashi Sugiyama

Abstract: Reducing the dimensionality of data without losing intrinsic information is an important preprocessing step in high-dimensional data analysis. Fisher discriminant analysis (FDA) is a traditional technique for supervised dimensionality reduction, but it tends to give undesired results if samples in a class are multimodal. An unsupervised dimensionality reduction method called localitypreserving projection (LPP) can work well with multimodal data due to its locality preserving property. However, since LPP does not take the label information into account, it is not necessarily useful in supervised learning scenarios. In this paper, we propose a new linear supervised dimensionality reduction method called local Fisher discriminant analysis (LFDA), which effectively combines the ideas of FDA and LPP. LFDA has an analytic form of the embedding transformation and the solution can be easily computed just by solving a generalized eigenvalue problem. We demonstrate the practical usefulness and high scalability of the LFDA method in data visualization and classiﬁcation tasks through extensive simulation studies. We also show that LFDA can be extended to non-linear dimensionality reduction scenarios by applying the kernel trick. Keywords: dimensionality reduction, supervised learning, Fisher discriminant analysis, locality preserving projection, afﬁnity matrix

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Fisher discriminant analysis (FDA) is a traditional technique for supervised dimensionality reduction, but it tends to give undesired results if samples in a class are multimodal. [sent-5, score-0.253]

2 An unsupervised dimensionality reduction method called localitypreserving projection (LPP) can work well with multimodal data due to its locality preserving property. [sent-6, score-0.314]

3 In this paper, we propose a new linear supervised dimensionality reduction method called local Fisher discriminant analysis (LFDA), which effectively combines the ideas of FDA and LPP. [sent-8, score-0.277]

4 We also show that LFDA can be extended to non-linear dimensionality reduction scenarios by applying the kernel trick. [sent-11, score-0.198]

5 Keywords: dimensionality reduction, supervised learning, Fisher discriminant analysis, locality preserving projection, afﬁnity matrix 1. [sent-12, score-0.248]

6 Introduction The goal of dimensionality reduction is to embed high-dimensional data samples in a low-dimensional space so that most of ‘intrinsic information’ contained in the data is preserved (e. [sent-13, score-0.22]

7 Once dimensionality reduction is carried out appropriately, the compact representation of the data can be used for various succeeding tasks such as visualization, classiﬁcation, etc. [sent-17, score-0.182]

8 In this paper, we consider the supervised dimensionality reduction problem, that is, samples are accompanied with class labels. [sent-18, score-0.244]

9 Fisher discriminant analysis (FDA) (Fisher, 1936; Fukunaga, 1990) is a popular method for linear supervised dimensionality reduction. [sent-19, score-0.2]

10 This Fisher criterion can be used for dimensionality reduction onto a space with dimension more than one in multi-class problems (Fukunaga, 1990). [sent-30, score-0.201]

11 With some abuse, we refer to the dimensionality reduction method based on the Fisher criterion as FDA (see Section 2. [sent-31, score-0.182]

12 S UGIYAMA that the between-class scatter is maximized and the within-class scatter is minimized. [sent-34, score-0.216]

13 For example, in disease diagnosis, the distribution of medial checkup samples of sick patients could be multimodal since there may be several different causes even for a single disease. [sent-42, score-0.208]

14 For this reason, there is a universal need for reducing the dimensionality of multimodal data. [sent-45, score-0.252]

15 In order to reduce the dimensionality of multimodal data appropriately, it is important to preserve the local structure of the data. [sent-46, score-0.281]

16 Locality-preserving projection (LPP) (He and Niyogi, 2004) meets this requirement; LPP seeks for an embedding transformation such that nearby data pairs in the original space close in the embedding space. [sent-47, score-0.183]

17 Thus LPP can reduce the dimensionality of multimodal data without losing the local structure. [sent-48, score-0.281]

18 However, LPP is an unsupervised dimensionality reduction method and does not take the label information into account. [sent-49, score-0.196]

19 Therefore, it does not necessarily work appropriately in supervised dimensionality reduction scenarios. [sent-50, score-0.206]

20 In this paper, we propose a new dimensionality reduction method called local Fisher discriminant analysis (LFDA). [sent-51, score-0.253]

21 Thus LFDA is useful for dimensionality reduction of multimodal labeled data. [sent-53, score-0.3]

22 The original FDA provides a meaningful result only when the dimensionality of the embedding space is smaller than the number of classes because of the rank deﬁciency of the between-class scatter matrix (Fukunaga, 1990). [sent-54, score-0.371]

23 On the other hand, the proposed LFDA does not generally suffer from this problem and can be employed for dimensionality reduction into an arbitrary dimensional space. [sent-56, score-0.182]

24 Furthermore, LFDA inherits an excellent property from FDA—it has an analytic form of the embedding matrix and the solution can be easily computed just by solving a generalized eigenvalue problem. [sent-57, score-0.168]

25 This is an advantage over recently proposed supervised dimensionality reduction methods (e. [sent-58, score-0.206]

26 Furthermore, LFDA can be naturally extended to non¨ linear dimensionality reduction scenarios by applying the kernel trick (Sch olkopf and Smola, 2002). [sent-62, score-0.198]

27 In Section 2, we formulate the linear dimensionality reduction problem, brieﬂy review FDA and LPP, and illustrate how they typically behave. [sent-64, score-0.182]

28 Linear Dimensionality Reduction In this section, we formulate the problem of linear dimensionality reduction and review existing methods. [sent-70, score-0.182]

29 For the moment, we focus on linear dimensionality reduction, that is, using a d × r transformation matrix T , the embedded samples zi are given by zi = T x i , where denotes the transpose of a matrix or vector. [sent-85, score-0.302]

30 4, we extend our discussion to the non-linear dimensionality reduction scenarios where the mapping from x i to zi is non-linear. [sent-87, score-0.182]

31 2 Fisher Discriminant Analysis for Dimensionality Reduction One of the most popular dimensionality reduction techniques is Fisher discriminant analysis (FDA) (Fisher, 1936; Fukunaga, 1990; Duda et al. [sent-89, score-0.224]

32 1029 (3) S UGIYAMA That is, FDA seeks a transformation matrix T such that the between-class scatter is ‘maximized’ while the within-class scatter is ‘minimized’. [sent-100, score-0.298]

33 This constraint makes the within-class scatter in the embedding space sphered. [sent-107, score-0.173]

34 The between-class scatter matrix S(b) has at most rank c − 1 (Fukunaga, 1990). [sent-108, score-0.172]

35 This is an essential limitation of FDA for dimensionality reduction and is very restrictive in practice. [sent-111, score-0.182]

36 3 Locality-Preserving Projection Another dimensionality reduction technique that is relevant to the current setting is locality-preserving projection (LPP) (He and Niyogi, 2004). [sent-113, score-0.182]

37 The LPP transformation matrix T LPP is deﬁned as follows:3 T LPP ≡ argmin T ∈Rd×r 1 n Ai, j T xi − T x j 2 i,∑ j=1 2 subject to T XDX T = I r , (4) where D is the n-dimensional diagonal matrix with i-th diagonal element being n Di,i ≡ ∑ Ai, j . [sent-119, score-0.213]

38 (4) implies that LPP looks for a transformation matrix T such that nearby data pairs in the original space Rd are kept close in the embedding space. [sent-125, score-0.166]

39 The reason for the failure of FDA is that the ‘levels’ of the between-class scatter and the within-class scatter are not evaluated in an intuitively natural way because of the two separate clusters in ‘◦’-class (see also Fukunaga, 1990). [sent-139, score-0.216]

40 In other words, the undesired behavior of FDA is caused by the globality when evaluating the within-class scatter and the between-class scatter (e. [sent-144, score-0.231]

41 To overcome these problems, we propose combining the ideas of FDA and LPP; more specifically, we evaluate the levels of the between-class scatter and the within-class scatter in a local manner. [sent-150, score-0.245]

42 This implies that if the data pairs in the same class are made close, the within-class scatter matrix S (w) gets ‘small’ and the between-class scatter matrix S (b) gets ‘large’. [sent-164, score-0.331]

43 On the other hand, if the data pairs in different classes are separated from each other, the between-class scatter matrix S(b) gets ‘large’. [sent-165, score-0.175]

44 2 Deﬁnition and Typical Behavior of LFDA Based on the above pairwise expression, let us deﬁne the local within-class scatter matrix S the local between-class scatter matrix S S (w) S (b) (b) (w) and as follows. [sent-169, score-0.37]

45 Toy examples of dimensionality reduction by LFDA are illustrated in Figure 1. [sent-182, score-0.182]

46 S pressed as (w) 1 n 1 (w) S = ∑ Pi , 2 i=1 nyi (w) can be ex- (w) where nyi is the number of samples in the class to which the sample x i belongs and Pi pointwise local within-class scatter matrix around xi : (w) Pi ≡ ∑ is the Ai, j (x j − xi )(x j − xi ) . [sent-196, score-0.431]

47 j:y j =yi Therefore, ‘minimizing’ S (w) corresponds to minimizing the weighted sum of the pointwise local within-class scatter matrices over all samples. [sent-197, score-0.18]

48 S S (b) where Pi (b) = (b) can also be expressed in a similar way as 1 n 1 1 ∑ n − ny 2 i=1 i (w) Pi + 1 n (b) ∑P , 2n i=1 i is the pointwise between-class scatter matrix around x i : (b) Pi ≡ ∑ (x j − xi )(x j − xi ) . [sent-198, score-0.283]

49 j:y j =yi 1034 (13) L OCAL F ISHER D ISCRIMINANT A NALYSIS Labeled samples {(xi , yi ) | xi ∈ Rd , yi ∈ {1, 2, . [sent-199, score-0.161]

50 , n % Determine local scaling (7) xi ←− 7th nearest neighbor of xi among {x j }n ; j=1 (7) σi ←− xi − xi 7: 8: 9: 10: 11: 12: 13: end for i, j = 1, 2, . [sent-208, score-0.302]

51 (13) implies that ‘maximizing’ S corresponds to minimizing the weighted sum of the pointwise local within-class scatter matrices and maximizing the sum of the pointwise between-class scater matrices. [sent-230, score-0.205]

52 The original FDA allows us to extract at most c − 1 meaningful features since the between-class scatter matrix S (b) has rank at most (b) c − 1 (Fukunaga, 1990). [sent-232, score-0.172]

53 On the other hand, the local between-class scatter matrix S generally has a much higher rank with less eigenvalue multiplicity, thanks to the localization factor A i, j included (b) (b) in W (see Eq. [sent-233, score-0.238]

54 Therefore, the proposed LFDA can be practically employed for dimensionality reduction into any dimensional spaces. [sent-236, score-0.182]

55 4 Kernel LFDA for Non-Linear Dimensionality Reduction Here we show how LFDA can be extended to non-linear dimensionality reduction scenarios. [sent-250, score-0.182]

56 Since a a KLFDA uses the samples only via the kernel function K(x, x ), it allows us to reduce the dimensionality of such non-vectorial data. [sent-287, score-0.188]

57 Based on a similar idea, they also proposed a global supervised dimensionality reduction method using local discriminant information (LDI) in the same paper. [sent-292, score-0.277]

58 We refer to this supervised dimensionality reduction method as LDI. [sent-293, score-0.206]

59 In LDI, the data samples {xi }n are ﬁrst sphered according to the within-class scatter matrix i=1 (w) S , that is, for i = 1, 2, . [sent-296, score-0.223]

60 Let Ai, j be the weight of sample x j around xi deﬁned by  3 3   xi −x j (K) if xi − x j < xi − xi , 1− (K) Ai, j ≡ xi −xi   0 otherwise. [sent-300, score-0.306]

61 (K) where xi is the K-th nearest neighbor of xi in the sphered space. [sent-301, score-0.185]

62 1038 L OCAL F ISHER D ISCRIMINANT A NALYSIS T LDI is a transformation matrix for sphered samples; the LDI transformation matrix T LDI for nonsphered samples is given by 1 T LDI = (S(w) )− 2 T LDI . [sent-308, score-0.231]

63 (b) The average between sum-of-squares matrix S is conceptually very similar to the local betweenclass scatter matrix S manner as (b) in LFDA. [sent-310, score-0.233]

64 This implies that far apart sample pairs in different classes could be made close in LDI, which is not desirable in supervised dimensionality reduction. [sent-316, score-0.177]

65 Another important difference between LDI and LFDA is that the within-class scatter matrix S (w) is not localized in LDI. [sent-318, score-0.156]

66 1, the within-class scatter matrix S (w) also accounts for collapsing the within-class multimodal structure (i. [sent-320, score-0.29]

67 (2005) proposed a supervised dimensionality reduction method called neighborhood component analysis (NCA). [sent-343, score-0.206]

68 In such a scenario, NCA requires to optimize the transformation matrix individually for each dimensionality r of the embedding space. [sent-366, score-0.281]

69 Once U MCML is obtained, the MCML transformation matrix T MCML is computed by T MCML = (φ1 |φ2 | · · · |φr ), (25) where {φk }r are the eigenvectors associated with the positive eigenvalues η 1 ≥ η2 ≥ · · · ≥ ηr > 0 k=1 of the following eigenvalue problem: U MCML φ = ηφ. [sent-380, score-0.172]

70 However, MCML still has a weakness in optimization: the optimization problem (24) is convex only when r = d, that is, the dimensionality is not reduced but only the distance metric of the original space is changed. [sent-382, score-0.169]

71 Since MCML requires all the samples in the same class to collapse into a single point, it is not necessarily useful in dimensionality reduction of multimodal data samples. [sent-403, score-0.358]

72 In addition to between-class separability, LFDA clearly preserves the multimodal structure among sick patients (i. [sent-449, score-0.17]

73 2 Data Visualization Here we apply the proposed and existing dimensionality reduction methods to benchmark data sets and investigate how they behave in data visualization tasks. [sent-455, score-0.209]

74 Overall, LFDA is found to be more appropriate for embedding labeled multimodal data samples than FDA and LPP, implying that our primal goal has been successfully achieved. [sent-479, score-0.221]

75 Overall, LDI works fairly well, but the within-class multimodal structure is sometimes lost since LDI only partially takes within-class multimodality into account (see Section 4. [sent-488, score-0.165]

76 3 Classiﬁcation Here we apply the proposed and existing dimensionality reduction techniques to classiﬁcation tasks, and objectively evaluate the effectiveness of LFDA. [sent-544, score-0.182]

77 1047 S UGIYAMA Data set ∗ banana breast-cancer diabetes ﬂare-solar german heart image ringnorm splice ∗ thyroid titanic twonorm ∗ waveform ∗ USPS1 ∗ USPS2 Computation time (ratio) LFDA LDI ◦ 13. [sent-564, score-0.185]

78 4 Table 3: Means and standard deviations of the misclassiﬁcation rate when the embedding dimensionality is chosen by cross validation. [sent-734, score-0.199]

79 Note that LPP and PCA are unsupervised dimensionality reduction methods, while others are supervised methods. [sent-742, score-0.22]

80 Table 3 describes the mean and standard deviation of the misclassiﬁcation rate by each method when the embedding dimensionality r is chosen by 5-fold cross validation (Stone, 1974; Wahba, 1990); for the USPS-eo and USPS-sl data sets, we used 20-fold cross validation since this was more accurate. [sent-748, score-0.199]

81 05 0 5 10 15 Reduced Dimension r 20 0 50 100 150 Reduced Dimension r 200 250 0 50 100 150 Reduced Dimension r 200 250 Figure 7: Mean misclassiﬁcation rates by a one-nearest-neighbor method as functions of the dimensionality of the embedding space. [sent-863, score-0.199]

82 1049 S UGIYAMA Data set ∗ banana breast-cancer diabetes ﬂare-solar german heart image ringnorm splice ∗ thyroid titanic twonorm ∗ waveform ∗ USPS-eo ∗ USPS-sl LFDA EUCLID FDA ◦ 13. [sent-865, score-0.185]

83 LPP and PCA perform well, despite the fact that they are unsupervised dimensionality reduction methods. [sent-963, score-0.196]

84 The misclassiﬁcation rates by a naive one-nearest-neighbor classiﬁcation without dimensionality reduction (‘EUCLID’) are described in Table 4. [sent-968, score-0.182]

85 1050 L OCAL F ISHER D ISCRIMINANT A NALYSIS Based on the above simulation results, we conclude that the proposed LFDA is a promising dimensionality reduction technique also in classiﬁcation scenarios. [sent-975, score-0.182]

86 Conclusions We discussed the problem of supervised dimensionality reduction. [sent-977, score-0.158]

87 LPP (He and Niyogi, 2004) can work well in dimensionality reduction of multimodal data. [sent-981, score-0.3]

88 However, it is an unsupervised method and does not necessarily useful in supervised dimensionality reduction scenarios. [sent-982, score-0.22]

89 LFDA allows us to reduce the dimensionality of multimodal labeled data appropriately by maximizing between-class separability and preserving the within-class local structure at the same time. [sent-984, score-0.302]

90 The original FDA provides a meaningful result only when the dimensionality of the embedding space is smaller than the number of classes because of the rank deﬁciency of the between-class scatter matrix. [sent-987, score-0.323]

91 On the other hand, LFDA does not share this limitation and can be employed for dimensionality reduction into any dimensional spaces (see Section 3. [sent-988, score-0.182]

92 This means that the range of the transformation matrix can be uniquely determined, but the distance metric in the embedding space cannot be determined. [sent-993, score-0.166]

93 Therefore, how to optimally determine the kernel function for supervised dimensionality reduction needs to be explored. [sent-1004, score-0.222]

94 Although this standard choice appeared to be reasonable in experiments, it is important to ﬁnd the optimal way to deﬁne the afﬁnity matrix in the context of supervised dimensionality reduction. [sent-1008, score-0.206]

95 MDA (Hastie and Tibshirani, 1996b) provides a solid probabilistic framework for supervised dimensionality reduction with multimodality (see Section 4. [sent-1009, score-0.253]

96 i=1 Then we have n S(b) = ∑ xi xi − i=1 n =∑ i=1 n 1 ∑n 1 n xi x j − S(w) n i,∑ j=1 n xi xi − j=1 1 xi x − S(w) n j i, j=1 ∑ 1 1 n (w) = ∑ −Wi, j (xi xi + x j x j − xi x j − x j xi ), 2 i, j=1 n which yields Eq. [sent-1024, score-0.459]

97 (15), the LFDA transformation matrix T LFDA can be computed analytically using the generalized eigenvectors and generalized eigenvalues of the following generalized eigenvalue problem. [sent-1054, score-0.226]

98 be the local mixture scatter matrix deﬁned by S (m) ≡S (b) +S From Eqs. [sent-1056, score-0.185]

99 σi represents the local scaling of the data samples around x i , which is determined by (K) σi = x i − x i , (K) where xi is the K-th nearest neighbor of xi . [sent-1119, score-0.238]

100 A kernel view of the dimensionality reduction of manifolds. [sent-1281, score-0.198]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lfda', 0.682), ('fda', 0.425), ('lpp', 0.245), ('mcml', 0.202), ('nca', 0.192), ('ldi', 0.186), ('dimensionality', 0.134), ('multimodal', 0.118), ('scatter', 0.108), ('isher', 0.099), ('ugiyama', 0.084), ('nity', 0.083), ('af', 0.071), ('vi', 0.069), ('iscriminant', 0.069), ('misclassification', 0.066), ('embedding', 0.065), ('nalysis', 0.06), ('ocal', 0.06), ('globerson', 0.059), ('fukunaga', 0.057), ('roweis', 0.056), ('wi', 0.052), ('xi', 0.051), ('reduction', 0.048), ('matrix', 0.048), ('multimodality', 0.047), ('discriminant', 0.042), ('goldberger', 0.041), ('fisher', 0.04), ('ai', 0.039), ('samples', 0.038), ('eigenvalue', 0.037), ('yi', 0.036), ('thyroids', 0.035), ('eigenvectors', 0.034), ('transformation', 0.034), ('tr', 0.032), ('thyroid', 0.032), ('pi', 0.032), ('patients', 0.032), ('neighbor', 0.031), ('sphered', 0.029), ('local', 0.029), ('visualization', 0.027), ('pointwise', 0.025), ('rd', 0.025), ('banana', 0.025), ('ringnorm', 0.025), ('twonorm', 0.025), ('niyogi', 0.025), ('supervised', 0.024), ('classwise', 0.023), ('hyperthyroidism', 0.023), ('hypothyroidism', 0.023), ('kashima', 0.023), ('nearest', 0.023), ('saul', 0.023), ('waveform', 0.022), ('separability', 0.021), ('sick', 0.02), ('toy', 0.02), ('collapse', 0.02), ('titanic', 0.02), ('pairs', 0.019), ('dimension', 0.019), ('eigenvalues', 0.019), ('metric', 0.019), ('splice', 0.019), ('pca', 0.018), ('matrices', 0.018), ('generalized', 0.018), ('tenenbaum', 0.018), ('appendix', 0.018), ('euclid', 0.017), ('euthyroidism', 0.017), ('jmcml', 0.017), ('jnca', 0.017), ('mda', 0.017), ('rtner', 0.017), ('diabetes', 0.017), ('reduced', 0.016), ('rank', 0.016), ('collapsing', 0.016), ('kernel', 0.016), ('iris', 0.016), ('diagonal', 0.016), ('scaling', 0.015), ('usps', 0.015), ('nyi', 0.015), ('psd', 0.015), ('undesired', 0.015), ('weinberger', 0.015), ('unsupervised', 0.014), ('argmax', 0.014), ('letter', 0.014), ('multiplicity', 0.013), ('kernelized', 0.013), ('perona', 0.013), ('misclassi', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

Author: Masashi Sugiyama

2 0.064882211 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are speciﬁed. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semideﬁnite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and signiﬁcantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semideﬁnite programming

3 0.055124771 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

Author: Marco Loog

Abstract: Recently, Ye (2005) suggested yet another optimization criterion for discriminant analysis and proposed a characterization of the family of solutions to this objective. The characterization, however, merely describes a part of the full solution set, that is, it is not complete and therefore not at all a characterization. This correspondence ﬁrst gives the correct characterization and afterwards compares it to Ye’s. Keywords: linear discriminant analysis, Fisher criterion, small sample, characterization 1. Classical and New Criteria Given N feature vectors of dimensionality n, a linear reduction of dimensionality, based on classical Fisher LDA, determines an n × d transformation matrix L that, for a given d < K, K the number of classes, maximizes the so-called Fisher criterion: F(A) = tr((A t SW A)−1 (At SB A)) or, equivalently, F0 (A) = tr((At ST A)−1 (At SB A)). Here, SB := ∑K pi (mi − m)(mi − m)t , SW := ∑K pi Si , and i=1 i=1 ST = SB + SW . The matrices SB , SW , and ST are the so-called between-class, pooled within-class, and total covariance matrices. In addition, mi is the mean vector of class i, pi is the prior of class i, and the overall mean m equals ∑k pi mi . Finally, Si is the covariance matrix of class i. i=1 A solution to these optimization problems can be obtained by means of a generalized eigenvalue decomposition, which Fukunaga (1990) relates to a simultaneous diagonalization of the two matrices involved (see also Campbell and Atchley, 1981). More common is it to apply a standard −1 eigenvalue decomposition to S−1 SB (or SW SB ), resulting in an equivalent set of eigenvectors. The d T columns of the optimal solution L are simply taken to equal the d eigenvectors corresponding to the d largest eigenvalues. It is known that this solution is not unique and the full class can be obtained by multiplying L to the right with nonsingular d × d matrices (see Fukunaga, 1990). Clearly, if the total covariance ST is singular, neither the generalized nor the standard eigenvalue decomposition can be readily employed. Directly or indirectly, the problem is that the matrix inverse S−1 does not exist, which is the typical situation when dealing with small samples. In an attempt to T overcome this problem, Ye (2005) introduced a different criterion that is deﬁned as F1 (A) = tr((At ST A)+ (At SB A)) , ∗. Also at Nordic Bioscience Imaging, Hovegade 207, DK-2730 Herlev, Denmark. c 2007 Marco Loog. (1) L OOG where + denotes taking the Moore-Penrose generalized inverse of a matrix. Like for F0 , an optimal transform L is one that maximizes the objective F1 . Again, this solution is not unique. 2. Correct Characterization For the full characterization of the set of solutions to Equation (1), initially the problem is looked at from a geometrical point of view (cf., Campbell and Atchley, 1981). It is assumed that the number of samples N is smaller than or equal to the feature dimensionality n. In the undersampled case, it is clear that all data variation occurs in an N − 1-dimensional subspace of the original space. To start with, a PCA of the data is carried out and the ﬁrst N − 1 principal components are rotated to the ﬁrst N − 1 axes of the n-dimensional space by means of a rotation matrix R. This matrix consists of all (normalized) eigenvectors of ST taken as its columns. After this rotation, new total and between-class covariance matrices, ST = Rt ST R and SB = Rt SB R, are obtained. These 0 0 matrices can be partitioned as follows: ST = Σ0T 0 and SB = ΣB 0 , where ΣT and ΣB are N − 1 × 0 N − 1 covariance matrices and ΣT is nonsingular and diagonal by construction. The between-class variation is obviously restricted to the N − 1-dimensional subspace in which the total data variation takes place, therefore a same partitioning of SB is possible. However, the covariance submatrix ΣB is not necessarily diagonal, neither does it have to be nonsingular. Basically, the PCA-based rotation R converts the initial problem into a more convenient one, splitting up the original space in an N − 1-dimensional one in which “everything interesting” takes place and a remaining n − N + 1dimensional subspace in which “nothing happens at all”. Now consider F1 in this transformed space and take a general n × d transformation matrix A, which is partitioned in a way similar to the covariance matrices, that is, X . Y A= (2) Here, X is an N − 1 × d-matrix and Y is of size n − N + 1 × d. Taking this deﬁnition, the following holds (cf., Ye, 2005): t + t F1 (A) = tr((A ST A) (A SB A)) = tr =tr X t ΣT X 0 0 0 + X Y X t ΣB X 0 0 0 t ΣT 0 = tr 0 0 X Y + (Xt ΣT X)−1 0 0 0 X Y t ΣB 0 0 0 X Y X t ΣB X 0 0 0 = tr((Xt ΣT X)−1 (Xt ΣB X)) = F0 (X) . From this it is immediate that a matrix A maximizes F1 if and only if the submatrix X maximizes the original Fisher criterion in the lower-dimensional subspace. Moreover, if L is such a matrix maximizing F1 in the PCA-transformed space, it is easy to check that R−1 L = Rt L provides a solution to the original, general problem that has not been preprocessed by means of a PCA (see also Fukunaga, 1990). A characterization of the complete family of solutions can now be given. Let Λ ∈ RN−1×d be an optimal solution of F0 (X) = tr((Xt ΣT X)−1 (Xt ΣB X)). As already noted in Section 1, the full set of solutions is given by F = {ΛZ ∈ RN−1×d | Z ∈ GLd (R)}, where GLd (R) denotes the general linear group of d × d invertible matrices. The previous paragraph essentially demonstrates that if X ∈ F , A in Equation (2) maximizes F1 . The matrix Y can be chosen ad 2122 C OMPLETE C HARACTERIZATION OF A FAMILY OF S OLUTIONS libitum. Now, the latter provides the solution in the PCA-transformed space and to solve the general problem we need to take the rotation back to the original space into account. All in all, this leads to the following complete family of solutions L , maximizing F1 in the original space: L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R), Y ∈ Rn−N+1×d Y , (3) where Λ = argmaxX tr((Xt ΣT X)−1 (Xt ΣB X)) and Rt takes care of the rotation back. 3. Original Characterization Though not noted by Ye (2005), his attempt to characterize the full set of solutions of Equation (1) is based on a simultaneous diagonalization of the three covariance matrices S B , SW , and ST that is similar to the ideas described by Campbell and Atchley (1981) and Fukunaga (1990). Moreover, Golub and Van Loan (Theorem 8.7.1. 1996) can be readily applied to demonstrate that such simultaneous diagonalization is possible in the small sample setting. After the diagonalization step, partitioned between-class, pooled within-class, and total covariance matrices are considered. This partitioning is similar to the one employed in the previous section, which does not enforce matrices to be diagonal however. In the subsequent optimization step, the classical Fisher criterion is maximized basically in the appropriate subspace, comparable to the approach described above, but in a mildly more involved and concealed way. For this, matrices of the form Rt X are considered, consider Equations (2) and Y (3). However, Y is simply the null matrix and the family of solutions L provided is limited to L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R) . 0 Obviously, this is far from a complete characterization, especially when N − 1 n which is, for instance, typically the case for the data sets considered by Ye (2005). Generally, the utility of a dimensionality reduction criterion, without additional constrains, depends on the efﬁciency over the full set of solutions. As Ye (2005) only considers two very speciﬁc instances from the large class of possibilities, it is unclear to what extent the new criterion really provides an efﬁcient way of performing a reduction of dimensionality. References N. A. Campbell and W. R. Atchley. The geometry of canonical variate analysis. Systematic Zoology, 30(3):268–280, 1981. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, 1990. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996. J. Ye. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483–502, 2005. 2123

4 0.039323084 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

Author: Francesco Dinuzzo, Marta Neve, Giuseppe De Nicolao, Ugo Pietro Gianazza

Abstract: Support Vector Regression (SVR) for discrete data is considered. An alternative formulation of the representer theorem is derived. This result is based on the newly introduced notion of pseudoresidual and the use of subdifferential calculus. The representer theorem is exploited to analyze the sensitivity properties of ε-insensitive SVR and introduce the notion of approximate degrees of freedom. The degrees of freedom are shown to play a key role in the evaluation of the optimism, that is the difference between the expected in-sample error and the expected empirical risk. In this way, it is possible to deﬁne a C p -like statistic that can be used for tuning the parameters of SVR. The proposed tuning procedure is tested on a simulated benchmark problem and on a real world problem (Boston Housing data set). Keywords: statistical learning, reproducing kernel Hilbert spaces, support vector machines, representer theorem, regularization theory

5 0.038096741 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

Author: Rie Johnson, Tong Zhang

Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Speciﬁcally, by introducing a deﬁnition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary signiﬁcantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments conﬁrm the superiority of the normalization scheme motivated by learning theory on artiﬁcial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian

6 0.032585699 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

7 0.03222191 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

8 0.029515108 15 jmlr-2007-Bilinear Discriminant Component Analysis

9 0.028196378 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

10 0.027351245 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation

11 0.027119838 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

12 0.026616389 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

13 0.026190508 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs (Special Topic on the Conference on Learning Theory 2005)

14 0.024286186 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

15 0.023049528 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

16 0.022220192 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

17 0.019656112 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods

18 0.019594962 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

19 0.019334573 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

20 0.019147666 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.119), (1, 0.023), (2, 0.021), (3, 0.061), (4, -0.119), (5, -0.09), (6, -0.052), (7, -0.023), (8, 0.0), (9, 0.03), (10, -0.018), (11, 0.039), (12, -0.095), (13, -0.101), (14, 0.032), (15, -0.017), (16, 0.002), (17, 0.015), (18, 0.091), (19, -0.202), (20, 0.159), (21, 0.004), (22, 0.031), (23, 0.057), (24, -0.08), (25, -0.214), (26, -0.255), (27, -0.082), (28, 0.016), (29, 0.243), (30, -0.076), (31, -0.056), (32, -0.315), (33, -0.061), (34, 0.119), (35, 0.357), (36, -0.05), (37, -0.058), (38, 0.066), (39, 0.068), (40, 0.199), (41, -0.133), (42, 0.021), (43, -0.064), (44, -0.11), (45, 0.116), (46, -0.067), (47, 0.087), (48, -0.075), (49, 0.175)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94431174 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

Author: Masashi Sugiyama

2 0.47386488 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

3 0.27219751 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

Author: Marco Loog

4 0.25966051 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

Author: Francesco Dinuzzo, Marta Neve, Giuseppe De Nicolao, Ugo Pietro Gianazza

5 0.19968785 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

Author: Chao-Chun Liu, Dao-Qing Dai, Hong Yan

Abstract: Face recognition is a challenging problem due to variations in pose, illumination, and expression. Techniques that can provide effective feature representation with enhanced discriminability are crucial. Wavelets have played an important role in image processing for its ability to capture localized spatial-frequency information of images. In this paper, we propose a novel local discriminant coordinates method based on wavelet packet for face recognition to compensate for these variations. Traditional wavelet-based methods for face recognition select or operate on the most discriminant subband, and neglect the scattered characteristic of discriminant features. The proposed method selects the most discriminant coordinates uniformly from all spatial frequency subbands to overcome the deﬁciency of traditional wavelet-based methods. To measure the discriminability of coordinates, a new dilation invariant entropy and a maximum a posterior logistic model are put forward. Moreover, a new triangle square ratio criterion is used to improve classiﬁcation using the Euclidean distance and the cosine criterion. Experimental results show that the proposed method is robust for face recognition under variations in illumination, pose and expression. Keywords: local discriminant coordinates, invariant entropy, logistic model, wavelet packet, face recognition, illumination, pose and expression variations

6 0.18141118 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

7 0.16719204 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

8 0.13981703 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

9 0.13941257 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation

10 0.11912434 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

11 0.10784463 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

12 0.10321571 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

13 0.099654973 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

14 0.097315818 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

15 0.097154312 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

16 0.096489273 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

17 0.093980476 15 jmlr-2007-Bilinear Discriminant Component Analysis

18 0.091321796 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

19 0.088809185 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods

20 0.087980457 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.01), (3, 0.016), (4, 0.047), (8, 0.036), (10, 0.023), (12, 0.028), (15, 0.014), (17, 0.02), (22, 0.012), (28, 0.042), (40, 0.029), (45, 0.018), (48, 0.491), (60, 0.02), (80, 0.011), (85, 0.033), (98, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91965252 80 jmlr-2007-Synergistic Face Detection and Pose Estimation with Energy-Based Models

Author: Margarita Osadchy, Yann Le Cun, Matthew L. Miller

Abstract: We describe a novel method for simultaneously detecting faces and estimating their pose in real time. The method employs a convolutional network to map images of faces to points on a lowdimensional manifold parametrized by pose, and images of non-faces to points far away from that manifold. Given an image, detecting a face and estimating its pose is viewed as minimizing an energy function with respect to the face/non-face binary variable and the continuous pose parameters. The system is trained to minimize a loss function that drives correct combinations of labels and pose to be associated with lower energy values than incorrect ones. The system is designed to handle very large range of poses without retraining. The performance of the system was tested on three standard data sets—for frontal views, rotated faces, and proﬁles— is comparable to previous systems that are designed to handle a single one of these data sets. We show that a system trained simuiltaneously for detection and pose estimation is more accurate on both tasks than similar systems trained for each task separately.1 Keywords: face detection, pose estimation, convolutional networks, energy based models, object recognition

2 0.91889942 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)

Author: Carine Hue, Marc Boullé

Abstract: In this paper, we consider the supervised learning task which consists in predicting the normalized rank of a numerical variable. We introduce a novel probabilistic approach to estimate the posterior distribution of the target rank conditionally to the predictors. We turn this learning task into a model selection problem. For that, we deﬁne a 2D partitioning family obtained by discretizing numerical variables and grouping categorical ones and we derive an analytical criterion to select the partition with the highest posterior probability. We show how these partitions can be used to build univariate predictors and multivariate ones under a naive Bayes assumption. We also propose a new evaluation criterion for probabilistic rank estimators. Based on the logarithmic score, we show that such criterion presents the advantage to be minored, which is not the case of the logarithmic score computed for probabilistic value estimator. A ﬁrst set of experimentations on synthetic data shows the good properties of the proposed criterion and of our partitioning approach. A second set of experimentations on real data shows competitive performance of the univariate and selective naive Bayes rank estimators projected on the value range compared to methods submitted to a recent challenge on probabilistic metric regression tasks. Our approach is applicable for all regression problems with categorical or numerical predictors. It is particularly interesting for those with a high number of predictors as it automatically detects the variables which contain predictive information. It builds pertinent predictors of the normalized rank of the numerical target from one or several predictors. As the criteria selection is regularized by the presence of a prior and a posterior term, it does not suffer from overﬁtting. Keywords: rank regression, probabilistic approach, 2D partitioning, non parametric estimation, Bayesian model selection

same-paper 3 0.81375092 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

Author: Masashi Sugiyama

4 0.4118844 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

Author: Marco Reisert, Hans Burkhardt

Abstract: This paper presents a new class of matrix valued kernels that are ideally suited to learn vector valued equivariant functions. Matrix valued kernels are a natural generalization of the common notion of a kernel. We set the theoretical foundations of so called equivariant matrix valued kernels. We work out several properties of equivariant kernels, we give an interpretation of their behavior and show relations to scalar kernels. The notion of (ir)reducibility of group representations is transferred into the framework of matrix valued kernels. At the end to two exemplary applications are demonstrated. We design a non-linear rotation and translation equivariant ﬁlter for 2D-images and propose an invariant object detector based on the generalized Hough transform. Keywords: kernel methods, matrix kernels, equivariance, group integration, representation theory, Hough transform, signal processing, Volterra series

5 0.39959359 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

Author: Chao-Chun Liu, Dao-Qing Dai, Hong Yan

6 0.39721677 39 jmlr-2007-Handling Missing Values when Applying Classification Models

7 0.38414487 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

8 0.37898868 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

9 0.3770574 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

10 0.35901892 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

11 0.35658541 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

12 0.35187447 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

13 0.34456423 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

14 0.3413071 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

15 0.33125365 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

16 0.32828698 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

17 0.3277784 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

18 0.32536912 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

19 0.32458434 86 jmlr-2007-Truncating the Loop Series Expansion for Belief Propagation

20 0.31934902 72 jmlr-2007-Relational Dependency Networks