jmlr jmlr2008 jmlr2008-71 knowledge-graph by maker-knowledge-mining

71 jmlr-2008-On the Equivalence of Linear Dimensionality-Reducing Transformations


Source: pdf

Author: Marco Loog

Abstract: In this JMLR volume, Ye (2008) demonstrates the essential equivalence of two sets of solutions to a generalized Fisher criterion used for linear dimensionality reduction (see Ye, 2005; Loog, 2007). Here, I point out the basic flaw in this new contribution. Keywords: linear discriminant analysis, equivalence relation, linear subspaces, Bayes error

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Keywords: linear discriminant analysis, equivalence relation, linear subspaces, Bayes error 1. [sent-5, score-0.248]

2 Introduction Some time ago, Ye (2005) studied an optimization criterion for linear dimensionality reduction and tried to characterize the family of solutions to this objective function. [sent-6, score-0.45]

3 The description, however, merely covers a part of the full solution set and is therefore, in fact, not at all a characterization. [sent-7, score-0.091]

4 Loog (2007) has corrected this mistake, giving the proper, larger set of solutions. [sent-8, score-0.079]

5 In this volume, Ye (2008) now demonstrates that the two solution sets are essentially equivalent. [sent-9, score-0.183]

6 In principle, Ye (2008) is correct and the two sets of dimension reducing transforms can indeed be considered equivalent. [sent-10, score-0.178]

7 At the base of this fact is that mathematically speaking anything can be equivalent to anything else. [sent-11, score-0.34]

8 The point I would like to convey, however, is that the equivalence considered is not essential and, as a result, the two sets are in fact essentially different. [sent-12, score-0.361]

9 The main question in this is what is ‘essential’ in the context of supervised linear dimensionality reduction? [sent-13, score-0.141]

10 Essential Equivalence Concerned with classification tasks, the performance of every dimensionality reduction criterion should primarily be discussed in relation to the Bayes error (see Fukunaga, 1990, Chapter 10). [sent-15, score-0.287]

11 As such, transformations might be considered essentially equivalent if their Bayes errors in the reduces spaces are equal. [sent-16, score-0.488]

12 A closely related definition is to consider transformations A and B equivalent if there is a nonsingular transformation T such that A = T ◦ B (see Fukunaga, 1990). [sent-17, score-0.414]

13 The latter is more restrictive than the former as the existence of T implies an equal Bayes error for A and B, but the implication in the other direction does not necessarily hold. [sent-18, score-0.179]

14 When A and B are linear and there is such a transform T , both of them span the same subspace of the original feature space, obviously ∗. [sent-19, score-0.265]

15 L OOG resulting in the equality of the Bayes errors. [sent-22, score-0.021]

16 Based on the foregoing, two linear transformations are also considered essentially equivalent if they span the same subspace. [sent-23, score-0.591]

17 Now, without providing any rationale, Ye (2008) declares two linear transformations A and B to be equivalent if there is a vector v such that A(xi − v) = B(xi − v) for all feature vectors xi in the training set. [sent-24, score-0.475]

18 The following very simple examples demonstrate, however, why the latter definition of equivalence is flawed. [sent-25, score-0.17]

19 Let x1 = (0, 0)t and x2 = (1, 0)t be two training samples, A = (1, 0), B = (−1, 0), C = (1, 1), D = (0, 0), and E = (0, 1) be linear transformations, and let v equal to (v 1 , v2 )t . [sent-26, score-0.022]

20 Now, firstly, one cannot have both −v1 = A(x1 − v) = B(x1 − v) = v1 and 1 − v1 = A(x2 − v) = B(x2 − v) = −1 + v1 , and therefore A is not equivalent to B even thought A = −B. [sent-27, score-0.065]

21 That is, two transforms that trivially define the same subspace are apparently not equivalent. [sent-28, score-0.288]

22 Secondly, D(x i − v) = 0 = E(xi − v) shows that transforms spanning subspaces of different dimensions can be equivalent. [sent-29, score-0.402]

23 Finally, a straightforward calculation shows that A and C are equivalent, that is, two transforms that obviously span different subspaces, and therefore most probably result in different Bayes errors, are considered equivalent. [sent-30, score-0.394]

24 In Conclusion Maintaining that the equivalence relation in Ye (2008) is flawed, it directly follows that it cannot be concluded that the different sets of solutions as given by Loog (2007) and Ye (2005) are essentially equivalent. [sent-32, score-0.424]

25 In fact, as should be obvious from Loog (2007), they are essentially different. [sent-33, score-0.128]

26 Given that x1 and x2 (as defined above) come from two different classes, one can easily check that the solution set by Ye (2005) is given by {(a, 0)|a ∈ R\0}, that is, nondegenerate multiples of A = (1, 0), while the true set also contains transformations like C = (1, 1). [sent-34, score-0.418]

27 Both define different subspaces and, generically, lead to different Bayes errors. [sent-35, score-0.182]

28 A complete characterization of a family of solutions to a generalized fisher criterion. [sent-43, score-0.317]

29 Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. [sent-47, score-0.249]

30 Comments on the complete characterization of a family of solutions to a generalized fisher criterion. [sent-51, score-0.317]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ye', 0.486), ('loog', 0.425), ('transformations', 0.303), ('copenhagen', 0.25), ('subspaces', 0.182), ('delft', 0.167), ('sher', 0.167), ('transforms', 0.158), ('equivalence', 0.145), ('netherlands', 0.136), ('fukunaga', 0.127), ('bayes', 0.126), ('marco', 0.108), ('essentially', 0.108), ('span', 0.106), ('anything', 0.101), ('dimensionality', 0.098), ('characterization', 0.092), ('essential', 0.088), ('solutions', 0.076), ('incentives', 0.071), ('nondegenerate', 0.071), ('rstly', 0.071), ('declares', 0.071), ('awed', 0.071), ('nwo', 0.071), ('pack', 0.071), ('family', 0.07), ('reduction', 0.068), ('generically', 0.063), ('undersampled', 0.063), ('ago', 0.063), ('leslie', 0.063), ('kaelbling', 0.063), ('discriminant', 0.059), ('ict', 0.058), ('implication', 0.058), ('corrected', 0.058), ('subspace', 0.057), ('generalized', 0.057), ('obviously', 0.056), ('mistake', 0.054), ('mathematically', 0.054), ('nonsingular', 0.054), ('demonstrates', 0.054), ('faculty', 0.051), ('aw', 0.051), ('concluded', 0.051), ('rationale', 0.048), ('cd', 0.048), ('restrictive', 0.048), ('jmlr', 0.045), ('organization', 0.045), ('relation', 0.044), ('criterion', 0.043), ('nl', 0.041), ('group', 0.04), ('apparently', 0.04), ('tried', 0.04), ('spanning', 0.04), ('covers', 0.037), ('secondly', 0.037), ('primarily', 0.034), ('trivially', 0.033), ('merely', 0.033), ('thought', 0.033), ('characterize', 0.033), ('fisher', 0.032), ('equivalent', 0.032), ('speaking', 0.031), ('chapter', 0.031), ('proper', 0.03), ('concerned', 0.03), ('maintaining', 0.03), ('probably', 0.029), ('former', 0.027), ('volume', 0.027), ('xi', 0.026), ('errors', 0.025), ('calculation', 0.025), ('program', 0.025), ('transformation', 0.025), ('latter', 0.025), ('transform', 0.024), ('academic', 0.023), ('come', 0.023), ('technology', 0.023), ('complete', 0.022), ('linear', 0.022), ('dimensions', 0.022), ('solution', 0.021), ('equality', 0.021), ('giving', 0.021), ('existence', 0.021), ('providing', 0.021), ('supervised', 0.021), ('base', 0.021), ('considered', 0.02), ('comments', 0.02), ('obvious', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 71 jmlr-2008-On the Equivalence of Linear Dimensionality-Reducing Transformations

Author: Marco Loog

Abstract: In this JMLR volume, Ye (2008) demonstrates the essential equivalence of two sets of solutions to a generalized Fisher criterion used for linear dimensionality reduction (see Ye, 2005; Loog, 2007). Here, I point out the basic flaw in this new contribution. Keywords: linear discriminant analysis, equivalence relation, linear subspaces, Bayes error

2 0.27541953 23 jmlr-2008-Comments on the Complete Characterization of a Family of Solutions to a GeneralizedFisherCriterion

Author: Jieping Ye

Abstract: Loog (2007) provided a complete characterization of the family of solutions to a generalized Fisher criterion. We show that this characterization is essentially equivalent to the original characterization proposed in Ye (2005). The computational advantage of the original characterization over the new one is discussed, which justifies its practical use. Keywords: linear discriminant analysis, dimension reduction, linear transformation 1. Generalized Fisher Criterion For a given data set consisting of n data points {ai }n in IRd , a linear transformation G ∈ IRd× i=1 ( < d) maps each ai for 1 ≤ i ≤ n in the d-dimensional space to a vector ai in the -dimensional ˜ space as follows: G : ai ∈ IRd → ai = GT ai ∈ IR . ˜ Assume that there are k classes in the data set. The within-class scatter matrix S w , the betweenclass scatter matrix Sb , and the total scatter matrix St involved in linear discriminant analysis are defined as follows (Fukunaga, 1990): k Sw = ∑ (Ai − ci eT )(Ai − ci eT )T , i=1 k Sb = ∑ ni (ci − c)(ci − c)T , i=1 k St = ∑ (Ai − ceT )(Ai − ceT )T , i=1 where Ai denotes the data matrix of the i-th class, ci = Ai e/ni is the centroid of the i-th class, ni is the sample size of the i-th class, c = Ae/n is the global centroid, and e is the vector of all ones with an appropriate length. It is easy to verify that St = Sb + Sw . In Ye (2005), the optimal transformation G is computed by maximizing a generalized Fisher criterion as follows: + G = arg max trace GT St G GT Sb G , (1) m× G∈IR c 2008 Jieping Ye. YE where M + denotes the pseudo-inverse (Golub and Van Loan, 1996) of M and it is introduced to overcome the singularity problem when dealing with high-dimensional low-sample-size data. 1.1 Equivalent Transformation Two linear transformations G1 and G2 can be considered equivalent if there is a vector v such that GT (ai − v) = GT (ai − v), for i = 1, · · · , n. Indeed, in this case, the difference between the projections 1 2 by G1 and G2 is a mere shift. Definition 1.1 For a given data set {a1 , · · · , an }, two transformations G1 and G2 are equivalent, if there is a vector v such that GT (ai − v) = GT (ai − v), for i = 1, · · · , n. 1 2 2. Characterization of Solutions to the Generalized Fisher Criterion Let St = UΣU T be the orthogonal eigendecomposition of St (note that St is symmetric and positive semi-definite), where U ∈ IRd×d is orthogonal and Σ ∈ IRd×d is diagonal with nonnegative diagonal entries sorted in nonincreasing order. Denote Σr as the r-th principal submatrix of Σ, where r = rank(St ). Partition U into two components as U = [U1 ,U2 ], where U1 ∈ IRd×r and U2 ∈ IRd×(d−r) . Note that r ≤ n, and for high-dimensional low-sample-size data, U1 is much smaller than U2 . In Loog (2007), a complete family of solutions S to the maximization problem in Eq. (1) is given as (We correct the error in Loog (2007) by using U instead of U T .) S= U ΛZ Y ∈ IRd× Z ∈ IR × is nonsingular , Y ∈ IR(n−r)× , where Λ ∈ IRr× maximizes the following objective function: F0 (X) = trace −1 X T Σr X T X T (U1 SbU1 )X . ˜ In Ye (2005), a family of solutions S is given as ˜ S= U ΛZ 0 ∈ IRd× Z ∈ IR × is nonsingular . The only difference between these two characterizations of solutions is the matrix Y in S , which is ˜ replaced by the zero matrix in S . We show in the next section the equivalence relationship between these two characterizations. 3. Equivalent Solution Characterizations ˜ Consider the following two transformations G1 and G2 from S and S respectively: G1 = U ΛZ Y ∈ S, G2 = U 518 ΛZ 0 ˜ ∈ S. O N THE C OMPLETE C HARACTERIZATION OF S OLUTIONS TO A G ENERALIZED F ISHER C RITERION Recall that U = [U1 ,U2 ], where the columns of U2 span the null space of St . Hence, n T T T 0 = U2 St U2 = ∑ U2 (ai − c) · (U2 (ai − c))T , i=1 T and U2 (ai − c) = 0, for i = 1, · · · , n, where c is the global centroid. It follows that T T T GT (ai − c) = Z T ΛT U1 (ai − c) +Y T U2 (ai − c) = Z T ΛT U1 (ai − c) = GT (ai − c), 1 2 for i = 1, · · · , n. That is, G1 and G2 are equivalent transformations. Hence, the two solution charac˜ terizations S and S are essentially equivalent. Remark 3.1 The analysis above shows that the additional information contained in S is the null ˜ space, U2 , of St , which leads to an equivalent transformation. In S , the null space U2 is removed, which can be further justified as follows. Since St = Sb + Sw , we have T T T 0 = U2 St U2 = U2 SbU2 +U2 SwU2 . T It follows that U2 SbU2 = 0, as both Sb and Sw are positive semi-definite. Thus, the null space U2 does not contain any discriminant information. This explains why the null space of St is removed in most discriminant analysis based algorithms proposed in the past. 4. Efficiency Comparison In S , the full matrix U is involved, whose computation may be expensive, especially for high˜ dimensional data. In contrast, only the first component U1 ∈ IRd×r of U is involved in S , which can be computed efficiently for high-dimensional low-sample-size problem by directly working on the Gram matrix instead of the covariance matrix. ˜ In summary, we show that S and S are equivalent characterizations of the solutions to the generalized Fisher criterion in Eq. (1). However, the latter one is preferred in practice due to its relative efficiency for high-dimensional low-sample-size data. References K. Fukunaga. Introduction to Statistical Pattern Classification. Academic Press, San Diego, California, USA, 1990. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD, USA, third edition, 1996. M. Loog. A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion. Journal of Machine Learning Research, 8:2121–2123, 2007. J. Ye. Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems. Journal of Machine Learning Research, 6:483–502, 2005. 519

3 0.050598327 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen

Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue

4 0.0452221 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

Author: Andreas Maurer

Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning

5 0.031239845 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming    (Special Topic on Model Selection)

Author: Jieping Ye, Shuiwang Ji, Jianhui Chen

Abstract: Regularized kernel discriminant analysis (RKDA) performs linear discriminant analysis in the feature space via the kernel trick. Its performance depends on the selection of kernels. In this paper, we consider the problem of multiple kernel learning (MKL) for RKDA, in which the optimal kernel matrix is obtained as a linear combination of pre-specified kernel matrices. We show that the kernel learning problem in RKDA can be formulated as convex programs. First, we show that this problem can be formulated as a semidefinite program (SDP). Based on the equivalence relationship between RKDA and least square problems in the binary-class case, we propose a convex quadratically constrained quadratic programming (QCQP) formulation for kernel learning in RKDA. A semi-infinite linear programming (SILP) formulation is derived to further improve the efficiency. We extend these formulations to the multi-class case based on a key result established in this paper. That is, the multi-class RKDA kernel learning problem can be decomposed into a set of binary-class kernel learning problems which are constrained to share a common kernel. Based on this decomposition property, SDP formulations are proposed for the multi-class case. Furthermore, it leads naturally to QCQP and SILP formulations. As the performance of RKDA depends on the regularization parameter, we show that this parameter can also be optimized in a joint framework with the kernel. Extensive experiments have been conducted and analyzed, and connections to other algorithms are discussed. Keywords: model selection, kernel discriminant analysis, semidefinite programming, quadratically constrained quadratic programming, semi-infinite linear programming

6 0.028537022 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

7 0.027248712 10 jmlr-2008-Active Learning of Causal Networks with Intervention Experiments and Optimal Designs    (Special Topic on Causality)

8 0.022918923 96 jmlr-2008-Visualizing Data using t-SNE

9 0.022470783 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

10 0.021387327 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

11 0.020445643 45 jmlr-2008-JNCC2: The Java Implementation Of Naive Credal Classifier 2    (Machine Learning Open Source Software Paper)

12 0.019612834 92 jmlr-2008-Universal Multi-Task Kernels

13 0.019154781 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

14 0.01886104 73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding

15 0.018614272 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure

16 0.017642729 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

17 0.01755991 87 jmlr-2008-Stationary Features and Cat Detection

18 0.015171439 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines    (Special Topic on Model Selection)

19 0.014785828 58 jmlr-2008-Max-margin Classification of Data with Absent Features

20 0.013786339 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.093), (1, -0.019), (2, -0.012), (3, -0.02), (4, -0.025), (5, -0.05), (6, 0.069), (7, 0.166), (8, -0.134), (9, 0.261), (10, 0.428), (11, 0.252), (12, -0.405), (13, -0.117), (14, -0.127), (15, 0.089), (16, -0.16), (17, 0.001), (18, -0.057), (19, 0.047), (20, 0.002), (21, 0.058), (22, -0.019), (23, 0.003), (24, 0.033), (25, -0.052), (26, 0.049), (27, -0.018), (28, -0.056), (29, -0.025), (30, 0.025), (31, -0.018), (32, -0.018), (33, -0.003), (34, 0.037), (35, 0.035), (36, -0.025), (37, -0.031), (38, -0.022), (39, -0.009), (40, -0.002), (41, -0.027), (42, 0.043), (43, 0.008), (44, -0.031), (45, 0.002), (46, 0.027), (47, 0.005), (48, 0.024), (49, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98671412 71 jmlr-2008-On the Equivalence of Linear Dimensionality-Reducing Transformations

Author: Marco Loog

Abstract: In this JMLR volume, Ye (2008) demonstrates the essential equivalence of two sets of solutions to a generalized Fisher criterion used for linear dimensionality reduction (see Ye, 2005; Loog, 2007). Here, I point out the basic flaw in this new contribution. Keywords: linear discriminant analysis, equivalence relation, linear subspaces, Bayes error

2 0.92814499 23 jmlr-2008-Comments on the Complete Characterization of a Family of Solutions to a GeneralizedFisherCriterion

Author: Jieping Ye

Abstract: Loog (2007) provided a complete characterization of the family of solutions to a generalized Fisher criterion. We show that this characterization is essentially equivalent to the original characterization proposed in Ye (2005). The computational advantage of the original characterization over the new one is discussed, which justifies its practical use. Keywords: linear discriminant analysis, dimension reduction, linear transformation 1. Generalized Fisher Criterion For a given data set consisting of n data points {ai }n in IRd , a linear transformation G ∈ IRd× i=1 ( < d) maps each ai for 1 ≤ i ≤ n in the d-dimensional space to a vector ai in the -dimensional ˜ space as follows: G : ai ∈ IRd → ai = GT ai ∈ IR . ˜ Assume that there are k classes in the data set. The within-class scatter matrix S w , the betweenclass scatter matrix Sb , and the total scatter matrix St involved in linear discriminant analysis are defined as follows (Fukunaga, 1990): k Sw = ∑ (Ai − ci eT )(Ai − ci eT )T , i=1 k Sb = ∑ ni (ci − c)(ci − c)T , i=1 k St = ∑ (Ai − ceT )(Ai − ceT )T , i=1 where Ai denotes the data matrix of the i-th class, ci = Ai e/ni is the centroid of the i-th class, ni is the sample size of the i-th class, c = Ae/n is the global centroid, and e is the vector of all ones with an appropriate length. It is easy to verify that St = Sb + Sw . In Ye (2005), the optimal transformation G is computed by maximizing a generalized Fisher criterion as follows: + G = arg max trace GT St G GT Sb G , (1) m× G∈IR c 2008 Jieping Ye. YE where M + denotes the pseudo-inverse (Golub and Van Loan, 1996) of M and it is introduced to overcome the singularity problem when dealing with high-dimensional low-sample-size data. 1.1 Equivalent Transformation Two linear transformations G1 and G2 can be considered equivalent if there is a vector v such that GT (ai − v) = GT (ai − v), for i = 1, · · · , n. Indeed, in this case, the difference between the projections 1 2 by G1 and G2 is a mere shift. Definition 1.1 For a given data set {a1 , · · · , an }, two transformations G1 and G2 are equivalent, if there is a vector v such that GT (ai − v) = GT (ai − v), for i = 1, · · · , n. 1 2 2. Characterization of Solutions to the Generalized Fisher Criterion Let St = UΣU T be the orthogonal eigendecomposition of St (note that St is symmetric and positive semi-definite), where U ∈ IRd×d is orthogonal and Σ ∈ IRd×d is diagonal with nonnegative diagonal entries sorted in nonincreasing order. Denote Σr as the r-th principal submatrix of Σ, where r = rank(St ). Partition U into two components as U = [U1 ,U2 ], where U1 ∈ IRd×r and U2 ∈ IRd×(d−r) . Note that r ≤ n, and for high-dimensional low-sample-size data, U1 is much smaller than U2 . In Loog (2007), a complete family of solutions S to the maximization problem in Eq. (1) is given as (We correct the error in Loog (2007) by using U instead of U T .) S= U ΛZ Y ∈ IRd× Z ∈ IR × is nonsingular , Y ∈ IR(n−r)× , where Λ ∈ IRr× maximizes the following objective function: F0 (X) = trace −1 X T Σr X T X T (U1 SbU1 )X . ˜ In Ye (2005), a family of solutions S is given as ˜ S= U ΛZ 0 ∈ IRd× Z ∈ IR × is nonsingular . The only difference between these two characterizations of solutions is the matrix Y in S , which is ˜ replaced by the zero matrix in S . We show in the next section the equivalence relationship between these two characterizations. 3. Equivalent Solution Characterizations ˜ Consider the following two transformations G1 and G2 from S and S respectively: G1 = U ΛZ Y ∈ S, G2 = U 518 ΛZ 0 ˜ ∈ S. O N THE C OMPLETE C HARACTERIZATION OF S OLUTIONS TO A G ENERALIZED F ISHER C RITERION Recall that U = [U1 ,U2 ], where the columns of U2 span the null space of St . Hence, n T T T 0 = U2 St U2 = ∑ U2 (ai − c) · (U2 (ai − c))T , i=1 T and U2 (ai − c) = 0, for i = 1, · · · , n, where c is the global centroid. It follows that T T T GT (ai − c) = Z T ΛT U1 (ai − c) +Y T U2 (ai − c) = Z T ΛT U1 (ai − c) = GT (ai − c), 1 2 for i = 1, · · · , n. That is, G1 and G2 are equivalent transformations. Hence, the two solution charac˜ terizations S and S are essentially equivalent. Remark 3.1 The analysis above shows that the additional information contained in S is the null ˜ space, U2 , of St , which leads to an equivalent transformation. In S , the null space U2 is removed, which can be further justified as follows. Since St = Sb + Sw , we have T T T 0 = U2 St U2 = U2 SbU2 +U2 SwU2 . T It follows that U2 SbU2 = 0, as both Sb and Sw are positive semi-definite. Thus, the null space U2 does not contain any discriminant information. This explains why the null space of St is removed in most discriminant analysis based algorithms proposed in the past. 4. Efficiency Comparison In S , the full matrix U is involved, whose computation may be expensive, especially for high˜ dimensional data. In contrast, only the first component U1 ∈ IRd×r of U is involved in S , which can be computed efficiently for high-dimensional low-sample-size problem by directly working on the Gram matrix instead of the covariance matrix. ˜ In summary, we show that S and S are equivalent characterizations of the solutions to the generalized Fisher criterion in Eq. (1). However, the latter one is preferred in practice due to its relative efficiency for high-dimensional low-sample-size data. References K. Fukunaga. Introduction to Statistical Pattern Classification. Academic Press, San Diego, California, USA, 1990. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD, USA, third edition, 1996. M. Loog. A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion. Journal of Machine Learning Research, 8:2121–2123, 2007. J. Ye. Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems. Journal of Machine Learning Research, 6:483–502, 2005. 519

3 0.15394196 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

Author: Andreas Maurer

Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning

4 0.14837356 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen

Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. Keywords: conditional random fields, dynamic hierarchical Markov random fields, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue

5 0.14281675 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective

Author: Liviu Panait, Karl Tuyls, Sean Luke

Abstract: This paper presents the dynamics of multiple learning agents from an evolutionary game theoretic perspective. We provide replicator dynamics models for cooperative coevolutionary algorithms and for traditional multiagent Q-learning, and we extend these differential equations to account for lenient learners: agents that forgive possible mismatched teammate actions that resulted in low rewards. We use these extended formal models to study the convergence guarantees for these algorithms, and also to visualize the basins of attraction to optimal and suboptimal solutions in two benchmark coordination problems. The paper demonstrates that lenience provides learners with more accurate information about the benefits of performing their actions, resulting in higher likelihood of convergence to the globally optimal solution. In addition, the analysis indicates that the choice of learning algorithm has an insignificant impact on the overall performance of multiagent learning algorithms; rather, the performance of these algorithms depends primarily on the level of lenience that the agents exhibit to one another. Finally, the research herein supports the strength and generality of evolutionary game theory as a backbone for multiagent learning. Keywords: multiagent learning, reinforcement learning, cooperative coevolution, evolutionary game theory, formal models, visualization, basins of attraction

6 0.1030946 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming    (Special Topic on Model Selection)

7 0.088497296 96 jmlr-2008-Visualizing Data using t-SNE

8 0.088145748 50 jmlr-2008-Learning Reliable Classifiers From Small or Incomplete Data Sets: The Naive Credal Classifier 2

9 0.087825187 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

10 0.077864915 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines    (Special Topic on Model Selection)

11 0.076777652 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

12 0.076626644 10 jmlr-2008-Active Learning of Causal Networks with Intervention Experiments and Optimal Designs    (Special Topic on Causality)

13 0.074924842 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

14 0.073506385 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

15 0.072275728 57 jmlr-2008-Manifold Learning: The Price of Normalization

16 0.071341246 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

17 0.070170768 73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding

18 0.068708502 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

19 0.065696493 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

20 0.065400511 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.027), (40, 0.032), (55, 0.613), (58, 0.029), (66, 0.041), (76, 0.017), (88, 0.068), (92, 0.025), (94, 0.011), (99, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7966674 71 jmlr-2008-On the Equivalence of Linear Dimensionality-Reducing Transformations

Author: Marco Loog

Abstract: In this JMLR volume, Ye (2008) demonstrates the essential equivalence of two sets of solutions to a generalized Fisher criterion used for linear dimensionality reduction (see Ye, 2005; Loog, 2007). Here, I point out the basic flaw in this new contribution. Keywords: linear discriminant analysis, equivalence relation, linear subspaces, Bayes error

2 0.22286996 23 jmlr-2008-Comments on the Complete Characterization of a Family of Solutions to a GeneralizedFisherCriterion

Author: Jieping Ye

Abstract: Loog (2007) provided a complete characterization of the family of solutions to a generalized Fisher criterion. We show that this characterization is essentially equivalent to the original characterization proposed in Ye (2005). The computational advantage of the original characterization over the new one is discussed, which justifies its practical use. Keywords: linear discriminant analysis, dimension reduction, linear transformation 1. Generalized Fisher Criterion For a given data set consisting of n data points {ai }n in IRd , a linear transformation G ∈ IRd× i=1 ( < d) maps each ai for 1 ≤ i ≤ n in the d-dimensional space to a vector ai in the -dimensional ˜ space as follows: G : ai ∈ IRd → ai = GT ai ∈ IR . ˜ Assume that there are k classes in the data set. The within-class scatter matrix S w , the betweenclass scatter matrix Sb , and the total scatter matrix St involved in linear discriminant analysis are defined as follows (Fukunaga, 1990): k Sw = ∑ (Ai − ci eT )(Ai − ci eT )T , i=1 k Sb = ∑ ni (ci − c)(ci − c)T , i=1 k St = ∑ (Ai − ceT )(Ai − ceT )T , i=1 where Ai denotes the data matrix of the i-th class, ci = Ai e/ni is the centroid of the i-th class, ni is the sample size of the i-th class, c = Ae/n is the global centroid, and e is the vector of all ones with an appropriate length. It is easy to verify that St = Sb + Sw . In Ye (2005), the optimal transformation G is computed by maximizing a generalized Fisher criterion as follows: + G = arg max trace GT St G GT Sb G , (1) m× G∈IR c 2008 Jieping Ye. YE where M + denotes the pseudo-inverse (Golub and Van Loan, 1996) of M and it is introduced to overcome the singularity problem when dealing with high-dimensional low-sample-size data. 1.1 Equivalent Transformation Two linear transformations G1 and G2 can be considered equivalent if there is a vector v such that GT (ai − v) = GT (ai − v), for i = 1, · · · , n. Indeed, in this case, the difference between the projections 1 2 by G1 and G2 is a mere shift. Definition 1.1 For a given data set {a1 , · · · , an }, two transformations G1 and G2 are equivalent, if there is a vector v such that GT (ai − v) = GT (ai − v), for i = 1, · · · , n. 1 2 2. Characterization of Solutions to the Generalized Fisher Criterion Let St = UΣU T be the orthogonal eigendecomposition of St (note that St is symmetric and positive semi-definite), where U ∈ IRd×d is orthogonal and Σ ∈ IRd×d is diagonal with nonnegative diagonal entries sorted in nonincreasing order. Denote Σr as the r-th principal submatrix of Σ, where r = rank(St ). Partition U into two components as U = [U1 ,U2 ], where U1 ∈ IRd×r and U2 ∈ IRd×(d−r) . Note that r ≤ n, and for high-dimensional low-sample-size data, U1 is much smaller than U2 . In Loog (2007), a complete family of solutions S to the maximization problem in Eq. (1) is given as (We correct the error in Loog (2007) by using U instead of U T .) S= U ΛZ Y ∈ IRd× Z ∈ IR × is nonsingular , Y ∈ IR(n−r)× , where Λ ∈ IRr× maximizes the following objective function: F0 (X) = trace −1 X T Σr X T X T (U1 SbU1 )X . ˜ In Ye (2005), a family of solutions S is given as ˜ S= U ΛZ 0 ∈ IRd× Z ∈ IR × is nonsingular . The only difference between these two characterizations of solutions is the matrix Y in S , which is ˜ replaced by the zero matrix in S . We show in the next section the equivalence relationship between these two characterizations. 3. Equivalent Solution Characterizations ˜ Consider the following two transformations G1 and G2 from S and S respectively: G1 = U ΛZ Y ∈ S, G2 = U 518 ΛZ 0 ˜ ∈ S. O N THE C OMPLETE C HARACTERIZATION OF S OLUTIONS TO A G ENERALIZED F ISHER C RITERION Recall that U = [U1 ,U2 ], where the columns of U2 span the null space of St . Hence, n T T T 0 = U2 St U2 = ∑ U2 (ai − c) · (U2 (ai − c))T , i=1 T and U2 (ai − c) = 0, for i = 1, · · · , n, where c is the global centroid. It follows that T T T GT (ai − c) = Z T ΛT U1 (ai − c) +Y T U2 (ai − c) = Z T ΛT U1 (ai − c) = GT (ai − c), 1 2 for i = 1, · · · , n. That is, G1 and G2 are equivalent transformations. Hence, the two solution charac˜ terizations S and S are essentially equivalent. Remark 3.1 The analysis above shows that the additional information contained in S is the null ˜ space, U2 , of St , which leads to an equivalent transformation. In S , the null space U2 is removed, which can be further justified as follows. Since St = Sb + Sw , we have T T T 0 = U2 St U2 = U2 SbU2 +U2 SwU2 . T It follows that U2 SbU2 = 0, as both Sb and Sw are positive semi-definite. Thus, the null space U2 does not contain any discriminant information. This explains why the null space of St is removed in most discriminant analysis based algorithms proposed in the past. 4. Efficiency Comparison In S , the full matrix U is involved, whose computation may be expensive, especially for high˜ dimensional data. In contrast, only the first component U1 ∈ IRd×r of U is involved in S , which can be computed efficiently for high-dimensional low-sample-size problem by directly working on the Gram matrix instead of the covariance matrix. ˜ In summary, we show that S and S are equivalent characterizations of the solutions to the generalized Fisher criterion in Eq. (1). However, the latter one is preferred in practice due to its relative efficiency for high-dimensional low-sample-size data. References K. Fukunaga. Introduction to Statistical Pattern Classification. Academic Press, San Diego, California, USA, 1990. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, MD, USA, third edition, 1996. M. Loog. A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion. Journal of Machine Learning Research, 8:2121–2123, 2007. J. Ye. Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems. Journal of Machine Learning Research, 6:483–502, 2005. 519

3 0.14465117 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park

Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefficients of the linear support vector machine. A Bahadur type representation of the coefficients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classification, convexity lemma, Radon transform

4 0.14313965 96 jmlr-2008-Visualizing Data using t-SNE

Author: Laurens van der Maaten, Geoffrey Hinton

Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the data sets. Keywords: visualization, dimensionality reduction, manifold learning, embedding algorithms, multidimensional scaling

5 0.14302644 57 jmlr-2008-Manifold Learning: The Price of Normalization

Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov

Abstract: We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap

6 0.14207189 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

7 0.13964069 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

8 0.13933004 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

9 0.13714179 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

10 0.13606355 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning

11 0.13602491 58 jmlr-2008-Max-margin Classification of Data with Absent Features

12 0.13459753 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition

13 0.13443768 9 jmlr-2008-Active Learning by Spherical Subdivision

14 0.13380744 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks

15 0.13379124 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

16 0.1330393 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

17 0.13159142 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers

18 0.13117532 56 jmlr-2008-Magic Moments for Structured Output Prediction

19 0.13114339 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons

20 0.13017198 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction