jmlr jmlr2007 jmlr2007-13 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Santosh Srivastava, Maya R. Gupta, Béla A. Frigyik
Abstract: Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be ill-posed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distribution-based Bayesian classifier is derived using information geometry. Using a calculus of variations approach to define a functional Bregman divergence for distributions, it is shown that the Bayesian distribution-based classifier that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassification cost. A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. A new Bayesian quadratic discriminant analysis classifier is proposed where the prior is defined using a coarse estimate of the covariance based on the training data; this classifier is termed BDA7. Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases significantly better than, regularized quadratic discriminant analysis and the cross-validated Bayesian quadratic discriminant analysis classifier Quadratic Bayes. Keywords: quadratic discriminant analysis, regularized quadratic discriminant analysis, Bregman divergence, data-dependent prior, eigenvalue decomposition, Wishart, functional analysis
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Department of Mathematics Purdue University West Lafayette, IN 47907, USA Editor: Saharon Rosset Abstract Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be ill-posed. [sent-9, score-0.092]
2 This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. [sent-10, score-0.117]
3 Using a calculus of variations approach to define a functional Bregman divergence for distributions, it is shown that the Bayesian distribution-based classifier that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassification cost. [sent-12, score-0.119]
4 A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. [sent-13, score-0.186]
5 A new Bayesian quadratic discriminant analysis classifier is proposed where the prior is defined using a coarse estimate of the covariance based on the training data; this classifier is termed BDA7. [sent-14, score-0.27]
6 Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases significantly better than, regularized quadratic discriminant analysis and the cross-validated Bayesian quadratic discriminant analysis classifier Quadratic Bayes. [sent-15, score-0.286]
7 Keywords: quadratic discriminant analysis, regularized quadratic discriminant analysis, Bregman divergence, data-dependent prior, eigenvalue decomposition, Wishart, functional analysis 1. [sent-16, score-0.283]
8 Introduction A standard approach to supervised classification problems is quadratic discriminant analysis (QDA), which models the likelihood of each class as a Gaussian distribution, then uses the posterior distributions to estimate the class for a given test point (Hastie et al. [sent-17, score-0.128]
9 Unfortunately, when the number of training samples n is small compared to the number of dimensions of each training sample d, the ML covariance estimation can be ill-posed. [sent-21, score-0.103]
10 e S RIVASTAVA , G UPTA AND F RIGYIK One approach to resolve the ill-posed estimation is to regularize the covariance estimation; another approach is to use Bayesian estimation. [sent-24, score-0.103]
11 In preliminary experiments we found that the performance of Bayesian QDA classifiers is very sensitive to the choice of prior (Srivastava and Gupta, 2006), and that priors suggested by Geisser (1964) and Keehn (1965) produce error rates similar to those yielded by ML. [sent-28, score-0.066]
12 BDA7 differs from previous Bayesian QDA methods in that the prior is selected by crossvalidation from a set of data-dependent priors. [sent-31, score-0.077]
13 Each data-dependent prior captures some coarse information from the training data. [sent-32, score-0.068]
14 We show that the distribution-based Bayesian discriminant analysis classifier has roughly the same closed-form as the parameter-based Bayesian discriminant analysis classifier, but has a different degree of freedom. [sent-41, score-0.156]
15 In Section 5 we establish that the Bayesian minimum expected misclassification cost estimate is equivalent to a plug-in estimate using the Bayesian minimum expected Bregman divergence estimate for each class conditional. [sent-47, score-0.095]
16 Geisser’s work used a noninformative prior distribution to calculate the posterior odds that a test sample belongs to a particular class. [sent-56, score-0.074]
17 Keehn’s work assumed that the prior distribution of the covariance matrix is an inverse Wishart distribution. [sent-57, score-0.159]
18 Recent work by the authors showed that Bayesian QDA using the priors suggested by Geisser and Keehn perform similarly to ML QDA on six standard QDA simulations (Srivastava and Gupta, 2006). [sent-59, score-0.071]
19 The inverse Wishart prior is a conjugate prior for the covariance matrix, and it requires the specification of a “seed” positive definite matrix and a scalar degree of freedom. [sent-60, score-0.205]
20 Friedman’s comparisons to ML QDA and ML linear discriminant analysis on six simulations showed that RDA could deal effectively with ill-posed covariance estimation when the true covariance matrix is diagonal. [sent-69, score-0.314]
21 RDA is perhaps the most popular approach to discriminant analysis when the covariance estimation is expected to be ill-posed (Hastie et al. [sent-70, score-0.179]
22 Another restricted model used to regularize covariance matrix estimation is a banded covariance matrix (Bickel and Li, 2006). [sent-74, score-0.223]
23 RDA and the Hoffbeck-Landgrebe classifiers linearly combine different covariance estimates. [sent-75, score-0.074]
24 A related approach is to select the best covariance model out of a set of models using crossvalidation. [sent-76, score-0.074]
25 Bensmail and Celeux (1996) propose eigenvalue decomposition discriminant analysis (EDDA), in which fourteen different models for the class covariances are considered, ranging from a scalar times the identity matrix to a full class covariance estimate for each class. [sent-77, score-0.22]
26 3 Other Approaches to Quadratic Discriminant Analysis Other approaches have been developed for ill-posed quadratic discriminant analysis. [sent-83, score-0.103]
27 Friedman (1989) notes that, beginning with work by James and Stein in 1961, researchers have attempted to improve the eigenvalues of the sample covariance matrix. [sent-84, score-0.074]
28 One of the most recent algorithms of this type is orthogonal linear discriminant analysis (Ye, 2005), which was shown by the author of that work to perform similarly to Friedman’s regularized linear discriminant analysis on six real data sets. [sent-86, score-0.2]
29 ,G h=1 Straightforward integration yields an estimate of the class prior, E Θ [Θh ] = nh +1 ; this Bayesian estin+G mate for the multinomial is also known as Laplace correction (Jaynes and Bretthorst, 2003). [sent-125, score-0.8]
30 2), and (Nh , Th ) is the likelihood of the data Th given Nh , that is, (Nh (µh , Σh ), Th ) = 1 ¯ ¯ exp[− 2 tr Σ−1 Sh − nh tr Σ−1 (µh − Xh )(µh − Xh )T ] h h 2 (2π) dnh 2 nh . [sent-136, score-1.66]
31 A common interpretation of Bayesian analysis is that the prior represents information that one has prior to seeing the data (Jaynes and Bretthorst, 2003). [sent-139, score-0.092]
32 To meet these goals, we use as a prior p(Nh ) = p(µh )p(Σh ) = γ0 exp[− 1 tr Σ−1 Bh ] h 2 q |Σh | 2 , (10) where Bh is a positive definite matrix (further specified in (25)) and γ0 is a normalization constant. [sent-142, score-0.112]
33 The prior (10) is equivalent to a noninformative prior for the mean µ, and an inverted Wishart prior with q degrees of freedom over Σ. [sent-143, score-0.188]
34 However, the tails of this prior are sufficiently heavy that the prior does not hinder convergence to the true generating distribution as the number of training samples increases if the true generating distribution is normal. [sent-147, score-0.137]
35 The positive definite matrix Bh specifies the location of the maximum of the prior probability distribution. [sent-148, score-0.069]
36 q (11) Because this prior is unimodal, a rough interpretation of its action is that it regularizes the likelihood covariance estimate towards the maximum of the prior, given in (11). [sent-152, score-0.133]
37 , 1999), the prior seed matrix Bh = kI, where k is a scalar determined by crossvalidation. [sent-155, score-0.106]
38 Setting Bh = kI is reminiscent of Friedman’s RDA (Friedman, 1989), where the covariance estimate is ˆ tr(ΣML ) regularized by the trace: d I. [sent-156, score-0.117]
39 ˆ tr(ΣML ) We have shown in earlier work that setting Bh = d I for a distribution-based discriminant analysis outperforms Geisser’s or Keehn’s parameter-based Bayesian discriminant methods, and does not require crossvalidation (Srivastava and Gupta, 2006). [sent-157, score-0.187]
40 The trace of the ML covariance estimate is stable, and provides coarse information about the scale of the data samples. [sent-158, score-0.109]
41 Theorem 1: The classifier (5) using the prior (10) can be written as d 2 (nh ) Γ G ˆ Y = argmin ∑ C(g, h) g=1,. [sent-166, score-0.067]
42 ,G h=1 nh +q+1 2 d (nh + 1) 2 Γ Sh +Bh 2 nh +q−d+1 2 nh +q 2 |Ah | nh +q+1 2 ˆ P(Y = h), (12) ˆ where P(Y = h) is an estimate of the class prior probability for class h, Γ(·) is the standard gamma function, and 1 nh (x − xh )(x − xh )T ¯ ¯ (13) Ah = Sh + + Bh . [sent-169, score-4.12]
43 The parameter-based Bayesian QDA class label using the prior given in (10) is d G ˆ Y = argmin ∑ C(g, h) g=1,. [sent-171, score-0.067]
44 ,G h=1 (nh ) 2 Γ d 2 (nh + 1) Γ nh +q−d−1 2 nh +q−2d−1 2 | Sh +Bh | 2 |Ah | nh +q−d−2 2 nh +q−d−1 2 ˆ P(Y = h). [sent-174, score-3.148]
45 The parameterbased Bayesian discriminant result (14) will not hold if nh ≤ 2d − q + 1, while the distribution-based result (12) holds for any nh > 0 and any d. [sent-176, score-1.668]
46 We compared the noninformative prior originally proposed by Geisser (1964), and ˆ the modified inverse Wishart prior given in Equation (10) with B h = tr ΣML,h /d I. [sent-178, score-0.167]
47 Results on six simulations (two-class versions of Friedman’s original simulations) showed that given the same prior, the distribution-based performed better than the parameter-based when the ratio of feature dimensions to number of training samples is high. [sent-180, score-0.066]
48 1), chooses the degree of freedom for the modified inverse Wishart prior by cross-validation. [sent-186, score-0.08]
49 In Section 6, we propose a new Bayesian QDA classifier called BDA7 that cross-validates the degree of freedom and a data-dependent seed matrix Bh for the prior. [sent-188, score-0.078]
50 Relationship Between Regularized QDA and Bayesian QDA In this section we show that Friedman’s regularized form for the covariance matrix (Friedman, 1989) emerges from the Bayesian QDA formula. [sent-191, score-0.127]
51 Then (15) becomes h Γ ENh [Nh ] ≈ nh +q+1 2 d nh +1 nh (π) 2 Γ nh +q+1 2 = (π) exp nh T −1 nh +1 Zh Dh Zh Dh 1 2 Γ d 2 nh +1 nh Dh 1 2 nh +q+1 2 , nh +q−d+1 2 nh +1 nh +q+1 1 T exp − 2 Zh − −1 Dh nh Zh . [sent-201, score-10.231]
52 (17) nh + q + 1 nh ˜ The approximation (16) resembles a Gaussian distribution, where Σh plays the role of the covariance matrix. [sent-203, score-1.648]
53 Rewrite (17), ˜ Σh = ˜ Σh = = = Sh + B h nh Sh Bh nh + 1 + nh nh + q + 1 nh Sh nh + 1 1 q Bh . [sent-204, score-4.722]
54 + 1− nh + q + 1 nh nh + q + 1 nh nh + 1 nh + q + 1 nh + 1 nh + q + 1 In (18), make the approximation nh +1 nh ˜ Σh ≈ 1 − (18) ≈ 1, then multiply and divide the second term of (18) by q, q nh + q + 1 Sh q + nh nh + q + 1 1286 Bh . [sent-205, score-10.231]
55 q (19) BAYESIAN Q UADRATIC D ISCRIMINANT A NALYSIS The right-hand side of (19) is a convex combination of the sample covariance and the positive definite matrix Bh . [sent-206, score-0.097]
56 Here, the fraction nh +q+1 controls the shrink- age of the sample covariance matrix toward the positive definite matrix Bh ; recall from (11) that Bh is q q the maximum of the prior probability distribution. [sent-208, score-0.953]
57 Equation (19) also gives information about how the Bayesian shrinkage depends on the number of sample points from each class: fewer training samples nh results in greater shrinkage towards the positive definite matrix Bh . [sent-209, score-0.857]
58 The diagonal of ΣML has been used to regularize a regularized discriminant analysis (Hoffbeck and Landgrebe, 1996) and model-based discriminant analysis (Bensmail and Celeux, 1996). [sent-261, score-0.215]
59 1 ˆ ˆ Note that setting Bh = diag ΣML places the maximum of the prior at q diag ΣML . [sent-262, score-0.112]
60 We hypothˆ esize that in some cases it may be more effective to place the maximum of the prior at diag ΣML ; ˆ ML . [sent-263, score-0.079]
61 that requires setting Bh = q diag Σ We also consider moving the maximum of the prior closer to the zero matrix, effectively turning the prior into an exponential prior rather than a unimodal one. [sent-264, score-0.171]
62 To this end, we also consider setting the prior matrix seed to be B h = 1 diag ΣML . [sent-268, score-0.139]
63 Thus, for BDA7 the B h is selected by crossvalidation from seven options: ˆ ˆ q diag Σ(pooled ML) , q diag Σ(class ML,h) , 1 diag Σ ˆ (pooled ML) , 1 diag Σ(class ML,h) , ˆ q q (25) Bh = ˆ ˆ diag Σ(pooled ML) , diag Σ(class ML,h) , 1 ˆ tr(Σ (pooled ML) ) I. [sent-270, score-0.229]
64 , 1999), RDA (Friedman, 1989), eigenvalue decomposition discriminant analysis (EDDA) (Bensmail and Celeux, 1996), and maximum-likelihood estimated QDA, LDA, and the nearest-means classifier (NM). [sent-282, score-0.092]
65 QB uses a normal prior for each mean vector, and an inverse Wishart distribution prior for each covariance matrix. [sent-295, score-0.198]
66 There are two free parameters to the inverse Wishart distribution: the scale parameter q ≥ d and the seed matrix Bh . [sent-296, score-0.076]
67 This causes numerical difficulties in estimating the maximum likelihood covariance estimates. [sent-317, score-0.074]
68 BDA7 chooses the identity prior seed ˆ matrix Bh = 1 tr(Σ(pooled ML) ) I every time for this data set. [sent-318, score-0.124]
69 Given that this is BDA7’s choice, BDA7 q is at a disadvantage compared to QB, which cross-validates a scaled identity seed matrix kI. [sent-319, score-0.078]
70 Six of the simulations were originally created by Friedman and used to show that RDA performs favorably compared to ML linear discriminant analysis (LDA) and ML QDA (Friedman, 1989). [sent-469, score-0.115]
71 Friedman’s six simulations all have diagonal generating class covariance matrices, corresponding to independent classification features. [sent-470, score-0.154]
72 When the true generating distributions are drawn from full covariance matrices (Cases 7–10), EDDA performs well when there are only 10 features, but the maximum likelihood estimates used in EDDA fail for 50 and 100 feature dimensions. [sent-481, score-0.103]
73 1 C ASE 1: E QUAL S PHERICAL C OVARIANCE M ATRICES Each class conditional distribution is normal with identity covariance matrix I. [sent-488, score-0.115]
74 2 C ASE 2: U NEQUAL S PHERICAL C OVARIANCE M ATRICES The class one conditional distribution is normal with identity covariance matrix I and mean at the origin. [sent-497, score-0.131]
75 The class two conditional distribution is normal with covariance matrix 2I and has zero mean except the first component of its mean is 3. [sent-498, score-0.129]
76 The class three conditional distribution is normal with covariance matrix 3I and has zero mean except the last component of its mean is 4. [sent-499, score-0.129]
77 The eigenvalues of the common covariance matrix are given by ei = 9(i − 1) +1 d −1 2 , 1 ≤ i ≤ d, (26) so the ratio of the largest to smallest eigenvalue is 100. [sent-504, score-0.111]
78 4 C ASES 5 AND 6: U NEQUAL H IGHLY E LLIPSOIDAL C OVARIANCE M ATRICES For these cases, the covariance matrices are highly ellipsoidal and different for each class. [sent-727, score-0.088]
79 The eigenvalues of the class one covariance are given by Equation (26), and those of class two are given by 2 9(d − i) e2i = + 1 , 1 ≤ i ≤ d. [sent-728, score-0.074]
80 Case 5 and Case 6 present more information to the classifiers than the previous cases because the covariance matrices are substantially different. [sent-734, score-0.088]
81 Then let the class one covariance matrix be R T R1 . [sent-739, score-0.097]
82 Similarly, let the class two 1 and class three covariance matrices be RT R2 and RT R3 , where R2 and R3 are constructed in the same 3 2 manner as R1 . [sent-740, score-0.088]
83 For most runs of this simulation, the best EDDA model is the full covariance model, but because EDDA uses ML estimation, its estimation of the full covariance model is illconditioned. [sent-745, score-0.162]
84 Then the Cases 9 and 10 covariance matrices are Ri RT RT Ri for i = 1, 2, 3. [sent-752, score-0.088]
85 These covariance matrices are highly ellipsoidal, often with one strong i i eigenvalue and many relatively small eigenvalues. [sent-753, score-0.102]
86 Key aspects of BDA7 are that the seed matrix in the inverse Wishart prior uses a coarse estimate of the covariance matrix to peg the maximum of the prior to a relevant part of the distribution-space. [sent-763, score-0.3]
87 Comparisons on simulations show that RDA and EDDA perform well when the true Gaussian distribution matches one of their regularization covariance models (e. [sent-765, score-0.111]
88 , diagonal, identity), but can fail when the generating distribution has a full covariance matrix, particularly when features are correlated. [sent-767, score-0.089]
89 In contrast, the Bayesian methods BDA7 and QB can learn from the rich differentiating information offered by full covariance matrices. [sent-768, score-0.074]
90 We hypothesize that better priors exist, and that such priors will also be data-dependent and make use of a coarse estimate of the covariance matrix for the prior. [sent-769, score-0.172]
91 BAYESIAN Q UADRATIC D ISCRIMINANT A NALYSIS Substitute (Nh , Th ) from (9) and p(Nh ) from (10), αh = Z Z Σ h µh exp[− nh tr Σ−1 (µh − xh )(µh − xh )T ] ¯ ¯ h 2 (2π) nh d 2 |Σh | nh +q 2 1 exp − tr Σ−1 (Sh + Bh ) 2 dΣh dµh |Σh | d+2 2 . [sent-780, score-2.573]
92 Integrate with respect to µh using identity (27): αh = 2π nh 1 nh d 2 (2π) d 2 1 exp[− 2 tr Σ−1 (Sh + Bh ) ] h Z Σh |Σh | nh +q+d+1 2 dΣh . [sent-781, score-2.422]
93 Next, integrate with respect to Σh using identity (28): αh = (2π) d 2 2π nh 1 nh d 2 nh +q 2 Γd nh +q 2 Sh +Bh 2 . [sent-782, score-3.179]
94 Z M Nh (x) f (Nh )dMh 1 αh (2π) nh d 2 Z Z Σ h µh 1 exp − 2 tr Σ−1 (x − µh )(x − µh )T h d 1 exp − 2 tr Σ−1 (Sh + Bh ) h |Σh | 1 (2π) 2 |Σh | 2 exp − nh +q 2 nh tr Σ−1 (µh − xh )(µh − xh )T ¯ ¯ h 2 dµh dΣh |Σh | d+2 2 . [sent-784, score-2.616]
95 Integrate with respect to µh and Σh using identities (27) and (28), and equation (13) to yield ENh [Nh (x)] = Γd 1 αh (2π) dnh 2 d (nh + 1) 2 nh +q+1 2 |Ah | , nh +q+1 2 where Ah is given by Equation (13). [sent-785, score-1.574]
96 After substituting the value of αh from (30), d ENh [Nh (x)] = nh +q+1 2 2 n h Γd 1 d d (2π) 2 (nh + 1) 2 Γd | Sh +Bh | 2 nh +q 2 |Ah | nh +q 2 nh +q+1 2 . [sent-786, score-3.148]
97 Simplify the multivariate gamma function Γd using (29): d ENh [Nh (x)] = 2 nh Γ 1 d d nh +q+1 2 (2π) 2 (nh + 1) 2 Γ nh +q−d+1 2 | Sh +Bh | 2 |Ah | nh +q 2 nh +q+1 2 . [sent-787, score-3.935]
98 M (39) Set a = M Nh − fˆ r(Nh )dM in (39) and use the assumption that the functional δ2 φ[ f ; ·, ·] is strongly positive, which implies that the above functional can be zero if and only if a = 0, that is, R 0 = Z M (Nh − fˆ)r(Nh )dM, (40) fˆ = ENh [Nh ]. [sent-802, score-0.078]
99 Empirical Bayes estimation of the multivariate normal covariance matrix. [sent-938, score-0.088]
100 Characterization of a family of algorithms for generalized discriminant analysis of undersampled problems. [sent-1011, score-0.078]
wordName wordTfidf (topN-words)
[('nh', 0.787), ('qda', 0.386), ('bh', 0.185), ('rda', 0.141), ('qb', 0.125), ('edda', 0.119), ('bregman', 0.097), ('ml', 0.097), ('zh', 0.084), ('wishart', 0.079), ('discriminant', 0.078), ('covariance', 0.074), ('enh', 0.073), ('rigyik', 0.073), ('rivastava', 0.073), ('uadratic', 0.073), ('upta', 0.073), ('bayesian', 0.073), ('dm', 0.067), ('xh', 0.063), ('srivastava', 0.057), ('sh', 0.057), ('gupta', 0.051), ('iscriminant', 0.051), ('geisser', 0.047), ('prior', 0.046), ('nalysis', 0.044), ('dh', 0.043), ('tr', 0.043), ('seed', 0.037), ('simulations', 0.037), ('bensmail', 0.036), ('pooled', 0.036), ('friedman', 0.035), ('functional', 0.033), ('diag', 0.033), ('crossvalidation', 0.031), ('keehn', 0.031), ('regularized', 0.03), ('divergence', 0.03), ('lda', 0.03), ('celeux', 0.026), ('ovariance', 0.026), ('er', 0.026), ('ah', 0.025), ('quadratic', 0.025), ('matrix', 0.023), ('coarse', 0.022), ('argmin', 0.021), ('ases', 0.021), ('hoffbeck', 0.021), ('nequal', 0.021), ('nm', 0.02), ('priors', 0.02), ('gaussian', 0.019), ('brown', 0.018), ('ionosphere', 0.018), ('identity', 0.018), ('atrices', 0.018), ('freedom', 0.018), ('pen', 0.016), ('simulation', 0.016), ('shrinkage', 0.016), ('crossvalidated', 0.016), ('extremum', 0.016), ('landgrebe', 0.016), ('llipsoidal', 0.016), ('noninformative', 0.016), ('parameterbased', 0.016), ('inverse', 0.016), ('mean', 0.016), ('misclassi', 0.015), ('differential', 0.015), ('generating', 0.015), ('regularize', 0.015), ('samples', 0.015), ('csisz', 0.014), ('pima', 0.014), ('thyroid', 0.014), ('rd', 0.014), ('diagonal', 0.014), ('th', 0.014), ('classi', 0.014), ('matrices', 0.014), ('eigenvalue', 0.014), ('six', 0.014), ('estimation', 0.014), ('benchmark', 0.013), ('ighly', 0.013), ('numbered', 0.013), ('expected', 0.013), ('estimate', 0.013), ('banerjee', 0.013), ('integrate', 0.013), ('sonar', 0.013), ('posterior', 0.012), ('strongly', 0.012), ('wine', 0.012), ('santosh', 0.012), ('termed', 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis
Author: Santosh Srivastava, Maya R. Gupta, Béla A. Frigyik
Abstract: Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be ill-posed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distribution-based Bayesian classifier is derived using information geometry. Using a calculus of variations approach to define a functional Bregman divergence for distributions, it is shown that the Bayesian distribution-based classifier that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassification cost. A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. A new Bayesian quadratic discriminant analysis classifier is proposed where the prior is defined using a coarse estimate of the covariance based on the training data; this classifier is termed BDA7. Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases significantly better than, regularized quadratic discriminant analysis and the cross-validated Bayesian quadratic discriminant analysis classifier Quadratic Bayes. Keywords: quadratic discriminant analysis, regularized quadratic discriminant analysis, Bregman divergence, data-dependent prior, eigenvalue decomposition, Wishart, functional analysis
2 0.068685189 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
Author: Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, Dharmendra S. Modha
Abstract: Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation—every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a varic 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE , D HILLON , G HOSH , M ERUGU AND M ODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and
3 0.039001867 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion
Author: Marco Loog
Abstract: Recently, Ye (2005) suggested yet another optimization criterion for discriminant analysis and proposed a characterization of the family of solutions to this objective. The characterization, however, merely describes a part of the full solution set, that is, it is not complete and therefore not at all a characterization. This correspondence first gives the correct characterization and afterwards compares it to Ye’s. Keywords: linear discriminant analysis, Fisher criterion, small sample, characterization 1. Classical and New Criteria Given N feature vectors of dimensionality n, a linear reduction of dimensionality, based on classical Fisher LDA, determines an n × d transformation matrix L that, for a given d < K, K the number of classes, maximizes the so-called Fisher criterion: F(A) = tr((A t SW A)−1 (At SB A)) or, equivalently, F0 (A) = tr((At ST A)−1 (At SB A)). Here, SB := ∑K pi (mi − m)(mi − m)t , SW := ∑K pi Si , and i=1 i=1 ST = SB + SW . The matrices SB , SW , and ST are the so-called between-class, pooled within-class, and total covariance matrices. In addition, mi is the mean vector of class i, pi is the prior of class i, and the overall mean m equals ∑k pi mi . Finally, Si is the covariance matrix of class i. i=1 A solution to these optimization problems can be obtained by means of a generalized eigenvalue decomposition, which Fukunaga (1990) relates to a simultaneous diagonalization of the two matrices involved (see also Campbell and Atchley, 1981). More common is it to apply a standard −1 eigenvalue decomposition to S−1 SB (or SW SB ), resulting in an equivalent set of eigenvectors. The d T columns of the optimal solution L are simply taken to equal the d eigenvectors corresponding to the d largest eigenvalues. It is known that this solution is not unique and the full class can be obtained by multiplying L to the right with nonsingular d × d matrices (see Fukunaga, 1990). Clearly, if the total covariance ST is singular, neither the generalized nor the standard eigenvalue decomposition can be readily employed. Directly or indirectly, the problem is that the matrix inverse S−1 does not exist, which is the typical situation when dealing with small samples. In an attempt to T overcome this problem, Ye (2005) introduced a different criterion that is defined as F1 (A) = tr((At ST A)+ (At SB A)) , ∗. Also at Nordic Bioscience Imaging, Hovegade 207, DK-2730 Herlev, Denmark. c 2007 Marco Loog. (1) L OOG where + denotes taking the Moore-Penrose generalized inverse of a matrix. Like for F0 , an optimal transform L is one that maximizes the objective F1 . Again, this solution is not unique. 2. Correct Characterization For the full characterization of the set of solutions to Equation (1), initially the problem is looked at from a geometrical point of view (cf., Campbell and Atchley, 1981). It is assumed that the number of samples N is smaller than or equal to the feature dimensionality n. In the undersampled case, it is clear that all data variation occurs in an N − 1-dimensional subspace of the original space. To start with, a PCA of the data is carried out and the first N − 1 principal components are rotated to the first N − 1 axes of the n-dimensional space by means of a rotation matrix R. This matrix consists of all (normalized) eigenvectors of ST taken as its columns. After this rotation, new total and between-class covariance matrices, ST = Rt ST R and SB = Rt SB R, are obtained. These 0 0 matrices can be partitioned as follows: ST = Σ0T 0 and SB = ΣB 0 , where ΣT and ΣB are N − 1 × 0 N − 1 covariance matrices and ΣT is nonsingular and diagonal by construction. The between-class variation is obviously restricted to the N − 1-dimensional subspace in which the total data variation takes place, therefore a same partitioning of SB is possible. However, the covariance submatrix ΣB is not necessarily diagonal, neither does it have to be nonsingular. Basically, the PCA-based rotation R converts the initial problem into a more convenient one, splitting up the original space in an N − 1-dimensional one in which “everything interesting” takes place and a remaining n − N + 1dimensional subspace in which “nothing happens at all”. Now consider F1 in this transformed space and take a general n × d transformation matrix A, which is partitioned in a way similar to the covariance matrices, that is, X . Y A= (2) Here, X is an N − 1 × d-matrix and Y is of size n − N + 1 × d. Taking this definition, the following holds (cf., Ye, 2005): t + t F1 (A) = tr((A ST A) (A SB A)) = tr =tr X t ΣT X 0 0 0 + X Y X t ΣB X 0 0 0 t ΣT 0 = tr 0 0 X Y + (Xt ΣT X)−1 0 0 0 X Y t ΣB 0 0 0 X Y X t ΣB X 0 0 0 = tr((Xt ΣT X)−1 (Xt ΣB X)) = F0 (X) . From this it is immediate that a matrix A maximizes F1 if and only if the submatrix X maximizes the original Fisher criterion in the lower-dimensional subspace. Moreover, if L is such a matrix maximizing F1 in the PCA-transformed space, it is easy to check that R−1 L = Rt L provides a solution to the original, general problem that has not been preprocessed by means of a PCA (see also Fukunaga, 1990). A characterization of the complete family of solutions can now be given. Let Λ ∈ RN−1×d be an optimal solution of F0 (X) = tr((Xt ΣT X)−1 (Xt ΣB X)). As already noted in Section 1, the full set of solutions is given by F = {ΛZ ∈ RN−1×d | Z ∈ GLd (R)}, where GLd (R) denotes the general linear group of d × d invertible matrices. The previous paragraph essentially demonstrates that if X ∈ F , A in Equation (2) maximizes F1 . The matrix Y can be chosen ad 2122 C OMPLETE C HARACTERIZATION OF A FAMILY OF S OLUTIONS libitum. Now, the latter provides the solution in the PCA-transformed space and to solve the general problem we need to take the rotation back to the original space into account. All in all, this leads to the following complete family of solutions L , maximizing F1 in the original space: L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R), Y ∈ Rn−N+1×d Y , (3) where Λ = argmaxX tr((Xt ΣT X)−1 (Xt ΣB X)) and Rt takes care of the rotation back. 3. Original Characterization Though not noted by Ye (2005), his attempt to characterize the full set of solutions of Equation (1) is based on a simultaneous diagonalization of the three covariance matrices S B , SW , and ST that is similar to the ideas described by Campbell and Atchley (1981) and Fukunaga (1990). Moreover, Golub and Van Loan (Theorem 8.7.1. 1996) can be readily applied to demonstrate that such simultaneous diagonalization is possible in the small sample setting. After the diagonalization step, partitioned between-class, pooled within-class, and total covariance matrices are considered. This partitioning is similar to the one employed in the previous section, which does not enforce matrices to be diagonal however. In the subsequent optimization step, the classical Fisher criterion is maximized basically in the appropriate subspace, comparable to the approach described above, but in a mildly more involved and concealed way. For this, matrices of the form Rt X are considered, consider Equations (2) and Y (3). However, Y is simply the null matrix and the family of solutions L provided is limited to L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R) . 0 Obviously, this is far from a complete characterization, especially when N − 1 n which is, for instance, typically the case for the data sets considered by Ye (2005). Generally, the utility of a dimensionality reduction criterion, without additional constrains, depends on the efficiency over the full set of solutions. As Ye (2005) only considers two very specific instances from the large class of possibilities, it is unclear to what extent the new criterion really provides an efficient way of performing a reduction of dimensionality. References N. A. Campbell and W. R. Atchley. The geometry of canonical variate analysis. Systematic Zoology, 30(3):268–280, 1981. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, 1990. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996. J. Ye. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483–502, 2005. 2123
4 0.033830751 15 jmlr-2007-Bilinear Discriminant Component Analysis
Author: Mads Dyrholm, Christoforos Christoforou, Lucas C. Parra
Abstract: Factor analysis and discriminant analysis are often used as complementary approaches to identify linear components in two dimensional data arrays. For three dimensional arrays, which may organize data in dimensions such as space, time, and trials, the opportunity arises to combine these two approaches. A new method, Bilinear Discriminant Component Analysis (BDCA), is derived and demonstrated in the context of functional brain imaging data for which it seems ideally suited. The work suggests to identify a subspace projection which optimally separates classes while ensuring that each dimension in this space captures an independent contribution to the discrimination. Keywords: bilinear, decomposition, component, classification, regularization
5 0.02760922 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen
Abstract: We introduce standardised building blocks designed to be used with variational Bayesian learning. The blocks include Gaussian variables, summation, multiplication, nonlinearity, and delay. A large variety of latent variable models can be constructed from these blocks, including nonlinear and variance models, which are lacking from most existing variational systems. The introduced blocks are designed to fit together and to yield efficient update rules. Practical implementation of various models is easy thanks to an associated software package which derives the learning formulas automatically once a specific model structure has been fixed. Variational Bayesian learning provides a cost function which is used both for updating the variables of the model and for optimising the model structure. All the computations can be carried out locally, resulting in linear computational complexity. We present experimental results on several structures, including a new hierarchical nonlinear model for variances and means. The test results demonstrate the good performance and usefulness of the introduced method. Keywords: latent variable models, variational Bayesian learning, graphical models, building blocks, Bayesian modelling, local computation
6 0.025207607 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors
7 0.023231709 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition
9 0.022771614 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
10 0.022678372 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
11 0.02083494 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
12 0.019751834 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
13 0.019647054 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
14 0.019594962 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis
15 0.018399864 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions
17 0.016170645 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features
18 0.01574737 42 jmlr-2007-Infinitely Imbalanced Logistic Regression
19 0.015334575 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
20 0.014488925 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
topicId topicWeight
[(0, 0.093), (1, 0.016), (2, 0.007), (3, 0.055), (4, -0.057), (5, -0.122), (6, -0.041), (7, -0.005), (8, -0.015), (9, -0.019), (10, -0.06), (11, -0.013), (12, -0.021), (13, -0.138), (14, -0.115), (15, 0.063), (16, -0.001), (17, 0.007), (18, 0.033), (19, -0.025), (20, 0.184), (21, -0.152), (22, -0.071), (23, 0.144), (24, -0.329), (25, 0.049), (26, 0.064), (27, -0.042), (28, -0.125), (29, -0.073), (30, 0.19), (31, 0.058), (32, -0.003), (33, 0.073), (34, -0.28), (35, -0.121), (36, -0.195), (37, -0.204), (38, -0.202), (39, -0.101), (40, 0.127), (41, -0.255), (42, 0.12), (43, -0.17), (44, -0.075), (45, 0.05), (46, 0.198), (47, -0.103), (48, 0.007), (49, -0.212)]
simIndex simValue paperId paperTitle
same-paper 1 0.97118211 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis
Author: Santosh Srivastava, Maya R. Gupta, Béla A. Frigyik
Abstract: Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be ill-posed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distribution-based Bayesian classifier is derived using information geometry. Using a calculus of variations approach to define a functional Bregman divergence for distributions, it is shown that the Bayesian distribution-based classifier that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassification cost. A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. A new Bayesian quadratic discriminant analysis classifier is proposed where the prior is defined using a coarse estimate of the covariance based on the training data; this classifier is termed BDA7. Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases significantly better than, regularized quadratic discriminant analysis and the cross-validated Bayesian quadratic discriminant analysis classifier Quadratic Bayes. Keywords: quadratic discriminant analysis, regularized quadratic discriminant analysis, Bregman divergence, data-dependent prior, eigenvalue decomposition, Wishart, functional analysis
2 0.38805541 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
Author: Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, Dharmendra S. Modha
Abstract: Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation—every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a varic 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE , D HILLON , G HOSH , M ERUGU AND M ODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and
3 0.17936112 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features
Author: Kristen Grauman, Trevor Darrell
Abstract: In numerous domains it is useful to represent a single example by the set of the local features or parts that comprise it. However, this representation poses a challenge to many conventional machine learning techniques, since sets may vary in cardinality and elements lack a meaningful ordering. Kernel methods can learn complex functions, but a kernel over unordered set inputs must somehow solve for correspondences—generally a computationally expensive task that becomes impractical for large set sizes. We present a new fast kernel function called the pyramid match that measures partial match similarity in time linear in the number of features. The pyramid match maps unordered feature sets to multi-resolution histograms and computes a weighted histogram intersection in order to find implicit correspondences based on the finest resolution histogram cell where a matched pair first appears. We show the pyramid match yields a Mercer kernel, and we prove bounds on its error relative to the optimal partial matching cost. We demonstrate our algorithm on both classification and regression tasks, including object recognition, 3-D human pose inference, and time of publication estimation for documents, and we show that the proposed method is accurate and significantly more efficient than current approaches. Keywords: kernel, sets of features, histogram intersection, multi-resolution histogram pyramid, approximate matching, object recognition
4 0.17095834 15 jmlr-2007-Bilinear Discriminant Component Analysis
Author: Mads Dyrholm, Christoforos Christoforou, Lucas C. Parra
Abstract: Factor analysis and discriminant analysis are often used as complementary approaches to identify linear components in two dimensional data arrays. For three dimensional arrays, which may organize data in dimensions such as space, time, and trials, the opportunity arises to combine these two approaches. A new method, Bilinear Discriminant Component Analysis (BDCA), is derived and demonstrated in the context of functional brain imaging data for which it seems ideally suited. The work suggests to identify a subspace projection which optimally separates classes while ensuring that each dimension in this space captures an independent contribution to the discrimination. Keywords: bilinear, decomposition, component, classification, regularization
5 0.161414 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion
Author: Marco Loog
Abstract: Recently, Ye (2005) suggested yet another optimization criterion for discriminant analysis and proposed a characterization of the family of solutions to this objective. The characterization, however, merely describes a part of the full solution set, that is, it is not complete and therefore not at all a characterization. This correspondence first gives the correct characterization and afterwards compares it to Ye’s. Keywords: linear discriminant analysis, Fisher criterion, small sample, characterization 1. Classical and New Criteria Given N feature vectors of dimensionality n, a linear reduction of dimensionality, based on classical Fisher LDA, determines an n × d transformation matrix L that, for a given d < K, K the number of classes, maximizes the so-called Fisher criterion: F(A) = tr((A t SW A)−1 (At SB A)) or, equivalently, F0 (A) = tr((At ST A)−1 (At SB A)). Here, SB := ∑K pi (mi − m)(mi − m)t , SW := ∑K pi Si , and i=1 i=1 ST = SB + SW . The matrices SB , SW , and ST are the so-called between-class, pooled within-class, and total covariance matrices. In addition, mi is the mean vector of class i, pi is the prior of class i, and the overall mean m equals ∑k pi mi . Finally, Si is the covariance matrix of class i. i=1 A solution to these optimization problems can be obtained by means of a generalized eigenvalue decomposition, which Fukunaga (1990) relates to a simultaneous diagonalization of the two matrices involved (see also Campbell and Atchley, 1981). More common is it to apply a standard −1 eigenvalue decomposition to S−1 SB (or SW SB ), resulting in an equivalent set of eigenvectors. The d T columns of the optimal solution L are simply taken to equal the d eigenvectors corresponding to the d largest eigenvalues. It is known that this solution is not unique and the full class can be obtained by multiplying L to the right with nonsingular d × d matrices (see Fukunaga, 1990). Clearly, if the total covariance ST is singular, neither the generalized nor the standard eigenvalue decomposition can be readily employed. Directly or indirectly, the problem is that the matrix inverse S−1 does not exist, which is the typical situation when dealing with small samples. In an attempt to T overcome this problem, Ye (2005) introduced a different criterion that is defined as F1 (A) = tr((At ST A)+ (At SB A)) , ∗. Also at Nordic Bioscience Imaging, Hovegade 207, DK-2730 Herlev, Denmark. c 2007 Marco Loog. (1) L OOG where + denotes taking the Moore-Penrose generalized inverse of a matrix. Like for F0 , an optimal transform L is one that maximizes the objective F1 . Again, this solution is not unique. 2. Correct Characterization For the full characterization of the set of solutions to Equation (1), initially the problem is looked at from a geometrical point of view (cf., Campbell and Atchley, 1981). It is assumed that the number of samples N is smaller than or equal to the feature dimensionality n. In the undersampled case, it is clear that all data variation occurs in an N − 1-dimensional subspace of the original space. To start with, a PCA of the data is carried out and the first N − 1 principal components are rotated to the first N − 1 axes of the n-dimensional space by means of a rotation matrix R. This matrix consists of all (normalized) eigenvectors of ST taken as its columns. After this rotation, new total and between-class covariance matrices, ST = Rt ST R and SB = Rt SB R, are obtained. These 0 0 matrices can be partitioned as follows: ST = Σ0T 0 and SB = ΣB 0 , where ΣT and ΣB are N − 1 × 0 N − 1 covariance matrices and ΣT is nonsingular and diagonal by construction. The between-class variation is obviously restricted to the N − 1-dimensional subspace in which the total data variation takes place, therefore a same partitioning of SB is possible. However, the covariance submatrix ΣB is not necessarily diagonal, neither does it have to be nonsingular. Basically, the PCA-based rotation R converts the initial problem into a more convenient one, splitting up the original space in an N − 1-dimensional one in which “everything interesting” takes place and a remaining n − N + 1dimensional subspace in which “nothing happens at all”. Now consider F1 in this transformed space and take a general n × d transformation matrix A, which is partitioned in a way similar to the covariance matrices, that is, X . Y A= (2) Here, X is an N − 1 × d-matrix and Y is of size n − N + 1 × d. Taking this definition, the following holds (cf., Ye, 2005): t + t F1 (A) = tr((A ST A) (A SB A)) = tr =tr X t ΣT X 0 0 0 + X Y X t ΣB X 0 0 0 t ΣT 0 = tr 0 0 X Y + (Xt ΣT X)−1 0 0 0 X Y t ΣB 0 0 0 X Y X t ΣB X 0 0 0 = tr((Xt ΣT X)−1 (Xt ΣB X)) = F0 (X) . From this it is immediate that a matrix A maximizes F1 if and only if the submatrix X maximizes the original Fisher criterion in the lower-dimensional subspace. Moreover, if L is such a matrix maximizing F1 in the PCA-transformed space, it is easy to check that R−1 L = Rt L provides a solution to the original, general problem that has not been preprocessed by means of a PCA (see also Fukunaga, 1990). A characterization of the complete family of solutions can now be given. Let Λ ∈ RN−1×d be an optimal solution of F0 (X) = tr((Xt ΣT X)−1 (Xt ΣB X)). As already noted in Section 1, the full set of solutions is given by F = {ΛZ ∈ RN−1×d | Z ∈ GLd (R)}, where GLd (R) denotes the general linear group of d × d invertible matrices. The previous paragraph essentially demonstrates that if X ∈ F , A in Equation (2) maximizes F1 . The matrix Y can be chosen ad 2122 C OMPLETE C HARACTERIZATION OF A FAMILY OF S OLUTIONS libitum. Now, the latter provides the solution in the PCA-transformed space and to solve the general problem we need to take the rotation back to the original space into account. All in all, this leads to the following complete family of solutions L , maximizing F1 in the original space: L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R), Y ∈ Rn−N+1×d Y , (3) where Λ = argmaxX tr((Xt ΣT X)−1 (Xt ΣB X)) and Rt takes care of the rotation back. 3. Original Characterization Though not noted by Ye (2005), his attempt to characterize the full set of solutions of Equation (1) is based on a simultaneous diagonalization of the three covariance matrices S B , SW , and ST that is similar to the ideas described by Campbell and Atchley (1981) and Fukunaga (1990). Moreover, Golub and Van Loan (Theorem 8.7.1. 1996) can be readily applied to demonstrate that such simultaneous diagonalization is possible in the small sample setting. After the diagonalization step, partitioned between-class, pooled within-class, and total covariance matrices are considered. This partitioning is similar to the one employed in the previous section, which does not enforce matrices to be diagonal however. In the subsequent optimization step, the classical Fisher criterion is maximized basically in the appropriate subspace, comparable to the approach described above, but in a mildly more involved and concealed way. For this, matrices of the form Rt X are considered, consider Equations (2) and Y (3). However, Y is simply the null matrix and the family of solutions L provided is limited to L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R) . 0 Obviously, this is far from a complete characterization, especially when N − 1 n which is, for instance, typically the case for the data sets considered by Ye (2005). Generally, the utility of a dimensionality reduction criterion, without additional constrains, depends on the efficiency over the full set of solutions. As Ye (2005) only considers two very specific instances from the large class of possibilities, it is unclear to what extent the new criterion really provides an efficient way of performing a reduction of dimensionality. References N. A. Campbell and W. R. Atchley. The geometry of canonical variate analysis. Systematic Zoology, 30(3):268–280, 1981. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, 1990. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996. J. Ye. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483–502, 2005. 2123
6 0.15008329 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
7 0.13677008 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
8 0.13331664 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition
9 0.10887522 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
10 0.10417764 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors
11 0.097251199 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)
12 0.096219376 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
13 0.095010504 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions
14 0.092966713 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
16 0.089008719 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks
17 0.084886923 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
18 0.080308415 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
19 0.076524884 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
20 0.074598983 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
topicId topicWeight
[(1, 0.013), (4, 0.049), (8, 0.043), (12, 0.025), (26, 0.024), (28, 0.053), (40, 0.026), (48, 0.031), (60, 0.024), (80, 0.016), (85, 0.053), (86, 0.437), (98, 0.06)]
simIndex simValue paperId paperTitle
same-paper 1 0.7010166 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis
Author: Santosh Srivastava, Maya R. Gupta, Béla A. Frigyik
Abstract: Quadratic discriminant analysis is a common tool for classification, but estimation of the Gaussian parameters can be ill-posed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distribution-based Bayesian classifier is derived using information geometry. Using a calculus of variations approach to define a functional Bregman divergence for distributions, it is shown that the Bayesian distribution-based classifier that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassification cost. A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. A new Bayesian quadratic discriminant analysis classifier is proposed where the prior is defined using a coarse estimate of the covariance based on the training data; this classifier is termed BDA7. Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases significantly better than, regularized quadratic discriminant analysis and the cross-validated Bayesian quadratic discriminant analysis classifier Quadratic Bayes. Keywords: quadratic discriminant analysis, regularized quadratic discriminant analysis, Bregman divergence, data-dependent prior, eigenvalue decomposition, Wishart, functional analysis
Author: Onur C. Hamsici, Aleix M. Martinez
Abstract: Many feature representations, as in genomics, describe directional data where all feature vectors share a common norm. In other cases, as in computer vision, a norm or variance normalization step, where all feature vectors are normalized to a common length, is generally used. These representations and pre-processing step map the original data from R p to the surface of a hypersphere S p−1 . Such representations should then be modeled using spherical distributions. However, the difficulty associated with such spherical representations has prompted researchers to model their spherical data using Gaussian distributions instead—as if the data were represented in R p rather than S p−1 . This opens the question to whether the classification results calculated with the Gaussian approximation are the same as those obtained when using the original spherical distributions. In this paper, we show that in some particular cases (which we named spherical-homoscedastic) the answer to this question is positive. In the more general case however, the answer is negative. For this reason, we further investigate the additional error added by the Gaussian modeling. We conclude that the more the data deviates from spherical-homoscedastic, the less advisable it is to employ the Gaussian approximation. We then show how our derivations can be used to define optimal classifiers for spherical-homoscedastic distributions. By using a kernel which maps the original space into one where the data adapts to the spherical-homoscedastic model, we can derive non-linear classifiers with potential applications in a large number of problems. We conclude this paper by demonstrating the uses of spherical-homoscedasticity in the classification of images of objects, gene expression sequences, and text data. Keywords: directional data, spherical distributions, normal distributions, norm normalization, linear and non-linear classifiers, computer vision
3 0.23686679 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
Author: Jia Li, Surajit Ray, Bruce G. Lindsay
Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density
4 0.2363475 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
Author: Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, Dharmendra S. Modha
Abstract: Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation—every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and flexibility of our co-clustering framework, we provide examples and empirical evidence on a varic 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE , D HILLON , G HOSH , M ERUGU AND M ODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and
5 0.23062143 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers
Author: François Laviolette, Mario Marchand
Abstract: We propose a PAC-Bayes theorem for the sample-compression setting where each classifier is described by a compression subset of the training data and a message string of additional information. This setting, which is the appropriate one to describe many learning algorithms, strictly generalizes the usual data-independent setting where classifiers are represented only by data-independent message strings (or parameters taken from a continuous set). The proposed PAC-Bayes theorem for the sample-compression setting reduces to the PAC-Bayes theorem of Seeger (2002) and Langford (2005) when the compression subset of each classifier vanishes. For posteriors having all their weights on a single sample-compressed classifier, the general risk bound reduces to a bound similar to the tight sample-compression bound proposed in Laviolette et al. (2005). Finally, we extend our results to the case where each sample-compressed classifier of a data-dependent ensemble may abstain of predicting a class label. Keywords: PAC-Bayes, risk bounds, sample-compression, set covering machines, decision list machines
6 0.2285617 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
7 0.22846732 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
8 0.22786455 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
9 0.22734934 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis
10 0.22607106 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
11 0.22498035 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
13 0.22384042 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
14 0.22187373 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
15 0.22148338 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
16 0.22134072 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
18 0.22113685 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
19 0.2203761 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
20 0.21912101 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections