jmlr jmlr2007 jmlr2007-13 knowledge-graph by maker-knowledge-mining

13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

Source: pdf

Author: Santosh Srivastava, Maya R. Gupta, Béla A. Frigyik

Abstract: Quadratic discriminant analysis is a common tool for classiﬁcation, but estimation of the Gaussian parameters can be ill-posed. This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. A distribution-based Bayesian classiﬁer is derived using information geometry. Using a calculus of variations approach to deﬁne a functional Bregman divergence for distributions, it is shown that the Bayesian distribution-based classiﬁer that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassiﬁcation cost. A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. A new Bayesian quadratic discriminant analysis classiﬁer is proposed where the prior is deﬁned using a coarse estimate of the covariance based on the training data; this classiﬁer is termed BDA7. Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases signiﬁcantly better than, regularized quadratic discriminant analysis and the cross-validated Bayesian quadratic discriminant analysis classiﬁer Quadratic Bayes. Keywords: quadratic discriminant analysis, regularized quadratic discriminant analysis, Bregman divergence, data-dependent prior, eigenvalue decomposition, Wishart, functional analysis

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Mathematics Purdue University West Lafayette, IN 47907, USA Editor: Saharon Rosset Abstract Quadratic discriminant analysis is a common tool for classiﬁcation, but estimation of the Gaussian parameters can be ill-posed. [sent-9, score-0.092]

2 This paper contains theoretical and algorithmic contributions to Bayesian estimation for quadratic discriminant analysis. [sent-10, score-0.117]

3 Using a calculus of variations approach to deﬁne a functional Bregman divergence for distributions, it is shown that the Bayesian distribution-based classiﬁer that minimizes the expected Bregman divergence of each class conditional distribution also minimizes the expected misclassiﬁcation cost. [sent-12, score-0.119]

4 A series approximation is used to relate regularized discriminant analysis to Bayesian discriminant analysis. [sent-13, score-0.186]

5 A new Bayesian quadratic discriminant analysis classiﬁer is proposed where the prior is deﬁned using a coarse estimate of the covariance based on the training data; this classiﬁer is termed BDA7. [sent-14, score-0.27]

6 Results on benchmark data sets and simulations show that BDA7 performance is competitive with, and in some cases signiﬁcantly better than, regularized quadratic discriminant analysis and the cross-validated Bayesian quadratic discriminant analysis classiﬁer Quadratic Bayes. [sent-15, score-0.286]

7 Keywords: quadratic discriminant analysis, regularized quadratic discriminant analysis, Bregman divergence, data-dependent prior, eigenvalue decomposition, Wishart, functional analysis 1. [sent-16, score-0.283]

8 Introduction A standard approach to supervised classiﬁcation problems is quadratic discriminant analysis (QDA), which models the likelihood of each class as a Gaussian distribution, then uses the posterior distributions to estimate the class for a given test point (Hastie et al. [sent-17, score-0.128]

9 Unfortunately, when the number of training samples n is small compared to the number of dimensions of each training sample d, the ML covariance estimation can be ill-posed. [sent-21, score-0.103]

10 e S RIVASTAVA , G UPTA AND F RIGYIK One approach to resolve the ill-posed estimation is to regularize the covariance estimation; another approach is to use Bayesian estimation. [sent-24, score-0.103]

11 In preliminary experiments we found that the performance of Bayesian QDA classiﬁers is very sensitive to the choice of prior (Srivastava and Gupta, 2006), and that priors suggested by Geisser (1964) and Keehn (1965) produce error rates similar to those yielded by ML. [sent-28, score-0.066]

12 BDA7 differs from previous Bayesian QDA methods in that the prior is selected by crossvalidation from a set of data-dependent priors. [sent-31, score-0.077]

13 Each data-dependent prior captures some coarse information from the training data. [sent-32, score-0.068]

14 We show that the distribution-based Bayesian discriminant analysis classiﬁer has roughly the same closed-form as the parameter-based Bayesian discriminant analysis classiﬁer, but has a different degree of freedom. [sent-41, score-0.156]

15 In Section 5 we establish that the Bayesian minimum expected misclassiﬁcation cost estimate is equivalent to a plug-in estimate using the Bayesian minimum expected Bregman divergence estimate for each class conditional. [sent-47, score-0.095]

16 Geisser’s work used a noninformative prior distribution to calculate the posterior odds that a test sample belongs to a particular class. [sent-56, score-0.074]

17 Keehn’s work assumed that the prior distribution of the covariance matrix is an inverse Wishart distribution. [sent-57, score-0.159]

18 Recent work by the authors showed that Bayesian QDA using the priors suggested by Geisser and Keehn perform similarly to ML QDA on six standard QDA simulations (Srivastava and Gupta, 2006). [sent-59, score-0.071]

19 The inverse Wishart prior is a conjugate prior for the covariance matrix, and it requires the speciﬁcation of a “seed” positive deﬁnite matrix and a scalar degree of freedom. [sent-60, score-0.205]

20 Friedman’s comparisons to ML QDA and ML linear discriminant analysis on six simulations showed that RDA could deal effectively with ill-posed covariance estimation when the true covariance matrix is diagonal. [sent-69, score-0.314]

21 RDA is perhaps the most popular approach to discriminant analysis when the covariance estimation is expected to be ill-posed (Hastie et al. [sent-70, score-0.179]

22 Another restricted model used to regularize covariance matrix estimation is a banded covariance matrix (Bickel and Li, 2006). [sent-74, score-0.223]

23 RDA and the Hoffbeck-Landgrebe classiﬁers linearly combine different covariance estimates. [sent-75, score-0.074]

24 A related approach is to select the best covariance model out of a set of models using crossvalidation. [sent-76, score-0.074]

25 Bensmail and Celeux (1996) propose eigenvalue decomposition discriminant analysis (EDDA), in which fourteen different models for the class covariances are considered, ranging from a scalar times the identity matrix to a full class covariance estimate for each class. [sent-77, score-0.22]

26 3 Other Approaches to Quadratic Discriminant Analysis Other approaches have been developed for ill-posed quadratic discriminant analysis. [sent-83, score-0.103]

27 Friedman (1989) notes that, beginning with work by James and Stein in 1961, researchers have attempted to improve the eigenvalues of the sample covariance matrix. [sent-84, score-0.074]

28 One of the most recent algorithms of this type is orthogonal linear discriminant analysis (Ye, 2005), which was shown by the author of that work to perform similarly to Friedman’s regularized linear discriminant analysis on six real data sets. [sent-86, score-0.2]

29 ,G h=1 Straightforward integration yields an estimate of the class prior, E Θ [Θh ] = nh +1 ; this Bayesian estin+G mate for the multinomial is also known as Laplace correction (Jaynes and Bretthorst, 2003). [sent-125, score-0.8]

30 2), and (Nh , Th ) is the likelihood of the data Th given Nh , that is, (Nh (µh , Σh ), Th ) = 1 ¯ ¯ exp[− 2 tr Σ−1 Sh − nh tr Σ−1 (µh − Xh )(µh − Xh )T ] h h 2 (2π) dnh 2 nh . [sent-136, score-1.66]

31 A common interpretation of Bayesian analysis is that the prior represents information that one has prior to seeing the data (Jaynes and Bretthorst, 2003). [sent-139, score-0.092]

32 To meet these goals, we use as a prior p(Nh ) = p(µh )p(Σh ) = γ0 exp[− 1 tr Σ−1 Bh ] h 2 q |Σh | 2 , (10) where Bh is a positive deﬁnite matrix (further speciﬁed in (25)) and γ0 is a normalization constant. [sent-142, score-0.112]

33 The prior (10) is equivalent to a noninformative prior for the mean µ, and an inverted Wishart prior with q degrees of freedom over Σ. [sent-143, score-0.188]

34 However, the tails of this prior are sufﬁciently heavy that the prior does not hinder convergence to the true generating distribution as the number of training samples increases if the true generating distribution is normal. [sent-147, score-0.137]

35 The positive deﬁnite matrix Bh speciﬁes the location of the maximum of the prior probability distribution. [sent-148, score-0.069]

36 q (11) Because this prior is unimodal, a rough interpretation of its action is that it regularizes the likelihood covariance estimate towards the maximum of the prior, given in (11). [sent-152, score-0.133]

37 , 1999), the prior seed matrix Bh = kI, where k is a scalar determined by crossvalidation. [sent-155, score-0.106]

38 Setting Bh = kI is reminiscent of Friedman’s RDA (Friedman, 1989), where the covariance estimate is ˆ tr(ΣML ) regularized by the trace: d I. [sent-156, score-0.117]

39 ˆ tr(ΣML ) We have shown in earlier work that setting Bh = d I for a distribution-based discriminant analysis outperforms Geisser’s or Keehn’s parameter-based Bayesian discriminant methods, and does not require crossvalidation (Srivastava and Gupta, 2006). [sent-157, score-0.187]

40 The trace of the ML covariance estimate is stable, and provides coarse information about the scale of the data samples. [sent-158, score-0.109]

41 Theorem 1: The classiﬁer (5) using the prior (10) can be written as d 2 (nh ) Γ G ˆ Y = argmin ∑ C(g, h) g=1,. [sent-166, score-0.067]

42 ,G h=1 nh +q+1 2 d (nh + 1) 2 Γ Sh +Bh 2 nh +q−d+1 2 nh +q 2 |Ah | nh +q+1 2 ˆ P(Y = h), (12) ˆ where P(Y = h) is an estimate of the class prior probability for class h, Γ(·) is the standard gamma function, and 1 nh (x − xh )(x − xh )T ¯ ¯ (13) Ah = Sh + + Bh . [sent-169, score-4.12]

43 The parameter-based Bayesian QDA class label using the prior given in (10) is d G ˆ Y = argmin ∑ C(g, h) g=1,. [sent-171, score-0.067]

44 ,G h=1 (nh ) 2 Γ d 2 (nh + 1) Γ nh +q−d−1 2 nh +q−2d−1 2 | Sh +Bh | 2 |Ah | nh +q−d−2 2 nh +q−d−1 2 ˆ P(Y = h). [sent-174, score-3.148]

45 The parameterbased Bayesian discriminant result (14) will not hold if nh ≤ 2d − q + 1, while the distribution-based result (12) holds for any nh > 0 and any d. [sent-176, score-1.668]

46 We compared the noninformative prior originally proposed by Geisser (1964), and ˆ the modiﬁed inverse Wishart prior given in Equation (10) with B h = tr ΣML,h /d I. [sent-178, score-0.167]

47 Results on six simulations (two-class versions of Friedman’s original simulations) showed that given the same prior, the distribution-based performed better than the parameter-based when the ratio of feature dimensions to number of training samples is high. [sent-180, score-0.066]

48 1), chooses the degree of freedom for the modiﬁed inverse Wishart prior by cross-validation. [sent-186, score-0.08]

49 In Section 6, we propose a new Bayesian QDA classiﬁer called BDA7 that cross-validates the degree of freedom and a data-dependent seed matrix Bh for the prior. [sent-188, score-0.078]

50 Relationship Between Regularized QDA and Bayesian QDA In this section we show that Friedman’s regularized form for the covariance matrix (Friedman, 1989) emerges from the Bayesian QDA formula. [sent-191, score-0.127]

51 Then (15) becomes h Γ ENh [Nh ] ≈ nh +q+1 2 d nh +1 nh (π) 2 Γ nh +q+1 2 = (π) exp nh T −1 nh +1 Zh Dh Zh Dh 1 2 Γ d 2 nh +1 nh Dh 1 2 nh +q+1 2 , nh +q−d+1 2 nh +1 nh +q+1 1 T exp − 2 Zh − −1 Dh nh Zh . [sent-201, score-10.231]

52 (17) nh + q + 1 nh ˜ The approximation (16) resembles a Gaussian distribution, where Σh plays the role of the covariance matrix. [sent-203, score-1.648]

53 Rewrite (17), ˜ Σh = ˜ Σh = = = Sh + B h nh Sh Bh nh + 1 + nh nh + q + 1 nh Sh nh + 1 1 q Bh . [sent-204, score-4.722]

54 + 1− nh + q + 1 nh nh + q + 1 nh nh + 1 nh + q + 1 nh + 1 nh + q + 1 In (18), make the approximation nh +1 nh ˜ Σh ≈ 1 − (18) ≈ 1, then multiply and divide the second term of (18) by q, q nh + q + 1 Sh q + nh nh + q + 1 1286 Bh . [sent-205, score-10.231]

55 q (19) BAYESIAN Q UADRATIC D ISCRIMINANT A NALYSIS The right-hand side of (19) is a convex combination of the sample covariance and the positive deﬁnite matrix Bh . [sent-206, score-0.097]

56 Here, the fraction nh +q+1 controls the shrink- age of the sample covariance matrix toward the positive deﬁnite matrix Bh ; recall from (11) that Bh is q q the maximum of the prior probability distribution. [sent-208, score-0.953]

57 Equation (19) also gives information about how the Bayesian shrinkage depends on the number of sample points from each class: fewer training samples nh results in greater shrinkage towards the positive deﬁnite matrix Bh . [sent-209, score-0.857]

58 The diagonal of ΣML has been used to regularize a regularized discriminant analysis (Hoffbeck and Landgrebe, 1996) and model-based discriminant analysis (Bensmail and Celeux, 1996). [sent-261, score-0.215]

59 1 ˆ ˆ Note that setting Bh = diag ΣML places the maximum of the prior at q diag ΣML . [sent-262, score-0.112]

60 We hypothˆ esize that in some cases it may be more effective to place the maximum of the prior at diag ΣML ; ˆ ML . [sent-263, score-0.079]

61 that requires setting Bh = q diag Σ We also consider moving the maximum of the prior closer to the zero matrix, effectively turning the prior into an exponential prior rather than a unimodal one. [sent-264, score-0.171]

62 To this end, we also consider setting the prior matrix seed to be B h = 1 diag ΣML . [sent-268, score-0.139]

63 Thus, for BDA7 the B h is selected by crossvalidation from seven options:  ˆ ˆ  q diag Σ(pooled ML) , q diag Σ(class ML,h) ,   1  diag Σ ˆ (pooled ML) , 1 diag Σ(class ML,h) , ˆ q q (25) Bh = ˆ ˆ  diag Σ(pooled ML) , diag Σ(class ML,h) ,   1 ˆ  tr(Σ (pooled ML) ) I. [sent-270, score-0.229]

64 , 1999), RDA (Friedman, 1989), eigenvalue decomposition discriminant analysis (EDDA) (Bensmail and Celeux, 1996), and maximum-likelihood estimated QDA, LDA, and the nearest-means classiﬁer (NM). [sent-282, score-0.092]

65 QB uses a normal prior for each mean vector, and an inverse Wishart distribution prior for each covariance matrix. [sent-295, score-0.198]

66 There are two free parameters to the inverse Wishart distribution: the scale parameter q ≥ d and the seed matrix Bh . [sent-296, score-0.076]

67 This causes numerical difﬁculties in estimating the maximum likelihood covariance estimates. [sent-317, score-0.074]

68 BDA7 chooses the identity prior seed ˆ matrix Bh = 1 tr(Σ(pooled ML) ) I every time for this data set. [sent-318, score-0.124]

69 Given that this is BDA7’s choice, BDA7 q is at a disadvantage compared to QB, which cross-validates a scaled identity seed matrix kI. [sent-319, score-0.078]

70 Six of the simulations were originally created by Friedman and used to show that RDA performs favorably compared to ML linear discriminant analysis (LDA) and ML QDA (Friedman, 1989). [sent-469, score-0.115]

71 Friedman’s six simulations all have diagonal generating class covariance matrices, corresponding to independent classiﬁcation features. [sent-470, score-0.154]

72 When the true generating distributions are drawn from full covariance matrices (Cases 7–10), EDDA performs well when there are only 10 features, but the maximum likelihood estimates used in EDDA fail for 50 and 100 feature dimensions. [sent-481, score-0.103]

73 1 C ASE 1: E QUAL S PHERICAL C OVARIANCE M ATRICES Each class conditional distribution is normal with identity covariance matrix I. [sent-488, score-0.115]

74 2 C ASE 2: U NEQUAL S PHERICAL C OVARIANCE M ATRICES The class one conditional distribution is normal with identity covariance matrix I and mean at the origin. [sent-497, score-0.131]

75 The class two conditional distribution is normal with covariance matrix 2I and has zero mean except the ﬁrst component of its mean is 3. [sent-498, score-0.129]

76 The class three conditional distribution is normal with covariance matrix 3I and has zero mean except the last component of its mean is 4. [sent-499, score-0.129]

77 The eigenvalues of the common covariance matrix are given by ei = 9(i − 1) +1 d −1 2 , 1 ≤ i ≤ d, (26) so the ratio of the largest to smallest eigenvalue is 100. [sent-504, score-0.111]

78 4 C ASES 5 AND 6: U NEQUAL H IGHLY E LLIPSOIDAL C OVARIANCE M ATRICES For these cases, the covariance matrices are highly ellipsoidal and different for each class. [sent-727, score-0.088]

79 The eigenvalues of the class one covariance are given by Equation (26), and those of class two are given by 2 9(d − i) e2i = + 1 , 1 ≤ i ≤ d. [sent-728, score-0.074]

80 Case 5 and Case 6 present more information to the classiﬁers than the previous cases because the covariance matrices are substantially different. [sent-734, score-0.088]

81 Then let the class one covariance matrix be R T R1 . [sent-739, score-0.097]

82 Similarly, let the class two 1 and class three covariance matrices be RT R2 and RT R3 , where R2 and R3 are constructed in the same 3 2 manner as R1 . [sent-740, score-0.088]

83 For most runs of this simulation, the best EDDA model is the full covariance model, but because EDDA uses ML estimation, its estimation of the full covariance model is illconditioned. [sent-745, score-0.162]

84 Then the Cases 9 and 10 covariance matrices are Ri RT RT Ri for i = 1, 2, 3. [sent-752, score-0.088]

85 These covariance matrices are highly ellipsoidal, often with one strong i i eigenvalue and many relatively small eigenvalues. [sent-753, score-0.102]

86 Key aspects of BDA7 are that the seed matrix in the inverse Wishart prior uses a coarse estimate of the covariance matrix to peg the maximum of the prior to a relevant part of the distribution-space. [sent-763, score-0.3]

87 Comparisons on simulations show that RDA and EDDA perform well when the true Gaussian distribution matches one of their regularization covariance models (e. [sent-765, score-0.111]

88 , diagonal, identity), but can fail when the generating distribution has a full covariance matrix, particularly when features are correlated. [sent-767, score-0.089]

89 In contrast, the Bayesian methods BDA7 and QB can learn from the rich differentiating information offered by full covariance matrices. [sent-768, score-0.074]

90 We hypothesize that better priors exist, and that such priors will also be data-dependent and make use of a coarse estimate of the covariance matrix for the prior. [sent-769, score-0.172]

91 BAYESIAN Q UADRATIC D ISCRIMINANT A NALYSIS Substitute (Nh , Th ) from (9) and p(Nh ) from (10), αh = Z Z Σ h µh exp[− nh tr Σ−1 (µh − xh )(µh − xh )T ] ¯ ¯ h 2 (2π) nh d 2 |Σh | nh +q 2 1 exp − tr Σ−1 (Sh + Bh ) 2 dΣh dµh |Σh | d+2 2 . [sent-780, score-2.573]

92 Integrate with respect to µh using identity (27): αh = 2π nh 1 nh d 2 (2π) d 2 1 exp[− 2 tr Σ−1 (Sh + Bh ) ] h Z Σh |Σh | nh +q+d+1 2 dΣh . [sent-781, score-2.422]

93 Next, integrate with respect to Σh using identity (28): αh = (2π) d 2 2π nh 1 nh d 2 nh +q 2 Γd nh +q 2 Sh +Bh 2 . [sent-782, score-3.179]

94 Z M Nh (x) f (Nh )dMh 1 αh (2π) nh d 2 Z Z Σ h µh 1 exp − 2 tr Σ−1 (x − µh )(x − µh )T h d 1 exp − 2 tr Σ−1 (Sh + Bh ) h |Σh | 1 (2π) 2 |Σh | 2 exp − nh +q 2 nh tr Σ−1 (µh − xh )(µh − xh )T ¯ ¯ h 2 dµh dΣh |Σh | d+2 2 . [sent-784, score-2.616]

95 Integrate with respect to µh and Σh using identities (27) and (28), and equation (13) to yield ENh [Nh (x)] = Γd 1 αh (2π) dnh 2 d (nh + 1) 2 nh +q+1 2 |Ah | , nh +q+1 2 where Ah is given by Equation (13). [sent-785, score-1.574]

96 After substituting the value of αh from (30), d ENh [Nh (x)] = nh +q+1 2 2 n h Γd 1 d d (2π) 2 (nh + 1) 2 Γd | Sh +Bh | 2 nh +q 2 |Ah | nh +q 2 nh +q+1 2 . [sent-786, score-3.148]

97 Simplify the multivariate gamma function Γd using (29): d ENh [Nh (x)] = 2 nh Γ 1 d d nh +q+1 2 (2π) 2 (nh + 1) 2 Γ nh +q−d+1 2 | Sh +Bh | 2 |Ah | nh +q 2 nh +q+1 2 . [sent-787, score-3.935]

98 M (39) Set a = M Nh − fˆ r(Nh )dM in (39) and use the assumption that the functional δ2 φ[ f ; ·, ·] is strongly positive, which implies that the above functional can be zero if and only if a = 0, that is, R 0 = Z M (Nh − fˆ)r(Nh )dM, (40) fˆ = ENh [Nh ]. [sent-802, score-0.078]

99 Empirical Bayes estimation of the multivariate normal covariance matrix. [sent-938, score-0.088]

100 Characterization of a family of algorithms for generalized discriminant analysis of undersampled problems. [sent-1011, score-0.078]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nh', 0.787), ('qda', 0.386), ('bh', 0.185), ('rda', 0.141), ('qb', 0.125), ('edda', 0.119), ('bregman', 0.097), ('ml', 0.097), ('zh', 0.084), ('wishart', 0.079), ('discriminant', 0.078), ('covariance', 0.074), ('enh', 0.073), ('rigyik', 0.073), ('rivastava', 0.073), ('uadratic', 0.073), ('upta', 0.073), ('bayesian', 0.073), ('dm', 0.067), ('xh', 0.063), ('srivastava', 0.057), ('sh', 0.057), ('gupta', 0.051), ('iscriminant', 0.051), ('geisser', 0.047), ('prior', 0.046), ('nalysis', 0.044), ('dh', 0.043), ('tr', 0.043), ('seed', 0.037), ('simulations', 0.037), ('bensmail', 0.036), ('pooled', 0.036), ('friedman', 0.035), ('functional', 0.033), ('diag', 0.033), ('crossvalidation', 0.031), ('keehn', 0.031), ('regularized', 0.03), ('divergence', 0.03), ('lda', 0.03), ('celeux', 0.026), ('ovariance', 0.026), ('er', 0.026), ('ah', 0.025), ('quadratic', 0.025), ('matrix', 0.023), ('coarse', 0.022), ('argmin', 0.021), ('ases', 0.021), ('hoffbeck', 0.021), ('nequal', 0.021), ('nm', 0.02), ('priors', 0.02), ('gaussian', 0.019), ('brown', 0.018), ('ionosphere', 0.018), ('identity', 0.018), ('atrices', 0.018), ('freedom', 0.018), ('pen', 0.016), ('simulation', 0.016), ('shrinkage', 0.016), ('crossvalidated', 0.016), ('extremum', 0.016), ('landgrebe', 0.016), ('llipsoidal', 0.016), ('noninformative', 0.016), ('parameterbased', 0.016), ('inverse', 0.016), ('mean', 0.016), ('misclassi', 0.015), ('differential', 0.015), ('generating', 0.015), ('regularize', 0.015), ('samples', 0.015), ('csisz', 0.014), ('pima', 0.014), ('thyroid', 0.014), ('rd', 0.014), ('diagonal', 0.014), ('th', 0.014), ('classi', 0.014), ('matrices', 0.014), ('eigenvalue', 0.014), ('six', 0.014), ('estimation', 0.014), ('benchmark', 0.013), ('ighly', 0.013), ('numbered', 0.013), ('expected', 0.013), ('estimate', 0.013), ('banerjee', 0.013), ('integrate', 0.013), ('sonar', 0.013), ('posterior', 0.012), ('strongly', 0.012), ('wine', 0.012), ('santosh', 0.012), ('termed', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

Author: Santosh Srivastava, Maya R. Gupta, Béla A. Frigyik

2 0.068685189 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

Author: Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, Dharmendra S. Modha

Abstract: Co-clustering, or simultaneous clustering of rows and columns of a two-dimensional data matrix, is rapidly becoming a powerful data analysis technique. Co-clustering has enjoyed wide success in varied application domains such as text clustering, gene-microarray analysis, natural language processing and image, speech and video analysis. In this paper, we introduce a partitional co-clustering formulation that is driven by the search for a good matrix approximation—every co-clustering is associated with an approximation of the original data matrix and the quality of co-clustering is determined by the approximation error. We allow the approximation error to be measured using a large class of loss functions called Bregman divergences that include squared Euclidean distance and KL-divergence as special cases. In addition, we permit multiple structurally different co-clustering schemes that preserve various linear statistics of the original data matrix. To accomplish the above tasks, we introduce a new minimum Bregman information (MBI) principle that simultaneously generalizes the maximum entropy and standard least squares principles, and leads to a matrix approximation that is optimal among all generalized additive models in a certain natural parameter space. Analysis based on this principle yields an elegant meta algorithm, special cases of which include most previously known alternate minimization based clustering algorithms such as kmeans and co-clustering algorithms such as information theoretic (Dhillon et al., 2003b) and minimum sum-squared residue co-clustering (Cho et al., 2004). To demonstrate the generality and ﬂexibility of our co-clustering framework, we provide examples and empirical evidence on a varic 2007 Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Srujana Merugu and Dharmendra Modha. BANERJEE , D HILLON , G HOSH , M ERUGU AND M ODHA ety of problem domains and also describe novel co-clustering applications such as missing value prediction and

3 0.039001867 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

Author: Marco Loog

Abstract: Recently, Ye (2005) suggested yet another optimization criterion for discriminant analysis and proposed a characterization of the family of solutions to this objective. The characterization, however, merely describes a part of the full solution set, that is, it is not complete and therefore not at all a characterization. This correspondence ﬁrst gives the correct characterization and afterwards compares it to Ye’s. Keywords: linear discriminant analysis, Fisher criterion, small sample, characterization 1. Classical and New Criteria Given N feature vectors of dimensionality n, a linear reduction of dimensionality, based on classical Fisher LDA, determines an n × d transformation matrix L that, for a given d < K, K the number of classes, maximizes the so-called Fisher criterion: F(A) = tr((A t SW A)−1 (At SB A)) or, equivalently, F0 (A) = tr((At ST A)−1 (At SB A)). Here, SB := ∑K pi (mi − m)(mi − m)t , SW := ∑K pi Si , and i=1 i=1 ST = SB + SW . The matrices SB , SW , and ST are the so-called between-class, pooled within-class, and total covariance matrices. In addition, mi is the mean vector of class i, pi is the prior of class i, and the overall mean m equals ∑k pi mi . Finally, Si is the covariance matrix of class i. i=1 A solution to these optimization problems can be obtained by means of a generalized eigenvalue decomposition, which Fukunaga (1990) relates to a simultaneous diagonalization of the two matrices involved (see also Campbell and Atchley, 1981). More common is it to apply a standard −1 eigenvalue decomposition to S−1 SB (or SW SB ), resulting in an equivalent set of eigenvectors. The d T columns of the optimal solution L are simply taken to equal the d eigenvectors corresponding to the d largest eigenvalues. It is known that this solution is not unique and the full class can be obtained by multiplying L to the right with nonsingular d × d matrices (see Fukunaga, 1990). Clearly, if the total covariance ST is singular, neither the generalized nor the standard eigenvalue decomposition can be readily employed. Directly or indirectly, the problem is that the matrix inverse S−1 does not exist, which is the typical situation when dealing with small samples. In an attempt to T overcome this problem, Ye (2005) introduced a different criterion that is deﬁned as F1 (A) = tr((At ST A)+ (At SB A)) , ∗. Also at Nordic Bioscience Imaging, Hovegade 207, DK-2730 Herlev, Denmark. c 2007 Marco Loog. (1) L OOG where + denotes taking the Moore-Penrose generalized inverse of a matrix. Like for F0 , an optimal transform L is one that maximizes the objective F1 . Again, this solution is not unique. 2. Correct Characterization For the full characterization of the set of solutions to Equation (1), initially the problem is looked at from a geometrical point of view (cf., Campbell and Atchley, 1981). It is assumed that the number of samples N is smaller than or equal to the feature dimensionality n. In the undersampled case, it is clear that all data variation occurs in an N − 1-dimensional subspace of the original space. To start with, a PCA of the data is carried out and the ﬁrst N − 1 principal components are rotated to the ﬁrst N − 1 axes of the n-dimensional space by means of a rotation matrix R. This matrix consists of all (normalized) eigenvectors of ST taken as its columns. After this rotation, new total and between-class covariance matrices, ST = Rt ST R and SB = Rt SB R, are obtained. These 0 0 matrices can be partitioned as follows: ST = Σ0T 0 and SB = ΣB 0 , where ΣT and ΣB are N − 1 × 0 N − 1 covariance matrices and ΣT is nonsingular and diagonal by construction. The between-class variation is obviously restricted to the N − 1-dimensional subspace in which the total data variation takes place, therefore a same partitioning of SB is possible. However, the covariance submatrix ΣB is not necessarily diagonal, neither does it have to be nonsingular. Basically, the PCA-based rotation R converts the initial problem into a more convenient one, splitting up the original space in an N − 1-dimensional one in which “everything interesting” takes place and a remaining n − N + 1dimensional subspace in which “nothing happens at all”. Now consider F1 in this transformed space and take a general n × d transformation matrix A, which is partitioned in a way similar to the covariance matrices, that is, X . Y A= (2) Here, X is an N − 1 × d-matrix and Y is of size n − N + 1 × d. Taking this deﬁnition, the following holds (cf., Ye, 2005): t + t F1 (A) = tr((A ST A) (A SB A)) = tr =tr X t ΣT X 0 0 0 + X Y X t ΣB X 0 0 0 t ΣT 0 = tr 0 0 X Y + (Xt ΣT X)−1 0 0 0 X Y t ΣB 0 0 0 X Y X t ΣB X 0 0 0 = tr((Xt ΣT X)−1 (Xt ΣB X)) = F0 (X) . From this it is immediate that a matrix A maximizes F1 if and only if the submatrix X maximizes the original Fisher criterion in the lower-dimensional subspace. Moreover, if L is such a matrix maximizing F1 in the PCA-transformed space, it is easy to check that R−1 L = Rt L provides a solution to the original, general problem that has not been preprocessed by means of a PCA (see also Fukunaga, 1990). A characterization of the complete family of solutions can now be given. Let Λ ∈ RN−1×d be an optimal solution of F0 (X) = tr((Xt ΣT X)−1 (Xt ΣB X)). As already noted in Section 1, the full set of solutions is given by F = {ΛZ ∈ RN−1×d | Z ∈ GLd (R)}, where GLd (R) denotes the general linear group of d × d invertible matrices. The previous paragraph essentially demonstrates that if X ∈ F , A in Equation (2) maximizes F1 . The matrix Y can be chosen ad 2122 C OMPLETE C HARACTERIZATION OF A FAMILY OF S OLUTIONS libitum. Now, the latter provides the solution in the PCA-transformed space and to solve the general problem we need to take the rotation back to the original space into account. All in all, this leads to the following complete family of solutions L , maximizing F1 in the original space: L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R), Y ∈ Rn−N+1×d Y , (3) where Λ = argmaxX tr((Xt ΣT X)−1 (Xt ΣB X)) and Rt takes care of the rotation back. 3. Original Characterization Though not noted by Ye (2005), his attempt to characterize the full set of solutions of Equation (1) is based on a simultaneous diagonalization of the three covariance matrices S B , SW , and ST that is similar to the ideas described by Campbell and Atchley (1981) and Fukunaga (1990). Moreover, Golub and Van Loan (Theorem 8.7.1. 1996) can be readily applied to demonstrate that such simultaneous diagonalization is possible in the small sample setting. After the diagonalization step, partitioned between-class, pooled within-class, and total covariance matrices are considered. This partitioning is similar to the one employed in the previous section, which does not enforce matrices to be diagonal however. In the subsequent optimization step, the classical Fisher criterion is maximized basically in the appropriate subspace, comparable to the approach described above, but in a mildly more involved and concealed way. For this, matrices of the form Rt X are considered, consider Equations (2) and Y (3). However, Y is simply the null matrix and the family of solutions L provided is limited to L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R) . 0 Obviously, this is far from a complete characterization, especially when N − 1 n which is, for instance, typically the case for the data sets considered by Ye (2005). Generally, the utility of a dimensionality reduction criterion, without additional constrains, depends on the efﬁciency over the full set of solutions. As Ye (2005) only considers two very speciﬁc instances from the large class of possibilities, it is unclear to what extent the new criterion really provides an efﬁcient way of performing a reduction of dimensionality. References N. A. Campbell and W. R. Atchley. The geometry of canonical variate analysis. Systematic Zoology, 30(3):268–280, 1981. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, 1990. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996. J. Ye. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483–502, 2005. 2123

4 0.033830751 15 jmlr-2007-Bilinear Discriminant Component Analysis

Author: Mads Dyrholm, Christoforos Christoforou, Lucas C. Parra

Abstract: Factor analysis and discriminant analysis are often used as complementary approaches to identify linear components in two dimensional data arrays. For three dimensional arrays, which may organize data in dimensions such as space, time, and trials, the opportunity arises to combine these two approaches. A new method, Bilinear Discriminant Component Analysis (BDCA), is derived and demonstrated in the context of functional brain imaging data for which it seems ideally suited. The work suggests to identify a subspace projection which optimally separates classes while ensuring that each dimension in this space captures an independent contribution to the discrimination. Keywords: bilinear, decomposition, component, classiﬁcation, regularization

5 0.02760922 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen

Abstract: We introduce standardised building blocks designed to be used with variational Bayesian learning. The blocks include Gaussian variables, summation, multiplication, nonlinearity, and delay. A large variety of latent variable models can be constructed from these blocks, including nonlinear and variance models, which are lacking from most existing variational systems. The introduced blocks are designed to ﬁt together and to yield efﬁcient update rules. Practical implementation of various models is easy thanks to an associated software package which derives the learning formulas automatically once a speciﬁc model structure has been ﬁxed. Variational Bayesian learning provides a cost function which is used both for updating the variables of the model and for optimising the model structure. All the computations can be carried out locally, resulting in linear computational complexity. We present experimental results on several structures, including a new hierarchical nonlinear model for variances and means. The test results demonstrate the good performance and usefulness of the introduced method. Keywords: latent variable models, variational Bayesian learning, graphical models, building blocks, Bayesian modelling, local computation

6 0.025207607 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

7 0.023231709 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

8 0.023185693 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

9 0.022771614 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

10 0.022678372 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods

11 0.02083494 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

12 0.019751834 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

13 0.019647054 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

14 0.019594962 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

15 0.018399864 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

16 0.017718213 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs (Special Topic on the Conference on Learning Theory 2005)

17 0.016170645 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

18 0.01574737 42 jmlr-2007-Infinitely Imbalanced Logistic Regression

19 0.015334575 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

20 0.014488925 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.093), (1, 0.016), (2, 0.007), (3, 0.055), (4, -0.057), (5, -0.122), (6, -0.041), (7, -0.005), (8, -0.015), (9, -0.019), (10, -0.06), (11, -0.013), (12, -0.021), (13, -0.138), (14, -0.115), (15, 0.063), (16, -0.001), (17, 0.007), (18, 0.033), (19, -0.025), (20, 0.184), (21, -0.152), (22, -0.071), (23, 0.144), (24, -0.329), (25, 0.049), (26, 0.064), (27, -0.042), (28, -0.125), (29, -0.073), (30, 0.19), (31, 0.058), (32, -0.003), (33, 0.073), (34, -0.28), (35, -0.121), (36, -0.195), (37, -0.204), (38, -0.202), (39, -0.101), (40, 0.127), (41, -0.255), (42, 0.12), (43, -0.17), (44, -0.075), (45, 0.05), (46, 0.198), (47, -0.103), (48, 0.007), (49, -0.212)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97118211 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

Author: Santosh Srivastava, Maya R. Gupta, Béla A. Frigyik

2 0.38805541 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

Author: Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, Dharmendra S. Modha

3 0.17936112 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

Author: Kristen Grauman, Trevor Darrell

Abstract: In numerous domains it is useful to represent a single example by the set of the local features or parts that comprise it. However, this representation poses a challenge to many conventional machine learning techniques, since sets may vary in cardinality and elements lack a meaningful ordering. Kernel methods can learn complex functions, but a kernel over unordered set inputs must somehow solve for correspondences—generally a computationally expensive task that becomes impractical for large set sizes. We present a new fast kernel function called the pyramid match that measures partial match similarity in time linear in the number of features. The pyramid match maps unordered feature sets to multi-resolution histograms and computes a weighted histogram intersection in order to ﬁnd implicit correspondences based on the ﬁnest resolution histogram cell where a matched pair ﬁrst appears. We show the pyramid match yields a Mercer kernel, and we prove bounds on its error relative to the optimal partial matching cost. We demonstrate our algorithm on both classiﬁcation and regression tasks, including object recognition, 3-D human pose inference, and time of publication estimation for documents, and we show that the proposed method is accurate and signiﬁcantly more efﬁcient than current approaches. Keywords: kernel, sets of features, histogram intersection, multi-resolution histogram pyramid, approximate matching, object recognition

4 0.17095834 15 jmlr-2007-Bilinear Discriminant Component Analysis

Author: Mads Dyrholm, Christoforos Christoforou, Lucas C. Parra

5 0.161414 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

Author: Marco Loog

6 0.15008329 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

7 0.13677008 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

8 0.13331664 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition

9 0.10887522 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

10 0.10417764 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

11 0.097251199 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)

12 0.096219376 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

13 0.095010504 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

14 0.092966713 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

15 0.092611656 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

16 0.089008719 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

17 0.084886923 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

18 0.080308415 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

19 0.076524884 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

20 0.074598983 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.013), (4, 0.049), (8, 0.043), (12, 0.025), (26, 0.024), (28, 0.053), (40, 0.026), (48, 0.031), (60, 0.024), (80, 0.016), (85, 0.053), (86, 0.437), (98, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7010166 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis

Author: Santosh Srivastava, Maya R. Gupta, Béla A. Frigyik

2 0.24216099 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

Author: Onur C. Hamsici, Aleix M. Martinez

Abstract: Many feature representations, as in genomics, describe directional data where all feature vectors share a common norm. In other cases, as in computer vision, a norm or variance normalization step, where all feature vectors are normalized to a common length, is generally used. These representations and pre-processing step map the original data from R p to the surface of a hypersphere S p−1 . Such representations should then be modeled using spherical distributions. However, the difﬁculty associated with such spherical representations has prompted researchers to model their spherical data using Gaussian distributions instead—as if the data were represented in R p rather than S p−1 . This opens the question to whether the classiﬁcation results calculated with the Gaussian approximation are the same as those obtained when using the original spherical distributions. In this paper, we show that in some particular cases (which we named spherical-homoscedastic) the answer to this question is positive. In the more general case however, the answer is negative. For this reason, we further investigate the additional error added by the Gaussian modeling. We conclude that the more the data deviates from spherical-homoscedastic, the less advisable it is to employ the Gaussian approximation. We then show how our derivations can be used to deﬁne optimal classiﬁers for spherical-homoscedastic distributions. By using a kernel which maps the original space into one where the data adapts to the spherical-homoscedastic model, we can derive non-linear classiﬁers with potential applications in a large number of problems. We conclude this paper by demonstrating the uses of spherical-homoscedasticity in the classiﬁcation of images of objects, gene expression sequences, and text data. Keywords: directional data, spherical distributions, normal distributions, norm normalization, linear and non-linear classiﬁers, computer vision

3 0.23686679 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

Abstract: A new clustering approach based on mode identiﬁcation is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efﬁciently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model ﬁtting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Speciﬁcally, a pairwise separability measure for clusters is deﬁned using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difﬁculty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density

4 0.2363475 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

Author: Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, Dharmendra S. Modha

5 0.23062143 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers

Author: François Laviolette, Mario Marchand

Abstract: We propose a PAC-Bayes theorem for the sample-compression setting where each classiﬁer is described by a compression subset of the training data and a message string of additional information. This setting, which is the appropriate one to describe many learning algorithms, strictly generalizes the usual data-independent setting where classiﬁers are represented only by data-independent message strings (or parameters taken from a continuous set). The proposed PAC-Bayes theorem for the sample-compression setting reduces to the PAC-Bayes theorem of Seeger (2002) and Langford (2005) when the compression subset of each classiﬁer vanishes. For posteriors having all their weights on a single sample-compressed classiﬁer, the general risk bound reduces to a bound similar to the tight sample-compression bound proposed in Laviolette et al. (2005). Finally, we extend our results to the case where each sample-compressed classiﬁer of a data-dependent ensemble may abstain of predicting a class label. Keywords: PAC-Bayes, risk bounds, sample-compression, set covering machines, decision list machines

6 0.2285617 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

7 0.22846732 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

8 0.22786455 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

9 0.22734934 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

10 0.22607106 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

11 0.22498035 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

12 0.22387034 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

13 0.22384042 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

14 0.22187373 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

15 0.22148338 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models

16 0.22134072 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

17 0.22131684 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

18 0.22113685 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

19 0.2203761 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

20 0.21912101 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections