jmlr jmlr2006 jmlr2006-71 knowledge-graph by maker-knowledge-mining

71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

Source: pdf

Author: Tonatiuh Peña Centeno, Neil D. Lawrence

Abstract: In this paper we consider a novel Bayesian interpretation of Fisher’s discriminant analysis. We relate Rayleigh’s coefﬁcient to a noise model that minimises a cost based on the most probable class centres and that abandons the ‘regression to the labels’ assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with the incorporation of a prior we can apply Bayes’ rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher’s discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefﬁcient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefﬁcient with the generalisation error.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Editor: Greg Ridgeway Abstract In this paper we consider a novel Bayesian interpretation of Fisher’s discriminant analysis. [sent-10, score-0.195]

2 We relate Rayleigh’s coefﬁcient to a noise model that minimises a cost based on the most probable class centres and that abandons the ‘regression to the labels’ assumption used by other algorithms. [sent-11, score-0.275]

3 Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with the incorporation of a prior we can apply Bayes’ rule to infer the posterior distribution of the direction of discrimination. [sent-12, score-0.232]

4 Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher’s discriminant. [sent-14, score-0.27]

5 A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefﬁcient through the optimisation of the marginal log-likelihood of the data. [sent-15, score-0.334]

6 An added bonus of the new formulation is that it enables us to link the regularisation coefﬁcient with the generalisation error. [sent-16, score-0.192]

7 In contrast, Fisher’s linear discriminant (Fisher, 1936) separates classes of data by selecting the features1 that maximise the ratio of projected class means to projected intraclass variances. [sent-23, score-0.363]

8 KFD shares many of the virtues of other kernel based algorithms: the appealing interpretation of a kernel as a mapping of an input to a high dimensional space and good performance in real life applications, among the most important. [sent-37, score-0.192]

9 However, it also suffers from some of the deﬁciencies of kernelised algorithms: the solution will typically include a regularisation coefﬁcient to limit model complexity and parameter estimation will rely on some form of cross validation. [sent-38, score-0.174]

10 We build on the model in Section 4 by ﬁrst applying priors over the direction of discrimination to develop a Bayesian Fisher discriminant and later we use a Gaussian process prior to reformulate the problem. [sent-42, score-0.294]

11 Section 7 details an EM-based algorithm for estimating the parameters of the model (kernel and regularisation coefﬁcients) by optimising the marginal log likelihood. [sent-45, score-0.253]

12 Fisher’s Discriminant Analysis As mentioned above, discriminant analysis involves ﬁnding a vector of compounds w ∈ Rd×1 for which class separation will be maximised according to some deﬁned statistic. [sent-49, score-0.235]

13 Hence, with some manipulation and the introduction of a couple of matrices we arrive at J (w) = wT ΣB w , wT Σw w N (4) (n) (n) T q where ΣB = (m1 − m0 ) (m1 − m0 )T and Σw = ∑q∈{0,1} ∑n=1 xq − mq xq − mq , are between and within covariance matrices respectively. [sent-58, score-0.139]

14 A solution for the discriminant w can also be derived from geometric arguments. [sent-65, score-0.195]

15 Given a test point x , the discriminant is a hyperplane D (x ) = wT x + b, that outputs a number according to the class membership of the test point, where b is a bias term. [sent-66, score-0.195]

16 The subsequent discussion will make use of this ‘average distance’ 0 1 constraint to reformulate the discriminant problem. [sent-73, score-0.195]

17 The set of variables D = (X, y) ∈ RN×(d+1) is observed or instantiated, f ∈ RN×1 is a dependent or latent variable and t ∈ RN×1 is a vector of targets that have been observed as well. [sent-76, score-0.133]

18 From Section 2, we know that every observation in D is projected into a single variate that ideally can take only two values which are the projected class centres, where the variance around the projections tries to be minimised. [sent-78, score-0.168]

19 We deﬁne the parameters c0 and c1 as the true class centres in the projected space. [sent-79, score-0.295]

20 1 The Noise Model Figure 1 models the causal relationship between the observations D and the variables f and t, such that the distribution p ( f, t| D , ) can be decomposed into noise model p ( t| y, f) and prior p ( f| X), disregarding the parameter β. [sent-84, score-0.159]

21 Both variables t0 and t1 are discrete, with each of their elements being given by the class centres c0 and c1 , (n) nevertheless, we will make a Gaussian2 approximation such that every element tq ∼ N f (n) , β−1 . [sent-89, score-0.226]

22 2 q∈{0,1} (5) X y0 y1 f £ ¢ £ ¢ £ ¢ £ ¢ £ ¢ t1 £ ¢ £ ¢ £ ¢ £ ¢ £ ¢ £ ¢ £ ¢ £ ¢ ¡ ¡ £ ¢ ¡ ¡ £ ¢ β £ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ t0 Figure 1: The proposed graphical model for discriminant analysis. [sent-91, score-0.231]

23 The graph models the joint distribution over the latent variables f and the targets t = t0 ∪ t1 , which have been decomposed into their two possible types. [sent-92, score-0.133]

25 As it can be observed from both the ﬁgure and Equation 5, there is a conditional independence assumption on the observed targets given y and f; in other words, the noise model can be further 2. [sent-97, score-0.166]

26 (n) We can substitute every element tq by its class centre cq and take the log of (5) to obtain L (f, β) = − β N ∑ y(n) c1 − f (n) 2 n=1 2 + 1 − y(n) c0 − f (n) 2 + C, (6) N β where C = log . [sent-100, score-0.145]

27 2 2π Note that the class centres can be made to coincide with the labels. [sent-101, score-0.179]

28 (8) However, the values of the class centres c0 and c1 are not known, so L can also be maximised w. [sent-107, score-0.179]

29 them to obtain N cq = ˆ 1 q (n) (n) ∑ yq f for q ∈ {0, 1} . [sent-110, score-0.173]

30 First, the vector of targets t is replaced by a new vector ﬁlled with the estimates cq such that ˆ ˆ = ˆ0 ∪ ˆ1 is generated. [sent-128, score-0.168]

31 y1 w § ¦ § ¦ § ¦ § ¦ § ¦ § ¦ § ¦ § ¦ § ¦ ¥ ¤ ¥ ¤ ¥ ¤ § ¦ § ¦ ¥ ¤ § ¦ § ¦ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ^ t0 § ¦ β § ¦ X § ¦ y0 ^ t1 Figure 2: Partially modiﬁed graphical model for discriminant analysis. [sent-133, score-0.231]

32 The log t t t t of the noise model p ˆ D , w is expressed in Equation 12 while the prior p (w) is speciﬁed in Section 4. [sent-136, score-0.159]

33 Furthermore, we look not only to parameterise the latent variables, but the class centres as well. [sent-137, score-0.242]

34 Equation 9 can be used to this purpose, substituting every f (n) in it with their parametric Nq (n) 1 versions wT x(n) leads to cq = Nq ∑n=1 yq wT x(n) . [sent-138, score-0.199]

35 (13) As it will be seen in Section 5, most models make the assumption that class centres and class labels coincide, that is cq = yq ; including the least squares support vector machine of Suykens and Vandewalle (1999). [sent-142, score-0.413]

36 However this approach is suboptimal because there is no guarantee that class centres should map perfectly with the labels. [sent-143, score-0.179]

37 As we saw above, by taking this step, the class centres can be parameterised as well. [sent-145, score-0.179]

38 Previously, the class centres were parameters which we knew beforehand were separated by some given distance. [sent-151, score-0.211]

39 We therefore introduce a Lagrange multiplier to force the projected class centres to lie at a distance d, leading to the following function Λ (w, λ) = 2 β N − ∑ y(n) wT m1 − wT x(n) + 1 − y(n) wT m0 − wT x(n) 2 n=1 +λ wT (m0 − m1 ) − d + C. [sent-153, score-0.263]

40 Furthermore, we were interested on parameterising not only the latent variables in the model but also the centres themselves. [sent-158, score-0.278]

41 Bayesian Formulation One of the aims of discriminant analysis is to determine the group membership of an input x outside the training set. [sent-164, score-0.195]

42 From a probabilistic perspective this process is only possible if a noise model and a prior distribution have been identiﬁed. [sent-165, score-0.159]

43 This section will show that the introduction of a separable Gaussian prior over w leads to a posterior distribution that is not enough to recover FLD’s solution. [sent-169, score-0.136]

44 Now what we seek is a distribution over this vector which is obtained by combining the noise model with a prior distribution through Bayes’ rule, t p w| ˆ, D = p ˆ D , w p (w) t , p ˆ D t where we have used D to indicate the training set (X, y) and have omitted the dependence on β. [sent-174, score-0.159]

45 The combination of this prior with the parametric noise model of Equation 13 gives a posterior of the form β p( w| ˆ, D ) ∝ exp − ∑ y(n) wT m1 − wT x(n) t 2 n=1 N 1 − y(n) wT m0 − wT x(n) 2 2 +. [sent-176, score-0.265]

46 Nonetheless, this formulation yields a consistent result if we consider that standard discriminant analysis exhibits a sign symmetry for the vector w, hence the average is zero. [sent-189, score-0.195]

47 We propose to use the posterior p ( w| D ) as the prior for a new model that is depicted in Figure 3. [sent-195, score-0.197]

48 As a partial result, the conditional distribution p ( w| D , d, γ) will be ¯ N ( w| w, Σ) with mean ¯ w = lim γdΣ∆m, γ→∞ and covariance Σ = lim B + γ∆m∆mT γ→∞ 463 −1 . [sent-210, score-0.159]

49 This approach will lead to a regularised version of kernel Fisher’s discriminant and ultimately to an alternative strategy to select model parameters. [sent-227, score-0.402]

50 In GP’s (O’Hagan, 1978) a prior is placed directly over the latent variables such that a posterior distribution over them can be inferred. [sent-235, score-0.199]

51 1 P REDICTION OVER A T EST P OINT In order to adopt GP’s we need to go back to the formulation of the discriminant presented in Figure 1. [sent-243, score-0.195]

52 In this ﬁgure the graph models the joint distribution p ( f, t| D ) with the product of noise model p ( t| y, f) and prior p ( f| X). [sent-244, score-0.159]

53 Secondly, we will continue to work with the maximum likelihood estimates of the class centres, which were denoted cq . [sent-247, score-0.135]

54 Firstly, the modiﬁed noise model is deﬁned in terms of f by applying the values of cq and rearranging, (see Appendix B). [sent-257, score-0.194]

55 After conditioning f on both ˆ and d, the solution is a Gaussian of the form t p ( f | D , d, γ) ∝ exp − 1 2 2 (σ ) f − f¯ 2 with mean f¯ = lim −γd (σ )2 cT Q−1 ∆ˆ . [sent-268, score-0.133]

56 The scheme proposed above will be termed Bayesian Fisher’s discriminant (BFD) to facilitate its referencing. [sent-277, score-0.195]

57 Relationship with Other Models There are several well known connections between discriminant analysis and other techniques. [sent-279, score-0.195]

58 In this section, however, we prefer to explore the connections of our approach to some algorithms that have been applied to machine learning problems, namely kernel Fisher’s discriminant and the least-squares and proximal support vector machines. [sent-282, score-0.32]

59 1 Kernel Fisher’s Discriminant The algorithm known as kernel Fisher’s discriminant consists of a two stage procedure. [sent-284, score-0.291]

60 The ﬁrst consists of embedding the data space X into a possibly inﬁnite dimensional reproducing kernel Hilbert space F via a kernel function k. [sent-285, score-0.192]

61 Nonetheless, the discriminant function can still be written 467 P EÑA C ENTENO AND L AWRENCE as the rule D (x ) = ∑N α(i) k x , x(i) + b with the coefﬁcients α(i) ’s being obtained as the solution i=1 of maximizing a new form of the statistic J (α) = αT Mα . [sent-295, score-0.223]

62 This implies that some form of regularisation will need to be applied when inverting N and this will generally be done by applying Nδ = N + δC, with C being the identity or the kernel matrices. [sent-298, score-0.209]

63 From the −1 −1 deﬁnition of mF , the difference mF − mF can be written as K∆ˆ , with ∆ˆ = N0 y0 − N1 y1 , y y q 0 1 and by regularising N with a multiple of the kernel matrix we obtain αKFD ∝ KLK + β−1 K −1 K∆ˆ , y where β−1 is the regularisation coefﬁcient. [sent-307, score-0.209]

64 As an additional insight, we observe that the coefﬁcients αBFD have an equivalent αKFD if and only if KFD uses a regularisation based on a multiple of the kernel matrix. [sent-309, score-0.209]

65 LS-SVM’s have been related to ridge regression with modiﬁed targets, discriminant analysis in the feature space (KFD) and, as many other kernelised algorithms, to GP’s. [sent-315, score-0.233]

66 In this ﬁgure, the joint distribution over labels and parameters factorises as p ( y, w| X) = p ( y| X, w) × p (w), with noise model p ( y| X, w) = N Xw, ζ−1 I and prior p (w) = N 0| µ−1 I . [sent-323, score-0.246]

67 µ X w ζ y Figure 4: LS-SVM noise model assumes a regularised least squares cost function. [sent-324, score-0.205]

68 The model depicted can be interpreted as the joint distribution p (y, w) = p ( y| X, w) p (w), whereby the noise is Gaussian, p ( y| X, w) = N Xw, ζ−1 I , as is the prior p (w) = N 0| µ−1 I . [sent-325, score-0.184]

69 In this model the targets and the labels are the same t ≡ y. [sent-326, score-0.133]

70 However, it suffers from the fundamental missconception that the class labels ±1 have to coincide with the projected class centres cq . [sent-329, score-0.388]

71 Ignoring the bias term, in P-SVM’s the joint distribution p ( y, w| X) is factorised according to the noise model p ( y| X, w) = N Xw, σ2 I and the prior distribution p (w) = N 0, νσ2 I . [sent-349, score-0.187]

72 1 we saw that optimisation of the proposed noise model and that of Rayleigh’s coefﬁcient give equivalent results. [sent-363, score-0.155]

73 In both cases the solution to the discriminant problem was given by adjusting the level of β. [sent-364, score-0.195]

74 From the point of view of projected data, the problem has shifted from computing the direction of discrimination to that of minimising the generalisation error through the adjustment of the variable β. [sent-371, score-0.163]

75 Hence placing a prior over it not only places a prior over the generalisation error but on 5. [sent-373, score-0.205]

76 The class centres have been denoted by cq with q ∈ {1, 0}. [sent-377, score-0.277]

77 The prior could also be used to bias β towards low or high generalisation errors if this is thought appropriate. [sent-384, score-0.142]

78 In the Bayesian formalism it is quite common to make use of the marginal likelihood to reach this purpose, therefore we look to optimise L (Θt ) = log p (t|D , Θt ) , with respect to the model parameters Θt . [sent-394, score-0.139]

79 1 that we optimised the likelihood with respect to the parameters c0 and c1 leading to a new encoding of the targets ˆq = t fT yq Nq yq . [sent-396, score-0.289]

80 As a consequence, the targets will shift whent ever the kernel parameters are changed. [sent-399, score-0.198]

81 1 EM Algorithm We denote the parameters of the prior as Θk and the complete set of model parameters as Θt = {Θk , β}. [sent-407, score-0.163]

82 Thus our algorithm will be composed of the alternation of the following steps: E-step Given the current parameters Θtit , approximate the posterior with 1 qit (f) ∝ exp − fT Σ−1 f , p 2 where Σ p = K−1 + βL −1 . [sent-412, score-0.138]

83 Maximisation with respect to Θk , the kernel parameters, cannot be done in closed form and has to rely on some optimisation routine, for example gradient descent, therefore it is necessary to specify the gradients of Equation 29 w. [sent-415, score-0.155]

84 Kernel and regularisation parameters for KFD and LS-SVM were obtained by 10fold cross validation, whereas BFD related parameters were obtained by evidence maximisation. [sent-489, score-0.177]

85 Given that discriminant methods operate independently of the method of bias choice, we felt it more appropriate to use a bias independent measure like the area under the ROC curve (AUC). [sent-609, score-0.195]

86 The compared algorithms are: linear discriminant (LDA), quadratic discriminant (QDA), kernel Fisher’s discriminant (KFD), least squares support vector machine (LS-SVM) and Bayesian Fisher’s discriminant (BFD). [sent-782, score-0.91]

87 Incorporation of this type of prior performs feature selection by assigning very high weights to some of the posterior values of the hyperparameters and hence prunning out features, (see Mackay, 1995). [sent-786, score-0.136]

88 The EM algorithm we proposed is slower to converge than direct optimisation of the marginal likelihood as can be applied to the LS-SVM. [sent-790, score-0.13]

89 Conclusions and Future Work We have presented a Bayesian probabilistic approach to discriminant analysis that can correspond to kernel Fisher’s discriminant. [sent-914, score-0.291]

90 Regularisation of the discriminant arises naturally in the proposed framework and through maximisation of the marginal likelihood we were able to determine kernel 481 P EÑA C ENTENO AND L AWRENCE Twonorm train set Waveform train set 0. [sent-915, score-0.447]

91 Therefore, after conditioning on d, the ¯ resulting distribution will be Gaussian with the form p(w|D , d, γ) = lim N (w, Σ) and parameters γ→∞ ¯ w = lim γdΣ∆m (34) γ→∞ and Σ = lim B + γ∆m∆mT γ→∞ −1 . [sent-962, score-0.264]

92 Expressing p ˆ f, y in terms of f and L t Disregarding an additive constant, the log of the modiﬁed noise model p ˆ f, y is t L (c0 , c1 ) = − ˆ ˆ β N ˆ ˆ ∑ yn (c1 − fn )2 + (1 − yn ) (c0 − fn )2 , 2 n=1 483 P EÑA C ENTENO AND L AWRENCE where we have used Equation 6 as base. [sent-967, score-0.212]

93 From Equation 9, we substitute the values of each estimate cq so that ˆ 2 2 1 T 1 T β N y1 f − fn + (1 − yn ) y0 f − fn L = − ∑ yn 2 n=1 N1 N0 = − = − 1 β 1 ffT − yT ffT y1 − yT ffT y0 1 2 N1 N0 0 β T f Lf . [sent-968, score-0.214]

94 In the model we have three types of distributions: p ˆ y, f , the noise model; p (f+ ), an extended GP prior that includes the point f and the constraint t p ( d| y, f, γ). [sent-971, score-0.159]

95 y y 2 Completing the squares on f gives a Gaussian p( f | D , d, γ) ∝ exp − f − f¯ 1 2(σ )2 where the variance is 2 −1 (40) f¯ = − lim γd (σ )2 cT Q−1 ∆ˆ . [sent-990, score-0.133]

96 2 Then combining it with a gamma prior G ( β| a, b) gives a posterior of the form G ( β| N/2 + a, (V /2 + b)), that is p β| ˆ, f t ∝ βN/2+a−1 exp −β V +b 2 . [sent-1012, score-0.169]

97 Use Equation 6 and substitute the class centres cq by their estimates cq . [sent-1017, score-0.375]

98 The initialisations selected ranged from 1 × 10−4 to 1 × 104 ; initialisations that produced numerical errors were ignored. [sent-1039, score-0.132]

99 Estimating a kernel Fisher discriminant in the presence of label noise. [sent-1148, score-0.291]

100 Bayesian framework for least squares support vector machine classiﬁers, Gaussian processes and kernel discriminant analysis. [sent-1242, score-0.325]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bfd', 0.583), ('kfd', 0.319), ('discriminant', 0.195), ('centres', 0.179), ('awrence', 0.169), ('enteno', 0.169), ('ptimizing', 0.169), ('wt', 0.152), ('ard', 0.152), ('regularisation', 0.113), ('fld', 0.103), ('fisher', 0.1), ('cq', 0.098), ('kernel', 0.096), ('nq', 0.096), ('mf', 0.094), ('spiral', 0.094), ('twonorm', 0.086), ('bumpy', 0.085), ('maximisation', 0.085), ('projected', 0.084), ('rayleigh', 0.08), ('generalisation', 0.079), ('ft', 0.077), ('gp', 0.075), ('regularised', 0.075), ('yq', 0.075), ('mt', 0.073), ('posterior', 0.073), ('kt', 0.072), ('targets', 0.07), ('lim', 0.066), ('initialisations', 0.066), ('prior', 0.063), ('latent', 0.063), ('noise', 0.06), ('optimisation', 0.059), ('ka', 0.059), ('ringnorm', 0.057), ('nx', 0.057), ('mq', 0.056), ('xw', 0.056), ('gestel', 0.056), ('klk', 0.056), ('suykens', 0.055), ('mika', 0.049), ('rbf', 0.048), ('qda', 0.047), ('tq', 0.047), ('waveform', 0.046), ('overlap', 0.045), ('equation', 0.042), ('compounds', 0.04), ('splice', 0.04), ('neil', 0.04), ('lda', 0.039), ('ridge', 0.038), ('bart', 0.038), ('johan', 0.038), ('joos', 0.038), ('maximising', 0.038), ('optimising', 0.038), ('pelckmans', 0.038), ('likelihood', 0.037), ('model', 0.036), ('bayesian', 0.036), ('conditioning', 0.034), ('marginal', 0.034), ('squares', 0.034), ('exp', 0.033), ('parameters', 0.032), ('coef', 0.032), ('gaussian', 0.032), ('relevance', 0.031), ('roc', 0.031), ('fn', 0.031), ('toy', 0.03), ('tsch', 0.029), ('auc', 0.029), ('proximal', 0.029), ('baudat', 0.028), ('dkt', 0.028), ('factorised', 0.028), ('factorises', 0.028), ('lf', 0.028), ('realisations', 0.028), ('tony', 0.028), ('statistic', 0.028), ('ct', 0.028), ('yn', 0.027), ('labels', 0.027), ('covariance', 0.027), ('array', 0.027), ('substituting', 0.026), ('banana', 0.026), ('meanwhile', 0.026), ('shaded', 0.026), ('limit', 0.025), ('van', 0.025), ('depicted', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

Author: Tonatiuh Peña Centeno, Neil D. Lawrence

2 0.11412739 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

Abstract: A selective sampling algorithm is a learning algorithm for classiﬁcation that, based on the past observed data, decides whether to ask the label of each new instance to be classiﬁed. In this paper, we introduce a general technique for turning linear-threshold classiﬁcation algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classiﬁcation, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classiﬁers

3 0.10660606 70 jmlr-2006-Online Passive-Aggressive Algorithms

Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer

Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This uniﬁed view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any ﬁxed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.

4 0.068224043 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır

Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classiﬁer, whose key component is a weight vector in a feature space implicitly introduced by a positive deﬁnite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modiﬁed optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classiﬁers. These classiﬁers essentially ﬁnd a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the diﬀerent classes of data are linearly well separated. Experimental results over several classiﬁcation benchmarks demonstrate the eﬀectiveness of our approach. Keywords: sparse learning, sparse large margin classiﬁers, kernel learning algorithms, support vector machine, kernel Fisher discriminant

5 0.05755328 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

Author: Paul W. Goldberg

Abstract: A classical approach in multi-class pattern classiﬁcation is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classiﬁer to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would like to know whether it is harder to build accurate classiﬁers via this approach, than by techniques that may process all data with distinct labels together. In this paper we make that question precise by considering it in the context of PAC learnability. We propose two restrictions on the PAC learning framework that are intended to correspond with the above approach, and consider their relationship with standard PAC learning. Our main restriction of interest leads to some interesting algorithms that show that the restriction is not stronger (more restrictive) than various other well-known restrictions on PAC learning. An alternative slightly milder restriction turns out to be almost equivalent to unrestricted PAC learning. Keywords: computational learning theory, computational complexity, pattern classiﬁcation

6 0.054133546 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

7 0.051101644 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

8 0.050077487 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

9 0.049682438 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)

10 0.04841391 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

11 0.047865406 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

12 0.043165553 21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis

13 0.042191599 15 jmlr-2006-Bayesian Network Learning with Parameter Constraints (Special Topic on Machine Learning and Optimization)

14 0.041791268 54 jmlr-2006-Learning the Structure of Linear Latent Variable Models

15 0.039795503 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

16 0.038924906 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

17 0.037373263 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

18 0.037100941 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling

19 0.037001852 49 jmlr-2006-Learning Parts-Based Representations of Data

20 0.036416922 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.206), (1, 0.072), (2, 0.116), (3, 0.012), (4, -0.12), (5, -0.066), (6, -0.024), (7, -0.073), (8, 0.048), (9, -0.027), (10, 0.168), (11, -0.035), (12, -0.036), (13, -0.033), (14, -0.067), (15, 0.017), (16, -0.047), (17, 0.118), (18, -0.131), (19, -0.156), (20, 0.276), (21, -0.06), (22, -0.131), (23, 0.051), (24, 0.125), (25, -0.088), (26, -0.099), (27, -0.149), (28, 0.089), (29, -0.087), (30, 0.125), (31, 0.054), (32, -0.13), (33, -0.225), (34, -0.039), (35, 0.011), (36, 0.084), (37, 0.176), (38, 0.129), (39, -0.048), (40, -0.128), (41, -0.06), (42, 0.068), (43, 0.164), (44, -0.022), (45, 0.062), (46, -0.069), (47, 0.148), (48, -0.061), (49, -0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90749651 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

Author: Tonatiuh Peña Centeno, Neil D. Lawrence

2 0.48285419 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır

3 0.41581994 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

4 0.35243499 21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis

Author: Jieping Ye, Tao Xiong

Abstract: Dimensionality reduction is an important pre-processing step in many applications. Linear discriminant analysis (LDA) is a classical statistical approach for supervised dimensionality reduction. It aims to maximize the ratio of the between-class distance to the within-class distance, thus maximizing the class discrimination. It has been used widely in many applications. However, the classical LDA formulation requires the nonsingularity of the scatter matrices involved. For undersampled problems, where the data dimensionality is much larger than the sample size, all scatter matrices are singular and classical LDA fails. Many extensions, including null space LDA (NLDA) and orthogonal LDA (OLDA), have been proposed in the past to overcome this problem. NLDA aims to maximize the between-class distance in the null space of the within-class scatter matrix, while OLDA computes a set of orthogonal discriminant vectors via the simultaneous diagonalization of the scatter matrices. They have been applied successfully in various applications. In this paper, we present a computational and theoretical analysis of NLDA and OLDA. Our main result shows that under a mild condition which holds in many applications involving highdimensional data, NLDA is equivalent to OLDA. We have performed extensive experiments on various types of data and results are consistent with our theoretical analysis. We further apply the regularization to OLDA. The algorithm is called regularized OLDA (or ROLDA for short). An efﬁcient algorithm is presented to estimate the regularization value in ROLDA. A comparative study on classiﬁcation shows that ROLDA is very competitive with OLDA. This conﬁrms the effectiveness of the regularization in ROLDA. Keywords: linear discriminant analysis, dimensionality reduction, null space, orthogonal matrix, regularization

5 0.33359584 70 jmlr-2006-Online Passive-Aggressive Algorithms

Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer

6 0.30079079 15 jmlr-2006-Bayesian Network Learning with Parameter Constraints (Special Topic on Machine Learning and Optimization)

7 0.27854863 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)

8 0.27558362 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

9 0.24617881 49 jmlr-2006-Learning Parts-Based Representations of Data

10 0.217337 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

11 0.21678168 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

12 0.21261932 54 jmlr-2006-Learning the Structure of Linear Latent Variable Models

13 0.21209136 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

14 0.21052051 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

15 0.20700341 42 jmlr-2006-Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting (Special Topic on Inductive Programming)

16 0.19427465 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers

17 0.18669382 18 jmlr-2006-Building Support Vector Machines with Reduced Classifier Complexity (Special Topic on Machine Learning and Optimization)

18 0.18357351 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling

19 0.18122438 93 jmlr-2006-Universal Kernels

20 0.16679788 72 jmlr-2006-Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems (Special Topic on Machine Learning and Optimization)

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.496), (36, 0.069), (45, 0.013), (50, 0.035), (63, 0.038), (68, 0.015), (76, 0.012), (78, 0.015), (79, 0.016), (81, 0.035), (84, 0.028), (90, 0.044), (91, 0.029), (96, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86134183 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

Author: Tonatiuh Peña Centeno, Neil D. Lawrence

2 0.8186087 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

Author: Gilles Blanchard, Motoaki Kawanabe, Masashi Sugiyama, Vladimir Spokoiny, Klaus-Robert Müller

Abstract: Finding non-Gaussian components of high-dimensional data is an important preprocessing step for efﬁcient information processing. This article proposes a new linear method to identify the “nonGaussian subspace” within a very general semi-parametric framework. Our proposed method, called NGCA (non-Gaussian component analysis), is based on a linear operator which, to any arbitrary nonlinear (smooth) function, associates a vector belonging to the low dimensional nonGaussian target subspace, up to an estimation error. By applying this operator to a family of different nonlinear functions, one obtains a family of different vectors lying in a vicinity of the target space. As a ﬁnal step, the target space itself is estimated by applying PCA to this family of vectors. We show that this procedure is consistent in the sense that the estimaton error tends to zero at a parametric rate, uniformly over the family, Numerical examples demonstrate the usefulness of our method.

3 0.39641955 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

Author: Kazuho Watanabe, Sumio Watanabe

Abstract: Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clariﬁed yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity

4 0.36832476 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

Author: Daniil Ryabko

Abstract: In this work we consider the task of relaxing the i.i.d. assumption in pattern recognition (or classiﬁcation), aiming to make existing learning algorithms applicable to a wider range of tasks. Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). We consider the case of deterministically deﬁned labels. Traditionally, this task is studied under the assumption that examples are independent and identically distributed. However, it turns out that many results of pattern recognition theory carry over a weaker assumption. Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. We ﬁnd a broad class of learning algorithms for which estimations of the probability of the classiﬁcation error achieved under the classical i.i.d. assumption can be generalized to the similar estimates for case of conditionally i.i.d. examples.

5 0.34928674 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram. Keywords: spectral clustering, blind source separation, computational auditory scene analysis

6 0.34696937 49 jmlr-2006-Learning Parts-Based Representations of Data

7 0.34537637 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

8 0.3396126 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

9 0.33835244 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

10 0.33633995 47 jmlr-2006-Learning Image Components for Object Recognition

11 0.33454475 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

12 0.32995582 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

13 0.31862336 21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis

14 0.31856281 51 jmlr-2006-Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming (Special Topic on Machine Learning and Optimization)

15 0.31142378 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

16 0.30751947 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

17 0.30706322 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

18 0.30701381 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

19 0.30398324 83 jmlr-2006-Sparse Boosting

20 0.3037858 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann