jmlr jmlr2006 jmlr2006-71 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tonatiuh Peña Centeno, Neil D. Lawrence
Abstract: In this paper we consider a novel Bayesian interpretation of Fisher’s discriminant analysis. We relate Rayleigh’s coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the ‘regression to the labels’ assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with the incorporation of a prior we can apply Bayes’ rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher’s discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefficient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefficient with the generalisation error.
Reference: text
sentIndex sentText sentNum sentScore
1 Editor: Greg Ridgeway Abstract In this paper we consider a novel Bayesian interpretation of Fisher’s discriminant analysis. [sent-10, score-0.195]
2 We relate Rayleigh’s coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the ‘regression to the labels’ assumption used by other algorithms. [sent-11, score-0.275]
3 Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with the incorporation of a prior we can apply Bayes’ rule to infer the posterior distribution of the direction of discrimination. [sent-12, score-0.232]
4 Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher’s discriminant. [sent-14, score-0.27]
5 A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefficient through the optimisation of the marginal log-likelihood of the data. [sent-15, score-0.334]
6 An added bonus of the new formulation is that it enables us to link the regularisation coefficient with the generalisation error. [sent-16, score-0.192]
7 In contrast, Fisher’s linear discriminant (Fisher, 1936) separates classes of data by selecting the features1 that maximise the ratio of projected class means to projected intraclass variances. [sent-23, score-0.363]
8 KFD shares many of the virtues of other kernel based algorithms: the appealing interpretation of a kernel as a mapping of an input to a high dimensional space and good performance in real life applications, among the most important. [sent-37, score-0.192]
9 However, it also suffers from some of the deficiencies of kernelised algorithms: the solution will typically include a regularisation coefficient to limit model complexity and parameter estimation will rely on some form of cross validation. [sent-38, score-0.174]
10 We build on the model in Section 4 by first applying priors over the direction of discrimination to develop a Bayesian Fisher discriminant and later we use a Gaussian process prior to reformulate the problem. [sent-42, score-0.294]
11 Section 7 details an EM-based algorithm for estimating the parameters of the model (kernel and regularisation coefficients) by optimising the marginal log likelihood. [sent-45, score-0.253]
12 Fisher’s Discriminant Analysis As mentioned above, discriminant analysis involves finding a vector of compounds w ∈ Rd×1 for which class separation will be maximised according to some defined statistic. [sent-49, score-0.235]
13 Hence, with some manipulation and the introduction of a couple of matrices we arrive at J (w) = wT ΣB w , wT Σw w N (4) (n) (n) T q where ΣB = (m1 − m0 ) (m1 − m0 )T and Σw = ∑q∈{0,1} ∑n=1 xq − mq xq − mq , are between and within covariance matrices respectively. [sent-58, score-0.139]
14 A solution for the discriminant w can also be derived from geometric arguments. [sent-65, score-0.195]
15 Given a test point x , the discriminant is a hyperplane D (x ) = wT x + b, that outputs a number according to the class membership of the test point, where b is a bias term. [sent-66, score-0.195]
16 The subsequent discussion will make use of this ‘average distance’ 0 1 constraint to reformulate the discriminant problem. [sent-73, score-0.195]
17 The set of variables D = (X, y) ∈ RN×(d+1) is observed or instantiated, f ∈ RN×1 is a dependent or latent variable and t ∈ RN×1 is a vector of targets that have been observed as well. [sent-76, score-0.133]
18 From Section 2, we know that every observation in D is projected into a single variate that ideally can take only two values which are the projected class centres, where the variance around the projections tries to be minimised. [sent-78, score-0.168]
19 We define the parameters c0 and c1 as the true class centres in the projected space. [sent-79, score-0.295]
20 1 The Noise Model Figure 1 models the causal relationship between the observations D and the variables f and t, such that the distribution p ( f, t| D , ) can be decomposed into noise model p ( t| y, f) and prior p ( f| X), disregarding the parameter β. [sent-84, score-0.159]
21 Both variables t0 and t1 are discrete, with each of their elements being given by the class centres c0 and c1 , (n) nevertheless, we will make a Gaussian2 approximation such that every element tq ∼ N f (n) , β−1 . [sent-89, score-0.226]
22 2 q∈{0,1} (5) X y0 y1 f £ ¢ £ ¢ £ ¢ £ ¢ £ ¢ t1 £ ¢ £ ¢ £ ¢ £ ¢ £ ¢ £ ¢ £ ¢ £ ¢ ¡ ¡ £ ¢ ¡ ¡ £ ¢ β £ ¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ t0 Figure 1: The proposed graphical model for discriminant analysis. [sent-91, score-0.231]
23 The graph models the joint distribution over the latent variables f and the targets t = t0 ∪ t1 , which have been decomposed into their two possible types. [sent-92, score-0.133]
24 Disregarding the parameter β, the joint probability is factorised as p ( f, t| D ) = p ( t| y, f) p ( f| X), where the noise model is given by p ( t| y, f) and the prior by p ( f| X). [sent-93, score-0.187]
25 As it can be observed from both the figure and Equation 5, there is a conditional independence assumption on the observed targets given y and f; in other words, the noise model can be further 2. [sent-97, score-0.166]
26 (n) We can substitute every element tq by its class centre cq and take the log of (5) to obtain L (f, β) = − β N ∑ y(n) c1 − f (n) 2 n=1 2 + 1 − y(n) c0 − f (n) 2 + C, (6) N β where C = log . [sent-100, score-0.145]
27 2 2π Note that the class centres can be made to coincide with the labels. [sent-101, score-0.179]
28 (8) However, the values of the class centres c0 and c1 are not known, so L can also be maximised w. [sent-107, score-0.179]
29 them to obtain N cq = ˆ 1 q (n) (n) ∑ yq f for q ∈ {0, 1} . [sent-110, score-0.173]
30 First, the vector of targets t is replaced by a new vector filled with the estimates cq such that ˆ ˆ = ˆ0 ∪ ˆ1 is generated. [sent-128, score-0.168]
31 y1 w § ¦ § ¦ § ¦ § ¦ § ¦ § ¦ § ¦ § ¦ § ¦ ¥ ¤ ¥ ¤ ¥ ¤ § ¦ § ¦ ¥ ¤ § ¦ § ¦ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ¥ ¤ ^ t0 § ¦ β § ¦ X § ¦ y0 ^ t1 Figure 2: Partially modified graphical model for discriminant analysis. [sent-133, score-0.231]
32 The log t t t t of the noise model p ˆ D , w is expressed in Equation 12 while the prior p (w) is specified in Section 4. [sent-136, score-0.159]
33 Furthermore, we look not only to parameterise the latent variables, but the class centres as well. [sent-137, score-0.242]
34 Equation 9 can be used to this purpose, substituting every f (n) in it with their parametric Nq (n) 1 versions wT x(n) leads to cq = Nq ∑n=1 yq wT x(n) . [sent-138, score-0.199]
35 (13) As it will be seen in Section 5, most models make the assumption that class centres and class labels coincide, that is cq = yq ; including the least squares support vector machine of Suykens and Vandewalle (1999). [sent-142, score-0.413]
36 However this approach is suboptimal because there is no guarantee that class centres should map perfectly with the labels. [sent-143, score-0.179]
37 As we saw above, by taking this step, the class centres can be parameterised as well. [sent-145, score-0.179]
38 Previously, the class centres were parameters which we knew beforehand were separated by some given distance. [sent-151, score-0.211]
39 We therefore introduce a Lagrange multiplier to force the projected class centres to lie at a distance d, leading to the following function Λ (w, λ) = 2 β N − ∑ y(n) wT m1 − wT x(n) + 1 − y(n) wT m0 − wT x(n) 2 n=1 +λ wT (m0 − m1 ) − d + C. [sent-153, score-0.263]
40 Furthermore, we were interested on parameterising not only the latent variables in the model but also the centres themselves. [sent-158, score-0.278]
41 Bayesian Formulation One of the aims of discriminant analysis is to determine the group membership of an input x outside the training set. [sent-164, score-0.195]
42 From a probabilistic perspective this process is only possible if a noise model and a prior distribution have been identified. [sent-165, score-0.159]
43 This section will show that the introduction of a separable Gaussian prior over w leads to a posterior distribution that is not enough to recover FLD’s solution. [sent-169, score-0.136]
44 Now what we seek is a distribution over this vector which is obtained by combining the noise model with a prior distribution through Bayes’ rule, t p w| ˆ, D = p ˆ D , w p (w) t , p ˆ D t where we have used D to indicate the training set (X, y) and have omitted the dependence on β. [sent-174, score-0.159]
45 The combination of this prior with the parametric noise model of Equation 13 gives a posterior of the form β p( w| ˆ, D ) ∝ exp − ∑ y(n) wT m1 − wT x(n) t 2 n=1 N 1 − y(n) wT m0 − wT x(n) 2 2 +. [sent-176, score-0.265]
46 Nonetheless, this formulation yields a consistent result if we consider that standard discriminant analysis exhibits a sign symmetry for the vector w, hence the average is zero. [sent-189, score-0.195]
47 We propose to use the posterior p ( w| D ) as the prior for a new model that is depicted in Figure 3. [sent-195, score-0.197]
48 As a partial result, the conditional distribution p ( w| D , d, γ) will be ¯ N ( w| w, Σ) with mean ¯ w = lim γdΣ∆m, γ→∞ and covariance Σ = lim B + γ∆m∆mT γ→∞ 463 −1 . [sent-210, score-0.159]
49 This approach will lead to a regularised version of kernel Fisher’s discriminant and ultimately to an alternative strategy to select model parameters. [sent-227, score-0.402]
50 In GP’s (O’Hagan, 1978) a prior is placed directly over the latent variables such that a posterior distribution over them can be inferred. [sent-235, score-0.199]
51 1 P REDICTION OVER A T EST P OINT In order to adopt GP’s we need to go back to the formulation of the discriminant presented in Figure 1. [sent-243, score-0.195]
52 In this figure the graph models the joint distribution p ( f, t| D ) with the product of noise model p ( t| y, f) and prior p ( f| X). [sent-244, score-0.159]
53 Secondly, we will continue to work with the maximum likelihood estimates of the class centres, which were denoted cq . [sent-247, score-0.135]
54 Firstly, the modified noise model is defined in terms of f by applying the values of cq and rearranging, (see Appendix B). [sent-257, score-0.194]
55 After conditioning f on both ˆ and d, the solution is a Gaussian of the form t p ( f | D , d, γ) ∝ exp − 1 2 2 (σ ) f − f¯ 2 with mean f¯ = lim −γd (σ )2 cT Q−1 ∆ˆ . [sent-268, score-0.133]
56 The scheme proposed above will be termed Bayesian Fisher’s discriminant (BFD) to facilitate its referencing. [sent-277, score-0.195]
57 Relationship with Other Models There are several well known connections between discriminant analysis and other techniques. [sent-279, score-0.195]
58 In this section, however, we prefer to explore the connections of our approach to some algorithms that have been applied to machine learning problems, namely kernel Fisher’s discriminant and the least-squares and proximal support vector machines. [sent-282, score-0.32]
59 1 Kernel Fisher’s Discriminant The algorithm known as kernel Fisher’s discriminant consists of a two stage procedure. [sent-284, score-0.291]
60 The first consists of embedding the data space X into a possibly infinite dimensional reproducing kernel Hilbert space F via a kernel function k. [sent-285, score-0.192]
61 Nonetheless, the discriminant function can still be written 467 P EÑA C ENTENO AND L AWRENCE as the rule D (x ) = ∑N α(i) k x , x(i) + b with the coefficients α(i) ’s being obtained as the solution i=1 of maximizing a new form of the statistic J (α) = αT Mα . [sent-295, score-0.223]
62 This implies that some form of regularisation will need to be applied when inverting N and this will generally be done by applying Nδ = N + δC, with C being the identity or the kernel matrices. [sent-298, score-0.209]
63 From the −1 −1 definition of mF , the difference mF − mF can be written as K∆ˆ , with ∆ˆ = N0 y0 − N1 y1 , y y q 0 1 and by regularising N with a multiple of the kernel matrix we obtain αKFD ∝ KLK + β−1 K −1 K∆ˆ , y where β−1 is the regularisation coefficient. [sent-307, score-0.209]
64 As an additional insight, we observe that the coefficients αBFD have an equivalent αKFD if and only if KFD uses a regularisation based on a multiple of the kernel matrix. [sent-309, score-0.209]
65 LS-SVM’s have been related to ridge regression with modified targets, discriminant analysis in the feature space (KFD) and, as many other kernelised algorithms, to GP’s. [sent-315, score-0.233]
66 In this figure, the joint distribution over labels and parameters factorises as p ( y, w| X) = p ( y| X, w) × p (w), with noise model p ( y| X, w) = N Xw, ζ−1 I and prior p (w) = N 0| µ−1 I . [sent-323, score-0.246]
67 µ X w ζ y Figure 4: LS-SVM noise model assumes a regularised least squares cost function. [sent-324, score-0.205]
68 The model depicted can be interpreted as the joint distribution p (y, w) = p ( y| X, w) p (w), whereby the noise is Gaussian, p ( y| X, w) = N Xw, ζ−1 I , as is the prior p (w) = N 0| µ−1 I . [sent-325, score-0.184]
69 In this model the targets and the labels are the same t ≡ y. [sent-326, score-0.133]
70 However, it suffers from the fundamental missconception that the class labels ±1 have to coincide with the projected class centres cq . [sent-329, score-0.388]
71 Ignoring the bias term, in P-SVM’s the joint distribution p ( y, w| X) is factorised according to the noise model p ( y| X, w) = N Xw, σ2 I and the prior distribution p (w) = N 0, νσ2 I . [sent-349, score-0.187]
72 1 we saw that optimisation of the proposed noise model and that of Rayleigh’s coefficient give equivalent results. [sent-363, score-0.155]
73 In both cases the solution to the discriminant problem was given by adjusting the level of β. [sent-364, score-0.195]
74 From the point of view of projected data, the problem has shifted from computing the direction of discrimination to that of minimising the generalisation error through the adjustment of the variable β. [sent-371, score-0.163]
75 Hence placing a prior over it not only places a prior over the generalisation error but on 5. [sent-373, score-0.205]
76 The class centres have been denoted by cq with q ∈ {1, 0}. [sent-377, score-0.277]
77 The prior could also be used to bias β towards low or high generalisation errors if this is thought appropriate. [sent-384, score-0.142]
78 In the Bayesian formalism it is quite common to make use of the marginal likelihood to reach this purpose, therefore we look to optimise L (Θt ) = log p (t|D , Θt ) , with respect to the model parameters Θt . [sent-394, score-0.139]
79 1 that we optimised the likelihood with respect to the parameters c0 and c1 leading to a new encoding of the targets ˆq = t fT yq Nq yq . [sent-396, score-0.289]
80 As a consequence, the targets will shift whent ever the kernel parameters are changed. [sent-399, score-0.198]
81 1 EM Algorithm We denote the parameters of the prior as Θk and the complete set of model parameters as Θt = {Θk , β}. [sent-407, score-0.163]
82 Thus our algorithm will be composed of the alternation of the following steps: E-step Given the current parameters Θtit , approximate the posterior with 1 qit (f) ∝ exp − fT Σ−1 f , p 2 where Σ p = K−1 + βL −1 . [sent-412, score-0.138]
83 Maximisation with respect to Θk , the kernel parameters, cannot be done in closed form and has to rely on some optimisation routine, for example gradient descent, therefore it is necessary to specify the gradients of Equation 29 w. [sent-415, score-0.155]
84 Kernel and regularisation parameters for KFD and LS-SVM were obtained by 10fold cross validation, whereas BFD related parameters were obtained by evidence maximisation. [sent-489, score-0.177]
85 Given that discriminant methods operate independently of the method of bias choice, we felt it more appropriate to use a bias independent measure like the area under the ROC curve (AUC). [sent-609, score-0.195]
86 The compared algorithms are: linear discriminant (LDA), quadratic discriminant (QDA), kernel Fisher’s discriminant (KFD), least squares support vector machine (LS-SVM) and Bayesian Fisher’s discriminant (BFD). [sent-782, score-0.91]
87 Incorporation of this type of prior performs feature selection by assigning very high weights to some of the posterior values of the hyperparameters and hence prunning out features, (see Mackay, 1995). [sent-786, score-0.136]
88 The EM algorithm we proposed is slower to converge than direct optimisation of the marginal likelihood as can be applied to the LS-SVM. [sent-790, score-0.13]
89 Conclusions and Future Work We have presented a Bayesian probabilistic approach to discriminant analysis that can correspond to kernel Fisher’s discriminant. [sent-914, score-0.291]
90 Regularisation of the discriminant arises naturally in the proposed framework and through maximisation of the marginal likelihood we were able to determine kernel 481 P EÑA C ENTENO AND L AWRENCE Twonorm train set Waveform train set 0. [sent-915, score-0.447]
91 Therefore, after conditioning on d, the ¯ resulting distribution will be Gaussian with the form p(w|D , d, γ) = lim N (w, Σ) and parameters γ→∞ ¯ w = lim γdΣ∆m (34) γ→∞ and Σ = lim B + γ∆m∆mT γ→∞ −1 . [sent-962, score-0.264]
92 Expressing p ˆ f, y in terms of f and L t Disregarding an additive constant, the log of the modified noise model p ˆ f, y is t L (c0 , c1 ) = − ˆ ˆ β N ˆ ˆ ∑ yn (c1 − fn )2 + (1 − yn ) (c0 − fn )2 , 2 n=1 483 P EÑA C ENTENO AND L AWRENCE where we have used Equation 6 as base. [sent-967, score-0.212]
93 From Equation 9, we substitute the values of each estimate cq so that ˆ 2 2 1 T 1 T β N y1 f − fn + (1 − yn ) y0 f − fn L = − ∑ yn 2 n=1 N1 N0 = − = − 1 β 1 ffT − yT ffT y1 − yT ffT y0 1 2 N1 N0 0 β T f Lf . [sent-968, score-0.214]
94 In the model we have three types of distributions: p ˆ y, f , the noise model; p (f+ ), an extended GP prior that includes the point f and the constraint t p ( d| y, f, γ). [sent-971, score-0.159]
95 y y 2 Completing the squares on f gives a Gaussian p( f | D , d, γ) ∝ exp − f − f¯ 1 2(σ )2 where the variance is 2 −1 (40) f¯ = − lim γd (σ )2 cT Q−1 ∆ˆ . [sent-990, score-0.133]
96 2 Then combining it with a gamma prior G ( β| a, b) gives a posterior of the form G ( β| N/2 + a, (V /2 + b)), that is p β| ˆ, f t ∝ βN/2+a−1 exp −β V +b 2 . [sent-1012, score-0.169]
97 Use Equation 6 and substitute the class centres cq by their estimates cq . [sent-1017, score-0.375]
98 The initialisations selected ranged from 1 × 10−4 to 1 × 104 ; initialisations that produced numerical errors were ignored. [sent-1039, score-0.132]
99 Estimating a kernel Fisher discriminant in the presence of label noise. [sent-1148, score-0.291]
100 Bayesian framework for least squares support vector machine classifiers, Gaussian processes and kernel discriminant analysis. [sent-1242, score-0.325]
wordName wordTfidf (topN-words)
[('bfd', 0.583), ('kfd', 0.319), ('discriminant', 0.195), ('centres', 0.179), ('awrence', 0.169), ('enteno', 0.169), ('ptimizing', 0.169), ('wt', 0.152), ('ard', 0.152), ('regularisation', 0.113), ('fld', 0.103), ('fisher', 0.1), ('cq', 0.098), ('kernel', 0.096), ('nq', 0.096), ('mf', 0.094), ('spiral', 0.094), ('twonorm', 0.086), ('bumpy', 0.085), ('maximisation', 0.085), ('projected', 0.084), ('rayleigh', 0.08), ('generalisation', 0.079), ('ft', 0.077), ('gp', 0.075), ('regularised', 0.075), ('yq', 0.075), ('mt', 0.073), ('posterior', 0.073), ('kt', 0.072), ('targets', 0.07), ('lim', 0.066), ('initialisations', 0.066), ('prior', 0.063), ('latent', 0.063), ('noise', 0.06), ('optimisation', 0.059), ('ka', 0.059), ('ringnorm', 0.057), ('nx', 0.057), ('mq', 0.056), ('xw', 0.056), ('gestel', 0.056), ('klk', 0.056), ('suykens', 0.055), ('mika', 0.049), ('rbf', 0.048), ('qda', 0.047), ('tq', 0.047), ('waveform', 0.046), ('overlap', 0.045), ('equation', 0.042), ('compounds', 0.04), ('splice', 0.04), ('neil', 0.04), ('lda', 0.039), ('ridge', 0.038), ('bart', 0.038), ('johan', 0.038), ('joos', 0.038), ('maximising', 0.038), ('optimising', 0.038), ('pelckmans', 0.038), ('likelihood', 0.037), ('model', 0.036), ('bayesian', 0.036), ('conditioning', 0.034), ('marginal', 0.034), ('squares', 0.034), ('exp', 0.033), ('parameters', 0.032), ('coef', 0.032), ('gaussian', 0.032), ('relevance', 0.031), ('roc', 0.031), ('fn', 0.031), ('toy', 0.03), ('tsch', 0.029), ('auc', 0.029), ('proximal', 0.029), ('baudat', 0.028), ('dkt', 0.028), ('factorised', 0.028), ('factorises', 0.028), ('lf', 0.028), ('realisations', 0.028), ('tony', 0.028), ('statistic', 0.028), ('ct', 0.028), ('yn', 0.027), ('labels', 0.027), ('covariance', 0.027), ('array', 0.027), ('substituting', 0.026), ('banana', 0.026), ('meanwhile', 0.026), ('shaded', 0.026), ('limit', 0.025), ('van', 0.025), ('depicted', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000011 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
Author: Tonatiuh Peña Centeno, Neil D. Lawrence
Abstract: In this paper we consider a novel Bayesian interpretation of Fisher’s discriminant analysis. We relate Rayleigh’s coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the ‘regression to the labels’ assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with the incorporation of a prior we can apply Bayes’ rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher’s discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefficient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefficient with the generalisation error.
2 0.11412739 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni
Abstract: A selective sampling algorithm is a learning algorithm for classification that, based on the past observed data, decides whether to ask the label of each new instance to be classified. In this paper, we introduce a general technique for turning linear-threshold classification algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classifiers
3 0.10660606 70 jmlr-2006-Online Passive-Aggressive Algorithms
Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer
Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.
4 0.068224043 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır
Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classifier, whose key component is a weight vector in a feature space implicitly introduced by a positive definite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modified optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classifiers. These classifiers essentially find a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the different classes of data are linearly well separated. Experimental results over several classification benchmarks demonstrate the effectiveness of our approach. Keywords: sparse learning, sparse large margin classifiers, kernel learning algorithms, support vector machine, kernel Fisher discriminant
5 0.05755328 81 jmlr-2006-Some Discriminant-Based PAC Algorithms
Author: Paul W. Goldberg
Abstract: A classical approach in multi-class pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would like to know whether it is harder to build accurate classifiers via this approach, than by techniques that may process all data with distinct labels together. In this paper we make that question precise by considering it in the context of PAC learnability. We propose two restrictions on the PAC learning framework that are intended to correspond with the above approach, and consider their relationship with standard PAC learning. Our main restriction of interest leads to some interesting algorithms that show that the restriction is not stronger (more restrictive) than various other well-known restrictions on PAC learning. An alternative slightly milder restriction turns out to be almost equivalent to unrestricted PAC learning. Keywords: computational learning theory, computational complexity, pattern classification
6 0.054133546 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
7 0.051101644 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
9 0.049682438 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)
10 0.04841391 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
11 0.047865406 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
12 0.043165553 21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis
13 0.042191599 15 jmlr-2006-Bayesian Network Learning with Parameter Constraints (Special Topic on Machine Learning and Optimization)
14 0.041791268 54 jmlr-2006-Learning the Structure of Linear Latent Variable Models
15 0.039795503 57 jmlr-2006-Linear State-Space Models for Blind Source Separation
18 0.037100941 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling
19 0.037001852 49 jmlr-2006-Learning Parts-Based Representations of Data
20 0.036416922 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
topicId topicWeight
[(0, 0.206), (1, 0.072), (2, 0.116), (3, 0.012), (4, -0.12), (5, -0.066), (6, -0.024), (7, -0.073), (8, 0.048), (9, -0.027), (10, 0.168), (11, -0.035), (12, -0.036), (13, -0.033), (14, -0.067), (15, 0.017), (16, -0.047), (17, 0.118), (18, -0.131), (19, -0.156), (20, 0.276), (21, -0.06), (22, -0.131), (23, 0.051), (24, 0.125), (25, -0.088), (26, -0.099), (27, -0.149), (28, 0.089), (29, -0.087), (30, 0.125), (31, 0.054), (32, -0.13), (33, -0.225), (34, -0.039), (35, 0.011), (36, 0.084), (37, 0.176), (38, 0.129), (39, -0.048), (40, -0.128), (41, -0.06), (42, 0.068), (43, 0.164), (44, -0.022), (45, 0.062), (46, -0.069), (47, 0.148), (48, -0.061), (49, -0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.90749651 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
Author: Tonatiuh Peña Centeno, Neil D. Lawrence
Abstract: In this paper we consider a novel Bayesian interpretation of Fisher’s discriminant analysis. We relate Rayleigh’s coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the ‘regression to the labels’ assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with the incorporation of a prior we can apply Bayes’ rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher’s discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefficient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefficient with the generalisation error.
2 0.48285419 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır
Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classifier, whose key component is a weight vector in a feature space implicitly introduced by a positive definite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modified optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classifiers. These classifiers essentially find a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the different classes of data are linearly well separated. Experimental results over several classification benchmarks demonstrate the effectiveness of our approach. Keywords: sparse learning, sparse large margin classifiers, kernel learning algorithms, support vector machine, kernel Fisher discriminant
3 0.41581994 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni
Abstract: A selective sampling algorithm is a learning algorithm for classification that, based on the past observed data, decides whether to ask the label of each new instance to be classified. In this paper, we introduce a general technique for turning linear-threshold classification algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classifiers
4 0.35243499 21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis
Author: Jieping Ye, Tao Xiong
Abstract: Dimensionality reduction is an important pre-processing step in many applications. Linear discriminant analysis (LDA) is a classical statistical approach for supervised dimensionality reduction. It aims to maximize the ratio of the between-class distance to the within-class distance, thus maximizing the class discrimination. It has been used widely in many applications. However, the classical LDA formulation requires the nonsingularity of the scatter matrices involved. For undersampled problems, where the data dimensionality is much larger than the sample size, all scatter matrices are singular and classical LDA fails. Many extensions, including null space LDA (NLDA) and orthogonal LDA (OLDA), have been proposed in the past to overcome this problem. NLDA aims to maximize the between-class distance in the null space of the within-class scatter matrix, while OLDA computes a set of orthogonal discriminant vectors via the simultaneous diagonalization of the scatter matrices. They have been applied successfully in various applications. In this paper, we present a computational and theoretical analysis of NLDA and OLDA. Our main result shows that under a mild condition which holds in many applications involving highdimensional data, NLDA is equivalent to OLDA. We have performed extensive experiments on various types of data and results are consistent with our theoretical analysis. We further apply the regularization to OLDA. The algorithm is called regularized OLDA (or ROLDA for short). An efficient algorithm is presented to estimate the regularization value in ROLDA. A comparative study on classification shows that ROLDA is very competitive with OLDA. This confirms the effectiveness of the regularization in ROLDA. Keywords: linear discriminant analysis, dimensionality reduction, null space, orthogonal matrix, regularization
5 0.33359584 70 jmlr-2006-Online Passive-Aggressive Algorithms
Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer
Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.
7 0.27854863 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)
8 0.27558362 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
9 0.24617881 49 jmlr-2006-Learning Parts-Based Representations of Data
10 0.217337 81 jmlr-2006-Some Discriminant-Based PAC Algorithms
11 0.21678168 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
12 0.21261932 54 jmlr-2006-Learning the Structure of Linear Latent Variable Models
14 0.21052051 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
15 0.20700341 42 jmlr-2006-Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting (Special Topic on Inductive Programming)
16 0.19427465 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers
18 0.18357351 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling
19 0.18122438 93 jmlr-2006-Universal Kernels
topicId topicWeight
[(8, 0.496), (36, 0.069), (45, 0.013), (50, 0.035), (63, 0.038), (68, 0.015), (76, 0.012), (78, 0.015), (79, 0.016), (81, 0.035), (84, 0.028), (90, 0.044), (91, 0.029), (96, 0.061)]
simIndex simValue paperId paperTitle
same-paper 1 0.86134183 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
Author: Tonatiuh Peña Centeno, Neil D. Lawrence
Abstract: In this paper we consider a novel Bayesian interpretation of Fisher’s discriminant analysis. We relate Rayleigh’s coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the ‘regression to the labels’ assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher’s discriminant, and with the incorporation of a prior we can apply Bayes’ rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher’s discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefficient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefficient with the generalisation error.
2 0.8186087 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
Author: Gilles Blanchard, Motoaki Kawanabe, Masashi Sugiyama, Vladimir Spokoiny, Klaus-Robert Müller
Abstract: Finding non-Gaussian components of high-dimensional data is an important preprocessing step for efficient information processing. This article proposes a new linear method to identify the “nonGaussian subspace” within a very general semi-parametric framework. Our proposed method, called NGCA (non-Gaussian component analysis), is based on a linear operator which, to any arbitrary nonlinear (smooth) function, associates a vector belonging to the low dimensional nonGaussian target subspace, up to an estimation error. By applying this operator to a family of different nonlinear functions, one obtains a family of different vectors lying in a vicinity of the target space. As a final step, the target space itself is estimated by applying PCA to this family of vectors. We show that this procedure is consistent in the sense that the estimaton error tends to zero at a parametric rate, uniformly over the family, Numerical examples demonstrate the usefulness of our method.
3 0.39641955 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
Author: Kazuho Watanabe, Sumio Watanabe
Abstract: Bayesian learning has been widely used and proved to be effective in many data modeling problems. However, computations involved in it require huge costs and generally cannot be performed exactly. The variational Bayesian approach, proposed as an approximation of Bayesian learning, has provided computational tractability and good generalization performance in many applications. The properties and capabilities of variational Bayesian learning itself have not been clarified yet. It is still unknown how good approximation the variational Bayesian approach can achieve. In this paper, we discuss variational Bayesian learning of Gaussian mixture models and derive upper and lower bounds of variational stochastic complexities. The variational stochastic complexity, which corresponds to the minimum variational free energy and a lower bound of the Bayesian evidence, not only becomes important in addressing the model selection problem, but also enables us to discuss the accuracy of the variational Bayesian approach as an approximation of true Bayesian learning. Keywords: Gaussian mixture model, variational Bayesian learning, stochastic complexity
4 0.36832476 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data
Author: Daniil Ryabko
Abstract: In this work we consider the task of relaxing the i.i.d. assumption in pattern recognition (or classification), aiming to make existing learning algorithms applicable to a wider range of tasks. Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). We consider the case of deterministically defined labels. Traditionally, this task is studied under the assumption that examples are independent and identically distributed. However, it turns out that many results of pattern recognition theory carry over a weaker assumption. Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. We find a broad class of learning algorithms for which estimations of the probability of the classification error achieved under the classical i.i.d. assumption can be generalized to the similar estimates for case of conditionally i.i.d. examples.
5 0.34928674 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
Author: Francis R. Bach, Michael I. Jordan
Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram. Keywords: spectral clustering, blind source separation, computational auditory scene analysis
6 0.34696937 49 jmlr-2006-Learning Parts-Based Representations of Data
7 0.34537637 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
8 0.3396126 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
9 0.33835244 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
10 0.33633995 47 jmlr-2006-Learning Image Components for Object Recognition
11 0.33454475 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
12 0.32995582 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples
13 0.31862336 21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis
15 0.31142378 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
17 0.30706322 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error
18 0.30701381 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models
19 0.30398324 83 jmlr-2006-Sparse Boosting
20 0.3037858 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann