jmlr jmlr2006 jmlr2006-22 knowledge-graph by maker-knowledge-mining

22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers

Source: pdf

Author: Francis R. Bach, David Heckerman, Eric Horvitz

Abstract: Receiver Operating Characteristic (ROC) curves are a standard way to display the performance of a set of binary classiﬁers for all feasible ratios of the costs associated with false positives and false negatives. For linear classiﬁers, the set of classiﬁers is typically obtained by training once, holding constant the estimated slope and then varying the intercept to obtain a parameterized set of classiﬁers whose performances can be plotted in the ROC plane. We consider the alternative of varying the asymmetry of the cost function used for training. We show that the ROC curve obtained by varying both the intercept and the asymmetry, and hence the slope, always outperforms the ROC curve obtained by varying only the intercept. In addition, we present a path-following algorithm for the support vector machine (SVM) that can compute efﬁciently the entire ROC curve, and that has the same computational complexity as training a single classiﬁer. Finally, we provide a theoretical analysis of the relationship between the asymmetric cost model assumed when training a classiﬁer and the cost model assumed in applying the classiﬁer. In particular, we show that the mismatch between the step function used for testing and its convex upper bounds, usually used for training, leads to a provable and quantiﬁable difference around extreme asymmetries. Keywords: support vector machines, receiver operating characteristic (ROC) analysis, linear classiﬁcation

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 COM Microsoft Research Redmond, WA 98052, USA Editor: John Shawe-Taylor Abstract Receiver Operating Characteristic (ROC) curves are a standard way to display the performance of a set of binary classiﬁers for all feasible ratios of the costs associated with false positives and false negatives. [sent-6, score-0.259]

2 For linear classiﬁers, the set of classiﬁers is typically obtained by training once, holding constant the estimated slope and then varying the intercept to obtain a parameterized set of classiﬁers whose performances can be plotted in the ROC plane. [sent-7, score-0.305]

3 We consider the alternative of varying the asymmetry of the cost function used for training. [sent-8, score-0.93]

4 We show that the ROC curve obtained by varying both the intercept and the asymmetry, and hence the slope, always outperforms the ROC curve obtained by varying only the intercept. [sent-9, score-0.426]

5 In particular, we show that the mismatch between the step function used for testing and its convex upper bounds, usually used for training, leads to a provable and quantiﬁable difference around extreme asymmetries. [sent-12, score-0.239]

6 The ROC curve for such a classiﬁer reveals the ratio of false negatives and positives at different probability thresholds for classifying an email message as unsolicited or normal email. [sent-17, score-0.247]

7 To date, ROC curves have been usually constructed by training once, holding constant the estimated slope and varying the intercept to obtain the curve. [sent-22, score-0.36]

8 The crux of our approach is that we allow the asymmetry of the cost function to vary, that is, we vary the ratio of the cost of a false positive and the cost of a false negative. [sent-24, score-1.122]

9 In a naive implementation, varying the asymmetry would require a retraining of the classiﬁer for each point of the ROC curve, which would be computationally expensive. [sent-26, score-0.867]

10 In particular, by allowing both the asymmetry and the intercept to vary, we can obtain better ROC curves than by methods that simply vary the intercept. [sent-33, score-1.018]

11 In Section 4, we provide a theoretical analysis of the relationship between the asymmetry of costs assumed in training a classiﬁer and the asymmetry desired in its application. [sent-34, score-1.694]

12 , inﬁnite sample) case, the use of a training loss function which is a convex upper bound on the true or testing loss function (a step function) creates classiﬁers with sub-optimal accuracy. [sent-37, score-0.283]

13 The main result of this analysis is that given an extreme user-deﬁned testing asymmetry, the training asymmetry should almost always be chosen to be less extreme. [sent-40, score-1.005]

14 As we shall see, the consequences of the potential mismatch between the cost functions assumed in testing versus training underscore the value of using the algorithm that we introduce in Section 4. [sent-41, score-0.237]

15 We let C+ denote the cost of a false negative and C− the cost of a false positive. [sent-64, score-0.232]

16 The expected cost deﬁned using the 0–1 loss is the cost that end users are usually interested in during the use of the classiﬁer, while the other cost functions that we deﬁne below are used solely for training purposes. [sent-68, score-0.283]

17 We shall use training asymmetry to refer to the asymmetry used for training a classiﬁer using a φ-cost, and the testing asymmetry to refer to the asymmetric cost characterizing the testing situation (reﬂecting end user preferences) with the actual cost based on the 0–1 loss. [sent-73, score-2.877]

18 1715 BACH , H ECKERMAN AND H ORVITZ 5 5 0−1 hinge square erf 4 0−1 probit logistic 4 3 3 2 2 1 1 0 −2 0 2 0 4 −2 0 2 4 Figure 1: Loss functions. [sent-78, score-0.254]

19 The erf 2π 2π loss can be used to provide a good approximation of the logistic loss φ log (u) = log(1 + e−u ) as well as its derivative, and is amenable to closed-form computations for Gaussians and mixture of Gaussians (see Section 4 for more details). [sent-82, score-0.311]

20 Note that the erf loss is different from the probit loss − log ψ(u), which leads to probit regression (Hastie et al. [sent-83, score-0.332]

21 Whether γ is the intercept or the training asymmetry, the ROC curve always passes through the point (0, 0) and (1, 1), which corresponds to classifying all points as negative (resp. [sent-90, score-0.286]

22 However, if the user is 1716 C ONSIDERING C OST A SYMMETRY IN L EARNING C LASSIFIERS v 1 v 1 0 true positives true positives c b a false positives 1 u a 0 false positives 1 u Figure 2: Left: ROC curve: (plain) regular ROC curve; (dashed) convex envelope. [sent-94, score-0.427]

23 only interested in the testing cost, which is a weighted linear combination of the true positive and false positive rates, the lowest testing cost is always achieved by both of these two classiﬁers (see Section 4. [sent-97, score-0.324]

24 dγ dγ The optimal testing asymmetry β(γ) deﬁned as the ratio β(γ) C+ (γ) = C+ (γ) +C− (γ) 1 + C+ (γ) C+ (γ)+C− (γ) , 1 is thus equal to . [sent-102, score-0.917]

25 Handling ROC surfaces In this paper, we will also consider varying both the asymmetry of the cost function and the intercept, leading to a set of points in the ROC plane parameterized by two real values. [sent-110, score-0.983]

26 Thus, two end-user-deﬁned parameters are needed to train a linear classiﬁer: the total amount of regularization C+ +C− ∈ R+ , and the asymmetry C+ C+ +C− ∈ [0, 1]. [sent-121, score-0.813]

27 For all classiﬁers along that line, we compute a misclassiﬁcation cost on the testing set, with given 0 0 asymmetry (C+ ,C− ) (as given by the user, and usually, but not necessarily, close to a point of 1 1 interest in the ROC space). [sent-197, score-0.98]

28 • Varying intercept and asymmetry: For each of the points on the R lines in the (C+ ,C− )-plane, we discard the intercept b and keep the slope w obtained from the optimization problem; we then use all possible intercept values; this leads to R two-dimensional surfaces in the ROC plane. [sent-215, score-0.532]

29 the asymmetry) are included in the set used for varying both the intercept and the asymmetry, the third ROC curve always outperforms the ﬁrst two curves (i. [sent-218, score-0.336]

30 Intuitively, the ROC curve obtained by varying the asymmetry should be better than the ROC generated by varying the intercept because, for each point, the slope of the classiﬁer is optimized. [sent-226, score-1.221]

31 In addition, even for ROC-consistent points, the training asymmetry and the testing asymmetry differ. [sent-231, score-1.772]

32 2 that the training cost asymmetry can differ from the testing asymmetry. [sent-235, score-1.022]

33 In many cases, the Bayes optimal classiﬁer does not belong 1723 BACH , H ECKERMAN AND H ORVITZ true positives 1 both asymmetry intercept 0. [sent-243, score-1.021]

34 8 1 false positives true positives 1 both asymmetry intercept 0. [sent-251, score-1.146]

35 8 1 false positives Figure 4: Two examples of ROC curves for bimodal class conditional densities, varying intercept (dotted), varying asymmetry (plain) and varying both (dashed). [sent-259, score-1.291]

36 Since we are using population densities, we can get rid of the regularization term and thus only the asymmetry will have an inﬂuence on the results, that is, we can restrict ourselves to C+ +C− = 1. [sent-263, score-0.826]

37 For a given training asymmetry γ and a loss function φ, in Section 2. [sent-265, score-0.907]

38 2, we deﬁned the optimal testing asymmetry β(γ) for the training asymmetry γ as the testing asymmetry for which the classiﬁer obtained by training with asymmetry γ is optimal. [sent-266, score-3.544]

39 Although a difference might be seen empirically for all possible asymmetries, we analyze the relationship between the testing cost asymmetry and training asymmetry in cases of extreme asymmetries, that is, in the ROC framework, close to the corner points (0, 0) and (1, 1). [sent-268, score-1.898]

40 We prove that, depending on the class conditional densities, there are three possible different regimes for extreme asymmetries and that under two of these regimes, the training asymmetry should be chosen less extreme. [sent-269, score-1.112]

41 , 2001), and (b) for the square loss and the erf loss, they enable closed-form calculations that lead to Taylor expansions. [sent-272, score-0.245]

42 In the following sections, we provide Taylor expansions of various quantities around the null training asymmetry γ = 0. [sent-288, score-0.943]

43 The quantities trivially extend around the reverse asymmetry γ = 1. [sent-289, score-0.841]

44 We focus on two losses, the square loss and the erf loss. [sent-290, score-0.245]

45 In order to study if using such losses might alter the qualitative behavior observed and analyzed for the square loss around extreme testing asymmetries, we use another loss with similar behavior as u tends to +∞, the erf loss. [sent-295, score-0.549]

46 The erf loss is deﬁned as φer f (u) = [uψ (u) − u + ψ (u)] and leads to simpler computations than the hinge loss or the logistic loss2 . [sent-296, score-0.296]

47 Note that the erf loss φer f is a tight approximation of a rescaled logistic loss 1725 1 2 log(1 + e−2u ), with similar derivatives. [sent-298, score-0.277]

48 BACH , H ECKERMAN AND H ORVITZ We start with an expansion of the unique global minimum (w, b) of the φ-cost with asymmetry γ. [sent-299, score-0.846]

49 The previous propositions provide a closed-form expansion of the solutions of the convex optimization problems deﬁned by the square loss and the erf loss. [sent-306, score-0.311]

50 2 Expansion of Testing Asymmetries Using the expansions of Proposition 1 and 2, we can readily derive an expansion of the ROC coordinates for small γ, as well as the testing asymmetry β(γ). [sent-309, score-1.01]

51 C ONSIDERING C OST A SYMMETRY IN L EARNING C LASSIFIERS β(γ) true positives (v) 1 0 0 γ false positives (u) Figure 5: Case A < 0: (left) plot of the optimal testing asymmetry β(γ) as a function of the training asymmetry γ, (right) ROC curve around (0,0) as γ varies close to zero. [sent-311, score-2.088]

52 (3), we see that log p− −1 p+ (β(γ) −1) is asymptotically equivalent to a negative constant times implying that the testing asymmetry β(γ) tends dv/dγ p− to one exponentially fast. [sent-316, score-0.994]

53 In this situation, better performance for a given testing asymmetry can be obtained by using a less extreme training asymmetry, simply because too extreme training asymmetries lead to classiﬁer who perform worse than trivial classiﬁers. [sent-321, score-1.274]

54 (3), we see that the testing asymmetry tends to 0 exponentially fast, in particular, the derivative dβ/dγ is null at γ = 0, meaning, that the testing asymmetry is signiﬁcantly smaller than the training asymmetry, that is, more extreme. [sent-323, score-1.934]

55 See Figure 6 for a plot of the ROC curve 1727 BACH , H ECKERMAN AND H ORVITZ β(γ) true positives (v) 1 0 0 γ false positives (u) Figure 6: Case A > 0. [sent-325, score-0.288]

56 Left: Plot of the optimal testing asymmetry β(γ) as a function of the training asymmetry γ; Right: ROC curve around (0,0) as γ varies close to zero. [sent-326, score-1.891]

57 Classiﬁers corresponding to γ close to zero are ROC-consistent, but since β(γ) < γ, best performance for a given testing asymmetry is obtained for less extreme training asymmetries. [sent-327, score-1.005]

58 In this situation, better performance for a given testing asymmetry can be obtained by using a less extreme training asymmetry. [sent-329, score-1.005]

59 Overall, in the two more likely regimes where A = 0, we have shown that, given an extreme testing asymmetry, the training asymmetry should be chosen less extreme. [sent-335, score-1.035]

60 3, where we show in Table 2 examples of the mismatch between testing asymmetry and training asymmetry. [sent-338, score-0.987]

61 For the erf loss, there are also three regimes, but with weaker effects since the testing asymmetry β(γ) tends to zero or one polynomially fast, instead of exponentially fast. [sent-340, score-1.132]

62 In Figure 7 and Figure 8, we provide several examples for the square loss and the erf loss, with the two regimes A > 0 and A < 0 and different strengths. [sent-342, score-0.275]

63 , extreme asymmetries), the effects also often remain valid far from 1728 C ONSIDERING C OST A SYMMETRY IN L EARNING C LASSIFIERS testing asymmetry 1 0. [sent-346, score-0.963]

64 8 1 training asymmetry testing asymmetry 1 0. [sent-354, score-1.772]

65 8 1 training asymmetry Figure 7: Training asymmetry vs. [sent-378, score-1.668]

66 testing asymmetry, square loss: (Left) Gaussian class conditional densities, (right) testing asymmetry vs. [sent-379, score-1.057]

67 1729 BACH , H ECKERMAN AND H ORVITZ testing asymmetry 1 0. [sent-383, score-0.917]

68 8 1 training asymmetry testing asymmetry 1 0. [sent-391, score-1.772]

69 8 1 training asymmetry Figure 8: Training asymmetry vs. [sent-415, score-1.668]

70 Left: Gaussian class conditional densities; Right: Testing asymmetry vs. [sent-417, score-0.813]

71 However, in general, for non-extreme asymmetries, the mismatch might be inverted, that is, a more extreme training asymmetry might lead to better performance. [sent-422, score-0.929]

72 However, this approach would only applicable to extreme asymmetries (where the expansions are valid). [sent-424, score-0.287]

73 2 and chose the best classiﬁer as follows: a given testing asymmetry leads to a unique slope in the ROC space, and the optimal point for this asymmetry is the point on the ROC curve whose tangent has the corresponding slope and which is closest to the upper-left corner. [sent-429, score-1.967]

74 In our simulations, we compare two types of ROC curves, one which is obtained by varying the intercept, and one which is obtained by varying both the training asymmetry and the intercept. [sent-430, score-0.963]

75 More precisely, 1000 values of testing and training asymmetries were tried, uniformly spread in [0, 1]. [sent-437, score-0.327]

76 , training asymmetry and/or intercept) by 10-fold cross validation on the training data. [sent-441, score-0.934]

77 In cases where statistical signiﬁcance is not found, the original classiﬁer obtained with the training asymmetry equal to the testing asymmetry is used (in this case, the classiﬁer is identical to the classiﬁer “one”). [sent-443, score-1.772]

78 For each testing asymmetry, we performed one-tailed Wilcoxon signed rank tests on the testing costs of the outer folds, with 5% conﬁdence levels (Hollander and Wolfe, 1999). [sent-447, score-0.271]

79 , the number of acceptances divided by the number of testing asymmetries that are considered) for all one-sided tests are reported in Table 1. [sent-450, score-0.314]

80 Moreover, the column “one>int” shows that there is a signiﬁcant gain in simply optimizing the intercept using the nonconvex non-differentiable risk based on the 0-1 loss, and columns “one>all” and “int>all” show that there is even more gain in considering all training asymmetries 4 . [sent-452, score-0.359]

81 Since the training asymmetries corresponding to a given testing asymmetry may differ for each outer fold, we use the training asymmetry that was selected by the ﬁrst (i. [sent-455, score-2.016]

82 , a random) outer fold of cross validation, and compute the cross-validation scores corresponding to this single training asymmetry for all nine other outer folds. [sent-457, score-0.957]

83 For each data set, testing asymmetries were chosen so that the difference in testing performance between the classiﬁer “one” and “all” on the ﬁrst fold is greatest. [sent-458, score-0.411]

84 Moreover, when the testing asymmetry is extreme, such as for the G AUSSIAN and P IMA data sets, the training asymmetry is less extreme and leads to better performance, illustrating the theoretical results of Section 4. [sent-461, score-1.818]

85 Note also, that for other data sets, such as T WONORM or I ONOSPHERE, the optimal training asymmetry is very different from the testing asymmetry: we ﬁnd that using all asymmetries 4. [sent-462, score-1.14]

86 , the number of acceptances divided by the number of testing asymmetries that are considered) of the six one-sided tests between the three classiﬁers are given in the last six columns (Wilcoxon signed rank tests with 5% conﬁdence level comparing cross-validation scores). [sent-532, score-0.33]

87 067 Table 2: Training with the testing asymmetry γ versus training with all cost asymmetries: we report cross-validation testing costs (multipled by 100). [sent-623, score-1.152]

88 Only the asymmetry with the largest difference in testing performance between the classiﬁer “one” and “all” is reported. [sent-624, score-0.917]

89 We also report the training asymmetry which led to the best performance. [sent-625, score-0.855]

90 Given an asymmetry γ we use the cost settings C+ = 2γ, C− = 2(1−γ) (which leads to the misclassiﬁcation error if γ = 1/2). [sent-626, score-0.876]

91 Conclusion We have presented an efﬁcient algorithm to build ROC curves by varying the training cost asymmetries for SVMs. [sent-631, score-0.395]

92 The algorithm is based on the piecewise linearity of the path of solutions when the cost of false positives and false negatives vary. [sent-632, score-0.293]

93 We have also provided a theoretical analysis of the relationship between the potentially different cost asymmetries assumed in training and testing classiﬁers, showing that they differ under certain circumstances. [sent-633, score-0.39]

94 In particular, in case of extreme asymmetries, our theoretical analysis suggests that training asymmetries should be chosen less extreme than the testing asymmetry. [sent-634, score-0.419]

95 Our theoretical analysis implies that because we use a convex surrogate, using the testing asymmetry for training leads to non-optimal classiﬁers. [sent-637, score-0.992]

96 Proof of Expansion of Optimal Solutions In this appendix, we give the proof of the expansions of optimal solutions for extreme asymmeC+ p+ = tries, for the square and erf loss. [sent-647, score-0.299]

97 In order to derive optimality conditions for the erf loss, we ﬁrst need to compute expectations of the erf loss and its derivatives for Gaussian densities. [sent-660, score-0.381]

98 (6) and the fact that ψ(t+ ) tends to one, we obtain ψ(t− ) ∼ ρ, A classical asymptotic result on the inverse erf function shows that t− ∼ − 2 log(1/ρ). [sent-683, score-0.237]

99 Proof of Expansion of Testing Asymmetries For the two losses we considered (square and erf), the expansions of w and b around γ = 0 lead to w(γ) ≈ −c(γ)a, b(γ) p+ where c(γ) = 2 p− γ, a = Σ−1 (µ+ − µ− ) for the square loss and c(γ) = (2 log(1/γ))−1 , a = Σ−1 (˜ + − − − µ µ− ) for the erf loss. [sent-746, score-0.349]

100 On the path to an ideal ROC curve: considering cost asymmetry in learning classiﬁers. [sent-763, score-0.896]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('asymmetry', 0.813), ('roc', 0.314), ('asymmetries', 0.181), ('erf', 0.157), ('intercept', 0.136), ('testing', 0.104), ('eckerman', 0.093), ('orvitz', 0.093), ('ost', 0.093), ('curve', 0.091), ('lassifiers', 0.079), ('onsidering', 0.079), ('bach', 0.076), ('int', 0.073), ('slope', 0.073), ('positives', 0.072), ('cost', 0.063), ('expansions', 0.06), ('classi', 0.058), ('tends', 0.058), ('curves', 0.055), ('er', 0.054), ('varying', 0.054), ('false', 0.053), ('loss', 0.052), ('gaussians', 0.052), ('ers', 0.047), ('extreme', 0.046), ('symmetry', 0.046), ('training', 0.042), ('densities', 0.04), ('square', 0.036), ('expansion', 0.033), ('convex', 0.033), ('bw', 0.03), ('dv', 0.03), ('surrogate', 0.03), ('regimes', 0.03), ('mismatch', 0.028), ('around', 0.028), ('hastie', 0.028), ('earning', 0.027), ('probit', 0.026), ('costs', 0.026), ('scores', 0.025), ('yi', 0.025), ('validation', 0.024), ('mixtures', 0.023), ('ci', 0.023), ('fold', 0.022), ('asymptotic', 0.022), ('envelope', 0.021), ('outer', 0.021), ('path', 0.02), ('proposition', 0.02), ('asymmetric', 0.02), ('plain', 0.02), ('dc', 0.02), ('hinge', 0.019), ('log', 0.019), ('negatives', 0.019), ('du', 0.019), ('plane', 0.019), ('receiver', 0.018), ('operating', 0.018), ('rc', 0.018), ('lines', 0.017), ('surfaces', 0.017), ('active', 0.017), ('points', 0.017), ('hollander', 0.017), ('francis', 0.017), ('logistic', 0.016), ('losses', 0.016), ('tests', 0.016), ('mixture', 0.015), ('optimality', 0.015), ('covariance', 0.015), ('heckerman', 0.015), ('tv', 0.015), ('folds', 0.014), ('vary', 0.014), ('boyd', 0.014), ('cross', 0.013), ('acceptances', 0.013), ('avis', 0.013), ('bleistein', 0.013), ('byi', 0.013), ('flach', 0.013), ('regime', 0.013), ('unseen', 0.013), ('population', 0.013), ('misclassi', 0.013), ('piecewise', 0.013), ('tu', 0.013), ('accumulation', 0.013), ('af', 0.012), ('polytope', 0.012), ('line', 0.012), ('normal', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers

Author: Francis R. Bach, David Heckerman, Eric Horvitz

2 0.060170949 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets

Author: Janez Demšar

Abstract: While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classiﬁers: the Wilcoxon signed ranks test for comparison of two classiﬁers and the Friedman test with the corresponding post-hoc tests for comparison of more classiﬁers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams. Keywords: comparative studies, statistical methods, Wilcoxon signed ranks test, Friedman test, multiple comparisons tests

3 0.053680643 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

Author: Pavel Laskov, Christian Gehl, Stefan Krüger, Klaus-Robert Müller

Abstract: Incremental Support Vector Machines (SVM) are instrumental in practical applications of online learning. This work focuses on the design and analysis of efﬁcient incremental SVM learning, with the aim of providing a fast, numerically stable and robust implementation. A detailed analysis of convergence and of algorithmic complexity of incremental SVM learning is carried out. Based on this analysis, a new design of storage and numerical operations is proposed, which speeds up the training of an incremental SVM by a factor of 5 to 20. The performance of the new algorithm is demonstrated in two scenarios: learning with limited resources and active learning. Various applications of the algorithm, such as in drug discovery, online monitoring of industrial devices and and surveillance of network trafﬁc, can be foreseen. Keywords: incremental SVM, online learning, drug discovery, intrusion detection

4 0.043443866 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)

Author: Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, Bernhard Schölkopf

Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lanckriet et al. (2004) considered conic combinations of kernel matrices for classiﬁcation, leading to a convex quadratically constrained quadratic program. We show that it can be rewritten as a semi-inﬁnite linear program that can be efﬁciently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classiﬁcation. Experimental results show that the proposed algorithm works for hundred thousands of examples or hundreds of kernels to be combined, and helps for automatic model selection, improving the interpretability of the learning result. In a second part we discuss general speed up mechanism for SVMs, especially when used with sparse feature maps as appear for string kernels, allowing us to train a string kernel SVM on a 10 million real-world splice data set from computational biology. We integrated multiple kernel learning in our machine learning toolbox SHOGUN for which the source code is publicly available at http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun. Keywords: multiple kernel learning, string kernels, large scale optimization, support vector machines, support vector regression, column generation, semi-inﬁnite linear programming

5 0.042498682 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram. Keywords: spectral clustering, blind source separation, computational auditory scene analysis

6 0.038013272 3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection

7 0.032387998 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling

8 0.028375987 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)

9 0.028308302 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

10 0.02813099 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

11 0.026862385 12 jmlr-2006-Active Learning with Feedback on Features and Instances

12 0.026829397 18 jmlr-2006-Building Support Vector Machines with Reduced Classifier Complexity (Special Topic on Machine Learning and Optimization)

13 0.02642522 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data (Special Topic on Machine Learning and Optimization)

14 0.025100676 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

15 0.024527693 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

16 0.024110325 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

17 0.023431428 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

18 0.022045549 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

19 0.021947304 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

20 0.021736296 31 jmlr-2006-Exact 1-Norm Support Vector Machines Via Unconstrained Convex Differentiable Minimization (Special Topic on Machine Learning and Optimization)

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.124), (1, -0.06), (2, 0.034), (3, 0.018), (4, -0.015), (5, -0.006), (6, -0.067), (7, -0.066), (8, -0.016), (9, -0.031), (10, -0.015), (11, -0.052), (12, -0.021), (13, -0.034), (14, 0.074), (15, -0.018), (16, -0.025), (17, 0.096), (18, 0.027), (19, 0.041), (20, 0.083), (21, -0.273), (22, -0.155), (23, -0.063), (24, -0.045), (25, 0.095), (26, -0.102), (27, -0.105), (28, -0.045), (29, 0.008), (30, 0.101), (31, -0.301), (32, 0.019), (33, 0.182), (34, 0.052), (35, 0.109), (36, -0.063), (37, 0.152), (38, -0.29), (39, -0.192), (40, 0.203), (41, 0.196), (42, 0.237), (43, 0.222), (44, -0.034), (45, 0.288), (46, -0.031), (47, 0.096), (48, 0.04), (49, -0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93950194 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers

Author: Francis R. Bach, David Heckerman, Eric Horvitz

2 0.40795201 85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets

Author: Janez Demšar

3 0.263127 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)

Author: Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, Bernhard Schölkopf

4 0.2582688 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications (Special Topic on Machine Learning and Optimization)

Author: Pavel Laskov, Christian Gehl, Stefan Krüger, Klaus-Robert Müller

5 0.20176476 3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection

Author: Hichem Sahbi, Donald Geman

Abstract: We introduce a computational design for pattern detection based on a tree-structured network of support vector machines (SVMs). An SVM is associated with each cell in a recursive partitioning of the space of patterns (hypotheses) into increasingly ﬁner subsets. The hierarchy is traversed coarse-to-ﬁne and each chain of positive responses from the root to a leaf constitutes a detection. Our objective is to design and build a network which balances overall error and computation. Initially, SVMs are constructed for each cell with no constraints. This “free network” is then perturbed, cell by cell, into another network, which is “graded” in two ways: ﬁrst, the number of support vectors of each SVM is reduced (by clustering) in order to adjust to a pre-determined, increasing function of cell depth; second, the decision boundaries are shifted to preserve all positive responses from the original set of training data. The limits on the numbers of clusters (virtual support vectors) result from minimizing the mean computational cost of collecting all detections subject to a bound on the expected number of false positives. When applied to detecting faces in cluttered scenes, the patterns correspond to poses and the free network is already faster and more accurate than applying a single pose-speciﬁc SVM many times. The graded network promotes very rapid processing of background regions while maintaining the discriminatory power of the free network. Keywords: statistical learning, hierarchy of classiﬁers, coarse-to-ﬁne computation, support vector machines, face detection

6 0.17771843 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

7 0.15881011 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

8 0.14796087 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling

9 0.14227015 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

10 0.13299046 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

11 0.12803525 31 jmlr-2006-Exact 1-Norm Support Vector Machines Via Unconstrained Convex Differentiable Minimization (Special Topic on Machine Learning and Optimization)

12 0.12750542 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

13 0.11943547 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

14 0.11520894 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

15 0.10806828 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

16 0.10606661 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra (Special Topic on Machine Learning and Optimization)

17 0.10593075 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

18 0.10551259 44 jmlr-2006-Large Scale Transductive SVMs

19 0.10390374 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

20 0.10372332 12 jmlr-2006-Active Learning with Feedback on Features and Instances

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.018), (35, 0.019), (36, 0.083), (50, 0.047), (63, 0.04), (76, 0.013), (78, 0.022), (81, 0.477), (84, 0.019), (90, 0.031), (91, 0.028), (96, 0.069)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97122574 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling

Author: Seyoung Kim, Padhraic Smyth

Abstract: This paper proposes a general probabilistic framework for shape-based modeling and classiﬁcation of waveform data. A segmental hidden Markov model (HMM) is used to characterize waveform shape and shape variation is captured by adding random effects to the segmental model. The resulting probabilistic framework provides a basis for learning of waveform models from data as well as parsing and recognition of new waveforms. Expectation-maximization (EM) algorithms are derived and investigated for ﬁtting such models to data. In particular, the “expectation conditional maximization either” (ECME) algorithm is shown to provide signiﬁcantly faster convergence than a standard EM procedure. Experimental results on two real-world data sets demonstrate that the proposed approach leads to improved accuracy in classiﬁcation and segmentation when compared to alternatives such as Euclidean distance matching, dynamic time warping, and segmental HMMs without random effects. Keywords: waveform recognition, random effects, segmental hidden Markov models, EM algorithm, ECME

2 0.94818789 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

same-paper 3 0.88616478 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers

Author: Francis R. Bach, David Heckerman, Eric Horvitz

4 0.68808073 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classiﬁcation where the documents are allowed to belong to more than one category at a time. The classiﬁcation model is a variant of the Maximum Margin Markov Network framework, where the classiﬁcation hierarchy is represented as a Markov tree equipped with an exponential family deﬁned on the edges. We present an efﬁcient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classiﬁcation hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efﬁcient as training independent SVM-light classiﬁers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classiﬁcation learning algorithms. Keywords: kernel methods, hierarchical classiﬁcation, text categorization, convex optimization, structured outputs

5 0.62893492 66 jmlr-2006-On Model Selection Consistency of Lasso

Author: Peng Zhao, Bin Yu

Abstract: Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to ﬁnd such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufﬁcient for Lasso to select the true model both in the classical ﬁxed p setting and in the large p setting as the sample size n gets large. Based on these results, sufﬁcient conditions that are veriﬁable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are “irrepresentable” (in a sense to be clariﬁed) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result. Keywords: Lasso, regularization, sparsity, model selection, consistency

6 0.51808959 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

7 0.51332378 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

8 0.48601392 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

9 0.48555809 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

10 0.47080967 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

11 0.46754196 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

12 0.46097153 49 jmlr-2006-Learning Parts-Based Representations of Data

13 0.46054393 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

14 0.45752496 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

15 0.45301309 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

16 0.45148957 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

17 0.44729909 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

18 0.4448424 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

19 0.44358617 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

20 0.43781996 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis