jmlr jmlr2010 jmlr2010-25 knowledge-graph by maker-knowledge-mining

25 jmlr-2010-Composite Binary Losses

Source: pdf

Author: Mark D. Reid, Robert C. Williamson

Abstract: We study losses for binary classiﬁcation and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function. We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and “classiﬁcation calibrated” losses. We also consider the question of the “best” surrogate binary loss. We introduce a precise notion of “best” and show there exist situations where two convex surrogate losses are incommensurable. We provide a complete explicit characterisation of the convexity of composite binary losses in terms of the link function and the weight function associated with the proper loss which make up the composite loss. This characterisation suggests new ways of “surrogate tuning” as well as providing an explicit characterisation of when Bregman divergences on the unit interval are convex in their second argument. Finally, in an appendix we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassiﬁcation noise for binary losses and show that all convex proper losses are non-robust to misclassiﬁcation noise. Keywords: surrogate loss, convexity, probability estimation, classiﬁcation, Fisher consistency, classiﬁcation-calibrated, regret bound, proper scoring rule, Bregman divergence, robustness, misclassiﬁcation noise

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We introduce a precise notion of “best” and show there exist situations where two convex surrogate losses are incommensurable. [sent-12, score-0.899]

2 We provide a complete explicit characterisation of the convexity of composite binary losses in terms of the link function and the weight function associated with the proper loss which make up the composite loss. [sent-13, score-2.102]

3 Finally, in an appendix we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassiﬁcation noise for binary losses and show that all convex proper losses are non-robust to misclassiﬁcation noise. [sent-15, score-1.647]

4 A composite loss is the composition of a proper loss (deﬁned below) and a link function (also deﬁned below). [sent-19, score-1.094]

5 In this paper we study composite binary losses and develop a number of new characterisation results. [sent-20, score-1.035]

6 (2005) applied to an analysis of composite losses by Masnadi-Shirazi and Vasconcelos (2009). [sent-22, score-0.877]

7 Informally, proper losses are well-calibrated losses for class probability estimation, that is for the problem of not only predicting a binary classiﬁcation label, but providing an estimate of the probability that an example will have a positive label. [sent-23, score-1.402]

8 A key goal of the present work is to consider composite losses in the general (non-symmetric) situation. [sent-34, score-0.877]

9 Speciﬁcally, Theorem 9 shows these losses are completely determined by the behaviour of one of its partial losses on half the unit interval. [sent-46, score-1.217]

10 In §4 we deﬁne composite losses as the composition of a CPE loss and a link. [sent-48, score-1.046]

11 The new contributions of this section are Theorem 10 which generalises Theorem 1 to composite losses, and Corollaries 12 and 14 which shows how requiring properness completely determines the link function for composite and margin losses. [sent-49, score-1.024]

12 We also introduce a natural and intrinsic parametrisation of proper composite losses that is a generalisation of the weight function and show how it can be used to easily derive gradients for stochastic descent algorithms. [sent-50, score-1.28]

13 , 2006) so it applies to non-symmetric composite losses (i. [sent-52, score-0.877]

14 We also describe how this new notion of classiﬁcation calibrated relates to proper CPE and composite losses via its connection with the weight function. [sent-55, score-1.288]

15 The main results of this paper are found in §6: Theorems 24 and 29 characterise when proper composite losses are convex. [sent-56, score-1.164]

16 To do so we deﬁne a well founded notion of “best” surrogate loss and show that some convex surrogate losses are incommensurable on some problems. [sent-63, score-1.287]

17 In particular, we argue that the weight and link function parametrisation of losses provides a convenient way to work with an entire class of losses that are central to probability estimation and may provide new ways of approaching the problem of “surrogate tuning” (Nock and Nielsen, 2009b). [sent-66, score-1.51]

18 In it, we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassiﬁcation noise for binary losses and show that all convex proper losses are non-robust to misclassiﬁcation noise. [sent-68, score-1.647]

19 We can always write an arbitrary loss in terms of its partial losses ℓ1 := ℓ(1, ·) and ℓ−1 := ℓ(−1, ·) using ℓ(y, v) = y = 1 ℓ1 (v) + y = −1 ℓ−1 (v). [sent-75, score-0.81]

20 Our deﬁnition of a loss function covers all commonly used margin losses (i. [sent-76, score-0.835]

21 3 We will ˆ use η instead of v as an argument to indicate losses for class probability estimation and use the shorthand CPE losses to distinguish them from general losses. [sent-80, score-1.152]

22 Losses for Class Probability Estimation We begin by considering CPE losses, that is, functions ℓ : {−1, 1} × [0, 1] → [0, ∞) and brieﬂy summarise a number of important existing structural results for proper losses—a large, natural class of losses for class probability estimation. [sent-103, score-0.826]

23 Losses that satisfy this constraint are said to be Fisher consistent and are known as proper losses (Buja et al. [sent-109, score-0.826]

24 A strictly proper loss is a ˆ ˆ proper loss for which the minimiser of L(η, η) over η is unique. [sent-112, score-0.957]

25 In order to explicitly construct losses from their associated “weight functions” as shown in Theorem 7, we will require that the loss be deﬁnite, that is, its point-wise Bayes risk for deterministic events (i. [sent-116, score-0.838]

26 2 The Structure of Proper Losses A key result in the study of proper losses is originally due to Shuford et al. [sent-125, score-0.826]

27 The following theorem5 characterises proper losses for probability estimation via a constraint on the relationship between its partial losses. [sent-128, score-0.931]

28 ) Suppose ℓ : {−1, 1} × [0, 1] → R is a loss and that its partial losses ℓ1 ˆ and ℓ−1 are both differentiable. [sent-130, score-0.81]

29 This simple characterisation of the structure of proper losses has a number of interesting implications. [sent-133, score-0.984]

30 Along with an extra constraint, this gives another characterisation of proper losses (Savage, 1971; Reid and Williamson, 2009a). [sent-146, score-0.984]

31 ˆ ˆ (2005) do, between regret ∆L(η, η) := L(η, η) − L(η) for proper losses and Bregman divergences. [sent-150, score-0.92]

32 For each ﬁxed η the properness of ℓ requires that these convex combinations of the partial losses (each ˆ slice parallel to the left and right faces) are minimised when η = η. [sent-158, score-0.856]

33 The coupling of the partial losses via the tangents to the conditional Bayes risk curve demonstrates why much of the structure of proper losses is determined by the curvature of L—that is, by the weight function w. [sent-161, score-1.698]

34 The relationship between a proper loss and its associated weight function is captured succinctly by Schervish (1989) via the following representation of proper losses as a weighted integral of the cost-weighted misclassiﬁcation losses ℓc deﬁned in (1). [sent-162, score-1.925]

35 Requiring a loss to be proper and symmetric constrains the partial losses signiﬁcantly. [sent-201, score-1.1]

36 Thus in order 2 to specify a symmetric proper loss, one needs to only specify one of the partial losses on one half 1 of the interval [0, 1]. [sent-206, score-0.955]

37 We have thus shown: Theorem 9 If a loss is proper and symmetric, then it is completely determined by specifying one of 1 the partial losses on half the unit interval (either [0, 2 ] or [ 1 , 0]) and using (6) and (7). [sent-208, score-1.084]

38 In this paper, our focus will be composite losses for binary class probability estimation. [sent-221, score-0.877]

39 A property Γ is said to be elicitable whenever there exists a strictly proper ˆ loss ℓ for it so that the composite loss ℓΓ satisﬁes for all η = η ˆ ˆ LΓ (η, η) := EY∼η [ℓΓ (Y, η)] > LΓ (η, η). [sent-241, score-0.92]

40 1 Proper Composite Losses We will call a composite loss ℓψ (8) a proper composite loss if ℓ in (8) is a proper loss for class probability estimation. [sent-249, score-1.609]

41 As in the case for losses for probability estimation, the requirement that a composite loss be proper imposes some constraints on its partial losses. [sent-250, score-1.361]

42 Many of the results for proper losses carry over to composite losses with some extra factors to account for the link function. [sent-251, score-1.908]

43 2396 C OMPOSITE B INARY L OSSES Theorem 10 Let λ = ℓψ be a composite loss with differentiable and strictly monotone link ψ and suppose the partial losses λ−1 (v) and λ1 (v) are both differentiable. [sent-254, score-1.399]

44 Then λ is a proper composite ˆ loss if and only if there exists a weight function w : (0, 1) → R+ such that for all η ∈ (0, 1) ˆ ˆ ˆ −λ′ (ψ(η)) λ′ (ψ(η)) w(η) 1 ˆ = −1 =: ρ(η), = ′ ˆ ˆ ˆ 1−η ψ (η) η (9) ˆ ˆ where equality is interpreted in the distributional sense. [sent-255, score-0.806]

45 Proof This is a direct consequence of Theorem 1 for proper losses for probability estimation and ˆ ˆ the chain rule applied to ℓy (η) = λy (ψ(η)). [sent-257, score-0.826]

46 Corollary 11 Suppose ℓψ is a proper composite loss with conditional risk denoted Lψ . [sent-262, score-0.865]

47 ∂v (10) Loosely speaking then, ρ is a “co-ordinate free” weight function for composite losses where the link function ψ is interpreted as a mapping from arbitrary v ∈ V to values which can be interpreted as probabilities. [sent-264, score-1.168]

48 Corollary 12 Let λ := ℓψ be a composite loss with differentiable partial losses λ1 and λ−1 . [sent-266, score-1.136]

49 Corollary 12 shows that for a given link ψ one can specify one of the partial losses λy but then properness ﬁxes the other partial loss λ−y . [sent-271, score-1.185]

50 We see then that Corollary 12 provides us with a way of constructing a reference link for arbitrary composite losses speciﬁed by their partial losses. [sent-273, score-1.147]

51 2 Derivatives of Composite Losses We now brieﬂy consider an application of the parametrisation of proper losses as a weight function and link. [sent-281, score-0.979]

52 ∂v −1 Given an arbitrary weight function w (which deﬁnes a proper loss via Corollary 2 and Theorem 4) and link ψ, the above equations show that one could implement SGD directly parametrised in terms of ρ without needing to explicitly compute the partial losses themselves. [sent-284, score-1.391]

53 In this section we explore the relationship between these margin losses and the more general class of composite losses and, in particular, symmetric composite losses. [sent-291, score-1.902]

54 Recall that a general composite loss is of the form ℓψ (y, v) = ℓ(y, ψ−1 (v)) for a loss ℓ : Y × [0, 1] → [0, ∞) and an invertible link ψ : R → [0, 1]. [sent-292, score-0.889]

55 As discussed above, proper losses are a natural class of losses over [0, 1] for probability estimation so a natural question in this vein is the following: given a margin loss φ can we choose a link ψ so that there exists a proper loss ℓ such that φ(yv) = ℓψ (y, v)? [sent-294, score-2.285]

56 The following corollary of Theorem 10 gives necessary and sufﬁcient conditions on the choice of link ψ to guarantee when a margin loss φ can be expressed as a proper composite loss. [sent-296, score-1.096]

57 Then, φ(yv) can be expressed as a proper composite loss ℓψ (y, v) if and only if the link ψ satisﬁes ψ−1 (v) = φ′ (−v) . [sent-298, score-0.925]

58 12 We now consider a couple of speciﬁc margin losses and show how they can be associated with a proper loss through the choice of link given in Corollary 14. [sent-305, score-1.29]

59 The exponential loss φ(v) = e−v gives ˆ ˆ rise to a proper loss ℓ(y, η) = φ(yψ(η)) via the link ψ−1 (v) = 1 −ev = v − e−v −e 1 + e−2v η 1 ˆ which has non-zero denominator. [sent-306, score-0.793]

60 α This family of differentiable convex losses approximates the hinge loss as α → ∞ and was studied in the multiclass case by Zhang et al. [sent-309, score-0.853]

61 , over the unit interval [0, 1]) and relate it to composite losses via a link. [sent-327, score-0.901]

62 The following theorem is a generalisation of the characterisation of CC 1 for margin losses via 2 due to Bartlett et al. [sent-337, score-0.876]

63 2 Calibration for Composite Losses The translation of the above results to general proper composite losses with invertible differentiable link ψ is straight forward. [sent-350, score-1.402]

64 Theorem 16 then immediately gives: Corollary 20 A composite loss ℓψ (·, ·) = ℓ(·, ψ−1 (·)) with invertible and differentiable link ψ is CCc for all c ∈ (0, 1) if and only if the associated proper loss ℓ is strictly proper. [sent-352, score-1.195]

65 Theorem 17 immediately gives: Corollary 21 Suppose ℓψ is as in Corollary 20 and that the partial losses ℓ1 and ℓ−1 of the associated proper loss ℓ are differentiable. [sent-353, score-1.06]

66 It can be shown that in the special case of margin losses Lφ , which satisfy the conditions of Corollary 14 such that they are proper composite losses, Corollary 21 leads to the condition φ′ (0) < 0 which is the same as obtained by Bartlett et al. [sent-355, score-1.217]

67 Convexity of Composite Losses We have seen that composite losses are deﬁned by the proper loss ℓ and the link ψ. [sent-358, score-1.501]

68 We have further seen from (14) that it is natural to parametrise composite losses in terms of w and ψ′ , and combine them as ρ. [sent-359, score-0.903]

69 One may wish to choose a weight function w and determine which links ψ lead to a convex loss; or choose a link ψ and determine which weight functions w (and hence proper losses) lead to a convex composite loss. [sent-360, score-1.129]

70 The main result of this section is Theorem 29 answers these questions by characterising the convexity of composite losses in terms of (w, ψ′ ) or ρ. [sent-361, score-0.933]

71 ∂ ψ ∂v L (η, v) only changes sign at η = ψ−1 (v) and so Lψ The following theorem characterises convexity of composite losses with invertible links. [sent-381, score-1.052]

72 Theorem 24 Let ℓψ (y, v) be a composite loss comprising an invertible link ψ with inverse q := ψ−1 and strictly proper loss with weight function w. [sent-382, score-1.256]

73 (17) C OMPOSITE B INARY L OSSES This theorem suggests a very natural parametrisation of composite losses is via (w, ψ′ ). [sent-385, score-0.996]

74 Corollary 26 Composite losses with a linear link (including as a special case the identity link) are convex if and only if 1 w′ (x) 1 − ≤ ≤ , ∀x ∈ (0, 1). [sent-416, score-0.864]

75 Theorem 28 A composite loss comprising a proper loss with weight function w combined with its canonical link is always convex. [sent-427, score-1.221]

76 Theorem 29 Consider a proper composite loss ℓψ with invertible link ψ and (strictly proper) weight 1 w normalised such that w( 2 ) = 1. [sent-437, score-1.108]

77 2406 C OMPOSITE B INARY L OSSES Figure 2: Allowable normalised weight functions to ensure convexity of composite loss functions with identity link (left) and logistic link (right). [sent-453, score-1.074]

78 If one wants inverse links deﬁned on the whole real line (such as the logistic link) then one can not obtain a convex composite link with the associated proper loss having a weight function satisfying (32). [sent-465, score-1.129]

79 Thus one can not choose an effectively usable link to ensure convexity of a proper loss that is arbitrarily “close to” 0-1 loss in the sense of the corresponding weight functions. [sent-466, score-0.935]

80 Convex surrogate losses are often used in place of the 0-1 loss which is not convex. [sent-487, score-0.964]

81 To date, work on surrogate losses has focussed on margin losses which necessarily are symmetric with respect to false positives and false negatives (Buja et al. [sent-492, score-1.527]

82 2409 R EID AND W ILLIAMSON Given a ﬁxed experiment P, if L is a class of losses then the best surrogate losses in L for the reference loss ℓref are those that minimise the ℓref -surrogate penalty. [sent-508, score-1.54]

83 This deﬁnition is motivated by the manner in which surrogate losses are used—one minimizes L(h) over h to obtain the minimiser h∗ and one hopes that Lref (h∗ ) is small. [sent-509, score-0.883]

84 They explicitly considered proper losses and said that “minimizing any [lower-bounded, symmetric proper] loss amounts to the same ultimate goal” and concluded that “the crux of the choice of the [loss] relies on data-dependent considerations”. [sent-523, score-1.035]

85 One can construct experiments (η1 , M) and (η2 , M) and proper losses ℓ1 and ℓ2 such that Sℓ0−1 (ℓ1 , (η1 , M)) > Sℓ0−1 (ℓ2 , (η1 , M)) but Sℓ0−1 (ℓ1 , (η2 , M)) < Sℓ0−1 (ℓ2 , (η2 , M)). [sent-525, score-0.826]

86 ) However, this does not imply there can not exist a particular convex ℓ∗ that minorizes all proper losses in this sense. [sent-527, score-0.909]

87 To prove the above conjecture it would sufﬁce to show that for a ﬁxed hypothesis class and any pair of losses one can construct two experiments such that one loss minorises the other loss on one experiment and vice versa on the other experiment. [sent-530, score-0.934]

88 2 Considering all symmetric proper losses normalised such that w( 1 ) = 1, the right side of (37) is max2 1 imised and thus the bound on ∆L 1 in terms of ∆L is minimised when L( 2 + α) is maximised (over 2 all losses normalised as mentioned). [sent-554, score-1.573]

89 We conjecture that ℓminimal is somehow special in the class of proper convex losses in some way other than being the pointwise minimiser of weights (and the normalised loss with smallest regret bound with respect to ℓ0−1 ), but the exact nature of the specialness still eludes us. [sent-566, score-1.332]

90 Furthermore the example in the appendix below seems to require loss functions whose corresponding weight functions cross over each other and there is no weight function corresponding to a convex proper loss that crosses over wminimal . [sent-569, score-0.843]

91 1, we have characterised a number of aspects of them: their relationship to margin losses, the connection between properness and classiﬁcation calibration, the constraints symmetry imposes, when composite losses are convex, and natural ways to parametrise them. [sent-573, score-1.116]

92 3 3 We use a simple linear hypothesis class H := {hα (x) := αx : α ∈ [0, 1]}, with identity link function and consider the two surrogate proper losses ℓ1 and ℓ2 with weight functions 1 1 w1 (c) = , w2 (c) = . [sent-594, score-1.336]

93 2,2 Thus for problem η1 the surrogate loss L2 has a constrained Bayes optimal hypothesis hα∗ which 2,1 has a lower 0-1 risk than the constrained Bayes optimal hypothesis hα∗ for the surrogate loss L1 . [sent-610, score-0.869]

94 Convexity and Robustness In this appendix we show how the characterisation of the convexity of proper losses (Theorem 29) allows one to make general algorithm independent statements about the robustness of convex proper losses to random mis-classiﬁcation noise. [sent-647, score-2.019]

95 This has led to the recent proposal of boosting algorithms that use non-convex margin losses and experimental evidence suggests that these are more robust to class noise than their convex counterparts. [sent-653, score-0.821]

96 Freund (2009) recently described RobustBoost, which uses a parameterised family of non-convex surrogate losses that approximates the 0-1 loss as the number of boosting iterations increases. [sent-654, score-0.993]

97 This suggests that strictly proper losses are not robust to any class noise. [sent-681, score-0.882]

98 / This theorem allows us to characterise the robustness of arbitrary proper losses by appealing to the integral representation in (4). [sent-704, score-0.985]

99 1 − 2α 1 − 2α By Corollary 30 we see that convex proper losses are strictly proper and thus have weight functions which are non-zero for all c ∈ [0, 1] and so by Theorem 40 we have the following corollary. [sent-715, score-1.276]

100 Also, our focus on proper losses excludes some convex losses (such as the hinge loss) that is covered by Long and Servedio’s results. [sent-726, score-1.485]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('losses', 0.576), ('composite', 0.301), ('proper', 0.25), ('surrogate', 0.219), ('link', 0.205), ('loss', 0.169), ('eid', 0.166), ('illiamson', 0.166), ('characterisation', 0.158), ('osses', 0.149), ('inary', 0.127), ('omposite', 0.115), ('buja', 0.106), ('properness', 0.105), ('ref', 0.105), ('cpe', 0.096), ('regret', 0.094), ('risk', 0.093), ('margin', 0.09), ('minimiser', 0.088), ('reid', 0.088), ('weight', 0.086), ('convex', 0.083), ('corollary', 0.081), ('dw', 0.08), ('ccc', 0.075), ('robustness', 0.07), ('vasconcelos', 0.067), ('parametrisation', 0.067), ('bregman', 0.067), ('partial', 0.065), ('shuford', 0.061), ('lc', 0.061), ('convexity', 0.056), ('yv', 0.055), ('calibrated', 0.054), ('williamson', 0.054), ('nock', 0.053), ('normalised', 0.052), ('conditional', 0.052), ('theorem', 0.052), ('invertible', 0.045), ('gneiting', 0.044), ('minimisers', 0.044), ('servedio', 0.043), ('canonical', 0.041), ('bayes', 0.041), ('parametrised', 0.04), ('symmetric', 0.04), ('inf', 0.04), ('characterise', 0.037), ('calibration', 0.037), ('bartlett', 0.037), ('links', 0.035), ('dx', 0.035), ('lambert', 0.035), ('savage', 0.035), ('lw', 0.035), ('divergences', 0.033), ('strictly', 0.031), ('dc', 0.03), ('misclassi', 0.03), ('surrogates', 0.03), ('nielsen', 0.03), ('boosting', 0.029), ('steinwart', 0.029), ('ln', 0.028), ('minimised', 0.027), ('suppose', 0.027), ('confer', 0.026), ('focussed', 0.026), ('kalnishkan', 0.026), ('parametrise', 0.026), ('schervish', 0.026), ('generalised', 0.026), ('classi', 0.025), ('differentiable', 0.025), ('robust', 0.025), ('iff', 0.024), ('interval', 0.024), ('sta', 0.022), ('characterises', 0.022), ('generalises', 0.022), ('granger', 0.022), ('mccullagh', 0.022), ('normalise', 0.022), ('permissible', 0.022), ('notion', 0.021), ('endpoints', 0.02), ('vx', 0.02), ('conjecture', 0.02), ('log', 0.019), ('raftery', 0.019), ('lf', 0.019), ('noise', 0.018), ('proof', 0.018), ('relationship', 0.018), ('immediate', 0.018), ('fair', 0.018), ('aczel', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000013 25 jmlr-2010-Composite Binary Losses

Author: Mark D. Reid, Robert C. Williamson

2 0.12044732 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

Author: Ming Yuan, Marten Wegkamp

Abstract: In this paper, we investigate the problem of binary classiﬁcation with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassiﬁcation. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufﬁcient condition for inﬁnite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classiﬁcation, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option

3 0.068909384 109 jmlr-2010-Stochastic Composite Likelihood

Author: Joshua V. Dillon, Guy Lebanon

Abstract: Maximum likelihood estimators are often of limited practical use due to the intensive computation they require. We propose a family of alternative estimators that maximize a stochastic variation of the composite likelihood function. Each of the estimators resolve the computation-accuracy tradeoff differently, and taken together they span a continuous spectrum of computation-accuracy tradeoff resolutions. We prove the consistency of the estimators, provide formulas for their asymptotic variance, statistical robustness, and computational complexity. We discuss experimental results in the context of Boltzmann machines and conditional random ﬁelds. The theoretical and experimental studies demonstrate the effectiveness of the estimators when the computational resources are insufﬁcient. They also demonstrate that in some cases reduced computational complexity is associated with robustness thereby increasing statistical accuracy. Keywords: Markov random ﬁelds, composite likelihood, maximum likelihood estimation

4 0.063092522 8 jmlr-2010-A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design

Author: Dirk Gorissen, Ivo Couckuyt, Piet Demeester, Tom Dhaene, Karel Crombecq

Abstract: An exceedingly large number of scientiﬁc and engineering ﬁelds are confronted with the need for computer simulations to study complex, real world phenomena or solve challenging design problems. However, due to the computational cost of these high ﬁdelity simulations, the use of neural networks, kernel methods, and other surrogate modeling techniques have become indispensable. Surrogate models are compact and cheap to evaluate, and have proven very useful for tasks such as optimization, design space exploration, prototyping, and sensitivity analysis. Consequently, in many ﬁelds there is great interest in tools and techniques that facilitate the construction of such regression models, while minimizing the computational cost and maximizing model accuracy. This paper presents a mature, ﬂexible, and adaptive machine learning toolkit for regression modeling and active learning to tackle these issues. The toolkit brings together algorithms for data ﬁtting, model selection, sample selection (active learning), hyperparameter optimization, and distributed computing in order to empower a domain expert to efﬁciently generate an accurate model for the problem or data at hand. Keywords: surrogate modeling, metamodeling, function approximation, model selection, adaptive sampling, active learning, distributed computing 1. Background and Motivation In many science and engineering problems researchers make heavy use of computer simulation codes in order to replace expensive physical experiments and improve the quality and performance of engineered products and devices. Such simulation activities are collectively referred to as computational science/engineering. Unfortunately, while allowing scientists more ﬂexibility to study phenomena under controlled conditions, computer simulations require a substantial investment of c 2010 Dirk Gorissen, Ivo Couckuyt, Piet Demeester, Tom Dhaene and Karel Crombecq. G ORISSEN , C OUCKUYT, D EMEESTER , D HAENE AND C ROMBECQ computation time. One simulation may take many minutes, hours, days or even weeks, quickly rendering parameter studies impractical (Forrester et al., 2008; Simpson et al., 2008). Of the different ways to deal with this problem, this paper is concerned with the construction of simpler approximation models to predict the system performance and develop a relationship between the system inputs and outputs. When properly constructed, these approximation models mimic the behavior of the simulation accurately while being computationally cheap(er) to evaluate. Different approximation methods exist, each with their relative merits. This work concentrates on the use of data-driven, global approximations using compact surrogate models (also known as metamodels, replacement models, or response surface models). Examples include: rational functions, Kriging models, Artiﬁcial Neural Networks (ANN), splines, and Support Vector Machines (SVM). Once such a global approximation is available it is of great use for gaining insight into the behavior of the underlying system. The surrogate may be easily queried, optimized, visualized, and seamlessly integrated into CAD/CAE software packages. The challenge is thus how to generate an approximation model that is as accurate as possible over the complete domain of interest while minimizing the simulation cost. Solving this challenge involves multiple sub-problems that must be addressed: how to interface with the simulation code, how to run simulations (locally, or on a cluster or cloud), which model type to approximate the data with and how to set the model complexity (e.g., topology of a neural network), how to estimate the model quality and ensure the domain expert trusts the model, how to decide which simulations to run (data collection), etc. The data collection aspect is worth emphasizing. Since data is computationally expensive to obtain and the optimal data distribution is not known up front, data points should be selected iteratively, there where the information gain will be the greatest. A sampling function is needed that minimizes the number of sample points selected in each iteration, yet maximizes the information gain of each iteration step. This process is called adaptive sampling but is also known as active learning, or sequential design. There is a complex dependency web between these different options and dealing with these dependencies is non-trivial, particularly for a domain expert for whom the surrogate model is just an intermediate step towards solving a larger, more important problem. Few domain experts will be experts in the intricacies of efﬁcient sampling and modeling strategies. Their primary concern is obtaining an accurate replacement metamodel for their problem as fast as possible and with minimal overhead (Gorissen et al., 2009d). As a result these choices are often made in a pragmatic, sometimes even ad-hoc, manner. This paper discusses an advanced, and integrated software framework that provides a ﬂexible and rigorous means to tackle such problems. This work lies at the intersection of Machine Learning/AI, Modeling and Simulation, and Distributed Computing. The methods developed are applicable to any domain where a cheap, accurate, approximation is needed to replace some expensive reference model. Our experience has been that the availability of such a framework can facilitate the transfer of knowledge from surrogate modeling researchers and lower the barrier of entry for domain experts. 2. SUMO Toolbox The platform in question is the Matlab SUrrogate MOdeling (SUMO) Toolbox, illustrated in Figure 1. Given a simulation engine (Fluent, Cadence, Abaqus, HFSS, etc.) or other data source (data 2052 A S URROGATE M ODELING AND A DAPTIVE S AMPLING TOOLBOX FOR C OMPUTER BASED D ESIGN Figure 1: The SUMO Toolbox is a ﬂexible framework for accurate global surrogate modeling and adaptive sampling (active learning). It features a rich set of plugins, is applicable to a wide range of domains, and can be applied in an autonomous, black-box fashion, or under full manual control. Written in Matlab and Java it is fully cross platform and comes with a large (60+) number of example problems. set, Matlab script, Java class, etc.), the toolbox drives the data source to produce a surrogate model within the time and accuracy constraints set by the user. The SUMO Toolbox adopts a microkernel design philosophy with many different plugins available for each of the different sub-problems:1 model types (rational functions, Kriging, splines, SVM, ANN, etc.), hyperparameter optimization algorithms (Particle Swarm Optimization, Efﬁcient Global Optimization, simulated annealing, Genetic Algorithm, etc.), model selection algorithms (cross validation, AIC, Leave-out set, etc.), sample selection (random, error based, density based, hybrid, etc.), Design of Experiments (Latin hypercube, Box-Bhenken, etc.), and sample evaluation methods (local, on a cluster or grid). The behavior of each software component is conﬁgurable through a central XML ﬁle and components can easily be added, removed or replaced by custom implementations. In addition the toolbox provides ‘meta’ plugins. For example to automatically select the best model type for a given problem (Gorissen et al., 2009d) or to use multiple model selection or sample selection criteria in concert (Gorissen et al., 2010). Furthermore, there is built-in support for high performance computing. On the modeling side, the model generation process can take full advantage of multi-core CPUs and even of a complete cluster or grid. This can result in signiﬁcant speedups for model types where the ﬁtting process can be expensive (e.g., neural networks). Likewise, sample evaluation (simulation) can occur locally (with the option to take advantage of multi-core architectures) or on a separate compute cluster or grid (possibly accessed through a remote head-node). All interfacing with the grid middleware 1. The full list of plugins and features can be found at http://www.sumowiki.intec.ugent.be. 2053 G ORISSEN , C OUCKUYT, D EMEESTER , D HAENE AND C ROMBECQ (submission, job monitoring, rescheduling of failed/lost simulation points, etc.) is handled transparently and automatically (see Gorissen et al., 2009c for more details). Also, the sample evaluation component runs in parallel with the other components (non-blocking) and not sequentially. This allows for an optimal use of computational resources. In addition the SUMO Toolbox contains extensive logging and proﬁling capabilities so that the modeling process can easily be tracked and the modeling decisions understood. Once a ﬁnal model has been generated, a GUI tool is available to visually explore the model (including derivatives and prediction uncertainty), assess its quality, and export it for use in other software tools. 3. Applications The SUMO Toolbox has already been applied successfully to a very wide range of applications, including RF circuit block modeling (Gorissen et al., 2009b), hydrological modeling (Couckuyt et al., 2009), Electronic Packaging (Zhu and Franzon, 2009), aerodynamic modeling (Gorissen et al., 2009a), process engineering (Stephens et al., 2009), and automotive data modeling (Gorissen et al., 2010). Besides global modeling capabilities, the SUMO Toolbox also includes a powerful optimization framework based on the Efﬁcient Global Optimization framework developed by Jones et al. (1998). As of version 6.1, the toolbox also contains an example of how the framework can also be applied to solve classiﬁcation problems. In sum, the goal of the toolbox is to ﬁll the void in machine learning software when it comes to the challenging, costly, real-valued, problems faced in computational engineering. The toolbox is in use successfully at various institutions and we are continuously reﬁning and extending the set of available plugins as the number of applications increase. Usage instructions, design documentation, and stable releases for all major platforms can be found at http://www.sumo.intec.ugent.be. References I. Couckuyt, D. Gorissen, H. Rouhani, E. Laermans, and T. Dhaene. Evolutionary regression modeling with active learning: An application to rainfall runoff modeling. In International Conference on Adaptive and Natural Computing Algorithms, volume LNCS 5495, pages 548–558, Sep. 2009. A. Forrester, A. Sobester, and A. Keane. Engineering Design Via Surrogate Modelling: A Practical Guide. Wiley, 2008. D. Gorissen, K. Crombecq, I. Couckuyt, and T. Dhaene. Foundations of Computational Intelligence, Volume 1: Learning and Approximation: Theoretical Foundations and Applications, volume 201, chapter Automatic approximation of expensive functions with active learning, pages 35–62. Springer Verlag, Series Studies in Computational Intelligence, 2009a. D. Gorissen, L. De Tommasi, K. Crombecq, and T. Dhaene. Sequential modeling of a low noise ampliﬁer with neural networks and active learning. Neural Computing and Applications, 18(5): 485–494, Jun. 2009b. D. Gorissen, T. Dhaene, P. Demeester, and J. Broeckhove. Handbook of Research on Grid Technologies and Utility Computing: Concepts for Managing Large-Scale Applications, chapter Grid enabled surrogate modeling, pages 249–258. IGI Global, May 2009c. 2054 A S URROGATE M ODELING AND A DAPTIVE S AMPLING TOOLBOX FOR C OMPUTER BASED D ESIGN D. Gorissen, T. Dhaene, and F. DeTurck. Evolutionary model type selection for global surrogate modeling. Journal of Machine Learning Research, 10:2039–2078, 2009d. D. Gorissen, I. Couckuyt, E. Laermans, and T. Dhaene. Multiobjective global surrogate modeling,dealing with the 5-percent problem. Engineering with Computers, 26(1):81–89, Jan. 2010. D. R. Jones, M. Schonlau, and W. J. Welch. Efﬁcient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492, Nov. 1998. ISSN 0925-5001. T. W. Simpson, V. Toropov, V. Balabanov, and F. A. C. Viana. Design and analysis of computer experiments in multidisciplinary design optimization: a review of how far we have come or not. In Proceedings of the 12th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, 2008 MAO, Victoria, Canada, 2008. D.W. Stephens, D. Gorissen, and T. Dhaene. Surrogate based sensitivity analysis of process equipment. In Proc. of 7th International Conference on CFD in the Minerals and Process Industries, CSIRO, Melbourne, Australia, Dec. 2009. T. Zhu and P. D. Franzon. Application of surrogate modeling to generate compact and PVT-sensitive IBIS models. In Proceedings of the 18th Conference on Electrical Performance of Electronic Packaging and Systems (EPEPS), Oct. 2009. 2055

5 0.057860665 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra

Abstract: The presence of asymmetry in the misclassiﬁcation costs or class prevalences is a common occurrence in the pattern classiﬁcation domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the speciﬁcation of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspeciﬁed. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspeciﬁcation, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassiﬁcation cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets

6 0.054190557 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

7 0.051303484 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

8 0.049479648 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

9 0.045379598 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

10 0.043554988 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

11 0.04275541 22 jmlr-2010-Classification Using Geometric Level Sets

12 0.042527419 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

13 0.041160367 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

14 0.040804692 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

15 0.040326715 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning

16 0.037881609 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

17 0.037611969 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

18 0.036679387 102 jmlr-2010-Semi-Supervised Novelty Detection

19 0.03527436 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

20 0.034970786 84 jmlr-2010-On Spectral Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.165), (1, -0.076), (2, 0.092), (3, 0.018), (4, -0.005), (5, 0.031), (6, 0.051), (7, 0.13), (8, -0.032), (9, 0.038), (10, 0.025), (11, -0.159), (12, -0.037), (13, -0.128), (14, -0.08), (15, 0.023), (16, -0.076), (17, -0.042), (18, -0.101), (19, 0.064), (20, -0.063), (21, -0.034), (22, -0.02), (23, -0.024), (24, 0.065), (25, -0.129), (26, -0.06), (27, 0.027), (28, 0.079), (29, 0.293), (30, 0.061), (31, 0.027), (32, 0.263), (33, -0.002), (34, -0.205), (35, 0.117), (36, -0.055), (37, 0.065), (38, -0.051), (39, -0.111), (40, 0.107), (41, 0.061), (42, -0.019), (43, -0.108), (44, -0.238), (45, -0.059), (46, 0.017), (47, 0.028), (48, -0.025), (49, -0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97205013 25 jmlr-2010-Composite Binary Losses

Author: Mark D. Reid, Robert C. Williamson

2 0.62286907 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

Author: Ming Yuan, Marten Wegkamp

3 0.61610013 8 jmlr-2010-A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design

Author: Dirk Gorissen, Ivo Couckuyt, Piet Demeester, Tom Dhaene, Karel Crombecq

4 0.30603802 109 jmlr-2010-Stochastic Composite Likelihood

Author: Joshua V. Dillon, Guy Lebanon

5 0.28843328 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra

6 0.28788322 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

7 0.2776897 16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory

8 0.22516617 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

9 0.21913396 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

10 0.21804595 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

11 0.21702377 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

12 0.21323012 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

13 0.21195592 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

14 0.20939812 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

15 0.20668404 80 jmlr-2010-On-Line Sequential Bin Packing

16 0.20342611 22 jmlr-2010-Classification Using Geometric Level Sets

17 0.20110819 84 jmlr-2010-On Spectral Learning

18 0.19931155 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning

19 0.19114393 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

20 0.18778513 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.023), (3, 0.043), (4, 0.016), (8, 0.02), (15, 0.028), (21, 0.026), (32, 0.073), (36, 0.035), (37, 0.053), (52, 0.397), (75, 0.104), (85, 0.067)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.77017123 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

Author: Fei Ye, Cun-Hui Zhang

Abstract: We consider the estimation of regression coefﬁcients in a high-dimensional linear model. For regression coefﬁcients in ℓr balls, we provide lower bounds for the minimax ℓq risk and minimax quantiles of the ℓq loss for all design matrices. Under an ℓ0 sparsity condition on a target coefﬁcient vector, we sharpen and unify existing oracle inequalities for the Lasso and Dantzig selector. We derive oracle inequalities for target coefﬁcient vectors with many small elements and smaller threshold levels than the universal threshold. These oracle inequalities provide sufﬁcient conditions on the design matrix for the rate minimaxity of the Lasso and Dantzig selector for the ℓq risk and loss in ℓr balls, 0 ≤ r ≤ 1 ≤ q ≤ ∞. By allowing q = ∞, our risk bounds imply the variable selection consistency of threshold Lasso and Dantzig selectors. Keywords: variable selection, estimation, oracle inequality, minimax, linear regression, penalized least squares, linear programming

same-paper 2 0.74296361 25 jmlr-2010-Composite Binary Losses

Author: Mark D. Reid, Robert C. Williamson

3 0.35995328 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for ﬁnding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative ﬁltering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data ﬁt and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily

4 0.35987186 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

Abstract: Estimating the error rates of classiﬁers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classiﬁcation and regression, maximum likelihood, latent variable models

5 0.35810506 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

Author: Shiliang Sun, John Shawe-Taylor

Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artiﬁcial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efﬁcacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory

6 0.3579686 102 jmlr-2010-Semi-Supervised Novelty Detection

7 0.35766515 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

8 0.35740066 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

9 0.35584787 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

10 0.35186088 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

11 0.35077316 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

12 0.35047573 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

13 0.349751 60 jmlr-2010-Learnability, Stability and Uniform Convergence

14 0.34941411 109 jmlr-2010-Stochastic Composite Likelihood

15 0.34868556 16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory

16 0.34812045 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

17 0.34753272 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

18 0.34570271 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

19 0.34529197 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

20 0.34481987 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values