jmlr jmlr2011 jmlr2011-43 knowledge-graph by maker-knowledge-mining

43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

Source: pdf

Author: Mark D. Reid, Robert C. Williamson

Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classiﬁcation. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classiﬁcation, loss functions, divergence, statistical information, regret bounds

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 AU Australian National University and NICTA Canberra ACT 0200, Australia Editor: Yoram Singer Abstract We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. [sent-9, score-0.322]

2 We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classiﬁcation. [sent-10, score-0.31]

3 As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. [sent-11, score-0.477]

4 These include information, loss, risk, regret, ROC (Receiver Operating Characteristic) curves and the area under them, Bregman divergences and distance or divergence between probability distributions. [sent-21, score-0.358]

5 A link between the weighted integral representations for proper scoring rules and those for f -divergences which allows the transformation from one to the other (Theorem 10); 2. [sent-53, score-0.294]

6 Explicit formulae relating Bayes risk to the Neyman-Pearson function, which allows the transformation of risk curves to ROC curves and vice versa (Theorem 22). [sent-65, score-0.432]

7 We show that integral representations of f -divergences and proper losses and statistical information are all essentially the same (Theorem 18). [sent-95, score-0.427]

8 We unify several graphical representations for binary experiments and present new explicit formulae relating Bayes risk to the Neyman-Pearson function, which allows the transformation of risk curves to ROC curves and vice versa (Theorem 22). [sent-98, score-0.526]

9 These are tight bounds on the conditional risk with respect to an arbitrary cost-sensitive misclassiﬁcation loss when all is known is the value of the conditional risk with respect to an arbitrary proper loss. [sent-100, score-0.416]

10 We also generalise the classical Pinsker inequality by deriving tight bounds on an arbitrary f -divergence when the value of several generalised variational divergences between the same distributions is known (Theorem 30). [sent-102, score-0.31]

11 We systematically explore the relationship between Bayes risk and variational divergence, building upon classical results. [sent-105, score-0.249]

12 2 φ♦ Csisz´ r dual of φ a (2) φ⋆ Legendre-Fenchel dual of φ (3) Bφ Bregman divergence and regret §4. [sent-139, score-0.255]

13 Convex Functions and Their Representations Many of the properties of divergences and losses are best understood through properties of the convex functions that deﬁne them. [sent-154, score-0.388]

14 One aim of this paper is to explain and relate various divergences and losses by understanding the relationships between their primitive functions. [sent-155, score-0.459]

15 It is used in Section 5 below to obtain an integral representation of losses for binary class probability estimation. [sent-206, score-0.302]

16 Both these integral representations state that the non-linear part of φ can be expressed as a weighted integral of piecewise linear terms φs0 or ψ. [sent-208, score-0.313]

17 Since the measures of risk, information and divergence we examine below do not depend on the linear part of these expansions we are able to identify convex functions with the weights w(t) = φ′′ (t) that deﬁne their non-linear part. [sent-210, score-0.245]

18 As much of this paper deals with functions that fall into this category—namely general convex functions—being able to generalise these results is essential in order to understand the weight functions corresponding to the primitive f -divergences and loss functions. [sent-215, score-0.35]

19 As Liese and Vajda (2006) carefully show, it is possible to derive generalised versions of the integral representations using the interpretations above. [sent-230, score-0.255]

20 When S = R and φ is twice differentiable, comparing the deﬁnition of a Bregman divergence in (7) to the integral representation in (4) reveals that Bregman divergences between real numbers can be deﬁned as the non-linear part of the Taylor expansion of φ. [sent-239, score-0.396]

21 Many measures of divergence and information studied in the subsequent sections can be expressed as the Jensen gap of some convex function. [sent-270, score-0.277]

22 One divergence central to this paper is the variational divergence V (P, Q) which is obtained by setting f (t) = |t − 1| in Equation 14. [sent-366, score-0.399]

23 ) This form of the variational divergence is discussed further in Section 8. [sent-370, score-0.238]

24 Furthermore, the variational divergence is one of a family of “primitive” f divergences discussed in Section 5. [sent-371, score-0.374]

25 Given a convex function φ : R+ → R the generative Bregman divergence between the distributions P and Q is (confer (14)) Bφ (P, Q) := EM Bφ (p, q) = EX∼M Bφ (p(X), q(X)) . [sent-378, score-0.28]

26 We call this Bregman divergence “generative” to distinguish it from the “discriminative” Bregman divergence introduced in Section 4 below, where the adjectives “generative” and “discriminative” are explained further. [sent-379, score-0.322]

27 Below we will introduce risk, regret, and proper losses and show how these relate to discriminative Bregman divergence. [sent-394, score-0.311]

28 ) When η : X → [0, 1] is an observation-conditional ˆ density, taking the M-average of the point-wise risk gives the (full) risk of the estimator η: ˆ ˆ ˆ L(η, η, M) := EM [L(η, η)] = EX∼M [L(η(X), η(X))] = X ˆ ˆ L(η(x), η(x)) dM(x) =: L(π, η, P, Q). [sent-423, score-0.27]

29 To use common machine learning terminology we will refer to Fisher consistent losses as proper losses. [sent-433, score-0.269]

30 In order to explicitly construct a proper loss from its associated “weight function” as shown in Theorem 17 we will require that the loss be deﬁnite, that is, its point-wise Bayes risk at 0 and 1 must be bounded from below: L(0) > −∞ , L(1) > −∞. [sent-439, score-0.326]

31 Since properness and fairness imply deﬁniteness and regularity, most of the situations we consider in the remainder of this paper will involve losses which are both proper and fair. [sent-445, score-0.269]

32 Proper losses for probability estimation and surrogate margin losses (confer Bartlett et al. [sent-446, score-0.428]

33 (2005) note that “the surrogate criteria of classiﬁcation are exactly the primary criteria of class probability estimation” and that most commonly used surrogate margin losses are just proper losses mapped from [0, 1] to R via a link function. [sent-450, score-0.621]

34 ” However, commonly used margin losses of the form φ(yF(x)) are a more restrictive class than proper losses since, as Buja et al. [sent-454, score-0.437]

35 The relation between link functions, proper losses and margin losses is considered in more detail by Reid and Williamson (2010). [sent-456, score-0.437]

36 The following important property of proper losses seems to be originally due to Savage (1971). [sent-457, score-0.269]

37 It shows that a proper loss is completely characterised by a concave function deﬁning its point-wise Bayes risk along with a simple structural relationship between its point-wise risk and Bayes risk. [sent-458, score-0.563]

38 Theorem 7 A loss function ℓ is proper if and only if its point-wise Bayes risk L(η) is concave and ˆ for each η, η ∈ (0, 1) ˆ ˆ ˆ ˆ L(η, η) = L(η) + (η − η)L′ (η). [sent-459, score-0.391]

39 This characterisation of the concavity of L means proper losses have a natural connection to Bregman divergences. [sent-478, score-0.269]

40 5 that if S ⊆ Rd is a convex set, then a convex function φ : S → R deﬁnes a Bregman divergence Bφ (s, s0 ) := φ(s) − φ(s0 ) − s − s0 , ∇φ(s0 ) . [sent-481, score-0.329]

41 Thus, we know that there is a proper loss ℓ with Bayes risk equal to −φ. [sent-485, score-0.281]

42 The theorem leads to some correspondences between well known losses and divergence: log-loss with KL(P, Q); square loss with triangular discrimination; and 0-1 loss with V (P, Q). [sent-526, score-0.306]

43 However, the family of margin losses given in their work can be recovered by combining the proper losses with link functions. [sent-533, score-0.437]

44 Working with proper losses also addresses a limitation pointed out by Nguyen et al. [sent-534, score-0.269]

45 3, integral representations allow these convex functions to be expressed as weighted combinations of simple, convex, piecewise linear functions. [sent-553, score-0.298]

46 In Section 6, risk curves are used to graphically summarise the values of all the primitive risks for a given binary experiment. [sent-562, score-0.44]

47 In Section 7, surrogate regret bounds for proper losses and a tight generalisation of Pinsker’s inequality are derived by considering the relationship between general regrets or divergences and the primitive ones comprising them. [sent-563, score-0.848]

48 The speciﬁc choice of fπ in the above theorem from all of the afﬁne equivalents was made to make simpler the connection between integral representations for losses and f -divergences, discussed in Section 5. [sent-601, score-0.374]

49 2 Proper Losses and Cost-Weighted Risk We now consider a representation of proper losses in terms of primitive losses that originates with Shuford et al. [sent-619, score-0.592]

50 The cost-weighted losses are a family of losses parameterised by a false positive cost c ∈ [0, 1] ˆ that deﬁnes a loss for y ∈ {±1} and η ∈ [0, 1] by ˆ ˆ ˆ ℓc (y, η) = c y = −1 η ≥ c + (1 − c) y = 1 η < c . [sent-623, score-0.425]

51 The ﬁrst shows that the point-wise Bayes risk is a simple, concave “tent” function. [sent-631, score-0.245]

52 Theorem 14 For all η, c ∈ [0, 1] the point-wise Bayes risk Lc (η) = (1 − η)c ∧ (1 − c)η and is therefore concave in both c and η. [sent-633, score-0.245]

53 3 Integral Representations of Proper Losses The cost-weighted losses are primitive in the sense that they form the basis for a Choquet integral representation of proper losses. [sent-644, score-0.523]

54 Substituting these into the Savage representation of Theorem 7 for proper losses we see that ˆ ˆ ˆ ˆ L(η, η) = L(η) + (η − η)L′ (η) ˆ ˆ ˆ ˆ = −W (η) + aη + b + (η − η)[−W (η) + a] ˆ ˆ ˆ = −W (η) − (η − η)W (η) + aη + b. [sent-679, score-0.269]

55 As an example of how this theorem lets us explicitly construct proper losses from weight func2 tions, consider the weight function w(c) = 1. [sent-684, score-0.449]

56 Graphical Representations The last section described representations of risks and f -divergences in terms of weighted integrals of primitive functions. [sent-737, score-0.268]

57 In particular, a diagram called a risk curve is introduced. [sent-739, score-0.242]

58 Risk curves are a useful aid to intuition when reasoning about risks, divergences and information and they are used in Section 7 to derive bounds between various divergences and risks. [sent-740, score-0.333]

59 Despite the close ties between f -divergences and risks, and between risk curves and ROC curves, we show in Proposition 19 that the area under an ROC curve cannot be interpreted as an f -divergence. [sent-744, score-0.27]

60 The weight function w(c) associated with a loss ℓ can be interpreted as a weighting on the horizontal axis of a risk curve diagram. [sent-790, score-0.32]

61 When the area under a risk curve is computed with respect to this weighting the result is the full risk L ˆ ˆ since L(η, η) = 01 Lc (η, η) w(c) dc. [sent-791, score-0.344]

62 ˆ Furthermore, the weighted area between the risk curves for an estimate η and the true posterior ˆ − L(η) and the statistical information ∆L(η, M) = L(π, M) − L(η, M) is the η is the regret L(η, η) weighted area between the “tent” risk curve for π and the risk curve for η. [sent-792, score-0.708]

63 Proposition 20 For a given point (FP, TP) on an ROC diagram the corresponding line in a risk diagram is Lc = (1 − π) c FP + π (1 − c) (1 − TP), c ∈ [0, 1] Conversely, the line in ROC space corresponding to a point (c, Lc ) in risk space is TP = (1 − π)c (1 − π)c − Lc FP + , FP ∈ [0, 1]. [sent-802, score-0.336]

64 The grey Bayes risk curve on the left corresponds to the dominating grey ROC curve on the right for the likelihood statistic. [sent-811, score-0.283]

65 Given mild conditions on the space of instances, this gives a corollary which guarantees that all concave curves on a risk diagram can be realised by some pair of distributions. [sent-836, score-0.339]

66 Each experiment can be associated with a concave curve and vice versa so that the existence of an experiment becomes equivalent to the existence of a concave curve with certain properties. [sent-847, score-0.368]

67 In this section we consider how we can (tightly) bound the value of a general object (I f or Bw ) in terms of primitive objects (Vπ —the generalised variational divergence deﬁned below—or Bc , the regret with respect to the cost weight loss (29)). [sent-865, score-0.695]

68 The main result of this subsection, Theorem 25, presents a general surrogate 767 R EID AND W ILLIAMSON bound for proper losses implicitly as Bw ≥ F −1 (Bc0 ). [sent-876, score-0.361]

69 Then the point-wise regret B(η, η) for any proper ˆ loss ℓc0 . [sent-882, score-0.24]

70 Suppose it is known that Bc0 (η, η) surrogate loss ℓ with point-wise risk L and Bayes risk L satisﬁes ˆ B(η, η) ≥ ψ(c0 , α) ∨ ψ(c0 , −α), (46) where ψ(c0 , α) := B(c0 , c0 + α) = L(c0 ) − L(c0 + α) + αL′ (c0 ). [sent-883, score-0.407]

71 The proof of this bound is almost a direct consequence of the fact that regrets for proper losses are Bregman divergences (see Section 4. [sent-885, score-0.405]

72 Proof (Theorem 25) Let B be the conditional regret associated with some arbitrary proper loss ˆ ℓ and suppose that we know the cost-weighted regret Bc0 (η, η) = α. [sent-895, score-0.334]

73 (2006) for surrogate margin losses since B 1 is 2 1 easily shown to be half the 0-1 regret. [sent-903, score-0.26]

74 Finally, Theorem 25 can be used to immediately establish a loose, second-order bound in α for symmetric losses in terms of their weight function, similar to a result due to Buja et al. [sent-917, score-0.267]

75 Corollary 29 Suppose Bw is the regret for a symmetric proper loss ℓ with associated weight function w. [sent-919, score-0.339]

76 sinh(t) sinh2 (t) R EID AND W ILLIAMSON We will now show how viewing f -divergences in terms of their weighted integral representation simpliﬁes the problem of understanding the relationship between different divergences and leads, amongst other things, to an explicit formula for (47). [sent-933, score-0.272]

77 We make use of a generalised notion of variational divergence: Vπ (P, Q) := 2 sup r∈[−1,1]X |πEP r − (1 − π)EQ r|, (49) where π ∈ (0, 1) and the supremum is over all measurable functions from X to [−1, 1]. [sent-934, score-0.25]

78 Since π → Vπ (P, Q) is concave, the piecewise linear concave function passing through points {(πi ,Vπi (P, Q))}n i=1 is guaranteed to be an upper bound on the variational curve (π,Vπ (P, Q))π∈(0,1) . [sent-938, score-0.317]

79 h2 (P, Q) ≥ 2 − J(P, Q) ≥ 2V ln Ψ(P, Q) ≥ I(P, Q) ≥ 4 −V 2 , 2+V 2−V , 8V 2 , 4 −V 2 1 V 2 − 4 ln(2 −V ) + T(P, Q) ≥ ln √ 4 4−V 2 1 2 + V ln(2 +V ) − ln(2), 4 − ln(2). [sent-987, score-0.242]

80 The following special cases hold (γ is not symmetric) χ2 (P, Q) ≥ KL(P, Q) ≥ V < 1 V2 + V ≥ 1 min β∈[V −2,2−V ] V +2−β 4 V (2−V ) , ln β−2−V β−2+V (58) + β+2−V 4 ln β+2−V β+2+V . [sent-988, score-0.242]

81 Consider the (constrained23 ) Bayes risk for 0-1 loss minimised over this set: L0−1 (π, P, Q) = inf E(X,Y)∼P [ℓ0−1 (r(X), Y)]. [sent-1013, score-0.252]

82 C r∈C (61) The variational divergence is so called because it can be written V (P, Q) = 2 sup |P(A) − Q(A)|, (62) A⊆X where the supremum is over all measurable subsets of X. [sent-1014, score-0.314]

83 The idea of generalising variational divergence by restricting the set the supremum is taken over is also used by Ben-David et al. [sent-1034, score-0.274]

84 Along the way we have drawn connections to a diverse set of concepts related to binary experiments: risk curves, cost curves, ROC curves and the area under them; variational representations of f -divergences, risks and regrets. [sent-1067, score-0.421]

85 The convexity of f π is guaranteed by Theorem 7, which shows that L is concave and the fact that the perspective transform of a convex function is always convex (see Section 2. [sent-1116, score-0.322]

86 Then L0−1 (π, P, Q) = inf E(X,Y)∼P ℓ0−1 (r(X), Y) C r∈C = inf πEX∼P ℓ0−1 (r(X), 0) + (1 − π)EX∼Q ℓ0−1 (r(X), 1) r∈C = inf (πEX∼P r(X) = 1 + (1 − π)EX∼Q r(X) = 0 ) r∈C = inf (πEP r + (1 − π)EQ (1 − r)) r∈C since Ran r = {0, 1} ⇒ EX∼P r(X) = 1 = EX∼P r(X) and EX∼Q r(X) = 0 = EX∼Q (1 − r(X)). [sent-1202, score-0.288]

87 Hence 2 0−1 LC (π, P, Q) = = = = = inf ρ∈2C−1 πEP ρ+1 ρ+1 − (1 − π)EQ 1 − 2 2 1 inf (πEP (ρ + 1) + (1 − π)EQ (1 − ρ)) 2 ρ∈2C−1 1 inf (πEP ρ + (1 − π)EQ (−ρ) + π + (1 − π)) 2 ρ∈2C−1 1 1 + inf (πEP ρ − (1 − π)EQ ρ) 2 2 ρ∈2C−1 1 1 − sup (πEP (−ρ) − (1 − π)EQ (−ρ)). [sent-1205, score-0.328]

88 , n) pi (πi−1 ) ⇒ ai πi−1 + bi ⇒ ai (πi−1 − πi ) ⇒ ai ≥ g(πi−1 ) ≥ ψi−1 ⇒ ai πi−1 + ψi − ai πi ≥ ψi−1 ≥ ψi−1 − ψi <0 ≤ ψi−1 − ψi . [sent-1270, score-0.24]

89 , n) pi (πi+1 ) ⇒ ai πi+1 + bi ⇒ ai (πi+1 − πi ) ⇒ ai ≥ g(πi+1 ) ≥ ψi+1 ⇒ ai πi+1 + ψi − ai πi ≥ ψi+1 ≥ ψi+1 − ψi >0 ≥ ψi+1 − ψi . [sent-1274, score-0.24]

90 That occurs at the points π when pi (π) = pi+1 (π) ⇒ ai π + ψi − ai πi = ai+1 π + ψi+1 − ai+1 πi+1 ⇒ ⇒ (ai+1 − ai )π π = ψi − ψi+1 + ai+1 πi+1 − ai πi ψi − ψi+1 + ai+1 πi+1 = ai+1 − ai ˜ =: πi for i = 0, . [sent-1276, score-0.24]

91 The optimisation problem involves ﬁnding the piecewise linear concave risk curve ψ ∈ Ψ and the corresponding φ = π ∧ (1 − π) that maximises I f . [sent-1301, score-0.375]

92 2 2 T (P, Q) ≥ 2 (π − ψ1 ) 1 Substituting ψ1 = 2 − V gives 4 1 1 V 1 V 1 − ln + − T (P, Q) ≥ − ln 2 2 4 2 2 4 4 − ln(2). [sent-1329, score-0.242]

93 From (79) we obtain π KL(P, Q) ≥ min [−2ψ1 ,2ψ1 ] a a + 2ψ1 − 2 a a + 2ψ1 1 − − ψ1 ln + + ψ1 ln . [sent-1346, score-0.242]

94 (2005) discuss, such margin losses can not capture the richness of all possible proper ˆ losses. [sent-1454, score-0.269]

95 27 Gilardoni (2006c) improved 2 Vajda’s bound slightly to KL(P, Q) ≥ ln 2−V − 2−V ln 2+V . [sent-1499, score-0.242]

96 Variational Representation of I f and its Generalizations The variational representation of the Variational divergence (62) suggests the question of whether there is a variational representation for a general f -divergence. [sent-1513, score-0.315]

97 Hiriart-Urruty and Lemar´ chal (1993a, page 69) show that for f convex on R+ , g convex and e +, increasing on R s (g ◦ f )⋆ (s) = inf α f ⋆ ( α ) + g⋆ (α) = f ⋆ g⋆ . [sent-1613, score-0.276]

98 2 regarding the relationship between divergence and risk when R = BH , a unit ball in a reproducing kernel Hilbert space H. [sent-1618, score-0.333]

99 Unifying divergence minimization and statistical inference via convex duality. [sent-1752, score-0.245]

100 Bounds on non-symmetric divergence measures in terms of symmetric divergence measures. [sent-3029, score-0.355]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('eid', 0.262), ('illiamson', 0.262), ('ivergence', 0.256), ('nformation', 0.256), ('isk', 0.218), ('lc', 0.208), ('bregman', 0.172), ('losses', 0.168), ('roc', 0.163), ('divergence', 0.161), ('primitive', 0.155), ('eq', 0.138), ('divergences', 0.136), ('risk', 0.135), ('ln', 0.121), ('buja', 0.119), ('dq', 0.116), ('ep', 0.112), ('concave', 0.11), ('vajda', 0.101), ('proper', 0.101), ('integral', 0.099), ('generalised', 0.097), ('csisz', 0.096), ('pinsker', 0.095), ('regret', 0.094), ('surrogate', 0.092), ('bw', 0.086), ('convex', 0.084), ('liese', 0.077), ('variational', 0.077), ('jensen', 0.075), ('curve', 0.074), ('inf', 0.072), ('kl', 0.07), ('weight', 0.066), ('generalisation', 0.065), ('osterreicher', 0.065), ('curves', 0.061), ('representations', 0.059), ('fp', 0.058), ('ex', 0.057), ('auc', 0.056), ('piecewise', 0.056), ('risks', 0.054), ('summarised', 0.054), ('torgersen', 0.054), ('bayes', 0.051), ('lemar', 0.05), ('theorem', 0.048), ('monistic', 0.048), ('tops', 0.048), ('ai', 0.048), ('af', 0.046), ('jenssen', 0.046), ('tp', 0.045), ('loss', 0.045), ('perspective', 0.044), ('false', 0.044), ('dp', 0.043), ('langford', 0.043), ('discriminative', 0.042), ('lf', 0.042), ('substituting', 0.042), ('dm', 0.042), ('nguyen', 0.041), ('reductions', 0.041), ('primitives', 0.04), ('confer', 0.04), ('reid', 0.04), ('relating', 0.04), ('sgn', 0.04), ('vr', 0.04), ('sup', 0.04), ('dt', 0.039), ('steinwart', 0.038), ('relationship', 0.037), ('chal', 0.036), ('supremum', 0.036), ('degroot', 0.036), ('drummond', 0.036), ('lecam', 0.036), ('mmd', 0.036), ('pluralistic', 0.036), ('binary', 0.035), ('bc', 0.035), ('statistic', 0.035), ('scoring', 0.035), ('generative', 0.035), ('parzen', 0.033), ('symmetric', 0.033), ('discrimination', 0.033), ('diagram', 0.033), ('theoretic', 0.033), ('gap', 0.032), ('dc', 0.032), ('rockafellar', 0.031), ('beygelzimer', 0.031), ('minimising', 0.03), ('informations', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

Author: Mark D. Reid, Robert C. Williamson

2 0.098069333 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

Author: Botond Cseke, Tom Heskes

Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random ﬁelds, Laplace approximation, variational inference, expectation propagation

3 0.08921285 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari

Abstract: This paper considers the robust and efﬁcient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modiﬁcations and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difﬁcult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difﬁcult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation

4 0.079262808 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

Author: Vladimir V. V'yugin

Abstract: In this paper the sequential prediction problem with expert advice is considered for the case where losses of experts suffered at each step cannot be bounded in advance. We present some modiﬁcation of Kalai and Vempala algorithm of following the perturbed leader where weights depend on past losses of the experts. New notions of a volume and a scaled ﬂuctuation of a game are introduced. We present a probabilistic algorithm protected from unrestrictedly large one-step losses. This algorithm has the optimal performance in the case when the scaled ﬂuctuations of one-step losses of experts of the pool tend to zero. Keywords: prediction with expert advice, follow the perturbed leader, unbounded losses, adaptive learning rate, expected bounds, Hannan consistency, online sequential prediction

5 0.078383453 104 jmlr-2011-X-Armed Bandits

Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári

Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a ﬁnite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modiﬁed strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with inﬁnitely many arms, optimistic online optimization, regret bounds, minimax rates

6 0.074008726 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

7 0.07270167 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

8 0.069258414 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

9 0.064646348 91 jmlr-2011-The Sample Complexity of Dictionary Learning

10 0.062666252 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

11 0.059461422 58 jmlr-2011-Learning from Partial Labels

12 0.058227692 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

13 0.054816917 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

14 0.053035606 60 jmlr-2011-Locally Defined Principal Curves and Surfaces

15 0.052406542 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

16 0.052394513 97 jmlr-2011-Union Support Recovery in Multi-task Learning

17 0.050372284 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

18 0.050112717 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

19 0.048013378 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

20 0.04656012 105 jmlr-2011-lp-Norm Multiple Kernel Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.258), (1, -0.015), (2, -0.067), (3, 0.073), (4, 0.127), (5, -0.077), (6, -0.004), (7, -0.079), (8, -0.06), (9, -0.085), (10, -0.072), (11, -0.106), (12, -0.057), (13, 0.12), (14, -0.068), (15, -0.071), (16, -0.166), (17, 0.028), (18, -0.03), (19, -0.193), (20, 0.009), (21, -0.012), (22, 0.052), (23, -0.128), (24, 0.123), (25, 0.054), (26, 0.119), (27, 0.011), (28, -0.028), (29, 0.01), (30, -0.089), (31, 0.024), (32, -0.042), (33, -0.043), (34, -0.088), (35, -0.094), (36, -0.073), (37, 0.068), (38, -0.022), (39, -0.097), (40, 0.046), (41, 0.062), (42, -0.16), (43, 0.03), (44, 0.028), (45, -0.008), (46, -0.145), (47, -0.066), (48, 0.072), (49, -0.056)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.929717 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

Author: Mark D. Reid, Robert C. Williamson

2 0.55636376 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

Author: Vladimir V. V'yugin

3 0.46879509 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Author: Philippe Rigollet, Xin Tong

Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classiﬁcation with a convex loss ϕ. Given a ﬁnite collection of classiﬁers, we combine them and obtain a new classiﬁer that satisﬁes simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-speciﬁed level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classiﬁer is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classiﬁer output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classiﬁcation, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization

4 0.46048442 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

Author: Daniil Ryabko

Abstract: A sequence x1 , . . . , xn , . . . of discrete-valued observations is generated according to some unknown probabilistic law (measure) µ. After observing each outcome, one is required to give conditional probabilities of the next observation. The realizable case is when the measure µ belongs to an arbitrary but known class C of process measures. The non-realizable case is when µ is completely arbitrary, but the prediction performance is measured with respect to a given set C of process measures. We are interested in the relations between these problems and between their solutions, as well as in characterizing the cases when a solution exists and ﬁnding these solutions. We show that if the quality of prediction is measured using the total variation distance, then these problems coincide, while if it is measured using the expected average KL divergence, then they are different. For some of the formalizations we also show that when a solution exists it can be obtained as a Bayes mixture over a countable subset of C . We also obtain several characterization of those sets C for which solutions to the considered problems exist. As an illustration to the general results obtained, we show that a solution to the non-realizable case of the sequence prediction problem exists for the set of all ﬁnite-memory processes, but does not exist for the set of all stationary processes. It should be emphasized that the framework is completely general: the processes measures considered are not required to be i.i.d., mixing, stationary, or to belong to any parametric family. Keywords: sequence prediction, time series, online prediction, realizable sequence prediction, non-realizable sequence prediction

5 0.38959506 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

Author: Zeeshan Syed, John Guttag

Abstract: In medicine, one often bases decisions upon a comparative analysis of patient data. In this paper, we build upon this observation and describe similarity-based algorithms to risk stratify patients for major adverse cardiac events. We evolve the traditional approach of comparing patient data in two ways. First, we propose similarity-based algorithms that compare patients in terms of their long-term physiological monitoring data. Symbolic mismatch identiﬁes functional units in longterm signals and measures changes in the morphology and frequency of these units across patients. Second, we describe similarity-based algorithms that are unsupervised and do not require comparisons to patients with known outcomes for risk stratiﬁcation. This is achieved by using an anomaly detection framework to identify patients who are unlike other patients in a population and may potentially be at an elevated risk. We demonstrate the potential utility of our approach by showing how symbolic mismatch-based algorithms can be used to classify patients as being at high or low risk of major adverse cardiac events by comparing their long-term electrocardiograms to that of a large population. We describe how symbolic mismatch can be used in three different existing methods: one-class support vector machines, nearest neighbor analysis, and hierarchical clustering. When evaluated on a population of 686 patients with available long-term electrocardiographic data, symbolic mismatch-based comparative approaches were able to identify patients at roughly a two-fold increased risk of major adverse cardiac events in the 90 days following acute coronary syndrome. These results were consistent even after adjusting for other clinical risk variables. Keywords: risk stratiﬁcation, cardiovascular disease, time-series comparison, symbolic analysis, anomaly detection

6 0.38550001 58 jmlr-2011-Learning from Partial Labels

7 0.37728763 91 jmlr-2011-The Sample Complexity of Dictionary Learning

8 0.34641999 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

9 0.3447119 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

10 0.33923316 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

11 0.33901149 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

12 0.3317247 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

13 0.32905984 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals

14 0.3251164 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

15 0.32195097 104 jmlr-2011-X-Armed Bandits

16 0.31097913 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

17 0.29561758 60 jmlr-2011-Locally Defined Principal Curves and Surfaces

18 0.29464349 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

19 0.28515717 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

20 0.28099003 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.018), (4, 0.053), (6, 0.013), (9, 0.033), (10, 0.026), (24, 0.035), (31, 0.079), (32, 0.417), (40, 0.014), (41, 0.026), (60, 0.014), (65, 0.011), (67, 0.01), (73, 0.05), (78, 0.063), (86, 0.013), (90, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97001928 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning

Author: Grigorios Tsoumakas, Eleftherios Spyromitros-Xioufis, Jozef Vilcek, Ioannis Vlahavas

Abstract: M ULAN is a Java library for learning from multi-label data. It offers a variety of classiﬁcation, ranking, thresholding and dimensionality reduction algorithms, as well as algorithms for learning from hierarchically structured labels. In addition, it contains an evaluation framework that calculates a rich variety of performance measures. Keywords: multi-label data, classiﬁcation, ranking, thresholding, dimensionality reduction, hierarchical classiﬁcation, evaluation 1. Multi-Label Learning A multi-label data set consists of training examples that are associated with a subset of a ﬁnite set of labels. Nowadays, multi-label data are becoming ubiquitous. They arise in an increasing number and diversity of applications, such as semantic annotation of images and video, web page categorization, direct marketing, functional genomics and music categorization into genres and emotions. There exist two major multi-label learning tasks (Tsoumakas et al., 2010): multi-label classiﬁcation and label ranking. The former is concerned with learning a model that outputs a bipartition of the set of labels into relevant and irrelevant with respect to a query instance. The latter is concerned with learning a model that outputs a ranking of the labels according to their relevance to a query instance. Some algorithms learn models that serve both tasks. Several algorithms learn models that primarily output a vector of numerical scores, one for each label. This vector is then converted to a ranking after solving ties, or to a bipartition, after thresholding (Ioannou et al., 2010). Multi-label learning methods addressing these tasks can be grouped into two categories (Tsoumakas et al., 2010): problem transformation and algorithm adaptation. The ﬁrst group of methods are algorithm independent. They transform the learning task into one or more singlelabel classiﬁcation tasks, for which a large body of learning algorithms exists. The second group of methods extend speciﬁc learning algorithms in order to handle multi-label data directly. There exist extensions of decision tree learners, nearest neighbor classiﬁers, neural networks, ensemble methods, support vector machines, kernel methods, genetic algorithms and others. Multi-label learning stretches across several other tasks. When labels are structured as a treeshaped hierarchy or a directed acyclic graph, then we have the interesting task of hierarchical multilabel learning. Dimensionality reduction is another important task for multi-label data, as it is for c 2011 Grigorios Tsoumakas, Eleftherios Spyromitros-Xiouﬁs, Jozef Vilcek and Ioannis Vlahavas. T SOUMAKAS , S PYROMITROS -X IOUFIS , V ILCEK AND V LAHAVAS any kind of data. When bags of instances are used to represent a training object, then multi-instance multi-label learning algorithms are required. There also exist semi-supervised learning and active learning algorithms for multi-label data. 2. The M ULAN Library The main goal of M ULAN is to bring the beneﬁts of machine learning open source software (MLOSS) (Sonnenburg et al., 2007) to people working with multi-label data. The availability of MLOSS is especially important in emerging areas like multi-label learning, because it removes the burden of implementing related work and speeds up the scientiﬁc progress. In multi-label learning, an extra burden is implementing appropriate evaluation measures, since these are different compared to traditional supervised learning tasks. Evaluating multi-label algorithms with a variety of measures, is considered important by the community, due to the different types of output (bipartition, ranking) and diverse applications. Towards this goal, M ULAN offers a plethora of state-of-the-art algorithms for multi-label classiﬁcation and label ranking and an evaluation framework that computes a large variety of multi-label evaluation measures through hold-out evaluation and cross-validation. In addition, the library offers a number of thresholding strategies that produce bipartitions from score vectors, simple baseline methods for multi-label dimensionality reduction and support for hierarchical multi-label classiﬁcation, including an implemented algorithm. M ULAN is a library. As such, it offers only programmatic API to the library users. There is no graphical user interface (GUI) available. The possibility to use the library via command line, is also currently not supported. Another drawback of M ULAN is that it runs everything in main memory so there exist limitations with very large data sets. M ULAN is written in Java and is built on top of Weka (Witten and Frank, 2005). This choice was made in order to take advantage of the vast resources of Weka on supervised learning algorithms, since many state-of-the-art multi-label learning algorithms are based on problem transformation. The fact that several machine learning researchers and practitioners are familiar with Weka was another reason for this choice. However, many aspects of the library are independent of Weka and there are interfaces for most of the core classes. M ULAN is an advocate of open science in general. One of the unique features of the library is a recently introduced experiments package, whose goal is to host code that reproduces experimental results reported on published papers on multi-label learning. To the best of our knowledge, most of the general learning platforms, like Weka, don’t support multi-label data. There are currently only a number of implementations of speciﬁc multi-label learning algorithms, but not a general library like M ULAN. 3. Using M ULAN This section presents an example of how to setup an experiment for empirically evaluating two multi-label algorithms on a multi-label data set using cross-validation. We create a new Java class for this experiment, which we call MulanExp1.java. The ﬁrst thing to do is load the multi-label data set that will be used for the empirical evaluation. M ULAN requires two text ﬁles for the speciﬁcation of a data set. The ﬁrst one is in the ARFF format of Weka. The labels should be speciﬁed as nominal attributes with values “0” and “1” indicating 2412 M ULAN : A JAVA L IBRARY FOR M ULTI -L ABEL L EARNING absence and presence of the label respectively. The second ﬁle is in XML format. It speciﬁes the labels and any hierarchical relationships among them. Hierarchies of labels can be expressed in the XML ﬁle by nesting the label tag. In our example, the two ﬁlenames are given to the experiment class through command-line parameters. String arffFile = Utils.getOption(

2 0.86043823 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood

Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki

Abstract: We propose an efﬁcient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆ fCLL), for learning Bayesian network classiﬁers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efﬁcient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classiﬁers. Results on a large suite of benchmark data sets from the UCI repository show that ˆ fCLL-trained classiﬁers achieve at least as good accuracy as the best compared classiﬁers, using signiﬁcantly less computational resources. Keywords: Bayesian networks, discriminative learning, conditional log-likelihood, scoring criterion, classiﬁcation, approximation c 2011 Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira and Petri Myllym¨ ki. a ¨ C ARVALHO , ROOS , O LIVEIRA AND M YLLYM AKI

3 0.85851347 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

Author: Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng

Abstract: Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most inﬂuential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classiﬁer in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LP-AdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a reﬁned analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly ©2011 Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou and Jufu Feng. WANG , S UGIYAMA , J ING , YANG , Z HOU AND F ENG sharper than Breiman’s minimum margin bound. Thus our result suggests that the minimum margin may be not crucial for the generalization error. We also show that a large Emargin and a small empirical error at Emargin imply a smaller bound of the generalization error. Experimental results on benchmark data sets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than LP-AdaBoost, which agrees well with our theory. Keywords: boosting, margin bounds, voting classiﬁer

same-paper 4 0.81741256 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

Author: Mark D. Reid, Robert C. Williamson

5 0.43212867 102 jmlr-2011-Waffles: A Machine Learning Toolkit

Author: Michael Gashler

Abstract: We present a breadth-oriented collection of cross-platform command-line tools for researchers in machine learning called Wafﬂes. The Wafﬂes tools are designed to offer a broad spectrum of functionality in a manner that is friendly for scripted automation. All functionality is also available in a C++ class library. Wafﬂes is available under the GNU Lesser General Public License. Keywords: machine learning, toolkits, data mining, C++, open source

6 0.42747453 48 jmlr-2011-Kernel Analysis of Deep Networks

7 0.39189798 91 jmlr-2011-The Sample Complexity of Dictionary Learning

8 0.39157504 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

9 0.38949823 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

10 0.38664317 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package

11 0.38638926 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

12 0.38003761 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

13 0.37098727 66 jmlr-2011-Multiple Kernel Learning Algorithms

14 0.36905563 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

15 0.36866882 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

16 0.36805609 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

17 0.36802092 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

18 0.36676598 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

19 0.36658031 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

20 0.36477005 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models