nips nips2005 nips2005-58 knowledge-graph by maker-knowledge-mining

58 nips-2005-Divergences, surrogate loss functions and experimental design


Source: pdf

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs. [sent-1, score-0.973]

2 edu Abstract In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. [sent-9, score-1.125]

3 Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. [sent-10, score-1.57]

4 Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. [sent-11, score-0.872]

5 These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. [sent-12, score-0.216]

6 1 Introduction A unifying theme in the recent literature on classification is the notion of a surrogate loss function—a convex upper bound on the 0-1 loss. [sent-13, score-0.999]

7 Many practical classification algorithms can be formulated in terms of the minimization of surrogate loss functions; well-known examples include the support vector machine (hinge loss) and Adaboost (exponential loss). [sent-14, score-0.854]

8 Significant progress has been made on the theoretical front by analyzing the general statistical consequences of using surrogate loss functions [e. [sent-15, score-0.881]

9 Working in the context of experimental design, researchers in the 1960’s recast the (intractable) problem of minimizing the probability of classification error in terms of the maximization of various surrogate functions [e. [sent-19, score-0.613]

10 Examples of experimental design include the choice of a quantizer as a preprocessor for a classifier [12], or the choice of a “signal set” for a radar system [5]. [sent-22, score-0.24]

11 The surrogate functions that were used included the Hellinger distance and various forms of KL divergence; maximization of these functions was proposed as a criterion for the choice of a design. [sent-23, score-0.73]

12 An important outcome of this line of work was the definition of a general family of “f -divergences” (also known as “Ali-Silvey distances”), which includes Hellinger distance and KL divergence as special cases [1, 4]. [sent-25, score-0.196]

13 In broad terms, the goal of the current paper is to bring together these two literatures, in particular by establishing a correspondence between the family of surrogate loss functions and the family of f -divergences. [sent-26, score-1.054]

14 In particular, one natural extension—and one which we explore towards the end of this paper—is in requiring consistency not only in the choice of an optimal discriminant function but also in the choice of an optimal experiment design. [sent-28, score-0.364]

15 The main technical contribution of this paper is to state and prove a general theorem relating surrogate loss functions and f -divergences. [sent-29, score-0.964]

16 1 We show that the correspondence is quite strong: any surrogate loss induces a corresponding f -divergence, and any f -divergence satisfying certain conditions corresponds to a family of surrogate loss functions. [sent-30, score-1.84]

17 Moreover, exploiting tools from convex analysis, we provide a constructive procedure for finding loss functions from f -divergences. [sent-31, score-0.64]

18 We also introduce and analyze a notion of universal equivalence among loss functions (and corresponding f -divergences). [sent-32, score-0.613]

19 Finally, we present an application of these ideas to the problem of proving consistency of classification algorithms with an additional decentralization requirement. [sent-33, score-0.14]

20 The stochastic map Q is referred to as an experiment in statistics; in the signal processing literature, where Z is generally taken to be discrete, it is referred to as a quantizer. [sent-37, score-0.121]

21 Given a fixed experiment Q, we can formulate a standard binary classification problem as one of finding a measurable function γ ∈ Γ := {Z → R} that minimizes the Bayes risk P (Y = sign(γ(Z))). [sent-39, score-0.215]

22 Our focus is the broader question of determining both the classifier γ ∈ Γ, as well as the experiment choice Q ∈ Q so as to minimize the Bayes risk. [sent-40, score-0.153]

23 The Bayes risk corresponds to the expectation of the 0-1 loss. [sent-41, score-0.166]

24 Given the non-convexity of this loss function, it is natural to consider a surrogate loss function φ that we optimize in place of the 0-1 loss. [sent-42, score-1.14]

25 For each fixed quantization rule Q, the optimal φ risk (as a function of Q) is defined as follows: Rφ (Q) := inf Rφ (γ, Q). [sent-44, score-0.485]

26 As a consequence of Lyapunov’s theorem, the space of {(µ, π)} obtained by varying Q ∈ Q (or Q0 ) is both compact and convex (see [12] for details). [sent-47, score-0.2]

27 (2) z This representation allows us to compute the optimal value for γ(z) for all z ∈ Z, as well as the optimal φ risk for a fixed Q. [sent-54, score-0.266]

28 Thus the optimal Bayes risk given 1 a fixed Q takes the form: Rbayes (Q) = z∈Z min{µ(z), π(z)} = 2 − 1 z∈Z |µ(z) − 2 π(z)| =: 1 (1 − V (µ, π)), where V (µ, π) denotes the variational distance between two 2 measures µ and π. [sent-57, score-0.434]

29 In this case γ(z) = sign(µ(z) − π(z)) and the optimal risk takes the form: Rhinge (Q) = z∈Z 2 min{µ(z), π(z)} = 1 − z∈Z |µ(z) − π(z)| = 1 − V (µ, π) = 2Rbayes (Q). [sent-60, score-0.243]

30 Letting φsqr (yγ(z)) = (1 − yγ(z))2 , we have γ(z) = 4µ(z)π(z) optimal risk takes the form: Rsqr (Q) = z∈Z µ(z)+π(z) = 1 − 1 − ∆(µ, π), where ∆(µ, π) denotes the triangular discrimination distance. [sent-63, score-0.312]

31 Letting φlog (yγ(z)) := log 1 + exp−yγ(z) , we have γ(z) = log The optimal risk for logistic loss takes the form: Rlog (Q) = µ(z)+π(z) π(z) KL(µ|| µ+π ) 2 KL(π|| µ+π ) 2 = log 2 − − π(z) log C(U, V ) denotes the capacitory discrimination distance. [sent-66, score-0.712]

32 The optimal risk for exponential loss takes the form: Rexp (Q) = z∈Z 2 µ(z)π(z) = 1 − z∈Z ( µ(z) − π(z))2 = 1 − 2h2 (µ, π), where h(µ, π) denotes the Hellinger distance between measures µ and π. [sent-69, score-0.713]

33 3 The correspondence between loss functions and f -divergences In order to resolve this question, we begin with precise definitions of f -divergences, and surrogate loss functions. [sent-75, score-1.326]

34 Given any continuous convex function f : [0, +∞) → R ∪ {+∞}, the f -divergence between measures µ and π is given by If (µ, π) := z π(z)f µ(z) π(z) . [sent-77, score-0.228]

35 For instance, the variational distance is given by f (u) = |u − 1|, KL divergence by f (u) = u log u, triangular discrimination by f (u) = (u − 1)2 /(u + 1), and Hellinger distance by 1 √ f (u) = 2 ( u − 1)2 . [sent-78, score-0.39]

36 First, we require that any surrogate loss function φ is continuous and convex. [sent-80, score-0.846]

37 Second, the function φ must be classification-calibrated [2], meaning that for any a, b ≥ 0 and a = b, inf α:α(a−b)<0 φ(α)a + φ(−α)b > inf α∈R φ(α)a + φ(−α)b. [sent-81, score-0.428]

38 It can be shown [2] that in the convex case φ is classification-calibrated if and only if it is differentiable at 0 and φ (0) < 0. [sent-82, score-0.192]

39 More precisely, the following lemma proves that the optimal φ risk for a fixed Q can be written as the negative of an f divergence. [sent-88, score-0.301]

40 The φ risk for (Q, γQ ) is an f -divergence between µ and π for some convex function f : Rφ (Q) = −If (µ, π). [sent-91, score-0.313]

41 The optimal φ risk takes the form: Rφ (Q) = inf (φ(α)µ(z) + φ(−α)π(z)) = z∈Z α π(z) inf φ(−α) + φ(α) α z µ(z) . [sent-93, score-0.671]

42 π(z) µ(z) π(z) , For each z let u = then inf α (φ(−α) + φ(α)u) is a concave function of u (since minimization over a set of linear function is a concave function). [sent-94, score-0.3]

43 Thus, the claim follows by defining (for u ∈ R) f (u) := − inf (φ(−α) + φ(α)u). [sent-95, score-0.214]

44 Given a divergence If (µ, π) for some convex function f , does there exist a loss function φ for which Rφ (Q) = −If (µ, π)? [sent-98, score-0.56]

45 In the following, we provide a precise characterization of the set of f -divergences that can be realized in this way, as well as a constructive procedure for determining all φ that realize a given f -divergence. [sent-99, score-0.222]

46 First, let us define, for each β, the inverse mapping φ−1 (β) := inf{α : φ(α) ≤ β}, where inf ∅ := +∞. [sent-101, score-0.214]

47 Define β1 := inf{β : Ψ(β) < +∞} and β2 := inf{β : Ψ(β) = inf Ψ}. [sent-104, score-0.214]

48 ∗ ∗ (6) It is simple to check that inf φ = inf Ψ = φ(α ), and β1 = φ(α ), β2 = φ(−α∗ ). [sent-105, score-0.428]

49 If φ is decreasing, then Ψ is convex in (−∞, +∞). [sent-112, score-0.147]

50 Hence, if Ψ is a lower semicontinuous convex function, it is possible to recover Ψ from f by means of convex duality [9]: Ψ(β) = f ∗ (−β). [sent-118, score-0.37]

51 Thus, equation (5) provides means for recovering a loss function φ from Ψ. [sent-119, score-0.32]

52 Indeed, the following theorem provides a constructive procedure for finding all such φ when Ψ satisfies necessary conditions specified in Lemma 3: Theorem 4. [sent-120, score-0.198]

53 (a) Given a lower semicontinuous convex function f : R → R, define: Ψ(β) = f ∗ (−β). [sent-121, score-0.196]

54 (8) If Ψ is a decreasing function satisfying the properties specified in parts (c), (d) and (e) of Lemma 3, then there exist convex continuous loss function φ for which (3) and (4) hold. [sent-122, score-0.538]

55 (b) More precisely, all such functions φ are of the form: For any α ≥ 0, φ(α) = Ψ(g(α + u∗ )), ∗ ∗ ∗ and φ(−α) = g(α + u∗ ), ∗ (9) ∗ where u satisfies Ψ(u ) = u for some u ∈ (β1 , β2 ) and g : [u , +∞) → R is any increasing continuous convex function such that g(u∗ ) = u∗ . [sent-123, score-0.234]

56 One interesting consequence of Theorem 4 that any realizable f -divergence can in fact be obtained from a fairly large set of φ loss functions. [sent-125, score-0.402]

57 We describe below how the Hellinger distance, for instance, is realized not only by the exponential loss (as described earlier), but also by many other surrogate loss functions. [sent-127, score-1.202]

58 Now if we choose g(u) = eu−1 , then we obtain the exponential loss φ(α) = exp(−α). [sent-134, score-0.352]

59 Recall that we have shown previously that the 0-1 loss induces the variational distance, which can be expressed as an f -divergence with fvar (u) = −2 min(u, 1) for u ≥ 0. [sent-136, score-0.575]

60 It is thus of particular interest to determine other loss functions that also lead to variational distance. [sent-137, score-0.454]

61 If we augment the function fvar by defining fvar (u) = +∞ for u < 0, then we can recover Ψ from fvar as follows: ∗ Ψ(β) = fvar (−β) = sup(−βu − fvar (u)) = u∈R 2 (2 − β)+ +∞ when β ≥ 0 when β < 0. [sent-138, score-0.787]

62 We consider f -divergences for two convex functions f1 and f2 to be equivalent if f1 and f2 are related by a linear term, i. [sent-139, score-0.237]

63 Choosing g(u) = u leads to the hinge loss φ(α) = (1 − α)+ , which is consistent with our earlier findings. [sent-143, score-0.414]

64 Making the alternative choice g(u) = e u−1 leads to a rather different loss—namely, φ(α) = (2 − eα )+ for α ≥ 0 and φ(α) = e−α for α < 0— that also realizes the variational distance. [sent-144, score-0.156]

65 Using Theorem 4 it can be shown that an f -divergence is realizable by a margin-based surrogate loss if and only if it is symmetric [7]. [sent-145, score-0.874]

66 The symmetric KL divergence KL(µ||π) + KL(π||µ) is a realizable f -divergence. [sent-147, score-0.147]

67 4 On comparison of loss functions and quantization schemes The previous section was devoted to study of the correspondence between f -divergences and the optimal φ-risk Rφ (Q) for a fixed experiment Q. [sent-150, score-0.628]

68 Our ultimate goal, however, is that of choosing an optimal Q, a problem known as experimental design in the statistics literature [3]. [sent-151, score-0.169]

69 One concrete application is the design of quantizers for performing decentralized detection [12, 6] in a sensor network. [sent-152, score-0.168]

70 In this section, we address the experiment design problem via the joint optimization of φrisk (or more precisely, its empirical version) over both the decision γ and the choice of experiment Q (hereafter referred to as a quantizer). [sent-153, score-0.275]

71 This procedure raises the natural theoretical question: for what loss functions φ does such joint optimization lead to minimum Bayes risk? [sent-154, score-0.408]

72 Note that the minimum here is taken over both the decision rule γ and the space of experiments Q, so that this question is not covered by standard consistency results [13, 10, 2]. [sent-155, score-0.151]

73 1 Universal equivalence The connection between f -divergences and 0-1 loss can be traced back to seminal work on the comparison of experiments [3]. [sent-158, score-0.449]

74 Formally, we say that the quantization scheme Q 1 dominates than Q2 if Rbayes (Q1 ) ≤ Rbayes (Q2 ) for any prior probabilities q ∈ (0, 1). [sent-159, score-0.11]

75 Q1 dominates Q2 iff If (µQ1 , π Q1 ) ≥ If (µQ2 , π Q2 ), for all convex functions f . [sent-161, score-0.291]

76 Q1 dominates Q2 iff Rφ (Q1 ) ≤ Rφ (Q2 ) for any surrogate loss φ. [sent-164, score-0.903]

77 One implication of Corollary 6 is that if Rφ (Q1 ) ≤ Rφ (Q2 ) for some loss function φ, then Rbayes (Q1 ) ≤ Rbayes (Q2 ) for some set of prior probabilities on the labels Y . [sent-165, score-0.348]

78 This fact justifies the use of a surrogate φ-loss as a proxy for the 0-1 loss, at least for a certain subset of prior probabilities. [sent-166, score-0.5]

79 Typically, however, the goal is to select the optimal experiment Q for a pre-specified set of priors, in which context this implication is of limited use. [sent-167, score-0.127]

80 We are thus motivated to consider a different method of determining which loss functions (or equivalently, f -divergences) lead to the same optimal experimental design as the 0-1 loss (respectively the variational distance). [sent-168, score-0.95]

81 More generally, we are interested in comparing two arbitrary loss function φ1 and φ2 , with corresponding divergences induced by f1 and f2 respectively: Definition 7. [sent-169, score-0.392]

82 The surrogate loss functions φ1 and φ2 are universally equivalent, denoted u u by φ1 ≈ φ2 (and f1 ≈ f2 ), if for any P (X, Y ) and quantization rules Q1 , Q2 , there holds: Rφ1 (Q1 ) ≤ Rφ1 (Q2 ) ⇔ Rφ2 (Q1 ) ≤ Rφ2 (Q2 ). [sent-170, score-1.063]

83 (10) The following result provides necessary and sufficient conditions for universal equivalence: Theorem 8. [sent-171, score-0.145]

84 If we restrict our attention to convex and differentiable a. [sent-176, score-0.192]

85 functions f , then it follows that all f -divergences univerally equivalent to the variational distance must have the form f (u) = −c min(u, 1) + au + b with c > 0. [sent-178, score-0.294]

86 (11) As a consequence, the only φ-loss functions universally equivalent to 0-1 loss are those that induce an f -divergence of this form (11). [sent-179, score-0.512]

87 2 Consistency in experimental design The notion of universal equivalence might appear quite restrictive because condition (10) must hold for any underlying probability measure P (X, Y ). [sent-182, score-0.324]

88 Let (γn , Q∗ ) be an optimal n ∗ ˆ solution to the minimization problem min(γ,Q)∈(Cn ,Dn ) Rφ (γ, Q), and let Rbayes denote the minimum Bayes risk achieved over the space of decision rules (γ, Q) ∈ (Γ, Q). [sent-194, score-0.303]

89 We say that n such a procedure is universally consistent if the Bayes error tends to 0 as n → ∞, i. [sent-196, score-0.157]

90 n n→∞ When the surrogate loss φ is universally equivalent to 0-1 loss, we can prove that suitable learning procedures are indeed universally consistent. [sent-199, score-1.077]

91 Assume that the loss function φ is universally equivalent to the 0-1 loss. [sent-202, score-0.451]

92 3 These technical conditions are needed so that the approximation error due to varying Q dominates the approximation error due to varying γ. [sent-209, score-0.142]

93 The following lemma plays a key role in our proof: it links the excess φ-risk to the Bayes error when performing joint minimization: c ∗ ∗ Lemma 9. [sent-211, score-0.113]

94 Finally, we can relate the Bayes error to the approximation error and estimation error, and provide general conditions for universal consistency: Theorem 10. [sent-213, score-0.229]

95 n 5 Conclusions We have presented a general theoretical connection between surrogate loss functions and f -divergences. [sent-216, score-0.924]

96 As illustrated by our application to decentralized detection, this connection can provide new domains of application for statistical learning theory. [sent-217, score-0.133]

97 The divergence and Bhattacharyya distance measures in signal selection. [sent-252, score-0.211]

98 Applications of Ali-Silvey distance measures in the design of generalized quantizers for binary decision systems. [sent-276, score-0.252]

99 Some inequalities for information divergence and related measures of discrimination. [sent-291, score-0.185]

100 Statistical behavior and consistency of classification methods based on convex risk minimization. [sent-300, score-0.411]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('surrogate', 0.5), ('rbayes', 0.388), ('loss', 0.32), ('inf', 0.214), ('hellinger', 0.194), ('cn', 0.176), ('risk', 0.166), ('convex', 0.147), ('fvar', 0.146), ('dn', 0.12), ('kl', 0.119), ('universal', 0.114), ('universally', 0.102), ('consistency', 0.098), ('divergence', 0.093), ('correspondence', 0.093), ('bayes', 0.086), ('equivalence', 0.086), ('lemma', 0.085), ('theorem', 0.083), ('nguyen', 0.077), ('variational', 0.073), ('divergences', 0.072), ('hinge', 0.071), ('design', 0.068), ('au', 0.068), ('distance', 0.063), ('decentralized', 0.062), ('functions', 0.061), ('limn', 0.059), ('classi', 0.059), ('borel', 0.058), ('quantizer', 0.058), ('constructive', 0.057), ('dominates', 0.055), ('quantization', 0.055), ('measures', 0.055), ('sup', 0.054), ('realizable', 0.054), ('optimal', 0.05), ('experiment', 0.049), ('semicontinuous', 0.049), ('xuanlong', 0.049), ('wainwright', 0.046), ('realize', 0.046), ('differentiable', 0.045), ('choice', 0.045), ('decreasing', 0.045), ('berkeley', 0.044), ('connection', 0.043), ('decentralization', 0.042), ('california', 0.042), ('family', 0.04), ('realizes', 0.038), ('quantizers', 0.038), ('precisely', 0.038), ('inequalities', 0.037), ('triangular', 0.036), ('induces', 0.036), ('referred', 0.036), ('letting', 0.034), ('minimization', 0.034), ('determining', 0.034), ('discrimination', 0.033), ('losses', 0.032), ('mn', 0.032), ('resolve', 0.032), ('exponential', 0.032), ('notion', 0.032), ('conditions', 0.031), ('cation', 0.031), ('augment', 0.03), ('realized', 0.03), ('sign', 0.029), ('equivalent', 0.029), ('log', 0.029), ('implication', 0.028), ('provide', 0.028), ('decision', 0.028), ('error', 0.028), ('consequence', 0.028), ('iff', 0.028), ('dp', 0.028), ('procedure', 0.027), ('min', 0.027), ('recover', 0.027), ('choosing', 0.027), ('takes', 0.027), ('discriminant', 0.027), ('continuous', 0.026), ('concave', 0.026), ('rules', 0.025), ('deviations', 0.025), ('question', 0.025), ('compact', 0.025), ('procedures', 0.024), ('holds', 0.024), ('dense', 0.024), ('experimental', 0.024), ('earlier', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 58 nips-2005-Divergences, surrogate loss functions and experimental design

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. 1

2 0.26329279 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. 1

3 0.19145255 47 nips-2005-Consistency of one-class SVM and related algorithms

Author: Régis Vert, Jean-philippe Vert

Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1

4 0.14605609 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models

Author: Jason Palmer, Kenneth Kreutz-Delgado, Bhaskar D. Rao, David P. Wipf

Abstract: We consider criteria for variational representations of non-Gaussian latent variables, and derive variational EM algorithms in general form. We establish a general equivalence among convex bounding methods, evidence based methods, and ensemble learning/Variational Bayes methods, which has previously been demonstrated only for particular cases.

5 0.13858396 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

Author: Anatoli Juditsky, Alexander Nazin, Alexandre Tsybakov, Nicolas Vayatis

Abstract: We consider the problem of constructing an aggregated estimator from a finite class of base functions which approximately minimizes a convex risk functional under the ℓ1 constraint. For this purpose, we propose a stochastic procedure, the mirror descent, which performs gradient descent in the dual space. The generated estimates are additionally averaged in a recursive fashion with specific weights. Mirror descent algorithms have been developed in different contexts and they are known to be particularly efficient in high dimensional problems. Moreover their implementation is adapted to the online setting. The main result of the paper is the upper bound on the convergence rate for the generalization error. 1

6 0.1281738 95 nips-2005-Improved risk tail bounds for on-line algorithms

7 0.12658048 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

8 0.12647416 50 nips-2005-Convex Neural Networks

9 0.12370966 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

10 0.098023102 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

11 0.090779923 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

12 0.086694039 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

13 0.078183688 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

14 0.07765688 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

15 0.077293962 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

16 0.076788329 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

17 0.076183863 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

18 0.074081533 167 nips-2005-Robust design of biological experiments

19 0.073750295 112 nips-2005-Learning Minimum Volume Sets

20 0.073651679 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.214), (1, 0.127), (2, -0.045), (3, -0.139), (4, 0.217), (5, 0.095), (6, -0.172), (7, -0.046), (8, -0.152), (9, -0.227), (10, 0.001), (11, -0.055), (12, -0.139), (13, 0.017), (14, 0.103), (15, -0.093), (16, 0.054), (17, -0.037), (18, -0.051), (19, -0.007), (20, -0.01), (21, -0.025), (22, -0.012), (23, -0.039), (24, -0.175), (25, 0.077), (26, 0.011), (27, -0.006), (28, 0.015), (29, -0.062), (30, -0.124), (31, 0.032), (32, 0.026), (33, -0.018), (34, -0.019), (35, -0.002), (36, 0.017), (37, 0.032), (38, -0.063), (39, -0.016), (40, 0.019), (41, -0.14), (42, 0.087), (43, -0.017), (44, -0.007), (45, -0.157), (46, -0.036), (47, 0.061), (48, -0.096), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95915788 58 nips-2005-Divergences, surrogate loss functions and experimental design

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. 1

2 0.7115103 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. 1

3 0.6747545 47 nips-2005-Consistency of one-class SVM and related algorithms

Author: Régis Vert, Jean-philippe Vert

Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1

4 0.64618015 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

Author: Anatoli Juditsky, Alexander Nazin, Alexandre Tsybakov, Nicolas Vayatis

Abstract: We consider the problem of constructing an aggregated estimator from a finite class of base functions which approximately minimizes a convex risk functional under the ℓ1 constraint. For this purpose, we propose a stochastic procedure, the mirror descent, which performs gradient descent in the dual space. The generated estimates are additionally averaged in a recursive fashion with specific weights. Mirror descent algorithms have been developed in different contexts and they are known to be particularly efficient in high dimensional problems. Moreover their implementation is adapted to the online setting. The main result of the paper is the upper bound on the convergence rate for the generalization error. 1

5 0.56719005 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

Author: Aurelie C. Lozano, Sanjeev R. Kulkarni, Robert E. Schapire

Abstract: We study the statistical convergence and consistency of regularized Boosting methods, where the samples are not independent and identically distributed (i.i.d.) but come from empirical processes of stationary β-mixing sequences. Utilizing a technique that constructs a sequence of independent blocks close in distribution to the original samples, we prove the consistency of the composite classifiers resulting from a regularization achieved by restricting the 1-norm of the base classifiers’ weights. When compared to the i.i.d. case, the nature of sampling manifests in the consistency result only through generalization of the original condition on the growth of the regularization parameter.

6 0.51886344 112 nips-2005-Learning Minimum Volume Sets

7 0.48723671 95 nips-2005-Improved risk tail bounds for on-line algorithms

8 0.45816389 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails

9 0.44936368 50 nips-2005-Convex Neural Networks

10 0.44876084 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models

11 0.42143548 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

12 0.39142397 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

13 0.37122843 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

14 0.36409965 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

15 0.34723389 126 nips-2005-Metric Learning by Collapsing Classes

16 0.34142742 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior

17 0.3388854 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

18 0.33609462 167 nips-2005-Robust design of biological experiments

19 0.32850695 166 nips-2005-Robust Fisher Discriminant Analysis

20 0.32557482 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.089), (10, 0.04), (27, 0.014), (31, 0.05), (34, 0.143), (39, 0.023), (50, 0.018), (55, 0.045), (69, 0.06), (72, 0.23), (73, 0.034), (88, 0.052), (91, 0.085)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88365573 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares

Author: Ross Lippert, Ryan Rifkin

Abstract: We consider regularized least-squares (RLS) with a Gaussian kernel. We prove that if we let the Gaussian bandwidth σ → ∞ while letting the regularization parameter λ → 0, the RLS solution tends to a polynomial 1 whose order is controlled by the rielative rates of decay of σ2 and λ: if λ = σ −(2k+1) , then, as σ → ∞, the RLS solution tends to the kth order polynomial with minimal empirical error. We illustrate the result with an example. 1

same-paper 2 0.8380155 58 nips-2005-Divergences, surrogate loss functions and experimental design

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. 1

3 0.74356711 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard

Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1

4 0.65644795 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

Author: John D. Lafferty, Pradeep K. Ravikumar

Abstract: We present a family of approximation techniques for probabilistic graphical models, based on the use of graphical preconditioners developed in the scientific computing literature. Our framework yields rigorous upper and lower bounds on event probabilities and the log partition function of undirected graphical models, using non-iterative procedures that have low time complexity. As in mean field approaches, the approximations are built upon tractable subgraphs; however, we recast the problem of optimizing the tractable distribution parameters and approximate inference in terms of the well-studied linear systems problem of obtaining a good matrix preconditioner. Experiments are presented that compare the new approximation schemes to variational methods. 1

5 0.65245253 47 nips-2005-Consistency of one-class SVM and related algorithms

Author: Régis Vert, Jean-philippe Vert

Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1

6 0.64450914 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

7 0.64323211 50 nips-2005-Convex Neural Networks

8 0.64056289 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

9 0.64016926 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

10 0.6359424 184 nips-2005-Structured Prediction via the Extragradient Method

11 0.63445181 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

12 0.63426626 177 nips-2005-Size Regularized Cut for Data Clustering

13 0.63302904 105 nips-2005-Large-Scale Multiclass Transduction

14 0.63045353 98 nips-2005-Infinite latent feature models and the Indian buffet process

15 0.62912387 112 nips-2005-Learning Minimum Volume Sets

16 0.62811691 114 nips-2005-Learning Rankings via Convex Hull Separation

17 0.62593329 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

18 0.62504309 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

19 0.62320805 78 nips-2005-From Weighted Classification to Policy Search

20 0.62270987 54 nips-2005-Data-Driven Online to Batch Conversions