jmlr jmlr2005 jmlr2005-66 knowledge-graph by maker-knowledge-mining

66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

Source: pdf

Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer

Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The ﬁrst family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classiﬁcation and regression algorithms. Our regression framework also has implications on classiﬁcation algorithms, namely, a new additive update boosting algorithm for classiﬁcation. We demonstrate the merits of our algorithms in a series of experiments.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Bennett and Nicol` Cesa-Bianchi o Abstract We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. [sent-11, score-0.358]

2 These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. [sent-12, score-0.918]

3 The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. [sent-13, score-0.722]

4 The ﬁrst family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. [sent-15, score-0.375]

5 Our regression framework also has implications on classiﬁcation algorithms, namely, a new additive update boosting algorithm for classiﬁcation. [sent-19, score-0.548]

6 Such a function is often referred to as a regression function or a regressor for short. [sent-28, score-0.382]

7 This setting is also suitable for learning a linear combination of base regressors of the form f (x) = l λj hj (x) where each base j=1 regressor hj is a mapping from an instance domain X into R. [sent-31, score-0.754]

8 Dekel, Shalev-Shwartz and Singer abs−loss log−loss exp−loss 9 8 hinge−loss log−loss exp−loss 9 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 −5 0 5 −5 0 5 Figure 1: Constructing regression losses (left) by symmetrization of margin losses (right). [sent-36, score-0.413]

9 As we discuss shortly, the loss functions we consider in this paper depend only on the discrepancy between the predicted target and the true target δ = f (x) − y, hence L can be viewed as a function from R into R+ . [sent-38, score-0.392]

10 Given a loss function L, the goal of a regression algorithm is to ﬁnd a regressor f which attains a small total loss on the training set S, m Loss(λ, S) = i=1 m L(f (xi ) − yi ) = i=1 L(λ · xi − yi ). [sent-40, score-1.082]

11 It has been argued that the squared loss is sensitive to outliers, hence robust regression algorithms often employ the absolute loss (Huber, 1981). [sent-42, score-0.617]

12 The ε-insensitive hinge loss is not smooth as its derivative is discontinuous at δ = ±ε. [sent-45, score-0.466]

13 Several batch learning algorithms have been proposed for minimizing the ε-insensitive hinge loss (see for example Vapnik, 1998; Smola and Sch¨lkopf, 1998). [sent-46, score-0.448]

14 The ﬁrst loss function presented in this paper is a smooth approximation to the εinsensitive hinge loss. [sent-48, score-0.423]

15 Learning algorithms for margin-based classiﬁers typically employ a margin-based loss function Lc (yf (x)) and attempt to minimize the total loss over all instances in a given sample. [sent-65, score-0.577]

16 (2) We discuss a general technique for reducing a regression problem to a margin-based classiﬁcation problem called loss symmetrization. [sent-67, score-0.358]

17 The technique of loss symmetrization was previously discussed in Bi and Bennett (2003) in the context of support vector regression. [sent-71, score-0.387]

18 In Section 3 we describe another family of additive update regression algorithms based on modiﬁed gradient descent. [sent-102, score-0.442]

19 Speciﬁcally, we show how both the additive update algorithm of Section 3 and the regularization scheme of Section 4 extend to the setting of classiﬁcation. [sent-107, score-0.422]

20 Our implicit goal is to obtain the (global) minimizer of the empirical loss function · xi − yi ) where L is either the log-loss, the exp-loss or the comb-loss. [sent-117, score-0.36]

21 The j’th element of λ, namely λj , should be regarded as the weight associated with the base regressor hj . [sent-131, score-0.447]

22 Thus, this form of sequential update can be used for feature selection as well as loss minimization. [sent-134, score-0.444]

23 This algorithm includes both sequential update and parallel update paradigms as special cases 715 Dekel, Shalev-Shwartz and Singer Input: Training set S = {(xi , yi ) | xi ∈ Rn , yi ∈ R}m ; Insensitivity ε ∈ R+ i=1 Update templates A ⊆ Rn s. [sent-142, score-0.518]

24 In the sequential version of the algorithm, selecting an update template is equivalent to selecting a single base regressor and updating its weight. [sent-172, score-0.615]

25 Proof Deﬁne ∆t (i) to be the diﬀerence between the loss attained by λt and that attained by λt+1 on an instance-target pair (xi , yi ) in the training set, namely ∆t (i) = Llog (δt,i ) − Llog (δt+1,i ). [sent-182, score-0.465]

26 The weighted loss is deﬁned as, m Loss(λ, ν, S) = i=1 νi L(λ · xi − yi ), where L(·) is any of the loss functions discussed above. [sent-214, score-0.596]

27 This family includes a parallel update which modiﬁes all the elements of λ simultaneously and a sequential update which updates a single element of λ on each iteration. [sent-223, score-0.41]

28 The sequential update applied to an element 719 Dekel, Shalev-Shwartz and Singer Input: Training set S = {(xi , yi ) | xi ∈ Rn , yi ∈ R}m ; Insensitivity ε ∈ R+ i=1 Update templates A ⊆ Rn s. [sent-225, score-0.354]

29 The additive update family of algorithms can accommodate a weighted loss just as logadditive update algorithms do. [sent-276, score-0.736]

30 722 Smooth ε-Insensitive Regression by Loss Symmetrization As noted in Section 2 and Section 3, both the log-additive and additive update batch algorithms easily accommodate a weighted loss. [sent-293, score-0.384]

31 In practice, we do not need to explicitly add pseudo-examples to our sample in order to incorporate a regularization term into the loss function. [sent-297, score-0.371]

32 Furthermore, since the regularization term tends to inﬁnity at least as fast as λ 1 , the regularized loss has an attainable global minimum. [sent-309, score-0.425]

33 In other words, this form of regularization enforces uniqueness of the solution in our loss minimization problem. [sent-310, score-0.371]

34 be the sequence of vectors generated by the log-additive (Figure 3) or the additive (Figure 4) updates, using either of the regularized loss functions discussed in this paper. [sent-325, score-0.465]

35 Proof Due to the introduction of the regularization term, the loss function is strictly convex and attains its unique minimum at the point denoted λ , as argued in the previous section. [sent-327, score-0.428]

36 Since the loss attained by the algorithm on every iteration is non-increasing, the contribution of the regularization term to the total loss certainly does not exceed L0 /ν. [sent-337, score-0.699]

37 Next, note that the lower bound on the decrease in loss given in Theorem 1 and Theorem 2 can be thought of as a function of the current regressor λt and the chosen template at . [sent-343, score-0.642]

38 Since the regularized loss function is strictly convex, a zero gradient vector is attained only at the optimal point λ . [sent-348, score-0.394]

39 The additive update results in a new boosting procedure for classiﬁcation accompanied with a matching criterion for selecting a base hypothesis. [sent-377, score-0.533]

40 Given a training set S = {xi , yi }m with yi ∈ {−1, +1} deﬁne the pseudoi=1 ¯ sample S = {xi , −yi } and use log-loss boosting to train a classiﬁer whose task is to minimize the loss, ¯ (1 − ν)Loss(λ, S) + νLoss(λ, S), where, as before, ν is a regularization parameter. [sent-398, score-0.371]

41 We then receive the true target yt and suﬀer an instantaneous loss equal to Llog (λt · xt − yt ). [sent-415, score-0.578]

42 The learning algorithm employs an update rule which modiﬁes its current regressor after each round. [sent-427, score-0.42]

43 The ﬁrst is additive in the gradient of the loss and is thus called Gradient Descent (GD) while the second is exponential in the gradient of the loss and is analogously called Exponentiated Gradient (EG). [sent-429, score-0.757]

44 Note that the GD algorithm updates its current regressor, λt , by subtracting the gradient of the loss function from it. [sent-431, score-0.353]

45 In the following analysis we give a bound on the 2 cumulative loss attained on any number of rounds. [sent-433, score-0.398]

46 However, rather than bounding the loss per se we bound the cumulative loss relative to the cumulative loss suﬀered by a ﬁxed regressor µ. [sent-434, score-1.2]

47 The bound holds for any linear regressor µ and any number of rounds, hence we get that the GD algorithm is competitive with the optimal (ﬁxed) linear regressor for any number of rounds. [sent-435, score-0.566]

48 Formally, the following theorem states that the cumulative loss attained by the GD algorithm is at most twice the cumulative loss of any ﬁxed linear regressor plus an additive constant. [sent-436, score-1.21]

49 Then for any ﬁxed linear regressor µ ∈ Rn we have T T t=1 Llog (λt · xt − yt ) ≤ 2 t=1 Llog (µ · xt − yt ) + R µ 2 2 . [sent-444, score-0.647]

50 Lemma 5 Consider the setting of Theorem 4, then for each round t we have Llog (λt · xt − yt ) − 2Llog (µ · xt − yt ) ≤ R λt − µ 2 2 − λt+1 − µ 2 2 . [sent-448, score-0.395]

51 Intuitively, the lemma states that if the loss attained by λt on round t is greater than the loss of a ﬁxed regressor µ, then the algorithm will update λt such that it gets closer to µ. [sent-450, score-1.059]

52 In contrast, if the loss of µ is greater than the loss of GD, the algorithm may move its regressor away from µ. [sent-451, score-0.801]

53 Since EG maintains a regressor from the probability simplex, we measure the cumulative loss of the EG algorithm relative to the cumulative loss achieved by any ﬁxed regressor from the probability simplex. [sent-461, score-1.224]

54 Then, for any ﬁxed regressor µ ∈ Pn we have T t=1 4 Llog (λt · xt − yt ) ≤ 3 where DRE (p, q) = j T t=1 Llog (µ · xt − yt ) + 4R DRE (µ, λ1 ) , 3 pj log(pj /qj ) is the relative entropy function. [sent-469, score-0.647]

55 Llog (λt · xt − yt ) − Llog (µ · xt − yt ) ≤ 3 3 (16) The proof of the lemma is given in Appendix A. [sent-472, score-0.385]

56 3 we use the sequential form of our update as a boosting procedure which uses regression stumps as base regressors. [sent-482, score-0.568]

57 More speciﬁcally, we learn a kernel-based regressor and compare our log-loss regularization to Support Vector Regression with l1 regularization. [sent-488, score-0.395]

58 In these experiments, we compare the cumulative loss of the online GD algorithm with its theoretical bound given in Theorem 4. [sent-491, score-0.382]

59 The end result is that the solution obtained by minimizing the log-loss shares the same asymptotic behavior as the 1 regression loss ( i |δi |). [sent-495, score-0.385]

60 On the other hand, the solution found by minimizing the exp-loss approximately minimizes the l∞ regression loss on the sample. [sent-496, score-0.385]

61 The regressor attained by minimizing the exp-loss, however, approximately minimizes the maximal discrepancy over the entire data set and therefore lies signiﬁcantly below. [sent-509, score-0.442]

62 Here, the regressor obtained by minimizing the exp-loss lies between the two groups of points and as such approximately minimizes the ∞ regression loss on the sample. [sent-513, score-0.668]

63 The regressor found by minimizing the log-loss practically ignores the samples that were shifted by 1 and as such approximately minimizes the 1 regression loss on the sample. [sent-514, score-0.668]

64 2 A Comparison of the Log-Additive and Additive Updates In this section we compare the performance of the log-additive update from Figure 3 to that of the additive update from Figure 4. [sent-516, score-0.447]

65 For the log-additive update the constraint on a becomes a maxi |xi | ≤ 1 while for the additive update the constraint is a i x2 ≤ 2. [sent-521, score-0.478]

66 Therefore, for the ﬁrst data set, the constraint on a reduces to a ≤ 2 when using the log-additive update and to a ≤ 4 when the additive update is used. [sent-544, score-0.447]

67 Therefore, in the settings discussed in this section, the additive update should converge faster on the ﬁrst data set while the log-additive update should converge faster on the second data set. [sent-552, score-0.447]

68 It is clear from the graphs that our expectations are met and that the additive update converges faster than the log-additive update on the ﬁrst data set and slower on the second data set. [sent-554, score-0.447]

69 Recall that in the case of the log-additive update, the weights + − of the examples are qi = eδi and qi = e−δi while for the additive update we further divide these weights by Z. [sent-556, score-0.42]

70 3 Boosting Regression Stumps The next experiment demonstrates the eﬀectiveness of the log-additive and additive updates in their sequential form, when they are applied as boosting procedures. [sent-563, score-0.421]

71 The goal of the boosting algorithm is to construct a highly accurate regressor by combining base regressors obtained from consecutive calls to the base learner. [sent-565, score-0.754]

72 On every + boosting iteration the base learner receives the training set along with the weights qt,i and − qt,i generated by the boosting algorithm. [sent-566, score-0.409]

73 The goal of the base learner is to construct a regressor which maximizes the decrease in loss. [sent-567, score-0.428]

74 We denote by ht : Rn → R, the regressor returned by the base learner on round t. [sent-568, score-0.456]

75 We use either the bound in Theorem 1 or the bound in Theorem 2 as the criterion for selecting a base regressor using the log-additive and additive updates respectively. [sent-569, score-0.601]

76 That is, the base learner attempts to maximize the lower bound on the decrease in loss given in Theorem 1 or Theorem 2. [sent-570, score-0.404]

77 The pseudocode of the regression learning algorithm using stumps for both the additive and the log-additive updates is given in Figure 8. [sent-605, score-0.394]

78 The LAD algorithm appears to be able to decrease the hinge loss and the MSE much faster than our algorithm on both the training data (not shown) and the test data. [sent-616, score-0.384]

79 However, despite its initial performance, LDA seems to “get stuck” rather quickly and the ﬁnal regressor it obtains has substantially higher loss on both data sets compared to our algorithm, whether it is trained with the log-additive update or the additive one. [sent-620, score-0.852]

80 The improved generalization performance may be attributed to the following behavior that is common to boosting algorithms: as more base regressors are added, the + regression error obtained on most of the examples is rather small. [sent-621, score-0.486]

81 5 l1 loss 5 l1 loss additive log−additive LAD 6. [sent-624, score-0.691]

82 Thus, even simple regressors such as decision stumps can further reduce the loss on the remaining examples for which the loss is still high. [sent-633, score-0.743]

83 Comparing the performance of the additive and the log-additive updates in this experiment, it is apparent that the former seems more eﬀective in reducing the loss but is also more susceptible to over-ﬁtting. [sent-635, score-0.493]

84 4 Examining the Eﬀect of Regularization So far we have focused on experiments which illustrate the diﬀerent facets of empirical loss minimization using diﬀerent update schemes. [sent-639, score-0.396]

85 regularization can ensure that the generalization loss (i. [sent-643, score-0.371]

86 the regression loss suﬀered on test examples) would not greatly exceed the loss obtained on the training set. [sent-645, score-0.617]

87 We used the log-additive and additive update algorithms to minimize the following regularized loss, m Loss(λ, ν, S) = i=1 m Llog (fλ(xi ) − yi ; ε) + ν Llog (λj ). [sent-651, score-0.416]

88 (20) j=1 As illustrated in Figure 1, the log-loss can be interpreted as a smooth approximation to the ε-insensitive hinge loss used by Support Vector Regression (SVR). [sent-652, score-0.423]

89 As anticipated, the training loss for both algorithms is monotonically increasing in the regularization parameter ν while the diﬀerence between the test and training loss is monotonically decreasing in ν. [sent-668, score-0.63]

90 As can be seen form the ﬁgure, the lowest test loss attained by SVR is very close to the value attained by the log-loss (with a slight advantage to the latter). [sent-673, score-0.397]

91 Indeed, for ν in [10−5 , 1], the discrepancy between the losses of the regressors found by the log-loss is less than 1 while in the same range the losses of SVR can be as much as 3 units apart. [sent-675, score-0.413]

92 Theorem 4 states that the GD online algorithm attains a cumulative log-loss which is at most twice the loss of any ﬁxed regressor µ, up to a constant additive factor. [sent-679, score-0.895]

93 For any ﬁnite number of online rounds T , the theorem in particular holds for µ = λT , the regressor which attains the minimal log-loss on the ﬁrst T examples in the sequence. [sent-680, score-0.454]

94 The cumulative loss of the GD algorithm is depicted on the left hand side of Figure 12, along with the cumulative loss of λT and the worst-case guarantee attained from Theorem 4 with µ = λT . [sent-686, score-0.748]

95 Discussion We described a framework for solving regression problems by a symmetrization of marginbased loss functions commonly used in boosting techniques. [sent-698, score-0.625]

96 Our approach naturally lent itself to a shifted and symmetric loss function which is approximately zero in a pre-speciﬁed interval and can thus be used as a smooth alternative to the ε-insensitive hinge loss. [sent-699, score-0.423]

97 Another interesting direction is the marriage of the loss symmetrization technique with other boosting related techniques such as drifting games (Schapire, 1999; Freund and Opper, 2002). [sent-707, score-0.551]

98 Technical Proofs Proof of Lemma 5: Recall that we denote the discrepancy between the target predicted ¯ by the online algorithm and the true target by δt = λt · xt − yt . [sent-712, score-0.368]

99 Since (λt − µ) · xt = (λt · xt − yt ) + (yt − µ · xt ) = δt − δt , we can rewrite Eq. [sent-715, score-0.384]

100 Proof of Lemma 7: Recall that the EG update rule is λt+1,j = λt,j e−βt xt,j , −βt xt,k k λt,k e (23) where βt = L (λt · xt − yt )/R and R ≥ (maxj xt,j − minj xt,j )2 . [sent-731, score-0.379]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('llog', 0.571), ('regressor', 0.283), ('loss', 0.259), ('gd', 0.235), ('additive', 0.173), ('dekel', 0.17), ('eg', 0.168), ('regressors', 0.164), ('lad', 0.14), ('boosting', 0.139), ('update', 0.137), ('erent', 0.132), ('symmetrization', 0.128), ('regularization', 0.112), ('xt', 0.101), ('regression', 0.099), ('di', 0.094), ('losses', 0.093), ('hinge', 0.088), ('base', 0.084), ('yt', 0.081), ('dre', 0.08), ('lexp', 0.08), ('svr', 0.076), ('smooth', 0.076), ('insensitivity', 0.075), ('batch', 0.074), ('singer', 0.072), ('cumulative', 0.07), ('attained', 0.069), ('discrepancy', 0.063), ('template', 0.063), ('stumps', 0.061), ('updates', 0.061), ('minj', 0.06), ('sreg', 0.06), ('hj', 0.059), ('attains', 0.057), ('aj', 0.056), ('erence', 0.054), ('log', 0.054), ('online', 0.053), ('su', 0.049), ('sequential', 0.048), ('collins', 0.048), ('yi', 0.047), ('templates', 0.044), ('derivative', 0.043), ('rn', 0.043), ('mse', 0.041), ('hreg', 0.04), ('yf', 0.04), ('logistic', 0.04), ('decrease', 0.037), ('th', 0.035), ('target', 0.035), ('rounds', 0.034), ('ht', 0.034), ('lc', 0.034), ('regularized', 0.033), ('instances', 0.033), ('gradient', 0.033), ('qi', 0.032), ('xi', 0.031), ('discrepancies', 0.031), ('targets', 0.031), ('maxi', 0.031), ('round', 0.031), ('logadditive', 0.03), ('maxj', 0.03), ('underscores', 0.03), ('concretely', 0.03), ('parallel', 0.027), ('minimizing', 0.027), ('theorem', 0.027), ('ect', 0.026), ('minimize', 0.026), ('coordinates', 0.025), ('drifting', 0.025), ('stump', 0.025), ('schapire', 0.025), ('learner', 0.024), ('formally', 0.023), ('uni', 0.023), ('weights', 0.023), ('minimizer', 0.023), ('iterations', 0.023), ('housing', 0.022), ('devise', 0.022), ('zt', 0.022), ('namely', 0.021), ('depicted', 0.021), ('receive', 0.021), ('global', 0.021), ('iterate', 0.021), ('instance', 0.021), ('lemma', 0.021), ('friedman', 0.02), ('summing', 0.02), ('demonstrate', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer

2 0.10042158 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive deﬁnite matrix. The central issue is to design parameter updates that preserve positive deﬁniteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and ﬁnding a symmetric positive deﬁnite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive deﬁnite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive deﬁniteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DeﬁniteBoost to the problem of learning a kernel matrix from distance measurements.

3 0.076274075 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classiﬁers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classiﬁer should minimize the conﬁdence-rated error, whereas for AdaBoost.M2 the base classiﬁer should minimize the pseudo-loss. This makes them more easily applicable to already existing base classiﬁers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classiﬁcation, decision stumps

4 0.075920485 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun

Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing ﬂexible and powerful input representations, this paper addresses the complementary issue of designing classiﬁcation algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classiﬁcation problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.

5 0.06775818 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

Author: Marcus Hutter, Jan Poland

Abstract: When applying aggregating strategies to Prediction with Expert Advice (PEA), the learning rate √ must be adaptively tuned. The natural choice of complexity/current loss renders the analysis of Weighted Majority (WM) derivatives quite complicated. In particular, for arbitrary weights there have been no results proven so far. The analysis of the alternative Follow the Perturbed Leader (FPL) algorithm from Kalai and Vempala (2003) based on Hannan’s algorithm is easier. We derive loss bounds for adaptive learning rate and both ﬁnite expert classes with uniform weights and countable expert classes with arbitrary weights. For the former setup, our loss bounds match the best known results so far, while for the latter our results are new. Keywords: prediction with expert advice, follow the perturbed leader, general weights, adaptive learning rate, adaptive adversary, hierarchy of experts, expected and high probability bounds, general alphabet and loss, online sequential prediction

6 0.05899772 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

7 0.057381496 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs

8 0.052110225 29 jmlr-2005-Efficient Margin Maximizing with Boosting

9 0.05084575 36 jmlr-2005-Gaussian Processes for Ordinal Regression

10 0.049416211 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

11 0.043765981 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

12 0.04367606 47 jmlr-2005-Learning from Examples as an Inverse Problem

13 0.041753244 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

14 0.040095273 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

15 0.038983405 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error

16 0.034410641 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

17 0.033042766 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

18 0.031975586 73 jmlr-2005-Working Set Selection Using Second Order Information for Training Support Vector Machines

19 0.031944238 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning

20 0.031326812 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.21), (1, 0.055), (2, 0.179), (3, 0.014), (4, -0.239), (5, -0.075), (6, 0.058), (7, -0.068), (8, 0.032), (9, 0.285), (10, -0.016), (11, 0.041), (12, 0.083), (13, -0.04), (14, 0.115), (15, 0.116), (16, 0.121), (17, -0.095), (18, -0.005), (19, 0.149), (20, 0.105), (21, 0.193), (22, -0.092), (23, -0.063), (24, -0.087), (25, -0.026), (26, 0.0), (27, -0.018), (28, -0.002), (29, -0.117), (30, -0.039), (31, 0.065), (32, -0.056), (33, 0.033), (34, 0.021), (35, 0.149), (36, -0.113), (37, 0.052), (38, 0.259), (39, 0.049), (40, -0.03), (41, 0.027), (42, 0.194), (43, -0.042), (44, 0.072), (45, -0.127), (46, 0.088), (47, -0.212), (48, 0.04), (49, 0.307)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95956761 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer

2 0.38656715 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

3 0.37743282 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun

4 0.36121246 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

5 0.30911276 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

Author: Aapo Hyvärinen

Abstract: One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant. Typically, one then has to resort to Markov Chain Monte Carlo methods, or approximations of the normalization constant. Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data. While the estimation of the gradient of log-density function is, in principle, a very diﬃcult non-parametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simpliﬁes to a sample average of a sum of some derivatives of the log-density given by the model. The validity of the method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete ﬁlter set for natural image data. Keywords: statistical estimation, non-normalized densities, pseudo-likelihood, Markov chain Monte Carlo, contrastive divergence

6 0.28337923 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

7 0.26068756 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs

8 0.22155336 47 jmlr-2005-Learning from Examples as an Inverse Problem

9 0.21122709 36 jmlr-2005-Gaussian Processes for Ordinal Regression

10 0.20273794 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

11 0.19611099 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

12 0.19127014 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

13 0.17777225 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

14 0.17356507 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

15 0.1698744 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

16 0.16670007 73 jmlr-2005-Working Set Selection Using Second Order Information for Training Support Vector Machines

17 0.14964761 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning

18 0.14955057 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

19 0.14550042 20 jmlr-2005-Clustering with Bregman Divergences

20 0.1450277 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.01), (17, 0.019), (19, 0.047), (36, 0.023), (37, 0.022), (42, 0.021), (43, 0.029), (47, 0.018), (52, 0.072), (59, 0.021), (70, 0.024), (88, 0.077), (90, 0.504), (94, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8762176 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization

Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer

2 0.81541729 41 jmlr-2005-Kernel Methods for Measuring Independence

Author: Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf

Abstract: We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is veriﬁed in the context of independent component analysis. Keywords: independence, covariance operator, mutual information, kernel, Parzen window estimate, independent component analysis c 2005 Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet and Bernhard Schölkopf . G RETTON , H ERBRICH , S MOLA , B OUSQUET AND S CHÖLKOPF

3 0.37652856 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

4 0.37555614 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader

Author: Marcus Hutter, Jan Poland

5 0.36986822 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

Author: Aapo Hyvärinen

6 0.34627599 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

7 0.33383214 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

8 0.32865903 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

9 0.32716534 48 jmlr-2005-Learning the Kernel Function via Regularization

10 0.32273722 3 jmlr-2005-A Classification Framework for Anomaly Detection

11 0.32260323 20 jmlr-2005-Clustering with Bregman Divergences