jmlr jmlr2005 jmlr2005-66 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.
Reference: text
sentIndex sentText sentNum sentScore
1 Bennett and Nicol` Cesa-Bianchi o Abstract We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. [sent-11, score-0.358]
2 These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. [sent-12, score-0.918]
3 The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. [sent-13, score-0.722]
4 The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. [sent-15, score-0.375]
5 Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. [sent-19, score-0.548]
6 Such a function is often referred to as a regression function or a regressor for short. [sent-28, score-0.382]
7 This setting is also suitable for learning a linear combination of base regressors of the form f (x) = l λj hj (x) where each base j=1 regressor hj is a mapping from an instance domain X into R. [sent-31, score-0.754]
8 Dekel, Shalev-Shwartz and Singer abs−loss log−loss exp−loss 9 8 hinge−loss log−loss exp−loss 9 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 −5 0 5 −5 0 5 Figure 1: Constructing regression losses (left) by symmetrization of margin losses (right). [sent-36, score-0.413]
9 As we discuss shortly, the loss functions we consider in this paper depend only on the discrepancy between the predicted target and the true target δ = f (x) − y, hence L can be viewed as a function from R into R+ . [sent-38, score-0.392]
10 Given a loss function L, the goal of a regression algorithm is to find a regressor f which attains a small total loss on the training set S, m Loss(λ, S) = i=1 m L(f (xi ) − yi ) = i=1 L(λ · xi − yi ). [sent-40, score-1.082]
11 It has been argued that the squared loss is sensitive to outliers, hence robust regression algorithms often employ the absolute loss (Huber, 1981). [sent-42, score-0.617]
12 The ε-insensitive hinge loss is not smooth as its derivative is discontinuous at δ = ±ε. [sent-45, score-0.466]
13 Several batch learning algorithms have been proposed for minimizing the ε-insensitive hinge loss (see for example Vapnik, 1998; Smola and Sch¨lkopf, 1998). [sent-46, score-0.448]
14 The first loss function presented in this paper is a smooth approximation to the εinsensitive hinge loss. [sent-48, score-0.423]
15 Learning algorithms for margin-based classifiers typically employ a margin-based loss function Lc (yf (x)) and attempt to minimize the total loss over all instances in a given sample. [sent-65, score-0.577]
16 (2) We discuss a general technique for reducing a regression problem to a margin-based classification problem called loss symmetrization. [sent-67, score-0.358]
17 The technique of loss symmetrization was previously discussed in Bi and Bennett (2003) in the context of support vector regression. [sent-71, score-0.387]
18 In Section 3 we describe another family of additive update regression algorithms based on modified gradient descent. [sent-102, score-0.442]
19 Specifically, we show how both the additive update algorithm of Section 3 and the regularization scheme of Section 4 extend to the setting of classification. [sent-107, score-0.422]
20 Our implicit goal is to obtain the (global) minimizer of the empirical loss function · xi − yi ) where L is either the log-loss, the exp-loss or the comb-loss. [sent-117, score-0.36]
21 The j’th element of λ, namely λj , should be regarded as the weight associated with the base regressor hj . [sent-131, score-0.447]
22 Thus, this form of sequential update can be used for feature selection as well as loss minimization. [sent-134, score-0.444]
23 This algorithm includes both sequential update and parallel update paradigms as special cases 715 Dekel, Shalev-Shwartz and Singer Input: Training set S = {(xi , yi ) | xi ∈ Rn , yi ∈ R}m ; Insensitivity ε ∈ R+ i=1 Update templates A ⊆ Rn s. [sent-142, score-0.518]
24 In the sequential version of the algorithm, selecting an update template is equivalent to selecting a single base regressor and updating its weight. [sent-172, score-0.615]
25 Proof Define ∆t (i) to be the difference between the loss attained by λt and that attained by λt+1 on an instance-target pair (xi , yi ) in the training set, namely ∆t (i) = Llog (δt,i ) − Llog (δt+1,i ). [sent-182, score-0.465]
26 The weighted loss is defined as, m Loss(λ, ν, S) = i=1 νi L(λ · xi − yi ), where L(·) is any of the loss functions discussed above. [sent-214, score-0.596]
27 This family includes a parallel update which modifies all the elements of λ simultaneously and a sequential update which updates a single element of λ on each iteration. [sent-223, score-0.41]
28 The sequential update applied to an element 719 Dekel, Shalev-Shwartz and Singer Input: Training set S = {(xi , yi ) | xi ∈ Rn , yi ∈ R}m ; Insensitivity ε ∈ R+ i=1 Update templates A ⊆ Rn s. [sent-225, score-0.354]
29 The additive update family of algorithms can accommodate a weighted loss just as logadditive update algorithms do. [sent-276, score-0.736]
30 722 Smooth ε-Insensitive Regression by Loss Symmetrization As noted in Section 2 and Section 3, both the log-additive and additive update batch algorithms easily accommodate a weighted loss. [sent-293, score-0.384]
31 In practice, we do not need to explicitly add pseudo-examples to our sample in order to incorporate a regularization term into the loss function. [sent-297, score-0.371]
32 Furthermore, since the regularization term tends to infinity at least as fast as λ 1 , the regularized loss has an attainable global minimum. [sent-309, score-0.425]
33 In other words, this form of regularization enforces uniqueness of the solution in our loss minimization problem. [sent-310, score-0.371]
34 be the sequence of vectors generated by the log-additive (Figure 3) or the additive (Figure 4) updates, using either of the regularized loss functions discussed in this paper. [sent-325, score-0.465]
35 Proof Due to the introduction of the regularization term, the loss function is strictly convex and attains its unique minimum at the point denoted λ , as argued in the previous section. [sent-327, score-0.428]
36 Since the loss attained by the algorithm on every iteration is non-increasing, the contribution of the regularization term to the total loss certainly does not exceed L0 /ν. [sent-337, score-0.699]
37 Next, note that the lower bound on the decrease in loss given in Theorem 1 and Theorem 2 can be thought of as a function of the current regressor λt and the chosen template at . [sent-343, score-0.642]
38 Since the regularized loss function is strictly convex, a zero gradient vector is attained only at the optimal point λ . [sent-348, score-0.394]
39 The additive update results in a new boosting procedure for classification accompanied with a matching criterion for selecting a base hypothesis. [sent-377, score-0.533]
40 Given a training set S = {xi , yi }m with yi ∈ {−1, +1} define the pseudoi=1 ¯ sample S = {xi , −yi } and use log-loss boosting to train a classifier whose task is to minimize the loss, ¯ (1 − ν)Loss(λ, S) + νLoss(λ, S), where, as before, ν is a regularization parameter. [sent-398, score-0.371]
41 We then receive the true target yt and suffer an instantaneous loss equal to Llog (λt · xt − yt ). [sent-415, score-0.578]
42 The learning algorithm employs an update rule which modifies its current regressor after each round. [sent-427, score-0.42]
43 The first is additive in the gradient of the loss and is thus called Gradient Descent (GD) while the second is exponential in the gradient of the loss and is analogously called Exponentiated Gradient (EG). [sent-429, score-0.757]
44 Note that the GD algorithm updates its current regressor, λt , by subtracting the gradient of the loss function from it. [sent-431, score-0.353]
45 In the following analysis we give a bound on the 2 cumulative loss attained on any number of rounds. [sent-433, score-0.398]
46 However, rather than bounding the loss per se we bound the cumulative loss relative to the cumulative loss suffered by a fixed regressor µ. [sent-434, score-1.2]
47 The bound holds for any linear regressor µ and any number of rounds, hence we get that the GD algorithm is competitive with the optimal (fixed) linear regressor for any number of rounds. [sent-435, score-0.566]
48 Formally, the following theorem states that the cumulative loss attained by the GD algorithm is at most twice the cumulative loss of any fixed linear regressor plus an additive constant. [sent-436, score-1.21]
49 Then for any fixed linear regressor µ ∈ Rn we have T T t=1 Llog (λt · xt − yt ) ≤ 2 t=1 Llog (µ · xt − yt ) + R µ 2 2 . [sent-444, score-0.647]
50 Lemma 5 Consider the setting of Theorem 4, then for each round t we have Llog (λt · xt − yt ) − 2Llog (µ · xt − yt ) ≤ R λt − µ 2 2 − λt+1 − µ 2 2 . [sent-448, score-0.395]
51 Intuitively, the lemma states that if the loss attained by λt on round t is greater than the loss of a fixed regressor µ, then the algorithm will update λt such that it gets closer to µ. [sent-450, score-1.059]
52 In contrast, if the loss of µ is greater than the loss of GD, the algorithm may move its regressor away from µ. [sent-451, score-0.801]
53 Since EG maintains a regressor from the probability simplex, we measure the cumulative loss of the EG algorithm relative to the cumulative loss achieved by any fixed regressor from the probability simplex. [sent-461, score-1.224]
54 Then, for any fixed regressor µ ∈ Pn we have T t=1 4 Llog (λt · xt − yt ) ≤ 3 where DRE (p, q) = j T t=1 Llog (µ · xt − yt ) + 4R DRE (µ, λ1 ) , 3 pj log(pj /qj ) is the relative entropy function. [sent-469, score-0.647]
55 Llog (λt · xt − yt ) − Llog (µ · xt − yt ) ≤ 3 3 (16) The proof of the lemma is given in Appendix A. [sent-472, score-0.385]
56 3 we use the sequential form of our update as a boosting procedure which uses regression stumps as base regressors. [sent-482, score-0.568]
57 More specifically, we learn a kernel-based regressor and compare our log-loss regularization to Support Vector Regression with l1 regularization. [sent-488, score-0.395]
58 In these experiments, we compare the cumulative loss of the online GD algorithm with its theoretical bound given in Theorem 4. [sent-491, score-0.382]
59 The end result is that the solution obtained by minimizing the log-loss shares the same asymptotic behavior as the 1 regression loss ( i |δi |). [sent-495, score-0.385]
60 On the other hand, the solution found by minimizing the exp-loss approximately minimizes the l∞ regression loss on the sample. [sent-496, score-0.385]
61 The regressor attained by minimizing the exp-loss, however, approximately minimizes the maximal discrepancy over the entire data set and therefore lies significantly below. [sent-509, score-0.442]
62 Here, the regressor obtained by minimizing the exp-loss lies between the two groups of points and as such approximately minimizes the ∞ regression loss on the sample. [sent-513, score-0.668]
63 The regressor found by minimizing the log-loss practically ignores the samples that were shifted by 1 and as such approximately minimizes the 1 regression loss on the sample. [sent-514, score-0.668]
64 2 A Comparison of the Log-Additive and Additive Updates In this section we compare the performance of the log-additive update from Figure 3 to that of the additive update from Figure 4. [sent-516, score-0.447]
65 For the log-additive update the constraint on a becomes a maxi |xi | ≤ 1 while for the additive update the constraint is a i x2 ≤ 2. [sent-521, score-0.478]
66 Therefore, for the first data set, the constraint on a reduces to a ≤ 2 when using the log-additive update and to a ≤ 4 when the additive update is used. [sent-544, score-0.447]
67 Therefore, in the settings discussed in this section, the additive update should converge faster on the first data set while the log-additive update should converge faster on the second data set. [sent-552, score-0.447]
68 It is clear from the graphs that our expectations are met and that the additive update converges faster than the log-additive update on the first data set and slower on the second data set. [sent-554, score-0.447]
69 Recall that in the case of the log-additive update, the weights + − of the examples are qi = eδi and qi = e−δi while for the additive update we further divide these weights by Z. [sent-556, score-0.42]
70 3 Boosting Regression Stumps The next experiment demonstrates the effectiveness of the log-additive and additive updates in their sequential form, when they are applied as boosting procedures. [sent-563, score-0.421]
71 The goal of the boosting algorithm is to construct a highly accurate regressor by combining base regressors obtained from consecutive calls to the base learner. [sent-565, score-0.754]
72 On every + boosting iteration the base learner receives the training set along with the weights qt,i and − qt,i generated by the boosting algorithm. [sent-566, score-0.409]
73 The goal of the base learner is to construct a regressor which maximizes the decrease in loss. [sent-567, score-0.428]
74 We denote by ht : Rn → R, the regressor returned by the base learner on round t. [sent-568, score-0.456]
75 We use either the bound in Theorem 1 or the bound in Theorem 2 as the criterion for selecting a base regressor using the log-additive and additive updates respectively. [sent-569, score-0.601]
76 That is, the base learner attempts to maximize the lower bound on the decrease in loss given in Theorem 1 or Theorem 2. [sent-570, score-0.404]
77 The pseudocode of the regression learning algorithm using stumps for both the additive and the log-additive updates is given in Figure 8. [sent-605, score-0.394]
78 The LAD algorithm appears to be able to decrease the hinge loss and the MSE much faster than our algorithm on both the training data (not shown) and the test data. [sent-616, score-0.384]
79 However, despite its initial performance, LDA seems to “get stuck” rather quickly and the final regressor it obtains has substantially higher loss on both data sets compared to our algorithm, whether it is trained with the log-additive update or the additive one. [sent-620, score-0.852]
80 The improved generalization performance may be attributed to the following behavior that is common to boosting algorithms: as more base regressors are added, the + regression error obtained on most of the examples is rather small. [sent-621, score-0.486]
81 5 l1 loss 5 l1 loss additive log−additive LAD 6. [sent-624, score-0.691]
82 Thus, even simple regressors such as decision stumps can further reduce the loss on the remaining examples for which the loss is still high. [sent-633, score-0.743]
83 Comparing the performance of the additive and the log-additive updates in this experiment, it is apparent that the former seems more effective in reducing the loss but is also more susceptible to over-fitting. [sent-635, score-0.493]
84 4 Examining the Effect of Regularization So far we have focused on experiments which illustrate the different facets of empirical loss minimization using different update schemes. [sent-639, score-0.396]
85 regularization can ensure that the generalization loss (i. [sent-643, score-0.371]
86 the regression loss suffered on test examples) would not greatly exceed the loss obtained on the training set. [sent-645, score-0.617]
87 We used the log-additive and additive update algorithms to minimize the following regularized loss, m Loss(λ, ν, S) = i=1 m Llog (fλ(xi ) − yi ; ε) + ν Llog (λj ). [sent-651, score-0.416]
88 (20) j=1 As illustrated in Figure 1, the log-loss can be interpreted as a smooth approximation to the ε-insensitive hinge loss used by Support Vector Regression (SVR). [sent-652, score-0.423]
89 As anticipated, the training loss for both algorithms is monotonically increasing in the regularization parameter ν while the difference between the test and training loss is monotonically decreasing in ν. [sent-668, score-0.63]
90 As can be seen form the figure, the lowest test loss attained by SVR is very close to the value attained by the log-loss (with a slight advantage to the latter). [sent-673, score-0.397]
91 Indeed, for ν in [10−5 , 1], the discrepancy between the losses of the regressors found by the log-loss is less than 1 while in the same range the losses of SVR can be as much as 3 units apart. [sent-675, score-0.413]
92 Theorem 4 states that the GD online algorithm attains a cumulative log-loss which is at most twice the loss of any fixed regressor µ, up to a constant additive factor. [sent-679, score-0.895]
93 For any finite number of online rounds T , the theorem in particular holds for µ = λT , the regressor which attains the minimal log-loss on the first T examples in the sequence. [sent-680, score-0.454]
94 The cumulative loss of the GD algorithm is depicted on the left hand side of Figure 12, along with the cumulative loss of λT and the worst-case guarantee attained from Theorem 4 with µ = λT . [sent-686, score-0.748]
95 Discussion We described a framework for solving regression problems by a symmetrization of marginbased loss functions commonly used in boosting techniques. [sent-698, score-0.625]
96 Our approach naturally lent itself to a shifted and symmetric loss function which is approximately zero in a pre-specified interval and can thus be used as a smooth alternative to the ε-insensitive hinge loss. [sent-699, score-0.423]
97 Another interesting direction is the marriage of the loss symmetrization technique with other boosting related techniques such as drifting games (Schapire, 1999; Freund and Opper, 2002). [sent-707, score-0.551]
98 Technical Proofs Proof of Lemma 5: Recall that we denote the discrepancy between the target predicted ¯ by the online algorithm and the true target by δt = λt · xt − yt . [sent-712, score-0.368]
99 Since (λt − µ) · xt = (λt · xt − yt ) + (yt − µ · xt ) = δt − δt , we can rewrite Eq. [sent-715, score-0.384]
100 Proof of Lemma 7: Recall that the EG update rule is λt+1,j = λt,j e−βt xt,j , −βt xt,k k λt,k e (23) where βt = L (λt · xt − yt )/R and R ≥ (maxj xt,j − minj xt,j )2 . [sent-731, score-0.379]
wordName wordTfidf (topN-words)
[('llog', 0.571), ('regressor', 0.283), ('loss', 0.259), ('gd', 0.235), ('additive', 0.173), ('dekel', 0.17), ('eg', 0.168), ('regressors', 0.164), ('lad', 0.14), ('boosting', 0.139), ('update', 0.137), ('erent', 0.132), ('symmetrization', 0.128), ('regularization', 0.112), ('xt', 0.101), ('regression', 0.099), ('di', 0.094), ('losses', 0.093), ('hinge', 0.088), ('base', 0.084), ('yt', 0.081), ('dre', 0.08), ('lexp', 0.08), ('svr', 0.076), ('smooth', 0.076), ('insensitivity', 0.075), ('batch', 0.074), ('singer', 0.072), ('cumulative', 0.07), ('attained', 0.069), ('discrepancy', 0.063), ('template', 0.063), ('stumps', 0.061), ('updates', 0.061), ('minj', 0.06), ('sreg', 0.06), ('hj', 0.059), ('attains', 0.057), ('aj', 0.056), ('erence', 0.054), ('log', 0.054), ('online', 0.053), ('su', 0.049), ('sequential', 0.048), ('collins', 0.048), ('yi', 0.047), ('templates', 0.044), ('derivative', 0.043), ('rn', 0.043), ('mse', 0.041), ('hreg', 0.04), ('yf', 0.04), ('logistic', 0.04), ('decrease', 0.037), ('th', 0.035), ('target', 0.035), ('rounds', 0.034), ('ht', 0.034), ('lc', 0.034), ('regularized', 0.033), ('instances', 0.033), ('gradient', 0.033), ('qi', 0.032), ('xi', 0.031), ('discrepancies', 0.031), ('targets', 0.031), ('maxi', 0.031), ('round', 0.031), ('logadditive', 0.03), ('maxj', 0.03), ('underscores', 0.03), ('concretely', 0.03), ('parallel', 0.027), ('minimizing', 0.027), ('theorem', 0.027), ('ect', 0.026), ('minimize', 0.026), ('coordinates', 0.025), ('drifting', 0.025), ('stump', 0.025), ('schapire', 0.025), ('learner', 0.024), ('formally', 0.023), ('uni', 0.023), ('weights', 0.023), ('minimizer', 0.023), ('iterations', 0.023), ('housing', 0.022), ('devise', 0.022), ('zt', 0.022), ('namely', 0.021), ('depicted', 0.021), ('receive', 0.021), ('global', 0.021), ('iterate', 0.021), ('instance', 0.021), ('lemma', 0.021), ('friedman', 0.02), ('summing', 0.02), ('demonstrate', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.
2 0.10042158 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth
Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.
3 0.076274075 57 jmlr-2005-Multiclass Boosting for Weak Classifiers
Author: Günther Eibl, Karl-Peter Pfeiffer
Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classification, decision stumps
4 0.075920485 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun
Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.
5 0.06775818 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
Author: Marcus Hutter, Jan Poland
Abstract: When applying aggregating strategies to Prediction with Expert Advice (PEA), the learning rate √ must be adaptively tuned. The natural choice of complexity/current loss renders the analysis of Weighted Majority (WM) derivatives quite complicated. In particular, for arbitrary weights there have been no results proven so far. The analysis of the alternative Follow the Perturbed Leader (FPL) algorithm from Kalai and Vempala (2003) based on Hannan’s algorithm is easier. We derive loss bounds for adaptive learning rate and both finite expert classes with uniform weights and countable expert classes with arbitrary weights. For the former setup, our loss bounds match the best known results so far, while for the latter our results are new. Keywords: prediction with expert advice, follow the perturbed leader, general weights, adaptive learning rate, adaptive adversary, hierarchy of experts, expected and high probability bounds, general alphabet and loss, online sequential prediction
6 0.05899772 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
7 0.057381496 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
8 0.052110225 29 jmlr-2005-Efficient Margin Maximizing with Boosting
9 0.05084575 36 jmlr-2005-Gaussian Processes for Ordinal Regression
10 0.049416211 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
11 0.043765981 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
12 0.04367606 47 jmlr-2005-Learning from Examples as an Inverse Problem
13 0.041753244 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
14 0.040095273 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
15 0.038983405 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
16 0.034410641 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
17 0.033042766 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
18 0.031975586 73 jmlr-2005-Working Set Selection Using Second Order Information for Training Support Vector Machines
19 0.031944238 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
20 0.031326812 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
topicId topicWeight
[(0, 0.21), (1, 0.055), (2, 0.179), (3, 0.014), (4, -0.239), (5, -0.075), (6, 0.058), (7, -0.068), (8, 0.032), (9, 0.285), (10, -0.016), (11, 0.041), (12, 0.083), (13, -0.04), (14, 0.115), (15, 0.116), (16, 0.121), (17, -0.095), (18, -0.005), (19, 0.149), (20, 0.105), (21, 0.193), (22, -0.092), (23, -0.063), (24, -0.087), (25, -0.026), (26, 0.0), (27, -0.018), (28, -0.002), (29, -0.117), (30, -0.039), (31, 0.065), (32, -0.056), (33, 0.033), (34, 0.021), (35, 0.149), (36, -0.113), (37, 0.052), (38, 0.259), (39, 0.049), (40, -0.03), (41, 0.027), (42, 0.194), (43, -0.042), (44, 0.072), (45, -0.127), (46, 0.088), (47, -0.212), (48, 0.04), (49, 0.307)]
simIndex simValue paperId paperTitle
same-paper 1 0.95956761 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.
2 0.38656715 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth
Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: on-line learning with a simple square loss, and finding a symmetric positive definite matrix subject to linear constraints. The updates generalize the exponentiated gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the derivation and the analyses of the original EG update and AdaBoost generalize to the non-diagonal case. We apply the resulting matrix exponentiated gradient (MEG) update and DefiniteBoost to the problem of learning a kernel matrix from distance measurements.
3 0.37743282 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
Author: Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun
Abstract: Learning general functional dependencies between arbitrary input and output spaces is one of the key challenges in computational intelligence. While recent progress in machine learning has mainly focused on designing flexible and powerful input representations, this paper addresses the complementary issue of designing classification algorithms that can deal with more complex outputs, such as trees, sequences, or sets. More generally, we consider problems involving multiple dependent output variables, structured output spaces, and classification problems with class attributes. In order to accomplish this, we propose to appropriately generalize the well-known notion of a separation margin and derive a corresponding maximum-margin formulation. While this leads to a quadratic program with a potentially prohibitive, i.e. exponential, number of constraints, we present a cutting plane algorithm that solves the optimization problem in polynomial time for a large class of problems. The proposed method has important applications in areas such as computational biology, natural language processing, information retrieval/extraction, and optical character recognition. Experiments from various domains involving different types of output spaces emphasize the breadth and generality of our approach.
4 0.36121246 57 jmlr-2005-Multiclass Boosting for Weak Classifiers
Author: Günther Eibl, Karl-Peter Pfeiffer
Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classification, decision stumps
5 0.30911276 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
Author: Aapo Hyvärinen
Abstract: One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant. Typically, one then has to resort to Markov Chain Monte Carlo methods, or approximations of the normalization constant. Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data. While the estimation of the gradient of log-density function is, in principle, a very difficult non-parametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simplifies to a sample average of a sum of some derivatives of the log-density given by the model. The validity of the method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete filter set for natural image data. Keywords: statistical estimation, non-normalized densities, pseudo-likelihood, Markov chain Monte Carlo, contrastive divergence
6 0.28337923 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
7 0.26068756 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
8 0.22155336 47 jmlr-2005-Learning from Examples as an Inverse Problem
9 0.21122709 36 jmlr-2005-Gaussian Processes for Ordinal Regression
10 0.20273794 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
11 0.19611099 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
12 0.19127014 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
13 0.17777225 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
14 0.17356507 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
15 0.1698744 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
16 0.16670007 73 jmlr-2005-Working Set Selection Using Second Order Information for Training Support Vector Machines
17 0.14964761 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
18 0.14955057 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
19 0.14550042 20 jmlr-2005-Clustering with Bregman Divergences
20 0.1450277 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
topicId topicWeight
[(13, 0.01), (17, 0.019), (19, 0.047), (36, 0.023), (37, 0.022), (42, 0.021), (43, 0.029), (47, 0.018), (52, 0.072), (59, 0.021), (70, 0.024), (88, 0.077), (90, 0.504), (94, 0.04)]
simIndex simValue paperId paperTitle
same-paper 1 0.8762176 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
Author: Ofer Dekel, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe new loss functions for regression problems along with an accompanying algorithmic framework which utilizes these functions. These loss functions are derived by symmetrization of margin-based losses commonly used in boosting algorithms, namely, the logistic loss and the exponential loss. The resulting symmetric logistic loss can be viewed as a smooth approximation to the ε-insensitive hinge loss used in support vector regression. We describe and analyze two parametric families of batch learning algorithms for minimizing these symmetric losses. The first family employs an iterative log-additive update which can be viewed as a regression counterpart to recent boosting algorithms. The second family utilizes an iterative additive update step. We also describe and analyze online gradient descent (GD) and exponentiated gradient (EG) algorithms for the symmetric logistic loss. A byproduct of our work is a new simple form of regularization for boosting-based classification and regression algorithms. Our regression framework also has implications on classification algorithms, namely, a new additive update boosting algorithm for classification. We demonstrate the merits of our algorithms in a series of experiments.
2 0.81541729 41 jmlr-2005-Kernel Methods for Measuring Independence
Author: Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf
Abstract: We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis. Keywords: independence, covariance operator, mutual information, kernel, Parzen window estimate, independent component analysis c 2005 Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet and Bernhard Schölkopf . G RETTON , H ERBRICH , S MOLA , B OUSQUET AND S CHÖLKOPF
3 0.37652856 57 jmlr-2005-Multiclass Boosting for Weak Classifiers
Author: Günther Eibl, Karl-Peter Pfeiffer
Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classification, decision stumps
4 0.37555614 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
Author: Marcus Hutter, Jan Poland
Abstract: When applying aggregating strategies to Prediction with Expert Advice (PEA), the learning rate √ must be adaptively tuned. The natural choice of complexity/current loss renders the analysis of Weighted Majority (WM) derivatives quite complicated. In particular, for arbitrary weights there have been no results proven so far. The analysis of the alternative Follow the Perturbed Leader (FPL) algorithm from Kalai and Vempala (2003) based on Hannan’s algorithm is easier. We derive loss bounds for adaptive learning rate and both finite expert classes with uniform weights and countable expert classes with arbitrary weights. For the former setup, our loss bounds match the best known results so far, while for the latter our results are new. Keywords: prediction with expert advice, follow the perturbed leader, general weights, adaptive learning rate, adaptive adversary, hierarchy of experts, expected and high probability bounds, general alphabet and loss, online sequential prediction
5 0.36986822 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
Author: Aapo Hyvärinen
Abstract: One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant. Typically, one then has to resort to Markov Chain Monte Carlo methods, or approximations of the normalization constant. Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data. While the estimation of the gradient of log-density function is, in principle, a very difficult non-parametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simplifies to a sample average of a sum of some derivatives of the log-density given by the model. The validity of the method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete filter set for natural image data. Keywords: statistical estimation, non-normalized densities, pseudo-likelihood, Markov chain Monte Carlo, contrastive divergence
6 0.34627599 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
7 0.33383214 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
8 0.32865903 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
9 0.32716534 48 jmlr-2005-Learning the Kernel Function via Regularization
10 0.32273722 3 jmlr-2005-A Classification Framework for Anomaly Detection
11 0.32260323 20 jmlr-2005-Clustering with Bregman Divergences
12 0.32128304 36 jmlr-2005-Gaussian Processes for Ordinal Regression
13 0.32071206 54 jmlr-2005-Managing Diversity in Regression Ensembles
14 0.31933704 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
15 0.31922653 11 jmlr-2005-Algorithmic Stability and Meta-Learning
16 0.31917885 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
17 0.31792566 39 jmlr-2005-Information Bottleneck for Gaussian Variables
18 0.31565446 67 jmlr-2005-Stability of Randomized Learning Algorithms
19 0.30186084 47 jmlr-2005-Learning from Examples as an Inverse Problem
20 0.29981443 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods