jmlr jmlr2009 jmlr2009-87 knowledge-graph by maker-knowledge-mining

87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Source: pdf

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsiﬁcation from no sparsiﬁcation to total sparsiﬁcation. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsiﬁcation result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and ﬁnd for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Statistics Rutgers University Piscataway, NJ, USA Editor: Manfred Warmuth Abstract We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. [sent-7, score-0.712]

2 The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. [sent-11, score-0.259]

3 Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso 1. [sent-16, score-0.59]

4 This paper addresses the problem of inducing sparsity in learned weights while using an onlinelearning algorithm. [sent-42, score-0.267]

5 Simply adding L1 -regularization to the gradient of an online weight update doesn’t work because gradients don’t induce sparsity. [sent-45, score-0.421]

6 Simply rounding weights to 0 is problematic because a weight may be small due to being useless or small because it has been updated only once (either at the beginning of training or because the set of features appearing is also sparse). [sent-49, score-0.601]

7 Consider a loss function L(w, zi ) which is convex in w, where zi = (xi , yi ) is an input/output pair. [sent-58, score-0.306]

8 The convex constraint formulation has a simple online version using the projection idea of Zinkevich (2003), which requires the projection of weight w into an L1 -ball at every online step. [sent-62, score-0.38]

9 This operation is difﬁcult to implement efﬁciently for large-scale data with many features even if all examples have sparse features although recent progress was made (Duchi et al. [sent-63, score-0.32]

10 This family of algorithms allow general convex regularization function, and reproduce a special case of the truncated gradient algorithm we will introduce in Section 3. [sent-71, score-0.429]

11 It operates by decaying the weights on previous examples and then rounding these weights to zero when they become small. [sent-75, score-0.492]

12 At a high level, our approach works with the softregularization formulation (2) and decays the weight to a default value after every online stochastic gradient step. [sent-85, score-0.438]

13 As mentioned in the introduction, we are mainly interested in sparse online methods for large scale problems with sparse features. [sent-91, score-0.257]

14 For such problems, our algorithm should satisfy the following requirements: • The algorithm should be computationally efﬁcient: the number of operations per online step should be linear in the number of nonzero features, and independent of the total number of features. [sent-92, score-0.248]

15 779 L ANGFORD , L I AND Z HANG Our solution, referred to as truncated gradient, is a simple modiﬁcation of the standard stochastic gradient rule. [sent-94, score-0.433]

16 It is deﬁned in (6) as an improvement over simpler ideas such as rounding and subgradient method with L1 -regularization. [sent-95, score-0.387]

17 We compare our approach to a few others, including L1 -regularization on small data, as well as online rounding of coefﬁcients to zero. [sent-99, score-0.544]

18 We make a prediction based on existing weights wi ∈ Rd . [sent-107, score-0.29]

19 We observe yi , let zi = (xi , yi ), and incur some known loss L(wi , zi ) that is convex in parameter wi . [sent-109, score-0.557]

20 For this purpose, we start with the standard stochastic gradient descent (SGD) rule, which is of the form: f (wi ) = wi − η∇1 L(wi , zi ), (3) where ∇1 L(a, b) is a sub-gradient of L(a, b) with respect to the ﬁrst variable a. [sent-113, score-0.648]

21 In the literature, the stochastic-gradient descent rule is often referred to as gradient descent (GD). [sent-128, score-0.338]

22 There are other variants, such as exponentiated gradient descent (EG). [sent-129, score-0.247]

23 Sparse Online Learning In this section, we examine several methods for achieving sparsity in online learning. [sent-132, score-0.333]

24 otherwise That is, we ﬁrst apply the standard stochastic gradient descent rule, and then round small coefﬁcients to zero. [sent-147, score-0.302]

25 In general, we should not take K = 1, especially when η is small, since each step modiﬁes wi by only a small amount. [sent-148, score-0.251]

26 If a coefﬁcient is zero, it remains small after one online update, and the rounding operation pulls it back to zero. [sent-149, score-0.544]

27 Consequently, rounding can be done only after every K steps (with a reasonably large K); in this case, nonzero coefﬁcients have sufﬁcient time to go above the threshold θ. [sent-150, score-0.502]

28 2 A Sub-gradient Algorithm for L1 -Regularization In our experiments, we combine rounding-in-the-end-of-training with a simple online sub-gradient method for L1 -regularization with a regularization parameter g > 0: f (wi ) = wi − η∇1 L(wi , zi ) − ηg sgn(wi ), (5) where for a vector v = [v1 , . [sent-155, score-0.528]

29 In the experiments, the online method (5) plus rounding in the end is used as a simple baseline. [sent-162, score-0.544]

30 3 Truncated Gradient In order to obtain an online version of the simple rounding rule in (4), we observe that the direct rounding to zero is too aggressive. [sent-166, score-0.957]

31 The amount of shrinkage is measured by a gravity parameter gi > 0: f (wi ) = T1 (wi − η∇1 L(wi , zi ), ηgi , θ), (6) where for a vector v = [v1 , . [sent-169, score-0.483]

32 That is, if i/K is not an integer, we let gi = 0; if i/K is an integer, we let gi = Kg for a gravity parameter g > 0. [sent-178, score-0.562]

33 Since ηKg ≪ θ when ηK is small, the truncation operation is less aggressive than the rounding in (4). [sent-187, score-0.488]

34 Setting θ = ∞ yields another important special case of (6), which becomes f (wi ) = T (wi − η∇1 L(wi , zi ), gi η), 782 (7) S PARSE O NLINE L EARNING VIA T RUNCATED G RADIENT where for a vector v = [v1 , . [sent-190, score-0.342]

35 The parameter gi ≥ 0 controls the sparsity that can be achieved with the algorithm. [sent-198, score-0.398]

36 Note when gi = 0, the update rule is identical to the standard stochastic gradient descent rule. [sent-199, score-0.567]

37 That is, if i/K is not an integer, we let gi = 0; if i/K is an integer, we let gi = Kg for a gravity parameter g > 0. [sent-201, score-0.562]

38 Truncated gradient for L1 -regularization is different from (5), which is a naïve application of stochastic gradient descent rule with an added L1 -regularization term. [sent-211, score-0.485]

39 1 (Sparse Online Regret) Consider sparse online update rule (6) with w1 = 0 and η > 0. [sent-234, score-0.275]

40 5Aη wi+1 · I(wi+1 ≤ θ) T i=1 η w 2 1 ¯ ≤ B+ + 2 2ηT T T ¯ ∑ [L(w, zi ) + gi i=1 1 w · I(wi+1 ≤ θ) 1 ], ¯ where for vectors v = [v1 , . [sent-238, score-0.342]

41 Without a good characterization of the local minimum, it is not possible for us to replace gi w · I(wi+1 ≤ θ) 1 on the right-hand side by gi w · I(w ≤ θ) 1 because such a ¯ ¯ ¯ formulation implies we can efﬁciently solve a non-convex problem with a simple online update rule. [sent-255, score-0.643]

42 Still, when θ < ∞, one naturally expects the right-hand side penalty gi w · I(wi+1 ≤ θ) 1 is much ¯ smaller than the corresponding L1 penalty gi w 1 , especially when w j has many components close ¯ to 0. [sent-256, score-0.444]

43 1 also implies a trade-off between sparsity and regret performance. [sent-260, score-0.292]

44 When g is small, we have less sparsity but the regret term g w · I(wi+1 ≤ θ) 1 ≤ g w 1 on the right-hand side is also small. [sent-262, score-0.292]

45 When g is large, we are ¯ ¯ able to achieve more sparsity but the regret g w · I(wi+1 ≤ θ) 1 on the right-hand side also becomes ¯ large. [sent-263, score-0.292]

46 1 implies 1 T T ∑ [L(wi , zi ) + g i=1 wi 1 ] ≤ inf w∈Rd ¯ 1 T T ¯ ∑ L(w, zi ) + g w ¯ 1 + o(1). [sent-268, score-0.491]

47 In the stochastic setting where the examples are drawn iid from some underlying distribution, the sparse online gradient method proposed in this paper solves the L1 -regularization problem. [sent-272, score-0.419]

48 For simplicity, we only consider the case with θ = ∞ and constant gravity gi = g. [sent-283, score-0.34]

49 ˆ wt+1 = T (wt − η∇1 (wt , zit ), gη), 785 wt+1 = wt + ˆ ˆ wt+1 − wt ˆ , t +1 L ANGFORD , L I AND Z HANG where each it is drawn from {1, . [sent-292, score-0.339]

50 ¯ 2 2ηT Proof Note the recursion of wt implies ˆ wT = ˆ 1 T ∑ wt T t=1 from telescoping the update rule. [sent-308, score-0.343]

51 Since L1 -regularization is frequently used to achieve sparsity in the batch learning setting, the connection to L1 -regularization can be regarded as an alternative justiﬁcation for the sparse-online algorithm developed in this paper. [sent-331, score-0.258]

52 786 S PARSE O NLINE L EARNING VIA T RUNCATED G RADIENT Algorithm 1 Truncated Gradient for Least Squares Inputs: • threshold θ ≥ 0 • gravity sequence gi ≥ 0 • learning rate η ∈ (0, 1) • example oracle O initialize weights w j ← 0 ( j = 1, . [sent-332, score-0.426]

53 , d) (a) if w j > 0 and w j ≤ θ then w j ← max{w j − gi η, 0} (b) elseif w j < 0 and w j ≥ −θ then w j ← min{w j + gi η, 0} 3. [sent-346, score-0.444]

54 In the description, we use superscripted symbol w j to denote the j-th component of vector w (in order to differentiate from wi , which we have used to denote the i-th weight vector). [sent-352, score-0.291]

55 Although we keep the choice of gravity parameters gi open in the algorithm description, in practice, we only consider the following choice: gi = Kg 0 if i/K is an integer . [sent-354, score-0.587]

56 It is thus desirable to deal with sparsity only in this small subset rather than in all features, while simultaneously inducing sparsity on all feature weights. [sent-358, score-0.398]

57 Moreover, it is important to store only features with non-zero coefﬁcients (if the number of features is too large to be stored in memory, this approach allows us to use a hash table to track only the nonzero coefﬁcients). [sent-359, score-0.361]

58 1 (Sparse Online Square Loss Regret) If there exists C > 0 such that for all x, x ≤ C, then for all w ∈ Rd , we have ¯ gi 1 − 2C2 η T ∑ (wT xi − yi )2 + 1 − 2C2 η wi · I(|wi | ≤ θ) i T i=1 ≤ w 2 1 ¯ + 2ηT T T ∑ i=1 (wT xi − yi )2 + gi+1 w · I(|wi+1 | ≤ θ) ¯ ¯ 1 1 , where wi = [w1 , . [sent-364, score-0.724]

59 VOWPAL WABBIT optimizes square loss on a linear representation wT x via gradient descent (3) with a couple caveats: 1. [sent-372, score-0.33]

60 Instead the update is a constant value, equivalent to the gradient of a linear loss function. [sent-377, score-0.264]

61 During online learning, we simply went through all nonzero features j of example i, and could “simulate” the shrinkage of w j after τ j in a batch mode. [sent-383, score-0.463]

62 This lazy-update idea of delaying the shrinkage calculation until needed is the key to efﬁcient implementation of truncated gradient. [sent-385, score-0.244]

63 Speciﬁcally, instead of using update rule (6) for weight w j , we shrunk the weights of all nonzero feature j differently by the following: ˆ f (w j ) = T1 w j + 2η(y − y)x j , and τ j is updated by τj ← τj + i−τj Kηg, θ , K i−τj K. [sent-386, score-0.26]

64 In the coefﬁcient rounding algorithm (4), for instance, for each nonzero feature j of example i, we can ﬁrst perform a regular gradient descent on the square loss, and then do the following: if |w j | is below the threshold θ and i ≥ τ j + K, we round w j to 0 and set τ j to i. [sent-388, score-0.839]

65 This implementation shows the truncated gradient method satisﬁes the following requirements needed for solving large scale problems with sparse features. [sent-389, score-0.453]

66 • The algorithm is computationally efﬁcient: the number of operations per online step is linear in the number of nonzero features, and independent of the total number of features. [sent-390, score-0.248]

67 If we directly apply the online projection idea of Zinkevich (2003) to solve (1), then in the update rule (7), one has to pick the smallest gi ≥ 0 such that wi+1 1 ≤ s. [sent-392, score-0.447]

68 The theoretical analysis presented in this paper shows we can obtain a meaningful regret bound by picking an arbitrary gi . [sent-400, score-0.338]

69 Moreover, our method allows non-convex updates closely related to the simple coefﬁcient rounding idea. [sent-402, score-0.387]

70 However, we believe the sparsity achieved with their approach should be comparable to the sparsity achieved with our method. [sent-405, score-0.352]

71 The UCI data sets used do not have many features so we expect that a large fraction of these features are useful for making predictions. [sent-411, score-0.301]

72 For UCI data sets, with randomly added features, VOWPAL WABBIT is able to reduce the number of features by a fraction of more than 90%, except for the ad data set in which only 71% reduction is observed. [sent-432, score-0.273]

73 2 The Effects of K As we argued before, using a K value larger than 1 may be desired in truncated gradient and the rounding algorithms. [sent-461, score-0.79]

74 As before, cross validation is used to select parameters in the rounding algorithm, including learning rate η, number of passes of data during training, and learning rate decay over training passes. [sent-464, score-0.505]

75 number-of-feature plots, where each data point is generated by running respective algorithm using a different value of g (for truncated gradient) and θ (for the rounding algorithm). [sent-466, score-0.608]

76 The effect of K is large in the rounding algorithm. [sent-468, score-0.387]

77 8 Fraction of Features Left Dataset zoo rcv1 Big_Ads Dataset wpbc wbc wdbc spam shroom ad Big_Ads zoo rcv1 wpbc wbc wdbc spam shroom krvskp magic04 housing ad crx 0 krvskp 0. [sent-481, score-1.177]

78 6 crx Fraction Left Fraction Left Base data 1000 extra 1 AND Figure 2: Plots showing the amount of features left after sparsiﬁcation using truncated gradient for each data set, when the performance is changed by at most 1% due to sparsiﬁcation. [sent-484, score-0.632]

79 Plot on left: fraction left with respect to the total number of features (original with 1000 artiﬁcial features for the dashed bar). [sent-486, score-0.301]

80 Plot on right: fraction left with respect to the original features (not counting the 1000 artiﬁcial features in the denominator for the dashed bar). [sent-487, score-0.301]

81 2 rcv1 wpbc wdbc wbc spam shroom magic04 krvskp crx ad 0 Dataset Figure 3: A plot showing the ratio of the AUC when sparsiﬁcation is used over the AUC when no sparsiﬁcation is used. [sent-493, score-0.589]

82 The rounding algorithm is also included for comparison due to its similarity to truncated gradient when θ = g. [sent-498, score-0.79]

83 3 3 10 0 1 2 10 10 Number of Features Figure 4: Effect of K on AUC in the rounding algorithm. [sent-557, score-0.387]

84 number-of-feature plots, where each data point is generated by running respective algorithms using a different value of g (for truncated gradient) and θ (for the rounding algorithm). [sent-616, score-0.608]

85 First, the results verify the observation that the behavior of truncated gradient with θ = g is similar to the rounding algorithm. [sent-618, score-0.79]

86 Second, these results suggest that, in practice, it may be desirable to use θ = ∞ in truncated gradient because it avoids the local-minimum problem. [sent-619, score-0.403]

87 4 Comparison to Other Algorithms The next set of experiments compares truncated gradient to other algorithms regarding their abilities to balance feature sparsiﬁcation and performance. [sent-621, score-0.425]

88 The algorithms for comparison include: • The truncated gradient algorithm: We ﬁxed K = 10 and θ = ∞, used crossed-validated parameters, and altered the gravity parameter g. [sent-623, score-0.56]

89 1: We ﬁxed K = 10, used cross-validated parameters, and altered the rounding threshold θ. [sent-625, score-0.45]

90 Truncated gradient is consistently competitive with the other two online algorithms and signiﬁcantly outperformed them in some problems. [sent-631, score-0.339]

91 Second, it is interesting to observe that the qualitative behavior of truncated gradient is often similar to LASSO, especially when very sparse weight vectors are allowed (the left side in the graphs). [sent-633, score-0.493]

92 In this case, LASSO seems to overﬁt, while truncated gradient is more robust to overﬁtting. [sent-637, score-0.403]

93 As we have shown and argued, the rounding algorithm is quite ad hoc and may not work robustly in some problems, and the sub-gradient algorithm does not lead to sparsity in general during training. [sent-643, score-0.649]

94 The algorithm, truncated gradient, is the natural extension of Lasso-style re795 L ANGFORD , L I AND Z HANG ad crx 0. [sent-646, score-0.401]

95 1 proves the technique is sound: it never harms performance much compared to standard stochastic gradient descent in adversarial situations. [sent-810, score-0.277]

96 Lemma 1 Suppose update rule (6) is applied to weight vector w on example z = (x, y) with gravity parameter gi = g, and results in a weight vector w′ . [sent-820, score-0.488]

97 5Aη)L(wi , zi ) + gi wi+1 · I(|wi+1 | ≤ θ) ≤L(w, zi ) + ¯ w − wi ¯ 2− 2η w − wi+1 ¯ 2 1 + gi w · I(|wi+1 | ≤ θ) ¯ 1+ η B. [sent-836, score-0.935]

98 5Aη)L(wi, zi ) + gi i=1 T ≤ ∑ w − wi ¯ i=1 2− 2− 2η w − wi+1 ¯ 2 2 + L(w, zi ) + gi w · I(|wi+1 | ≤ θ) ¯ ¯ 1+ η B 2 T η ¯ ¯ + T B + ∑ [L(w, zi ) + gi w · I(|wi+1 | ≤ θ) 1 ] 2 i=1 = w − w1 ¯ ≤ T w 2 η ¯ ¯ ¯ + T B + ∑ [L(w, zi ) + gi w · I(|wi+1 | ≤ θ) 1 ]. [sent-841, score-1.619]

99 Lazy sparse stochastic gradient descent for regularized multinomial logistic regression. [sent-856, score-0.327]

100 Solving large scale linear prediction problems using stochastic gradient descent algorithms. [sent-918, score-0.277]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rounding', 0.387), ('auc', 0.325), ('wi', 0.251), ('sparsi', 0.243), ('gi', 0.222), ('truncated', 0.221), ('gradient', 0.182), ('sparsity', 0.176), ('radient', 0.166), ('runcated', 0.166), ('online', 0.157), ('angford', 0.141), ('wt', 0.14), ('features', 0.135), ('vowpal', 0.129), ('lasso', 0.123), ('zi', 0.12), ('gravity', 0.118), ('regret', 0.116), ('krvskp', 0.111), ('nline', 0.108), ('hang', 0.108), ('wabbit', 0.106), ('parse', 0.096), ('vd', 0.095), ('crx', 0.094), ('wbc', 0.094), ('nonzero', 0.091), ('ad', 0.086), ('wdbc', 0.084), ('spambase', 0.071), ('descent', 0.065), ('duchi', 0.063), ('zit', 0.059), ('mushroom', 0.059), ('batch', 0.057), ('truncation', 0.056), ('wpbc', 0.055), ('rd', 0.054), ('kg', 0.054), ('sparse', 0.05), ('lihong', 0.048), ('uci', 0.048), ('sub', 0.047), ('aggressive', 0.045), ('square', 0.043), ('earning', 0.042), ('forgetron', 0.042), ('update', 0.042), ('shroom', 0.042), ('sgn', 0.041), ('loss', 0.04), ('weight', 0.04), ('altered', 0.039), ('weights', 0.039), ('bar', 0.034), ('littlestone', 0.034), ('rutgers', 0.032), ('zoo', 0.032), ('fraction', 0.031), ('coef', 0.03), ('stochastic', 0.03), ('housing', 0.029), ('decays', 0.029), ('zinkevich', 0.029), ('langford', 0.029), ('onlinelearning', 0.028), ('gd', 0.027), ('decaying', 0.027), ('cross', 0.027), ('tong', 0.027), ('yoram', 0.027), ('rule', 0.026), ('convex', 0.026), ('regarded', 0.025), ('round', 0.025), ('integer', 0.025), ('manfred', 0.025), ('inducing', 0.024), ('threshold', 0.024), ('ranging', 0.024), ('manages', 0.024), ('ows', 0.024), ('acquire', 0.024), ('py', 0.024), ('vj', 0.024), ('validation', 0.023), ('rate', 0.023), ('shrinkage', 0.023), ('spam', 0.023), ('feature', 0.022), ('passes', 0.022), ('ads', 0.021), ('sj', 0.021), ('telescoping', 0.021), ('reduction', 0.021), ('yahoo', 0.021), ('theorem', 0.021), ('counterpart', 0.02), ('shai', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

2 0.16705456 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we ﬁrst perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the ﬁrst phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ﬁcient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efﬁcient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization

3 0.13288967 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari

Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a ﬁrst-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the ﬁrst PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent

4 0.10390235 23 jmlr-2009-Discriminative Learning Under Covariate Shift

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

Abstract: We address classiﬁcation problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classiﬁcation under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classiﬁer for covariate shift. The optimization problem is convex under certain conditions; our ﬁndings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam ﬁltering, text classiﬁcation, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning

5 0.097629301 49 jmlr-2009-Learning Permutations with Exponential Weights

Author: David P. Helmbold, Manfred K. Warmuth

Abstract: We give an algorithm for the on-line learning of permutations. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efﬁcient method for decomposing the weight matrix into a convex combination of permutations. The weight matrix is updated by multiplying the current matrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity. Even though the result of this procedure does not have a closed form, a new analysis approach allows us to prove an optimal (up to small constant factors) bound on the regret of our algorithm. This regret bound is signiﬁcantly better than that of either Kalai and Vempala’s more efﬁcient Follow the Perturbed Leader algorithm or the computationally expensive method of explicitly representing each permutation as an expert. Keywords: permutation, ranking, on-line learning, Hedge algorithm, doubly stochastic matrix, relative entropy projection, Sinkhorn balancing

6 0.079493798 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

7 0.074663304 38 jmlr-2009-Hash Kernels for Structured Data

8 0.063121043 50 jmlr-2009-Learning When Concepts Abound

9 0.055134222 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

10 0.054673325 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

11 0.053201687 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions

12 0.049773313 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

13 0.048974618 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

14 0.048317313 67 jmlr-2009-Online Learning with Sample Path Constraints

15 0.047096957 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List

16 0.045975886 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

17 0.044987787 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

18 0.044932354 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

19 0.042841006 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

20 0.040118594 13 jmlr-2009-Bounded Kernel-Based Online Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.247), (1, 0.059), (2, 0.068), (3, -0.143), (4, 0.221), (5, 0.176), (6, -0.028), (7, 0.003), (8, -0.024), (9, 0.06), (10, 0.029), (11, -0.109), (12, 0.016), (13, 0.044), (14, 0.061), (15, -0.024), (16, -0.115), (17, -0.248), (18, -0.021), (19, -0.117), (20, -0.086), (21, 0.007), (22, 0.189), (23, -0.227), (24, 0.008), (25, -0.142), (26, 0.104), (27, -0.049), (28, -0.068), (29, -0.064), (30, -0.016), (31, -0.019), (32, 0.062), (33, 0.024), (34, 0.101), (35, -0.123), (36, -0.016), (37, -0.07), (38, -0.021), (39, 0.112), (40, -0.001), (41, -0.008), (42, 0.014), (43, -0.001), (44, 0.052), (45, -0.139), (46, -0.076), (47, -0.209), (48, 0.057), (49, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96634072 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

2 0.57259226 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

3 0.53606677 49 jmlr-2009-Learning Permutations with Exponential Weights

Author: David P. Helmbold, Manfred K. Warmuth

4 0.43987322 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari

5 0.42575106 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

Author: Tong Zhang

Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches inﬁnity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefﬁcient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefﬁcients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefﬁcients. Keywords: greedy algorithm, feature selection, sparsity

6 0.3954556 50 jmlr-2009-Learning When Concepts Abound

7 0.38060901 38 jmlr-2009-Hash Kernels for Structured Data

8 0.29868773 67 jmlr-2009-Online Learning with Sample Path Constraints

9 0.29824048 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

10 0.28870481 23 jmlr-2009-Discriminative Learning Under Covariate Shift

11 0.2703495 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

12 0.26645324 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

13 0.26563749 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

14 0.26558495 60 jmlr-2009-Nieme: Large-Scale Energy-Based Models (Machine Learning Open Source Software Paper)

15 0.26405451 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

16 0.24546582 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

17 0.19991592 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions

18 0.19580436 78 jmlr-2009-Refinement of Reproducing Kernels

19 0.19161133 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination (Special Topic on Model Selection)

20 0.18696809 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.023), (11, 0.018), (19, 0.017), (26, 0.021), (38, 0.033), (42, 0.293), (47, 0.014), (52, 0.099), (55, 0.056), (58, 0.02), (66, 0.158), (68, 0.068), (90, 0.048), (96, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76786369 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

2 0.57454252 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

3 0.55433798 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classiﬁcation that explicitly build in protection to noise, and at the same time control overﬁtting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine

4 0.5488292 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni

Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modiﬁcation of the Perceptron update with an adaptive ﬁltering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning

5 0.54841655 38 jmlr-2009-Hash Kernels for Structured Data

Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan

Abstract: We propose hashing to facilitate efﬁcient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classiﬁcation

6 0.54512095 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

7 0.54057312 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models

8 0.53715903 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

9 0.53555179 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm (Special Topic on Mining and Learning with Graphs and Relations)

10 0.53464913 48 jmlr-2009-Learning Nondeterministic Classifiers

11 0.5338158 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

12 0.53235519 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

13 0.53112763 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

14 0.53071117 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

15 0.5281654 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

16 0.52765977 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

17 0.52424306 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

18 0.52383542 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning

19 0.52334338 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning

20 0.52154797 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions