jmlr jmlr2009 jmlr2009-87 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Department of Statistics Rutgers University Piscataway, NJ, USA Editor: Manfred Warmuth Abstract We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. [sent-7, score-0.712]
2 The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. [sent-11, score-0.259]
3 Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso 1. [sent-16, score-0.59]
4 This paper addresses the problem of inducing sparsity in learned weights while using an onlinelearning algorithm. [sent-42, score-0.267]
5 Simply adding L1 -regularization to the gradient of an online weight update doesn’t work because gradients don’t induce sparsity. [sent-45, score-0.421]
6 Simply rounding weights to 0 is problematic because a weight may be small due to being useless or small because it has been updated only once (either at the beginning of training or because the set of features appearing is also sparse). [sent-49, score-0.601]
7 Consider a loss function L(w, zi ) which is convex in w, where zi = (xi , yi ) is an input/output pair. [sent-58, score-0.306]
8 The convex constraint formulation has a simple online version using the projection idea of Zinkevich (2003), which requires the projection of weight w into an L1 -ball at every online step. [sent-62, score-0.38]
9 This operation is difficult to implement efficiently for large-scale data with many features even if all examples have sparse features although recent progress was made (Duchi et al. [sent-63, score-0.32]
10 This family of algorithms allow general convex regularization function, and reproduce a special case of the truncated gradient algorithm we will introduce in Section 3. [sent-71, score-0.429]
11 It operates by decaying the weights on previous examples and then rounding these weights to zero when they become small. [sent-75, score-0.492]
12 At a high level, our approach works with the softregularization formulation (2) and decays the weight to a default value after every online stochastic gradient step. [sent-85, score-0.438]
13 As mentioned in the introduction, we are mainly interested in sparse online methods for large scale problems with sparse features. [sent-91, score-0.257]
14 For such problems, our algorithm should satisfy the following requirements: • The algorithm should be computationally efficient: the number of operations per online step should be linear in the number of nonzero features, and independent of the total number of features. [sent-92, score-0.248]
15 779 L ANGFORD , L I AND Z HANG Our solution, referred to as truncated gradient, is a simple modification of the standard stochastic gradient rule. [sent-94, score-0.433]
16 It is defined in (6) as an improvement over simpler ideas such as rounding and subgradient method with L1 -regularization. [sent-95, score-0.387]
17 We compare our approach to a few others, including L1 -regularization on small data, as well as online rounding of coefficients to zero. [sent-99, score-0.544]
18 We make a prediction based on existing weights wi ∈ Rd . [sent-107, score-0.29]
19 We observe yi , let zi = (xi , yi ), and incur some known loss L(wi , zi ) that is convex in parameter wi . [sent-109, score-0.557]
20 For this purpose, we start with the standard stochastic gradient descent (SGD) rule, which is of the form: f (wi ) = wi − η∇1 L(wi , zi ), (3) where ∇1 L(a, b) is a sub-gradient of L(a, b) with respect to the first variable a. [sent-113, score-0.648]
21 In the literature, the stochastic-gradient descent rule is often referred to as gradient descent (GD). [sent-128, score-0.338]
22 There are other variants, such as exponentiated gradient descent (EG). [sent-129, score-0.247]
23 Sparse Online Learning In this section, we examine several methods for achieving sparsity in online learning. [sent-132, score-0.333]
24 otherwise That is, we first apply the standard stochastic gradient descent rule, and then round small coefficients to zero. [sent-147, score-0.302]
25 In general, we should not take K = 1, especially when η is small, since each step modifies wi by only a small amount. [sent-148, score-0.251]
26 If a coefficient is zero, it remains small after one online update, and the rounding operation pulls it back to zero. [sent-149, score-0.544]
27 Consequently, rounding can be done only after every K steps (with a reasonably large K); in this case, nonzero coefficients have sufficient time to go above the threshold θ. [sent-150, score-0.502]
28 2 A Sub-gradient Algorithm for L1 -Regularization In our experiments, we combine rounding-in-the-end-of-training with a simple online sub-gradient method for L1 -regularization with a regularization parameter g > 0: f (wi ) = wi − η∇1 L(wi , zi ) − ηg sgn(wi ), (5) where for a vector v = [v1 , . [sent-155, score-0.528]
29 In the experiments, the online method (5) plus rounding in the end is used as a simple baseline. [sent-162, score-0.544]
30 3 Truncated Gradient In order to obtain an online version of the simple rounding rule in (4), we observe that the direct rounding to zero is too aggressive. [sent-166, score-0.957]
31 The amount of shrinkage is measured by a gravity parameter gi > 0: f (wi ) = T1 (wi − η∇1 L(wi , zi ), ηgi , θ), (6) where for a vector v = [v1 , . [sent-169, score-0.483]
32 That is, if i/K is not an integer, we let gi = 0; if i/K is an integer, we let gi = Kg for a gravity parameter g > 0. [sent-178, score-0.562]
33 Since ηKg ≪ θ when ηK is small, the truncation operation is less aggressive than the rounding in (4). [sent-187, score-0.488]
34 Setting θ = ∞ yields another important special case of (6), which becomes f (wi ) = T (wi − η∇1 L(wi , zi ), gi η), 782 (7) S PARSE O NLINE L EARNING VIA T RUNCATED G RADIENT where for a vector v = [v1 , . [sent-190, score-0.342]
35 The parameter gi ≥ 0 controls the sparsity that can be achieved with the algorithm. [sent-198, score-0.398]
36 Note when gi = 0, the update rule is identical to the standard stochastic gradient descent rule. [sent-199, score-0.567]
37 That is, if i/K is not an integer, we let gi = 0; if i/K is an integer, we let gi = Kg for a gravity parameter g > 0. [sent-201, score-0.562]
38 Truncated gradient for L1 -regularization is different from (5), which is a naïve application of stochastic gradient descent rule with an added L1 -regularization term. [sent-211, score-0.485]
39 1 (Sparse Online Regret) Consider sparse online update rule (6) with w1 = 0 and η > 0. [sent-234, score-0.275]
40 5Aη wi+1 · I(wi+1 ≤ θ) T i=1 η w 2 1 ¯ ≤ B+ + 2 2ηT T T ¯ ∑ [L(w, zi ) + gi i=1 1 w · I(wi+1 ≤ θ) 1 ], ¯ where for vectors v = [v1 , . [sent-238, score-0.342]
41 Without a good characterization of the local minimum, it is not possible for us to replace gi w · I(wi+1 ≤ θ) 1 on the right-hand side by gi w · I(w ≤ θ) 1 because such a ¯ ¯ ¯ formulation implies we can efficiently solve a non-convex problem with a simple online update rule. [sent-255, score-0.643]
42 Still, when θ < ∞, one naturally expects the right-hand side penalty gi w · I(wi+1 ≤ θ) 1 is much ¯ smaller than the corresponding L1 penalty gi w 1 , especially when w j has many components close ¯ to 0. [sent-256, score-0.444]
43 1 also implies a trade-off between sparsity and regret performance. [sent-260, score-0.292]
44 When g is small, we have less sparsity but the regret term g w · I(wi+1 ≤ θ) 1 ≤ g w 1 on the right-hand side is also small. [sent-262, score-0.292]
45 When g is large, we are ¯ ¯ able to achieve more sparsity but the regret g w · I(wi+1 ≤ θ) 1 on the right-hand side also becomes ¯ large. [sent-263, score-0.292]
46 1 implies 1 T T ∑ [L(wi , zi ) + g i=1 wi 1 ] ≤ inf w∈Rd ¯ 1 T T ¯ ∑ L(w, zi ) + g w ¯ 1 + o(1). [sent-268, score-0.491]
47 In the stochastic setting where the examples are drawn iid from some underlying distribution, the sparse online gradient method proposed in this paper solves the L1 -regularization problem. [sent-272, score-0.419]
48 For simplicity, we only consider the case with θ = ∞ and constant gravity gi = g. [sent-283, score-0.34]
49 ˆ wt+1 = T (wt − η∇1 (wt , zit ), gη), 785 wt+1 = wt + ˆ ˆ wt+1 − wt ˆ , t +1 L ANGFORD , L I AND Z HANG where each it is drawn from {1, . [sent-292, score-0.339]
50 ¯ 2 2ηT Proof Note the recursion of wt implies ˆ wT = ˆ 1 T ∑ wt T t=1 from telescoping the update rule. [sent-308, score-0.343]
51 Since L1 -regularization is frequently used to achieve sparsity in the batch learning setting, the connection to L1 -regularization can be regarded as an alternative justification for the sparse-online algorithm developed in this paper. [sent-331, score-0.258]
52 786 S PARSE O NLINE L EARNING VIA T RUNCATED G RADIENT Algorithm 1 Truncated Gradient for Least Squares Inputs: • threshold θ ≥ 0 • gravity sequence gi ≥ 0 • learning rate η ∈ (0, 1) • example oracle O initialize weights w j ← 0 ( j = 1, . [sent-332, score-0.426]
53 , d) (a) if w j > 0 and w j ≤ θ then w j ← max{w j − gi η, 0} (b) elseif w j < 0 and w j ≥ −θ then w j ← min{w j + gi η, 0} 3. [sent-346, score-0.444]
54 In the description, we use superscripted symbol w j to denote the j-th component of vector w (in order to differentiate from wi , which we have used to denote the i-th weight vector). [sent-352, score-0.291]
55 Although we keep the choice of gravity parameters gi open in the algorithm description, in practice, we only consider the following choice: gi = Kg 0 if i/K is an integer . [sent-354, score-0.587]
56 It is thus desirable to deal with sparsity only in this small subset rather than in all features, while simultaneously inducing sparsity on all feature weights. [sent-358, score-0.398]
57 Moreover, it is important to store only features with non-zero coefficients (if the number of features is too large to be stored in memory, this approach allows us to use a hash table to track only the nonzero coefficients). [sent-359, score-0.361]
58 1 (Sparse Online Square Loss Regret) If there exists C > 0 such that for all x, x ≤ C, then for all w ∈ Rd , we have ¯ gi 1 − 2C2 η T ∑ (wT xi − yi )2 + 1 − 2C2 η wi · I(|wi | ≤ θ) i T i=1 ≤ w 2 1 ¯ + 2ηT T T ∑ i=1 (wT xi − yi )2 + gi+1 w · I(|wi+1 | ≤ θ) ¯ ¯ 1 1 , where wi = [w1 , . [sent-364, score-0.724]
59 VOWPAL WABBIT optimizes square loss on a linear representation wT x via gradient descent (3) with a couple caveats: 1. [sent-372, score-0.33]
60 Instead the update is a constant value, equivalent to the gradient of a linear loss function. [sent-377, score-0.264]
61 During online learning, we simply went through all nonzero features j of example i, and could “simulate” the shrinkage of w j after τ j in a batch mode. [sent-383, score-0.463]
62 This lazy-update idea of delaying the shrinkage calculation until needed is the key to efficient implementation of truncated gradient. [sent-385, score-0.244]
63 Specifically, instead of using update rule (6) for weight w j , we shrunk the weights of all nonzero feature j differently by the following: ˆ f (w j ) = T1 w j + 2η(y − y)x j , and τ j is updated by τj ← τj + i−τj Kηg, θ , K i−τj K. [sent-386, score-0.26]
64 In the coefficient rounding algorithm (4), for instance, for each nonzero feature j of example i, we can first perform a regular gradient descent on the square loss, and then do the following: if |w j | is below the threshold θ and i ≥ τ j + K, we round w j to 0 and set τ j to i. [sent-388, score-0.839]
65 This implementation shows the truncated gradient method satisfies the following requirements needed for solving large scale problems with sparse features. [sent-389, score-0.453]
66 • The algorithm is computationally efficient: the number of operations per online step is linear in the number of nonzero features, and independent of the total number of features. [sent-390, score-0.248]
67 If we directly apply the online projection idea of Zinkevich (2003) to solve (1), then in the update rule (7), one has to pick the smallest gi ≥ 0 such that wi+1 1 ≤ s. [sent-392, score-0.447]
68 The theoretical analysis presented in this paper shows we can obtain a meaningful regret bound by picking an arbitrary gi . [sent-400, score-0.338]
69 Moreover, our method allows non-convex updates closely related to the simple coefficient rounding idea. [sent-402, score-0.387]
70 However, we believe the sparsity achieved with their approach should be comparable to the sparsity achieved with our method. [sent-405, score-0.352]
71 The UCI data sets used do not have many features so we expect that a large fraction of these features are useful for making predictions. [sent-411, score-0.301]
72 For UCI data sets, with randomly added features, VOWPAL WABBIT is able to reduce the number of features by a fraction of more than 90%, except for the ad data set in which only 71% reduction is observed. [sent-432, score-0.273]
73 2 The Effects of K As we argued before, using a K value larger than 1 may be desired in truncated gradient and the rounding algorithms. [sent-461, score-0.79]
74 As before, cross validation is used to select parameters in the rounding algorithm, including learning rate η, number of passes of data during training, and learning rate decay over training passes. [sent-464, score-0.505]
75 number-of-feature plots, where each data point is generated by running respective algorithm using a different value of g (for truncated gradient) and θ (for the rounding algorithm). [sent-466, score-0.608]
76 The effect of K is large in the rounding algorithm. [sent-468, score-0.387]
77 8 Fraction of Features Left Dataset zoo rcv1 Big_Ads Dataset wpbc wbc wdbc spam shroom ad Big_Ads zoo rcv1 wpbc wbc wdbc spam shroom krvskp magic04 housing ad crx 0 krvskp 0. [sent-481, score-1.177]
78 6 crx Fraction Left Fraction Left Base data 1000 extra 1 AND Figure 2: Plots showing the amount of features left after sparsification using truncated gradient for each data set, when the performance is changed by at most 1% due to sparsification. [sent-484, score-0.632]
79 Plot on left: fraction left with respect to the total number of features (original with 1000 artificial features for the dashed bar). [sent-486, score-0.301]
80 Plot on right: fraction left with respect to the original features (not counting the 1000 artificial features in the denominator for the dashed bar). [sent-487, score-0.301]
81 2 rcv1 wpbc wdbc wbc spam shroom magic04 krvskp crx ad 0 Dataset Figure 3: A plot showing the ratio of the AUC when sparsification is used over the AUC when no sparsification is used. [sent-493, score-0.589]
82 The rounding algorithm is also included for comparison due to its similarity to truncated gradient when θ = g. [sent-498, score-0.79]
83 3 3 10 0 1 2 10 10 Number of Features Figure 4: Effect of K on AUC in the rounding algorithm. [sent-557, score-0.387]
84 number-of-feature plots, where each data point is generated by running respective algorithms using a different value of g (for truncated gradient) and θ (for the rounding algorithm). [sent-616, score-0.608]
85 First, the results verify the observation that the behavior of truncated gradient with θ = g is similar to the rounding algorithm. [sent-618, score-0.79]
86 Second, these results suggest that, in practice, it may be desirable to use θ = ∞ in truncated gradient because it avoids the local-minimum problem. [sent-619, score-0.403]
87 4 Comparison to Other Algorithms The next set of experiments compares truncated gradient to other algorithms regarding their abilities to balance feature sparsification and performance. [sent-621, score-0.425]
88 The algorithms for comparison include: • The truncated gradient algorithm: We fixed K = 10 and θ = ∞, used crossed-validated parameters, and altered the gravity parameter g. [sent-623, score-0.56]
89 1: We fixed K = 10, used cross-validated parameters, and altered the rounding threshold θ. [sent-625, score-0.45]
90 Truncated gradient is consistently competitive with the other two online algorithms and significantly outperformed them in some problems. [sent-631, score-0.339]
91 Second, it is interesting to observe that the qualitative behavior of truncated gradient is often similar to LASSO, especially when very sparse weight vectors are allowed (the left side in the graphs). [sent-633, score-0.493]
92 In this case, LASSO seems to overfit, while truncated gradient is more robust to overfitting. [sent-637, score-0.403]
93 As we have shown and argued, the rounding algorithm is quite ad hoc and may not work robustly in some problems, and the sub-gradient algorithm does not lead to sparsity in general during training. [sent-643, score-0.649]
94 The algorithm, truncated gradient, is the natural extension of Lasso-style re795 L ANGFORD , L I AND Z HANG ad crx 0. [sent-646, score-0.401]
95 1 proves the technique is sound: it never harms performance much compared to standard stochastic gradient descent in adversarial situations. [sent-810, score-0.277]
96 Lemma 1 Suppose update rule (6) is applied to weight vector w on example z = (x, y) with gravity parameter gi = g, and results in a weight vector w′ . [sent-820, score-0.488]
97 5Aη)L(wi , zi ) + gi wi+1 · I(|wi+1 | ≤ θ) ≤L(w, zi ) + ¯ w − wi ¯ 2− 2η w − wi+1 ¯ 2 1 + gi w · I(|wi+1 | ≤ θ) ¯ 1+ η B. [sent-836, score-0.935]
98 5Aη)L(wi, zi ) + gi i=1 T ≤ ∑ w − wi ¯ i=1 2− 2− 2η w − wi+1 ¯ 2 2 + L(w, zi ) + gi w · I(|wi+1 | ≤ θ) ¯ ¯ 1+ η B 2 T η ¯ ¯ + T B + ∑ [L(w, zi ) + gi w · I(|wi+1 | ≤ θ) 1 ] 2 i=1 = w − w1 ¯ ≤ T w 2 η ¯ ¯ ¯ + T B + ∑ [L(w, zi ) + gi w · I(|wi+1 | ≤ θ) 1 ]. [sent-841, score-1.619]
99 Lazy sparse stochastic gradient descent for regularized multinomial logistic regression. [sent-856, score-0.327]
100 Solving large scale linear prediction problems using stochastic gradient descent algorithms. [sent-918, score-0.277]
wordName wordTfidf (topN-words)
[('rounding', 0.387), ('auc', 0.325), ('wi', 0.251), ('sparsi', 0.243), ('gi', 0.222), ('truncated', 0.221), ('gradient', 0.182), ('sparsity', 0.176), ('radient', 0.166), ('runcated', 0.166), ('online', 0.157), ('angford', 0.141), ('wt', 0.14), ('features', 0.135), ('vowpal', 0.129), ('lasso', 0.123), ('zi', 0.12), ('gravity', 0.118), ('regret', 0.116), ('krvskp', 0.111), ('nline', 0.108), ('hang', 0.108), ('wabbit', 0.106), ('parse', 0.096), ('vd', 0.095), ('crx', 0.094), ('wbc', 0.094), ('nonzero', 0.091), ('ad', 0.086), ('wdbc', 0.084), ('spambase', 0.071), ('descent', 0.065), ('duchi', 0.063), ('zit', 0.059), ('mushroom', 0.059), ('batch', 0.057), ('truncation', 0.056), ('wpbc', 0.055), ('rd', 0.054), ('kg', 0.054), ('sparse', 0.05), ('lihong', 0.048), ('uci', 0.048), ('sub', 0.047), ('aggressive', 0.045), ('square', 0.043), ('earning', 0.042), ('forgetron', 0.042), ('update', 0.042), ('shroom', 0.042), ('sgn', 0.041), ('loss', 0.04), ('weight', 0.04), ('altered', 0.039), ('weights', 0.039), ('bar', 0.034), ('littlestone', 0.034), ('rutgers', 0.032), ('zoo', 0.032), ('fraction', 0.031), ('coef', 0.03), ('stochastic', 0.03), ('housing', 0.029), ('decays', 0.029), ('zinkevich', 0.029), ('langford', 0.029), ('onlinelearning', 0.028), ('gd', 0.027), ('decaying', 0.027), ('cross', 0.027), ('tong', 0.027), ('yoram', 0.027), ('rule', 0.026), ('convex', 0.026), ('regarded', 0.025), ('round', 0.025), ('integer', 0.025), ('manfred', 0.025), ('inducing', 0.024), ('threshold', 0.024), ('ranging', 0.024), ('manages', 0.024), ('ows', 0.024), ('acquire', 0.024), ('py', 0.024), ('vj', 0.024), ('validation', 0.023), ('rate', 0.023), ('shrinkage', 0.023), ('spam', 0.023), ('feature', 0.022), ('passes', 0.022), ('ads', 0.021), ('sj', 0.021), ('telescoping', 0.021), ('reduction', 0.021), ('yahoo', 0.021), ('theorem', 0.021), ('counterpart', 0.02), ('shai', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
2 0.16705456 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
3 0.13288967 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
Author: Antoine Bordes, Léon Bottou, Patrick Gallinari
Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent
4 0.10390235 23 jmlr-2009-Discriminative Learning Under Covariate Shift
Author: Steffen Bickel, Michael Brückner, Tobias Scheffer
Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning
5 0.097629301 49 jmlr-2009-Learning Permutations with Exponential Weights
Author: David P. Helmbold, Manfred K. Warmuth
Abstract: We give an algorithm for the on-line learning of permutations. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efficient method for decomposing the weight matrix into a convex combination of permutations. The weight matrix is updated by multiplying the current matrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity. Even though the result of this procedure does not have a closed form, a new analysis approach allows us to prove an optimal (up to small constant factors) bound on the regret of our algorithm. This regret bound is significantly better than that of either Kalai and Vempala’s more efficient Follow the Perturbed Leader algorithm or the computationally expensive method of explicitly representing each permutation as an expert. Keywords: permutation, ranking, on-line learning, Hedge algorithm, doubly stochastic matrix, relative entropy projection, Sinkhorn balancing
6 0.079493798 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
7 0.074663304 38 jmlr-2009-Hash Kernels for Structured Data
8 0.063121043 50 jmlr-2009-Learning When Concepts Abound
9 0.055134222 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
10 0.054673325 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
11 0.053201687 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
12 0.049773313 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers
13 0.048974618 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
14 0.048317313 67 jmlr-2009-Online Learning with Sample Path Constraints
15 0.047096957 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List
16 0.045975886 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
17 0.044987787 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
18 0.044932354 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
19 0.042841006 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
20 0.040118594 13 jmlr-2009-Bounded Kernel-Based Online Learning
topicId topicWeight
[(0, 0.247), (1, 0.059), (2, 0.068), (3, -0.143), (4, 0.221), (5, 0.176), (6, -0.028), (7, 0.003), (8, -0.024), (9, 0.06), (10, 0.029), (11, -0.109), (12, 0.016), (13, 0.044), (14, 0.061), (15, -0.024), (16, -0.115), (17, -0.248), (18, -0.021), (19, -0.117), (20, -0.086), (21, 0.007), (22, 0.189), (23, -0.227), (24, 0.008), (25, -0.142), (26, 0.104), (27, -0.049), (28, -0.068), (29, -0.064), (30, -0.016), (31, -0.019), (32, 0.062), (33, 0.024), (34, 0.101), (35, -0.123), (36, -0.016), (37, -0.07), (38, -0.021), (39, 0.112), (40, -0.001), (41, -0.008), (42, 0.014), (43, -0.001), (44, 0.052), (45, -0.139), (46, -0.076), (47, -0.209), (48, 0.057), (49, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.96634072 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
2 0.57259226 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
3 0.53606677 49 jmlr-2009-Learning Permutations with Exponential Weights
Author: David P. Helmbold, Manfred K. Warmuth
Abstract: We give an algorithm for the on-line learning of permutations. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efficient method for decomposing the weight matrix into a convex combination of permutations. The weight matrix is updated by multiplying the current matrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity. Even though the result of this procedure does not have a closed form, a new analysis approach allows us to prove an optimal (up to small constant factors) bound on the regret of our algorithm. This regret bound is significantly better than that of either Kalai and Vempala’s more efficient Follow the Perturbed Leader algorithm or the computationally expensive method of explicitly representing each permutation as an expert. Keywords: permutation, ranking, on-line learning, Hedge algorithm, doubly stochastic matrix, relative entropy projection, Sinkhorn balancing
4 0.43987322 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
Author: Antoine Bordes, Léon Bottou, Patrick Gallinari
Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent
5 0.42575106 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
Author: Tong Zhang
Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches infinity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefficients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefficients. Keywords: greedy algorithm, feature selection, sparsity
6 0.3954556 50 jmlr-2009-Learning When Concepts Abound
7 0.38060901 38 jmlr-2009-Hash Kernels for Structured Data
8 0.29868773 67 jmlr-2009-Online Learning with Sample Path Constraints
9 0.29824048 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
10 0.28870481 23 jmlr-2009-Discriminative Learning Under Covariate Shift
11 0.2703495 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
12 0.26645324 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory
13 0.26563749 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
14 0.26558495 60 jmlr-2009-Nieme: Large-Scale Energy-Based Models (Machine Learning Open Source Software Paper)
15 0.26405451 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
16 0.24546582 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems
17 0.19991592 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions
18 0.19580436 78 jmlr-2009-Refinement of Reproducing Kernels
20 0.18696809 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods
topicId topicWeight
[(8, 0.023), (11, 0.018), (19, 0.017), (26, 0.021), (38, 0.033), (42, 0.293), (47, 0.014), (52, 0.099), (55, 0.056), (58, 0.02), (66, 0.158), (68, 0.068), (90, 0.048), (96, 0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.76786369 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso
2 0.57454252 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
3 0.55433798 82 jmlr-2009-Robustness and Regularization of Support Vector Machines
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classification that explicitly build in protection to noise, and at the same time control overfitting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine
4 0.5488292 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni
Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning
5 0.54841655 38 jmlr-2009-Hash Kernels for Structured Data
Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan
Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification
6 0.54512095 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
7 0.54057312 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
8 0.53715903 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
10 0.53464913 48 jmlr-2009-Learning Nondeterministic Classifiers
11 0.5338158 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
12 0.53235519 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
13 0.53112763 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
14 0.53071117 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
15 0.5281654 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability
16 0.52765977 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
17 0.52424306 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs
18 0.52383542 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning
19 0.52334338 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning
20 0.52154797 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions