jmlr jmlr2012 jmlr2012-97 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
Reference: text
sentIndex sentText sentNum sentScore
1 Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. [sent-13, score-0.659]
2 We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. [sent-15, score-0.399]
3 Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning 1. [sent-16, score-0.323]
4 This work examines regularization based on group norms and spectral norms of matrices. [sent-26, score-0.385]
5 In particular, we use a recently developed methodology, based on the notion of strong convexity, for designing and analyzing the regret or generalization ability of a wide range of learning algorithms (see, e. [sent-28, score-0.286]
6 In fact, most of our efficient algorithms (both in the batch and online settings) impose some complexity control via the use of some strictly convex penalty function either explicitly via a regularizer or implicitly in the design of an online update rule. [sent-32, score-0.455]
7 Central to understanding these algorithms is the manner in which these penalty functions are strictly convex, that is, the behavior of the “gap” by which these convex functions lie above their tangent planes, which is strictly positive for strictly convex functions. [sent-33, score-0.242]
8 Here, the notion of strong convexity provides one means to characterize this gap in terms of some general norm rather than just the standard Euclidean norm. [sent-34, score-0.293]
9 The importance of strong convexity can be understood using the duality between strong convexity and strong smoothness. [sent-35, score-0.515]
10 We characterize a number of matrix based regularization functions, of recent interest, as being strongly convex functions. [sent-44, score-0.302]
11 Specifying the general performance bounds for the specific matrix based regularization method, we are able to systematically decide which regularization function is more appropriate based on underlying statistical properties of a given problem. [sent-46, score-0.238]
12 First, we want to provide a unified framework for obtaining regret and generalization error bounds for algorithms based on strongly convex regularizers. [sent-49, score-0.529]
13 • We show how the framework based on strong convexity/strong smoothness duality (see ShalevShwartz, 2007; Kakade et al. [sent-54, score-0.24]
14 • We provide template algorithms (both in the online and batch settings) for a number of machine learning problems of recent interest, which use matrix parameters. [sent-60, score-0.25]
15 In particular, we provide a simple derivation of generalization/mistake bounds for: (i) online and batch multitask learning using group or spectral norms, (ii) online multi-class categorization using group or spectral norms, and (iii) multiple kernel learning. [sent-61, score-0.661]
16 • We generalize recent results of Juditsky and Nemirovski (2008) to obtain a general result on the strong convexity/strong smoothness of certain group norms (Theorem 13). [sent-63, score-0.352]
17 For example, the generality of this framework allows us to simplify the proofs of previously proposed regret bounds for online multi-task learning (e. [sent-66, score-0.409]
18 The use of interesting matrix norms like group norms and Schatten norms in the context of multi-task learning appears in the work of Argyriou et al. [sent-96, score-0.519]
19 Relatively recently, its use in machine learning has been two fold: in deriving regret bounds for online algorithms and generalization bounds in batch settings. [sent-104, score-0.672]
20 The duality of strong convexity and strong smoothness was first used by Shalev-Shwartz and Singer (2006) and Shalev-Shwartz (2007) in the context of deriving low regret online algorithms. [sent-105, score-0.765]
21 Here, once we choose a particular strongly convex penalty function, we immediately have a family of algorithms along with a regret bound for these algorithms that is in terms of a certain strong convexity parameter. [sent-106, score-0.61]
22 Again, under this characterization, a number of generalization and margin bounds (for methods which use linear prediction) are immediate corollaries, as one only needs to specify the strong convexity parameter from which these bounds easily follow (see Kakade et al. [sent-125, score-0.417]
23 The concept of strong smoothness (essentially a second order upper bound on a function) has also been in play in a different literature: the analysis of the concentration of martingales in smooth Banach spaces (Pinelis, 1994; Pisier, 1975). [sent-127, score-0.236]
24 Recently, Juditsky and Nemirovski (2008) used the fact that a norm is strongly convex if and only if its conjugate is strongly smooth. [sent-129, score-0.487]
25 The norms considered here were the (Schatten) L p -matrix norms and certain “block” composed norms (such as the · 2,q norm). [sent-131, score-0.405]
26 We further highlight the importance of strong convexity to matrix learning applications by drawing attention to families of strongly convex functions over matrices. [sent-136, score-0.431]
27 Given a norm · , its dual norm is denoted by · ⋆ . [sent-156, score-0.317]
28 2 The following theorem states that strong convexity and strong smoothness are dual properties. [sent-178, score-0.4]
29 Subtly, note that while the domain of a strongly convex function f may be a proper subset of X (important for a number of settings), its conjugate f ⋆ always has a domain which is X (since if f ⋆ is strongly smooth then it is finite and everywhere differentiable). [sent-188, score-0.368]
30 The following direct corollary of Theorem 3 is central in proving both regret and generalization bounds. [sent-199, score-0.292]
31 • Online convex optimization: Let W be a convex set. [sent-210, score-0.242]
32 On round t of the game, the learner (first player) should choose wt ∈ W and the environment (second player) responds with a convex function over W , that is, lt : W → R. [sent-212, score-0.662]
33 The goal of the learner is to minimize its regret defined as: 1 n 1 n lt (wt ) − min ∑ lt (w) . [sent-213, score-0.833]
34 1 R EGRET B OUND FOR O NLINE C ONVEX O PTIMIZATION Algorithm 1 provides one common algorithm (online mirror descent) which achieves the following regret bound. [sent-230, score-0.245]
35 It is one of a family of algorithms that enjoy the same regret bound (see ShalevShwartz, 2007). [sent-231, score-0.223]
36 Suppose the loss functions lt are convex and V -Lipschitz w. [sent-237, score-0.449]
37 Then, the algorithm run with any positive η enjoys the regret bound, T T min ∑ lt (wt ) − u∈W ∑ lt (u) ≤ t=1 t=1 maxu∈W f (u) ηV 2 T + . [sent-241, score-0.833]
38 , −ηvT to get, for all u, T T −η ∑ vt , u − f (u) ≤ −η ∑ vt , wt + t=1 t=1 1 T ∑ ηvt 2β t=1 2 ⋆ . [sent-245, score-0.505]
39 Using the fact that lt is V -Lipschitz, we get vt ⋆ ≤ V . [sent-246, score-0.474]
40 Plugging this into the inequality above and 2 T rearranging gives, ∑t=1 vt , wt − u ≤ f (u) + ηV T . [sent-247, score-0.359]
41 By convexity of lt , lt (wt ) − lt (u) ≤ vt , wt − u . [sent-248, score-1.443]
42 η 2β T T Therefore, ∑t=1 lt (wt ) − ∑t=1 lt (u) ≤ follows. [sent-249, score-0.656]
43 In particular, when the “weight vector” comes from an infinite dimensional normed space with norm · and the√ functions lt are 1-Lipschitz w. [sent-253, score-0.482]
44 the dual norm · ⋆ , a necessary and sufficient condition for a O( T ) regret rate is that the Martingale type of the dual space is 2. [sent-256, score-0.454]
45 Moreover, this last condition is equivalent to the existence of a strongly convex function in the original normed space. [sent-257, score-0.248]
46 1871 K AKADE , S HALEV-S HWARTZ AND T EWARI Algorithm 1 Online Mirror Descent w1 ← ∇ f ⋆ (0) for t = 1 to T do Play wt ∈ W Receive lt and pick vt ∈ ∂lt (wt ) wt+1 ← ∇ f ⋆ (−η ∑ts=1 vt ) end for 2. [sent-259, score-0.833]
47 It is well known that bounds on Rademacher complexity of a class immediately yield generalization bounds for classifiers picked from that class (assuming the loss function is Lipschitz). [sent-273, score-0.243]
48 (2008) proved Rademacher complexity bounds for classes consisting of linear predictors using strong convexity arguments. [sent-275, score-0.278]
49 Apply Corollary 4 with u = w and vi = λεi xi to get, n sup ∑ w∈W i=1 w, λεi xi ≤ ≤ λ2 n ∑ εi xi 2β i=1 λ2 X 2 n 2β n 2 ⋆+ sup f (w) + ∑ ∇ f ⋆ (v1:i−1 ), εi xi w∈W n i=1 + fmax + ∑ ∇ f ⋆ (v1:i−1 ), εi xi . [sent-286, score-0.437]
50 4 Strongly Convex Matrix Functions Before we consider strongly convex matrix functions, let us recall the following result about strong convexity of vector ℓq norm. [sent-310, score-0.475]
51 Obtaining results with respect to · 1 is slightly more tricky since for q = 1 the strong convexity parameter is 0 (meaning that the function is not strongly convex). [sent-316, score-0.266]
52 For this choice of q, the strong convexity parameter becomes q − 1 = 1/(ln(d) − 1) ≥ 1/ ln(d) and the value of p corresponding to the dual norm is p = (1 − 1/q)−1 = ln(d). [sent-318, score-0.372]
53 The above ln(d) ln(d)−1 is 1/(3 ln(d))- We now consider two families of strongly convex matrix functions. [sent-323, score-0.257]
54 R 1873 K AKADE , S HALEV-S HWARTZ AND T EWARI As above, choosing q to be ln m′ ln(m′ )−1 for m′ = min{m, n} gives the following corollary. [sent-337, score-0.224]
55 Given two (vector) norms · and · p , we can define a new norm X r,p as X r,p := ( X1 r , . [sent-345, score-0.254]
56 The following theorem, which appears in a slightly weaker form in the work of Juditsky and Nemirovski (2008), provides us with an easy way to construct strongly convex group norms. [sent-352, score-0.283]
57 Now using strong convexity/strong smoothness duality and the discussion preceding Corollary 10, we get the following corollary. [sent-372, score-0.24]
58 , ln be a sequence of convex functions that are X-Lipschitz w. [sent-379, score-0.345]
59 Then, online mirror descent run with f (w) = 1 w 2 for q = q 2 ln(d)/(ln(d) − 1) achieves the following regret bound: 1 n 1 n lt (wt ) − min ∑ lt (w) ≤ X W ∑ n t=1 w∈W n t=1 1874 3 ln(d) . [sent-383, score-1.062]
60 , ln be a sequence of functions that are X-Lipschitz w. [sent-392, score-0.224]
61 Then, online mirror descent run with f (W) = 1 W 2 for 2,q 2 q = ln(d)/(ln(d) − 1) achieves the following regret bound: 1 n 1 n min ∑ lt (Wt ) − W∈W n ∑ lt (W) ≤ X W n t=1 t=1 3 ln(d) . [sent-396, score-1.062]
62 , ln be a sequence of functions 1 that are X-Lipschitz w. [sent-400, score-0.224]
63 Then, online mirror descent run with f (W) = 2 W 2 for S(q) ′ )/(ln(k′ ) − 1) achieves the following regret bound: q = ln(k 1 n 1 n lt (Wt ) − min ∑ lt (W) ≤ X W ∑ n t=1 W∈W n t=1 3 ln(k′ ) , n where k′ = min{k, d}. [sent-404, score-1.062]
64 Remark 19 The gradient mapping W → ∇ f ⋆ (W) that is needed to run the online mirror descent algorithm can be easily computed both for the group norm ( f (W) = 1 W 2 ) and Schatten norm r,p 2 1 ( f (W) = 2 W 2 ) cases. [sent-405, score-0.537]
65 For the sake of concreteness, let us focus on the batch learning setting, but we note that the discussion below is relevant to the online learning model as well. [sent-412, score-0.25]
66 For the former, recall that we first apply ℓ2 norm on each column of W and then apply ℓ1 norm on the obtained vector of norm values. [sent-441, score-0.357]
67 Similarly, to calculate X 2,∞ we first apply ℓ2 norm on columns of X and then apply ℓ∞ norm on the obtained vector of norm values. [sent-442, score-0.397]
68 In the next sections we demonstrate how to apply the general methodology described above in order to derive a few generalization and regret bounds for problems of recent interest. [sent-468, score-0.316]
69 Considerations of common sparsity patterns (same features relevant across different tasks) lead to the use of group norm regularizers (i. [sent-480, score-0.221]
70 We now describe online and batch multi-task learning using various matrix norms. [sent-485, score-0.25]
71 1877 K AKADE , S HALEV-S HWARTZ AND T EWARI Algorithm 2 Online Multi-task Mirror Descent W1 ← ∇ f ⋆ (0) for t = 1 to T do Predict using Wt ∈ W ⊆ Rk×d Receive Xt ∈ Rk×d , yt ∈ Rk j j j j j Pick Vt whose row j is vt = (l j )′ wt , xt , yt xt Wt+1 ← ∇ f ⋆ (−η ∑ts=1 Vt ) end for 4. [sent-486, score-0.747]
72 1 Online Multi-task Learning In the online model, on round t the learner first uses Wt to predict the vector of responses and then it pays the cost k lt (Wt ) = l(Wt , Xt , yt ) = ∑ lj j j j wt , xt , yt . [sent-487, score-0.932]
73 j=1 Let Vt ∈ Rk×d be a sub-gradient of lt at Wt . [sent-488, score-0.328]
74 It is easy to verify that the j’th row of Vt , denoted j j j j j vt , is a sub-gradient of l j wt , xt , yt at wt . [sent-489, score-0.766]
75 Assuming that l j is ρ-Lipschitz with respect to its j j j j first argument, we obtain that vt = τt xt for some τt ∈ [−ρ, ρ]. [sent-490, score-0.271]
76 Then, Algorithm 2 achieves with regret bounds given in Table 1 (ignoring constants). [sent-500, score-0.281]
77 Then, the bound for W1,1 is order of sw ln(dk)/n while the bound for W2,2 is order of sw sx /n. [sent-507, score-0.332]
78 This bound will be better than the bound of W2,2 if sg ln(d) < d and will be better than the bound of W1,1 if sg sx /d < sw . [sent-518, score-0.418]
79 We see that there are scenarios in which the group norm is better than the non-grouped norms and that the most adequate class depends on properties of the problem and our prior beliefs on a good predictor W. [sent-519, score-0.356]
80 To obtain excess risk bounds for W, we need to consider the k-task Rademacher complexity 1 n k j ∑ ∑ ε w j , Xij n i=1 j=1 i W∈W RTk (W ) := E sup , because, assuming each l j is ρ-Lipschitz, we have the bound E L(W) − min L(W) ≤ ρE RTk (W ) . [sent-535, score-0.23]
81 Theorem 21 (Multi-task Generalization) Suppose F(W) ≤ fmax for all W ∈ W for a function F that is β-strongly convex w. [sent-538, score-0.219]
82 On round t, the online algorithm receives an instance xt ∈ Rd and is required to predict its label as a number in {1, . [sent-557, score-0.253]
83 That is, lt (Wt ) = max(1[r=yt ] − ( wtyt , xt − wtr , xt )) = max(1[r=yt ] − ( W, Xtr,yt )) , r r where Xtr,yt is a matrix with xt on the y’th row, −xt on the r’th row, and zeros in all other elements. [sent-563, score-0.747]
84 It is easy to verify that lt (Wt ) upper bounds the zero-one loss, that is, if the prediction of Wt is r then lt (Wt ) ≥ 1[r=yt ] . [sent-564, score-0.76]
85 A sub-gradient of lt (Wt ) is either a matrix of the form −Xtr,yt or the all zeros matrix. [sent-565, score-0.372]
86 Therefore, √ Xtr,yt 2,2 = 2 xt 2 , Xtr,yt ∞,∞ = xt ∞ , √ √ Xtr,yt S(∞) = 2 xt 2 . [sent-567, score-0.375]
87 Corollary 23 Let W1,1 , W2,2 , W2,1 , WS(1) be the classes defined in Section 3 and let X2 = maxt xt 2 and X∞ = maxt xt ∞ . [sent-569, score-0.294]
88 Then, there exist online multi-class learning algorithms with regret bounds given by the following table. [sent-570, score-0.409]
89 They only gave empirical results and our analysis given in Corollary 23 provides a first rigorous regret bound for their algorithm. [sent-576, score-0.223]
90 The algorithm is a specification of the general online mirror descent procedure (Algorithm 1) 1 with f (W) = 2 W 2 , r = ln(d)/(ln(d)−1), and with a conservative update (i. [sent-592, score-0.229]
91 Therefore, we can apply our general online regret bound (Corollary 17) on the sequence of examples in I we obtain that for any W, ∑ lt (Wt ) − ∑ lt (W) t∈I t∈I ≤ X∞ W 1881 2,1 6 ln(d) |I| . [sent-599, score-1.007]
92 Corollary 24 The number of mistakes Algorithm 3 will make on any sequence of examples for which xt ∞ ≤ X∞ is upper bounded by min W ∑ lt (W) + X∞ W 2 3 ln(d) ∑ lt (W) + 3 X∞ W 2,1 t 2 2,1 ln(d) . [sent-605, score-0.781]
93 Recall that given a norm · on X , its dual norm is defined as y ⋆ := sup{ x, y : x ≤ 1}. [sent-648, score-0.317]
94 An important property of the dual norm is that the Fenchel conjugate of the function 1 x 2 is 1 y 2 . [sent-649, score-0.261]
95 Then if f = · is a norm on Rl then f ◦ σ = σ(·) is a norm on Rm×n . [sent-687, score-0.238]
96 Then if g = · is a norm on Rn then g ◦ λ = λ(·) is a norm on Sn . [sent-690, score-0.238]
97 1 Proof of Theorem 3 First, (Shalev-Shwartz, 2007, Lemma 15) yields one half of the claim ( f strongly convex ⇒ f ⋆ strongly smooth). [sent-702, score-0.305]
98 It is left to prove that f is strongly convex assuming that f ⋆ is strongly smooth. [sent-703, score-0.305]
99 This implies that g⋆ (a) ≥ 1 a 2 because of (Shalev-Shwartz and ⋆ 2 1885 K AKADE , S HALEV-S HWARTZ AND T EWARI Singer, 2008, Lemma 19) and that the conjugate of half squared norm is half squared of the dual norm. [sent-707, score-0.261]
100 But, / a function is defined to be essentially strictly convex if it is strictly convex on any subset of {u : ∂ f (u) = 0}. [sent-727, score-0.242]
wordName wordTfidf (topN-words)
[('lt', 0.328), ('ln', 0.224), ('ws', 0.224), ('wt', 0.213), ('akade', 0.212), ('ewari', 0.212), ('hwartz', 0.212), ('regret', 0.177), ('echniques', 0.167), ('egularization', 0.167), ('udiag', 0.163), ('vt', 0.146), ('atrices', 0.139), ('norms', 0.135), ('online', 0.128), ('xt', 0.125), ('convex', 0.121), ('norm', 0.119), ('kakade', 0.11), ('bounds', 0.104), ('convexity', 0.1), ('fmax', 0.098), ('duality', 0.093), ('strongly', 0.092), ('rk', 0.091), ('sx', 0.088), ('fenchel', 0.086), ('schatten', 0.084), ('rtk', 0.081), ('rademacher', 0.081), ('corollary', 0.08), ('dual', 0.079), ('batch', 0.078), ('sw', 0.076), ('juditsky', 0.075), ('strong', 0.074), ('smoothness', 0.073), ('group', 0.07), ('yt', 0.069), ('earning', 0.069), ('mirror', 0.068), ('fkc', 0.065), ('lewis', 0.063), ('conjugate', 0.063), ('rm', 0.06), ('kd', 0.058), ('absolutely', 0.058), ('sg', 0.058), ('srebro', 0.057), ('seamlessly', 0.056), ('nemirovski', 0.051), ('argyriou', 0.051), ('perceptron', 0.051), ('xi', 0.049), ('singular', 0.049), ('sn', 0.048), ('sup', 0.047), ('deriving', 0.046), ('multitask', 0.046), ('bound', 0.046), ('regularization', 0.045), ('singer', 0.045), ('let', 0.044), ('matrix', 0.044), ('subdifferential', 0.043), ('martingales', 0.043), ('sridharan', 0.043), ('equipped', 0.042), ('grove', 0.042), ('rl', 0.041), ('columns', 0.04), ('lanckriet', 0.038), ('borwein', 0.038), ('cavallanti', 0.038), ('kernel', 0.037), ('warmuth', 0.035), ('generalization', 0.035), ('kivinen', 0.035), ('matrices', 0.035), ('normed', 0.035), ('jerusalem', 0.035), ('gentile', 0.035), ('ambuj', 0.035), ('obozinski', 0.035), ('lasso', 0.034), ('risk', 0.033), ('tasks', 0.033), ('interior', 0.033), ('descent', 0.033), ('distill', 0.033), ('rotational', 0.033), ('xij', 0.033), ('zalinescu', 0.033), ('advice', 0.032), ('sparsity', 0.032), ('yi', 0.032), ('prior', 0.032), ('crammer', 0.031), ('rt', 0.031), ('player', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
2 0.24888772 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
3 0.14594792 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
4 0.143802 82 jmlr-2012-On the Necessity of Irrelevant Variables
Author: David P. Helmbold, Philip M. Long
Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory
5 0.14288811 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
Author: Francesco Orabona, Luo Jie, Barbara Caputo
Abstract: In recent years there has been a lot of interest in designing principled classification algorithms over multiple cues, based on the intuitive notion that using more features should lead to better performance. In the domain of kernel methods, a principled way to use multiple features is the Multi Kernel Learning (MKL) approach. Here we present a MKL optimization algorithm based on stochastic gradient descent that has a guaranteed convergence rate. We directly solve the MKL problem in the primal formulation. By having a p-norm formulation of MKL, we introduce a parameter that controls the level of sparsity of the solution, while leading to an easier optimization problem. We prove theoretically and experimentally that 1) our algorithm has a faster convergence rate as the number of kernels grows; 2) the training complexity is linear in the number of training examples; 3) very few iterations are sufficient to reach good solutions. Experiments on standard benchmark databases support our claims. Keywords: multiple kernel learning, learning kernels, online optimization, stochastic subgradient descent, convergence bounds, large scale
6 0.13642204 80 jmlr-2012-On Ranking and Generalization Bounds
8 0.1287519 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
9 0.12392063 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
10 0.12315395 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
11 0.120288 111 jmlr-2012-Structured Sparsity and Generalization
12 0.11688011 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
13 0.10748456 84 jmlr-2012-Online Submodular Minimization
14 0.095652819 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
15 0.093763955 73 jmlr-2012-Multi-task Regression using Minimal Penalties
16 0.091402829 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
17 0.084868848 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
18 0.082669064 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
19 0.081303738 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
20 0.081002191 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
topicId topicWeight
[(0, -0.387), (1, -0.283), (2, -0.224), (3, 0.055), (4, -0.021), (5, 0.044), (6, 0.085), (7, 0.031), (8, -0.017), (9, -0.123), (10, 0.002), (11, 0.063), (12, 0.08), (13, 0.045), (14, -0.016), (15, -0.184), (16, 0.196), (17, 0.03), (18, -0.02), (19, 0.028), (20, 0.067), (21, 0.063), (22, 0.051), (23, -0.058), (24, 0.064), (25, -0.053), (26, 0.072), (27, 0.04), (28, -0.091), (29, 0.088), (30, 0.041), (31, 0.045), (32, 0.009), (33, 0.013), (34, -0.016), (35, 0.087), (36, -0.008), (37, -0.019), (38, -0.012), (39, 0.0), (40, -0.004), (41, 0.015), (42, -0.004), (43, -0.048), (44, 0.001), (45, 0.008), (46, -0.012), (47, -0.061), (48, -0.066), (49, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.95025218 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
2 0.6411503 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
3 0.6333043 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
4 0.63147199 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
5 0.62879235 111 jmlr-2012-Structured Sparsity and Generalization
Author: Andreas Maurer, Massimiliano Pontil
Abstract: We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels. Keywords: empirical processes, Rademacher average, sparse estimation.
6 0.59276414 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
7 0.56695664 80 jmlr-2012-On Ranking and Generalization Bounds
8 0.54742908 82 jmlr-2012-On the Necessity of Irrelevant Variables
9 0.51599604 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
10 0.50867462 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
11 0.49893573 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
12 0.47126806 84 jmlr-2012-Online Submodular Minimization
13 0.46146378 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
14 0.44702569 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications
15 0.44640931 73 jmlr-2012-Multi-task Regression using Minimal Penalties
16 0.44377246 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
17 0.40634948 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
18 0.40295264 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
19 0.39689615 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
20 0.39656451 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
topicId topicWeight
[(7, 0.404), (21, 0.054), (26, 0.056), (27, 0.012), (29, 0.02), (49, 0.014), (56, 0.014), (64, 0.012), (69, 0.014), (75, 0.08), (77, 0.015), (79, 0.013), (92, 0.106), (96, 0.11)]
simIndex simValue paperId paperTitle
1 0.98221534 101 jmlr-2012-SVDFeature: A Toolkit for Feature-based Collaborative Filtering
Author: Tianqi Chen, Weinan Zhang, Qiuxia Lu, Kailong Chen, Zhao Zheng, Yong Yu
Abstract: In this paper we introduce SVDFeature, a machine learning toolkit for feature-based collaborative filtering. SVDFeature is designed to efficiently solve the feature-based matrix factorization. The feature-based setting allows us to build factorization models incorporating side information such as temporal dynamics, neighborhood relationship, and hierarchical information. The toolkit is capable of both rate prediction and collaborative ranking, and is carefully designed for efficient training on large-scale data set. Using this toolkit, we built solutions to win KDD Cup for two consecutive years. Keywords: large-scale collaborative filtering, context-aware recommendation, ranking
same-paper 2 0.79811925 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
3 0.56152707 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
4 0.51389027 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
5 0.51342356 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
6 0.50609332 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
7 0.46016866 111 jmlr-2012-Structured Sparsity and Generalization
8 0.45945638 103 jmlr-2012-Sampling Methods for the Nyström Method
9 0.45718753 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
10 0.45715123 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
11 0.45263329 84 jmlr-2012-Online Submodular Minimization
12 0.44858992 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
13 0.44599134 80 jmlr-2012-On Ranking and Generalization Bounds
14 0.44459391 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
15 0.44414419 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
16 0.44183159 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing
17 0.4413133 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
18 0.440772 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
19 0.43927759 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
20 0.43789196 73 jmlr-2012-Multi-task Regression using Minimal Penalties