jmlr jmlr2012 jmlr2012-85 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
Reference: text
sentIndex sentText sentNum sentScore
1 In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. [sent-8, score-0.731]
2 We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. [sent-9, score-0.646]
3 Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization 1. [sent-13, score-0.688]
4 After making the prediction wi , we observe zi and suffer the loss f (wi , zi ), where f is a predefined loss function. [sent-26, score-0.702]
5 Since regret relies on the stochastic inputs zi , it is a random variable. [sent-33, score-0.7]
6 It is well-known that the optimal regret bound √ that can be achieved by a gradient-based serial algorithm on an arbitrary convex loss is O( m) (e. [sent-57, score-0.911]
7 At the other extreme, a trivial solution to our problem is to have each node operate in isolation of the other k−1 nodes, running an independent copy of a serial algorithm, without any communication over 166 O PTIMAL D ISTRIBUTED O NLINE P REDICTION the network. [sent-61, score-0.554]
8 More specifically, assuming that each node processes m/k inputs, the expected regret per node is √ √ O( m/k). [sent-64, score-0.515]
9 ı In this paper, we present the distributed mini-batch (DMB) algorithm, a method of converting any serial gradient-based online prediction algorithm into a parallel or distributed algorithm. [sent-67, score-0.78]
10 This method has two important properties: • It can use any gradient-based update rule for serial online prediction as a black box, and convert it into a parallel or distributed online prediction algorithm. [sent-68, score-0.965]
11 Moreover, the coefficient √ of the dominant term m is the same as in the serial bound, and independent of k and of the network topology. [sent-70, score-0.531]
12 The idea of using mini-batches in stochastic and online learning is not new, and has been previously explored in both the serial and parallel settings (see, e. [sent-71, score-0.718]
13 Our results build on the fact that the optimal regret bound for serial stochastic gradient-based prediction algorithms can be refined if the loss function is smooth. [sent-77, score-1.047]
14 In particular, it can be shown √ that the hidden coefficient in the O( m) notation is proportional to the standard deviation of the stochastic gradients evaluated at each predictor wi (Juditsky et al. [sent-78, score-0.557]
15 However, the non-negligible communication latencies prevent a straightforward parallel implementation from obtaining the optimal serial regret bound. [sent-81, score-0.896]
16 In Section 2, we present a template for stochastic gradientbased serial prediction algorithms, and state refined variance-based regret bounds for smooth loss functions. [sent-84, score-1.08]
17 In Section 3, we analyze the effect of using mini-batches in the serial setting, and show that it does not significantly affect the regret bounds. [sent-85, score-0.765]
18 In Section 4, we present the DMB algorithm, and show that it achieves an asymptotically optimal serial regret bound for smooth loss functions. [sent-86, score-0.895]
19 For example, if the network communication operates over a minimum-depth spanning tree and the diameter of the network scales as log(k), then we can show that a straightforward implementation of the idea of parallel variance reduction leads to an O m log(k) regret bound. [sent-90, score-0.637]
20 167 D EKEL , G ILAD -BACHRACH , S HAMIR AND X IAO Algorithm 1: Template for a serial first-order stochastic online prediction algorithm. [sent-92, score-0.719]
21 Variance Bounds for Serial Algorithms Before discussing distributed algorithms, we must fully understand the serial algorithms on which they are based. [sent-106, score-0.509]
22 α j i=1 (4) For stochastic online prediction problems with convex loss functions, both of these update rules √ √ have expected regret bound of O( m). [sent-125, score-0.879]
23 In particular, for the projected stochastic gradient method defined in Equation (2), we have the following result: Theorem 1 Let f (w, z) be an L-smooth convex loss function in w for each z ∈ Z and assume that the stochastic gradient ∇w f (w, z) has σ2 -bounded variance for all w ∈ W . [sent-140, score-0.52]
24 Although we focus on expected regret bounds here, our results can equally be stated as high-probability bounds on the actual regret (see Appendix B for details). [sent-149, score-0.747]
25 For each of these update rules, we get an expected regret bound that closely resembles the bound in Equation (6). [sent-157, score-0.531]
26 Therefore, each node suffers an expected regret of at most ψ(σ2 , ⌈m/k⌉) on its portion of the input stream, and the total regret bound is obtain by simply summing over the k nodes, that is, E[R(m)] ≤ k ψ σ2 , m k . [sent-185, score-0.819]
27 k √ Comparing this bound to 2D2 L + 2Dσ m in the ideal serial solution, we see that it is approximately √ k times worse in its leading term. [sent-187, score-0.506]
28 Serial Online Prediction using Mini-Batches The expected regret bounds presented in the previous section depend on the variance of the stochastic gradients. [sent-191, score-0.572]
29 Before we present the distributed mini-batch algorithm in the next section, we first analyze a serial mini-batch algorithm. [sent-193, score-0.509]
30 We define the serial mini-batch algorithm as follows: Our prediction remains constant for the duration of each batch, and is updated only when a batch ends. [sent-197, score-0.634]
31 Once a batch ends, the serial mini-batch algorithm feeds g j to the update rule φ as the jth gradient and obtains the new prediction ¯ for the next batch and the new state. [sent-200, score-0.947]
32 The appeal of the serial mini-batch setting is that the update rule is used less frequently, which may have computational benefits. [sent-202, score-0.546]
33 If the update rule φ has the serial regret bound ψ(σ2 , m), then the expected regret of Algorithm 2 over m inputs is at most bψ σ2 m , b b . [sent-204, score-1.399]
34 Proof Assume without loss of generality that b divides m, and that the serial mini-batch algorithm processes exactly m/b complete batches. [sent-206, score-0.526]
35 Therefore the serial ¯ ¯ mini-batch algorithm is equivalent to using the update rule φ with the loss function f¯. [sent-226, score-0.59]
36 If the ¯ update rule φ has a regret bound ψ(σ2 , m) for the loss function f over m inputs, then its regret for f¯ over m/b batches is bounded as m/b E ∑ j=1 f¯(w j , z j ) − f¯(w⋆ , z j ) ¯ ¯ ≤ ψ σ2 m . [sent-237, score-0.905]
37 √ The bound in Theorem 3 is asymptotically equivalent to the 2D2 L + 2Dσ m regret bound for the basic serial algorithms presented in Section 2. [sent-241, score-0.903]
38 In other words, performing the mini-batch update in the serial setting does not significantly hurt the performance of the update rule. [sent-242, score-0.571]
39 On the other hand, it is also not surprising that using mini-batches in the serial setting does not improve the regret bound. [sent-243, score-0.765]
40 Nevertheless, our experiments demonstrate that in real-world scenarios, mini-batching can in fact have a very substantial positive effect on the transient performance of the online prediction algorithm, even in the serial setting (see Section 6 for details). [sent-246, score-0.583]
41 This technique resembles the serial mini-batch technique described earlier, and is therefore called the distributed mini-batch algorithm, or DMB for short. [sent-271, score-0.509]
42 Once all of the nodes have the overall average g j , each node updates the predictor using the same deterministic serial ¯ algorithm. [sent-306, score-0.63]
43 If the update rule φ has the serial regret bound ψ(σ2 , m), then the expected regret of Algorithm 3 over m samples is at most (b + µ) ψ σ2 m , b b+µ . [sent-316, score-1.287]
44 √ Specifically, if ψ(σ2 , m) = 2D2 L + 2Dσ m, then setting the batch size b = m1/3 gives the expected regret bound √ √ 1 1 1 2Dσ m + 2Dm /3 (LD + σ µ) + 2Dσm /6 + 2Dσµm− /6 + 2µD2 L. [sent-317, score-0.56]
45 To appreciate the power of this result, we compare the specific bound in Equation (8) with the ideal serial solution and the na¨ve no-communication solution discussed in the introduction. [sent-319, score-0.506]
46 It ı is clear that our bound is asymptotically equivalent to the ideal serial bound ψ(σ2 , m)—even the constants in the dominant terms are identical. [sent-320, score-0.615]
47 2 Improving Performance on Short Input Streams Theorem 4 presents an optimal way of choosing the batch size b, which results in an asymptotically optimal regret bound. [sent-352, score-0.513]
48 The batch size should indeed be set such that the generated traffic does not exceed the network bandwidth limit, but the latency of each sum operation should not be affected by the fact that multiple sum operations take place at once. [sent-367, score-0.499]
49 Each instance of the algorithm handles m/c inputs and suffers a regret of at most bψ σ2 m ,1+ b bc , and, using Jensen’s inequality, the overall regret using the average prediction is upper bounded by bc ψ σ2 m ,1+ b bc . [sent-372, score-0.822]
50 m Therefore, a bound on the expected optimality gap can be readily obtained from a bound on the expected regret of the same algorithm. [sent-403, score-0.565]
51 In particular, if f is an L-smooth convex loss function and ∇w f (w, z) has σ2 -bounded variance, and our algorithm has a regret bound of ψ(σ2 , m), then it also has an expected optimality gap of at most E[G(m)] ≤ 1 ψ(σ2 , m) . [sent-404, score-0.578]
52 m √ For the specific regret bound ψ(σ2 , m) = 2D2 L + 2Dσ m, which holds for the serial algorithms presented in Section 2, we have ¯ ψ(σ2 , m) = ¯ E[G(m)] ≤ ψ(σ2 , m) = 2D2 L 2Dσ +√ . [sent-405, score-0.817]
53 If the update rule φ used ¯ in a serial setting has an expected optimality gap bounded by ψ(σ2 , m), then the expected optimality gap of Algorithm 4 after processing m samples is at most ¯ ψ ¯ If ψ(σ2 , m) = 2D2 L m σ2 m , b b . [sent-422, score-0.75]
54 Since we can run the workload associated with a single batch in parallel, this theorem shows that the mini-batch technique is capable of turning many serial optimization algorithms into parallel ones. [sent-433, score-0.67]
55 Suppose the update rule φ 2 √ ¯ used in the serial setting has an expected optimality gap bounded by ψ(σ2 , m) = 2D L + 2Dσ . [sent-458, score-0.648]
56 ε→0 Proof By solving the equation 2D2 L 2Dσ + √ =ε, m m we see that the following number of samples is sufficient for the serial algorithm to reach εoptimality: D2 σ2 msrl (ε) = ε2 2Lε 1+ 2 σ 1+ 2 . [sent-462, score-0.511]
57 6 5 10 6 10 7 10 8 10 9 10 number of inputs Figure 2: The effects of of the batch size when serial mini-batching on average loss. [sent-518, score-0.696]
58 1 Serial Mini-Batching As a warm-up, we investigated the effects of modifying the mini-batch size b in a standard serial Euclidean dual averaging algorithm. [sent-533, score-0.592]
59 5 no−comm batch no−comm serial DMB 2 average loss average loss 2 1. [sent-545, score-0.672]
60 5 5 10 6 10 7 10 8 10 9 10 number of inputs Figure 3: Comparing DBM with the serial algorithm and the no-communication distributed algorithm. [sent-549, score-0.621]
61 2 Evaluating DBM Next, we compared the average loss of the DBM algorithm with the average loss of the serial algorithm and the no-communication algorithm (where each cluster node works independently). [sent-560, score-0.635]
62 We have already seen that larger batch sizes accelerate the initial learning phase, even in a serial setting. [sent-630, score-0.584]
63 As noted in the serial case, larger batch sizes (b = 512) are beneficial at first (m = 107 ), while smaller batch sizes (b = 128) are better in the end (m = 109 ). [sent-637, score-0.733]
64 5 Discussion We presented an empirical evaluation of the serial mini-batch algorithm and its distributed version, the DMB algorithm, on a realistic web-scale online prediction problem. [sent-639, score-0.657]
65 Instead, our novel use of the variancebased regret bounds can exploit parallel/distributed computing to obtain the asymptotic optimal regret bound. [sent-678, score-0.689]
66 As far as we know, we are the first to propose a general principled framework for distributing many gradient-based update rule, with a concrete regret analysis for the large family of mirror descent and dual averaging update rules. [sent-680, score-0.711]
67 Additionally, our work is the first to explicitly include network latency in our regret analysis, and to theoretically guarantee that a large latency can be overcome by setting parameters appropriately. [sent-681, score-0.813]
68 In this work we studied the problems of distributed stochastic online prediction and distributed stochastic optimization. [sent-685, score-0.568]
69 We presented a family of distributed online algorithms with asymptotically optimal regret and optimality gap guarantees. [sent-686, score-0.609]
70 Our analysis shows that asymptotically, a distributed computing system can 188 O PTIMAL D ISTRIBUTED O NLINE P REDICTION perform as well as a hypothetical fast serial computer. [sent-688, score-0.509]
71 Our formal analysis hinges on the fact that the regret bounds of many stochastic online update rules scale with the variance of the stochastic gradients when the loss function is smooth. [sent-693, score-0.989]
72 For any serial update rule φ with a regret bound √ √ of ψ(σ2 , m) = Cσ m + o ( m), the DMB algorithm and its variants have the optimal regret bound √ √ of Cσ m + o ( m), provided that the bound ψ(σ2 , m) applies equally to the function f and to the function 1 b f¯ (w, (z1 , . [sent-696, score-1.362]
73 Smooth Stochastic Online Prediction in the Serial Setting In this appendix, we prove expected regret bounds for stochastic dual averaging and stochastic mirror descent applied to smooth loss functions. [sent-726, score-0.949]
74 Before observing each zi we predict wi ∈ W , and suffer a loss f (wi , zi ). [sent-733, score-0.608]
75 Under the above assumptions, we are concerned with bounding the expected regret E[R(m)], where regret is defined as m R(m) = ∑ ( f (wi , zi ) − f (w⋆ , zi )) . [sent-743, score-0.933]
76 In the stochastic dual averaging method, we predict each wi by i wi+1 = arg min w∈W ∑ g j, w + (L + βi+1 )h(w) , (13) j=1 where g j denotes the stochastic gradient ∇w f (w j , z j ), and (βi )i≥1 is a sequence of positive and nondecreasing parameters (i. [sent-760, score-0.806]
77 (14) w∈W We are now ready to state a bound on the expected regret of the dual averaging method, in the smooth stochastic case. [sent-764, score-0.704]
78 Theorem 7 The expected regret of the stochastic dual averaging method is bounded as ∀m, E[R(m)] ≤ (F(w1 ) − F(w⋆ )) + (L + βm )h(w⋆ ) + The optimal choice of βi is exactly of order positive parameter. [sent-765, score-0.652]
79 Proof First, we define the linear functions ℓi (w) = F(wi ) + ∇F(wi ), w − wi , and (using the notation gi = ∇ f (wi , zi )) ∀ i ≥ 1, ˆ ℓi (w) = F(wi ) + gi , w − wi = ℓi (w) + qi , w − wi , where qi = gi − ∇F(wi ). [sent-776, score-1.447]
80 Therefore, the stochastic dual averaging method specified in Equation (13) is equivalent to i−1 wi = arg min w∈W ∑ ℓˆj (w) + (L + βi )h(w) . [sent-777, score-0.617]
81 2βi i=1 Therefore, m ∑ i=2 F(wi ) − F(w⋆ ) ≤ (L + βm )h(w⋆ ) + m−1 ∑ i=1 qi 2 m−1 ∗ + ∑ qi , w⋆ − wi . [sent-785, score-0.626]
82 2βi Theorem 7 is proved by further noticing E f (wi , zi ) = E F(wi ), E f (w⋆ , zi ) = F(w⋆ ), ∀ i ≥ 1, which are due to the fact that wi is a deterministic function of z0 , . [sent-797, score-0.544]
83 In the stochastic mirror descent method, we use the same initialization as in the dual averaging method (see Equation (14)) and then we set wi+1 = arg min gi , w + (L + βi )d(w, wi ) , i ≥ 1. [sent-807, score-0.738]
84 u,v∈W Then the expected regret of the stochastic mirror descent method is bounded as E[R(m)] ≤ (F(w1 ) − F(w⋆ )) + (L + βm )D2 + σ2 m−1 1 ∑ . [sent-811, score-0.583]
85 2 i=1 βi √ Similar to the dual averaging case, using the sequence of parameters βi = (σ/D) i gives the expected regret bound √ E[R(m)] ≤ (F(w1 ) − F(w⋆ )) + LD2 + (2σD) m. [sent-812, score-0.568]
86 2βi (17) ˆ Now using Lemma 11 with ϕ(w) = ℓi (w) yields ˆ ˆ ℓi (wi+1 ) + (L + βi )d(wi+1 , wi ) ≤ ℓi (w⋆ ) + (L + βi )d(w⋆ , wi ) − (L + βi )d(w⋆ , wi+1 ). [sent-830, score-0.6]
87 Summing the above inequality from i = 1 to i = m − 1, we have m ∑ F(wi ) i=2 ≤ (m − 1)F(w⋆ ) + (L + β1 )d(w⋆ , w1 ) − (L + βm )d(w⋆ , wm ) + (βm − β1 )D2 m−1 + ∑ i=1 qi 2 m−1 ∗ + ∑ qi , w⋆ − wi . [sent-832, score-0.657]
88 2βi i=1 Notice that d(w⋆ , wi ) ≥ 0 and d(w⋆ , w1 ) ≤ D2 , so we have m m−1 i=2 i=1 ∑ F(wi ) ≤ (m − 1)F(w⋆ ) + (L + βm )D2 + ∑ 195 qi 2 m−1 ∗ + ∑ qi , w⋆ − wi . [sent-833, score-0.926]
89 First, we need to justify that the expected regret bounds for the online prediction rules discussed in Appendix A have high-probability versions. [sent-847, score-0.558]
90 For simplicity, we will focus on a high-probability version of the regret bound for dual averaging (Theorem 7), but exactly the same technique will work for stochastic mirror descent (Theorem 9 and Theorem 10). [sent-848, score-0.763]
91 Theorem 12 For any m and any δ ∈ (0, 1], the regret of the stochastic dual averaging method is bounded with probability at least 1 − δ over the sampling of z1 , . [sent-856, score-0.623]
92 log(2/δ) 196 1 G2 σ2 ∑m β2 + D2 σ2 m i=1 i log(2/δ) O PTIMAL D ISTRIBUTED O NLINE P REDICTION Proof The proof of the theorem is identical to the one of Theorem 7, up to Equation (16): m ∑ i=2 F(wi ) − F(w⋆ ) ≤ (L + βm )h(w⋆ ) + m−1 ∑ i=1 qi 2 m−1 + ∑ qi , w⋆ − wi . [sent-860, score-0.663]
93 We i i will first use this result for the sequence xi = qi 2 − σ2 i 2βi + qi , w⋆ − wi . [sent-882, score-0.626]
94 Moreover, | qi , w⋆ − wi | ≤ D qi ≤ 2DG, qi 2 ≤ 4G2 . [sent-893, score-0.789]
95 Then Varzi (xi ) ≤ 2Varzi 1 ≤ Ezi 2 ≤ 2G2 Ezi ≤ 2G2 qi 2 − σ2 i 2βi qi 4 β2 i + 2Varzi ( qi , w⋆ − wi ) + 2Ezi [( qi , w⋆ − wi )2 ] qi 2 β2 i + 2 w⋆ − wi 2 Ezi [ qi 2 ] σ2 σ2 i + 2D2 σ2 ≤ 2G2 2 + 2D2 σ2 . [sent-899, score-1.878]
96 i β2 βi i Combining these observations with Equation (19), we get that with probability at least 1 − δ, m−1 ∑ i=1 qi 2 − σ2 βi 4G2 + qi , w − wi ≤ 2DG + β1 ⋆ log(1/δ) 1 + 36 1 G2 σ2 ∑m β2 + D2 σ2 m i=1 i log(1/δ) . [sent-900, score-0.626]
97 197 D EKEL , G ILAD -BACHRACH , S HAMIR AND X IAO Also, |( f (wi , zi ) − f (w⋆ , zi )) − (F(wi ) − F(w⋆ ))| ≤ 4B, and Varzi f (wi , zi ) − f (w⋆ , zi ) − F(wi ) − F(w⋆ ) = Varzi f (wi , zi ) − f (w⋆ , zi ) ˆ ≤ σ2 . [sent-912, score-0.732]
98 ˆ now, the regret bound also depends on the function variance σ Theorem 13 Let f is an L-smooth convex loss function. [sent-920, score-0.524]
99 Assume that the stochastic gradient ∇w f (w, zi ) is bounded by a constant and has σ2 -bounded variance for all i and all w, and that ˆ f (w, zi ) is bounded by a constant and has σ2 -bounded variance for all i and for all w. [sent-921, score-0.529]
100 If the update ˆ rule φ has a serial high-probability regret bound ψ(σ2 , σ2 , δ, m). [sent-922, score-0.928]
wordName wordTfidf (topN-words)
[('serial', 0.435), ('dmb', 0.352), ('regret', 0.33), ('wi', 0.3), ('latency', 0.205), ('qi', 0.163), ('hamir', 0.156), ('ilad', 0.156), ('batch', 0.149), ('istributed', 0.147), ('stochastic', 0.136), ('iao', 0.133), ('ekel', 0.133), ('zi', 0.122), ('rediction', 0.114), ('inputs', 0.112), ('ptimal', 0.105), ('online', 0.098), ('averaging', 0.092), ('xiao', 0.091), ('zs', 0.087), ('nline', 0.083), ('node', 0.078), ('gradients', 0.078), ('distributed', 0.074), ('nodes', 0.074), ('mdmb', 0.074), ('network', 0.073), ('lan', 0.071), ('update', 0.068), ('dual', 0.065), ('mirror', 0.059), ('template', 0.056), ('gradient', 0.053), ('bound', 0.052), ('ez', 0.051), ('convex', 0.05), ('prediction', 0.05), ('operation', 0.05), ('monetizable', 0.049), ('msrl', 0.049), ('parallel', 0.049), ('variance', 0.048), ('divides', 0.047), ('loss', 0.044), ('queries', 0.043), ('predictor', 0.043), ('rule', 0.043), ('ezi', 0.042), ('communication', 0.041), ('latencies', 0.041), ('accelerated', 0.041), ('asynchronous', 0.041), ('gap', 0.039), ('minw', 0.039), ('batches', 0.038), ('theorem', 0.037), ('nemirovski', 0.036), ('dbm', 0.035), ('lh', 0.035), ('zm', 0.034), ('cluster', 0.034), ('asymptotically', 0.034), ('optimality', 0.034), ('gi', 0.033), ('comm', 0.033), ('nedi', 0.033), ('varzi', 0.033), ('waste', 0.033), ('juditsky', 0.032), ('wm', 0.031), ('query', 0.03), ('dekel', 0.03), ('descent', 0.029), ('issued', 0.029), ('expected', 0.029), ('bounds', 0.029), ('equation', 0.027), ('nesterov', 0.025), ('bregman', 0.025), ('ghadimi', 0.025), ('synchronized', 0.025), ('arg', 0.024), ('lim', 0.024), ('diameter', 0.023), ('engine', 0.023), ('dominant', 0.023), ('operations', 0.022), ('rules', 0.022), ('duchi', 0.022), ('smoothness', 0.021), ('ethernet', 0.021), ('incoming', 0.02), ('zb', 0.02), ('tseng', 0.02), ('stream', 0.02), ('suffer', 0.02), ('ideal', 0.019), ('receive', 0.019), ('mimics', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
2 0.14594792 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
3 0.13269724 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
4 0.1105826 84 jmlr-2012-Online Submodular Minimization
Author: Elad Hazan, Satyen Kale
Abstract: We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efficient and are Hannan-consistent in both the full information and partial feedback settings. Keywords: submodular optimization, online learning, regret minimization
5 0.09916544 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu
Abstract: Sparse additive models are families of d-variate functions with the additive decomposition f ∗ = ∑ j∈S f j∗ , where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f j∗ lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f ∗ based on kernels combined with ℓ1 -type convex regularization. Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L2 (P) and L2 (Pn ) norms over the class Fd,s,H of sparse additive models with each univariate function f j∗ in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the d-variate function class is assumed to be globally bounded, then much √ faster estimation rates are possible for any sparsity s = Ω( n), showing that global boundedness is a significant restriction in the high-dimensional setting. Keywords: sparsity, kernel, non-parametric, convex, minimax
6 0.08461637 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
7 0.074457951 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
8 0.072077416 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
9 0.057316627 13 jmlr-2012-Active Learning via Perfect Selective Classification
10 0.054467443 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
11 0.053751748 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity
12 0.052508023 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
13 0.046656035 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
14 0.044406738 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
15 0.043699265 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso
16 0.04329244 80 jmlr-2012-On Ranking and Generalization Bounds
17 0.040640805 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers
18 0.039126609 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
19 0.038706567 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
20 0.037261341 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
topicId topicWeight
[(0, -0.222), (1, -0.201), (2, -0.079), (3, -0.084), (4, 0.001), (5, 0.055), (6, 0.072), (7, 0.053), (8, -0.03), (9, -0.077), (10, 0.024), (11, 0.0), (12, -0.015), (13, -0.004), (14, -0.049), (15, -0.002), (16, 0.103), (17, -0.194), (18, -0.161), (19, 0.024), (20, 0.063), (21, -0.013), (22, 0.112), (23, 0.125), (24, 0.101), (25, -0.074), (26, -0.113), (27, 0.071), (28, -0.172), (29, -0.055), (30, 0.004), (31, 0.164), (32, -0.03), (33, 0.04), (34, 0.022), (35, 0.137), (36, -0.007), (37, 0.006), (38, 0.028), (39, 0.005), (40, -0.062), (41, 0.028), (42, -0.115), (43, -0.122), (44, -0.133), (45, -0.094), (46, -0.017), (47, 0.058), (48, 0.035), (49, -0.123)]
simIndex simValue paperId paperTitle
same-paper 1 0.95041084 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
2 0.49527061 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
3 0.49314752 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
4 0.46606815 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
5 0.41030648 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
Author: Philipp Hennig, Christian J. Schuler
Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation
6 0.40772405 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
7 0.38863859 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
9 0.3829647 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
10 0.36584619 84 jmlr-2012-Online Submodular Minimization
11 0.31716752 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
12 0.31336483 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
13 0.28395286 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
14 0.27325276 13 jmlr-2012-Active Learning via Perfect Selective Classification
15 0.26739821 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
16 0.26711446 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
17 0.26543558 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems
18 0.26469323 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel
19 0.24812652 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
20 0.22982682 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
topicId topicWeight
[(7, 0.062), (12, 0.254), (21, 0.04), (26, 0.069), (27, 0.013), (29, 0.032), (35, 0.019), (49, 0.016), (56, 0.029), (57, 0.014), (64, 0.016), (69, 0.015), (75, 0.089), (77, 0.016), (79, 0.019), (92, 0.103), (96, 0.124)]
simIndex simValue paperId paperTitle
same-paper 1 0.76288736 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
2 0.61640728 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
3 0.61580735 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
4 0.61300701 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
Author: Matus Telgarsky
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy
5 0.60909748 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
6 0.59993047 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
7 0.59966314 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
8 0.59681046 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
9 0.59204888 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
10 0.58921182 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
11 0.58764374 103 jmlr-2012-Sampling Methods for the Nyström Method
12 0.58625621 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
13 0.5842495 80 jmlr-2012-On Ranking and Generalization Bounds
14 0.58135152 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices
15 0.58119845 13 jmlr-2012-Active Learning via Perfect Selective Classification
16 0.5784713 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
17 0.57741857 73 jmlr-2012-Multi-task Regression using Minimal Penalties
18 0.57636601 84 jmlr-2012-Online Submodular Minimization
19 0.57635933 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
20 0.57630467 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection