jmlr jmlr2012 jmlr2012-85 knowledge-graph by maker-knowledge-mining

85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Source: pdf

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. [sent-8, score-0.731]

2 We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. [sent-9, score-0.646]

3 Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization 1. [sent-13, score-0.688]

4 After making the prediction wi , we observe zi and suffer the loss f (wi , zi ), where f is a predeﬁned loss function. [sent-26, score-0.702]

5 Since regret relies on the stochastic inputs zi , it is a random variable. [sent-33, score-0.7]

6 It is well-known that the optimal regret bound √ that can be achieved by a gradient-based serial algorithm on an arbitrary convex loss is O( m) (e. [sent-57, score-0.911]

7 At the other extreme, a trivial solution to our problem is to have each node operate in isolation of the other k−1 nodes, running an independent copy of a serial algorithm, without any communication over 166 O PTIMAL D ISTRIBUTED O NLINE P REDICTION the network. [sent-61, score-0.554]

8 More speciﬁcally, assuming that each node processes m/k inputs, the expected regret per node is √ √ O( m/k). [sent-64, score-0.515]

9 ı In this paper, we present the distributed mini-batch (DMB) algorithm, a method of converting any serial gradient-based online prediction algorithm into a parallel or distributed algorithm. [sent-67, score-0.78]

10 This method has two important properties: • It can use any gradient-based update rule for serial online prediction as a black box, and convert it into a parallel or distributed online prediction algorithm. [sent-68, score-0.965]

11 Moreover, the coefﬁcient √ of the dominant term m is the same as in the serial bound, and independent of k and of the network topology. [sent-70, score-0.531]

12 The idea of using mini-batches in stochastic and online learning is not new, and has been previously explored in both the serial and parallel settings (see, e. [sent-71, score-0.718]

13 Our results build on the fact that the optimal regret bound for serial stochastic gradient-based prediction algorithms can be reﬁned if the loss function is smooth. [sent-77, score-1.047]

14 In particular, it can be shown √ that the hidden coefﬁcient in the O( m) notation is proportional to the standard deviation of the stochastic gradients evaluated at each predictor wi (Juditsky et al. [sent-78, score-0.557]

15 However, the non-negligible communication latencies prevent a straightforward parallel implementation from obtaining the optimal serial regret bound. [sent-81, score-0.896]

16 In Section 2, we present a template for stochastic gradientbased serial prediction algorithms, and state reﬁned variance-based regret bounds for smooth loss functions. [sent-84, score-1.08]

17 In Section 3, we analyze the effect of using mini-batches in the serial setting, and show that it does not signiﬁcantly affect the regret bounds. [sent-85, score-0.765]

18 In Section 4, we present the DMB algorithm, and show that it achieves an asymptotically optimal serial regret bound for smooth loss functions. [sent-86, score-0.895]

19 For example, if the network communication operates over a minimum-depth spanning tree and the diameter of the network scales as log(k), then we can show that a straightforward implementation of the idea of parallel variance reduction leads to an O m log(k) regret bound. [sent-90, score-0.637]

20 167 D EKEL , G ILAD -BACHRACH , S HAMIR AND X IAO Algorithm 1: Template for a serial ﬁrst-order stochastic online prediction algorithm. [sent-92, score-0.719]

21 Variance Bounds for Serial Algorithms Before discussing distributed algorithms, we must fully understand the serial algorithms on which they are based. [sent-106, score-0.509]

22 α j i=1 (4) For stochastic online prediction problems with convex loss functions, both of these update rules √ √ have expected regret bound of O( m). [sent-125, score-0.879]

23 In particular, for the projected stochastic gradient method deﬁned in Equation (2), we have the following result: Theorem 1 Let f (w, z) be an L-smooth convex loss function in w for each z ∈ Z and assume that the stochastic gradient ∇w f (w, z) has σ2 -bounded variance for all w ∈ W . [sent-140, score-0.52]

24 Although we focus on expected regret bounds here, our results can equally be stated as high-probability bounds on the actual regret (see Appendix B for details). [sent-149, score-0.747]

25 For each of these update rules, we get an expected regret bound that closely resembles the bound in Equation (6). [sent-157, score-0.531]

26 Therefore, each node suffers an expected regret of at most ψ(σ2 , ⌈m/k⌉) on its portion of the input stream, and the total regret bound is obtain by simply summing over the k nodes, that is, E[R(m)] ≤ k ψ σ2 , m k . [sent-185, score-0.819]

27 k √ Comparing this bound to 2D2 L + 2Dσ m in the ideal serial solution, we see that it is approximately √ k times worse in its leading term. [sent-187, score-0.506]

28 Serial Online Prediction using Mini-Batches The expected regret bounds presented in the previous section depend on the variance of the stochastic gradients. [sent-191, score-0.572]

29 Before we present the distributed mini-batch algorithm in the next section, we ﬁrst analyze a serial mini-batch algorithm. [sent-193, score-0.509]

30 We deﬁne the serial mini-batch algorithm as follows: Our prediction remains constant for the duration of each batch, and is updated only when a batch ends. [sent-197, score-0.634]

31 Once a batch ends, the serial mini-batch algorithm feeds g j to the update rule φ as the jth gradient and obtains the new prediction ¯ for the next batch and the new state. [sent-200, score-0.947]

32 The appeal of the serial mini-batch setting is that the update rule is used less frequently, which may have computational beneﬁts. [sent-202, score-0.546]

33 If the update rule φ has the serial regret bound ψ(σ2 , m), then the expected regret of Algorithm 2 over m inputs is at most bψ σ2 m , b b . [sent-204, score-1.399]

34 Proof Assume without loss of generality that b divides m, and that the serial mini-batch algorithm processes exactly m/b complete batches. [sent-206, score-0.526]

35 Therefore the serial ¯ ¯ mini-batch algorithm is equivalent to using the update rule φ with the loss function f¯. [sent-226, score-0.59]

36 If the ¯ update rule φ has a regret bound ψ(σ2 , m) for the loss function f over m inputs, then its regret for f¯ over m/b batches is bounded as m/b E ∑ j=1 f¯(w j , z j ) − f¯(w⋆ , z j ) ¯ ¯ ≤ ψ σ2 m . [sent-237, score-0.905]

37 √ The bound in Theorem 3 is asymptotically equivalent to the 2D2 L + 2Dσ m regret bound for the basic serial algorithms presented in Section 2. [sent-241, score-0.903]

38 In other words, performing the mini-batch update in the serial setting does not signiﬁcantly hurt the performance of the update rule. [sent-242, score-0.571]

39 On the other hand, it is also not surprising that using mini-batches in the serial setting does not improve the regret bound. [sent-243, score-0.765]

40 Nevertheless, our experiments demonstrate that in real-world scenarios, mini-batching can in fact have a very substantial positive effect on the transient performance of the online prediction algorithm, even in the serial setting (see Section 6 for details). [sent-246, score-0.583]

41 This technique resembles the serial mini-batch technique described earlier, and is therefore called the distributed mini-batch algorithm, or DMB for short. [sent-271, score-0.509]

42 Once all of the nodes have the overall average g j , each node updates the predictor using the same deterministic serial ¯ algorithm. [sent-306, score-0.63]

43 If the update rule φ has the serial regret bound ψ(σ2 , m), then the expected regret of Algorithm 3 over m samples is at most (b + µ) ψ σ2 m , b b+µ . [sent-316, score-1.287]

44 √ Speciﬁcally, if ψ(σ2 , m) = 2D2 L + 2Dσ m, then setting the batch size b = m1/3 gives the expected regret bound √ √ 1 1 1 2Dσ m + 2Dm /3 (LD + σ µ) + 2Dσm /6 + 2Dσµm− /6 + 2µD2 L. [sent-317, score-0.56]

45 To appreciate the power of this result, we compare the speciﬁc bound in Equation (8) with the ideal serial solution and the na¨ve no-communication solution discussed in the introduction. [sent-319, score-0.506]

46 It ı is clear that our bound is asymptotically equivalent to the ideal serial bound ψ(σ2 , m)—even the constants in the dominant terms are identical. [sent-320, score-0.615]

47 2 Improving Performance on Short Input Streams Theorem 4 presents an optimal way of choosing the batch size b, which results in an asymptotically optimal regret bound. [sent-352, score-0.513]

48 The batch size should indeed be set such that the generated trafﬁc does not exceed the network bandwidth limit, but the latency of each sum operation should not be affected by the fact that multiple sum operations take place at once. [sent-367, score-0.499]

49 Each instance of the algorithm handles m/c inputs and suffers a regret of at most bψ σ2 m ,1+ b bc , and, using Jensen’s inequality, the overall regret using the average prediction is upper bounded by bc ψ σ2 m ,1+ b bc . [sent-372, score-0.822]

50 m Therefore, a bound on the expected optimality gap can be readily obtained from a bound on the expected regret of the same algorithm. [sent-403, score-0.565]

51 In particular, if f is an L-smooth convex loss function and ∇w f (w, z) has σ2 -bounded variance, and our algorithm has a regret bound of ψ(σ2 , m), then it also has an expected optimality gap of at most E[G(m)] ≤ 1 ψ(σ2 , m) . [sent-404, score-0.578]

52 m √ For the speciﬁc regret bound ψ(σ2 , m) = 2D2 L + 2Dσ m, which holds for the serial algorithms presented in Section 2, we have ¯ ψ(σ2 , m) = ¯ E[G(m)] ≤ ψ(σ2 , m) = 2D2 L 2Dσ +√ . [sent-405, score-0.817]

53 If the update rule φ used ¯ in a serial setting has an expected optimality gap bounded by ψ(σ2 , m), then the expected optimality gap of Algorithm 4 after processing m samples is at most ¯ ψ ¯ If ψ(σ2 , m) = 2D2 L m σ2 m , b b . [sent-422, score-0.75]

54 Since we can run the workload associated with a single batch in parallel, this theorem shows that the mini-batch technique is capable of turning many serial optimization algorithms into parallel ones. [sent-433, score-0.67]

55 Suppose the update rule φ 2 √ ¯ used in the serial setting has an expected optimality gap bounded by ψ(σ2 , m) = 2D L + 2Dσ . [sent-458, score-0.648]

56 ε→0 Proof By solving the equation 2D2 L 2Dσ + √ =ε, m m we see that the following number of samples is sufﬁcient for the serial algorithm to reach εoptimality: D2 σ2 msrl (ε) = ε2 2Lε 1+ 2 σ 1+ 2 . [sent-462, score-0.511]

57 6 5 10 6 10 7 10 8 10 9 10 number of inputs Figure 2: The effects of of the batch size when serial mini-batching on average loss. [sent-518, score-0.696]

58 1 Serial Mini-Batching As a warm-up, we investigated the effects of modifying the mini-batch size b in a standard serial Euclidean dual averaging algorithm. [sent-533, score-0.592]

59 5 no−comm batch no−comm serial DMB 2 average loss average loss 2 1. [sent-545, score-0.672]

60 5 5 10 6 10 7 10 8 10 9 10 number of inputs Figure 3: Comparing DBM with the serial algorithm and the no-communication distributed algorithm. [sent-549, score-0.621]

61 2 Evaluating DBM Next, we compared the average loss of the DBM algorithm with the average loss of the serial algorithm and the no-communication algorithm (where each cluster node works independently). [sent-560, score-0.635]

62 We have already seen that larger batch sizes accelerate the initial learning phase, even in a serial setting. [sent-630, score-0.584]

63 As noted in the serial case, larger batch sizes (b = 512) are beneﬁcial at ﬁrst (m = 107 ), while smaller batch sizes (b = 128) are better in the end (m = 109 ). [sent-637, score-0.733]

64 5 Discussion We presented an empirical evaluation of the serial mini-batch algorithm and its distributed version, the DMB algorithm, on a realistic web-scale online prediction problem. [sent-639, score-0.657]

65 Instead, our novel use of the variancebased regret bounds can exploit parallel/distributed computing to obtain the asymptotic optimal regret bound. [sent-678, score-0.689]

66 As far as we know, we are the ﬁrst to propose a general principled framework for distributing many gradient-based update rule, with a concrete regret analysis for the large family of mirror descent and dual averaging update rules. [sent-680, score-0.711]

67 Additionally, our work is the ﬁrst to explicitly include network latency in our regret analysis, and to theoretically guarantee that a large latency can be overcome by setting parameters appropriately. [sent-681, score-0.813]

68 In this work we studied the problems of distributed stochastic online prediction and distributed stochastic optimization. [sent-685, score-0.568]

69 We presented a family of distributed online algorithms with asymptotically optimal regret and optimality gap guarantees. [sent-686, score-0.609]

70 Our analysis shows that asymptotically, a distributed computing system can 188 O PTIMAL D ISTRIBUTED O NLINE P REDICTION perform as well as a hypothetical fast serial computer. [sent-688, score-0.509]

71 Our formal analysis hinges on the fact that the regret bounds of many stochastic online update rules scale with the variance of the stochastic gradients when the loss function is smooth. [sent-693, score-0.989]

72 For any serial update rule φ with a regret bound √ √ of ψ(σ2 , m) = Cσ m + o ( m), the DMB algorithm and its variants have the optimal regret bound √ √ of Cσ m + o ( m), provided that the bound ψ(σ2 , m) applies equally to the function f and to the function 1 b f¯ (w, (z1 , . [sent-696, score-1.362]

73 Smooth Stochastic Online Prediction in the Serial Setting In this appendix, we prove expected regret bounds for stochastic dual averaging and stochastic mirror descent applied to smooth loss functions. [sent-726, score-0.949]

74 Before observing each zi we predict wi ∈ W , and suffer a loss f (wi , zi ). [sent-733, score-0.608]

75 Under the above assumptions, we are concerned with bounding the expected regret E[R(m)], where regret is deﬁned as m R(m) = ∑ ( f (wi , zi ) − f (w⋆ , zi )) . [sent-743, score-0.933]

76 In the stochastic dual averaging method, we predict each wi by i wi+1 = arg min w∈W ∑ g j, w + (L + βi+1 )h(w) , (13) j=1 where g j denotes the stochastic gradient ∇w f (w j , z j ), and (βi )i≥1 is a sequence of positive and nondecreasing parameters (i. [sent-760, score-0.806]

77 (14) w∈W We are now ready to state a bound on the expected regret of the dual averaging method, in the smooth stochastic case. [sent-764, score-0.704]

78 Theorem 7 The expected regret of the stochastic dual averaging method is bounded as ∀m, E[R(m)] ≤ (F(w1 ) − F(w⋆ )) + (L + βm )h(w⋆ ) + The optimal choice of βi is exactly of order positive parameter. [sent-765, score-0.652]

79 Proof First, we deﬁne the linear functions ℓi (w) = F(wi ) + ∇F(wi ), w − wi , and (using the notation gi = ∇ f (wi , zi )) ∀ i ≥ 1, ˆ ℓi (w) = F(wi ) + gi , w − wi = ℓi (w) + qi , w − wi , where qi = gi − ∇F(wi ). [sent-776, score-1.447]

80 Therefore, the stochastic dual averaging method speciﬁed in Equation (13) is equivalent to i−1 wi = arg min w∈W ∑ ℓˆj (w) + (L + βi )h(w) . [sent-777, score-0.617]

81 2βi i=1 Therefore, m ∑ i=2 F(wi ) − F(w⋆ ) ≤ (L + βm )h(w⋆ ) + m−1 ∑ i=1 qi 2 m−1 ∗ + ∑ qi , w⋆ − wi . [sent-785, score-0.626]

82 2βi Theorem 7 is proved by further noticing E f (wi , zi ) = E F(wi ), E f (w⋆ , zi ) = F(w⋆ ), ∀ i ≥ 1, which are due to the fact that wi is a deterministic function of z0 , . [sent-797, score-0.544]

83 In the stochastic mirror descent method, we use the same initialization as in the dual averaging method (see Equation (14)) and then we set wi+1 = arg min gi , w + (L + βi )d(w, wi ) , i ≥ 1. [sent-807, score-0.738]

84 u,v∈W Then the expected regret of the stochastic mirror descent method is bounded as E[R(m)] ≤ (F(w1 ) − F(w⋆ )) + (L + βm )D2 + σ2 m−1 1 ∑ . [sent-811, score-0.583]

85 2 i=1 βi √ Similar to the dual averaging case, using the sequence of parameters βi = (σ/D) i gives the expected regret bound √ E[R(m)] ≤ (F(w1 ) − F(w⋆ )) + LD2 + (2σD) m. [sent-812, score-0.568]

86 2βi (17) ˆ Now using Lemma 11 with ϕ(w) = ℓi (w) yields ˆ ˆ ℓi (wi+1 ) + (L + βi )d(wi+1 , wi ) ≤ ℓi (w⋆ ) + (L + βi )d(w⋆ , wi ) − (L + βi )d(w⋆ , wi+1 ). [sent-830, score-0.6]

87 Summing the above inequality from i = 1 to i = m − 1, we have m ∑ F(wi ) i=2 ≤ (m − 1)F(w⋆ ) + (L + β1 )d(w⋆ , w1 ) − (L + βm )d(w⋆ , wm ) + (βm − β1 )D2 m−1 + ∑ i=1 qi 2 m−1 ∗ + ∑ qi , w⋆ − wi . [sent-832, score-0.657]

88 2βi i=1 Notice that d(w⋆ , wi ) ≥ 0 and d(w⋆ , w1 ) ≤ D2 , so we have m m−1 i=2 i=1 ∑ F(wi ) ≤ (m − 1)F(w⋆ ) + (L + βm )D2 + ∑ 195 qi 2 m−1 ∗ + ∑ qi , w⋆ − wi . [sent-833, score-0.926]

89 First, we need to justify that the expected regret bounds for the online prediction rules discussed in Appendix A have high-probability versions. [sent-847, score-0.558]

90 For simplicity, we will focus on a high-probability version of the regret bound for dual averaging (Theorem 7), but exactly the same technique will work for stochastic mirror descent (Theorem 9 and Theorem 10). [sent-848, score-0.763]

91 Theorem 12 For any m and any δ ∈ (0, 1], the regret of the stochastic dual averaging method is bounded with probability at least 1 − δ over the sampling of z1 , . [sent-856, score-0.623]

92 log(2/δ) 196 1 G2 σ2 ∑m β2 + D2 σ2 m i=1 i log(2/δ) O PTIMAL D ISTRIBUTED O NLINE P REDICTION Proof The proof of the theorem is identical to the one of Theorem 7, up to Equation (16): m ∑ i=2 F(wi ) − F(w⋆ ) ≤ (L + βm )h(w⋆ ) + m−1 ∑ i=1 qi 2 m−1 + ∑ qi , w⋆ − wi . [sent-860, score-0.663]

93 We i i will ﬁrst use this result for the sequence xi = qi 2 − σ2 i 2βi + qi , w⋆ − wi . [sent-882, score-0.626]

94 Moreover, | qi , w⋆ − wi | ≤ D qi ≤ 2DG, qi 2 ≤ 4G2 . [sent-893, score-0.789]

95 Then Varzi (xi ) ≤ 2Varzi 1 ≤ Ezi 2 ≤ 2G2 Ezi ≤ 2G2 qi 2 − σ2 i 2βi qi 4 β2 i + 2Varzi ( qi , w⋆ − wi ) + 2Ezi [( qi , w⋆ − wi )2 ] qi 2 β2 i + 2 w⋆ − wi 2 Ezi [ qi 2 ] σ2 σ2 i + 2D2 σ2 ≤ 2G2 2 + 2D2 σ2 . [sent-899, score-1.878]

96 i β2 βi i Combining these observations with Equation (19), we get that with probability at least 1 − δ, m−1 ∑ i=1 qi 2 − σ2 βi 4G2 + qi , w − wi ≤ 2DG + β1 ⋆ log(1/δ) 1 + 36 1 G2 σ2 ∑m β2 + D2 σ2 m i=1 i log(1/δ) . [sent-900, score-0.626]

97 197 D EKEL , G ILAD -BACHRACH , S HAMIR AND X IAO Also, |( f (wi , zi ) − f (w⋆ , zi )) − (F(wi ) − F(w⋆ ))| ≤ 4B, and Varzi f (wi , zi ) − f (w⋆ , zi ) − F(wi ) − F(w⋆ ) = Varzi f (wi , zi ) − f (w⋆ , zi ) ˆ ≤ σ2 . [sent-912, score-0.732]

98 ˆ now, the regret bound also depends on the function variance σ Theorem 13 Let f is an L-smooth convex loss function. [sent-920, score-0.524]

99 Assume that the stochastic gradient ∇w f (w, zi ) is bounded by a constant and has σ2 -bounded variance for all i and all w, and that ˆ f (w, zi ) is bounded by a constant and has σ2 -bounded variance for all i and for all w. [sent-921, score-0.529]

100 If the update ˆ rule φ has a serial high-probability regret bound ψ(σ2 , σ2 , δ, m). [sent-922, score-0.928]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('serial', 0.435), ('dmb', 0.352), ('regret', 0.33), ('wi', 0.3), ('latency', 0.205), ('qi', 0.163), ('hamir', 0.156), ('ilad', 0.156), ('batch', 0.149), ('istributed', 0.147), ('stochastic', 0.136), ('iao', 0.133), ('ekel', 0.133), ('zi', 0.122), ('rediction', 0.114), ('inputs', 0.112), ('ptimal', 0.105), ('online', 0.098), ('averaging', 0.092), ('xiao', 0.091), ('zs', 0.087), ('nline', 0.083), ('node', 0.078), ('gradients', 0.078), ('distributed', 0.074), ('nodes', 0.074), ('mdmb', 0.074), ('network', 0.073), ('lan', 0.071), ('update', 0.068), ('dual', 0.065), ('mirror', 0.059), ('template', 0.056), ('gradient', 0.053), ('bound', 0.052), ('ez', 0.051), ('convex', 0.05), ('prediction', 0.05), ('operation', 0.05), ('monetizable', 0.049), ('msrl', 0.049), ('parallel', 0.049), ('variance', 0.048), ('divides', 0.047), ('loss', 0.044), ('queries', 0.043), ('predictor', 0.043), ('rule', 0.043), ('ezi', 0.042), ('communication', 0.041), ('latencies', 0.041), ('accelerated', 0.041), ('asynchronous', 0.041), ('gap', 0.039), ('minw', 0.039), ('batches', 0.038), ('theorem', 0.037), ('nemirovski', 0.036), ('dbm', 0.035), ('lh', 0.035), ('zm', 0.034), ('cluster', 0.034), ('asymptotically', 0.034), ('optimality', 0.034), ('gi', 0.033), ('comm', 0.033), ('nedi', 0.033), ('varzi', 0.033), ('waste', 0.033), ('juditsky', 0.032), ('wm', 0.031), ('query', 0.03), ('dekel', 0.03), ('descent', 0.029), ('issued', 0.029), ('expected', 0.029), ('bounds', 0.029), ('equation', 0.027), ('nesterov', 0.025), ('bregman', 0.025), ('ghadimi', 0.025), ('synchronized', 0.025), ('arg', 0.024), ('lim', 0.024), ('diameter', 0.023), ('engine', 0.023), ('dominant', 0.023), ('operations', 0.022), ('rules', 0.022), ('duchi', 0.022), ('smoothness', 0.021), ('ethernet', 0.021), ('incoming', 0.02), ('zb', 0.02), ('tseng', 0.02), ('stream', 0.02), ('suffer', 0.02), ('ideal', 0.019), ('receive', 0.019), ('mimics', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

2 0.14594792 97 jmlr-2012-Regularization Techniques for Learning with Matrices

Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari

Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning

3 0.13269724 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang

Abstract: In this paper we propose efﬁcient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefﬁcient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which deﬁne the set K , be satisﬁed in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efﬁcient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisﬁed in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a ﬁnite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our ﬁrst algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality

4 0.1105826 84 jmlr-2012-Online Submodular Minimization

Author: Elad Hazan, Satyen Kale

Abstract: We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efﬁcient and are Hannan-consistent in both the full information and partial feedback settings. Keywords: submodular optimization, online learning, regret minimization

5 0.09916544 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu

Abstract: Sparse additive models are families of d-variate functions with the additive decomposition f ∗ = ∑ j∈S f j∗ , where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f j∗ lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f ∗ based on kernels combined with ℓ1 -type convex regularization. Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L2 (P) and L2 (Pn ) norms over the class Fd,s,H of sparse additive models with each univariate function f j∗ in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the d-variate function class is assumed to be globally bounded, then much √ faster estimation rates are possible for any sparsity s = Ω( n), showing that global boundedness is a signiﬁcant restriction in the high-dimensional setting. Keywords: sparsity, kernel, non-parametric, convex, minimax

6 0.08461637 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

7 0.074457951 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

8 0.072077416 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

9 0.057316627 13 jmlr-2012-Active Learning via Perfect Selective Classification

10 0.054467443 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

11 0.053751748 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

12 0.052508023 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

13 0.046656035 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

14 0.044406738 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

15 0.043699265 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso

16 0.04329244 80 jmlr-2012-On Ranking and Generalization Bounds

17 0.040640805 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers

18 0.039126609 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

19 0.038706567 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

20 0.037261341 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.222), (1, -0.201), (2, -0.079), (3, -0.084), (4, 0.001), (5, 0.055), (6, 0.072), (7, 0.053), (8, -0.03), (9, -0.077), (10, 0.024), (11, 0.0), (12, -0.015), (13, -0.004), (14, -0.049), (15, -0.002), (16, 0.103), (17, -0.194), (18, -0.161), (19, 0.024), (20, 0.063), (21, -0.013), (22, 0.112), (23, 0.125), (24, 0.101), (25, -0.074), (26, -0.113), (27, 0.071), (28, -0.172), (29, -0.055), (30, 0.004), (31, 0.164), (32, -0.03), (33, 0.04), (34, 0.022), (35, 0.137), (36, -0.007), (37, 0.006), (38, 0.028), (39, 0.005), (40, -0.062), (41, 0.028), (42, -0.115), (43, -0.122), (44, -0.133), (45, -0.094), (46, -0.017), (47, 0.058), (48, 0.035), (49, -0.123)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95041084 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

2 0.49527061 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang

3 0.49314752 97 jmlr-2012-Regularization Techniques for Learning with Matrices

Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari

4 0.46606815 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

Author: Sangkyun Lee, Stephen J. Wright

Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is deﬁned by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identiﬁcation property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identiﬁcation

5 0.41030648 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

Author: Philipp Hennig, Christian J. Schuler

Abstract: Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly addresses the decision problem of maximizing information gain from each evaluation. Keywords: optimization, probability, information, Gaussian processes, expectation propagation

6 0.40772405 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

7 0.38863859 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

8 0.38403955 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

9 0.3829647 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

10 0.36584619 84 jmlr-2012-Online Submodular Minimization

11 0.31716752 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models

12 0.31336483 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

13 0.28395286 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

14 0.27325276 13 jmlr-2012-Active Learning via Perfect Selective Classification

15 0.26739821 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss

16 0.26711446 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

17 0.26543558 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

18 0.26469323 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel

19 0.24812652 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

20 0.22982682 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.062), (12, 0.254), (21, 0.04), (26, 0.069), (27, 0.013), (29, 0.032), (35, 0.019), (49, 0.016), (56, 0.029), (57, 0.014), (64, 0.016), (69, 0.015), (75, 0.089), (77, 0.016), (79, 0.019), (92, 0.103), (96, 0.124)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76288736 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

2 0.61640728 97 jmlr-2012-Regularization Techniques for Learning with Matrices

Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari

3 0.61580735 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang

4 0.61300701 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

Author: Matus Telgarsky

Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations sufﬁce to produce a predictor with empirical risk ε-close to the inﬁmum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy

5 0.60909748 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

Author: Sangkyun Lee, Stephen J. Wright

6 0.59993047 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

7 0.59966314 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

8 0.59681046 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

9 0.59204888 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

10 0.58921182 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

11 0.58764374 103 jmlr-2012-Sampling Methods for the Nyström Method

12 0.58625621 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

13 0.5842495 80 jmlr-2012-On Ranking and Generalization Bounds

14 0.58135152 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

15 0.58119845 13 jmlr-2012-Active Learning via Perfect Selective Classification

16 0.5784713 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

17 0.57741857 73 jmlr-2012-Multi-task Regression using Minimal Penalties

18 0.57636601 84 jmlr-2012-Online Submodular Minimization

19 0.57635933 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

20 0.57630467 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection