nips nips2013 nips2013-19 knowledge-graph by maker-knowledge-mining

19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

Source: pdf

Author: Shai Shalev-Shwartz, Tong Zhang

Abstract: Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007]. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. [sent-3, score-0.097]

2 We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007]. [sent-4, score-0.623]

3 , φn be a sequence of vector convex functions from Rd to R, and let g : Rd → R be a strongly convex regularization function. [sent-9, score-0.059]

4 (1) i=1 For example, given a sequence of n training examples (v1 , y1 ), . [sent-11, score-0.044]

5 , (vn , yn ), where vi ∈ Rd and yi ∈ R, ridge regression is obtained by setting g(x) = λ x 2 and φi (x) = (x vi − yi )2 . [sent-14, score-0.25]

6 Regular2 ized logistic regression is obtained by setting φi (x) = log(1 + exp(−yi x vi )). [sent-15, score-0.078]

7 The dual problem of (1) is deﬁned as follows: For each i, let φ∗ : Rd → R be the convex conjugate i of φi , namely, φ∗ (u) = maxz∈Rd (z u − φi (z)). [sent-16, score-0.147]

8 The i dual problem is: max D(α) where D(α) = α∈Rd×n 1 n n n −φ∗ (−αi ) − g ∗ i 1 n αi , (2) i=1 i=1 where for each i, αi is the i’th column of the matrix α. [sent-18, score-0.128]

9 The dual objective has a different dual vector associated with each primal function. [sent-19, score-0.322]

10 Dual Coordinate Ascent (DCA) methods solve the dual problem iteratively, where at each iteration of DCA, the dual objective is optimized with respect to a single dual vector, while the rest of the dual vectors are kept in tact. [sent-20, score-0.579]

11 Recently, Shalev-Shwartz and Zhang [2013a] analyzed a stochastic version of dual coordinate ascent, abbreviated by SDCA, in which at each round we choose which dual vector to optimize uniformly at random (see also Richt´ rik and Tak´ c [2012a]). [sent-21, score-0.478]

12 This convergence rate is signiﬁcantly better than the more commonly studied stochastic gradient descent (SGD) methods that are related to SDCA1 . [sent-25, score-0.172]

13 Another approach to solving (1) is deterministic gradient descent methods. [sent-26, score-0.135]

14 In particular, Nesterov [2007] proposed an accelerated gradient descent (AGD) method for solving (1). [sent-27, score-0.212]

15 The advantage of SDCA over AGD is that each iteration involves only a single dual vector and usually costs O(d). [sent-29, score-0.195]

16 In contrast, each iteration of AGD requires Ω(nd) operations. [sent-30, score-0.067]

17 On the other hand, AGD has a better dependence on the condition number of the problem — the iteration bound √ of AGD scales with 1/ λγ while the iteration bound of SDCA scales with 1/(λγ). [sent-31, score-0.166]

18 In this paper we describe and analyze a new algorithm that interpolates between SDCA and AGD. [sent-32, score-0.09]

19 At each iteration of the algorithm, we randomly pick a subset of m indices from {1, . [sent-33, score-0.067]

20 , n} and update the dual vectors corresponding to this subset. [sent-36, score-0.128]

21 Another typical use of mini-batch is for parallel computing, which was studied by various authors for stochastic gradient descent (e. [sent-41, score-0.246]

22 They have shown that the naive mini-batching method, in which m dual variables are optimized in parallel, might actually increase the number of iterations required. [sent-48, score-0.149]

23 Analyzing 2 more general strongly convex regularization functions is left for future work. [sent-63, score-0.04]

24 In Section 3 we discuss 1 An exception is the recent analysis given in Le Roux et al. [sent-64, score-0.041]

25 2 parallel implementations of ASDCA and compare it to parallel implementations of AGD and SDCA. [sent-66, score-0.204]

26 In Section 4 we present some experimental results, demonstrating how ASDCA interpolates between AGD and SDCA. [sent-68, score-0.09]

27 2γ For example, if φi (x) = (x vi − yi )2 , then it is vi 2 -smooth. [sent-76, score-0.194]

28 (3) Deﬁne the dual sub-optimality by ∆D(α) = D(α∗ ) − D(α), where α∗ is the optimal dual solution, and the primal sub-optimality by ∆P (x) = P (x) − D(α∗ ). [sent-85, score-0.322]

29 As can be seen in the table, the ASDCA algorithm interpolates between SDCA and AGD. [sent-92, score-0.09]

30 Recall that the cost of each iteration of AGD scales with n while the cost of each iteration of SDCA does not scale with n. [sent-94, score-0.194]

31 The cost of each iteration of ASDCA scales with m. [sent-95, score-0.105]

32 To compensate for the difference cost per iteration for different algorithms, we may also compare the complexity in terms of the number of examples processed (see Table 2). [sent-96, score-0.229]

33 It should be mentioned that this comparison is meaningful in a single processor environment, but not in a parallel computing environment when multiple examples can be processed simultaneously in a minibatch. [sent-98, score-0.233]

34 In the next section we discuss under what conditions the overall runtime of ASDCA is better than both AGD and SDCA. [sent-99, score-0.044]

35 3 Parallel Implementation In recent years, there has been a lot of interest in implementing optimization algorithms using a parallel computing architecture (see Section 5). [sent-100, score-0.074]

36 We now discuss how to implement AGD, SDCA, and ASDCA when having a computing machine with s parallel computing nodes. [sent-101, score-0.091]

37 In the calculations below, we use the following facts: • If each node holds a d-dimensional vector, we can compute the sum of these vectors in time O(d log(s)) by applying a “tree-structure” summation (see for example the All-Reduce architecture in Agarwal et al. [sent-102, score-0.084]

38 • A node can broadcast a message with c bits to all other nodes in time O(c log2 (s)). [sent-104, score-0.204]

39 To see this, order nodes on the corners of the log2 (s)-dimensional hypercube. [sent-105, score-0.061]

40 Then, at each iteration, each node sends the message to its log(s) neighbors (namely, the nodes whose code word is at a hamming distance of 1 from the node). [sent-106, score-0.147]

41 The message between the furthest away nodes will pass after log(s) iterations. [sent-107, score-0.087]

42 Overall, we perform log(s) iterations and each iteration requires transmitting c log(s) bits. [sent-108, score-0.088]

43 • All nodes can broadcast a message with c bits to all other nodes in time O(cs log2 (s)). [sent-109, score-0.205]

44 To see this, simply apply the broadcasting of the different nodes mentioned above in parallel. [sent-110, score-0.08]

45 The number of iterations will still be the same, but now, at each iteration, each node should transmit cs bits to its log(s) neighbors. [sent-111, score-0.14]

46 For concreteness of the discussion, we consider problems in which φi (x) takes the form of (x vi , yi ), where yi is a scalar and vi ∈ Rd . [sent-113, score-0.232]

47 We further assume that the average number ¯ of non-zero elements of vi is d. [sent-117, score-0.078]

48 However, we assume that a single node can hold a fraction of 1/s of the data in its memory. [sent-119, score-0.06]

49 4 Let us now discuss parallel implementations of the different algorithms starting with deterministic gradient algorithms (such as AGD). [sent-120, score-0.188]

50 The bottleneck operation of deterministic gradient algorithms is the calculation of the gradient. [sent-121, score-0.069]

51 If the data is distributed over s computing nodes, where each node holds n/s ¯ examples, we can calculate the gradient in time O(nd/s + d log(s)) as follows. [sent-123, score-0.138]

52 First, each node ¯ calculates the gradient over its own n/s examples (which takes time O(nd/s)). [sent-124, score-0.153]

53 On a single computing node, it was observed that SDCA is much more efﬁcient than deterministic gradient descent methods, since each iteration of SDCA ¯ ¯ costs only Θ(d) while each iteration of AGD costs Θ(nd). [sent-127, score-0.269]

54 When we have s nodes, for the SDCA algorithm, dividing the examples into s computing nodes does not yield any speed-up. [sent-128, score-0.128]

55 However, we can divide the features into the s nodes (that is, each node will hold d/s of the features for all ¯ of the examples). [sent-129, score-0.161]

56 This enables the computation of x vi in (expected) time of O(d/s + s log2 (s)). [sent-130, score-0.078]

57 Indeed, node t will calculate j∈Jt xj vi,j , where Jt is the set of features stored in node t (namely, |Jt | = d/s). [sent-131, score-0.14]

58 Then, each node broadcasts the resulting scalar to all the other nodes. [sent-132, score-0.06]

59 For the ASDCA algorithm, each iteration involves the computation of the gradient over m examples. [sent-134, score-0.116]

60 We can choose to implement it by dividing the examples to the s nodes (as we did for AGD) or by dividing the features into the s nodes (as we did for SDCA). [sent-135, score-0.232]

61 In the ﬁrst case, the cost of each iteration ¯ ¯ is O(md/s + d log(s)) while in the latter case, the cost of each iteration is O(md/s + ms log2 (s)). [sent-136, score-0.178]

62 We will choose between these two implementations based on the relation between d, m, and s. [sent-137, score-0.028]

63 The runtime and communication time of each iteration is summarized in the table below. [sent-138, score-0.165]

64 Algorithm partition type runtime communication time SDCA features ¯ d/s s log2 (s) ASDCA features ms log2 (s) ASDCA examples ¯ dm/s ¯ dm/s examples ¯ dn/s d log(s) AGD d log(s) We again see that ASDCA nicely interpolates between SDCA and AGD. [sent-139, score-0.3]

65 In practice, it is usually the case that there is a non-negligible cost of opening communication channels between nodes. [sent-140, score-0.179]

66 In that case, it will be better to apply the ASDCA with a value of m that reﬂects an adequate tradeoff between the runtime of each node and the communication time. [sent-141, score-0.142]

67 With the appropriate value of m (which depends on constants like the cost of opening communication channels and sending packets of bits between nodes), ASDCA may outperform both SDCA and AGD. [sent-142, score-0.214]

68 4 Experimental Results In this section we demonstrate how ASDCA interpolates between SDCA and AGD. [sent-143, score-0.09]

69 , (vm , ym ) be a set of labeled examples, where for every i, vi ∈ Rd and yi ∈ {±1}. [sent-148, score-0.116]

70 Deﬁne φi (x) to be  yi x vi > 1 0 φi (x) = 1/2 − yi x vi yi x vi < 0 1 (1 − yi x vi )2 o. [sent-149, score-0.464]

71 13 10 9 6 10 10 #processed examples m=3 m=30 m=299 AGD SDCA 0. [sent-155, score-0.044]

72 11 7 10 #processed examples m=78 m=781 m=7813 AGD SDCA 0. [sent-189, score-0.044]

73 12 5 −2 10 −3 10 10 Test Loss m=52 m=523 m=5229 AGD SDCA Primal suboptimality m=78 m=781 m=7813 AGD SDCA Primal suboptimality Primal suboptimality m=3 m=30 m=299 AGD SDCA m=78 m=781 m=7813 AGD SDCA 0. [sent-191, score-0.111]

74 13 7 10 8 10 #processed examples m=52 m=523 m=5229 AGD SDCA 0. [sent-193, score-0.044]

75 22 5 10 6 10 7 10 6 10 #processed examples 7 8 10 10 9 10 #processed examples 6 10 7 10 8 10 #processed examples Figure 1: The ﬁgures presents the performance of AGD, SDCA, and ASDCA with different values of mini-batch size, m. [sent-220, score-0.132]

76 In all ﬁgures, the x axis is the number of processed examples. [sent-221, score-0.096]

77 In Figure 1 we depict the primal sub-optimality of the different algorithms as a function of the number of examples processed. [sent-233, score-0.11]

78 Note that each iteration of SDCA processes a single example, each iteration of ASDCA processes m examples, and each iteration of AGD processes n examples. [sent-234, score-0.201]

79 As can be seen from the graphs, ASDCA indeed interpolates between SDCA and AGD. [sent-235, score-0.09]

80 6 5 Discussion and Related Work We have introduced an accelerated version of stochastic dual coordinate ascent with mini-batches. [sent-239, score-0.369]

81 We have shown, both theoretically and empirically, that the resulting algorithm interpolates between the vanilla stochastic coordinate descent algorithm and the accelerated gradient descent algorithm. [sent-240, score-0.495]

82 Using mini-batches in stochastic learning has received a lot of attention in recent years. [sent-241, score-0.034]

83 [2012] and Agarwal and Duchi [2012] gave an analysis of SGD with mini-batches for smooth loss functions. [sent-247, score-0.043]

84 [2011] studied SGD and accelerated versions of SGD with mini-batches and Tak´ c et al. [sent-249, score-0.144]

85 [2010] studied dual averaging in distributed networks as a function of spectral properties of the underlying graph. [sent-252, score-0.18]

86 However, all of these methods have a polynomial dependence on 1/ , while we consider the strongly convex and smooth case in which a log(1/ ) rate is achievable. [sent-253, score-0.041]

87 2 Parallel coordinate descent has also been recently studied in Fercoq and Richt´ rik [2013], Richt´ rik and Tak´ c [2013]. [sent-254, score-0.348]

88 A possible reason is the cost of opening communication sockets as discussed in Section 3. [sent-256, score-0.138]

89 There are various practical considerations that one should take into account when designing a practical system for distributed optimization. [sent-257, score-0.029]

90 The more general problem of distributed PAC learning has been studied recently in Daume III et al. [sent-262, score-0.076]

91 In particular, they obtain algorithms with O(log(1/ )) communication complexity. [sent-266, score-0.055]

92 By doing so, we can interpolate between the 1/ 2 rate of non-accelerated method and the 1/ rate of accelerated gradient. [sent-300, score-0.097]

93 3 There are few exceptions in the context of stochastic coordinate descent in the primal. [sent-301, score-0.165]

94 [2011], Richt´ rik and Tak´ c [2012b] a aˇ 7 John Duchi, Alekh Agarwal, and Martin J Wainwright. [sent-303, score-0.097]

95 Smooth minimization of nonsmooth functions with parallel a coordinate descent methods. [sent-307, score-0.205]

96 Algorithms and hardness results for parallel large margin learning. [sent-315, score-0.074]

97 : A lock-free approach e to parallelizing stochastic gradient descent. [sent-331, score-0.083]

98 Iteration complexity of randomized block-coordinate descent a aˇ methods for minimizing a composite function. [sent-335, score-0.081]

99 Distributed coordinate descent method for learning with big data. [sent-342, score-0.131]

100 Stochastic dual coordinate ascent methods for regularized loss minimization. [sent-346, score-0.259]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sdca', 0.572), ('agd', 0.488), ('asdca', 0.453), ('dual', 0.128), ('tak', 0.123), ('richt', 0.115), ('rik', 0.097), ('accelerated', 0.097), ('processed', 0.096), ('interpolates', 0.09), ('vi', 0.078), ('sgd', 0.077), ('parallel', 0.074), ('iteration', 0.067), ('primal', 0.066), ('descent', 0.066), ('coordinate', 0.065), ('arxiv', 0.062), ('opening', 0.061), ('nodes', 0.061), ('node', 0.06), ('shai', 0.058), ('communication', 0.055), ('ccat', 0.05), ('gradient', 0.049), ('nesterov', 0.048), ('ascent', 0.045), ('danny', 0.044), ('shalevshwartz', 0.044), ('examples', 0.044), ('joseph', 0.043), ('tong', 0.042), ('nm', 0.042), ('preprint', 0.042), ('channels', 0.041), ('agarwal', 0.04), ('aapo', 0.038), ('yi', 0.038), ('zhang', 0.038), ('suboptimality', 0.037), ('martin', 0.035), ('alekh', 0.035), ('kyrola', 0.035), ('dekel', 0.035), ('bits', 0.035), ('bickson', 0.034), ('jt', 0.034), ('rd', 0.034), ('stochastic', 0.034), ('fercoq', 0.034), ('peter', 0.03), ('yucheng', 0.03), ('yurii', 0.03), ('distributed', 0.029), ('nathan', 0.029), ('vanilla', 0.028), ('carlos', 0.028), ('implementations', 0.028), ('daume', 0.027), ('dca', 0.027), ('runtime', 0.027), ('message', 0.026), ('abbreviated', 0.026), ('euclidean', 0.025), ('graphlab', 0.024), ('cotter', 0.024), ('duchi', 0.024), ('cs', 0.024), ('et', 0.024), ('dividing', 0.023), ('studied', 0.023), ('smooth', 0.022), ('broadcast', 0.022), ('ohad', 0.022), ('cost', 0.022), ('ofer', 0.022), ('niu', 0.022), ('iterations', 0.021), ('log', 0.021), ('loss', 0.021), ('gonzalez', 0.021), ('acceleration', 0.021), ('balcan', 0.021), ('regularization', 0.021), ('bradley', 0.02), ('features', 0.02), ('deterministic', 0.02), ('convex', 0.019), ('olivier', 0.019), ('mentioned', 0.019), ('namely', 0.019), ('guestrin', 0.018), ('ridge', 0.018), ('roux', 0.017), ('discuss', 0.017), ('ran', 0.017), ('shamir', 0.017), ('table', 0.016), ('scales', 0.016), ('composite', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

Author: Shai Shalev-Shwartz, Tong Zhang

2 0.25602305 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Author: Rie Johnson, Tong Zhang

Abstract: Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is signiﬁcantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning. 1

3 0.12885548 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

Author: Tianbao Yang

Abstract: We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances. 1

4 0.12002781 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients

Author: Lijun Zhang, Mehrdad Mahdavi, Rong Jin

Abstract: For smooth and strongly convex optimizations, the optimal iteration complexity of √ the gradient-based algorithm is O( κ log 1/ǫ), where κ is the condition number. In the case that the optimization problem is ill-conditioned, we need to evaluate a large number of full gradients, which could be computationally expensive. In this paper, we propose to remove the dependence on the condition number by allowing the algorithm to access stochastic gradients of the objective function. To this end, we present a novel algorithm named Epoch Mixed Gradient Descent (EMGD) that is able to utilize two kinds of gradients. A distinctive step in EMGD is the mixed gradient descent, where we use a combination of the full and stochastic gradients to update the intermediate solution. Theoretical analysis shows that EMGD is able to ﬁnd an ǫ-optimal solution by computing O(log 1/ǫ) full gradients and O(κ2 log 1/ǫ) stochastic gradients. 1

5 0.09201695 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

Author: Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, Andreas Krause

Abstract: Many large-scale machine learning problems (such as clustering, non-parametric learning, kernel machines, etc.) require selecting, out of a massive data set, a manageable yet representative subset. Such problems can often be reduced to maximizing a submodular set function subject to cardinality constraints. Classical approaches require centralized access to the full data set; but for truly large-scale problems, rendering the data centrally is often impractical. In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol G REE D I, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show, that under certain natural conditions, performance close to the (impractical) centralized approach can be achieved. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference and exemplar-based clustering, on tens of millions of data points using Hadoop. 1

6 0.071169659 268 nips-2013-Reflection methods for user-friendly submodular optimization

7 0.057671815 193 nips-2013-Mixed Optimization for Smooth Functions

8 0.051317681 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

9 0.046567123 318 nips-2013-Structured Learning via Logistic Regression

10 0.038418338 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse

11 0.038022563 289 nips-2013-Scalable kernels for graphs with continuous attributes

12 0.037411541 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima

13 0.035602033 196 nips-2013-Modeling Overlapping Communities with Node Popularities

14 0.035093263 211 nips-2013-Non-Linear Domain Adaptation with Boosting

15 0.034897678 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

16 0.034888111 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

17 0.034480423 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel

18 0.034203872 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

19 0.033881027 171 nips-2013-Learning with Noisy Labels

20 0.033763345 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.101), (1, 0.012), (2, 0.038), (3, -0.013), (4, 0.037), (5, 0.052), (6, -0.109), (7, -0.0), (8, 0.102), (9, 0.02), (10, 0.022), (11, 0.065), (12, 0.069), (13, -0.112), (14, -0.025), (15, 0.05), (16, -0.03), (17, 0.063), (18, 0.025), (19, 0.063), (20, 0.057), (21, 0.004), (22, 0.012), (23, -0.033), (24, -0.016), (25, 0.005), (26, 0.014), (27, 0.034), (28, -0.028), (29, -0.064), (30, 0.014), (31, 0.012), (32, -0.048), (33, 0.018), (34, 0.116), (35, 0.073), (36, -0.105), (37, 0.026), (38, -0.031), (39, -0.108), (40, -0.031), (41, -0.115), (42, -0.035), (43, 0.117), (44, -0.066), (45, 0.026), (46, 0.031), (47, -0.035), (48, -0.07), (49, -0.126)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92696375 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

Author: Shai Shalev-Shwartz, Tong Zhang

2 0.78143525 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

Author: Tianbao Yang

3 0.71603358 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Author: Rie Johnson, Tong Zhang

4 0.62422967 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse

Author: John Duchi, Michael Jordan, Brendan McMahan

Abstract: We study stochastic optimization problems when the data is sparse, which is in a sense dual to current perspectives on high-dimensional statistical learning and optimization. We highlight both the difﬁculties—in terms of increased sample complexity that sparse data necessitates—and the potential beneﬁts, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on several medium to large-scale learning tasks. 1 Introduction and problem setting In this paper, we investigate stochastic optimization problems in which the data is sparse. Formally, let {F (·; ξ), ξ ∈ Ξ} be a collection of real-valued convex functions, each of whose domains contains the convex set X ⊂ Rd . For a probability distribution P on Ξ, we consider the following optimization problem: minimize f (x) := E[F (x; ξ)] = x∈X F (x; ξ)dP (ξ). (1) Ξ By data sparsity, we mean the samples ξ are sparse: assuming that samples ξ lie in Rd , and deﬁning the support supp(x) of a vector x to the set of indices of its non-zero components, we assume supp F (x; ξ) ⊂ supp ξ. (2) The sparsity condition (2) means that F (x; ξ) does not “depend” on the values of xj for indices j such that ξj = 0.1 This type of data sparsity is prevalent in statistical optimization problems and machine learning applications; in spite of its prevalence, study of such problems has been limited. As a motivating example, consider a text classiﬁcation problem: data ξ ∈ Rd represents words appearing in a document, and we wish to minimize a logistic loss F (x; ξ) = log(1 + exp( ξ, x )) on the data (we encode the label implicitly with the sign of ξ). Such generalized linear models satisfy the sparsity condition (2), and while instances are of very high dimension, in any given instance, very few entries of ξ are non-zero [8]. From a modelling perspective, it thus makes sense to allow a dense predictor x: any non-zero entry of ξ is potentially relevant and important. In a sense, this is dual to the standard approaches to high-dimensional problems; one usually assumes that the data ξ may be dense, but there are only a few relevant features, and thus a parsimonious model x is desirous [2]. So 1 Formally, if πξ denotes the coordinate projection zeroing all indices j of its argument where ξj = 0, then F (πξ (x); ξ) = F (x; ξ) for all x, ξ. This follows from the ﬁrst-order conditions for convexity [6]. 1 while such sparse data problems are prevalent—natural language processing, information retrieval, and other large data settings all have signiﬁcant data sparsity—they do not appear to have attracted as much study as their high-dimensional “duals” of dense data and sparse predictors. In this paper, we investigate algorithms and their inherent limitations for solving problem (1) under natural conditions on the data generating distribution. Recent work in the optimization and machine learning communities has shown that data sparsity can be leveraged to develop parallel (and even asynchronous [12]) optimization algorithms [13, 14], but this work does not consider the statistical effects of data sparsity. In another line of research, Duchi et al. [4] and McMahan and Streeter [9] develop “adaptive” stochastic gradient algorithms to address problems in sparse data regimes (2). These algorithms exhibit excellent practical performance and have theoretical guarantees on their convergence, but it is not clear if they are optimal—in that no algorithm can attain better statistical performance—or whether they can leverage parallel computing as in the papers [12, 14]. In this paper, we take a two-pronged approach. First, we investigate the fundamental limits of optimization and learning algorithms in sparse data regimes. In doing so, we derive lower bounds on the optimization error of any algorithm for problems of the form (1) with sparsity condition (2). These results have two main implications. They show that in some scenarios, learning with sparse data is quite difﬁcult, as essentially each coordinate j ∈ [d] can be relevant and must be optimized for. In spite of this seemingly negative result, we are also able to show that the A DAG RAD algorithms of [4, 9] are optimal, and we show examples in which their dependence on the dimension d can be made exponentially better than standard gradient methods. As the second facet of our two-pronged approach, we study how sparsity may be leveraged in parallel computing frameworks to give substantially faster algorithms that still achieve optimal sample complexity in terms of the number of samples ξ used. We develop two new algorithms, asynchronous dual averaging (A SYNC DA) and asynchronous A DAG RAD (A SYNC A DAG RAD), which allow asynchronous parallel solution of the problem (1) for general convex f and X . Combining insights of Niu et al.’s H OGWILD ! [12] with a new analysis, we prove our algorithms achieve linear speedup in the number of processors while maintaining optimal statistical guarantees. We also give experiments on text-classiﬁcation and web-advertising tasks to illustrate the beneﬁts of the new algorithms. 2 Minimax rates for sparse optimization We begin our study of sparse optimization problems by establishing their fundamental statistical and optimization-theoretic properties. To do this, we derive bounds on the minimax convergence rate of any algorithm for such problems. Formally, let x denote any estimator for a minimizer of the objective (1). We deﬁne the optimality gap N for the estimator x based on N samples ξ 1 , . . . , ξ N from the distribution P as N (x, F, X , P ) := f (x) − inf f (x) = EP [F (x; ξ)] − inf EP [F (x; ξ)] . x∈X x∈X This quantity is a random variable, since x is a random variable (it is a function of ξ 1 , . . . , ξ N ). To deﬁne the minimax error, we thus take expectations of the quantity N , though we require a bit more than simply E[ N ]. We let P denote a collection of probability distributions, and we consider a collection of loss functions F speciﬁed by a collection F of convex losses F : X × ξ → R. We can then deﬁne the minimax error for the family of losses F and distributions P as ∗ N (X , P, F) := inf sup sup EP [ x P ∈P F ∈F N (x(ξ 1:N ), F, X , P )], (3) where the inﬁmum is taken over all possible estimators x (an estimator is an optimization scheme, or a measurable mapping x : ΞN → X ) . 2.1 Minimax lower bounds Let us now give a more precise characterization of the (natural) set of sparse optimization problems we consider to provide the lower bound. For the next proposition, we let P consist of distributions supported on Ξ = {−1, 0, 1}d , and we let pj := P (ξj = 0) be the marginal probability of appearance of feature j ∈ {1, . . . , d}. For our class of functions, we set F to consist of functions F satisfying the sparsity condition (2) and with the additional constraint that for g ∈ ∂x F (x; ξ), we have that the jth coordinate |gj | ≤ Mj for a constant Mj < ∞. We obtain 2 Proposition 1. Let the conditions of the preceding paragraph hold. Let R be a constant such that X ⊃ [−R, R]d . Then √ d pj 1 ∗ . Mj min pj , √ N (X , P, F) ≥ R 8 j=1 N log 3 We provide the proof of Proposition 1 in the supplement A.1 in the full version of the paper, providing a few remarks here. We begin by giving a corollary to Proposition 1 that follows when the data ξ obeys a type of power law: let p0 ∈ [0, 1], and assume that P (ξj = 0) = p0 j −α . We have Corollary 2. Let α ≥ 0. Let the conditions of Proposition 1 hold with Mj ≡ M for all j, and assume the power law condition P (ξj = 0) = p0 j −α on coordinate appearance probabilities. Then (1) If d > (p0 N )1/α , ∗ N (X , P, F) ≥ 2−α 1−α p0 p0 (p0 N ) 2α − 1 + d1−α − (p0 N ) α N 1−α 2 MR 8 2−α (2) If d ≤ (p0 N )1/α , ∗ N (X , P, F) ≥ MR 8 p0 N α 1 1 d1− 2 − 1 − α/2 1 − α/2 . . Expanding Corollary 2 slightly, for simplicity assume the number of samples is large enough that d ≤ (p0 N )1/α . Then we ﬁnd that the lower bound on optimization error is of order p0 1− α p0 p0 d 2 when α < 2, M R log d when α → 2, and M R when α > 2. (4) N N N These results beg the question of tightness: are they improvable? As we see presently, they are not. MR 2.2 Algorithms for attaining the minimax rate To show that the lower bounds of Proposition 1 and its subsequent specializations are sharp, we review a few stochastic gradient algorithms. We begin with stochastic gradient descent (SGD): SGD repeatedly samples ξ ∼ P , computes g ∈ ∂x F (x; ξ), then performs the update x ← ΠX (x − ηg), where η is a stepsize parameter and ΠX denotes Euclidean projection onto X . Standard analyses of stochastic gradient descent [10] show that after N samples ξ i , the SGD estimator x(N ) satisﬁes R2 M ( d j=1 1 pj ) 2 √ , (5) N where R2 denotes the 2 -radius of X . Dual averaging, due to Nesterov [11] (sometimes called “follow the regularized leader” [5]) is a more recent algorithm. In dual averaging, one again samples g ∈ ∂x F (x; ξ), but instead of updating the parameter vector x one updates a dual vector z by z ← z + g, then computes 1 x ← argmin z, x + ψ(x) , η x∈X E[f (x(N ))] − inf f (x) ≤ O(1) x∈X 2 1 where ψ(x) is a strongly convex function deﬁned over X (often one takes ψ(x) = 2 x 2 ). As we discuss presently, the dual averaging algorithm is somewhat more natural in asynchronous and parallel computing environments, and it enjoys the same type of convergence guarantees (5) as SGD. The A DAG RAD algorithm [4, 9] is an extension of the preceding stochastic gradient methods. It maintains a diagonal matrix S, where upon receiving a new sample ξ, A DAG RAD performs the following: it computes g ∈ ∂x F (x; ξ), then updates 2 Sj ← Sj + gj for j ∈ [d]. The dual averaging variant of A DAG RAD updates the usual dual vector z ← z + g; the update to x is based on S and a stepsize η and computes x ← argmin z, x + x ∈X 3 1 1 x ,S2x 2η . After N samples ξ, the averaged parameter x(N ) returned by A DAG RAD satisﬁes R∞ M E[f (x(N ))] − inf f (x) ≤ O(1) √ x∈X N d √ pj , (6) j=1 where R∞ denotes the ∞ -radius of X (cf. [4, Section 1.3 and Theorem 5]). By inspection, the A DAG RAD rate (6) matches the lower bound in Proposition 1 and is thus optimal. It is interesting to note, though, that in the power law setting of Corollary 2 (recall the error order (4)), a calculation √ shows that the multiplier for the SGD guarantee (5) becomes R∞ d max{d(1−α)/2 , 1}, while A DA G RAD attains rate at worst R∞ max{d1−α/2 , log d}. For α > 1, the A DAG RAD rate is no worse, √ and for α ≥ 2, is more than d/ log d better—an exponential improvement in the dimension. 3 Parallel and asynchronous optimization with sparsity As we note in the introduction, recent works [12, 14] have suggested that sparsity can yield beneﬁts in our ability to parallelize stochastic gradient-type algorithms. Given the optimality of A DAG RADtype algorithms, it is natural to focus on their parallelization in the hope that we can leverage their ability to “adapt” to sparsity in the data. To provide the setting for our further algorithms, we ﬁrst revisit Niu et al.’s H OGWILD ! [12]. H OGWILD ! is an asynchronous (parallelized) stochastic gradient algorithm for optimization over product-space domains, meaning that X in problem (1) decomposes as X = X1 × · · · × Xd , where Xj ⊂ R. Fix a stepsize η > 0. A pool of independently running processors then performs the following updates asynchronously to a centralized vector x: 1. Sample ξ ∼ P 2. Read x and compute g ∈ ∂x F (x; ξ) 3. For each j s.t. gj = 0, update xj ← ΠXj (xj − ηgj ). Here ΠXj denotes projection onto the jth coordinate of the domain X . The key of H OGWILD ! is that in step 2, the parameter x is allowed to be inconsistent—it may have received partial gradient updates from many processors—and for appropriate problems, this inconsistency is negligible. Indeed, Niu et al. [12] show linear speedup in optimization time as the number of processors grow; they show this empirically in many scenarios, providing a proof under the somewhat restrictive assumptions that there is at most one non-zero entry in any gradient g and that f has Lipschitz gradients. 3.1 Asynchronous dual averaging A weakness of H OGWILD ! is that it appears only applicable to problems for which the domain X is a product space, and its analysis assumes g 0 = 1 for all gradients g. In effort to alleviate these difﬁculties, we now develop and present our asynchronous dual averaging algorithm, A SYNC DA. A SYNC DA maintains and upates a centralized dual vector z instead of a parameter x, and a pool of processors perform asynchronous updates to z, where each processor independently iterates: 1. Read z and compute x := argminx∈X 1 z, x + η ψ(x) // Implicitly increment “time” counter t and let x(t) = x 2. Sample ξ ∼ P and let g ∈ ∂x F (x; ξ) // Let g(t) = g. 3. For j ∈ [d] such that gj = 0, update zj ← zj + gj . Because the actual computation of the vector x in A SYNC DA is performed locally on each processor in step 1 of the algorithm, the algorithm can be executed with any proximal function ψ and domain X . The only communication point between any of the processors is the addition operation in step 3. Since addition is commutative and associative, forcing all asynchrony to this point of the algorithm is a natural strategy for avoiding synchronization problems. In our analysis of A SYNC DA, and in our subsequent analysis of the adaptive methods, we require a measurement of time elapsed. With that in mind, we let t denote a time index that exists (roughly) behind-the-scenes. We let x(t) denote the vector x ∈ X computed in the tth step 1 of the A SYNC DA 4 algorithm, that is, whichever is the tth x actually computed by any of the processors. This quantity t exists and is recoverable from the algorithm, and it is possible to track the running sum τ =1 x(τ ). Additionally, we state two assumptions encapsulating the conditions underlying our analysis. Assumption A. There is an upper bound m on the delay of any processor. In addition, for each j ∈ [d] there is a constant pj ∈ [0, 1] such that P (ξj = 0) ≤ pj . We also require certain continuity (Lipschitzian) properties of the loss functions; these amount to a second moment constraint on the instantaneous ∂F and a rough measure of gradient sparsity. Assumption B. There exist constants M and (Mj )d such that the following bounds hold for all j=1 2 x ∈ X : E[ ∂x F (x; ξ) 2 ] ≤ M2 and for each j ∈ [d] we have E[|∂xj F (x; ξ)|] ≤ pj Mj . With these deﬁnitions, we have the following theorem, which captures the convergence behavior of A SYNC DA under the assumption that X is a Cartesian product, meaning that X = X1 × · · · × Xd , 2 where Xj ⊂ R, and that ψ(x) = 1 x 2 . Note the algorithm itself can still be efﬁciently parallelized 2 for more general convex X , even if the theorem does not apply. Theorem 3. Let Assumptions A and B and the conditions in the preceding paragraph hold. Then T E t=1 F (x(t); ξ t ) − F (x∗ ; ξ t ) ≤ 1 x∗ 2η d 2 2 η 2 p2 Mj . + T M2 + ηT m j 2 j=1 We now provide a few remarks to explain and simplify the result. Under the more stringent condition 2 d 2 that |∂xj F (x; ξ)| ≤ Mj , Assumption A implies E[ ∂x F (x; ξ) 2 ] ≤ j=1 pj Mj . Thus, for the d 2 remainder of this section we take M2 = j=1 pj Mj , which upper bounds the Lipschitz continuity constant of the objective function f . We then obtain the following corollary. √ T 1 Corollary 4. Deﬁne x(T ) = T t=1 x(t), and set η = x∗ 2 /M T . Then E[f (x(T )) − f (x∗ )] ≤ M x∗ √ T 2 +m x∗ 2 √ 2M T d 2 p2 M j . j j=1 Corollary 4 is nearly immediate: since ξ t is independent of x(t), we have E[F (x(t); ξ t ) | x(t)] = f (x(t)); applying Jensen’s inequality to f (x(T )) and performing an algebraic manipulation give the result. If the data is suitably sparse, meaning that pj ≤ 1/m, the bound in Corollary 4 simpliﬁes to 3 M x∗ √ E[f (x(T )) − f (x )] ≤ 2 T ∗ 2 3 = 2 d j=1 2 pj M j x ∗ √ T 2 , (7) which is the convergence rate of stochastic gradient descent even in centralized settings (5). The √ convergence guarantee (7) shows that after T timesteps, the error scales as 1/ T ; however, if we have k processors, updates occur roughly k times as quickly, as they are asynchronous, and in time scaling as N/k, we can evaluate N gradient samples: a linear speedup. 3.2 Asynchronous AdaGrad We now turn to extending A DAG RAD to asynchronous settings, developing A SYNC A DAG RAD (asynchronous A DAG RAD). As in the A SYNC DA algorithm, A SYNC A DAG RAD maintains a shared dual vector z (the sum of gradients) and the shared matrix S, which is the diagonal sum of squares of gradient entries (recall Section 2.2). The matrix S is initialized as diag(δ 2 ), where δj ≥ 0 is an initial value. Each processor asynchronously performs the following iterations: 1 1 1. Read S and z and set G = S 2 . Compute x := argminx∈X { z, x + 2η x, Gx } increment “time” counter t and let x(t) = x, S(t) = S 2. Sample ξ ∼ P and let g ∈ ∂F (x; ξ) 2 3. For j ∈ [d] such that gj = 0, update Sj ← Sj + gj and zj ← zj + gj . 5 // Implicitly As in the description of A SYNC DA, we note that x(t) is the vector x ∈ X computed in the tth “step” of the algorithm (step 1), and similarly associate ξ t with x(t). To analyze A SYNC A DAG RAD, we make a somewhat stronger assumption on the sparsity properties of the losses F than Assumption B. 2 Assumption C. There exist constants (Mj )d such that E[(∂xj F (x; ξ))2 | ξj = 0] ≤ Mj for all j=1 x ∈ X. 2 Indeed, taking M2 = j pj Mj shows that Assumption C implies Assumption B with speciﬁc constants. We then have the following convergence result. Theorem 5. In addition to the conditions of Theorem 3, let Assumption C hold. Assume that for all 2 j we have δ 2 ≥ Mj m and X ⊂ [−R∞ , R∞ ]d . Then T t=1 E F (x(t); ξ t ) − F (x∗ ; ξ t ) d ≤ min j=1 T 1 2 R E η ∞ 2 δ + gj (t) 2 1 2 T + ηE gj (t) t=1 2 1 2 (1 + pj m), Mj R∞ pj T . t=1 It is possible to relax the condition on the initial constant diagonal term; we defer this to the full version of the paper. It is natural to ask in which situations the bound provided by Theorem 5 is optimal. We note that, as in the case with Theorem 3, we may obtain a convergence rate for f (x(T )) − f (x∗ ) using convexity, T 1 where x(T ) = T t=1 x(t). By Jensen’s inequality, we have for any δ that T E 2 δ + gj (t) 2 1 2 t=1 T ≤ 2 2 E[gj (t) ] δ + t=1 1 2 ≤ 2 δ 2 + T pj Mj . For interpretation, let us now make a few assumptions on the probabilities pj . If we assume that pj ≤ c/m for a universal (numerical) constant c, then Theorem 5 guarantees that d log(T )/T + pj √ (8) , pj , T j=1 √ which is the convergence rate of A DAG RAD except for a small factor of min{ log T /T, pj } in addition to the usual pj /T rate. In particular, optimizing by choosing η = R∞ , and assuming 1 pj T log T , we have convergence guarantee √ d pj E[f (x(T )) − f (x∗ )] ≤ O(1)R∞ Mj min √ , pj , T j=1 E[f (x(T )) − f (x∗ )] ≤ O(1) 1 2 R +η η ∞ Mj min which is minimax optimal by Proposition 1. In fact, however, the bounds of Theorem 5 are somewhat stronger: they provide bounds using the expectation of the squared gradients gj (t) rather than the maximal value Mj , though the bounds are perhaps clearer in the form (8). We note also that our analysis applies to more adversarial settings than stochastic optimization (e.g., to online convex optimization [5]). Speciﬁcally, an adversary may choose an arbitrary sequence of functions subject to the random data sparsity constraint (2), and our results provide an expected regret bound, which is strictly stronger than the stochastic convergence guarantees provided (and guarantees high-probability convergence in stochastic settings [3]). Moreover, our comments in Section 2 about the relative optimality of A DAG RAD versus standard gradient methods apply. When the data is sparse, we indeed should use asynchronous algorithms, but using adaptive methods yields even more improvement than simple gradient-based methods. 4 Experiments In this section, we give experimental validation of our theoretical results on A SYNC A DAG RAD and A SYNC DA, giving results on two datasets selected for their high-dimensional sparsity.2 2 In our experiments, A SYNC DA and H OGWILD ! had effectively identical performance. 6 8 0.07 6 5 4 0.024 Test error Training loss Speedup 0.025 0.065 7 0.06 0.055 0.05 0.045 0.04 0.023 0.022 0.021 0.02 0.035 3 0.019 2 1 2 4 0.03 A-A DAG RAD A SYNC DA Number of workers 6 8 10 12 14 0.018 0.025 0.02 16 2 4 6 8 10 12 14 Number of workers 0.017 16 2 4 6 8 10 12 14 Number of workers 16 Figure 1. Experiments with URL data. Left: speedup relative to one processor. Middle: training dataset loss versus number of processors. Right: test set error rate versus number of processors. AA DAG RAD abbreviates A SYNC A DAG RAD. 1.03 1.02 1.01 1.00 1.0 1 2 4 8 16 64 256 number of passes A-AdaGrad, η = 0.008 L2 = 0 A-AdaGrad, η = 0.008 L2 = 80 A-DA, η = 0.8 L2 = 0 A-DA, η = 0.8 L2 = 80 1.00 1.01 1.4 1.02 1.03 1.04 Impact of L2 regularizaton on test error 1.04 Fixed stepsizes, test data, L2=0 1.2 relative log-loss 1.6 1.8 Fixed stepsizes, training data, L2=0 A-AdaGrad η = 0.002 A-AdaGrad η = 0.004 A-AdaGrad η = 0.008 A-AdaGrad η = 0.016 A-DA η = 0.800 A-DA η = 1.600 A-DA η = 3.200 1 2 4 8 16 32 number of passes 64 128 256 1 2 4 8 16 32 64 128 256 number of passes Figure 2: Relative accuracy for various stepsize choices on an click-through rate prediction dataset. 4.1 Malicious URL detection For our ﬁrst set of experiments, we consider the speedup attainable by applying A SYNC A DAG RAD and A SYNC DA, investigating the performance of each algorithm on a malicious URL prediction task [7]. The dataset in this case consists of an anonymized collection of URLs labeled as malicious (e.g., spam, phishing, etc.) or benign over a span of 120 days. The data in this case consists of 2.4 · 106 examples with dimension d = 3.2 · 106 (sparse) features. We perform several experiments, randomly dividing the dataset into 1.2 · 106 training and test samples for each experiment. In Figure 1 we compare the performance of A SYNC A DAG RAD and A SYNC DA after doing after single pass through the training dataset. (For each algorithm, we choose the stepsize η for optimal training set performance.) We perform the experiments on a single machine running Ubuntu Linux with six cores (with two-way hyperthreading) and 32Gb of RAM. From the left-most plot in Fig. 1, we see that up to six processors, both A SYNC DA and A SYNC A DAG RAD enjoy the expected linear speedup, and from 6 to 12, they continue to enjoy a speedup that is linear in the number of processors though at a lesser slope (this is the effect of hyperthreading). For more than 12 processors, there is no further beneﬁt to parallelism on this machine. The two right plots in Figure 1 plot performance of the different methods (with standard errors) versus the number of worker threads used. Both are essentially ﬂat; increasing the amount of parallelism does nothing to the average training loss or the test error rate for either method. It is clear, however, that for this dataset, the adaptive A SYNC A DAG RAD algorithm provides substantial performance beneﬁts over A SYNC DA. 4.2 Click-through-rate prediction experiments We also experiment on a proprietary datasets consisting of search ad impressions. Each example corresponds to showing a search-engine user a particular text ad in response to a query string. From this, we construct a very sparse feature vector based on the text of the ad displayed and the query string (no user-speciﬁc data is used). The target label is 1 if the user clicked the ad and -1 otherwise. 7 (B) A-AdaGrad speedup (D) Impact of training data ordering 1.004 1.005 1.006 1.007 1.008 1 2 4 8 16 32 number of passes 64 128 256 1.000 1 2 A-DA base η = 1.600 A-AdaGrad base η = 0.023 0 1.005 relative stepsize (C) Optimal stepsize scaling relative log-loss 1.003 target relative log-loss 1.005 1.010 1.002 1.010 1.015 8 4 0 speedup A-DA η = 1.600 A-AdaGrad η = 0.016 1.001 1.000 relative log-loss 1.015 A-DA, L2=80 A-AdaGrad, L2=80 12 (A) Optimized stepsize for each number of passes 1 2 4 8 16 32 number of passes 64 128 256 1 2 4 8 16 32 64 128 256 number of passes Figure 3. (A) Relative test-set log-loss for A SYNC DA and A SYNC A DAG RAD, choosing the best stepsize (within a factor of about 1.4×) individually for each number of passes. (B) Effective speedup for A SYNC A DAG RAD. (C) The best stepsize η, expressed as a scaling factor on the stepsize used for one pass. (D) Five runs with different random seeds for each algorithm (with 2 penalty 80). We ﬁt logistic regression models using both A SYNC DA and A SYNC A DAG RAD. We run extensive experiments on a moderate-sized dataset (about 107 examples, split between training and testing), which allows thorough investigation of the impact of the stepsize η, the number of training passes,3 and 2 -regularization on accuracy. For these experiments we used 32 threads on 16 core machines for each run, as A SYNC A DAG RAD and A SYNC DA achieve similar speedups from parallelization. On this dataset, A SYNC A DAG RAD typically achieves an effective additional speedup over A SYNC DA of 4× or more. That is, to reach a given level of accuracy, A SYNC DA generally needs four times as many effective passes over the dataset. We measure accuracy with log-loss (the logistic loss) averaged over ﬁve runs using different random seeds (which control the order in which the algorithms sample examples during training). We report relative values in Figures 2 and 3, that is, the ratio of the mean loss for the given datapoint to the lowest (best) mean loss obtained. Our results are not particularly sensitive to the choice of relative log-loss as the metric of interest; we also considered AUC (the area under the ROC curve) and observed similar results. Figure 2 shows relative log-loss as a function of the number of training passes for various stepsizes. Without regularization, A SYNC A DAG RAD is prone to overﬁtting: it achieves signiﬁcantly higher accuracy on the training data (Fig. 2 (left)), but unless the stepsize is tuned carefully to the number of passes, it will overﬁt (Fig. 2 (middle)). Fortunately, the addition of 2 regularization largely solves this problem. Indeed, Figure 2 (right) shows that while adding an 2 penalty of 80 has very little impact on A SYNC DA, it effectively prevents the overﬁtting of A SYNC A DAG RAD.4 Fixing √ regularization multiplier to 80, we varied the stepsize η over a multiplicative grid with res2 olution 2 for each number of passes and for each algorithm. Figure 3 reports the results obtained by selecting the best stepsize in terms of test set log-loss for each number of passes. Figure 3(A) shows relative log-loss of the best stepsize for each algorithm; 3(B) shows the relative time A SYNC DA requires with respect to A SYNC A DAG RAD to achieve a given loss. Speciﬁcally, Fig. 3(B) shows the ratio of the number of passes the algorithms require to achieve a ﬁxed loss, which gives a broader estimate of the speedup obtained by using A SYNC A DAG RAD; speedups range from 3.6× to 12×. Figure 3(C) shows the optimal stepsizes as a function of the best setting for one pass. The optimal stepsize decreases moderately for A SYNC A DAG RAD, but are somewhat noisy for A SYNC DA. It is interesting to note that A SYNC A DAG RAD’s accuracy is largely independent of the ordering of the training data, while A SYNC DA shows signiﬁcant variability. This can be seen both in the error bars on Figure 3(A), and explicitly in Figure 3(D), where we plot one line for each of the ﬁve random seeds used. Thus, while on the one hand A SYNC DA requires somewhat less tuning of the stepsize and 2 parameter, tuning A SYNC A DAG RAD is much easier because of its predictable response. 3 Here “number of passes” more precisely means the expected number of times each example in the dataset is trained on. That is, each worker thread randomly selects a training example from the dataset for each update, and we continued making updates until (dataset size) × (number of passes) updates have been processed. 4 For both algorithms, this is accomplished by adding the term η80 x 2 to the ψ function. We can achieve 2 slightly better results for A SYNC A DAG RAD by varying the 2 penalty with the number of passes. 8 References [1] P. Auer and C. Gentile. Adaptive and self-conﬁdent online learning algorithms. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, 2000. [2] P. B¨ hlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and u Applications. Springer, 2011. [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, September 2004. [4] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011. [5] E. Hazan. The convex optimization approach to regret minimization. In Optimization for Machine Learning, chapter 10. MIT Press, 2012. [6] J. Hiriart-Urruty and C. Lemar´ chal. Convex Analysis and Minimization Algorithms I & II. e Springer, New York, 1996. [7] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying malicious urls: An application of large-scale online learning. In Proceedings of the 26th International Conference on Machine Learning, 2009. [8] C. Manning and H. Sch¨ tze. Foundations of Statistical Natural Language Processing. MIT u Press, 1999. [9] B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, 2010. [10] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. [11] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):261–283, 2009. [12] F. Niu, B. Recht, C. R´ , and S. Wright. Hogwild: a lock-free approach to parallelizing stochase tic gradient descent. In Advances in Neural Information Processing Systems 24, 2011. [13] P. Richt´ rik and M. Tak´ c. Parallel coordinate descent methods for big data optimization. a aˇ arXiv:1212.0873 [math.OC], 2012. URL http://arxiv.org/abs/1212.0873. [14] M. Tak´ c, A. Bijral, P. Richt´ rik, and N. Srebro. Mini-batch primal and dual methods for aˇ a SVMs. In Proceedings of the 30th International Conference on Machine Learning, 2013. 9

5 0.52423847 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding

Author: Srikrishna Sridhar, Stephen Wright, Christopher Re, Ji Liu, Victor Bittorf, Ce Zhang

Abstract: Many problems in machine learning can be solved by rounding the solution of an appropriate linear program (LP). This paper shows that we can recover solutions of comparable quality by rounding an approximate LP solution instead of the exact one. These approximate LP solutions can be computed efﬁciently by applying a parallel stochastic-coordinate-descent method to a quadratic-penalty formulation of the LP. We derive worst-case runtime and solution quality guarantees of this scheme using novel perturbation and convergence analysis. Our experiments demonstrate that on such combinatorial problems as vertex cover, independent set and multiway-cut, our approximate rounding scheme is up to an order of magnitude faster than Cplex (a commercial LP solver) while producing solutions of similar quality. 1

6 0.52214116 193 nips-2013-Mixed Optimization for Smooth Functions

7 0.50520182 198 nips-2013-More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

8 0.49828884 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients

9 0.4911865 132 nips-2013-Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation

10 0.48642227 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

11 0.44451484 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

12 0.43278393 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

13 0.41988754 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

14 0.41601291 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

15 0.41398877 268 nips-2013-Reflection methods for user-friendly submodular optimization

16 0.38667834 245 nips-2013-Pass-efficient unsupervised feature selection

17 0.36528069 318 nips-2013-Structured Learning via Logistic Regression

18 0.36101577 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification

19 0.36014149 65 nips-2013-Compressive Feature Learning

20 0.35729349 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.013), (2, 0.04), (16, 0.05), (30, 0.273), (33, 0.102), (34, 0.061), (41, 0.045), (49, 0.012), (56, 0.084), (70, 0.043), (85, 0.047), (89, 0.017), (93, 0.064), (95, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72783476 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

Author: Shai Shalev-Shwartz, Tong Zhang

2 0.63182127 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties

Author: Paul Valiant, Gregory Valiant

Abstract: Recently, Valiant and Valiant [1, 2] showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a sublinear sized sample. Speciﬁcally, given a sample consisting of independent draws from any distribution over at most n distinct elements, these properties can be estimated accurately using a sample of size O(n/ log n). We propose a novel modiﬁcation of this approach and show: 1) theoretically, this estimator is optimal (to constant factors, over worst-case instances), and 2) in practice, it performs exceptionally well for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key step in our approach is to ﬁrst use the sample to characterize the “unseen” portion of the distribution. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: we seek to estimate the shape of the unobserved portion of the distribution. This approach is robust, general, and theoretically principled; we expect that it may be fruitfully used as a component within larger machine learning and data analysis systems. 1

3 0.58870769 224 nips-2013-On the Sample Complexity of Subspace Learning

Author: Alessandro Rudi, Guillermo D. Canas, Lorenzo Rosasco

Abstract: A large number of algorithms in machine learning, from principal component analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral embedding and support estimation methods, rely on estimating a linear subspace from samples. In this paper we introduce a general formulation of this problem and derive novel learning error estimates. Our results rely on natural assumptions on the spectral properties of the covariance operator associated to the data distribution, and hold for a wide class of metrics between subspaces. As special cases, we discuss sharp error estimates for the reconstruction properties of PCA and spectral support estimation. Key to our analysis is an operator theoretic approach that has broad applicability to spectral learning methods. 1

4 0.58561492 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors

Author: Shouyuan Chen, Michael R. Lyu, Irwin King, Zenglin Xu

Abstract: Tensor completion from incomplete observations is a problem of signiﬁcant practical interest. However, it is unlikely that there exists an efﬁcient algorithm with provable guarantee to recover a general tensor from a limited number of observations. In this paper, we study the recovery algorithm for pairwise interaction tensors, which has recently gained considerable attention for modeling multiple attribute data due to its simplicity and effectiveness. Speciﬁcally, in the absence of noise, we show that one can exactly recover a pairwise interaction tensor by solving a constrained convex program which minimizes the weighted sum of nuclear norms of matrices from O(nr log2 (n)) observations. For the noisy cases, we also prove error bounds for a constrained convex program for recovering the tensors. Our experiments on the synthetic dataset demonstrate that the recovery performance of our algorithm agrees well with the theory. In addition, we apply our algorithm on a temporal collaborative ﬁltering task and obtain state-of-the-art results. 1

5 0.52866262 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

Author: Tianbao Yang

6 0.52508098 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

7 0.52304989 215 nips-2013-On Decomposing the Proximal Map

8 0.52030283 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

9 0.52016658 99 nips-2013-Dropout Training as Adaptive Regularization

10 0.51890439 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs

11 0.51845425 201 nips-2013-Multi-Task Bayesian Optimization

12 0.51687616 350 nips-2013-Wavelets on Graphs via Deep Learning

13 0.5164966 149 nips-2013-Latent Structured Active Learning

14 0.51638001 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

15 0.51634461 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

16 0.51624191 318 nips-2013-Structured Learning via Logistic Regression

17 0.51558006 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

18 0.51501924 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

19 0.51469499 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding

20 0.51340401 69 nips-2013-Context-sensitive active sensing in humans