jmlr jmlr2013 jmlr2013-25 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
Reference: text
sentIndex sentText sentNum sentScore
1 Keywords: distributed learning, stochastic optimization, averaging, subsampling 1. [sent-19, score-0.131]
2 In a centralized setting, there are many procedures for solving empirical risk minimization problems, among them standard convex programming approaches (e. [sent-22, score-0.118]
3 Accordingly, the focus of this paper is the study of some distributed and communication-efficient procedures for empirical risk minimization. [sent-29, score-0.081]
4 Whenever the number of machines m is less than the number of samples n per machine, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N = nm samples. [sent-59, score-0.131]
5 We also show how the result extends to stochastic programming approaches, exhibiting a stochastic gradient-descent based procedure that also attains convergence rates scaling as O ((nm)−1 ), but with slightly worse dependence on different problem-specific parameters. [sent-61, score-0.106]
6 As long as m < n2 , the subsampled method again matches the centralized gold standard in the first-order term, and has a second-order term smaller than the standard averaging approach. [sent-67, score-0.139]
7 Our experiments on this problem show that S AVGM—with the resampling and correction it provides—gives substantial performance benefits over naive solutions as well as the averaging algorithm AVGM. [sent-76, score-0.094]
8 Background and Problem Set-up We begin by setting up our decision-theoretic framework for empirical risk minimization, after which we describe our algorithms and the assumptions we require for our main theoretical results. [sent-78, score-0.08]
9 Assuming that each function x → f (θ; x) is P-integrable, the population risk F0 : Θ → R is given by F0 (θ) := EP [ f (θ; X)] = f (θ; x)dP(x). [sent-82, score-0.113]
10 X Our goal is to estimate the parameter vector minimizing the population risk, namely the quantity θ∗ := argmin F0 (θ) = argmin θ∈Θ θ∈Θ f (θ; x)dP(x), X which we assume to be unique. [sent-83, score-0.103]
11 In practice, the population distribution P is unknown to us, but we have access to a collection S of samples from the distribution P. [sent-84, score-0.075]
12 Empirical risk minimization is based on estimating θ∗ by solving the optimization problem 1 θ ∈ argmin ∑ f (θ; x) . [sent-85, score-0.082]
13 We formalize this notion in terms of the Hessian of F0 : Assumption 2 (Local strong convexity) The population risk is twice differentiable, and there exists a parameter λ > 0 such that ∇2 F0 (θ∗ ) λId×d . [sent-93, score-0.113]
14 ) This local condition is milder than a global strong convexity condition and is required to hold only for the population risk F0 evaluated at θ∗ . [sent-97, score-0.146]
15 , m}, processor i uses its local data set S1,i to compute the local empirical minimizer θ1,i ∈ argmin θ∈Θ 1 ∑ f (θ; x) . [sent-120, score-0.108]
16 m i=1 3324 (2) C OMMUNICATION -E FFICIENT A LGORITHMS FOR S TATISTICAL O PTIMIZATION The subsampled average mixture (S AVGM) algorithm is based on an additional level of sampling on top of the first, involving a fixed subsampling rate r ∈ [0, 1]. [sent-122, score-0.12]
17 (2) Each processor i computes both the local empirical minimizers θ1,i from Equation (1) and the empirical minimizer θ2,i ∈ argmin θ∈Θ 1 ∑ f (θ; x) . [sent-126, score-0.132]
18 1−r (3) The intuition for the weighted estimator (3) is similar to that for standard bias correction procedures using the bootstrap or subsampling (Efron and Tibshirani, 1993; Hall, 1992; Politis et al. [sent-128, score-0.155]
19 Roughly speaking, if b0 = θ∗ − θ1 is the bias of the first estimator, then we may approximate b0 by the subsampled estimate of bias b1 = θ∗ − θ2 . [sent-130, score-0.085]
20 The goal of this paper is to understand under what conditions—and in what sense—the estimators (2) and (3) approach the oracle performance, by which we mean the error of a centralized risk minimization procedure that is given access to all N = nm samples. [sent-133, score-0.189]
21 1 Smoothness Conditions In addition to our previously stated assumptions on the population risk, we require regularity conditions on the empirical risk functions. [sent-147, score-0.113]
22 It is an important insight of our analysis that some type of smoothness condition on the Hessian matrix, as in the Lipschitz condition (4), is essential in order for simple averaging methods to work. [sent-152, score-0.084]
23 The associated population risk is F0 (θ) = 1 2 2 ′ ′ 2 (θ + θ 1(θ≤0) ). [sent-154, score-0.113]
24 Since |F0 (w) − F0 (v)| ≤ 2|w − v|, the population risk is strongly convex and smooth, but it has discontinuous second derivative. [sent-155, score-0.132]
25 The unique minimizer of the population risk is 1 θ∗ = 0, and by an asymptotic expansion given in Appendix A, it can be shown that E[θ1,i ] = Ω(n− 2 ). [sent-156, score-0.207]
26 We recall that θ∗ denotes the minimizer of the population objective function F0 , and that for each i ∈ {1, . [sent-164, score-0.108]
27 For each i, we use θi ∈ argminθ∈Θ { 1 ∑x∈Si f (θ; x)} to denote a minimizer of the empirical risk for the data set n 1 Si , and we define the averaged vector θ = m ∑m θi . [sent-168, score-0.129]
28 The following result bounds the mean-squared i=1 error between this averaged estimate and the minimizer θ∗ of the population risk. [sent-169, score-0.143]
29 (8) This upper bound shows that the leading term decays proportionally to (nm)−1 , with the pre-factor depending inversely on the strong convexity constant λ and growing proportionally with the bound G on the loss gradient. [sent-175, score-0.119]
30 This issue is exacerbated when the local strong convexity parameter λ of the risk F0 is close to zero or the Lipschitz continuity constant H of ∇ f is large. [sent-201, score-0.091]
31 Due to the additional randomness introduced by the subsampling in S AVGM, its analysis requires an additional smoothness condition. [sent-203, score-0.088]
32 The reason for this elimination is that subsampling at a rate r reduces the bias of the S AVGM algorithm to O (n−3 ), whereas in contrast, the bias of the AVGM algorithm induces terms of order n−2 . [sent-213, score-0.093]
33 Theorem 4 suggests that the performance of the S AVGM algorithm is affected by the subsampling rate r; in order to minimize the upper √ bound (9) in the regime m < N 2/3 , the optimal choice is of the form r ∝ C m/n = Cm3/2 /N where √ C ≈ (G2 /λ2 ) max{MG/λ, L d log d}. [sent-214, score-0.101]
34 Roughly, as the number of machines m becomes larger, we may increase r, since we enjoy averaging affects from the S AVGM algorithm. [sent-215, score-0.098]
35 On one hand, this asymptotic growth is faster in the subsampled case (11); on the other hand, the dependence on the dimension d of the problem is more stringent than the standard averaging case (10). [sent-223, score-0.098]
36 As the local strong convexity constant λ of the population risk shrinks, both methods allow less splitting of the data, meaning that the sample size per machine must be larger. [sent-224, score-0.146]
37 This limitation is intuitive, since lower curvature for the population risk means that the local empirical risks associated with each machine will inherit lower curvature as well, and this effect will be exacerbated with a small local sample size per machine. [sent-225, score-0.113]
38 More precisely, suppose that each processor runs a finite number of iterations of some optimization algorithm, thereby obtaining the vector θ′ as an approximate minimizer of the objective function i ′ 1 F1,i . [sent-234, score-0.084]
39 Consequently, suppose that processor i runs enough iterations to obtain an approximate minimizer θ′ such that 1 E[ θ′ − θi i 2 ] 2 = O ((mn)−2 ). [sent-237, score-0.084]
40 Consequently, performing stochastic gradient descent on F1 for O (log2 (mn)/ρ2 ) iterations yields an approximate minimizer that falls within Bρ (θ1 ) with high probability (e. [sent-242, score-0.184]
41 Note that the radius ρ for local strong convexity is a property of the population risk F0 we use as a prior knowledge. [sent-247, score-0.146]
42 Under local strong convexity of the objective function, gradient descent is known to converge at a geometric rate (see, e. [sent-249, score-0.111]
43 In our case, we have ε = (mn)−2 , and since each iteration of standard gradient descent requires O (n) units of time, a total of O (n log(mn)) time units are sufficient to obtain a final estimate θ′1 satisfying condition (13). [sent-252, score-0.078]
44 5 Stochastic Gradient Descent with Averaging The previous strategy involved a combination of stochastic gradient descent and standard gradient descent. [sent-255, score-0.178]
45 In many settings, it may be appealing to use only a stochastic gradient algorithm, due to their ease of their implementation and limited computational requirements. [sent-256, score-0.1]
46 In this section, we describe an extension of Theorem 1 to the case in which each machine computes an approximate minimizer using only stochastic gradient descent. [sent-257, score-0.153]
47 Let us begin by briefly reviewing the basic form of stochastic gradient descent (SGD). [sent-261, score-0.153]
48 Stochastic gradient descent algorithms iteratively update a parameter vector θt over time based on 3331 Z HANG , D UCHI AND WAINWRIGHT randomly sampled gradient information. [sent-262, score-0.125]
49 (14) Here ηt > 0 is a stepsize, and the first update in (14) is a gradient descent step with respect to the 1 random gradient ∇ f (θt ; Xt ). [sent-268, score-0.125]
50 To prove convergence of our stochastic gradient-based averaging algorithms, we require the following smoothness and strong convexity condition, which is an alternative to the Assumptions 2 and 3 used previously. [sent-273, score-0.17]
51 The averaged stochastic gradient algorithm (SGDAVGM) is based on the following two steps: (1) Given some constant c > 1, each machine performs n iterations of stochastic gradient dec scent (14) on its local data set of n samples using the stepsize ηt = λt , then outputs the ′. [sent-280, score-0.256]
52 E θ − θ∗ 2 ≤ 2 λ mn n Theorem 5 shows that the averaged stochastic gradient descent procedure attains the optimal convergence rate O (N −1 ) as a function of the total number of observations N = mn. [sent-284, score-0.188]
53 It is not clear whether a bootstrap correction is possible for the stochastic-gradient based estimator; such a correction could be significant, because the term β2 /n3/2 arising from the bias in the stochastic gradient estimator may be non-trivial. [sent-287, score-0.227]
54 The oracle least-squares estimate using all N samples is given by the line “All,” while the line “Single” gives the performance of the naive estimator using only n = N/m samples. [sent-307, score-0.079]
55 In addition to (i)–(iii), we also estimate θ∗ with (iv) The empirical minimizer of a single split of the data of size n = N/m (v) The empirical minimizer on the full data set (the oracle solution). [sent-309, score-0.138]
56 1 Averaging Methods For our first set of experiments, we study the performance of the averaging methods (AVGM and S AVGM), showing their scaling as the number of splits of data—the number of machines m—grows for fixed N and dimensions d = 20 and d = 200. [sent-311, score-0.132]
57 In Figure 1, we plot the error θ − θ∗ 2 of the inferred parameter vector θ for the true parameters 2 ∗ versus the number of splits m, or equivalently, the number of separate machines available for use. [sent-319, score-0.116]
58 As a baseline in each plot, we plot as a red line the squared error θN − θ∗ 2 of the centralized “gold standard,” obtained by 2 applying a batch method to all N samples. [sent-321, score-0.076]
59 Before describing the results, we remark that for the standard regression model (16), the least-squares solution is unbiased for θ∗ , so we expect subsampled averaging to yield little (if any) improvement. [sent-336, score-0.133]
60 The S AVGM method is essentially aimed at correcting the bias of the estimator θ1 , and de-biasing an unbiased estimator only increases its variance. [sent-337, score-0.089]
61 05 128 2 4 (a) d = 20 8 16 32 64 Number m of machines 128 (b) d = 200 Figure 3: The error θ − θ∗ 2 plotted against the number of machines m for the AVGM and S AVGM 2 methods, with standard errors across twenty simulations, using the normal regression model (16). [sent-355, score-0.159]
62 1 2 4 8 16 32 64 Number m of machines 128 (b) d = 200 Figure 4: The error θ − θ∗ 2 plotted against the number of machines m for the AVGM and S AVGM 2 methods, with standard errors across twenty simulations, using the non-normal regression model (18). [sent-375, score-0.159]
63 ) To understand settings in which subsampling for bias correction helps, in Figure 4, we plot mean-square error curves for the least-squares regression problem when the vector y is sampled according to the non-normal regression model (18). [sent-411, score-0.174]
64 We use a simpler data generating mechanism, specifically, we draw x ∼ N(0, Id×d ) from a standard d-dimensional normal, and v is chosen uniformly in [0, 1]; in this case, the population minimizer has the closed form θ∗ = u + 3v. [sent-422, score-0.108]
65 1295 8 16 32 64 Number of machines m 128 1 2 3 4 5 6 7 8 Number of Passes (a) 9 10 (b) Figure 6: The negative log-likelihood of the output of the AVGM, S AVGM, and stochastic methods on the held-out data set for the click-through prediction task. [sent-476, score-0.1]
66 In all our experiments, we assume that the population negative log-likelihood risk has local strong convexity as suggested by Assumption 2. [sent-490, score-0.146]
67 Note that although the SDCA enjoys faster convergence rate on the regularized empirical risk (Shalev-Shwartz and Zhang, 2012), the plot shows that the SGD has better generalization performance. [sent-501, score-0.076]
68 In Figure 6(a), we show the average hold-out set log-loss (with standard errors) of the estimator θ1 provided by the AVGM method versus number of splits of the data m, and we also plot the log-loss of the S AVGM method using subsampling ratios of r ∈ {. [sent-502, score-0.134]
69 As the number of machines m grows, however, the de-biasing provided by the subsampled bootstrap method yields substantial improvements over the standard AVGM method. [sent-506, score-0.121]
70 This is striking, since doing even one pass through the data with stochastic gradient descent gives minimax optimal convergence rates (Polyak and Juditsky, 1992; Agarwal et al. [sent-508, score-0.131]
71 We explore this question in Figure 8 using m = 128 splits, where we plot the log-loss of the S AVGM estimator on the held-out data set versus the subsampling ratio r. [sent-514, score-0.1]
72 ods provide strategies for efficiently solving such large-scale risk minimization problems, enjoying performance comparable to an oracle method that is able to access the entire large data set. [sent-544, score-0.09]
73 More generally, an understanding of the interplay between statistical efficiency and communication could provide an avenue for further research, and it may also be interesting to study the effects of subsampled or bootstrap-based estimators in other distributed environments. [sent-549, score-0.088]
74 The Necessity of Smoothness Here we show that some version of the smoothness conditions presented in Assumption 3 are necessary for averaging methods to attain better mean-squared error than using only the n samples on a single processor. [sent-555, score-0.121]
75 i=1 Using θ1 as shorthand for θ1,i , we see by inspection that the empirical minimizer θ1 is θ1 = n0 n −1 2 n 1 − 2n0 when n0 ≤ n/2 otherwise. [sent-557, score-0.084]
76 Before beginning the proof of Theorem 1 proper, we begin with a simple inequality that relates the error term θ − θ∗ to an average of the errors θi − θ∗ , each of which we can bound in turn. [sent-568, score-0.122]
77 In rough terms, when these events hold, the function F1 behaves similarly to the population risk F0 around the point θ∗ ; since F0 is locally strongly convex, the minimizer θ1 of F1 will be close to θ∗ . [sent-574, score-0.19]
78 Inspecting the Taylor expansion (23), we see that there are several terms of a form similar to (∇2 F0 (θ∗ ) − ∇2 F1 (θ∗ ))(θ1 − θ∗ ); using the smoothness Assumption 3, we can convert these terms into higher order terms involving only θ1 − θ∗ . [sent-591, score-0.092]
79 Proof of Theorem 4 Our proof of Theorem 4 begins with a simple inequality that mimics our first inequality (19) in the proof of Theorem 1. [sent-636, score-0.116]
80 Recall the definitions of the averaged vector θ1 and subsampled averaged vector θ2 . [sent-637, score-0.083]
81 Let θ1 denote the minimizer of the (an arbitrary) empirical risk F1 , and θ2 denote the minimizer of the resampled empirical risk F2 (from the same samples as θ1 ). [sent-638, score-0.242]
82 The lemma requires careful moment control over the expansion θ1 − θ∗ , leading to some technical difficulty, but is similar in spirit to the results leading to Theorem 1. [sent-665, score-0.099]
83 Proof of Theorem 5 We begin by recalling that if θn denotes the output of performing stochastic gradient on one machine, then from the inequality (19) we have the upper bound n E[ θ − θ∗ 2 ] 2 ≤ 1 E[ θn − θ∗ 2 ] + E[θn − θ∗ ] 2 m 3352 2 2. [sent-708, score-0.208]
84 Let gt = ∇ f (θt ; Xt ) be the gradient of the t th sample in stochastic gradient descent, where we consider running SGD on a single machine. [sent-711, score-0.256]
85 We now state a known result, which gives sharp rates on the convergence of the iterates {θt } in stochastic gradient descent. [sent-713, score-0.1]
86 , 2012) Assume that E[ gt c ≥ 1, for any t ∈ N we have E θt − θ∗ 2 2 ≤ 2 2 2] ≤ G for all t. [sent-715, score-0.109]
87 To that end, recall the neighborhood Uρ ⊂ Θ in Assumption 5, and note that θt+1 − θ∗ = Π(θt − ηt gt − θ∗ ) = θt − ηt gt − θ∗ + 1(θt+1 ∈Uρ ) Π(θt − ηt gt ) − (θt − ηt gt ) since when θ ∈ Uρ , we have Π(θ) = θ. [sent-719, score-0.436]
88 Consequently, an application of the triangle inequality gives E[θt+1 − θ∗ ] 2 ≤ E[θt − ηt gt − θ∗ ] 2 + E[ (Π(θt − ηt gt ) − (θt − ηt gt ))1(θt+1 ∈ Uρ ) 2 ]. [sent-720, score-0.366]
89 / By the definition of the projection and the fact that θt ∈ Θ, we additionally have Π(θt − ηt gt ) − (θt − ηt gt ) 2 ≤ θt − (θt − ηt gt )) 2 ≤ ηt gt 2. [sent-721, score-0.436]
90 By applying Lemma 14, we finally obtain E[θt+1 − θ∗ ] ≤ E[θt − ηt gt − θ∗ ] 2 = E[θt − ηt gt − θ∗ ] αG2 λ2 ρ2t + ηt G 2 + 2 3/4 cα3/4 G5/2 1 · . [sent-723, score-0.218]
91 λ5/2 ρ3/2 t 7/4 (49) Now we turn to controlling the rate at which θt − ηt gt goes to zero. [sent-724, score-0.109]
92 By defining rt = gt − ∇ ft (θ∗ ) − ∇2 ft (θ∗ )(θt − θ∗ ), a bit of algebra yields gt = ∇ ft (θ∗ ) + ∇2 ft (θ∗ )(θt − θ∗ ) + rt . [sent-726, score-0.438]
93 / If θt ∈ Uρ , then Taylor’s theorem implies that rt is the Lagrange remainder rt = (∇2 ft (θ′ ) − ∇2 ft (θ∗ ))(θ′ − θ∗ ), where θ′ = κθt + (1 − κ)θ∗ for some κ ∈ [0, 1]. [sent-731, score-0.187]
94 (51) E[ rt 2 ] ≤ 2 + λt λ3/2 ρ3/2 t By combining the expansion (50) with the bound (51), we find that E[θt − ηt gt − θ∗ ] 2 = E[(I − ηt ∇2 F0 (θ∗ ))(θt − θ∗ ) + ηt rt ] 2 cαLG2 3cα3/4 G3/2 (G + HR) 1 ≤ E[(I − ηt ∇2 F0 (θ∗ ))(θt − θ∗ )] 2 + 3 2 + · 7/4 . [sent-736, score-0.251]
95 In order to prove the conclusion of the lemma, we argue that since F1 is (locally) strongly convex, if the function F1 has small gradient at the point θ∗ , it must be the case that the minimizer θ1 of F1 is near θ∗ . [sent-759, score-0.1]
96 E[ ∇ f (θ∗ ; Xi ) 2 ] ∑ 2 2 n i=1 n n i=1 n 3357 Z HANG , D UCHI AND WAINWRIGHT Applying Jensen’s inequality yields k/2 1 n ∑ E[ ∇ f (θ∗ ; Xi ) 2 ] 2 n i=1 ≤ 1 n ∑ E[ ∇ f (θ∗ ; Xi ) 2 ]k/2 ≤ Gk , 2 n i=1 completes the proof of the inequality (24). [sent-785, score-0.097]
97 Now, we recall the definition Σ = ∇2 F0 (θ∗ ), the Hessian of the risk at the optimal point, and solve for the error ∆ to see that ∆ = −Σ−1 ∇F1 (θ∗ ) − Σ−1 (∇2 F1 (θ∗ ) − Σ)∆ − Σ−1 ∇3 F1 (θ∗ )(∆ ⊗ ∆) + Σ−1 (∇3 F0 (θ∗ ) − ∇3 F1 (θ′ ))(∆ ⊗ ∆) (53) on the event E . [sent-808, score-0.094]
98 As we did in the proof of Theorem 1, specifically in deriving the recursive equality (28), we may apply the expansion (23) of ∆ = θ1 − θ∗ to obtain a clean asymptotic expansion of ∆ using (53). [sent-809, score-0.118]
99 Making gradient descent optimal for strongly convex stochastic optimization. [sent-1007, score-0.15]
100 Hogwild: a lock-free approach to parallelizing stochastic gradient descent. [sent-1022, score-0.1]
wordName wordTfidf (topN-words)
[('avgm', 0.871), ('ommunication', 0.144), ('uchi', 0.144), ('tatistical', 0.123), ('savgm', 0.117), ('ptimization', 0.111), ('gt', 0.109), ('fficient', 0.096), ('lgorithms', 0.096), ('hang', 0.069), ('sgdavgm', 0.069), ('wainwright', 0.066), ('risk', 0.058), ('population', 0.055), ('subsampling', 0.055), ('minimizer', 0.053), ('stochastic', 0.053), ('advertisement', 0.053), ('averaging', 0.051), ('sgd', 0.048), ('gradient', 0.047), ('machines', 0.047), ('subsampled', 0.047), ('duchi', 0.046), ('hr', 0.044), ('centralized', 0.041), ('expansion', 0.041), ('inequality', 0.039), ('mn', 0.039), ('rt', 0.038), ('ft', 0.036), ('lemma', 0.034), ('splits', 0.034), ('convexity', 0.033), ('smoothness', 0.033), ('oracle', 0.032), ('rd', 0.031), ('descent', 0.031), ('shorthand', 0.031), ('processor', 0.031), ('lder', 0.03), ('twenty', 0.029), ('estimator', 0.027), ('bootstrap', 0.027), ('correction', 0.027), ('lemmas', 0.026), ('hessian', 0.026), ('bound', 0.025), ('minimizers', 0.024), ('passes', 0.024), ('rakhlin', 0.024), ('nemirovski', 0.024), ('argmin', 0.024), ('events', 0.024), ('moment', 0.024), ('lehmann', 0.023), ('manning', 0.023), ('juditsky', 0.023), ('polyak', 0.023), ('sdca', 0.023), ('nm', 0.023), ('distributed', 0.023), ('recalling', 0.022), ('begin', 0.022), ('jensen', 0.022), ('agarwal', 0.021), ('log', 0.021), ('theorem', 0.021), ('xi', 0.021), ('adid', 0.021), ('tokens', 0.021), ('samples', 0.02), ('advertiser', 0.02), ('keyword', 0.02), ('bias', 0.019), ('event', 0.019), ('id', 0.019), ('moments', 0.019), ('convex', 0.019), ('regression', 0.019), ('proof', 0.019), ('plot', 0.018), ('averaged', 0.018), ('necessity', 0.018), ('gender', 0.018), ('remainder', 0.018), ('xt', 0.018), ('involving', 0.018), ('stepsize', 0.018), ('dekel', 0.018), ('nedi', 0.018), ('politis', 0.018), ('proportionally', 0.018), ('title', 0.018), ('estimators', 0.018), ('error', 0.017), ('equality', 0.017), ('unbiased', 0.016), ('rk', 0.016), ('resampling', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
2 0.056948088 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Author: Shai Shalev-Shwartz, Tong Zhang
Abstract: Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it has so far lacked good convergence analysis. This paper presents a new analysis of Stochastic Dual Coordinate Ascent (SDCA) showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD. This analysis justifies the effectiveness of SDCA for practical applications. Keywords: stochastic dual coordinate ascent, optimization, computational complexity, regularized loss minimization, support vector machines, ridge regression, logistic regression
3 0.046192206 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
4 0.043437172 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression
Author: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar
Abstract: We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a principal component analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method (PCA-OLS) is within a constant factor (namely 4) of the risk of ridge regression (RR). Keywords: risk inflation, ridge regression, pca
5 0.040718451 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
6 0.038313463 114 jmlr-2013-The Rate of Convergence of AdaBoost
7 0.03569679 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
8 0.035516046 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
9 0.035367373 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
10 0.034271467 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
11 0.033957757 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
12 0.032740071 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
13 0.032187626 108 jmlr-2013-Stochastic Variational Inference
14 0.031221816 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization
15 0.031194948 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
16 0.030159026 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
17 0.030052785 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
18 0.029171096 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
19 0.029106887 76 jmlr-2013-Nonparametric Sparsity and Regularization
20 0.029056743 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
topicId topicWeight
[(0, -0.149), (1, 0.029), (2, 0.04), (3, 0.046), (4, -0.001), (5, -0.043), (6, 0.057), (7, -0.007), (8, 0.039), (9, -0.023), (10, -0.003), (11, 0.002), (12, 0.044), (13, 0.002), (14, -0.003), (15, -0.01), (16, -0.087), (17, -0.103), (18, 0.066), (19, 0.118), (20, 0.035), (21, -0.029), (22, 0.046), (23, 0.031), (24, -0.001), (25, 0.042), (26, -0.182), (27, 0.258), (28, 0.126), (29, -0.066), (30, 0.051), (31, 0.08), (32, 0.017), (33, 0.038), (34, 0.127), (35, 0.066), (36, -0.247), (37, 0.019), (38, 0.211), (39, -0.152), (40, 0.091), (41, -0.211), (42, -0.005), (43, 0.002), (44, 0.047), (45, 0.066), (46, 0.029), (47, -0.008), (48, 0.089), (49, 0.111)]
simIndex simValue paperId paperTitle
same-paper 1 0.86401796 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
2 0.70299715 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Author: Shai Shalev-Shwartz, Tong Zhang
Abstract: Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it has so far lacked good convergence analysis. This paper presents a new analysis of Stochastic Dual Coordinate Ascent (SDCA) showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD. This analysis justifies the effectiveness of SDCA for practical applications. Keywords: stochastic dual coordinate ascent, optimization, computational complexity, regularized loss minimization, support vector machines, ridge regression, logistic regression
3 0.36909059 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
Author: Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos
Abstract: Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the fidelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsityconstrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. Keywords: sparsity, optimization, compressed sensing, greedy algorithm
4 0.32787973 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
Author: Kris De Brabanter, Jos De Brabanter, Bart De Moor, Irène Gijbels
Abstract: We present a fully automated framework to estimate derivatives nonparametrically without estimating the regression function. Derivative estimation plays an important role in the exploration of structures in curves (jump detection and discontinuities), comparison of regression curves, analysis of human growth data, etc. Hence, the study of estimating derivatives is equally important as regression estimation itself. Via empirical derivatives we approximate the qth order derivative and create a new data set which can be smoothed by any nonparametric regression estimator. We derive L1 and L2 rates and establish consistency of the estimator. The new data sets created by this technique are no longer independent and identically distributed (i.i.d.) random variables anymore. As a consequence, automated model selection criteria (data-driven procedures) break down. Therefore, we propose a simple factor method, based on bimodal kernels, to effectively deal with correlated data in the local polynomial regression framework. Keywords: nonparametric derivative estimation, model selection, empirical derivative, factor rule
5 0.32508942 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression
Author: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar
Abstract: We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a principal component analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method (PCA-OLS) is within a constant factor (namely 4) of the risk of ridge regression (RR). Keywords: risk inflation, ridge regression, pca
6 0.32234746 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
7 0.30999935 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
8 0.28533354 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
9 0.27102026 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization
10 0.25767976 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
11 0.25074011 62 jmlr-2013-Learning Theory Approach to Minimum Error Entropy Criterion
12 0.24472554 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
13 0.2288214 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
14 0.21203949 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
15 0.2041254 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
16 0.19664443 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
17 0.18623531 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
18 0.18544976 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
19 0.18508399 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
20 0.17876746 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
topicId topicWeight
[(0, 0.035), (5, 0.142), (6, 0.061), (10, 0.07), (20, 0.025), (23, 0.034), (44, 0.01), (53, 0.014), (68, 0.03), (70, 0.061), (75, 0.038), (85, 0.042), (87, 0.03), (89, 0.011), (92, 0.262), (93, 0.018)]
simIndex simValue paperId paperTitle
1 0.78554696 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features
Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng
Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)
same-paper 2 0.72115904 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
3 0.56700629 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
Author: Daniil Ryabko, Jérémie Mary
Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classification methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classification and concern highly dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions
5 0.56056285 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
6 0.56015778 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
7 0.55593652 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
8 0.55319518 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
9 0.55315495 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
10 0.55288655 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
12 0.55135173 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
13 0.55024207 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
14 0.54938179 76 jmlr-2013-Nonparametric Sparsity and Regularization
15 0.54913419 68 jmlr-2013-Machine Learning with Operational Costs
16 0.54709369 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
17 0.54701352 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
18 0.54483843 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
19 0.54387391 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
20 0.54360604 120 jmlr-2013-Variational Algorithms for Marginal MAP