nips nips2013 nips2013-313 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Julien Mairal
Abstract: Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. Because of its simplicity and its wide applicability, this principle has been very popular in statistics and in signal processing. In this paper, we intend to make this principle scalable. We introduce a stochastic majorization-minimization scheme which is able to deal with largescale or possibly infinite data sets. When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence √ rate of O(1/ n) after n iterations, and of O(1/n) for strongly convex functions. Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. We develop several efficient algorithms based on our framework. First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale ℓ1 logistic regression. Second, we develop an online DC programming algorithm for non-convex sparse estimation. Finally, we demonstrate the effectiveness of our approach for solving large-scale structured matrix factorization problems. 1
Reference: text
sentIndex sentText sentNum sentScore
1 fr Abstract Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. [sent-3, score-0.472]
2 We introduce a stochastic majorization-minimization scheme which is able to deal with largescale or possibly infinite data sets. [sent-6, score-0.195]
3 When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence √ rate of O(1/ n) after n iterations, and of O(1/n) for strongly convex functions. [sent-7, score-0.42]
4 Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. [sent-8, score-0.217]
5 First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale ℓ1 logistic regression. [sent-10, score-0.351]
6 Second, we develop an online DC programming algorithm for non-convex sparse estimation. [sent-11, score-0.188]
7 1 Introduction Majorization-minimization [15] is a simple optimization principle for minimizing an objective function. [sent-13, score-0.17]
8 It consists of iteratively minimizing a surrogate that upper-bounds the objective, thus monotonically driving the objective function value downhill. [sent-14, score-0.396]
9 For instance, the expectation-maximization (EM) algorithm (see [5, 21]) builds a surrogate for a likelihood model by using Jensen’s inequality. [sent-16, score-0.28]
10 Other approaches can also be interpreted under the majorization-minimization point of view, such as DC programming [8], where “DC” stands for difference of convex functions, variational Bayes techniques [28], or proximal algorithms [1, 23, 29]. [sent-17, score-0.28]
11 For such objectives, online techniques based on stochastic approximations have proven to be particularly efficient, and have drawn a lot of attraction in machine learning, statistics, and optimization [3–6, 9–12, 14, 16, 17, 19, 22, 24–26, 30]. [sent-20, score-0.245]
12 It consists of iteratively building a surrogate of the expected cost when only a single data point is observed at each iteration; this data point is used to update the surrogate, which in turn is minimized to obtain a new estimate. [sent-22, score-0.355]
13 Some previous works are closely related to this scheme: the online EM algorithm for latent data models [5, 21] and the online matrix factorization technique of [19] involve for instance surrogate functions updated in a similar fashion. [sent-23, score-0.599]
14 Another related work is the incremental majorization-minimization algorithm of [18] for finite training sets; it was indeed shown to be efficient for solving machine learning problems where storing 1 dense information about the past iterates can be afforded. [sent-25, score-0.17]
15 Concretely, this incremental scheme requires to store O(pn) values, where p is the variable size, and n is the size of the training set. [sent-26, score-0.176]
16 1 This issue was the main motivation for us for proposing a stochastic scheme with a memory load independent of n, thus allowing us to possibly deal with infinite data sets, or a huge variable size p. [sent-27, score-0.222]
17 We study the convergence properties of our algorithm when the surrogates are strongly convex and chosen among the class of first-order surrogate functions introduced in [18], which consist of approximating the possibly non-smooth objective up to a smooth error. [sent-28, score-0.796]
18 When the objective is convex, we obtain expected convergence rates that are asymptotically optimal, or close to optimal [14, 22]. [sent-29, score-0.178]
19 √ More precisely, the convergence rate is of order O(1/ n) in a finite horizon setting, and O(1/n) for a strongly convex objective in an infinite horizon setting. [sent-30, score-0.44]
20 Our second analysis shows that for nonconvex problems, our method almost surely converges to a set of stationary points under suitable assumptions. [sent-31, score-0.158]
21 We believe that this result is equally valuable as convergence rates for convex optimization. [sent-32, score-0.188]
22 The first one is a new stochastic proximal gradient method for composite or constrained optimization. [sent-35, score-0.352]
23 This algorithm is related to a long series of work in the convex optimization literature [6, 10, 12, 14, 16, 22, 25, 30], and we demonstrate that it performs as well as state-of-the-art solvers for large-scale ℓ1 -logistic regression [7]. [sent-36, score-0.174]
24 The second one is an online DC programming technique, which we demonstrate to be better than batch alternatives for large-scale non-convex sparse estimation [8]. [sent-37, score-0.313]
25 Finally, we show that our scheme can address efficiently structured sparse matrix factorization problems in an online fashion, and offers new possibilities to [13, 19] such as the use of various loss or regularization functions. [sent-38, score-0.394]
26 This paper is organized as follows: Section 2 introduces first-order surrogate functions for batch optimization; Section 3 is devoted to our stochastic approach and its convergence analysis; Section 4 presents several applications and numerical experiments, and Section 5 concludes the paper. [sent-39, score-0.669]
27 The majorization-minimization principle consists of computing a majorizing surrogate gn of f at iteration n and updating the current estimate by θn ∈ arg minθ∈Θ gn (θ). [sent-41, score-0.943]
28 The success of such a scheme depends on how well the surrogates approximate f . [sent-42, score-0.212]
29 In this paper, we consider a particular class of surrogate functions introduced in [18] and defined as follows: Definition 2. [sent-43, score-0.334]
30 We denote by SL,ρ (f, κ) the set of ρ-strongly convex functions g such that g ≥ f , g(κ) = f (κ), the approximation error g − f is differentiable, and the gradient ∇(g − f ) is LLipschitz continuous. [sent-46, score-0.247]
31 We call the functions g in SL,ρ (f, κ) “first-order surrogate functions”. [sent-47, score-0.334]
32 Among the first-order surrogate functions presented in [18], we should mention the following ones: • Lipschitz Gradient Surrogates. [sent-48, score-0.334]
33 When f is differentiable and ∇f is L-Lipschitz, f admits the following surrogate g in S2L,L (f, κ): g : θ → f (κ) + ∇f (κ)⊤ (θ − κ) + L θ − κ 2. [sent-49, score-0.334]
34 1 Minimizing g amounts to performing a classical classical gradient descent step θ ← κ − L ∇f (κ). [sent-51, score-0.201]
35 Minimizing g 1 1 1 amounts to a proximal gradient step [1, 23, 29]: θ ← arg minθ 2 κ − L ∇f1 (κ) − θ 2 + L f2 (θ). [sent-58, score-0.241]
36 Assume that f = f1 + f2 , where f2 is concave and differentiable, ∇f2 is L2 -Lipschitz, and g1 is in SL1 ,ρ1 (f1 , κ), Then, the following function g is a surrogate in SL1 +L2 ,ρ1 (f, κ): g : θ → f1 (θ) + f2 (κ) + ∇f2 (κ)⊤ (θ − κ). [sent-60, score-0.28]
37 With the definition of first-order surrogates and a basic “batch” algorithm in hand, we now introduce our main contribution: a stochastic scheme for solving large-scale problems. [sent-62, score-0.345]
38 Then, we choose a surrogate gn for the function θ → ℓ(xn , θ), and we use it to update a function gn that behaves as an approximate surrogate for the ¯ expected cost f . [sent-75, score-1.197]
39 The function gn is in fact a weighted average of previously computed surrogates, ¯ and involves a sequence of weights (wn )n≥1 that will be discussed later. [sent-76, score-0.325]
40 Then, we minimize gn , and ¯ obtain a new estimate θn . [sent-77, score-0.281]
41 For convex problems, we also propose to use averaging schemes, denoted by “option 2” and “option 3” in Alg. [sent-78, score-0.188]
42 Averaging is a classical technique for improving convergence rates in convex optimization [10, 22] for reasons that are clear in the convergence proofs. [sent-80, score-0.337]
43 output (option 3): θ 7: We remark that Algorithm 1 is only practical when the functions gn can be parameterized with a ¯ small number of variables, and when they can be easily minimized over Θ. [sent-85, score-0.365]
44 1 Convergence Analysis - Convex Case First, We study the case of convex functions fn : θ → ℓ(θ, xn ), and make the following assumption: (A) for all θ in Θ, the functions fn are R-Lipschitz continuous. [sent-89, score-0.817]
45 Note that for convex functions, this is equivalent to saying that subgradients of fn are uniformly bounded by R. [sent-90, score-0.406]
46 Assumption (A) is classical in the stochastic optimization literature [22]. [sent-91, score-0.183]
47 Our first result shows that with the averaging scheme corresponding to “option 2” in Alg. [sent-92, score-0.161]
48 When the functions fn are convex, under assumption (A), and when ρ = L, we have L θ ⋆ − θ0 ¯ E[f (θn−1 ) − f ⋆ ] ≤ 2 2 2 + R2 L n k=1 wk n k=1 2 wk for all n ≥ 1, ¯ where θn−1 is defined in Algorithm 1, θ⋆ is a minimizer of f on Θ, and f ⋆ (3) f (θ⋆ ). [sent-96, score-0.424]
49 Such a rate is similar to the one of stochastic gradient descent with averaging, see [22] for example. [sent-97, score-0.251]
50 Note that the constraint ρ = L here is compatible with the proximal gradient surrogate. [sent-98, score-0.211]
51 The next corollary shows that in an infinite horizon setting and with decreasing √ weights, we lose a logarithmic factor compared to an optimal convergence rate [14,22] of O(1/ n). [sent-104, score-0.156]
52 In practice, we have found that choosing √ √ wn = n0 + 1/ n0 + n performs well, where n0 is tuned on a subsample of the training set. [sent-111, score-0.469]
53 2 + Convergence Analysis - Strongly Convex Case In this section, we introduce an additional assumption: (B) the functions fn are µ-strongly convex. [sent-113, score-0.325]
54 We show that our method achieves a rate O(1/n), which is optimal up to a multiplicative constant for strongly convex functions (see [14, 22]). [sent-114, score-0.234]
55 Then, 1 2R2 ρ ˆ , ρ θ ⋆ − θ0 2 for all n ≥ 1, E[f (θn−1 ) − f ⋆ ] + E[ θ⋆ − θn 2 ] ≤ max 2 2 2 µ βn + 1 ˆ where θn is defined in Algorithm 1, when choosing the averaging scheme called “option 3”. [sent-119, score-0.161]
56 The averaging scheme is slightly different than in the previous section and the weights decrease at a different speed. [sent-120, score-0.205]
57 Again, this rate applies to the proximal gradient surrogates, which satisfy the constraint ρ = L + µ. [sent-121, score-0.24]
58 In such a context, proving convergence to a global (or local) minimum is out of reach, and classical analyses study instead asymptotic stationary point conditions, which involve directional derivatives (see [2, 18]). [sent-125, score-0.198]
59 The next proposition is a generalization of a convergence result obtained in [19] in the context of sparse matrix factorization. [sent-133, score-0.201]
60 Under assumption (F), we also have that ¯ ∇fn (θn , θ − θn ) lim inf inf ≥ 0, n→+∞ θ∈Θ θ − θn 2 ¯ ¯ ¯ where the function fn is a weighted empirical risk recursively defined as fn = (1−wn )fn−1 +wn fn . [sent-137, score-0.834]
61 ¯ It can be shown that fn uniformly converges to f . [sent-138, score-0.302]
62 ¯ Even though fn converges uniformly to the expected cost f , Proposition 3. [sent-139, score-0.347]
63 We obtain such a guarantee when the surrogates that are parameterized, an assumption always satisfied when Algorithm 1 is used in practice. [sent-141, score-0.147]
64 3, and let us assume that the functions gn are ¯ parameterized by some variables κn living in a compact set K of Rd . [sent-145, score-0.365]
65 In other words, gn can be ¯ written as gκn , with κn in K. [sent-146, score-0.281]
66 Finally, we show that our non-convex convergence analysis can be extended beyond first-order surrogate functions—that is, when gn does not satisfy exactly Definition 2. [sent-149, score-0.636]
67 K Assume that the functions fn split into fn (θ) = f0,n (θ) + k=1 fk,n (γk (θ)), where the functions p γk : R → R are convex and R-Lipschitz, and the fk,n are non-decreasing for k ≥ 1. [sent-155, score-0.763]
68 Instead of choosing gn in SL,ρ (fn , θn−1 ) in Alg 1, replace it by gn θ → g0,n (θ)+gk,n (γk (θ)). [sent-157, score-0.562]
69 We can thus use the proximal gradient surrogate presented in Section 2. [sent-170, score-0.491]
70 By defining some other weights wn recursively as wn (1 − wn )wn for i < n and n wn wn , our scheme yields the update rule: n θn ← arg min θ∈Θ i=1 i wn ∇fi (θi−1 )⊤ θ + L 2 θ − θi−1 2 2 + ψ(θ) . [sent-172, score-2.714]
71 These three algorithms use indeed the following update rule: 1 (FOBOS) θn ← arg min ∇fn (θn−1 )⊤ θ + 2ηn θ − θn−1 2 + ψ(θ), 2 θ∈Θ Another related scheme is the regularized dual averaging (RDA) of [30], which can be written as θn ← arg min θ∈Θ 1 n n i=1 ∇fi (θi−1 )⊤ θ + 1 2ηn θ 2 2 + ψ(θ). [sent-174, score-0.354]
72 The function ψ in (4) is the ℓ1 -norm: ψ(θ) λ θ 1 , and λ is a regularization parameter; −yi x⊤ θ i the functions fi are logistic losses: fi (θ) log(1 + e ). [sent-179, score-0.201]
73 We used weights of the form wn (n0 + 1)/(n + n0 ), where n0 is automatically adjusted at the beginning of each experiment by performing one pass on 5% of the training data. [sent-181, score-0.513]
74 rcv1 and webspam are obtained from the 2008 Pascal large-scale learning challenge. [sent-184, score-0.146]
75 4 name rcv1 webspam kdd2010 Ntr (train) 781 265 250 000 10 000 000 Nte (test) 23 149 100 000 9 264 097 p 47 152 16 091 143 28 875 157 density (%) 0. [sent-186, score-0.146]
76 8 We compare our implementation with state-of-the-art publicly available solvers: the batch algorithm FISTA of [1] implemented in the C++ SPAMS toolbox and LIBLINEAR v1. [sent-191, score-0.153]
77 We run every method on 1, 2, 3, 4, 5, 10 and 25 epochs (passes over the training set), for three regularization regimes, respectively yielding a solution with approximately 100, 1 000 and 10 000 non-zero coefficients. [sent-195, score-0.243]
78 FISTA is not represented in this figure since it required more than 25 epochs to achieve reasonable values. [sent-197, score-0.153]
79 Such a conclusion is often obtained when comparing batch and stochastic algorithms [4], but matching the performance of LIBLINEAR is very challenging. [sent-200, score-0.234]
80 15 10 Epochs / Dataset rcv1 0 5 10 15 20 25 Epochs / Dataset webspam Computation Time (sec) / Dataset webspam 0. [sent-231, score-0.292]
81 05 0 0 5 10 15 20 25 Epochs / Dataset kddb Computation Time (sec) / Dataset kddb Figure 1: Comparison between LIBLINEAR and SMM for the medium regularization regime. [sent-243, score-0.221]
82 A classical way for N 1 minimizing the regularized empirical cost N i=1 fi (θ) + ψ(θ) is to resort to DC programming. [sent-246, score-0.182]
83 At iteration n of Algorithm 1, we define the function gn according to Proposition 3. [sent-250, score-0.281]
84 5: gn : θ → fn (θn−1 ) + ∇fn (θn−1 )⊤ (θ − θn−1 ) + L 2 θ − θn−1 2 2 p |θ[j]| j=1 |θn−1 [j]|+ε , +λ -0. [sent-251, score-0.552]
85 52 Objective on Test Set Online DC Batch DC 0 Objective on Train Set Objective on Test Set Objective on Train Set We compare our online DC programming algorithm against the batch one, and report the results in Figure 2, with ε set to 0. [sent-266, score-0.267]
86 We conclude that the batch reweighted-ℓ1 algorithm always converges after 2 or 3 weight updates, but suffers from local minima issues. [sent-268, score-0.156]
87 54 0 5 10 15 20 Iterations - Epochs / Dataset webspam 25 0 5 10 15 20 25 Iterations - Epochs / Dataset webspam Figure 2: Comparison between batch and online DC programming, with medium regularization for the datasets rcv1 and webspam. [sent-277, score-0.602]
88 Note that each iteration in the batch setting can perform several epochs (passes over training data). [sent-279, score-0.331]
89 We are given a large collection of signals (xi )N in Rm , and we want to find a i=1 7 dictionary D in Rm×K that can represent these signals in a sparse way. [sent-282, score-0.161]
90 The quality of D is measured through the loss ℓ(x, D) minα∈RK 1 x − Dα 2 + λ1 α 1 + λ2 α 2 , where the ℓ1 -norm 2 2 2 2 can be replaced by any convex regularizer, and the squared loss by any convex smooth loss. [sent-283, score-0.226]
91 In the matrix factorization framework of [13], it is argued that some applications can benefit from a structured penalty ϕ, but the approach of [13] is not easily amenable to stochastic optimization. [sent-286, score-0.203]
92 Our approach makes it possible by using the proximal gradient surrogate gn : D → ℓ(xn , Dn−1 ) + Tr ∇D ℓ(xn , Dn−1 )⊤ (D − Dn−1 ) + L 2 D − Dn−1 2 F + ϕ(D). [sent-287, score-0.772]
93 Its proximal operator can be computed efficiently [20], and it is thus easy to use the surrogates (5). [sent-299, score-0.257]
94 5 Conclusion In this paper, we have introduced a stochastic majorization-minimization algorithm that gracefully scales to millions of training samples. [sent-309, score-0.162]
95 We have derived from our framework several new algorithms, which have shown to match or outperform the state of the art for solving large-scale convex problems, and to open up new possibilities for non-convex ones. [sent-311, score-0.162]
96 In the future, we would like to study surrogate functions that can exploit the curvature of the objective function, which we believe is a crucial issue to deal with badly conditioned datasets. [sent-312, score-0.415]
97 Efficient online and batch learning using forward backward splitting. [sent-351, score-0.231]
98 Accelerated gradient methods for stochastic optimization and online learning. [sent-405, score-0.325]
99 A stochastic gradient method with an exponential convergence rate for finite training sets. [sent-446, score-0.346]
100 Dual averaging methods for regularized stochastic learning and online optimization. [sent-540, score-0.315]
wordName wordTfidf (topN-words)
[('wn', 0.416), ('smm', 0.33), ('gn', 0.281), ('surrogate', 0.28), ('liblinear', 0.274), ('fn', 0.271), ('dc', 0.247), ('epochs', 0.153), ('webspam', 0.146), ('proximal', 0.131), ('surrogates', 0.126), ('batch', 0.125), ('convex', 0.113), ('stochastic', 0.109), ('online', 0.106), ('option', 0.095), ('sl', 0.086), ('scheme', 0.086), ('spams', 0.083), ('objective', 0.081), ('proposition', 0.08), ('gradient', 0.08), ('averaging', 0.075), ('convergence', 0.075), ('kddb', 0.071), ('dictionary', 0.067), ('fi', 0.055), ('functions', 0.054), ('differentiable', 0.054), ('xn', 0.054), ('dn', 0.053), ('factorization', 0.053), ('training', 0.053), ('horizon', 0.052), ('stationary', 0.051), ('fobos', 0.047), ('majorizing', 0.047), ('rda', 0.047), ('sparse', 0.046), ('classical', 0.044), ('weights', 0.044), ('dataset', 0.044), ('medium', 0.042), ('structured', 0.041), ('wk', 0.039), ('mairal', 0.039), ('strongly', 0.038), ('sec', 0.038), ('regularization', 0.037), ('incremental', 0.037), ('fista', 0.036), ('programming', 0.036), ('minimizing', 0.035), ('iterates', 0.034), ('assumptions', 0.033), ('descent', 0.033), ('composite', 0.032), ('converges', 0.031), ('solvers', 0.031), ('arg', 0.03), ('ex', 0.03), ('iterations', 0.03), ('optimization', 0.03), ('separable', 0.03), ('update', 0.03), ('parameterized', 0.03), ('consist', 0.029), ('rate', 0.029), ('min', 0.028), ('directional', 0.028), ('toolbox', 0.028), ('em', 0.028), ('nonconvex', 0.027), ('jenatton', 0.027), ('points', 0.027), ('dj', 0.027), ('load', 0.027), ('obozinski', 0.027), ('devoted', 0.026), ('regularizer', 0.026), ('possibilities', 0.025), ('regularized', 0.025), ('rp', 0.024), ('signals', 0.024), ('solving', 0.024), ('principle', 0.024), ('concretely', 0.024), ('nite', 0.023), ('cost', 0.023), ('passes', 0.023), ('regimes', 0.023), ('indeed', 0.022), ('bounded', 0.022), ('surely', 0.022), ('supplemental', 0.022), ('expected', 0.022), ('assumption', 0.021), ('llipschitz', 0.021), ('rakotomamonjy', 0.021), ('ghadimi', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
Author: Julien Mairal
Abstract: Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. Because of its simplicity and its wide applicability, this principle has been very popular in statistics and in signal processing. In this paper, we intend to make this principle scalable. We introduce a stochastic majorization-minimization scheme which is able to deal with largescale or possibly infinite data sets. When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence √ rate of O(1/ n) after n iterations, and of O(1/n) for strongly convex functions. Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. We develop several efficient algorithms based on our framework. First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale ℓ1 logistic regression. Second, we develop an online DC programming algorithm for non-convex sparse estimation. Finally, we demonstrate the effectiveness of our approach for solving large-scale structured matrix factorization problems. 1
2 0.17076066 269 nips-2013-Regression-tree Tuning in a Streaming Setting
Author: Samory Kpotufe, Francesco Orabona
Abstract: We consider the problem of maintaining the data-structures of a partition-based regression procedure in a setting where the training data arrives sequentially over time. We prove that it is possible to maintain such a structure in time O (log n) at ˜ any time step n while achieving a nearly-optimal regression rate of O n−2/(2+d) in terms of the unknown metric dimension d. Finally we prove a new regression lower-bound which is independent of a given data size, and hence is more appropriate for the streaming setting. 1
3 0.13457921 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses
Author: Harish G. Ramaswamy, Shivani Agarwal, Ambuj Tewari
Abstract: The design of convex, calibrated surrogate losses, whose minimization entails consistency with respect to a desired target loss, is an important concept to have emerged in the theory of machine learning in recent years. We give an explicit construction of a convex least-squares type surrogate loss that can be designed to be calibrated for any multiclass learning problem for which the target loss matrix has a low-rank structure; the surrogate loss operates on a surrogate target space of dimension at most the rank of the target loss. We use this result to design convex calibrated surrogates for a variety of subset ranking problems, with target losses including the precision@q, expected rank utility, mean average precision, and pairwise disagreement. 1
4 0.13150673 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
Author: Yao-Liang Yu
Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1
5 0.1178127 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
Author: Ke Hou, Zirui Zhou, Anthony Man-Cho So, Zhi-Quan Luo
Abstract: Motivated by various applications in machine learning, the problem of minimizing a convex smooth loss function with trace norm regularization has received much attention lately. Currently, a popular method for solving such problem is the proximal gradient method (PGM), which is known to have a sublinear rate of convergence. In this paper, we show that for a large class of loss functions, the convergence rate of the PGM is in fact linear. Our result is established without any strong convexity assumption on the loss function. A key ingredient in our proof is a new Lipschitzian error bound for the aforementioned trace norm–regularized problem, which may be of independent interest.
6 0.11336499 193 nips-2013-Mixed Optimization for Smooth Functions
7 0.11257221 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
8 0.11080999 233 nips-2013-Online Robust PCA via Stochastic Optimization
9 0.10949855 324 nips-2013-The Fast Convergence of Incremental PCA
10 0.10284644 98 nips-2013-Documents as multiple overlapping windows into grids of counts
11 0.09349066 156 nips-2013-Learning Kernels Using Local Rademacher Complexity
12 0.091435276 171 nips-2013-Learning with Noisy Labels
13 0.091277465 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients
14 0.090936579 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives
15 0.075860463 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
16 0.074346587 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
17 0.067935035 318 nips-2013-Structured Learning via Logistic Regression
18 0.067700006 6 nips-2013-A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data
19 0.067640655 75 nips-2013-Convex Two-Layer Modeling
20 0.067601949 188 nips-2013-Memory Limited, Streaming PCA
topicId topicWeight
[(0, 0.195), (1, 0.026), (2, 0.086), (3, -0.013), (4, 0.025), (5, 0.061), (6, -0.126), (7, 0.06), (8, 0.056), (9, -0.005), (10, 0.005), (11, 0.119), (12, 0.017), (13, -0.05), (14, 0.009), (15, -0.023), (16, 0.026), (17, 0.07), (18, 0.055), (19, 0.061), (20, -0.013), (21, 0.066), (22, 0.078), (23, 0.074), (24, -0.124), (25, -0.034), (26, -0.077), (27, 0.051), (28, 0.049), (29, 0.068), (30, -0.163), (31, 0.044), (32, -0.117), (33, 0.021), (34, -0.029), (35, -0.022), (36, 0.045), (37, 0.071), (38, -0.022), (39, 0.028), (40, -0.037), (41, -0.091), (42, 0.138), (43, 0.012), (44, 0.053), (45, 0.048), (46, -0.05), (47, -0.06), (48, 0.038), (49, -0.008)]
simIndex simValue paperId paperTitle
same-paper 1 0.92385888 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
Author: Julien Mairal
Abstract: Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. Because of its simplicity and its wide applicability, this principle has been very popular in statistics and in signal processing. In this paper, we intend to make this principle scalable. We introduce a stochastic majorization-minimization scheme which is able to deal with largescale or possibly infinite data sets. When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence √ rate of O(1/ n) after n iterations, and of O(1/n) for strongly convex functions. Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. We develop several efficient algorithms based on our framework. First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale ℓ1 logistic regression. Second, we develop an online DC programming algorithm for non-convex sparse estimation. Finally, we demonstrate the effectiveness of our approach for solving large-scale structured matrix factorization problems. 1
2 0.70224786 324 nips-2013-The Fast Convergence of Incremental PCA
Author: Akshay Balsubramani, Sanjoy Dasgupta, Yoav Freund
Abstract: We consider a situation in which we see samples Xn ∈ Rd drawn i.i.d. from some distribution with mean zero and unknown covariance A. We wish to compute the top eigenvector of A in an incremental fashion - with an algorithm that maintains an estimate of the top eigenvector in O(d) space, and incrementally adjusts the estimate with each new data point that arrives. Two classical such schemes are due to Krasulina (1969) and Oja (1983). We give finite-sample convergence rates for both. 1
3 0.66575909 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
Author: Yao-Liang Yu
Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1
4 0.61810631 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses
Author: Harish G. Ramaswamy, Shivani Agarwal, Ambuj Tewari
Abstract: The design of convex, calibrated surrogate losses, whose minimization entails consistency with respect to a desired target loss, is an important concept to have emerged in the theory of machine learning in recent years. We give an explicit construction of a convex least-squares type surrogate loss that can be designed to be calibrated for any multiclass learning problem for which the target loss matrix has a low-rank structure; the surrogate loss operates on a surrogate target space of dimension at most the rank of the target loss. We use this result to design convex calibrated surrogates for a variety of subset ranking problems, with target losses including the precision@q, expected rank utility, mean average precision, and pairwise disagreement. 1
5 0.5851323 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse
Author: John Duchi, Michael Jordan, Brendan McMahan
Abstract: We study stochastic optimization problems when the data is sparse, which is in a sense dual to current perspectives on high-dimensional statistical learning and optimization. We highlight both the difficulties—in terms of increased sample complexity that sparse data necessitates—and the potential benefits, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on several medium to large-scale learning tasks. 1 Introduction and problem setting In this paper, we investigate stochastic optimization problems in which the data is sparse. Formally, let {F (·; ξ), ξ ∈ Ξ} be a collection of real-valued convex functions, each of whose domains contains the convex set X ⊂ Rd . For a probability distribution P on Ξ, we consider the following optimization problem: minimize f (x) := E[F (x; ξ)] = x∈X F (x; ξ)dP (ξ). (1) Ξ By data sparsity, we mean the samples ξ are sparse: assuming that samples ξ lie in Rd , and defining the support supp(x) of a vector x to the set of indices of its non-zero components, we assume supp F (x; ξ) ⊂ supp ξ. (2) The sparsity condition (2) means that F (x; ξ) does not “depend” on the values of xj for indices j such that ξj = 0.1 This type of data sparsity is prevalent in statistical optimization problems and machine learning applications; in spite of its prevalence, study of such problems has been limited. As a motivating example, consider a text classification problem: data ξ ∈ Rd represents words appearing in a document, and we wish to minimize a logistic loss F (x; ξ) = log(1 + exp( ξ, x )) on the data (we encode the label implicitly with the sign of ξ). Such generalized linear models satisfy the sparsity condition (2), and while instances are of very high dimension, in any given instance, very few entries of ξ are non-zero [8]. From a modelling perspective, it thus makes sense to allow a dense predictor x: any non-zero entry of ξ is potentially relevant and important. In a sense, this is dual to the standard approaches to high-dimensional problems; one usually assumes that the data ξ may be dense, but there are only a few relevant features, and thus a parsimonious model x is desirous [2]. So 1 Formally, if πξ denotes the coordinate projection zeroing all indices j of its argument where ξj = 0, then F (πξ (x); ξ) = F (x; ξ) for all x, ξ. This follows from the first-order conditions for convexity [6]. 1 while such sparse data problems are prevalent—natural language processing, information retrieval, and other large data settings all have significant data sparsity—they do not appear to have attracted as much study as their high-dimensional “duals” of dense data and sparse predictors. In this paper, we investigate algorithms and their inherent limitations for solving problem (1) under natural conditions on the data generating distribution. Recent work in the optimization and machine learning communities has shown that data sparsity can be leveraged to develop parallel (and even asynchronous [12]) optimization algorithms [13, 14], but this work does not consider the statistical effects of data sparsity. In another line of research, Duchi et al. [4] and McMahan and Streeter [9] develop “adaptive” stochastic gradient algorithms to address problems in sparse data regimes (2). These algorithms exhibit excellent practical performance and have theoretical guarantees on their convergence, but it is not clear if they are optimal—in that no algorithm can attain better statistical performance—or whether they can leverage parallel computing as in the papers [12, 14]. In this paper, we take a two-pronged approach. First, we investigate the fundamental limits of optimization and learning algorithms in sparse data regimes. In doing so, we derive lower bounds on the optimization error of any algorithm for problems of the form (1) with sparsity condition (2). These results have two main implications. They show that in some scenarios, learning with sparse data is quite difficult, as essentially each coordinate j ∈ [d] can be relevant and must be optimized for. In spite of this seemingly negative result, we are also able to show that the A DAG RAD algorithms of [4, 9] are optimal, and we show examples in which their dependence on the dimension d can be made exponentially better than standard gradient methods. As the second facet of our two-pronged approach, we study how sparsity may be leveraged in parallel computing frameworks to give substantially faster algorithms that still achieve optimal sample complexity in terms of the number of samples ξ used. We develop two new algorithms, asynchronous dual averaging (A SYNC DA) and asynchronous A DAG RAD (A SYNC A DAG RAD), which allow asynchronous parallel solution of the problem (1) for general convex f and X . Combining insights of Niu et al.’s H OGWILD ! [12] with a new analysis, we prove our algorithms achieve linear speedup in the number of processors while maintaining optimal statistical guarantees. We also give experiments on text-classification and web-advertising tasks to illustrate the benefits of the new algorithms. 2 Minimax rates for sparse optimization We begin our study of sparse optimization problems by establishing their fundamental statistical and optimization-theoretic properties. To do this, we derive bounds on the minimax convergence rate of any algorithm for such problems. Formally, let x denote any estimator for a minimizer of the objective (1). We define the optimality gap N for the estimator x based on N samples ξ 1 , . . . , ξ N from the distribution P as N (x, F, X , P ) := f (x) − inf f (x) = EP [F (x; ξ)] − inf EP [F (x; ξ)] . x∈X x∈X This quantity is a random variable, since x is a random variable (it is a function of ξ 1 , . . . , ξ N ). To define the minimax error, we thus take expectations of the quantity N , though we require a bit more than simply E[ N ]. We let P denote a collection of probability distributions, and we consider a collection of loss functions F specified by a collection F of convex losses F : X × ξ → R. We can then define the minimax error for the family of losses F and distributions P as ∗ N (X , P, F) := inf sup sup EP [ x P ∈P F ∈F N (x(ξ 1:N ), F, X , P )], (3) where the infimum is taken over all possible estimators x (an estimator is an optimization scheme, or a measurable mapping x : ΞN → X ) . 2.1 Minimax lower bounds Let us now give a more precise characterization of the (natural) set of sparse optimization problems we consider to provide the lower bound. For the next proposition, we let P consist of distributions supported on Ξ = {−1, 0, 1}d , and we let pj := P (ξj = 0) be the marginal probability of appearance of feature j ∈ {1, . . . , d}. For our class of functions, we set F to consist of functions F satisfying the sparsity condition (2) and with the additional constraint that for g ∈ ∂x F (x; ξ), we have that the jth coordinate |gj | ≤ Mj for a constant Mj < ∞. We obtain 2 Proposition 1. Let the conditions of the preceding paragraph hold. Let R be a constant such that X ⊃ [−R, R]d . Then √ d pj 1 ∗ . Mj min pj , √ N (X , P, F) ≥ R 8 j=1 N log 3 We provide the proof of Proposition 1 in the supplement A.1 in the full version of the paper, providing a few remarks here. We begin by giving a corollary to Proposition 1 that follows when the data ξ obeys a type of power law: let p0 ∈ [0, 1], and assume that P (ξj = 0) = p0 j −α . We have Corollary 2. Let α ≥ 0. Let the conditions of Proposition 1 hold with Mj ≡ M for all j, and assume the power law condition P (ξj = 0) = p0 j −α on coordinate appearance probabilities. Then (1) If d > (p0 N )1/α , ∗ N (X , P, F) ≥ 2−α 1−α p0 p0 (p0 N ) 2α − 1 + d1−α − (p0 N ) α N 1−α 2 MR 8 2−α (2) If d ≤ (p0 N )1/α , ∗ N (X , P, F) ≥ MR 8 p0 N α 1 1 d1− 2 − 1 − α/2 1 − α/2 . . Expanding Corollary 2 slightly, for simplicity assume the number of samples is large enough that d ≤ (p0 N )1/α . Then we find that the lower bound on optimization error is of order p0 1− α p0 p0 d 2 when α < 2, M R log d when α → 2, and M R when α > 2. (4) N N N These results beg the question of tightness: are they improvable? As we see presently, they are not. MR 2.2 Algorithms for attaining the minimax rate To show that the lower bounds of Proposition 1 and its subsequent specializations are sharp, we review a few stochastic gradient algorithms. We begin with stochastic gradient descent (SGD): SGD repeatedly samples ξ ∼ P , computes g ∈ ∂x F (x; ξ), then performs the update x ← ΠX (x − ηg), where η is a stepsize parameter and ΠX denotes Euclidean projection onto X . Standard analyses of stochastic gradient descent [10] show that after N samples ξ i , the SGD estimator x(N ) satisfies R2 M ( d j=1 1 pj ) 2 √ , (5) N where R2 denotes the 2 -radius of X . Dual averaging, due to Nesterov [11] (sometimes called “follow the regularized leader” [5]) is a more recent algorithm. In dual averaging, one again samples g ∈ ∂x F (x; ξ), but instead of updating the parameter vector x one updates a dual vector z by z ← z + g, then computes 1 x ← argmin z, x + ψ(x) , η x∈X E[f (x(N ))] − inf f (x) ≤ O(1) x∈X 2 1 where ψ(x) is a strongly convex function defined over X (often one takes ψ(x) = 2 x 2 ). As we discuss presently, the dual averaging algorithm is somewhat more natural in asynchronous and parallel computing environments, and it enjoys the same type of convergence guarantees (5) as SGD. The A DAG RAD algorithm [4, 9] is an extension of the preceding stochastic gradient methods. It maintains a diagonal matrix S, where upon receiving a new sample ξ, A DAG RAD performs the following: it computes g ∈ ∂x F (x; ξ), then updates 2 Sj ← Sj + gj for j ∈ [d]. The dual averaging variant of A DAG RAD updates the usual dual vector z ← z + g; the update to x is based on S and a stepsize η and computes x ← argmin z, x + x ∈X 3 1 1 x ,S2x 2η . After N samples ξ, the averaged parameter x(N ) returned by A DAG RAD satisfies R∞ M E[f (x(N ))] − inf f (x) ≤ O(1) √ x∈X N d √ pj , (6) j=1 where R∞ denotes the ∞ -radius of X (cf. [4, Section 1.3 and Theorem 5]). By inspection, the A DAG RAD rate (6) matches the lower bound in Proposition 1 and is thus optimal. It is interesting to note, though, that in the power law setting of Corollary 2 (recall the error order (4)), a calculation √ shows that the multiplier for the SGD guarantee (5) becomes R∞ d max{d(1−α)/2 , 1}, while A DA G RAD attains rate at worst R∞ max{d1−α/2 , log d}. For α > 1, the A DAG RAD rate is no worse, √ and for α ≥ 2, is more than d/ log d better—an exponential improvement in the dimension. 3 Parallel and asynchronous optimization with sparsity As we note in the introduction, recent works [12, 14] have suggested that sparsity can yield benefits in our ability to parallelize stochastic gradient-type algorithms. Given the optimality of A DAG RADtype algorithms, it is natural to focus on their parallelization in the hope that we can leverage their ability to “adapt” to sparsity in the data. To provide the setting for our further algorithms, we first revisit Niu et al.’s H OGWILD ! [12]. H OGWILD ! is an asynchronous (parallelized) stochastic gradient algorithm for optimization over product-space domains, meaning that X in problem (1) decomposes as X = X1 × · · · × Xd , where Xj ⊂ R. Fix a stepsize η > 0. A pool of independently running processors then performs the following updates asynchronously to a centralized vector x: 1. Sample ξ ∼ P 2. Read x and compute g ∈ ∂x F (x; ξ) 3. For each j s.t. gj = 0, update xj ← ΠXj (xj − ηgj ). Here ΠXj denotes projection onto the jth coordinate of the domain X . The key of H OGWILD ! is that in step 2, the parameter x is allowed to be inconsistent—it may have received partial gradient updates from many processors—and for appropriate problems, this inconsistency is negligible. Indeed, Niu et al. [12] show linear speedup in optimization time as the number of processors grow; they show this empirically in many scenarios, providing a proof under the somewhat restrictive assumptions that there is at most one non-zero entry in any gradient g and that f has Lipschitz gradients. 3.1 Asynchronous dual averaging A weakness of H OGWILD ! is that it appears only applicable to problems for which the domain X is a product space, and its analysis assumes g 0 = 1 for all gradients g. In effort to alleviate these difficulties, we now develop and present our asynchronous dual averaging algorithm, A SYNC DA. A SYNC DA maintains and upates a centralized dual vector z instead of a parameter x, and a pool of processors perform asynchronous updates to z, where each processor independently iterates: 1. Read z and compute x := argminx∈X 1 z, x + η ψ(x) // Implicitly increment “time” counter t and let x(t) = x 2. Sample ξ ∼ P and let g ∈ ∂x F (x; ξ) // Let g(t) = g. 3. For j ∈ [d] such that gj = 0, update zj ← zj + gj . Because the actual computation of the vector x in A SYNC DA is performed locally on each processor in step 1 of the algorithm, the algorithm can be executed with any proximal function ψ and domain X . The only communication point between any of the processors is the addition operation in step 3. Since addition is commutative and associative, forcing all asynchrony to this point of the algorithm is a natural strategy for avoiding synchronization problems. In our analysis of A SYNC DA, and in our subsequent analysis of the adaptive methods, we require a measurement of time elapsed. With that in mind, we let t denote a time index that exists (roughly) behind-the-scenes. We let x(t) denote the vector x ∈ X computed in the tth step 1 of the A SYNC DA 4 algorithm, that is, whichever is the tth x actually computed by any of the processors. This quantity t exists and is recoverable from the algorithm, and it is possible to track the running sum τ =1 x(τ ). Additionally, we state two assumptions encapsulating the conditions underlying our analysis. Assumption A. There is an upper bound m on the delay of any processor. In addition, for each j ∈ [d] there is a constant pj ∈ [0, 1] such that P (ξj = 0) ≤ pj . We also require certain continuity (Lipschitzian) properties of the loss functions; these amount to a second moment constraint on the instantaneous ∂F and a rough measure of gradient sparsity. Assumption B. There exist constants M and (Mj )d such that the following bounds hold for all j=1 2 x ∈ X : E[ ∂x F (x; ξ) 2 ] ≤ M2 and for each j ∈ [d] we have E[|∂xj F (x; ξ)|] ≤ pj Mj . With these definitions, we have the following theorem, which captures the convergence behavior of A SYNC DA under the assumption that X is a Cartesian product, meaning that X = X1 × · · · × Xd , 2 where Xj ⊂ R, and that ψ(x) = 1 x 2 . Note the algorithm itself can still be efficiently parallelized 2 for more general convex X , even if the theorem does not apply. Theorem 3. Let Assumptions A and B and the conditions in the preceding paragraph hold. Then T E t=1 F (x(t); ξ t ) − F (x∗ ; ξ t ) ≤ 1 x∗ 2η d 2 2 η 2 p2 Mj . + T M2 + ηT m j 2 j=1 We now provide a few remarks to explain and simplify the result. Under the more stringent condition 2 d 2 that |∂xj F (x; ξ)| ≤ Mj , Assumption A implies E[ ∂x F (x; ξ) 2 ] ≤ j=1 pj Mj . Thus, for the d 2 remainder of this section we take M2 = j=1 pj Mj , which upper bounds the Lipschitz continuity constant of the objective function f . We then obtain the following corollary. √ T 1 Corollary 4. Define x(T ) = T t=1 x(t), and set η = x∗ 2 /M T . Then E[f (x(T )) − f (x∗ )] ≤ M x∗ √ T 2 +m x∗ 2 √ 2M T d 2 p2 M j . j j=1 Corollary 4 is nearly immediate: since ξ t is independent of x(t), we have E[F (x(t); ξ t ) | x(t)] = f (x(t)); applying Jensen’s inequality to f (x(T )) and performing an algebraic manipulation give the result. If the data is suitably sparse, meaning that pj ≤ 1/m, the bound in Corollary 4 simplifies to 3 M x∗ √ E[f (x(T )) − f (x )] ≤ 2 T ∗ 2 3 = 2 d j=1 2 pj M j x ∗ √ T 2 , (7) which is the convergence rate of stochastic gradient descent even in centralized settings (5). The √ convergence guarantee (7) shows that after T timesteps, the error scales as 1/ T ; however, if we have k processors, updates occur roughly k times as quickly, as they are asynchronous, and in time scaling as N/k, we can evaluate N gradient samples: a linear speedup. 3.2 Asynchronous AdaGrad We now turn to extending A DAG RAD to asynchronous settings, developing A SYNC A DAG RAD (asynchronous A DAG RAD). As in the A SYNC DA algorithm, A SYNC A DAG RAD maintains a shared dual vector z (the sum of gradients) and the shared matrix S, which is the diagonal sum of squares of gradient entries (recall Section 2.2). The matrix S is initialized as diag(δ 2 ), where δj ≥ 0 is an initial value. Each processor asynchronously performs the following iterations: 1 1 1. Read S and z and set G = S 2 . Compute x := argminx∈X { z, x + 2η x, Gx } increment “time” counter t and let x(t) = x, S(t) = S 2. Sample ξ ∼ P and let g ∈ ∂F (x; ξ) 2 3. For j ∈ [d] such that gj = 0, update Sj ← Sj + gj and zj ← zj + gj . 5 // Implicitly As in the description of A SYNC DA, we note that x(t) is the vector x ∈ X computed in the tth “step” of the algorithm (step 1), and similarly associate ξ t with x(t). To analyze A SYNC A DAG RAD, we make a somewhat stronger assumption on the sparsity properties of the losses F than Assumption B. 2 Assumption C. There exist constants (Mj )d such that E[(∂xj F (x; ξ))2 | ξj = 0] ≤ Mj for all j=1 x ∈ X. 2 Indeed, taking M2 = j pj Mj shows that Assumption C implies Assumption B with specific constants. We then have the following convergence result. Theorem 5. In addition to the conditions of Theorem 3, let Assumption C hold. Assume that for all 2 j we have δ 2 ≥ Mj m and X ⊂ [−R∞ , R∞ ]d . Then T t=1 E F (x(t); ξ t ) − F (x∗ ; ξ t ) d ≤ min j=1 T 1 2 R E η ∞ 2 δ + gj (t) 2 1 2 T + ηE gj (t) t=1 2 1 2 (1 + pj m), Mj R∞ pj T . t=1 It is possible to relax the condition on the initial constant diagonal term; we defer this to the full version of the paper. It is natural to ask in which situations the bound provided by Theorem 5 is optimal. We note that, as in the case with Theorem 3, we may obtain a convergence rate for f (x(T )) − f (x∗ ) using convexity, T 1 where x(T ) = T t=1 x(t). By Jensen’s inequality, we have for any δ that T E 2 δ + gj (t) 2 1 2 t=1 T ≤ 2 2 E[gj (t) ] δ + t=1 1 2 ≤ 2 δ 2 + T pj Mj . For interpretation, let us now make a few assumptions on the probabilities pj . If we assume that pj ≤ c/m for a universal (numerical) constant c, then Theorem 5 guarantees that d log(T )/T + pj √ (8) , pj , T j=1 √ which is the convergence rate of A DAG RAD except for a small factor of min{ log T /T, pj } in addition to the usual pj /T rate. In particular, optimizing by choosing η = R∞ , and assuming 1 pj T log T , we have convergence guarantee √ d pj E[f (x(T )) − f (x∗ )] ≤ O(1)R∞ Mj min √ , pj , T j=1 E[f (x(T )) − f (x∗ )] ≤ O(1) 1 2 R +η η ∞ Mj min which is minimax optimal by Proposition 1. In fact, however, the bounds of Theorem 5 are somewhat stronger: they provide bounds using the expectation of the squared gradients gj (t) rather than the maximal value Mj , though the bounds are perhaps clearer in the form (8). We note also that our analysis applies to more adversarial settings than stochastic optimization (e.g., to online convex optimization [5]). Specifically, an adversary may choose an arbitrary sequence of functions subject to the random data sparsity constraint (2), and our results provide an expected regret bound, which is strictly stronger than the stochastic convergence guarantees provided (and guarantees high-probability convergence in stochastic settings [3]). Moreover, our comments in Section 2 about the relative optimality of A DAG RAD versus standard gradient methods apply. When the data is sparse, we indeed should use asynchronous algorithms, but using adaptive methods yields even more improvement than simple gradient-based methods. 4 Experiments In this section, we give experimental validation of our theoretical results on A SYNC A DAG RAD and A SYNC DA, giving results on two datasets selected for their high-dimensional sparsity.2 2 In our experiments, A SYNC DA and H OGWILD ! had effectively identical performance. 6 8 0.07 6 5 4 0.024 Test error Training loss Speedup 0.025 0.065 7 0.06 0.055 0.05 0.045 0.04 0.023 0.022 0.021 0.02 0.035 3 0.019 2 1 2 4 0.03 A-A DAG RAD A SYNC DA Number of workers 6 8 10 12 14 0.018 0.025 0.02 16 2 4 6 8 10 12 14 Number of workers 0.017 16 2 4 6 8 10 12 14 Number of workers 16 Figure 1. Experiments with URL data. Left: speedup relative to one processor. Middle: training dataset loss versus number of processors. Right: test set error rate versus number of processors. AA DAG RAD abbreviates A SYNC A DAG RAD. 1.03 1.02 1.01 1.00 1.0 1 2 4 8 16 64 256 number of passes A-AdaGrad, η = 0.008 L2 = 0 A-AdaGrad, η = 0.008 L2 = 80 A-DA, η = 0.8 L2 = 0 A-DA, η = 0.8 L2 = 80 1.00 1.01 1.4 1.02 1.03 1.04 Impact of L2 regularizaton on test error 1.04 Fixed stepsizes, test data, L2=0 1.2 relative log-loss 1.6 1.8 Fixed stepsizes, training data, L2=0 A-AdaGrad η = 0.002 A-AdaGrad η = 0.004 A-AdaGrad η = 0.008 A-AdaGrad η = 0.016 A-DA η = 0.800 A-DA η = 1.600 A-DA η = 3.200 1 2 4 8 16 32 number of passes 64 128 256 1 2 4 8 16 32 64 128 256 number of passes Figure 2: Relative accuracy for various stepsize choices on an click-through rate prediction dataset. 4.1 Malicious URL detection For our first set of experiments, we consider the speedup attainable by applying A SYNC A DAG RAD and A SYNC DA, investigating the performance of each algorithm on a malicious URL prediction task [7]. The dataset in this case consists of an anonymized collection of URLs labeled as malicious (e.g., spam, phishing, etc.) or benign over a span of 120 days. The data in this case consists of 2.4 · 106 examples with dimension d = 3.2 · 106 (sparse) features. We perform several experiments, randomly dividing the dataset into 1.2 · 106 training and test samples for each experiment. In Figure 1 we compare the performance of A SYNC A DAG RAD and A SYNC DA after doing after single pass through the training dataset. (For each algorithm, we choose the stepsize η for optimal training set performance.) We perform the experiments on a single machine running Ubuntu Linux with six cores (with two-way hyperthreading) and 32Gb of RAM. From the left-most plot in Fig. 1, we see that up to six processors, both A SYNC DA and A SYNC A DAG RAD enjoy the expected linear speedup, and from 6 to 12, they continue to enjoy a speedup that is linear in the number of processors though at a lesser slope (this is the effect of hyperthreading). For more than 12 processors, there is no further benefit to parallelism on this machine. The two right plots in Figure 1 plot performance of the different methods (with standard errors) versus the number of worker threads used. Both are essentially flat; increasing the amount of parallelism does nothing to the average training loss or the test error rate for either method. It is clear, however, that for this dataset, the adaptive A SYNC A DAG RAD algorithm provides substantial performance benefits over A SYNC DA. 4.2 Click-through-rate prediction experiments We also experiment on a proprietary datasets consisting of search ad impressions. Each example corresponds to showing a search-engine user a particular text ad in response to a query string. From this, we construct a very sparse feature vector based on the text of the ad displayed and the query string (no user-specific data is used). The target label is 1 if the user clicked the ad and -1 otherwise. 7 (B) A-AdaGrad speedup (D) Impact of training data ordering 1.004 1.005 1.006 1.007 1.008 1 2 4 8 16 32 number of passes 64 128 256 1.000 1 2 A-DA base η = 1.600 A-AdaGrad base η = 0.023 0 1.005 relative stepsize (C) Optimal stepsize scaling relative log-loss 1.003 target relative log-loss 1.005 1.010 1.002 1.010 1.015 8 4 0 speedup A-DA η = 1.600 A-AdaGrad η = 0.016 1.001 1.000 relative log-loss 1.015 A-DA, L2=80 A-AdaGrad, L2=80 12 (A) Optimized stepsize for each number of passes 1 2 4 8 16 32 number of passes 64 128 256 1 2 4 8 16 32 64 128 256 number of passes Figure 3. (A) Relative test-set log-loss for A SYNC DA and A SYNC A DAG RAD, choosing the best stepsize (within a factor of about 1.4×) individually for each number of passes. (B) Effective speedup for A SYNC A DAG RAD. (C) The best stepsize η, expressed as a scaling factor on the stepsize used for one pass. (D) Five runs with different random seeds for each algorithm (with 2 penalty 80). We fit logistic regression models using both A SYNC DA and A SYNC A DAG RAD. We run extensive experiments on a moderate-sized dataset (about 107 examples, split between training and testing), which allows thorough investigation of the impact of the stepsize η, the number of training passes,3 and 2 -regularization on accuracy. For these experiments we used 32 threads on 16 core machines for each run, as A SYNC A DAG RAD and A SYNC DA achieve similar speedups from parallelization. On this dataset, A SYNC A DAG RAD typically achieves an effective additional speedup over A SYNC DA of 4× or more. That is, to reach a given level of accuracy, A SYNC DA generally needs four times as many effective passes over the dataset. We measure accuracy with log-loss (the logistic loss) averaged over five runs using different random seeds (which control the order in which the algorithms sample examples during training). We report relative values in Figures 2 and 3, that is, the ratio of the mean loss for the given datapoint to the lowest (best) mean loss obtained. Our results are not particularly sensitive to the choice of relative log-loss as the metric of interest; we also considered AUC (the area under the ROC curve) and observed similar results. Figure 2 shows relative log-loss as a function of the number of training passes for various stepsizes. Without regularization, A SYNC A DAG RAD is prone to overfitting: it achieves significantly higher accuracy on the training data (Fig. 2 (left)), but unless the stepsize is tuned carefully to the number of passes, it will overfit (Fig. 2 (middle)). Fortunately, the addition of 2 regularization largely solves this problem. Indeed, Figure 2 (right) shows that while adding an 2 penalty of 80 has very little impact on A SYNC DA, it effectively prevents the overfitting of A SYNC A DAG RAD.4 Fixing √ regularization multiplier to 80, we varied the stepsize η over a multiplicative grid with res2 olution 2 for each number of passes and for each algorithm. Figure 3 reports the results obtained by selecting the best stepsize in terms of test set log-loss for each number of passes. Figure 3(A) shows relative log-loss of the best stepsize for each algorithm; 3(B) shows the relative time A SYNC DA requires with respect to A SYNC A DAG RAD to achieve a given loss. Specifically, Fig. 3(B) shows the ratio of the number of passes the algorithms require to achieve a fixed loss, which gives a broader estimate of the speedup obtained by using A SYNC A DAG RAD; speedups range from 3.6× to 12×. Figure 3(C) shows the optimal stepsizes as a function of the best setting for one pass. The optimal stepsize decreases moderately for A SYNC A DAG RAD, but are somewhat noisy for A SYNC DA. It is interesting to note that A SYNC A DAG RAD’s accuracy is largely independent of the ordering of the training data, while A SYNC DA shows significant variability. This can be seen both in the error bars on Figure 3(A), and explicitly in Figure 3(D), where we plot one line for each of the five random seeds used. Thus, while on the one hand A SYNC DA requires somewhat less tuning of the stepsize and 2 parameter, tuning A SYNC A DAG RAD is much easier because of its predictable response. 3 Here “number of passes” more precisely means the expected number of times each example in the dataset is trained on. That is, each worker thread randomly selects a training example from the dataset for each update, and we continued making updates until (dataset size) × (number of passes) updates have been processed. 4 For both algorithms, this is accomplished by adding the term η80 x 2 to the ψ function. We can achieve 2 slightly better results for A SYNC A DAG RAD by varying the 2 penalty with the number of passes. 8 References [1] P. Auer and C. Gentile. Adaptive and self-confident online learning algorithms. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, 2000. [2] P. B¨ hlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and u Applications. Springer, 2011. [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, September 2004. [4] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011. [5] E. Hazan. The convex optimization approach to regret minimization. In Optimization for Machine Learning, chapter 10. MIT Press, 2012. [6] J. Hiriart-Urruty and C. Lemar´ chal. Convex Analysis and Minimization Algorithms I & II. e Springer, New York, 1996. [7] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying malicious urls: An application of large-scale online learning. In Proceedings of the 26th International Conference on Machine Learning, 2009. [8] C. Manning and H. Sch¨ tze. Foundations of Statistical Natural Language Processing. MIT u Press, 1999. [9] B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, 2010. [10] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. [11] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):261–283, 2009. [12] F. Niu, B. Recht, C. R´ , and S. Wright. Hogwild: a lock-free approach to parallelizing stochase tic gradient descent. In Advances in Neural Information Processing Systems 24, 2011. [13] P. Richt´ rik and M. Tak´ c. Parallel coordinate descent methods for big data optimization. a aˇ arXiv:1212.0873 [math.OC], 2012. URL http://arxiv.org/abs/1212.0873. [14] M. Tak´ c, A. Bijral, P. Richt´ rik, and N. Srebro. Mini-batch primal and dual methods for aˇ a SVMs. In Proceedings of the 30th International Conference on Machine Learning, 2013. 9
6 0.58060527 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
7 0.57919592 215 nips-2013-On Decomposing the Proximal Map
8 0.57561064 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
9 0.56457281 193 nips-2013-Mixed Optimization for Smooth Functions
10 0.55304551 249 nips-2013-Polar Operators for Structured Sparse Estimation
11 0.52001709 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors
12 0.50600147 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
13 0.50404787 171 nips-2013-Learning with Noisy Labels
14 0.50110739 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
15 0.50046885 202 nips-2013-Multiclass Total Variation Clustering
16 0.50022119 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
17 0.49895036 269 nips-2013-Regression-tree Tuning in a Streaming Setting
18 0.49852973 252 nips-2013-Predictive PAC Learning and Process Decompositions
19 0.48752606 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
20 0.48094425 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
topicId topicWeight
[(16, 0.035), (33, 0.169), (34, 0.088), (36, 0.02), (41, 0.037), (49, 0.024), (56, 0.096), (70, 0.05), (85, 0.04), (89, 0.025), (93, 0.065), (95, 0.022), (99, 0.258)]
simIndex simValue paperId paperTitle
1 0.88332522 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
Author: Jukka Corander, Tomi Janhunen, Jussi Rintanen, Henrik Nyman, Johan Pensar
Abstract: We investigate the problem of learning the structure of a Markov network from data. It is shown that the structure of such networks can be described in terms of constraints which enables the use of existing solver technology with optimization capabilities to compute optimal networks starting from initial scores computed from the data. To achieve efficient encodings, we develop a novel characterization of Markov network structure using a balancing condition on the separators between cliques forming the network. The resulting translations into propositional satisfiability and its extensions such as maximum satisfiability, satisfiability modulo theories, and answer set programming, enable us to prove optimal certain networks which have been previously found by stochastic search. 1
2 0.87187594 128 nips-2013-Generalized Method-of-Moments for Rank Aggregation
Author: Hossein Azari Soufiani, William Chen, David C. Parkes, Lirong Xia
Abstract: In this paper we propose a class of efficient Generalized Method-of-Moments (GMM) algorithms for computing parameters of the Plackett-Luce model, where the data consists of full rankings over alternatives. Our technique is based on breaking the full rankings into pairwise comparisons, and then computing parameters that satisfy a set of generalized moment conditions. We identify conditions for the output of GMM to be unique, and identify a general class of consistent and inconsistent breakings. We then show by theory and experiments that our algorithms run significantly faster than the classical Minorize-Maximization (MM) algorithm, while achieving competitive statistical efficiency. 1
3 0.83090621 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces
Author: Leonid Boytsov, Bilegsaikhan Naidan
Abstract: Our focus is on approximate nearest neighbor retrieval in metric and non-metric spaces. We employ a VP-tree and explore two simple yet effective learning-toprune approaches: density estimation through sampling and “stretching” of the triangle inequality. Both methods are evaluated using data sets with metric (Euclidean) and non-metric (KL-divergence and Itakura-Saito) distance functions. Conditions on spaces where the VP-tree is applicable are discussed. The VP-tree with a learned pruner is compared against the recently proposed state-of-the-art approaches: the bbtree, the multi-probe locality sensitive hashing (LSH), and permutation methods. Our method was competitive against state-of-the-art methods and, in most cases, was more efficient for the same rank approximation quality. 1
4 0.80123043 283 nips-2013-Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model
Author: Fang Han, Han Liu
Abstract: In this paper we focus on the principal component regression and its application to high dimension non-Gaussian data. The major contributions are two folds. First, in low dimensions and under the Gaussian model, by borrowing the strength from recent development in minimax optimal principal component estimation, we first time sharply characterize the potential advantage of classical principal component regression over least square estimation. Secondly, we propose and analyze a new robust sparse principal component regression on high dimensional elliptically distributed data. The elliptical distribution is a semiparametric generalization of the Gaussian, including many well known distributions such as multivariate Gaussian, rank-deficient Gaussian, t, Cauchy, and logistic. It allows the random vector to be heavy tailed and have tail dependence. These extra flexibilities make it very suitable for modeling finance and biomedical imaging data. Under the elliptical model, we prove that our method can estimate the regression coefficients in the optimal parametric rate and therefore is a good alternative to the Gaussian based methods. Experiments on synthetic and real world data are conducted to illustrate the empirical usefulness of the proposed method. 1
same-paper 5 0.79977447 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
Author: Julien Mairal
Abstract: Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. Because of its simplicity and its wide applicability, this principle has been very popular in statistics and in signal processing. In this paper, we intend to make this principle scalable. We introduce a stochastic majorization-minimization scheme which is able to deal with largescale or possibly infinite data sets. When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence √ rate of O(1/ n) after n iterations, and of O(1/n) for strongly convex functions. Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. We develop several efficient algorithms based on our framework. First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale ℓ1 logistic regression. Second, we develop an online DC programming algorithm for non-convex sparse estimation. Finally, we demonstrate the effectiveness of our approach for solving large-scale structured matrix factorization problems. 1
6 0.7880457 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
7 0.67786729 175 nips-2013-Linear Convergence with Condition Number Independent Access of Full Gradients
8 0.66722149 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search
9 0.66552365 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
10 0.66540909 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
11 0.66395265 251 nips-2013-Predicting Parameters in Deep Learning
12 0.66342604 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning
13 0.66233152 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
14 0.66218609 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
15 0.66160202 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
16 0.66101027 331 nips-2013-Top-Down Regularization of Deep Belief Networks
17 0.66099131 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
18 0.66062289 30 nips-2013-Adaptive dropout for training deep neural networks
19 0.66034508 201 nips-2013-Multi-Task Bayesian Optimization
20 0.65824175 300 nips-2013-Solving the multi-way matching problem by permutation synchronization