jmlr jmlr2013 jmlr2013-107 knowledge-graph by maker-knowledge-mining

107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization


Source: pdf

Author: Shai Shalev-Shwartz, Tong Zhang

Abstract: Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it has so far lacked good convergence analysis. This paper presents a new analysis of Stochastic Dual Coordinate Ascent (SDCA) showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD. This analysis justifies the effectiveness of SDCA for practical applications. Keywords: stochastic dual coordinate ascent, optimization, computational complexity, regularized loss minimization, support vector machines, ridge regression, logistic regression

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Keywords: stochastic dual coordinate ascent, optimization, computational complexity, regularized loss minimization, support vector machines, ridge regression, logistic regression 1. [sent-10, score-0.254]

2 An alternative approach is dual coordinate ascent (DCA), which solves a dual problem of (1). [sent-38, score-0.331]

3 The dual problem is  max D(α) where D(α) =  α∈Rn n 1 λ ∑ −φ∗ (−αi ) − 2 n i=1 i n 1 λn ∑ αi xi i=1 2  . [sent-40, score-0.176]

4 (2) The dual objective in (2) has a different dual variable associated with each example in the training set. [sent-41, score-0.271]

5 At each iteration of DCA, the dual objective is optimized with respect to a single dual variable, while the rest of the dual variables are kept in tact. [sent-42, score-0.398]

6 It is also known that P(w∗ ) = D(α∗ ) which immediately implies that for all w and α, we have P(w) ≥ D(α), and hence the duality gap defined as P(w(α)) − D(α) can be regarded as an upper bound of the primal sub-optimality P(w(α)) − P(w∗ ). [sent-44, score-0.227]

7 We focus on a stochastic version of DCA, abbreviated by SDCA, in which at each round we choose which dual coordinate to optimize uniformly at random. [sent-45, score-0.182]

8 The purpose of this paper is to develop theoretical understanding of the convergence of the duality gap for SDCA. [sent-46, score-0.17]

9 i i i 2 Our main findings are: in order to achieve a duality gap of ε, ˜ • For L-Lipschitz loss functions, we obtain the rate of O(n + L2 /(λε)). [sent-53, score-0.207]

10 • For loss functions which are almost everywhere smooth (such as the hinge-loss), we can obtain rate better than the above rate for Lipschitz loss. [sent-55, score-0.132]

11 The basic technique is to adapt the linear convergence of coordinate ascent that was established by Luo and Tseng (1992). [sent-68, score-0.108]

12 Second, the analysis only deals with the sub-optimality of the dual objective, while our real goal is to bound the sub-optimality of the primal objective. [sent-76, score-0.215]

13 Given a dual solution α ∈ Rn its corresponding primal solution is w(α) (see (3)). [sent-77, score-0.211]

14 , 2006, Theorem 2) showed that in order to obtain a primal εP -sub-optimal solution, we need a dual εD -sub-optimal solution with εD = O(λε2 ); therefore a convergence result for dual solution P can only translate into a primal convergence result with worse convergence rate. [sent-80, score-0.511]

15 Some analyses of stochastic coordinate ascent provide solutions to the first problem mentioned above. [sent-82, score-0.105]

16 (2008) analyzed an exponentiated gradient dual coordinate ascent algorithm. [sent-84, score-0.232]

17 (2008, Theorem 4) applied these results to the dual SVM formulation. [sent-90, score-0.127]

18 Furthermore, neither of these analyses can be applied to logistic regression due to their reliance on the smoothness of the dual objective function which is not satisfied for the dual formulation of logistic regression. [sent-92, score-0.311]

19 We shall also point out again that all of these bounds are for the dual sub-optimality, while as mentioned before, we are interested in the primal sub-optimality. [sent-93, score-0.191]

20 In this paper we derive new bounds on the duality gap (hence, they also imply bounds on the primal sub-optimality) of SDCA. [sent-94, score-0.203]

21 These bounds are superior to earlier results, and our analysis only holds for randomized (stochastic) dual coordinate ascent. [sent-95, score-0.154]

22 In fact, the practical convergence behavior of (non-stochastic) cyclic dual coordinate ascent (even with a random ordering of the data) can be slower than our theoretical bounds for SDCA, and thus cyclic DCA is inferior to SDCA. [sent-97, score-0.437]

23 In this regard, we note that some of the earlier analysis such as Luo and Tseng (1992) can be applied both to stochastic and to cyclic dual coordinate ascent methods with similar results. [sent-98, score-0.318]

24 This means that their analysis, which can be no better than the behavior of cyclic dual coordinate ascent, is inferior to our analysis. [sent-99, score-0.26]

25 (2012) derived a stochastic coordinate ascent for structural SVM based on the Frank-Wolfe algorithm. [sent-101, score-0.105]

26 There, a convergence rate of O(n log(1/ε)) rate is shown, for the 8 case of smooth losses, assuming that n ≥ λ γ . [sent-107, score-0.113]

27 Lipschitz loss Algorithm type of convergence SGD primal online EG (Collins et al. [sent-111, score-0.129]

28 , 2008) (for SVM) dual Stochastic Frank-Wolfe (Lacoste-Julien et al. [sent-112, score-0.127]

29 , 2012) (assuming n ≥ λ8γ ) SDCA type of convergence primal dual primal primal-dual rate ˜ 1) O( λε 1 ˜ O((n + λ ) log 1 ) ε 1 ˜ O((n + λ ) log 1 ) ε 1 ˜ O((n + λ ) log 1 ) ε 3. [sent-115, score-0.382]

30 In practice, however, the parameters T and T0 are not required as one can evaluate the duality gap and terminate when it is sufficiently small. [sent-119, score-0.139]

31 To ¯ obtain a duality gap of E[P(w) − D(α)] ≤ εP , it suffices to have a total number of iterations of ¯ T ≥ T0 + n + 4 L2 20 L2 ≥ max(0, ⌈n log(0. [sent-132, score-0.177]

32 λεP λεP Moreover, when t ≥ T0 , we have dual sub-optimality bound of E[D(α∗ ) − D(α(t) )] ≤ εP /2. [sent-134, score-0.151]

33 This means that calculating the duality gap at few random points would lead to the same type of guarantee with high probability. [sent-140, score-0.139]

34 This approach has the advantage over averaging, since it is easier to implement the stopping condition (we simply check the duality gap at some random stopping points. [sent-141, score-0.159]

35 However, for the hinge-loss, the constant 4 in the first inequality can be replaced by 1 (this is because the domain of the dual variables is positive, hence the constant 4 in Lemma 22 can be replaced by 1). [sent-144, score-0.127]

36 To obtain an expected duality gap of E[P(w(T ) ) − D(α(T ) )] ≤ εP , it suffices to have a total number of iterations of 1 1 T ≥ n + λγ log((n + λγ ) · ε1 ). [sent-149, score-0.177]

37 P ¯ Moreover, to obtain an expected duality gap of E[P(w) − D(α)] ≤ εP , it suffices to have a total ¯ number of iterations of T > T0 where 1 1 1 T0 ≥ n + λγ log((n + λγ ) · (T −T0 )εP ). [sent-150, score-0.177]

38 Using SGD At The First Epoch From the convergence analysis, SDCA may not perform as well as SGD for the first few epochs (each epoch means one pass over the data). [sent-166, score-0.132]

39 It is thus natural to combine SGD and SDCA, where the first epoch is performed using a modified stochastic gradient descent rule. [sent-168, score-0.104]

40 We show that ˜ the expected dual sub-optimality at the end of the first epoch is O(1/(λn)). [sent-169, score-0.172]

41 Let Pt denote the primal objective for the first t examples in the training set, λ 1 t Pt (w) = ∑ φi (w⊤ xi ) + 2 w 2 . [sent-172, score-0.118]

42 t i=1 The corresponding dual objective is  1 t λ Dt (α) =  ∑ −φ∗ (−αi ) − i t i=1 2 t 1 λt ∑ αi xi i=1 2  . [sent-173, score-0.181]

43 Note that Pn (w) is the primal objective given in (1) and that Dn (α) is the dual objective given in (2). [sent-174, score-0.225]

44 The idea is to greedily decrease the dual sub-optimality for problem Dt (·) at each step t. [sent-176, score-0.127]

45 We have the following result for the convergence of dual objective: Theorem 9 Assume that φi is L-Lipschitz for all i. [sent-182, score-0.158]

46 ¯ To obtain a duality gap of E[P(w) − D(α)] ≤ εP at Stage 2, it suffices to have a total number of ¯ SDCA iterations of T ≥ T0 + n + 20 L2 4 L2 ≥ ⌈n log(log(en))⌉ + n + . [sent-201, score-0.177]

47 λεP λεP Moreover, when t ≥ T0 , we have duality sub-optimality bound of E[D(α∗ ) − D(α(t) )] ≤ εP /2. [sent-202, score-0.116]

48 That is, the vanilla SDCA tends to have a slower convergence rate than SGD in the first few iterations when λ is relatively large. [sent-207, score-0.099]

49 For example, the hinge loss max(0, 1 − uyi ) is smooth at any point u such that uyi is not close to 1. [sent-219, score-0.222]

50 Definition 14 For each i, we define γi (·) ≥ 0 so that for all dual variables a and b, and u ∈ ∂φ∗ (−b), i we have φ∗ (−a) − φ∗ (−b) + u(a − b) ≥ γi (u)|a − b|2 . [sent-221, score-0.127]

51 (4) i i For the SVM loss, we have φi (u) = max(0, 1 − uyi ), and φ∗ (−a) = −ayi , with ayi ∈ [0, 1] and i yi ∈ {±1}. [sent-222, score-0.13]

52 Let γi = γi (w∗⊤ xi ), we have the following dual strong convexity inequality: D(α∗ ) − D(α) ≥ 1 n λ ∑ γi |αi − α∗ |2 + 2 (w − w∗ )⊤ (w − w∗ ). [sent-228, score-0.164]

53 i For SVM, we can take γi = |w∗⊤ xi yi − 1|, and for the absolute deviation loss, we may take γi = |w∗⊤ xi − yi |. [sent-230, score-0.118]

54 Under this assumption, we may establish a convergence result for the dual sub-optimality. [sent-232, score-0.158]

55 Remark 17 if N(s/λn)/n is small, then Theorem 16 is superior to Theorem 2 for the convergence of the dual objective function. [sent-237, score-0.175]

56 This result is again superior to Theorem 2 for dual convergence. [sent-244, score-0.127]

57 575 S HALEV-S HWARTZ AND Z HANG The following result shows fast convergence of duality gap using Theorem 16. [sent-245, score-0.17]

58 Assume that at time T0 ≥ n, we have dual suboptimality of E[D(α∗ ) − D(α(T0 ) )] ≤ εD , and define N(γ) 2 2εD ˜ εP = inf 4L + , γ>0 n min(γ, λγ2 /(2ρ)) then at time T = 2T0 , we have ˜ εP ¯ . [sent-250, score-0.137]

59 ¯ This means that the convergence rate for duality gap in Theorem 18 is linear as implied by the linear convergence of εD in Theorem 16. [sent-254, score-0.219]

60 For the hinge loss, step (*) in Procedure SDCA-Perm has a closed form solution as ∆αi = yi max 0, min 1, 1 − xi⊤ w(t−1) yi (t−1) + αi yi 2 /(λn) xi (t−1) − αi . [sent-280, score-0.187]

61 Both hinge loss and absolute deviation loss are 1-Lipschitz. [sent-282, score-0.13]

62 We have φi (u) = log(1+ exp(−yi u)), and φ∗ (−a) = ayi log(ayi ) + i (1 − ayi ) log(1 − ayi ) with ayi ∈ [0, 1]. [sent-289, score-0.272]

63 Recall that the hinge loss function (for positive labels) is φ(u) = max{0, 1 − u} and we have φ∗ (−a) = −a with a ∈ [0, 1]. [sent-297, score-0.096]

64 (6) φ 1 2 a∈[−1,0]  2 otherwise 2γ (1 − x) 577 S HALEV-S HWARTZ AND Z HANG For the smoothed hinge loss, step (*) in Procedure SDCA-Perm has a closed form solution as (t−1) ∆αi = yi max 0, min 1, 1 − xi⊤ w(t−1) yi − γ αi xi 2 /(λn) + γ yi (t−1) + αi (t−1) − αi yi . [sent-301, score-0.234]

65 For convenience, we list the following simple facts about primal and dual formulations, which will used in the proofs. [sent-309, score-0.191]

66 For each i, we have −α∗ ∈ ∂φi (w∗⊤ xi ), i and w∗ = w∗⊤ xi ∈ ∂φ∗ (−α∗ ), i i 1 n ∗ ∑ α xi . [sent-310, score-0.111]

67 λn i=1 i The proof of our basic results stated in Theorem 5 and Theorem 2 relies on the fact that for SDCA, it is possible to lower bound the expected increase in dual objective by the duality gap. [sent-311, score-0.274]

68 Note that the duality gap can be further lower bounded using dual suboptimality. [sent-313, score-0.266]

69 Therefore Lemma 19 implies a recursion for dual suboptimality which can be solved to obtain the convergence of dual objective. [sent-314, score-0.311]

70 We can then apply Lemma 19 again, and the convergence of dual objective implies an upper bound of the duality gap, which leads to the basic theorems. [sent-315, score-0.291]

71 1 Proof Of Theorem 5 The key lemma, which estimates the expected increase in dual objective in terms of the duality gap, can be stated as follows. [sent-318, score-0.236]

72 Next note that P(w) − D(α) = = 1 n λ λ 1 n ∑ φi (w⊤ xi ) + 2 w⊤ w − − n ∑ φ∗ (−αi ) − 2 w⊤ w i n i=1 i=1 1 n ∑ φi (w⊤ xi ) + φ∗ (−αi ) + αi w⊤ xi . [sent-327, score-0.111]

73 s s (t) (9) s So, requiring εD ≤ n εP we obtain a duality gap of at most εP . [sent-347, score-0.155]

74 ≤ = 1 − 2n−t01+t−1 This provides a bound on the dual sub-optimality. [sent-383, score-0.151]

75 (t+1) The inequality above can be obtained by noticing that the choice of −αt+1 maximizes the dual objective. [sent-410, score-0.127]

76 5 Proof Of Proposition 15 Consider any feasible dual variable α and the corresponding w = w(α). [sent-422, score-0.127]

77 Since w= 1 n ∑ αi xi , λn i=1 we have λ(w − w∗ )⊤ w∗ = w∗ = 1 n ∗ ∑ α xi , λn i=1 i 1 n ∑ (αi − α∗ )w∗⊤ xi . [sent-423, score-0.111]

78 6 Proof Of Theorem 16 The following lemma is very similar to Lemma 19 with nearly identical proof, but it focuses only on the convergence of dual objective function using (5). [sent-432, score-0.195]

79 In the experiments, we use εD to denote the dual sub-optimality, and εP to denote the primal sub-optimality (note that this is different than the notation in our analysis which uses εP to denote the duality gap). [sent-473, score-0.283]

80 We indeed observe linear convergence for the duality gap. [sent-489, score-0.123]

81 Random Permutation In Figure 5 we compare choosing dual variables at random with repetitions (as done in SDCA) vs. [sent-508, score-0.139]

82 choosing dual variables using a random permutation at each epoch (as done in SDCA-Perm) vs. [sent-509, score-0.182]

83 choosing dual variables in a fixed cyclic order (that was chosen once at random). [sent-510, score-0.213]

84 As can be seen, a cyclic order does not lead to linear convergence and yields actual convergence rate much slower than the other methods and even worse than our bound. [sent-511, score-0.176]

85 As mentioned before, some of the earlier analyses such as Luo and Tseng (1992) can be applied both to stochastic and to cyclic dual coordinate ascent methods with similar results. [sent-512, score-0.318]

86 This means that their analysis, which can be no better than the behavior of cyclic dual coordinate ascent, is inferior to our analysis. [sent-513, score-0.26]

87 One clear advantage of SDCA is the availability of a clear i stopping condition (by calculating the duality gap). [sent-518, score-0.102]

88 The primal and dual sub-optimality, the duality gap, and our bound are depicted as a function of the number of epochs, on the astro-ph (left), CCAT (center) and cov1 (right) data sets. [sent-527, score-0.317]

89 In all plots the horizontal axis is the number of iterations divided by training set size (corresponding to the number of epochs through the data). [sent-528, score-0.124]

90 The primal and dual sub-optimality, the duality gap, and our bound are depicted as a function of the number of epochs, on the astro-ph (left), CCAT (center) and cov1 (right) data sets. [sent-530, score-0.317]

91 In all plots the horizontal axis is the number of iterations divided by training set size (corresponding to the number of epochs through the data). [sent-531, score-0.124]

92 In all plots the vertical axis is the zero-one error on the test set and the horizontal axis is the number of iterations divided by training set size (corresponding to the number of epochs through the data). [sent-688, score-0.143]

93 We terminated each method when the duality gap was smaller than 10−5 . [sent-689, score-0.139]

94 In all cases, the duality gap is depicted as a function of the number of epochs for different values of λ. [sent-691, score-0.205]

95 The loss function is the smooth hinge loss with γ = 1. [sent-692, score-0.176]

96 In all plots the horizontal axis is the number of iterations divided by training set size (corresponding to the number of epochs through the data). [sent-694, score-0.124]

97 In all plots the horizontal axis is the number of iterations divided by training set size (corresponding to the number of epochs through the data). [sent-696, score-0.124]

98 In all plots the vertical axis is the zero-one error on the test set and the horizontal axis is the number of iterations divided by training set size (corresponding to the number of epochs through the data). [sent-794, score-0.143]

99 We terminated SDCA when the duality gap was smaller than 10−5 . [sent-795, score-0.139]

100 A dual coordinate descent method for large-scale linear SVM. [sent-827, score-0.172]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sdca', 0.794), ('sgd', 0.371), ('perm', 0.203), ('dca', 0.138), ('dual', 0.127), ('oss', 0.105), ('ual', 0.105), ('duality', 0.092), ('hwartz', 0.09), ('inimization', 0.09), ('oordinate', 0.09), ('egularized', 0.09), ('cyclic', 0.086), ('tochastic', 0.081), ('scent', 0.081), ('ccat', 0.079), ('ayi', 0.068), ('primal', 0.064), ('hinge', 0.062), ('epochs', 0.056), ('ethods', 0.055), ('hang', 0.051), ('ascent', 0.05), ('gap', 0.047), ('smooth', 0.046), ('epoch', 0.045), ('uyi', 0.04), ('xi', 0.037), ('luo', 0.036), ('loss', 0.034), ('ui', 0.033), ('convergence', 0.031), ('tw', 0.031), ('xt', 0.03), ('stochastic', 0.028), ('svm', 0.027), ('coordinate', 0.027), ('log', 0.026), ('smoothed', 0.025), ('ai', 0.025), ('lipschitz', 0.024), ('bound', 0.024), ('en', 0.023), ('tseng', 0.022), ('iterations', 0.022), ('yi', 0.022), ('lemma', 0.02), ('tdt', 0.02), ('uxt', 0.02), ('inferior', 0.02), ('axis', 0.019), ('rate', 0.018), ('descent', 0.018), ('theorem', 0.018), ('vanilla', 0.018), ('bottou', 0.017), ('objective', 0.017), ('collins', 0.017), ('hsieh', 0.017), ('pd', 0.016), ('obtain', 0.016), ('horizontal', 0.016), ('dt', 0.016), ('remark', 0.016), ('exponentiated', 0.015), ('roux', 0.015), ('logistic', 0.014), ('regularized', 0.014), ('proof', 0.014), ('gradient', 0.013), ('cun', 0.013), ('hingeloss', 0.013), ('maxz', 0.013), ('sag', 0.013), ('runtime', 0.013), ('repetitions', 0.012), ('max', 0.012), ('smoothness', 0.012), ('passes', 0.012), ('averaging', 0.012), ('divided', 0.011), ('mangasarian', 0.011), ('iterate', 0.011), ('option', 0.011), ('bousquet', 0.011), ('faster', 0.011), ('slower', 0.01), ('sridharan', 0.01), ('hush', 0.01), ('acknowledges', 0.01), ('rutgers', 0.01), ('suboptimality', 0.01), ('permutation', 0.01), ('stopping', 0.01), ('ridge', 0.01), ('depicted', 0.01), ('find', 0.01), ('solution', 0.01), ('summing', 0.01), ('procedure', 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Author: Shai Shalev-Shwartz, Tong Zhang

Abstract: Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it has so far lacked good convergence analysis. This paper presents a new analysis of Stochastic Dual Coordinate Ascent (SDCA) showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD. This analysis justifies the effectiveness of SDCA for practical applications. Keywords: stochastic dual coordinate ascent, optimization, computational complexity, regularized loss minimization, support vector machines, ridge regression, logistic regression

2 0.056948088 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

3 0.052144684 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation

Author: Michael Chertkov, Adam B. Yedidia

Abstract: We discuss schemes for exact and approximate computations of permanents, and compare them with each other. Specifically, we analyze the belief propagation (BP) approach and its fractional belief propagation (FBP) generalization for computing the permanent of a non-negative matrix. Known bounds and Conjectures are verified in experiments, and some new theoretical relations, bounds and Conjectures are proposed. The fractional free energy (FFE) function is parameterized by a scalar parameter γ ∈ [−1; 1], where γ = −1 corresponds to the BP limit and γ = 1 corresponds to the exclusion principle (but ignoring perfect matching constraints) mean-field (MF) limit. FFE shows monotonicity and continuity with respect to γ. For every non-negative matrix, we define its special value γ∗ ∈ [−1; 0] to be the γ for which the minimum of the γ-parameterized FFE function is equal to the permanent of the matrix, where the lower and upper bounds of the γ-interval corresponds to respective bounds for the permanent. Our experimental analysis suggests that the distribution of γ∗ varies for different ensembles but γ∗ always lies within the [−1; −1/2] interval. Moreover, for all ensembles considered, the behavior of γ∗ is highly distinctive, offering an empirical practical guidance for estimating permanents of non-negative matrices via the FFE approach. Keywords: permanent, graphical models, belief propagation, exact and approximate algorithms, learning flows

4 0.042397588 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

Author: Nemanja Djuric, Liang Lan, Slobodan Vucetic, Zhuang Wang

Abstract: We present BudgetedSVM, an open-source C++ toolbox comprising highly-optimized implementations of recently proposed algorithms for scalable training of Support Vector Machine (SVM) approximators: Adaptive Multi-hyperplane Machines, Low-rank Linearization SVM, and Budgeted Stochastic Gradient Descent. BudgetedSVM trains models with accuracy comparable to LibSVM in time comparable to LibLinear, solving non-linear problems with millions of high-dimensional examples within minutes on a regular computer. We provide command-line and Matlab interfaces to BudgetedSVM, an efficient API for handling large-scale, high-dimensional data sets, as well as detailed documentation to help developers use and further extend the toolbox. Keywords: non-linear classification, large-scale learning, SVM, machine learning toolbox

5 0.030230733 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency

6 0.028267717 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

7 0.026668116 114 jmlr-2013-The Rate of Convergence of AdaBoost

8 0.026660658 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

9 0.025197476 108 jmlr-2013-Stochastic Variational Inference

10 0.024148108 90 jmlr-2013-Quasi-Newton Method: A New Direction

11 0.023715237 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

12 0.022409847 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

13 0.022001259 120 jmlr-2013-Variational Algorithms for Marginal MAP

14 0.02118236 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

15 0.02093932 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

16 0.020450948 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

17 0.020084472 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

18 0.019995976 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

19 0.019773973 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes

20 0.019509869 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.094), (1, 0.017), (2, 0.029), (3, 0.03), (4, 0.006), (5, -0.009), (6, 0.067), (7, 0.014), (8, 0.007), (9, -0.066), (10, 0.025), (11, 0.03), (12, 0.115), (13, 0.052), (14, 0.036), (15, -0.059), (16, -0.094), (17, -0.032), (18, 0.05), (19, 0.08), (20, 0.027), (21, 0.037), (22, -0.004), (23, 0.114), (24, 0.009), (25, -0.059), (26, -0.156), (27, 0.269), (28, 0.077), (29, -0.074), (30, -0.019), (31, 0.181), (32, -0.127), (33, 0.086), (34, 0.136), (35, 0.282), (36, -0.258), (37, 0.026), (38, 0.044), (39, -0.035), (40, 0.133), (41, -0.072), (42, 0.087), (43, -0.021), (44, 0.129), (45, -0.002), (46, 0.153), (47, -0.184), (48, 0.168), (49, -0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92976272 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Author: Shai Shalev-Shwartz, Tong Zhang

Abstract: Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it has so far lacked good convergence analysis. This paper presents a new analysis of Stochastic Dual Coordinate Ascent (SDCA) showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD. This analysis justifies the effectiveness of SDCA for practical applications. Keywords: stochastic dual coordinate ascent, optimization, computational complexity, regularized loss minimization, support vector machines, ridge regression, logistic regression

2 0.58145607 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

3 0.40077662 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

Author: Nemanja Djuric, Liang Lan, Slobodan Vucetic, Zhuang Wang

Abstract: We present BudgetedSVM, an open-source C++ toolbox comprising highly-optimized implementations of recently proposed algorithms for scalable training of Support Vector Machine (SVM) approximators: Adaptive Multi-hyperplane Machines, Low-rank Linearization SVM, and Budgeted Stochastic Gradient Descent. BudgetedSVM trains models with accuracy comparable to LibSVM in time comparable to LibLinear, solving non-linear problems with millions of high-dimensional examples within minutes on a regular computer. We provide command-line and Matlab interfaces to BudgetedSVM, an efficient API for handling large-scale, high-dimensional data sets, as well as detailed documentation to help developers use and further extend the toolbox. Keywords: non-linear classification, large-scale learning, SVM, machine learning toolbox

4 0.31712496 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation

Author: Michael Chertkov, Adam B. Yedidia

Abstract: We discuss schemes for exact and approximate computations of permanents, and compare them with each other. Specifically, we analyze the belief propagation (BP) approach and its fractional belief propagation (FBP) generalization for computing the permanent of a non-negative matrix. Known bounds and Conjectures are verified in experiments, and some new theoretical relations, bounds and Conjectures are proposed. The fractional free energy (FFE) function is parameterized by a scalar parameter γ ∈ [−1; 1], where γ = −1 corresponds to the BP limit and γ = 1 corresponds to the exclusion principle (but ignoring perfect matching constraints) mean-field (MF) limit. FFE shows monotonicity and continuity with respect to γ. For every non-negative matrix, we define its special value γ∗ ∈ [−1; 0] to be the γ for which the minimum of the γ-parameterized FFE function is equal to the permanent of the matrix, where the lower and upper bounds of the γ-interval corresponds to respective bounds for the permanent. Our experimental analysis suggests that the distribution of γ∗ varies for different ensembles but γ∗ always lies within the [−1; −1/2] interval. Moreover, for all ensembles considered, the behavior of γ∗ is highly distinctive, offering an empirical practical guidance for estimating permanents of non-negative matrices via the FFE approach. Keywords: permanent, graphical models, belief propagation, exact and approximate algorithms, learning flows

5 0.20322216 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

Author: Kris De Brabanter, Jos De Brabanter, Bart De Moor, Irène Gijbels

Abstract: We present a fully automated framework to estimate derivatives nonparametrically without estimating the regression function. Derivative estimation plays an important role in the exploration of structures in curves (jump detection and discontinuities), comparison of regression curves, analysis of human growth data, etc. Hence, the study of estimating derivatives is equally important as regression estimation itself. Via empirical derivatives we approximate the qth order derivative and create a new data set which can be smoothed by any nonparametric regression estimator. We derive L1 and L2 rates and establish consistency of the estimator. The new data sets created by this technique are no longer independent and identically distributed (i.i.d.) random variables anymore. As a consequence, automated model selection criteria (data-driven procedures) break down. Therefore, we propose a simple factor method, based on bimodal kernels, to effectively deal with correlated data in the local polynomial regression framework. Keywords: nonparametric derivative estimation, model selection, empirical derivative, factor rule

6 0.19530731 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

7 0.18750694 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

8 0.1605794 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

9 0.1524471 104 jmlr-2013-Sparse Single-Index Model

10 0.15150365 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents

11 0.14808728 120 jmlr-2013-Variational Algorithms for Marginal MAP

12 0.14662734 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

13 0.14374962 114 jmlr-2013-The Rate of Convergence of AdaBoost

14 0.1406326 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

15 0.13742627 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

16 0.12795308 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

17 0.12544708 86 jmlr-2013-Parallel Vector Field Embedding

18 0.12336914 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

19 0.12012839 73 jmlr-2013-Multicategory Large-Margin Unified Machines

20 0.11976784 90 jmlr-2013-Quasi-Newton Method: A New Direction


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.044), (5, 0.098), (6, 0.03), (10, 0.071), (20, 0.01), (23, 0.024), (38, 0.432), (68, 0.016), (70, 0.025), (75, 0.026), (85, 0.033), (87, 0.023), (93, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.69511986 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection

Author: Edward McFowland III, Skyler Speakman, Daniel B. Neill

Abstract: We propose Fast Generalized Subset Scan (FGSS), a new method for detecting anomalous patterns in general categorical data sets. We frame the pattern detection problem as a search over subsets of data records and attributes, maximizing a nonparametric scan statistic over all such subsets. We prove that the nonparametric scan statistics possess a novel property that allows for efficient optimization over the exponentially many subsets of the data without an exhaustive search, enabling FGSS to scale to massive and high-dimensional data sets. We evaluate the performance of FGSS in three real-world application domains (customs monitoring, disease surveillance, and network intrusion detection), and demonstrate that FGSS can successfully detect and characterize relevant patterns in each domain. As compared to three other recently proposed detection algorithms, FGSS substantially decreased run time and improved detection power for massive multivariate data sets. Keywords: pattern detection, anomaly detection, knowledge discovery, Bayesian networks, scan statistics

same-paper 2 0.61286414 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Author: Shai Shalev-Shwartz, Tong Zhang

Abstract: Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it has so far lacked good convergence analysis. This paper presents a new analysis of Stochastic Dual Coordinate Ascent (SDCA) showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD. This analysis justifies the effectiveness of SDCA for practical applications. Keywords: stochastic dual coordinate ascent, optimization, computational complexity, regularized loss minimization, support vector machines, ridge regression, logistic regression

3 0.30054557 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

4 0.29531497 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer

Abstract: Linear reinforcement learning (RL) algorithms like least-squares temporal difference learning (LSTD) require basis functions that span approximation spaces of potential value functions. This article investigates methods to construct these bases from samples. We hypothesize that an ideal approximation spaces should encode diffusion distances and that slow feature analysis (SFA) constructs such spaces. To validate our hypothesis we provide theoretical statements about the LSTD value approximation error and induced metric of approximation spaces constructed by SFA and the state-of-the-art methods Krylov bases and proto-value functions (PVF). In particular, we prove that SFA minimizes the average (over all tasks in the same environment) bound on the above approximation error. Compared to other methods, SFA is very sensitive to sampling and can sometimes fail to encode the whole state space. We derive a novel importance sampling modification to compensate for this effect. Finally, the LSTD and least squares policy iteration (LSPI) performance of approximation spaces constructed by Krylov bases, PVF, SFA and PCA is compared in benchmark tasks and a visual robot navigation experiment (both in a realistic simulation and with a robot). The results support our hypothesis and suggest that (i) SFA provides subspace-invariant features for MDPs with self-adjoint transition operators, which allows strong guarantees on the approximation error, (ii) the modified SFA algorithm is best suited for LSPI in both discrete and continuous state spaces and (iii) approximation spaces encoding diffusion distances facilitate LSPI performance. Keywords: reinforcement learning, diffusion distance, proto value functions, slow feature analysis, least-squares policy iteration, visual robot navigation c 2013 Wendelin B¨ hmer, Steffen Gr¨ new¨ lder, Yun Shen, Marek Musial and Klaus Obermayer. o u a ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER

5 0.29384831 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

Author: Daniil Ryabko, Jérémie Mary

Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classification methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classification and concern highly dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions

6 0.29335803 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

7 0.29301414 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

8 0.29291734 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

9 0.2922858 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

10 0.29148275 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

11 0.29087171 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

12 0.29055071 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

13 0.2897737 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

14 0.28973114 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

15 0.28912488 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

16 0.28855363 76 jmlr-2013-Nonparametric Sparsity and Regularization

17 0.28854111 68 jmlr-2013-Machine Learning with Operational Costs

18 0.28834319 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

19 0.28833294 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

20 0.2881119 59 jmlr-2013-Large-scale SVD and Manifold Learning