jmlr jmlr2010 jmlr2010-34 knowledge-graph by maker-knowledge-mining

34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected

Source: pdf

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari, Jonathan Chang, S. Alex Smith

Abstract: The SGD-QN algorithm described in Bordes et al. (2009) contains a subtle ﬂaw that prevents it from reaching its design goals. Yet the ﬂawed SGD-QN algorithm has worked well enough to be a winner of the ﬁrst Pascal Large Scale Learning Challenge (Sonnenburg et al., 2008). This document clariﬁes the situation, proposes a corrected algorithm, and evaluates its performance. Keywords: stochastic gradient descent, support vector machine, conditional random ﬁelds

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Yet the ﬂawed SGD-QN algorithm has worked well enough to be a winner of the ﬁrst Pascal Large Scale Learning Challenge (Sonnenburg et al. [sent-13, score-0.012]

2 This document clariﬁes the situation, proposes a corrected algorithm, and evaluates its performance. [sent-15, score-0.289]

3 Keywords: stochastic gradient descent, support vector machine, conditional random ﬁelds 1. [sent-16, score-0.021]

4 (2009) propose to improve the practical speed of stochastic gradient descent by efﬁciently estimating a diagonal matrix for rescaling the gradient estimates. [sent-18, score-0.048]

5 The proposed algorithm, SGD-QN, works well enough to be a winner of the ﬁrst Pascal Large Scale Learning Challenge (Sonnenburg et al. [sent-19, score-0.012]

6 Then we present a corrected algorithm and evaluate its performance for training both linear Support Vector Machines (SVMs) and Conditional Random Fields (CRFs). [sent-26, score-0.289]

7 (xn , yn )}, we obtain a linear SVM classiﬁer by minimizing the cost Pn (w) = λ w 2 2 + 1 1 n ∑ ℓ(yi w⊤xi ) = n n i=1 n ∑ i=1 λ w 2 2 + ℓ(yi w⊤xi ) . [sent-35, score-0.009]

8 In the following, expectations and probabilities refer to the discrete distribution describing the training examples randomly picked from the ﬁnite training set at each iteration. [sent-37, score-0.03]

9 (xt−1 , yt−1 )} picked before reaching the t-th iteration. [sent-41, score-0.01]

10 We would like to ﬁnd B such that ′ ′ wt+1 − wt = B Pn (wt+1 ) − Pn (wt ) + ξt , (2) with an error term ξt verifying E [ ξt | Ft ] = 0. [sent-42, score-0.154]

11 The SGD-QN algorithm updates the diagonal elements of matrix B on-the-ﬂy on the basis of the term-by-term ratios of the observed differences wt+1 − wt and gτ (wt+1 ) − gτ (wt ). [sent-46, score-0.172]

12 The obvious choices τ = t and τ = t + 1 only require one additional gradient evaluation because the parameter update formula (1) demands the computation of all the gradients gt (wt ) anyway. [sent-47, score-0.092]

13 Since gt+1 (wt+1 ) is a function of (xt+1 , yt+1 , xt , yt , Ft ), E [ gτ (wt+1 )| Ft ] = = gt+1 (wt+1 ) dP(xt+1 , yt+1 , xt , yt | Ft ) gt+1 (wt+1 ) dP(xt+1 , yt+1 ) dP(xt , yt | Ft ) . [sent-51, score-0.369]

14 2230 E RRATUM : SGD-QN IS L ESS C AREFUL THAN E XPECTED Since the variables wt+1 and (xt+1 , yt+1 ) are independent, the inner integral above is simply the average of gt+1 (wt+1 ) for all possible (xt+1 , yt+1 ) picked from the training set. [sent-52, score-0.02]

15 Therefore E [ gτ (wt+1 )| Ft ] = ′ ′ Pn (wt+1 ) dP(xt , yt | Ft ) = E Pn (wt+1 ) Ft . [sent-53, score-0.087]

16 Such a derivation is impossible when τ = t because (xt , yt ) and wt+1 are not independent. [sent-56, score-0.087]

17 The Consequences In order to take maximal advantage of sparse data sets, the Flawed SGD-QN algorithm (see Figure 2 in the original paper) splits the stochastic parameter update (1) in two halves in order to schedule them separately. [sent-62, score-0.039]

18 The ﬁrst half involves only the gradient of the loss, w ← w − (t + t0 )−1 B ℓ′ (yt w⊤ xt ) yt xt . [sent-63, score-0.214]

19 The second half involves only the gradient of the regularization term, w← w most of the time, −1 B w once every skip iterations. [sent-64, score-0.061]

20 w − skip λ (t + t0 ) The Flawed SGD-QN algorithm measures the differences wt+1 − wt and gt (wt+1 ) − gt (wt ) during iterations for which the second half does nothing. [sent-65, score-0.342]

21 Therefore, using notations [x]i for the i-th coefﬁcient of vector x, and Bii for the terms of the diagonal matrix B, we always have ℓ′ (yt w⊤ xt ) − ℓ′ (yt w⊤xt ) yt [xt ]i [gt (wt+1 ) − gt (wt )]i t t+1 = λ− . [sent-66, score-0.219]

22 [wt+1 − wt ]i Bii (t + t0 )−1 ℓ′ (yt w⊤xt ) yt [xt ]i • When [xt ]i is nonzero, we can simplify this expression as ℓ′ (yt w⊤ xt ) − ℓ′ (yt w⊤xt ) [gt (wt+1 ) − gt (wt )]i t t+1 = λ− . [sent-67, score-0.363]

23 [wt+1 − wt ]i Bii (t + t0 )−1 ℓ′ (yt w⊤xt ) t (4) This ratio is always greater than λ because of the loss function ℓ is convex. [sent-68, score-0.154]

24 • When [xt ]i is zero, the original paper uses a continuity argument to justify the equality [gt (wt+1 ) − gt (wt )]i = λ. [sent-70, score-0.068]

25 [wt+1 − wt ]i 2231 (5) B ORDES , B OTTOU , G ALLINARI , C HANG AND S MITH 23. [sent-71, score-0.154]

26 5 0 1 2 3 4 Number of epochs 5 6 7 0 D ELTA 1 2 3 Number of epochs 4 5 DATA SET Figure 1: Plots of the training cost and test misclassiﬁcation percentage versus the number of epochs for SVMSGD2 (red) and Flawed SGD-QN (green) for various values of t0 on the dense Delta data set. [sent-79, score-0.342]

27 1 Impact on Dense Data Sets The coefﬁcients [xt ]i for dense data sets are rarely zero. [sent-82, score-0.018]

28 Since the Bii coefﬁcients are initially equal, they remain equal all the time, except maybe when encountering an occasional zero in the patterns xt . [sent-86, score-0.054]

29 Since the scaling matrix reduces to a scalar gain, similar results could in principle be obtained using the ordinary stochastic gradient descent with a better gain schedule. [sent-88, score-0.029]

30 Figure 1 compares the evolutions of the training cost and the test misclassiﬁcation error for the SVMSGD2 and the Flawed SGD-QN algorithms for selected values of the parameter t0 instead of the usual heuristic defaults. [sent-90, score-0.047]

31 In some cases, Flawed SGD-QN can even slightly outperforms SVMSGD2 because, despite the ﬂaw, it can still update its learning rate on the course of learning. [sent-94, score-0.008]

32 2 Impact on Sparse Data Sets The situation is more complex in the case of sparse data sets because there is special case for updating the Bii coefﬁcients when dealing with zero coefﬁcients (5). [sent-97, score-0.009]

33 As a result, the Flawed SGDQN algorithm gives higher values to the scaling coefﬁcients Bii when the i-th feature is more likely 2232 E RRATUM : SGD-QN IS L ESS C AREFUL THAN E XPECTED 0. [sent-98, score-0.008]

34 5 4 DATA SET Figure 2: Plots of the training cost and test misclassiﬁcation error versus the number of epochs for both SVMSGD2 (red) and Flawed SGD-QN (green) running on the RCV1 data set. [sent-117, score-0.128]

35 Both algorithms reach optimal performance after seeing half the training set. [sent-118, score-0.02]

36 Since this is a sensible scaling for such data sets, the Flawed SGD-QN algorithm works relatively well in the presence of sparse features. [sent-120, score-0.017]

37 Figure 2 compares the SVMSGD2 and Flawed SGD-QN algorithms for many choices for the t0 parameter on the Reuters RCV1 data set. [sent-121, score-0.017]

38 Both algorithms reach optimal performance after processing only one half of the training set. [sent-123, score-0.02]

39 In order to ﬁnd a more challenging sparse data set, we have adapted both the SVMSGD2 and the Flawed SGD-QN algorithms for the optimization of Conditional Random Fields (Lafferty et al. [sent-124, score-0.009]

40 This is an interesting case where preconditioning is difﬁcult because the features are generated on the ﬂy on the basis of position-independent templates. [sent-126, score-0.033]

41 Figure 3 compares the algorithms on the CoNLL 2000 “chunking” task (Sang and Buchholz, 2000) using the template setup provided as an example with the CRF++ code (Kudo, 2007). [sent-127, score-0.017]

42 The Flawed SGD-QN algorithm reaches the best test performance after less epochs than the SVMSGD2 algorithm, but this does not translate into a large improvement in terms of training time. [sent-128, score-0.119]

43 Correcting SGD-QN At ﬁrst glance, correcting SGD-QN simply involves computing the difference gτ (wt+1 ) − gτ (wt ) with τ = t + 1 instead of τ = t. [sent-130, score-0.008]

44 In fact, during the weeks preceding the Pascal challenge deadline, we tried both versions and found that picking τ = t + 1 performs signiﬁcantly worse! [sent-131, score-0.007]

45 5 Test FB1 score 94 2 93 SVMSGD2 eta0=1e-2 SVMSGD2 eta0=3e-2 SVMSGD2 eta0=1e-1 SVMSGD2 eta0=3e-1 SVMSGD2 eta0=1 Flawed SGDQN eta0=1e-2 Flawed SGDQN eta0=3e-2 Flawed SGDQN eta0=1e-1 Flawed SGDQN eta0=3e-1 Flawed SGDQN eta0=1 92. [sent-138, score-0.011]

46 5 92 0 2 4 6 8 10 Number of epochs CRF 12 ON THE 93 SVMSGD2 eta0=1e-2 SVMSGD2 eta0=3e-2 SVMSGD2 eta0=1e-1 SVMSGD2 eta0=3e-1 SVMSGD2 eta0=1 Flawed SGDQN eta0=1e-2 Flawed SGDQN eta0=3e-2 Flawed SGDQN eta0=1e-1 Flawed SGDQN eta0=3e-1 Flawed SGDQN eta0=1 92. [sent-139, score-0.098]

47 ) 200 250 CONLL 2000 C HUNKING TASK Figure 3: Plots of the training cost, test loss, and test F1 score for a CRF trained using both 1 SVMSGD2 (red) and Flawed SGD-QN (green) for various initial rates η0 = λt . [sent-141, score-0.058]

48 0 In order to form an intuition about the learning rate, we must pay attention to its inﬂuence on the stochastic noise. [sent-142, score-0.012]

49 Stochastic gradient descent with a constant learning rate generates a cloud of parameter estimates wt covering a zone whose extent is deﬁned by the learning rate and by the curvature of the cost function. [sent-143, score-0.182]

50 When the rates decrease with an appropriate speed, this zone shrinks around the solution. [sent-144, score-0.025]

51 t0 + t (6) Since the algorithm periodically adapts Bii on the basis of the observed differences wt+1 − wt and gt+1 (wt+1 ) − gt+1 (wt ), the sequence of learning rates ηﬂawQN can occasionally increase. [sent-146, score-0.169]

52 This is i,t Bii conﬁrmed by the middle plot of Figure 5 which displays the evolution of the learning rates t0 +t for SGD-QN implementing this straightforward ﬁx. [sent-147, score-0.024]

53 2 Managing the Speed of the Learning Rate Decrease The schedule with which the learning rate decreases during training appears to be a key factor, so we propose to ﬁx SGD-QN by using the second-order information to manage this diminution. [sent-152, score-0.02]

54 Hence, we use learning rates of the form ηi,t corQN t−1 = λt0 + ∑ ri,k −1 where ri,t = k=1 [gt+1 (wt+1 ) − gt+1 (wt )]i . [sent-153, score-0.015]

55 1 It is also interesting to compare the formula ηcorQN with the ﬁrst order version ηSGD = λt +∑t λ i,t i,t 0 k=1 which decreases the learning rate after each iteration by adding λ to the denominator. [sent-155, score-0.007]

56 Instead of adding a lower bound of the curvature, the proposed learning rate formula adds a stochastic estimate of the curvature. [sent-156, score-0.019]

57 1 1 Number of epochs (logscale) Number of epochs (logscale) Flawed SGD-QN Straightforward Fix 10 Number of epochs (logscale) Corrected SGD-QN Figure 5: Plots of the learning rates corresponding to each feature on the course of learning on the Delta data set. [sent-174, score-0.309]

58 All rates are equal for Flawed SGD-QN (left plot). [sent-175, score-0.015]

59 1, implementing the straightforward ﬁx (middle plot) causes rates to alternatively increase or decrease very fast. [sent-177, score-0.024]

60 The Corrected SGD-QN (right plot) proposes learning rates nicely decreasing at different speeds for each feature. [sent-178, score-0.025]

61 Interestingly, Equation (7) leads to a convenient recursive formula ηi,t corQN = 1 ηcorQN i,t−1 −1 + ri,t−1 = ηcorQN i,t−1 1 + ri,t−1 ηcorQN i,t−1 . [sent-179, score-0.007]

62 (8) Figure 4 describes the Corrected SGD-QN algorithm and compares it with a slightly reorganized version of the Flawed SGD-QN algorithm. [sent-180, score-0.017]

63 The diagonal matrix B is used to store the gains (7). [sent-181, score-0.01]

64 The algorithm schedules a gain update (line 14) whenever it performs a regularization update (line 16). [sent-182, score-0.016]

65 During the next iteration, the algorithm computes ri,t−1 (line 6) and implements the learning rate update (8) with an additional multiplicative factor (line 8) because this only happens every skip iterations. [sent-183, score-0.05]

66 The effect on the learning rates of using Corrected SGD-QN instead of Flawed SGD-QN is illustrated by Figure 5 if we compare the left and the right plots. [sent-184, score-0.015]

67 3 Performances on Dense Data Sets Figure 6 (top row) compares the performances of Corrected SGD-QN with the best results of Flawed SGD-QN and SVMSGD2 on the Delta data set. [sent-186, score-0.026]

68 Before running the SGD algorithms, we always precondition the dense data sets by centering all the features, normalizing their variances, and rescaling every example to ensure that xk = 1. [sent-188, score-0.037]

69 Yet, implementing a strategy involving a single learning rate for all the features appears already very rewarding and, for such cases, the Flawed SGD-QN algorithm is a strong choice because of its capacity to adapt its learning rate. [sent-191, score-0.009]

70 Figure 6 (bottom row) compares the performances of SVMSGD2, Flawed SGD-QN and Corrected SGD-QN on this deconditioned data. [sent-194, score-0.052]

71 35 0 1 2 3 Number of epochs 4 5 DATA SET ( NORMALIZED ) 25. [sent-204, score-0.098]

72 Both normalized (top) and deconditioned (bottom) cases are considered; see the text for details. [sent-214, score-0.026]

73 SGD-QN algorithm clearly suffers from the deconditioning operation because it can not assign a different learning rate per feature. [sent-216, score-0.013]

74 We also veriﬁed that the estimated learning rates replicate the deconditioning pattern. [sent-218, score-0.028]

75 In conclusion, on dense data sets, the Corrected SGD-QN bring little improvement over those associated with a good preconditioning technique. [sent-219, score-0.051]

76 Preconditioning was probably the main reason of the good SGD-QN results on dense data sets in the Pascal Large Scale Challenge. [sent-220, score-0.018]

77 5 4 DATA SET Figure 7: Plots of the training cost and test error versus the number of epochs for SVMSGD2 (red) and Flawed SGD-QN (green) with their optimal t0 parameter, and Corrected SGD-QN (brown) running on the RCV1 data set. [sent-240, score-0.128]

78 4 Performances on Sparse Data Sets Preconditioning sparse data sets is much more difﬁcult because it is impossible to center sparse features and keep them sparse. [sent-245, score-0.018]

79 In addition, normalizing the variance of very rare features generates a small number of coefﬁcients with high values. [sent-246, score-0.011]

80 This fat tail distribution usually has very negative impact on the test performance. [sent-247, score-0.011]

81 Figure 7 compares the SVMSGD2, Flawed SGD-QN and Corrected SGD-QN algorithms on the Reuters RCV1 data set, but, as we explained for Figure 2, this task is too easy to draw any conclusions. [sent-248, score-0.017]

82 Figure 8 then compares the adaptations of SVMSGD2, Flawed SGD-QN (with their best parameters) and Corrected SGD-QN for Conditional Random Fields on the CoNLL 2000 “chunking” task with the setup described in Section 4. [sent-249, score-0.017]

83 The Corrected SGD-QN algorithm achieves its optimal test performance after only 75 seconds while SVMSGD2 and Flawed SGD-QN need around twice this time. [sent-251, score-0.011]

84 Conclusion Despite its ﬂaw, the original SGD-QN algorithm works well enough to be a winner of the ﬁrst PASCAL Large Scale Learning Challenge (Sonnenburg et al. [sent-254, score-0.012]

85 , 2008) because it beneﬁts from our careful preconditioning and handles sparse examples efﬁciently. [sent-255, score-0.042]

86 However, as explained in this document, this original version often does not achieve the full beneﬁts of a diagonal scaling approach. [sent-256, score-0.018]

87 5 Test FB1 score 94 2 93 Best SVMSGD2 eta0=1e-1 Best Flawed SGDQN eta0=1e-1 Corrected SGDQN eta0=1e-2 Corrected SGDQN eta0=3e-2 Corrected SGDQN eta0=1e-1 Corrected SGDQN eta0=3e-1 Corrected SGDQN eta0=1 92. [sent-259, score-0.011]

88 5 92 0 2 4 6 8 10 Number of epochs CRF 12 ON THE 93 Best SVMSGD2 eta0=1e-1 Best Flawed SGDQN eta0=1e-1 Corrected SGDQN eta0=1e-2 Corrected SGDQN eta0=3e-2 Corrected SGDQN eta0=1e-1 Corrected SGDQN eta0=3e-1 Corrected SGDQN eta0=1 92. [sent-260, score-0.098]

89 ) 200 250 CONLL 2000 C HUNKING TASK Figure 8: Plots of the training cost, test loss, and test F1 score for a CRF trained using the best setups of SVMSGD2 (red) and Flawed SGD-QN (green), and Corrected SGD-QN for various 1 initial rates η0 = λt0 (brown). [sent-262, score-0.058]

90 Unlike the original SGD-QN algorithm, the Corrected SGDQN algorithm discovers sensible diagonal scaling coefﬁcients. [sent-265, score-0.018]

91 However, experiments on dense data sets of intermediate dimensionality show that similar speed improvements can be achieved by simple preconditioning techniques such as normalizing the means and the variances of each feature and normalizing the length of each example. [sent-266, score-0.073]

92 A stochastic quasi-Newton method for online convex optiu mization. [sent-316, score-0.012]

93 Towards optimal one pass large scale learning with averaged stochastic gradient descent. [sent-336, score-0.021]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sgdqn', 0.717), ('flawed', 0.554), ('corrected', 0.279), ('wt', 0.154), ('bii', 0.111), ('epochs', 0.098), ('yt', 0.087), ('gt', 0.068), ('xt', 0.054), ('updateb', 0.052), ('ft', 0.047), ('corqn', 0.046), ('olbfgs', 0.046), ('skip', 0.042), ('allinari', 0.039), ('ordes', 0.039), ('ottou', 0.039), ('crf', 0.035), ('mith', 0.033), ('preconditioning', 0.033), ('areful', 0.033), ('ess', 0.033), ('rratum', 0.033), ('bordes', 0.03), ('xpected', 0.028), ('deconditioned', 0.026), ('count', 0.022), ('ri', 0.022), ('aw', 0.022), ('pascal', 0.02), ('awqn', 0.02), ('elta', 0.02), ('facebook', 0.02), ('logscale', 0.02), ('hang', 0.019), ('pn', 0.019), ('dense', 0.018), ('conll', 0.017), ('sang', 0.017), ('compares', 0.017), ('sgd', 0.016), ('delta', 0.016), ('green', 0.015), ('rates', 0.015), ('sonnenburg', 0.015), ('dp', 0.014), ('antoine', 0.014), ('jonathan', 0.014), ('claire', 0.013), ('deconditioning', 0.013), ('hunking', 0.013), ('gallinari', 0.013), ('primal', 0.012), ('winner', 0.012), ('stochastic', 0.012), ('schraudolph', 0.012), ('score', 0.011), ('jussieu', 0.011), ('lip', 0.011), ('tjong', 0.011), ('test', 0.011), ('normalizing', 0.011), ('patrick', 0.011), ('cients', 0.011), ('plots', 0.01), ('coef', 0.01), ('picked', 0.01), ('diagonal', 0.01), ('awed', 0.01), ('schedule', 0.01), ('zone', 0.01), ('red', 0.01), ('half', 0.01), ('proposes', 0.01), ('training', 0.01), ('cost', 0.009), ('alex', 0.009), ('fields', 0.009), ('bottou', 0.009), ('reuters', 0.009), ('curie', 0.009), ('gradient', 0.009), ('misclassi', 0.009), ('sparse', 0.009), ('implementing', 0.009), ('performances', 0.009), ('marie', 0.009), ('scaling', 0.008), ('ratios', 0.008), ('brown', 0.008), ('polyak', 0.008), ('franc', 0.008), ('chunking', 0.008), ('false', 0.008), ('rescaling', 0.008), ('correcting', 0.008), ('update', 0.008), ('paris', 0.007), ('challenge', 0.007), ('formula', 0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari, Jonathan Chang, S. Alex Smith

2 0.1025421 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1 -norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1 -regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method. Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods

3 0.087866172 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le

Abstract: A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the uniﬁed framework we present tight convergence bounds, which show that our algorithm converges in O(1/ε) steps to ε precision for general convex problems and in O(log(1/ε)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available data sets. Keywords: optimization, subgradient methods, cutting plane method, bundle methods, regularized risk minimization, parallel optimization ∗. Also at Canberra Research Laboratory, NICTA. c 2010 Choon Hui Teo, S.V. N. Vishwanthan, Alex J. Smola and Quoc V. Le. T EO , V ISHWANATHAN , S MOLA AND L E

4 0.064932592 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

Author: Jin Yu, S.V.N. Vishwanathan, Simon Güunter, Nicol N. Schraudolph

Abstract: We extend the well-known BFGS quasi-Newton method and its memory-limited variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the local quadratic model, the identiﬁcation of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value. We apply its memory-limited variant (subLBFGS) to L2 -regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings, we develop a new, efﬁcient, exact line search algorithm. We prove its worst-case time complexity bounds, and show that our line search can also be used to extend a recently developed bundle method to the multiclass and multilabel settings. We also apply the direction-ﬁnding component of our algorithm to L1 -regularized risk minimization with logistic loss. In all these contexts our methods perform comparable to or better than specialized state-of-the-art solvers on a number of publicly available data sets. An open source implementation of our algorithms is freely available. Keywords: BFGS, variable metric methods, Wolfe conditions, subgradient, risk minimization, hinge loss, multiclass, multilabel, bundle methods, BMRM, OCAS, OWL-QN

5 0.032617796 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

Author: Joshua W. Robinson, Alexander J. Hartemink

Abstract: Learning dynamic Bayesian network structures provides a principled mechanism for identifying conditional dependencies in time-series data. An important assumption of traditional DBN structure learning is that the data are generated by a stationary process, an assumption that is not true in many important settings. In this paper, we introduce a new class of graphical model called a nonstationary dynamic Bayesian network, in which the conditional dependence structure of the underlying data-generation process is permitted to change over time. Non-stationary dynamic Bayesian networks represent a new framework for studying problems in which the structure of a network is evolving over time. Some examples of evolving networks are transcriptional regulatory networks during an organism’s development, neural pathways during learning, and trafﬁc patterns during the day. We deﬁne the non-stationary DBN model, present an MCMC sampling algorithm for learning the structure of the model from time-series data under different assumptions, and demonstrate the effectiveness of the algorithm on both simulated and biological data. Keywords: Bayesian networks, graphical models, model selection, structure learning, Monte Carlo methods

6 0.025143243 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

7 0.023045674 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

8 0.02234173 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

9 0.020973381 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning

10 0.019634131 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

11 0.018521484 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

12 0.018474139 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

13 0.013099832 80 jmlr-2010-On-Line Sequential Bin Packing

14 0.0123512 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

15 0.012240659 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes

16 0.011946491 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

17 0.011253837 63 jmlr-2010-Learning Instance-Specific Predictive Models

18 0.010596981 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

19 0.010355277 36 jmlr-2010-Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity

20 0.0095396489 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.063), (1, -0.053), (2, 0.082), (3, -0.026), (4, 0.23), (5, 0.017), (6, -0.009), (7, -0.022), (8, -0.019), (9, 0.009), (10, -0.069), (11, 0.031), (12, -0.056), (13, 0.012), (14, 0.007), (15, 0.005), (16, -0.063), (17, 0.089), (18, 0.019), (19, 0.036), (20, 0.031), (21, 0.059), (22, -0.023), (23, -0.074), (24, -0.037), (25, 0.004), (26, -0.017), (27, -0.027), (28, -0.001), (29, -0.066), (30, -0.073), (31, 0.009), (32, -0.066), (33, 0.004), (34, 0.013), (35, -0.04), (36, -0.026), (37, 0.036), (38, 0.125), (39, 0.023), (40, 0.098), (41, -0.076), (42, -0.049), (43, -0.199), (44, 0.04), (45, 0.062), (46, -0.089), (47, -0.054), (48, 0.464), (49, -0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96515048 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari, Jonathan Chang, S. Alex Smith

2 0.42112654 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

Author: Lin Xiao

3 0.31791642 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le

4 0.27156997 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

Author: Jin Yu, S.V.N. Vishwanathan, Simon Güunter, Nicol N. Schraudolph

5 0.24192986 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

Author: Joshua W. Robinson, Alexander J. Hartemink

6 0.22637473 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

7 0.20137188 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

8 0.15140659 63 jmlr-2010-Learning Instance-Specific Predictive Models

9 0.14938578 4 jmlr-2010-A Generalized Path Integral Control Approach to Reinforcement Learning

10 0.14280705 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

11 0.12168524 71 jmlr-2010-Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases

12 0.12111112 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

13 0.12073279 80 jmlr-2010-On-Line Sequential Bin Packing

14 0.11712565 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

15 0.11672068 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

16 0.11531023 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

17 0.11521488 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

18 0.10868459 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

19 0.10818648 10 jmlr-2010-An Exponential Model for Infinite Rankings

20 0.10420889 8 jmlr-2010-A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.013), (4, 0.016), (12, 0.45), (15, 0.024), (21, 0.027), (22, 0.013), (24, 0.011), (32, 0.027), (36, 0.034), (37, 0.049), (75, 0.093), (81, 0.012), (85, 0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.69811803 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari, Jonathan Chang, S. Alex Smith

2 0.62291372 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure

Author: Kaname Kojima, Eric Perrier, Seiya Imoto, Satoru Miyano

Abstract: We study the problem of learning an optimal Bayesian network in a constrained search space; skeletons are compelled to be subgraphs of a given undirected graph called the super-structure. The previously derived constrained optimal search (COS) remains limited even for sparse superstructures. To extend its feasibility, we propose to divide the super-structure into several clusters and perform an optimal search on each of them. Further, to ensure acyclicity, we introduce the concept of ancestral constraints (ACs) and derive an optimal algorithm satisfying a given set of ACs. Finally, we theoretically derive the necessary and sufﬁcient sets of ACs to be considered for ﬁnding an optimal constrained graph. Empirical evaluations demonstrate that our algorithm can learn optimal Bayesian networks for some graphs containing several hundreds of vertices, and even for super-structures having a high average degree (up to four), which is a drastic improvement in feasibility over the previous optimal algorithm. Learnt networks are shown to largely outperform state-of-the-art heuristic algorithms both in terms of score and structural hamming distance. Keywords: Bayesian networks, structure learning, constrained optimal search

3 0.26881364 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

Author: Shiliang Sun, John Shawe-Taylor

Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artiﬁcial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efﬁcacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory

4 0.26459208 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

Abstract: Estimating the error rates of classiﬁers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classiﬁcation and regression, maximum likelihood, latent variable models

5 0.26380208 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: Leading classiﬁcation methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identiﬁes its sensitivity to afﬁne transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classiﬁcation function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classiﬁcation function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efﬁciency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classiﬁcation and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity

6 0.26369599 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

7 0.26293814 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

8 0.26255754 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

9 0.26241669 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

10 0.26139414 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

11 0.26138923 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

12 0.26130027 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

13 0.26093882 63 jmlr-2010-Learning Instance-Specific Predictive Models

14 0.260456 9 jmlr-2010-An Efficient Explanation of Individual Classifications using Game Theory

15 0.25942454 102 jmlr-2010-Semi-Supervised Novelty Detection

16 0.25879729 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

17 0.25842008 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression

18 0.25840229 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

19 0.25730556 109 jmlr-2010-Stochastic Composite Likelihood

20 0.2566953 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning