jmlr jmlr2009 jmlr2009-83 knowledge-graph by maker-knowledge-mining

83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

Source: pdf

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari

Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a ﬁrst-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the ﬁrst PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Thanks to this design, SGD-QN iterates nearly as fast as a ﬁrst-order stochastic gradient descent but requires less iterations to achieve the same accuracy. [sent-8, score-0.175]

2 Keywords: support vector machine, stochastic gradient descent 1. [sent-11, score-0.15]

3 In various domains such as biology, networking, or information retrieval, fast classiﬁcation methods able to scale on millions of training instances are needed. [sent-13, score-0.07]

4 Evaluations were carried out on the basis of various performance curves such as training time versus test error, data set size versus test error, and data set size versus training time. [sent-18, score-0.122]

5 Unfortunately, even in the Wild Track case, the evaluation criteria for the competition reward good scaling properties and short training durations more than they punish suboptimal test errors. [sent-21, score-0.107]

6 However, in the large scale setup, when the bottleneck is the computing time rather than the number of training examples, Bottou and Bousquet (2008) have shown that stochastic algorithms often yield the best generalization performances in spite of being worst optimizers. [sent-33, score-0.125]

7 SGD algorithms were therefore a natural choice for the “Wild Track” of the competition which focuses on the relation between training time and test performance. [sent-34, score-0.085]

8 This quasi-Newton stochastic method is statistically efﬁcient but is penalized in practice by the cost of storing and manipulating the metric tensor. [sent-43, score-0.089]

9 The limited storage version of this algorithm is a quasi-Newton stochastic method whose cost by iteration is a small multiple of the cost of a standard SGD iteration. [sent-46, score-0.16]

10 Stochastic gradient works well in models with nonlinear parametrization. [sent-64, score-0.064]

11 Analysis This section describes our notations and summarizes theoretical results that are relevant to the design of a fast variant of stochastic gradient algorithms. [sent-70, score-0.119]

12 The linear SVM classiﬁer is obtained by minimizing the primal cost function Pn (w) = λ w 2 2 + 1 1 n ∑ ℓ(yi w⊤xi ) = n n i=1 n ∑ i=1 λ w 2 2 + ℓ(yi w⊤xi ) , (1) where the hyper-parameter λ > 0 controls the strength of the regularization term. [sent-73, score-0.127]

13 Each iteration of the SGD algorithm consists of drawing a random training example (xt , yt ) and computing a new value of the parameter wt as wt+1 = wt − 1 B gt (wt ) where gt (wt ) = λwt + ℓ′ (yt w⊤xt ) yt xt t t + t0 (2) where the rescaling matrix B is positive deﬁnite. [sent-76, score-2.229]

14 • The traditional ﬁrst-order SGD algorithm, with decreasing learning rate, is obtained by setting B = λ−1 I in the generic update (2) : wt+1 = wt − 1 gt (wt ) . [sent-79, score-0.868]

15 λ(t + t0 ) (3) • The second-order SGD algorithm is obtained by setting B to the inverse of the Hessian Matrix ′′ H = [ Pn (w∗ ) ] computed at the optimum w∗ of the primal cost Pn (w) : n n wt+1 = wt − 1 H−1 gt (wt ) . [sent-80, score-0.93]

16 In practice, one simply performs sequential passes over the randomly shufﬂed training set. [sent-82, score-0.069]

17 2 What Matters Are the Constant Factors Bottou and Bousquet (2008) characterize the asymptotic learning properties of stochastic gradient algorithms in the large scale regime, that is, when the bottleneck is the computing time rather than the number of training examples. [sent-84, score-0.189]

18 2 report the time for a single iteration, the number of iterations needed to reach a predeﬁned accuracy ρ, and their product, the time needed to reach accuracy ρ. [sent-86, score-0.101]

19 Compare the second last column (time to optimize) with the last column (time to reach the excess test error ε). [sent-88, score-0.092]

20 Legend: n number of examples; d parameter dimension; c positive constant that appears in the generalization bounds; κ condition number of the Hessian matrix H; ν = tr GH−1 with G the Fisher matrix (see Theorem 1 for more details). [sent-89, score-0.182]

21 The excess test error E measures how much the test error is worse than the best possible error for this problem. [sent-91, score-0.079]

22 Both the ﬁrst-order and the second-order SGD require a time inversely proportional to ε to reach the target test error. [sent-97, score-0.083]

23 To reach an excess error ε, the most favorable generalization bounds suggest that one needs a number of examples proportional to 1/ε. [sent-103, score-0.087]

24 In fact, Bottou and Bousquet (2008) obtain slightly worse scaling laws for typical non-stochastic gradient algorithms. [sent-105, score-0.086]

25 1740 SGD-QN Theorem 1 Let Eσ denote the expectation with respect to the random selection of the examples (xt , yt ) drawn independently from the training set at each iteration. [sent-109, score-0.137]

26 Let w∗ = arg minw Pn (w) be an n optimum of the primal cost. [sent-110, score-0.085]

27 Deﬁne the Hessian matrix H = ∂2 Pn (w∗ )/∂w2 and the Fisher matrix n G = Gt = Eσ gt′ (w∗ ) gt′ (w∗ )⊤ . [sent-111, score-0.064]

28 If the eigenvalues of HB are in range λmax ≥ λmin > 1/2, and if the n n SGD algorithm (2) converges to w∗ , the following inequality holds: n tr (HBGB) −1 tr (HBGB) −1 t + o t −1 ≤ Eσ [Pn (wt ) − Pn (w∗ )] ≤ t + o t −1 . [sent-112, score-0.236]

29 Convergence also holds under slightly stronger assumptions when the rescaling matrix B changes over time (e. [sent-119, score-0.224]

30 Under the assumptions of Theorem 1, we have Eσ [Pn (wt ) − Pn (w∗ )] = tr GH−1 t −1 + o t −1 = νt −1 + o t −1 . [sent-127, score-0.118]

31 Under the assumptions of Theorem 1, we have Eσ [Pn (wt ) − Pn (w∗ )] ≤ λ−2 tr H2 GH−1 t −1 + o t −1 ≤ κ2 νt −1 + o t −1 . [sent-129, score-0.118]

32 Given a sample of n independent examples (xi , yi ) , deﬁne 2 w∗ = arg minw Pn (w) and compute wn by applying the second-order SGD update (4) to each of the n n examples. [sent-131, score-0.085]

33 This result means that, asymptotically and on average, the parameter wn obtained after one pass of second-order SGD is as close to the inﬁnite training set solution w∗ as the true optimum of the primal w∗ . [sent-133, score-0.157]

34 Therefore, when the training set is large enough, we can expect that a single pass of n second-order SGD (n iterations of (4)) optimizes the primal accurately enough to replicate the test error of the actual SVM solution. [sent-134, score-0.182]

35 When we replace the full second-order rescaling matrix B = H−1 by a more computationally acceptable approximation, Theorem 1 indicates that we lose a constant factor k on the required number of iterations to reach that accuracy. [sent-135, score-0.309]

36 In other words, we can expect to replicate the SVM test error after k passes over the randomly reshufﬂed training set. [sent-136, score-0.094]

37 On the other hand, a well chosen approximation of the rescaling matrix can save a large constant factor on the computation of the generic SGD update (2). [sent-137, score-0.31]

38 The best training times are therefore obtained by carefully trading the quality of the approximation for sparse representations. [sent-138, score-0.065]

39 1741 B ORDES , B OTTOU AND G ALLINARI Frequency Special example: Examples 1 to n: n skip 1 Loss λ skip w 2 ℓ(yi w⊤xi ) 2 Table 2: The regularization term in the primal cost can be viewed as an additional training example with an arbitrarily chosen frequency and a speciﬁc loss function. [sent-139, score-0.427]

40 There are of course many other ways to save constant factors: • Exploiting the sparsity of the patterns (see Section 3) can save a constant factor in the cost of each ﬁrst-order iteration. [sent-142, score-0.137]

41 Scheduling Stochastic Updates to Exploit Sparsity First-order SGD iterations can be made substantially faster when the patterns xt are sparse. [sent-150, score-0.168]

42 The ﬁrst-order SGD update has the form wt+1 = wt − αt wt − βt xt , (5) where αt and βt are scalar coefﬁcients. [sent-151, score-1.27]

43 Subtracting βt xt from the parameter vector involves solely the nonzero coefﬁcients of the pattern xt . [sent-152, score-0.267]

44 On the other hand, subtracting αt wt involves all d coefﬁcients. [sent-153, score-0.554]

45 (2007) circumvent this problem by representing the parameter wt as the product st vt of a scalar and a vector. [sent-156, score-0.611]

46 The update (5) can then be computed as st+1 = (1 − αt )st and vt+1 = vt − βxt /st+1 in time proportional to the number of nonzero coefﬁcients in xt . [sent-157, score-0.262]

47 A more general method consists of treating the regularization term in the primal cost (1) as an additional training example occurring with an arbitrarily chosen frequency with a speciﬁc loss function. [sent-159, score-0.163]

48 Consider examples with the frequencies and losses listed in Table 2 and write the average loss: 1 n + n skip skip n λ skip w 2 2 n + ∑ ℓ(yi w⊤ xi ) i=1 1742 = skip 1 + skip λ w 2 2 + 1 n ∑ ℓ(yi w⊤xi ) . [sent-160, score-0.681]

49 Minimizing this loss is of course equivalent to minimizing the primal cost (1) with its regularization term. [sent-162, score-0.127]

50 The parameter skip regulates the relative frequencies of these updates. [sent-164, score-0.132]

51 The SVMSGD2 algorithm (Bottou, 2007) measures the average pattern sparsity and picks a frequency that ensures that the amortized cost of the regularization update is proportional to the number of nonzero coefﬁcients. [sent-165, score-0.183]

52 Both algorithms handle the real examples at each iteration (line 3) but SVMSGD2 only performs a regularization update every skip iterations (line 6). [sent-167, score-0.262]

53 Assume s is the average proportion of nonzero coefﬁcients in the patterns xi and set skip to c/s where c is a predeﬁned constant (we use c = 16 in our experiments). [sent-168, score-0.176]

54 Each regularization update (line 6) requires d operations but occurs s/c times less often. [sent-170, score-0.068]

55 The average cost per iteration is therefore proportional to O (sd) instead of O (d). [sent-171, score-0.091]

56 This is no longer true in the case of large scale machine learning because we care about the test error instead of the training error. [sent-175, score-0.095]

57 vector for representing the parameter w, but handles the patterns x using either a full vector representation or a sparse representation as an ordered list of index/value pairs. [sent-181, score-0.072]

58 For example, on a dense data set with 500 attributes, using sparse vectors increases the training time by 50%; on the sparse RCV1 data set (see Table 4), using a sparse vector to represent the parameter w increases the training time by more than 900%. [sent-184, score-0.159]

59 SGD-QN: A Careful Diagonal Quasi-Newton SGD As explained in Section 2, designing an efﬁcient quasi-Newton SGD algorithm involves a careful trade-off between the sparsity of the scaling matrix representation B and the quality of its approximation of the inverse Hessian H−1 . [sent-192, score-0.145]

60 In the case of a linear model, such preconditioning is similar to using a constant diagonal scaling matrix. [sent-198, score-0.068]

61 d [Wt+1 ]i = [Wt ]i − ηt λ[Wt ]i + ℓ′ (yt W⊤Xt ) yt [Xt ]i , t = [Wt ]i − ηt λ[Wt ]i + ℓ′ (yt w⊤xt ) yt bi [xt ]i . [sent-204, score-0.202]

62 t 1744 SGD-QN Multiplying by bi shows how the original parameter vector wt is affected: ∀i = 1 . [sent-205, score-0.554]

63 d [wt+1 ]i = [wt ]i − ηt λ[wt ]i + ℓ′ (yt w⊤xt ) yt b2 [xt ]i . [sent-208, score-0.101]

64 t i We observe that rescaling the input is equivalent to multiplying the gradient by a ﬁxed diagonal matrix B whose elements are the squares of the coefﬁcients bi . [sent-209, score-0.334]

65 Instead we could consider n the current value of the Hessian Hwt = P ′′ (wt ) and compute the diagonal rescaling matrix B that makes BHwt closest to the identity. [sent-212, score-0.27]

66 Becker and Le Cun (1989) approximate the optimal diagonal rescaling matrix by inverting the diagonal coefﬁcients of the Hessian. [sent-214, score-0.316]

67 2 Low Rank Rescaling Matrices The popular LBFGS optimization algorithm (Nocedal, 1980) maintains a low rank approximation of the inverse Hessian by storing the k most recent rank-one BFGS updates instead of the full ′ ′ inverse Hessian matrix. [sent-219, score-0.1]

68 When the successive full gradients Pn (wt−1 ) and Pn (wt ) are available, standard rank one updates can be used to directly estimate the inverse Hessian matrix H−1 . [sent-220, score-0.125]

69 Using ′ ′ this method with stochastic gradient is tricky because the full gradients Pn (wt−1 ) and Pn (wt ) are not readily available. [sent-221, score-0.161]

70 Instead we only have access to the stochastic estimates gt−1 (wt−1 ) and gt (wt ) which are too noisy to compute good rescaling matrices. [sent-222, score-0.497]

71 This reduces the noise to an acceptable level at the expense of the computation of the additional gradient vector gt−1 (wt ). [sent-225, score-0.064]

72 The most expensive part however remains the multiplication of the gradient gt (wt ) by the low-rank estimate of the inverse Hessian. [sent-227, score-0.341]

73 3 SGD-QN The SGD-QN algorithm estimates a diagonal rescaling matrix using a technique inspired by oLBFGS. [sent-230, score-0.27]

74 For any pair of parameters wt−1 and wt , a Taylor series of the gradient of the primal cost P provides the secant equation: ′ ′ (6) wt − wt−1 ≈ H−1 Pn (wt ) − Pn (wt−1 ) . [sent-231, score-1.271]

75 wt We would then like to replace the inverse Hessian matrix H−1 by a diagonal estimate B wt ′ ′ wt − wt−1 ≈ B Pn (wt ) − Pn (wt−1 ) . [sent-232, score-1.767]

76 ′ Since we are designing a stochastic algorithm, we do not have access to the full gradient Pn . [sent-233, score-0.141]

77 Following oLBFGS, we replace them by the local gradients gt−1 (wt ) and gt−1 (wt−1 ) and obtain wt − wt−1 ≈ B gt−1 (wt ) − gt−1 (wt−1 ) . [sent-234, score-0.574]

78 1745 B ORDES , B OTTOU AND G ALLINARI Since we chose to use a diagonal rescaling matrix B, we can write the term-by-term equality [wt − wt−1 ]i ≈ Bii [gt−1 (wt ) − gt−1 (wt−1 )]i , where the notation [v]i still represents the i-th coefﬁcient of vector v. [sent-235, score-0.27]

79 d , and where the integer r is incremented whenever we update the matrix B. [sent-240, score-0.072]

80 Since the curvature of the primal cost (1) is always larger than λ, the ratio [gt−1 (wt ) − gt−1 (wt−1 )]i /[wt − wt−1 ]i is always larger than λ. [sent-242, score-0.099]

81 Performing the weight update (2) with a diagonal rescaling matrix B consists in performing term-by-term operations with a time complexity that is marginally greater than the complexity of the ﬁrst-order SGD (3) update. [sent-246, score-0.31]

82 Fortunately this higher computational cost per iteration can be nearly avoided by scheduling the reestimation of the rescaling matrix with the same frequency as the regularization updates. [sent-248, score-0.402]

83 1 has shown that a diagonal rescaling matrix does little more than rescaling the input variables. [sent-250, score-0.462]

84 Since a ﬁxed diagonal rescaling matrix already works quite well, there is little need to update its coefﬁcients very often. [sent-251, score-0.31]

85 Whenever SVMSGD2 performs a regularization update, we set the ﬂag updateB to schedule a reestimation of the rescaling coefﬁcients during the next iteration. [sent-253, score-0.304]

86 Therefore the rescaling matrix reestimation schedule can be regulated with the same skip parameter as the regularization updates. [sent-255, score-0.468]

87 Because SGD-QN reestimates the rescaling matrix after a pattern update, special care must be taken when the ratio [wt − wt−1 ]i /[gt−1 (wt ) − gt−1 (wt−1 )]i has the form 0/0 because the corresponding input coefﬁcient [xt−1 ]i is zero. [sent-257, score-0.224]

88 Since the secant Equation (6) is valid for any two values of the parameter vector, one can compute the ratios with parameter vectors wt−1 and wt + ε and derive the correct value by continuity when ε → 0. [sent-258, score-0.554]

89 Features s λ t0 skip A LPHA D ELTA RCV1 100,000 100,000 781,265 50,000 50,000 23,149 500 500 47,152 1 1 0. [sent-265, score-0.132]

90 Experiments We demonstrate the good scaling properties of SGD-QN in two ways: we present a detailed comparison with other stochastic gradient methods, and we summarize the results obtained on the PASCAL Large Scale Challenge. [sent-268, score-0.141]

91 Our implementation of oLBFGS maintains a rank 10 rescaling matrix. [sent-296, score-0.192]

92 2 Quasi-Newton Figure 3 shows how the primal cost Pn (w) of the Alpha data set evolves with the number of passes (left) and the training time (right). [sent-311, score-0.168]

93 5 2 Figure 3: Primal costs according to the number of epochs (left) and the training duration (right) on the Alpha data set. [sent-346, score-0.077]

94 3 Training Speed Figure 4 displays the test errors achieved on the Alpha, Delta and RCV1 data sets as a function of the number of passes (left) and the training time (right). [sent-349, score-0.094]

95 According to Theorem 4, given a large enough training set, a perfect second-order SGD algorithm would reach the batch test error after a single pass. [sent-353, score-0.099]

96 ) RCV1 DATA SET Figure 4: Test errors (in %) according to the number of epochs (left) and training duration (right). [sent-421, score-0.077]

97 Conclusion The SGD-QN algorithm strikes a good compromise for large scale application because it has low time and memory requirements per iteration and because it reaches competitive test errors after a small number of iterations. [sent-427, score-0.096]

98 Proof of Theorem 1 Deﬁne vt = wt − w∗ and observe that a second-order Taylor expansion of the primal gives n Pn (wt ) − Pn (w∗ ) = v⊤Hvt + o t −2 = tr Hvt v⊤ + o t −2 . [sent-434, score-0.794]

99 − = Eσ tr Hvt−1 v⊤ t−1 t + t0 (t + t0 )2 Hvt v⊤ = Hvt−1 v⊤ − t t−1 Et−1 Hvt v⊤ t Et−1 tr Hvt v⊤ t Eσ tr Hvt v⊤ t Let λmax ≥ λmin > 1/2 be the extreme eigenvalues of HB. [sent-438, score-0.354]

100 Then tr (HBGB) −1 t + o t −1 2λmin − 1 and tr (HBGB) −1 tr (HBGB) −1 t + o t −1 ≤ Eσ [Pn (wt ) − Pn (w∗ )] ≤ t + o t −1 . [sent-441, score-0.354]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wt', 0.554), ('sgd', 0.459), ('gt', 0.25), ('olbfgs', 0.239), ('rescaling', 0.192), ('hvt', 0.191), ('pn', 0.148), ('skip', 0.132), ('xt', 0.122), ('tr', 0.118), ('bii', 0.109), ('allinari', 0.107), ('ordes', 0.107), ('ottou', 0.107), ('yt', 0.101), ('hessian', 0.095), ('bgb', 0.083), ('hbgb', 0.072), ('bottou', 0.071), ('liblinear', 0.07), ('primal', 0.065), ('gradient', 0.064), ('count', 0.061), ('schraudolph', 0.06), ('updateb', 0.06), ('vt', 0.057), ('stochastic', 0.055), ('bordes', 0.054), ('alpha', 0.052), ('pascal', 0.051), ('wild', 0.05), ('reestimation', 0.048), ('diagonal', 0.046), ('epochs', 0.041), ('cun', 0.041), ('update', 0.04), ('reach', 0.038), ('sparsity', 0.038), ('sonnenburg', 0.037), ('iteration', 0.037), ('antoine', 0.036), ('schedule', 0.036), ('training', 0.036), ('delta', 0.036), ('eapp', 0.036), ('lpha', 0.036), ('nec', 0.036), ('coef', 0.035), ('cost', 0.034), ('scale', 0.034), ('gh', 0.033), ('passes', 0.033), ('matrix', 0.032), ('descent', 0.031), ('pass', 0.031), ('svms', 0.031), ('scheduling', 0.031), ('secondorder', 0.03), ('sparse', 0.029), ('excess', 0.029), ('track', 0.028), ('regularization', 0.028), ('inverse', 0.027), ('bfgs', 0.026), ('careful', 0.026), ('test', 0.025), ('wn', 0.025), ('tricks', 0.025), ('iterations', 0.025), ('generic', 0.024), ('updates', 0.024), ('competition', 0.024), ('eest', 0.024), ('elta', 0.024), ('eopt', 0.024), ('sident', 0.024), ('online', 0.024), ('cients', 0.023), ('amari', 0.023), ('hsieh', 0.023), ('becker', 0.023), ('nonzero', 0.023), ('bousquet', 0.023), ('scaling', 0.022), ('full', 0.022), ('save', 0.022), ('patrick', 0.022), ('listed', 0.021), ('patterns', 0.021), ('france', 0.021), ('universit', 0.021), ('solvers', 0.021), ('murata', 0.02), ('minw', 0.02), ('table', 0.02), ('proportional', 0.02), ('paris', 0.02), ('gradients', 0.02), ('speedup', 0.02), ('dual', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari

2 0.4057714 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we ﬁrst perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the ﬁrst phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ﬁcient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efﬁcient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization

3 0.20633705 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

Author: Vojtěch Franc, Sören Sonnenburg

Abstract: We have developed an optimized cutting plane algorithm (OCA) for solving large-scale risk minimization problems. We prove that the number of iterations OCA requires to converge to a ε precise solution is approximately linear in the sample size. We also derive OCAS, an OCA-based linear binary Support Vector Machine (SVM) solver, and OCAM, a linear multi-class SVM solver. In an extensive empirical evaluation we show that OCAS outperforms current state-of-the-art SVM solvers like SVMlight , SVMperf and BMRM, achieving speedup factor more than 1,200 over SVMlight on some data sets and speedup factor of 29 over SVMperf , while obtaining the same precise support vector solution. OCAS, even in the early optimization steps, often shows faster convergence than the currently prevailing approximative methods in this domain, SGD and Pegasos. In addition, our proposed linear multi-class SVM solver, OCAM, achieves speedups of factor of up to 10 compared to SVMmulti−class . Finally, we use OCAS and OCAM in two real-world applications, the problem of human acceptor splice site detection and malware detection. Effectively parallelizing OCAS, we achieve state-of-the-art results on an acceptor splice site recognition problem only by being able to learn from all the available 50 million examples in a 12-million-dimensional feature space. Source code, data sets and scripts to reproduce the experiments are available at http://cmp.felk.cvut.cz/˜xfrancv/ocas/html/. Keywords: risk minimization, linear support vector machine, multi-class classiﬁcation, binary classiﬁcation, large-scale learning, parallelization

4 0.15611006 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

Author: Wenxin Jiang

Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classiﬁcation, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation

5 0.13288967 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsiﬁcation from no sparsiﬁcation to total sparsiﬁcation. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsiﬁcation result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and ﬁnd for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso

6 0.12727655 23 jmlr-2009-Discriminative Learning Under Covariate Shift

7 0.096370935 6 jmlr-2009-An Algorithm for Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies Weak Transitivity

8 0.091364376 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

9 0.091345057 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques

10 0.090252928 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

11 0.088987313 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

12 0.077168152 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

13 0.070302457 13 jmlr-2009-Bounded Kernel-Based Online Learning

14 0.069171138 90 jmlr-2009-Structure Spaces

15 0.066818111 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

16 0.063587941 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

17 0.059848774 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

18 0.05807589 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning

19 0.045502663 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

20 0.041615374 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.316), (1, 0.256), (2, -0.016), (3, -0.193), (4, 0.517), (5, 0.244), (6, -0.084), (7, -0.152), (8, -0.0), (9, 0.003), (10, -0.053), (11, -0.015), (12, -0.07), (13, 0.093), (14, -0.026), (15, 0.044), (16, -0.016), (17, 0.083), (18, -0.021), (19, 0.093), (20, -0.014), (21, -0.031), (22, -0.016), (23, 0.077), (24, -0.023), (25, 0.045), (26, -0.007), (27, -0.023), (28, 0.009), (29, -0.006), (30, 0.011), (31, 0.0), (32, 0.007), (33, 0.001), (34, -0.013), (35, 0.039), (36, 0.016), (37, 0.004), (38, 0.02), (39, -0.004), (40, 0.029), (41, -0.005), (42, -0.04), (43, -0.003), (44, 0.004), (45, 0.003), (46, -0.016), (47, 0.03), (48, -0.016), (49, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97762883 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari

2 0.92479008 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

Author: John Duchi, Yoram Singer

3 0.7496962 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

Author: Vojtěch Franc, Sören Sonnenburg

4 0.41332185 6 jmlr-2009-An Algorithm for Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies Weak Transitivity

Author: Jose M. Peña, Roland Nilsson, Johan Björkegren, Jesper Tegnér

Abstract: We present a sound and complete graphical criterion for reading dependencies from the minimal undirected independence map G of a graphoid M that satisﬁes weak transitivity. Here, complete means that it is able to read all the dependencies in M that can be derived by applying the graphoid properties and weak transitivity to the dependencies used in the construction of G and the independencies obtained from G by vertex separation. We argue that assuming weak transitivity is not too restrictive. As an intermediate step in the derivation of the graphical criterion, we prove that for any undirected graph G there exists a strictly positive discrete probability distribution with the prescribed sample spaces that is faithful to G. We also report an algorithm that implements the graphical criterion and whose running time is considered to be at most O(n2 (e + n)) for n nodes and e edges. Finally, we illustrate how the graphical criterion can be used within bioinformatics to identify biologically meaningful gene dependencies. Keywords: graphical models, vertex separation, graphoids, weak transitivity, bioinformatics

5 0.40704679 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

6 0.34744579 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

7 0.32360804 23 jmlr-2009-Discriminative Learning Under Covariate Shift

8 0.31610188 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

9 0.27238914 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

10 0.2492002 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

11 0.24001713 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

12 0.22269237 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques

13 0.21738975 90 jmlr-2009-Structure Spaces

14 0.20957977 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

15 0.19940953 13 jmlr-2009-Bounded Kernel-Based Online Learning

16 0.19042511 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning

17 0.17835793 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

18 0.17414783 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

19 0.16825357 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

20 0.16776946 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.016), (11, 0.012), (19, 0.424), (24, 0.033), (26, 0.011), (38, 0.051), (47, 0.012), (52, 0.053), (55, 0.055), (58, 0.038), (66, 0.094), (68, 0.056), (90, 0.04), (96, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83676505 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages (Special Topic on Mining and Learning with Graphs and Relations)

Author: Leonid (Aryeh) Kontorovich, Boaz Nadler

Abstract: We propose a novel framework for supervised learning of discrete concepts. Since the 1970’s, the standard computational primitive has been to ﬁnd the most consistent hypothesis in a given complexity class. In contrast, in this paper we propose a new basic operation: for each pair of input instances, count how many concepts of bounded complexity contain both of them. Our approach maps instances to a Hilbert space, whose metric is induced by a universal kernel coinciding with our computational primitive, and identiﬁes concepts with half-spaces. We prove that all concepts are linearly separable under this mapping. Hence, given a labeled sample and an oracle for evaluating the universal kernel, we can efﬁciently compute a linear classiﬁer (via SVM, for example) and use margin bounds to control its generalization error. Even though exact evaluation of the universal kernel may be infeasible, in various natural situations it is efﬁciently approximable. Though our approach is general, our main application is to regular languages. Our approach presents a substantial departure from current learning paradigms and in particular yields a novel method for learning this fundamental concept class. Unlike existing techniques, we make no structural assumptions on the corresponding unknown automata, the string distribution or the completeness of the training set. Instead, given a labeled sample our algorithm outputs a classiﬁer with guaranteed distribution-free generalization bounds; to our knowledge, the proposed framework is the only one capable of achieving the latter. Along the way, we touch upon several fundamental questions in complexity, automata, and machine learning. Keywords: grammar induction, regular language, ﬁnite state automaton, maximum margin hyperplane, kernel approximation

same-paper 2 0.79298556 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

Author: Antoine Bordes, Léon Bottou, Patrick Gallinari

3 0.38476264 38 jmlr-2009-Hash Kernels for Structured Data

Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan

Abstract: We propose hashing to facilitate efﬁcient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classiﬁcation

4 0.38049468 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations

Author: Jonathan Huang, Carlos Guestrin, Leonidas Guibas

Abstract: Permutations are ubiquitous in many real-world problems, such as voting, ranking, and data association. Representing uncertainty over permutations is challenging, since there are n! possibilities, and typical compact and factorized probability distribution representations, such as graphical models, cannot capture the mutual exclusivity constraints associated with permutations. In this paper, we use the “low-frequency” terms of a Fourier decomposition to represent distributions over permutations compactly. We present Kronecker conditioning, a novel approach for maintaining and updating these distributions directly in the Fourier domain, allowing for polynomial time bandlimited approximations. Low order Fourier-based approximations, however, may lead to functions that do not correspond to valid distributions. To address this problem, we present a quadratic program deﬁned directly in the Fourier domain for projecting the approximation onto a relaxation of the polytope of legal marginal distributions. We demonstrate the effectiveness of our approach on a real camera-based multi-person tracking scenario. Keywords: identity management, permutations, approximate inference, group theoretical methods, sensor networks

5 0.37918645 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

Author: Kristian Woodsend, Jacek Gondzio

Abstract: Support vector machines are a powerful machine learning technology, but the training process involves a dense quadratic optimization problem and is computationally challenging. A parallel implementation of linear Support Vector Machine training has been developed, using a combination of MPI and OpenMP. Using an interior point method for the optimization and a reformulation that avoids the dense Hessian matrix, the structure of the augmented system matrix is exploited to partition data and computations amongst parallel processors efﬁciently. The new implementation has been applied to solve problems from the PASCAL Challenge on Large-scale Learning. We show that our approach is competitive, and is able to solve problems in the Challenge many times faster than other parallel approaches. We also demonstrate that the hybrid version performs more efﬁciently than the version using pure MPI. Keywords: linear SVM training, hybrid parallelism, largescale learning, interior point method

6 0.37083715 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

7 0.36850962 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning

8 0.36526346 18 jmlr-2009-Consistency and Localizability

9 0.36492747 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

10 0.36356992 48 jmlr-2009-Learning Nondeterministic Classifiers

11 0.36170375 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

12 0.35982379 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

13 0.35902429 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

14 0.3582617 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

15 0.35626549 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

16 0.35565948 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

17 0.35516563 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

18 0.35511291 5 jmlr-2009-Adaptive False Discovery Rate Control under Independence and Dependence

19 0.35359079 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

20 0.35084522 25 jmlr-2009-Distributed Algorithms for Topic Models