jmlr jmlr2012 jmlr2012-54 knowledge-graph by maker-knowledge-mining

54 jmlr-2012-Large-scale Linear Support Vector Regression

Source: pdf

Author: Chia-Hua Ho, Chih-Jen Lin

Abstract: Support vector regression (SVR) and support vector classiﬁcation (SVC) are popular learning techniques, but their use with kernels is often time consuming. Recently, linear SVC without kernels has been shown to give competitive accuracy for some applications, but enjoys much faster training/testing. However, few studies have focused on linear SVR. In this paper, we extend state-of-theart training methods for linear SVC to linear SVR. We show that the extension is straightforward for some methods, but is not trivial for some others. Our experiments demonstrate that for some problems, the proposed linear-SVR training methods can very efﬁciently produce models that are as good as kernel SVR. Keywords: support vector regression, Newton methods, coordinate descent methods

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Keywords: support vector regression, Newton methods, coordinate descent methods 1. [sent-13, score-0.128]

2 , Joachims, 1998; Platt, 1998; Chang and Lin, 2011), it is well known that training/testing large-scale nonlinear SVC and SVR is time consuming. [sent-21, score-0.058]

3 Recently, for some applications such as document classiﬁcation, linear SVC without using kernels has been shown to give competitive performances, but training and testing are much faster. [sent-22, score-0.063]

4 In this paper, we develop efﬁcient training methods to demonstrate that, similar to SVC, linear SVR can sometimes achieve comparable performance to nonlinear SVR, but enjoys much faster training/testing. [sent-30, score-0.071]

5 , 2008), while the second is a coordinate descent approach for the dual form (Hsieh et al. [sent-38, score-0.177]

6 We show that it is straightforward to extend the Newton method for linear SVR, but some careful redesign is essential for applying coordinate descent methods. [sent-40, score-0.128]

7 A coordinate descent method quickly gives an approximate solution, but may converge slowly in the end. [sent-42, score-0.128]

8 In particular, we propose a condensed implementation of coordinate descent methods. [sent-48, score-0.128]

9 Linear Support Vector Regression Given a set of training instance-target pairs {(xi , yi )}, xi ∈ Rn , yi ∈ R, i = 1, . [sent-53, score-0.139]

10 , l, linear SVR ﬁnds a model w such that wT xi is close to the target value yi . [sent-56, score-0.071]

11 min w f (w), where l 1 f (w) ≡ wT w +C ∑ ξε (w; xi , yi ). [sent-58, score-0.071]

12 2 i=1 (1) In Equation (1), C > 0 is the regularization parameter, and ξε (w; xi , yi ) = max(|wT xi − yi | − ε, 0) T max(|w xi − yi | − ε, 0) or 2 (2) (3) is the ε-insensitive loss function associated with (xi , yi ). [sent-59, score-0.253]

13 The parameter ε is given so that the loss is zero if |wT xi − yi | ≤ ε. [sent-60, score-0.071]

14 i i i i i i 2 2 i=1 3324 (5) L ARGE - SCALE L INEAR S UPPORT V ECTOR R EGRESSION loss L2 L1 −ε 0 wT xi − yi ε Figure 1: L1-loss and L2-loss functions. [sent-73, score-0.071]

15 In this paper, we refer to (1) as the primal SVR problem, while (4) as the dual SVR problem. [sent-76, score-0.088]

16 The primal-dual relationship indicates that primal optimal solution w∗ and dual optimal solution (α+ )∗ and (α− )∗ satisfy l w∗ = ∑ ((α+ )∗ − (α− )∗ )xi . [sent-77, score-0.088]

17 The ﬁrst is a Newton method for L2-loss SVR, while the second is a coordinate descent method for L1-/L2-loss SVR. [sent-83, score-0.128]

18 At the k-th outer-layer iteration, given the current position wk , TRON sets a trust-region size ∆k and constructs a quadratic model 1 qk (s) ≡ ∇ f (wk )T s + sT ∇2 f (wk )s 2 as the approximation to f (wk + s) − f (wk ). [sent-109, score-0.062]

19 , xl ]T , I1 ≡ {i | wT xi − yi > ε}, and I2 ≡ {i | wT xi − yi < −ε}. [sent-118, score-0.142]

20 For the stopping condition, we follow the setting of TRON in LIBLINEAR for classiﬁcation. [sent-132, score-0.062]

21 ∇ f (wk ) 2 ≤ εs ∇ f (w0 ) 2 , (7) where w0 is the initial iterate and εs is stopping tolerance given by users. [sent-134, score-0.108]

22 2 Dual Coordinate Descent Methods (DCD) In this section, we introduce DCD, a coordinate descent method for the dual form of SVC/SVR. [sent-139, score-0.177]

23 1 A D IRECT E XTENSION FROM C LASSIFICATION TO R EGRESSION A coordinate descent method sequentially updates one variable by solving the following subproblem. [sent-146, score-0.141]

24 (2008) check the projected gradient ∇P fA (α) for the stopping condition because α is optimal if and only if ∇P fA (α) is zero. [sent-185, score-0.133]

25 (2008) apply two techniques to make a coordinate descent method faster. [sent-190, score-0.128]

26 Their shrinking strategy can be directly applied here, so we omit details. [sent-199, score-0.129]

27 While we have directly applied a coordinate descent method to solve (4), the procedure does not take SVR’s special structure into account. [sent-200, score-0.142]

28 Shrinking can partially solve this problem, but alternatively we may explicitly use the property (12) in designing the coordinate descent algorithm. [sent-213, score-0.142]

29 If i i the optimality condition at α− is not satisﬁed yet, then i ∇P fA (α) = ∇i+l fA (α) = −(Q(α+ − α− ))i + ε + yi + λα− < 0. [sent-217, score-0.061]

30 i+l i We then have 0 < −∇i+l fA (α) = (Q(α+ − α− ))i − ε − yi − λα− i < (Q(α+ − α− ))i + ε − yi + λα+ = ∇i fA (α), i (13) so a larger violation of the optimality condition occurs at α+ . [sent-218, score-0.158]

31 However, the straightforward coordinate descent implementation discussed in this section still possesses some advantages. [sent-223, score-0.128]

32 We design a coordinate descent method to solve (15). [sent-234, score-0.142]

33 Interestingly, (15) is in a form similar to the primal optimization problem of L1-regularized regression and classiﬁcation. [sent-235, score-0.057]

34 In LIBLINEAR, a coordinate descent solver is provided for L1-regularized L2-loss SVC (Yuan et al. [sent-236, score-0.128]

35 Second, the discussion will help to explain our stopping condition and shrinking procedure. [sent-246, score-0.212]

36 1, we maintain a vector u and calculate (Qβ) by l ¯ (Qβ)i = uT xi + λβi , where u = ∑ βi xi . [sent-298, score-0.062]

37 Besides Algorithm 3, other types of coordinate descent methods may be applicable here. [sent-304, score-0.128]

38 Studies of such random coordinate descent methods with run time analysis include, for example, Shalev-Shwartz and Tewari (2011), Nesterov (2010), and Richt´ rik and Tak´ c (2011). [sent-306, score-0.143]

39 a aˇ For the stopping condition and the shrinking procedure, we will mainly follow the setting in LIBLINEAR for L1-regularized classiﬁcation. [sent-307, score-0.212]

40 To begin, we study how to measure the violation of the optimality condition of (16) during the optimization procedure. [sent-308, score-0.078]

41 i i Thus, if 0 < βi < U, |g′p (βi )| can be considered as the violation of the optimality. [sent-310, score-0.057]

42 i n i i Thus, g′ (βi ) n −g′p (βi ) if βi = 0 and g′ (βi ) > 0, n ′ (β ) < 0 if βi = 0 and g p i gives the violation of the optimality. [sent-312, score-0.057]

43 We follow them to use a similar stopping condition. [sent-319, score-0.062]

44 vk 1 < εs v0 1 , (25) where v0 and vk are the initial violation and the violation in the k-th iteration, respectively. [sent-320, score-0.186]

45 Note that vk ’s components are sequentially obtained via (24) in l coordinate descent steps of the k-th iteration. [sent-321, score-0.177]

46 βi = 0 and g′ (βi ) < −M < 0 < M < g′p (βi ), n (26) βi = U and g′p (βi ) < −M, βi = −U and g′ (βi ) > M, n (27) or (28) where k−1 M ≡ max vi i (29) is the maximal violation of the previous iteration. [sent-328, score-0.057]

47 The condition (26) is equivalent to ¯ βi = 0 and − ε + M < (Qβ)i − yi < ε − M. [sent-329, score-0.061]

48 Note that M used in conditions (26), (27), and (28) controls how aggressive our shrinking scheme is. [sent-335, score-0.161]

49 For L2-loss SVR, αi is not upper-bounded in the dual problem, so (26) becomes the only condition to shrink variables. [sent-338, score-0.092]

50 In this situation, Equation (13) indicates i i that for ﬁnding the maximal violation of the optimality condition, we only need to check ∇P fA (α) i rather than ∇P fA (α). [sent-345, score-0.075]

51 3333 H O AND L IN Algorithm 4 Details of Algorithm 3 with a stopping condition and a shrinking implementation. [sent-349, score-0.212]

52 g′p ← −yi + uT xi + λβi + ε, g′ ← −yi + uT xi + λβi − ε. [sent-378, score-0.062]

53 During the iterations, the stopping condition of a smaller problem of T is checked. [sent-412, score-0.083]

54 This setting ensures that the algorithm stops only after the stopping condition for problem (15) is satisﬁed. [sent-418, score-0.083]

55 They consider SVR with a bias term, so the dual problem contains an additional linear constraint. [sent-429, score-0.081]

56 i i i=1 3334 L ARGE - SCALE L INEAR S UPPORT V ECTOR R EGRESSION Because of this constraint, their coordinate descent implementation (called decomposition methods in the SVM community) must select at least two variables at a time. [sent-431, score-0.128]

57 For nonlinear SVM, we can afford to use gradient information for selecting the working variables; see reasons explained in Section 4. [sent-452, score-0.056]

58 i Therefore, their coordinate descent algorithm implicitly has a shrinking implementation, so the ﬁrst concern discussed in Section 3. [sent-460, score-0.257]

59 The implementation of coordinate descent methods for nonlinear SVM is more complicated than that for linear because of steps such as gradient-based variable selection and kernel-cache maintenance, etc. [sent-469, score-0.171]

60 This is the approach taken by the nonlinear SVM package LIBSVM (Chang and Lin, 2011), in which SVC and SVR share the same optimization solver. [sent-471, score-0.056]

61 Given the target values y and the predicted values y′ , R2 is deﬁned as ∑i (y′ − E[y′ ])(yi − E[yi ]) i i σ2 σ2 ′ y y 2 2 l ∑i y′ yi − (∑i y′ )(∑i yi ) i i = . [sent-480, score-0.08]

62 The x-axis is the training time, and the y-axis is the relative difference to the dual optimal function value. [sent-521, score-0.077]

63 We present training time and relative difference to the dual optimal function values. [sent-541, score-0.092]

64 If shrinking is not applied, we simply plot the value (31) once every eight iterations. [sent-551, score-0.129]

65 With shrinking, the setting is more complicated because the stopping tolerance εs affects the shrinking implementation; see step 6. [sent-552, score-0.221]

66 We mentioned that shrinking can reduce the overhead and this is supported by the result that DCD-1-sh becomes closer to DCD-2-sh. [sent-559, score-0.129]

67 This experiment also reveals how useful the shrinking technique is. [sent-561, score-0.129]

68 For both Algorithms 2 and 4, shrinking very effectively reduces the training time. [sent-562, score-0.157]

69 23 Table 2: Testing MSE and training time using DCD for linear SVR and LIBSVM for nonlinear SVR with RBF kernel. [sent-585, score-0.086]

70 3 A Comparison Between Linear and Nonlinear SVR We wrote in Section 1 that the motivation of this research work is to check if for some applications linear SVR can give competitive MSE/R2 with nonlinear SVR, but enjoy faster training. [sent-590, score-0.061]

71 Because LIBSVM’s training time is very long, we only use 1% training data for MSD, and 0. [sent-594, score-0.071]

72 1 for both methods although their stopping conditions are slightly different. [sent-608, score-0.062]

73 In Table 2, we observe that for all data sets except MSD, nonlinear SVR gives only marginally better MSE than linear SVR, but the training time is prohibitively long. [sent-610, score-0.086]

74 The dotted and solid horizontal lines respectively indicate the function values of TRON using stopping tolerance εs = 0. [sent-625, score-0.092]

75 Although DCD solves the dual problem, for calculating (32), we can obtain a corresponding primal value using w = X T β. [sent-634, score-0.101]

76 6 Because practically users apply TRON or DCD under a ﬁxed stopping tolerance, we draw two horizontal lines in Figure 4 to indicate the result using a typical tolerance value. [sent-636, score-0.092]

77 For MSD, which has only 90 features, DCD’s primal function value is so unstable that it does not reach the stopping condition for drawing the horizontal line. [sent-641, score-0.122]

78 By comparing Figures 4 and 5, we observe that both methods have shorter training time for normalized data. [sent-648, score-0.067]

79 i i We apply L1-loss SVR on normalized data sets to compare MSE values with and without the bias term. [sent-665, score-0.056]

80 2, we introduced DCD’s shrinking scheme with a parameter M deﬁned as the maximal violation of the optimality condition. [sent-713, score-0.186]

81 We pointed out that the smaller M is, the more aggressive the shrinking method is. [sent-714, score-0.161]

82 Using L1-loss SVR, Figure 6(a) shows the relationship between the relative difference to the optimal dual function value and the training time of MSD and TFIDF-2006. [sent-725, score-0.092]

83 Results indicate that these data sets need a more aggressive shrinking strategy. [sent-726, score-0.161]

84 3342 L ARGE - SCALE L INEAR S UPPORT V ECTOR R EGRESSION MSD TFIDF-2006 (a) L1-loss SVR (b) L2-loss SVR Figure 6: A comparison of three shrinking settings. [sent-729, score-0.129]

85 We show the relative difference to the dual optimal function values and training time (in seconds). [sent-732, score-0.092]

86 An investigation shows that the stopping condition of DCD used to generate Table 5 is too loose for this problem. [sent-793, score-0.083]

87 If a strict stopping condition is used, the two MAE values become close to each other. [sent-794, score-0.083]

88 Further, the shrinking strategies make both DCD methods faster. [sent-812, score-0.129]

89 We propose an efﬁcient DCD method to solve a reformulation of the dual problem. [sent-827, score-0.063]

90 An interesting future research direction is to apply coordinate descent methods for L1-regularized least-square regression, which has been shown to be related to problem (15). [sent-829, score-0.128]

91 However, we expect some differences because the former is a primal problem, while the latter is a dual problem. [sent-830, score-0.088]

92 In summary, we have successfully demonstrated that for some document data, the proposed methods can efﬁciently train linear SVR, while achieve comparable testing errors to nonlinear SVR. [sent-833, score-0.078]

93 Tseng and Yun (2009) propose a general coordinate descent method. [sent-844, score-0.128]

94 If each time only one variable is updated, the subproblem of their coordinate descent method is min s 1 ∇i F(β)(s − βi ) + 2 H(s − βi )2 + ε|s| ∞ if −U ≤ s ≤ U, otherwise, (37) ¯ ¯ where H is any positive value. [sent-847, score-0.183]

95 The only situation that (38) fails is when xi = 0 and L1-loss SVR is applied. [sent-854, score-0.059]

96 In this ¯ situation, βT Qβ is not related to βi and the minimization of −yi βi + ε|βi | shows that the optimal β∗ i is  U if − yi + ε < 0,  ∗ βi = −U if − yi − ε > 0, (39)   0 otherwise. [sent-855, score-0.08]

97 A dual coordinate descent method for large-scale linear SVM. [sent-915, score-0.177]

98 Efﬁciency of coordinate descent methods on huge-scale optimization problems. [sent-967, score-0.128]

99 Iteration complexity of randomized block-coordinate descent a aˇ methods for minimizing a composite function. [sent-982, score-0.06]

100 A coordinate gradient descent method for nonsmooth separable minimization. [sent-991, score-0.141]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('svr', 0.655), ('dcd', 0.401), ('tron', 0.237), ('svc', 0.21), ('fa', 0.21), ('msd', 0.172), ('mse', 0.167), ('qii', 0.131), ('shrinking', 0.129), ('ctr', 0.123), ('ector', 0.115), ('upport', 0.115), ('mae', 0.095), ('liblinear', 0.09), ('hsieh', 0.082), ('arge', 0.081), ('fb', 0.074), ('coordinate', 0.068), ('yun', 0.066), ('stopping', 0.062), ('inear', 0.062), ('tseng', 0.06), ('descent', 0.06), ('egression', 0.06), ('violation', 0.057), ('libsvm', 0.056), ('newton', 0.051), ('dual', 0.049), ('wk', 0.047), ('wt', 0.046), ('nonlinear', 0.043), ('yi', 0.04), ('subproblem', 0.04), ('primal', 0.039), ('chang', 0.039), ('lin', 0.038), ('vk', 0.036), ('aggressive', 0.032), ('hessian', 0.032), ('bias', 0.032), ('xi', 0.031), ('tolerance', 0.03), ('training', 0.028), ('situation', 0.028), ('liao', 0.027), ('sathiya', 0.027), ('trust', 0.025), ('ut', 0.025), ('normalized', 0.024), ('keerthi', 0.023), ('shrink', 0.022), ('condition', 0.021), ('mor', 0.02), ('taiwan', 0.02), ('kogan', 0.019), ('oordinate', 0.019), ('unigrams', 0.019), ('zei', 0.019), ('projected', 0.019), ('testing', 0.019), ('regression', 0.018), ('check', 0.018), ('yuan', 0.018), ('scale', 0.018), ('normalization', 0.017), ('qi', 0.017), ('losses', 0.017), ('xw', 0.016), ('csie', 0.016), ('ntu', 0.016), ('escent', 0.016), ('nancial', 0.016), ('hoerl', 0.016), ('richt', 0.016), ('tak', 0.016), ('year', 0.016), ('song', 0.016), ('document', 0.016), ('iterate', 0.016), ('qk', 0.015), ('conduct', 0.015), ('url', 0.015), ('optimum', 0.015), ('classi', 0.015), ('time', 0.015), ('tf', 0.015), ('table', 0.014), ('find', 0.014), ('vladimir', 0.014), ('differentiable', 0.014), ('solve', 0.014), ('updated', 0.014), ('shrunk', 0.014), ('suspect', 0.014), ('tw', 0.014), ('louvain', 0.014), ('package', 0.013), ('solves', 0.013), ('gradient', 0.013), ('sequentially', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 54 jmlr-2012-Large-scale Linear Support Vector Regression

Author: Chia-Hua Ho, Chih-Jen Lin

2 0.091684684 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

Abstract: A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a uniﬁed constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classiﬁcation or regression. The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. Efﬁcient variational methods for posterior inference and parameter estimation are derived and extensive empirical studies on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative topical representations; 2) achieve state of the art prediction performance; and 3) be more efﬁcient than existing supervised topic models, especially for classiﬁcation. Keywords: supervised topic models, max-margin learning, maximum entropy discrimination, latent Dirichlet allocation, support vector machines

3 0.06493222 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

Author: Guo-Xun Yuan, Chia-Hua Ho, Chih-Jen Lin

Abstract: Recently, Yuan et al. (2010) conducted a comprehensive comparison on software for L1-regularized classiﬁcation. They concluded that a carefully designed coordinate descent implementation CDN is the fastest among state-of-the-art solvers. In this paper, we point out that CDN is less competitive on loss functions that are expensive to compute. In particular, CDN for logistic regression is much slower than CDN for SVM because the logistic loss involves expensive exp/log operations. In optimization, Newton methods are known to have fewer iterations although each iteration costs more. Because solving the Newton sub-problem is independent of the loss calculation, this type of methods may surpass CDN under some circumstances. In L1-regularized classiﬁcation, GLMNET by Friedman et al. is already a Newton-type method, but experiments in Yuan et al. (2010) indicated that the existing GLMNET implementation may face difﬁculties for some largescale problems. In this paper, we propose an improved GLMNET to address some theoretical and implementation issues. In particular, as a Newton-type method, GLMNET achieves fast local convergence, but may fail to quickly obtain a useful solution. By a careful design to adjust the effort for each iteration, our method is efﬁcient for both loosely or strictly solving the optimization problem. Experiments demonstrate that our improved GLMNET is more efﬁcient than CDN for L1-regularized logistic regression. Keywords: L1 regularization, linear classiﬁcation, optimization methods, logistic regression, support vector machines

4 0.055263147 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

Author: Aviv Tamar, Dotan Di Castro, Ron Meir

Abstract: In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy. Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent’s knowledge. Our method relies on a novel deﬁnition for a partially known model, and an estimator that incorporates such knowledge in order to reduce uncertainty in stochastic approximation iterations. We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms

5 0.051913396 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie

Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with signiﬁcant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We ﬁnd that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling

6 0.044753063 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

7 0.037038632 24 jmlr-2012-Causal Bounds and Observable Constraints for Non-deterministic Models

8 0.033050556 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

9 0.032828815 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

10 0.032353453 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training

11 0.031479929 97 jmlr-2012-Regularization Techniques for Learning with Matrices

12 0.031145129 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

13 0.029388433 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

14 0.028546613 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

15 0.027193438 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

16 0.026432747 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

17 0.025120093 20 jmlr-2012-Analysis of a Random Forests Model

18 0.025103042 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise

19 0.022537103 59 jmlr-2012-Linear Regression With Random Projections

20 0.022393284 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.114), (1, -0.032), (2, 0.037), (3, -0.014), (4, 0.05), (5, 0.015), (6, 0.001), (7, -0.014), (8, -0.041), (9, 0.019), (10, -0.005), (11, 0.174), (12, 0.082), (13, 0.015), (14, 0.194), (15, -0.028), (16, -0.044), (17, -0.122), (18, -0.002), (19, 0.11), (20, -0.077), (21, 0.058), (22, 0.023), (23, 0.093), (24, 0.02), (25, 0.064), (26, 0.074), (27, 0.023), (28, 0.246), (29, -0.139), (30, 0.217), (31, 0.145), (32, -0.143), (33, -0.107), (34, -0.256), (35, 0.082), (36, 0.055), (37, -0.205), (38, 0.105), (39, 0.153), (40, 0.228), (41, -0.027), (42, 0.082), (43, 0.092), (44, -0.063), (45, 0.048), (46, -0.171), (47, -0.035), (48, -0.011), (49, 0.067)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94214332 54 jmlr-2012-Large-scale Linear Support Vector Regression

Author: Chia-Hua Ho, Chih-Jen Lin

2 0.6965636 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

Author: Guo-Xun Yuan, Chia-Hua Ho, Chih-Jen Lin

3 0.31820029 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

4 0.24064448 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie

5 0.21839422 23 jmlr-2012-Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training

Author: Zhuang Wang, Koby Crammer, Slobodan Vucetic

Abstract: Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online Support Vector Machine (SVM) training due to its simplicity and effectiveness. When equipped with kernel functions, similarly to other SVM learning algorithms, SGD is susceptible to the curse of kernelization that causes unbounded linear growth in model size and update time with data size. This may render SGD inapplicable to large data sets. We address this issue by presenting a class of Budgeted SGD (BSGD) algorithms for large-scale kernel SVM training which have constant space and constant time complexity per update. Speciﬁcally, BSGD keeps the number of support vectors bounded during training through several budget maintenance strategies. We treat the budget maintenance as a source of the gradient error, and show that the gap between the BSGD and the optimal SVM solutions depends on the model degradation due to budget maintenance. To minimize the gap, we study greedy budget maintenance methods based on removal, projection, and merging of support vectors. We propose budgeted versions of several popular online SVM algorithms that belong to the SGD family. We further derive BSGD algorithms for multi-class SVM training. Comprehensive empirical results show that BSGD achieves higher accuracy than the state-of-the-art budgeted online algorithms and comparable to non-budget algorithms, while achieving impressive computational efﬁciency both in time and space during training and prediction. Keywords: SVM, large-scale learning, online learning, stochastic gradient descent, kernel methods

6 0.21316898 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

7 0.21296939 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

8 0.18791802 89 jmlr-2012-Pairwise Support Vector Machines and their Application to Large Scale Problems

9 0.17641193 24 jmlr-2012-Causal Bounds and Observable Constraints for Non-deterministic Models

10 0.17432588 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

11 0.1730091 79 jmlr-2012-Oger: Modular Learning Architectures For Large-Scale Sequential Processing

12 0.14906885 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

13 0.13783357 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

14 0.1345765 44 jmlr-2012-Feature Selection via Dependence Maximization

15 0.13372344 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems

16 0.12871791 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

17 0.1285242 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

18 0.1217564 20 jmlr-2012-Analysis of a Random Forests Model

19 0.12078772 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

20 0.12062131 97 jmlr-2012-Regularization Techniques for Learning with Matrices

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.019), (21, 0.028), (26, 0.038), (29, 0.026), (35, 0.011), (56, 0.015), (57, 0.012), (64, 0.012), (69, 0.011), (75, 0.062), (77, 0.486), (92, 0.061), (96, 0.094)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83593833 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

Author: Gary B. Huang, Andrew Kae, Carl Doersch, Erik Learned-Miller

Abstract: We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identiﬁed with near certainty, they can be conditioned upon, allowing further inference to be done efﬁciently. Speciﬁcally, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This “clean set” is subsequently used as document-speciﬁc training data. While OCR systems produce conﬁdence measures for the identity of each letter or word, thresholding these values still produces a signiﬁcant number of errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data set of difﬁcult historical newspaper scans, demonstrating that our method for identifying correct words makes only two errors in 56 documents. Using document-speciﬁc character models generated from this data, we are able to reduce the error over properly segmented characters by 34.1% from an initial OCR system’s translation.1 Keywords: optical character recognition, probability bounding, document-speciﬁc modeling, computer vision 1. This work is an expanded and revised version of Kae et al. (2010). Supported by NSF Grant IIS-0916555. c 2012 Gary B. Huang, Andrew Kae, Carl Doersch and Erik Learned-Miller. H UANG , K AE , D OERSCH AND L EARNED -M ILLER

same-paper 2 0.73856741 54 jmlr-2012-Large-scale Linear Support Vector Regression

Author: Chia-Hua Ho, Chih-Jen Lin

3 0.67709714 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

Author: Christopher R. Genovese, Jiashun Jin, Larry Wasserman, Zhigang Yao

Abstract: The lasso is an important method for sparse, high-dimensional regression problems, with efﬁcient algorithms available, a long history of practical success, and a large body of theoretical results supporting and explaining its performance. But even with the best available algorithms, ﬁnding the lasso solutions remains a computationally challenging task in cases where the number of covariates vastly exceeds the number of data points. Marginal regression, where each dependent variable is regressed separately on each covariate, offers a promising alternative in this case because the estimates can be computed roughly two orders faster than the lasso solutions. The question that remains is how the statistical performance of the method compares to that of the lasso in these cases. In this paper, we study the relative statistical performance of the lasso and marginal regression for sparse, high-dimensional regression problems. We consider the problem of learning which coefﬁcients are non-zero. Our main results are as follows: (i) we compare the conditions under which the lasso and marginal regression guarantee exact recovery in the ﬁxed design, noise free case; (ii) we establish conditions under which marginal regression provides exact recovery with high probability in the ﬁxed design, noise free, random coefﬁcients case; and (iii) we derive rates of convergence for both procedures, where performance is measured by the number of coefﬁcients with incorrect sign, and characterize the regions in the parameter space recovery is and is not possible under this metric. In light of the computational advantages of marginal regression in very high dimensional problems, our theoretical and simulations results suggest that the procedure merits further study. Keywords: high-dimensional regression, lasso, phase diagram, regularization

4 0.41828203 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

Author: Guo-Xun Yuan, Chia-Hua Ho, Chih-Jen Lin

5 0.34058723 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

Author: Jun Zhu, Amr Ahmed, Eric P. Xing

6 0.32422325 82 jmlr-2012-On the Necessity of Irrelevant Variables

7 0.32331681 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

8 0.31469834 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

9 0.31328166 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

10 0.31158158 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

11 0.31017432 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

12 0.30790436 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

13 0.30671841 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies