jmlr jmlr2012 jmlr2012-18 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Guo-Xun Yuan, Chia-Hua Ho, Chih-Jen Lin
Abstract: Recently, Yuan et al. (2010) conducted a comprehensive comparison on software for L1-regularized classification. They concluded that a carefully designed coordinate descent implementation CDN is the fastest among state-of-the-art solvers. In this paper, we point out that CDN is less competitive on loss functions that are expensive to compute. In particular, CDN for logistic regression is much slower than CDN for SVM because the logistic loss involves expensive exp/log operations. In optimization, Newton methods are known to have fewer iterations although each iteration costs more. Because solving the Newton sub-problem is independent of the loss calculation, this type of methods may surpass CDN under some circumstances. In L1-regularized classification, GLMNET by Friedman et al. is already a Newton-type method, but experiments in Yuan et al. (2010) indicated that the existing GLMNET implementation may face difficulties for some largescale problems. In this paper, we propose an improved GLMNET to address some theoretical and implementation issues. In particular, as a Newton-type method, GLMNET achieves fast local convergence, but may fail to quickly obtain a useful solution. By a careful design to adjust the effort for each iteration, our method is efficient for both loosely or strictly solving the optimization problem. Experiments demonstrate that our improved GLMNET is more efficient than CDN for L1-regularized logistic regression. Keywords: L1 regularization, linear classification, optimization methods, logistic regression, support vector machines
Reference: text
sentIndex sentText sentNum sentScore
1 In particular, CDN for logistic regression is much slower than CDN for SVM because the logistic loss involves expensive exp/log operations. [sent-15, score-0.16]
2 (2) i=1 For logistic regression and L2-loss SVM, we have the following loss functions. [sent-33, score-0.085]
3 An example is L1-regularized logistic regression, where exp/log operations are more expensive than other basic operations. [sent-60, score-0.127]
4 While logistic regression is an example of problems with expensive loss functions, to check the situation of cheap loss functions, in Section 7, we extend newGLMNET to solve L2-loss SVM. [sent-77, score-0.139]
5 , wk ]T n j j−1 1 (5) and the indicator vector e j ≡ [0, . [sent-113, score-0.28]
6 j−1 wk wk,1 wk−1,n+1 Let = = at the beginning of each iteration. [sent-120, score-0.29]
7 ¯ j – w where τ(s) is the derivative of the logistic loss function log(1 + es ): 1 , 1 + e−s τ(s) = D ∈ Rl×l is a diagonal matrix with Dii = τ(yi wT xi ) 1 − τ(yi wT xi ) , and (8) T x1 . [sent-140, score-0.093]
8 (2010) T store ew xi instead and update the values by Tx ew i Tx ← ew i ¯ · eλdxi j , ∀i, 2002 (11) A N I MPROVED GLMNET FOR L1- REGULARIZED L OGISTIC R EGRESSION Data set epsilon webspam exp/log 64. [sent-162, score-0.176]
9 (2010), for the same data set, CDN’s training time for logistic regression is more than L2-loss SVM. [sent-184, score-0.11]
10 Motivated from this observation, we point out that CDN suffers from expensive exp/log operations of logistic regression. [sent-185, score-0.136]
11 In Table 1, we conduct an experiment on two data sets, epsilon and webspam. [sent-186, score-0.088]
12 1 We check the proportion of time for exp/log operations in the first CD cycle of CDN. [sent-187, score-0.148]
13 1 Algorithms that may Have Less Loss Computation For logistic regression, exp/log operations occur in computing the function, gradient, and Hessian of the loss. [sent-197, score-0.115]
14 At the kth iteration, consider the following sub-problem to find a direction d: min d qk (d), (13) where 1 qk (d) ≡ ∇L(wk )T d + dT H k d + wk + d 2 1− wk 1 , and H k is either ∇2 L(wk ) or its approximation. [sent-210, score-0.782]
15 If we use an approximate Hessian as H k , the analysis of O(l) exp/log operations versus at least O(nl) total operations per iteration still holds. [sent-228, score-0.126]
16 4 However, because minimizing qk (d) in (13) becomes easier, exp/log operations may play a more important role in the whole procedure. [sent-229, score-0.158]
17 (2008) have pointed out that Newton methods less frequently conduct exp/log operations than CD methods, but they did not conduct detailed analysis and comparisons. [sent-234, score-0.09]
18 Each CD cycle is now considered as an inner iteration of GLMNET. [sent-260, score-0.126]
19 qk (d p, j + ze j ) − qk (d p, j ) 1 ¯ = |wk + d p + z| − |wk + d p | + ∇ j qk (d p, j )z + ∇2 j qk (d p, j )z2 , ¯ j j j j 2 j (15) where the definition of d p, j is similar to wk, j of CDN in (5) p−1 p−1 p d p, j ≡ [d1 , d2 , . [sent-262, score-0.439]
20 Further, 1 qk (d) ≡ ∇L(wk )T d + (d)T ∇2 L(wk )d ¯ 2 represents the smooth terms of qk (d) and plays a similar role to L(w) for (1). [sent-269, score-0.212]
21 We have ∇ j qk (d p, j ) = ∇ j L(wk ) + (∇2 L(wk )d p, j ) j ¯ and ∇2 j qk (d p, j ) = ∇2 j L(wk ). [sent-270, score-0.212]
22 A suitable stopping condition for the inner level is very important. [sent-275, score-0.097]
23 At each CD step, most operations are spent on calculating ∇ j qk (d p, j ) and ∇2 j qk (d p, j ) in ¯ j ¯ 2 q (d p, j ) = ∇2 L(wk ), ∀ j can be pre-calculated before the CD procedure. [sent-279, score-0.288]
24 Note that ∇ j j ¯k jj ∇ j qk (d p, j ), the first term ∇ j L(wk ) can also be pre-calculated. [sent-281, score-0.106]
25 Note that in GLMNET, exp/log operations are needed in the beginning/end of an outer iteration. [sent-287, score-0.126]
26 We make each CD cycle to share the cost in this table even though a CD cycle in GLMNET involves no exp/log operations. [sent-289, score-0.15]
27 , xT d p, j , ∀i) is maintained and updated by i (Xd p, j+1 )i ← (Xd p, j )i + Xi j z, ∀i, (17) then calculating ∇ j qk (d) costs O(l) operations. [sent-292, score-0.14]
28 5 Therefore, the CD method for (13) requires ¯ O(nl) operations for one inner iteration (cycle) of n CD steps. [sent-293, score-0.108]
29 Because they both use CD, we check in Table 2 that relative to the total number of operations of one CD cycle, how many exp/log operations are needed. [sent-296, score-0.13]
30 f (wk + λd) − f (wk ) (20) ≤ σλ ∇L(wk )T d + γdT H k d + wk + d 1− wk 1 , where 0 < β < 1, 0 < σ < 1, and 0 ≤ γ < 1. [sent-329, score-0.56]
31 (21) Thus, the cost for finding f (wk + λd) is reduced to O(n + l), where O(n) comes from calculating k+1 T k T wk + λd 1 . [sent-350, score-0.305]
32 In contrast, in newGLMNET, we conduct line search only in the end of an outer iteration. [sent-359, score-0.145]
33 If we distribute the line search cost of newGLMNET to its inner CD cycles, we have that, for the same λ in one CD cycle,7 CDN costs O(nl) and newGLMNET costs no more than O(n + l). [sent-362, score-0.116]
34 An “outer iteration” corresponds to the process from wk to wk+1 , while the “inner” level consists of CD iterations for solving (13). [sent-389, score-0.29]
35 For an algorithm involving two levels of iterations, the stopping condition of the inner iterations must be carefully designed. [sent-390, score-0.116]
36 A strict inner stopping condition may cause the algorithm to take a prohibitive amount of time at the first outer iteration. [sent-391, score-0.185]
37 Alternatively, a loose inner condition leads to an inaccurate solution of (13) and possibly lengthy outer iterations. [sent-392, score-0.133]
38 , zn satisfy max(∇2 j L(wk ) · z2 ) ≤ εin , j j j (24) where εin is the inner stopping tolerance. [sent-402, score-0.085]
39 For the outer stopping condition, similarly, GLMNET checks if w is still significantly changed. [sent-403, score-0.125]
40 GLMNET stops if the following condition holds: max(∇2 j L(wk+1 ) · (d k )2 ) ≤ εout , (25) j j j where εout is the outer stopping tolerance. [sent-405, score-0.137]
41 GLMNET uses the same value for inner and outer tolerances; that is, εin = εout . [sent-406, score-0.108]
42 We find that if users specify a small εout , a huge amount of time may be needed for the first outer iteration. [sent-407, score-0.088]
43 For newGLMNET, we propose an adaptive inner stopping condition. [sent-409, score-0.085]
44 n ∑ |∇Sj qk (d p, j )| ≤ εin , (26) j=1 where ∇S q(d) is the minimum-norm subgradient at d. [sent-416, score-0.106]
45 (27) Note that we do not need to calculate the whole ∇S qk (d p, j ). [sent-419, score-0.106]
46 Instead, ∇S qk (d p, j ) is easily available j via ∇ j qk (d p, j ) in (16). [sent-420, score-0.212]
47 ¯ If at one outer iteration, the condition (26) holds after only one cycle of n CD steps, then we reduce εin by εin ← εin /4. [sent-421, score-0.156]
48 (28) That is, the program automatically adjusts εin if it finds that too few CD steps are conducted for minimizing qk (d). [sent-422, score-0.116]
49 We use an outer stopping condition similar to (26). [sent-424, score-0.137]
50 Sub-problem (13) becomes min d qk (d) subject to d j = 0, ∀ j ∈ J. [sent-442, score-0.106]
51 The way to choose J in the above procedure is by checking if z = 0 is optimal for minz qk (d + ze j ) − qk (d). [sent-448, score-0.238]
52 For newGLMNET, we propose a heuristic shrinking scheme following its two-level structure: the outer level removes some w’s elements so that a smaller sub-problem (13) similar to (30) is solved; the inner level is applied to remove elements in d so that (13) becomes an even smaller sub-problem. [sent-449, score-0.196]
53 For each level, our setting is related to the shrinking implementation of CDN; see Yuan et al. [sent-450, score-0.088]
54 In the beginning of each outer iteration, we remove w j if wk = 0 and j −1+ M out M out < ∇ j L(wk ) < 1 − , l l (31) where M out ≡ max ∇S f (wk−1 ) , . [sent-454, score-0.364]
55 M out in (31) is used to adjust the shrinking scheme from a conservative setting in the beginning to an aggressive setting in the end. [sent-460, score-0.109]
56 Our shrinking implementation differs from GLMNET’s in several aspects. [sent-461, score-0.088]
57 First, by using ∇ f (wk ) that is available in the beginning of the kth iteration, we do not conduct a special cycle of n CD steps in GLMNET for selecting variables. [sent-462, score-0.119]
58 If wk = 0 and |∇ j L(wk )| < 1 − M out /l j J ← J\{ j}. [sent-486, score-0.28]
59 For the inner shrinking scheme of newGLMNET, assume the previous CD cycle contains points p−1,1 ,. [sent-504, score-0.192]
60 An element j is removed if M in M in < ∇ j qk (d p,t ) < 1 − ¯ , (32) wk + d p,t = 0 and − 1 + j j l l where t is the iteration index of the current cycle and M in ≡ max p ∇S1 qk (d p−1,1 ) , . [sent-512, score-0.584]
61 11 The overall procedure of newGLMNET with two-level shrinking is shown in Algorithms 3 and 4. [sent-518, score-0.088]
62 For theoretical properties, if the subproblem (13) is exactly solved, for any given outer stopping tolerance, newGLMNET terminates in finite iterations. [sent-519, score-0.136]
63 Steps 1–4 are for outerlevel shrinking and the stopping condition. [sent-524, score-0.139]
64 2011 Y UAN , H O AND L IN Algorithm 4 Inner iterations of newGLMNET with shrinking • Given working set J, initial solution d, inner stopping condition εin , and a small positive number ν from the outer problem. [sent-539, score-0.269]
65 j ¯ j j ¯ // inner-level shrinking – If wk + d j = 0 and |∇ j qk (d)| < 1 − M in /l j T ← T \{ j}. [sent-549, score-0.474]
66 Else m ← max(m, |∇S qk (d)|) and m ← m + |∇S qk (d)|. [sent-550, score-0.212]
67 ¯ ¯ j j d j ← d j + arg minz qk (d+ze j ) − qk (d). [sent-551, score-0.212]
68 ¯ in (31) for the outer-level shrinking, while M is for calculating ∑n |∇S f (w)| in the outer stopping j=1 j condition (29). [sent-557, score-0.152]
69 Because each CD cycle goes through only elements in T , the inner stopping condition (26) is also calculated using T . [sent-564, score-0.167]
70 In (28), we reduce the inner stopping tolerance εin if the stopping condition (26) holds after one CD cycle. [sent-569, score-0.168]
71 The number of outer iterations should be small because of using a more accurate initial point. [sent-661, score-0.084]
72 For GLMNET and newGLMNET, the tolerance means the outer tolerance εout in (25). [sent-678, score-0.114]
73 96 5 10 0 10 Log-scaled Training Time (sec) 1 10 2 10 3 10 Log-scaled Training Time (sec) (g) webspam (h) gisette Figure 2: L1-regularized logistic regression: testing accuracy versus training time (log-scaled). [sent-751, score-0.224]
74 The horizontal dotted line in Figure 1 indicates the relative function difference by running CDN using LIBLINEAR’s default stopping tolerance with ε = 0. [sent-753, score-0.13]
75 A better inner stopping condition should be adaptive like ours so that the sub-problem (13) can be solved properly at each outer iteration. [sent-775, score-0.171]
76 In summary, because of the proposed adaptive inner stopping condition, newGLMNET takes both advantages of fast approximation in the early stage like CDN and of fast local convergence in the final stage like GLMNET. [sent-783, score-0.115]
77 To check if line search costs much in newGLMNET, we report the average number of line search steps per outer iteration in Table 4. [sent-786, score-0.232]
78 As a comparison, we also, respectively, show the average numbers of updated variables and line search steps per cycle of CDN in the first and second columns of the same table. [sent-788, score-0.141]
79 Note that because a shrinking technique is applied to CDN, the number of updated variables in one cycle is much smaller than the number of features. [sent-789, score-0.167]
80 In Table 4, we further report the average number of line search steps in a CD cycle where this check is successfully applied. [sent-799, score-0.144]
81 We further extend Table 1 to compare the running time of the first CD cycle in Table 5. [sent-806, score-0.095]
82 We may encounter this problem when calculating the function value difference needed by the sufficient decrease condition 2018 A N I MPROVED GLMNET FOR L1- REGULARIZED L OGISTIC R EGRESSION Data set KDD2010-b rcv1 yahoo-japan yahoo-korea news20 epsilon* webspam gisette* CDN exp/log 21. [sent-815, score-0.086]
83 First, we investigate the effect of newGLMNET’s two levels of shrinking by presenting results of only inner or outer level. [sent-873, score-0.196]
84 Secondly, we compare the shrinking strategies of GLMNET and newGLMNET. [sent-874, score-0.088]
85 Because these two implementations differ in many places, for a fair comparison, we modify newGLMNET to apply GLMNET’s shrinking strategy. [sent-875, score-0.088]
86 We can clearly see that all shrinking implementations are better than the one without shrinking. [sent-877, score-0.088]
87 Results in Figure 3 show that the outer-level shrinking is more useful than the inner-level shrinking. [sent-878, score-0.088]
88 We suspect that the difference is due to that in the CD procedure for sub-problem (13), the (inner-level) shrinking is done in a sequential manner. [sent-879, score-0.101]
89 Therefore, the outer-level shrinking is a more integrated setting for checking and removing variables. [sent-882, score-0.099]
90 The same explanation may also apply to the result that shrinking is slightly more effective for newGLMNET than CDN; see Yuan et al. [sent-883, score-0.088]
91 Regarding GLMNET’s shrinking strategy, it performs slightly better than the inner-level shrinking of newGLMNET, but is worse than both the outer-level and the two-level settings. [sent-885, score-0.176]
92 “Inner only” (“Outer only”) indicates that only inner-level (outer-level) shrinking is conducted. [sent-889, score-0.088]
93 1 0 At the kth outer iteration, newGLMNET solves a quadratic sub-problem qk (d), min d (37) where qk (d) ≡ wk + d wk 1 + qk (d), ¯ 1 qk (d) ≡ ∇L(wk )T d + dT H k d and H k ≡ ∇2 L(wk ) + νI . [sent-902, score-1.068]
94 1 Line Search and Asymptotic Convergence At every outer iteration, after d is obtained by solving sub-problem (37), we need a line search procedure to find the maximal λ ∈ {βi | i = 0, 1, . [sent-910, score-0.126]
95 The last bi (wk + βt d) is passed to the next outer iteration as bi (wk+1 ). [sent-918, score-0.126]
96 965 4 0 10 10 1 10 2 10 Log-scaled Training Time (sec) Log-scaled Training Time (sec) (g) webspam (h) gisette Figure 5: L1-regularized L2-loss SVM: testing accuracy versus training time (log-scaled). [sent-1019, score-0.161]
97 One difference is that the trick of using an upper-bound function in CDN is slightly more effective for L2-loss SVM than logistic regression. [sent-1023, score-0.091]
98 Using the property that CD involves simple and cheap updates, we carefully adjust the stopping condition for sub-problems. [sent-1026, score-0.093]
99 In this work, we point out that a state-of-the-art algorithm CDN for L1-regularized logistic regression suffers from frequent exp/log operations. [sent-1042, score-0.084]
100 Linear Convergence of newGLMNET for L1-regularized Logistic Regression To apply the linear convergence result in Tseng and Yun (2009), we show that L1-regularized logistic regression satisfies the conditions in their Theorem 3 if the loss term L(w) is strictly convex (and therefore w∗ is unique). [sent-1071, score-0.095]
wordName wordTfidf (topN-words)
[('cdn', 0.558), ('newglmnet', 0.549), ('glmnet', 0.401), ('wk', 0.28), ('cd', 0.157), ('qk', 0.106), ('glmnetpath', 0.098), ('sec', 0.096), ('shrinking', 0.088), ('outer', 0.074), ('uan', 0.071), ('cycle', 0.07), ('epsilon', 0.069), ('mproved', 0.067), ('ogistic', 0.067), ('yun', 0.065), ('nl', 0.064), ('logistic', 0.063), ('tseng', 0.056), ('operations', 0.052), ('yuan', 0.052), ('stopping', 0.051), ('gisette', 0.047), ('webspam', 0.046), ('nnz', 0.045), ('inner', 0.034), ('egression', 0.032), ('search', 0.031), ('di', 0.029), ('dxi', 0.027), ('testing', 0.024), ('newton', 0.024), ('hsieh', 0.024), ('liblinear', 0.023), ('iteration', 0.022), ('tried', 0.021), ('training', 0.021), ('xd', 0.021), ('line', 0.021), ('tolerance', 0.02), ('chang', 0.02), ('conduct', 0.019), ('wt', 0.018), ('regularized', 0.018), ('ew', 0.017), ('friedman', 0.016), ('failed', 0.016), ('svm', 0.016), ('ze', 0.015), ('bi', 0.015), ('hessian', 0.015), ('calculating', 0.015), ('trick', 0.015), ('sequentially', 0.014), ('relative', 0.014), ('time', 0.014), ('dense', 0.014), ('conducts', 0.013), ('genkin', 0.013), ('coordinate', 0.013), ('difference', 0.013), ('lengthy', 0.013), ('termination', 0.013), ('dii', 0.013), ('check', 0.012), ('expensive', 0.012), ('regression', 0.012), ('condition', 0.012), ('csie', 0.011), ('ntu', 0.011), ('loosely', 0.011), ('supplementary', 0.011), ('terminates', 0.011), ('adjust', 0.011), ('descent', 0.011), ('checking', 0.011), ('running', 0.011), ('iterations', 0.01), ('costs', 0.01), ('document', 0.01), ('steps', 0.01), ('kth', 0.01), ('cheap', 0.01), ('cost', 0.01), ('xi', 0.01), ('convergence', 0.01), ('stage', 0.01), ('situation', 0.01), ('beginning', 0.01), ('loss', 0.01), ('sathiya', 0.01), ('taiwan', 0.01), ('tw', 0.01), ('suffers', 0.009), ('fan', 0.009), ('carefully', 0.009), ('updated', 0.009), ('tolerances', 0.009), ('spent', 0.009), ('accuracy', 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression
Author: Guo-Xun Yuan, Chia-Hua Ho, Chih-Jen Lin
Abstract: Recently, Yuan et al. (2010) conducted a comprehensive comparison on software for L1-regularized classification. They concluded that a carefully designed coordinate descent implementation CDN is the fastest among state-of-the-art solvers. In this paper, we point out that CDN is less competitive on loss functions that are expensive to compute. In particular, CDN for logistic regression is much slower than CDN for SVM because the logistic loss involves expensive exp/log operations. In optimization, Newton methods are known to have fewer iterations although each iteration costs more. Because solving the Newton sub-problem is independent of the loss calculation, this type of methods may surpass CDN under some circumstances. In L1-regularized classification, GLMNET by Friedman et al. is already a Newton-type method, but experiments in Yuan et al. (2010) indicated that the existing GLMNET implementation may face difficulties for some largescale problems. In this paper, we propose an improved GLMNET to address some theoretical and implementation issues. In particular, as a Newton-type method, GLMNET achieves fast local convergence, but may fail to quickly obtain a useful solution. By a careful design to adjust the effort for each iteration, our method is efficient for both loosely or strictly solving the optimization problem. Experiments demonstrate that our improved GLMNET is more efficient than CDN for L1-regularized logistic regression. Keywords: L1 regularization, linear classification, optimization methods, logistic regression, support vector machines
2 0.06493222 54 jmlr-2012-Large-scale Linear Support Vector Regression
Author: Chia-Hua Ho, Chih-Jen Lin
Abstract: Support vector regression (SVR) and support vector classification (SVC) are popular learning techniques, but their use with kernels is often time consuming. Recently, linear SVC without kernels has been shown to give competitive accuracy for some applications, but enjoys much faster training/testing. However, few studies have focused on linear SVR. In this paper, we extend state-of-theart training methods for linear SVC to linear SVR. We show that the extension is straightforward for some methods, but is not trivial for some others. Our experiments demonstrate that for some problems, the proposed linear-SVR training methods can very efficiently produce models that are as good as kernel SVR. Keywords: support vector regression, Newton methods, coordinate descent methods
3 0.041117031 34 jmlr-2012-Dynamic Policy Programming
Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen
Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic ℓ∞ -norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods. Keywords: approximate dynamic programming, reinforcement learning, Markov decision processes, Monte-Carlo methods, function approximation
4 0.021408139 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression
Author: Christopher R. Genovese, Jiashun Jin, Larry Wasserman, Zhigang Yao
Abstract: The lasso is an important method for sparse, high-dimensional regression problems, with efficient algorithms available, a long history of practical success, and a large body of theoretical results supporting and explaining its performance. But even with the best available algorithms, finding the lasso solutions remains a computationally challenging task in cases where the number of covariates vastly exceeds the number of data points. Marginal regression, where each dependent variable is regressed separately on each covariate, offers a promising alternative in this case because the estimates can be computed roughly two orders faster than the lasso solutions. The question that remains is how the statistical performance of the method compares to that of the lasso in these cases. In this paper, we study the relative statistical performance of the lasso and marginal regression for sparse, high-dimensional regression problems. We consider the problem of learning which coefficients are non-zero. Our main results are as follows: (i) we compare the conditions under which the lasso and marginal regression guarantee exact recovery in the fixed design, noise free case; (ii) we establish conditions under which marginal regression provides exact recovery with high probability in the fixed design, noise free, random coefficients case; and (iii) we derive rates of convergence for both procedures, where performance is measured by the number of coefficients with incorrect sign, and characterize the regions in the parameter space recovery is and is not possible under this metric. In light of the computational advantages of marginal regression in very high dimensional problems, our theoretical and simulations results suggest that the procedure merits further study. Keywords: high-dimensional regression, lasso, phase diagram, regularization
5 0.020776516 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions
Author: Franz J. Király, Paul von Bünau, Frank C. Meinecke, Duncan A.J. Blythe, Klaus-Robert Müller
Abstract: We propose a novel algebraic algorithmic framework for dealing with probability distributions represented by their cumulants such as the mean and covariance matrix. As an example, we consider the unsupervised learning problem of finding the subspace on which several probability distributions agree. Instead of minimizing an objective function involving the estimated cumulants, we show that by treating the cumulants as elements of the polynomial ring we can directly solve the problem, at a lower computational cost and with higher accuracy. Moreover, the algebraic viewpoint on probability distributions allows us to invoke the theory of algebraic geometry, which we demonstrate in a compact proof for an identifiability criterion. Keywords: computational algebraic geometry, approximate algebra, unsupervised Learning
6 0.020493284 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods
7 0.019870391 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
8 0.019610513 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
9 0.019547099 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
10 0.018878935 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
11 0.018661993 13 jmlr-2012-Active Learning via Perfect Selective Classification
12 0.018426986 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers
13 0.017945332 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
14 0.015436136 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
15 0.015386055 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine
16 0.015274842 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
17 0.015057598 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
18 0.014920655 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization
19 0.014725212 97 jmlr-2012-Regularization Techniques for Learning with Matrices
20 0.013922336 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso
topicId topicWeight
[(0, -0.071), (1, 0.009), (2, 0.016), (3, -0.01), (4, 0.019), (5, 0.012), (6, 0.015), (7, -0.035), (8, -0.003), (9, 0.019), (10, -0.04), (11, 0.097), (12, 0.013), (13, -0.024), (14, 0.057), (15, 0.031), (16, -0.005), (17, -0.041), (18, -0.013), (19, 0.109), (20, -0.089), (21, 0.086), (22, 0.029), (23, 0.109), (24, 0.021), (25, -0.032), (26, 0.101), (27, -0.018), (28, 0.239), (29, -0.21), (30, 0.168), (31, 0.139), (32, -0.035), (33, -0.157), (34, -0.142), (35, 0.091), (36, 0.059), (37, -0.151), (38, 0.222), (39, 0.025), (40, 0.238), (41, 0.14), (42, -0.297), (43, 0.145), (44, -0.017), (45, 0.062), (46, -0.012), (47, 0.069), (48, 0.09), (49, -0.069)]
simIndex simValue paperId paperTitle
same-paper 1 0.96630788 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression
Author: Guo-Xun Yuan, Chia-Hua Ho, Chih-Jen Lin
Abstract: Recently, Yuan et al. (2010) conducted a comprehensive comparison on software for L1-regularized classification. They concluded that a carefully designed coordinate descent implementation CDN is the fastest among state-of-the-art solvers. In this paper, we point out that CDN is less competitive on loss functions that are expensive to compute. In particular, CDN for logistic regression is much slower than CDN for SVM because the logistic loss involves expensive exp/log operations. In optimization, Newton methods are known to have fewer iterations although each iteration costs more. Because solving the Newton sub-problem is independent of the loss calculation, this type of methods may surpass CDN under some circumstances. In L1-regularized classification, GLMNET by Friedman et al. is already a Newton-type method, but experiments in Yuan et al. (2010) indicated that the existing GLMNET implementation may face difficulties for some largescale problems. In this paper, we propose an improved GLMNET to address some theoretical and implementation issues. In particular, as a Newton-type method, GLMNET achieves fast local convergence, but may fail to quickly obtain a useful solution. By a careful design to adjust the effort for each iteration, our method is efficient for both loosely or strictly solving the optimization problem. Experiments demonstrate that our improved GLMNET is more efficient than CDN for L1-regularized logistic regression. Keywords: L1 regularization, linear classification, optimization methods, logistic regression, support vector machines
2 0.65649831 54 jmlr-2012-Large-scale Linear Support Vector Regression
Author: Chia-Hua Ho, Chih-Jen Lin
Abstract: Support vector regression (SVR) and support vector classification (SVC) are popular learning techniques, but their use with kernels is often time consuming. Recently, linear SVC without kernels has been shown to give competitive accuracy for some applications, but enjoys much faster training/testing. However, few studies have focused on linear SVR. In this paper, we extend state-of-theart training methods for linear SVC to linear SVR. We show that the extension is straightforward for some methods, but is not trivial for some others. Our experiments demonstrate that for some problems, the proposed linear-SVR training methods can very efficiently produce models that are as good as kernel SVR. Keywords: support vector regression, Newton methods, coordinate descent methods
3 0.23439392 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods
Author: Zhiwei Qin, Donald Goldfarb
Abstract: We consider a class of sparse learning problems in high dimensional feature space regularized by a structured sparsity-inducing norm that incorporates prior knowledge of the group structure of the features. Such problems often pose a considerable challenge to optimization algorithms due to the non-smoothness and non-separability of the regularization term. In this paper, we focus on two commonly adopted sparsity-inducing regularization terms, the overlapping Group Lasso penalty l1 /l2 -norm and the l1 /l∞ -norm. We propose a unified framework based on the augmented Lagrangian method, under which problems with both types of regularization and their variants can be efficiently solved. As one of the core building-blocks of this framework, we develop new algorithms using a partial-linearization/splitting technique and prove that the accelerated versions 1 of these algorithms require O( √ε ) iterations to obtain an ε-optimal solution. We compare the performance of these algorithms against that of the alternating direction augmented Lagrangian and FISTA methods on a collection of data sets and apply them to two real-world problems to compare the relative merits of the two norms. Keywords: structured sparsity, overlapping Group Lasso, alternating direction methods, variable splitting, augmented Lagrangian
4 0.19519179 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa
Abstract: Methods for learning decision rules are being successfully applied to many problem domains, in particular when understanding and interpretation of the learned model is necessary. In many real life problems, we would like to predict multiple related (nominal or numeric) target attributes simultaneously. While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. An optimization procedure is then used to select the best (and much smaller) subset of these rules and to determine their respective weights. We introduce the F IRE algorithm for solving multi-target regression problems, which employs the rule ensembles approach. We improve the accuracy of the algorithm by adding simple linear functions to the ensemble. We also extensively evaluate the algorithm with and without linear functions. The results show that the accuracy of multi-target regression rule ensembles is high. They are more accurate than, for instance, multi-target regression trees, but not quite as accurate as multi-target random forests. The rule ensembles are significantly more concise than random forests, and it is also possible to create compact rule sets that are smaller than a single regression tree but still comparable in accuracy. Keywords: multi-target prediction, rule learning, rule ensembles, regression ∗. Also in Microtask, Tampere, Finland. †. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia. ‡. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia and the Joˇ ef Stefan International Postgraduate School, Ljubljana, Slovenia. z ˇ c 2012 Timo Aho, Bernard Zenko, Saˇo Dˇ eroski and Tapio Elomaa. s z ˇ ˇ
5 0.16622257 34 jmlr-2012-Dynamic Policy Programming
Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen
Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic ℓ∞ -norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods. Keywords: approximate dynamic programming, reinforcement learning, Markov decision processes, Monte-Carlo methods, function approximation
6 0.15862256 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions
7 0.15449272 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data
9 0.13075908 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression
10 0.13029946 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory
11 0.12482213 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
12 0.12372928 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
13 0.12259784 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
14 0.12179626 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox
15 0.11643431 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection
16 0.11628377 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
17 0.11252331 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression
18 0.1113093 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
19 0.10907402 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
20 0.10751653 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
topicId topicWeight
[(21, 0.034), (26, 0.057), (27, 0.01), (29, 0.025), (35, 0.017), (49, 0.012), (56, 0.016), (57, 0.012), (63, 0.356), (69, 0.019), (75, 0.056), (77, 0.093), (92, 0.056), (96, 0.097)]
simIndex simValue paperId paperTitle
1 0.77866906 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory
Author: Tamer Salman, Yoram Baram
Abstract: We describe a quantum algorithm for computing the intersection of two sets and its application to associative memory. The algorithm is based on a modification of Grover’s quantum search algorithm (Grover, 1996). We present algorithms for pattern retrieval, pattern completion, and pattern correction. We show that the quantum associative memory can store an exponential number of memories and retrieve them in sub-exponential time. We prove that this model has advantages over known classical associative memories as well as previously proposed quantum models. Keywords: associative memory, pattern completion, pattern correction, quantum computation, quantum search
same-paper 2 0.69787133 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression
Author: Guo-Xun Yuan, Chia-Hua Ho, Chih-Jen Lin
Abstract: Recently, Yuan et al. (2010) conducted a comprehensive comparison on software for L1-regularized classification. They concluded that a carefully designed coordinate descent implementation CDN is the fastest among state-of-the-art solvers. In this paper, we point out that CDN is less competitive on loss functions that are expensive to compute. In particular, CDN for logistic regression is much slower than CDN for SVM because the logistic loss involves expensive exp/log operations. In optimization, Newton methods are known to have fewer iterations although each iteration costs more. Because solving the Newton sub-problem is independent of the loss calculation, this type of methods may surpass CDN under some circumstances. In L1-regularized classification, GLMNET by Friedman et al. is already a Newton-type method, but experiments in Yuan et al. (2010) indicated that the existing GLMNET implementation may face difficulties for some largescale problems. In this paper, we propose an improved GLMNET to address some theoretical and implementation issues. In particular, as a Newton-type method, GLMNET achieves fast local convergence, but may fail to quickly obtain a useful solution. By a careful design to adjust the effort for each iteration, our method is efficient for both loosely or strictly solving the optimization problem. Experiments demonstrate that our improved GLMNET is more efficient than CDN for L1-regularized logistic regression. Keywords: L1 regularization, linear classification, optimization methods, logistic regression, support vector machines
3 0.43267369 54 jmlr-2012-Large-scale Linear Support Vector Regression
Author: Chia-Hua Ho, Chih-Jen Lin
Abstract: Support vector regression (SVR) and support vector classification (SVC) are popular learning techniques, but their use with kernels is often time consuming. Recently, linear SVC without kernels has been shown to give competitive accuracy for some applications, but enjoys much faster training/testing. However, few studies have focused on linear SVR. In this paper, we extend state-of-theart training methods for linear SVC to linear SVR. We show that the extension is straightforward for some methods, but is not trivial for some others. Our experiments demonstrate that for some problems, the proposed linear-SVR training methods can very efficiently produce models that are as good as kernel SVR. Keywords: support vector regression, Newton methods, coordinate descent methods
4 0.42452645 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition
Author: Gary B. Huang, Andrew Kae, Carl Doersch, Erik Learned-Miller
Abstract: We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identified with near certainty, they can be conditioned upon, allowing further inference to be done efficiently. Specifically, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This “clean set” is subsequently used as document-specific training data. While OCR systems produce confidence measures for the identity of each letter or word, thresholding these values still produces a significant number of errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data set of difficult historical newspaper scans, demonstrating that our method for identifying correct words makes only two errors in 56 documents. Using document-specific character models generated from this data, we are able to reduce the error over properly segmented characters by 34.1% from an initial OCR system’s translation.1 Keywords: optical character recognition, probability bounding, document-specific modeling, computer vision 1. This work is an expanded and revised version of Kae et al. (2010). Supported by NSF Grant IIS-0916555. c 2012 Gary B. Huang, Andrew Kae, Carl Doersch and Erik Learned-Miller. H UANG , K AE , D OERSCH AND L EARNED -M ILLER
5 0.402652 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression
Author: Christopher R. Genovese, Jiashun Jin, Larry Wasserman, Zhigang Yao
Abstract: The lasso is an important method for sparse, high-dimensional regression problems, with efficient algorithms available, a long history of practical success, and a large body of theoretical results supporting and explaining its performance. But even with the best available algorithms, finding the lasso solutions remains a computationally challenging task in cases where the number of covariates vastly exceeds the number of data points. Marginal regression, where each dependent variable is regressed separately on each covariate, offers a promising alternative in this case because the estimates can be computed roughly two orders faster than the lasso solutions. The question that remains is how the statistical performance of the method compares to that of the lasso in these cases. In this paper, we study the relative statistical performance of the lasso and marginal regression for sparse, high-dimensional regression problems. We consider the problem of learning which coefficients are non-zero. Our main results are as follows: (i) we compare the conditions under which the lasso and marginal regression guarantee exact recovery in the fixed design, noise free case; (ii) we establish conditions under which marginal regression provides exact recovery with high probability in the fixed design, noise free, random coefficients case; and (iii) we derive rates of convergence for both procedures, where performance is measured by the number of coefficients with incorrect sign, and characterize the regions in the parameter space recovery is and is not possible under this metric. In light of the computational advantages of marginal regression in very high dimensional problems, our theoretical and simulations results suggest that the procedure merits further study. Keywords: high-dimensional regression, lasso, phase diagram, regularization
6 0.36420655 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
7 0.35877368 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
8 0.35868579 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models
10 0.35459334 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
11 0.35232747 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
12 0.34980261 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
13 0.34809649 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
14 0.34628367 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
15 0.34598404 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
16 0.34529611 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
17 0.34374809 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach
18 0.34334308 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks
19 0.34168458 72 jmlr-2012-Multi-Target Regression with Rule Ensembles
20 0.34084731 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO