jmlr jmlr2012 jmlr2012-18 knowledge-graph by maker-knowledge-mining

18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

Source: pdf

Author: Guo-Xun Yuan, Chia-Hua Ho, Chih-Jen Lin

Abstract: Recently, Yuan et al. (2010) conducted a comprehensive comparison on software for L1-regularized classiﬁcation. They concluded that a carefully designed coordinate descent implementation CDN is the fastest among state-of-the-art solvers. In this paper, we point out that CDN is less competitive on loss functions that are expensive to compute. In particular, CDN for logistic regression is much slower than CDN for SVM because the logistic loss involves expensive exp/log operations. In optimization, Newton methods are known to have fewer iterations although each iteration costs more. Because solving the Newton sub-problem is independent of the loss calculation, this type of methods may surpass CDN under some circumstances. In L1-regularized classiﬁcation, GLMNET by Friedman et al. is already a Newton-type method, but experiments in Yuan et al. (2010) indicated that the existing GLMNET implementation may face difﬁculties for some largescale problems. In this paper, we propose an improved GLMNET to address some theoretical and implementation issues. In particular, as a Newton-type method, GLMNET achieves fast local convergence, but may fail to quickly obtain a useful solution. By a careful design to adjust the effort for each iteration, our method is efﬁcient for both loosely or strictly solving the optimization problem. Experiments demonstrate that our improved GLMNET is more efﬁcient than CDN for L1-regularized logistic regression. Keywords: L1 regularization, linear classiﬁcation, optimization methods, logistic regression, support vector machines

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In particular, CDN for logistic regression is much slower than CDN for SVM because the logistic loss involves expensive exp/log operations. [sent-15, score-0.16]

2 (2) i=1 For logistic regression and L2-loss SVM, we have the following loss functions. [sent-33, score-0.085]

3 An example is L1-regularized logistic regression, where exp/log operations are more expensive than other basic operations. [sent-60, score-0.127]

4 While logistic regression is an example of problems with expensive loss functions, to check the situation of cheap loss functions, in Section 7, we extend newGLMNET to solve L2-loss SVM. [sent-77, score-0.139]

5 , wk ]T n j j−1 1 (5) and the indicator vector e j ≡ [0, . [sent-113, score-0.28]

6 j−1 wk wk,1 wk−1,n+1 Let = = at the beginning of each iteration. [sent-120, score-0.29]

7 ¯ j – w where τ(s) is the derivative of the logistic loss function log(1 + es ): 1 , 1 + e−s τ(s) = D ∈ Rl×l is a diagonal matrix with Dii = τ(yi wT xi ) 1 − τ(yi wT xi ) , and (8)  T x1 . [sent-140, score-0.093]

8 (2010) T store ew xi instead and update the values by Tx ew i Tx ← ew i ¯ · eλdxi j , ∀i, 2002 (11) A N I MPROVED GLMNET FOR L1- REGULARIZED L OGISTIC R EGRESSION Data set epsilon webspam exp/log 64. [sent-162, score-0.176]

9 (2010), for the same data set, CDN’s training time for logistic regression is more than L2-loss SVM. [sent-184, score-0.11]

10 Motivated from this observation, we point out that CDN suffers from expensive exp/log operations of logistic regression. [sent-185, score-0.136]

11 In Table 1, we conduct an experiment on two data sets, epsilon and webspam. [sent-186, score-0.088]

12 1 We check the proportion of time for exp/log operations in the ﬁrst CD cycle of CDN. [sent-187, score-0.148]

13 1 Algorithms that may Have Less Loss Computation For logistic regression, exp/log operations occur in computing the function, gradient, and Hessian of the loss. [sent-197, score-0.115]

14 At the kth iteration, consider the following sub-problem to ﬁnd a direction d: min d qk (d), (13) where 1 qk (d) ≡ ∇L(wk )T d + dT H k d + wk + d 2 1− wk 1 , and H k is either ∇2 L(wk ) or its approximation. [sent-210, score-0.782]

15 If we use an approximate Hessian as H k , the analysis of O(l) exp/log operations versus at least O(nl) total operations per iteration still holds. [sent-228, score-0.126]

16 4 However, because minimizing qk (d) in (13) becomes easier, exp/log operations may play a more important role in the whole procedure. [sent-229, score-0.158]

17 (2008) have pointed out that Newton methods less frequently conduct exp/log operations than CD methods, but they did not conduct detailed analysis and comparisons. [sent-234, score-0.09]

18 Each CD cycle is now considered as an inner iteration of GLMNET. [sent-260, score-0.126]

19 qk (d p, j + ze j ) − qk (d p, j ) 1 ¯ = |wk + d p + z| − |wk + d p | + ∇ j qk (d p, j )z + ∇2 j qk (d p, j )z2 , ¯ j j j j 2 j (15) where the deﬁnition of d p, j is similar to wk, j of CDN in (5) p−1 p−1 p d p, j ≡ [d1 , d2 , . [sent-262, score-0.439]

20 Further, 1 qk (d) ≡ ∇L(wk )T d + (d)T ∇2 L(wk )d ¯ 2 represents the smooth terms of qk (d) and plays a similar role to L(w) for (1). [sent-269, score-0.212]

21 We have ∇ j qk (d p, j ) = ∇ j L(wk ) + (∇2 L(wk )d p, j ) j ¯ and ∇2 j qk (d p, j ) = ∇2 j L(wk ). [sent-270, score-0.212]

22 A suitable stopping condition for the inner level is very important. [sent-275, score-0.097]

23 At each CD step, most operations are spent on calculating ∇ j qk (d p, j ) and ∇2 j qk (d p, j ) in ¯ j ¯ 2 q (d p, j ) = ∇2 L(wk ), ∀ j can be pre-calculated before the CD procedure. [sent-279, score-0.288]

24 Note that ∇ j j ¯k jj ∇ j qk (d p, j ), the ﬁrst term ∇ j L(wk ) can also be pre-calculated. [sent-281, score-0.106]

25 Note that in GLMNET, exp/log operations are needed in the beginning/end of an outer iteration. [sent-287, score-0.126]

26 We make each CD cycle to share the cost in this table even though a CD cycle in GLMNET involves no exp/log operations. [sent-289, score-0.15]

27 , xT d p, j , ∀i) is maintained and updated by i (Xd p, j+1 )i ← (Xd p, j )i + Xi j z, ∀i, (17) then calculating ∇ j qk (d) costs O(l) operations. [sent-292, score-0.14]

28 5 Therefore, the CD method for (13) requires ¯ O(nl) operations for one inner iteration (cycle) of n CD steps. [sent-293, score-0.108]

29 Because they both use CD, we check in Table 2 that relative to the total number of operations of one CD cycle, how many exp/log operations are needed. [sent-296, score-0.13]

30 f (wk + λd) − f (wk ) (20) ≤ σλ ∇L(wk )T d + γdT H k d + wk + d 1− wk 1 , where 0 < β < 1, 0 < σ < 1, and 0 ≤ γ < 1. [sent-329, score-0.56]

31 (21) Thus, the cost for ﬁnding f (wk + λd) is reduced to O(n + l), where O(n) comes from calculating k+1 T k T wk + λd 1 . [sent-350, score-0.305]

32 In contrast, in newGLMNET, we conduct line search only in the end of an outer iteration. [sent-359, score-0.145]

33 If we distribute the line search cost of newGLMNET to its inner CD cycles, we have that, for the same λ in one CD cycle,7 CDN costs O(nl) and newGLMNET costs no more than O(n + l). [sent-362, score-0.116]

34 An “outer iteration” corresponds to the process from wk to wk+1 , while the “inner” level consists of CD iterations for solving (13). [sent-389, score-0.29]

35 For an algorithm involving two levels of iterations, the stopping condition of the inner iterations must be carefully designed. [sent-390, score-0.116]

36 A strict inner stopping condition may cause the algorithm to take a prohibitive amount of time at the ﬁrst outer iteration. [sent-391, score-0.185]

37 Alternatively, a loose inner condition leads to an inaccurate solution of (13) and possibly lengthy outer iterations. [sent-392, score-0.133]

38 , zn satisfy max(∇2 j L(wk ) · z2 ) ≤ εin , j j j (24) where εin is the inner stopping tolerance. [sent-402, score-0.085]

39 For the outer stopping condition, similarly, GLMNET checks if w is still signiﬁcantly changed. [sent-403, score-0.125]

40 GLMNET stops if the following condition holds: max(∇2 j L(wk+1 ) · (d k )2 ) ≤ εout , (25) j j j where εout is the outer stopping tolerance. [sent-405, score-0.137]

41 GLMNET uses the same value for inner and outer tolerances; that is, εin = εout . [sent-406, score-0.108]

42 We ﬁnd that if users specify a small εout , a huge amount of time may be needed for the ﬁrst outer iteration. [sent-407, score-0.088]

43 For newGLMNET, we propose an adaptive inner stopping condition. [sent-409, score-0.085]

44 n ∑ |∇Sj qk (d p, j )| ≤ εin , (26) j=1 where ∇S q(d) is the minimum-norm subgradient at d. [sent-416, score-0.106]

45 (27) Note that we do not need to calculate the whole ∇S qk (d p, j ). [sent-419, score-0.106]

46 Instead, ∇S qk (d p, j ) is easily available j via ∇ j qk (d p, j ) in (16). [sent-420, score-0.212]

47 ¯ If at one outer iteration, the condition (26) holds after only one cycle of n CD steps, then we reduce εin by εin ← εin /4. [sent-421, score-0.156]

48 (28) That is, the program automatically adjusts εin if it ﬁnds that too few CD steps are conducted for minimizing qk (d). [sent-422, score-0.116]

49 We use an outer stopping condition similar to (26). [sent-424, score-0.137]

50 Sub-problem (13) becomes min d qk (d) subject to d j = 0, ∀ j ∈ J. [sent-442, score-0.106]

51 The way to choose J in the above procedure is by checking if z = 0 is optimal for minz qk (d + ze j ) − qk (d). [sent-448, score-0.238]

52 For newGLMNET, we propose a heuristic shrinking scheme following its two-level structure: the outer level removes some w’s elements so that a smaller sub-problem (13) similar to (30) is solved; the inner level is applied to remove elements in d so that (13) becomes an even smaller sub-problem. [sent-449, score-0.196]

53 For each level, our setting is related to the shrinking implementation of CDN; see Yuan et al. [sent-450, score-0.088]

54 In the beginning of each outer iteration, we remove w j if wk = 0 and j −1+ M out M out < ∇ j L(wk ) < 1 − , l l (31) where M out ≡ max ∇S f (wk−1 ) , . [sent-454, score-0.364]

55 M out in (31) is used to adjust the shrinking scheme from a conservative setting in the beginning to an aggressive setting in the end. [sent-460, score-0.109]

56 Our shrinking implementation differs from GLMNET’s in several aspects. [sent-461, score-0.088]

57 First, by using ∇ f (wk ) that is available in the beginning of the kth iteration, we do not conduct a special cycle of n CD steps in GLMNET for selecting variables. [sent-462, score-0.119]

58 If wk = 0 and |∇ j L(wk )| < 1 − M out /l j J ← J\{ j}. [sent-486, score-0.28]

59 For the inner shrinking scheme of newGLMNET, assume the previous CD cycle contains points p−1,1 ,. [sent-504, score-0.192]

60 An element j is removed if M in M in < ∇ j qk (d p,t ) < 1 − ¯ , (32) wk + d p,t = 0 and − 1 + j j l l where t is the iteration index of the current cycle and M in ≡ max p ∇S1 qk (d p−1,1 ) , . [sent-512, score-0.584]

61 11 The overall procedure of newGLMNET with two-level shrinking is shown in Algorithms 3 and 4. [sent-518, score-0.088]

62 For theoretical properties, if the subproblem (13) is exactly solved, for any given outer stopping tolerance, newGLMNET terminates in ﬁnite iterations. [sent-519, score-0.136]

63 Steps 1–4 are for outerlevel shrinking and the stopping condition. [sent-524, score-0.139]

64 2011 Y UAN , H O AND L IN Algorithm 4 Inner iterations of newGLMNET with shrinking • Given working set J, initial solution d, inner stopping condition εin , and a small positive number ν from the outer problem. [sent-539, score-0.269]

65 j ¯ j j ¯ // inner-level shrinking – If wk + d j = 0 and |∇ j qk (d)| < 1 − M in /l j T ← T \{ j}. [sent-549, score-0.474]

66 Else m ← max(m, |∇S qk (d)|) and m ← m + |∇S qk (d)|. [sent-550, score-0.212]

67 ¯ ¯ j j d j ← d j + arg minz qk (d+ze j ) − qk (d). [sent-551, score-0.212]

68 ¯ in (31) for the outer-level shrinking, while M is for calculating ∑n |∇S f (w)| in the outer stopping j=1 j condition (29). [sent-557, score-0.152]

69 Because each CD cycle goes through only elements in T , the inner stopping condition (26) is also calculated using T . [sent-564, score-0.167]

70 In (28), we reduce the inner stopping tolerance εin if the stopping condition (26) holds after one CD cycle. [sent-569, score-0.168]

71 The number of outer iterations should be small because of using a more accurate initial point. [sent-661, score-0.084]

72 For GLMNET and newGLMNET, the tolerance means the outer tolerance εout in (25). [sent-678, score-0.114]

73 96 5 10 0 10 Log-scaled Training Time (sec) 1 10 2 10 3 10 Log-scaled Training Time (sec) (g) webspam (h) gisette Figure 2: L1-regularized logistic regression: testing accuracy versus training time (log-scaled). [sent-751, score-0.224]

74 The horizontal dotted line in Figure 1 indicates the relative function difference by running CDN using LIBLINEAR’s default stopping tolerance with ε = 0. [sent-753, score-0.13]

75 A better inner stopping condition should be adaptive like ours so that the sub-problem (13) can be solved properly at each outer iteration. [sent-775, score-0.171]

76 In summary, because of the proposed adaptive inner stopping condition, newGLMNET takes both advantages of fast approximation in the early stage like CDN and of fast local convergence in the ﬁnal stage like GLMNET. [sent-783, score-0.115]

77 To check if line search costs much in newGLMNET, we report the average number of line search steps per outer iteration in Table 4. [sent-786, score-0.232]

78 As a comparison, we also, respectively, show the average numbers of updated variables and line search steps per cycle of CDN in the ﬁrst and second columns of the same table. [sent-788, score-0.141]

79 Note that because a shrinking technique is applied to CDN, the number of updated variables in one cycle is much smaller than the number of features. [sent-789, score-0.167]

80 In Table 4, we further report the average number of line search steps in a CD cycle where this check is successfully applied. [sent-799, score-0.144]

81 We further extend Table 1 to compare the running time of the ﬁrst CD cycle in Table 5. [sent-806, score-0.095]

82 We may encounter this problem when calculating the function value difference needed by the sufﬁcient decrease condition 2018 A N I MPROVED GLMNET FOR L1- REGULARIZED L OGISTIC R EGRESSION Data set KDD2010-b rcv1 yahoo-japan yahoo-korea news20 epsilon* webspam gisette* CDN exp/log 21. [sent-815, score-0.086]

83 First, we investigate the effect of newGLMNET’s two levels of shrinking by presenting results of only inner or outer level. [sent-873, score-0.196]

84 Secondly, we compare the shrinking strategies of GLMNET and newGLMNET. [sent-874, score-0.088]

85 Because these two implementations differ in many places, for a fair comparison, we modify newGLMNET to apply GLMNET’s shrinking strategy. [sent-875, score-0.088]

86 We can clearly see that all shrinking implementations are better than the one without shrinking. [sent-877, score-0.088]

87 Results in Figure 3 show that the outer-level shrinking is more useful than the inner-level shrinking. [sent-878, score-0.088]

88 We suspect that the difference is due to that in the CD procedure for sub-problem (13), the (inner-level) shrinking is done in a sequential manner. [sent-879, score-0.101]

89 Therefore, the outer-level shrinking is a more integrated setting for checking and removing variables. [sent-882, score-0.099]

90 The same explanation may also apply to the result that shrinking is slightly more effective for newGLMNET than CDN; see Yuan et al. [sent-883, score-0.088]

91 Regarding GLMNET’s shrinking strategy, it performs slightly better than the inner-level shrinking of newGLMNET, but is worse than both the outer-level and the two-level settings. [sent-885, score-0.176]

92 “Inner only” (“Outer only”) indicates that only inner-level (outer-level) shrinking is conducted. [sent-889, score-0.088]

93 1 0 At the kth outer iteration, newGLMNET solves a quadratic sub-problem qk (d), min d (37) where qk (d) ≡ wk + d wk 1 + qk (d), ¯ 1 qk (d) ≡ ∇L(wk )T d + dT H k d and H k ≡ ∇2 L(wk ) + νI . [sent-902, score-1.068]

94 1 Line Search and Asymptotic Convergence At every outer iteration, after d is obtained by solving sub-problem (37), we need a line search procedure to ﬁnd the maximal λ ∈ {βi | i = 0, 1, . [sent-910, score-0.126]

95 The last bi (wk + βt d) is passed to the next outer iteration as bi (wk+1 ). [sent-918, score-0.126]

96 965 4 0 10 10 1 10 2 10 Log-scaled Training Time (sec) Log-scaled Training Time (sec) (g) webspam (h) gisette Figure 5: L1-regularized L2-loss SVM: testing accuracy versus training time (log-scaled). [sent-1019, score-0.161]

97 One difference is that the trick of using an upper-bound function in CDN is slightly more effective for L2-loss SVM than logistic regression. [sent-1023, score-0.091]

98 Using the property that CD involves simple and cheap updates, we carefully adjust the stopping condition for sub-problems. [sent-1026, score-0.093]

99 In this work, we point out that a state-of-the-art algorithm CDN for L1-regularized logistic regression suffers from frequent exp/log operations. [sent-1042, score-0.084]

100 Linear Convergence of newGLMNET for L1-regularized Logistic Regression To apply the linear convergence result in Tseng and Yun (2009), we show that L1-regularized logistic regression satisﬁes the conditions in their Theorem 3 if the loss term L(w) is strictly convex (and therefore w∗ is unique). [sent-1071, score-0.095]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cdn', 0.558), ('newglmnet', 0.549), ('glmnet', 0.401), ('wk', 0.28), ('cd', 0.157), ('qk', 0.106), ('glmnetpath', 0.098), ('sec', 0.096), ('shrinking', 0.088), ('outer', 0.074), ('uan', 0.071), ('cycle', 0.07), ('epsilon', 0.069), ('mproved', 0.067), ('ogistic', 0.067), ('yun', 0.065), ('nl', 0.064), ('logistic', 0.063), ('tseng', 0.056), ('operations', 0.052), ('yuan', 0.052), ('stopping', 0.051), ('gisette', 0.047), ('webspam', 0.046), ('nnz', 0.045), ('inner', 0.034), ('egression', 0.032), ('search', 0.031), ('di', 0.029), ('dxi', 0.027), ('testing', 0.024), ('newton', 0.024), ('hsieh', 0.024), ('liblinear', 0.023), ('iteration', 0.022), ('tried', 0.021), ('training', 0.021), ('xd', 0.021), ('line', 0.021), ('tolerance', 0.02), ('chang', 0.02), ('conduct', 0.019), ('wt', 0.018), ('regularized', 0.018), ('ew', 0.017), ('friedman', 0.016), ('failed', 0.016), ('svm', 0.016), ('ze', 0.015), ('bi', 0.015), ('hessian', 0.015), ('calculating', 0.015), ('trick', 0.015), ('sequentially', 0.014), ('relative', 0.014), ('time', 0.014), ('dense', 0.014), ('conducts', 0.013), ('genkin', 0.013), ('coordinate', 0.013), ('difference', 0.013), ('lengthy', 0.013), ('termination', 0.013), ('dii', 0.013), ('check', 0.012), ('expensive', 0.012), ('regression', 0.012), ('condition', 0.012), ('csie', 0.011), ('ntu', 0.011), ('loosely', 0.011), ('supplementary', 0.011), ('terminates', 0.011), ('adjust', 0.011), ('descent', 0.011), ('checking', 0.011), ('running', 0.011), ('iterations', 0.01), ('costs', 0.01), ('document', 0.01), ('steps', 0.01), ('kth', 0.01), ('cheap', 0.01), ('cost', 0.01), ('xi', 0.01), ('convergence', 0.01), ('stage', 0.01), ('situation', 0.01), ('beginning', 0.01), ('loss', 0.01), ('sathiya', 0.01), ('taiwan', 0.01), ('tw', 0.01), ('suffers', 0.009), ('fan', 0.009), ('carefully', 0.009), ('updated', 0.009), ('tolerances', 0.009), ('spent', 0.009), ('accuracy', 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

Author: Guo-Xun Yuan, Chia-Hua Ho, Chih-Jen Lin

2 0.06493222 54 jmlr-2012-Large-scale Linear Support Vector Regression

Author: Chia-Hua Ho, Chih-Jen Lin

Abstract: Support vector regression (SVR) and support vector classiﬁcation (SVC) are popular learning techniques, but their use with kernels is often time consuming. Recently, linear SVC without kernels has been shown to give competitive accuracy for some applications, but enjoys much faster training/testing. However, few studies have focused on linear SVR. In this paper, we extend state-of-theart training methods for linear SVC to linear SVR. We show that the extension is straightforward for some methods, but is not trivial for some others. Our experiments demonstrate that for some problems, the proposed linear-SVR training methods can very efﬁciently produce models that are as good as kernel SVR. Keywords: support vector regression, Newton methods, coordinate descent methods

3 0.041117031 34 jmlr-2012-Dynamic Policy Programming

Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen

Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the inﬁnite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove ﬁnite-iteration and asymptotic ℓ∞ -norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be signiﬁcantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods. Keywords: approximate dynamic programming, reinforcement learning, Markov decision processes, Monte-Carlo methods, function approximation

4 0.021408139 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

Author: Christopher R. Genovese, Jiashun Jin, Larry Wasserman, Zhigang Yao

Abstract: The lasso is an important method for sparse, high-dimensional regression problems, with efﬁcient algorithms available, a long history of practical success, and a large body of theoretical results supporting and explaining its performance. But even with the best available algorithms, ﬁnding the lasso solutions remains a computationally challenging task in cases where the number of covariates vastly exceeds the number of data points. Marginal regression, where each dependent variable is regressed separately on each covariate, offers a promising alternative in this case because the estimates can be computed roughly two orders faster than the lasso solutions. The question that remains is how the statistical performance of the method compares to that of the lasso in these cases. In this paper, we study the relative statistical performance of the lasso and marginal regression for sparse, high-dimensional regression problems. We consider the problem of learning which coefﬁcients are non-zero. Our main results are as follows: (i) we compare the conditions under which the lasso and marginal regression guarantee exact recovery in the ﬁxed design, noise free case; (ii) we establish conditions under which marginal regression provides exact recovery with high probability in the ﬁxed design, noise free, random coefﬁcients case; and (iii) we derive rates of convergence for both procedures, where performance is measured by the number of coefﬁcients with incorrect sign, and characterize the regions in the parameter space recovery is and is not possible under this metric. In light of the computational advantages of marginal regression in very high dimensional problems, our theoretical and simulations results suggest that the procedure merits further study. Keywords: high-dimensional regression, lasso, phase diagram, regularization

5 0.020776516 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions

Author: Franz J. Király, Paul von Bünau, Frank C. Meinecke, Duncan A.J. Blythe, Klaus-Robert Müller

Abstract: We propose a novel algebraic algorithmic framework for dealing with probability distributions represented by their cumulants such as the mean and covariance matrix. As an example, we consider the unsupervised learning problem of ﬁnding the subspace on which several probability distributions agree. Instead of minimizing an objective function involving the estimated cumulants, we show that by treating the cumulants as elements of the polynomial ring we can directly solve the problem, at a lower computational cost and with higher accuracy. Moreover, the algebraic viewpoint on probability distributions allows us to invoke the theory of algebraic geometry, which we demonstrate in a compact proof for an identiﬁability criterion. Keywords: computational algebraic geometry, approximate algebra, unsupervised Learning

6 0.020493284 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

7 0.019870391 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models

8 0.019610513 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

9 0.019547099 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

10 0.018878935 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

11 0.018661993 13 jmlr-2012-Active Learning via Perfect Selective Classification

12 0.018426986 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers

13 0.017945332 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

14 0.015436136 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

15 0.015386055 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

16 0.015274842 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

17 0.015057598 95 jmlr-2012-Random Search for Hyper-Parameter Optimization

18 0.014920655 74 jmlr-2012-Multi Kernel Learning with Online-Batch Optimization

19 0.014725212 97 jmlr-2012-Regularization Techniques for Learning with Matrices

20 0.013922336 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.071), (1, 0.009), (2, 0.016), (3, -0.01), (4, 0.019), (5, 0.012), (6, 0.015), (7, -0.035), (8, -0.003), (9, 0.019), (10, -0.04), (11, 0.097), (12, 0.013), (13, -0.024), (14, 0.057), (15, 0.031), (16, -0.005), (17, -0.041), (18, -0.013), (19, 0.109), (20, -0.089), (21, 0.086), (22, 0.029), (23, 0.109), (24, 0.021), (25, -0.032), (26, 0.101), (27, -0.018), (28, 0.239), (29, -0.21), (30, 0.168), (31, 0.139), (32, -0.035), (33, -0.157), (34, -0.142), (35, 0.091), (36, 0.059), (37, -0.151), (38, 0.222), (39, 0.025), (40, 0.238), (41, 0.14), (42, -0.297), (43, 0.145), (44, -0.017), (45, 0.062), (46, -0.012), (47, 0.069), (48, 0.09), (49, -0.069)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96630788 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

Author: Guo-Xun Yuan, Chia-Hua Ho, Chih-Jen Lin

2 0.65649831 54 jmlr-2012-Large-scale Linear Support Vector Regression

Author: Chia-Hua Ho, Chih-Jen Lin

3 0.23439392 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

Author: Zhiwei Qin, Donald Goldfarb

Abstract: We consider a class of sparse learning problems in high dimensional feature space regularized by a structured sparsity-inducing norm that incorporates prior knowledge of the group structure of the features. Such problems often pose a considerable challenge to optimization algorithms due to the non-smoothness and non-separability of the regularization term. In this paper, we focus on two commonly adopted sparsity-inducing regularization terms, the overlapping Group Lasso penalty l1 /l2 -norm and the l1 /l∞ -norm. We propose a uniﬁed framework based on the augmented Lagrangian method, under which problems with both types of regularization and their variants can be efﬁciently solved. As one of the core building-blocks of this framework, we develop new algorithms using a partial-linearization/splitting technique and prove that the accelerated versions 1 of these algorithms require O( √ε ) iterations to obtain an ε-optimal solution. We compare the performance of these algorithms against that of the alternating direction augmented Lagrangian and FISTA methods on a collection of data sets and apply them to two real-world problems to compare the relative merits of the two norms. Keywords: structured sparsity, overlapping Group Lasso, alternating direction methods, variable splitting, augmented Lagrangian

4 0.19519179 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

Author: Timo Aho, Bernard Ženko, Sašo Džeroski, Tapio Elomaa

Abstract: Methods for learning decision rules are being successfully applied to many problem domains, in particular when understanding and interpretation of the learned model is necessary. In many real life problems, we would like to predict multiple related (nominal or numeric) target attributes simultaneously. While several methods for learning rules that predict multiple targets at once exist, they are all based on the covering algorithm, which does not work well for regression problems. A better solution for regression is the rule ensemble approach that transcribes an ensemble of decision trees into a large collection of rules. An optimization procedure is then used to select the best (and much smaller) subset of these rules and to determine their respective weights. We introduce the F IRE algorithm for solving multi-target regression problems, which employs the rule ensembles approach. We improve the accuracy of the algorithm by adding simple linear functions to the ensemble. We also extensively evaluate the algorithm with and without linear functions. The results show that the accuracy of multi-target regression rule ensembles is high. They are more accurate than, for instance, multi-target regression trees, but not quite as accurate as multi-target random forests. The rule ensembles are signiﬁcantly more concise than random forests, and it is also possible to create compact rule sets that are smaller than a single regression tree but still comparable in accuracy. Keywords: multi-target prediction, rule learning, rule ensembles, regression ∗. Also in Microtask, Tampere, Finland. †. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia. ‡. Also in the Centre of Excellence for Integrated Approaches in Chemistry and Biology of Proteins, Ljubljana, Slovenia and the Joˇ ef Stefan International Postgraduate School, Ljubljana, Slovenia. z ˇ c 2012 Timo Aho, Bernard Zenko, Saˇo Dˇ eroski and Tapio Elomaa. s z ˇ ˇ

5 0.16622257 34 jmlr-2012-Dynamic Policy Programming

Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen

6 0.15862256 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions

7 0.15449272 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

8 0.13441609 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

9 0.13075908 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression

10 0.13029946 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory

11 0.12482213 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

12 0.12372928 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

13 0.12259784 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

14 0.12179626 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox

15 0.11643431 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

16 0.11628377 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models

17 0.11252331 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

18 0.1113093 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

19 0.10907402 95 jmlr-2012-Random Search for Hyper-Parameter Optimization

20 0.10751653 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(21, 0.034), (26, 0.057), (27, 0.01), (29, 0.025), (35, 0.017), (49, 0.012), (56, 0.016), (57, 0.012), (63, 0.356), (69, 0.019), (75, 0.056), (77, 0.093), (92, 0.056), (96, 0.097)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.77866906 93 jmlr-2012-Quantum Set Intersection and its Application to Associative Memory

Author: Tamer Salman, Yoram Baram

Abstract: We describe a quantum algorithm for computing the intersection of two sets and its application to associative memory. The algorithm is based on a modiﬁcation of Grover’s quantum search algorithm (Grover, 1996). We present algorithms for pattern retrieval, pattern completion, and pattern correction. We show that the quantum associative memory can store an exponential number of memories and retrieve them in sub-exponential time. We prove that this model has advantages over known classical associative memories as well as previously proposed quantum models. Keywords: associative memory, pattern completion, pattern correction, quantum computation, quantum search

same-paper 2 0.69787133 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

Author: Guo-Xun Yuan, Chia-Hua Ho, Chih-Jen Lin

3 0.43267369 54 jmlr-2012-Large-scale Linear Support Vector Regression

Author: Chia-Hua Ho, Chih-Jen Lin

4 0.42452645 22 jmlr-2012-Bounding the Probability of Error for High Precision Optical Character Recognition

Author: Gary B. Huang, Andrew Kae, Carl Doersch, Erik Learned-Miller

Abstract: We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identiﬁed with near certainty, they can be conditioned upon, allowing further inference to be done efﬁciently. Speciﬁcally, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This “clean set” is subsequently used as document-speciﬁc training data. While OCR systems produce conﬁdence measures for the identity of each letter or word, thresholding these values still produces a signiﬁcant number of errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data set of difﬁcult historical newspaper scans, demonstrating that our method for identifying correct words makes only two errors in 56 documents. Using document-speciﬁc character models generated from this data, we are able to reduce the error over properly segmented characters by 34.1% from an initial OCR system’s translation.1 Keywords: optical character recognition, probability bounding, document-speciﬁc modeling, computer vision 1. This work is an expanded and revised version of Kae et al. (2010). Supported by NSF Grant IIS-0916555. c 2012 Gary B. Huang, Andrew Kae, Carl Doersch and Erik Learned-Miller. H UANG , K AE , D OERSCH AND L EARNED -M ILLER

5 0.402652 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

Author: Christopher R. Genovese, Jiashun Jin, Larry Wasserman, Zhigang Yao

6 0.36420655 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

7 0.35877368 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

8 0.35868579 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

9 0.3561843 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

10 0.35459334 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis

11 0.35232747 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

12 0.34980261 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

13 0.34809649 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

14 0.34628367 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

15 0.34598404 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

16 0.34529611 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

17 0.34374809 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

18 0.34334308 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

19 0.34168458 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

20 0.34084731 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO