jmlr jmlr2013 jmlr2013-51 knowledge-graph by maker-knowledge-mining

51 jmlr-2013-Greedy Sparsity-Constrained Optimization


Source: pdf

Author: Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos

Abstract: Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the fidelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsityconstrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. Keywords: sparsity, optimization, compressed sensing, greedy algorithm

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. [sent-9, score-0.129]

2 Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. [sent-10, score-0.152]

3 Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. [sent-11, score-0.103]

4 We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. [sent-12, score-0.151]

5 As a special case, logistic regression with ℓ1 and elastic net regularization are studied by Bunea (2008). [sent-30, score-0.227]

6 (2010) proposed a number of greedy that sparsify a given estimate at the cost of relatively small increase of the objective function. [sent-51, score-0.098]

7 For instance, the formulation does not apply to objectives such as the logistic loss. [sent-56, score-0.102]

8 (2011) imposes conditions on the function as restricted to sparse inputs whose nonzeros are fewer than a multiple of the target sparsity level. [sent-61, score-0.168]

9 The analysis and the guarantees for smooth and non-smooth cost functions are similar, except for less stringent conditions derived for smooth cost functions due to properties of symmetric Hessian matrices. [sent-71, score-0.13]

10 We also prove that the SRH holds for the case of the ℓ2 -penalized logistic loss function. [sent-72, score-0.135]

11 As an example where our algorithm can be applied, ℓ2 -regularized logistic regression is studied in Section 4. [sent-80, score-0.102]

12 Some experimental results for logistic regression with sparsity constraints are presented in Section 5. [sent-81, score-0.132]

13 In standard CS problems the aim is to estimate a sparse vector x⋆ from noisy linear measurements y = Ax⋆ + e, where A is a known n × p measurement matrix with n ≪ p and e is the additive measurement noise. [sent-87, score-0.113]

14 a vector that equals v except for coordinates in I c where it is zero vr the best r-term approximation of vector v supp (v) the support set (i. [sent-107, score-0.125]

15 For example, in statistics and machine learning the logistic loss function is also commonly used in regression and classification problems (see Liu et al. [sent-137, score-0.135]

16 The majority of these studies consider the cost function to be convex everywhere and rely on the ℓ1 regularization as the means to induce sparsity in the solution. [sent-141, score-0.137]

17 To establish this error bound they introduced the notion of Restricted Strong Convexity (RSC), which basically requires a lower bound on the curvature of the cost function around the true parameter in a restricted set of directions. [sent-148, score-0.204]

18 For instance, the error bound for ℓ1 -regularized logistic regression is recognized by Bunea (2008) to be dependent on the true parameter (Bunea, 2008, Assumption A, Theorem 2. [sent-153, score-0.102]

19 This reasoning applies in many cases including the studies mentioned above, where it would be impossible to achieve the desired restricted strong convexity properties—with high probability—if the true parameter is allowed to be unbounded. [sent-168, score-0.108]

20 These two properties bound the curvature of the function from below and above in a restricted set of directions around the true optimum. [sent-172, score-0.129]

21 For quadratic cost functions, such as squared error, these curvature bounds are absolute constants. [sent-173, score-0.112]

22 For smooth cost functions we introduce the notion of a Stable Restricted Hessian (SRH) and for non-smooth cost functions we introduce the Stable Restricted Linearization (SRL). [sent-193, score-0.108]

23 Both of these properties basically bound the Bregman divergence of the cost function restricted to sparse canonical subspaces. [sent-194, score-0.231]

24 1 Algorithm Description Algorithm 1: The GraSP algorithm input : f (·) and s ˆ output: x initialize: x = 0 repeat compute local gradient: z = ∇ f (x) identify directions: Z = supp (z2s ) merge supports: T = Z ∪ supp (x) minimize over support: b = arg min f (x) s. [sent-197, score-0.268]

25 x|T c = 0 prune estimate: x = bs until halting condition holds GraSP is an iterative algorithm, summarized in Algorithm 1, that maintains and updates an estimate x of the sparse optimum at every iteration. [sent-199, score-0.127]

26 For nonsmooth functions, instead of the gradient we use a restricted subgradient z = ∇ f (x) defined in Section 3. [sent-201, score-0.175]

27 Their indices, denoted by Z = supp (z2s ), are then merged 813 BAHMANI , R AJ AND B OUFOUNOS with the support of the current estimate to obtain T = Z ∪ supp (x). [sent-204, score-0.25]

28 Specifically, the gradient step reduces to the proxy step z = AT (y − Aˆ ) and x minimization over the restricted support reduces to the constrained pseudoinverse step b|T = A† y, T b|T c = 0 in CoSaMP. [sent-211, score-0.115]

29 Debiasing: In this variant, instead of performing a hard thresholding on the vector b, the objective is minimized restricted to the support set of bs to obtain the new iterate: x = arg min f (x) x s. [sent-215, score-0.203]

30 Restricted Newton Step: To reduce the computations in each iteration, the minimization that yields b, we can set b|T c = 0 and take a restricted Newton step as b|T = x|T − κ PT H f (x) PT T −1 x|T , where κ > 0 is a step-size. [sent-219, score-0.107]

31 Restricted Gradient Descent: The minimization step can be relaxed even further by applying a restricted gradient descent. [sent-222, score-0.115]

32 These properties that are analogous to the RIP in the standard CS framework, basically require that the curvature of the cost function over the sparse subspaces can be bounded locally from above and below such that the corresponding bounds have the same order. [sent-229, score-0.164]

33 Furthermore, let Ak (x) = sup ∆T H f (x) ∆ |supp (x) ∪ supp (∆)| ≤ k, ∆ =1 (5) =1 , (6) 2 and Bk (x) = inf ∆T H f (x) ∆ |supp (x) ∪ supp (∆)| ≤ k, ∆ 2 for all k-sparse vectors x. [sent-233, score-0.25]

34 Note that the functions that satisfy the SRH are convex over canonical sparse subspaces, but they are not necessarily convex everywhere. [sent-241, score-0.117]

35 To generalize the notion of SRH to the case of nonsmooth functions, first we define the restricted subgradient of a function. [sent-265, score-0.149]

36 We say vector ∇ f (x) is a restricted subgradient of f : R p → R at point x if f (x + ∆) − f (x) ≥ ∇ f (x) , ∆ holds for all k-sparse vectors ∆. [sent-267, score-0.13]

37 We introduced the notion of restricted subgradient so that the restrictions imposed on f are as minimal as we need. [sent-269, score-0.13]

38 We acknowledge that the existence of restricted subgradients implies convexity in sparse directions, but it does not imply convexity everywhere. [sent-270, score-0.176]

39 Obviously, if the function f is convex everywhere, then any subgradient of f determines a restricted subgradient of f as well. [sent-272, score-0.205]

40 With a slight abuse of terminology we call B f x′ x = f x′ − f (x) − ∇ f (x) , x′ − x the restricted Bregman divergence of f : R p → R between points x and x′ where ∇ f (·) gives a restricted subgradient of f (·). [sent-276, score-0.237]

41 For function f : R p → R we define the functions αk (x) = sup 1 ∆ 2 2 B f (x + ∆ x) | ∆ = 0 and |supp (x) ∪ supp (∆)| ≤ k and βk (x) = inf 1 ∆ 2 2 B f (x + ∆ x) | ∆ = 0 and |supp (x) ∪ supp (∆)| ≤ k . [sent-279, score-0.25]

42 , 2011; Zhang, 2011) in the sense that they all bound the curvature of the objective function over a restricted set. [sent-285, score-0.147]

43 We will also show this for the logistic loss as an example in Section 4. [sent-318, score-0.135]

44 Should the SRH or SRL conditions hold for the objective function, it is straightforward to convert the point accuracy guarantees of Theorems 1 and 2, into accuracy guarantees in terms of the objective value. [sent-326, score-0.116]

45 Example: Sparse Minimization of ℓ2 -regularized Logistic Regression One of the models widely used in machine learning and statistics is the logistic model. [sent-331, score-0.102]

46 To find the maximum likelihood estimate one should maximize this likelihood function, or equivalently minimize the negative log-likelihood, the logistic loss, g(x) = 1 n ∑ log (1 + exp ( ai , x )) − yi ai , x . [sent-334, score-0.201]

47 In these scenarios the logistic loss is merely convex and it does not have a unique minimum. [sent-344, score-0.191]

48 To compensate for these drawbacks the logistic loss is usually regularized by some penalty term (Hastie et al. [sent-347, score-0.154]

49 1 Verifying SRH for ℓ2 -regularized Logistic Loss 1 It is easy to show that the Hessian of the logistic loss at any point x is given by Hg (x) = 4n AT ΛA, 21 where Λ is an n × n diagonal matrix whose diagonal entries are Λii = sech 2 ai , x with sech (·) 1 denoting the hyperbolic secant function. [sent-358, score-0.23]

50 Therefore, if Hη (x) denotes the Hessian of the ℓ2 -regularized logistic loss, we have ∀x, ∆ η ∆ 2 2 ≤ ∆T Hη (x) ∆ ≤ 1 A∆ 4n 2 2 +η ∆ 2 2. [sent-360, score-0.102]

51 R ≤k exp As stated before, in a standard logistic model data samples {ai } are supposed to be independent instances of a random vector a. [sent-368, score-0.125]

52 With the above assumptions, if n ≥ R θh(τ) p log k + k 1 + log k − log ε for some τ > 0 and ε ∈ (0, 1), then with probability at least 1 − ε the ℓ2 -regularized logistic loss has µk -SRH with µk ≤ 1 + 1+τ θ. [sent-374, score-0.135]

53 4η Proof For any set of k indices J let MJ = ai |J ai |T = PT ai aT PJ . [sent-375, score-0.114]

54 The independence of the vectors i J i J ai implies that the matrix n AT AJ = ∑ ai |J ai |T J J i=1 n = ∑ MJ i i=1 is a sum of n independent, random, self-adjoint matrices. [sent-376, score-0.114]

55 max 819 BAHMANI , R AJ AND B OUFOUNOS Hence, for any fixed index set J with |J | = k we may apply Theorem 3 for Mi = MJ , θmax = nθJ , max i and τ > 0 to obtain n Pr λmax ∑ MJ i i=1 ≥ (1 + τ) nθJ max ≤k exp − nθJ h (τ) max R . [sent-379, score-0.135]

56 Furthermore, we can write n ∑ MJ i Pr λmax AT AJ ≥ (1 + τ) nθ = Pr λmax J ≥ (1 + τ) nθ i=1 n ∑ MJ i ≤ Pr λmax ≥ (1 + τ) nθJ max i=1 ≤ k exp − nθJ h (τ) max R ≤ k exp − nθh (τ) R . [sent-380, score-0.102]

57 Thus, the ℓ2 -regularized logistic loss has an SRH constant µk ≤ 1 + 1+τ θ with probability 1 − ε. [sent-384, score-0.135]

58 The analysis provided here is not specific to the ℓ2 -regularized logistic loss and can be readily extended to any other ℓ2 -regularized GLM loss whose log-partition function has a Lipschitzcontinuous derivative. [sent-394, score-0.168]

59 In the case of case of ℓ2 -regularized logistic loss considered in this section we have n 1 − yi ai + ηx. [sent-397, score-0.173]

60 1 + exp (− ai , x ) ∇ f (x) = ∑ i=1 Denoting 1 1+exp(− ai ,x⋆ ) − yi by vi for i = 1, 2, . [sent-398, score-0.099]

61 Nevertheless, we evaluated performance of GraSP for variable selection in the logistic model both on synthetic and real data. [sent-418, score-0.102]

62 To analyze the effect of an additional ℓ2 regularization we also evaluated the performance of GraSP with ℓ2 -regularized logistic loss, as well as the logistic regression with elastic net (i. [sent-433, score-0.329]

63 For the elastic net penalty (1 − ω) x 2 /2 + ω x 1 we considered the “mixing parameter” ω to be 2 0. [sent-437, score-0.125]

64 For the ℓ2 -regularized logistic loss we considered η = (1 − ω) log p n . [sent-439, score-0.135]

65 Figure 1 compares the average value of the empirical logistic loss achieved by each of the considered algorithms for a wide range of “sampling ratio” n/p. [sent-442, score-0.135]

66 For GraSP, the curves labelled by GraSP and GraSP + ℓ2 corresponding to the cases where the algorithm is applied to unregularized and ℓ2 -regularized logistic loss, respectively. [sent-443, score-0.102]

67 Furthermore, the results of GLMnet for the LASSO and the elastic net regularization are labelled by GLMnet (ℓ1 ) and GLMnet (elastic net), respectively. [sent-444, score-0.125]

68 To contrast the obtained results we also provided the average of empirical logistic loss evaluated at the true parameter and one standard deviation above and below this average on the plots. [sent-446, score-0.135]

69 Furthermore, the difference between the loss obtained by the LASSO and the loss at the true parameter never drops below a certain threshold, although the convex method exhibits a more stable behaviour at low sampling ratios. [sent-457, score-0.159]

70 Interestingly, GraSP becomes more stable at low sampling ratios when the logistic loss is regularized with the ℓ2 -norm. [sent-458, score-0.229]

71 However, this stability comes at the cost of a bias in the loss value at high sampling ratios that is particularly pronounced in Figure 1d. [sent-459, score-0.144]

72 Nevertheless, for all of the tested values of ρ, at low sampling ratios GraSP+ℓ2 and at high sampling ratios GraSP are consistently closer to the true loss value compared to the other methods. [sent-460, score-0.147]

73 8 1 √ 2 2 Figure 1: Comparison of the average (empirical) logistic loss at solutions obtained via GraSP, GraSP with ℓ2 -penalty, LASSO, the elastic-net regularization, and Logit-OMP. [sent-497, score-0.135]

74 Applying the debiasing procedure has improved the performance of both GraSP methods except at very low sampling ratios. [sent-511, score-0.117]

75 The logistic loss values at obtained estimates are reported in Tables 2 and 3. [sent-536, score-0.135]

76 For each data set we applied the sparse logistic regression for a range of sparsity level s. [sent-537, score-0.181]

77 The results for DEXTER data set show that GraSP variants without debiasing and the convex methods achieve comparable loss values in most cases, whereas the convex methods show significantly better performance on the ARCENE data set. [sent-542, score-0.196]

78 14E-07 Table 3: DEXTER Logit-OMP has the best performance, the smallest loss values in both data sets are attained by GraSP methods with debiasing step. [sent-683, score-0.128]

79 We provide theoretical convergence guarantees based on the notions of a Stable Restricted Hessian (SRH) for smooth cost functions and a Stable Restricted Linearization (SRL) for non-smooth cost functions, both of which are introduced in this paper. [sent-689, score-0.13]

80 Our algorithm generalizes the wellestablished sparse recovery algorithm CoSaMP that merely applies in linear models with squared error loss. [sent-690, score-0.116]

81 The SRH and SRL also generalize the well-known Restricted Isometry Property for 826 G REEDY S PARSITY-C ONSTRAINED O PTIMIZATION sparse recovery to the case of cost functions other than the squared error. [sent-691, score-0.148]

82 To provide a concrete example we studied the requirements of GraSP for ℓ2 -regularized logistic loss. [sent-692, score-0.102]

83 Therefore, Jensen’s inequality yields  λmin  1 0  M(t)dt  ≥ 1 1 λmin (M(t)) dt ≥ 0 B(t)dt 0 and  λmax  1 0  M(t)dt  ≤ 1 1 0 λmax (M(t)) dt ≤ which imply the desired result. [sent-708, score-0.204]

84 If Γ is a subset of row/column indices of M (·) then for any vector v we have   1 1 T  PΓ M(t)PΓc dt  v ≤ A(t) − B (t) dt v 2 . [sent-711, score-0.186]

85 2 2 Finally, it follows from the convexity of the operator norm, Jensen’s inequality, and (11) that 1 0 1 1 M (t) dt ≤ 0 M (t) dt ≤ 0 A(t) − B (t) dt. [sent-717, score-0.205]

86 2 To simplify notation we introduce functions 1 αk (p, q) = Ak (tq + (1 − t) p) dt 0 1 βk (p, q) = 0 Bk (tq + (1 − t) p) dt γk (p, q) = αk (p, q) − βk (p, q) , where Ak (·) and Bk (·) are defined by (5) and (6), respectively. [sent-718, score-0.186]

87 The current estimate x then satisfies (x−x⋆) |Z c 2≤ ∇ f (x⋆) |R \Z 2 + ∇ f (x⋆) |Z\R γ4s (x,x⋆)+γ2s (x,x⋆) x−x⋆ 2 + 2β2s (x,x⋆) β2s (x,x⋆) Proof Since Z = supp (z2s ) and |R | ≤ 2s we have z|R z|R \Z 2 ≤ z|Z \R 2 ≤ z|Z 2 2 2 . [sent-721, score-0.125]

88 γ2s (x,x⋆) x−x⋆ 2 2 BAHMANI , R AJ AND B OUFOUNOS Since R = supp (x−x⋆ ), we have (x−x⋆ ) |R \Z (x−x⋆) |Z c 2≤ 2 = (x−x⋆ ) |Z c 2. [sent-729, score-0.125]

89 Therefore,   1 1 PT H f (tx⋆ + (1 − t) b) dt  (x⋆ − b) T 1 PT H f (tx⋆ +(1−t) b) PT dt  (x⋆ − b) |T T ⋆ ∇ f (x ) |T =   =  0 0   +  PT H f (tx⋆ +(1−t) b) PT c dt  (x⋆ −b) |T c . [sent-737, score-0.279]

90 T 0 (16) Since f has µ4s -SRH and |T ∪ supp (tx⋆ + (1 − t) b)| ≤ 4s for all t ∈ [0, 1], functions A4s (·) and B4s (·), defined using (5) and (6), exist such that we have B4s (tx⋆ + (1 − t) b) ≤ λmin PT H f (tx⋆ +(1−t) b) PT T and A4s (tx⋆ + (1 − t) b) ≥ λmax PT H f (tx⋆ +(1−t) b) PT . [sent-738, score-0.125]

91 T Thus, from Proposition 1 we obtain  β4s (b, x⋆ ) ≤ λmin   1 0 PT H f (tx⋆ + (1 − t) b) PT dt  T 830 G REEDY S PARSITY-C ONSTRAINED O PTIMIZATION and  α4s (b, x⋆ ) ≥ λmax   1 PT H f (tx⋆ + (1 − t) b) PT dt  . [sent-739, score-0.186]

92 With S ⋆ = supp (x⋆ ), using triangle inequality, (17), and Proposition 2 then we obtain x⋆ |T −b 2 = (x⋆−b)|T  2  1 ≤ W−1 PT H f (tx⋆+(1−t) b) PT c ∩S ⋆ dt  x⋆ |T c ∩S ⋆ + W−1 ∇ f (x⋆) |T T 0 ⋆) | (x 2 2 ∇f γ4s (b, x⋆ ) T 2 ≤ + x⋆ |T c β4s (b, x⋆ ) 2β4s (b, x⋆ ) 2, as desired. [sent-742, score-0.218]

93 Furthermore, for any function that satisfies µk −SRH we can write αk (p, q) = βk (p, q) 1 0 Ak (tq + (1 − t) p) dt 1 0 Bk (tq + (1 − t) p) dt ≤ 1 0 µk Bk (tq + (1 − t) p) dt 1 0 Bk (tq + (1 − t) p) dt = µk , γ and thereby βkk(p,q) ≤ µk − 1. [sent-755, score-0.372]

94 2 (22) Propositions 3, 4, and 5 establish some basic inequalities regarding the restricted Bregman divergence under SRL assumption. [sent-763, score-0.107]

95 Note In Propositions 3, 4, and 5 we assume x1 and x2 are two vectors in Rn such that |supp (x1 ) ∪ supp (x2 )| ≤ r. [sent-766, score-0.125]

96 Furthermore, we use the shorthand ∆ = x1 − x2 and denote supp (∆) by R . [sent-767, score-0.125]

97 Then we have ⋆ (x−x )|Z c 2 ≤ γ2s (x, x⋆ )+γ4s (x, x⋆) β2s (x, x⋆) ⋆ x−x 2+ ∇ f (x⋆ ) R \Z 2 ≤ z|Z \R 836 + ∇ f (x⋆ ) Z\R β2s (x, x⋆ ) Proof Given that Z = supp (z2s ) and |R | ≤ 2s we have z|R z|R \Z 2 2 . [sent-803, score-0.125]

98 Honest variable selection in linear and logistic regression models via ℓ1 and ℓ1 + ℓ2 penalization. [sent-886, score-0.102]

99 The restricted isometry property and its implications for compressed sensing. [sent-891, score-0.149]

100 Sparse recovery algorithms: sufficient conditions in terms of restricted isometry constants. [sent-947, score-0.136]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('grasp', 0.731), ('srh', 0.241), ('glmnet', 0.215), ('bahmani', 0.189), ('oufounos', 0.146), ('pt', 0.144), ('debias', 0.138), ('srl', 0.138), ('aj', 0.126), ('supp', 0.125), ('onstrained', 0.125), ('ptimization', 0.113), ('reedy', 0.113), ('logistic', 0.102), ('debiasing', 0.095), ('dt', 0.093), ('restricted', 0.089), ('rsc', 0.086), ('tx', 0.084), ('bs', 0.078), ('negahban', 0.073), ('elastic', 0.063), ('hessian', 0.061), ('cost', 0.054), ('sparse', 0.049), ('mj', 0.049), ('pursuit', 0.048), ('net', 0.043), ('kakade', 0.043), ('linearization', 0.043), ('rip', 0.043), ('subgradient', 0.041), ('curvature', 0.04), ('bregman', 0.04), ('compressed', 0.04), ('rss', 0.04), ('ai', 0.038), ('blumensath', 0.038), ('cand', 0.038), ('stable', 0.037), ('jalali', 0.037), ('pr', 0.037), ('tq', 0.037), ('bunea', 0.037), ('cs', 0.037), ('bk', 0.035), ('ratios', 0.035), ('cosamp', 0.034), ('convex', 0.034), ('loss', 0.033), ('tropp', 0.031), ('sparsity', 0.03), ('ak', 0.028), ('max', 0.028), ('ax', 0.028), ('recovery', 0.027), ('agarwal', 0.027), ('greedy', 0.026), ('gradient', 0.026), ('measurements', 0.026), ('arcene', 0.026), ('bhiksha', 0.026), ('glm', 0.026), ('needell', 0.026), ('sensing', 0.025), ('underdetermined', 0.024), ('remark', 0.024), ('furthermore', 0.024), ('exp', 0.023), ('pj', 0.023), ('iteration', 0.023), ('entries', 0.023), ('lasso', 0.023), ('guarantees', 0.022), ('cpj', 0.022), ('sampling', 0.022), ('merely', 0.022), ('basically', 0.021), ('isometry', 0.02), ('mi', 0.02), ('dexter', 0.02), ('qu', 0.02), ('propositions', 0.019), ('measurement', 0.019), ('convexity', 0.019), ('penalty', 0.019), ('nonsmooth', 0.019), ('regularization', 0.019), ('accuracy', 0.018), ('divergence', 0.018), ('restriction', 0.018), ('arg', 0.018), ('objective', 0.018), ('squared', 0.018), ('yields', 0.018), ('aat', 0.017), ('dobson', 0.017), ('hg', 0.017), ('iht', 0.017), ('sech', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

Author: Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos

Abstract: Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the fidelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsityconstrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. Keywords: sparsity, optimization, compressed sensing, greedy algorithm

2 0.053521845 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm

3 0.050421976 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence

4 0.041683171 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

Author: Julien Mairal, Bin Yu

Abstract: We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network. In this context, it is of much interest to automatically select a subgraph with few connected components; by exploiting prior knowledge, one can indeed improve the prediction performance or obtain results that are easier to interpret. Regularization or penalty functions for selecting features in graphs have recently been proposed, but they raise new algorithmic challenges. For example, they typically require solving a combinatorially hard selection problem among all connected subgraphs. In this paper, we propose computationally feasible strategies to select a sparse and well-connected subset of features sitting on a directed acyclic graph (DAG). We introduce structured sparsity penalties over paths on a DAG called “path coding” penalties. Unlike existing regularization functions that model long-range interactions between features in a graph, path coding penalties are tractable. The penalties and their proximal operators involve path selection problems, which we efficiently solve by leveraging network flow optimization. We experimentally show on synthetic, image, and genomic data that our approach is scalable and leads to more connected subgraphs than other regularization functions for graphs. Keywords: convex and non-convex optimization, network flow optimization, graph sparsity

5 0.041057754 90 jmlr-2013-Quasi-Newton Method: A New Direction

Author: Philipp Hennig, Martin Kiefel

Abstract: Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors. Keywords: optimization, numerical analysis, probability, Gaussian processes

6 0.039327078 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

7 0.037188787 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

8 0.036247484 114 jmlr-2013-The Rate of Convergence of AdaBoost

9 0.034835435 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

10 0.034603182 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

11 0.034271467 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

12 0.032293864 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

13 0.032286 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

14 0.031630274 76 jmlr-2013-Nonparametric Sparsity and Regularization

15 0.031575631 106 jmlr-2013-Stationary-Sparse Causality Network Learning

16 0.031467732 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

17 0.03052731 86 jmlr-2013-Parallel Vector Field Embedding

18 0.029141808 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

19 0.029083755 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

20 0.028728113 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.144), (1, 0.03), (2, 0.034), (3, 0.028), (4, 0.026), (5, -0.03), (6, 0.032), (7, -0.045), (8, 0.093), (9, -0.004), (10, 0.027), (11, -0.012), (12, -0.076), (13, 0.048), (14, 0.052), (15, -0.028), (16, -0.066), (17, -0.023), (18, -0.069), (19, -0.007), (20, 0.07), (21, -0.064), (22, -0.068), (23, 0.087), (24, 0.01), (25, -0.027), (26, -0.099), (27, 0.109), (28, 0.02), (29, 0.05), (30, 0.036), (31, 0.039), (32, 0.064), (33, -0.022), (34, -0.142), (35, 0.064), (36, -0.175), (37, -0.138), (38, 0.141), (39, 0.026), (40, 0.114), (41, -0.116), (42, -0.102), (43, -0.073), (44, -0.0), (45, 0.093), (46, -0.134), (47, 0.192), (48, -0.328), (49, -0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88694572 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

Author: Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos

Abstract: Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the fidelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsityconstrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. Keywords: sparsity, optimization, compressed sensing, greedy algorithm

2 0.40137261 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence

3 0.36015007 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

Author: Pinghua Gong, Jieping Ye, Changshui Zhang

Abstract: Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel non-convex regularizer. To solve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm; we also provide intuitive interpretations, detailed convergence and reproducibility analysis for the proposed algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. Keywords: multi-task learning, multi-stage, non-convex, sparse learning

4 0.35690117 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

5 0.3493675 106 jmlr-2013-Stationary-Sparse Causality Network Learning

Author: Yuejia He, Yiyuan She, Dapeng Wu

Abstract: Recently, researchers have proposed penalized maximum likelihood to identify network topology underlying a dynamical system modeled by multivariate time series. The time series of interest are assumed to be stationary, but this restriction is never taken into consideration by existing estimation methods. Moreover, practical problems of interest may have ultra-high dimensionality and obvious node collinearity. In addition, none of the available algorithms provides a probabilistic measure of the uncertainty for the obtained network topology which is informative in reliable network identification. The main purpose of this paper is to tackle these challenging issues. We propose the S2 learning framework, which stands for stationary-sparse network learning. We propose a novel algorithm referred to as the Berhu iterative sparsity pursuit with stationarity (BISPS), where the Berhu regularization can improve the Lasso in detection and estimation. The algorithm is extremely easy to implement, efficient in computation and has a theoretical guarantee to converge to a global optimum. We also incorporate a screening technique into BISPS to tackle ultra-high dimensional problems and enhance computational efficiency. Furthermore, a stationary bootstrap technique is applied to provide connection occurring frequency for reliable topology learning. Experiments show that our method can achieve stationary and sparse causality network learning and is scalable for high-dimensional problems. Keywords: stationarity, sparsity, Berhu, screening, bootstrap

6 0.3291077 90 jmlr-2013-Quasi-Newton Method: A New Direction

7 0.29080999 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

8 0.27690664 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

9 0.2576721 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

10 0.24560912 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

11 0.24352583 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

12 0.24215031 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

13 0.22891712 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model

14 0.21292299 62 jmlr-2013-Learning Theory Approach to Minimum Error Entropy Criterion

15 0.21232888 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

16 0.21178403 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

17 0.20870422 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

18 0.20332436 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

19 0.20130242 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

20 0.19177885 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.034), (5, 0.114), (6, 0.046), (10, 0.064), (14, 0.018), (20, 0.429), (23, 0.03), (68, 0.027), (70, 0.029), (75, 0.035), (85, 0.022), (87, 0.027), (89, 0.017), (93, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98928225 89 jmlr-2013-QuantMiner for Mining Quantitative Association Rules

Author: Ansaf Salleb-Aouissi, Christel Vrain, Cyril Nortet, Xiangrong Kong, Vivek Rathod, Daniel Cassard

Abstract: In this paper, we propose Q UANT M INER, a mining quantitative association rules system. This system is based on a genetic algorithm that dynamically discovers “good” intervals in association rules by optimizing both the support and the confidence. The experiments on real and artificial databases have shown the usefulness of Q UANT M INER as an interactive, exploratory data mining tool. Keywords: association rules, numerical and categorical attributes, unsupervised discretization, genetic algorithm, simulated annealing

2 0.8836413 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju

Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models

same-paper 3 0.72996289 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

Author: Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos

Abstract: Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the fidelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsityconstrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. Keywords: sparsity, optimization, compressed sensing, greedy algorithm

4 0.44382331 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction

Author: Cynthia Rudin, Benjamin Letham, David Madigan

Abstract: We present a theoretical analysis for prediction algorithms based on association rules. As part of this analysis, we introduce a problem for which rules are particularly natural, called “sequential event prediction.” In sequential event prediction, events in a sequence are revealed one by one, and the goal is to determine which event will next be revealed. The training set is a collection of past sequences of events. An example application is to predict which item will next be placed into a customer’s online shopping cart, given his/her past purchases. In the context of this problem, algorithms based on association rules have distinct advantages over classical statistical and machine learning methods: they look at correlations based on subsets of co-occurring past events (items a and b imply item c), they can be applied to the sequential event prediction problem in a natural way, they can potentially handle the “cold start” problem where the training set is small, and they yield interpretable predictions. In this work, we present two algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification, and they are simple enough that they can possibly be understood by users, customers, patients, managers, etc. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence” measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis. Keywords: statistical learning theory, algorithmic stability, association rules, sequence prediction, associative classification c 2013 Cynthia Rudin, Benjamin Letham and David Madigan. RUDIN , L E

5 0.44343254 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features

Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng

Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)

6 0.43986765 42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection

7 0.41756675 66 jmlr-2013-MAGIC Summoning: Towards Automatic Suggesting and Testing of Gestures With Low Probability of False Positives During Use

8 0.41304594 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

9 0.39813331 82 jmlr-2013-Optimally Fuzzy Temporal Memory

10 0.39453477 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

11 0.39233789 95 jmlr-2013-Ranking Forests

12 0.39171228 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

13 0.39080292 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

14 0.38960397 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

15 0.38885468 59 jmlr-2013-Large-scale SVD and Manifold Learning

16 0.38881651 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

17 0.38781595 22 jmlr-2013-Classifying With Confidence From Incomplete Information

18 0.38705975 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

19 0.38572636 32 jmlr-2013-Differential Privacy for Functions and Functional Data

20 0.38509098 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning