jmlr jmlr2013 jmlr2013-76 knowledge-graph by maker-knowledge-mining

76 jmlr-2013-Nonparametric Sparsity and Regularization


Source: pdf

Author: Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro, Alessandro Verri

Abstract: In this work we are interested in the problems of supervised learning and variable selection when the input-output dependence is described by a nonlinear function depending on a few variables. Our goal is to consider a sparse nonparametric model, hence avoiding linear or additive models. The key idea is to measure the importance of each variable in the model by making use of partial derivatives. Based on this intuition we propose a new notion of nonparametric sparsity and a corresponding least squares regularization scheme. Using concepts and results from the theory of reproducing kernel Hilbert spaces and proximal methods, we show that the proposed learning algorithm corresponds to a minimization problem which can be provably solved by an iterative procedure. The consistency properties of the obtained estimator are studied both in terms of prediction and selection performance. An extensive empirical analysis shows that the proposed method performs favorably with respect to the state-of-the-art methods. Keywords: sparsity, nonparametric, variable selection, regularization, proximal methods, RKHS ∗. Also at Istituto Italiano di Tecnologia, Via Morego 30, 16163 Genova, Italy and Massachusetts Institute of Technology, Bldg. 46-5155, 77 Massachusetts Avenue, Cambridge, MA 02139, USA. c 2013 Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro and Alessandro Verri. ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Based on this intuition we propose a new notion of nonparametric sparsity and a corresponding least squares regularization scheme. [sent-13, score-0.136]

2 Using concepts and results from the theory of reproducing kernel Hilbert spaces and proximal methods, we show that the proposed learning algorithm corresponds to a minimization problem which can be provably solved by an iterative procedure. [sent-14, score-0.254]

3 The consistency properties of the obtained estimator are studied both in terms of prediction and selection performance. [sent-15, score-0.136]

4 In particular, the idea behind, so called, sparsity is that the quantity of interest depends only on a few relevant variables (dimensions). [sent-25, score-0.153]

5 Only a few works, that we discuss in details in Section 2, have considered notions of sparsity beyond additive models. [sent-48, score-0.135]

6 This observation suggests a way to define a new notion of nonparametric sparsity and a corresponding regularizer which favors functions where most partial derivatives are essentially zero. [sent-50, score-0.204]

7 The theory of reproducing kernel Hilbert spaces (RKHSs) provides us with tools to answer both questions. [sent-55, score-0.147]

8 In fact, partial derivatives in a RKHS are bounded linear functionals and hence have a suitable representation that allows efficient computations. [sent-56, score-0.122]

9 First, we propose a new notion of sparsity and discuss a corresponding regularization scheme using concept from the theory of reproducing kernel Hilbert spaces. [sent-59, score-0.283]

10 1 Linear and Additive Models The sparsity requirement can be made precise considering linear functions f (x) = ∑d βa xa with a=1 x = (x1 , . [sent-80, score-0.169]

11 More precisely, we can add to the simplest additive model functions of couples fa,b (xa , xb ), triplets fa,b,c (xa , xb , xc ), etc. [sent-105, score-0.14]

12 Moreover, we study the selection properties of the algorithm and prove that, if Rρ ˆ is the set of relevant variables and Rτn the set estimated by our algorithm, then the following consistency result holds ˆ lim P Rτn ⊆ Rρ = 1. [sent-109, score-0.222]

13 Left: Multiple kernel learning based on additive models using kernels. [sent-125, score-0.161]

14 We consider a reproducing kernel k : X × X → R (Aronszajn, 1950) i=1 and the associated reproducing kernel Hilbert space H . [sent-132, score-0.294]

15 , d kx is the function t → k(x,t), and (∂a k)x denotes partial i=1 derivatives of the kernel, see (19). [sent-149, score-0.185]

16 d do q−1 τ vq = π σ Bn va − a 1 τν ˜ La vq−1 − ZT αt + 1 − La βt a η σ end for end while set vt = vq ¯ βt = 1 − τν ˜ t β − vt . [sent-204, score-0.635]

17 Second, the computation of the solution for different regularization parameters can be highly accelerated by a simple warm starting procedure, as the one in Hale et al. [sent-210, score-0.135]

18 3 Kernels, Partial Derivatives and Regularization We start discussing how partial derivatives can be efficiently computed in RKHSs induced by smooth kernels and hence derive a new representer theorem. [sent-215, score-0.154]

19 1675 ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI Recall that the RKHS associated to a symmetric positive definite function k : X × X → R is the unique Hilbert space (H , ·, · H ) such that kx = k(x, ·) ∈ H , for all x ∈ X and f (x) = f , kx H , (17) for all f ∈ H , x ∈ X . [sent-219, score-0.186]

20 Property (17) is called reproducing property and k is called reproducing kernel (Aronszajn, 1950). [sent-220, score-0.201]

21 One of the most important results for kernel methods, namely the representer theorem (Wahba, 1990), shows that a large class of regularized kernel methods induce estimators that can be written as finite linear combinations of kernels centered at the training set points. [sent-223, score-0.276]

22 The above operator is linear and bounded if the kernel is bounded—see Appendix A, which is true thanks to Assumption (A1). [sent-231, score-0.148]

23 Let (∂a k)x := ∂k(s, ·) ∂sa (19) s=x be the partial derivative of the kernel with respect to the first variable. [sent-233, score-0.16]

24 It is useful to define the analogous of the sampling operator for derivatives, which returns the values of the partial derivative of a function f ∈ H at a set of input points x = (x1 , . [sent-238, score-0.122]

25 Proposition 3 The minimizer of (6) can be written as n n d 1 1 fˆτ = ∑ αi kxi + ∑ ∑ βa,i (∂a k)xi n n i=1 a=1 i=1 with α ∈ R and β ∈ Rnd . [sent-252, score-0.12]

26 Unlike other simpler penalties used in additive models, such as the ℓ1 norm in the lasso, in our setting the computation of the proximity operator of ΩD is not trivial and will be discussed in the 1 next paragraph. [sent-281, score-0.208]

27 st st (23) The algorithm is analyzed in Beck and Teboulle (2009) and in Tseng (2010) where it is proved that the objective values generated by such a procedure have convergence of order O(1/t 2 ), if the step-size satisfies σ ≥ L. [sent-287, score-0.14]

28 Finally, it is well known that, if the functional is strongly convex with a positive modulus, the convergence rate of both the basic and accelerated scheme is indeed linear for both the function values and the minimizers (Nesterov, 1983; Mosci et al. [sent-292, score-0.124]

29 In fact we can fix n an arbitrary initialization v0 ∈ Rnd and consider, 1ˆ ˆ τ vq+1 = π σ Bd vq − ∇(∇∗ vq − f ) , n η (26) τ for a suitable choice of η. [sent-313, score-0.394]

30 , 2010) that ˆ ˆ ¯ ˆ τ ∇∗ vq − ∇∗ v H → 0, or, equivalently ∇∗ vq − π σ Cn ( f ) H → 0. [sent-318, score-0.394]

31 5 Overall Procedure and Convergence Analysis To compute the minimizer of E τ we consider the combination of the accelerated FB-splitting algorithm (outer iteration) and the basic FB-splitting algorithm for computing the proximity operator (inner iteration). [sent-322, score-0.212]

32 The overall procedure is given by st = f˜t = ft = 1 2 1 + 1 + 4st−1 , 2 1 − st−1 t−2 st−1 − 1 f , f t−1 + 1+ st st τν ˜t 1 ˆ∗ ˆ ˜t ˆ ¯ f − S S f − y − ∇∗ vt , 1− σ σ for t = 2, 3, . [sent-323, score-0.28]

33 , where vt is computed through the iteration ¯ τν ˜t 1 ˆ∗ ˆ ˜t 1ˆ ˆ τ f − S Sf −y vq = π σ Bd vq−1 − ∇ ∇∗ vq−1 − 1 − n η σ σ , (27) for given initializations. [sent-326, score-0.267]

34 The above algorithm is an inexact accelerated FB-splitting algorithm, in the sense that the proximal or backward step is computed only approximately. [sent-327, score-0.244]

35 For the basic—not accelerated—FB-splitting algorithm, convergence in the inexact case is still guaranteed (without a rate) (Combettes and Wajs, 2005), if the computation of the proximity operator is sufficiently accurate. [sent-330, score-0.183]

36 The convergence of the inexact accelerated FB-splitting algorithm is studied in Villa et al. [sent-331, score-0.137]

37 There exists q such that if vt = vq , for q > q, the following ¯ ¯ ¯ ¯ condition is satisfied 2τ D t ˆ ¯ Ω ( f ) − 2 ∇∗ vt , f t ≤ εt2 . [sent-335, score-0.337]

38 Third, we remark that for proving convergence of the inexact procedure, it is essential that the specific algorithm proposed to compute the proximal step generates a sequence belonging to Cn . [sent-345, score-0.178]

39 Second, since in practice we have to define suitable stopping rules, Equations (29) and (28) suggest the following choices5 f t − f t−1 H ≤ ε(ext) and 2τ D t ˆ Ω ( f ) − 2 ∇∗ vq , f t ≤ ε(int) . [sent-358, score-0.197]

40 σ 1 As a direct consequence of (30) and using the definition of matrices K, Z, L, these quantities can be easily computed as f t − f t−1 2 H = δα, Kδα n + 2 δα, Zδβ n + δβ, Lδβ n , 2τ D t ˆ Ω ( f ) − 2 ∇∗ vq , f t σ 1 = ∑ d a=1 2τ Za αt + La βt σ n −2 T vq , Za αt + La βt n . [sent-359, score-0.394]

41 , βd coincide with the partial derivatives, and the coefficient vector β given by ℓ1 regularization is sparse (in the sense that it has zero entries), so that it is easy to detect which variables are to be considered relevant. [sent-368, score-0.145]

42 In practice we often use a stopping rule where the tolerance is scaled with the current iterate, ˆ ˆ ε(ext) f t H and 2τ ΩD ( f t ) − 2 ∇∗ vq , f t ≤ ε(int) ∇∗ vq H . [sent-374, score-0.394]

43 Given the pair ( f t , vt ) evaluated via ¯ ˜ Algorithm 1, we can thus consider to be irrelevant the variables such that vta n < τ/σ − (εt )2 /(2m). [sent-383, score-0.148]

44 The restriction to functions such that f H ≤ r is natural and is required since the penalty ΩD 1 forces the partial derivatives to be zero only on the training set points. [sent-392, score-0.137]

45 The above result on the consistency of the derivative based regularizer is at the basis of the following consistency result. [sent-396, score-0.138]

46 1 1 Note that, if the kernel is universal (Steinwart and Christmann, 2008), then inf f ∈H E ( f ) = E ( fρ ) and Theorem 8 gives the universal consistency of the estimator fˆτn . [sent-402, score-0.155]

47 Towards this end, it is interesting to contrast the form of our regularizer to that of structured sparsity penalties for which sparsity results can be derived. [sent-440, score-0.207]

48 (33) N ONPARAMETRIC S PARSITY AND R EGULARIZATION ˆ The operators (V j ) j are positive and self-adjoint and so are the operators (V j ) j . [sent-445, score-0.126]

49 In conclusion, rewriting our derivative based regularizer as in (33) highlights similarity and differences with respect to previously studied sparsity methods: indeed many of these methods are induced by families of operators. [sent-473, score-0.141]

50 1 Choice of Tuning Parameters When using Algorithm 1, once the parameters n and ν are fixed, we evaluate the optimal value of the regularization parameter τ via hold out validation on an independent validation set of nval = n samples. [sent-485, score-0.197]

51 drawn from the above distribution (Figure 3 top-left), we evaluate the optimal value of the regularization parameter τ via hold out validation on an independent validation set of nval = n = 20 samples. [sent-514, score-0.197]

52 We then apply our method with polynomial kernel of degree p = 4, letting vary ν in {0. [sent-529, score-0.185]

53 For fixed n and ν we evaluate the optimal value of the regularization parameter τ via hold out validation on an independent validation set of nval = n samples. [sent-531, score-0.197]

54 We measure the selection error as the mean of the false negative rate (fraction of relevant variables that were not selected) and false positive rate (fraction of irrelevant variables that were selected). [sent-532, score-0.192]

55 The parameters we take into account are the following • n, training set size • d, input space dimensionality • |Rρ |, number of relevant variables • p, size of the hypotheses space, measured as the degree of the polynomial kernel. [sent-560, score-0.206]

56 For fixed n, d, and |Rρ | we evaluate the optimal value of the regularization parameter τ via hold out validation on an independent validation set of nval = n samples. [sent-566, score-0.197]

57 1 20 40 60 n 80 100 120 20 40 60 n Figure 5: Prediction error (top) and selection error (bottom) versus n for different values of d and number of relevant variables (|Rρ |). [sent-587, score-0.154]

58 For each value of |R | |R | ρ ρ |Rρ | we use a different regression function, f (x) = λ ∑a=1 ∑b=a+1 cab xa xb , so that for fixed |Rρ | all 2-way interaction terms are present, and the polynomial degree of the regression function is always 2. [sent-592, score-0.292]

59 We then apply our method with polynomial kernel of degree p = 2. [sent-595, score-0.185]

60 2, we display the selection error, and the prediction error, respectively, versus n for different values of d and number of relevant variables |Rρ |. [sent-599, score-0.19]

61 We therefore leave unchanged the data generation setting, made exception for the number of training samples, and vary the polynomial kernel degree as p = 1, 2, 3, 4, 5, 6. [sent-606, score-0.185]

62 5 10 10 6 15 20 25 d 30 35 40 Figure 6: Minimum number of input points (n) necessary to achieve 10% of average selection error versus the number of relevant variables |Rρ | for different values of d (left), and versus d for different values of |Rρ | (right). [sent-618, score-0.154]

63 5 0 20 30 40 50 n 60 70 0 20 80 30 40 50 n 60 70 80 Figure 7: Prediction error (left) and selection error (right) versus n for different values of the polynomial kernel degree (p). [sent-627, score-0.253]

64 We set d = 10, Rρ = {1, 2, 3}, n = 30, and then use a polynomial kernel of degree 2. [sent-640, score-0.185]

65 2, we display the prediction and selection error, versus the number of relevant features. [sent-647, score-0.152]

66 While the performance of ℓ1 -features fades when the number of relevant features increases, our method presents stable performance both in terms of selection and prediction error. [sent-648, score-0.152]

67 In particular, since our method is a regularization method, we focus on alternative regularization approaches to the problem of nonlinear variable selection. [sent-652, score-0.138]

68 In the first 3 models, for MKL, HKL, RLS and our method we employed the polynomial kernel of degree p, where p is the polynomial degree of the regression function f . [sent-685, score-0.308]

69 For ℓ1 -features we used the polynomial kernel with degree chosen as the minimum between the polynomial degree of f and 3. [sent-686, score-0.277]

70 1692 N ONPARAMETRIC S PARSITY AND R EGULARIZATION polynomial kernel of degree 4 for MKL, ℓ1 -features and HKL, and the Gaussian kernel with kernel parameter σ = 2 for RLS and our method. [sent-693, score-0.371]

71 Results in terms of prediction and selection errors are reported in Figure 6. [sent-695, score-0.134]

72 Note that here we are interested in evaluating the ability of our method of dealing with a general kernel like the Gaussian kernel, not in the choice of the kernel parameter itself. [sent-707, score-0.186]

73 For our method and RLS we used the Gaussian kernel with the kernel parameter σ chosen as the mean over the samples of the euclidean distance form the 20-th nearest neighbor. [sent-722, score-0.186]

74 Since the other methods cannot be run with the Gaussian kernel we used a polynomial kernel of degree p = 3 for MKL and ℓ1 -features. [sent-723, score-0.278]

75 For HKL we used both the polynomial kernel and the hermite kernel, both with p = 3. [sent-724, score-0.146]

76 We consider the polynomial kernel with degree p = 2, 3 and 4, and the Gaussian kernel. [sent-731, score-0.185]

77 We propose to use partial derivatives of functions in a RKHS to design a new sparsity penalty and a corresponding regularization scheme. [sent-738, score-0.273]

78 To make the prediction errors comparable among experiments, root mean squared errors (RMSE) were divided by the outputs standard deviation, which corresponds to the dotted line. [sent-744, score-0.124]

79 1695 ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI Figure 11: Prediction error (top) and fraction of selected variables (bottom) on real data for our method with different kernels: polynomial kernel of degree p = 1, 2 and 3 (DENOVAS pol-p), and Gaussian kernel (DENOVAS gauss). [sent-750, score-0.316]

80 In order to make the prediction errors comparable among experiments, root mean square errors were divided by the outputs standard deviation, which corresponds to the dotted line. [sent-752, score-0.124]

81 Moreover we study selection properties and show that that the proposed regularization scheme represents a safe filter for variable selection, as it does not discard relevant variables. [sent-756, score-0.185]

82 Finally, comparisons to state-of-the-art algorithms for nonlinear variable selection on toy data as well as on a cohort of benchmark data sets, show that our approach often leads to better prediction and selection performance. [sent-758, score-0.172]

83 Our work can be considered as a first step towards understanding the role of sparsity beyond additive models. [sent-759, score-0.135]

84 It substantially differs with respect to previous approaches based on local polynomial regression (Lafferty and Wasserman, 2008; Bertin and Lecu´ , 2008; Miller and Hall, 2010), since e it is a regularization scheme directly performing global variable selection. [sent-760, score-0.153]

85 • From a theoretical point of view it would be interesting to further analyzing the sparsity property of the obtained estimator in terms of finite sample estimates for the prediction and the selection error. [sent-763, score-0.171]

86 • More generally, our study begs the question of whether there are alternative/better ways to perform learning and variable selection beyond additive models and using non parametric models. [sent-769, score-0.136]

87 The operator Ik : H → L2 (X , ρX ) defined by (Ik f )(x) = f , kx H , for almost all x ∈ X, is welldefined and bounded thanks to assumption (A1). [sent-777, score-0.148]

88 Henceforth it satisfies the following representer theorem n n d 1 1 ˆ ˆ fˆτ = S∗ α + ∇∗ β = ∑ αi kxi + ∑ ∑ βa,i (∂a k)xi i=1 n i=1 a=1 n (34) with α ∈ Rn and β ∈ Rnd . [sent-796, score-0.148]

89 s=xi Given x ∈ X it holds ∂k(s, x) − =e ∂sa x−s 2 2γ2 sa − x a · − 2 γ − (∂a k)xi (x) = e =⇒ x−xi 2 2γ2 · − xia − xa γ2 . [sent-804, score-0.159]

90 Theorem 4 is a consequence of the general results about convergence of accelerated and inexact FB-splitting algorithms in Villa et al. [sent-813, score-0.137]

91 In that paper it is shown that inexact schemes converge only when the errors in the computation of the proximity operator are of a suitable type and satisfy a sufficiently fast decay condition. [sent-815, score-0.213]

92 Theorem 15 Consider the following inexact version of the accelerated FB-algorithm in (21) with c1,t and c2,t as in (23) f t ≅εt prox τ ΩD σ 1 I− 1 ∇F (c1,t f t−1 + c2,t f t−2 ) . [sent-821, score-0.207]

93 Proposition 17 Under the same assumptions of Proposition 16, for any f ∈ H , ε > 0, and sequence {vq } minimizing the dual problem f − λB∗ v H + iS (v), where iS is the indicator function of S, there exists q such that for any q > q, f − λB∗ vq satisfies ¯ ¯ condition (36). [sent-825, score-0.197]

94 Proof [Proof of Theorem 4] Since the the regularizer ΩD can be written as a composition of ω ◦ B, 1 q ˆ with B = ∇ and ω : Rd → [0, +∞), ω(v) = ∑d a=1 va n and {v } is a minimizing sequence for problem (25), Proposition 17 ensures that vq satisfies condition (28) if q is big enough. [sent-826, score-0.343]

95 Therefore, ˆ the sequence ∇∗ vq obtained from vq generates, via (27), admissible approximations of prox τ ΩD . [sent-827, score-0.464]

96 1 H 1701 ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI where, for an arbitrary λ > 0, the subdifferential of λΩD at f is given by 1 ˆ ˆ ˆ ∂λΩD ( f ) ={∇∗ v, v = (va )d ∈(Rn )d | va = λDa f / Da f 1 a=1 and va n n ˆ if Da f n > 0, ≤ λ otherwise, ∀a = 1, . [sent-851, score-0.245]

97 to the previous inequality we obtain σ σˆ ˜ ¯ ΩD ( f ) − ΩD ( fˆτ ) ≥ ∇∗ vt , f − fˆτ H − (εt )2 , 1 1 τ 2τ with 2τ D ˆτ ˆ ¯ Ω1 ( f ) − ΩD ( f t ) − 2∇∗ vt , fˆτ − f t H . [sent-866, score-0.14]

98 Now, relying on the structure of ΩD , it is easy to see that ε 1 1 τ ˆ ∂ε ΩD ( f ) ⊆ {∇∗ v, v = (va )d ∈(Rn )d | va 1 a=1 ˆ Thus, if Da fˆτ n > 0 we have vt ¯ n τ ≥ σ (1 − 2 (˜ t )2 ε ˆ Da fˆτ n n ˆ ≥ 1 − ε/ Da f n ˆ if Da f n > 0}. [sent-870, score-0.171]

99 The proof is a direct application of the above inequalities to the random variables, (1) ξ =y2 ξ∈ R with supzn ξ ≤ M 2 , (2) ξ =kx y ξ ∈ H ⊗ R with supzn ξ ≤ κ1 M, (3) ξ = ·, kx H kx ξ ∈ H S(H ) with supzn ξ H S (H ) ≤ κ2 , 1 (4) ξ = ·, (∂a k)x H (∂a k)x ξ ∈ H S(H ) with supzn ξ H S (H ) ≤ κ2 . [sent-882, score-0.346]

100 Derivative reproducing properties for kernel methods in learning theory. [sent-1347, score-0.147]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('da', 0.421), ('rosasco', 0.232), ('erri', 0.231), ('illa', 0.231), ('osci', 0.231), ('antoro', 0.197), ('vq', 0.197), ('parsity', 0.189), ('onparametric', 0.17), ('egularization', 0.157), ('hkl', 0.13), ('villa', 0.129), ('rls', 0.12), ('mosci', 0.11), ('proximal', 0.107), ('ik', 0.105), ('xa', 0.102), ('va', 0.101), ('rnd', 0.094), ('mkl', 0.093), ('kernel', 0.093), ('kx', 0.093), ('cosso', 0.09), ('za', 0.087), ('kxi', 0.086), ('combettes', 0.086), ('denovas', 0.08), ('rkhs', 0.076), ('inexact', 0.071), ('wajs', 0.07), ('genova', 0.07), ('prox', 0.07), ('st', 0.07), ('vt', 0.07), ('regularization', 0.069), ('additive', 0.068), ('selection', 0.068), ('sparsity', 0.067), ('accelerated', 0.066), ('rmse', 0.064), ('beck', 0.064), ('rn', 0.063), ('operators', 0.063), ('representer', 0.062), ('la', 0.061), ('nval', 0.06), ('bd', 0.057), ('proximity', 0.057), ('sa', 0.057), ('te', 0.057), ('operator', 0.055), ('reproducing', 0.054), ('derivatives', 0.054), ('polynomial', 0.053), ('teboulle', 0.053), ('proposition', 0.049), ('relevant', 0.048), ('regularizer', 0.045), ('penalty', 0.045), ('santoro', 0.043), ('subdifferential', 0.043), ('white', 0.042), ('hilbert', 0.041), ('attouch', 0.04), ('dontchev', 0.04), ('epi', 0.04), ('liacc', 0.04), ('supzn', 0.04), ('vta', 0.04), ('degree', 0.039), ('variables', 0.038), ('partial', 0.038), ('italy', 0.038), ('prediction', 0.036), ('lim', 0.036), ('bach', 0.036), ('bn', 0.036), ('xb', 0.036), ('validation', 0.034), ('minimizer', 0.034), ('cn', 0.034), ('consistency', 0.032), ('regression', 0.031), ('verri', 0.031), ('convex', 0.031), ('argmin', 0.031), ('ekeland', 0.03), ('loris', 0.03), ('salzo', 0.03), ('silvia', 0.03), ('zolezzi', 0.03), ('functionals', 0.03), ('inf', 0.03), ('errors', 0.03), ('derivative', 0.029), ('penalties', 0.028), ('hypotheses', 0.028), ('dotted', 0.028), ('regularized', 0.028), ('functional', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 76 jmlr-2013-Nonparametric Sparsity and Regularization

Author: Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro, Alessandro Verri

Abstract: In this work we are interested in the problems of supervised learning and variable selection when the input-output dependence is described by a nonlinear function depending on a few variables. Our goal is to consider a sparse nonparametric model, hence avoiding linear or additive models. The key idea is to measure the importance of each variable in the model by making use of partial derivatives. Based on this intuition we propose a new notion of nonparametric sparsity and a corresponding least squares regularization scheme. Using concepts and results from the theory of reproducing kernel Hilbert spaces and proximal methods, we show that the proposed learning algorithm corresponds to a minimization problem which can be provably solved by an iterative procedure. The consistency properties of the obtained estimator are studied both in terms of prediction and selection performance. An extensive empirical analysis shows that the proposed method performs favorably with respect to the state-of-the-art methods. Keywords: sparsity, nonparametric, variable selection, regularization, proximal methods, RKHS ∗. Also at Istituto Italiano di Tecnologia, Via Morego 30, 16163 Genova, Italy and Massachusetts Institute of Technology, Bldg. 46-5155, 77 Massachusetts Avenue, Cambridge, MA 02139, USA. c 2013 Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro and Alessandro Verri. ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI

2 0.10206784 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

Author: Julien Mairal, Bin Yu

Abstract: We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network. In this context, it is of much interest to automatically select a subgraph with few connected components; by exploiting prior knowledge, one can indeed improve the prediction performance or obtain results that are easier to interpret. Regularization or penalty functions for selecting features in graphs have recently been proposed, but they raise new algorithmic challenges. For example, they typically require solving a combinatorially hard selection problem among all connected subgraphs. In this paper, we propose computationally feasible strategies to select a sparse and well-connected subset of features sitting on a directed acyclic graph (DAG). We introduce structured sparsity penalties over paths on a DAG called “path coding” penalties. Unlike existing regularization functions that model long-range interactions between features in a graph, path coding penalties are tractable. The penalties and their proximal operators involve path selection problems, which we efficiently solve by leveraging network flow optimization. We experimentally show on synthetic, image, and genomic data that our approach is scalable and leads to more connected subgraphs than other regularization functions for graphs. Keywords: convex and non-convex optimization, network flow optimization, graph sparsity

3 0.093503654 120 jmlr-2013-Variational Algorithms for Marginal MAP

Author: Qiang Liu, Alexander Ihler

Abstract: The marginal maximum a posteriori probability (MAP) estimation problem, which calculates the mode of the marginal posterior distribution of a subset of variables with the remaining variables marginalized, is an important inference problem in many models, such as those with hidden variables or uncertain parameters. Unfortunately, marginal MAP can be NP-hard even on trees, and has attracted less attention in the literature compared to the joint MAP (maximization) and marginalization problems. We derive a general dual representation for marginal MAP that naturally integrates the marginalization and maximization operations into a joint variational optimization problem, making it possible to easily extend most or all variational-based algorithms to marginal MAP. In particular, we derive a set of “mixed-product” message passing algorithms for marginal MAP, whose form is a hybrid of max-product, sum-product and a novel “argmax-product” message updates. We also derive a class of convergent algorithms based on proximal point methods, including one that transforms the marginal MAP problem into a sequence of standard marginalization problems. Theoretically, we provide guarantees under which our algorithms give globally or locally optimal solutions, and provide novel upper bounds on the optimal objectives. Empirically, we demonstrate that our algorithms significantly outperform the existing approaches, including a state-of-the-art algorithm based on local search methods. Keywords: graphical models, message passing, belief propagation, variational methods, maximum a posteriori, marginal-MAP, hidden variable models

4 0.084333539 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

Author: Manfred Opper, Ulrich Paquet, Ole Winther

Abstract: Expectation Propagation (EP) provides a framework for approximate inference. When the model under consideration is over a latent Gaussian field, with the approximation being Gaussian, we show how these approximations can systematically be corrected. A perturbative expansion is made of the exact but intractable correction, and can be applied to the model’s partition function and other moments of interest. The correction is expressed over the higher-order cumulants which are neglected by EP’s local matching of moments. Through the expansion, we see that EP is correct to first order. By considering higher orders, corrections of increasing polynomial complexity can be applied to the approximation. The second order provides a correction in quadratic time, which we apply to an array of Gaussian process and Ising models. The corrections generalize to arbitrarily complex approximating families, which we illustrate on tree-structured Ising model approximations. Furthermore, they provide a polynomial-time assessment of the approximation error. We also provide both theoretical and practical insights on the exactness of the EP solution. Keywords: expectation consistent inference, expectation propagation, perturbation correction, Wick expansions, Ising model, Gaussian process

5 0.071636237 114 jmlr-2013-The Rate of Convergence of AdaBoost

Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire

Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate

6 0.066890836 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

7 0.063111492 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

8 0.057605281 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

9 0.057174608 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

10 0.056785401 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

11 0.055036329 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels

12 0.054488618 96 jmlr-2013-Regularization-Free Principal Curve Estimation

13 0.053506568 104 jmlr-2013-Sparse Single-Index Model

14 0.051859852 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

15 0.051370848 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

16 0.049590979 32 jmlr-2013-Differential Privacy for Functions and Functional Data

17 0.048605785 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features

18 0.046427511 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

19 0.046087846 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

20 0.045770388 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.244), (1, 0.017), (2, 0.022), (3, 0.024), (4, 0.083), (5, -0.057), (6, -0.002), (7, 0.02), (8, 0.046), (9, -0.089), (10, -0.019), (11, 0.019), (12, 0.027), (13, 0.035), (14, 0.043), (15, -0.152), (16, -0.035), (17, 0.126), (18, 0.149), (19, 0.109), (20, -0.116), (21, -0.147), (22, 0.06), (23, 0.179), (24, 0.022), (25, 0.119), (26, 0.142), (27, -0.186), (28, -0.004), (29, 0.31), (30, -0.016), (31, -0.118), (32, 0.031), (33, -0.119), (34, 0.09), (35, 0.133), (36, 0.032), (37, 0.049), (38, 0.006), (39, -0.07), (40, -0.033), (41, -0.165), (42, -0.047), (43, 0.1), (44, 0.041), (45, -0.046), (46, 0.004), (47, 0.027), (48, -0.07), (49, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90089703 76 jmlr-2013-Nonparametric Sparsity and Regularization

Author: Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro, Alessandro Verri

Abstract: In this work we are interested in the problems of supervised learning and variable selection when the input-output dependence is described by a nonlinear function depending on a few variables. Our goal is to consider a sparse nonparametric model, hence avoiding linear or additive models. The key idea is to measure the importance of each variable in the model by making use of partial derivatives. Based on this intuition we propose a new notion of nonparametric sparsity and a corresponding least squares regularization scheme. Using concepts and results from the theory of reproducing kernel Hilbert spaces and proximal methods, we show that the proposed learning algorithm corresponds to a minimization problem which can be provably solved by an iterative procedure. The consistency properties of the obtained estimator are studied both in terms of prediction and selection performance. An extensive empirical analysis shows that the proposed method performs favorably with respect to the state-of-the-art methods. Keywords: sparsity, nonparametric, variable selection, regularization, proximal methods, RKHS ∗. Also at Istituto Italiano di Tecnologia, Via Morego 30, 16163 Genova, Italy and Massachusetts Institute of Technology, Bldg. 46-5155, 77 Massachusetts Avenue, Cambridge, MA 02139, USA. c 2013 Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro and Alessandro Verri. ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI

2 0.64252472 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

Author: Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro, Lorenzo Rosasco

Abstract: We present GURLS, a least squares, modular, easy-to-extend software library for efficient supervised learning. GURLS is targeted to machine learning practitioners, as well as non-specialists. It offers a number state-of-the-art training strategies for medium and large-scale learning, and routines for efficient model selection. The library is particularly well suited for multi-output problems (multi-category/multi-label). GURLS is currently available in two independent implementations: Matlab and C++. It takes advantage of the favorable properties of regularized least squares algorithm to exploit advanced tools in linear algebra. Routines to handle computations with very large matrices by means of memory-mapped storage and distributed task execution are available. The package is distributed under the BSD license and is available for download at https://github.com/LCSL/GURLS. Keywords: regularized least squares, big data, linear algebra 1. Introduction and Design Supervised learning has become a fundamental tool for the design of intelligent systems and the analysis of high dimensional data. Key to this success has been the availability of efficient, easy-touse software packages. New data collection technologies make it easy to gather high dimensional, multi-output data sets of increasing size. This trend calls for new software solutions for the automatic training, tuning and testing of supervised learning methods. These observations motivated the design of GURLS (Grand Unified Regularized Least Squares). The package was developed to pursue the following goals: Speed: Fast training/testing procedures for learning problems with potentially large/huge number of points, features and especially outputs (e.g., classes). Memory: Flexible data management to work with large data sets by means of memory-mapped storage. Performance: ∗. Also in the Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology c 2013 Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro and Lorenzo Rosasco. TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO State of the art results in high-dimensional multi-output problems. Usability and modularity: Easy to use and to expand. GURLS is based on Regularized Least Squares (RLS) and takes advantage of all the favorable properties of these methods (Rifkin et al., 2003). Since the algorithm reduces to solving a linear system, GURLS is set up to exploit the powerful tools, and recent advances, of linear algebra (including randomized solver, first order methods, etc.). Second, it makes use of RLS properties which are particularly suited for high dimensional learning. For example: (1) RLS has natural primal and dual formulation (hence having complexity which is the smallest between number of examples and features); (2) efficient parameter selection (closed form expression of the leave one out error and efficient computations of regularization path); (3) natural and efficient extension to multiple outputs. Specific attention has been devoted to handle large high dimensional data sets. We rely on data structures that can be serialized using memory-mapped files, and on a distributed task manager to perform a number of key steps (such as matrix multiplication) without loading the whole data set in memory. Efforts were devoted to to provide a lean API and an exhaustive documentation. GURLS has been deployed and tested successfully on Linux, MacOS and Windows. The library is distributed under the simplified BSD license, and can be downloaded from https://github.com/LCSL/GURLS. 2. Description of the Library The library comprises four main modules. GURLS and bGURLS—both implemented in Matlab— are aimed at solving learning problems with small/medium and large-scale data sets respectively. GURLS++ and bGURLS++ are their C++ counterparts. The Matlab and C++ versions share the same design, but the C++ modules have significant improvements, which make them faster and more flexible. The specification of the desired machine learning experiment in the library is straightforward. Basically, it is a formal description of a pipeline, that is, an ordered sequence of steps. Each step identifies an actual learning task, and belongs to a predefined category. The core of the library is a method (a class in the C++ implementation) called GURLScore, which is responsible for processing the sequence of tasks in the proper order and for linking the output of the former task to the input of the subsequent one. A key role is played by the additional “options” structure, referred to as OPT. OPT is used to store all configuration parameters required to customize the behavior of individual tasks in the pipeline. Tasks receive configuration parameters from OPT in read-only mode and—upon termination—the results are appended to the structure by GURLScore in order to make them available to subsequent tasks. This allows the user to skip the execution of some tasks in a pipeline, by simply inserting the desired results directly into the options structure. Currently, we identify six different task categories: data set splitting, kernel computation, model selection, training, evaluation and testing and performance assessment and analysis. Tasks belonging to the same category may be interchanged with each other. 2.1 Learning From Large Data Sets Two modules in GURLS have been specifically designed to deal with big data scenarios. The approach we adopted is mainly based on a memory-mapped abstraction of matrix and vector data structures, and on a distributed computation of a number of standard problems in linear algebra. For learning on big data, we decided to focus specifically on those situations where one seeks a linear model on a large set of (possibly non linear) features. A more accurate specification of what “large” means in GURLS is related to the number of features d and the number of training 3202 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING data set optdigit landast pendigit letter isolet # of samples 3800 4400 7400 10000 6200 # of classes 10 6 10 26 26 # of variables 64 36 16 16 600 Table 1: Data sets description. examples n: we require it must be possible to store a min(d, n) × min(d, n) matrix in memory. In practice, this roughly means we can train models with up-to 25k features on machines with 8Gb of RAM, and up-to 50k features on machines with 36Gb of RAM. We do not require the data matrix itself to be stored in memory: within GURLS it is possible to manage an arbitrarily large set of training examples. We distinguish two different scenarios. Data sets that can fully reside in RAM without any memory mapping techniques—such as swapping—are considered to be small/medium. Larger data sets are considered to be “big” and learning must be performed using either bGURLS or bGURLS++ . These two modules include all the design patterns described above, and have been complemented with additional big data and distributed computation capabilities. Big data support is obtained using a data structure called bigarray, which allows to handle data matrices as large as the space available on the hard drive: we store the entire data set on disk and load only small chunks in memory when required. There are some differences between the Matlab and C++ implementations. bGURLS relies on a simple, ad hoc interface, called GURLS Distributed Manager (GDM), to distribute matrix-matrix multiplications, thus allowing users to perform the important task of kernel matrix computation on a distributed network of computing nodes. After this step, the subsequent tasks behave as in GURLS. bGURLS++ (currently in active development) offers more interesting features because it is based on the MPI libraries. Therefore, it allows for a full distribution within every single task of the pipeline. All the processes read the input data from a shared filesystem over the network and then start executing the same pipeline. During execution, each process’ task communicates with the corresponding ones. Every process maintains its local copy of the options. Once the same task is completed by all processes, the local copies of the options are synchronized. This architecture allows for the creation of hybrid pipelines comprising serial one-process-based tasks from GURLS++ . 3. Experiments We decided to focus the experimental analysis in the paper to the assessment of GURLS’ performance both in terms of accuracy and time. In our experiments we considered 5 popular data sets, briefly described in Table 1. Experiments were run on a Intel Xeon 5140 @ 2.33GHz processor with 8GB of RAM, and running Ubuntu 8.10 Server (64 bit). optdigit accuracy (%) GURLS (linear primal) GURLS (linear dual) LS-SVM linear GURLS (500 random features) GURLS (1000 random features) GURLS (Gaussian kernel) LS-SVM (Gaussian kernel) time (s) landsat accuracy (%) time (s) pendigit accuracy (%) time (s) 92.3 92.3 92.3 96.8 97.5 98.3 98.3 0.49 726 7190 25.6 207 13500 26100 63.68 66.3 64.6 63.5 63.5 90.4 90.51 0.22 1148 6526 28.0 187 20796 18430 82.24 82.46 82.3 96.7 95.8 98.4 98.36 0.23 5590 46240 31.6 199 100600 120170 Table 2: Comparison between GURLS and LS-SVM. 3203 TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO Performance (%) 1 0.95 0.9 0.85 isolet(∇) letter(×) 0.8 pendigit(∆) 0.75 landsat(♦) optdigit(◦) 0.7 LIBSVM:rbf 0.65 GURLS++:rbf GURLS:randomfeatures-1000 0.6 GURLS:randomfeatures-500 0.55 0.5 0 10 GURLS:rbf 1 10 2 10 3 10 4 Time (s) 10 Figure 1: Prediction accuracy vs. computing time. The color represents the training method and the library used. In blue: the Matlab implementation of RLS with RBF kernel, in red: its C++ counterpart. In dark red: results of LIBSVM with RBF kernel. In yellow and green: results obtained using a linear kernel on 500 and 1000 random features respectively. We set up different pipelines and compared the performance to SVM, for which we used the python modular interface to LIBSVM (Chang and Lin, 2011). Automatic selection of the optimal regularization parameter is implemented identically in all experiments: (i) split the data; (ii) define a set of regularization parameter on a regular grid; (iii) perform hold-out validation. The variance of the Gaussian kernel has been fixed by looking at the statistics of the pairwise distances among training examples. The prediction accuracy of GURLS and GURLS++ is identical—as expected—but the implementation in C++ is significantly faster. The prediction accuracy of standard RLS-based methods is in many cases higher than SVM. Exploiting the primal formulation of RLS, we further ran experiments with the random features approximation (Rahimi and Recht, 2008). As show in Figure 1, the performance of this method is comparable to that of SVM at a much lower computational cost in the majority of the tested data sets. We further compared GURLS with another available least squares based toolbox: the LS-SVM toolbox (Suykens et al., 2001), which includes routines for parameter selection such as coupled simulated annealing and line/grid search. The goal of this experiment is to benchmark the performance of the parameter selection with random data splitting included in GURLS. For a fair comparison, we considered only the Matlab implementation of GURLS. Results are reported in Table 2. As expected, using the linear kernel with the primal formulation—not available in LS-SVM—is the fastest approach since it leverages the lower dimensionality of the input space. When the Gaussian kernel is used, GURLS and LS-SVM have comparable computing time and classification performance. Note, however, that in GURLS the number of parameter in the grid search is fixed to 400, while in LS-SVM it may vary and is limited to 70. The interesting results obtained with the random features implementation in GURLS, make it an interesting choice in many applications. Finally, all GURLS pipelines, in their Matlab implementation, are faster than LS-SVM and further improvements can be achieved with GURLS++ . Acknowledgments We thank Tomaso Poggio, Zak Stone, Nicolas Pinto, Hristo S. Paskov and CBCL for comments and insights. 3204 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING References C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www. csie.ntu.edu.tw/˜cjlin/libsvm. A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, volume 21, pages 1313–1320, 2008. R. Rifkin, G. Yeo, and T. Poggio. Regularized least-squares classification. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003. J. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific, 2001. ISBN 981-238-151-1. 3205

3 0.47608489 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels

Author: Kenji Fukumizu, Le Song, Arthur Gretton

Abstract: A kernel method for realizing Bayes’ rule is proposed, based on representations of probabilities in reproducing kernel Hilbert spaces. Probabilities are uniquely characterized by the mean of the canonical map to the RKHS. The prior and conditional probabilities are expressed in terms of RKHS functions of an empirical sample: no explicit parametric model is needed for these quantities. The posterior is likewise an RKHS mean of a weighted sample. The estimator for the expectation of a function of the posterior is derived, and rates of consistency are shown. Some representative applications of the kernel Bayes’ rule are presented, including Bayesian computation without likelihood and filtering with a nonparametric state-space model. Keywords: kernel method, Bayes’ rule, reproducing kernel Hilbert space

4 0.39820594 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

Author: Kris De Brabanter, Jos De Brabanter, Bart De Moor, Irène Gijbels

Abstract: We present a fully automated framework to estimate derivatives nonparametrically without estimating the regression function. Derivative estimation plays an important role in the exploration of structures in curves (jump detection and discontinuities), comparison of regression curves, analysis of human growth data, etc. Hence, the study of estimating derivatives is equally important as regression estimation itself. Via empirical derivatives we approximate the qth order derivative and create a new data set which can be smoothed by any nonparametric regression estimator. We derive L1 and L2 rates and establish consistency of the estimator. The new data sets created by this technique are no longer independent and identically distributed (i.i.d.) random variables anymore. As a consequence, automated model selection criteria (data-driven procedures) break down. Therefore, we propose a simple factor method, based on bimodal kernels, to effectively deal with correlated data in the local polynomial regression framework. Keywords: nonparametric derivative estimation, model selection, empirical derivative, factor rule

5 0.38940194 120 jmlr-2013-Variational Algorithms for Marginal MAP

Author: Qiang Liu, Alexander Ihler

Abstract: The marginal maximum a posteriori probability (MAP) estimation problem, which calculates the mode of the marginal posterior distribution of a subset of variables with the remaining variables marginalized, is an important inference problem in many models, such as those with hidden variables or uncertain parameters. Unfortunately, marginal MAP can be NP-hard even on trees, and has attracted less attention in the literature compared to the joint MAP (maximization) and marginalization problems. We derive a general dual representation for marginal MAP that naturally integrates the marginalization and maximization operations into a joint variational optimization problem, making it possible to easily extend most or all variational-based algorithms to marginal MAP. In particular, we derive a set of “mixed-product” message passing algorithms for marginal MAP, whose form is a hybrid of max-product, sum-product and a novel “argmax-product” message updates. We also derive a class of convergent algorithms based on proximal point methods, including one that transforms the marginal MAP problem into a sequence of standard marginalization problems. Theoretically, we provide guarantees under which our algorithms give globally or locally optimal solutions, and provide novel upper bounds on the optimal objectives. Empirically, we demonstrate that our algorithms significantly outperform the existing approaches, including a state-of-the-art algorithm based on local search methods. Keywords: graphical models, message passing, belief propagation, variational methods, maximum a posteriori, marginal-MAP, hidden variable models

6 0.38707635 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

7 0.34972575 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

8 0.28297564 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

9 0.28263125 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

10 0.27853996 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

11 0.27317819 96 jmlr-2013-Regularization-Free Principal Curve Estimation

12 0.2687926 106 jmlr-2013-Stationary-Sparse Causality Network Learning

13 0.26784974 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

14 0.26685697 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

15 0.25520149 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

16 0.25469843 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

17 0.25144383 104 jmlr-2013-Sparse Single-Index Model

18 0.25098357 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

19 0.25059876 68 jmlr-2013-Machine Learning with Operational Costs

20 0.24400035 114 jmlr-2013-The Rate of Convergence of AdaBoost


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.021), (5, 0.123), (6, 0.031), (10, 0.074), (20, 0.016), (23, 0.033), (35, 0.01), (68, 0.053), (70, 0.026), (75, 0.038), (85, 0.023), (87, 0.032), (89, 0.014), (93, 0.411)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.76784468 121 jmlr-2013-Variational Inference in Nonconjugate Models

Author: Chong Wang, David M. Blei

Abstract: Mean-field variational methods are widely used for approximate posterior inference in many probabilistic models. In a typical application, mean-field methods approximately compute the posterior with a coordinate-ascent optimization algorithm. When the model is conditionally conjugate, the coordinate updates are easily derived and in closed form. However, many models of interest—like the correlated topic model and Bayesian logistic regression—are nonconjugate. In these models, mean-field methods cannot be directly applied and practitioners have had to develop variational algorithms on a case-by-case basis. In this paper, we develop two generic methods for nonconjugate models, Laplace variational inference and delta method variational inference. Our methods have several advantages: they allow for easily derived variational algorithms with a wide class of nonconjugate models; they extend and unify some of the existing algorithms that have been derived for specific models; and they work well on real-world data sets. We studied our methods on the correlated topic model, Bayesian logistic regression, and hierarchical Bayesian logistic regression. Keywords: variational inference, nonconjugate models, Laplace approximations, the multivariate delta method

same-paper 2 0.71101427 76 jmlr-2013-Nonparametric Sparsity and Regularization

Author: Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro, Alessandro Verri

Abstract: In this work we are interested in the problems of supervised learning and variable selection when the input-output dependence is described by a nonlinear function depending on a few variables. Our goal is to consider a sparse nonparametric model, hence avoiding linear or additive models. The key idea is to measure the importance of each variable in the model by making use of partial derivatives. Based on this intuition we propose a new notion of nonparametric sparsity and a corresponding least squares regularization scheme. Using concepts and results from the theory of reproducing kernel Hilbert spaces and proximal methods, we show that the proposed learning algorithm corresponds to a minimization problem which can be provably solved by an iterative procedure. The consistency properties of the obtained estimator are studied both in terms of prediction and selection performance. An extensive empirical analysis shows that the proposed method performs favorably with respect to the state-of-the-art methods. Keywords: sparsity, nonparametric, variable selection, regularization, proximal methods, RKHS ∗. Also at Istituto Italiano di Tecnologia, Via Morego 30, 16163 Genova, Italy and Massachusetts Institute of Technology, Bldg. 46-5155, 77 Massachusetts Avenue, Cambridge, MA 02139, USA. c 2013 Lorenzo Rosasco, Silvia Villa, Sofia Mosci, Matteo Santoro and Alessandro Verri. ROSASCO , V ILLA , M OSCI , S ANTORO AND V ERRI

3 0.43543512 108 jmlr-2013-Stochastic Variational Inference

Author: Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley

Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets. Keywords: Bayesian inference, variational inference, stochastic optimization, topic models, Bayesian nonparametrics

4 0.428307 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

Author: Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro, Lorenzo Rosasco

Abstract: We present GURLS, a least squares, modular, easy-to-extend software library for efficient supervised learning. GURLS is targeted to machine learning practitioners, as well as non-specialists. It offers a number state-of-the-art training strategies for medium and large-scale learning, and routines for efficient model selection. The library is particularly well suited for multi-output problems (multi-category/multi-label). GURLS is currently available in two independent implementations: Matlab and C++. It takes advantage of the favorable properties of regularized least squares algorithm to exploit advanced tools in linear algebra. Routines to handle computations with very large matrices by means of memory-mapped storage and distributed task execution are available. The package is distributed under the BSD license and is available for download at https://github.com/LCSL/GURLS. Keywords: regularized least squares, big data, linear algebra 1. Introduction and Design Supervised learning has become a fundamental tool for the design of intelligent systems and the analysis of high dimensional data. Key to this success has been the availability of efficient, easy-touse software packages. New data collection technologies make it easy to gather high dimensional, multi-output data sets of increasing size. This trend calls for new software solutions for the automatic training, tuning and testing of supervised learning methods. These observations motivated the design of GURLS (Grand Unified Regularized Least Squares). The package was developed to pursue the following goals: Speed: Fast training/testing procedures for learning problems with potentially large/huge number of points, features and especially outputs (e.g., classes). Memory: Flexible data management to work with large data sets by means of memory-mapped storage. Performance: ∗. Also in the Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology c 2013 Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro and Lorenzo Rosasco. TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO State of the art results in high-dimensional multi-output problems. Usability and modularity: Easy to use and to expand. GURLS is based on Regularized Least Squares (RLS) and takes advantage of all the favorable properties of these methods (Rifkin et al., 2003). Since the algorithm reduces to solving a linear system, GURLS is set up to exploit the powerful tools, and recent advances, of linear algebra (including randomized solver, first order methods, etc.). Second, it makes use of RLS properties which are particularly suited for high dimensional learning. For example: (1) RLS has natural primal and dual formulation (hence having complexity which is the smallest between number of examples and features); (2) efficient parameter selection (closed form expression of the leave one out error and efficient computations of regularization path); (3) natural and efficient extension to multiple outputs. Specific attention has been devoted to handle large high dimensional data sets. We rely on data structures that can be serialized using memory-mapped files, and on a distributed task manager to perform a number of key steps (such as matrix multiplication) without loading the whole data set in memory. Efforts were devoted to to provide a lean API and an exhaustive documentation. GURLS has been deployed and tested successfully on Linux, MacOS and Windows. The library is distributed under the simplified BSD license, and can be downloaded from https://github.com/LCSL/GURLS. 2. Description of the Library The library comprises four main modules. GURLS and bGURLS—both implemented in Matlab— are aimed at solving learning problems with small/medium and large-scale data sets respectively. GURLS++ and bGURLS++ are their C++ counterparts. The Matlab and C++ versions share the same design, but the C++ modules have significant improvements, which make them faster and more flexible. The specification of the desired machine learning experiment in the library is straightforward. Basically, it is a formal description of a pipeline, that is, an ordered sequence of steps. Each step identifies an actual learning task, and belongs to a predefined category. The core of the library is a method (a class in the C++ implementation) called GURLScore, which is responsible for processing the sequence of tasks in the proper order and for linking the output of the former task to the input of the subsequent one. A key role is played by the additional “options” structure, referred to as OPT. OPT is used to store all configuration parameters required to customize the behavior of individual tasks in the pipeline. Tasks receive configuration parameters from OPT in read-only mode and—upon termination—the results are appended to the structure by GURLScore in order to make them available to subsequent tasks. This allows the user to skip the execution of some tasks in a pipeline, by simply inserting the desired results directly into the options structure. Currently, we identify six different task categories: data set splitting, kernel computation, model selection, training, evaluation and testing and performance assessment and analysis. Tasks belonging to the same category may be interchanged with each other. 2.1 Learning From Large Data Sets Two modules in GURLS have been specifically designed to deal with big data scenarios. The approach we adopted is mainly based on a memory-mapped abstraction of matrix and vector data structures, and on a distributed computation of a number of standard problems in linear algebra. For learning on big data, we decided to focus specifically on those situations where one seeks a linear model on a large set of (possibly non linear) features. A more accurate specification of what “large” means in GURLS is related to the number of features d and the number of training 3202 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING data set optdigit landast pendigit letter isolet # of samples 3800 4400 7400 10000 6200 # of classes 10 6 10 26 26 # of variables 64 36 16 16 600 Table 1: Data sets description. examples n: we require it must be possible to store a min(d, n) × min(d, n) matrix in memory. In practice, this roughly means we can train models with up-to 25k features on machines with 8Gb of RAM, and up-to 50k features on machines with 36Gb of RAM. We do not require the data matrix itself to be stored in memory: within GURLS it is possible to manage an arbitrarily large set of training examples. We distinguish two different scenarios. Data sets that can fully reside in RAM without any memory mapping techniques—such as swapping—are considered to be small/medium. Larger data sets are considered to be “big” and learning must be performed using either bGURLS or bGURLS++ . These two modules include all the design patterns described above, and have been complemented with additional big data and distributed computation capabilities. Big data support is obtained using a data structure called bigarray, which allows to handle data matrices as large as the space available on the hard drive: we store the entire data set on disk and load only small chunks in memory when required. There are some differences between the Matlab and C++ implementations. bGURLS relies on a simple, ad hoc interface, called GURLS Distributed Manager (GDM), to distribute matrix-matrix multiplications, thus allowing users to perform the important task of kernel matrix computation on a distributed network of computing nodes. After this step, the subsequent tasks behave as in GURLS. bGURLS++ (currently in active development) offers more interesting features because it is based on the MPI libraries. Therefore, it allows for a full distribution within every single task of the pipeline. All the processes read the input data from a shared filesystem over the network and then start executing the same pipeline. During execution, each process’ task communicates with the corresponding ones. Every process maintains its local copy of the options. Once the same task is completed by all processes, the local copies of the options are synchronized. This architecture allows for the creation of hybrid pipelines comprising serial one-process-based tasks from GURLS++ . 3. Experiments We decided to focus the experimental analysis in the paper to the assessment of GURLS’ performance both in terms of accuracy and time. In our experiments we considered 5 popular data sets, briefly described in Table 1. Experiments were run on a Intel Xeon 5140 @ 2.33GHz processor with 8GB of RAM, and running Ubuntu 8.10 Server (64 bit). optdigit accuracy (%) GURLS (linear primal) GURLS (linear dual) LS-SVM linear GURLS (500 random features) GURLS (1000 random features) GURLS (Gaussian kernel) LS-SVM (Gaussian kernel) time (s) landsat accuracy (%) time (s) pendigit accuracy (%) time (s) 92.3 92.3 92.3 96.8 97.5 98.3 98.3 0.49 726 7190 25.6 207 13500 26100 63.68 66.3 64.6 63.5 63.5 90.4 90.51 0.22 1148 6526 28.0 187 20796 18430 82.24 82.46 82.3 96.7 95.8 98.4 98.36 0.23 5590 46240 31.6 199 100600 120170 Table 2: Comparison between GURLS and LS-SVM. 3203 TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO Performance (%) 1 0.95 0.9 0.85 isolet(∇) letter(×) 0.8 pendigit(∆) 0.75 landsat(♦) optdigit(◦) 0.7 LIBSVM:rbf 0.65 GURLS++:rbf GURLS:randomfeatures-1000 0.6 GURLS:randomfeatures-500 0.55 0.5 0 10 GURLS:rbf 1 10 2 10 3 10 4 Time (s) 10 Figure 1: Prediction accuracy vs. computing time. The color represents the training method and the library used. In blue: the Matlab implementation of RLS with RBF kernel, in red: its C++ counterpart. In dark red: results of LIBSVM with RBF kernel. In yellow and green: results obtained using a linear kernel on 500 and 1000 random features respectively. We set up different pipelines and compared the performance to SVM, for which we used the python modular interface to LIBSVM (Chang and Lin, 2011). Automatic selection of the optimal regularization parameter is implemented identically in all experiments: (i) split the data; (ii) define a set of regularization parameter on a regular grid; (iii) perform hold-out validation. The variance of the Gaussian kernel has been fixed by looking at the statistics of the pairwise distances among training examples. The prediction accuracy of GURLS and GURLS++ is identical—as expected—but the implementation in C++ is significantly faster. The prediction accuracy of standard RLS-based methods is in many cases higher than SVM. Exploiting the primal formulation of RLS, we further ran experiments with the random features approximation (Rahimi and Recht, 2008). As show in Figure 1, the performance of this method is comparable to that of SVM at a much lower computational cost in the majority of the tested data sets. We further compared GURLS with another available least squares based toolbox: the LS-SVM toolbox (Suykens et al., 2001), which includes routines for parameter selection such as coupled simulated annealing and line/grid search. The goal of this experiment is to benchmark the performance of the parameter selection with random data splitting included in GURLS. For a fair comparison, we considered only the Matlab implementation of GURLS. Results are reported in Table 2. As expected, using the linear kernel with the primal formulation—not available in LS-SVM—is the fastest approach since it leverages the lower dimensionality of the input space. When the Gaussian kernel is used, GURLS and LS-SVM have comparable computing time and classification performance. Note, however, that in GURLS the number of parameter in the grid search is fixed to 400, while in LS-SVM it may vary and is limited to 70. The interesting results obtained with the random features implementation in GURLS, make it an interesting choice in many applications. Finally, all GURLS pipelines, in their Matlab implementation, are faster than LS-SVM and further improvements can be achieved with GURLS++ . Acknowledgments We thank Tomaso Poggio, Zak Stone, Nicolas Pinto, Hristo S. Paskov and CBCL for comments and insights. 3204 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING References C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www. csie.ntu.edu.tw/˜cjlin/libsvm. A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, volume 21, pages 1313–1320, 2008. R. Rifkin, G. Yeo, and T. Poggio. Regularized least-squares classification. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003. J. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific, 2001. ISBN 981-238-151-1. 3205

5 0.4166255 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

Author: Edward Challis, David Barber

Abstract: We investigate Gaussian Kullback-Leibler (G-KL) variational approximate inference techniques for Bayesian generalised linear models and various extensions. In particular we make the following novel contributions: sufficient conditions for which the G-KL objective is differentiable and convex are described; constrained parameterisations of Gaussian covariance that make G-KL methods fast and scalable are provided; the lower bound to the normalisation constant provided by G-KL methods is proven to dominate those provided by local lower bounding methods; complexity and model applicability issues of G-KL versus other Gaussian approximate inference methods are discussed. Numerical results comparing G-KL and other deterministic Gaussian approximate inference methods are presented for: robust Gaussian process regression models with either Student-t or Laplace likelihoods, large scale Bayesian binary logistic regression models, and Bayesian sparse linear models for sequential experimental design. Keywords: generalised linear models, latent linear models, variational approximate inference, large scale inference, sparse learning, experimental design, active learning, Gaussian processes

6 0.39675874 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

7 0.39148802 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

8 0.38772231 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

9 0.37842962 32 jmlr-2013-Differential Privacy for Functions and Functional Data

10 0.37751281 73 jmlr-2013-Multicategory Large-Margin Unified Machines

11 0.37733248 120 jmlr-2013-Variational Algorithms for Marginal MAP

12 0.37645847 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

13 0.37352669 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

14 0.37298888 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels

15 0.36917263 15 jmlr-2013-Bayesian Canonical Correlation Analysis

16 0.36878848 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

17 0.3659429 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

18 0.36569816 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

19 0.36445847 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

20 0.36431491 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems