jmlr jmlr2013 jmlr2013-74 knowledge-graph by maker-knowledge-mining

74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

Source: pdf

Author: Lauren A. Hannah, David B. Dunson

Abstract: We propose a new, nonparametric method for multivariate regression subject to convexity or concavity constraints on the response function. Convexity constraints are common in economics, statistics, operations research, ﬁnancial engineering and optimization, but there is currently no multivariate method that is stable and computationally feasible for more than a few thousand observations. We introduce convex adaptive partitioning (CAP), which creates a globally convex regression model from locally linear estimates ﬁt on adaptively selected covariate partitions. CAP is a computationally efﬁcient, consistent method for convex regression. We demonstrate empirical performance by comparing the performance of CAP to other shape-constrained and unconstrained regression methods for predicting weekly wages and value function approximation for pricing American basket options. Keywords: adaptive partitioning, convex regression, nonparametric regression, shape constraint, treed linear model

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Statistical Science Duke University Durham, NC 27708, USA Editor: Hui Zou Abstract We propose a new, nonparametric method for multivariate regression subject to convexity or concavity constraints on the response function. [sent-6, score-0.144]

2 We introduce convex adaptive partitioning (CAP), which creates a globally convex regression model from locally linear estimates ﬁt on adaptively selected covariate partitions. [sent-8, score-0.425]

3 We demonstrate empirical performance by comparing the performance of CAP to other shape-constrained and unconstrained regression methods for predicting weekly wages and value function approximation for pricing American basket options. [sent-10, score-0.425]

4 Keywords: adaptive partitioning, convex regression, nonparametric regression, shape constraint, treed linear model 1. [sent-11, score-0.298]

5 Convex regression is easily extended to concave regression since a concave function is the negative of a convex function. [sent-19, score-0.299]

6 Although convex regression has been well explored in the univariate setting, the literature remains underdeveloped in the multivariate setting. [sent-37, score-0.139]

7 Methods where an objective function is constrained to the set of convex functions through supporting hyperplane constraints for each pair of observations (Hildreth, 1954; Holloway, 1979; Kuosmanen, 2008; Seijo and Sen, 2011; Lim and Glynn, 2012; Allon et al. [sent-38, score-0.127]

8 Reﬁtting a series of hyperplanes can be done in a frequentist (Magnani and Boyd, 2009) or Bayesian (Hannah and Dunson, 2011) manner. [sent-44, score-0.18]

9 In this paper, we introduce the ﬁrst computationally efﬁcient and theoretically sound multivariate convex regression method: convex adaptive partitioning (CAP). [sent-47, score-0.365]

10 It ﬁts a series of hyperplanes to the data through adaptive partitioning. [sent-48, score-0.192]

11 The CAP estimator is formed by adaptively partitioning a set of observations in a method similar to trees with linear leaves (Chaudhuri et al. [sent-52, score-0.178]

12 Given a partition with K subsets and linear models, (αk , βk )K , a continuous, convex (concave) function is then generated by taking the k=1 maximum (minimum) over the hyperplanes by fn (x) = max αk + βT x. [sent-55, score-0.342]

13 Then, the hyperplanes themselves are used to reﬁt the subsets. [sent-61, score-0.149]

14 The reﬁtting step places the hyperplanes in closer alignment with the observations that generated them. [sent-63, score-0.175]

15 Literature Review The literature for convex regression is scattered throughout a variety of ﬁelds, including statistics, operations research, economics numerical analysis and electrical engineering. [sent-70, score-0.139]

16 The oldest and simplest solution method is the least squares estimator (LSE), which produces a piecewise linear estimator by solving a quadratic program with a least squares objective function subject to the constraints in Equation (2) (Hildreth, 1954; Dent, 1973). [sent-81, score-0.213]

17 Due to the constraint on the derivative of f0 , univariate convex regression is quite similar to univariate isotonic regression; see Brunk (1955), Hall and Huang (2001), Neelon and Dunson (2004) and Shively et al. [sent-94, score-0.17]

18 The multivariate least squares estimator Hildreth (1954); Holloway (1979) solves the quadratic program, n ˆ min ∑ (yi − yi )2 (3) i=1 subject to y j ≥ yi + gT (x j − xi ), i, j = 1, . [sent-98, score-0.114]

19 The ˆ LSE estimator fn is piecewise linear, LSE fn (x) = max yi + gT (x − xi ). [sent-103, score-0.22]

20 Recently, multivariate convex regression methods have been proposed with different approaches. [sent-118, score-0.139]

21 Then the convex hull of the smoothed estimate was used as a convex estimator. [sent-122, score-0.15]

22 This new function induced a partition over the covariate space, which generated a new collection of K subsets. [sent-130, score-0.126]

23 We do this by partitioning the covariate space and approximating the gradients within each region by hyperplanes generated by the least squares estimator. [sent-136, score-0.358]

24 The covariate space partition and K are chosen through adaptive partitioning. [sent-137, score-0.169]

25 ,K} Adaptive partitioning models with linear leaves have been proposed before; see Chaudhuri et al. [sent-146, score-0.108]

26 In most of these cases, o the partition is created by adaptively reﬁning an existing partition by dyadic splitting of one subset along one dimension. [sent-150, score-0.167]

27 We can use this new partition K 1 to reﬁt the hyperplanes and produce a signiﬁcantly better estimate. [sent-174, score-0.215]

28 Reﬁtting hyperplanes in this manner can be viewed as a Gauss-Newton method for the nonlinear least squares problem (Magnani and Boyd, 2009), n minimize ∑ i=1 yi − max k∈{1,. [sent-176, score-0.19]

29 Similar methods for reﬁtting hyperplanes have been proposed in Breiman (1993) and Magnani and Boyd (2009). [sent-180, score-0.149]

30 Convex adaptive partitioning uses adaptive partitioning with linear leaves to ﬁt a convex function that is deﬁned as the maximum over the set of leaves. [sent-182, score-0.377]

31 Although simple, these rules, and reﬁtting in particular, produce large gains over naive adaptive partitioning methods; empirical results are discussed in Section 6. [sent-189, score-0.151]

32 Most other adaptive partitioning methods use backﬁtting or pruning to select the tree or partition size. [sent-190, score-0.217]

33 1 The Algorithm We now introduce some notation required for convex adaptive partitioning. [sent-195, score-0.118]

34 When presented with data, a partition can be deﬁned over the covariate space (denoted by {A1 , . [sent-196, score-0.126]

35 The observation partition is deﬁned from the covariate partition, Ck = {i : xi ∈ Ak } , k = 1, . [sent-206, score-0.155]

36 A model Mk is deﬁned by: 1) the covariate partition {A1 , . [sent-215, score-0.126]

37 ,CK }, and 3) the hyperplanes (α j , β j )K ﬁt to those j=1 partitions. [sent-221, score-0.149]

38 The CAP algorithm progressively reﬁnes the partition until each subset cannot be split without one subset having fewer than a minimal number of observations, nmin . [sent-222, score-0.159]

39 Therefore, we choose a conservative value for nmin , which 3266 M ULTIVARIATE C ONVEX R EGRESSION WITH A DAPTIVE PARTITIONING admits logarithmic partition growth, nmin = min n , 2(d + 1) . [sent-227, score-0.19]

40 Use the partition induced by the hyperplanes to generate model MK . [sent-251, score-0.215]

41 We create models for every subset and search along every cardinal direction by splitting the data along jk jk that direction. [sent-266, score-0.282]

42 For a ﬁxed dimension j and subset k, let xmin be the minimum value and xmax be the maximum value of the covariates in this subset and dimension. [sent-267, score-0.121]

43 Let 0 < a1 < · · · < aL < 1 be a set jk jk of evenly spaced knots that represent the proportion between xmin and xmax . [sent-268, score-0.363]

44 xmin = min{xi j : i ∈ Ck }, jk jk Use the weighted average b jkℓ = aℓ xmin + (1 − aℓ )xmax to split Ck and Ak in dimension j. [sent-277, score-0.341]

45 3267 H ANNAH AND D UNSON M1,1,2 M1,1,1 C1 C1 C2 C3 C1 C3 C2 C2 b1,1,1 b1,1,2 M1,2,2 C1 C2 C3 M1,2,1 C1 C2 C3 b1,2,1 b1,2,2 Figure 2: (A) The original observation partition C for M2 , (B) new splits generated from the subset C1 , and (C) new splits generated from the subset C2 . [sent-279, score-0.122]

46 Fit hyperplanes (αk , βk )K+1 in each of the subsets. [sent-283, score-0.149]

47 The ′ ′ ˆ ˆ k=1 triplet of observation partition C1:K+1 , covariate partition, A1:K+1 , and set of hyperplanes (αk , βk )K+1 ′ . [sent-284, score-0.275]

48 Let (αi , βi )K be the hyperplanes i=1 associated with M ′jkℓ and let jkℓ fˆ jkℓ (x) = jkℓ T max αi + βi jkℓ jkℓ x i∈{1,. [sent-300, score-0.149]

49 Let (α1:K , β1:K ) be the hyperplanes associated with MK . [sent-309, score-0.149]

50 Fit hyperplanes in 1:K ′ ′ ′ ′ each of those subsets. [sent-315, score-0.149]

51 Consistency Consistency for CAP can be shown in a related manner to consistency for other adaptive partitioning models, like CART (Breiman et al. [sent-346, score-0.151]

52 We take a two-step approach, ﬁrst showing consistency o for the mean function and ﬁrst derivatives of a more traditional treed linear model based on CAP under the ℓ∞ metric and then we use that to show consistency for the CAP estimator itself. [sent-350, score-0.129]

53 3269 H ANNAH AND D UNSON ∗ Letting Mn be the model for the CAP estimate after n observations, deﬁne the discontinuous ∗ piecewise linear estimate based on Mn , ∗ fn (x) = Kn ∑ k=1 αk + βT x 1{x∈Ak } , k where Kn is the partition size, A1 , . [sent-351, score-0.161]

54 Let fn (x) be the CAP estimator based on Mn , fn (x) = max k∈{1,. [sent-355, score-0.148]

55 k Each subset Ak has an associated diameter, dnk = supx1 ,x2 ∈Ak ||x1 − x2 ||2 . [sent-359, score-0.101]

56 The diameter of the partition maxk dnk → 0 in probability as n → ∞. [sent-381, score-0.167]

57 ∗ To show consistency of fn under the ℓ∞ metric, we ﬁrst show consistency of fn and its derivatives under the ℓ∞ metric in Theorem 1. [sent-397, score-0.104]

58 ,CK , the partition, and the hyperplanes (αk , βk )K , which k=1 (−i) (−i) were generated by the partition. [sent-436, score-0.149]

59 Let (αk , βk )K be the collection of hyperplanes generated k=1 when observation i is removed; notice that if i ∈ Ck , only (αk , βk ) changes. [sent-437, score-0.149]

60 Empirical Analysis We compare shape constrained and unconstrained regression methods across a set of convex regression problems: two synthetic regression problems, predicting mean weekly wages and value function approximation for pricing basket options. [sent-471, score-0.696]

61 1 Synthetic Regression Problems We apply CAP to two synthetic regression problems to demonstrate predictive performance and analyze sensitivity to tunable parameters. [sent-473, score-0.127]

62 We implemented the following shape constrained algorithms: the least squares regression (LSE) using cvx (Grant and Boyd, 2012, 2008), and the linear reﬁtting algorithm of Magnani and Boyd (2009). [sent-506, score-0.173]

63 Non-convex regression methods performed poorly compared to shape restricted methods, particularly in the higher noise setting. [sent-514, score-0.106]

64 They can be easily modiﬁed to produce a convex regression estimator by taking the maximum over the linear leaves. [sent-527, score-0.183]

65 CAP differs from existing treed linear models in how the partition is reﬁned. [sent-528, score-0.151]

66 2 Predicting Weekly Wages We use shape restricted methods to predict mean weekly wages based on years of education and experience. [sent-547, score-0.418]

67 The data set contains 25,361 records of weekly wages for full-time, adult, male workers for 1987, along with years experience, years of education, race (either back or white; no others were included in the sample), region, and whether the last job held was part time. [sent-549, score-0.37]

68 A reasonable assumption for wages is that they are concave in years experience. [sent-550, score-0.202]

69 Indeed, this pattern is seen in Figure 7 when we compare average weekly wages against experience. [sent-552, score-0.248]

70 Wages can also be expected to increase based on education level, but not in a concave or convex fashion. [sent-553, score-0.19]

71 mean squared error (log scale) plus/minus one standard error for CAP and treed linear models with local split selection, with no reﬁtting; local split selection, with reﬁtting; and global split selection, with no reﬁtting. [sent-569, score-0.178]

72 In the SVM surface, someone with 0 years of experience and 0 years of education is predicted to have about a 150% larger weekly wage than a high school graduate with 0 years of experience and about the same weekly wage as someone with a 4-year college degree and 0 years of experience. [sent-617, score-0.803]

73 Averaging randomized convex estimators produces a new convex estimator; these methods have been explored for approximating objective functions in Hannah and Dunson (2012). [sent-621, score-0.15]

74 Often value functions are known to be convex or concave in the state variable; this is common in options pricing, portfolio optimization and logistics problems. [sent-628, score-0.168]

75 To give a simple example for value function approximation, we consider pricing American basket options on the average of M underlying assets. [sent-632, score-0.158]

76 A popular method for pricing American options uses approximate dynamic programming where continuation values are approximated via regression (Carriere, 1996; Tsitsiklis and Van Roy, 1999, 2001; Longstaff and Schwartz, 2001). [sent-636, score-0.213]

77 The regression values are used to create a policy that is implemented on a test set: exercise when the current exercise value is greater than or equal to the estimated continuation value. [sent-668, score-0.102]

78 In previous literature, {Ct (Xt j )}N has been estimated by regression splines for a single underj=1 lying asset (Carriere, 1996), or least squares linear regression on a set of basis functions (Tsitsiklis and Van Roy, 1999; Longstaff and Schwartz, 2001; Glasserman, 2004). [sent-670, score-0.217]

79 Since the expected continuation values are convex in the asset price for basket options, CAP is a simple, nonparametric alternative to these methods. [sent-673, score-0.237]

80 , M, j = i; ridge regression on the same basis functions with ridge parameter chosen by 10-fold cross-validation each time period from values between 10−3 and 105 . [sent-677, score-0.118]

81 We observed a decline in the performance of least squares as the number of assets grew due to overﬁtting. [sent-690, score-0.107]

82 Ridge regularization greatly improved the least squares performance as the number of assets grew. [sent-691, score-0.107]

83 This is likely because the number of covariates in the least squares regression grew like M 2 , while all linear regressions in CAP only had M covariates. [sent-696, score-0.132]

84 Conclusions In this article, we presented convex adaptive partitioning, a computationally efﬁcient, theoretically sound and empirically robust method for regression subject to a convexity constraint. [sent-698, score-0.209]

85 CAP is the ﬁrst convex regression method to scale to large problems, both in terms of dimensions and number of observations. [sent-699, score-0.139]

86 Second, CAP can be extended to more shape constrained settings, like monotone, concave and semi-convex functions. [sent-706, score-0.116]

87 Variants of CAP could be combined with other nonparametric methods like kernels to produce efﬁcient inference methods for partially convex functions. [sent-710, score-0.128]

88 Proof [Proof of Theorem 1] It is sufﬁcient to show that −1 x max dnk [αk , βk ]t − ∆(¯ k ) → 0 k=1,. [sent-730, score-0.101]

89 ,Kn −1 ∂ in probability, where ∆(¯ k ) is the vector of elements [ f0 (xk , dnk ∂x1 f0 (¯ k ), . [sent-733, score-0.101]

90 ,Kn dnk D−1 ∑i∈Ck Γi r(xi − xk ) → 0 in probability as n → ∞. [sent-751, score-0.101]

91 (1984) to each component of dnk |Ck |−1 ∑i∈Ck Γi εi , there exist constants h1 > 0, h2 > 0 and γ0 > 0 such that −1 P dnk |Ck |−1 ∑ Γi εi >γ i∈Ck 2 2 ≤ h1 e−h2 dnk |Ck |γ , (7) whenever γ ≤ γ0 . [sent-755, score-0.303]

93 Using the VC dimension of the partition, 2 P [α0 , βk ]t − ∆(¯ k ) ≥ γ for each Ak and dnk |Ck | ≥ mn D log(n) x k ≤ h1 21+Kn (p+2) nKn (p+2)−h2 γ 2m n . [sent-762, score-0.133]

94 Then new hyperplanes are ﬁt to each of the new subsets. [sent-819, score-0.149]

95 Since D controls the maximum partition size, Kn = D log(n), and a linear regression is ﬁt K log(K) times, the expected increase in the runtime should only be O (D log(D)). [sent-851, score-0.19]

96 Rates of convergence for multivariate convex regression have only been studied in two articles of which we are aware. [sent-881, score-0.139]

97 (2011) studied rates of convergence for an estimator that is created by ﬁrst smoothing the data, then evaluating the smoothed data over an ε-net, and ﬁnally convexifying the net of smoothed data by taking the convex hull. [sent-883, score-0.143]

98 Hannah and Dunson (2011) showed adaptive rates for a Bayesian model that places a prior over the set of all piecewise linear functions. [sent-886, score-0.11]

99 Ensemble methods for convex regression with applications to geometric programming based circuit design. [sent-1044, score-0.139]

100 Nonparametric least squares estimation of a multivariate convex regression function. [sent-1156, score-0.18]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cap', 0.707), ('magnani', 0.179), ('ck', 0.174), ('weekly', 0.155), ('hyperplanes', 0.149), ('annah', 0.132), ('unson', 0.132), ('ultivariate', 0.124), ('daptive', 0.113), ('partitioning', 0.108), ('jk', 0.108), ('gcv', 0.108), ('onvex', 0.102), ('dnk', 0.101), ('cart', 0.1), ('mk', 0.094), ('mars', 0.093), ('wages', 0.093), ('boyd', 0.089), ('treed', 0.085), ('egression', 0.083), ('convex', 0.075), ('hannah', 0.073), ('ak', 0.071), ('aguilera', 0.07), ('chaudhuri', 0.068), ('education', 0.067), ('assets', 0.066), ('lse', 0.066), ('pricing', 0.066), ('partition', 0.066), ('regression', 0.064), ('nmin', 0.062), ('years', 0.061), ('covariate', 0.06), ('runtime', 0.06), ('runtimes', 0.06), ('tting', 0.056), ('nonparametric', 0.053), ('knots', 0.053), ('fn', 0.052), ('concave', 0.048), ('wage', 0.047), ('xmax', 0.047), ('xmin', 0.047), ('basket', 0.047), ('dunson', 0.046), ('options', 0.045), ('experience', 0.044), ('estimator', 0.044), ('adaptive', 0.043), ('piecewise', 0.043), ('shape', 0.042), ('breiman', 0.042), ('squares', 0.041), ('fit', 0.04), ('kn', 0.04), ('continuation', 0.038), ('fast', 0.036), ('slopes', 0.036), ('stephen', 0.035), ('gt', 0.035), ('xt', 0.035), ('splitting', 0.035), ('lauren', 0.033), ('tunable', 0.033), ('roy', 0.033), ('mn', 0.032), ('allon', 0.031), ('cardinal', 0.031), ('cule', 0.031), ('fastcap', 0.031), ('frequentist', 0.031), ('isotonic', 0.031), ('kuosmanen', 0.031), ('salary', 0.031), ('shively', 0.031), ('econometrics', 0.031), ('split', 0.031), ('predictive', 0.03), ('tsitsiklis', 0.03), ('re', 0.029), ('thousand', 0.029), ('xi', 0.029), ('splits', 0.028), ('seconds', 0.028), ('rmse', 0.028), ('convexity', 0.027), ('covariates', 0.027), ('ridge', 0.027), ('constrained', 0.026), ('observations', 0.026), ('log', 0.026), ('rates', 0.024), ('splines', 0.024), ('duke', 0.024), ('asset', 0.024), ('svm', 0.024), ('stopping', 0.023), ('glasserman', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000018 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

Author: Lauren A. Hannah, David B. Dunson

2 0.078606822 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

Author: Arnaud Guyader, Nick Hengartner

Abstract: Motivated by promising experimental results, this paper investigates the theoretical properties of a recently proposed nonparametric estimator, called the Mutual Nearest Neighbors rule, which estimates the regression function m(x) = E[Y |X = x] as follows: ﬁrst identify the k nearest neighbors of x in the sample Dn , then keep only those for which x is itself one of the k nearest neighbors, and ﬁnally take the average over the corresponding response variables. We prove that this estimator is consistent and that its rate of convergence is optimal. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, we also present adaptation results by data-splitting. Keywords: nonparametric estimation, nearest neighbor methods, mathematical statistics

3 0.061953545 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

Author: Wei Pan, Xiaotong Shen, Binghui Liu

Abstract: Clustering analysis is widely used in many ﬁelds. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classiﬁcation and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classiﬁcation and regression, such as model selection criteria to select the number of clusters, a difﬁcult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method’s promising performance. Keywords: generalized degrees of freedom, grouping, K-means clustering, Lasso, penalized regression, truncated Lasso penalty (TLP)

4 0.043342084 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

Author: Kris De Brabanter, Jos De Brabanter, Bart De Moor, Irène Gijbels

Abstract: We present a fully automated framework to estimate derivatives nonparametrically without estimating the regression function. Derivative estimation plays an important role in the exploration of structures in curves (jump detection and discontinuities), comparison of regression curves, analysis of human growth data, etc. Hence, the study of estimating derivatives is equally important as regression estimation itself. Via empirical derivatives we approximate the qth order derivative and create a new data set which can be smoothed by any nonparametric regression estimator. We derive L1 and L2 rates and establish consistency of the estimator. The new data sets created by this technique are no longer independent and identically distributed (i.i.d.) random variables anymore. As a consequence, automated model selection criteria (data-driven procedures) break down. Therefore, we propose a simple factor method, based on bimodal kernels, to effectively deal with correlated data in the local polynomial regression framework. Keywords: nonparametric derivative estimation, model selection, empirical derivative, factor rule

5 0.041721266 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

Abstract: We propose a new method of learning a sparse nonnegative-deﬁnite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm ﬁrst estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to inﬁnity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm

6 0.04071492 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

7 0.040124603 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

8 0.040017933 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

9 0.037764251 106 jmlr-2013-Stationary-Sparse Causality Network Learning

10 0.034875993 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

11 0.034856148 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

12 0.034449559 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris

13 0.034180012 90 jmlr-2013-Quasi-Newton Method: A New Direction

14 0.033737238 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression

15 0.031387229 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

16 0.031280592 120 jmlr-2013-Variational Algorithms for Marginal MAP

17 0.030438798 68 jmlr-2013-Machine Learning with Operational Costs

18 0.030070126 104 jmlr-2013-Sparse Single-Index Model

19 0.029646292 76 jmlr-2013-Nonparametric Sparsity and Regularization

20 0.028786786 108 jmlr-2013-Stochastic Variational Inference

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.157), (1, 0.006), (2, 0.026), (3, 0.034), (4, 0.025), (5, -0.03), (6, -0.024), (7, 0.05), (8, 0.116), (9, -0.043), (10, -0.024), (11, -0.097), (12, 0.088), (13, -0.008), (14, -0.109), (15, 0.01), (16, -0.011), (17, 0.169), (18, 0.048), (19, 0.157), (20, 0.046), (21, -0.034), (22, -0.064), (23, -0.185), (24, 0.068), (25, -0.06), (26, -0.126), (27, 0.187), (28, 0.03), (29, 0.059), (30, 0.066), (31, 0.118), (32, 0.173), (33, -0.074), (34, -0.001), (35, -0.082), (36, 0.059), (37, -0.037), (38, 0.031), (39, -0.031), (40, -0.082), (41, 0.032), (42, -0.058), (43, 0.021), (44, -0.25), (45, -0.133), (46, -0.119), (47, -0.027), (48, 0.05), (49, -0.131)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92989957 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

Author: Lauren A. Hannah, David B. Dunson

2 0.48675036 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

Author: Arnaud Guyader, Nick Hengartner

3 0.46132341 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

Author: Wei Pan, Xiaotong Shen, Binghui Liu

4 0.35051286 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

Author: Kris De Brabanter, Jos De Brabanter, Bart De Moor, Irène Gijbels

5 0.32129031 106 jmlr-2013-Stationary-Sparse Causality Network Learning

Author: Yuejia He, Yiyuan She, Dapeng Wu

Abstract: Recently, researchers have proposed penalized maximum likelihood to identify network topology underlying a dynamical system modeled by multivariate time series. The time series of interest are assumed to be stationary, but this restriction is never taken into consideration by existing estimation methods. Moreover, practical problems of interest may have ultra-high dimensionality and obvious node collinearity. In addition, none of the available algorithms provides a probabilistic measure of the uncertainty for the obtained network topology which is informative in reliable network identiﬁcation. The main purpose of this paper is to tackle these challenging issues. We propose the S2 learning framework, which stands for stationary-sparse network learning. We propose a novel algorithm referred to as the Berhu iterative sparsity pursuit with stationarity (BISPS), where the Berhu regularization can improve the Lasso in detection and estimation. The algorithm is extremely easy to implement, efﬁcient in computation and has a theoretical guarantee to converge to a global optimum. We also incorporate a screening technique into BISPS to tackle ultra-high dimensional problems and enhance computational efﬁciency. Furthermore, a stationary bootstrap technique is applied to provide connection occurring frequency for reliable topology learning. Experiments show that our method can achieve stationary and sparse causality network learning and is scalable for high-dimensional problems. Keywords: stationarity, sparsity, Berhu, screening, bootstrap

6 0.31715253 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression

7 0.27968177 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

8 0.26361492 117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

9 0.25470495 90 jmlr-2013-Quasi-Newton Method: A New Direction

10 0.24070486 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction

11 0.23578176 40 jmlr-2013-Efficient Program Synthesis Using Constraint Satisfaction in Inductive Logic Programming

12 0.23381676 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

13 0.23003006 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

14 0.22761193 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

15 0.22634959 68 jmlr-2013-Machine Learning with Operational Costs

16 0.2112402 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

17 0.20272198 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising

18 0.20029972 120 jmlr-2013-Variational Algorithms for Marginal MAP

19 0.19929549 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

20 0.19305134 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (5, 0.088), (6, 0.065), (10, 0.07), (20, 0.014), (23, 0.034), (28, 0.399), (41, 0.016), (44, 0.011), (68, 0.027), (70, 0.025), (75, 0.055), (85, 0.014), (87, 0.039), (93, 0.011), (95, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72295201 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

Author: Lauren A. Hannah, David B. Dunson

2 0.32922691 120 jmlr-2013-Variational Algorithms for Marginal MAP

Author: Qiang Liu, Alexander Ihler

Abstract: The marginal maximum a posteriori probability (MAP) estimation problem, which calculates the mode of the marginal posterior distribution of a subset of variables with the remaining variables marginalized, is an important inference problem in many models, such as those with hidden variables or uncertain parameters. Unfortunately, marginal MAP can be NP-hard even on trees, and has attracted less attention in the literature compared to the joint MAP (maximization) and marginalization problems. We derive a general dual representation for marginal MAP that naturally integrates the marginalization and maximization operations into a joint variational optimization problem, making it possible to easily extend most or all variational-based algorithms to marginal MAP. In particular, we derive a set of “mixed-product” message passing algorithms for marginal MAP, whose form is a hybrid of max-product, sum-product and a novel “argmax-product” message updates. We also derive a class of convergent algorithms based on proximal point methods, including one that transforms the marginal MAP problem into a sequence of standard marginalization problems. Theoretically, we provide guarantees under which our algorithms give globally or locally optimal solutions, and provide novel upper bounds on the optimal objectives. Empirically, we demonstrate that our algorithms signiﬁcantly outperform the existing approaches, including a state-of-the-art algorithm based on local search methods. Keywords: graphical models, message passing, belief propagation, variational methods, maximum a posteriori, marginal-MAP, hidden variable models

3 0.32820389 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

Author: Alberto N. Escalante-B., Laurenz Wiskott

Abstract: Supervised learning from high-dimensional data, for example, multimedia data, is a challenging task. We propose an extension of slow feature analysis (SFA) for supervised dimensionality reduction called graph-based SFA (GSFA). The algorithm extracts a label-predictive low-dimensional set of features that can be post-processed by typical supervised algorithms to generate the ﬁnal label or class estimation. GSFA is trained with a so-called training graph, in which the vertices are the samples and the edges represent similarities of the corresponding labels. A new weighted SFA optimization problem is introduced, generalizing the notion of slowness from sequences of samples to such training graphs. We show that GSFA computes an optimal solution to this problem in the considered function space and propose several types of training graphs. For classiﬁcation, the most straightforward graph yields features equivalent to those of (nonlinear) Fisher discriminant analysis. Emphasis is on regression, where four different graphs were evaluated experimentally with a subproblem of face detection on photographs. The method proposed is promising particularly when linear models are insufﬁcient as well as when feature selection is difﬁcult. Keywords: slow feature analysis, feature extraction, classiﬁcation, regression, pattern recognition, training graphs, nonlinear dimensionality reduction, supervised learning, implicitly supervised, high-dimensional data, image analysis

4 0.32475787 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer

Abstract: Linear reinforcement learning (RL) algorithms like least-squares temporal difference learning (LSTD) require basis functions that span approximation spaces of potential value functions. This article investigates methods to construct these bases from samples. We hypothesize that an ideal approximation spaces should encode diffusion distances and that slow feature analysis (SFA) constructs such spaces. To validate our hypothesis we provide theoretical statements about the LSTD value approximation error and induced metric of approximation spaces constructed by SFA and the state-of-the-art methods Krylov bases and proto-value functions (PVF). In particular, we prove that SFA minimizes the average (over all tasks in the same environment) bound on the above approximation error. Compared to other methods, SFA is very sensitive to sampling and can sometimes fail to encode the whole state space. We derive a novel importance sampling modiﬁcation to compensate for this effect. Finally, the LSTD and least squares policy iteration (LSPI) performance of approximation spaces constructed by Krylov bases, PVF, SFA and PCA is compared in benchmark tasks and a visual robot navigation experiment (both in a realistic simulation and with a robot). The results support our hypothesis and suggest that (i) SFA provides subspace-invariant features for MDPs with self-adjoint transition operators, which allows strong guarantees on the approximation error, (ii) the modiﬁed SFA algorithm is best suited for LSPI in both discrete and continuous state spaces and (iii) approximation spaces encoding diffusion distances facilitate LSPI performance. Keywords: reinforcement learning, diffusion distance, proto value functions, slow feature analysis, least-squares policy iteration, visual robot navigation c 2013 Wendelin B¨ hmer, Steffen Gr¨ new¨ lder, Yun Shen, Marek Musial and Klaus Obermayer. o u a ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER

5 0.32426345 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efﬁcient algorithms for distributed optimization in statistical settings involving large-scale data sets. The ﬁrst algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efﬁciently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

6 0.32350749 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

7 0.32336327 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

8 0.32255825 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

9 0.32195836 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

10 0.32067636 59 jmlr-2013-Large-scale SVD and Manifold Learning

11 0.32011694 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

12 0.31990129 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

13 0.31959555 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

14 0.31929997 106 jmlr-2013-Stationary-Sparse Causality Network Learning

15 0.31925935 22 jmlr-2013-Classifying With Confidence From Incomplete Information

16 0.31923145 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

17 0.31810805 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

18 0.31743935 86 jmlr-2013-Parallel Vector Field Embedding

19 0.31710434 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

20 0.31652552 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees