nips nips2009 nips2009-108 knowledge-graph by maker-knowledge-mining

108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints


Source: pdf

Author: Xiaolin Yang, Seyoung Kim, Eric P. Xing

Abstract: Multitask learning addresses the problem of learning related tasks that presumably share some commonalities on their input-output mapping functions. Previous approaches to multitask learning usually deal with homogeneous tasks, such as purely regression tasks, or entirely classification tasks. In this paper, we consider the problem of learning multiple related tasks of predicting both continuous and discrete outputs from a common set of input variables that lie in a highdimensional feature space. All of the tasks are related in the sense that they share the same set of relevant input variables, but the amount of influence of each input on different outputs may vary. We formulate this problem as a combination of linear regressions and logistic regressions, and model the joint sparsity as L1 /L∞ or L1 /L2 norm of the model parameters. Among several possible applications, our approach addresses an important open problem in genetic association mapping, where the goal is to discover genetic markers that influence multiple correlated traits jointly. In our experiments, we demonstrate our method in this setting, using simulated and clinical asthma datasets, and we show that our method can effectively recover the relevant inputs with respect to all of the tasks. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Multitask learning addresses the problem of learning related tasks that presumably share some commonalities on their input-output mapping functions. [sent-8, score-0.354]

2 Previous approaches to multitask learning usually deal with homogeneous tasks, such as purely regression tasks, or entirely classification tasks. [sent-9, score-0.484]

3 In this paper, we consider the problem of learning multiple related tasks of predicting both continuous and discrete outputs from a common set of input variables that lie in a highdimensional feature space. [sent-10, score-0.491]

4 All of the tasks are related in the sense that they share the same set of relevant input variables, but the amount of influence of each input on different outputs may vary. [sent-11, score-0.601]

5 We formulate this problem as a combination of linear regressions and logistic regressions, and model the joint sparsity as L1 /L∞ or L1 /L2 norm of the model parameters. [sent-12, score-0.277]

6 Among several possible applications, our approach addresses an important open problem in genetic association mapping, where the goal is to discover genetic markers that influence multiple correlated traits jointly. [sent-13, score-0.444]

7 In our experiments, we demonstrate our method in this setting, using simulated and clinical asthma datasets, and we show that our method can effectively recover the relevant inputs with respect to all of the tasks. [sent-14, score-0.433]

8 1 Introduction In multitask learning, one is interested in learning a set of related models for predicting multiple (possibly) related outputs (i. [sent-15, score-0.331]

9 In many applications, the multiple tasks share a common input space, but have different functional mappings to different output variables corresponding to different tasks. [sent-18, score-0.515]

10 When the tasks and their corresponding models are believed to be related, it is desirable to learn all of the models jointly rather than treating each task as independent of each other and fitting each model separately. [sent-19, score-0.395]

11 Such a learning strategy that allows us to borrow information across tasks can potentially increase the predictive power of the learned models. [sent-20, score-0.374]

12 For example, hierarchical Bayesian models have been applied when the parameter values themselves are thought to be similar across tasks [2, 14]. [sent-22, score-0.374]

13 A probabilistic method for modeling the latent structure shared across multiple tasks has been proposed [16]. [sent-23, score-0.409]

14 For problems of which the input lies in a high-dimensional space and the goal is to recover the shared sparsity structure across tasks, a regularized regression method has been proposed [10]. [sent-24, score-0.347]

15 In this paper, we consider an interesting and not uncommon scenario of multitask learning, where the tasks are heterogeneous and bear a union support. [sent-25, score-0.821]

16 That is, each task can be either a regression or classification problem, with the inputs lying in a very high-dimensional feature space, but only a small number of the input variables (i. [sent-26, score-0.416]

17 , predictors) are relevant to each of the output variables (i. [sent-28, score-0.184]

18 Furthermore, we assume that all of the related tasks possibly share common relevant predictors, but with varying amount of influence on each task. [sent-31, score-0.429]

19 Previous approaches for multitask learning usually consider a set of homogeneous tasks, such as regressions only, or classifications only. [sent-32, score-0.417]

20 One of the successful extensions of the standard lasso is the group lasso that uses an L1 /L2 penalty defined over predictor groups [15], instead of just the L1 penalty ubiquitously over all predictors. [sent-34, score-0.337]

21 Recently, a more general L1 /Lq -regularized regression scheme with q > 0 has been thoroughly investigated [17]. [sent-35, score-0.178]

22 When the L1 /Lq penalty is used in estimating the regression function for a single predictive task, it makes use of information about the grouping of input variables, and applies the L1 penalty over the Lq norm of the regression coefficients for each group of inputs. [sent-36, score-0.689]

23 This type of regularization scheme can be also used against the output variables in a single classification task with multi-way (rather than binary) prediction, where the output is expanded from univariate to multivariate with dummy variables for each prediction category. [sent-38, score-0.468]

24 In this situation the group lasso can promote selecting the same set of relevant predictors across all of the dummy variables (which is desirable since these dummy variables indeed correspond to only a single multi-way output). [sent-39, score-0.554]

25 For example, an interior point method and a preconditioned conjugate gradient algorithm have been used to solve a large-scale L1 -regularized linear regression and logistic regression [8]. [sent-42, score-0.439]

26 In [6, 13], a coordinate-descent method was used in solving an L1 -regularized linear regression and generalized linear models, where the soft thresholding operator gives a closed-form solution for each coordinate in each iteration. [sent-43, score-0.178]

27 This means that the tasks in question consist of both regression and classification problems. [sent-47, score-0.505]

28 Assuming a linear regression for continuous-valued output and a logistic regression for discrete-valued output with dummy variables for multiple categories, an L1 /Lq penalty can be used to learn both types of tasks jointly for a sparse union support recovery. [sent-48, score-1.195]

29 Since the L1 /Lq penalty selects the same relevant inputs for all dummy outputs for each classification task, the desired consistency in chosen relevant inputs across the dummy variables corresponding to the same multi-way response is automatically maintained. [sent-49, score-0.846]

30 Our work is primarily motivated by the problem of genetic association mapping based on genomewide genotype data of single nucleotide polymorphisms (SNPs), and phenotype data such as disease status, clinical traits, and microarray data collected over a large number of individuals. [sent-51, score-0.293]

31 However, statistically significant patterns can emerge when the joint associations to multiple related traits are estimated properly. [sent-54, score-0.255]

32 Over the recent years, researchers started recognizing the importance of the joint analysis of multiple correlated phenotypes [5, 18], but there has been a lack of statistical tools to systematically perform such analysis. [sent-55, score-0.15]

33 In our previous work [7], we developed a regularized regression method, called a graph-guided fused lasso, for multitask regression problem that takes advantage of the graph structure over tasks to encourage a selection of common inputs across highly correlated traits in the graph. [sent-56, score-1.393]

34 In reality, the set of clinical traits related to a disease often contains both continuous- and discrete-valued traits. [sent-58, score-0.296]

35 As we 2 demonstrate in our experiments, the L1 /Lq regularization for the joint regression and classification can successfully handle this situation. [sent-59, score-0.268]

36 In Section 2, we introduce the notation and the basic formulation for joint regression-classification problem, and describe the L1 /L∞ and L1 /L2 regularized regressions for heterogeneous multitask learning in this setting. [sent-61, score-0.607]

37 Section 4 presents experimental results on simulated and asthma datasets. [sent-63, score-0.18]

38 2 Joint Multitask Learning of Linear Regressions and Multinomial Logistic Regressions Suppose that we have K tasks of learning a predictive model for the output variable, given a common set of P input variables. [sent-65, score-0.444]

39 In our joint regression-classification setting, we assume that the K tasks consist of Kr tasks with continuous-valued outputs and Kc tasks with discrete-valued outputs of an arbitrary number of categories. [sent-66, score-1.162]

40 For each of the Kr regression problems, we assume a linear relationship between the input vector X of size P and the kth output Yk as follows: (r) (r) Yk = βk0 + Xβ k + , (r) (r) k = 1, . [sent-67, score-0.366]

41 , βkP ) represents a vector of P regression coefficients for the kth regression (r) task, with the superscript (r) indicating that this is a parameter for regression; βk0 represents the intercept; and denotes the residual. [sent-73, score-0.427]

42 , ykN ) represent the vector of observations for the kth output over N samples; and X represent an N × P matrix X = (x1 , . [sent-77, score-0.136]

43 For the tasks with discrete-valued output, we set up a multinomial (i. [sent-85, score-0.327]

44 , softmax) logistic regression for each of the Kc tasks, assuming that the kth task has Mk categories: (c) P (Yk = m|X = x) = (c) Mk −1 l=1 1+ P (Yk = Mk |X = x) = (c) (c) exp (βk0 + xβ km ) 1+ (c) (c) exp (βk0 + xβ kl ) 1 Mk −1 l=1 (c) , (c) exp (βk0 + xβ kl ) for m = 1, . [sent-87, score-0.406]

45 Assuming that the measurements for the Kc output variables are collected for the same set of N samples as in the regression tasks, we expand each output data yki for the kth task of the ith sample into a set of Mk binary variables y ki = (yk1i , . [sent-97, score-0.505]

46 , Mk , takes value 1 if the ith sample for the kth classification task belongs to the mth category and value 0 otherwise, and thus m ykmi = 1. [sent-103, score-0.168]

47 (r) (4) (c) This is equivalent to estimating the β k ’s and β km ’s independently for each of the K tasks, assuming that there are no shared patterns in the way that each of the K output variables is dependent on the input variables. [sent-106, score-0.267]

48 Our goal is to increase the performance of variable selection and prediction power by allowing the sharing of information among the heterogeneous tasks. [sent-107, score-0.288]

49 For example, in a genetic association mapping, often millions of genetic markers over a population of individuals are examined to find associations with the given phenotype such as clinical traits, disease status, or molecular phenotypes. [sent-109, score-0.389]

50 We consider the case where the related tasks share the same sparsity pattern such that they have a common set of relevant input variables for both the regression and classification tasks and the amount of influence of the relevant input variables on the output may vary across the tasks. [sent-111, score-1.348]

51 We consider two extreme cases of the L1 /Lq penalty for group variable selection in our problem which are L∞ norm and L2 norm across different tasks in one dimension. [sent-113, score-0.633]

52 P j=1 (r) (r) P (c) max |βkj |, |βkmj | P∞ = k,m (r) (c) |β j , β j |L2 , or P2 = (6) j=1 (c) where β j , β j are vector of parameters over all regression and classification tasks, respectively, for the jth dimension. [sent-114, score-0.203]

53 Here, the L∞ and L2 norms over the parameters across different tasks can regulate the joint sparsity among tasks. [sent-115, score-0.507]

54 The L1 /L∞ and L1 /L2 norms encourage group sparsity (r) (c) in a similar way in that the βkj ’s and βkmj ’s are set to 0 simultaneously for all of the tasks for dimension j if the L∞ or L2 norm for that dimension is set to be 0. [sent-116, score-0.492]

55 The L1 /L∞ penalty tends to encourage the parameter values to be the same across all tasks for a given input [17], whereas under L1 /L2 penalty the values of the parameters across tasks tend to be more different for a given input than in the L1 /L∞ penalty. [sent-118, score-1.118]

56 However, since it is a first-order method, the speed of convergence becomes slow as the number of tasks and dimension increase. [sent-122, score-0.327]

57 The linear regression loss and logistic regression loss have different forms. [sent-124, score-0.456]

58 The interior method optimizes their original loss function without any transformation so that it is more intuitive to see how the two heterogeneous tasks affect each other. [sent-125, score-0.541]

59 First, we re-write the problem of minimizing Equation (5) with the nondifferentiable L1 /L∞ penalty as P minimize Lr + Lc + λ uj j=1 (r) (c) subject to max |βkj |, |βkmj | < uj , for j = 1, . [sent-127, score-0.377]

60 where R and L are second derivatives of the parameters β for regression tasks in the form of R = 2 Lr + 2 Pg |∂β (r) ∂β (r) , L = 2 Lc + 2 Pg |∂β (c) ∂β (c) , D = 2 Pg |∂β∂u and F = D(r) + D(c) . [sent-166, score-0.53]

61 5 Experiments We demonstrate our methods for heterogeneous multitask learning with L1 /L∞ and L1 /L2 regularizations on simulated and asthma datasets, and compare their performances with those from solving two types of multitask-learning problems for regressions and classifications separately. [sent-169, score-0.796]

62 1 Simulation Study In the context of genetic association analysis, we simulate the input and output data with known model parameters as follows. [sent-171, score-0.246]

63 We start from the 120 haplotypes of chromosome 7 from the population of European ancestry in HapMap data [12], and randomly mate the haplotypes to generate genotype data for 500 individuals. [sent-172, score-0.184]

64 (r) (c) In order to simulate the parameters β k ’s and β km ’s, we assume six regression tasks and a single classification task with five categories, and choose five common SNPs from the total of 50 SNPs as relevant covariates across all of the tasks. [sent-174, score-0.786]

65 We fill the non-zero entries in the regression coefficients (r) β k ’s with values uniformly distributed in the interval [a, b] with 5 ≤ a, b ≤ 10, and the non-zero (c) entries in the logistic-regression parameters β km ’s such that the five categories are separated in the output space. [sent-175, score-0.37]

66 Given these inputs and the model parameters, we generate the output values, using 2 the noise for regression tasks distributed as N (0, σsim ). [sent-176, score-0.732]

67 In the classification task, we expand the single output into five dummy variables representing different categories that take values of 0 or 1 depending on which category each sample belongs to. [sent-177, score-0.252]

68 We repeat this whole process of simulating inputs and outputs to obtain 50 datasets, and report the results averaged over these datasets. [sent-178, score-0.172]

69 The results from learning all of the tasks jointly are shown in Figures 1(a) and 1(c) for regression and classification tasks, respectively, whereas the results from learning the two sets of regression and classification tasks separately are shown in Figures 1(b) and 1(d). [sent-180, score-1.04]

70 The red curves indicate the parameters for true relevant inputs, and the blue curves for true irrelevant inputs. [sent-181, score-0.265]

71 We find that when learning both types of tasks jointly, the parameters of the irrelevant inputs are more reliably set to zero along the regularization path than learning the two types of tasks separately. [sent-182, score-0.869]

72 To obtain ROC curves, we estimate the parameters, sort the input-output pairs according to the magnitude of (r) (c) the estimated β kj ’s and β kmj ’s, and compare the sorted list with the list of input-output pairs with (r) (c) true non-zero β kj ’s and β kmj ’s. [sent-184, score-0.714]

73 8 1 (a) (b) (c) (d) Figure 2: ROC curves for detecting true relevant input variables when the sample size N varies. [sent-217, score-0.233]

74 (a) Regression tasks with N = 100, (b) classification tasks with N = 100, (c) regression tasks with N = 200, and (d) classification tasks with N = 200. [sent-218, score-1.486]

75 ‘M’ indicates homogeneous multitask learning, and ‘HM’ heterogenous multitask learning (This notation is the same for the following other figures). [sent-221, score-0.609]

76 (a) Regression tasks with N =100, (b) classification tasks with N = 100, (c) regression tasks with N = 200, and (d) classification tasks with N = 200. [sent-239, score-1.486]

77 We vary the sample size to N = 100 and 200, and show the ROC curves for detecting true relevant inputs using different methods in Figure 2. [sent-241, score-0.241]

78 We use σsim = 1 to generate noise in the regression tasks. [sent-242, score-0.236]

79 Results for the regression and classification tasks with N = 100 are shown in Figure 2(a) and (b) respectively, and similarly, the results with N = 200 in Figure 2(c) and (d). [sent-243, score-0.505]

80 The results with L1 /L∞ penalty are shown with color blue and green to compare the homogeneous and heterogeneous methods. [sent-244, score-0.336]

81 Although the performance of learning the two types of tasks separately improves with a larger sample size, the joint estimation performs significantly better for both sample sizes. [sent-246, score-0.372]

82 In order to see how different signal-to-noise ratios affect the performance, we vary the noise level 2 2 to σsim = 5 and σsim = 8, and plot the ROC curves averaged over 50 datasets with a sample size N = 300 in Figure 4. [sent-248, score-0.163]

83 Our results show that for both of the signal-to-noise ratios, learning regression and classification tasks jointly improves the performance significantly. [sent-249, score-0.535]

84 We can see that the L1 /L2 method tends to improve the variable selection, but the tradeoff is that the prediction error will be high when the noise level is low. [sent-251, score-0.156]

85 While L1 /L∞ has a good balance between the variable selection accuracy and prediction error at a lower noise level, as the noise increases, the L1 /L2 outperforms L1 /L∞ in both variable selection and prediction accuracy. [sent-252, score-0.316]

86 8 1 (a) (b) (c) (d) Figure 4: ROC curves for detecting true relevant input variables when the noise level varies. [sent-285, score-0.334]

87 (a) Regression tasks with noise level N (0, 5), (b) classification tasks with noise level N (0, 5), (c) regression tasks with noise level N (0, 8), and (d) classification tasks with noise level N (0, 8). [sent-286, score-1.89]

88 (a) Regression tasks with noise level N (0, 52 ), (b) classification tasks with noise level N (0, 52 ), (c) regression tasks with noise level N (0, 82 ), and (d) classification tasks with noise level N (0, 82 ). [sent-305, score-1.89]

89 2 8 10 20 30 0 (a) (b) Figure 6: Parameters estimated from the asthma dataset for discovery of causal SNPs for the correlated phenotypes. [sent-315, score-0.23]

90 (a) Heterogeneous task learning method, and (b) separate analysis of multitask regressions and multitask classifications. [sent-316, score-0.675]

91 2 Analysis of Asthma Dataset We apply our method to the asthma dataset with 34 SNPs in the IL4R gene of chromosome 11 and five asthma-related clinical traits collected over 613 patients. [sent-319, score-0.455]

92 The set of traits includes four continuous-valued traits related to lung physiology such as baseline predrug FEV1, maximum FEV1, baseline predrug FVC, and maximum FVC as well as a single discrete-valued trait with five categories. [sent-320, score-0.46]

93 The goal of this analysis is to discover whether any of the SNPs (inputs) are influencing each of the asthma-related traits (outputs). [sent-321, score-0.199]

94 We fit the joint regression-classification method with L1 /L∞ and L1 /L2 regularizations, and compare the results from fitting L1 /L∞ and L1 /L2 regularized methods only for the regression tasks or only for the classification task. [sent-322, score-0.55]

95 We can see that the heterogeneous multitask-learning method encourages to find common causal SNPs for the multiclass classification task and the regression tasks. [sent-324, score-0.45]

96 6 Conclusions In this paper, we proposed a method for a recovery of union support in heterogeneous multitask learning, where the set of tasks consists of both regressions and classifications. [sent-325, score-0.932]

97 In our experiments with simulated and asthma datasets, we demonstrated that using L1 /L2 or L1 /L∞ regularizations in the joint regression-classification problem improves the performance for identifying the input variables that are commonly relevant to multiple tasks. [sent-326, score-0.45]

98 The sparse union support recovery as was presented in this paper is concerned with finding inputs that influence at least one task. [sent-327, score-0.147]

99 In the real-world problem of association mapping, there is a clustering structure such as co-regulated genes, and it would be interesting to discover SNPs that are causal to at least one of the outputs within the subgroup rather than all of the outputs. [sent-328, score-0.176]

100 Model selection and estimation in regression with grouped variables. [sent-422, score-0.223]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hm', 0.382), ('tasks', 0.327), ('snps', 0.296), ('multitask', 0.263), ('kmj', 0.237), ('heterogeneous', 0.188), ('regression', 0.178), ('traits', 0.174), ('sencitivity', 0.158), ('specificity', 0.158), ('asthma', 0.138), ('uj', 0.136), ('mk', 0.123), ('kc', 0.121), ('kr', 0.121), ('kj', 0.12), ('dummy', 0.112), ('regressions', 0.111), ('penalty', 0.105), ('inputs', 0.104), ('lc', 0.089), ('lr', 0.089), ('yk', 0.087), ('relevant', 0.075), ('clinical', 0.074), ('kth', 0.071), ('km', 0.071), ('roc', 0.069), ('classi', 0.069), ('chromosome', 0.069), ('outputs', 0.068), ('genetic', 0.067), ('output', 0.065), ('curves', 0.062), ('sim', 0.059), ('lbarrier', 0.059), ('phenotypes', 0.059), ('ykmi', 0.059), ('noise', 0.058), ('prediction', 0.055), ('regularizations', 0.054), ('input', 0.052), ('classification', 0.05), ('logistic', 0.048), ('disease', 0.048), ('across', 0.047), ('lasso', 0.047), ('correlated', 0.046), ('causal', 0.046), ('joint', 0.045), ('regularization', 0.045), ('selection', 0.045), ('variables', 0.044), ('cation', 0.044), ('union', 0.043), ('level', 0.043), ('homogeneous', 0.043), ('simulated', 0.042), ('irrelevant', 0.041), ('newton', 0.041), ('predictors', 0.04), ('fvc', 0.04), ('haplotypes', 0.04), ('hapmap', 0.04), ('heterogenous', 0.04), ('preconditionor', 0.04), ('predrug', 0.04), ('hessian', 0.039), ('pg', 0.039), ('task', 0.038), ('norm', 0.038), ('uence', 0.038), ('association', 0.037), ('inverting', 0.036), ('associations', 0.036), ('sparsity', 0.035), ('shared', 0.035), ('preconditioned', 0.035), ('genotype', 0.035), ('group', 0.033), ('pittsburgh', 0.032), ('trait', 0.032), ('phenotype', 0.032), ('encourage', 0.031), ('categories', 0.031), ('jointly', 0.03), ('norms', 0.028), ('status', 0.028), ('barrier', 0.028), ('markers', 0.028), ('obozinski', 0.028), ('share', 0.027), ('kim', 0.027), ('pq', 0.027), ('loss', 0.026), ('ve', 0.025), ('discover', 0.025), ('covariates', 0.025), ('parameters', 0.025), ('cients', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

Author: Xiaolin Yang, Seyoung Kim, Eric P. Xing

Abstract: Multitask learning addresses the problem of learning related tasks that presumably share some commonalities on their input-output mapping functions. Previous approaches to multitask learning usually deal with homogeneous tasks, such as purely regression tasks, or entirely classification tasks. In this paper, we consider the problem of learning multiple related tasks of predicting both continuous and discrete outputs from a common set of input variables that lie in a highdimensional feature space. All of the tasks are related in the sense that they share the same set of relevant input variables, but the amount of influence of each input on different outputs may vary. We formulate this problem as a combination of linear regressions and logistic regressions, and model the joint sparsity as L1 /L∞ or L1 /L2 norm of the model parameters. Among several possible applications, our approach addresses an important open problem in genetic association mapping, where the goal is to discover genetic markers that influence multiple correlated traits jointly. In our experiments, we demonstrate our method in this setting, using simulated and clinical asthma datasets, and we show that our method can effectively recover the relevant inputs with respect to all of the tasks. 1

2 0.088923708 238 nips-2009-Submanifold density estimation

Author: Arkadas Ozakin, Alexander G. Gray

Abstract: Kernel density estimation is the most widely-used practical method for accurate nonparametric density estimation. However, long-standing worst-case theoretical results showing that its performance worsens exponentially with the dimension of the data have quashed its application to modern high-dimensional datasets for decades. In practice, it has been recognized that often such data have a much lower-dimensional intrinsic structure. We propose a small modification to kernel density estimation for estimating probability density functions on Riemannian submanifolds of Euclidean space. Using ideas from Riemannian geometry, we prove the consistency of this modified estimator and show that the convergence rate is determined by the intrinsic dimension of the submanifold. We conclude with empirical results demonstrating the behavior predicted by our theory. 1 Introduction: Density estimation and the curse of dimensionality Kernel density estimation (KDE) [8] is one of the most popular methods for estimating the underlying probability density function (PDF) of a dataset. Roughly speaking, KDE consists of having the data points “contribute” to the estimate at a given point according to their distances from the ˆ point. In the simplest multi-dimensional KDE [3], the estimate fm (y0 ) of the PDF f (y0 ) at a point N y0 ∈ R is given in terms of a sample {y1 , . . . , ym } as, 1 ˆ fm (y0 ) = m m i=1 1 K hN m yi − y0 hm , (1) where hm > 0, the bandwidth, is chosen to approach to zero at a suitable rate as the number m of data points increases, and K : [0.∞) → [0, ∞) is a kernel function that satisfies certain properties such as boundedness. Various theorems exist on the different types of convergence of the estimator to the correct result and the rates of convergence. The earliest result on the pointwise convergence rate in the multivariable case seems to be given in [3], where it is stated that under certain conditions for f and K, assuming hm → 0 and mhm → ∞ as m → ∞, the mean squared ˆ ˆ error in the estimate f (y0 ) of the density at a point goes to zero with the rate, MSE[fm (y0 )] = E ˆ fm (y0 ) − f (y0 ) 2 = O h4 + m 1 mhN m as m → ∞. If hm is chosen to be proportional to m−1/(N +4) , one gets, ˆ MSE[fm (p)] = O 1 m4/(N +4) , (2) as m → ∞. This is an example of a curse of dimensionality; the convergence rate slows as the dimensionality N of the data set increases. In Table 4.2 of [12], Silverman demonstrates how the sample size required for a given mean square error for the estimate of a multivariable normal distribution increases with the dimensionality. The numbers look as discouraging as the formula 2. 1 One source of optimism towards various curses of dimensionality is the fact that although the data for a given problem may have many features, in reality the intrinsic dimensionality of the “data subspace” of the full feature space may be low. This may result in there being no curse at all, if the performance of the method/algorithm under consideration can be shown to depend only on the intrinsic dimensionality of the data. Alternatively, one may be able to avoid the curse by devising ways to work with the low-dimensional data subspace by using dimensional reduction techniques on the data. One example of the former case is the results on nearest neighbor search [6, 2] which indicate that the performance of certain nearest-neighbor search algortihms is determined not by the full dimensionality of the feature space, but only on the intrinsic dimensionality of the data subspace. Riemannian manifolds. In this paper, we will assume that the data subspace is a Riemannian manifold. Riemannian manifolds provide a generalization of the notion of a smooth surface in R3 to higher dimensions. As first clarified by Gauss in the two-dimensional case (and by Riemann in the general case) it turns out that intrinsic features of the geometry of a surface such as lengths of its curves or intrinsic distances between its points, etc., can be given in terms of the so-called metric tensor1 g without referring to the particular way the the surface is embedded in R3 . A space whose geometry is defined in terms of a metric tensor is called a Riemannian manifold (for a rigorous definition, see, e.g., [5, 7, 1]). Previous work. In [9], Pelletier defines an estimator of a PDF on a Riemannian manifold M by using the distances measured on M via its metric tensor, and obtains the same convergence rate as in (2), with N being replaced by the dimensionality of the Riemannian manifold. Thus, if we know that the data lives on a Riemannian manifold M , the convergence rate of this estimator will be determined by the dimensionality of M , instead of the full dimensionality of the feature space on which the data may have been originally sampled. While an interesting generalization of the usual KDE, this approach assumes that the data manifold M is known in advance, and that we have access to certain geometric quantities related to this manifold such as intrinsic distances between its points and the so-called volume density function. Thus, this Riemannian KDE cannot be used directly in a case where the data lives on an unknown Riemannian submanifold of RN . Certain tools from existing nonlinear dimensionality reduction methods could perhaps be utilized to estimate the quantities needed in the estimator of [9], however, a more straightforward method that directly estimates the density of the data as measured in the subspace is desirable. Other related works include [13], where the authors propose a submanifold density estimation method that uses a kernel function with a variable covariance but do not present theorerical results, [4] where the author proposes a method for doing density estimation on a Riemannian manifold by using the eigenfunctions of the Laplace-Beltrami operator, which, as in [9], assumes that the manifold is known in advance, together with intricate geometric information pertaining to it, and [10, 11], which discuss various issues related to statistics on a Riemannian manifold. This paper. In this paper, we propose a direct way to estimate the density of Euclidean data that lives on a Riemannian submanifold of RN with known dimension n < N . We prove the pointwise consistency of the estimator, and prove bounds on its convergence rates given in terms of the intrinsic dimension of the submanifold the data lives in. This is an example of the avoidance of the curse of dimensionality in the manner mentioned above, by a method whose performance depends on the intrinsic dimensionality of the data instead of the full dimensionality of the feature space. Our method is practical in that it works with Euclidean distances on RN . In particular, we do not assume any knowledge of the quantities pertaining to the intrinsic geometry of the underlying submanifold such as its metric tensor, geodesic distances between its points, its volume form, etc. 2 The estimator and its convergence rate Motivation. In this paper, we are concerned with the estimation of a PDF that lives on an (unknown) n-dimensional Riemannian submanifold M of RN , where N > n. Usual, N -dimensional kernel density estimation would not work for this problem, since if interpreted as living on RN , the 1 The metric tensor can be thought of as giving the “infinitesimal distance” ds between two points whose P coordinates differ by the infinitesimal amounts (dy 1 , . . . , dy N ) as ds2 = ij gij dy i dy j . 2 underlying PDF would involve a “delta function” that vanishes when one moves away from M , and “becomes infinite” on M in order to have proper normalization. More formally, the N -dimensional probability measure for such an n-dimensional PDF on M will have support only on M , will not be absolutely continuous with respect to the Lebesgue measure on RN , and will not have a probability density function on RN . If one attempts to use the usual, N -dimensional KDE for data drawn from such a probability measure, the estimator will “try to converge” to a singular PDF, one that is infinite on M , zero outside. In order to estimate the probability density function on M by using data given in RN , we propose a simple modification of usual KDE on RN , namely, to use a kernel that is normalized for n-dimensions instead of N , while still using the Euclidean distances in RN . The intuition behind this approach is based on three facts: 1) For small distances, an n-dimensional Riemannian manifold “looks like” Rn , and densities in Rn should be estimated by an n-dimensional kernel, 2) For points of M that are close enough to each other, the intrinsic distances as measured on M are close to Euclidean distances as measured in RN , and, 3) For small bandwidths, the main contribution to the estimate at a point comes from data points that are nearby. Thus, as the number of data points increases and the bandwidth is taken to be smaller and smaller, estimating the density by using a kernel normalized for n-dimensions and distances as measured in RN should give a result closer and closer to the correct value. We will next give the formal definition of the estimator motivated by these considerations, and state our theorem on its asymptotics. As in the original work of Parzen [8], the proof that the estimator is asymptotically unbiased consists of proving that as the bandwidth converges to zero, the kernel function becomes a “delta function”. This result is also used in showing that with an appropriate choice of vanishing rate for the bandwidth, the variance also vanishes asymptotically, hence the estimator is pointwise consistent. Statement of the theorem Let M be an n-dimensional, embedded, complete Riemannian submanifold of RN (n < N ) with an induced metric g and injectivity radius rinj > 0.2 Let d(p, q) be the length of a length-minimizing geodesic in M between p, q ∈ M , and let u(p, q) be the geodesic (linear) distance between p and q as measured in RN . Note that u(p, q) ≤ d(p, q). We will use the notation up (q) = u(p, q) and dp (q) = d(p, q). We will denote the Riemannian volume measure on M by V , and the volume form by dV . Theorem 2.1. Let f : M → [0, ∞) be a probability density function defined on M (so that the related probability measure is f V ), and K : [0, ∞) → [0, ∞) be a continous function that satisfies vanishes outside [0, 1), is differentiable with a bounded derivative in [0, 1), and satisfies, n z ≤1 K( z )d z = 1. Assume f is differentiable to second order in a neighborhood of p ∈ M , ˆ and for a sample q1 , . . . , qm of size m drawn from the density f , define an estimator fm (p) of f (p) as, m 1 1 up (qj ) ˆ fm (p) = (3) K n m j=1 hm hm where hm > 0. If hm satisfies limm→∞ hm = 0 and limm→∞ mhn = ∞, then, there exists m non-negative numbers m∗ , Cb , and CV such that for all m > m∗ we have, ˆ MSE fm (p) = E ˆ fm (p) − f (p) 2 < Cb h4 + m CV . mhn m If hm is chosen to be proportional to m−1/(n+4) , this gives, E (fm (p) − f (p))2 = O as m → ∞. (4) 1 m4/(n+4) Thus, the convergence rate of the estimator is given as in [3, 9], with the dimensionality replaced by the intrinsic dimension n of M . The proof will follow from the two lemmas below on the convergence rates of the bias and the variance. 2 The injectivity radius rinj of a Riemannian manifold is a distance such that all geodesic pieces (i.e., curves with zero intrinsic acceleration) of length less than rinj minimize the length between their endpoints. On a complete Riemannian manifold, there exists a distance-minimizing geodesic between any given pair of points, however, an arbitrary geodesic need not be distance minimizing. For example, any two non-antipodal points on the sphere can be connected with two geodesics with different lengths, namely, the two pieces of the great circle passing throught the points. For a detailed discussion of these issues, see, e.g., [1]. 3 3 Preliminary results The following theorem, which is analogous to Theorem 1A in [8], tells that up to a constant, the kernel becomes a “delta function” as the bandwidth gets smaller. Theorem 3.1. Let K : [0, ∞) → [0, ∞) be a continuous function that vanishes outside [0, 1) and is differentiable with a bounded derivative in [0, 1), and let ξ : M → R be a function that is differentiable to second order in a neighborhood of p ∈ M . Let ξh (p) = 1 hn K M up (q) h ξ(q) dV (q) , (5) where h > 0 and dV (q) denotes the Riemannian volume form on M at point q. Then, as h → 0, K( z )dn z = O(h2 ) , ξh (p) − ξ(p) (6) Rn where z = (z 1 , . . . , z n ) denotes the Cartesian coordinates on Rn and dn z = dz 1 . . . dz n denotes the volume form on Rn . In particular, limh→0 ξh (p) = ξ(p) Rn K( z )dn z. Before proving this theorem, we prove some results on the relation between up (q) and dp (q). Lemma 3.1. There exist δup > 0 and Mup > 0 such that for all q with dp (q) ≤ δup , we have, 3 dp (q) ≥ up (q) ≥ dp (q) − Mup [dp (q)] . In particular, limq→p up (q) dp (q) (7) = 1. Proof. Let cv0 (s) be a geodesic in M parametrized by arclength s, with c(0) = p and initial vedcv locity ds0 s=0 = v0 . When s < rinj , s is equal to dp (cv0 (s)) [7, 1]. Now let xv0 (s) be the representation of cv0 (s) in RN in terms of Cartesian coordinates with the origin at p. We have up (cv0 (s)) = xv0 (s) and x′ 0 (s) = 1, which gives3 x′ 0 (s) · x′′0 (s) = 0. Using these v v v we get, dup (cv0 (s)) ds s=0 the absolute value of the third d3 up (cv0 (s)) ds3 d2 up (cv0 (s)) = ds2 s=0 derivative of up (cv0 (s)) = 1 , and 0. Let M3 ≥ 0 be an upper bound on for all s ≤ rinj and all unit length v0 : 3 ≤ M3 . Taylor’s theorem gives up (cv0 (s)) = s + Rv0 (s) where |Rv0 (s)| ≤ M3 s . 3! Thus, (7) holds with Mup = M3 , for all r < rinj . For later convenience, instead of δu = rinj , 3! we will pick δup as follows. The polynomial r − Mup r3 is monotonically increasing in the interval 0 ≤ r ≤ 1/ 3Mup . We let δup = min{rinj , 1/ Mup }, so that r − Mup r3 is ensured to be monotonic for 0 ≤ r ≤ δup . Definition 3.2. For 0 ≤ r1 < r2 , let, Hp (r1 , r2 ) = Hp (r) = inf{up (q) : r1 ≤ dp (q) < r2 } , Hp (r, ∞) = inf{up (q) : r1 ≤ dp (q)} , (8) (9) i.e., Hp (r1 , r2 ) is the smallest u-distance from p among all points that have a d-distance between r1 and r2 . Since M is assumed to be an embedded submanifold, we have Hp (r) > 0 for all r > 0. In the below, we will assume that all radii are smaller than rinj , in particular, a set of the form {q : r1 ≤ dp (q) < r2 } will be assumed to be non-empty and so, due to the completeness of M , to contain a point q ∈ M such that dp (q) = r1 . Note that, Hp (r1 ) = min{H(r1 , r2 ), H(r2 )} . (10) Lemma 3.2. Hp (r) is a non-decreasing, non-negative function, and there exist δHp > 0 and MHp ≥ H (r) 0 such that, r ≥ Hp (r) ≥ r − MHp r3 , for all r < δHp . In particular, limr→0 pr = 1. 3 Primes denote differentiation with respect to s. 4 Proof. Hp (r) is clearly non-decreasing and Hp (r) ≤ r follows from up (q) ≤ dp (q) and the fact that there exists at least one point q with dp (q) = r in the set {q : r ≤ dp (q)} Let δHp = Hp (δup ) where δup is as in the proof of Lemma 3.1 and let r < δHp . Since r < δHp = Hp (δup ) ≤ δup , by Lemma 3.1 we have, r ≥ up (r) ≥ r − Mup r3 , (11) for some Mup > 0. Now, since r and r − Mup r3 are both monotonic for 0 ≤ r ≤ δup , we have (see figure) (12) r ≥ Hp (r, δup ) ≥ r − Mup r3 . In particular, H(r, δup ) ≤ r < δHp = Hp (δup ), i.e, H(r, δup ) < Hp (δup ). Using (10) this gives, Hp (r) = Hp (r, δup ). Combining this with (12), we get r ≥ Hp (r) ≥ r − Mup r3 for all r < δHp . Next we show that for all small enough h, there exists some radius Rp (h) such that for all points q with a dp (q) ≥ Rp (h), we have up (q) ≥ h. Rp (h) will roughly be the inverse function of Hp (r). Lemma 3.3. For any h < Hp (rinj ), let Rp (h) = sup{r : Hp (r) ≤ h}. Then, up (q) ≥ h for all q with dp (q) ≥ Rp (h) and there exist δRp > 0 and MRp > 0 such that for all h ≤ δRp , Rp (h) satisfies, (13) h ≤ Rp (h) ≤ h + MRp h3 . In particular, limh→0 Rp (h) h = 1. Proof. That up (q) ≥ h when dq (q) ≥ Rp (h) follows from the definitions. In order to show (13), we will use Lemma 3.2. Let α(r) = r − MHp r3 , where MHp is as in Lemma 3.2. Then, α(r) is oneto-one and continuous in the interval 0 ≤ r ≤ δHp ≤ δup . Let β = α−1 be the inverse function of α in this interval. From the definition of Rp (h) and Lemma 3.2, it follows that h ≤ Rp (h) ≤ β(h) for all h ≤ α(δHp ). Now, β(0) = 0, β ′ (0) = 1, β ′′ (0) = 0, so by Taylor’s theorem and the fact that the third derivative of β is bounded in a neighborhood of 0, there exists δg and MRp such that β(h) ≤ h + MRp h3 for all h ≤ δg . Thus, h ≤ Rp (h) ≤ h + MRp h3 , (14) for all h ≤ δR where δR = min{α(δHp ), δg }. Proof of Theorem 3.1. We will begin by proving that for small enough h, there is no contribution to the integral in the definition of ξh (p) (see (5)) from outside the coordinate patch covered by normal coordinates.4 Let h0 > 0 be such that Rp (h0 ) < rinj (such an h0 exists since limh→ 0 Rp (h) = 0). For any h ≤ h0 , all points q with dp (q) > rinj will satisfy up (q) > h. This means if h is small enough, u (q) K( ph ) = 0 for all points outside the injectivity radius and we can perform the integral in (5) solely in the patch of normal coordinates at p. For normal coordinates y = (y 1 , . . . , y n ) around the point p with y(p) = 0, we have dp (q) = y(q) [7, 1]. With slight abuse of notation, we will write up (y(q)) = up (q), ξ(y(q)) = ξ(q) and g(q) = g(y(q)), where g is the metric tensor of M . Since K( up (q) h ) = 0 for all q with dp (q) > Rp (h), we have, ξh (p) = 1 hn K y ≤Rp (h) up (y) h ξ(y) g(y)dy 1 . . . dy n , (15) 4 Normal coordinates at a point p in a Riemannian manifold are a close approximation to Cartesian coordinates, in the sense that the components of the metric have vanishing first derivatives at p, and gij (p) = δij [1]. Normal coordinates can be defined in a “geodesic ball” of radius less than rinj . 5 where g denotes the determinant of g as calculated in normal coordinates. Changing the variable of integration to z = y/h, we get, K( z )dn z = ξh (p) − ξ(p) up (zh) h K z ≤Rp (h)/h up (zh) h K = z ≤1 ξ(zh) K z ≤1 z ≤1 g(zh) − 1 dn z + ξ(zh) up (zh) h K( z )dn z g(zh)dn z − ξ(0) ξ(zh) − K( z ) dn z + K( z ) (ξ(zh) − ξ(0)) dn z + z ≤1 K 1< z ≤Rp (h)/h up (zh) h ξ(zh) g(zh)dn z . Thus, K ( z ) dn z ≤ ξh (p) − ξ(p) (16) t∈R z ≤1 sup |ξ(zh)| . sup K( z ≤1 z ≤1 dn z + g(zh) − 1 . sup K(t) . sup |ξ(zh)| . sup z ≤1 (17) z ≤1 up (zh) ) − K( z ) . h dn z + (18) z ≤1 K( z )(ξ(zh) − ξ(0))dn z + (19) z ≤1 sup K(t) . t∈R sup g(zh) . 1< z ≤Rp (h)/h sup dn z . (20) |ξ(zh)| . 1< z ≤Rp (h)/h 1< z ≤Rp (h)/h Letting h → 0, the terms (17)-(20) approach zero at the following rates: (17): K(t) is bounded and ξ(y) is continuous at y = 0, so the first two terms can be bounded by constants as h → 0. In normal coordinates y, gij (y) = δij + O( y 2 ) as y → 0, so, sup z ≤1 g(zh) − 1 = O(h2 ) as h → 0. (18): Since K is assumed to be differentiable with a bounded derivative in [0, 1), we get K(b) − u (zh) K(a) = O(b − a) as b → a. By Lemma 3.1 we have p h − z = O(h2 ) as h → 0. Thus, K up (zh) h − K( z ) = O(h2 ) as h → 0. (19): Since ξ(y) is assumed to have partial derivatives up to second order in a neighborhood of y(p) = 0, for z ≤ 1, Taylor’s theorem gives, n zi ξ(zh) = ξ(0) + h i=1 as h → 0. Since h → 0. z ≤1 zK( z )dn z = 0, we get ∂ξ(y) ∂y i z ≤1 + O(h2 ) (21) y=0 K( z )(ξ(zh) − ξ(0))dn z = O(h2 ) as (20): The first three terms can be bounded by constants. By Lemma 3.3, Rp (h) = h + O(h3 ) as h → 0. A spherical shell 1 < z ≤ 1 + ǫ has volume O(ǫ) as ǫ → 0+ . Thus, the volume of 1 < z ≤ Rp (h)/h is O(Rp (h)/h − 1) = O(h2 ) as h → 0. Thus, the sum of the terms (17-20), is O(h2 ) as h → 0, as claimed in Theorem 3.1. 6 4 Bias, variance and mean squared error ˆ Let M , f , fm , K, p be as in Theorem 2.1 and assume hm → 0 as m → ∞. ˆ Lemma 4.1. Bias fm (p) = O(h2 ), as m → ∞. m u (q) Proof. We have Bias[fm (p)] = Bias h1 K ph m follows from Theorem 3.1 with ξ replaced with f . , so recalling Rn K( z )dn z = 1, the lemma Lemma 4.2. If in addition to hm → 0, we have mhn → ∞ as m → ∞, then, Var[fm (p)] = m O 1 mhn m , as m → ∞. Proof. Var[fm (p)] = 1 1 Var n K m hm up (q) hm (22) (23) Now, Var 1 K hn m up (q) hm =E 1 K2 h2n m up (q) hm 1 hn m f (q) − E 1 K hn m 2 up (q) hm , (24) and, E 1 K2 h2n m up (q) hm = M 1 2 K hn m up (q) hm dV (q) . (25) By Theorem 3.1, the integral in (25) converges to f (p) K 2 ( z )dn z, so, the right hand side of (25) is O 1 hn m ˆ Var[fm (p)] = O as m → ∞. By Lemma 4.1 we have, E 1 mhn m 1 hn K m up (q) hm 2 → f 2 (p). Thus, as m → ∞. ˆ ˆ ˆ Proof of Theorem 2.1 Finally, since MSE fm (p) = Bias2 [fm (p)] + Var[fm (p)], the theorem follows from Lemma 4.1 and 4.2. 5 Experiments and discussion We have empirically tested the estimator (3) on two datasets: A unit normal distribution mapped onto a piece of a spiral in the plane, so that n = 1 and N = 2, and a uniform distribution on the unit disc x2 + y 2 ≤ 1 mapped onto the unit hemisphere by (x, y) → (x, y, 1 − x2 + y 2 ), so that n = 2 and N = 3. We picked the bandwidths to be proportional to m−1/(n+4) where m is the number of data points. We performed live-one-out estimates of the density on the data points, and obtained the MSE for a range of ms. See Figure 5. 6 Conclusion and future work We have proposed a small modification of the usual KDE in order to estimate the density of data that lives on an n-dimensional submanifold of RN , and proved that the rate of convergence of the estimator is determined by the intrinsic dimension n. This shows that the curse of dimensionality in KDE can be overcome for data with low intrinsic dimension. Our method assumes that the intrinsic dimensionality n is given, so it has to be supplemented with an estimator of the dimension. We have assumed various smoothness properties for the submanifold M , the density f , and the kernel K. We find it likely that our estimator or slight modifications of it will be consistent under weaker requirements. Such a relaxation of requirements would have practical consequences, since it is unlikely that a generic data set lives on a smooth Riemannian manifold. 7 MSE MSE Mean squared error for the hemisphere data Mean squared error for the spiral data 0.000175 0.0008 0.00015 0.000125 0.0006 0.0001 0.000075 0.0004 0.00005 0.0002 0.000025 # of data points 50000 100000 150000 200000 # of data points 50000 100000 150000 200000 Figure 1: Mean squared error as a function of the number of data points for the spiral data (left) and the hemisphere data. In each case, we fit a curve of the form M SE(m) = amb , which gave b = −0.80 for the spiral and b = −0.69 for the hemisphere. Theorem 2.1 bounds the MSE by Cm−4/(n+4) , which gives the exponent as −0.80 for the spiral and −0.67 for the hemisphere. References [1] M. Berger and N. Hitchin. A panoramic view of Riemannian geometry. The Mathematical Intelligencer, 28(2):73–74, 2006. [2] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pages 97–104. ACM New York, NY, USA, 2006. [3] T. Cacoullos. Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics, 18(1):179–189, 1966. [4] H. Hendriks. Nonparametric estimation of a probability density on a Riemannian manifold using Fourier expansions. The Annals of Statistics, 18(2):832–849, 1990. [5] J. Jost. Riemannian geometry and geometric analysis. Springer, 2008. [6] F. Korn, B. Pagel, and C. Faloutsos. On dimensionality and self-similarity . IEEE Transactions on Knowledge and Data Engineering, 13(1):96–111, 2001. [7] J. Lee. Riemannian manifolds: an introduction to curvature. Springer Verlag, 1997. [8] E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, pages 1065–1076, 1962. [9] B. Pelletier. Kernel density estimation on Riemannian manifolds. Statistics and Probability Letters, 73(3):297–304, 2005. [10] X. Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In IEEE Workshop on Nonlinear Signal and Image Processing, volume 4. Citeseer, 1999. [11] X. Pennec. Intrinsic statistics on Riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127–154, 2006. [12] B. Silverman. Density estimation for statistics and data analysis. Chapman & Hall/CRC, 1986. [13] P. Vincent and Y. Bengio. Manifold Parzen Windows. Advances in Neural Information Processing Systems, pages 849–856, 2003. 8

3 0.08884839 157 nips-2009-Multi-Label Prediction via Compressed Sensing

Author: John Langford, Tong Zhang, Daniel J. Hsu, Sham M. Kakade

Abstract: We consider multi-label prediction problems with large output spaces under the assumption of output sparsity – that the target (label) vectors have small support. We develop a general theory for a variant of the popular error correcting output code scheme, using ideas from compressed sensing for exploiting this sparsity. The method can be regarded as a simple reduction from multi-label regression problems to binary regression problems. We show that the number of subproblems need only be logarithmic in the total number of possible labels, making this approach radically more efficient than others. We also state and prove robustness guarantees for this method in the form of regret transform bounds (in general), and also provide a more detailed analysis for the linear prediction setting. 1

4 0.082787752 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

Author: Mladen Kolar, Le Song, Eric P. Xing

Abstract: To estimate the changing structure of a varying-coefficient varying-structure (VCVS) model remains an important and open problem in dynamic system modelling, which includes learning trajectories of stock prices, or uncovering the topology of an evolving gene network. In this paper, we investigate sparsistent learning of a sub-family of this model — piecewise constant VCVS models. We analyze two main issues in this problem: inferring time points where structural changes occur and estimating model structure (i.e., model selection) on each of the constant segments. We propose a two-stage adaptive procedure, which first identifies jump points of structural changes and then identifies relevant covariates to a response on each of the segments. We provide an asymptotic analysis of the procedure, showing that with the increasing sample size, number of structural changes, and number of variables, the true model can be consistently selected. We demonstrate the performance of the method on synthetic data and apply it to the brain computer interface dataset. We also consider how this applies to structure estimation of time-varying probabilistic graphical models. 1

5 0.079038568 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

Author: Sahand Negahban, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless p/n → 0, a line of recent work has studied models with various types of structure (e.g., sparse vectors; block-structured matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M -estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages the assumed structure. The goal of this paper is to provide a unified framework for establishing consistency and convergence rates for such regularized M estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M -estimators have fast convergence rates. 1

6 0.076343477 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

7 0.073665679 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

8 0.07299269 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

9 0.072413899 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

10 0.070143141 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

11 0.066990875 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

12 0.065924928 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

13 0.065719903 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

14 0.058785401 246 nips-2009-Time-Varying Dynamic Bayesian Networks

15 0.05733842 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

16 0.056772716 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning

17 0.053600527 3 nips-2009-AUC optimization and the two-sample problem

18 0.053170804 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

19 0.05297906 128 nips-2009-Learning Non-Linear Combinations of Kernels

20 0.052188262 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.172), (1, 0.045), (2, -0.015), (3, 0.083), (4, -0.012), (5, -0.056), (6, 0.042), (7, -0.084), (8, 0.042), (9, 0.039), (10, 0.079), (11, -0.056), (12, -0.029), (13, -0.058), (14, 0.013), (15, 0.046), (16, -0.006), (17, -0.001), (18, -0.015), (19, -0.001), (20, -0.013), (21, 0.011), (22, 0.039), (23, -0.035), (24, 0.082), (25, 0.094), (26, -0.039), (27, 0.093), (28, 0.043), (29, -0.066), (30, 0.006), (31, 0.053), (32, 0.016), (33, 0.099), (34, -0.069), (35, -0.001), (36, -0.023), (37, 0.064), (38, 0.078), (39, -0.004), (40, -0.102), (41, 0.035), (42, 0.049), (43, -0.036), (44, 0.025), (45, -0.087), (46, 0.01), (47, 0.092), (48, 0.044), (49, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9443658 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

Author: Xiaolin Yang, Seyoung Kim, Eric P. Xing

Abstract: Multitask learning addresses the problem of learning related tasks that presumably share some commonalities on their input-output mapping functions. Previous approaches to multitask learning usually deal with homogeneous tasks, such as purely regression tasks, or entirely classification tasks. In this paper, we consider the problem of learning multiple related tasks of predicting both continuous and discrete outputs from a common set of input variables that lie in a highdimensional feature space. All of the tasks are related in the sense that they share the same set of relevant input variables, but the amount of influence of each input on different outputs may vary. We formulate this problem as a combination of linear regressions and logistic regressions, and model the joint sparsity as L1 /L∞ or L1 /L2 norm of the model parameters. Among several possible applications, our approach addresses an important open problem in genetic association mapping, where the goal is to discover genetic markers that influence multiple correlated traits jointly. In our experiments, we demonstrate our method in this setting, using simulated and clinical asthma datasets, and we show that our method can effectively recover the relevant inputs with respect to all of the tasks. 1

2 0.64445329 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

Author: Matthias Hein

Abstract: Motivated by recent developments in manifold-valued regression we propose a family of nonparametric kernel-smoothing estimators with metric-space valued output including several robust versions. Depending on the choice of the output space and the metric the estimator reduces to partially well-known procedures for multi-class classification, multivariate regression in Euclidean space, regression with manifold-valued output and even some cases of structured output learning. In this paper we focus on the case of regression with manifold-valued input and output. We show pointwise and Bayes consistency for all estimators in the family for the case of manifold-valued output and illustrate the robustness properties of the estimators with experiments. 1

3 0.60449034 124 nips-2009-Lattice Regression

Author: Eric Garcia, Maya Gupta

Abstract: We present a new empirical risk minimization framework for approximating functions from training samples for low-dimensional regression applications where a lattice (look-up table) is stored and interpolated at run-time for an efficient implementation. Rather than evaluating a fitted function at the lattice nodes without regard to the fact that samples will be interpolated, the proposed lattice regression approach estimates the lattice to minimize the interpolation error on the given training samples. Experiments show that lattice regression can reduce mean test error by as much as 25% compared to Gaussian process regression (GPR) for digital color management of printers, an application for which linearly interpolating a look-up table is standard. Simulations confirm that lattice regression performs consistently better than the naive approach to learning the lattice. Surprisingly, in some cases the proposed method — although motivated by computational efficiency — performs better than directly applying GPR with no lattice at all. 1

4 0.60409373 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

Author: Mladen Kolar, Le Song, Eric P. Xing

Abstract: To estimate the changing structure of a varying-coefficient varying-structure (VCVS) model remains an important and open problem in dynamic system modelling, which includes learning trajectories of stock prices, or uncovering the topology of an evolving gene network. In this paper, we investigate sparsistent learning of a sub-family of this model — piecewise constant VCVS models. We analyze two main issues in this problem: inferring time points where structural changes occur and estimating model structure (i.e., model selection) on each of the constant segments. We propose a two-stage adaptive procedure, which first identifies jump points of structural changes and then identifies relevant covariates to a response on each of the segments. We provide an asymptotic analysis of the procedure, showing that with the increasing sample size, number of structural changes, and number of variables, the true model can be consistently selected. We demonstrate the performance of the method on synthetic data and apply it to the brain computer interface dataset. We also consider how this applies to structure estimation of time-varying probabilistic graphical models. 1

5 0.5620001 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

Abstract: This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which first estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves significantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. 1

6 0.55188477 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

7 0.5366475 157 nips-2009-Multi-Label Prediction via Compressed Sensing

8 0.53282803 238 nips-2009-Submanifold density estimation

9 0.52857912 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

10 0.52341914 81 nips-2009-Ensemble Nystrom Method

11 0.48870617 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

12 0.48805597 97 nips-2009-Free energy score space

13 0.47478634 71 nips-2009-Distribution-Calibrated Hierarchical Classification

14 0.45661756 117 nips-2009-Inter-domain Gaussian Processes for Sparse Inference using Inducing Features

15 0.45530123 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

16 0.45275006 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

17 0.45093444 237 nips-2009-Subject independent EEG-based BCI decoding

18 0.44814199 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

19 0.4424955 246 nips-2009-Time-Varying Dynamic Bayesian Networks

20 0.43879887 105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.015), (21, 0.023), (24, 0.048), (25, 0.052), (35, 0.039), (36, 0.123), (39, 0.064), (55, 0.284), (58, 0.126), (71, 0.041), (81, 0.012), (86, 0.088)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88129741 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities

Author: Matthew Wilder, Matt Jones, Michael C. Mozer

Abstract: Across a wide range of cognitive tasks, recent experience influences behavior. For example, when individuals repeatedly perform a simple two-alternative forcedchoice task (2AFC), response latencies vary dramatically based on the immediately preceding trial sequence. These sequential effects have been interpreted as adaptation to the statistical structure of an uncertain, changing environment (e.g., Jones and Sieck, 2003; Mozer, Kinoshita, and Shettel, 2007; Yu and Cohen, 2008). The Dynamic Belief Model (DBM) (Yu and Cohen, 2008) explains sequential effects in 2AFC tasks as a rational consequence of a dynamic internal representation that tracks second-order statistics of the trial sequence (repetition rates) and predicts whether the upcoming trial will be a repetition or an alternation of the previous trial. Experimental results suggest that first-order statistics (base rates) also influence sequential effects. We propose a model that learns both first- and second-order sequence properties, each according to the basic principles of the DBM but under a unified inferential framework. This model, the Dynamic Belief Mixture Model (DBM2), obtains precise, parsimonious fits to data. Furthermore, the model predicts dissociations in behavioral (Maloney, Martello, Sahm, and Spillmann, 2005) and electrophysiological studies (Jentzsch and Sommer, 2002), supporting the psychological and neurobiological reality of its two components. 1

2 0.87520224 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs

Author: Andrew McCallum, Karl Schultz, Sameer Singh

Abstract: Discriminatively trained undirected graphical models have had wide empirical success, and there has been increasing interest in toolkits that ease their application to complex relational data. The power in relational models is in their repeated structure and tied parameters; at issue is how to define these structures in a powerful and flexible way. Rather than using a declarative language, such as SQL or first-order logic, we advocate using an imperative language to express various aspects of model structure, inference, and learning. By combining the traditional, declarative, statistical semantics of factor graphs with imperative definitions of their construction and operation, we allow the user to mix declarative and procedural domain knowledge, and also gain significant efficiencies. We have implemented such imperatively defined factor graphs in a system we call FACTORIE, a software library for an object-oriented, strongly-typed, functional language. In experimental comparisons to Markov Logic Networks on joint segmentation and coreference, we find our approach to be 3-15 times faster while reducing error by 20-25%—achieving a new state of the art. 1

same-paper 3 0.80899304 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

Author: Xiaolin Yang, Seyoung Kim, Eric P. Xing

Abstract: Multitask learning addresses the problem of learning related tasks that presumably share some commonalities on their input-output mapping functions. Previous approaches to multitask learning usually deal with homogeneous tasks, such as purely regression tasks, or entirely classification tasks. In this paper, we consider the problem of learning multiple related tasks of predicting both continuous and discrete outputs from a common set of input variables that lie in a highdimensional feature space. All of the tasks are related in the sense that they share the same set of relevant input variables, but the amount of influence of each input on different outputs may vary. We formulate this problem as a combination of linear regressions and logistic regressions, and model the joint sparsity as L1 /L∞ or L1 /L2 norm of the model parameters. Among several possible applications, our approach addresses an important open problem in genetic association mapping, where the goal is to discover genetic markers that influence multiple correlated traits jointly. In our experiments, we demonstrate our method in this setting, using simulated and clinical asthma datasets, and we show that our method can effectively recover the relevant inputs with respect to all of the tasks. 1

4 0.80801773 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

Author: Sanja Fidler, Marko Boben, Ales Leonardis

Abstract: Multi-class object learning and detection is a challenging problem due to the large number of object classes and their high visual variability. Specialized detectors usually excel in performance, while joint representations optimize sharing and reduce inference time — but are complex to train. Conveniently, sequential class learning cuts down training time by transferring existing knowledge to novel classes, but cannot fully exploit the shareability of features among object classes and might depend on ordering of classes during learning. In hierarchical frameworks these issues have been little explored. In this paper, we provide a rigorous experimental analysis of various multiple object class learning strategies within a generative hierarchical framework. Specifically, we propose, evaluate and compare three important types of multi-class learning: 1.) independent training of individual categories, 2.) joint training of classes, and 3.) sequential learning of classes. We explore and compare their computational behavior (space and time) and detection performance as a function of the number of learned object classes on several recognition datasets. We show that sequential training achieves the best trade-off between inference and training times at a comparable detection performance and could thus be used to learn the classes on a larger scale. 1

5 0.61483836 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black

Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1

6 0.60823584 46 nips-2009-Bilinear classifiers for visual recognition

7 0.6047895 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

8 0.6045047 104 nips-2009-Group Sparse Coding

9 0.60304642 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

10 0.60052544 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

11 0.59765053 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

12 0.59718764 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

13 0.59707201 191 nips-2009-Positive Semidefinite Metric Learning with Boosting

14 0.59466308 126 nips-2009-Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering

15 0.59465116 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors

16 0.59426826 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

17 0.59283912 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

18 0.59076452 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

19 0.59074712 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning

20 0.59059095 169 nips-2009-Nonlinear Learning using Local Coordinate Coding