nips nips2012 nips2012-312 knowledge-graph by maker-knowledge-mining

312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression


Source: pdf

Author: Piyush Rai, Abhishek Kumar, Hal Daume

Abstract: Multiple-output regression models require estimating multiple parameters, one for each output. Structural regularization is usually employed to improve parameter estimation in such models. In this paper, we present a multiple-output regression model that leverages the covariance structure of the latent model parameters as well as the conditional covariance structure of the observed outputs. This is in contrast with existing methods that usually take into account only one of these structures. More importantly, unlike some of the other existing methods, none of these structures need be known a priori in our model, and are learned from the data. Several previously proposed structural regularization based multiple-output regression models turn out to be special cases of our model. Moreover, in addition to being a rich model for multiple-output regression, our model can also be used in estimating the graphical model structure of a set of variables (multivariate outputs) conditioned on another set of variables (inputs). Experimental results on both synthetic and real datasets demonstrate the effectiveness of our method. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Multiple-output regression models require estimating multiple parameters, one for each output. [sent-10, score-0.235]

2 In this paper, we present a multiple-output regression model that leverages the covariance structure of the latent model parameters as well as the conditional covariance structure of the observed outputs. [sent-12, score-1.147]

3 Several previously proposed structural regularization based multiple-output regression models turn out to be special cases of our model. [sent-15, score-0.268]

4 Moreover, in addition to being a rich model for multiple-output regression, our model can also be used in estimating the graphical model structure of a set of variables (multivariate outputs) conditioned on another set of variables (inputs). [sent-16, score-0.224]

5 1 Introduction Multivariate response prediction, also known as multiple-output regression [3] when the responses are real-valued vectors, is an important problem in machine learning and statistics. [sent-18, score-0.341]

6 The goal in multiple-output regression is to learn a model for predicting K > 1 real-valued responses (the output) from D predictors or features (the input), given a training dataset consisting of N inputoutput pairs. [sent-19, score-0.522]

7 Multiple-output prediction is also an instance of the problem of multitask learning [5, 10] where predicting each output is a task and all the tasks share the same input data. [sent-20, score-0.398]

8 Multipleoutput regression problems are encountered frequently in various application domains. [sent-21, score-0.235]

9 One distinguishing aspect of multiple-output regression is that the outputs are often related to each other via some underlying (and often a priori unknown) structure. [sent-23, score-0.434]

10 A part of this can be captured by the imposing a relatedness structure among the regression coefficients (e. [sent-24, score-0.44]

11 , the weight vectors in a linear regression model) of all the outputs. [sent-26, score-0.422]

12 We refer to the relatedness structure among the regression coefficients as task structure. [sent-27, score-0.522]

13 However, there can still be some structure left in the outputs that is not explained by the regression coefficients alone. [sent-28, score-0.488]

14 The residual structure that is left out when conditioned on inputs will be referred to as output structure here. [sent-32, score-0.377]

15 This can be also be seen as the covariance structure in the output noise. [sent-33, score-0.505]

16 It is therefore desirable to simultaneously learn † Contributed equally 1 and leverage both the output structure and the task structure in multiple-output regression models for improved parameter estimation and prediction accuracy. [sent-34, score-0.738]

17 For example, Multivariate Regression with Covariance Estimation [17] (MRCE) is a recently proposed method which learns the output structure (in form of the covariance matrix for correlated noise across multiple outputs) along with the regression coefficients (i. [sent-36, score-0.881]

18 However MRCE does not explicitly model the relationships among the regression coefficients of the multiple tasks and therefore fails to account for the task structure. [sent-39, score-0.56]

19 More recently, [14] proposed an extension of the MRCE model by allowing weighting the individual entries of the regression coefficients and the entries of the output (inverse) covariance matrix, but otherwise this model has essentially the same properties as MRCE. [sent-40, score-0.702]

20 Among other works, Graph-guided Fused Lasso [11] (GFlasso) incorporates task structure to some degree by assuming that the regression coefficients of all the outputs have similar sparsity patterns. [sent-41, score-0.737]

21 This amounts to assuming that all the outputs share almost same set of relevant features. [sent-42, score-0.188]

22 However, GFlasso assumes that output graph structure is known which is rarely true in practice. [sent-43, score-0.218]

23 Some other methods such as[13] take into account the task structure by imposing structural sparsity on the regression coefficients of the multiple tasks but again assume that output structure is known a priori and/or is of a specific form. [sent-44, score-0.963]

24 In [22], the authors proposed a multitask learning model by explicitly modeling the task structures as the task covariance matrix but this model does not take into account the output structure which is important in multiple-output regression problems. [sent-45, score-1.28]

25 In this paper, we present a multiple-output regression model that allows leveraging both output structure and task structure without assuming an a priori knowledge of either. [sent-46, score-0.773]

26 In our model, both output structure and task structure are learned from the data, along with the regression coefficients for each task. [sent-47, score-0.666]

27 Specifically, we model the output structure using the (inverse) covariance matrix of the correlated noise across the multiple outputs, and the task structure using the (inverse) covariance matrix of the regression coefficients of the multiple tasks being learned in the model. [sent-48, score-1.523]

28 By explicitly modeling and learning the output structure and task structure, our model also addresses the limitations of the existing models that typically assume certain specific type of output structures (e. [sent-49, score-0.551]

29 In particular, a model with task relatedness structure based on shared sparsity on the task weight vectors may not be appropriate in many real applications where all the features are important for prediction and the true task structure is at a more higher level (e. [sent-54, score-0.936]

30 , weight vectors for some tasks are closer to each other compared to others). [sent-56, score-0.225]

31 , yN ]⊤ , the goal in multiple-output regression is to learn the functional relationship between the inputs xn ∈ RD and the outputs yn ∈ RK . [sent-65, score-0.579]

32 For a linear regression model, we write: y n = W ⊤ x n + b + ǫn ∀n = 1, . [sent-66, score-0.235]

33 , wK ] denotes the D × K matrix where wk denotes the regression coefficient of the k-th output, b = [b1 , . [sent-72, score-0.402]

34 ) denotes matrix trace, 1 an N ×1 vector of all 1s and R(W) the regularizer on the weight matrix W consisting of the regression weight vectors of all the outputs. [sent-81, score-0.767]

35 3 Multiple-Output Regression with Output and Task Structures To take into account both conditional output covariance and the covariance among the weight vectors W = [w1 , . [sent-83, score-0.998]

36 , wK ], we assume a full covariance matrix Ω of size K × K on the output noise distribution to capture conditional output covariance, and a structured prior distribution on the weight vector matrix W that induces structural regularization of W. [sent-86, score-0.906]

37 In this prior distribution, the Nor(wk |0, ID ) factors regularize the weight vectors wk individually, and the MN D×K (W|0D×K , ID ⊗ Σ) term couples the K weight vectors, allowing them to share statistical strength. [sent-89, score-0.407]

38 observations: N N Nor(yn |W⊤ xn + b, Ω) p(yn |xn , W, b) = (4) n=1 n=1 In the above, a diagonal Ω would imply that the K outputs are all conditionally independent of each other. [sent-93, score-0.193]

39 Note that in the term tr(WΣ−1 W⊤ ), the inverse covariance matrix Σ−1 plays the role of coupling pairs of weight vectors, and therefore controls the amount of sharing between any pair of tasks. [sent-96, score-0.631]

40 The task covariance matrix Σ as well as the conditional output covariance matrix Ω will be learned from the data. [sent-97, score-0.997]

41 For reasons that will become apparent later, we parameterize our model in terms of the inverse covariance matrices Ω−1 and Σ−1 instead of covariance matrices. [sent-98, score-0.747]

42 The minimization problem is given as tr((Y − XW − 1b⊤ )Ω−1 (Y − XW − 1b⊤ )⊤ ) − N log |Ω−1 | + λ tr(WW⊤ ) arg min W,b,Σ−1 ,Ω−1 +λ1 tr(WΣ−1 W⊤ ) − D log |Σ−1 | + λ2 ||Ω−1 ||1 + λ3 ||Σ−1 ||1 (6) where ||A||1 denotes the sum of absolute values of the matrix A. [sent-104, score-0.177]

43 Note that by replacing the regularizer tr(WW⊤ ) with a sparsity inducing regularizer on the individual weight vectors w1 , . [sent-105, score-0.36]

44 , wK , one can also learn Lasso-like sparsity [19] in the regression weights. [sent-108, score-0.336]

45 In this exposition, however, we consider ℓ2 regularization on the regression weights and let the tr(WΣ−1 W⊤ ) term capture the similarity between the weights of two tasks by learning the task inverse covariance matrix Σ−1 . [sent-109, score-0.892]

46 W: Ω−1 ⊗ X′ X + λ1 Σ−1 + λIK ⊗ ID vec(W) = vec(X′ (Y − 1b⊤ )Ω−1 ) (8) It is easy to see that with Ω and Σ set to identity, the model becomes equivalent to solving K regularized independent linear regression problems. [sent-140, score-0.318]

47 b when Ω−1 , Σ−1 and W are fixed: Given Ω−1 , Σ−1 , W, the bias vector b for all the K outputs can be obtained by solving the following optimization problem: arg min tr((Y − XW − 1b⊤ )Ω−1 (Y − XW − 1b⊤ )⊤ ) (9) b ˆ ˆ The estimate b is given by b = 1 N N n=1 (Y − XW)⊤ 1 Optimization w. [sent-144, score-0.259]

48 This can be seen as imposing a matrix variate Gaussian prior on Σ−1/2 with both row and column covariance matrices equal to I to make the solution well defined. [sent-151, score-0.394]

49 The optimization problem for Σ−1 is then given as arg min λ1 tr(WΣ−1 W⊤ ) − D log |Σ−1 | + λ tr Σ−1 . [sent-153, score-0.327]

50 We use the graphical Lasso procedure [8] to solve Equation 10 to estimate Σ−1 : 1 ˆ (14) Ω−1 = gLasso( (Y − XW − cb⊤ )⊤ (Y − XW − cb⊤ ), λ2 ) N 4 Experiments In this section, we evaluate our model by comparing it with several relevant baselines on both synthetic and real-world datasets. [sent-160, score-0.229]

51 Our main set of results are on multiple-output regression problems on which we report mean-squared errors averaged across all the outputs. [sent-161, score-0.235]

52 However, since our model also provides an estimate of the conditional inverse covariance structure Ω−1 of the outputs, in Section 4. [sent-162, score-0.598]

53 3 we provide experimental results on the structure recovery task as well. [sent-163, score-0.213]

54 We compare our method with following baselines: 5 • Independent regressions (RLS): This baseline learns regularized least squares (RLS) regression model for each output, without assuming any structure among the weight vectors or among the outputs. [sent-164, score-0.694]

55 The weight vector of each individual problem is ℓ2 regularized with a hyperparameter λ. [sent-166, score-0.219]

56 • Multi-task Relationship Learning (MTRL): This method leverages task relationships by assuming a matrix-variate prior on the weight matrix W [22]. [sent-169, score-0.456]

57 We chose this baseline because of its flexibility in modeling the task relationships by “discovering” how the weight vectors are related (via Σ−1 ), rather than assuming a specific structure on them such as shared sparsity [16], low-rank assumption [2], etc. [sent-170, score-0.599]

58 However MTRL in the multiple-output regression setting cannot take into account the output structure. [sent-171, score-0.403]

59 It is therefor a special case of our model if we assume the output inverse covariance matrix Ω−1 = I. [sent-172, score-0.654]

60 MRCE leverages output structure by assuming a full noise covariance in multiple-output regression and learning it along with the weight matrix W from the data. [sent-176, score-1.047]

61 MRCE however cannot take into account the task structure because it cannot capture the relationships among the columns of W. [sent-177, score-0.355]

62 It is therefore a special case of our model if we assume the task inverse covariance matrix Σ−1 = I. [sent-178, score-0.649]

63 We do not compare with the original ℓ1 regularized MRCE [17] to ensure a fair comparison by keeping all the models non-sparse in weight vectors. [sent-179, score-0.18]

64 We experiment with two variants of our proposed model, one without a sparsity inducing penalty on the task coupling matrix Σ−1 (called MROTS-I), and the other with the sparse penalty on Σ−1 (called MROTS-II). [sent-181, score-0.428]

65 This is used to ensure −1 that the task inverse covariance matrix estimate Σˆ exists and is robust when number of response variables K is of the same order or larger than the input dimension D. [sent-186, score-0.662]

66 First, we generate a random positive definite matrix Σ−1 which will act as the task inverse covariance matrix. [sent-194, score-0.619]

67 We compute the square-root S of Σ (= SS, where S is also a symmetric positive definite matrix), and S is used to generate the final weight matrix W as W = VS. [sent-196, score-0.201]

68 This process generates W such that its columns (and therefore the weight vectors for different outputs) are correlated. [sent-198, score-0.187]

69 Then we generate a sparse random positive definite matrix Ω−1 that acts as the conditional inverse covariance matrix on output noise making the outputs correlated (given the inputs). [sent-200, score-1.01]

70 from a normal distribution and the corresponding multivariate output variables are generated as yi = Wxi + b + ǫi , ∀i = 1, 2, . [sent-204, score-0.202]

71 , N , where ǫi is the correlated noise vector randomly sampled from a zero mean normal distribution with covariance matrix Ω. [sent-207, score-0.428]

72 RLS: Independent regression, C&W;: Curds and Whey model [3], MTRL: Multi-task relationship learning [22], MRCE-ℓ2 : The ℓ2 -regularized version of MRCE [17], MROTS-I: our model without sparse penalty on Σ−1 , MROTS-II: our model with sparse penalty on Σ−1 . [sent-245, score-0.311]

73 2 Real data We also evaluate our model on the following real-world multiple-output regression datasets: • Paper datasets: These are two multivariate multiple-response regression datasets from paper industry [1]. [sent-261, score-0.62]

74 • Genotype dataset: This dataset has genotypes as input variables and phenotypes or observed traits as output variables [12]. [sent-265, score-0.233]

75 Independent linear regression performs the worst on all synthetic datasets. [sent-270, score-0.359]

76 This mixed behavior of MRCE-ℓ2 and MTRL supports our motivation that both task structure (i. [sent-272, score-0.213]

77 , relationships among weight vectors) and output structure are important in multiple-output regression. [sent-274, score-0.439]

78 Both MTRL and MRCE-ℓ2 are special cases of our model with former ignoring the output structure (captured by Ω−1 ) and the latter ignoring the weight vector relationships (captured by Σ−1 ). [sent-275, score-0.544]

79 C&W; uses CCA to project the response matrix Y to a lower min(D, K)-dimensional space learning min(D, K) predictors there and then projecting them back to the original K-dimensional 7 space. [sent-284, score-0.179]

80 Both MROTS-I and MROTS-II perform significantly better than the other baselines on the first Paper dataset (9 features and 32 outputs per sample). [sent-288, score-0.229]

81 All models perform almost similarly on the second Paper dataset (9 features and 13 outputs per sample), which could be due to the absence of a strong task or output structure in this data. [sent-289, score-0.523]

82 On the first synthetic data, the performance gain of our model is more pronounced when number of training examples is small. [sent-296, score-0.184]

83 The MSE numbers for the first synthetic data are higher than the ones obtained for the second synthetic data because of a difference in the magnitude of error covariances used in the generation of datasets. [sent-298, score-0.248]

84 3 Covariance structure recovery Although not the main goal of the paper, we experiment with learned inverse covariance matrix of the outputs (given the inputs) as a sanity check on the proposed model. [sent-303, score-0.757]

85 Figure on the right shows the true conditional inverse ˆ covariance matrix Ω−1 (Top), the matrix learned with MROTS Ω−1 (Middle), and the precision matrix learned with graphical lasso ignoring the predictors (Bottom). [sent-307, score-0.884]

86 Taking into account the regression weights results in better estimate of the true covariance matrix. [sent-308, score-0.57]

87 We got similar results for MRCE-ℓ2 which also takes into account the predictors while learning the inverse covariance, although MROTS estimates were closer to the ground truth in terms of the Frobenius norm. [sent-309, score-0.253]

88 Recently, Sohn & Kim [18] proposed a model for jointly estimating the weight vector for each output and the covariance structure of the outputs. [sent-311, score-0.662]

89 However, they assume a shared sparsity structure on the weight vectors. [sent-312, score-0.326]

90 Some other works on conditional graphical model estimation [20, 4] are based on similar structural sparsity assumptions. [sent-314, score-0.28]

91 In contrast, our model does not assume any specific structure on the weight vectors, and by explicitly modeling the covariance structure of the weight vectors, learns the appropriate underlying structure from the data. [sent-315, score-0.865]

92 6 Future Work and Conclusion We have presented a flexible model for multiple-output regression taking into account the covariance structure of the outputs and the covariance structure of the underlying prediction tasks. [sent-316, score-1.27]

93 Our model leads to improved accuracies on multiple-output regression tasks. [sent-318, score-0.265]

94 Learning higher-order graph structure with features by structure penalty. [sent-358, score-0.196]

95 A multivariate regression approach to association analysis of a quantitative trait network. [sent-379, score-0.356]

96 Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping. [sent-390, score-0.277]

97 Simultaneous multiple response regression and inverse covariance matrix estimation via penalized gaussian maximum likelihood. [sent-395, score-0.822]

98 Joint estimation of structured sparsity and output structure in multipleoutput regression via inverse-covariance regularization. [sent-423, score-0.642]

99 A sparse conditional gaussian graphical model for analysis of genetical genomics data. [sent-432, score-0.204]

100 A convex formulation for learning task relationships in multi-task learning. [sent-442, score-0.18]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xw', 0.4), ('mtrl', 0.335), ('covariance', 0.287), ('mrce', 0.263), ('regression', 0.235), ('tr', 0.223), ('mrots', 0.168), ('outputs', 0.155), ('ww', 0.146), ('inverse', 0.143), ('rls', 0.128), ('weight', 0.127), ('synthetic', 0.124), ('output', 0.12), ('task', 0.115), ('mse', 0.11), ('sparsity', 0.101), ('structure', 0.098), ('curds', 0.096), ('whey', 0.096), ('wk', 0.093), ('id', 0.085), ('multivariate', 0.082), ('matrix', 0.074), ('synth', 0.072), ('structures', 0.068), ('relationships', 0.065), ('responses', 0.063), ('predictors', 0.062), ('inputs', 0.061), ('cients', 0.061), ('multitask', 0.06), ('vectors', 0.06), ('sohn', 0.058), ('yn', 0.057), ('ik', 0.055), ('regularized', 0.053), ('cca', 0.052), ('ignoring', 0.052), ('coef', 0.05), ('sparse', 0.05), ('account', 0.048), ('genetical', 0.048), ('gflasso', 0.048), ('multipleoutput', 0.048), ('wrls', 0.048), ('arg', 0.045), ('relatedness', 0.045), ('penalty', 0.044), ('priori', 0.044), ('response', 0.043), ('equation', 0.042), ('mn', 0.042), ('geostatistics', 0.042), ('abhishek', 0.042), ('piyush', 0.042), ('lasso', 0.042), ('leverages', 0.042), ('conditional', 0.04), ('estimation', 0.04), ('baselines', 0.039), ('kim', 0.039), ('genotypes', 0.039), ('phenotypes', 0.039), ('trait', 0.039), ('hyperparameter', 0.039), ('xn', 0.038), ('datasets', 0.038), ('tasks', 0.038), ('ii', 0.038), ('regularizer', 0.036), ('correlated', 0.036), ('graphical', 0.036), ('dataset', 0.035), ('variant', 0.035), ('maryland', 0.035), ('levina', 0.035), ('rk', 0.035), ('alternating', 0.034), ('consisting', 0.034), ('structural', 0.033), ('predicting', 0.033), ('imposing', 0.033), ('assuming', 0.033), ('relationship', 0.033), ('prediction', 0.032), ('glasso', 0.032), ('correlations', 0.031), ('noise', 0.031), ('cb', 0.031), ('prices', 0.031), ('training', 0.03), ('vec', 0.03), ('optimization', 0.03), ('model', 0.03), ('hal', 0.029), ('among', 0.029), ('log', 0.029), ('bias', 0.029), ('stock', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

Author: Piyush Rai, Abhishek Kumar, Hal Daume

Abstract: Multiple-output regression models require estimating multiple parameters, one for each output. Structural regularization is usually employed to improve parameter estimation in such models. In this paper, we present a multiple-output regression model that leverages the covariance structure of the latent model parameters as well as the conditional covariance structure of the observed outputs. This is in contrast with existing methods that usually take into account only one of these structures. More importantly, unlike some of the other existing methods, none of these structures need be known a priori in our model, and are learned from the data. Several previously proposed structural regularization based multiple-output regression models turn out to be special cases of our model. Moreover, in addition to being a rich model for multiple-output regression, our model can also be used in estimating the graphical model structure of a set of variables (multivariate outputs) conditioned on another set of variables (inputs). Experimental results on both synthetic and real datasets demonstrate the effectiveness of our method. 1

2 0.20021464 187 nips-2012-Learning curves for multi-task Gaussian process regression

Author: Peter Sollich, Simon Ashton

Abstract: We study the average case performance of multi-task Gaussian process (GP) regression as captured in the learning curve, i.e. the average Bayes error for a chosen task versus the total number of examples n for all tasks. For GP covariances that are the product of an input-dependent covariance function and a free-form intertask covariance matrix, we show that accurate approximations for the learning curve can be obtained for an arbitrary number of tasks T . We use these to study the asymptotic learning behaviour for large n. Surprisingly, multi-task learning can be asymptotically essentially useless, in the sense that examples from other tasks help only when the degree of inter-task correlation, ρ, is near its maximal value ρ = 1. This effect is most extreme for learning of smooth target functions as described by e.g. squared exponential kernels. We also demonstrate that when learning many tasks, the learning curves separate into an initial phase, where the Bayes error on each task is reduced down to a plateau value by “collective learning” even though most tasks have not seen examples, and a final decay that occurs once the number of examples is proportional to the number of tasks. 1 Introduction and motivation Gaussian processes (GPs) [1] have been popular in the NIPS community for a number of years now, as one of the key non-parametric Bayesian inference approaches. In the simplest case one can use a GP prior when learning a function from data. In line with growing interest in multi-task or transfer learning, where relatedness between tasks is used to aid learning of the individual tasks (see e.g. [2, 3]), GPs have increasingly also been used in a multi-task setting. A number of different choices of covariance functions have been proposed [4, 5, 6, 7, 8]. These differ e.g. in assumptions on whether the functions to be learned are related to a smaller number of latent functions or have free-form inter-task correlations; for a recent review see [9]. Given this interest in multi-task GPs, one would like to quantify the benefits that they bring compared to single-task learning. PAC-style bounds for classification [2, 3, 10] in more general multi-task scenarios exist, but there has been little work on average case analysis. The basic question in this setting is: how does the Bayes error on a given task depend on the number of training examples for all tasks, when averaged over all data sets of the given size. For a single regression task, this learning curve has become relatively well understood since the late 1990s, with a number of bounds and approximations available [11, 12, 13, 14, 15, 16, 17, 18, 19] as well as some exact predictions [20]. Already two-task GP regression is much more difficult to analyse, and progress was made only very recently at NIPS 2009 [21], where upper and lower bounds for learning curves were derived. The tightest of these bounds, however, either required evaluation by Monte Carlo sampling, or assumed knowledge of the corresponding single-task learning curves. Here our aim is to obtain accurate learning curve approximations that apply to an arbitrary number T of tasks, and that can be evaluated explicitly without recourse to sampling. 1 We begin (Sec. 2) by expressing the Bayes error for any single task in a multi-task GP regression problem in a convenient feature space form, where individual training examples enter additively. This requires the introduction of a non-trivial tensor structure combining feature space components and tasks. Considering the change in error when adding an example for some task leads to partial differential equations linking the Bayes errors for all tasks. Solving these using the method of characteristics then gives, as our primary result, the desired learning curve approximation (Sec. 3). In Sec. 4 we discuss some of its predictions. The approximation correctly delineates the limits of pure transfer learning, when all examples are from tasks other than the one of interest. Next we compare with numerical simulations for some two-task scenarios, finding good qualitative agreement. These results also highlight a surprising feature, namely that asymptotically the relatedness between tasks can become much less useful. We analyse this effect in some detail, showing that it is most extreme for learning of smooth functions. Finally we discuss the case of many tasks, where there is an unexpected separation of the learning curves into a fast initial error decay arising from “collective learning”, and a much slower final part where tasks are learned almost independently. 2 GP regression and Bayes error We consider GP regression for T functions fτ (x), τ = 1, 2, . . . , T . These functions have to be learned from n training examples (x , τ , y ), = 1, . . . , n. Here x is the training input, τ ∈ {1, . . . , T } denotes which task the example relates to, and y is the corresponding training output. We assume that the latter is given by the target function value fτ (x ) corrupted by i.i.d. additive 2 2 Gaussian noise with zero mean and variance στ . This setup allows the noise level στ to depend on the task. In GP regression the prior over the functions fτ (x) is a Gaussian process. This means that for any set of inputs x and task labels τ , the function values {fτ (x )} have a joint Gaussian distribution. As is common we assume this to have zero mean, so the multi-task GP is fully specified by the covariances fτ (x)fτ (x ) = C(τ, x, τ , x ). For this covariance we take the flexible form from [5], fτ (x)fτ (x ) = Dτ τ C(x, x ). Here C(x, x ) determines the covariance between function values at different input points, encoding “spatial” behaviour such as smoothness and the lengthscale(s) over which the functions vary, while the matrix D is a free-form inter-task covariance matrix. One of the attractions of GPs for regression is that, even though they are non-parametric models with (in general) an infinite number of degrees of freedom, predictions can be made in closed form, see e.g. [1]. For a test point x for task τ , one would predict as output the mean of fτ (x) over the (Gaussian) posterior, which is y T K −1 kτ (x). Here K is the n × n Gram matrix with entries 2 K m = Dτ τm C(x , xm ) + στ δ m , while kτ (x) is a vector with the n entries kτ, = Dτ τ C(x , x). The error bar would be taken as the square root of the posterior variance of fτ (x), which is T Vτ (x) = Dτ τ C(x, x) − kτ (x)K −1 kτ (x) (1) The learning curve for task τ is defined as the mean-squared prediction error, averaged over the location of test input x and over all data sets with a specified number of examples for each task, say n1 for task 1 and so on. As is standard in learning curve analysis we consider a matched scenario where the training outputs y are generated from the same prior and noise model that we use for inference. In this case the mean-squared prediction error ˆτ is the Bayes error, and is given by the average posterior variance [1], i.e. ˆτ = Vτ (x) x . To obtain the learning curve this is averaged over the location of the training inputs x : τ = ˆτ . This average presents the main challenge for learning curve prediction because the training inputs feature in a highly nonlinear way in Vτ (x). Note that the training outputs, on the other hand, do not appear in the posterior variance Vτ (x) and so do not need to be averaged over. We now want to write the Bayes error ˆτ in a form convenient for performing, at least approximately, the averages required for the learning curve. Assume that all training inputs x , and also the test input x, are drawn from the same distribution P (x). One can decompose the input-dependent part of the covariance function into eigenfunctions relative to P (x), according to C(x, x ) = i λi φi (x)φi (x ). The eigenfunctions are defined by the condition C(x, x )φi (x ) x = λi φi (x) and can be chosen to be orthonormal with respect to P (x), φi (x)φj (x) x = δij . The sum over i here is in general infinite (unless the covariance function is degenerate, as e.g. for the dot product kernel C(x, x ) = x · x ). To make the algebra below as simple as possible, we let the eigenvalues λi be arranged in decreasing order and truncate the sum to the finite range i = 1, . . . , M ; M is then some large effective feature space dimension and can be taken to infinity at the end. 2 In terms of the above eigenfunction decomposition, the Gram matrix has elements K m = Dτ 2 λi φi (x )φi (xm )+στ δ τm m δτ = i ,τ φi (x )λi δij Dτ τ φj (xm )δτ 2 ,τm +στ δ m i,τ,j,τ or in matrix form K = ΨLΨT + Σ where Σ is the diagonal matrix from the noise variances and Ψ = δτ ,iτ ,τ φi (x ), Liτ,jτ = λi δij Dτ τ (2) Here Ψ has its second index ranging over M (number of kernel eigenvalues) times T (number of tasks) values; L is a square matrix of this size. In Kronecker (tensor) product notation, L = D ⊗ Λ if we define Λ as the diagonal matrix with entries λi δij . The Kronecker product is convenient for the simplifications below; we will use that for generic square matrices, (A ⊗ B)(A ⊗ B ) = (AA ) ⊗ (BB ), (A ⊗ B)−1 = A−1 ⊗ B −1 , and tr (A ⊗ B) = (tr A)(tr B). In thinking about the mathematical expressions, it is often easier to picture Kronecker products over feature spaces and tasks as block matrices. For example, L can then be viewed as consisting of T × T blocks, each of which is proportional to Λ. To calculate the Bayes error, we need to average the posterior variance Vτ (x) over the test input x. The first term in (1) then becomes Dτ τ C(x, x) = Dτ τ tr Λ. In the second one, we need to average kτ, (x)kτ,m = Dτ τ C(x , x)C(x, xm ) x Dτm τ = x Dτ τ λi λj φi (x ) φi (x)φj (x) x φj (xm )Dτm τ ij = Dτ τ Ψl,iτ λi λj δij Ψm,jτ Dτ τ i,τ ,j,τ T In matrix form this is kτ (x)kτ (x) x = Ψ[(Deτ eT D) ⊗ Λ2 ]ΨT = ΨMτ ΨT Here the last τ equality defines Mτ , and we have denoted by eτ the T -dimensional vector with τ -th component equal to one and all others zero. Multiplying by the inverse Gram matrix K −1 and taking the trace gives the average of the second term in (1); combining with the first gives the Bayes error on task τ ˆτ = Vτ (x) x = Dτ τ tr Λ − tr ΨMτ ΨT (ΨLΨT + Σ)−1 Applying the Woodbury identity and re-arranging yields = Dτ τ tr Λ − tr Mτ ΨT Σ−1 Ψ(I + LΨT Σ−1 Ψ)−1 = ˆτ Dτ τ tr Λ − tr Mτ L−1 [I − (I + LΨT Σ−1 Ψ)−1 ] But tr Mτ L−1 = tr {[(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 } τ = tr {[Deτ eT ] ⊗ Λ} = eT Deτ tr Λ = Dτ τ tr Λ τ τ so the first and second terms in the expression for ˆτ cancel and one has = tr Mτ L−1 (I + LΨT Σ−1 Ψ)−1 = tr L−1 Mτ L−1 (L−1 + ΨT Σ−1 Ψ)−1 = tr [D ⊗ Λ]−1 [(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 (L−1 + ΨT Σ−1 Ψ)−1 τ = ˆτ tr [eτ eT ⊗ I](L−1 + ΨT Σ−1 Ψ)−1 τ The matrix in square brackets in the last line is just a projector Pτ onto task τ ; thought of as a matrix of T × T blocks (each of size M × M ), this has an identity matrix in the (τ, τ ) block while all other blocks are zero. We can therefore write, finally, for the Bayes error on task τ , ˆτ = tr Pτ (L−1 + ΨT Σ−1 Ψ)−1 (3) Because Σ is diagonal and given the definition (2) of Ψ, the matrix ΨT Σ−1 Ψ is a sum of contributions from the individual training examples = 1, . . . , n. This will be important for deriving the learning curve approximation below. We note in passing that, because τ Pτ = I, the sum of the Bayes errors on all tasks is τ ˆτ = tr (L−1 +ΨT Σ−1 Ψ)−1 , in close analogy to the corresponding expression for the single-task case [13]. 3 3 Learning curve prediction To obtain the learning curve τ = ˆτ , we now need to carry out the average . . . over the training inputs. To help with this, we can extend an approach for the single-task scenario [13] and define a response or resolvent matrix G = (L−1 + ΨT Σ−1 Ψ + τ vτ Pτ )−1 with auxiliary parameters vτ that will be set back to zero at the end. One can then ask how G = G and hence τ = tr Pτ G changes with the number nτ of training points for task τ . Adding an example at position x for task −2 τ increases ΨT Σ−1 Ψ by στ φτ φT , where φτ has elements (φτ )iτ = φi (x)δτ τ . Evaluating the τ −1 −2 difference (G + στ φτ φT )−1 − G with the help of the Woodbury identity and approximating it τ with a derivative gives Gφτ φT G ∂G τ =− 2 ∂nτ στ + φT Gφτ τ This needs to be averaged over the new example and all previous ones. If we approximate by averaging numerator and denominator separately we get 1 ∂G ∂G = 2 ∂nτ στ + tr Pτ G ∂vτ (4) Here we have exploited for the average over x that the matrix φτ φT x has (i, τ ), (j, τ )-entry τ φi (x)φj (x) x δτ τ δτ τ = δij δτ τ δτ τ , hence simply φτ φT x = Pτ . We have also used the τ auxiliary parameters to rewrite − GPτ G = ∂ G /∂vτ = ∂G/∂vτ . Finally, multiplying (4) by Pτ and taking the trace gives the set of quasi-linear partial differential equations ∂ τ 1 = 2 ∂nτ στ + τ ∂ τ ∂vτ (5) The remaining task is now to find the functions τ (n1 , . . . , nT , v1 , . . . , vT ) by solving these differential equations. We initially attempted to do this by tracking the τ as examples are added one task at a time, but the derivation is laborious already for T = 2 and becomes prohibitive beyond. Far more elegant is to adapt the method of characteristics to the present case. We need to find a 2T -dimensional surface in the 3T -dimensional space (n1 , . . . , nT , v1 , . . . , vT , 1 , . . . , T ), which is specified by the T functions τ (. . .). A small change (δn1 , . . . , δnT , δv1 , . . . , δvT , δ 1 , . . . , δ T ) in all 3T coordinates is tangential to this surface if it obeys the T constraints (one for each τ ) δ τ ∂ τ ∂ τ δnτ + δvτ ∂nτ ∂vτ = τ 2 From (5), one sees that this condition is satisfied whenever δ τ = 0 and δnτ = −δvτ (στ + τ ) It follows that all the characteristic curves given by τ (t) = τ,0 = const., vτ (t) = vτ,0 (1 − t), 2 nτ (t) = vτ,0 (στ + τ,0 ) t for t ∈ [0, 1] are tangential to the solution surface for all t, so lie within this surface if the initial point at t = 0 does. Because at t = 0 there are no training examples (nτ (0) = 0), this initial condition is satisfied by setting −1 τ,0 = tr Pτ −1 L + vτ ,0 Pτ τ Because t=1 τ (t) is constant along the characteristic curve, we get by equating the values at t = 0 and −1 τ,0 = tr Pτ L −1 + vτ ,0 Pτ = τ ({nτ = vτ 2 ,0 (στ + τ ,0 )}, {vτ = 0}) τ Expressing vτ ,0 in terms of nτ gives then τ = tr Pτ L−1 + τ nτ 2 στ + −1 Pτ (6) τ This is our main result: a closed set of T self-consistency equations for the average Bayes errors 2 τ . Given L as defined by the eigenvalues λi of the covariance function, the noise levels στ and the 4 number of examples nτ for each task, it is straightforward to solve these equations numerically to find the average Bayes error τ for each task. The r.h.s. of (6) is easiest to evaluate if we view the matrix inside the brackets as consisting of M × M blocks of size T × T (which is the reverse of the picture we have used so far). The matrix is then block diagonal, with the blocks corresponding to different eigenvalues λi . Explicitly, because L−1 = D −1 ⊗ Λ−1 , one has τ λ−1 D −1 + diag({ i = i 4 2 στ nτ + −1 }) τ (7) ττ Results and discussion We now consider the consequences of the approximate prediction (7) for multi-task learning curves in GP regression. A trivial special case is the one of uncorrelated tasks, where D is diagonal. Here one recovers T separate equations for the individual tasks as expected, which have the same form as for single-task learning [13]. 4.1 Pure transfer learning Consider now the case of pure transfer learning, where one is learning a task of interest (say τ = 1) purely from examples for other tasks. What is the lowest average Bayes error that can be obtained? Somewhat more generally, suppose we have no examples for the first T0 tasks, n1 = . . . = nT0 = 0, but a large number of examples for the remaining T1 = T − T0 tasks. Denote E = D −1 and write this in block form as E00 E01 E= T E01 E11 2 Now multiply by λ−1 and add in the lower right block a diagonal matrix N = diag({nτ /(στ + i −1 −1 τ )}τ =T0 +1,...,T ). The matrix inverse in (7) then has top left block λi [E00 + E00 E01 (λi N + −1 −1 T T E11 − E01 E00 E01 )−1 E01 E00 ]. As the number of examples for the last T1 tasks grows, so do all −1 (diagonal) elements of N . In the limit only the term λi E00 survives, and summing over i gives −1 −1 1 = tr Λ(E00 )11 = C(x, x) (E00 )11 . The Bayes error on task 1 cannot become lower than this, placing a limit on the benefits of pure transfer learning. That this prediction of the approximation (7) for such a lower limit is correct can also be checked directly: once the last T1 tasks fτ (x) (τ = T0 + 1, . . . T ) have been learn perfectly, the posterior over the first T0 functions is, by standard Gaussian conditioning, a GP with covariance C(x, x )(E00 )−1 . Averaging the posterior variance of −1 f1 (x) then gives the Bayes error on task 1 as 1 = C(x, x) (E00 )11 , as found earlier. This analysis can be extended to the case where there are some examples available also for the first T0 tasks. One finds for the generalization errors on these tasks the prediction (7) with D −1 replaced by E00 . This is again in line with the above form of the GP posterior after perfect learning of the remaining T1 tasks. 4.2 Two tasks We next analyse how well the approxiation (7) does in predicting multi-task learning curves for T = 2 tasks. Here we have the work of Chai [21] as a baseline, and as there we choose D= 1 ρ ρ 1 The diagonal elements are fixed to unity, as in a practical application where one would scale both task functions f1 (x) and f2 (x) to unit variance; the degree of correlation of the tasks is controlled by ρ. We fix π2 = n2 /n and plot learning curves against n. In numerical simulations we ensure integer values of n1 and n2 by setting n2 = nπ2 , n1 = n − n2 ; for evaluation of (7) we use 2 2 directly n2 = nπ2 , n1 = n(1 − π2 ). For simplicity we consider equal noise levels σ1 = σ2 = σ 2 . As regards the covariance function and input distribution, we analyse first the scenario studied in [21]: a squared exponential (SE) kernel C(x, x ) = exp[−(x − x )2 /(2l2 )] with lengthscale l, and one-dimensional inputs x with a Gaussian distribution N (0, 1/12). The kernel eigenvalues λi 5 1 1 1 1 ε1 ε1 0.8 1 1 ε1 ε1 0.8 0.001 1 ε1 0.8 0.001 n 10000 ε1 1 0.01 1 n 10000 0.6 0.6 0.4 0.4 0.4 0.2 0.2 n 1000 0.6 0.2 0 0 100 200 n 300 400 0 500 0 100 200 n 300 400 500 0 0 100 200 n 300 400 500 Figure 1: Average Bayes error for task 1 for two-task GP regression with kernel lengthscale l = 0.01, noise level σ 2 = 0.05 and a fraction π2 = 0.75 of examples for task 2. Solid lines: numerical simulations; dashed lines: approximation (7). Task correlation ρ2 = 0, 0.25, 0.5, 0.75, 1 from top to bottom. Left: SE covariance function, Gaussian input distribution. Middle: SE covariance, uniform inputs. Right: OU covariance, uniform inputs. Log-log plots (insets) show tendency of asymptotic uselessness, i.e. bunching of the ρ < 1 curves towards the one for ρ = 0; this effect is strongest for learning of smooth functions (left and middle). are known explicitly from [22] and decay exponentially with i. Figure 1(left) compares numerically simulated learning curves with the predictions for 1 , the average Bayes error on task 1, from (7). Five pairs of curves are shown, for ρ2 = 0, 0.25, 0.5, 0.75, 1. Note that the two extreme values represent single-task limits, where examples from task 2 are either ignored (ρ = 0) or effectively treated as being from task 1 (ρ = 1). Our predictions lie generally below the true learning curves, but qualitatively represent the trends well, in particular the variation with ρ2 . The curves for the different ρ2 values are fairly evenly spaced vertically for small number of examples, n, corresponding to a linear dependence on ρ2 . As n increases, however, the learning curves for ρ < 1 start to bunch together and separate from the one for the fully correlated case (ρ = 1). The approximation (7) correctly captures this behaviour, which is discussed in more detail below. Figure 1(middle) has analogous results for the case of inputs x uniformly distributed on the interval [0, 1]; the λi here decay exponentially with i2 [17]. Quantitative agreement between simulations and predictions is better for this case. The discussion in [17] suggests that this is because the approximation method we have used implicitly neglects spatial variation of the dataset-averaged posterior variance Vτ (x) ; but for a uniform input distribution this variation will be weak except near the ends of the input range [0, 1]. Figure 1(right) displays similar results for an OU kernel C(x, x ) = exp(−|x − x |/l), showing that our predictions also work well when learning rough (nowhere differentiable) functions. 4.3 Asymptotic uselessness The two-task results above suggest that multi-task learning is less useful asymptotically: when the number of training examples n is large, the learning curves seem to bunch towards the curve for ρ = 0, where task 2 examples are ignored, except when the two tasks are fully correlated (ρ = 1). We now study this effect. When the number of examples for all tasks becomes large, the Bayes errors τ will become small 2 and eventually be negligible compared to the noise variances στ in (7). One then has an explicit prediction for each τ , without solving T self-consistency equations. If we write, for T tasks, 2 nτ = nπτ with πτ the fraction of examples for task τ , and set γτ = πτ /στ , then for large n τ = i λ−1 D −1 + nΓ i −1 ττ = −1/2 −1 [λi (Γ1/2 DΓ1/2 )−1 i (Γ + nI]−1 Γ−1/2 )τ τ 1/2 where Γ = diag(γ1 , . . . , γT ). Using an eigendecomposition of the symmetric matrix Γ T T a=1 δa va va , one then shows in a few lines that (8) can be written as τ −1 ≈ γτ 2 a (va,τ ) δa g(nδa ) 6 (8) 1/2 DΓ = (9) 1 1 1 50000 ε 5000 r 0.1 ε 0.5 n=500 10 100 1000 n 0.1 0 0 0.2 0.4 ρ 2 0.6 0.8 1 1 10 100 1000 n Figure 2: Left: Bayes error (parameters as in Fig. 1(left), with n = 500) vs ρ2 . To focus on the error reduction with ρ, r = [ 1 (ρ) − 1 (1)]/[ 1 (0) − 1 (1)] is shown. Circles: simulations; solid line: predictions from (7). Other lines: predictions for larger n, showing the approach to asymptotic uselessness in multi-task learning of smooth functions. Inset: Analogous results for rough functions (parameters as in Fig. 1(right)). Right: Learning curve for many-task learning (T = 200, parameters otherwise as in Fig. 1(left) except ρ2 = 0.8). Notice the bend around 1 = 1 − ρ = 0.106. Solid line: simulations (steps arise because we chose to allocate examples to tasks in order τ = 1, . . . , T rather than randomly); dashed line: predictions from (7). Inset: Predictions for T = 1000, with asymptotic forms = 1 − ρ + ρ˜ and = (1 − ρ)¯ for the two learning stages shown as solid lines. −1 where g(h) = tr (Λ−1 + h)−1 = + h)−1 and va,τ is the τ -th component of the a-th i (λi eigenvector va . This is the general asymptotic form of our prediction for the average Bayes error for task τ . To get a more explicit result, consider the case where sample functions from the GP prior have (mean-square) derivatives up to order r. The kernel eigenvalues λi then decay as1 i−(2r+2) for large i, and using arguments from [17] one deduces that g(h) ∼ h−α for large h, with α = (2r +1)/(2r + 2). In (9) we can then write, for large n, g(nδa ) ≈ (δa /γτ )−α g(nγτ ) and hence τ ≈ g(nγτ ){ 2 1−α } a (va,τ ) (δa /γτ ) (10) 2 When there is only a single task, δ1 = γ1 and this expression reduces to 1 = g(nγ1 ) = g(n1 /σ1 ). 2 Thus g(nγτ ) = g(nτ /στ ) is the error we would get by ignoring all examples from tasks other than τ , and the term in {. . .} in (10) gives the “multi-task gain”, i.e. the factor by which the error is reduced because of examples from other tasks. (The absolute error reduction always vanishes trivially for n → ∞, along with the errors themselves.) One observation can be made directly. Learning of very smooth functions, as defined e.g. by the SE kernel, corresponds to r → ∞ and hence α → 1, so the multi-task gain tends to unity: multi-task learning is asymptotically useless. The only exception occurs when some of the tasks are fully correlated, because one or more of the eigenvalues δa of Γ1/2 DΓ1/2 will then be zero. Fig. 2(left) shows this effect in action, plotting Bayes error against ρ2 for the two-task setting of Fig. 1(left) with n = 500. Our predictions capture the nonlinear dependence on ρ2 quite well, though the effect is somewhat weaker in the simulations. For larger n the predictions approach a curve that is constant for ρ < 1, signifying negligible improvement from multi-task learning except at ρ = 1. It is worth contrasting this with the lower bound from [21], which is linear in ρ2 . While this provides a very good approximation to the learning curves for moderate n [21], our results here show that asymptotically this bound can become very loose. When predicting rough functions, there is some asymptotic improvement to be had from multi-task learning, though again the multi-task gain is nonlinear in ρ2 : see Fig. 2(left, inset) for the OU case, which has r = 1). A simple expression for the gain can be obtained in the limit of many tasks, to which we turn next. 1 See the discussion of Sacks-Ylvisaker conditions in e.g. [1]; we consider one-dimensional inputs here though the discussion can be generalized. 7 4.4 Many tasks We assume as for the two-task case that all inter-task correlations, Dτ,τ with τ = τ , are equal to ρ, while Dτ,τ = 1. This setup was used e.g. in [23], and can be interpreted as each task having a √ component proportional to ρ of a shared latent function, with an independent task-specific signal in addition. We assume for simplicity that we have the same number nτ = n/T of examples for 2 each task, and that all noise levels are the same, στ = σ 2 . Then also all Bayes errors τ = will be the same. Carrying out the matrix inverses in (7) explicitly, one can then write this equation as = gT (n/(σ 2 + ), ρ) (11) where gT (h, ρ) is related to the single-task function g(h) from above by gT (h, ρ) = 1−ρ T −1 (1 − ρ)g(h(1 − ρ)/T ) + ρ + T T g(h[ρ + (1 − ρ)/T ]) (12) Now consider the limit T → ∞ of many tasks. If n and hence h = n/(σ 2 + ) is kept fixed, gT (h, ρ) → (1 − ρ) + ρg(hρ); here we have taken g(0) = 1 which corresponds to tr Λ = C(x, x) x = 1 as in the examples above. One can then deduce from (11) that the Bayes error for any task will have the form = (1 − ρ) + ρ˜, where ˜ decays from one to zero with increasing n as for a single task, but with an effective noise level σ 2 = (1 − ρ + σ 2 )/ρ. Remarkably, then, ˜ even though here n/T → 0 so that for most tasks no examples have been seen, the Bayes error for each task decreases by “collective learning” to a plateau of height 1 − ρ. The remaining decay of to zero happens only once n becomes of order T . Here one can show, by taking T → ∞ at fixed h/T in (12) and inserting into (11), that = (1 − ρ)¯ where ¯ again decays as for a single task but with an effective number of examples n = n/T and effective noise level σ 2 /(1 − ρ). This final stage of ¯ ¯ learning therefore happens only when each task has seen a considerable number of exampes n/T . Fig. 2(right) validates these predictions against simulations, for a number of tasks (T = 200) that is in the same ballpark as in the many-tasks application example of [24]. The inset for T = 1000 shows clearly how the two learning curve stages separate as T becomes larger. Finally we can come back to the multi-task gain in the asymptotic stage of learning. For GP priors with sample functions with derivatives up to order r as before, the function ¯ from above will decay as (¯ /¯ 2 )−α ; since = (1 − ρ)¯ and σ 2 = σ 2 /(1 − ρ), the Bayes error is then proportional n σ ¯ to (1 − ρ)1−α . This multi-task gain again approaches unity for ρ < 1 for smooth functions (α = (2r + 1)/(2r + 2) → 1). Interestingly, for rough functions (α < 1), the multi-task gain decreases for small ρ2 as 1 − (1 − α) ρ2 and so always lies below a linear dependence on ρ2 initially. This shows that a linear-in-ρ2 lower error bound cannot generally apply to T > 2 tasks, and indeed one can verify that the derivation in [21] does not extend to this case. 5 Conclusion We have derived an approximate prediction (7) for learning curves in multi-task GP regression, valid for arbitrary inter-task correlation matrices D. This can be evaluated explicitly knowing only the kernel eigenvalues, without sampling or recourse to single-task learning curves. The approximation shows that pure transfer learning has a simple lower error bound, and provides a good qualitative account of numerically simulated learning curves. Because it can be used to study the asymptotic behaviour for large training sets, it allowed us to show that multi-task learning can become asymptotically useless: when learning smooth functions it reduces the asymptotic Bayes error only if tasks are fully correlated. For the limit of many tasks we found that, remarkably, some initial “collective learning” is possible even when most tasks have not seen examples. A much slower second learning stage then requires many examples per task. The asymptotic regime of this also showed explicitly that a lower error bound that is linear in ρ2 , the square of the inter-task correlation, is applicable only to the two-task setting T = 2. In future work it would be interesting to use our general result to investigate in more detail the consequences of specific choices for the inter-task correlations D, e.g. to represent a lower-dimensional latent factor structure. One could also try to deploy similar approximation methods to study the case of model mismatch, where the inter-task correlations D would have to be learned from data. More challenging, but worthwhile, would be an extension to multi-task covariance functions where task and input-space correlations to not factorize. 8 References [1] C K I Williams and C Rasmussen. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. [2] J Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12:149–198, 2000. [3] S Ben-David and R S Borbely. A notion of task relatedness yielding provable multiple-task learning guarantees. Mach. Learn., 73(3):273–287, December 2008. [4] Y W Teh, M Seeger, and M I Jordan. Semiparametric latent factor models. In Workshop on Artificial Intelligence and Statistics 10, pages 333–340. Society for Artificial Intelligence and Statistics, 2005. [5] E V Bonilla, F V Agakov, and C K I Williams. Kernel multi-task learning using task-specific features. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS). Omni Press, 2007. [6] E V Bonilla, K M A Chai, and C K I Williams. Multi-task Gaussian process prediction. In J C Platt, D Koller, Y Singer, and S Roweis, editors, NIPS 20, pages 153–160, Cambridge, MA, 2008. MIT Press. [7] M Alvarez and N D Lawrence. Sparse convolved Gaussian processes for multi-output regression. In D Koller, D Schuurmans, Y Bengio, and L Bottou, editors, NIPS 21, pages 57–64, Cambridge, MA, 2009. MIT Press. [8] G Leen, J Peltonen, and S Kaski. Focused multi-task learning using Gaussian processes. In Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, volume 6912 of Lecture Notes in Computer Science, pages 310– 325. Springer Berlin, Heidelberg, 2011. ´ [9] M A Alvarez, L Rosasco, and N D Lawrence. Kernels for vector-valued functions: a review. Foundations and Trends in Machine Learning, 4:195–266, 2012. [10] A Maurer. Bounds for linear multi-task learning. J. Mach. Learn. Res., 7:117–139, 2006. [11] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [12] G F Trecate, C K I Williams, and M Opper. Finite-dimensional approximation of Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 218–224, Cambridge, MA, 1999. MIT Press. [13] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, NIPS 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [14] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, NIPS 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [15] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [16] D Malzahn and M Opper. Statistical mechanics of learning: a variational approach for real data. Phys. Rev. Lett., 89:108302, 2002. [17] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [18] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [19] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artificial Intelligence, pages 199–210. Springer Berlin, Heidelberg, 2005. [20] M Urry and P Sollich. Exact larning curves for Gaussian process regression on large random graphs. In J Lafferty, C K I Williams, J Shawe-Taylor, R S Zemel, and A Culotta, editors, NIPS 23, pages 2316–2324, Cambridge, MA, 2010. MIT Press. [21] K M A Chai. Generalization errors and learning curves for regression with multi-task Gaussian processes. In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors, NIPS 22, pages 279–287, 2009. [22] H Zhu, C K I Williams, R J Rohwer, and M Morciniec. Gaussian regression and optimal finite dimensional linear models. In C M Bishop, editor, Neural Networks and Machine Learning. Springer, 1998. [23] E Rodner and J Denzler. One-shot learning of object categories using dependent Gaussian processes. In Michael Goesele, Stefan Roth, Arjan Kuijper, Bernt Schiele, and Konrad Schindler, editors, Pattern Recognition, volume 6376 of Lecture Notes in Computer Science, pages 232–241. Springer Berlin, Heidelberg, 2010. [24] T Heskes. Solving a huge number of similar tasks: a combination of multi-task learning and a hierarchical Bayesian approach. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML’98), pages 233–241. Morgan Kaufmann, 1998. 9

3 0.18450566 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

Author: Po-ling Loh, Martin J. Wainwright

Abstract: We investigate a curious relationship between the structure of a discrete graphical model and the support of the inverse of a generalized covariance matrix. We show that for certain graph structures, the support of the inverse covariance matrix of indicator variables on the vertices of a graph reflects the conditional independence structure of the graph. Our work extends results that have previously been established only in the context of multivariate Gaussian graphical models, thereby addressing an open question about the significance of the inverse covariance matrix of a non-Gaussian distribution. Based on our population-level results, we show how the graphical Lasso may be used to recover the edge structure of certain classes of discrete graphical models, and present simulations to verify our theoretical results. 1

4 0.17783482 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

Author: Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, Arian Maleki

Abstract: The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. The rate is shown to be closely related to the condition number of the optimal point. Numerical convergence results and timing comparisons for the proposed method are presented. G-ISTA is shown to perform very well, especially when the optimal point is well-conditioned. 1

5 0.15363544 319 nips-2012-Sparse Prediction with the $k$-Support Norm

Author: Andreas Argyriou, Rina Foygel, Nathan Srebro

Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justification for its use. 1

6 0.12950288 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

7 0.12274063 254 nips-2012-On the Sample Complexity of Robust PCA

8 0.11229987 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

9 0.095324017 16 nips-2012-A Polynomial-time Form of Robust Regression

10 0.095096998 247 nips-2012-Nonparametric Reduced Rank Regression

11 0.094763666 330 nips-2012-Supervised Learning with Similarity Functions

12 0.091383375 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

13 0.090917125 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

14 0.089131422 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

15 0.089082569 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

16 0.088070013 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

17 0.084103256 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

18 0.08095821 304 nips-2012-Selecting Diverse Features via Spectral Regularization

19 0.080937035 86 nips-2012-Convex Multi-view Subspace Learning

20 0.076851398 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.224), (1, 0.072), (2, 0.064), (3, -0.082), (4, 0.001), (5, 0.082), (6, 0.014), (7, -0.004), (8, -0.064), (9, -0.152), (10, -0.106), (11, -0.062), (12, -0.054), (13, 0.051), (14, 0.083), (15, 0.045), (16, 0.135), (17, 0.101), (18, -0.007), (19, -0.133), (20, 0.053), (21, 0.102), (22, 0.01), (23, 0.049), (24, 0.012), (25, -0.15), (26, 0.133), (27, -0.105), (28, -0.068), (29, -0.045), (30, -0.035), (31, 0.007), (32, 0.086), (33, -0.095), (34, 0.057), (35, 0.051), (36, -0.039), (37, 0.022), (38, -0.061), (39, -0.164), (40, -0.001), (41, 0.018), (42, 0.01), (43, -0.023), (44, -0.037), (45, 0.094), (46, -0.044), (47, 0.045), (48, -0.052), (49, -0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96905726 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

Author: Piyush Rai, Abhishek Kumar, Hal Daume

Abstract: Multiple-output regression models require estimating multiple parameters, one for each output. Structural regularization is usually employed to improve parameter estimation in such models. In this paper, we present a multiple-output regression model that leverages the covariance structure of the latent model parameters as well as the conditional covariance structure of the observed outputs. This is in contrast with existing methods that usually take into account only one of these structures. More importantly, unlike some of the other existing methods, none of these structures need be known a priori in our model, and are learned from the data. Several previously proposed structural regularization based multiple-output regression models turn out to be special cases of our model. Moreover, in addition to being a rich model for multiple-output regression, our model can also be used in estimating the graphical model structure of a set of variables (multivariate outputs) conditioned on another set of variables (inputs). Experimental results on both synthetic and real datasets demonstrate the effectiveness of our method. 1

2 0.70627666 187 nips-2012-Learning curves for multi-task Gaussian process regression

Author: Peter Sollich, Simon Ashton

Abstract: We study the average case performance of multi-task Gaussian process (GP) regression as captured in the learning curve, i.e. the average Bayes error for a chosen task versus the total number of examples n for all tasks. For GP covariances that are the product of an input-dependent covariance function and a free-form intertask covariance matrix, we show that accurate approximations for the learning curve can be obtained for an arbitrary number of tasks T . We use these to study the asymptotic learning behaviour for large n. Surprisingly, multi-task learning can be asymptotically essentially useless, in the sense that examples from other tasks help only when the degree of inter-task correlation, ρ, is near its maximal value ρ = 1. This effect is most extreme for learning of smooth target functions as described by e.g. squared exponential kernels. We also demonstrate that when learning many tasks, the learning curves separate into an initial phase, where the Bayes error on each task is reduced down to a plateau value by “collective learning” even though most tasks have not seen examples, and a final decay that occurs once the number of examples is proportional to the number of tasks. 1 Introduction and motivation Gaussian processes (GPs) [1] have been popular in the NIPS community for a number of years now, as one of the key non-parametric Bayesian inference approaches. In the simplest case one can use a GP prior when learning a function from data. In line with growing interest in multi-task or transfer learning, where relatedness between tasks is used to aid learning of the individual tasks (see e.g. [2, 3]), GPs have increasingly also been used in a multi-task setting. A number of different choices of covariance functions have been proposed [4, 5, 6, 7, 8]. These differ e.g. in assumptions on whether the functions to be learned are related to a smaller number of latent functions or have free-form inter-task correlations; for a recent review see [9]. Given this interest in multi-task GPs, one would like to quantify the benefits that they bring compared to single-task learning. PAC-style bounds for classification [2, 3, 10] in more general multi-task scenarios exist, but there has been little work on average case analysis. The basic question in this setting is: how does the Bayes error on a given task depend on the number of training examples for all tasks, when averaged over all data sets of the given size. For a single regression task, this learning curve has become relatively well understood since the late 1990s, with a number of bounds and approximations available [11, 12, 13, 14, 15, 16, 17, 18, 19] as well as some exact predictions [20]. Already two-task GP regression is much more difficult to analyse, and progress was made only very recently at NIPS 2009 [21], where upper and lower bounds for learning curves were derived. The tightest of these bounds, however, either required evaluation by Monte Carlo sampling, or assumed knowledge of the corresponding single-task learning curves. Here our aim is to obtain accurate learning curve approximations that apply to an arbitrary number T of tasks, and that can be evaluated explicitly without recourse to sampling. 1 We begin (Sec. 2) by expressing the Bayes error for any single task in a multi-task GP regression problem in a convenient feature space form, where individual training examples enter additively. This requires the introduction of a non-trivial tensor structure combining feature space components and tasks. Considering the change in error when adding an example for some task leads to partial differential equations linking the Bayes errors for all tasks. Solving these using the method of characteristics then gives, as our primary result, the desired learning curve approximation (Sec. 3). In Sec. 4 we discuss some of its predictions. The approximation correctly delineates the limits of pure transfer learning, when all examples are from tasks other than the one of interest. Next we compare with numerical simulations for some two-task scenarios, finding good qualitative agreement. These results also highlight a surprising feature, namely that asymptotically the relatedness between tasks can become much less useful. We analyse this effect in some detail, showing that it is most extreme for learning of smooth functions. Finally we discuss the case of many tasks, where there is an unexpected separation of the learning curves into a fast initial error decay arising from “collective learning”, and a much slower final part where tasks are learned almost independently. 2 GP regression and Bayes error We consider GP regression for T functions fτ (x), τ = 1, 2, . . . , T . These functions have to be learned from n training examples (x , τ , y ), = 1, . . . , n. Here x is the training input, τ ∈ {1, . . . , T } denotes which task the example relates to, and y is the corresponding training output. We assume that the latter is given by the target function value fτ (x ) corrupted by i.i.d. additive 2 2 Gaussian noise with zero mean and variance στ . This setup allows the noise level στ to depend on the task. In GP regression the prior over the functions fτ (x) is a Gaussian process. This means that for any set of inputs x and task labels τ , the function values {fτ (x )} have a joint Gaussian distribution. As is common we assume this to have zero mean, so the multi-task GP is fully specified by the covariances fτ (x)fτ (x ) = C(τ, x, τ , x ). For this covariance we take the flexible form from [5], fτ (x)fτ (x ) = Dτ τ C(x, x ). Here C(x, x ) determines the covariance between function values at different input points, encoding “spatial” behaviour such as smoothness and the lengthscale(s) over which the functions vary, while the matrix D is a free-form inter-task covariance matrix. One of the attractions of GPs for regression is that, even though they are non-parametric models with (in general) an infinite number of degrees of freedom, predictions can be made in closed form, see e.g. [1]. For a test point x for task τ , one would predict as output the mean of fτ (x) over the (Gaussian) posterior, which is y T K −1 kτ (x). Here K is the n × n Gram matrix with entries 2 K m = Dτ τm C(x , xm ) + στ δ m , while kτ (x) is a vector with the n entries kτ, = Dτ τ C(x , x). The error bar would be taken as the square root of the posterior variance of fτ (x), which is T Vτ (x) = Dτ τ C(x, x) − kτ (x)K −1 kτ (x) (1) The learning curve for task τ is defined as the mean-squared prediction error, averaged over the location of test input x and over all data sets with a specified number of examples for each task, say n1 for task 1 and so on. As is standard in learning curve analysis we consider a matched scenario where the training outputs y are generated from the same prior and noise model that we use for inference. In this case the mean-squared prediction error ˆτ is the Bayes error, and is given by the average posterior variance [1], i.e. ˆτ = Vτ (x) x . To obtain the learning curve this is averaged over the location of the training inputs x : τ = ˆτ . This average presents the main challenge for learning curve prediction because the training inputs feature in a highly nonlinear way in Vτ (x). Note that the training outputs, on the other hand, do not appear in the posterior variance Vτ (x) and so do not need to be averaged over. We now want to write the Bayes error ˆτ in a form convenient for performing, at least approximately, the averages required for the learning curve. Assume that all training inputs x , and also the test input x, are drawn from the same distribution P (x). One can decompose the input-dependent part of the covariance function into eigenfunctions relative to P (x), according to C(x, x ) = i λi φi (x)φi (x ). The eigenfunctions are defined by the condition C(x, x )φi (x ) x = λi φi (x) and can be chosen to be orthonormal with respect to P (x), φi (x)φj (x) x = δij . The sum over i here is in general infinite (unless the covariance function is degenerate, as e.g. for the dot product kernel C(x, x ) = x · x ). To make the algebra below as simple as possible, we let the eigenvalues λi be arranged in decreasing order and truncate the sum to the finite range i = 1, . . . , M ; M is then some large effective feature space dimension and can be taken to infinity at the end. 2 In terms of the above eigenfunction decomposition, the Gram matrix has elements K m = Dτ 2 λi φi (x )φi (xm )+στ δ τm m δτ = i ,τ φi (x )λi δij Dτ τ φj (xm )δτ 2 ,τm +στ δ m i,τ,j,τ or in matrix form K = ΨLΨT + Σ where Σ is the diagonal matrix from the noise variances and Ψ = δτ ,iτ ,τ φi (x ), Liτ,jτ = λi δij Dτ τ (2) Here Ψ has its second index ranging over M (number of kernel eigenvalues) times T (number of tasks) values; L is a square matrix of this size. In Kronecker (tensor) product notation, L = D ⊗ Λ if we define Λ as the diagonal matrix with entries λi δij . The Kronecker product is convenient for the simplifications below; we will use that for generic square matrices, (A ⊗ B)(A ⊗ B ) = (AA ) ⊗ (BB ), (A ⊗ B)−1 = A−1 ⊗ B −1 , and tr (A ⊗ B) = (tr A)(tr B). In thinking about the mathematical expressions, it is often easier to picture Kronecker products over feature spaces and tasks as block matrices. For example, L can then be viewed as consisting of T × T blocks, each of which is proportional to Λ. To calculate the Bayes error, we need to average the posterior variance Vτ (x) over the test input x. The first term in (1) then becomes Dτ τ C(x, x) = Dτ τ tr Λ. In the second one, we need to average kτ, (x)kτ,m = Dτ τ C(x , x)C(x, xm ) x Dτm τ = x Dτ τ λi λj φi (x ) φi (x)φj (x) x φj (xm )Dτm τ ij = Dτ τ Ψl,iτ λi λj δij Ψm,jτ Dτ τ i,τ ,j,τ T In matrix form this is kτ (x)kτ (x) x = Ψ[(Deτ eT D) ⊗ Λ2 ]ΨT = ΨMτ ΨT Here the last τ equality defines Mτ , and we have denoted by eτ the T -dimensional vector with τ -th component equal to one and all others zero. Multiplying by the inverse Gram matrix K −1 and taking the trace gives the average of the second term in (1); combining with the first gives the Bayes error on task τ ˆτ = Vτ (x) x = Dτ τ tr Λ − tr ΨMτ ΨT (ΨLΨT + Σ)−1 Applying the Woodbury identity and re-arranging yields = Dτ τ tr Λ − tr Mτ ΨT Σ−1 Ψ(I + LΨT Σ−1 Ψ)−1 = ˆτ Dτ τ tr Λ − tr Mτ L−1 [I − (I + LΨT Σ−1 Ψ)−1 ] But tr Mτ L−1 = tr {[(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 } τ = tr {[Deτ eT ] ⊗ Λ} = eT Deτ tr Λ = Dτ τ tr Λ τ τ so the first and second terms in the expression for ˆτ cancel and one has = tr Mτ L−1 (I + LΨT Σ−1 Ψ)−1 = tr L−1 Mτ L−1 (L−1 + ΨT Σ−1 Ψ)−1 = tr [D ⊗ Λ]−1 [(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 (L−1 + ΨT Σ−1 Ψ)−1 τ = ˆτ tr [eτ eT ⊗ I](L−1 + ΨT Σ−1 Ψ)−1 τ The matrix in square brackets in the last line is just a projector Pτ onto task τ ; thought of as a matrix of T × T blocks (each of size M × M ), this has an identity matrix in the (τ, τ ) block while all other blocks are zero. We can therefore write, finally, for the Bayes error on task τ , ˆτ = tr Pτ (L−1 + ΨT Σ−1 Ψ)−1 (3) Because Σ is diagonal and given the definition (2) of Ψ, the matrix ΨT Σ−1 Ψ is a sum of contributions from the individual training examples = 1, . . . , n. This will be important for deriving the learning curve approximation below. We note in passing that, because τ Pτ = I, the sum of the Bayes errors on all tasks is τ ˆτ = tr (L−1 +ΨT Σ−1 Ψ)−1 , in close analogy to the corresponding expression for the single-task case [13]. 3 3 Learning curve prediction To obtain the learning curve τ = ˆτ , we now need to carry out the average . . . over the training inputs. To help with this, we can extend an approach for the single-task scenario [13] and define a response or resolvent matrix G = (L−1 + ΨT Σ−1 Ψ + τ vτ Pτ )−1 with auxiliary parameters vτ that will be set back to zero at the end. One can then ask how G = G and hence τ = tr Pτ G changes with the number nτ of training points for task τ . Adding an example at position x for task −2 τ increases ΨT Σ−1 Ψ by στ φτ φT , where φτ has elements (φτ )iτ = φi (x)δτ τ . Evaluating the τ −1 −2 difference (G + στ φτ φT )−1 − G with the help of the Woodbury identity and approximating it τ with a derivative gives Gφτ φT G ∂G τ =− 2 ∂nτ στ + φT Gφτ τ This needs to be averaged over the new example and all previous ones. If we approximate by averaging numerator and denominator separately we get 1 ∂G ∂G = 2 ∂nτ στ + tr Pτ G ∂vτ (4) Here we have exploited for the average over x that the matrix φτ φT x has (i, τ ), (j, τ )-entry τ φi (x)φj (x) x δτ τ δτ τ = δij δτ τ δτ τ , hence simply φτ φT x = Pτ . We have also used the τ auxiliary parameters to rewrite − GPτ G = ∂ G /∂vτ = ∂G/∂vτ . Finally, multiplying (4) by Pτ and taking the trace gives the set of quasi-linear partial differential equations ∂ τ 1 = 2 ∂nτ στ + τ ∂ τ ∂vτ (5) The remaining task is now to find the functions τ (n1 , . . . , nT , v1 , . . . , vT ) by solving these differential equations. We initially attempted to do this by tracking the τ as examples are added one task at a time, but the derivation is laborious already for T = 2 and becomes prohibitive beyond. Far more elegant is to adapt the method of characteristics to the present case. We need to find a 2T -dimensional surface in the 3T -dimensional space (n1 , . . . , nT , v1 , . . . , vT , 1 , . . . , T ), which is specified by the T functions τ (. . .). A small change (δn1 , . . . , δnT , δv1 , . . . , δvT , δ 1 , . . . , δ T ) in all 3T coordinates is tangential to this surface if it obeys the T constraints (one for each τ ) δ τ ∂ τ ∂ τ δnτ + δvτ ∂nτ ∂vτ = τ 2 From (5), one sees that this condition is satisfied whenever δ τ = 0 and δnτ = −δvτ (στ + τ ) It follows that all the characteristic curves given by τ (t) = τ,0 = const., vτ (t) = vτ,0 (1 − t), 2 nτ (t) = vτ,0 (στ + τ,0 ) t for t ∈ [0, 1] are tangential to the solution surface for all t, so lie within this surface if the initial point at t = 0 does. Because at t = 0 there are no training examples (nτ (0) = 0), this initial condition is satisfied by setting −1 τ,0 = tr Pτ −1 L + vτ ,0 Pτ τ Because t=1 τ (t) is constant along the characteristic curve, we get by equating the values at t = 0 and −1 τ,0 = tr Pτ L −1 + vτ ,0 Pτ = τ ({nτ = vτ 2 ,0 (στ + τ ,0 )}, {vτ = 0}) τ Expressing vτ ,0 in terms of nτ gives then τ = tr Pτ L−1 + τ nτ 2 στ + −1 Pτ (6) τ This is our main result: a closed set of T self-consistency equations for the average Bayes errors 2 τ . Given L as defined by the eigenvalues λi of the covariance function, the noise levels στ and the 4 number of examples nτ for each task, it is straightforward to solve these equations numerically to find the average Bayes error τ for each task. The r.h.s. of (6) is easiest to evaluate if we view the matrix inside the brackets as consisting of M × M blocks of size T × T (which is the reverse of the picture we have used so far). The matrix is then block diagonal, with the blocks corresponding to different eigenvalues λi . Explicitly, because L−1 = D −1 ⊗ Λ−1 , one has τ λ−1 D −1 + diag({ i = i 4 2 στ nτ + −1 }) τ (7) ττ Results and discussion We now consider the consequences of the approximate prediction (7) for multi-task learning curves in GP regression. A trivial special case is the one of uncorrelated tasks, where D is diagonal. Here one recovers T separate equations for the individual tasks as expected, which have the same form as for single-task learning [13]. 4.1 Pure transfer learning Consider now the case of pure transfer learning, where one is learning a task of interest (say τ = 1) purely from examples for other tasks. What is the lowest average Bayes error that can be obtained? Somewhat more generally, suppose we have no examples for the first T0 tasks, n1 = . . . = nT0 = 0, but a large number of examples for the remaining T1 = T − T0 tasks. Denote E = D −1 and write this in block form as E00 E01 E= T E01 E11 2 Now multiply by λ−1 and add in the lower right block a diagonal matrix N = diag({nτ /(στ + i −1 −1 τ )}τ =T0 +1,...,T ). The matrix inverse in (7) then has top left block λi [E00 + E00 E01 (λi N + −1 −1 T T E11 − E01 E00 E01 )−1 E01 E00 ]. As the number of examples for the last T1 tasks grows, so do all −1 (diagonal) elements of N . In the limit only the term λi E00 survives, and summing over i gives −1 −1 1 = tr Λ(E00 )11 = C(x, x) (E00 )11 . The Bayes error on task 1 cannot become lower than this, placing a limit on the benefits of pure transfer learning. That this prediction of the approximation (7) for such a lower limit is correct can also be checked directly: once the last T1 tasks fτ (x) (τ = T0 + 1, . . . T ) have been learn perfectly, the posterior over the first T0 functions is, by standard Gaussian conditioning, a GP with covariance C(x, x )(E00 )−1 . Averaging the posterior variance of −1 f1 (x) then gives the Bayes error on task 1 as 1 = C(x, x) (E00 )11 , as found earlier. This analysis can be extended to the case where there are some examples available also for the first T0 tasks. One finds for the generalization errors on these tasks the prediction (7) with D −1 replaced by E00 . This is again in line with the above form of the GP posterior after perfect learning of the remaining T1 tasks. 4.2 Two tasks We next analyse how well the approxiation (7) does in predicting multi-task learning curves for T = 2 tasks. Here we have the work of Chai [21] as a baseline, and as there we choose D= 1 ρ ρ 1 The diagonal elements are fixed to unity, as in a practical application where one would scale both task functions f1 (x) and f2 (x) to unit variance; the degree of correlation of the tasks is controlled by ρ. We fix π2 = n2 /n and plot learning curves against n. In numerical simulations we ensure integer values of n1 and n2 by setting n2 = nπ2 , n1 = n − n2 ; for evaluation of (7) we use 2 2 directly n2 = nπ2 , n1 = n(1 − π2 ). For simplicity we consider equal noise levels σ1 = σ2 = σ 2 . As regards the covariance function and input distribution, we analyse first the scenario studied in [21]: a squared exponential (SE) kernel C(x, x ) = exp[−(x − x )2 /(2l2 )] with lengthscale l, and one-dimensional inputs x with a Gaussian distribution N (0, 1/12). The kernel eigenvalues λi 5 1 1 1 1 ε1 ε1 0.8 1 1 ε1 ε1 0.8 0.001 1 ε1 0.8 0.001 n 10000 ε1 1 0.01 1 n 10000 0.6 0.6 0.4 0.4 0.4 0.2 0.2 n 1000 0.6 0.2 0 0 100 200 n 300 400 0 500 0 100 200 n 300 400 500 0 0 100 200 n 300 400 500 Figure 1: Average Bayes error for task 1 for two-task GP regression with kernel lengthscale l = 0.01, noise level σ 2 = 0.05 and a fraction π2 = 0.75 of examples for task 2. Solid lines: numerical simulations; dashed lines: approximation (7). Task correlation ρ2 = 0, 0.25, 0.5, 0.75, 1 from top to bottom. Left: SE covariance function, Gaussian input distribution. Middle: SE covariance, uniform inputs. Right: OU covariance, uniform inputs. Log-log plots (insets) show tendency of asymptotic uselessness, i.e. bunching of the ρ < 1 curves towards the one for ρ = 0; this effect is strongest for learning of smooth functions (left and middle). are known explicitly from [22] and decay exponentially with i. Figure 1(left) compares numerically simulated learning curves with the predictions for 1 , the average Bayes error on task 1, from (7). Five pairs of curves are shown, for ρ2 = 0, 0.25, 0.5, 0.75, 1. Note that the two extreme values represent single-task limits, where examples from task 2 are either ignored (ρ = 0) or effectively treated as being from task 1 (ρ = 1). Our predictions lie generally below the true learning curves, but qualitatively represent the trends well, in particular the variation with ρ2 . The curves for the different ρ2 values are fairly evenly spaced vertically for small number of examples, n, corresponding to a linear dependence on ρ2 . As n increases, however, the learning curves for ρ < 1 start to bunch together and separate from the one for the fully correlated case (ρ = 1). The approximation (7) correctly captures this behaviour, which is discussed in more detail below. Figure 1(middle) has analogous results for the case of inputs x uniformly distributed on the interval [0, 1]; the λi here decay exponentially with i2 [17]. Quantitative agreement between simulations and predictions is better for this case. The discussion in [17] suggests that this is because the approximation method we have used implicitly neglects spatial variation of the dataset-averaged posterior variance Vτ (x) ; but for a uniform input distribution this variation will be weak except near the ends of the input range [0, 1]. Figure 1(right) displays similar results for an OU kernel C(x, x ) = exp(−|x − x |/l), showing that our predictions also work well when learning rough (nowhere differentiable) functions. 4.3 Asymptotic uselessness The two-task results above suggest that multi-task learning is less useful asymptotically: when the number of training examples n is large, the learning curves seem to bunch towards the curve for ρ = 0, where task 2 examples are ignored, except when the two tasks are fully correlated (ρ = 1). We now study this effect. When the number of examples for all tasks becomes large, the Bayes errors τ will become small 2 and eventually be negligible compared to the noise variances στ in (7). One then has an explicit prediction for each τ , without solving T self-consistency equations. If we write, for T tasks, 2 nτ = nπτ with πτ the fraction of examples for task τ , and set γτ = πτ /στ , then for large n τ = i λ−1 D −1 + nΓ i −1 ττ = −1/2 −1 [λi (Γ1/2 DΓ1/2 )−1 i (Γ + nI]−1 Γ−1/2 )τ τ 1/2 where Γ = diag(γ1 , . . . , γT ). Using an eigendecomposition of the symmetric matrix Γ T T a=1 δa va va , one then shows in a few lines that (8) can be written as τ −1 ≈ γτ 2 a (va,τ ) δa g(nδa ) 6 (8) 1/2 DΓ = (9) 1 1 1 50000 ε 5000 r 0.1 ε 0.5 n=500 10 100 1000 n 0.1 0 0 0.2 0.4 ρ 2 0.6 0.8 1 1 10 100 1000 n Figure 2: Left: Bayes error (parameters as in Fig. 1(left), with n = 500) vs ρ2 . To focus on the error reduction with ρ, r = [ 1 (ρ) − 1 (1)]/[ 1 (0) − 1 (1)] is shown. Circles: simulations; solid line: predictions from (7). Other lines: predictions for larger n, showing the approach to asymptotic uselessness in multi-task learning of smooth functions. Inset: Analogous results for rough functions (parameters as in Fig. 1(right)). Right: Learning curve for many-task learning (T = 200, parameters otherwise as in Fig. 1(left) except ρ2 = 0.8). Notice the bend around 1 = 1 − ρ = 0.106. Solid line: simulations (steps arise because we chose to allocate examples to tasks in order τ = 1, . . . , T rather than randomly); dashed line: predictions from (7). Inset: Predictions for T = 1000, with asymptotic forms = 1 − ρ + ρ˜ and = (1 − ρ)¯ for the two learning stages shown as solid lines. −1 where g(h) = tr (Λ−1 + h)−1 = + h)−1 and va,τ is the τ -th component of the a-th i (λi eigenvector va . This is the general asymptotic form of our prediction for the average Bayes error for task τ . To get a more explicit result, consider the case where sample functions from the GP prior have (mean-square) derivatives up to order r. The kernel eigenvalues λi then decay as1 i−(2r+2) for large i, and using arguments from [17] one deduces that g(h) ∼ h−α for large h, with α = (2r +1)/(2r + 2). In (9) we can then write, for large n, g(nδa ) ≈ (δa /γτ )−α g(nγτ ) and hence τ ≈ g(nγτ ){ 2 1−α } a (va,τ ) (δa /γτ ) (10) 2 When there is only a single task, δ1 = γ1 and this expression reduces to 1 = g(nγ1 ) = g(n1 /σ1 ). 2 Thus g(nγτ ) = g(nτ /στ ) is the error we would get by ignoring all examples from tasks other than τ , and the term in {. . .} in (10) gives the “multi-task gain”, i.e. the factor by which the error is reduced because of examples from other tasks. (The absolute error reduction always vanishes trivially for n → ∞, along with the errors themselves.) One observation can be made directly. Learning of very smooth functions, as defined e.g. by the SE kernel, corresponds to r → ∞ and hence α → 1, so the multi-task gain tends to unity: multi-task learning is asymptotically useless. The only exception occurs when some of the tasks are fully correlated, because one or more of the eigenvalues δa of Γ1/2 DΓ1/2 will then be zero. Fig. 2(left) shows this effect in action, plotting Bayes error against ρ2 for the two-task setting of Fig. 1(left) with n = 500. Our predictions capture the nonlinear dependence on ρ2 quite well, though the effect is somewhat weaker in the simulations. For larger n the predictions approach a curve that is constant for ρ < 1, signifying negligible improvement from multi-task learning except at ρ = 1. It is worth contrasting this with the lower bound from [21], which is linear in ρ2 . While this provides a very good approximation to the learning curves for moderate n [21], our results here show that asymptotically this bound can become very loose. When predicting rough functions, there is some asymptotic improvement to be had from multi-task learning, though again the multi-task gain is nonlinear in ρ2 : see Fig. 2(left, inset) for the OU case, which has r = 1). A simple expression for the gain can be obtained in the limit of many tasks, to which we turn next. 1 See the discussion of Sacks-Ylvisaker conditions in e.g. [1]; we consider one-dimensional inputs here though the discussion can be generalized. 7 4.4 Many tasks We assume as for the two-task case that all inter-task correlations, Dτ,τ with τ = τ , are equal to ρ, while Dτ,τ = 1. This setup was used e.g. in [23], and can be interpreted as each task having a √ component proportional to ρ of a shared latent function, with an independent task-specific signal in addition. We assume for simplicity that we have the same number nτ = n/T of examples for 2 each task, and that all noise levels are the same, στ = σ 2 . Then also all Bayes errors τ = will be the same. Carrying out the matrix inverses in (7) explicitly, one can then write this equation as = gT (n/(σ 2 + ), ρ) (11) where gT (h, ρ) is related to the single-task function g(h) from above by gT (h, ρ) = 1−ρ T −1 (1 − ρ)g(h(1 − ρ)/T ) + ρ + T T g(h[ρ + (1 − ρ)/T ]) (12) Now consider the limit T → ∞ of many tasks. If n and hence h = n/(σ 2 + ) is kept fixed, gT (h, ρ) → (1 − ρ) + ρg(hρ); here we have taken g(0) = 1 which corresponds to tr Λ = C(x, x) x = 1 as in the examples above. One can then deduce from (11) that the Bayes error for any task will have the form = (1 − ρ) + ρ˜, where ˜ decays from one to zero with increasing n as for a single task, but with an effective noise level σ 2 = (1 − ρ + σ 2 )/ρ. Remarkably, then, ˜ even though here n/T → 0 so that for most tasks no examples have been seen, the Bayes error for each task decreases by “collective learning” to a plateau of height 1 − ρ. The remaining decay of to zero happens only once n becomes of order T . Here one can show, by taking T → ∞ at fixed h/T in (12) and inserting into (11), that = (1 − ρ)¯ where ¯ again decays as for a single task but with an effective number of examples n = n/T and effective noise level σ 2 /(1 − ρ). This final stage of ¯ ¯ learning therefore happens only when each task has seen a considerable number of exampes n/T . Fig. 2(right) validates these predictions against simulations, for a number of tasks (T = 200) that is in the same ballpark as in the many-tasks application example of [24]. The inset for T = 1000 shows clearly how the two learning curve stages separate as T becomes larger. Finally we can come back to the multi-task gain in the asymptotic stage of learning. For GP priors with sample functions with derivatives up to order r as before, the function ¯ from above will decay as (¯ /¯ 2 )−α ; since = (1 − ρ)¯ and σ 2 = σ 2 /(1 − ρ), the Bayes error is then proportional n σ ¯ to (1 − ρ)1−α . This multi-task gain again approaches unity for ρ < 1 for smooth functions (α = (2r + 1)/(2r + 2) → 1). Interestingly, for rough functions (α < 1), the multi-task gain decreases for small ρ2 as 1 − (1 − α) ρ2 and so always lies below a linear dependence on ρ2 initially. This shows that a linear-in-ρ2 lower error bound cannot generally apply to T > 2 tasks, and indeed one can verify that the derivation in [21] does not extend to this case. 5 Conclusion We have derived an approximate prediction (7) for learning curves in multi-task GP regression, valid for arbitrary inter-task correlation matrices D. This can be evaluated explicitly knowing only the kernel eigenvalues, without sampling or recourse to single-task learning curves. The approximation shows that pure transfer learning has a simple lower error bound, and provides a good qualitative account of numerically simulated learning curves. Because it can be used to study the asymptotic behaviour for large training sets, it allowed us to show that multi-task learning can become asymptotically useless: when learning smooth functions it reduces the asymptotic Bayes error only if tasks are fully correlated. For the limit of many tasks we found that, remarkably, some initial “collective learning” is possible even when most tasks have not seen examples. A much slower second learning stage then requires many examples per task. The asymptotic regime of this also showed explicitly that a lower error bound that is linear in ρ2 , the square of the inter-task correlation, is applicable only to the two-task setting T = 2. In future work it would be interesting to use our general result to investigate in more detail the consequences of specific choices for the inter-task correlations D, e.g. to represent a lower-dimensional latent factor structure. One could also try to deploy similar approximation methods to study the case of model mismatch, where the inter-task correlations D would have to be learned from data. More challenging, but worthwhile, would be an extension to multi-task covariance functions where task and input-space correlations to not factorize. 8 References [1] C K I Williams and C Rasmussen. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. [2] J Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12:149–198, 2000. [3] S Ben-David and R S Borbely. A notion of task relatedness yielding provable multiple-task learning guarantees. Mach. Learn., 73(3):273–287, December 2008. [4] Y W Teh, M Seeger, and M I Jordan. Semiparametric latent factor models. In Workshop on Artificial Intelligence and Statistics 10, pages 333–340. Society for Artificial Intelligence and Statistics, 2005. [5] E V Bonilla, F V Agakov, and C K I Williams. Kernel multi-task learning using task-specific features. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS). Omni Press, 2007. [6] E V Bonilla, K M A Chai, and C K I Williams. Multi-task Gaussian process prediction. In J C Platt, D Koller, Y Singer, and S Roweis, editors, NIPS 20, pages 153–160, Cambridge, MA, 2008. MIT Press. [7] M Alvarez and N D Lawrence. Sparse convolved Gaussian processes for multi-output regression. In D Koller, D Schuurmans, Y Bengio, and L Bottou, editors, NIPS 21, pages 57–64, Cambridge, MA, 2009. MIT Press. [8] G Leen, J Peltonen, and S Kaski. Focused multi-task learning using Gaussian processes. In Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, volume 6912 of Lecture Notes in Computer Science, pages 310– 325. Springer Berlin, Heidelberg, 2011. ´ [9] M A Alvarez, L Rosasco, and N D Lawrence. Kernels for vector-valued functions: a review. Foundations and Trends in Machine Learning, 4:195–266, 2012. [10] A Maurer. Bounds for linear multi-task learning. J. Mach. Learn. Res., 7:117–139, 2006. [11] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [12] G F Trecate, C K I Williams, and M Opper. Finite-dimensional approximation of Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 218–224, Cambridge, MA, 1999. MIT Press. [13] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, NIPS 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [14] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, NIPS 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [15] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [16] D Malzahn and M Opper. Statistical mechanics of learning: a variational approach for real data. Phys. Rev. Lett., 89:108302, 2002. [17] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [18] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [19] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artificial Intelligence, pages 199–210. Springer Berlin, Heidelberg, 2005. [20] M Urry and P Sollich. Exact larning curves for Gaussian process regression on large random graphs. In J Lafferty, C K I Williams, J Shawe-Taylor, R S Zemel, and A Culotta, editors, NIPS 23, pages 2316–2324, Cambridge, MA, 2010. MIT Press. [21] K M A Chai. Generalization errors and learning curves for regression with multi-task Gaussian processes. In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors, NIPS 22, pages 279–287, 2009. [22] H Zhu, C K I Williams, R J Rohwer, and M Morciniec. Gaussian regression and optimal finite dimensional linear models. In C M Bishop, editor, Neural Networks and Machine Learning. Springer, 1998. [23] E Rodner and J Denzler. One-shot learning of object categories using dependent Gaussian processes. In Michael Goesele, Stefan Roth, Arjan Kuijper, Bernt Schiele, and Konrad Schindler, editors, Pattern Recognition, volume 6376 of Lecture Notes in Computer Science, pages 232–241. Springer Berlin, Heidelberg, 2010. [24] T Heskes. Solving a huge number of similar tasks: a combination of multi-task learning and a hierarchical Bayesian approach. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML’98), pages 233–241. Morgan Kaufmann, 1998. 9

3 0.6951974 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

Author: Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, Arian Maleki

Abstract: The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. The rate is shown to be closely related to the condition number of the optimal point. Numerical convergence results and timing comparisons for the proposed method are presented. G-ISTA is shown to perform very well, especially when the optimal point is well-conditioned. 1

4 0.66603112 254 nips-2012-On the Sample Complexity of Robust PCA

Author: Matthew Coudron, Gilad Lerman

Abstract: We estimate the rate of convergence and sample complexity of a recent robust estimator for a generalized version of the inverse covariance matrix. This estimator is used in a convex algorithm for robust subspace recovery (i.e., robust PCA). Our model assumes a sub-Gaussian underlying distribution and an i.i.d. sample from it. Our main result shows with high probability that the norm of the difference between the generalized inverse covariance of the underlying distribution and its estimator from an i.i.d. sample of size N is of order O(N −0.5+ ) for arbitrarily small > 0 (affecting the probabilistic estimate); this rate of convergence is close to the one of direct covariance estimation, i.e., O(N −0.5 ). Our precise probabilistic estimate implies for some natural settings that the sample complexity of the generalized inverse covariance estimation when using the Frobenius norm is O(D2+δ ) for arbitrarily small δ > 0 (whereas the sample complexity of direct covariance estimation with Frobenius norm is O(D2 )). These results provide similar rates of convergence and sample complexity for the corresponding robust subspace recovery algorithm. To the best of our knowledge, this is the only work analyzing the sample complexity of any robust PCA algorithm. 1

5 0.65038061 327 nips-2012-Structured Learning of Gaussian Graphical Models

Author: Karthik Mohan, Mike Chung, Seungyeop Han, Daniela Witten, Su-in Lee, Maryam Fazel

Abstract: We consider estimation of multiple high-dimensional Gaussian graphical models corresponding to a single set of nodes under several distinct conditions. We assume that most aspects of the networks are shared, but that there are some structured differences between them. Specifically, the network differences are generated from node perturbations: a few nodes are perturbed across networks, and most or all edges stemming from such nodes differ between networks. This corresponds to a simple model for the mechanism underlying many cancers, in which the gene regulatory network is disrupted due to the aberrant activity of a few specific genes. We propose to solve this problem using the perturbed-node joint graphical lasso, a convex optimization problem that is based upon the use of a row-column overlap norm penalty. We then solve the convex problem using an alternating directions method of multipliers algorithm. Our proposal is illustrated on synthetic data and on an application to brain cancer gene expression data. 1

6 0.62466586 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

7 0.59822828 319 nips-2012-Sparse Prediction with the $k$-Support Norm

8 0.59718031 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

9 0.59217733 221 nips-2012-Multi-Stage Multi-Task Feature Learning

10 0.57817101 222 nips-2012-Multi-Task Averaging

11 0.56185722 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

12 0.55954659 225 nips-2012-Multi-task Vector Field Learning

13 0.55670905 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

14 0.5546723 192 nips-2012-Learning the Dependency Structure of Latent Factors

15 0.54778117 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

16 0.54089743 86 nips-2012-Convex Multi-view Subspace Learning

17 0.53136742 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

18 0.52691972 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

19 0.5179863 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

20 0.51789647 247 nips-2012-Nonparametric Reduced Rank Regression


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.088), (12, 0.182), (21, 0.054), (38, 0.138), (42, 0.018), (54, 0.017), (55, 0.03), (74, 0.035), (76, 0.216), (80, 0.102), (92, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9286586 239 nips-2012-Neuronal Spike Generation Mechanism as an Oversampling, Noise-shaping A-to-D converter

Author: Dmitri B. Chklovskii, Daniel Soudry

Abstract: We test the hypothesis that the neuronal spike generation mechanism is an analog-to-digital (AD) converter encoding rectified low-pass filtered summed synaptic currents into a spike train linearly decodable in postsynaptic neurons. Faithful encoding of an analog waveform by a binary signal requires that the spike generation mechanism has a sampling rate exceeding the Nyquist rate of the analog signal. Such oversampling is consistent with the experimental observation that the precision of the spikegeneration mechanism is an order of magnitude greater than the cut -off frequency of low-pass filtering in dendrites. Additional improvement in the coding accuracy may be achieved by noise-shaping, a technique used in signal processing. If noise-shaping were used in neurons, it would reduce coding error relative to Poisson spike generator for frequencies below Nyquist by introducing correlations into spike times. By using experimental data from three different classes of neurons, we demonstrate that biological neurons utilize noise-shaping. Therefore, the spike-generation mechanism can be viewed as an oversampling and noise-shaping AD converter. The nature of the neural spike code remains a central problem in neuroscience [1-3]. In particular, no consensus exists on whether information is encoded in firing rates [4, 5] or individual spike timing [6, 7]. On the single-neuron level, evidence exists to support both points of view. On the one hand, post-synaptic currents are low-pass-filtered by dendrites with the cut-off frequency of approximately 30Hz [8], Figure 1B, providing ammunition for the firing rate camp: if the signal reaching the soma is slowly varying, why would precise spike timing be necessary? On the other hand, the ability of the spike-generation mechanism to encode harmonics of the injected current up to about 300Hz [9, 10], Figure 1B, points at its exquisite temporal precision [11]. Yet, in view of the slow variation of the somatic current, such precision may seem gratuitous and puzzling. The timescale mismatch between gradual variation of the somatic current and high precision of spike generation has been addressed previously. Existing explanations often rely on the population nature of the neural code [10, 12]. Although this is a distinct possibility, the question remains whether invoking population coding is necessary. Other possible explanations for the timescale mismatch include the possibility that some synaptic currents (for example, GABAergic) may be generated by synapses proximal to the soma and therefore not subject to low-pass filtering or that the high frequency harmonics are so strong in the pre-synaptic spike that despite attenuation, their trace is still present. Although in some cases, these explanations could apply, for the majority of synaptic inputs to typical neurons there is a glaring mismatch. The perceived mismatch between the time scales of somatic currents and the spike-generation mechanism can be resolved naturally if one views spike trains as digitally encoding analog somatic currents [13-15], Figure 1A. Although somatic currents vary slowly, information that could be communicated by their analog amplitude far exceeds that of binary signals, such as all- or-none spikes, of the same sampling rate. Therefore, faithful digital encoding requires sampling rate of the digital signal to be much higher than the cut-off frequency of the analog signal, socalled over-sampling. Although the spike generation mechanism operates in continuous time, the high temporal precision of the spikegeneration mechanism may be viewed as a manifestation of oversampling, which is needed for the digital encoding of the analog signal. Therefore, the extra order of magnitude in temporal precision available to the spike-generation mechanism relative to somatic current, Figure 1B, is necessary to faithfully encode the amplitude of the analog signal, thus potentially reconciling the firing rate and the spike timing points of view [13-15]. Figure 1. Hybrid digital-analog operation of neuronal circuits. A. Post-synaptic currents are low-pass filtered and summed in dendrites (black) to produce a somatic current (blue). This analog signal is converted by the spike generation mechanism into a sequence of all-or-none spikes (green), a digital signal. Spikes propagate along an axon and are chemically transduced across synapses (gray) into post-synatpic currents (black), whose amplitude reflects synaptic weights, thus converting digital signal back to analog. B. Frequency response function for dendrites (blue, adapted from [8]) and for the spike generation mechanism (green, adapted from [9]). Note one order of magnitude gap between the cut off frequencies. C. Amplitude of the summed postsynaptic currents depends strongly on spike timing. If the blue spike arrives just 5ms later, as shown in red, the EPSCs sum to a value already 20% less. Therefore, the extra precision of the digital signal may be used to communicate the amplitude of the analog signal. In signal processing, efficient AD conversion combines the principle of oversampling with that of noise-shaping, which utilizes correlations in the digital signal to allow more accurate encoding of the analog amplitude. This is exemplified by a family of AD converters called modulators [16], of which the basic one is analogous to an integrate-and-fire (IF) neuron [13-15]. The analogy between the basic modulator and the IF neuron led to the suggestion that neurons also use noise-shaping to encode incoming analog current waveform in the digital spike train [13]. However, the hypothesis of noise-shaping AD conversion has never been tested experimentally in biological neurons. In this paper, by analyzing existing experimental datasets, we demonstrate that noise-shaping is present in three different classes of neurons from vertebrates and invertebrates. This lends support to the view that neurons act as oversampling and noise-shaping AD converters and accounts for the mismatch between the slowly varying somatic currents and precise spike timing. Moreover, we show that the degree of noise-shaping in biological neurons exceeds that used by basic  modulators or IF neurons and propose viewing more complicated models in the noise-shaping framework. This paper is organized as follows: We review the principles of oversampling and noise-shaping in Section 2. In Section 3, we present experimental evidence for noise-shaping AD conversion in neurons. In Section 4 we argue that rectification of somatic currents may improve energy efficiency and/or implement de-noising. 2 . Oversampling and noise-shaping in AD converters To understand how oversampling can lead to more accurate encoding of the analog signal amplitude in a digital form, we first consider a Poisson spike encoder, whose rate of spiking is modulated by the signal amplitude, Figure 2A. Such an AD converter samples an analog signal at discrete time points and generates a spike with a probability given by the (normalized) signal amplitude. Because of the binary nature of spike trains, the resulting spike train encodes the signal with a large error even when the sampling is done at Nyquist rate, i.e. the lowest rate for alias-free sampling. To reduce the encoding error a Poisson encoder can sample at frequencies, fs , higher than Nyquist, fN – hence, the term oversampling, Figure 2B. When combined with decoding by lowpass filtering (down to Nyquist) on the receiving end, this leads to a reduction of the error, which can be estimated as follows. The number of samples over a Nyquist half-period (1/2fN) is given by the oversampling ratio: . As the normalized signal amplitude, , stays roughly constant over the Nyquist half-period, it can be encoded by spikes generated with a fixed probability, x. For a Poisson process the variance in the number of spikes is equal to the mean, . Therefore, the mean relative error of the signal decoded by averaging over the Nyquist half-period: , (1) indicating that oversampling reduces transmission error. However, the weak dependence of the error on the oversampling frequency indicates diminishing returns on the investment in oversampling and motivates one to search for other ways to lower the error. Figure 2. Oversampling and noise-shaping in AD conversion. A. Analog somatic current (blue) and its digital code (green). The difference between the green and the blue curves is encoding error. B. Digital output of oversampling Poisson encoder over one Nyquist half-period. C. Error power spectrum of a Nyquist (dark green) and oversampled (light green) Poisson encoder. Although the total error power is the same, the fraction surviving low-pass filtering during decoding (solid green) is smaller in oversampled case. D. Basic  modulator. E. Signal at the output of the integrator. F. Digital output of the  modulator over one Nyquist period. G. Error power spectrum of the  modulator (brown) is shifted to higher frequencies and low-pass filtered during decoding. The remaining error power (solid brown) is smaller than for Poisson encoder. To reduce encoding error beyond the ½ power of the oversampling ratio, the principle of noiseshaping was put forward [17]. To illustrate noise-shaping consider a basic AD converter called  [18], Figure 2D. In the basic  modulator, the previous quantized signal is fed back and subtracted from the incoming signal and then the difference is integrated in time. Rather than quantizing the input signal, as would be done in the Poisson encoder,  modulator quantizes the integral of the difference between the incoming analog signal and the previous quantized signal, Figure 2F. One can see that, in the oversampling regime, the quantization error of the basic  modulator is significantly less than that of the Poisson encoder. As the variance in the number of spikes over the Nyquist period is less than one, the mean relative error of the signal is at most, , which is better than the Poisson encoder. To gain additional insight and understand the origin of the term noise-shaping, we repeat the above analysis in the Fourier domain. First, the Poisson encoder has a flat power spectrum up to the sampling frequency, Figure 2C. Oversampling preserves the total error power but extends the frequency range resulting in the lower error power below Nyquist. Second, a more detailed analysis of the basic  modulator, where the dynamics is linearized by replacing the quantization device with a random noise injection [19], shows that the quantization noise is effectively differentiated. Taking the derivative in time is equivalent to multiplying the power spectrum of the quantization noise by frequency squared. Such reduction of noise power at low frequencies is an example of noise shaping, Figure 2G. Under the additional assumption of the white quantization noise, such analysis yields: , (2) which for R >> 1 is significantly better performance than for the Poisson encoder, Eq.(1). As mentioned previously, the basic  modulator, Figure 2D, in the continuous-time regime is nothing other than an IF neuron [13, 20, 21]. In the IF neuron, quantization is implemented by the spike generation mechanism and the negative feedback corresponds to the after-spike reset. Note that resetting the integrator to zero is strictly equivalent to subtraction only for continuous-time operation. In discrete-time computer simulations, the integrator value may exceed the threshold, and, therefore, subtraction of the threshold value rather than reset must be used. Next, motivated by the -IF analogy, we look for the signs of noise-shaping AD conversion in real neurons. 3 . Experimental evidence of noise-shaping AD conversion in real neurons In order to determine whether noise-shaping AD conversion takes place in biological neurons, we analyzed three experimental datasets, where spike trains were generated by time-varying somatic currents: 1) rat somatosensory cortex L5 pyramidal neurons [9], 2) mouse olfactory mitral cells [22, 23], and 3) fruit fly olfactory receptor neurons [24]. In the first two datasets, the current was injected through an electrode in whole-cell patch clamp mode, while in the third, the recording was extracellular and the intrinsic somatic current could be measured because the glial compartment included only one active neuron. Testing the noise-shaping AD conversion hypothesis is complicated by the fact that encoded and decoded signals are hard to measure accurately. First, as somatic current is rectified by the spikegeneration mechanism, only its super-threshold component can be encoded faithfully making it hard to know exactly what is being encoded. Second, decoding in the dendrites is not accessible in these single-neuron recordings. In view of these difficulties, we start by simply computing the power spectrum of the reconstruction error obtained by subtracting a scaled and shifted, but otherwise unaltered, spike train from the somatic current. The scaling factor was determined by the total weight of the decoding linear filter and the shift was optimized to maximize information capacity, see below. At the frequencies below 20Hz the error contains significantly lower power than the input signal, Figure 3, indicating that the spike generation mechanism may be viewed as an AD converter. Furthermore, the error power spectrum of the biological neuron is below that of the Poisson encoder, thus indicating the presence of noise-shaping. For dataset 3 we also plot the error power spectrum of the IF neuron, the threshold of which is chosen to generate the same number of spikes as the biological neuron. 4 somatic current biological neuron error Poisson encoder error I&F; neuron error 10 1 10 0 Spectral power, a.u. Spectral power, a.u. 10 3 10 -1 10 -2 10 -3 10 2 10 -4 10 0 10 20 30 40 50 60 Frequency [Hz] 70 80 90 0 10 20 30 40 50 60 70 80 90 100 Frequency [Hz] Figure 3. Evidence of noise-shaping. Power spectra of the somatic current (blue), difference between the somatic current and the digital spike train of the biological neuron (black), of the Poisson encoder (green) and of the IF neuron (red). Left: datset 1, right: dataset 3. Although the simple analysis presented above indicates noise-shaping, subtracting the spike train from the input signal, Figure 3, does not accurately quantify the error when decoding involves additional filtering. An example of such additional encoding/decoding is predictive coding, which will be discussed below [25]. To take such decoding filter into account, we computed a decoded waveform by convolving the spike train with the optimal linear filter, which predicts the somatic current from the spike train with the least mean squared error. Our linear decoding analysis lends additional support to the noise-shaping AD conversion hypothesis [13-15]. First, the optimal linear filter shape is similar to unitary post-synaptic currents, Figure 4B, thus supporting the view that dendrites reconstruct the somatic current of the presynaptic neuron by low-pass filtering the spike train in accordance with the noise-shaping principle [13]. Second, we found that linear decoding using an optimal filter accounts for 60-80% of the somatic current variance. Naturally, such prediction works better for neurons in suprathreshold regime, i.e. with high firing rates, an issue to which we return in Section 4. To avoid complications associated with rectification for now we focused on neurons which were in suprathreshold regime by monitoring that the relationship between predicted and actual current is close to linear. 2 10 C D 1 10 somatic current biological neuron error Poisson encoder error Spectral power, a.u. Spectral power, a.u. I&F; neuron error 3 10 0 10 -1 10 -2 10 -3 10 2 10 -4 0 10 20 30 40 50 60 Frequency [Hz] 70 80 90 10 0 10 20 30 40 50 60 70 80 90 100 Frequency [Hz] Figure 4. Linear decoding of experimentally recorded spike trains. A. Waveform of somatic current (blue), resulting spike train (black), and the linearly decoded waveform (red) from dataset 1. B. Top: Optimal linear filter for the trace in A, is representative of other datasets as well. Bottom: Typical EPSPs have a shape similar to the decoding filter (adapted from [26]). C-D. Power spectra of the somatic current (blue), the decdoding error of the biological neuron (black), the Poisson encoder (green), and IF neuron (red) for dataset 1 (C) dataset 3 (D). Next, we analyzed the spectral distribution of the reconstruction error calculated by subtracting the decoded spike train, i.e. convolved with the computed optimal linear filter, from the somatic current. We found that at low frequencies the error power is significantly lower than in the input signal, Figure 4C,D. This observation confirms that signals below the dendritic cut-off frequency of 20-30Hz can be efficiently communicated using spike trains. To quantify the effect of noise-shaping we computed information capacity of different encoders: where S(f) and N(f) are the power spectra of the somatic current and encoding error correspondingly and the sum is computed only over the frequencies for which S(f) > N(f). Because the plots in Figure 4C,D use semi-logrithmic scale, the information capacity can be estimated from the area between a somatic current (blue) power spectrum and an error power spectrum. We find that the biological spike generation mechanism has higher information capacity than the Poisson encoder and IF neurons. Therefore, neurons act as AD converters with stronger noise-shaping than IF neurons. We now return to the predictive nature of the spike generation mechanism. Given the causal nature of the spike generation mechanism it is surprising that the optimal filters for all three datasets carry most of their weight following a spike, Figure 4B. This indicates that the spike generation mechanism is capable of making predictions, which are possible in these experiments because somatic currents are temporally correlated. We note that these observations make delay-free reconstruction of the signal possible, thus allowing fast operation of neural circuits [27]. The predictive nature of the encoder can be captured by a  modulator embedded in a predictive coding feedback loop [28], Figure 5A. We verified by simulation that such a nested architecture generates a similar optimal linear filter with most of its weight in the time following a spike, Figure 5A right. Of course such prediction is only possible for correlated inputs implying that the shape of the optimal linear filter depends on the statistics of the inputs. The role of predictive coding is to reduce the dynamic range of the signal that enters , thus avoiding overloading. A possible biological implementation for such integrating feedback could be Ca2+ 2+ concentration and Ca dependent potassium channels [25, 29]. Figure 5. Enhanced  modulators. A.  modulator combined with predictive coder. In such device, the optimal decoding filter computed for correlated inputs has most of its weight following a spike, similar to experimental measurements, Figure 4B. B. Second-order  modulator possesses stronger noise-shaping properties. Because such circuit contains an internal state variable it generates a non-periodic spike train in response to a constant input. Bottom trace shows a typical result of a simulation. Black – spikes, blue – input current. 4 . Possible reasons for current rectification: energy efficiency and de-noising We have shown that at high firing rates biological neurons encode somatic current into a linearly decodable spike train. However, at low firing rates linear decoding cannot faithfully reproduce the somatic current because of rectification in the spike generation mechanism. If the objective of spike generation is faithful AD conversion, why would such rectification exist? We see two potential reasons: energy efficiency and de-noising. It is widely believed that minimizing metabolic costs is an important consideration in brain design and operation [30, 31]. Moreover, spikes are known to consume a significant fraction of the metabolic budget [30, 32] placing a premium on their total number. Thus, we can postulate that neuronal spike trains find a trade-off between the mean squared error in the decoded spike train relative to the input signal and the total number of spikes, as expressed by the following cost function over a time interval T: , (3) where x is the analog input signal, s is the binary spike sequence composed of zeros and ones, and is the linear filter. To demonstrate how solving Eq.(3) would lead to thresholding, let us consider a simplified version taken over a Nyquist period, during which the input signal stays constant: (4) where and normalized by w. Minimizing such a cost function reduces to choosing the lowest lying parabola for a given , Figure 6A. Therefore, thresholding is a natural outcome of minimizing a cost function combining the decoding error and the energy cost, Eq.(3). In addition to energy efficiency, there may be a computational reason for thresholding somatic current in neurons. To illustrate this point, we note that the cost function in Eq. (3) for continuous variables, st, may be viewed as a non-negative version of the L1-norm regularized linear regression called LASSO [33], which is commonly used for de-noising of sparse and Laplacian signals [34]. Such cost function can be minimized by iteratively applying a gradient descent and a shrinkage steps [35], which is equivalent to thresholding (one-sided in case of non-negative variables), Figure 6B,C. Therefore, neurons may be encoding a de-noised input signal. Figure 6. Possible reasons for rectification in neurons. A. Cost function combining encoding error squared with metabolic expense vs. input signal for different values of the spike number N, Eq.(4). Note that the optimal number of spikes jumps from zero to one as a function of input. B. Estimating most probable “clean” signal value for continuous non-negative Laplacian signal and Gaussian noise, Eq.(3) (while setting w = 1). The parabolas (red) illustrate the quadratic loglikelihood term in (3) for different values of the measurement, s, while the linear function (blue) reflects the linear log-prior term in (3). C. The minimum of the combined cost function in B is at zero if s , and grows linearly with s, if s >. 5 . Di scu ssi on In this paper, we demonstrated that the neuronal spike-generation mechanism can be viewed as an oversampling and noise-shaping AD converter, which encodes a rectified low-pass filtered somatic current as a digital spike train. Rectification by the spike generation mechanism may subserve both energy efficiency and de-noising. As the degree of noise-shaping in biological neurons exceeds that in IF neurons, or basic , we suggest that neurons should be modeled by more advanced  modulators, e.g. Figure 5B. Interestingly,  modulators can be also viewed as coders with error prediction feedback [19]. Many publications studied various aspects of spike generation in neurons yet we believe that the framework [13-15] we adopt is different and discuss its relationship to some of the studies. Our framework is different from previous proposals to cast neurons as predictors [36, 37] because a different quantity is being predicted. The possibility of perfect decoding from a spike train with infinite temporal precision has been proven in [38]. Here, we are concerned with a more practical issue of how reconstruction error scales with the over-sampling ratio. Also, we consider linear decoding which sets our work apart from [39]. Finally, previous experiments addressing noiseshaping [40] studied the power spectrum of the spike train rather than that of the encoding error. Our work is aimed at understanding biological and computational principles of spike-generation and decoding and is not meant as a substitute for the existing phenomenological spike-generation models [41], which allow efficient fitting of parameters and prediction of spike trains [42]. Yet, the theoretical framework [13-15] we adopt may assist in building better models of spike generation for a given somatic current waveform. First, having interpreted spike generation as AD conversion, we can draw on the rich experience in signal processing to attack the problem. Second, this framework suggests a natural metric to compare the performance of different spike generation models in the high firing rate regime: a mean squared error between the injected current waveform and the filtered version of the spike train produced by a model provided the total number of spikes is the same as in the experimental data. The AD conversion framework adds justification to the previously proposed spike distance obtained by subtracting low-pass filtered spike trains [43]. As the framework [13-15] we adopt relies on viewing neuronal computation as an analog-digital hybrid, which requires AD and DA conversion at every step, one may wonder about the reason for such a hybrid scheme. Starting with the early days of computers, the analog mode is known to be advantageous for computation. For example, performing addition of many variables in one step is possible in the analog mode simply by Kirchhoff law, but would require hundreds of logical gates in the digital mode [44]. However, the analog mode is vulnerable to noise build-up over many stages of computation and is inferior in precisely communicating information over long distances under limited energy budget [30, 31]. While early analog computers were displaced by their digital counterparts, evolution combined analog and digital modes into a computational hybrid [44], thus necessitating efficient AD and DA conversion, which was the focus of the present study. We are grateful to L. Abbott, S. Druckmann, D. Golomb, T. Hu, J. Magee, N. Spruston, B. Theilman for helpful discussions and comments on the manuscript, to X.-J. Wang, D. McCormick, K. Nagel, R. Wilson, K. Padmanabhan, N. Urban, S. Tripathy, H. Koendgen, and M. Giugliano for sharing their data. The work of D.S. was partially supported by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI). R e f e re n c e s 1. Ferster, D. and N. Spruston, Cracking the neural code. Science, 1995. 270: p. 756-7. 2. Panzeri, S., et al., Sensory neural codes using multiplexed temporal scales. Trends Neurosci, 2010. 33(3): p. 111-20. 3. Stevens, C.F. and A. Zador, Neural coding: The enigma of the brain. Curr Biol, 1995. 5(12): p. 1370-1. 4. Shadlen, M.N. and W.T. Newsome, The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci, 1998. 18(10): p. 3870-96. 5. Shadlen, M.N. and W.T. Newsome, Noise, neural codes and cortical organization. Curr Opin Neurobiol, 1994. 4(4): p. 569-79. 6. Singer, W. and C.M. Gray, Visual feature integration and the temporal correlation hypothesis. Annu Rev Neurosci, 1995. 18: p. 555-86. 7. Meister, M., Multineuronal codes in retinal signaling. Proc Natl Acad Sci U S A, 1996. 93(2): p. 609-14. 8. Cook, E.P., et al., Dendrite-to-soma input/output function of continuous timevarying signals in hippocampal CA1 pyramidal neurons. J Neurophysiol, 2007. 98(5): p. 2943-55. 9. Kondgen, H., et al., The dynamical response properties of neocortical neurons to temporally modulated noisy inputs in vitro. Cereb Cortex, 2008. 18(9): p. 2086-97. 10. Tchumatchenko, T., et al., Ultrafast population encoding by cortical neurons. J Neurosci, 2011. 31(34): p. 12171-9. 11. Mainen, Z.F. and T.J. Sejnowski, Reliability of spike timing in neocortical neurons. Science, 1995. 268(5216): p. 1503-6. 12. Mar, D.J., et al., Noise shaping in populations of coupled model neurons. Proc Natl Acad Sci U S A, 1999. 96(18): p. 10450-5. 13. Shin, J., Adaptive noise shaping neural spike encoding and decoding. Neurocomputing, 2001. 38-40: p. 369-381. 14. Shin, J., The noise shaping neural coding hypothesis: a brief history and physiological implications. Neurocomputing, 2002. 44: p. 167-175. 15. Shin, J.H., Adaptation in spiking neurons based on the noise shaping neural coding hypothesis. Neural Networks, 2001. 14(6-7): p. 907-919. 16. Schreier, R. and G.C. Temes, Understanding delta-sigma data converters2005, Piscataway, NJ: IEEE Press, Wiley. xii, 446 p. 17. Candy, J.C., A use of limit cycle oscillations to obtain robust analog-to-digital converters. IEEE Trans. Commun, 1974. COM-22: p. 298-305. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. Inose, H., Y. Yasuda, and J. Murakami, A telemetring system code modulation -  modulation. IRE Trans. Space Elect. Telemetry, 1962. SET-8: p. 204-209. Spang, H.A. and P.M. Schultheiss, Reduction of quantizing noise by use of feedback. IRE TRans. Commun. Sys., 1962: p. 373-380. Hovin, M., et al., Delta-Sigma modulation in single neurons, in IEEE International Symposium on Circuits and Systems2002. Cheung, K.F. and P.Y.H. Tang, Sigma-Delta Modulation Neural Networks. Proc. IEEE Int Conf Neural Networkds, 1993: p. 489-493. Padmanabhan, K. and N. Urban, Intrinsic biophysical diversity decorelates neuronal firing while increasing information content. Nat Neurosci, 2010. 13: p. 1276-82. Urban, N. and S. Tripathy, Neuroscience: Circuits drive cell diversity. Nature, 2012. 488(7411): p. 289-90. Nagel, K.I. and R.I. Wilson, personal communication. Shin, J., C. Koch, and R. Douglas, Adaptive neural coding dependent on the timevarying statistics of the somatic input current. Neural Comp, 1999. 11: p. 1893-913. Magee, J.C. and E.P. Cook, Somatic EPSP amplitude is independent of synapse location in hippocampal pyramidal neurons. Nat Neurosci, 2000. 3(9): p. 895-903. Thorpe, S., D. Fize, and C. Marlot, Speed of processing in the human visual system. Nature, 1996. 381(6582): p. 520-2. Tewksbury, S.K. and R.W. Hallock, Oversample, linear predictive and noiseshaping coders of order N>1. IEEE Trans Circuits & Sys, 1978. CAS25: p. 436-47. Wang, X.J., et al., Adaptation and temporal decorrelation by single neurons in the primary visual cortex. J Neurophysiol, 2003. 89(6): p. 3279-93. Attwell, D. and S.B. Laughlin, An energy budget for signaling in the grey matter of the brain. J Cereb Blood Flow Metab, 2001. 21(10): p. 1133-45. Laughlin, S.B. and T.J. Sejnowski, Communication in neuronal networks. Science, 2003. 301(5641): p. 1870-4. Lennie, P., The cost of cortical computation. Curr Biol, 2003. 13(6): p. 493-7. Tibshirani, R., Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological, 1996. 58(1): p. 267-288. Chen, S.S.B., D.L. Donoho, and M.A. Saunders, Atomic decomposition by basis pursuit. Siam Journal on Scientific Computing, 1998. 20(1): p. 33-61. Elad, M., et al., Wide-angle view at iterated shrinkage algorithms. P SOc Photo-Opt Ins, 2007. 6701: p. 70102. Deneve, S., Bayesian spiking neurons I: inference. Neural Comp, 2008. 20: p. 91. Yu, A.J., Optimal Change-Detection and Spinking Neurons, in NIPS, B. Scholkopf, J. Platt, and T. Hofmann, Editors. 2006. Lazar, A. and L. Toth, Perfect Recovery and Sensitivity Analysis of Time Encoded Bandlimited Signals. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, 2004. 51(10). Pfister, J.P., P. Dayan, and M. Lengyel, Synapses with short-term plasticity are optimal estimators of presynaptic membrane potentials. Nat Neurosci, 2010. 13(10): p. 1271-5. Chacron, M.J., et al., Experimental and theoretical demonstration of noise shaping by interspike interval correlations. Fluctuations and Noise in Biological, Biophysical, and Biomedical Systems III, 2005. 5841: p. 150-163. Pillow, J., Likelihood-based approaches to modeling the neural code, in Bayesian Brain: Probabilistic Approaches to Neural Coding, K. Doya, et al., Editors. 2007, MIT Press. Jolivet, R., et al., A benchmark test for a quantitative assessment of simple neuron models. J Neurosci Methods, 2008. 169(2): p. 417-24. van Rossum, M.C., A novel spike distance. Neural Comput, 2001. 13(4): p. 751-63. Sarpeshkar, R., Analog versus digital: extrapolating from electronics to neurobiology. Neural Computation, 1998. 10(7): p. 1601-38.

same-paper 2 0.88192099 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

Author: Piyush Rai, Abhishek Kumar, Hal Daume

Abstract: Multiple-output regression models require estimating multiple parameters, one for each output. Structural regularization is usually employed to improve parameter estimation in such models. In this paper, we present a multiple-output regression model that leverages the covariance structure of the latent model parameters as well as the conditional covariance structure of the observed outputs. This is in contrast with existing methods that usually take into account only one of these structures. More importantly, unlike some of the other existing methods, none of these structures need be known a priori in our model, and are learned from the data. Several previously proposed structural regularization based multiple-output regression models turn out to be special cases of our model. Moreover, in addition to being a rich model for multiple-output regression, our model can also be used in estimating the graphical model structure of a set of variables (multivariate outputs) conditioned on another set of variables (inputs). Experimental results on both synthetic and real datasets demonstrate the effectiveness of our method. 1

3 0.85393238 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

Author: Matthew Der, Lawrence K. Saul

Abstract: We describe a latent variable model for supervised dimensionality reduction and distance metric learning. The model discovers linear projections of high dimensional data that shrink the distance between similarly labeled inputs and expand the distance between differently labeled ones. The model’s continuous latent variables locate pairs of examples in a latent space of lower dimensionality. The model differs significantly from classical factor analysis in that the posterior distribution over these latent variables is not always multivariate Gaussian. Nevertheless we show that inference is completely tractable and derive an Expectation-Maximization (EM) algorithm for parameter estimation. We also compare the model to other approaches in distance metric learning. The model’s main advantage is its simplicity: at each iteration of the EM algorithm, the distance metric is re-estimated by solving an unconstrained least-squares problem. Experiments show that these simple updates are highly effective. 1

4 0.84196079 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

Author: Michael Bryant, Erik B. Sudderth

Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.

5 0.83574134 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

Author: Yi Wu, David P. Wipf

Abstract: Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from [22], which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. 1

6 0.8307212 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points

7 0.83047837 294 nips-2012-Repulsive Mixtures

8 0.82916689 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

9 0.82785344 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

10 0.82648885 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

11 0.82617819 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

12 0.82585979 126 nips-2012-FastEx: Hash Clustering with Exponential Families

13 0.82560426 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

14 0.82556343 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

15 0.82521325 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

16 0.82516187 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models

17 0.82481498 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

18 0.82379645 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

19 0.82337135 26 nips-2012-A nonparametric variable clustering model

20 0.82294977 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models