nips nips2012 nips2012-268 knowledge-graph by maker-knowledge-mining

268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA


Source: pdf

Author: Shinichi Nakajima, Ryota Tomioka, Masashi Sugiyama, S. D. Babacan

Abstract: The variational Bayesian (VB) approach is one of the best tractable approximations to the Bayesian estimation, and it was demonstrated to perform well in many applications. However, its good performance was not fully understood theoretically. For example, VB sometimes produces a sparse solution, which is regarded as a practical advantage of VB, but such sparsity is hardly observed in the rigorous Bayesian estimation. In this paper, we focus on probabilistic PCA and give more theoretical insight into the empirical success of VB. More specifically, for the situation where the noise variance is unknown, we derive a sufficient condition for perfect recovery of the true PCA dimensionality in the large-scale limit when the size of an observed matrix goes to infinity. In our analysis, we obtain bounds for a noise variance estimator and simple closed-form solutions for other parameters, which themselves are actually very useful for better implementation of VB-PCA. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 For example, VB sometimes produces a sparse solution, which is regarded as a practical advantage of VB, but such sparsity is hardly observed in the rigorous Bayesian estimation. [sent-15, score-0.14]

2 More specifically, for the situation where the noise variance is unknown, we derive a sufficient condition for perfect recovery of the true PCA dimensionality in the large-scale limit when the size of an observed matrix goes to infinity. [sent-17, score-0.441]

3 In our analysis, we obtain bounds for a noise variance estimator and simple closed-form solutions for other parameters, which themselves are actually very useful for better implementation of VB-PCA. [sent-18, score-0.282]

4 The key idea is to force the posterior to be factorized, so that the integration—a typical intractable operation in Bayesian methods—can be analytically performed over each parameter with the other parameters fixed. [sent-20, score-0.053]

5 An important exceptional case is the matrix factorization (MF) model [11, 6, 19] with no missing entry in the observed matrix. [sent-23, score-0.073]

6 Recently, the global analytic solution of VBMF has been derived and theoretical properties such as the mechanism of sparsity induction have been revealed [15, 16]. [sent-24, score-0.137]

7 These works also posed thought-provoking relations between VB and rigorous Bayesian estimation: The VB posterior is actually quite different from the true Bayes posterior (compare the left and the middle graphs in Fig. [sent-25, score-0.202]

8 1), and VB induces sparsity in its solution but such sparsity is hardly observed in rigorous Bayesian estimation (see the right graph in Fig. [sent-26, score-0.229]

9 1 Bayes p osterior (V = 1) VB p osterior (V = 1) 3 3 0. [sent-30, score-0.062]

10 3 2 (A, B ) ≈ (± 1, ± 1) 3 VB estimator : (A, B ) = (0, 0) 0. [sent-35, score-0.078]

11 (Left and Center) the Bayes posterior and the VB posterior of a 1 × 1 MF model, V = BA + E, when V = 1 is observed (E is a Gaussian noise). [sent-72, score-0.106]

12 VB approximates the Bayes posterior having two modes by an origincentered Gaussian, which induces sparsity. [sent-73, score-0.071]

13 The VB estimator (the magenta solid curve) is zero when V ≤ 1, which means exact sparsity. [sent-75, score-0.078]

14 Since the probabilistic PCA [21, 18, 3] is an instance of MF, the global analytic solution derived in [16] for MF can be utilized for analyzing the probabilistic PCA. [sent-80, score-0.093]

15 Indeed, automatic dimensionality selection of VB-PCA, which is an important practical advantage of VB-PCA, was theoretically investigated in [17]. [sent-81, score-0.135]

16 However, the noise variance, which is usually unknown in many realistic applications of PCA, was treated as a given constant in that analysis. [sent-82, score-0.099]

17 2 In this paper, we consider a more practical and challenging situation where the noise variance is unknown, and theoretically analyze VB-PCA. [sent-83, score-0.219]

18 It was reported that noise variance estimation fails in some Bayesian approximation methods, if the formal rank is set to be full [17]. [sent-84, score-0.21]

19 With such methods, an additional model selection procedure is required for dimensionality selection [14, 5]. [sent-85, score-0.088]

20 On the other hand, we theoretically show in this paper that VB-PCA can estimate the noise variance accurately, and therefore automatic dimensionality selection works well. [sent-86, score-0.273]

21 More specifically, we establish a sufficient condition that VB-PCA can perfectly recover the true dimensionality in the large-scale limit when the size of an observed matrix goes to infinity. [sent-87, score-0.158]

22 An interesting finding is that, although the objective function minimized for noise variance estimation is multimodal in general, only a local search algorithm is required for perfect recovery. [sent-88, score-0.241]

23 Our results are based on the random matrix theory [2, 5, 13, 22], which elucidates the distribution of singular values in the large-scale limit. [sent-89, score-0.093]

24 In the development of the above theoretical analysis, we obtain bounds for the noise variance estimator and simple closed-form solutions for other parameters. [sent-90, score-0.235]

25 2 Formulation In this section, we introduce the variational Bayesian matrix factorization (VBMF). [sent-92, score-0.124]

26 1 Bayesian Matrix Factorization Assume that we have an observation matrix V ∈ RL×M , which is the sum of a target matrix U ∈ RL×M and a noise matrix E ∈ RL×M : V = U + E. [sent-94, score-0.19]

27 In the matrix factorization model, the target matrix is assumed to be low rank, which can be expressed as the following factorizability: U = BA , 2 If the noise variance is known, we can actually show that dimensionality selection by VB-PCA is outperformed by a naive strategy (see Section 3. [sent-95, score-0.367]

28 In this paper, we consider the probabilistic matrix factorization (MF) model [19], where the observation noise E and the priors of A and B are assumed to be Gaussian: 1 p(V |A, B) ∝ exp − 2σ2 V − BA 2 , (1) Fro −1 −1 p(A) ∝ exp − 1 tr ACA A , p(B) ∝ exp − 1 tr BCB B . [sent-101, score-0.155]

29 Throughout the paper, we denote a column vector of a matrix by a bold lowercase letter, and a row vector by a bold lowercase letter with a tilde, namely, A = (a1 , . [sent-121, score-0.095]

30 Variational Bayesian Approximation The Bayes posterior is given by p(A, B|V ) = p(V |A,B)p(A)p(B) , p(V ) (3) where p(Y ) = p(V |A, B) p(A)p(B) . [sent-135, score-0.053]

31 Therefore, minimizing the free energy (4) amounts to finding a distribution closest to the Bayes posterior in the sense of the KL divergence. [sent-141, score-0.129]

32 A general approach to Bayesian approximate inference is to find the minimizer of the free energy (4) with respect to r in some restricted function space. [sent-142, score-0.104]

33 (5) Under this constraint, an iterative algorithm for minimizing the free energy (4) was derived [3, 11]. [sent-144, score-0.092]

34 Let r be such a minimizer, and we define the MF solution by the mean of the target matrix U : U = BA r(A,B) . [sent-145, score-0.063]

35 By manually choosing them, we can control regularization and sparsity of the solution (e. [sent-148, score-0.071]

36 A popular way to set the hyperparameter in the Bayesian framework is again based on the minimization of the free energy (4): (CA , CB ) = argminCA ,CB (minr F (r; CA , CB |V )) . [sent-151, score-0.076]

37 When the noise variance σ 2 is unknown, it can also be estimated as σ 2 = argminσ2 minr,CA ,CB F (r; CA , CB , σ 2 |V ) . [sent-153, score-0.157]

38 3 When the number of samples is larger (smaller) than the data dimensionality in the PCA setting, the observation matrix V should consist of the columns (rows), each of which corresponds to each sample. [sent-154, score-0.092]

39 3 3 Simple Closed-Form Solutions of VBMF Recently, the global analytic solution of VBMF has been derived [16]. [sent-155, score-0.093]

40 However, it is given as a solution of a quartic equation (Corollary 1 in [16]), and it is not easy to use for further analysis due to its complicated expression. [sent-156, score-0.078]

41 Then, the VB solution can be written as a truncated shrinkage SVD as follows: H U VB = VB γh ω bh ω ah , VB γh = where h=1 if γh ≥ γ VB , h otherwise. [sent-160, score-0.168]

42 γh ˘ VB 0 (8) Here, the truncation threshold and the shrinkage estimator are, respectively, given by 2 (L+M ) 2 γ VB = σ h γh = γh 1 − ˘ VB + σ2 2c2 c2 a b h σ2 2 2γh (L+M ) 2 + h M +L+ + σ2 2c2 c2 a b (M − L)2 + h − LM , h 2 4γh c2 c 2 a b h (9) (10) . [sent-161, score-0.163]

43 h We can also derive a simple closed-form expression of the VB posterior (omitted). [sent-162, score-0.053]

44 x 2 (12) Then, the EVB solution can be written as a truncated shrinkage SVD as follows: H U EVB = EVB γh ω bh ω ah , EVB γh = where h=1 γh ˘ EVB 0 if γh ≥ γ EVB , otherwise. [sent-165, score-0.168]

45 (13) Here, the truncation threshold and the shrinkage estimator are, respectively, given by γ EVB = σ γh ˘ EVB = γh 2 M +L+ 1− √ LM κ + (M +L)σ 2 2 γh 1 κ 1− + (14) , (M +L)σ 2 2 γh 2 − 4LM σ 4 4 γh . [sent-166, score-0.163]

46 Large-Scale Limiting Behavior of EVB When Noise Variance Is Known Here, we first introduce a result from random matrix theory [13, 22], and then discuss the behavior of EVB when the noise variance is known. [sent-184, score-0.211]

47 Assume that E ∈ RL×M is a random matrix such that each element is independently drawn from a distribution with mean zero and variance σ 2∗ (not necessarily Gaussian). [sent-185, score-0.111]

48 Proposition 1 states that all singular values of the random matrix E are almost surely upper-bounded by √ √ √ γ MPUL = M σ 2∗ u = ( L + M )σ ∗ , (17) which we call the Marˇ enko-Pastur upper-limit (MPUL). [sent-198, score-0.117]

49 This property can be used for designing c estimators robust against noise [10, 9]. [sent-199, score-0.082]

50 When the noise variance is known, we can set the parameter to σ = σ ∗ in Eq. [sent-201, score-0.157]

51 We can see that MPUL lower-bounds the EVB threshold (14) (actually this is true regardless of the value of κ > 0). [sent-205, score-0.051]

52 , EVB eliminates all noise components in the large-scale limit. [sent-208, score-0.099]

53 However, a simple optimal strategy— discarding the components with singular values smaller than γ MPUL —outperforms EVB, because signals lying between the gap [γ MPUL , γ EVB ) are discarded by EVB. [sent-209, score-0.118]

54 In Section 4, we investigate the behavior of EVB in a more practical and challenging situation where σ 2∗ is unknown and is also estimated from observation. [sent-211, score-0.077]

55 2, we also depicted the VB threshold (9) with almost flat prior cah , cbh → ∞ (labeled as ‘VBFL’) for comparison. [sent-213, score-0.198]

56 This implies that VBFL retains a lot of noise comh ponents, and does not perform well even when σ 2∗ is known. [sent-217, score-0.082]

57 5 4 Analysis of EVB When Noise Variance Is Unknown In this section, we derive bounds of the VB-based noise variance estimator, and obtain a sufficient condition for perfect dimensionality recovery in the large-scale limit. [sent-218, score-0.348]

58 1 Bounds of Noise Variance Estimator The simple closed-form solution obtained in Section 3 is the global minimizer of the free energy (4), given σ 2 . [sent-220, score-0.155]

59 Using the solution, we can explicitly describe the free energy as a function of σ 2 . [sent-221, score-0.076]

60 2 Note that x and κ(γh /(σ 2 M )) are rescaled versions of the squared EVB threshold (14) and the 2 EVB shrinkage estimator (15), respectively, i. [sent-223, score-0.146]

61 2 2 (Sketch of proof) First, we show that a global minimizer w. [sent-240, score-0.052]

62 For example, the half components are always discarded when the matrix is square (i. [sent-252, score-0.079]

63 The smallest singular value γL is always discarded, and 2 2 σ 2 EVB > γL /(M (L − (L − 1)(1 + α)) > γL /M always holds. [sent-255, score-0.057]

64 1 Figure 5: Success rate of dimensionality recovery in numerical simulation for M = 200. [sent-279, score-0.111]

65 The threshold for the guaranteed recovery (the second inequality in Eq. [sent-280, score-0.09]

66 (28)) is depicted with a vertical bar with the same color and line style. [sent-281, score-0.076]

67 2 Perfect Recovery Condition Here, we derive a sufficient condition for perfect recovery of the true PCA dimensionality in the large-scale limit. [sent-283, score-0.207]

68 Assume that the observation matrix V is generated as V = U ∗ + E, (27) ∗ ∗ where U is a true signal matrix with rank H ∗ and the singular values {γh }, and each element of the noise matrix E is subject to a distribution with mean zero and variance σ 2∗ . [sent-284, score-0.372]

69 , we set the rank of the MF model sufficiently large), and denote the relevant rank (dimensionality) ratio by ∗ ξ=H . [sent-288, score-0.068]

70 L ∗ In the large-scale limit with finite α and H , EVB implemented with a local search algorithm for the noise variance σ 2 estimation almost surely recovers the true rank, i. [sent-289, score-0.251]

71 This means the perfect recovery in the no-signal case. [sent-295, score-0.103]

72 It is important to note that, although the objective function (18) is non-convex and possibly multimodal in general, perfect recovery does not require global search, but only a local search, of the objective function for noise variance estimation. [sent-305, score-0.284]

73 E was drawn from the Gaussian distribution with variance σ 2∗ = 1, and signal singular values were drawn from the uniform distribution on [yM σ 2∗ , 10M ] for different y (the horizontal axis of the graphs indicates y). [sent-310, score-0.132]

74 The vertical axis indicates the success rate of dimensionality recovery, i. [sent-311, score-0.077]

75 Otherwise, the threshold of y for the guaranteed recovery (the second inequality in Eq. [sent-316, score-0.09]

76 7 5 Discussion Here, we discuss implementation of VB-PCA, and the origin of sparsity of VB. [sent-319, score-0.062]

77 , Ferrari’s method, of a quartic equation, or rely on a numerical quartic solver, which is computationally less efficient. [sent-324, score-0.102]

78 Given an observed matrix V , we perform SVD and obtain the singular values. [sent-327, score-0.093]

79 After that, in our new implementation, we first directly estimate the noise variance based on Theorem 3, using any 1-D local search algorithm— Theorem 4 helps restrict the search range. [sent-328, score-0.191]

80 Then we obtain the noise variance estimator σ 2 EVB . [sent-329, score-0.235]

81 For a dimensionality reduction purpose, simply discarding all the components such that σ 2 < σ 2 EVB h gives the result (here σ 2 is defined by Eq. [sent-330, score-0.091]

82 When the estimator U EVB is needed, Theorem 2 h gives the result for σ 2 = σ 2 EVB . [sent-332, score-0.078]

83 The VB posterior is also easily computed (its description is omitted). [sent-333, score-0.053]

84 Actually, at least in MF, the sparsity is induced by the independence assumption between A and B. [sent-338, score-0.068]

85 1, where the Bayes posterior for V = 1 is shown in the left graph. [sent-340, score-0.053]

86 In this scalar factorization model, the probability mass in the first and the third quadrants pulls the estimator U = BA toward the positive direction, and the mass in the second and the fourth quadrants toward the negative direction. [sent-341, score-0.199]

87 Since the Bayes posterior is skewed and more mass is put in the first and the third quadrants, it is natural that the Bayesian estimator γ = BA p(A,B|V ) is positive. [sent-342, score-0.15]

88 On the other hand, the VB posterior (the middle graph) is prohibited to be skewed because of the independent assumption, and thus it has to wait until V grows so that one of the modes has a enough probability mass. [sent-344, score-0.072]

89 The Bayes posterior (left graph) implies that, if we restrict the posterior to be Gaussian, but allow to have correlation between A and B, exact sparsity will not be observed. [sent-346, score-0.15]

90 This estimator is also depicted as blue diamonds labeled as EFB (empirical fully-Bayesian) in the right graph of Fig. [sent-348, score-0.115]

91 Further investigation on the relation between the independence constraint and the sparsity induction is our future work. [sent-351, score-0.068]

92 6 Conclusion In this paper, we considered the variational Bayesian PCA (VB-PCA) when the noise variance is unknown. [sent-352, score-0.208]

93 Analyzing the behavior of the noise variance estimator, we derived a sufficient condition for VB-PCA to perfectly recover the true dimensionality. [sent-353, score-0.239]

94 In our theoretical analysis, we obtained bounds for a noise variance estimator and simple closed-form solutions for other parameters, which were shown to be useful for better implementation of VB-PCA. [sent-355, score-0.253]

95 Eigenvalues of large sample covariance matrices of spiked population models. [sent-367, score-0.064]

96 Spectrum estimation for large dimensional covariance matrices using random matrix theory. [sent-406, score-0.088]

97 Local minima, symmetry-breaking, and model pruning in variational free energy minimization. [sent-424, score-0.144]

98 Global solution of fully-observed variational Bayesian matrix factorization is column-wise independent. [sent-456, score-0.151]

99 On Bayesian PCA: Automatic dimensionality selection and analytic solution. [sent-463, score-0.098]

100 The strong limits of random matrix spectra for sample matrices of independent elements. [sent-503, score-0.052]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('evb', 0.834), ('vb', 0.327), ('mpul', 0.11), ('mf', 0.105), ('cb', 0.102), ('noise', 0.082), ('estimator', 0.078), ('variance', 0.075), ('pca', 0.064), ('bayesian', 0.063), ('cah', 0.063), ('cbh', 0.063), ('vbmf', 0.063), ('bayes', 0.058), ('ba', 0.057), ('singular', 0.057), ('dimensionality', 0.056), ('recovery', 0.055), ('bh', 0.055), ('posterior', 0.053), ('ah', 0.053), ('ca', 0.052), ('variational', 0.051), ('nakajima', 0.051), ('quartic', 0.051), ('rigorous', 0.051), ('tokyo', 0.048), ('perfect', 0.048), ('succes', 0.047), ('vbfl', 0.047), ('sparsity', 0.044), ('mp', 0.042), ('quadrants', 0.042), ('rl', 0.041), ('lm', 0.041), ('energy', 0.04), ('mar', 0.04), ('depicted', 0.037), ('factorization', 0.037), ('free', 0.036), ('matrix', 0.036), ('threshold', 0.035), ('rank', 0.034), ('shrinkage', 0.033), ('theorem', 0.033), ('sugiyama', 0.033), ('condition', 0.032), ('efb', 0.031), ('fro', 0.031), ('osterior', 0.031), ('spiked', 0.031), ('tomioka', 0.031), ('omitted', 0.029), ('actually', 0.029), ('minimizer', 0.028), ('solution', 0.027), ('svd', 0.027), ('analytic', 0.026), ('mext', 0.026), ('hardly', 0.026), ('japan', 0.026), ('discarded', 0.026), ('surely', 0.024), ('automatic', 0.024), ('ul', 0.024), ('global', 0.024), ('independence', 0.024), ('situation', 0.023), ('kakenhi', 0.022), ('fb', 0.022), ('eigenvalues', 0.022), ('vertical', 0.021), ('lowercase', 0.021), ('principal', 0.021), ('theoretically', 0.02), ('estimation', 0.019), ('skewed', 0.019), ('practical', 0.019), ('ee', 0.019), ('transpose', 0.019), ('limit', 0.018), ('implementation', 0.018), ('bar', 0.018), ('discarding', 0.018), ('induces', 0.018), ('behavior', 0.018), ('arranged', 0.017), ('pruning', 0.017), ('unknown', 0.017), ('search', 0.017), ('covariance', 0.017), ('letter', 0.017), ('truncation', 0.017), ('proposition', 0.017), ('components', 0.017), ('matrices', 0.016), ('selection', 0.016), ('true', 0.016), ('derived', 0.016), ('roweis', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

Author: Shinichi Nakajima, Ryota Tomioka, Masashi Sugiyama, S. D. Babacan

Abstract: The variational Bayesian (VB) approach is one of the best tractable approximations to the Bayesian estimation, and it was demonstrated to perform well in many applications. However, its good performance was not fully understood theoretically. For example, VB sometimes produces a sparse solution, which is regarded as a practical advantage of VB, but such sparsity is hardly observed in the rigorous Bayesian estimation. In this paper, we focus on probabilistic PCA and give more theoretical insight into the empirical success of VB. More specifically, for the situation where the noise variance is unknown, we derive a sufficient condition for perfect recovery of the true PCA dimensionality in the large-scale limit when the size of an observed matrix goes to infinity. In our analysis, we obtain bounds for a noise variance estimator and simple closed-form solutions for other parameters, which themselves are actually very useful for better implementation of VB-PCA. 1

2 0.11790158 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

Author: S. D. Babacan, Shinichi Nakajima, Minh Do

Abstract: In this paper, we consider the problem of clustering data points into lowdimensional subspaces in the presence of outliers. We pose the problem using a density estimation formulation with an associated generative model. Based on this probability model, we first develop an iterative expectation-maximization (EM) algorithm and then derive its global solution. In addition, we develop two Bayesian methods based on variational Bayesian (VB) approximation, which are capable of automatic dimensionality selection. While the first method is based on an alternating optimization scheme for all unknowns, the second method makes use of recent results in VB matrix factorization leading to fast and effective estimation. Both methods are extended to handle sparse outliers for robustness and can handle missing values. Experimental results suggest that proposed methods are very effective in subspace clustering and identifying outliers. 1

3 0.06784749 78 nips-2012-Compressive Sensing MRI with Wavelet Tree Sparsity

Author: Chen Chen, Junzhou Huang

Abstract: In Compressive Sensing Magnetic Resonance Imaging (CS-MRI), one can reconstruct a MR image with good quality from only a small number of measurements. This can significantly reduce MR scanning time. According to structured sparsity theory, the measurements can be further reduced to O(K + log n) for tree-sparse data instead of O(K + K log n) for standard K-sparse data with length n. However, few of existing algorithms have utilized this for CS-MRI, while most of them model the problem with total variation and wavelet sparse regularization. On the other side, some algorithms have been proposed for tree sparse regularization, but few of them have validated the benefit of wavelet tree structure in CS-MRI. In this paper, we propose a fast convex optimization algorithm to improve CS-MRI. Wavelet sparsity, gradient sparsity and tree sparsity are all considered in our model for real MR images. The original complex problem is decomposed into three simpler subproblems then each of the subproblems can be efficiently solved with an iterative scheme. Numerous experiments have been conducted and show that the proposed algorithm outperforms the state-of-the-art CS-MRI algorithms, and gain better reconstructions results on real MR images than general tree based solvers or algorithms. 1

4 0.057306025 187 nips-2012-Learning curves for multi-task Gaussian process regression

Author: Peter Sollich, Simon Ashton

Abstract: We study the average case performance of multi-task Gaussian process (GP) regression as captured in the learning curve, i.e. the average Bayes error for a chosen task versus the total number of examples n for all tasks. For GP covariances that are the product of an input-dependent covariance function and a free-form intertask covariance matrix, we show that accurate approximations for the learning curve can be obtained for an arbitrary number of tasks T . We use these to study the asymptotic learning behaviour for large n. Surprisingly, multi-task learning can be asymptotically essentially useless, in the sense that examples from other tasks help only when the degree of inter-task correlation, ρ, is near its maximal value ρ = 1. This effect is most extreme for learning of smooth target functions as described by e.g. squared exponential kernels. We also demonstrate that when learning many tasks, the learning curves separate into an initial phase, where the Bayes error on each task is reduced down to a plateau value by “collective learning” even though most tasks have not seen examples, and a final decay that occurs once the number of examples is proportional to the number of tasks. 1 Introduction and motivation Gaussian processes (GPs) [1] have been popular in the NIPS community for a number of years now, as one of the key non-parametric Bayesian inference approaches. In the simplest case one can use a GP prior when learning a function from data. In line with growing interest in multi-task or transfer learning, where relatedness between tasks is used to aid learning of the individual tasks (see e.g. [2, 3]), GPs have increasingly also been used in a multi-task setting. A number of different choices of covariance functions have been proposed [4, 5, 6, 7, 8]. These differ e.g. in assumptions on whether the functions to be learned are related to a smaller number of latent functions or have free-form inter-task correlations; for a recent review see [9]. Given this interest in multi-task GPs, one would like to quantify the benefits that they bring compared to single-task learning. PAC-style bounds for classification [2, 3, 10] in more general multi-task scenarios exist, but there has been little work on average case analysis. The basic question in this setting is: how does the Bayes error on a given task depend on the number of training examples for all tasks, when averaged over all data sets of the given size. For a single regression task, this learning curve has become relatively well understood since the late 1990s, with a number of bounds and approximations available [11, 12, 13, 14, 15, 16, 17, 18, 19] as well as some exact predictions [20]. Already two-task GP regression is much more difficult to analyse, and progress was made only very recently at NIPS 2009 [21], where upper and lower bounds for learning curves were derived. The tightest of these bounds, however, either required evaluation by Monte Carlo sampling, or assumed knowledge of the corresponding single-task learning curves. Here our aim is to obtain accurate learning curve approximations that apply to an arbitrary number T of tasks, and that can be evaluated explicitly without recourse to sampling. 1 We begin (Sec. 2) by expressing the Bayes error for any single task in a multi-task GP regression problem in a convenient feature space form, where individual training examples enter additively. This requires the introduction of a non-trivial tensor structure combining feature space components and tasks. Considering the change in error when adding an example for some task leads to partial differential equations linking the Bayes errors for all tasks. Solving these using the method of characteristics then gives, as our primary result, the desired learning curve approximation (Sec. 3). In Sec. 4 we discuss some of its predictions. The approximation correctly delineates the limits of pure transfer learning, when all examples are from tasks other than the one of interest. Next we compare with numerical simulations for some two-task scenarios, finding good qualitative agreement. These results also highlight a surprising feature, namely that asymptotically the relatedness between tasks can become much less useful. We analyse this effect in some detail, showing that it is most extreme for learning of smooth functions. Finally we discuss the case of many tasks, where there is an unexpected separation of the learning curves into a fast initial error decay arising from “collective learning”, and a much slower final part where tasks are learned almost independently. 2 GP regression and Bayes error We consider GP regression for T functions fτ (x), τ = 1, 2, . . . , T . These functions have to be learned from n training examples (x , τ , y ), = 1, . . . , n. Here x is the training input, τ ∈ {1, . . . , T } denotes which task the example relates to, and y is the corresponding training output. We assume that the latter is given by the target function value fτ (x ) corrupted by i.i.d. additive 2 2 Gaussian noise with zero mean and variance στ . This setup allows the noise level στ to depend on the task. In GP regression the prior over the functions fτ (x) is a Gaussian process. This means that for any set of inputs x and task labels τ , the function values {fτ (x )} have a joint Gaussian distribution. As is common we assume this to have zero mean, so the multi-task GP is fully specified by the covariances fτ (x)fτ (x ) = C(τ, x, τ , x ). For this covariance we take the flexible form from [5], fτ (x)fτ (x ) = Dτ τ C(x, x ). Here C(x, x ) determines the covariance between function values at different input points, encoding “spatial” behaviour such as smoothness and the lengthscale(s) over which the functions vary, while the matrix D is a free-form inter-task covariance matrix. One of the attractions of GPs for regression is that, even though they are non-parametric models with (in general) an infinite number of degrees of freedom, predictions can be made in closed form, see e.g. [1]. For a test point x for task τ , one would predict as output the mean of fτ (x) over the (Gaussian) posterior, which is y T K −1 kτ (x). Here K is the n × n Gram matrix with entries 2 K m = Dτ τm C(x , xm ) + στ δ m , while kτ (x) is a vector with the n entries kτ, = Dτ τ C(x , x). The error bar would be taken as the square root of the posterior variance of fτ (x), which is T Vτ (x) = Dτ τ C(x, x) − kτ (x)K −1 kτ (x) (1) The learning curve for task τ is defined as the mean-squared prediction error, averaged over the location of test input x and over all data sets with a specified number of examples for each task, say n1 for task 1 and so on. As is standard in learning curve analysis we consider a matched scenario where the training outputs y are generated from the same prior and noise model that we use for inference. In this case the mean-squared prediction error ˆτ is the Bayes error, and is given by the average posterior variance [1], i.e. ˆτ = Vτ (x) x . To obtain the learning curve this is averaged over the location of the training inputs x : τ = ˆτ . This average presents the main challenge for learning curve prediction because the training inputs feature in a highly nonlinear way in Vτ (x). Note that the training outputs, on the other hand, do not appear in the posterior variance Vτ (x) and so do not need to be averaged over. We now want to write the Bayes error ˆτ in a form convenient for performing, at least approximately, the averages required for the learning curve. Assume that all training inputs x , and also the test input x, are drawn from the same distribution P (x). One can decompose the input-dependent part of the covariance function into eigenfunctions relative to P (x), according to C(x, x ) = i λi φi (x)φi (x ). The eigenfunctions are defined by the condition C(x, x )φi (x ) x = λi φi (x) and can be chosen to be orthonormal with respect to P (x), φi (x)φj (x) x = δij . The sum over i here is in general infinite (unless the covariance function is degenerate, as e.g. for the dot product kernel C(x, x ) = x · x ). To make the algebra below as simple as possible, we let the eigenvalues λi be arranged in decreasing order and truncate the sum to the finite range i = 1, . . . , M ; M is then some large effective feature space dimension and can be taken to infinity at the end. 2 In terms of the above eigenfunction decomposition, the Gram matrix has elements K m = Dτ 2 λi φi (x )φi (xm )+στ δ τm m δτ = i ,τ φi (x )λi δij Dτ τ φj (xm )δτ 2 ,τm +στ δ m i,τ,j,τ or in matrix form K = ΨLΨT + Σ where Σ is the diagonal matrix from the noise variances and Ψ = δτ ,iτ ,τ φi (x ), Liτ,jτ = λi δij Dτ τ (2) Here Ψ has its second index ranging over M (number of kernel eigenvalues) times T (number of tasks) values; L is a square matrix of this size. In Kronecker (tensor) product notation, L = D ⊗ Λ if we define Λ as the diagonal matrix with entries λi δij . The Kronecker product is convenient for the simplifications below; we will use that for generic square matrices, (A ⊗ B)(A ⊗ B ) = (AA ) ⊗ (BB ), (A ⊗ B)−1 = A−1 ⊗ B −1 , and tr (A ⊗ B) = (tr A)(tr B). In thinking about the mathematical expressions, it is often easier to picture Kronecker products over feature spaces and tasks as block matrices. For example, L can then be viewed as consisting of T × T blocks, each of which is proportional to Λ. To calculate the Bayes error, we need to average the posterior variance Vτ (x) over the test input x. The first term in (1) then becomes Dτ τ C(x, x) = Dτ τ tr Λ. In the second one, we need to average kτ, (x)kτ,m = Dτ τ C(x , x)C(x, xm ) x Dτm τ = x Dτ τ λi λj φi (x ) φi (x)φj (x) x φj (xm )Dτm τ ij = Dτ τ Ψl,iτ λi λj δij Ψm,jτ Dτ τ i,τ ,j,τ T In matrix form this is kτ (x)kτ (x) x = Ψ[(Deτ eT D) ⊗ Λ2 ]ΨT = ΨMτ ΨT Here the last τ equality defines Mτ , and we have denoted by eτ the T -dimensional vector with τ -th component equal to one and all others zero. Multiplying by the inverse Gram matrix K −1 and taking the trace gives the average of the second term in (1); combining with the first gives the Bayes error on task τ ˆτ = Vτ (x) x = Dτ τ tr Λ − tr ΨMτ ΨT (ΨLΨT + Σ)−1 Applying the Woodbury identity and re-arranging yields = Dτ τ tr Λ − tr Mτ ΨT Σ−1 Ψ(I + LΨT Σ−1 Ψ)−1 = ˆτ Dτ τ tr Λ − tr Mτ L−1 [I − (I + LΨT Σ−1 Ψ)−1 ] But tr Mτ L−1 = tr {[(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 } τ = tr {[Deτ eT ] ⊗ Λ} = eT Deτ tr Λ = Dτ τ tr Λ τ τ so the first and second terms in the expression for ˆτ cancel and one has = tr Mτ L−1 (I + LΨT Σ−1 Ψ)−1 = tr L−1 Mτ L−1 (L−1 + ΨT Σ−1 Ψ)−1 = tr [D ⊗ Λ]−1 [(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 (L−1 + ΨT Σ−1 Ψ)−1 τ = ˆτ tr [eτ eT ⊗ I](L−1 + ΨT Σ−1 Ψ)−1 τ The matrix in square brackets in the last line is just a projector Pτ onto task τ ; thought of as a matrix of T × T blocks (each of size M × M ), this has an identity matrix in the (τ, τ ) block while all other blocks are zero. We can therefore write, finally, for the Bayes error on task τ , ˆτ = tr Pτ (L−1 + ΨT Σ−1 Ψ)−1 (3) Because Σ is diagonal and given the definition (2) of Ψ, the matrix ΨT Σ−1 Ψ is a sum of contributions from the individual training examples = 1, . . . , n. This will be important for deriving the learning curve approximation below. We note in passing that, because τ Pτ = I, the sum of the Bayes errors on all tasks is τ ˆτ = tr (L−1 +ΨT Σ−1 Ψ)−1 , in close analogy to the corresponding expression for the single-task case [13]. 3 3 Learning curve prediction To obtain the learning curve τ = ˆτ , we now need to carry out the average . . . over the training inputs. To help with this, we can extend an approach for the single-task scenario [13] and define a response or resolvent matrix G = (L−1 + ΨT Σ−1 Ψ + τ vτ Pτ )−1 with auxiliary parameters vτ that will be set back to zero at the end. One can then ask how G = G and hence τ = tr Pτ G changes with the number nτ of training points for task τ . Adding an example at position x for task −2 τ increases ΨT Σ−1 Ψ by στ φτ φT , where φτ has elements (φτ )iτ = φi (x)δτ τ . Evaluating the τ −1 −2 difference (G + στ φτ φT )−1 − G with the help of the Woodbury identity and approximating it τ with a derivative gives Gφτ φT G ∂G τ =− 2 ∂nτ στ + φT Gφτ τ This needs to be averaged over the new example and all previous ones. If we approximate by averaging numerator and denominator separately we get 1 ∂G ∂G = 2 ∂nτ στ + tr Pτ G ∂vτ (4) Here we have exploited for the average over x that the matrix φτ φT x has (i, τ ), (j, τ )-entry τ φi (x)φj (x) x δτ τ δτ τ = δij δτ τ δτ τ , hence simply φτ φT x = Pτ . We have also used the τ auxiliary parameters to rewrite − GPτ G = ∂ G /∂vτ = ∂G/∂vτ . Finally, multiplying (4) by Pτ and taking the trace gives the set of quasi-linear partial differential equations ∂ τ 1 = 2 ∂nτ στ + τ ∂ τ ∂vτ (5) The remaining task is now to find the functions τ (n1 , . . . , nT , v1 , . . . , vT ) by solving these differential equations. We initially attempted to do this by tracking the τ as examples are added one task at a time, but the derivation is laborious already for T = 2 and becomes prohibitive beyond. Far more elegant is to adapt the method of characteristics to the present case. We need to find a 2T -dimensional surface in the 3T -dimensional space (n1 , . . . , nT , v1 , . . . , vT , 1 , . . . , T ), which is specified by the T functions τ (. . .). A small change (δn1 , . . . , δnT , δv1 , . . . , δvT , δ 1 , . . . , δ T ) in all 3T coordinates is tangential to this surface if it obeys the T constraints (one for each τ ) δ τ ∂ τ ∂ τ δnτ + δvτ ∂nτ ∂vτ = τ 2 From (5), one sees that this condition is satisfied whenever δ τ = 0 and δnτ = −δvτ (στ + τ ) It follows that all the characteristic curves given by τ (t) = τ,0 = const., vτ (t) = vτ,0 (1 − t), 2 nτ (t) = vτ,0 (στ + τ,0 ) t for t ∈ [0, 1] are tangential to the solution surface for all t, so lie within this surface if the initial point at t = 0 does. Because at t = 0 there are no training examples (nτ (0) = 0), this initial condition is satisfied by setting −1 τ,0 = tr Pτ −1 L + vτ ,0 Pτ τ Because t=1 τ (t) is constant along the characteristic curve, we get by equating the values at t = 0 and −1 τ,0 = tr Pτ L −1 + vτ ,0 Pτ = τ ({nτ = vτ 2 ,0 (στ + τ ,0 )}, {vτ = 0}) τ Expressing vτ ,0 in terms of nτ gives then τ = tr Pτ L−1 + τ nτ 2 στ + −1 Pτ (6) τ This is our main result: a closed set of T self-consistency equations for the average Bayes errors 2 τ . Given L as defined by the eigenvalues λi of the covariance function, the noise levels στ and the 4 number of examples nτ for each task, it is straightforward to solve these equations numerically to find the average Bayes error τ for each task. The r.h.s. of (6) is easiest to evaluate if we view the matrix inside the brackets as consisting of M × M blocks of size T × T (which is the reverse of the picture we have used so far). The matrix is then block diagonal, with the blocks corresponding to different eigenvalues λi . Explicitly, because L−1 = D −1 ⊗ Λ−1 , one has τ λ−1 D −1 + diag({ i = i 4 2 στ nτ + −1 }) τ (7) ττ Results and discussion We now consider the consequences of the approximate prediction (7) for multi-task learning curves in GP regression. A trivial special case is the one of uncorrelated tasks, where D is diagonal. Here one recovers T separate equations for the individual tasks as expected, which have the same form as for single-task learning [13]. 4.1 Pure transfer learning Consider now the case of pure transfer learning, where one is learning a task of interest (say τ = 1) purely from examples for other tasks. What is the lowest average Bayes error that can be obtained? Somewhat more generally, suppose we have no examples for the first T0 tasks, n1 = . . . = nT0 = 0, but a large number of examples for the remaining T1 = T − T0 tasks. Denote E = D −1 and write this in block form as E00 E01 E= T E01 E11 2 Now multiply by λ−1 and add in the lower right block a diagonal matrix N = diag({nτ /(στ + i −1 −1 τ )}τ =T0 +1,...,T ). The matrix inverse in (7) then has top left block λi [E00 + E00 E01 (λi N + −1 −1 T T E11 − E01 E00 E01 )−1 E01 E00 ]. As the number of examples for the last T1 tasks grows, so do all −1 (diagonal) elements of N . In the limit only the term λi E00 survives, and summing over i gives −1 −1 1 = tr Λ(E00 )11 = C(x, x) (E00 )11 . The Bayes error on task 1 cannot become lower than this, placing a limit on the benefits of pure transfer learning. That this prediction of the approximation (7) for such a lower limit is correct can also be checked directly: once the last T1 tasks fτ (x) (τ = T0 + 1, . . . T ) have been learn perfectly, the posterior over the first T0 functions is, by standard Gaussian conditioning, a GP with covariance C(x, x )(E00 )−1 . Averaging the posterior variance of −1 f1 (x) then gives the Bayes error on task 1 as 1 = C(x, x) (E00 )11 , as found earlier. This analysis can be extended to the case where there are some examples available also for the first T0 tasks. One finds for the generalization errors on these tasks the prediction (7) with D −1 replaced by E00 . This is again in line with the above form of the GP posterior after perfect learning of the remaining T1 tasks. 4.2 Two tasks We next analyse how well the approxiation (7) does in predicting multi-task learning curves for T = 2 tasks. Here we have the work of Chai [21] as a baseline, and as there we choose D= 1 ρ ρ 1 The diagonal elements are fixed to unity, as in a practical application where one would scale both task functions f1 (x) and f2 (x) to unit variance; the degree of correlation of the tasks is controlled by ρ. We fix π2 = n2 /n and plot learning curves against n. In numerical simulations we ensure integer values of n1 and n2 by setting n2 = nπ2 , n1 = n − n2 ; for evaluation of (7) we use 2 2 directly n2 = nπ2 , n1 = n(1 − π2 ). For simplicity we consider equal noise levels σ1 = σ2 = σ 2 . As regards the covariance function and input distribution, we analyse first the scenario studied in [21]: a squared exponential (SE) kernel C(x, x ) = exp[−(x − x )2 /(2l2 )] with lengthscale l, and one-dimensional inputs x with a Gaussian distribution N (0, 1/12). The kernel eigenvalues λi 5 1 1 1 1 ε1 ε1 0.8 1 1 ε1 ε1 0.8 0.001 1 ε1 0.8 0.001 n 10000 ε1 1 0.01 1 n 10000 0.6 0.6 0.4 0.4 0.4 0.2 0.2 n 1000 0.6 0.2 0 0 100 200 n 300 400 0 500 0 100 200 n 300 400 500 0 0 100 200 n 300 400 500 Figure 1: Average Bayes error for task 1 for two-task GP regression with kernel lengthscale l = 0.01, noise level σ 2 = 0.05 and a fraction π2 = 0.75 of examples for task 2. Solid lines: numerical simulations; dashed lines: approximation (7). Task correlation ρ2 = 0, 0.25, 0.5, 0.75, 1 from top to bottom. Left: SE covariance function, Gaussian input distribution. Middle: SE covariance, uniform inputs. Right: OU covariance, uniform inputs. Log-log plots (insets) show tendency of asymptotic uselessness, i.e. bunching of the ρ < 1 curves towards the one for ρ = 0; this effect is strongest for learning of smooth functions (left and middle). are known explicitly from [22] and decay exponentially with i. Figure 1(left) compares numerically simulated learning curves with the predictions for 1 , the average Bayes error on task 1, from (7). Five pairs of curves are shown, for ρ2 = 0, 0.25, 0.5, 0.75, 1. Note that the two extreme values represent single-task limits, where examples from task 2 are either ignored (ρ = 0) or effectively treated as being from task 1 (ρ = 1). Our predictions lie generally below the true learning curves, but qualitatively represent the trends well, in particular the variation with ρ2 . The curves for the different ρ2 values are fairly evenly spaced vertically for small number of examples, n, corresponding to a linear dependence on ρ2 . As n increases, however, the learning curves for ρ < 1 start to bunch together and separate from the one for the fully correlated case (ρ = 1). The approximation (7) correctly captures this behaviour, which is discussed in more detail below. Figure 1(middle) has analogous results for the case of inputs x uniformly distributed on the interval [0, 1]; the λi here decay exponentially with i2 [17]. Quantitative agreement between simulations and predictions is better for this case. The discussion in [17] suggests that this is because the approximation method we have used implicitly neglects spatial variation of the dataset-averaged posterior variance Vτ (x) ; but for a uniform input distribution this variation will be weak except near the ends of the input range [0, 1]. Figure 1(right) displays similar results for an OU kernel C(x, x ) = exp(−|x − x |/l), showing that our predictions also work well when learning rough (nowhere differentiable) functions. 4.3 Asymptotic uselessness The two-task results above suggest that multi-task learning is less useful asymptotically: when the number of training examples n is large, the learning curves seem to bunch towards the curve for ρ = 0, where task 2 examples are ignored, except when the two tasks are fully correlated (ρ = 1). We now study this effect. When the number of examples for all tasks becomes large, the Bayes errors τ will become small 2 and eventually be negligible compared to the noise variances στ in (7). One then has an explicit prediction for each τ , without solving T self-consistency equations. If we write, for T tasks, 2 nτ = nπτ with πτ the fraction of examples for task τ , and set γτ = πτ /στ , then for large n τ = i λ−1 D −1 + nΓ i −1 ττ = −1/2 −1 [λi (Γ1/2 DΓ1/2 )−1 i (Γ + nI]−1 Γ−1/2 )τ τ 1/2 where Γ = diag(γ1 , . . . , γT ). Using an eigendecomposition of the symmetric matrix Γ T T a=1 δa va va , one then shows in a few lines that (8) can be written as τ −1 ≈ γτ 2 a (va,τ ) δa g(nδa ) 6 (8) 1/2 DΓ = (9) 1 1 1 50000 ε 5000 r 0.1 ε 0.5 n=500 10 100 1000 n 0.1 0 0 0.2 0.4 ρ 2 0.6 0.8 1 1 10 100 1000 n Figure 2: Left: Bayes error (parameters as in Fig. 1(left), with n = 500) vs ρ2 . To focus on the error reduction with ρ, r = [ 1 (ρ) − 1 (1)]/[ 1 (0) − 1 (1)] is shown. Circles: simulations; solid line: predictions from (7). Other lines: predictions for larger n, showing the approach to asymptotic uselessness in multi-task learning of smooth functions. Inset: Analogous results for rough functions (parameters as in Fig. 1(right)). Right: Learning curve for many-task learning (T = 200, parameters otherwise as in Fig. 1(left) except ρ2 = 0.8). Notice the bend around 1 = 1 − ρ = 0.106. Solid line: simulations (steps arise because we chose to allocate examples to tasks in order τ = 1, . . . , T rather than randomly); dashed line: predictions from (7). Inset: Predictions for T = 1000, with asymptotic forms = 1 − ρ + ρ˜ and = (1 − ρ)¯ for the two learning stages shown as solid lines. −1 where g(h) = tr (Λ−1 + h)−1 = + h)−1 and va,τ is the τ -th component of the a-th i (λi eigenvector va . This is the general asymptotic form of our prediction for the average Bayes error for task τ . To get a more explicit result, consider the case where sample functions from the GP prior have (mean-square) derivatives up to order r. The kernel eigenvalues λi then decay as1 i−(2r+2) for large i, and using arguments from [17] one deduces that g(h) ∼ h−α for large h, with α = (2r +1)/(2r + 2). In (9) we can then write, for large n, g(nδa ) ≈ (δa /γτ )−α g(nγτ ) and hence τ ≈ g(nγτ ){ 2 1−α } a (va,τ ) (δa /γτ ) (10) 2 When there is only a single task, δ1 = γ1 and this expression reduces to 1 = g(nγ1 ) = g(n1 /σ1 ). 2 Thus g(nγτ ) = g(nτ /στ ) is the error we would get by ignoring all examples from tasks other than τ , and the term in {. . .} in (10) gives the “multi-task gain”, i.e. the factor by which the error is reduced because of examples from other tasks. (The absolute error reduction always vanishes trivially for n → ∞, along with the errors themselves.) One observation can be made directly. Learning of very smooth functions, as defined e.g. by the SE kernel, corresponds to r → ∞ and hence α → 1, so the multi-task gain tends to unity: multi-task learning is asymptotically useless. The only exception occurs when some of the tasks are fully correlated, because one or more of the eigenvalues δa of Γ1/2 DΓ1/2 will then be zero. Fig. 2(left) shows this effect in action, plotting Bayes error against ρ2 for the two-task setting of Fig. 1(left) with n = 500. Our predictions capture the nonlinear dependence on ρ2 quite well, though the effect is somewhat weaker in the simulations. For larger n the predictions approach a curve that is constant for ρ < 1, signifying negligible improvement from multi-task learning except at ρ = 1. It is worth contrasting this with the lower bound from [21], which is linear in ρ2 . While this provides a very good approximation to the learning curves for moderate n [21], our results here show that asymptotically this bound can become very loose. When predicting rough functions, there is some asymptotic improvement to be had from multi-task learning, though again the multi-task gain is nonlinear in ρ2 : see Fig. 2(left, inset) for the OU case, which has r = 1). A simple expression for the gain can be obtained in the limit of many tasks, to which we turn next. 1 See the discussion of Sacks-Ylvisaker conditions in e.g. [1]; we consider one-dimensional inputs here though the discussion can be generalized. 7 4.4 Many tasks We assume as for the two-task case that all inter-task correlations, Dτ,τ with τ = τ , are equal to ρ, while Dτ,τ = 1. This setup was used e.g. in [23], and can be interpreted as each task having a √ component proportional to ρ of a shared latent function, with an independent task-specific signal in addition. We assume for simplicity that we have the same number nτ = n/T of examples for 2 each task, and that all noise levels are the same, στ = σ 2 . Then also all Bayes errors τ = will be the same. Carrying out the matrix inverses in (7) explicitly, one can then write this equation as = gT (n/(σ 2 + ), ρ) (11) where gT (h, ρ) is related to the single-task function g(h) from above by gT (h, ρ) = 1−ρ T −1 (1 − ρ)g(h(1 − ρ)/T ) + ρ + T T g(h[ρ + (1 − ρ)/T ]) (12) Now consider the limit T → ∞ of many tasks. If n and hence h = n/(σ 2 + ) is kept fixed, gT (h, ρ) → (1 − ρ) + ρg(hρ); here we have taken g(0) = 1 which corresponds to tr Λ = C(x, x) x = 1 as in the examples above. One can then deduce from (11) that the Bayes error for any task will have the form = (1 − ρ) + ρ˜, where ˜ decays from one to zero with increasing n as for a single task, but with an effective noise level σ 2 = (1 − ρ + σ 2 )/ρ. Remarkably, then, ˜ even though here n/T → 0 so that for most tasks no examples have been seen, the Bayes error for each task decreases by “collective learning” to a plateau of height 1 − ρ. The remaining decay of to zero happens only once n becomes of order T . Here one can show, by taking T → ∞ at fixed h/T in (12) and inserting into (11), that = (1 − ρ)¯ where ¯ again decays as for a single task but with an effective number of examples n = n/T and effective noise level σ 2 /(1 − ρ). This final stage of ¯ ¯ learning therefore happens only when each task has seen a considerable number of exampes n/T . Fig. 2(right) validates these predictions against simulations, for a number of tasks (T = 200) that is in the same ballpark as in the many-tasks application example of [24]. The inset for T = 1000 shows clearly how the two learning curve stages separate as T becomes larger. Finally we can come back to the multi-task gain in the asymptotic stage of learning. For GP priors with sample functions with derivatives up to order r as before, the function ¯ from above will decay as (¯ /¯ 2 )−α ; since = (1 − ρ)¯ and σ 2 = σ 2 /(1 − ρ), the Bayes error is then proportional n σ ¯ to (1 − ρ)1−α . This multi-task gain again approaches unity for ρ < 1 for smooth functions (α = (2r + 1)/(2r + 2) → 1). Interestingly, for rough functions (α < 1), the multi-task gain decreases for small ρ2 as 1 − (1 − α) ρ2 and so always lies below a linear dependence on ρ2 initially. This shows that a linear-in-ρ2 lower error bound cannot generally apply to T > 2 tasks, and indeed one can verify that the derivation in [21] does not extend to this case. 5 Conclusion We have derived an approximate prediction (7) for learning curves in multi-task GP regression, valid for arbitrary inter-task correlation matrices D. This can be evaluated explicitly knowing only the kernel eigenvalues, without sampling or recourse to single-task learning curves. The approximation shows that pure transfer learning has a simple lower error bound, and provides a good qualitative account of numerically simulated learning curves. Because it can be used to study the asymptotic behaviour for large training sets, it allowed us to show that multi-task learning can become asymptotically useless: when learning smooth functions it reduces the asymptotic Bayes error only if tasks are fully correlated. For the limit of many tasks we found that, remarkably, some initial “collective learning” is possible even when most tasks have not seen examples. A much slower second learning stage then requires many examples per task. The asymptotic regime of this also showed explicitly that a lower error bound that is linear in ρ2 , the square of the inter-task correlation, is applicable only to the two-task setting T = 2. In future work it would be interesting to use our general result to investigate in more detail the consequences of specific choices for the inter-task correlations D, e.g. to represent a lower-dimensional latent factor structure. One could also try to deploy similar approximation methods to study the case of model mismatch, where the inter-task correlations D would have to be learned from data. More challenging, but worthwhile, would be an extension to multi-task covariance functions where task and input-space correlations to not factorize. 8 References [1] C K I Williams and C Rasmussen. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. [2] J Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12:149–198, 2000. [3] S Ben-David and R S Borbely. A notion of task relatedness yielding provable multiple-task learning guarantees. Mach. Learn., 73(3):273–287, December 2008. [4] Y W Teh, M Seeger, and M I Jordan. Semiparametric latent factor models. In Workshop on Artificial Intelligence and Statistics 10, pages 333–340. Society for Artificial Intelligence and Statistics, 2005. [5] E V Bonilla, F V Agakov, and C K I Williams. Kernel multi-task learning using task-specific features. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS). Omni Press, 2007. [6] E V Bonilla, K M A Chai, and C K I Williams. Multi-task Gaussian process prediction. In J C Platt, D Koller, Y Singer, and S Roweis, editors, NIPS 20, pages 153–160, Cambridge, MA, 2008. MIT Press. [7] M Alvarez and N D Lawrence. Sparse convolved Gaussian processes for multi-output regression. In D Koller, D Schuurmans, Y Bengio, and L Bottou, editors, NIPS 21, pages 57–64, Cambridge, MA, 2009. MIT Press. [8] G Leen, J Peltonen, and S Kaski. Focused multi-task learning using Gaussian processes. In Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, volume 6912 of Lecture Notes in Computer Science, pages 310– 325. Springer Berlin, Heidelberg, 2011. ´ [9] M A Alvarez, L Rosasco, and N D Lawrence. Kernels for vector-valued functions: a review. Foundations and Trends in Machine Learning, 4:195–266, 2012. [10] A Maurer. Bounds for linear multi-task learning. J. Mach. Learn. Res., 7:117–139, 2006. [11] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [12] G F Trecate, C K I Williams, and M Opper. Finite-dimensional approximation of Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 218–224, Cambridge, MA, 1999. MIT Press. [13] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, NIPS 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [14] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, NIPS 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [15] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [16] D Malzahn and M Opper. Statistical mechanics of learning: a variational approach for real data. Phys. Rev. Lett., 89:108302, 2002. [17] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [18] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [19] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artificial Intelligence, pages 199–210. Springer Berlin, Heidelberg, 2005. [20] M Urry and P Sollich. Exact larning curves for Gaussian process regression on large random graphs. In J Lafferty, C K I Williams, J Shawe-Taylor, R S Zemel, and A Culotta, editors, NIPS 23, pages 2316–2324, Cambridge, MA, 2010. MIT Press. [21] K M A Chai. Generalization errors and learning curves for regression with multi-task Gaussian processes. In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors, NIPS 22, pages 279–287, 2009. [22] H Zhu, C K I Williams, R J Rohwer, and M Morciniec. Gaussian regression and optimal finite dimensional linear models. In C M Bishop, editor, Neural Networks and Machine Learning. Springer, 1998. [23] E Rodner and J Denzler. One-shot learning of object categories using dependent Gaussian processes. In Michael Goesele, Stefan Roth, Arjan Kuijper, Bernt Schiele, and Konrad Schindler, editors, Pattern Recognition, volume 6376 of Lecture Notes in Computer Science, pages 232–241. Springer Berlin, Heidelberg, 2010. [24] T Heskes. Solving a huge number of similar tasks: a combination of multi-task learning and a hierarchical Bayesian approach. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML’98), pages 233–241. Morgan Kaufmann, 1998. 9

5 0.055016145 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

Author: Emile Richard, Stephane Gaiffas, Nicolas Vayatis

Abstract: In the paper, we consider the problem of link prediction in time-evolving graphs. We assume that certain graph features, such as the node degree, follow a vector autoregressive (VAR) model and we propose to use this information to improve the accuracy of prediction. Our strategy involves a joint optimization procedure over the space of adjacency matrices and VAR matrices which takes into account both sparsity and low rank properties of the matrices. Oracle inequalities are derived and illustrate the trade-offs in the choice of smoothing parameters when modeling the joint effect of sparsity and low rank property. The estimate is computed efficiently using proximal methods through a generalized forward-backward agorithm. 1

6 0.053159617 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

7 0.049684666 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation

8 0.048464887 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

9 0.048336256 254 nips-2012-On the Sample Complexity of Robust PCA

10 0.04518896 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

11 0.045154884 26 nips-2012-A nonparametric variable clustering model

12 0.042436615 34 nips-2012-Active Learning of Multi-Index Function Models

13 0.042214599 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

14 0.040299669 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression

15 0.039987419 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

16 0.039909333 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

17 0.039533138 16 nips-2012-A Polynomial-time Form of Robust Regression

18 0.038853846 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses

19 0.038671698 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature

20 0.038666569 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.114), (1, 0.03), (2, 0.03), (3, -0.025), (4, -0.044), (5, 0.04), (6, -0.014), (7, 0.023), (8, 0.001), (9, -0.039), (10, -0.023), (11, -0.03), (12, -0.024), (13, -0.037), (14, 0.009), (15, 0.034), (16, 0.089), (17, -0.02), (18, -0.016), (19, -0.055), (20, -0.04), (21, 0.042), (22, -0.027), (23, 0.043), (24, -0.046), (25, -0.014), (26, 0.019), (27, 0.021), (28, -0.005), (29, 0.068), (30, -0.043), (31, 0.022), (32, 0.003), (33, -0.077), (34, -0.01), (35, 0.001), (36, 0.025), (37, 0.034), (38, 0.038), (39, 0.012), (40, 0.059), (41, -0.045), (42, 0.028), (43, -0.002), (44, 0.047), (45, -0.002), (46, 0.018), (47, -0.036), (48, 0.018), (49, -0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89847565 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

Author: Shinichi Nakajima, Ryota Tomioka, Masashi Sugiyama, S. D. Babacan

Abstract: The variational Bayesian (VB) approach is one of the best tractable approximations to the Bayesian estimation, and it was demonstrated to perform well in many applications. However, its good performance was not fully understood theoretically. For example, VB sometimes produces a sparse solution, which is regarded as a practical advantage of VB, but such sparsity is hardly observed in the rigorous Bayesian estimation. In this paper, we focus on probabilistic PCA and give more theoretical insight into the empirical success of VB. More specifically, for the situation where the noise variance is unknown, we derive a sufficient condition for perfect recovery of the true PCA dimensionality in the large-scale limit when the size of an observed matrix goes to infinity. In our analysis, we obtain bounds for a noise variance estimator and simple closed-form solutions for other parameters, which themselves are actually very useful for better implementation of VB-PCA. 1

2 0.67399949 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

Author: S. D. Babacan, Shinichi Nakajima, Minh Do

Abstract: In this paper, we consider the problem of clustering data points into lowdimensional subspaces in the presence of outliers. We pose the problem using a density estimation formulation with an associated generative model. Based on this probability model, we first develop an iterative expectation-maximization (EM) algorithm and then derive its global solution. In addition, we develop two Bayesian methods based on variational Bayesian (VB) approximation, which are capable of automatic dimensionality selection. While the first method is based on an alternating optimization scheme for all unknowns, the second method makes use of recent results in VB matrix factorization leading to fast and effective estimation. Both methods are extended to handle sparse outliers for robustness and can handle missing values. Experimental results suggest that proposed methods are very effective in subspace clustering and identifying outliers. 1

3 0.65868396 254 nips-2012-On the Sample Complexity of Robust PCA

Author: Matthew Coudron, Gilad Lerman

Abstract: We estimate the rate of convergence and sample complexity of a recent robust estimator for a generalized version of the inverse covariance matrix. This estimator is used in a convex algorithm for robust subspace recovery (i.e., robust PCA). Our model assumes a sub-Gaussian underlying distribution and an i.i.d. sample from it. Our main result shows with high probability that the norm of the difference between the generalized inverse covariance of the underlying distribution and its estimator from an i.i.d. sample of size N is of order O(N −0.5+ ) for arbitrarily small > 0 (affecting the probabilistic estimate); this rate of convergence is close to the one of direct covariance estimation, i.e., O(N −0.5 ). Our precise probabilistic estimate implies for some natural settings that the sample complexity of the generalized inverse covariance estimation when using the Frobenius norm is O(D2+δ ) for arbitrarily small δ > 0 (whereas the sample complexity of direct covariance estimation with Frobenius norm is O(D2 )). These results provide similar rates of convergence and sample complexity for the corresponding robust subspace recovery algorithm. To the best of our knowledge, this is the only work analyzing the sample complexity of any robust PCA algorithm. 1

4 0.59047198 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

Author: Arnak Dalalyan, Yin Chen

Abstract: In this paper, we develop a novel approach to the problem of learning sparse representations in the context of fused sparsity and unknown noise level. We propose an algorithm, termed Scaled Fused Dantzig Selector (SFDS), that accomplishes the aforementioned learning task by means of a second-order cone program. A special emphasize is put on the particular instance of fused sparsity corresponding to the learning in presence of outliers. We establish finite sample risk bounds and carry out an experimental evaluation on both synthetic and real data. 1

5 0.58566922 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition is an important extension of Nystr¨ m approximao tion to a general matrix. It approximates any data matrix in terms of a small number of its columns and rows. In this paper we propose a novel randomized CUR algorithm with an expected relative-error bound. The proposed algorithm has the advantages over the existing relative-error CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. Finally, experiments on several real-world datasets demonstrate significant improvement over the existing relative-error algorithms. 1

6 0.56192911 34 nips-2012-Active Learning of Multi-Index Function Models

7 0.55235618 86 nips-2012-Convex Multi-view Subspace Learning

8 0.53500891 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

9 0.5334174 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

10 0.52170926 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

11 0.51936805 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

12 0.51261288 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

13 0.50861275 37 nips-2012-Affine Independent Variational Inference

14 0.49610364 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

15 0.49456415 234 nips-2012-Multiresolution analysis on the symmetric group

16 0.49141085 125 nips-2012-Factoring nonnegative matrices with linear programs

17 0.49105734 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion

18 0.4896268 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

19 0.48933321 211 nips-2012-Meta-Gaussian Information Bottleneck

20 0.48814988 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.018), (21, 0.012), (38, 0.518), (39, 0.017), (42, 0.024), (54, 0.019), (55, 0.019), (61, 0.016), (74, 0.033), (76, 0.118), (77, 0.015), (80, 0.064), (92, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99593276 262 nips-2012-Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum $L p$ Loss

Author: Zhuo Wang, Alan Stocker, Daniel Lee

Abstract: In this work we study how the stimulus distribution influences the optimal coding of an individual neuron. Closed-form solutions to the optimal sigmoidal tuning curve are provided for a neuron obeying Poisson statistics under a given stimulus distribution. We consider a variety of optimality criteria, including maximizing discriminability, maximizing mutual information and minimizing estimation error under a general Lp norm. We generalize the Cramer-Rao lower bound and show how the Lp loss can be written as a functional of the Fisher Information in the asymptotic limit, by proving the moment convergence of certain functions of Poisson random variables. In this manner, we show how the optimal tuning curve depends upon the loss function, and the equivalence of maximizing mutual information with minimizing Lp loss in the limit as p goes to zero. 1

2 0.9892987 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu

Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle. 1

3 0.98816752 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1

4 0.98341262 165 nips-2012-Iterative ranking from pair-wise comparisons

Author: Sahand Negahban, Sewoong Oh, Devavrat Shah

Abstract: The question of aggregating pairwise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR’s TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining ranking, finding ‘scores’ for each object (e.g. player’s rating) is of interest to understanding the intensity of the preferences. In this paper, we propose a novel iterative rank aggregation algorithm for discovering scores for objects from pairwise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with edges present between two objects if they are compared; the scores turn out to be the stationary probability of this random walk. The algorithm is model independent. To establish the efficacy of our method, however, we consider the popular Bradley-Terry-Luce (BTL) model in which each object has an associated score which determines the probabilistic outcomes of pairwise comparisons between objects. We bound the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. This, in essence, leads to order-optimal dependence on the number of samples required to learn the scores well by our algorithm. Indeed, the experimental evaluation shows that our (model independent) algorithm performs as well as the Maximum Likelihood Estimator of the BTL model and outperforms a recently proposed algorithm by Ammar and Shah [1]. 1

5 0.96358407 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

Author: Koby Crammer, Yishay Mansour

Abstract: In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of shared hypotheses. Each task is then mapped to a single hypothesis in the pool (hard association). We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. We conducted experiments with both synthetic problems and sentiment of reviews, which strongly support our approach. 1

same-paper 6 0.96327692 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

7 0.96222532 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

8 0.9515512 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

9 0.94544196 208 nips-2012-Matrix reconstruction with the local max norm

10 0.94420862 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

11 0.90640372 285 nips-2012-Query Complexity of Derivative-Free Optimization

12 0.89775741 86 nips-2012-Convex Multi-view Subspace Learning

13 0.89445806 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

14 0.88975257 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

15 0.88490027 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

16 0.88105452 30 nips-2012-Accuracy at the Top

17 0.88042456 324 nips-2012-Stochastic Gradient Descent with Only One Projection

18 0.87707019 187 nips-2012-Learning curves for multi-task Gaussian process regression

19 0.8758679 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

20 0.87332088 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization