nips nips2012 nips2012-86 knowledge-graph by maker-knowledge-mining

86 nips-2012-Convex Multi-view Subspace Learning

Source: pdf

Author: Martha White, Xinhua Zhang, Dale Schuurmans, Yao-liang Yu

Abstract: Subspace learning seeks a low dimensional representation of data that enables accurate reconstruction. However, in many applications, data is obtained from multiple sources rather than a single source (e.g. an object might be viewed by cameras at different angles, or a document might consist of text and images). The conditional independence of separate sources imposes constraints on their shared latent representation, which, if respected, can improve the quality of a learned low dimensional representation. In this paper, we present a convex formulation of multi-view subspace learning that enforces conditional independence while reducing dimensionality. For this formulation, we develop an efﬁcient algorithm that recovers an optimal data reconstruction by exploiting an implicit convex regularizer, then recovers the corresponding latent representation and reconstruction model, jointly and optimally. Experiments illustrate that the proposed method produces high quality results. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The conditional independence of separate sources imposes constraints on their shared latent representation, which, if respected, can improve the quality of a learned low dimensional representation. [sent-7, score-0.135]

2 In this paper, we present a convex formulation of multi-view subspace learning that enforces conditional independence while reducing dimensionality. [sent-8, score-0.258]

3 For this formulation, we develop an efﬁcient algorithm that recovers an optimal data reconstruction by exploiting an implicit convex regularizer, then recovers the corresponding latent representation and reconstruction model, jointly and optimally. [sent-9, score-0.298]

4 Re-expressing high dimensional data in a low dimensional representation has been used to discover important latent information about individual data items, visualize entire data sets to uncover their global organization, and even improve subsequent clustering or supervised learning [1]. [sent-12, score-0.105]

5 For example, given multiple camera views of a single object, the particular idiosyncrasies of each camera are generally independent, hence the images they capture will be conditionally independent given the scene. [sent-18, score-0.149]

6 In this paper we focus on the problem of multi-view subspace learning: reducing dimensionality when data consists of multiple, conditionally independent sources. [sent-21, score-0.146]

7 Classically, multi-view subspace learning has been achieved by an application of canonical correlation analysis (CCA) [2, 3]. [sent-22, score-0.165]

8 In particular, many successes have been achieved in using CCA to recover meaningful latent representations in a multi-view setting [4–6]. [sent-23, score-0.099]

9 By contrast, in the single-view setting, recent work has developed new generalizations of subspace learning that can accommodate arbitrary convex losses [10–12]. [sent-28, score-0.205]

10 These papers replace the hard bound on the dimension of the latent representation with a structured convex regularizer that still reduces rank, but in a relaxed manner that admits greater ﬂexibility while retaining tractable formulations. [sent-29, score-0.173]

11 1 Subspace learning can be achieved in this case by ﬁrst recovering an optimal reduced rank response matrix and then extracting the latent representation and reconstruction model. [sent-31, score-0.299]

12 Unfortunately, the multi-view formulation of subspace learning does not have an obvious convex form, and current work has resorted to local training methods based on alternating descent minimization (or approximating intractable integrals). [sent-33, score-0.244]

13 Consequently, there is no guarantee of recovering a globally optimal subspace. [sent-34, score-0.101]

14 In this paper we provide a formulation of multi-view subspace learning that can be solved optimally and efﬁciently. [sent-35, score-0.158]

15 After deriving a new formulation of multi-view subspace learning that allows a global solution, we also derive efﬁcient new algorithms. [sent-37, score-0.178]

16 The outcome is an efﬁcient approach to multi-view subspace discovery that can produce high quality repeatable results. [sent-38, score-0.14]

17 Notation: We use Ik for the k ×k identity matrix, A for the transpose of matrix A, · 2 for the Euclidean norm, X F = tr(X X) for the Frobenius norm and X tr = i σi (X) for the trace norm, where σi (X) is the ith singular value of X. [sent-39, score-0.291]

18 The goal of multi-view subspace learning is to infer, for each pair, a shared latent representation, hj , of dimension k < min(n, m), such that the original data can be accurately modeled. [sent-41, score-0.245]

19 Given paired observations the goal is to infer a set of latent representations, hj , and reconstruction models, A and B, such that Ahj ≈ xj and Bhj ≈ yj for all j. [sent-43, score-0.193]

20 The problem can then be expressed as recovering a (n + m) × k matrix of model parameters, C = of latent representations, H, such that Z ≈ CH. [sent-45, score-0.112]

21 A B , and a k × t matrix The key assumption of multi-view learning is that each of the two views, xj and yj , is conditionally independent given the shared latent representation, hj . [sent-46, score-0.203]

22 Although multi-view data can always be concatenated and treated as a single view, if the conditional independence assumption holds, explicitly representing multiple views enables more accurate identiﬁcation of the latent representation (as we will see). [sent-47, score-0.243]

23 The classical formulation of multi-view subspace learning is given by canonical correlation analysis (CCA), which is typically expressed as the problem of projecting two views so that the correlation between them is maximized [2]. [sent-48, score-0.317]

24 In fact, one can show that the classical CCA problem (1) is equivalent to the following multi-view subspace learning problem. [sent-56, score-0.118]

25 Such sharing can be detrimental if the two views really are conditionally independent given hj . [sent-68, score-0.174]

26 Recently, subspace learning algorithms have been greatly generalized in the single view case by relaxing the rank(H) = k constraint while imposing a structured regularizer that is a convex relaxation of rank [10–12]. [sent-70, score-0.264]

27 Such a relaxation allows one to incorporate arbitrary convex losses, including robust losses [10], while maintaining tractability. [sent-71, score-0.087]

28 As mentioned, these relaxations of single-view subspace learning have only recently been proposed for the multi-view setting [13, 14]. [sent-72, score-0.118]

29 An extension of these proposals can be achieved by reformulating (2) to ﬁrst incorporate an arbitrary loss function L that is convex in its ﬁrst argument (for examples, see [15]), then relaxing the rank constraint by replacing it with a rank-reducing regularizer on H. [sent-73, score-0.146]

30 The signiﬁcance of using the (2, 1)-block norm as a regularizer is that it encourages rows of H to become sparse, hence reducing the dimensionality of the learned representation [16]. [sent-77, score-0.117]

31 Note that (3) can in principle be tackled by a boosting strategy; however, one would have to formulate a difﬁcult weak learning oracle that considers both views simultaneously [17]. [sent-83, score-0.218]

32 ˆ To derive our tractable reformulation, we ﬁrst introduce the change of variable Z = CH which allows us to rewrite (3) equivalently as ˆ min L(Z; Z) + α ˆ Z min min ˆ {C:C:,i ∈C} {H:CH=Z} H 2,1 . [sent-85, score-0.174]

33 min min H 2,1 deﬁnes a norm ||| · |||∗ (on Z) whose dual norm is ˆ {C:C:,i ∈C} {H:CH=Z} |||Γ||| := max c∈C, h 2 ≤1 c Γh. [sent-88, score-0.331]

34 Therefore ˜ H min min ˆ {C:C:,i ∈C} {H:CH=Z} H 2,1 = min ˆ {C,λi :C:,i ∈C,λi ≥0, Z= i ˜ λi C:,i Hi,: } i λi = min ˆ {t≥0:Z∈tK} t, (5) where K is the convex hull of the set G := {ch : c ∈ C, h 2 = 1}. [sent-92, score-0.321]

35 Since the set K is convex and symmetric, 3 ˆ (5) is known as a gauge function and deﬁnes a norm on Z (see e. [sent-94, score-0.113]

36 This norm has a dual given by |||Γ||| := max tr(Γ Z) = max Z∈K c∈C, h 2 ≤1 c Γh, (6) where the last equality follows because maximizing any linear function over the convex hull K of a set G achieves the same value as maximizing over the set G itself. [sent-100, score-0.298]

37 (3) = min L(Z; Z)+α max Dρ Z ρ≥0 ˆ Z tr , where Dρ = β 2 +γ 2 ρ In 0 0 . [sent-103, score-0.296]

38 The lemma is proved by ﬁrst deriving an explicit form of the norm ||| · ||| in (6), then deriving its dual norm. [sent-105, score-0.143]

39 Expand h(η) into  η ˆX β2 Z  1−η ˆ Y γ2 Z  = tr η ˆX β 2 (Z ) √ ˆ ˆ ˆ Z X + 1−η (Z Y ) Z Y , where tr( ·) γ2 tr means summing the square root of the eigenvalues (i. [sent-115, score-0.38]

40 −1 ˆ ˆ (3) = min L(Z; Z) + α max Eη Z ˆ Z 0≤η≤1 tr −1 ˆ ˆ = max min L(Z; Z) + α Eη Z 0≤η≤1 ˆ Z tr . [sent-122, score-0.592]

41 Thus we have achieved a new formulation for multi-view subspace learning that respects conditional independence of the separate views (see discussion in Section 2) while allowing a globally solvable formulation. [sent-124, score-0.314]

42 4 Efﬁcient Training Procedure This new formulation for multi-view subspace learning also allows for an efﬁcient algorithmic approach. [sent-126, score-0.158]

43 To do so we introduce a further transformation Q = Eη Z in (7), which leads to an equivalent but computationally more convenient formulation of (3): ˆ ˆ (3) = max min L(Eη Q; Z) + α Q 0≤η≤1 Q ˆ tr . [sent-128, score-0.336]

44 (8) ˆ ˆ Denote g(η) := minQ L(Eη Q; Z) + α Q tr . [sent-129, score-0.19]

45 The training ˆ ˆ ˆ procedure then consists of two stages: ﬁrst, solve (8) to recover η and Q, which allows Z = Eη Q to ˆ be computed; then, recover the optimal factors H and C (i. [sent-131, score-0.132]

46 Since we already ˆ ˜ i −1 ˆ ˆ ˆ have |||Z|||∗ = Eη Z tr from the ﬁrst stage, we can rescale the problem so that |||Z|||∗ = 1 without ˆ ˜ loss of generality. [sent-141, score-0.19]

47 So now, Z lies in the convex hull of the set G := {ch : c ∈ C, h 2 ≤ ˆ proper scale to H ˆ 1} and any expansion of Z as a convex combination of the elements in G is a valid recovery. [sent-143, score-0.148]

48 In particular, the recovery just needs to solve ˆ min f (K), where f (K) := Z − K 2 . [sent-145, score-0.13]

49 ˆ ˆ Recall Z is penalized by the dual of the norm in (6). [sent-156, score-0.113]

50 Given Z, it is not hard to recover its dual ˆ variable Γ by exploiting the dual norm relationship: Γ= argmaxΓ:|||Γ|||≤1 tr(Γ Z). [sent-157, score-0.22]

51 The accelerated boosting generates ct in the weak learning step, giving the recovery C = [c1 , . [sent-163, score-0.198]

52 To solve (12), we used a variant of the boosting algorithm [22] when α is large, due to its effectiveness when the solution has low rank. [sent-177, score-0.088]

53 Once an optimal Z is ˆ achieved, the corresponding C and H can be recovered by an SVD: for Z = U ΣV , set C = 1 1 ˆ ˆ (β 2 +γ 2 ) 2 U and H = (β 2 +γ 2 )− 2 ΣV which satisﬁes CH = Z and H 2,1 = Z tr , and so is an optimal solution to (11). [sent-180, score-0.304]

54 To construct the dataset, we set the x-view to a ﬁxed lighting (+000E+00) and the y-view to a different ﬁxed lighting (+000E+20). [sent-190, score-0.11]

55 We obtain a pair of views by randomly drawing a subject and a pose (under the two ﬁxed lightings). [sent-191, score-0.091]

56 The underlying assumption is that each lighting has its own set of bases (A and B) and each (person, pose) pair has the same latent representation for the two lighting conditions. [sent-192, score-0.195]

57 The goal is to enable appropriate reconstruction of a noisy image using other views. [sent-198, score-0.104]

58 This result suggests that more local minima occur in the higher rank case (a large α increases regularization and decreases the rank of the solution). [sent-205, score-0.138]

59 2, we will see that the lower optimization quality of LSL and the fact that SSL optimizes a less constrained objective both lead to signiﬁcantly worse denoising performance. [sent-207, score-0.113]

60 Again, the runtime of LSL is signiﬁcantly worse for smaller α, as much as 4000x slower; as α increases, the runtimes become similar. [sent-210, score-0.108]

61 In (a), we used tL = 100 pairs of views for training A and B and tested on 100 hold-out pairs, with 30 repeated random draws of training and test data. [sent-225, score-0.091]

62 Interestingly, the recovery runtime seems independent of dataset size, and is instead likely proportional to the rank of the data. [sent-234, score-0.175]

63 2 Comparing denoising quality Next we compare the denoising capabilities of the algorithms on synthetic and face image datasets. [sent-237, score-0.242]

64 This approach requires ﬁrst recovering the ˆ ˜ ˜ latent representation, Hte = (h1 , . [sent-242, score-0.092]

65 (15) We present the recovery approach on synthetic data and the direct reconstruction approach on the face dataset. [sent-249, score-0.189]

66 1 Using Recovered Models for Denoising Figure 2 presents the signal-to-noise ratio for recovery on synthetic data. [sent-258, score-0.121]

67 SSL had larger reconstruction error on the clean x-view (10x higher), lower reconstruction error on the noisy yview (3x lower) and a higher representation norm (3x higher). [sent-265, score-0.263]

68 3 Comparing synthesis of views In image synthesis, the latent representation is computed from only one view: ˜ ˆ ˆ argminH L1 (AH, Xte ) + α H 2,1 . [sent-268, score-0.224]

69 7 Conclusion We provided a convex reformulation of multi-view subspace learning that enables global learning, as opposed to previous local formulations. [sent-271, score-0.231]

70 Experimental results on synthetic data and image data conﬁrm the effectiveness of our method, which consistently outperformed other approaches in denoising quality. [sent-273, score-0.12]

71 It should also be possible to extend our approach to more than two views and incorporate kernels. [sent-275, score-0.091]

72 Hence ˜ ˜ F min Z − CH 2 = min (I − CC † )Z 2 F C,H C ˜˜ = tr(Z Z ) − max {C:C C=I} ˜˜ tr(C Z Z C). [sent-439, score-0.164]

73 1 B Proof for Lemma 3 First, observe that (3) = min min L(CH; Z) + α H {C:C:,i ∈C} H 2,1 ˆ = min L(Z; Z) + α ˆ Z min min ˆ {C:C:,i ∈C} {H:CH=Z} H 2,1 ˆ ˆ = min L(Z; Z) + α|||Z|||∗ , ˆ Z where the last step follows from Proposition 2. [sent-441, score-0.348]

74 The ﬁrst stage is to prove that the dual norm is characterized by |||Γ||| = min Dρ Γ ρ≥0 1 sp . [sent-446, score-0.228]

75 10 (16) where the spectral norm X that |||Γ||| = sp = σmax (X) is the dual of the trace norm, X max c∈C, h 2 ≤1 c Γh = max c Γ c∈C = 2 {c: cX max 2 =β, cY tr . [sent-448, score-0.531]

76 Furthermore, since the extreme points have rank at most one in this case [31], the rank constraint rank(Φ) = 1 can be dropped. [sent-450, score-0.116]

77 Therefore, we achieve the equivalent dual formulation (17) = min β 2 λ + γ 2 ν. [sent-453, score-0.157]

78 sp sp ρ≥0 λ≥0,ν≥0 Finally, for the second stage, we characterize the target norm by observing that ˆ |||Z|||∗ = ˆ max tr(Γ Z) Γ:|||Γ|||≤1 = max = max = max ρ≥0 Γ: Dρ Γ max ˜ ˜ ρ≥0 Γ: Γ max ρ≥0 sp ≤1 sp ≤1 ˆ tr(Γ Z) (20) ˜ −1 ˆ tr(Γ Dρ Z) −1 ˆ Dρ Z tr . [sent-457, score-0.76]

79 C Proof for Theorem 6 and Details of Recovery ˆ Once an optimal reconstruction Z is obtained, we need to recover the optimal factors C and H that satisfy ˆ ˆ CH = Z, , H 2,1 = |||Z|||∗ , and C:,i ∈ C for all i. [sent-460, score-0.179]

80 (22) Note that by Proposition 2 and Lemma 3, the recovery problem (22) can be re-expressed as min ˆ {C,H:C:,i ∈C ∀i, CH=Z} H 2,1 = max {Γ:|||Γ|||≤1} ˆ tr(Γ Z). [sent-461, score-0.178]

81 (23) ˆ Our strategy will be to ﬁrst recover the optimal dual solution Γ given Z, then use Γ to recover H and C. [sent-462, score-0.212]

82 −1 ˜ Then Γ = U V and Γ = Dρ U V automatically satisﬁes |||Γ||| = 1 while achieving the optimal −1 ˆ ˜ −1 ˆ trace in (23) because tr(Γ Dρ Z) = tr(Σ) = Dρ Z tr . [sent-465, score-0.253]

83 11 Given such an optimal Γ, we are then able to characterize an optimal solution (C, H). [sent-466, score-0.093]

84 For a dual optimal Γ, (C, H) solves recovery problem (22) if and only if C:,i ∈ C(Γ) ˆ and Hi,: = Hi,: 2 C:,i Γ, such that CH = Z. [sent-469, score-0.167]

85 i Hi,: 2 , Therefore, given Γ, the recovery problem (22) has been reduced to ﬁnding a vector µ and matrix C ˆ such that µ ≥ 0, C:,i ∈ C(Γ) for all i, and C diag(µ)C Γ = Z. [sent-475, score-0.092]

86 Denote the range of C diag(µ)C by the set S := { i µi ci ci : ci ∈ C(Γ), µ ≥ 0} . [sent-477, score-0.111]

87 Note that S is the conic hull of (possibly inﬁnitely many) rank one matrices {cc : c ∈ C(Γ)}. [sent-478, score-0.088]

88 To ﬁnd these weights, we use a totally corrective “boosting” procedure [22] that is guaranteed to converge to a feasible solution. [sent-481, score-0.105]

89 The idea behind totally corrective boosting [22] is to ﬁnd a minimizer of f (hence optimal solution of (22)) incrementally. [sent-485, score-0.229]

90 Weak learning step: ﬁnd ct ∈ argmin f (Kt−1 ), cc = argmax c Qc, c∈C(Γ) (26) c∈C(Γ) ˆ where Q = − f (Kt−1 ) = 2(Z − Kt−1 Γ)Γ . [sent-487, score-0.091]

91 “Totally corrective” step: µ(t) = argmin f µ:µi ≥0 Kt = t i=1 t i=1 µi ci ci , (27) (t) µi ci ci . [sent-489, score-0.169]

92 Three key facts can be established about this boosting procedure: (i) each weak learning step can be solved efﬁciently; (ii) each totally corrective weight update can be solved efﬁciently; and (iii) f (Kt ) 0, hence a feasible solution can be arbitrarily well approximated. [sent-490, score-0.231]

93 1 Simpliﬁcation of C(Γ) Since C(Γ) is the set of optimal solutions to max Γ c , (28) c∈C our idea is to ﬁrst obtain an optimal solution to its dual problem, and then use it to recover the optimal c via the KKT conditions. [sent-495, score-0.284]

94 Once we obtain the optimal ρ in (21) by solving (8), it is straightforward to backtrack and recover the optimal λ and ν in (18). [sent-497, score-0.12]

95 Then by KKT condition [32, §28], c is an optimal solution to (28) if and only if (29) cY = γ, cX = β, R, cc = 0, where R = λI X + νI Y − ΓΓ 0. [sent-498, score-0.106]

96 (36) Solving the weak oracle problem (26) The weak oracle needs to solve max c Qc, c∈C(Γ) ˆ where Q = − f (Kt−1 ) = 2(Z − Kt−1 Γ)Γ . [sent-521, score-0.168]

97 Using the same technique as in the proof of Lemma 3, we have max v Tv v:v v=1,v Σv=0 (let H = vv ) = (Lagrange dual) = max tr(T H) H 0,tr(H)=1,tr(ΣH)=0 min τ,ω:τ Σ+ωI−T 0 ω = min λmax (T − τ Σ), τ ∈R where λmax stands for the maximum eigenvalue. [sent-523, score-0.212]

98 Since λmax is a convex function over real symmetric matrices, the last line search problem is convex in τ , hence can be solved globally and efﬁciently. [sent-524, score-0.142]

99 Given the optimal τ and the optimal objective value ω, the optimal v can be recovered using a similar ˆ ˆ trick as in Appendix C. [sent-525, score-0.129]

100 On the rank of extreme matrices in semideﬁnite programs and the multiplicity of optimal eigenvalues. [sent-551, score-0.094]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('msl', 0.588), ('lsl', 0.407), ('ssl', 0.263), ('yte', 0.241), ('xte', 0.196), ('tr', 0.19), ('ch', 0.18), ('hi', 0.169), ('subspace', 0.118), ('views', 0.091), ('kt', 0.086), ('cca', 0.077), ('hte', 0.075), ('recovery', 0.072), ('denoising', 0.07), ('boosting', 0.067), ('tl', 0.063), ('runtimes', 0.063), ('corrective', 0.061), ('reconstruction', 0.059), ('dual', 0.059), ('convex', 0.059), ('min', 0.058), ('rank', 0.058), ('sp', 0.057), ('lighting', 0.055), ('hj', 0.055), ('norm', 0.054), ('xx', 0.054), ('latent', 0.051), ('cc', 0.049), ('cy', 0.049), ('recover', 0.048), ('max', 0.048), ('xy', 0.046), ('runtime', 0.045), ('cx', 0.045), ('totally', 0.044), ('recovering', 0.041), ('independence', 0.041), ('formulation', 0.04), ('weak', 0.038), ('ci', 0.037), ('proposition', 0.036), ('optimal', 0.036), ('snr', 0.035), ('concave', 0.035), ('clean', 0.034), ('reformulation', 0.034), ('representation', 0.034), ('im', 0.032), ('lemma', 0.03), ('ah', 0.03), ('hull', 0.03), ('face', 0.03), ('diag', 0.03), ('adal', 0.03), ('ahj', 0.03), ('ahte', 0.03), ('bhj', 0.03), ('idiosyncrasies', 0.03), ('xtr', 0.03), ('ytr', 0.03), ('yx', 0.03), ('regularizer', 0.029), ('synthetic', 0.028), ('losses', 0.028), ('conditionally', 0.028), ('yj', 0.028), ('noise', 0.027), ('alternating', 0.027), ('trace', 0.027), ('xinhua', 0.027), ('concatenated', 0.026), ('canonical', 0.026), ('synthesis', 0.026), ('appendix', 0.025), ('te', 0.025), ('qc', 0.025), ('globally', 0.024), ('kkt', 0.023), ('noisy', 0.023), ('image', 0.022), ('quality', 0.022), ('solver', 0.022), ('goldfarb', 0.022), ('oracle', 0.022), ('minima', 0.022), ('ratio', 0.021), ('null', 0.021), ('recovered', 0.021), ('argmin', 0.021), ('constrained', 0.021), ('bh', 0.021), ('correlation', 0.021), ('shared', 0.021), ('solution', 0.021), ('ct', 0.021), ('global', 0.02), ('matrix', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 86 nips-2012-Convex Multi-view Subspace Learning

Author: Martha White, Xinhua Zhang, Dale Schuurmans, Yao-liang Yu

2 0.12465386 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu

Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with ﬁxed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the ﬁrst efﬁcient weak-oracle. 1

3 0.10065141 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

Author: S. D. Babacan, Shinichi Nakajima, Minh Do

Abstract: In this paper, we consider the problem of clustering data points into lowdimensional subspaces in the presence of outliers. We pose the problem using a density estimation formulation with an associated generative model. Based on this probability model, we ﬁrst develop an iterative expectation-maximization (EM) algorithm and then derive its global solution. In addition, we develop two Bayesian methods based on variational Bayesian (VB) approximation, which are capable of automatic dimensionality selection. While the ﬁrst method is based on an alternating optimization scheme for all unknowns, the second method makes use of recent results in VB matrix factorization leading to fast and effective estimation. Both methods are extended to handle sparse outliers for robustness and can handle missing values. Experimental results suggest that proposed methods are very effective in subspace clustering and identifying outliers. 1

4 0.093989976 208 nips-2012-Matrix reconstruction with the local max norm

Author: Rina Foygel, Nathan Srebro, Ruslan Salakhutdinov

Abstract: We introduce a new family of matrix norms, the “local max” norms, generalizing existing methods such as the max norm, the trace norm (nuclear norm), and the weighted or smoothed weighted trace norms, which have been extensively used in the literature as regularizers for matrix reconstruction problems. We show that this new family can be used to interpolate between the (weighted or unweighted) trace norm and the more conservative max norm. We test this interpolation on simulated data and on the large-scale Netﬂix and MovieLens ratings data, and ﬁnd improved accuracy relative to the existing matrix norms. We also provide theoretical results showing learning guarantees for some of the new norms. 1

5 0.087661222 247 nips-2012-Nonparametric Reduced Rank Regression

Author: Rina Foygel, Michael Horrell, Mathias Drton, John D. Lafferty

Abstract: We propose an approach to multivariate nonparametric regression that generalizes reduced rank regression for linear models. An additive model is estimated for each dimension of a q-dimensional response, with a shared p-dimensional predictor variable. To control the complexity of the model, we employ a functional form of the Ky-Fan or nuclear norm, resulting in a set of function estimates that have low rank. Backﬁtting algorithms are derived and justiﬁed using a nonparametric form of the nuclear norm subdifferential. Oracle inequalities on excess risk are derived that exhibit the scaling behavior of the procedure in the high dimensional setting. The methods are illustrated on gene expression data. 1

6 0.083511926 187 nips-2012-Learning curves for multi-task Gaussian process regression

7 0.081045091 168 nips-2012-Kernel Latent SVM for Visual Recognition

8 0.080937035 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

9 0.073594071 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

10 0.07304088 16 nips-2012-A Polynomial-time Form of Robust Regression

11 0.072086751 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

12 0.070440397 319 nips-2012-Sparse Prediction with the $k$-Support Norm

13 0.06966152 185 nips-2012-Learning about Canonical Views from Internet Image Collections

14 0.069582604 254 nips-2012-On the Sample Complexity of Robust PCA

15 0.068089508 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

16 0.066817701 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks

17 0.063363403 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

18 0.062992744 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

19 0.06166077 227 nips-2012-Multiclass Learning with Simplex Coding

20 0.06157805 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.167), (1, 0.053), (2, 0.042), (3, -0.08), (4, 0.066), (5, 0.046), (6, -0.016), (7, -0.0), (8, 0.008), (9, -0.047), (10, 0.003), (11, 0.018), (12, -0.03), (13, 0.02), (14, 0.085), (15, 0.073), (16, 0.041), (17, 0.009), (18, 0.058), (19, -0.059), (20, -0.003), (21, 0.063), (22, -0.066), (23, -0.021), (24, -0.077), (25, 0.001), (26, -0.013), (27, -0.055), (28, -0.014), (29, 0.082), (30, -0.018), (31, 0.043), (32, 0.037), (33, -0.008), (34, -0.044), (35, 0.085), (36, -0.062), (37, -0.04), (38, 0.046), (39, -0.078), (40, -0.025), (41, -0.046), (42, 0.102), (43, 0.015), (44, 0.112), (45, -0.014), (46, 0.062), (47, -0.011), (48, -0.029), (49, -0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92592752 86 nips-2012-Convex Multi-view Subspace Learning

Author: Martha White, Xinhua Zhang, Dale Schuurmans, Yao-liang Yu

2 0.74089688 208 nips-2012-Matrix reconstruction with the local max norm

Author: Rina Foygel, Nathan Srebro, Ruslan Salakhutdinov

3 0.73725408 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu

4 0.68803602 319 nips-2012-Sparse Prediction with the $k$-Support Norm

Author: Andreas Argyriou, Rina Foygel, Nathan Srebro

Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justiﬁcation for its use. 1

5 0.64697826 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition is an important extension of Nystr¨ m approximao tion to a general matrix. It approximates any data matrix in terms of a small number of its columns and rows. In this paper we propose a novel randomized CUR algorithm with an expected relative-error bound. The proposed algorithm has the advantages over the existing relative-error CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. Finally, experiments on several real-world datasets demonstrate signiﬁcant improvement over the existing relative-error algorithms. 1

6 0.64023542 247 nips-2012-Nonparametric Reduced Rank Regression

7 0.61684722 125 nips-2012-Factoring nonnegative matrices with linear programs

8 0.59123695 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion

9 0.58640862 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

10 0.57731539 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

11 0.56440288 221 nips-2012-Multi-Stage Multi-Task Feature Learning

12 0.56090271 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

13 0.56004667 225 nips-2012-Multi-task Vector Field Learning

14 0.55428028 254 nips-2012-On the Sample Complexity of Robust PCA

15 0.52481347 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach

16 0.52025807 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

17 0.51667053 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

18 0.50977802 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

19 0.50962305 187 nips-2012-Learning curves for multi-task Gaussian process regression

20 0.50899005 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.048), (21, 0.02), (38, 0.222), (42, 0.034), (54, 0.017), (55, 0.026), (58, 0.191), (74, 0.046), (76, 0.126), (80, 0.11), (92, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88503993 86 nips-2012-Convex Multi-view Subspace Learning

Author: Martha White, Xinhua Zhang, Dale Schuurmans, Yao-liang Yu

2 0.82433975 227 nips-2012-Multiclass Learning with Simplex Coding

Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine

Abstract: In this paper we discuss a novel framework for multiclass learning, deﬁned by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classiﬁcation. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the ﬁrst provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1

3 0.82277429 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

Author: Elad Hazan, Zohar Karnin

Abstract: We present a simplex algorithm for linear programming in a linear classiﬁcation formulation. The paramount complexity parameter in linear classiﬁcation problems is called the margin. We prove that for margin values of practical interest our simplex variant performs a polylogarithmic number of pivot steps in the worst case, and its overall running time is near linear. This is in contrast to general linear programming, for which no sub-polynomial pivot rule is known. 1

4 0.81987423 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

Author: Brendan Mcmahan, Matthew Streeter

Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classiﬁcation, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1

5 0.81925434 187 nips-2012-Learning curves for multi-task Gaussian process regression

Author: Peter Sollich, Simon Ashton

Abstract: We study the average case performance of multi-task Gaussian process (GP) regression as captured in the learning curve, i.e. the average Bayes error for a chosen task versus the total number of examples n for all tasks. For GP covariances that are the product of an input-dependent covariance function and a free-form intertask covariance matrix, we show that accurate approximations for the learning curve can be obtained for an arbitrary number of tasks T . We use these to study the asymptotic learning behaviour for large n. Surprisingly, multi-task learning can be asymptotically essentially useless, in the sense that examples from other tasks help only when the degree of inter-task correlation, ρ, is near its maximal value ρ = 1. This effect is most extreme for learning of smooth target functions as described by e.g. squared exponential kernels. We also demonstrate that when learning many tasks, the learning curves separate into an initial phase, where the Bayes error on each task is reduced down to a plateau value by “collective learning” even though most tasks have not seen examples, and a ﬁnal decay that occurs once the number of examples is proportional to the number of tasks. 1 Introduction and motivation Gaussian processes (GPs) [1] have been popular in the NIPS community for a number of years now, as one of the key non-parametric Bayesian inference approaches. In the simplest case one can use a GP prior when learning a function from data. In line with growing interest in multi-task or transfer learning, where relatedness between tasks is used to aid learning of the individual tasks (see e.g. [2, 3]), GPs have increasingly also been used in a multi-task setting. A number of different choices of covariance functions have been proposed [4, 5, 6, 7, 8]. These differ e.g. in assumptions on whether the functions to be learned are related to a smaller number of latent functions or have free-form inter-task correlations; for a recent review see [9]. Given this interest in multi-task GPs, one would like to quantify the beneﬁts that they bring compared to single-task learning. PAC-style bounds for classiﬁcation [2, 3, 10] in more general multi-task scenarios exist, but there has been little work on average case analysis. The basic question in this setting is: how does the Bayes error on a given task depend on the number of training examples for all tasks, when averaged over all data sets of the given size. For a single regression task, this learning curve has become relatively well understood since the late 1990s, with a number of bounds and approximations available [11, 12, 13, 14, 15, 16, 17, 18, 19] as well as some exact predictions [20]. Already two-task GP regression is much more difﬁcult to analyse, and progress was made only very recently at NIPS 2009 [21], where upper and lower bounds for learning curves were derived. The tightest of these bounds, however, either required evaluation by Monte Carlo sampling, or assumed knowledge of the corresponding single-task learning curves. Here our aim is to obtain accurate learning curve approximations that apply to an arbitrary number T of tasks, and that can be evaluated explicitly without recourse to sampling. 1 We begin (Sec. 2) by expressing the Bayes error for any single task in a multi-task GP regression problem in a convenient feature space form, where individual training examples enter additively. This requires the introduction of a non-trivial tensor structure combining feature space components and tasks. Considering the change in error when adding an example for some task leads to partial differential equations linking the Bayes errors for all tasks. Solving these using the method of characteristics then gives, as our primary result, the desired learning curve approximation (Sec. 3). In Sec. 4 we discuss some of its predictions. The approximation correctly delineates the limits of pure transfer learning, when all examples are from tasks other than the one of interest. Next we compare with numerical simulations for some two-task scenarios, ﬁnding good qualitative agreement. These results also highlight a surprising feature, namely that asymptotically the relatedness between tasks can become much less useful. We analyse this effect in some detail, showing that it is most extreme for learning of smooth functions. Finally we discuss the case of many tasks, where there is an unexpected separation of the learning curves into a fast initial error decay arising from “collective learning”, and a much slower ﬁnal part where tasks are learned almost independently. 2 GP regression and Bayes error We consider GP regression for T functions fτ (x), τ = 1, 2, . . . , T . These functions have to be learned from n training examples (x , τ , y ), = 1, . . . , n. Here x is the training input, τ ∈ {1, . . . , T } denotes which task the example relates to, and y is the corresponding training output. We assume that the latter is given by the target function value fτ (x ) corrupted by i.i.d. additive 2 2 Gaussian noise with zero mean and variance στ . This setup allows the noise level στ to depend on the task. In GP regression the prior over the functions fτ (x) is a Gaussian process. This means that for any set of inputs x and task labels τ , the function values {fτ (x )} have a joint Gaussian distribution. As is common we assume this to have zero mean, so the multi-task GP is fully speciﬁed by the covariances fτ (x)fτ (x ) = C(τ, x, τ , x ). For this covariance we take the ﬂexible form from [5], fτ (x)fτ (x ) = Dτ τ C(x, x ). Here C(x, x ) determines the covariance between function values at different input points, encoding “spatial” behaviour such as smoothness and the lengthscale(s) over which the functions vary, while the matrix D is a free-form inter-task covariance matrix. One of the attractions of GPs for regression is that, even though they are non-parametric models with (in general) an inﬁnite number of degrees of freedom, predictions can be made in closed form, see e.g. [1]. For a test point x for task τ , one would predict as output the mean of fτ (x) over the (Gaussian) posterior, which is y T K −1 kτ (x). Here K is the n × n Gram matrix with entries 2 K m = Dτ τm C(x , xm ) + στ δ m , while kτ (x) is a vector with the n entries kτ, = Dτ τ C(x , x). The error bar would be taken as the square root of the posterior variance of fτ (x), which is T Vτ (x) = Dτ τ C(x, x) − kτ (x)K −1 kτ (x) (1) The learning curve for task τ is deﬁned as the mean-squared prediction error, averaged over the location of test input x and over all data sets with a speciﬁed number of examples for each task, say n1 for task 1 and so on. As is standard in learning curve analysis we consider a matched scenario where the training outputs y are generated from the same prior and noise model that we use for inference. In this case the mean-squared prediction error ˆτ is the Bayes error, and is given by the average posterior variance [1], i.e. ˆτ = Vτ (x) x . To obtain the learning curve this is averaged over the location of the training inputs x : τ = ˆτ . This average presents the main challenge for learning curve prediction because the training inputs feature in a highly nonlinear way in Vτ (x). Note that the training outputs, on the other hand, do not appear in the posterior variance Vτ (x) and so do not need to be averaged over. We now want to write the Bayes error ˆτ in a form convenient for performing, at least approximately, the averages required for the learning curve. Assume that all training inputs x , and also the test input x, are drawn from the same distribution P (x). One can decompose the input-dependent part of the covariance function into eigenfunctions relative to P (x), according to C(x, x ) = i λi φi (x)φi (x ). The eigenfunctions are deﬁned by the condition C(x, x )φi (x ) x = λi φi (x) and can be chosen to be orthonormal with respect to P (x), φi (x)φj (x) x = δij . The sum over i here is in general inﬁnite (unless the covariance function is degenerate, as e.g. for the dot product kernel C(x, x ) = x · x ). To make the algebra below as simple as possible, we let the eigenvalues λi be arranged in decreasing order and truncate the sum to the ﬁnite range i = 1, . . . , M ; M is then some large effective feature space dimension and can be taken to inﬁnity at the end. 2 In terms of the above eigenfunction decomposition, the Gram matrix has elements K m = Dτ 2 λi φi (x )φi (xm )+στ δ τm m δτ = i ,τ φi (x )λi δij Dτ τ φj (xm )δτ 2 ,τm +στ δ m i,τ,j,τ or in matrix form K = ΨLΨT + Σ where Σ is the diagonal matrix from the noise variances and Ψ = δτ ,iτ ,τ φi (x ), Liτ,jτ = λi δij Dτ τ (2) Here Ψ has its second index ranging over M (number of kernel eigenvalues) times T (number of tasks) values; L is a square matrix of this size. In Kronecker (tensor) product notation, L = D ⊗ Λ if we deﬁne Λ as the diagonal matrix with entries λi δij . The Kronecker product is convenient for the simpliﬁcations below; we will use that for generic square matrices, (A ⊗ B)(A ⊗ B ) = (AA ) ⊗ (BB ), (A ⊗ B)−1 = A−1 ⊗ B −1 , and tr (A ⊗ B) = (tr A)(tr B). In thinking about the mathematical expressions, it is often easier to picture Kronecker products over feature spaces and tasks as block matrices. For example, L can then be viewed as consisting of T × T blocks, each of which is proportional to Λ. To calculate the Bayes error, we need to average the posterior variance Vτ (x) over the test input x. The ﬁrst term in (1) then becomes Dτ τ C(x, x) = Dτ τ tr Λ. In the second one, we need to average kτ, (x)kτ,m = Dτ τ C(x , x)C(x, xm ) x Dτm τ = x Dτ τ λi λj φi (x ) φi (x)φj (x) x φj (xm )Dτm τ ij = Dτ τ Ψl,iτ λi λj δij Ψm,jτ Dτ τ i,τ ,j,τ T In matrix form this is kτ (x)kτ (x) x = Ψ[(Deτ eT D) ⊗ Λ2 ]ΨT = ΨMτ ΨT Here the last τ equality deﬁnes Mτ , and we have denoted by eτ the T -dimensional vector with τ -th component equal to one and all others zero. Multiplying by the inverse Gram matrix K −1 and taking the trace gives the average of the second term in (1); combining with the ﬁrst gives the Bayes error on task τ ˆτ = Vτ (x) x = Dτ τ tr Λ − tr ΨMτ ΨT (ΨLΨT + Σ)−1 Applying the Woodbury identity and re-arranging yields = Dτ τ tr Λ − tr Mτ ΨT Σ−1 Ψ(I + LΨT Σ−1 Ψ)−1 = ˆτ Dτ τ tr Λ − tr Mτ L−1 [I − (I + LΨT Σ−1 Ψ)−1 ] But tr Mτ L−1 = tr {[(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 } τ = tr {[Deτ eT ] ⊗ Λ} = eT Deτ tr Λ = Dτ τ tr Λ τ τ so the ﬁrst and second terms in the expression for ˆτ cancel and one has = tr Mτ L−1 (I + LΨT Σ−1 Ψ)−1 = tr L−1 Mτ L−1 (L−1 + ΨT Σ−1 Ψ)−1 = tr [D ⊗ Λ]−1 [(Deτ eT D) ⊗ Λ2 ][D ⊗ Λ]−1 (L−1 + ΨT Σ−1 Ψ)−1 τ = ˆτ tr [eτ eT ⊗ I](L−1 + ΨT Σ−1 Ψ)−1 τ The matrix in square brackets in the last line is just a projector Pτ onto task τ ; thought of as a matrix of T × T blocks (each of size M × M ), this has an identity matrix in the (τ, τ ) block while all other blocks are zero. We can therefore write, ﬁnally, for the Bayes error on task τ , ˆτ = tr Pτ (L−1 + ΨT Σ−1 Ψ)−1 (3) Because Σ is diagonal and given the deﬁnition (2) of Ψ, the matrix ΨT Σ−1 Ψ is a sum of contributions from the individual training examples = 1, . . . , n. This will be important for deriving the learning curve approximation below. We note in passing that, because τ Pτ = I, the sum of the Bayes errors on all tasks is τ ˆτ = tr (L−1 +ΨT Σ−1 Ψ)−1 , in close analogy to the corresponding expression for the single-task case [13]. 3 3 Learning curve prediction To obtain the learning curve τ = ˆτ , we now need to carry out the average . . . over the training inputs. To help with this, we can extend an approach for the single-task scenario [13] and deﬁne a response or resolvent matrix G = (L−1 + ΨT Σ−1 Ψ + τ vτ Pτ )−1 with auxiliary parameters vτ that will be set back to zero at the end. One can then ask how G = G and hence τ = tr Pτ G changes with the number nτ of training points for task τ . Adding an example at position x for task −2 τ increases ΨT Σ−1 Ψ by στ φτ φT , where φτ has elements (φτ )iτ = φi (x)δτ τ . Evaluating the τ −1 −2 difference (G + στ φτ φT )−1 − G with the help of the Woodbury identity and approximating it τ with a derivative gives Gφτ φT G ∂G τ =− 2 ∂nτ στ + φT Gφτ τ This needs to be averaged over the new example and all previous ones. If we approximate by averaging numerator and denominator separately we get 1 ∂G ∂G = 2 ∂nτ στ + tr Pτ G ∂vτ (4) Here we have exploited for the average over x that the matrix φτ φT x has (i, τ ), (j, τ )-entry τ φi (x)φj (x) x δτ τ δτ τ = δij δτ τ δτ τ , hence simply φτ φT x = Pτ . We have also used the τ auxiliary parameters to rewrite − GPτ G = ∂ G /∂vτ = ∂G/∂vτ . Finally, multiplying (4) by Pτ and taking the trace gives the set of quasi-linear partial differential equations ∂ τ 1 = 2 ∂nτ στ + τ ∂ τ ∂vτ (5) The remaining task is now to ﬁnd the functions τ (n1 , . . . , nT , v1 , . . . , vT ) by solving these differential equations. We initially attempted to do this by tracking the τ as examples are added one task at a time, but the derivation is laborious already for T = 2 and becomes prohibitive beyond. Far more elegant is to adapt the method of characteristics to the present case. We need to ﬁnd a 2T -dimensional surface in the 3T -dimensional space (n1 , . . . , nT , v1 , . . . , vT , 1 , . . . , T ), which is speciﬁed by the T functions τ (. . .). A small change (δn1 , . . . , δnT , δv1 , . . . , δvT , δ 1 , . . . , δ T ) in all 3T coordinates is tangential to this surface if it obeys the T constraints (one for each τ ) δ τ ∂ τ ∂ τ δnτ + δvτ ∂nτ ∂vτ = τ 2 From (5), one sees that this condition is satisﬁed whenever δ τ = 0 and δnτ = −δvτ (στ + τ ) It follows that all the characteristic curves given by τ (t) = τ,0 = const., vτ (t) = vτ,0 (1 − t), 2 nτ (t) = vτ,0 (στ + τ,0 ) t for t ∈ [0, 1] are tangential to the solution surface for all t, so lie within this surface if the initial point at t = 0 does. Because at t = 0 there are no training examples (nτ (0) = 0), this initial condition is satisﬁed by setting −1 τ,0 = tr Pτ −1 L + vτ ,0 Pτ τ Because t=1 τ (t) is constant along the characteristic curve, we get by equating the values at t = 0 and −1 τ,0 = tr Pτ L −1 + vτ ,0 Pτ = τ ({nτ = vτ 2 ,0 (στ + τ ,0 )}, {vτ = 0}) τ Expressing vτ ,0 in terms of nτ gives then τ = tr Pτ L−1 + τ nτ 2 στ + −1 Pτ (6) τ This is our main result: a closed set of T self-consistency equations for the average Bayes errors 2 τ . Given L as deﬁned by the eigenvalues λi of the covariance function, the noise levels στ and the 4 number of examples nτ for each task, it is straightforward to solve these equations numerically to ﬁnd the average Bayes error τ for each task. The r.h.s. of (6) is easiest to evaluate if we view the matrix inside the brackets as consisting of M × M blocks of size T × T (which is the reverse of the picture we have used so far). The matrix is then block diagonal, with the blocks corresponding to different eigenvalues λi . Explicitly, because L−1 = D −1 ⊗ Λ−1 , one has τ λ−1 D −1 + diag({ i = i 4 2 στ nτ + −1 }) τ (7) ττ Results and discussion We now consider the consequences of the approximate prediction (7) for multi-task learning curves in GP regression. A trivial special case is the one of uncorrelated tasks, where D is diagonal. Here one recovers T separate equations for the individual tasks as expected, which have the same form as for single-task learning [13]. 4.1 Pure transfer learning Consider now the case of pure transfer learning, where one is learning a task of interest (say τ = 1) purely from examples for other tasks. What is the lowest average Bayes error that can be obtained? Somewhat more generally, suppose we have no examples for the ﬁrst T0 tasks, n1 = . . . = nT0 = 0, but a large number of examples for the remaining T1 = T − T0 tasks. Denote E = D −1 and write this in block form as E00 E01 E= T E01 E11 2 Now multiply by λ−1 and add in the lower right block a diagonal matrix N = diag({nτ /(στ + i −1 −1 τ )}τ =T0 +1,...,T ). The matrix inverse in (7) then has top left block λi [E00 + E00 E01 (λi N + −1 −1 T T E11 − E01 E00 E01 )−1 E01 E00 ]. As the number of examples for the last T1 tasks grows, so do all −1 (diagonal) elements of N . In the limit only the term λi E00 survives, and summing over i gives −1 −1 1 = tr Λ(E00 )11 = C(x, x) (E00 )11 . The Bayes error on task 1 cannot become lower than this, placing a limit on the beneﬁts of pure transfer learning. That this prediction of the approximation (7) for such a lower limit is correct can also be checked directly: once the last T1 tasks fτ (x) (τ = T0 + 1, . . . T ) have been learn perfectly, the posterior over the ﬁrst T0 functions is, by standard Gaussian conditioning, a GP with covariance C(x, x )(E00 )−1 . Averaging the posterior variance of −1 f1 (x) then gives the Bayes error on task 1 as 1 = C(x, x) (E00 )11 , as found earlier. This analysis can be extended to the case where there are some examples available also for the ﬁrst T0 tasks. One ﬁnds for the generalization errors on these tasks the prediction (7) with D −1 replaced by E00 . This is again in line with the above form of the GP posterior after perfect learning of the remaining T1 tasks. 4.2 Two tasks We next analyse how well the approxiation (7) does in predicting multi-task learning curves for T = 2 tasks. Here we have the work of Chai [21] as a baseline, and as there we choose D= 1 ρ ρ 1 The diagonal elements are ﬁxed to unity, as in a practical application where one would scale both task functions f1 (x) and f2 (x) to unit variance; the degree of correlation of the tasks is controlled by ρ. We ﬁx π2 = n2 /n and plot learning curves against n. In numerical simulations we ensure integer values of n1 and n2 by setting n2 = nπ2 , n1 = n − n2 ; for evaluation of (7) we use 2 2 directly n2 = nπ2 , n1 = n(1 − π2 ). For simplicity we consider equal noise levels σ1 = σ2 = σ 2 . As regards the covariance function and input distribution, we analyse ﬁrst the scenario studied in [21]: a squared exponential (SE) kernel C(x, x ) = exp[−(x − x )2 /(2l2 )] with lengthscale l, and one-dimensional inputs x with a Gaussian distribution N (0, 1/12). The kernel eigenvalues λi 5 1 1 1 1 ε1 ε1 0.8 1 1 ε1 ε1 0.8 0.001 1 ε1 0.8 0.001 n 10000 ε1 1 0.01 1 n 10000 0.6 0.6 0.4 0.4 0.4 0.2 0.2 n 1000 0.6 0.2 0 0 100 200 n 300 400 0 500 0 100 200 n 300 400 500 0 0 100 200 n 300 400 500 Figure 1: Average Bayes error for task 1 for two-task GP regression with kernel lengthscale l = 0.01, noise level σ 2 = 0.05 and a fraction π2 = 0.75 of examples for task 2. Solid lines: numerical simulations; dashed lines: approximation (7). Task correlation ρ2 = 0, 0.25, 0.5, 0.75, 1 from top to bottom. Left: SE covariance function, Gaussian input distribution. Middle: SE covariance, uniform inputs. Right: OU covariance, uniform inputs. Log-log plots (insets) show tendency of asymptotic uselessness, i.e. bunching of the ρ < 1 curves towards the one for ρ = 0; this effect is strongest for learning of smooth functions (left and middle). are known explicitly from [22] and decay exponentially with i. Figure 1(left) compares numerically simulated learning curves with the predictions for 1 , the average Bayes error on task 1, from (7). Five pairs of curves are shown, for ρ2 = 0, 0.25, 0.5, 0.75, 1. Note that the two extreme values represent single-task limits, where examples from task 2 are either ignored (ρ = 0) or effectively treated as being from task 1 (ρ = 1). Our predictions lie generally below the true learning curves, but qualitatively represent the trends well, in particular the variation with ρ2 . The curves for the different ρ2 values are fairly evenly spaced vertically for small number of examples, n, corresponding to a linear dependence on ρ2 . As n increases, however, the learning curves for ρ < 1 start to bunch together and separate from the one for the fully correlated case (ρ = 1). The approximation (7) correctly captures this behaviour, which is discussed in more detail below. Figure 1(middle) has analogous results for the case of inputs x uniformly distributed on the interval [0, 1]; the λi here decay exponentially with i2 [17]. Quantitative agreement between simulations and predictions is better for this case. The discussion in [17] suggests that this is because the approximation method we have used implicitly neglects spatial variation of the dataset-averaged posterior variance Vτ (x) ; but for a uniform input distribution this variation will be weak except near the ends of the input range [0, 1]. Figure 1(right) displays similar results for an OU kernel C(x, x ) = exp(−|x − x |/l), showing that our predictions also work well when learning rough (nowhere differentiable) functions. 4.3 Asymptotic uselessness The two-task results above suggest that multi-task learning is less useful asymptotically: when the number of training examples n is large, the learning curves seem to bunch towards the curve for ρ = 0, where task 2 examples are ignored, except when the two tasks are fully correlated (ρ = 1). We now study this effect. When the number of examples for all tasks becomes large, the Bayes errors τ will become small 2 and eventually be negligible compared to the noise variances στ in (7). One then has an explicit prediction for each τ , without solving T self-consistency equations. If we write, for T tasks, 2 nτ = nπτ with πτ the fraction of examples for task τ , and set γτ = πτ /στ , then for large n τ = i λ−1 D −1 + nΓ i −1 ττ = −1/2 −1 [λi (Γ1/2 DΓ1/2 )−1 i (Γ + nI]−1 Γ−1/2 )τ τ 1/2 where Γ = diag(γ1 , . . . , γT ). Using an eigendecomposition of the symmetric matrix Γ T T a=1 δa va va , one then shows in a few lines that (8) can be written as τ −1 ≈ γτ 2 a (va,τ ) δa g(nδa ) 6 (8) 1/2 DΓ = (9) 1 1 1 50000 ε 5000 r 0.1 ε 0.5 n=500 10 100 1000 n 0.1 0 0 0.2 0.4 ρ 2 0.6 0.8 1 1 10 100 1000 n Figure 2: Left: Bayes error (parameters as in Fig. 1(left), with n = 500) vs ρ2 . To focus on the error reduction with ρ, r = [ 1 (ρ) − 1 (1)]/[ 1 (0) − 1 (1)] is shown. Circles: simulations; solid line: predictions from (7). Other lines: predictions for larger n, showing the approach to asymptotic uselessness in multi-task learning of smooth functions. Inset: Analogous results for rough functions (parameters as in Fig. 1(right)). Right: Learning curve for many-task learning (T = 200, parameters otherwise as in Fig. 1(left) except ρ2 = 0.8). Notice the bend around 1 = 1 − ρ = 0.106. Solid line: simulations (steps arise because we chose to allocate examples to tasks in order τ = 1, . . . , T rather than randomly); dashed line: predictions from (7). Inset: Predictions for T = 1000, with asymptotic forms = 1 − ρ + ρ˜ and = (1 − ρ)¯ for the two learning stages shown as solid lines. −1 where g(h) = tr (Λ−1 + h)−1 = + h)−1 and va,τ is the τ -th component of the a-th i (λi eigenvector va . This is the general asymptotic form of our prediction for the average Bayes error for task τ . To get a more explicit result, consider the case where sample functions from the GP prior have (mean-square) derivatives up to order r. The kernel eigenvalues λi then decay as1 i−(2r+2) for large i, and using arguments from [17] one deduces that g(h) ∼ h−α for large h, with α = (2r +1)/(2r + 2). In (9) we can then write, for large n, g(nδa ) ≈ (δa /γτ )−α g(nγτ ) and hence τ ≈ g(nγτ ){ 2 1−α } a (va,τ ) (δa /γτ ) (10) 2 When there is only a single task, δ1 = γ1 and this expression reduces to 1 = g(nγ1 ) = g(n1 /σ1 ). 2 Thus g(nγτ ) = g(nτ /στ ) is the error we would get by ignoring all examples from tasks other than τ , and the term in {. . .} in (10) gives the “multi-task gain”, i.e. the factor by which the error is reduced because of examples from other tasks. (The absolute error reduction always vanishes trivially for n → ∞, along with the errors themselves.) One observation can be made directly. Learning of very smooth functions, as deﬁned e.g. by the SE kernel, corresponds to r → ∞ and hence α → 1, so the multi-task gain tends to unity: multi-task learning is asymptotically useless. The only exception occurs when some of the tasks are fully correlated, because one or more of the eigenvalues δa of Γ1/2 DΓ1/2 will then be zero. Fig. 2(left) shows this effect in action, plotting Bayes error against ρ2 for the two-task setting of Fig. 1(left) with n = 500. Our predictions capture the nonlinear dependence on ρ2 quite well, though the effect is somewhat weaker in the simulations. For larger n the predictions approach a curve that is constant for ρ < 1, signifying negligible improvement from multi-task learning except at ρ = 1. It is worth contrasting this with the lower bound from [21], which is linear in ρ2 . While this provides a very good approximation to the learning curves for moderate n [21], our results here show that asymptotically this bound can become very loose. When predicting rough functions, there is some asymptotic improvement to be had from multi-task learning, though again the multi-task gain is nonlinear in ρ2 : see Fig. 2(left, inset) for the OU case, which has r = 1). A simple expression for the gain can be obtained in the limit of many tasks, to which we turn next. 1 See the discussion of Sacks-Ylvisaker conditions in e.g. [1]; we consider one-dimensional inputs here though the discussion can be generalized. 7 4.4 Many tasks We assume as for the two-task case that all inter-task correlations, Dτ,τ with τ = τ , are equal to ρ, while Dτ,τ = 1. This setup was used e.g. in [23], and can be interpreted as each task having a √ component proportional to ρ of a shared latent function, with an independent task-speciﬁc signal in addition. We assume for simplicity that we have the same number nτ = n/T of examples for 2 each task, and that all noise levels are the same, στ = σ 2 . Then also all Bayes errors τ = will be the same. Carrying out the matrix inverses in (7) explicitly, one can then write this equation as = gT (n/(σ 2 + ), ρ) (11) where gT (h, ρ) is related to the single-task function g(h) from above by gT (h, ρ) = 1−ρ T −1 (1 − ρ)g(h(1 − ρ)/T ) + ρ + T T g(h[ρ + (1 − ρ)/T ]) (12) Now consider the limit T → ∞ of many tasks. If n and hence h = n/(σ 2 + ) is kept ﬁxed, gT (h, ρ) → (1 − ρ) + ρg(hρ); here we have taken g(0) = 1 which corresponds to tr Λ = C(x, x) x = 1 as in the examples above. One can then deduce from (11) that the Bayes error for any task will have the form = (1 − ρ) + ρ˜, where ˜ decays from one to zero with increasing n as for a single task, but with an effective noise level σ 2 = (1 − ρ + σ 2 )/ρ. Remarkably, then, ˜ even though here n/T → 0 so that for most tasks no examples have been seen, the Bayes error for each task decreases by “collective learning” to a plateau of height 1 − ρ. The remaining decay of to zero happens only once n becomes of order T . Here one can show, by taking T → ∞ at ﬁxed h/T in (12) and inserting into (11), that = (1 − ρ)¯ where ¯ again decays as for a single task but with an effective number of examples n = n/T and effective noise level σ 2 /(1 − ρ). This ﬁnal stage of ¯ ¯ learning therefore happens only when each task has seen a considerable number of exampes n/T . Fig. 2(right) validates these predictions against simulations, for a number of tasks (T = 200) that is in the same ballpark as in the many-tasks application example of [24]. The inset for T = 1000 shows clearly how the two learning curve stages separate as T becomes larger. Finally we can come back to the multi-task gain in the asymptotic stage of learning. For GP priors with sample functions with derivatives up to order r as before, the function ¯ from above will decay as (¯ /¯ 2 )−α ; since = (1 − ρ)¯ and σ 2 = σ 2 /(1 − ρ), the Bayes error is then proportional n σ ¯ to (1 − ρ)1−α . This multi-task gain again approaches unity for ρ < 1 for smooth functions (α = (2r + 1)/(2r + 2) → 1). Interestingly, for rough functions (α < 1), the multi-task gain decreases for small ρ2 as 1 − (1 − α) ρ2 and so always lies below a linear dependence on ρ2 initially. This shows that a linear-in-ρ2 lower error bound cannot generally apply to T > 2 tasks, and indeed one can verify that the derivation in [21] does not extend to this case. 5 Conclusion We have derived an approximate prediction (7) for learning curves in multi-task GP regression, valid for arbitrary inter-task correlation matrices D. This can be evaluated explicitly knowing only the kernel eigenvalues, without sampling or recourse to single-task learning curves. The approximation shows that pure transfer learning has a simple lower error bound, and provides a good qualitative account of numerically simulated learning curves. Because it can be used to study the asymptotic behaviour for large training sets, it allowed us to show that multi-task learning can become asymptotically useless: when learning smooth functions it reduces the asymptotic Bayes error only if tasks are fully correlated. For the limit of many tasks we found that, remarkably, some initial “collective learning” is possible even when most tasks have not seen examples. A much slower second learning stage then requires many examples per task. The asymptotic regime of this also showed explicitly that a lower error bound that is linear in ρ2 , the square of the inter-task correlation, is applicable only to the two-task setting T = 2. In future work it would be interesting to use our general result to investigate in more detail the consequences of speciﬁc choices for the inter-task correlations D, e.g. to represent a lower-dimensional latent factor structure. One could also try to deploy similar approximation methods to study the case of model mismatch, where the inter-task correlations D would have to be learned from data. More challenging, but worthwhile, would be an extension to multi-task covariance functions where task and input-space correlations to not factorize. 8 References [1] C K I Williams and C Rasmussen. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. [2] J Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12:149–198, 2000. [3] S Ben-David and R S Borbely. A notion of task relatedness yielding provable multiple-task learning guarantees. Mach. Learn., 73(3):273–287, December 2008. [4] Y W Teh, M Seeger, and M I Jordan. Semiparametric latent factor models. In Workshop on Artiﬁcial Intelligence and Statistics 10, pages 333–340. Society for Artiﬁcial Intelligence and Statistics, 2005. [5] E V Bonilla, F V Agakov, and C K I Williams. Kernel multi-task learning using task-speciﬁc features. In Proceedings of the 11th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS). Omni Press, 2007. [6] E V Bonilla, K M A Chai, and C K I Williams. Multi-task Gaussian process prediction. In J C Platt, D Koller, Y Singer, and S Roweis, editors, NIPS 20, pages 153–160, Cambridge, MA, 2008. MIT Press. [7] M Alvarez and N D Lawrence. Sparse convolved Gaussian processes for multi-output regression. In D Koller, D Schuurmans, Y Bengio, and L Bottou, editors, NIPS 21, pages 57–64, Cambridge, MA, 2009. MIT Press. [8] G Leen, J Peltonen, and S Kaski. Focused multi-task learning using Gaussian processes. In Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, volume 6912 of Lecture Notes in Computer Science, pages 310– 325. Springer Berlin, Heidelberg, 2011. ´ [9] M A Alvarez, L Rosasco, and N D Lawrence. Kernels for vector-valued functions: a review. Foundations and Trends in Machine Learning, 4:195–266, 2012. [10] A Maurer. Bounds for linear multi-task learning. J. Mach. Learn. Res., 7:117–139, 2006. [11] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [12] G F Trecate, C K I Williams, and M Opper. Finite-dimensional approximation of Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, NIPS 11, pages 218–224, Cambridge, MA, 1999. MIT Press. [13] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, NIPS 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [14] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, NIPS 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [15] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [16] D Malzahn and M Opper. Statistical mechanics of learning: a variational approach for real data. Phys. Rev. Lett., 89:108302, 2002. [17] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [18] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, NIPS 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [19] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artiﬁcial Intelligence, pages 199–210. Springer Berlin, Heidelberg, 2005. [20] M Urry and P Sollich. Exact larning curves for Gaussian process regression on large random graphs. In J Lafferty, C K I Williams, J Shawe-Taylor, R S Zemel, and A Culotta, editors, NIPS 23, pages 2316–2324, Cambridge, MA, 2010. MIT Press. [21] K M A Chai. Generalization errors and learning curves for regression with multi-task Gaussian processes. In Y Bengio, D Schuurmans, J Lafferty, C K I Williams, and A Culotta, editors, NIPS 22, pages 279–287, 2009. [22] H Zhu, C K I Williams, R J Rohwer, and M Morciniec. Gaussian regression and optimal ﬁnite dimensional linear models. In C M Bishop, editor, Neural Networks and Machine Learning. Springer, 1998. [23] E Rodner and J Denzler. One-shot learning of object categories using dependent Gaussian processes. In Michael Goesele, Stefan Roth, Arjan Kuijper, Bernt Schiele, and Konrad Schindler, editors, Pattern Recognition, volume 6376 of Lecture Notes in Computer Science, pages 232–241. Springer Berlin, Heidelberg, 2010. [24] T Heskes. Solving a huge number of similar tasks: a combination of multi-task learning and a hierarchical Bayesian approach. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML’98), pages 233–241. Morgan Kaufmann, 1998. 9

6 0.81831574 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

7 0.81640011 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

8 0.81592071 324 nips-2012-Stochastic Gradient Descent with Only One Projection

9 0.81471008 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

10 0.81392395 358 nips-2012-Value Pursuit Iteration

11 0.81389982 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

12 0.81218094 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

13 0.81185234 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

14 0.81183934 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

15 0.81182224 16 nips-2012-A Polynomial-time Form of Robust Regression

16 0.81164861 206 nips-2012-Majorization for CRFs and Latent Likelihoods

17 0.81035823 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

18 0.810251 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function

19 0.81022328 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

20 0.81021458 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models