jmlr jmlr2010 jmlr2010-84 knowledge-graph by maker-knowledge-mining

84 jmlr-2010-On Spectral Learning

Source: pdf

Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil

Abstract: In this paper, we study the problem of learning a matrix W from a set of linear measurements. Our formulation consists in solving an optimization problem which involves regularization with a spectral penalty term. That is, the penalty term is a function of the spectrum of the covariance of W . Instances of this problem in machine learning include multi-task learning, collaborative ﬁltering and multi-view learning, among others. Our goal is to elucidate the form of the optimal solution of spectral learning. The theory of spectral learning relies on the von Neumann characterization of orthogonally invariant norms and their association with symmetric gauge functions. Using this tool we formulate a representer theorem for spectral regularization and specify it to several useful example, such as Schatten p−norms, trace norm and spectral norm, which should proved useful in applications. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, orthogonally invariant norms, regularization

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The theory of spectral learning relies on the von Neumann characterization of orthogonally invariant norms and their association with symmetric gauge functions. [sent-15, score-0.756]

2 Using this tool we formulate a representer theorem for spectral regularization and specify it to several useful example, such as Schatten p−norms, trace norm and spectral norm, which should proved useful in applications. [sent-16, score-1.078]

3 Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, orthogonally invariant norms, regularization 1. [sent-17, score-0.79]

4 To obtain insights into this issue, we shall investigate in this paper the form of matrices which solve the variational problem (1) when the penalty term is an orthogonally invariant norm (OI-norm). [sent-26, score-0.796]

5 A recent trend in regularization methods in machine learning is to use matrix regularizers which are orthogonally invariant (Argyriou et al. [sent-39, score-0.553]

6 In particular, an important case is the Schatten one norm of W , which is often referred to as the trace norm. [sent-43, score-0.352]

7 The general idea behind this methodology is that a small trace norm favors low-rank solution matrices to (1). [sent-44, score-0.443]

8 However, the trace norm provides a convex relaxation of this problem, which has been justiﬁed in various ways (see, for example, Fazel et al. [sent-47, score-0.39]

9 Speciﬁcally, we provide what in machine learning is known as a representer theorem. [sent-50, score-0.381]

10 A representer theorem is appealing from a practical point of view, because it ensures that the cost of solving the optimization problem (1) depends on the size m of the training sample, which can be much smaller than the number of elements of the matrix W . [sent-52, score-0.51]

11 Our point of view in developing these theorems is through the study of the minimal norm interpolation problem min{ W : I(W ) = y, W ∈ Md,n }. [sent-55, score-0.551]

12 In Section 2 we introduce our notation and describe the connection between minimal norm interpolation and regularization. [sent-62, score-0.495]

13 In particular, we describe a special case of such norms, which contains the Schatten p−norms, and derive a linear representer theorem for this case. [sent-65, score-0.454]

14 937 A RGYRIOU , M ICCHELLI AND P ONTIL Regularization enables one to trade off interpolation of the data against smoothness or simplicity of the model, whereas interpolation frequently suffers from overﬁtting. [sent-88, score-0.42]

15 Regularization with the trace norm learns the tasks as one joint optimization problem, by favoring matrices with low rank (Argyriou et al. [sent-107, score-0.471]

16 Such facts are known in machine learning as representer theorems, see Argyriou et al. [sent-117, score-0.381]

17 That is, our main concern will be to obtain representer theorems which hold for problems like (3). [sent-120, score-0.437]

18 This in turn will imply representer theorems for the associated regularization problems. [sent-121, score-0.511]

19 Then for every y ∈ Rm there exists y ∈ Rm ˆ such that any solution of the interpolation problem (3) with y = y is a solution of the regularization ˆ problem (2). [sent-124, score-0.411]

20 We shall return to this issue in Section 4, where we study representer theorems of a particular type for regularizers which are OI-norms. [sent-129, score-0.579]

21 Duality and Minimal Norm Interpolation In this section, we turn our attention to the study of the interpolation problem (3) when the function Ω is a norm on Md,n . [sent-131, score-0.466]

22 That is we prescribe a linear operator I : Md,n → Rm , a vector y ∈ R (I) \ {0} and study the minimal norm interpolation problem φ := min{ W : I(W ) = y,W ∈ Md,n }. [sent-132, score-0.568]

23 (6) Associated with this inequality is the notion of the peak set of the norm · at X, namely D X = {W : X,W = X D, : W ∈ Md,n , W = 1}. [sent-136, score-0.397]

24 As we shall see in the theorem below, the dual norm leads to the following dual problem θ := min { I ∗ (c) D : c ∈ R (I), c, y = 1} . [sent-138, score-0.705]

25 In the primal problem we minimize a norm which is a function which grows at inﬁnity and, so, the existence of a solution is 939 A RGYRIOU , M ICCHELLI AND P ONTIL assured. [sent-140, score-0.399]

26 Similarly, the quantity I ∗ (c) D which is minimized in the dual problem is also norm on c ∈ R (I). [sent-141, score-0.397]

27 More importantly, any solution of the dual problem will provide us with a solution of the primal problem and conditions on the latter are obtained from a study of the peak set D I ∗ (c) . [sent-151, score-0.416]

28 Since the dual norm is a maximum of linear functions over a compact set, we may apply Theorem 22 in the case that X = {X : X ∈ Md,n , X ≤ 1}, W = Md,n , f (W, X) = W, X , and evaluate Equation (16) for W = I ∗ (c) and ∆ = I ∗ (b) to obtain the inequality ˆ max{ I ∗ (b), T : T ∈ D I ∗ (c) } ≥ 0 . [sent-158, score-0.452]

29 This latter property, combined with Lemma 1, may be viewed as a general representer theorem, that is, the theorem implies that the solutions of the regularization problem (2) are matrices in the set D I ∗ (c) , for some c ∈ Rm . [sent-172, score-0.6]

30 We refer to this condition throughout the paper as the ˜ standard representer theorem, see Argyriou et al. [sent-175, score-0.381]

31 In other words, the ˆ ˆ standard representer theorem for W means that W ∈ R (I ∗ ). [sent-177, score-0.454]

32 Alternatively, we can approach the minimal norm interpolation problem by use of the Lagrangian, deﬁned, for W ∈ Md,n and λ ∈ Rm , as L (W, λ) = W + W, I ∗ (λ) − y, λ . [sent-180, score-0.495]

33 As we shall see, for such norms, the representer theorem can be written in terms of the singular value decomposition. [sent-183, score-0.681]

34 3, we shall describe a subclass of OI-norms for which representer theorems can be phrased in terms of matrix multiples of the adjoint operator value I ∗ (c). [sent-185, score-0.725]

35 This type of representer theorem arises in multi-task ˆ learning as described in Argyriou et al. [sent-186, score-0.454]

36 We shall call functions of the singular values of a matrix spectral functions. [sent-197, score-0.382]

37 2 Orthogonally Invariant Norms A norm · on Md,n is called orthogonally invariant whenever, for every U ∈ Od , V ∈ On and W ∈ Md,n , we have that UWV ⊤ = W . [sent-202, score-0.666]

38 Conversely, if f : Rr → R is an SG-function then the norm deﬁned at W ∈ Md,n , as W = f (σ(W )) is orthogonally invariant. [sent-218, score-0.532]

39 The Schatten 1−norm is sometimes called the trace norm or nuclear norm. [sent-220, score-0.352]

40 Other common values of p give rise to the Frobenius norm (p = 2) and the spectral norm (p = ∞). [sent-221, score-0.611]

41 The Frobenius norm can √ also be written as trW ⊤W and the spectral norm is alternatively expressed as max{ W x 2 : x 2 = 1}, where the subscript on the vector norm indicates the Euclidean norm of that vector. [sent-222, score-1.123]

42 Another well-known family of OI-norms are the Ky Fan norms deﬁned, for every W ∈ Md,n as W (k) = ∑ σi (W ) i∈Nk where 1 ≤ k ≤ r (the cases k = 1 and k = r are the spectral and trace norms, respectively). [sent-223, score-0.349]

43 ++ (W, D) → WW ⊤ , D− (10) 2−p p is jointly convex in W and D and, so, the inﬁmum in (10) is convex in W , in agreement with the convexity of the norm of W p . [sent-232, score-0.332]

44 tr(WW ⊤ ) 2 In machine learning practice, regularization with the trace norm has been proposed for collaborative ﬁltering and multi-task learning (Abernethy et al. [sent-234, score-0.452]

45 However, a common technique that overcomes this issue is to replace the rank by the trace norm (Fazel et al. [sent-240, score-0.397]

46 The trace norm is the ℓ1 norm on the singular values and hence there is an analogy to regularization of a vector 943 A RGYRIOU , M ICCHELLI AND P ONTIL variable with the ℓ1 norm, which is often used to obtain sparse solutions, see Cand` s and Recht e (2008) and reference therein. [sent-242, score-0.815]

47 In analogy to ℓ1 regularization, it has recently been shown that for certain conﬁgurations of the input data the low rank solution can be recovered using the trace norm approach (Cand` s and Recht, 2008; Recht et al. [sent-243, score-0.443]

48 Another is that ﬁtting multiple learning tasks simultaneously, so that they share a small set of orthogonal features, leads to a trace norm problem (Argyriou et al. [sent-250, score-0.381]

49 This inequality also originates from von Neumann (1962) and is sometimes called von Neumann’s trace theorem or Ky Fan inequality. [sent-261, score-0.344]

50 Let us also mention that apart from norm duality, Lemma 5 underlies many analytical properties of spectral functions, such as convexity, Fenchel conjugacy, subgradients and differentiability (see, for example, Lewis, 1995 for a review). [sent-267, score-0.355]

51 For our purposes, inequality (11) can be used to compute the dual of an OI-norm in terms of the dual of the corresponding SG-function. [sent-268, score-0.337]

52 Lemma 6 If the norm · on Md,n is orthogonally invariant and f is the corresponding SGfunction, then the dual norm is given, for W ∈ Md,n , by W D = fD (σ(W )) where fD : Rr → R+ is the dual norm of f . [sent-273, score-1.425]

53 If the norm · is orthogonally invariant and f is the corresponding SG-function, then D W = {Z : Z = UΣ(Z)V ⊤ , σ(Z) ∈ D f (σ(W ))} . [sent-278, score-0.631]

54 If the norm · is orthogonally invariant and the corresponding SG-function f is differentiable at σ(W ), then · is differentiable at W and ∇ W = U ∇ f (σ(W ))V ⊤ . [sent-280, score-0.631]

55 And dual pairs of matrices with respect to OI-norms directly relate to dual pairs of vectors with respect to SG-functions. [sent-285, score-0.327]

56 1 As an example, the dual of a Schatten p-norm · p is the norm · q , where 1 + q = 1. [sent-288, score-0.397]

57 We are interested in obtaining representer theorems and optimality conditions, in general, for regularization problems of the form (2). [sent-293, score-0.511]

58 We shall focus, however, on representer theorems for interpolation problems of the form (3). [sent-294, score-0.741]

59 Let Ω : Md,n → R be a given regularizer and assume that, for every y ∈ Rm and linear operator I : Md,n → Rm , there exists some solution of (3) satisfying a prescribed representer theorem. [sent-295, score-0.601]

60 Then, by Lemma 1, for every y ∈ Rm , I : Md,n → Rm and E : Rm × Rm → R, the same representer theorem holds for some solution of problem (2). [sent-296, score-0.535]

61 In the remainder of the paper we shall prove optimality conditions for interpolation problems, which thus equally apply to regularization problems. [sent-297, score-0.378]

62 945 A RGYRIOU , M ICCHELLI AND P ONTIL Conversely, the representer theorem for the regularization problem (2) associated with certain choices of the function Ω and E , will also hold for the corresponding interpolation problems (3). [sent-298, score-0.738]

63 (2009), which concerns the standard representer theorem, ˆ W ∈ R (I ∗ ) . [sent-300, score-0.381]

64 ˆ This theorem implies that, in order to solve the minimal norm interpolation problem (5), we may ﬁrst solve the dual problem (7) and then ﬁnd a matrix in the peak set of I ∗ (c) scaled by 1/ I ∗ (c) D , which ˆ ˆ interpolates the data. [sent-312, score-0.851]

65 For example, if · is the Frobenius norm, Theorem 12 readily yields the standard representer theorem. [sent-315, score-0.381]

66 3 Admissible Orthogonally Invariant Norms In this section, we deﬁne a subclass of OI-norms, which obey an improved version of the representer theorem presented above. [sent-317, score-0.48]

67 Deﬁnition 13 A norm · on Rr is said to be admissible if for any x ∈ Rr , any k ∈ Nr such that xk = 0 we have that xk < x where xk is the vector all of whose components agree with x, except the k-th component which is zero. [sent-319, score-0.599]

68 The simplest example of admissible norms are the ℓ p norm on Rd , · p , for p ∈ [1, ∞). [sent-320, score-0.523]

69 From this norm we can form other admissible norms in various ways. [sent-321, score-0.523]

70 Speciﬁcally, for any p1 , p2 ∈ [1, ∞), we see that the norm · p1 + · p2 or the norm max{ · p1 , · p2 } are both admissible. [sent-322, score-0.512]

71 Lemma 14 If · is an admissible norm on Rr , x ∈ Rr \{0} and w ∈ D x , then for any k ∈ Nr with xk = 0 it holds that wk = 0. [sent-328, score-0.575]

72 Since · is admissible it follows that wk < w , and so, we get that wk < 1, because w = 1. [sent-333, score-0.36]

73 Consequently, it follows that xk D = y, xk = y, x ≤ y x D = 1 from which conclude that wk = wk , x = w, xk ≤ w . [sent-340, score-0.407]

74 Deﬁnition 15 A norm · on Md,n is said to be admissible orthogonally invariant if there exists an admissible vector norm · on Rr such that, for every W ∈ Md,n , we have that W = σ(W ) . [sent-345, score-1.218]

75 947 A RGYRIOU , M ICCHELLI AND P ONTIL Examples of non-admissible OI-norms are the spectral norm, the Ky Fan norms · (k) for 1 ≤ k < r and the norm max{ · 1 , α · ∞ } for α ∈ (1, ∞). [sent-346, score-0.474]

76 We have now accumulated sufﬁcient information on admissible OI-norms to present an imˆ proved representer theorem for problem (5). [sent-347, score-0.602]

77 ˆ i∈Nm ˆ In other words, W is obtained by ﬁrst applying the standard representer theorem and then multiplying it from the right by the matrix R. [sent-349, score-0.51]

78 ˆ Theorem 16 If · is admissible orthogonally invariant, the matrix W ∈ Md,n \ {0} is a solution of m is a solution of (7), then there exists a matrix R ∈ Sn such that (5) and the vector c ∈ R ˆ + ˆ W = I ∗ (c)R ˆ (13) and the eigenvectors of R are right singular vectors of I ∗ (c). [sent-351, score-0.788]

79 Theorem 17 If · is orthogonally invariant and condition (13) holds (without any conditions on ˆ R ∈ Mn,n ), for every linear operator I : Md,n → Rm , y ∈ R (I) \ {0}, every solution W of (5) and every solution c of (7), then the norm · is admissible orthogonally invariant. [sent-358, score-1.325]

80 948 O N S PECTRAL L EARNING We remark that there exist norms on Md,n which are not orthogonally invariant and satisfy condition (13). [sent-370, score-0.494]

81 In fact, given two non-singular matrices Q ∈ Md,d and M ∈ Mn,n , the norm W → QW M 2 is not orthogonally invariant, but the representer theorem is easily seen to be ˆ W = (Q⊤ Q)−1 I ∗ (c)(MM ⊤ )−1 . [sent-371, score-1.031]

82 ˆ Furthermore, concerning the converse of Theorem 16, it can be shown that if we restrict the eigenvectors of R to be right singular vectors of I ∗ (c), then · has to be orthogonally invariant. [sent-372, score-0.436]

83 ˆ Moreover, if the norm · is not admissible, then it can be shown that for every solution c of ˆ the dual problem there exists a solution of the primal satisfying (13). [sent-373, score-0.621]

84 For a characterization of functions yielding such representer theorems, see Argyriou et al. [sent-375, score-0.381]

85 ∞ = {y ∈ The above corollary also conﬁrms that representation (13) does not apply to the spectral norm ˆ (which is not admissible orthogonally invariant). [sent-395, score-0.82]

86 can be a superset of the range of I ˆ To recapitulate the results presented in this section, Theorem 12 allows one to obtain the solutions of the primal minimum norm interpolation problem (5) from those of its dual problem (7), which involves m variables. [sent-397, score-0.731]

87 This is true for all OI-norms, even though the representer theorem in the form (13) applies only to admissible OI-norms. [sent-398, score-0.602]

88 Part of the appeal of OI-norms is that computing primal solutions from dual ones reduces to a vector norm optimization problem. [sent-399, score-0.521]

89 Indeed, given a solution of the dual problem, one just needs to compute the singular value decomposition of the matrix I ∗ (c) and the peak set of the SG-function f at the singular values. [sent-400, score-0.621]

90 For example, if fD is differentiable (except at zero), each dual solution is associated with a single primal one, which equals a multiple of the gradient of fD at the dual solution. [sent-403, score-0.425]

91 4 Related Work The results of Section 4 are related to other prior work, besides the already mentioned literature on representer theorems for the case of the vector L2 norm (that is, for n = 1). [sent-405, score-0.693]

92 In particular, the representer theorem for the trace norm (Corollary 19) has been stated in Srebro et al. [sent-406, score-0.806]

93 Also, the representation (13) in Theorem 16 relates to the representer theorems proven in Argyriou et al. [sent-408, score-0.437]

94 (2009) apply to the case of the trace norm and when the Xi are rank one matrices. [sent-412, score-0.397]

95 (2009) give representer theorems for a broad class of functions, of which differentiable OI-norms are members. [sent-414, score-0.437]

96 Conclusion and Future Work We have characterized the form of the solution of regularization with an orthogonally invariant penalty term. [sent-418, score-0.521]

97 Our result depends upon a detailed analysis of the corresponding minimal norm interpolation problem. [sent-419, score-0.495]

98 In particular, we have derived a dual problem of the minimal norm interpolation problem and established the relationship between these two problems. [sent-420, score-0.636]

99 In practical circumstances, this number may be smaller than the dimension of the matrix we seek, thus our result 950 O N S PECTRAL L EARNING should prove useful in the development of optimization algorithms for orthogonally invariant norm regularization. [sent-422, score-0.687]

100 Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization. [sent-581, score-0.384]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('representer', 0.381), ('orthogonally', 0.276), ('rr', 0.27), ('argyriou', 0.269), ('norm', 0.256), ('interpolation', 0.21), ('nr', 0.18), ('rm', 0.153), ('icchelli', 0.152), ('ontil', 0.152), ('rgyriou', 0.152), ('admissible', 0.148), ('schatten', 0.144), ('dual', 0.141), ('singular', 0.133), ('norms', 0.119), ('neumann', 0.118), ('pectral', 0.117), ('wk', 0.106), ('invariant', 0.099), ('spectral', 0.099), ('primal', 0.097), ('trace', 0.096), ('shall', 0.094), ('peak', 0.086), ('diag', 0.081), ('srebro', 0.079), ('micchelli', 0.078), ('regularization', 0.074), ('theorem', 0.073), ('operator', 0.073), ('gauge', 0.072), ('abernethy', 0.072), ('nrmax', 0.068), ('ww', 0.067), ('recht', 0.067), ('regularizer', 0.066), ('xk', 0.065), ('sd', 0.063), ('von', 0.06), ('horn', 0.06), ('od', 0.058), ('cand', 0.056), ('theorems', 0.056), ('matrix', 0.056), ('inequality', 0.055), ('frobenius', 0.053), ('fd', 0.053), ('earning', 0.052), ('evgeniou', 0.052), ('fazel', 0.048), ('lemma', 0.048), ('regularizers', 0.048), ('solution', 0.046), ('matrices', 0.045), ('rank', 0.045), ('cavallanti', 0.043), ('corollary', 0.041), ('maurer', 0.039), ('adjoint', 0.039), ('convex', 0.038), ('xti', 0.036), ('ky', 0.036), ('every', 0.035), ('rd', 0.034), ('bhatia', 0.034), ('duals', 0.034), ('hardy', 0.034), ('nmt', 0.034), ('semicontinuous', 0.034), ('uwv', 0.034), ('duality', 0.032), ('symmetric', 0.031), ('directional', 0.03), ('pontil', 0.03), ('tasks', 0.029), ('wt', 0.029), ('massimiliano', 0.029), ('nonempty', 0.029), ('amit', 0.029), ('albany', 0.029), ('rmax', 0.029), ('ucl', 0.029), ('minimal', 0.029), ('eigenvectors', 0.027), ('yuan', 0.027), ('dw', 0.027), ('solutions', 0.027), ('rk', 0.027), ('ui', 0.027), ('nn', 0.027), ('lewis', 0.026), ('subclass', 0.026), ('decomposition', 0.026), ('gower', 0.026), ('collaborative', 0.026), ('penalty', 0.026), ('permutation', 0.025), ('author', 0.024), ('nm', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 84 jmlr-2010-On Spectral Learning

Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil

2 0.13139808 72 jmlr-2010-Matrix Completion from Noisy Entries

Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh

Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative ﬁltering (the ‘Netﬂix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization

3 0.10616499 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

Author: Tong Zhang

Abstract: We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem: • Heuristic methods such as gradient descent that only ﬁnd a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to a sub-optimal solution in reality. This paper tries to remedy the above gap between theory and practice. In particular, we present a multi-stage convex relaxation scheme for solving problems with non-convex objective functions. For learning formulations with sparse regularization, we analyze the behavior of a speciﬁc multistage relaxation scheme. Under appropriate conditions, we show that the local solution obtained by this procedure is superior to the global solution of the standard L1 convex relaxation for learning sparse targets. Keywords: sparsity, non-convex optimization, convex relaxation, multi-stage convex relaxation

4 0.10165539 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

Author: Rahul Mazumder, Trevor Hastie, Robert Tibshirani

Abstract: We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efﬁcient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm S OFT-I MPUTE iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efﬁciently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity of order linear in the matrix dimensions. Our semideﬁnite-programming algorithm is readily scalable to large matrices; for example S OFT-I MPUTE takes a few hours to compute low-rank approximations of a 106 × 106 incomplete matrix with 107 observed entries, and ﬁts a rank-95 approximation to the full Netﬂix training set in 3.3 hours. Our methods achieve good training and test errors and exhibit superior timings when compared to other competitive state-of-the-art techniques. Keywords: collaborative ﬁltering, nuclear norm, spectral regularization, netﬂix prize, large scale convex optimization

5 0.08852721 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

Author: Giovanni Cavallanti, Nicolò Cesa-Bianchi, Claudio Gentile

Abstract: We introduce new Perceptron-based algorithms for the online multitask binary classiﬁcation problem. Under suitable regularity conditions, our algorithms are shown to improve on their baselines by a factor proportional to the number of tasks. We achieve these improvements using various types of regularization that bias our algorithms towards speciﬁc notions of task relatedness. More specifically, similarity among tasks is either measured in terms of the geometric closeness of the task reference vectors or as a function of the dimension of their spanned subspace. In addition to adapting to the online setting a mix of known techniques, such as the multitask kernels of Evgeniou et al., our analysis also introduces a matrix-based multitask extension of the p-norm Perceptron, which is used to implement spectral co-regularization. Experiments on real-world data sets complement and support our theoretical ﬁndings. Keywords: mistake bounds, perceptron algorithm, multitask learning, spectral regularization

6 0.08762233 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

7 0.077159248 82 jmlr-2010-On Learning with Integral Operators

8 0.073388547 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

9 0.068215355 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

10 0.066649981 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

11 0.064377226 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

12 0.059848472 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

13 0.059087697 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures

14 0.058819693 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

15 0.05622068 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

16 0.052113257 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

17 0.051619433 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

18 0.050798103 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

19 0.04736349 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

20 0.046402685 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.207), (1, -0.157), (2, 0.155), (3, -0.019), (4, -0.078), (5, -0.065), (6, -0.061), (7, -0.105), (8, 0.043), (9, -0.136), (10, -0.116), (11, 0.02), (12, 0.064), (13, -0.01), (14, 0.152), (15, -0.117), (16, 0.073), (17, -0.124), (18, -0.099), (19, 0.051), (20, -0.127), (21, -0.006), (22, 0.199), (23, 0.042), (24, 0.173), (25, -0.106), (26, -0.025), (27, -0.043), (28, 0.052), (29, 0.008), (30, 0.037), (31, -0.051), (32, 0.046), (33, -0.023), (34, -0.008), (35, -0.153), (36, 0.093), (37, -0.006), (38, 0.105), (39, -0.128), (40, 0.05), (41, -0.073), (42, 0.053), (43, -0.013), (44, -0.018), (45, -0.018), (46, -0.104), (47, 0.009), (48, -0.024), (49, -0.084)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95978665 84 jmlr-2010-On Spectral Learning

Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil

2 0.69594914 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

Author: Rahul Mazumder, Trevor Hastie, Robert Tibshirani

3 0.62878442 72 jmlr-2010-Matrix Completion from Noisy Entries

Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh

4 0.58912379 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

Author: Giovanni Cavallanti, Nicolò Cesa-Bianchi, Claudio Gentile

5 0.51633435 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

Author: Tong Zhang

6 0.35976169 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond

7 0.35856524 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

8 0.33798042 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

9 0.33711386 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

10 0.33600014 82 jmlr-2010-On Learning with Integral Operators

11 0.33230621 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

12 0.30998185 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

13 0.28830674 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

14 0.28167349 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

15 0.28053132 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

16 0.2790629 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

17 0.26699394 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures

18 0.25898531 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

19 0.24549569 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

20 0.24529324 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.027), (3, 0.086), (4, 0.015), (8, 0.016), (21, 0.036), (32, 0.038), (36, 0.026), (37, 0.041), (51, 0.349), (75, 0.172), (81, 0.012), (85, 0.067), (96, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71507382 84 jmlr-2010-On Spectral Learning

Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil

2 0.52584332 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

Author: Rahul Mazumder, Trevor Hastie, Robert Tibshirani

3 0.51007736 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

4 0.50146455 72 jmlr-2010-Matrix Completion from Noisy Entries

Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh

5 0.49552581 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

Author: Shiliang Sun, John Shawe-Taylor

Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artiﬁcial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efﬁcacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory

6 0.48836112 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

7 0.48811445 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

8 0.48432738 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

9 0.48387274 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

10 0.48377246 82 jmlr-2010-On Learning with Integral Operators

11 0.48162892 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

12 0.4802995 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

13 0.47974199 69 jmlr-2010-Lp-Nested Symmetric Distributions

14 0.47940403 102 jmlr-2010-Semi-Supervised Novelty Detection

15 0.47852179 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

16 0.47770369 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

17 0.47730729 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

18 0.47713476 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis

19 0.47698534 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

20 0.47502303 63 jmlr-2010-Learning Instance-Specific Predictive Models