nips nips2006 nips2006-138 knowledge-graph by maker-knowledge-mining

138 nips-2006-Multi-Task Feature Learning


Source: pdf

Author: Andreas Argyriou, Theodoros Evgeniou, Massimiliano Pontil

Abstract: We present a method for learning a low-dimensional representation which is shared across a set of multiple related tasks. The method builds upon the wellknown 1-norm regularization problem using a new regularizer which controls the number of learned features common for all the tasks. We show that this problem is equivalent to a convex optimization problem and develop an iterative algorithm for solving it. The algorithm has a simple interpretation: it alternately performs a supervised and an unsupervised step, where in the latter step we learn commonacross-tasks representations and in the former step we learn task-specific functions using these representations. We report experiments on a simulated and a real data set which demonstrate that the proposed method dramatically improves the performance relative to learning each task independently. Our algorithm can also be used, as a special case, to simply select – not learn – a few common features across the tasks. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract We present a method for learning a low-dimensional representation which is shared across a set of multiple related tasks. [sent-11, score-0.15]

2 The method builds upon the wellknown 1-norm regularization problem using a new regularizer which controls the number of learned features common for all the tasks. [sent-12, score-0.321]

3 We show that this problem is equivalent to a convex optimization problem and develop an iterative algorithm for solving it. [sent-13, score-0.195]

4 The algorithm has a simple interpretation: it alternately performs a supervised and an unsupervised step, where in the latter step we learn commonacross-tasks representations and in the former step we learn task-specific functions using these representations. [sent-14, score-0.209]

5 We report experiments on a simulated and a real data set which demonstrate that the proposed method dramatically improves the performance relative to learning each task independently. [sent-15, score-0.123]

6 Our algorithm can also be used, as a special case, to simply select – not learn – a few common features across the tasks. [sent-16, score-0.275]

7 1 Introduction Learning multiple related tasks simultaneously has been empirically [2, 3, 8, 9, 12, 18, 19, 20] as well as theoretically [2, 4, 5] shown to often significantly improve performance relative to learning each task independently. [sent-17, score-0.254]

8 This is the case, for example, when only a few data per task are available, so that there is an advantage in “pooling” together data across many related tasks. [sent-18, score-0.157]

9 For example, task relatedness has been modeled through assuming that all functions learned are close to each other in some norm [3, 8, 15, 19]. [sent-20, score-0.267]

10 For example, in object recognition, it is well known that the human visual system is organized in a way that all objects 1 are represented – at the earlier stages of the visual system – using a common set of features learned, e. [sent-23, score-0.174]

11 ) using a common set of features describing these products. [sent-29, score-0.143]

12 In this paper, we explore the latter type of task relatedness, that is, we wish to learn a lowdimensional representation which is shared across multiple related tasks. [sent-30, score-0.292]

13 Inspired by the fact that the well known 1−norm regularization problem provides such a sparse representation for the single 1 We consider each object recognition problem within each object category, e. [sent-31, score-0.204]

14 task case, in Section 2 we generalize this formulation to the multiple task case. [sent-34, score-0.205]

15 Our method learns a few features common across the tasks by regularizing within the tasks while keeping them coupled to each other. [sent-35, score-0.509]

16 Moreover, the method can be used, as a special case, to select (not learn) a few features from a prescribed set. [sent-36, score-0.158]

17 Since the extended problem is nonconvex, we develop an equivalent convex optimization problem in Section 3 and present an algorithm for solving it in Section 4. [sent-37, score-0.195]

18 The learning algorithm simultaneously learns both the features and the task functions through two alternating steps. [sent-40, score-0.262]

19 The second step consists of learning, in an unsupervised way, a low-dimensional representation for these task parameters, which we show to be equivalent to learning common features across the tasks. [sent-42, score-0.299]

20 The number of common features learned is controlled, as we empirically show, by the regularization parameter, much like sparsity is controlled in the case of single-task 1-norm regularization. [sent-43, score-0.322]

21 In Section 5, we report experiments on a simulated and a real data set which demonstrate that the proposed method learns a few common features across the tasks while also improving the performance relative to learning each task independently. [sent-44, score-0.473]

22 Let T be the number of tasks and define IN T := {1, . [sent-48, score-0.131]

23 Based on this data, we wish to estimate T functions ft : IRd → IR, t ∈ INT , which approximate well the data and are statistically predictive, see e. [sent-56, score-0.124]

24 d If w, u ∈ IRd , we define w, u := i=1 wi ui , the standard inner product in IRd . [sent-59, score-0.457]

25 If A is a d × T matrix we denote by ai ∈ IRT and aj ∈ IRd the i-th row and the j-th column of A respectively. [sent-61, score-0.118]

26 1 Problem formulation The underlying assumption in this paper is that the functions ft are related so that they all share a small set of features. [sent-69, score-0.18]

27 Formally, our hypothesis is that the functions ft can be represented as d ft (x) = ait hi (x), t ∈ INT , (2. [sent-70, score-0.325]

28 1) i=1 where hi : IRd → IR are the features and ait ∈ IR are the regression parameters. [sent-71, score-0.259]

29 Our main assumption is that all the features but a few have zero coefficients across all the tasks. [sent-72, score-0.164]

30 For simplicity, we focus on linear features, that is, hi (x) = ui , x , where ui ∈ IRd . [sent-73, score-0.9]

31 In addition, we assume that the vectors ui are orthonormal. [sent-74, score-0.431]

32 Thus, if U denotes the d × d matrix with columns the vectors ui , then U ∈ Od . [sent-75, score-0.504]

33 The functions ft are linear as well, that is ft (x) = wt , x , where wt = i ait ui . [sent-76, score-1.134]

34 Let us denote by W the d × T matrix whose columns are the vectors wt and by A the d × T matrix with entries ait . [sent-79, score-0.45]

35 Our assumption that the tasks share a “small” set of features means that the matrix A has “many” rows which are identically equal to zero and, so, the corresponding features (columns of matrix U ) will not be used to represent the task parameters (columns of matrix W ). [sent-81, score-0.59]

36 We note that the problem of learning a low-rank matrix factorization which approximates a given partially observed target matrix has been considered in [1], [17] and references therein. [sent-83, score-0.124]

37 In the following, we describe our approach to computing the feature vectors u i and the parameters ait . [sent-85, score-0.16]

38 We first consider the case that there is only one task (say task t) and the features u i are fixed. [sent-86, score-0.238]

39 That is, we consider the problem min i=1 L(yti , at , U xti ) : at 1 ≤ α , or equivalently the unconstrained problem m min L(yti , at , U xti ) + γ at 2 1 : at ∈ IRd , (2. [sent-89, score-0.57]

40 For this purpose, we introduce the regularization error function T m L(yti , at , U xti ) + γ A E(A, U ) = 2 2,1 . [sent-96, score-0.318]

41 3) is the average of the empirical error across the tasks while the second one is a regularization term which penalizes the (2, 1)-norm of the matrix A. [sent-99, score-0.329]

42 It is obtained by first computing the 2-norm of the (across the tasks) rows ai (corresponding to feature i) of matrix A and then the 1-norm of the vector b(A) = ( a1 2 , . [sent-100, score-0.18]

43 This norm combines the tasks and ensures that common features will be selected across them. [sent-104, score-0.392]

44 ˆ Indeed, if the features U are prescribed and A minimizes the function E over A, the number of ˆ will typically be non-increasing with γ like in the case nonzero components of the vector b(A) ˆ of 1-norm single-task regularization. [sent-105, score-0.211]

45 Moreover, the components of the vector b( A) indicate how important each feature is and favor uniformity across the tasks for each feature. [sent-106, score-0.234]

46 Since we do not simply want to select the features but also learn them, we further minimize the function E over U , that is, we consider the optimization problem min E(A, U ) : U ∈ Od , A ∈ IRd×T . [sent-107, score-0.252]

47 4) This method learns a low-dimensional representation which is shared across the tasks. [sent-109, score-0.136]

48 As in the single-task case, the number of features will be typically non-increasing with the regularization parameter – we shall present experimental evidence of this fact in Section 5 (see Figure 1 therein). [sent-110, score-0.186]

49 We note that when the matrix U is not learned and we set U = Id×d , problem (2. [sent-111, score-0.14]

50 That is, we have the following convex optimization problem T m min L(yti , at , xti ) + γ A 2 2,1 : A ∈ IRd×T . [sent-113, score-0.401]

51 3 Equivalent convex optimization formulation Solving problem (2. [sent-117, score-0.181]

52 To this end, for every W ∈ IRd×T and D ∈ Sd , we define the function + T m R(W, D) = T w t , D + wt . [sent-123, score-0.208]

53 Then ai wt , D+ wt = trace(W D+ W ) = A t=1 2,1 2 2,1 and hence d ( W u i 2 )+ W u i u i W = A trace 2 trace(W U Diag( W ui 2 )+ U W ) = d A = W ui i=1 W ui 2,1 2 = A 2 2,1 . [sent-136, score-2.047]

54 Then T wt , D+ wt = trace(W U Diag(λ+ )U W ) = trace(Diag(λ+ )AA ) ≥ A i i 2 2,1 , t=1 by Lemma 4. [sent-139, score-0.416]

55 Note that the range constraint ensures that W is a multiple of the submatrix of U which corresponds to the nonzero eigenvalues of D, and hence if λi = 0 then ai = 0 as well. [sent-141, score-0.185]

56 2) we have constrained the trace of D, otherwise the optimal solution would be to simply set D = ∞ and only minimize the empirical error term in (3. [sent-144, score-0.291]

57 Indeed, without this constraint, it may be possible that DW = 0 when W does not have full rank, in which T case there is a matrix D for which t=1 wt , D+ wt = trace(W D+ W ) = 0. [sent-147, score-0.464]

58 We note that the rank of matrix D indicates how many common relevant features the tasks share. [sent-148, score-0.391]

59 3) that the rank of matrix D equals the number of nonzero rows of matrix A. [sent-150, score-0.241]

60 2) by alternately minimizing the function R with respect to D and the w t (recall that wt is the t-th column of matrix W ). [sent-157, score-0.324]

61 When we keep D fixed, the minimization over wt simply consists of learning the parameters wt independently by a regularization method, for example by an SVM or ridge regression type method 2 . [sent-158, score-0.59]

62 For a fixed value of the vectors wt , we learn D by simply solving the minimization problem T wt , D+ wt : D ∈ Sd , trace(D) ≤ 1, range(W ) ⊆ range(D) . [sent-159, score-0.744]

63 For example, we can also penalize the variance of the wt ’s – “forcing” them to be close to each other – as in [8]. [sent-164, score-0.208]

64 Algorithm 1 (Multi-Task Feature Learning) Input: training sets {(xti , yti )}m , t ∈ INT i=1 Parameters: regularization parameter γ Output: d × d matrix D, d × T regression matrix W = [w1 , . [sent-166, score-0.449]

65 , T do m d + compute wt = argmin i=1 L(yti , w, xti ) + γ w, D w : w ∈ IR , w ∈ range(D) end for 1 2 set D = (W W ) 1 trace(W W ) 2 end while Theorem 4. [sent-172, score-0.44]

66 2) 1 trace C 2 1 and the optimal value equals (trace C 2 )2 . [sent-177, score-0.268]

67 , bd ) ∈ IRd , we have that d inf i=1 b2 i : λi > 0, λi ˆ and any minimizing sequence converges to λi = d λi ≤ 1 1 2 bi =0 λi ) d i=1 λi → ( −1 2 1 2 bi =0 λi bi ) |bj | |bi | and λi − λj ≤( d i=1 2 1 (4. [sent-184, score-0.44]

68 From the Cauchy-Schwarz inequality we have that ( = b b 1 1 = bi =0 −1 λi2 λi 2 |bi | ≤ 1 λ−1 b2 ) 2 . [sent-187, score-0.125]

69 Convergence to the infimum is obtained when i i 1 → 0 for all i, j ∈ INd such that bi , bj = 0. [sent-188, score-0.148]

70 Hence λi → infimum is attained when bi = 0 for all i ∈ INd . [sent-189, score-0.125]

71 2 to obtain that d inf{trace(W U Diag(λ)−1 U W ) : λ ∈ IRd , ++ d λi ≤ 1} = U W 2 2,1 i=1 = W ui 2 2 . [sent-196, score-0.431]

72 To see this, note that 1 1 trace(W W ui ui ) = trace(C 2 ui ui ui ui C 2 ) trace(ui ui ui ui ) 1 1 1 ≥ (trace(C 2 ui ui ui ui ))2 = trace(C 2 ui ui ) = ui C 2 ui 1 since ui ui ui ui = ui ui . [sent-198, score-9.913]

73 The equality is verified if and only if C 2 ui ui = aui ui which implies 1 1 that C 2 ui = aui , that is, if ui is an eigenvector of C. [sent-199, score-2.253]

74 2) is simply the sum of the singular values of W and is sometimes called the trace norm. [sent-202, score-0.291]

75 As shown in [10], the trace norm is the convex envelope of rank(W ) in the unit ball, which gives another interpretation of the relationship between the rank and γ in our experiments. [sent-203, score-0.476]

76 18 15 16 14 10 12 10 8 5 6 4 2 −4 10 −2 10 0 0 0 10 10 1 10 Figure 1: Number of features learned versus the regularization parameter γ (see text for description). [sent-206, score-0.25]

77 However, since the trace norm is nonsmooth, we have opted for the above alternating minimization strategy which is simple to implement and has a natural interpretation. [sent-207, score-0.373]

78 Indeed, Algorithm 1 alternately performs a supervised and an unsupervised step, where in the latter step we learn common representations across the tasks and in the former step we learn task-specific functions using these representations. [sent-208, score-0.447]

79 We created synthetic data sets by generating T = 200 task parameters wt from a 5-dimensional Gaussian distribution with zero mean and covariance equal to Diag(1, 0. [sent-229, score-0.301]

80 The outputs yti were computed from the wt and xti as yti = wt , xti + ν, where ν is zero-mean Gaussian noise with standard deviation equal to 0. [sent-237, score-1.414]

81 We first present, in Figure 1, the number of features learned by our algorithm, as measured by rank(D). [sent-239, score-0.164]

82 The plot on the left corresponds to a data set of 200 tasks with 25 input dimensions and that on the right to a real data set of 180 tasks described in the next subsection. [sent-240, score-0.314]

83 Figure 2 depicts the performance of our algorithm for T = 10, 25, 100 and 200 tasks along with the performance of 200 independent standard ridge regressions on the data. [sent-242, score-0.208]

84 [4]), learning multiple tasks together significantly improves on learning the tasks independently. [sent-246, score-0.32]

85 Moreover, the performance of the algorithm improves when more tasks are available. [sent-247, score-0.159]

86 04 5 10 15 20 25 5 10 15 20 25 Figure 2: Test error (left) and residual of learned features (right) vs. [sent-261, score-0.189]

87 number of tasks (left) for the computer survey data set. [sent-289, score-0.155]

88 Significance of features (middle) and attributes learned by the most important feature (right). [sent-290, score-0.203]

89 On the right, we have plotted a residual measure of how well the learned features approximate the actual ones used to generate the data. [sent-291, score-0.189]

90 We observe that adding more tasks leads to better estimates of the underlying features. [sent-293, score-0.131]

91 Here the persons correspond to tasks and the PC models to examples. [sent-297, score-0.167]

92 Following [13], we used 4 examples per task as the test data and 8 examples per task as the training data. [sent-301, score-0.138]

93 6 Conclusion We have presented an algorithm which learns common sparse function representations across a pool of related tasks. [sent-316, score-0.196]

94 To our knowledge, our approach provides the first convex optimization formulation for multi-task feature learning. [sent-317, score-0.192]

95 Our algorithm shares some similarities with recent work in [2] where they also alternately update the task parameters and the features. [sent-319, score-0.137]

96 Two main differences are that their formulation is not convex and that, in our formulation, the number of learned features is not a parameter but it is controlled by the regularization parameter. [sent-320, score-0.401]

97 For example, it would be interesting to explore whether our formulation can be extended to more general models for the structure across the tasks, like in [20] where ICA type features are learned, or to hierarchical feature models like in [18]. [sent-322, score-0.24]

98 A framework for learning predictive structures from multiple tasks and unlabeled data. [sent-337, score-0.161]

99 A convex optimization approach to modeling consumer heterogeneity in conjoint estimation. [sent-378, score-0.255]

100 Hierarchical Bayes conjoint analysis: recovery of partworth heterogeneity from reduced experimental designs. [sent-407, score-0.139]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ird', 0.461), ('ui', 0.431), ('trace', 0.268), ('yti', 0.267), ('xti', 0.232), ('wt', 0.208), ('ir', 0.161), ('tasks', 0.131), ('bi', 0.125), ('ait', 0.121), ('sd', 0.113), ('od', 0.105), ('features', 0.1), ('diag', 0.1), ('conjoint', 0.097), ('ind', 0.097), ('micchelli', 0.097), ('regularization', 0.086), ('convex', 0.085), ('evgeniou', 0.077), ('ai', 0.07), ('ft', 0.07), ('task', 0.069), ('rank', 0.069), ('alternately', 0.068), ('across', 0.064), ('learned', 0.064), ('prescribed', 0.058), ('int', 0.057), ('norm', 0.054), ('ram', 0.054), ('relatedness', 0.054), ('nonzero', 0.053), ('cpu', 0.049), ('aui', 0.049), ('insead', 0.049), ('matrix', 0.048), ('london', 0.046), ('learn', 0.045), ('common', 0.043), ('bd', 0.042), ('nonsmooth', 0.042), ('heterogeneity', 0.042), ('ridge', 0.041), ('learns', 0.04), ('feature', 0.039), ('av', 0.038), ('te', 0.038), ('gower', 0.038), ('hd', 0.038), ('hi', 0.038), ('formulation', 0.037), ('regressions', 0.036), ('persons', 0.036), ('pontil', 0.036), ('sc', 0.034), ('cd', 0.034), ('mum', 0.032), ('co', 0.032), ('shared', 0.032), ('range', 0.032), ('optimization', 0.031), ('object', 0.031), ('gu', 0.031), ('multi', 0.031), ('multiple', 0.03), ('sw', 0.03), ('wa', 0.03), ('people', 0.029), ('controlled', 0.029), ('id', 0.028), ('users', 0.028), ('problem', 0.028), ('lemma', 0.028), ('improves', 0.028), ('wish', 0.028), ('alternating', 0.027), ('street', 0.027), ('functions', 0.026), ('wi', 0.026), ('real', 0.026), ('college', 0.026), ('purpose', 0.026), ('dimensions', 0.026), ('columns', 0.025), ('min', 0.025), ('residual', 0.025), ('representations', 0.025), ('survey', 0.024), ('minimization', 0.024), ('synthetic', 0.024), ('related', 0.024), ('irrelevant', 0.024), ('share', 0.023), ('equivalent', 0.023), ('simply', 0.023), ('bj', 0.023), ('corollary', 0.023), ('inf', 0.023), ('rows', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 138 nips-2006-Multi-Task Feature Learning

Author: Andreas Argyriou, Theodoros Evgeniou, Massimiliano Pontil

Abstract: We present a method for learning a low-dimensional representation which is shared across a set of multiple related tasks. The method builds upon the wellknown 1-norm regularization problem using a new regularizer which controls the number of learned features common for all the tasks. We show that this problem is equivalent to a convex optimization problem and develop an iterative algorithm for solving it. The algorithm has a simple interpretation: it alternately performs a supervised and an unsupervised step, where in the latter step we learn commonacross-tasks representations and in the former step we learn task-specific functions using these representations. We report experiments on a simulated and a real data set which demonstrate that the proposed method dramatically improves the performance relative to learning each task independently. Our algorithm can also be used, as a special case, to simply select – not learn – a few common features across the tasks. 1

2 0.15670447 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1

3 0.12824599 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

Author: Manfred K. Warmuth, Dima Kuzmin

Abstract: We design an on-line algorithm for Principal Component Analysis. In each trial the current instance is projected onto a probabilistically chosen low dimensional subspace. The total expected quadratic approximation error equals the total quadratic approximation error of the best subspace chosen in hindsight plus some additional term that grows linearly in dimension of the subspace but logarithmically in the dimension of the instances. 1

4 0.10766357 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This unified view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the first player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the first player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the first player. The goal of the first player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us first cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner first receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classifiers, which is used for answering yes/no questions, is defined as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redefine the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which defines a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the first player using the notion of regret. Given a number of rounds T and a fixed vector u ∈ S, we define the regret of the first player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the first player which guarantees low regret with respect to any vector u ∈ S. Specifically, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Specifically, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the first player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the first player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the first player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Specifically, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is defined as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few definitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is defined as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the definition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the first player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best fixed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The first update scheme simply finds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update defines (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the final value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisfies the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its final sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The definition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the definition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classification, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the confidence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are defined to be yt = Xt wt . Finally, let us define two loss functions for ranking, both generalize the hinge-loss used in binary classification problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we define a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufficient confidence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufficient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster first defines a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a final, so called strong, hypothesis whose error is small. The final hypothesis combines linearly the T hypotheses returned by the weak learner with coefficients α1 , . . . , αT , and is defined to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefficients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the final strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note first that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, defined as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms define weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally fits the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us first denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us define αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: find a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of finding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster first constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different definition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The definition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the definition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now briefly discuss. First, we generalize the applicability of the framework beyond the specific setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was first presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of defining appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simplifies the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.

5 0.099290125 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples

Author: Steffen Bickel, Tobias Scheffer

Abstract: We study a setting that is motivated by the problem of filtering spam messages for many users. Each user receives messages according to an individual, unknown distribution, reflected only in the unlabeled inbox. The spam filter for a user is required to perform well with respect to this distribution. Labeled messages from publicly available sources can be utilized, but they are governed by a distinct distribution, not adequately representing most inboxes. We devise a method that minimizes a loss function with respect to a user’s personal distribution based on the available biased sample. A nonparametric hierarchical Bayesian model furthermore generalizes across users by learning a common prior which is imposed on new email accounts. Empirically, we observe that bias-corrected learning outperforms naive reliance on the assumption of independent and identically distributed data; Dirichlet-enhanced generalization across users outperforms a single (“one size fits all”) filter as well as independent filters for all users. 1

6 0.099010713 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction

7 0.092189915 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

8 0.081765071 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

9 0.08130949 163 nips-2006-Prediction on a Graph with a Perceptron

10 0.064272992 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

11 0.061949302 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

12 0.058867991 186 nips-2006-Support Vector Machines on a Budget

13 0.057698146 130 nips-2006-Max-margin classification of incomplete data

14 0.056840554 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

15 0.049633611 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

16 0.049493894 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

17 0.049071774 99 nips-2006-Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons

18 0.047817506 75 nips-2006-Efficient sparse coding algorithms

19 0.047442406 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

20 0.047336582 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.192), (1, 0.038), (2, -0.035), (3, 0.019), (4, -0.05), (5, 0.029), (6, -0.017), (7, -0.039), (8, -0.028), (9, 0.006), (10, -0.081), (11, 0.022), (12, 0.113), (13, -0.102), (14, 0.004), (15, -0.064), (16, 0.026), (17, 0.042), (18, -0.011), (19, 0.02), (20, 0.017), (21, -0.034), (22, 0.045), (23, 0.084), (24, -0.086), (25, 0.034), (26, -0.002), (27, -0.023), (28, 0.261), (29, 0.049), (30, -0.107), (31, 0.089), (32, 0.006), (33, 0.092), (34, -0.062), (35, -0.007), (36, -0.203), (37, -0.106), (38, -0.021), (39, -0.058), (40, 0.126), (41, 0.097), (42, -0.033), (43, -0.008), (44, -0.003), (45, 0.198), (46, -0.051), (47, -0.031), (48, 0.15), (49, 0.188)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93857729 138 nips-2006-Multi-Task Feature Learning

Author: Andreas Argyriou, Theodoros Evgeniou, Massimiliano Pontil

Abstract: We present a method for learning a low-dimensional representation which is shared across a set of multiple related tasks. The method builds upon the wellknown 1-norm regularization problem using a new regularizer which controls the number of learned features common for all the tasks. We show that this problem is equivalent to a convex optimization problem and develop an iterative algorithm for solving it. The algorithm has a simple interpretation: it alternately performs a supervised and an unsupervised step, where in the latter step we learn commonacross-tasks representations and in the former step we learn task-specific functions using these representations. We report experiments on a simulated and a real data set which demonstrate that the proposed method dramatically improves the performance relative to learning each task independently. Our algorithm can also be used, as a special case, to simply select – not learn – a few common features across the tasks. 1

2 0.62103194 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples

Author: Steffen Bickel, Tobias Scheffer

Abstract: We study a setting that is motivated by the problem of filtering spam messages for many users. Each user receives messages according to an individual, unknown distribution, reflected only in the unlabeled inbox. The spam filter for a user is required to perform well with respect to this distribution. Labeled messages from publicly available sources can be utilized, but they are governed by a distinct distribution, not adequately representing most inboxes. We devise a method that minimizes a loss function with respect to a user’s personal distribution based on the available biased sample. A nonparametric hierarchical Bayesian model furthermore generalizes across users by learning a common prior which is imposed on new email accounts. Empirically, we observe that bias-corrected learning outperforms naive reliance on the assumption of independent and identically distributed data; Dirichlet-enhanced generalization across users outperforms a single (“one size fits all”) filter as well as independent filters for all users. 1

3 0.55934638 183 nips-2006-Stochastic Relational Models for Discriminative Link Prediction

Author: Kai Yu, Wei Chu, Shipeng Yu, Volker Tresp, Zhao Xu

Abstract: We introduce a Gaussian process (GP) framework, stochastic relational models (SRM), for learning social, physical, and other relational phenomena where interactions between entities are observed. The key idea is to model the stochastic structure of entity relationships (i.e., links) via a tensor interaction of multiple GPs, each defined on one type of entities. These models in fact define a set of nonparametric priors on infinite dimensional tensor matrices, where each element represents a relationship between a tuple of entities. By maximizing the marginalized likelihood, information is exchanged between the participating GPs through the entire relational network, so that the dependency structure of links is messaged to the dependency of entities, reflected by the adapted GP kernels. The framework offers a discriminative approach to link prediction, namely, predicting the existences, strengths, or types of relationships based on the partially observed linkage network as well as the attributes of entities (if given). We discuss properties and variants of SRM and derive an efficient learning algorithm. Very encouraging experimental results are achieved on a toy problem and a user-movie preference link prediction task. In the end we discuss extensions of SRM to general relational learning tasks. 1

4 0.52345854 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

Author: Manfred K. Warmuth, Dima Kuzmin

Abstract: We design an on-line algorithm for Principal Component Analysis. In each trial the current instance is projected onto a probabilistically chosen low dimensional subspace. The total expected quadratic approximation error equals the total quadratic approximation error of the best subspace chosen in hindsight plus some additional term that grows linearly in dimension of the subspace but logarithmically in the dimension of the instances. 1

5 0.39036822 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

Author: Su-in Lee, Varun Ganapathi, Daphne Koller

Abstract: Markov networks are commonly used in a wide variety of applications, ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally efficient method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: pixel values in the MNIST data, and genetic sequence variations in the human HapMap data. We show that our L1 -based method achieves considerably higher generalization performance than the more standard L2 -based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem.

6 0.3867887 61 nips-2006-Convex Repeated Games and Fenchel Duality

7 0.38184613 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

8 0.37297449 65 nips-2006-Denoising and Dimension Reduction in Feature Space

9 0.37085387 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems

10 0.35809869 149 nips-2006-Nonnegative Sparse PCA

11 0.3319377 124 nips-2006-Linearly-solvable Markov decision problems

12 0.32760227 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis

13 0.32471654 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

14 0.31275937 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis

15 0.30991283 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors

16 0.30327338 33 nips-2006-Analysis of Representations for Domain Adaptation

17 0.29606095 130 nips-2006-Max-margin classification of incomplete data

18 0.29022723 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

19 0.28598824 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering

20 0.28362808 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.145), (3, 0.022), (7, 0.069), (9, 0.043), (12, 0.012), (20, 0.019), (22, 0.09), (42, 0.273), (44, 0.055), (57, 0.079), (65, 0.038), (69, 0.03), (71, 0.022), (90, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77077955 138 nips-2006-Multi-Task Feature Learning

Author: Andreas Argyriou, Theodoros Evgeniou, Massimiliano Pontil

Abstract: We present a method for learning a low-dimensional representation which is shared across a set of multiple related tasks. The method builds upon the wellknown 1-norm regularization problem using a new regularizer which controls the number of learned features common for all the tasks. We show that this problem is equivalent to a convex optimization problem and develop an iterative algorithm for solving it. The algorithm has a simple interpretation: it alternately performs a supervised and an unsupervised step, where in the latter step we learn commonacross-tasks representations and in the former step we learn task-specific functions using these representations. We report experiments on a simulated and a real data set which demonstrate that the proposed method dramatically improves the performance relative to learning each task independently. Our algorithm can also be used, as a special case, to simply select – not learn – a few common features across the tasks. 1

2 0.66438305 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization

Author: Frank Wood, Thomas L. Griffiths

Abstract: Many unsupervised learning problems can be expressed as a form of matrix factorization, reconstructing an observed data matrix as the product of two matrices of latent variables. A standard challenge in solving these problems is determining the dimensionality of the latent matrices. Nonparametric Bayesian matrix factorization is one way of dealing with this challenge, yielding a posterior distribution over possible factorizations of unbounded dimensionality. A drawback to this approach is that posterior estimation is typically done using Gibbs sampling, which can be slow for large problems and when conjugate priors cannot be used. As an alternative, we present a particle filter for posterior estimation in nonparametric Bayesian matrix factorization models. We illustrate this approach with two matrix factorization models and show favorable performance relative to Gibbs sampling.

3 0.61380023 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

Author: Dong S. Cheng, Vittorio Murino, Mário Figueiredo

Abstract: This paper proposes a new approach to model-based clustering under prior knowledge. The proposed formulation can be interpreted from two different angles: as penalized logistic regression, where the class labels are only indirectly observed (via the probability density of each class); as finite mixture learning under a grouping prior. To estimate the parameters of the proposed model, we derive a (generalized) EM algorithm with a closed-form E-step, in contrast with other recent approaches to semi-supervised probabilistic clustering which require Gibbs sampling or suboptimal shortcuts. We show that our approach is ideally suited for image segmentation: it avoids the combinatorial nature Markov random field priors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based) in a simple and computationally efficient way. Finally, we extend our formulation to work in unsupervised, semi-supervised, or discriminative modes. 1

4 0.61138922 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1

5 0.61079103 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

6 0.60900706 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

7 0.60805082 20 nips-2006-Active learning for misspecified generalized linear models

8 0.60797 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

9 0.60686266 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

10 0.60190552 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

11 0.60138148 61 nips-2006-Convex Repeated Games and Fenchel Duality

12 0.60128564 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

13 0.60105968 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

14 0.60096443 131 nips-2006-Mixture Regression for Covariate Shift

15 0.60006768 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

16 0.59944934 175 nips-2006-Simplifying Mixture Models through Function Approximation

17 0.59721082 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

18 0.59696895 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

19 0.59401786 169 nips-2006-Relational Learning with Gaussian Processes

20 0.5928086 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model