jmlr jmlr2007 jmlr2007-64 knowledge-graph by maker-knowledge-mining

64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

Source: pdf

Author: Ofer Dekel, Philip M. Long, Yoram Singer

Abstract: We study the problem of learning multiple tasks in parallel within the online learning framework. On each online round, the algorithm receives an instance for each of the parallel tasks and responds by predicting the label of each instance. We consider the case where the predictions made on each round all contribute toward a common goal. The relationship between the various tasks is deﬁned by a global loss function, which evaluates the overall quality of the multiple predictions made on each round. Speciﬁcally, each individual prediction is associated with its own loss value, and then these multiple loss values are combined into a single number using the global loss function. We focus on the case where the global loss function belongs to the family of absolute norms, and present several online learning algorithms for the induced problem. We prove worst-case relative loss bounds for all of our algorithms, and demonstrate the effectiveness of our approach on a largescale multiclass-multilabel text categorization problem. Keywords: online learning, multitask learning, multiclass multilabel classiifcation, perceptron

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The relationship between the various tasks is deﬁned by a global loss function, which evaluates the overall quality of the multiple predictions made on each round. [sent-11, score-0.348]

2 Speciﬁcally, each individual prediction is associated with its own loss value, and then these multiple loss values are combined into a single number using the global loss function. [sent-12, score-0.433]

3 We focus on the case where the global loss function belongs to the family of absolute norms, and present several online learning algorithms for the induced problem. [sent-13, score-0.454]

4 Keywords: online learning, multitask learning, multiclass multilabel classiifcation, perceptron 1. [sent-15, score-0.83]

5 In this paper, we discuss the multitask learning problem in the online learning context, and focus on the possibility that the learning tasks contribute toward a common goal. [sent-17, score-0.634]

6 In the online multitask classiﬁcation setting, we are faced with k separate online binary classiﬁcation problems, which are presented to us in parallel. [sent-21, score-0.781]

7 D EKEL , L ONG AND S INGER loss values into a single number, and deﬁne the global loss attained on round t to be L ( t ). [sent-31, score-0.446]

8 A norm is said to be absolute if v = |v| for all v, where |v| is obtained by replacing each component of v with its absolute value. [sent-42, score-0.396]

9 (1) Note that both the L1 norm and L∞ norm are special cases of the r-max norm, as well as being p-norms. [sent-51, score-0.512]

10 Actually, the r-max norm can be viewed as a smooth interpolation between the L 1 norm and the L∞ norm, using Peetre’s K-method of norm interpolation (see Appendix A for details). [sent-52, score-0.858]

11 Since the global loss functions we consider in this paper are norms, the global loss equals zero only if t is itself the zero vector. [sent-53, score-0.364]

12 Therefore, the simplest solution to our multitask problem is to learn each task individually, and minimize the global loss function implicitly. [sent-55, score-0.526]

13 In this sense, a single binary mistake on round t is as bad as many binary mistakes on round t. [sent-69, score-0.433]

14 Therefore, we should only care about the worst binary prediction on round t, and we can do so by choosing the global loss to be t ∞ . [sent-70, score-0.388]

15 Another example where the L∞ norm comes in handy is the case where we are faced with a multiclass problem where the number of labels is huge. [sent-71, score-0.438]

16 As before, a single binary mistake is devastating to the multiclass classiﬁer, and the L∞ norm is the most appropriate means of combining the k individual losses into a global loss. [sent-76, score-0.609]

17 The reduction is represented by a code matrix M ∈ {−1, +1}s×k , where s is the number of multiclass labels and k is the number of binary problems used to encode the original multiclass problem. [sent-84, score-0.377]

18 If the binary classiﬁers are trained in the online multitask setting, we should only be interested in whether the r’th largest loss is less than 1, which would imply that a correct multiclass prediction can be guaranteed. [sent-99, score-0.843]

19 In this paper, we present three families of online multitask algorithms. [sent-102, score-0.555]

20 The ﬁrst two families are multitask extensions of the Perceptron algorithm (Rosenblatt, 1958; Novikoff, 1962), while the third family is closely related to the Passive-Aggressive classiﬁcation algorithm (Crammer et al. [sent-106, score-0.426]

21 For each algorithm, we prove a relative loss bound, namely, we show that the cumulative global loss attained by the algorithm is comparable to the cumulative loss attained by any ﬁxed set of k linear hypotheses, even deﬁned in hindsight. [sent-109, score-0.495]

22 Much previous work on theoretical and applied multitask learning has focused on how to take advantage of similarities between the various tasks (Caruana, 1997; Heskes, 1998; Evgeniou et al. [sent-110, score-0.462]

23 An analysis of the L∞ norm of prediction errors is implicit in some past work of Crammer and Singer (2001, 2003). [sent-118, score-0.412]

24 Our setting generalizes such approaches for ranking learning by employing a shared loss which is deﬁned through a norm over the individual pair-based losses. [sent-127, score-0.371]

25 The third family of algorithms requires solving a small optimization problem on each online round, and is therefore called the implicit update family of algorithms. [sent-133, score-0.533]

26 , t,n ) • update wt+1, j = wt, j + τt, j yt, j xt, j [1 ≤ j ≤ k] Figure 1: A general skeleton for an online multitask classiﬁcation algorithm. [sent-152, score-0.714]

27 efﬁcient algorithms for solving the implicit update in the case where the global loss is deﬁned by the L2 norm or the r-max norm. [sent-154, score-0.7]

28 Online Multitask Learning with Additive Updates We begin by presenting the online multitask classiﬁcation setting more formally. [sent-157, score-0.529]

29 Any norm · deﬁned on Rn , has a dual norm, also deﬁned on Rn , denoted by · ∗ and given by ∗ u = max n v∈R u·v = v max v∈Rn : v =1 u·v . [sent-194, score-0.392]

30 Two additional properties which we rely on are that the dual of the dual norm is the original norm (see for instance Horn and Johnson, 1985), and that the dual of an absolute norm is also an absolute norm. [sent-200, score-1.124]

31 As previously mentioned, to obtain concrete online algorithms, all that remains is to deﬁne the update weights τt, j for each task on each round. [sent-201, score-0.35]

32 The Finite-Horizon Multitask Perceptron In this section, we present our ﬁrst family of online multitask classiﬁcation algorithms, and prove a relative loss bound for the members of this family. [sent-240, score-0.711]

33 These algorithms are ﬁnite-horizon online algorithms, meaning that the number of online rounds, T , is known in advance and is given as a parameter to the algorithm. [sent-242, score-0.344]

34 Given an absolute norm · and its dual · ∗ , the multitask Perceptron sets τt, j in Figure 1 to τt = argmax τ· t , (9) τ: τ ∗ ≤C where C > 0 is a constant which is speciﬁed later in this section. [sent-245, score-0.796]

35 When the global loss function is an r-max norm and π is a permutation such that the following deﬁnition of τt is a solution to Equation (9): τt, j = C 0 if t, j > 0 and j ∈ {π(1), . [sent-259, score-0.425]

36 ≥ t,π(k) , (12) O NLINE L EARNING OF M ULTIPLE TASKS WITH A S HARED L OSS 1 L2 norm L3 norm 1 L1 norm √ √ 6 2 2 L∞ norm Figure 2: The remoteness of a norm is the longest Euclidean length of any vector contained in the norm’s unit ball. [sent-266, score-1.446]

37 Note that when r = k, the r-max norm reduces to the L1 norm and the above becomes the well-known update rule of the Perceptron algorithm (Rosenblatt, 1958; Novikoff, 1962). [sent-268, score-0.662]

38 Before proving a loss bound for the multitask Perceptron, we must introduce another important quantity. [sent-270, score-0.496]

39 This quantity is the remoteness of a norm · deﬁned on R k , and is deﬁned to be u 2 = u ρ( · , k) = max u∈Rk max u∈Rk : u ≤1 u . [sent-271, score-0.463]

40 As we show below, the remoteness of the dual norm, ρ( · ∗ , k), plays an important role in determining the difﬁculty of using · as the global loss function. [sent-274, score-0.384]

41 Moreover, since both the L1 norm and the L∞ norm are absolute norms, we can also assume that u resides in the non-negative orthant. [sent-308, score-0.582]

42 If we present this sequence to the ﬁnite-horizon multitask Perceptron with the norm · and the aggressiveness parameter C, then, T ∑ t=1 t ≤ 1 2C k ∑ j=1 wj 2 2 T + ∑ t=1 t + T R2C ρ2 ( · ∗ , k) . [sent-327, score-0.704]

43 This corollary bounds the global loss cumulated by our algorithm with the global loss obtained by any ﬁxed set of hypotheses, plus a term which grows sub-linearly in T . [sent-342, score-0.338]

44 Dividing both sides of the inequality in Corollary 5 by T , we see that the average global loss suffered by the multitask Perceptron is upper bounded by the average global loss of the best ﬁxed hypothesis ensemble plus a term that diminishes with T . [sent-346, score-0.8]

45 Using game-theoretic terminology, we can now say that the multitask Perceptron exhibits no-regret with respect to any global loss function deﬁned by an absolute norm. [sent-347, score-0.596]

46 Moreover, when an update is performed, the algorithm deﬁnes wt+1 = wt + Cyt xt , where C is a predeﬁned constant. [sent-360, score-0.807]

47 If we were to use the simplest version of the Perceptron, which updates its hypothesis only when a prediction mistake occurs, then ﬁnding a counter-example that achieves Equation (15) would be trivial, without even using the distinction between single-task and multitask Perceptron learning. [sent-362, score-0.449]

48 2244 O NLINE L EARNING OF M ULTIPLE TASKS WITH A S HARED L OSS This concludes the presentation of the counter-example thus showing that a set of independent single-task Perceptrons does not attain no-regret with respect to the L ∞ norm global loss. [sent-404, score-0.336]

49 The exception is the L 1 norm, which naturally reduces the multitask Perceptron to k independent single-task Perceptrons. [sent-406, score-0.357]

50 The ﬁrst term in this lower bound is proportional to the global loss by 2C t − R suffered on round t, and the second term is a constant. [sent-414, score-0.391]

51 (17) D EKEL , L ONG AND S INGER When the global loss function is an r-max norm and π is a permutation such that t,π(1) ≥ . [sent-427, score-0.425]

52 If we present this sequence to the explicit multitask algorithm with the norm · and the aggressiveness parameter C, then for every T 1/(R2 ρ2 ) t≤T : ∑ t t ≤R2Cρ2 2 +C t≤T : t >R2Cρ2 t k T ∑ ≤ 2C ∑ + t ∑ wj 2 2 . [sent-448, score-0.704]

53 The Implicit Online Multitask Update We now discuss a third family of online multitask algorithms, which leads to the strongest loss bounds of the three families of algorithms presented in this paper. [sent-473, score-0.687]

54 In contrast to the closed form updates of the previous algorithms, the algorithms in this family require solving an optimization problem on every round, and are therefore called implicit update algorithms. [sent-474, score-0.353]

55 Although the implementation of speciﬁc members of this family may be more involved than the implementation of the multitask Perceptron, we recommend using this family of algorithms in practice. [sent-475, score-0.443]

56 Then the online update deﬁned in Equation (23) is equivalent to setting wt+1, j = wt, j + τt, j yt, j xt, j for all 1 ≤ j ≤ k, where τt k ∑ = argmax τ 2τ j j=1 2 t, j − τ j xt, j 2 2 s. [sent-516, score-0.363]

57 Lemma 5 proves that the implicit update essentially ﬁnds the value of τt that maximizes the lefthand side of the bound in Lemma 1. [sent-598, score-0.347]

58 An immediate consequence of this observation is that the loss bounds of the multitask Perceptron also hold for the implicit algorithm. [sent-601, score-0.571]

59 More precisely, the bound in Theorem 4 (and Corollary 5) holds not only for the multitask Perceptron, but also for the implicit update algorithm. [sent-602, score-0.704]

60 Equivalently, it can be shown that the bound in Theorem 6 (and Corollary 7) also holds for the implicit update algorithm. [sent-603, score-0.347]

61 Theorem 9 The bound in Theorem 4 also holds for the implicit update algorithm. [sent-605, score-0.347]

62 Proof Let τt, j denote the weights deﬁned by the multitask Perceptron in Equation (9) and let τt, j denote the weights assigned by the implicit update algorithm. [sent-606, score-0.632]

63 Solving the Implicit Update for r-max Norms We now present an efﬁcient procedure for calculating the update in Equation (23), in the case where the norm being used is the r-max norm. [sent-634, score-0.406]

64 If the loss is moderate then the size of the update step is proportional to the loss attained, and inverse proportional to the squared norm of the respective instance. [sent-662, score-0.639]

65 Experiments with Text Classiﬁcation In this section, we demonstrate the effectiveness of the implicit multitask algorithm on large-scale text categorization problems. [sent-714, score-0.509]

66 The third experiment demonstrates that the superiority of the implicit update algorithm, presented in Section 5, over the multitask Perceptron, presented in Sections 3 and 4. [sent-717, score-0.632]

67 We ran our algorithm using both the L 1 norm and the L∞ norm as the global loss function. [sent-740, score-0.681]

68 005 0 2 4 online rounds 6 8 5 x 10 Figure 4: The ∞-error (left) and 1-error (right) error-rates attained by the implicit multitask algorithm using the L∞ norm (solid) and the L1 norm (dashed) global loss functions. [sent-757, score-1.443]

69 Therefore, the L∞ norm update seems to be a more appropriate choice for minimizing the ∞-error, while the L1 norm update is the more appropriate choice for minimizing the 1-error. [sent-769, score-0.812]

70 The left-hand plot in the ﬁgure shows the ∞-error-rate of the L∞ norm and L1 norm multitask updates, as the number of examples grows from zero to 800K. [sent-772, score-0.869]

71 The ﬁgure clearly shows that the L ∞ norm algorithm does a better job throughout the entire online learning process. [sent-773, score-0.428]

72 In this case, the L ∞ norm update initially takes the lead, but is quickly surpassed by the L 1 norm update. [sent-776, score-0.662]

73 The fact that the L1 norm update ultimately gains the advantage coincides with our initial intuition. [sent-777, score-0.406]

74 The reason why the L∞ norm update outperforms the L1 norm update at ﬁrst can also be easily explained. [sent-778, score-0.812]

75 The L1 norm update is quite aggressive, as it modiﬁes every binary classiﬁer that suffers a positive individual loss on every round. [sent-779, score-0.567]

76 Moreover, the L1 norm update enforces the constraint τt ∞ ≤ C. [sent-780, score-0.406]

77 On the other hand, the L∞ norm update is more cautious, since it enforces the stricter constraint τt 1 ≤ C. [sent-781, score-0.406]

78 The aggression of the L1 norm update causes its initial behavior to be somewhat erratic. [sent-782, score-0.406]

79 11 0 5 r 10 15 0 5 r 10 15 0 5 r 10 15 Figure 5: The multiclass error rate of the online ECOC-based classiﬁer, using a 15 column code matrix, with various r-max norms, after observing 10K, 100K, and 600K examples. [sent-796, score-0.382]

80 We used the r-max norm as the algorithm’s global loss function, with r set to every integer value between 1 and 15. [sent-811, score-0.425]

81 005 0 2 4 online rounds 6 8 5 x 10 Figure 6: The ∞-error (left) and 1-error (right) attained by the multitask Perceptron (dashed) and the implicit update algorithm (solid) when using the L∞ norm as a global loss function. [sent-825, score-1.337]

82 using either the L1 norm (r = 15) or the L∞ norm (r = 1) is suboptimal, and the best performance is consistently reached by setting r to be slightly smaller than half the code distance. [sent-826, score-0.547]

83 the Multitask Perceptron From a loss minimization standpoint, Theorem 9 proves that the implicit update, presented in Section 5, is at least as good as the multitask Perceptron variants, presented in Secs. [sent-833, score-0.571]

84 We repeated the multitask multi-label experiment described in Section 8. [sent-836, score-0.357]

85 1, using the multitask Perceptron in place of the implicit update algorithm. [sent-837, score-0.632]

86 The inﬁnite horizon extension discussed in Section 4 does not have a signiﬁcant effect on empirical performance, so we consider only the ﬁnite horizon version of the multitask Perceptron, described in Section 3. [sent-838, score-0.489]

87 When the global loss function is deﬁned using the L1 norm, both the implicit update and the multitask Perceptron update decouple to independent updates for each individual task. [sent-839, score-1.012]

88 A comparison between the performance of the implicit update and the multitask Perceptron update, both using the L∞ -norm loss, is given in Figure 6. [sent-842, score-0.632]

89 The implicit algorithm holds a clear lead over the multitask Perceptron with respect to both error measures, throughout the learning process. [sent-844, score-0.504]

90 We showed that, in the worst case, the ﬁnite horizon multitask Perceptron of Section 3 and the implicit update algorithm of Section 5 both perform asymptotically as well as the best ﬁxed hypothesis ensemble. [sent-855, score-0.698]

91 The same cannot be said for the naive alternative, where we use multiple independent single-task learning algorithms to solve the multitask problem. [sent-857, score-0.386]

92 We also demonstrated the beneﬁt of the multitask approach over the naive alternative on two largescale text categorization problems. [sent-858, score-0.384]

93 This setting induces a natural online multitask learning problem. [sent-868, score-0.529]

94 We consider the tasks of those customers that are not online to be dormant or inactive tasks. [sent-870, score-0.329]

95 We simply need to deﬁne t, j = 0 for every inactive task and apply the multitask update verbatim. [sent-873, score-0.537]

96 Therefore, we can imagine that the length of the vector t changes from round to round, and that the update on each round is applied as if the tasks that are sleeping on that round never existed in the ﬁrst place. [sent-876, score-0.681]

97 Speciﬁcally, a multitask implicit update can be derived for regression and uniclass problems using ideas from Crammer et al. [sent-878, score-0.632]

98 Another interesting research direction would be to return to the roots of statistical multitask learning, and to try to model generative similarities between the multiple tasks within the online framework. [sent-883, score-0.663]

99 We do not present the K-method in all its generality, but rather focus only on topics which are relevant to the online multitask learning setting. [sent-891, score-0.529]

100 Solving a huge number of silmilar tasks: A combination of multitask learning and a hierarchical bayesian approach. [sent-1022, score-0.357]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xt', 0.461), ('multitask', 0.357), ('norm', 0.256), ('wt', 0.196), ('ekel', 0.176), ('yt', 0.175), ('online', 0.172), ('hared', 0.165), ('ultiple', 0.165), ('perceptron', 0.153), ('update', 0.15), ('inger', 0.149), ('multiclass', 0.148), ('remoteness', 0.143), ('round', 0.142), ('nline', 0.139), ('ong', 0.133), ('implicit', 0.125), ('norms', 0.124), ('oss', 0.114), ('tasks', 0.105), ('loss', 0.089), ('rk', 0.08), ('global', 0.08), ('dual', 0.072), ('absolute', 0.07), ('horizon', 0.066), ('corpus', 0.064), ('rounds', 0.062), ('equation', 0.06), ('wk', 0.059), ('wj', 0.058), ('rc', 0.056), ('respective', 0.055), ('crammer', 0.054), ('earning', 0.052), ('lemma', 0.051), ('qi', 0.05), ('perceptrons', 0.05), ('bound', 0.05), ('attained', 0.046), ('binary', 0.046), ('predictions', 0.045), ('interpolation', 0.045), ('family', 0.043), ('argmax', 0.041), ('article', 0.039), ('hypotheses', 0.037), ('skeleton', 0.035), ('conservative', 0.035), ('updates', 0.035), ('code', 0.035), ('faced', 0.034), ('ecoc', 0.033), ('horn', 0.033), ('aggressiveness', 0.033), ('concreteness', 0.033), ('conservativeness', 0.033), ('peetre', 0.033), ('min', 0.033), ('max', 0.032), ('mistakes', 0.031), ('prediction', 0.031), ('rn', 0.031), ('inactive', 0.03), ('suffered', 0.03), ('multiple', 0.029), ('concrete', 0.028), ('reuters', 0.028), ('pre', 0.028), ('kivinen', 0.028), ('cumulative', 0.028), ('plugging', 0.027), ('categorization', 0.027), ('suffer', 0.027), ('losses', 0.027), ('observing', 0.027), ('turning', 0.026), ('mistake', 0.026), ('families', 0.026), ('equals', 0.026), ('individual', 0.026), ('sides', 0.026), ('complementarity', 0.025), ('inequality', 0.025), ('ensemble', 0.024), ('categories', 0.024), ('euclidean', 0.023), ('conclude', 0.023), ('enumerate', 0.023), ('bakiri', 0.023), ('abbreviate', 0.023), ('google', 0.023), ('dekel', 0.023), ('longest', 0.023), ('ers', 0.023), ('side', 0.022), ('holds', 0.022), ('th', 0.022), ('dormant', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

Author: Ofer Dekel, Philip M. Long, Yoram Singer

2 0.15130723 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

Author: Onur C. Hamsici, Aleix M. Martinez

Abstract: Many feature representations, as in genomics, describe directional data where all feature vectors share a common norm. In other cases, as in computer vision, a norm or variance normalization step, where all feature vectors are normalized to a common length, is generally used. These representations and pre-processing step map the original data from R p to the surface of a hypersphere S p−1 . Such representations should then be modeled using spherical distributions. However, the difﬁculty associated with such spherical representations has prompted researchers to model their spherical data using Gaussian distributions instead—as if the data were represented in R p rather than S p−1 . This opens the question to whether the classiﬁcation results calculated with the Gaussian approximation are the same as those obtained when using the original spherical distributions. In this paper, we show that in some particular cases (which we named spherical-homoscedastic) the answer to this question is positive. In the more general case however, the answer is negative. For this reason, we further investigate the additional error added by the Gaussian modeling. We conclude that the more the data deviates from spherical-homoscedastic, the less advisable it is to employ the Gaussian approximation. We then show how our derivations can be used to deﬁne optimal classiﬁers for spherical-homoscedastic distributions. By using a kernel which maps the original space into one where the data adapts to the spherical-homoscedastic model, we can derive non-linear classiﬁers with potential applications in a large number of problems. We conclude this paper by demonstrating the uses of spherical-homoscedasticity in the classiﬁcation of images of objects, gene expression sequences, and text data. Keywords: directional data, spherical distributions, normal distributions, norm normalization, linear and non-linear classiﬁers, computer vision

3 0.090592936 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

Author: Jaime S. Cardoso, Joaquim F. Pinto da Costa

Abstract: Classiﬁcation of ordinal data is one of the most important tasks of relation learning. This paper introduces a new machine learning paradigm speciﬁcally intended for classiﬁcation problems where the classes have a natural order. The technique reduces the problem of classifying ordered classes to the standard two-class problem. The introduced method is then mapped into support vector machines and neural networks. Generalization bounds of the proposed ordinal classiﬁer are also provided. An experimental study with artiﬁcial and real data sets, including an application to gene expression analysis, veriﬁes the usefulness of the proposed approach. Keywords: classiﬁcation, ordinal data, support vector machines, neural networks

4 0.087928861 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

Author: Roni Khardon, Gabriel Wachman

Abstract: A large number of variants of the Perceptron algorithm have been proposed and partially evaluated in recent work. One type of algorithm aims for noise tolerance by replacing the last hypothesis of the perceptron with another hypothesis or a vote among hypotheses. Another type simply adds a margin term to the perceptron in order to increase robustness and accuracy, as done in support vector machines. A third type borrows further from support vector machines and constrains the update function of the perceptron in ways that mimic soft-margin techniques. The performance of these algorithms, and the potential for combining different techniques, has not been studied in depth. This paper provides such an experimental study and reveals some interesting facts about the algorithms. In particular the perceptron with margin is an effective method for tolerating noise and stabilizing the algorithm. This is surprising since the margin in itself is not designed or used for noise tolerance, and there are no known guarantees for such performance. In most cases, similar performance is obtained by the voted-perceptron which has the advantage that it does not require parameter selection. Techniques using soft margin ideas are run-time intensive and do not give additional performance beneﬁts. The results also highlight the difﬁculty with automatic parameter selection which is required with some of these variants. Keywords: perceptron algorithm, on-line learning, noise tolerance, kernel methods

5 0.086612113 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

Author: Marco Loog

Abstract: Recently, Ye (2005) suggested yet another optimization criterion for discriminant analysis and proposed a characterization of the family of solutions to this objective. The characterization, however, merely describes a part of the full solution set, that is, it is not complete and therefore not at all a characterization. This correspondence ﬁrst gives the correct characterization and afterwards compares it to Ye’s. Keywords: linear discriminant analysis, Fisher criterion, small sample, characterization 1. Classical and New Criteria Given N feature vectors of dimensionality n, a linear reduction of dimensionality, based on classical Fisher LDA, determines an n × d transformation matrix L that, for a given d < K, K the number of classes, maximizes the so-called Fisher criterion: F(A) = tr((A t SW A)−1 (At SB A)) or, equivalently, F0 (A) = tr((At ST A)−1 (At SB A)). Here, SB := ∑K pi (mi − m)(mi − m)t , SW := ∑K pi Si , and i=1 i=1 ST = SB + SW . The matrices SB , SW , and ST are the so-called between-class, pooled within-class, and total covariance matrices. In addition, mi is the mean vector of class i, pi is the prior of class i, and the overall mean m equals ∑k pi mi . Finally, Si is the covariance matrix of class i. i=1 A solution to these optimization problems can be obtained by means of a generalized eigenvalue decomposition, which Fukunaga (1990) relates to a simultaneous diagonalization of the two matrices involved (see also Campbell and Atchley, 1981). More common is it to apply a standard −1 eigenvalue decomposition to S−1 SB (or SW SB ), resulting in an equivalent set of eigenvectors. The d T columns of the optimal solution L are simply taken to equal the d eigenvectors corresponding to the d largest eigenvalues. It is known that this solution is not unique and the full class can be obtained by multiplying L to the right with nonsingular d × d matrices (see Fukunaga, 1990). Clearly, if the total covariance ST is singular, neither the generalized nor the standard eigenvalue decomposition can be readily employed. Directly or indirectly, the problem is that the matrix inverse S−1 does not exist, which is the typical situation when dealing with small samples. In an attempt to T overcome this problem, Ye (2005) introduced a different criterion that is deﬁned as F1 (A) = tr((At ST A)+ (At SB A)) , ∗. Also at Nordic Bioscience Imaging, Hovegade 207, DK-2730 Herlev, Denmark. c 2007 Marco Loog. (1) L OOG where + denotes taking the Moore-Penrose generalized inverse of a matrix. Like for F0 , an optimal transform L is one that maximizes the objective F1 . Again, this solution is not unique. 2. Correct Characterization For the full characterization of the set of solutions to Equation (1), initially the problem is looked at from a geometrical point of view (cf., Campbell and Atchley, 1981). It is assumed that the number of samples N is smaller than or equal to the feature dimensionality n. In the undersampled case, it is clear that all data variation occurs in an N − 1-dimensional subspace of the original space. To start with, a PCA of the data is carried out and the ﬁrst N − 1 principal components are rotated to the ﬁrst N − 1 axes of the n-dimensional space by means of a rotation matrix R. This matrix consists of all (normalized) eigenvectors of ST taken as its columns. After this rotation, new total and between-class covariance matrices, ST = Rt ST R and SB = Rt SB R, are obtained. These 0 0 matrices can be partitioned as follows: ST = Σ0T 0 and SB = ΣB 0 , where ΣT and ΣB are N − 1 × 0 N − 1 covariance matrices and ΣT is nonsingular and diagonal by construction. The between-class variation is obviously restricted to the N − 1-dimensional subspace in which the total data variation takes place, therefore a same partitioning of SB is possible. However, the covariance submatrix ΣB is not necessarily diagonal, neither does it have to be nonsingular. Basically, the PCA-based rotation R converts the initial problem into a more convenient one, splitting up the original space in an N − 1-dimensional one in which “everything interesting” takes place and a remaining n − N + 1dimensional subspace in which “nothing happens at all”. Now consider F1 in this transformed space and take a general n × d transformation matrix A, which is partitioned in a way similar to the covariance matrices, that is, X . Y A= (2) Here, X is an N − 1 × d-matrix and Y is of size n − N + 1 × d. Taking this deﬁnition, the following holds (cf., Ye, 2005): t + t F1 (A) = tr((A ST A) (A SB A)) = tr =tr X t ΣT X 0 0 0 + X Y X t ΣB X 0 0 0 t ΣT 0 = tr 0 0 X Y + (Xt ΣT X)−1 0 0 0 X Y t ΣB 0 0 0 X Y X t ΣB X 0 0 0 = tr((Xt ΣT X)−1 (Xt ΣB X)) = F0 (X) . From this it is immediate that a matrix A maximizes F1 if and only if the submatrix X maximizes the original Fisher criterion in the lower-dimensional subspace. Moreover, if L is such a matrix maximizing F1 in the PCA-transformed space, it is easy to check that R−1 L = Rt L provides a solution to the original, general problem that has not been preprocessed by means of a PCA (see also Fukunaga, 1990). A characterization of the complete family of solutions can now be given. Let Λ ∈ RN−1×d be an optimal solution of F0 (X) = tr((Xt ΣT X)−1 (Xt ΣB X)). As already noted in Section 1, the full set of solutions is given by F = {ΛZ ∈ RN−1×d | Z ∈ GLd (R)}, where GLd (R) denotes the general linear group of d × d invertible matrices. The previous paragraph essentially demonstrates that if X ∈ F , A in Equation (2) maximizes F1 . The matrix Y can be chosen ad 2122 C OMPLETE C HARACTERIZATION OF A FAMILY OF S OLUTIONS libitum. Now, the latter provides the solution in the PCA-transformed space and to solve the general problem we need to take the rotation back to the original space into account. All in all, this leads to the following complete family of solutions L , maximizing F1 in the original space: L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R), Y ∈ Rn−N+1×d Y , (3) where Λ = argmaxX tr((Xt ΣT X)−1 (Xt ΣB X)) and Rt takes care of the rotation back. 3. Original Characterization Though not noted by Ye (2005), his attempt to characterize the full set of solutions of Equation (1) is based on a simultaneous diagonalization of the three covariance matrices S B , SW , and ST that is similar to the ideas described by Campbell and Atchley (1981) and Fukunaga (1990). Moreover, Golub and Van Loan (Theorem 8.7.1. 1996) can be readily applied to demonstrate that such simultaneous diagonalization is possible in the small sample setting. After the diagonalization step, partitioned between-class, pooled within-class, and total covariance matrices are considered. This partitioning is similar to the one employed in the previous section, which does not enforce matrices to be diagonal however. In the subsequent optimization step, the classical Fisher criterion is maximized basically in the appropriate subspace, comparable to the approach described above, but in a mildly more involved and concealed way. For this, matrices of the form Rt X are considered, consider Equations (2) and Y (3). However, Y is simply the null matrix and the family of solutions L provided is limited to L = Rt ΛZ ∈ Rn×d Z ∈ GLd (R) . 0 Obviously, this is far from a complete characterization, especially when N − 1 n which is, for instance, typically the case for the data sets considered by Ye (2005). Generally, the utility of a dimensionality reduction criterion, without additional constrains, depends on the efﬁciency over the full set of solutions. As Ye (2005) only considers two very speciﬁc instances from the large class of possibilities, it is unclear to what extent the new criterion really provides an efﬁcient way of performing a reduction of dimensionality. References N. A. Campbell and W. R. Atchley. The geometry of canonical variate analysis. Systematic Zoology, 30(3):268–280, 1981. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, 1990. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996. J. Ye. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483–502, 2005. 2123

6 0.077949516 90 jmlr-2007-Value Regularization and Fenchel Duality

7 0.076166466 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

8 0.073577709 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

9 0.067978963 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

10 0.061952159 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

11 0.058250204 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

12 0.057851799 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

13 0.053091668 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

14 0.048091758 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

15 0.04702834 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

16 0.046976984 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring

17 0.044042788 34 jmlr-2007-From External to Internal Regret (Special Topic on the Conference on Learning Theory 2005)

18 0.043897118 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)

19 0.039566617 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

20 0.036583573 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.243), (1, -0.006), (2, -0.048), (3, -0.005), (4, -0.162), (5, -0.004), (6, -0.325), (7, -0.043), (8, -0.286), (9, 0.059), (10, 0.084), (11, 0.123), (12, 0.184), (13, 0.133), (14, 0.106), (15, 0.02), (16, 0.013), (17, 0.21), (18, -0.02), (19, 0.162), (20, -0.138), (21, 0.043), (22, -0.152), (23, -0.029), (24, 0.033), (25, -0.003), (26, 0.096), (27, 0.074), (28, 0.073), (29, -0.038), (30, -0.002), (31, -0.031), (32, -0.057), (33, 0.072), (34, -0.008), (35, -0.013), (36, 0.067), (37, -0.006), (38, -0.011), (39, 0.049), (40, 0.018), (41, -0.008), (42, -0.032), (43, -0.037), (44, -0.039), (45, 0.009), (46, -0.082), (47, -0.02), (48, 0.0), (49, -0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97668791 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

Author: Ofer Dekel, Philip M. Long, Yoram Singer

2 0.65053868 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

Author: Onur C. Hamsici, Aleix M. Martinez

3 0.416145 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

Author: Marco Loog

4 0.3982287 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

Author: Roni Khardon, Gabriel Wachman

5 0.39001784 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

Author: Charles Sutton, Andrew McCallum, Khashayar Rohanimanesh

Abstract: In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when long-range dependencies exist. We present dynamic conditional random ﬁelds (DCRFs), a generalization of linear-chain conditional random ﬁelds (CRFs) in which each time slice contains a set of state variables and edges—a distributed state representation as in dynamic Bayesian networks (DBNs)—and parameters are tied across slices. Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linear-chain CRFs, achieving comparable performance using only half the training data. In addition to maximum conditional likelihood, we present two alternative approaches for training DCRFs: marginal likelihood training, for when we are primarily interested in predicting only a subset of the variables, and cascaded training, for when we have a distinct data set for each state variable, as in transfer learning. We evaluate marginal training and cascaded training on both synthetic data and real-world text data, ﬁnding that marginal training can improve accuracy when uncertainty exists over the latent variables, and that for transfer learning, a DCRF trained in a cascaded fashion performs better than a linear-chain CRF that predicts the ﬁnal task directly. Keywords: conditional random ﬁelds, graphical models, sequence labeling

6 0.38302803 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

7 0.3800956 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

8 0.37727368 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

9 0.31557143 90 jmlr-2007-Value Regularization and Fenchel Duality

10 0.3074317 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

11 0.29953817 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

12 0.27176064 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

13 0.21451323 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

14 0.20769219 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

15 0.19823575 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

16 0.18877259 34 jmlr-2007-From External to Internal Regret (Special Topic on the Conference on Learning Theory 2005)

17 0.18335854 23 jmlr-2007-Concave Learners for Rankboost

18 0.17122816 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

19 0.16991314 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

20 0.16598798 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.023), (8, 0.013), (10, 0.03), (12, 0.029), (15, 0.019), (20, 0.355), (22, 0.013), (28, 0.064), (40, 0.062), (48, 0.024), (60, 0.05), (80, 0.02), (85, 0.089), (98, 0.121)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74510646 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

Author: Ofer Dekel, Philip M. Long, Yoram Singer

2 0.44527996 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

Author: Roni Khardon, Gabriel Wachman

3 0.43378711 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire

Abstract: We present a uniﬁed and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and uniﬁes techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling

4 0.43243688 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

Author: Ambuj Tewari, Peter L. Bartlett

Abstract: Binary classiﬁcation is a well studied special case of the classiﬁcation problem. Statistical properties of binary classiﬁers, such as consistency, have been investigated in a variety of settings. Binary classiﬁcation methods can be generalized in many ways to handle multiple classes. It turns out that one can lose consistency in generalizing a binary classiﬁcation method to deal with multiple classes. We study a rich family of multiclass methods and provide a necessary and sufﬁcient condition for their consistency. We illustrate our approach by applying it to some multiclass methods proposed in the literature. Keywords: multiclass classiﬁcation, consistency, Bayes risk

5 0.43196025 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

Author: Guy Lebanon, Yi Mao, Joshua Dillon

Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efﬁcient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing

6 0.43036681 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

7 0.42915434 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

8 0.42893153 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

9 0.42859519 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

10 0.42768312 77 jmlr-2007-Stagewise Lasso

11 0.42663532 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

12 0.42626947 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

13 0.42565191 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

14 0.42442724 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

15 0.42432097 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

16 0.42379633 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

17 0.42339048 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

18 0.42079616 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

19 0.41991824 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

20 0.41809565 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR