nips nips2003 nips2003-122 knowledge-graph by maker-knowledge-mining

122 nips-2003-Margin Maximizing Loss Functions


Source: pdf

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. [sent-6, score-0.444]

2 Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. [sent-7, score-0.345]

3 We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. [sent-8, score-1.415]

4 This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. [sent-9, score-1.457]

5 We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. [sent-10, score-0.895]

6 1 Introduction Assume we have a classi£cation “learning” sample {x i , yi }n with yi ∈ {−1, +1}. [sent-11, score-0.528]

7 We i=1 wish to build a model F (x) for this data by minimizing (exactly or approximately) a loss criterion i C(yi , F (xi )) = i C(yi F (xi )) which is a function of the margins yi F (xi ) of this model on this data. [sent-12, score-0.902]

8 Most common classi£cation modeling approaches can be cast in this framework: logistic regression, support vector machines, boosting and more. [sent-13, score-0.529]

9 The model F (x) which these methods actually build is a linear combination of dictionary functions coming from a dictionary H which can be large or even in£nite: F (x) = βj hj (x) hj ∈H and our prediction at point x based on this model is sgnF (x). [sent-14, score-0.317]

10 When |H| is large, as is the case in most boosting or kernel SVM applications, some regularization is needed to control the “complexity” of the model F (x) and the resulting over£tting. [sent-15, score-0.312]

11 The 1- and 2-norm support vector machine training problems with slack can be cast in this form ([6], chapter 12). [sent-17, score-0.153]

12 In [8] we have shown that boosting approximately follows the “path” of regularized solutions traced by (1) as the regularization parameter λ varies, with the appropriate loss and an l1 penalty. [sent-18, score-0.998]

13 ˆ The main question that we answer in this paper is: for what loss functions does β(λ) converge to an “optimal” separator as λ → 0? [sent-19, score-0.543]

14 The de£nition of “optimal” which we will use depends on the lp norm used for regularization, and we will term it the “lp -margin maximizing separating hyper-plane”. [sent-20, score-0.475]

15 More concisely, we will investigate for which loss functions and under which conditions we have: ˆ β(λ) lim (2) = arg max min yi β h(xi ) ˆ λ→0 β(λ) β p =1 i This margin maximizing property is interesting for three distinct reasons. [sent-21, score-1.38]

16 First, it gives us a geometric interpretation of the “limiting” model as we relax the regularization. [sent-22, score-0.069]

17 It tells us that this loss seeks to optimally separate the data by maximizing a distance between a separating hyper-plane and the “closest” points. [sent-23, score-0.721]

18 A theorem by Mangasarian [7] allows us to interpret lp margin maximization as lq distance maximization, with 1/p + 1/q = 1, and hence make a clear geometric interpretation. [sent-24, score-0.697]

19 Second, from a learning theory perspective large margins are an important quantity — generalization error bounds that depend on the margins have been generated for support vector machines ([10] — using l2 margins) and boosting ( [9] — using l1 margins). [sent-25, score-0.787]

20 Thus, showing that a loss function is “margin maximizing” in this sense is useful and promising information regarding this loss function’s potential for generating good prediction models. [sent-26, score-0.973]

21 Third, practical experience shows that exact or approximate margin maximizaion (such as non-regularized kernel SVM solutions, or “in£nite” boosting) may actually lead to good classi£cation prediction models. [sent-27, score-0.317]

22 Our main result is a suf£cient condition on the loss function, which guarantees that (2) holds, if the data is separable, i. [sent-29, score-0.532]

23 This condition is presented and proven in section 2. [sent-32, score-0.111]

24 It covers the hinge loss of support vector machines, the logistic log-likelihood loss of logistic regression, and the exponential loss, most notably used in boosting. [sent-33, score-1.545]

25 Our result generalizes elegantly to multi-class models and loss functions. [sent-35, score-0.521]

26 We present the resulting margin-maximizing versions of SVMs and logistic regression in section 4. [sent-36, score-0.387]

27 2 Suf£cient condition for margin maximization The following theorem shows that if the loss function vanishes “quickly” enough, then it will be margin-maximizing as the regularization vanishes. [sent-37, score-1.159]

28 It provides us with a uni£ed margin-maximization theory, covering SVMs, logistic regression and boosting. [sent-38, score-0.323]

29 i=1 Let C(y, f ) = C(yf ) be a monotone non-increasing loss function depending on the margin only. [sent-45, score-0.72]

30 Consequently, if this margin-maximizing hyper-plane is unique, then the solutions converge to it: ˆ β(λ) (4) = arg max min yi β h(xi ) lim ˆ λ→0 β(λ) p β p =1 i Proof We prove the result separately for T = ∞ and T < ∞. [sent-47, score-0.579]

31 Therefore, for ˆ loss+penalty to vanish as λ → 0, β(λ) p must diverge, to allow the margins to diverge. [sent-51, score-0.189]

32 3 Assume β1 , β2 are two separating models, with β1 p = β2 p = 1, and β1 separates the data better, i. [sent-53, score-0.175]

33 : 0 < m2 = mini yi h(xi ) β2 < m1 = mini yi h(xi ) β1 . [sent-55, score-0.89]

34 Then ∃U = U (m1 , m2 ) such that ∀t > U, C(yi h(xi ) (tβ1 )) < C(yi h(xi ) (tβ2 )) i i In words, if β1 separates better than β2 then scaled-up versions of β1 will incur smaller loss than scaled-up versions of β2 , if the scaling factor is large enough. [sent-56, score-0.629]

35 Proof Since condition (3) holds with T = ∞, there exists U such that ∀t > U, n. [sent-57, score-0.117]

36 : Assume β ∗ is a convergence point of β(λ) as λ → 0, with β ∗ p = 1. [sent-59, score-0.058]

37 ˆ p ˜ ˜ Now assume by contradiction β has β p = 1 and bigger minimal lp margin. [sent-60, score-0.403]

38 Denote the minimal margins for the two models by m∗ and m, respectively, with m∗ < m. [sent-61, score-0.275]

39 ˜ ˜ By continuity of the minimal margin in β, there exists some open neighborhood of β ∗ on the lp sphere: Nβ ∗ = {β : β p = 1, β − β ∗ 2 < δ} and an > 0, such that: min yi β h(xi ) < m − , ∀β ∈ Nβ ∗ ˜ i ˜ Now by lemma 2. [sent-62, score-0.861]

40 3 we get that exists U = U (m, m − ) such that tβ incurs smaller loss ˜ ˜ than tβ for any t > U, β ∈ Nβ ∗ . [sent-63, score-0.479]

41 Therefore β ∗ cannot be a convergence point of ˆ β(λ) . [sent-64, score-0.058]

42 5 limλ→0 mini yi β(λ) h(xi ) = T Proof Assume by contradiction that there is a sequence λ1 , λ2 , . [sent-71, score-0.566]

43 ∀j, mini yi β(λ ˜ ˜ ˜ Pick any separating normalized model β i. [sent-77, score-0.593]

44 ˜ p C(T − ) Then for any λ < m T p we get: ˜ C(yi i T ˜ T ˜ β h(xi )) + λ β m ˜ m ˜ p p < C(T − ) since the £rst term (loss) is 0 and the penalty is smaller than C(T − ) by condition on λ. [sent-80, score-0.164]

45 λj0 < mp C(T p ) and so we get a contradiction to optimality of β(λj0 ), since T ˆ j ) h(xi ) ≤ T − and thus: we assumed mini yi β(λ 0 ˆ C(yi β(λj0 ) h(xi )) ≥ C(T − ) i ˆ We have thus proven that lim inf λ→0 mini yi β(λ) h(xi ) ≥ T . [sent-83, score-1.186]

46 ˆ Assume by contradiction that for some value of λ we have m := mini yi β(λ) h(xi ) > T . [sent-85, score-0.566]

47 T ˆ ˆ Then the re-scaled model m β(λ) has the same zero loss as β(λ), but a smaller penalty, T ˆ T ˆ ˆ ˆ since m β(λ) = m β(λ) < β(λ) . [sent-86, score-0.449]

48 : Assume β ∗ is a convergence point of β(λ) as λ → 0, with β ∗ p = 1. [sent-89, score-0.058]

49 ˆ p ˜ ˜ Now assume by contradiction β has β p = 1 and bigger minimal margin. [sent-90, score-0.244]

50 Denote the minimal margins for the two models by m∗ and m, respectively, with m∗ < m. [sent-91, score-0.275]

51 Thus, ∃j0 such that ∀j > j0 , β(λj ) p > T and ˆ assumption, β(λ m m ˜ m ˜ consequently: ˆ ˆ C(yi β(λj ) h(xi )) + λ β(λj ) p p > λ( i T p ) = m ˜ C(yi i T ˜ T ˜ βh(xi )) + λ β m ˜ m ˜ p p ˆ So we get a contradiction to optimality of β(λj ). [sent-98, score-0.202]

52 that any convergence point of ˆ β(λ) ˆ β(λ) p must ˆ β(λ) ˆ β(λ) p p maximize the lp margin. [sent-101, score-0.217]

53 We conjecture that if we also require that the loss is convex and vanishing (i. [sent-104, score-0.449]

54 limm→∞ C(m) = 0) then condition (3) is suf£cient and necessary. [sent-106, score-0.083]

55 3 Examples Support vector machines Support vector machines (linear or kernel) can be described as a regularized problem: (5) [1 − yi β h(xi )]+ + λ β min β p p i where p = 2 for the standard (“2-norm”) SVM and p = 1 for the 1-norm SVM. [sent-108, score-0.656]

56 This formulation is equivalent to the better known “norm minimization” SVM formulation in the sense that they have the same set of solutions as the regularization parameter λ varies in (5) or the slack bound varies in the norm minimization formulation. [sent-109, score-0.468]

57 The loss in (5) is termed “hinge loss” since it’s linear for margins less than 1, then £xed at 0 (see £gure 1). [sent-110, score-0.638]

58 The theorem obviously holds for T = 1, and it veri£es our knowledge that the non-regularized SVM solution, which is the limit of the regularized solutions, maximizes the appropriate margin (Euclidean for standard SVM, l1 for 1-norm SVM). [sent-111, score-0.554]

59 Note that our theorem indicates that the squared hinge loss (AKA truncated squared loss): C(yi , F (xi )) = [1 − yi F (xi )]2 + is also a margin-maximizing loss. [sent-112, score-0.895]

60 In [8] we showed that boosting approximately follows the regularized path of solutions ˆ β(λ) using these loss functions and l1 regularization. [sent-114, score-0.971]

61 We also proved that the two loss functions are very similar for positive margins, and that their regularized solutions converge to margin-maximizing separators. [sent-115, score-0.78]

62 1 provides a new proof of this result, since the theorem’s condition holds with T = ∞ for both loss functions. [sent-117, score-0.637]

63 Some interesting non-examples Commonly used classi£cation loss functions which are not margin-maximizing include any 1 polynomial loss function: C(m) = m , C(m) = m2 , etc. [sent-118, score-0.98]

64 do not guarantee convergence of regularized solutions to margin maximizing solutions. [sent-119, score-0.715]

65 Another interesting method in this context is linear discriminant analysis. [sent-120, score-0.059]

66 Although it does not correspond to the loss+penalty formulation we have described, it does £nd a “decision hyper-plane” in the predictor space. [sent-121, score-0.089]

67 For both polynomial loss functions and linear discriminant analysis it is easy to £nd examples which show that they are not necessarily margin maximizing on separable data. [sent-122, score-1.014]

68 4 A multi-class generalization Our main result can be elegantly extended to versions of multi-class logistic regression and support vector machines, as follows. [sent-123, score-0.571]

69 Our model consists of a “prediction” for each class: (k) Fk (x) = βj hj (x) hj ∈H with the obvious prediction rule at x being arg maxk Fk (x). [sent-130, score-0.262]

70 For y = ck , de£ne the margin vector as: (8) m(ck , f1 , . [sent-132, score-0.421]

71 , fk − fK ) And our loss is a function of this K − 1 dimensional margin: I{y = ck }C(m(ck , f1 , . [sent-141, score-0.843]

72 5 2 3 2 3 Figure 1: Margin maximizing loss functions for 2-class problems (left) and the SVM 3-class loss function of section 4. [sent-154, score-1.098]

73 1 (right) The lp -regularized problem is now: (9) ˆ β(λ) = arg C(yi , h(xi ) β (1) , . [sent-155, score-0.207]

74 In this formulation, the concept of margin maximization corresponds to maximizing the minimal of all n · (K − 1) normalized lp -margins generated by the data: (10) β (1) max p p +. [sent-165, score-0.776]

75 Here is a generalization of the optimal separation theorem 2. [sent-169, score-0.141]

76 The condition (11) implies that as the regularization vanishes the model is determined by the minimal margin, and so an optimal model puts the emphasis on maximizing that margin. [sent-184, score-0.446]

77 Proof The loss depends on β (1) − β (2) , the penalty on β (1) p + β (2) p . [sent-189, score-0.53]

78 1 Margin maximization in multi-class SVM and logistic regression Here we apply theorem 4. [sent-192, score-0.526]

79 For logistic regression, we use a slightly different formulation than the “standard” logistic regression models, which uses class K as a “reference” class, i. [sent-194, score-0.581]

80 However, using regularization as in (9) guarantees that the solution will be unique and consequently we can “symmetrize” the model — which allows us to apply theorem 4. [sent-198, score-0.259]

81 So the loss function we use is (assume y = ck belongs to class k): efk = ef1 + . [sent-200, score-0.69]

82 It is not dif£cult to verify that condition (11) holds for this loss function with T = ∞, using the fact that log(1 + ) = + O( 2 ). [sent-213, score-0.566]

83 For support vector machines, consider a multi-class loss which is a natural generalization of the two-class loss: K−1 [1 − mj ]+ C(m) = (13) j=1 Where mj is the j’th component of the multi-margin m as in (8). [sent-215, score-0.666]

84 Figure 1 shows this loss for K = 3 classes as a function of the two margins. [sent-216, score-0.449]

85 The loss+penalty formulation using 13 is equivalent to a standard optimization formulation of multi-class SVM (e. [sent-217, score-0.11]

86 , K}, ck = yi ξik ≥ 0 , ξik ≤ B , i,k β (k) p p =1 k As both theorem 4. [sent-227, score-0.479]

87 1 (using T = 1) and the optimization formulation indicate, the regularized solutions to this problem converge to the lp margin maximizing multi-class solution. [sent-228, score-0.914]

88 p p 5 Discussion What are the properties we would like to have in a classi£cation loss function? [sent-229, score-0.449]

89 Recently there has been a lot of interest in Bayes-consistency of loss functions and algorithms ([1] and references therein), as the data size increases. [sent-230, score-0.5]

90 It turns out that practically all “reasonable” loss functions are consistent in that sense, although convergence rates and other measures of “degree of consistency” may vary. [sent-231, score-0.587]

91 Margin maximization, on the other hand, is a £nite sample optimality property of loss functions, which is potentially of decreasing interest as sample size grows, since the training data-set is less likely to be separable. [sent-232, score-0.5]

92 Note, however, that in very high dimensional predictor spaces, such as those typically used by boosting or kernel SVM, separability of any £nite-size data-set is a mild assumption, which is violated only in pathological cases. [sent-233, score-0.241]

93 We have shown that the margin maximizing property is shared by some popular loss functions used in logistic regression, support vector machines and boosting. [sent-234, score-1.276]

94 Knowing that these algorithms “converge”, as regularization vanishes, to the same model (provided they use the same regularization) is an interesting insight. [sent-235, score-0.136]

95 So, for example, we can conclude that 1-norm support vector machines, exponential boosting and l1 -regularized logistic regression all facilitate the same non-regularized solution, which is an l1 -margin maximizing separating hyper-plane. [sent-236, score-0.962]

96 From Mangasarian’s theorem [7] we know that this hyper-plane maximizes the l∞ distance from the closest points on either side. [sent-237, score-0.092]

97 The most interesting statistical question which arises is: are these “optimal” separating models really good for prediction, or should we expect regularized models to always do better in practice? [sent-238, score-0.361]

98 We have had similar experience with boosting and kernel SVM. [sent-241, score-0.207]

99 Boosting in the limit: Maximizing the margin of learned ensembles. [sent-274, score-0.271]

100 Boosting as a regularized path to a maximum margin classi£er. [sent-293, score-0.455]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('loss', 0.449), ('margin', 0.271), ('fk', 0.271), ('yi', 0.264), ('boosting', 0.207), ('logistic', 0.203), ('margins', 0.189), ('mini', 0.181), ('lp', 0.159), ('regularized', 0.157), ('maximizing', 0.149), ('xi', 0.143), ('ck', 0.123), ('separating', 0.123), ('contradiction', 0.121), ('regression', 0.12), ('efk', 0.118), ('maximization', 0.111), ('regularization', 0.105), ('svm', 0.103), ('theorem', 0.092), ('hinge', 0.09), ('condition', 0.083), ('penalty', 0.081), ('solutions', 0.08), ('hastie', 0.074), ('hj', 0.072), ('proof', 0.071), ('lim', 0.066), ('separable', 0.066), ('machines', 0.065), ('versions', 0.064), ('minimal', 0.061), ('support', 0.061), ('convergence', 0.058), ('mangasarian', 0.057), ('lemma', 0.055), ('formulation', 0.055), ('limm', 0.055), ('scahpire', 0.055), ('separates', 0.052), ('functions', 0.051), ('min', 0.051), ('optimality', 0.051), ('generalization', 0.049), ('arg', 0.048), ('ik', 0.048), ('vanishes', 0.048), ('elegantly', 0.047), ('prediction', 0.046), ('norm', 0.044), ('rosset', 0.043), ('breiman', 0.043), ('converge', 0.043), ('geometric', 0.04), ('mj', 0.04), ('grove', 0.04), ('classi', 0.04), ('dictionary', 0.038), ('convexity', 0.038), ('stanford', 0.037), ('consequently', 0.037), ('conclude', 0.037), ('tibshirani', 0.036), ('schuurmans', 0.036), ('exponential', 0.035), ('bigger', 0.034), ('predictor', 0.034), ('necessity', 0.034), ('slack', 0.034), ('holds', 0.034), ('varies', 0.033), ('zhu', 0.032), ('interesting', 0.031), ('cast', 0.031), ('adaboost', 0.031), ('get', 0.03), ('statistics', 0.03), ('sense', 0.029), ('ce', 0.029), ('practically', 0.029), ('interpretation', 0.029), ('assume', 0.028), ('covers', 0.028), ('discriminant', 0.028), ('proven', 0.028), ('freund', 0.028), ('path', 0.027), ('prove', 0.027), ('vector', 0.027), ('bartlett', 0.027), ('models', 0.025), ('friedman', 0.025), ('normalized', 0.025), ('cation', 0.025), ('unique', 0.025), ('debated', 0.024), ('maxk', 0.024), ('lq', 0.024), ('watkins', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

2 0.2729837 1 nips-2003-1-norm Support Vector Machines

Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie

Abstract: The standard 2-norm SVM is known for its good performance in twoclass classi£cation. In this paper, we consider the 1-norm SVM. We argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. We also propose an ef£cient algorithm that computes the whole solution path of the 1-norm SVM, hence facilitates adaptive selection of the tuning parameter for the 1-norm SVM. 1

3 0.27126944 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe

Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.

4 0.2161942 124 nips-2003-Max-Margin Markov Networks

Author: Ben Taskar, Carlos Guestrin, Daphne Koller

Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1

5 0.17646678 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

Author: Tong Zhang

Abstract: The purpose of this paper is to investigate infinity-sample properties of risk minimization based multi-category classification methods. These methods can be considered as natural extensions to binary large margin classification. We establish conditions that guarantee the infinity-sample consistency of classifiers obtained in the risk minimization framework. Examples are provided for two specific forms of the general formulation, which extend a number of known methods. Using these examples, we show that some risk minimization formulations can also be used to obtain conditional probability estimates for the underlying problem. Such conditional probability information will be useful for statistical inferencing tasks beyond classification. 1 Motivation Consider a binary classification problem where we want to predict label y ∈ {±1} based on observation x. One of the most significant achievements for binary classification in machine learning is the invention of large margin methods, which include support vector machines and boosting algorithms. Based on a set of observations (X1 , Y1 ), . . . , (Xn , Yn ), ˆ a large margin classification algorithm produces a decision function fn by empirically minimizing a loss function that is often a convex upper bound of the binary classification error ˆ ˆ function. Given fn , the binary decision rule is to predict y = 1 if fn (x) ≥ 0, and to predict ˆ y = −1 otherwise (the decision rule at fn (x) = 0 is not important). In the literature, the following form of large margin binary classification is often encountered: we minimize the empirical risk associated with a convex function φ in a pre-chosen function class Cn : 1 ˆ fn = arg min f ∈Cn n n φ(f (Xi )Yi ). (1) i=1 Originally such a scheme was regarded as a compromise to avoid computational difficulties associated with direct classification error minimization, which often leads to an NP-hard problem. The current view in the statistical literature interprets such methods as algorithms to obtain conditional probability estimates. For example, see [3, 6, 9, 11] for some related studies. This point of view allows people to show the consistency of various large margin methods: that is, in the large sample limit, the obtained classifiers achieve the optimal Bayes error rate. For example, see [1, 4, 7, 8, 10, 11]. The consistency of a learning method is certainly a very desirable property, and one may argue that a good classification method should be consistent in the large sample limit. Although statistical properties of binary classification algorithms based on the risk minimization formulation (1) are quite well-understood due to many recent works such as those mentioned above, there are much fewer studies on risk minimization based multicategory problems which generalizes the binary large margin method (1). The complexity of possible generalizations may be one reason. Another reason may be that one can always estimate the conditional probability for a multi-category problem using the binary classification formulation (1) for each category, and then pick the category with the highest estimated conditional probability (or score).1 However, it is still useful to understand whether there are more natural alternatives, and what kind of risk minimization formulation which generalizes (1) can be used to yield consistent classifiers in the large sample limit. An important step toward this direction has recently been taken in [5], where the authors proposed a multi-category extension of the support vector machine that is Bayes consistent (note that there were a number of earlier proposals that were not consistent). The purpose of this paper is to generalize their investigation so as to include a much wider class of risk minimization formulations that can lead to consistent classifiers in the infinity-sample limit. We shall see that there is a rich structure in risk minimization based multi-category classification formulations. Multi-category large margin methods have started to draw more attention recently. For example, in [2], learning bounds for some multi-category convex risk minimization methods were obtained, although the authors did not study possible choices of Bayes consistent formulations. 2 Multi-category classification We consider the following K-class classification problem: we would like to predict the label y ∈ {1, . . . , K} of an input vector x. In this paper, we only consider the simplest scenario with 0 − 1 classification loss: we have a loss of 0 for correct prediction, and loss of 1 for incorrect prediction. In binary classification, the class label can be determined using the sign of a decision function. This can be generalized to K class classification problem as follows: we consider K decision functions fc (x) where c = 1, . . . , K and we predict the label y of x as: T (f (x)) = arg max c∈{1,...,K} fc (x), (2) where we denote by f (x) the vector function f (x) = [f1 (x), . . . , fK (x)]. Note that if two or more components of f achieve the same maximum value, then we may choose any of them as T (f ). In this framework, fc (x) is often regarded as a scoring function for category c that is correlated with how likely x belongs to category c (compared with the remaining k − 1 categories). The classification error is given by: (f ) = 1 − EX P (Y = T (X)|X). Note that only the relative strength of fc compared with the alternatives is important. In particular, the decision rule given in (2) does not change when we add the same numerical quantity to each component of f (x). This allows us to impose one constraint on the vector f (x) which decreases the degree of freedom K of the K-component vector f (x) to K − 1. 1 This approach is often called one-versus-all or ranking in machine learning. Another main approach is to encode a multi-category classification problem into binary classification sub-problems. The consistency of such encoding schemes can be difficult to analyze, and we shall not discuss them. For example, in the binary classification case, we can enforce f1 (x)+f2 (x) = 0, and hence f (x) can be represented as [f1 (x), −f1 (x)]. The decision rule in (2), which compares f1 (x) ≥ f2 (x), is equivalent to f1 (x) ≥ 0. This leads to the binary classification rule mentioned in the introduction. In the multi-category case, one may also interpret the possible constraint on the vector function f , which reduces its degree of freedom from K to K − 1 based on the following reasoning. In many cases, we seek fc (x) as a function of p(Y = c|x). Since we have a K constraint c=1 p(Y = c|x) = 1 (implying that the degree of freedom for p(Y = c|x) is K − 1), the degree of freedom for f is also K − 1 (instead of K). However, we shall point out that in the algorithms we formulate below, we may either enforce such a constraint that reduces the degree of freedom of f , or we do not impose any constraint, which keeps the degree of freedom of f to be K. The advantage of the latter is that it allows the computation of each fc to be decoupled. It is thus much simpler both conceptually and numerically. Moreover, it directly handles multiple-label problems where we may assign each x to multiple labels of y ∈ {1, . . . , K}. In this scenario, we do not have a constraint. In this paper, we consider an empirical risk minimization method to solve a multi-category problem, which is of the following general form: 1 ˆ fn = arg min f ∈Cn n n ΨYi (f (Xi )). (3) i=1 As we shall see later, this method is a natural generalization of the binary classification method (1). Note that one may consider an even more general form with ΨY (f (X)) replaced by ΨY (f (X), X), which we don’t study in this paper. From the standard learning theory, one can expect that with appropriately chosen Cn , the ˆ ˆ solution fn of (3) approximately minimizes the true risk R(f ) with respect to the unknown underlying distribution within the function class Cn , R(f ) = EX,Y ΨY (f (X)) = EX L(P (·|X), f (X)), (4) where P (·|X) = [P (Y = 1|X), . . . , P (Y = K|X)] is the conditional probability, and K L(q, f ) = qc Ψc (f ). (5) c=1 In order to understand the large sample behavior of the algorithm based on solving (3), we first need to understand the behavior of a function f that approximately minimizes R(f ). We introduce the following definition (also referred to as classification calibrated in [1]): Definition 2.1 Consider Ψc (f ) in (4). We say that the formulation is admissible (classification calibrated) on a closed set Ω ⊆ [−∞, ∞]K if the following conditions hold: ∀c, Ψc (·) : Ω → (−∞, ∞] is bounded below and continuous; ∩c {f : Ψc (f ) < ∞} is ∗ ∗ non-empty and dense in Ω; ∀q, if L(q, f ∗ ) = inf f L(q, f ), then fc = supk fk implies qc = supk qk . Since we allow Ψc (f ) = ∞, we use the convention that qc Ψc (f ) = 0 when qc = 0 and Ψc (f ) = ∞. The following result relates the approximate minimization of the Ψ risk to the approximate minimization of classification error: Theorem 2.1 Let B be the set of all Borel measurable functions. For a closed set Ω ⊂ [−∞, ∞]K , let BΩ = {f ∈ B : ∀x, f (x) ∈ Ω}. If Ψc (·) is admissible on Ω, then for a Borel measurable distribution, R(f ) → inf g∈BΩ R(g) implies (f ) → inf g∈B (g). Proof Sketch. First we show that the admissibility implies that ∀ > 0, ∃δ > 0 such that ∀q and x: inf {L(q, f ) : fc = sup fk } ≥ inf L(q, g) + δ. (6) qc ≤supk qk − g∈Ω k m If (6) does not hold, then ∃ > 0, and a sequence of (c , f m , q m ) with f m ∈ Ω such that m m m m fcm = supk fk , qcm ≤ supk qk − , and L(q m , f m ) − inf g∈Ω L(q m , g) → 0. Taking a limit point of (cm , f m , q m ), and using the continuity of Ψc (·), we obtain a contradiction (technical details handling the infinity case are skipped). Therefore (6) must be valid. Now we consider a vector function f (x) ∈ ΩB . Let q(x) = P (·|x). Given X, if P (Y = T (f (X))|X) ≥ P (Y = T (q(X))|X)+ , then equation (6) implies that L(q(X), f (X)) ≥ inf g∈Ω L(q(X), g) + δ. Therefore (f ) − inf (g) =EX [P (Y = T (q(X))|X) − P (Y = T (f (X))|X)] g∈B ≤ + EX I(P (Y = T (q(X))|X) − P (Y = T (f (X))|X) > ) LX (q(X), f (X)) − inf g∈BΩ LX (q(X), g) ≤ + EX δ R(f ) − inf g∈BΩ R(g) = + . δ In the above derivation we use I to denote the indicator function. Since and δ are arbitrary, we obtain the theorem by letting → 0. 2 Clearly, based on the above theorem, an admissible risk minimization formulation is suitable for multi-category classification problems. The classifier obtained from minimizing (3) can approach the Bayes error rate if we can show that with appropriately chosen function class Cn , approximate minimization of (3) implies approximate minimization of (4). Learning bounds of this forms have been very well-studied in statistics and machine learning. For example, for large margin binary classification, such bounds can be found in [4, 7, 8, 10, 11, 1], where they were used to prove the consistency of various large margin methods. In order to achieve consistency, it is also necessary to take a sequence of function classes Cn (C1 ⊂ C2 ⊂ · · · ) such that ∪n Cn is dense in the set of Borel measurable functions. The set Cn has the effect of regularization, which ensures that ˆ ˆ P R(fn ) ≈ inf f ∈Cn R(f ). It follows that as n → ∞, R(fn ) → inf f ∈B R(f ). Theorem 2.1 ˆ P then implies that (fn ) → inf f ∈B (f ). The purpose of this paper is not to study similar learning bounds that relate approximate minimization of (3) to the approximate minimization of (4). See [2] for a recent investigation. We shall focus on the choices of Ψ that lead to admissible formulations. We pay special attention to the case that each Ψc (f ) is a convex function of f , so that the resulting formulation becomes computational more tractable. Instead of working with the general form of Ψc in (4), we focus on two specific choices listed in the next two sections. 3 Unconstrained formulations We consider unconstrained formulation with the following choice of Ψ: K Ψc (f ) = φ(fc ) + s t(fk ) , (7) k=1 where φ, s and t are appropriately chosen functions that are continuously differentiable. The first term, which has a relatively simple form, depends on the label c. The second term is independent of the label, and can be regarded as a normalization term. Note that this function is symmetric with respect to components of f . This choice treats all potential classes equally. It is also possible to treat different classes differently (e.g. replacing φ(fc ) by φc (fc )), which can be useful if we associate different classification loss to different kinds of errors. 3.1 Optimality equation and probability model Using (7), the conditional true risk (5) can be written as: K L(q, f ) = K qc φ(fc ) + s t(fc ) . c=1 c=1 In the following, we study the property of the optimal vector f ∗ that minimizes L(q, f ) for a fixed q. Given q, the optimal solution f ∗ of L(q, f ) satisfies the following first order condition: ∗ ∗ qc φ (fc ) + µf ∗ t (fc ) = 0 (c = 1, . . . , K). (8) where quantity µf ∗ = s ( K k=1 ∗ t(fk )) is independent of k. ∗ Clearly this equation relates qc to fc for each component c. The relationship of q and f ∗ defined by (8) can be regarded as the (infinite sample-size) probability model associated with the learning method (3) with Ψ given by (7). The following result presents a simple criterion to check admissibility. We skip the proof for simplicity. Most of our examples satisfy the condition. Proposition 3.1 Consider (7). Assume Φc (f ) is continuous on [−∞, ∞]K and bounded below. If s (u) ≥ 0 and ∀p > 0, pφ (f ) + t (f ) = 0 has a unique solution fp that is an increasing function of p, then the formulation is admissible. If s(u) = u, the condition ∀p > 0 in Proposition 3.1 can be replaced by ∀p ∈ (0, 1). 3.2 Decoupled formulations We let s(u) = u in (7). The optimality condition (8) becomes ∗ ∗ qc φ (fc ) + t (fc ) = 0 (c = 1, . . . , K). (9) This means that we have K decoupled equalities, one for each fc . This is the simplest and in the author’s opinion, the most interesting formulation. Since the estimation problem in ˆ (3) is also decoupled into K separate equations, one for each component of fn , this class of methods are computationally relatively simple and easy to parallelize. Although this method seems to be preferable for multi-category problems, it is not the most efficient way for two-class problem (if we want to treat the two classes in a symmetric manner) since we have to solve two separate equations. We only need to deal with one equation in (1) due to the fact that an effective constraint f1 + f2 = 0 can be used to reduce the number of equations. This variable elimination has little impact if there are many categories. In the following, we list some examples of multi-category risk minimization formulations. They all satisfy the admissibility condition in Proposition 3.1. We focus on the relationship of the optimal optimizer function f∗ (q) and the conditional probability q. For simplicity, we focus on the choice φ(u) = −u. 3.2.1 φ(u) = −u and t(u) = eu ∗ We obtain the following probability model: qc = efc . This formulation is closely related K to the maximum-likelihood estimate with conditional model qc = efc / k=1 efk (logistic regression). In particular, if we choose a function class such that the normalization condiK tion k=1 efk = 1 holds, then the two formulations are identical. However, they become different when we do not impose such a normalization condition. Another very important and closely related formulation is the choice of φ(u) = − ln u and t(u) = u. This is an extension of maximum-likelihood estimate with probability model qc = fc . The resulting method is identical to maximum-likelihood if we choose our function class such that k fk = 1. However, the formulation also allows us to use function classes that do not satisfy the normalization constraint k fk = 1. Therefore this method is more flexible. 3.2.2 φ(u) = −u and t(u) = ln(1 + eu ) This version uses binary logistic regression loss, and we have the following probability ∗ model: qc = (1 + e−fc )−1 . Again this is an unnormalized model. 1 3.2.3 φ(u) = −u and t(u) = p |u|p (p > 1) ∗ ∗ We obtain the following probability model: qc = sign(fc )|fc |p−1 . This means that at the ∗ ∗ solution, fc ≥ 0. One may modify it such that we allow fc ≤ 0 to model the condition probability qc = 0. 3.2.4 φ(u) = −u and t(u) = 1 p max(u, 0)p (p > 1) ∗ In this probability model, we have the following relationship: qc = max(fc , 0)p−1 . The ∗ equation implies that we allow fc ≤ 0 to model the conditional probability qc = 0. Therefore, with a fixed function class, this model is more powerful than the previous one. How∗ ever, at the optimal solution, fc ≤ 1. This requirement can be further alleviated with the following modification. 3.2.5 φ(u) = −u and t(u) = 1 p min(max(u, 0)p , p(u − 1) + 1) (p > 1) In this probability model, we have the following relationship at the exact solution: qc = c min(max(f∗ , 0), 1)p−1 . Clearly this model is more powerful than the previous model since ∗ the function value fc ≥ 1 can be used to model qc = 1. 3.3 Coupled formulations In the coupled formulation with s(u) = u, the probability model can be normalized in a certain way. We list a few examples. 3.3.1 φ(u) = −u, and t(u) = eu , and s(u) = ln(u) This is the standard logistic regression model. The probability model is: K ∗ qc (x) = exp(fc (x))( ∗ exp(fc (x)))−1 . c=1 The right hand side is always normalized (sum up to 1). Note that the model is not continuous at infinities, and thus not admissible in our definition. However, we may consider the region Ω = {f : supk fk = 0}, and it is easy to check that this model is admissible in Ω. Ω Let fc = fc − supk fk ∈ Ω, then f Ω has the same decision rule as f and R(f ) = R(f Ω ). Therefore Theorem 2.1 implies that R(f ) → inf g∈B R(g) implies (f ) → inf g∈B (g). 1 3.3.2 φ(u) = −u, and t(u) = |u|p , and s(u) = p |u|p/p (p, p > 1) The probability model is: K ∗ ∗ ∗ |fk (x)|p )(p−p )/p sign(fc (x))|fc (x)|p −1 . qc (x) = ( k=1 We may replace t(u) by t(u) = max(0, u)p , and the probability model becomes: K qc (x) = ( ∗ ∗ max(fk (x), 0)p )(p−p )/p max(fc (x), 0)p −1 . k=1 These formulations do not seem to have advantages over the decoupled counterparts. Note that if we let p → 1, then the sum of the p p -th power of the right hand side → 1. In a −1 way, this means that the model is normalized in the limit of p → 1. 4 Constrained formulations As pointed out, one may impose constraints on possible choices of f . We may impose such a condition when we specify the function class Cn . However, for clarity, we shall directly impose a condition into our formulation. If we impose a constraint into (7), then its effect is rather similar to that of the second term in (7). In this section, we consider a direct extension of binary large-margin method (1) to multi-category case. The choice given below is motivated by [5], where an extension of SVM was proposed. We use a risk formulation that is different from (7), and for simplicity, we will consider linear equality constraint only: K Ψc (f ) = φ(−fk ), s.t. f ∈ Ω, (10) k=1,k=c where we define Ω as: K Ω = {f : fk = 0} ∪ {f : sup fk = ∞}. k k=1 We may interpret the added constraint as a restriction on the function class Cn in (3) such that every f ∈ Cn satisfies the constraint. Note that with K = 2, this leads to the usually binary large margin method. Using (10), the conditional true risk (5) can be written as: K (1 − qc )φ(−fc ), L(q, f ) = s.t. f ∈ Ω. (11) c=1 The following result provides a simple way to check the admissibility of (10). Proposition 4.1 If φ is a convex function which is bounded below and φ (0) < 0, then (10) is admissible on Ω. Proof Sketch. The continuity condition is straight-forward to verify. We may also assume that φ(·) ≥ 0 without loss of generality. Now let f achieves the minimum of L(q, ·). If fc = ∞, then it is clear that qc = 1 and thus qk = 0 for k = c. This implies that for k = c, φ(−fk ) = inf f φ(−f ), and thus fk < 0. If fc = supk fk < ∞, then the constraint implies fc ≥ 0. It is easy to see that ∀k, qc ≥ qk since otherwise, we must have φ(−fk ) > φ(−fc ), and thus φ (−fk ) > 0 and φ (−fc ) < 0, implying that with sufficient small δ > 0, φ(−(fk + δ)) < φ(−fk ) and φ(−(fc − δ)) < φ(−fc ). A contradiction. 2 Using the above criterion, we can convert any admissible convex φ for the binary formulation (1) into an admissible multi-category classification formulation (10). In [5] the special case of SVM (with loss function φ(u) = max(0, 1 − u)) was studied. The authors demonstrated the admissibility by direct calculation, although no results similar to Theorem 2.1 were established. Such a result is needed to prove consistency. The treatment presented here generalizes their study. Note that for the constrained formulation, it is more difficult to relate fc at the optimal solution to a probability model, since such a model will have a much more complicated form compared with the unconstrained counterpart. 5 Conclusion In this paper we proposed a family of risk minimization methods for multi-category classification problems, which are natural extensions of binary large margin classification methods. We established admissibility conditions that ensure the consistency of the obtained classifiers in the large sample limit. Two specific forms of risk minimization were proposed and examples were given to study the induced probability models. As an implication of this work, we see that it is possible to obtain consistent (conditional) density estimation using various non-maximum likelihood estimation methods. One advantage of some of the newly proposed methods is that they allow us to model zero density directly. Note that for the maximum-likelihood method, near zero density may cause serious robustness problems at least in theory. References [1] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Technical Report 638, Statistics Department, University of California, Berkeley, 2003. [2] Ilya Desyatnikov and Ron Meir. Data-dependent bounds for multi-category classification based on convex losses. In COLT, 2003. [3] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000. With discussion. [4] W. Jiang. Process consistency for adaboost. The Annals of Statistics, 32, 2004. with discussion. [5] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of American Statistical Association, 2002. accepted. [6] Yi Lin. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery, pages 259–275, 2002. [7] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting methods. The Annals of Statistics, 32, 2004. with discussion. [8] Shie Mannor, Ron Meir, and Tong Zhang. Greedy algorithms for classification - consistency, convergence rates, and adaptivity. Journal of Machine Learning Research, 4:713–741, 2003. [9] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Ingo Steinwart. Support vector machines are universally consistent. J. Complexity, 18:768–791, 2002. [11] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statitics, 32, 2004. with discussion.

6 0.16237429 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

7 0.15341055 143 nips-2003-On the Dynamics of Boosting

8 0.15126863 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

9 0.14903213 146 nips-2003-Online Learning of Non-stationary Sequences

10 0.12415653 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

11 0.10885353 88 nips-2003-Image Reconstruction by Linear Programming

12 0.10823005 126 nips-2003-Measure Based Regularization

13 0.10621244 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

14 0.10207935 121 nips-2003-Log-Linear Models for Label Ranking

15 0.10088013 148 nips-2003-Online Passive-Aggressive Algorithms

16 0.095288947 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

17 0.095038086 112 nips-2003-Learning to Find Pre-Images

18 0.094519958 133 nips-2003-Mutual Boosting for Contextual Inference

19 0.092322297 100 nips-2003-Laplace Propagation

20 0.091166295 41 nips-2003-Boosting versus Covering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.261), (1, -0.156), (2, -0.114), (3, -0.278), (4, 0.249), (5, -0.072), (6, -0.166), (7, -0.137), (8, -0.053), (9, 0.046), (10, 0.033), (11, 0.166), (12, -0.03), (13, 0.132), (14, -0.171), (15, 0.123), (16, 0.049), (17, 0.036), (18, 0.006), (19, 0.017), (20, -0.122), (21, -0.016), (22, 0.151), (23, 0.122), (24, 0.043), (25, -0.095), (26, -0.083), (27, -0.123), (28, -0.002), (29, 0.054), (30, 0.127), (31, 0.101), (32, 0.027), (33, -0.032), (34, -0.047), (35, -0.013), (36, -0.081), (37, -0.039), (38, -0.015), (39, 0.001), (40, -0.034), (41, 0.007), (42, 0.013), (43, -0.036), (44, -0.06), (45, -0.084), (46, 0.066), (47, 0.087), (48, -0.004), (49, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98570412 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

2 0.76461196 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe

Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.

3 0.71471637 1 nips-2003-1-norm Support Vector Machines

Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie

Abstract: The standard 2-norm SVM is known for its good performance in twoclass classi£cation. In this paper, we consider the 1-norm SVM. We argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. We also propose an ef£cient algorithm that computes the whole solution path of the 1-norm SVM, hence facilitates adaptive selection of the tuning parameter for the 1-norm SVM. 1

4 0.64335638 124 nips-2003-Max-Margin Markov Networks

Author: Ben Taskar, Carlos Guestrin, Daphne Koller

Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1

5 0.62999773 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

Author: Tong Zhang

Abstract: The purpose of this paper is to investigate infinity-sample properties of risk minimization based multi-category classification methods. These methods can be considered as natural extensions to binary large margin classification. We establish conditions that guarantee the infinity-sample consistency of classifiers obtained in the risk minimization framework. Examples are provided for two specific forms of the general formulation, which extend a number of known methods. Using these examples, we show that some risk minimization formulations can also be used to obtain conditional probability estimates for the underlying problem. Such conditional probability information will be useful for statistical inferencing tasks beyond classification. 1 Motivation Consider a binary classification problem where we want to predict label y ∈ {±1} based on observation x. One of the most significant achievements for binary classification in machine learning is the invention of large margin methods, which include support vector machines and boosting algorithms. Based on a set of observations (X1 , Y1 ), . . . , (Xn , Yn ), ˆ a large margin classification algorithm produces a decision function fn by empirically minimizing a loss function that is often a convex upper bound of the binary classification error ˆ ˆ function. Given fn , the binary decision rule is to predict y = 1 if fn (x) ≥ 0, and to predict ˆ y = −1 otherwise (the decision rule at fn (x) = 0 is not important). In the literature, the following form of large margin binary classification is often encountered: we minimize the empirical risk associated with a convex function φ in a pre-chosen function class Cn : 1 ˆ fn = arg min f ∈Cn n n φ(f (Xi )Yi ). (1) i=1 Originally such a scheme was regarded as a compromise to avoid computational difficulties associated with direct classification error minimization, which often leads to an NP-hard problem. The current view in the statistical literature interprets such methods as algorithms to obtain conditional probability estimates. For example, see [3, 6, 9, 11] for some related studies. This point of view allows people to show the consistency of various large margin methods: that is, in the large sample limit, the obtained classifiers achieve the optimal Bayes error rate. For example, see [1, 4, 7, 8, 10, 11]. The consistency of a learning method is certainly a very desirable property, and one may argue that a good classification method should be consistent in the large sample limit. Although statistical properties of binary classification algorithms based on the risk minimization formulation (1) are quite well-understood due to many recent works such as those mentioned above, there are much fewer studies on risk minimization based multicategory problems which generalizes the binary large margin method (1). The complexity of possible generalizations may be one reason. Another reason may be that one can always estimate the conditional probability for a multi-category problem using the binary classification formulation (1) for each category, and then pick the category with the highest estimated conditional probability (or score).1 However, it is still useful to understand whether there are more natural alternatives, and what kind of risk minimization formulation which generalizes (1) can be used to yield consistent classifiers in the large sample limit. An important step toward this direction has recently been taken in [5], where the authors proposed a multi-category extension of the support vector machine that is Bayes consistent (note that there were a number of earlier proposals that were not consistent). The purpose of this paper is to generalize their investigation so as to include a much wider class of risk minimization formulations that can lead to consistent classifiers in the infinity-sample limit. We shall see that there is a rich structure in risk minimization based multi-category classification formulations. Multi-category large margin methods have started to draw more attention recently. For example, in [2], learning bounds for some multi-category convex risk minimization methods were obtained, although the authors did not study possible choices of Bayes consistent formulations. 2 Multi-category classification We consider the following K-class classification problem: we would like to predict the label y ∈ {1, . . . , K} of an input vector x. In this paper, we only consider the simplest scenario with 0 − 1 classification loss: we have a loss of 0 for correct prediction, and loss of 1 for incorrect prediction. In binary classification, the class label can be determined using the sign of a decision function. This can be generalized to K class classification problem as follows: we consider K decision functions fc (x) where c = 1, . . . , K and we predict the label y of x as: T (f (x)) = arg max c∈{1,...,K} fc (x), (2) where we denote by f (x) the vector function f (x) = [f1 (x), . . . , fK (x)]. Note that if two or more components of f achieve the same maximum value, then we may choose any of them as T (f ). In this framework, fc (x) is often regarded as a scoring function for category c that is correlated with how likely x belongs to category c (compared with the remaining k − 1 categories). The classification error is given by: (f ) = 1 − EX P (Y = T (X)|X). Note that only the relative strength of fc compared with the alternatives is important. In particular, the decision rule given in (2) does not change when we add the same numerical quantity to each component of f (x). This allows us to impose one constraint on the vector f (x) which decreases the degree of freedom K of the K-component vector f (x) to K − 1. 1 This approach is often called one-versus-all or ranking in machine learning. Another main approach is to encode a multi-category classification problem into binary classification sub-problems. The consistency of such encoding schemes can be difficult to analyze, and we shall not discuss them. For example, in the binary classification case, we can enforce f1 (x)+f2 (x) = 0, and hence f (x) can be represented as [f1 (x), −f1 (x)]. The decision rule in (2), which compares f1 (x) ≥ f2 (x), is equivalent to f1 (x) ≥ 0. This leads to the binary classification rule mentioned in the introduction. In the multi-category case, one may also interpret the possible constraint on the vector function f , which reduces its degree of freedom from K to K − 1 based on the following reasoning. In many cases, we seek fc (x) as a function of p(Y = c|x). Since we have a K constraint c=1 p(Y = c|x) = 1 (implying that the degree of freedom for p(Y = c|x) is K − 1), the degree of freedom for f is also K − 1 (instead of K). However, we shall point out that in the algorithms we formulate below, we may either enforce such a constraint that reduces the degree of freedom of f , or we do not impose any constraint, which keeps the degree of freedom of f to be K. The advantage of the latter is that it allows the computation of each fc to be decoupled. It is thus much simpler both conceptually and numerically. Moreover, it directly handles multiple-label problems where we may assign each x to multiple labels of y ∈ {1, . . . , K}. In this scenario, we do not have a constraint. In this paper, we consider an empirical risk minimization method to solve a multi-category problem, which is of the following general form: 1 ˆ fn = arg min f ∈Cn n n ΨYi (f (Xi )). (3) i=1 As we shall see later, this method is a natural generalization of the binary classification method (1). Note that one may consider an even more general form with ΨY (f (X)) replaced by ΨY (f (X), X), which we don’t study in this paper. From the standard learning theory, one can expect that with appropriately chosen Cn , the ˆ ˆ solution fn of (3) approximately minimizes the true risk R(f ) with respect to the unknown underlying distribution within the function class Cn , R(f ) = EX,Y ΨY (f (X)) = EX L(P (·|X), f (X)), (4) where P (·|X) = [P (Y = 1|X), . . . , P (Y = K|X)] is the conditional probability, and K L(q, f ) = qc Ψc (f ). (5) c=1 In order to understand the large sample behavior of the algorithm based on solving (3), we first need to understand the behavior of a function f that approximately minimizes R(f ). We introduce the following definition (also referred to as classification calibrated in [1]): Definition 2.1 Consider Ψc (f ) in (4). We say that the formulation is admissible (classification calibrated) on a closed set Ω ⊆ [−∞, ∞]K if the following conditions hold: ∀c, Ψc (·) : Ω → (−∞, ∞] is bounded below and continuous; ∩c {f : Ψc (f ) < ∞} is ∗ ∗ non-empty and dense in Ω; ∀q, if L(q, f ∗ ) = inf f L(q, f ), then fc = supk fk implies qc = supk qk . Since we allow Ψc (f ) = ∞, we use the convention that qc Ψc (f ) = 0 when qc = 0 and Ψc (f ) = ∞. The following result relates the approximate minimization of the Ψ risk to the approximate minimization of classification error: Theorem 2.1 Let B be the set of all Borel measurable functions. For a closed set Ω ⊂ [−∞, ∞]K , let BΩ = {f ∈ B : ∀x, f (x) ∈ Ω}. If Ψc (·) is admissible on Ω, then for a Borel measurable distribution, R(f ) → inf g∈BΩ R(g) implies (f ) → inf g∈B (g). Proof Sketch. First we show that the admissibility implies that ∀ > 0, ∃δ > 0 such that ∀q and x: inf {L(q, f ) : fc = sup fk } ≥ inf L(q, g) + δ. (6) qc ≤supk qk − g∈Ω k m If (6) does not hold, then ∃ > 0, and a sequence of (c , f m , q m ) with f m ∈ Ω such that m m m m fcm = supk fk , qcm ≤ supk qk − , and L(q m , f m ) − inf g∈Ω L(q m , g) → 0. Taking a limit point of (cm , f m , q m ), and using the continuity of Ψc (·), we obtain a contradiction (technical details handling the infinity case are skipped). Therefore (6) must be valid. Now we consider a vector function f (x) ∈ ΩB . Let q(x) = P (·|x). Given X, if P (Y = T (f (X))|X) ≥ P (Y = T (q(X))|X)+ , then equation (6) implies that L(q(X), f (X)) ≥ inf g∈Ω L(q(X), g) + δ. Therefore (f ) − inf (g) =EX [P (Y = T (q(X))|X) − P (Y = T (f (X))|X)] g∈B ≤ + EX I(P (Y = T (q(X))|X) − P (Y = T (f (X))|X) > ) LX (q(X), f (X)) − inf g∈BΩ LX (q(X), g) ≤ + EX δ R(f ) − inf g∈BΩ R(g) = + . δ In the above derivation we use I to denote the indicator function. Since and δ are arbitrary, we obtain the theorem by letting → 0. 2 Clearly, based on the above theorem, an admissible risk minimization formulation is suitable for multi-category classification problems. The classifier obtained from minimizing (3) can approach the Bayes error rate if we can show that with appropriately chosen function class Cn , approximate minimization of (3) implies approximate minimization of (4). Learning bounds of this forms have been very well-studied in statistics and machine learning. For example, for large margin binary classification, such bounds can be found in [4, 7, 8, 10, 11, 1], where they were used to prove the consistency of various large margin methods. In order to achieve consistency, it is also necessary to take a sequence of function classes Cn (C1 ⊂ C2 ⊂ · · · ) such that ∪n Cn is dense in the set of Borel measurable functions. The set Cn has the effect of regularization, which ensures that ˆ ˆ P R(fn ) ≈ inf f ∈Cn R(f ). It follows that as n → ∞, R(fn ) → inf f ∈B R(f ). Theorem 2.1 ˆ P then implies that (fn ) → inf f ∈B (f ). The purpose of this paper is not to study similar learning bounds that relate approximate minimization of (3) to the approximate minimization of (4). See [2] for a recent investigation. We shall focus on the choices of Ψ that lead to admissible formulations. We pay special attention to the case that each Ψc (f ) is a convex function of f , so that the resulting formulation becomes computational more tractable. Instead of working with the general form of Ψc in (4), we focus on two specific choices listed in the next two sections. 3 Unconstrained formulations We consider unconstrained formulation with the following choice of Ψ: K Ψc (f ) = φ(fc ) + s t(fk ) , (7) k=1 where φ, s and t are appropriately chosen functions that are continuously differentiable. The first term, which has a relatively simple form, depends on the label c. The second term is independent of the label, and can be regarded as a normalization term. Note that this function is symmetric with respect to components of f . This choice treats all potential classes equally. It is also possible to treat different classes differently (e.g. replacing φ(fc ) by φc (fc )), which can be useful if we associate different classification loss to different kinds of errors. 3.1 Optimality equation and probability model Using (7), the conditional true risk (5) can be written as: K L(q, f ) = K qc φ(fc ) + s t(fc ) . c=1 c=1 In the following, we study the property of the optimal vector f ∗ that minimizes L(q, f ) for a fixed q. Given q, the optimal solution f ∗ of L(q, f ) satisfies the following first order condition: ∗ ∗ qc φ (fc ) + µf ∗ t (fc ) = 0 (c = 1, . . . , K). (8) where quantity µf ∗ = s ( K k=1 ∗ t(fk )) is independent of k. ∗ Clearly this equation relates qc to fc for each component c. The relationship of q and f ∗ defined by (8) can be regarded as the (infinite sample-size) probability model associated with the learning method (3) with Ψ given by (7). The following result presents a simple criterion to check admissibility. We skip the proof for simplicity. Most of our examples satisfy the condition. Proposition 3.1 Consider (7). Assume Φc (f ) is continuous on [−∞, ∞]K and bounded below. If s (u) ≥ 0 and ∀p > 0, pφ (f ) + t (f ) = 0 has a unique solution fp that is an increasing function of p, then the formulation is admissible. If s(u) = u, the condition ∀p > 0 in Proposition 3.1 can be replaced by ∀p ∈ (0, 1). 3.2 Decoupled formulations We let s(u) = u in (7). The optimality condition (8) becomes ∗ ∗ qc φ (fc ) + t (fc ) = 0 (c = 1, . . . , K). (9) This means that we have K decoupled equalities, one for each fc . This is the simplest and in the author’s opinion, the most interesting formulation. Since the estimation problem in ˆ (3) is also decoupled into K separate equations, one for each component of fn , this class of methods are computationally relatively simple and easy to parallelize. Although this method seems to be preferable for multi-category problems, it is not the most efficient way for two-class problem (if we want to treat the two classes in a symmetric manner) since we have to solve two separate equations. We only need to deal with one equation in (1) due to the fact that an effective constraint f1 + f2 = 0 can be used to reduce the number of equations. This variable elimination has little impact if there are many categories. In the following, we list some examples of multi-category risk minimization formulations. They all satisfy the admissibility condition in Proposition 3.1. We focus on the relationship of the optimal optimizer function f∗ (q) and the conditional probability q. For simplicity, we focus on the choice φ(u) = −u. 3.2.1 φ(u) = −u and t(u) = eu ∗ We obtain the following probability model: qc = efc . This formulation is closely related K to the maximum-likelihood estimate with conditional model qc = efc / k=1 efk (logistic regression). In particular, if we choose a function class such that the normalization condiK tion k=1 efk = 1 holds, then the two formulations are identical. However, they become different when we do not impose such a normalization condition. Another very important and closely related formulation is the choice of φ(u) = − ln u and t(u) = u. This is an extension of maximum-likelihood estimate with probability model qc = fc . The resulting method is identical to maximum-likelihood if we choose our function class such that k fk = 1. However, the formulation also allows us to use function classes that do not satisfy the normalization constraint k fk = 1. Therefore this method is more flexible. 3.2.2 φ(u) = −u and t(u) = ln(1 + eu ) This version uses binary logistic regression loss, and we have the following probability ∗ model: qc = (1 + e−fc )−1 . Again this is an unnormalized model. 1 3.2.3 φ(u) = −u and t(u) = p |u|p (p > 1) ∗ ∗ We obtain the following probability model: qc = sign(fc )|fc |p−1 . This means that at the ∗ ∗ solution, fc ≥ 0. One may modify it such that we allow fc ≤ 0 to model the condition probability qc = 0. 3.2.4 φ(u) = −u and t(u) = 1 p max(u, 0)p (p > 1) ∗ In this probability model, we have the following relationship: qc = max(fc , 0)p−1 . The ∗ equation implies that we allow fc ≤ 0 to model the conditional probability qc = 0. Therefore, with a fixed function class, this model is more powerful than the previous one. How∗ ever, at the optimal solution, fc ≤ 1. This requirement can be further alleviated with the following modification. 3.2.5 φ(u) = −u and t(u) = 1 p min(max(u, 0)p , p(u − 1) + 1) (p > 1) In this probability model, we have the following relationship at the exact solution: qc = c min(max(f∗ , 0), 1)p−1 . Clearly this model is more powerful than the previous model since ∗ the function value fc ≥ 1 can be used to model qc = 1. 3.3 Coupled formulations In the coupled formulation with s(u) = u, the probability model can be normalized in a certain way. We list a few examples. 3.3.1 φ(u) = −u, and t(u) = eu , and s(u) = ln(u) This is the standard logistic regression model. The probability model is: K ∗ qc (x) = exp(fc (x))( ∗ exp(fc (x)))−1 . c=1 The right hand side is always normalized (sum up to 1). Note that the model is not continuous at infinities, and thus not admissible in our definition. However, we may consider the region Ω = {f : supk fk = 0}, and it is easy to check that this model is admissible in Ω. Ω Let fc = fc − supk fk ∈ Ω, then f Ω has the same decision rule as f and R(f ) = R(f Ω ). Therefore Theorem 2.1 implies that R(f ) → inf g∈B R(g) implies (f ) → inf g∈B (g). 1 3.3.2 φ(u) = −u, and t(u) = |u|p , and s(u) = p |u|p/p (p, p > 1) The probability model is: K ∗ ∗ ∗ |fk (x)|p )(p−p )/p sign(fc (x))|fc (x)|p −1 . qc (x) = ( k=1 We may replace t(u) by t(u) = max(0, u)p , and the probability model becomes: K qc (x) = ( ∗ ∗ max(fk (x), 0)p )(p−p )/p max(fc (x), 0)p −1 . k=1 These formulations do not seem to have advantages over the decoupled counterparts. Note that if we let p → 1, then the sum of the p p -th power of the right hand side → 1. In a −1 way, this means that the model is normalized in the limit of p → 1. 4 Constrained formulations As pointed out, one may impose constraints on possible choices of f . We may impose such a condition when we specify the function class Cn . However, for clarity, we shall directly impose a condition into our formulation. If we impose a constraint into (7), then its effect is rather similar to that of the second term in (7). In this section, we consider a direct extension of binary large-margin method (1) to multi-category case. The choice given below is motivated by [5], where an extension of SVM was proposed. We use a risk formulation that is different from (7), and for simplicity, we will consider linear equality constraint only: K Ψc (f ) = φ(−fk ), s.t. f ∈ Ω, (10) k=1,k=c where we define Ω as: K Ω = {f : fk = 0} ∪ {f : sup fk = ∞}. k k=1 We may interpret the added constraint as a restriction on the function class Cn in (3) such that every f ∈ Cn satisfies the constraint. Note that with K = 2, this leads to the usually binary large margin method. Using (10), the conditional true risk (5) can be written as: K (1 − qc )φ(−fc ), L(q, f ) = s.t. f ∈ Ω. (11) c=1 The following result provides a simple way to check the admissibility of (10). Proposition 4.1 If φ is a convex function which is bounded below and φ (0) < 0, then (10) is admissible on Ω. Proof Sketch. The continuity condition is straight-forward to verify. We may also assume that φ(·) ≥ 0 without loss of generality. Now let f achieves the minimum of L(q, ·). If fc = ∞, then it is clear that qc = 1 and thus qk = 0 for k = c. This implies that for k = c, φ(−fk ) = inf f φ(−f ), and thus fk < 0. If fc = supk fk < ∞, then the constraint implies fc ≥ 0. It is easy to see that ∀k, qc ≥ qk since otherwise, we must have φ(−fk ) > φ(−fc ), and thus φ (−fk ) > 0 and φ (−fc ) < 0, implying that with sufficient small δ > 0, φ(−(fk + δ)) < φ(−fk ) and φ(−(fc − δ)) < φ(−fc ). A contradiction. 2 Using the above criterion, we can convert any admissible convex φ for the binary formulation (1) into an admissible multi-category classification formulation (10). In [5] the special case of SVM (with loss function φ(u) = max(0, 1 − u)) was studied. The authors demonstrated the admissibility by direct calculation, although no results similar to Theorem 2.1 were established. Such a result is needed to prove consistency. The treatment presented here generalizes their study. Note that for the constrained formulation, it is more difficult to relate fc at the optimal solution to a probability model, since such a model will have a much more complicated form compared with the unconstrained counterpart. 5 Conclusion In this paper we proposed a family of risk minimization methods for multi-category classification problems, which are natural extensions of binary large margin classification methods. We established admissibility conditions that ensure the consistency of the obtained classifiers in the large sample limit. Two specific forms of risk minimization were proposed and examples were given to study the induced probability models. As an implication of this work, we see that it is possible to obtain consistent (conditional) density estimation using various non-maximum likelihood estimation methods. One advantage of some of the newly proposed methods is that they allow us to model zero density directly. Note that for the maximum-likelihood method, near zero density may cause serious robustness problems at least in theory. References [1] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Technical Report 638, Statistics Department, University of California, Berkeley, 2003. [2] Ilya Desyatnikov and Ron Meir. Data-dependent bounds for multi-category classification based on convex losses. In COLT, 2003. [3] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000. With discussion. [4] W. Jiang. Process consistency for adaboost. The Annals of Statistics, 32, 2004. with discussion. [5] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of American Statistical Association, 2002. accepted. [6] Yi Lin. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery, pages 259–275, 2002. [7] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting methods. The Annals of Statistics, 32, 2004. with discussion. [8] Shie Mannor, Ron Meir, and Tong Zhang. Greedy algorithms for classification - consistency, convergence rates, and adaptivity. Journal of Machine Learning Research, 4:713–741, 2003. [9] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Ingo Steinwart. Support vector machines are universally consistent. J. Complexity, 18:768–791, 2002. [11] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statitics, 32, 2004. with discussion.

6 0.62854701 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

7 0.55762887 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

8 0.53439748 146 nips-2003-Online Learning of Non-stationary Sequences

9 0.48348132 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

10 0.41559467 143 nips-2003-On the Dynamics of Boosting

11 0.40141073 126 nips-2003-Measure Based Regularization

12 0.3605085 41 nips-2003-Boosting versus Covering

13 0.35526434 100 nips-2003-Laplace Propagation

14 0.35172457 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification

15 0.34147438 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

16 0.34108892 88 nips-2003-Image Reconstruction by Linear Programming

17 0.32462749 148 nips-2003-Online Passive-Aggressive Algorithms

18 0.32263848 181 nips-2003-Statistical Debugging of Sampled Programs

19 0.31902373 121 nips-2003-Log-Linear Models for Label Ranking

20 0.3161943 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.05), (11, 0.044), (29, 0.016), (30, 0.015), (35, 0.15), (45, 0.159), (53, 0.113), (69, 0.017), (71, 0.1), (76, 0.088), (85, 0.075), (91, 0.07), (99, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91940361 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

2 0.87616837 184 nips-2003-The Diffusion-Limited Biochemical Signal-Relay Channel

Author: Peter J. Thomas, Donald J. Spencer, Sierra K. Hampton, Peter Park, Joseph P. Zurkus

Abstract: Biochemical signal-transduction networks are the biological information-processing systems by which individual cells, from neurons to amoebae, perceive and respond to their chemical environments. We introduce a simplified model of a single biochemical relay and analyse its capacity as a communications channel. A diffusible ligand is released by a sending cell and received by binding to a transmembrane receptor protein on a receiving cell. This receptor-ligand interaction creates a nonlinear communications channel with non-Gaussian noise. We model this channel numerically and study its response to input signals of different frequencies in order to estimate its channel capacity. Stochastic effects introduced in both the diffusion process and the receptor-ligand interaction give the channel low-pass characteristics. We estimate the channel capacity using a water-filling formula adapted from the additive white-noise Gaussian channel. 1 Introduction: The Diffusion-Limited Biochemical Signal-Relay Channel The term signal-transduction network refers to the web of biochemical interactions by which single cells process sensory information about their environment. Just as neural networks underly the interaction of many multicellular organisms with their environments, these biochemical networks allow cells to perceive, evaluate and react to chemical stimuli [1]. Examples include chemical signaling across the synaptic cleft, calcium signaling within the postsynaptic dendritic spine, pathogen localization by the immune system, ∗ † Corresponding author: pjthomas@salk.edu dspencer@salk.edu growth-cone guidance during neuronal development, phototransduction in the retina, rhythmic chemotactic signaling in social amoebae, and many others. The introduction of quantitative measurements of the distribution and activation of chemical reactants within living cells [2] has prepared the way for detailed quantitative analysis of their properties, aided by numerical simulations. One of the key questions that can now be addressed is the fundamental limits to cell-to-cell communication using chemical signaling. To communicate via chemical signaling cells must contend with the unreliability inherent in chemical diffusion and in the interactions of limited numbers of signaling molecules and receptors [3]. We study a simplified situation in which one cell secretes a signaling molecule, or ligand, which can be detected by a receptor on another cell. Limiting ourselves to one ligand-receptor interaction allows a treatment of this communications system using elementary concepts from information theory. The information capacity of this fundamental signaling system is the maximum of the mutual information between the ensemble of input signals, the time-varying rate of ligand secretion s(t), and the output signal r(t), a piecewise continuous function taking the values one or zero as the receptor is bound to ligand or unbound. Using numerical simulation we can estimate the channel capacity via a standard ”water-filling” information measure [4], as described below. 2 Methods: Numerical Simulation of the Biochemical Relay We simulate a biochemical relay system as follows: in a two-dimensional rectangular volume V measuring 5 micrometers by 10 micrometers, we locate two cells spaced 5 micrometers apart. Cell A emits ligand molecules from location xs = [2.5µ, 2.5µ] with rate s(t) ≥ 0; they diffuse with a given diffusion constant D and decay at a rate α. Both secretion and decay occur as random Poisson processes, and diffusion is realized as a discrete random walk with Gaussian-distributed displacements. The boundaries of V are taken to be reflecting. We track the positions of each of N particles {xi , i = 1, · · · , N } at intervals of ∆t = 1msec. The local concentration in a neighborhood of size σ around a location x is given by the convolution N δ(x − xi )g(x − x , σ) dx c(x, t) = ˆ (1) V i=1 where g(·, σ) is a normalized Gaussian distribution in the plane, with mean 0 and variance σ 2 . The motions of the individual particles cause c(x, t) to fluctuate about the mean conˆ centration, causing the local concentration at cell B, c(xr , t) to be a noisy, low-pass filtered ˆ version of the original signal s(t) (see Figure 1). Cell B, located at xr = [7.5µ, 2.5µ], registers the presence of ligand through binding and unbinding transitions, which form a two-state Markov process with time-varying transition rates. Given an unbound receptor, the binding transition happens at a rate that depends on the ligand concentration around the receptor: k+ c(xr , t). The size of the neighborhood σ ˆ reflects the range of the receptor, with binding most likely in a small region close to xr . Once the receptor is bound to a ligand molecule, no more binding events occur until the receptor releases the ligand. The receiver is insensitive to fluctuations in c(xr , t) while it is ˆ in the bound state (see Figure 1). The unbinding transition occurs with a fixed rate k− . For concreteness, we take values for D, α, k− , k+ , and σ appropriate for cyclic AMP signaling between Dictyostelium amoebae, a model organism for chemical communica1 tion: D = 0.25µ2 msec−1 , α = 1 sec−1 , σ = 0.1µ, k− = 1 sec−1 , k+ = 2πσ2 sec−1 . Kd = k− /k+ is the dissociation constant, the concentration at which the receptor on average is bound half the time. For the chosen values of the reaction constants k± , we have Figure 1: Biochemical Signaling Simulation. Top: Cell A secretes a signaling molecule (red dots) with a time-varying rate r(t). Molecules diffuse throughout the two-dimensional volume, leading to locally fluctuating concentrations that carry a corrupted version of the signal. Molecules within a neighborhood of cell B can bind to a receptor molecule, giving a received signal s(t) ∈ {0, 1}. Bottom Left: Input signal. Mean instantaneous rate of molecule release (thousands of molecules per second). Molecule release is a Poisson process with time-varying rate. Bottom Center: Local concentration fluctuations, as seen by cell B, indicated by the number of molecules within 0.2 microns of the receptor. The receptor is sensitive to fluctuations in local concentrations only while it is unbound. While the receptor is bound, it does not register changes in the local concentration (indicated by constant plateaus corresponding to intervals when r(t) = 1 in bottom right panel. Bottom Right: Output signal r(t). At each moment the receptor is either bound (1) or unbound (0). The receiver output is a piecewise constant function with a finite number of transitions. Kd ≈ 15.9 molecules ≈ 26.4nMol, comparable to the most sensitive values reported for µ2 the cyclic AMP receptor [2]. At this concentration the volume V = 50µ2 contains about 800 signaling molecules, assuming a nominal depth of 1µ. 3 Results: Estimating Information Capacity via Frequency Response Communications channels mediated by diffusion and ligand receptor interaction are nonlinear with non-Gaussian noise. The expected value of the output signal, 0 ≤ E[r] < 1, is a sigmoidal function of the log concentration for a constant concentration c: E[r] = 1 c = c + Kd 1 + e−(y−y0 ) (2) where y = ln(c), y0 = ln(Kd ). The mean response saturates for high concentrations, c Kd , and the noise statistics become pronouncedly Poissonian (rather than Gaussian) for low concentrations. Several different kinds of stimuli can be used to characterize such a channel. The steadystate response to constant input reflects the static (equilibrium) transfer function. Concentrations ranging from 100Kd to 0.01Kd occupy 98% of the steady-state operating range, 0.99 > E[r] > 0.01 [5]. For a finite observation time T the actual fraction of time spent bound, rT , is distributed about E[r] with a variance that depends on T . The biochemi¯ cal relay may be used as a binary symmetric channel randomly selecting a ‘high’ or ‘low’ secretion rate, and ‘decoding’ by setting a suitable threshold for rT . As T increases, the ¯ variance of rT and the probability of error decrease. ¯ The binary symmetric channel makes only crude use of this signaling mechanism. Other possible communication schemes include sending all-or-none bursts of signaling molecule, as in synaptic transmission, or detecting discrete stepped responses. Here we use the frequency response of the channel as a way of estimating the information capacity of the biochemical channel. For an idealized linear channel with additive white Gaussian noise (AWNG channel) the channel capacity under a mean input power constraint P is given by the so-called “waterfilling formula” [4], C= 1 2 ωmax log2 1 + ω=ωmin (ν − N (ω))+ N (ω) dω (3) given the constraining condition ωmax (ν − N (ω))+ dω ≤ P (4) ω=ωmin where the constant ν is the sum of the noise and the signal power in the usable frequency range, N (ω) is the power of the additive noise at frequency ω and (X)+ indicates the positive part of X. The formula applies when each frequency band (ω, ω +dω) is subject to noise of power N (ω) independently of all other frequency bands, and reflects the optimal allocation of signal power S(ω) = (ν − N (ω))+ , with greater signal power invested in frequencies at which the noise power is smallest. The capacity C is in bits/second. For an input signal of finite duration T = 100 sec, we can independently specify the amplitudes and phases of its frequency components at ω = [0.01 Hz, 0.02 Hz, · · · , 500 Hz], where 500 Hz is the Nyquist frequency given a 1 msec simulation timestep. Because the population of secreted signaling molecules decays exponentially with a time constant of 1/α = 1 sec, the concentration signal is unable to pass frequencies ω ≥ 1Hz (see Figure 2) providing a natural high-frequency cutoff. For the AWGN channel the input and Figure 2: Frequency Response of Biochemical Relay Channel. The sending cell secreted signaling molecules at a mean rate of 1000 + 1000 sin(2πωt) molecules per second. From top to bottom, the input frequencies were 1.0, 0.5, 0.2, 0.1, 0.05, 0.02 and 0.01 Hz. The total signal duration was T = 100 seconds. Left Column: Total number of molecules in the volume. Attenuation of the original signal results from exponential decay of the signaling molecule population. Right Column: A one-second moving average of the output signal r(t), which takes the value one when the receptor molecule is bound to ligand, and zero when the receptor is unbound. Figure 3: Frequency Transmission Spectrum Noise power N (ω), calculated as the total power in r(t)−¯ in all frequency components save the input frequency ω. Frequencies were r binned in intervals of 0.01 Hz = 1/T . The maximum possible power in r(t) over all frequencies is 0.25; the power successfully transmitted by the channel is given by 0.25/N (ω). The lower curve is N (ω) for input signals of the form s(t) = 1000 + 1000 sin 2πωt, which uses the full dynamic range of the receptor. Decreasing the dynamic range used reduces the amount of power transmitted at the sending frequency: the upper curve is N (ω) for signals of the form s(t) = 1000 + 500 sin 2πωt. output signals share the same units (e.g. rms voltage); for the biological relay the input s(t) is in molecules/second while the output r(t) is a function with binary range {r = 0, r = 1}. The maximum of the mean output power for a binary function r(t) T 2 1 is T t=0 |r(t) − r| dt ≤ 1 . This total possible output power will be distributed be¯ 4 tween different frequencies depending on the frequency of the input. We wish to estimate the channel capacity by comparing the portion of the output power present in the sending frequency ω to the limiting output power 0.25. Therefore we set the total output power constant to ν = 0.25. Given a pure sinusoidal input signal s(t) = a0 + a1 sin(2πωt), we consider the power in the output spectrum at ω Hz to be the residual power from the input and the rest of the power in the spectrum of r(t) to be analogous to the additive noise power spectrum N (ω) in the AWNG channel. We calculate N (ω) to be the total power of r(t) − r ¯ in all frequency bands except ω. For signals of length T = 100 sec, the possible frequencies are discretized at intervals ∆ω = 0.01 Hz. Because the noise power N (ω) ≤ 0.25, the water-filling formula (3) for the capacity reduces to 1 Cest = 2 1Hz log2 0.01Hz 0.25 N (ω) dω. (5) As mentioned above frequencies ω ≥ 1 Hz do not transmit any information about the signal (see Figure 2) and do not contribute to the capacity. We approximate this integral using linear interpolation of log2 (N (ω)) between the measured values at ω = [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0] Hz. (See Figure 3.) This procedure gives an estimate of the channel capacity, Cest = 0.087 bits/second. 4 Discussion & Conclusions Diffusion and the Markov switching between bound and unbound states create a low-pass filter that removes high-frequency information in the biochemical relay channel. A general Poisson-type communications channel, such as commonly encountered in optical communications engineering, can achieve an arbitrarily large capacity by transmitting high frequencies and high amplitudes, unless bounded by a max or mean amplitude constraint [6]. In the biochemical channel, the effective input amplitude is naturally constrained by the saturation of the receptor at concentrations above the Kd . And the high frequency transmission is limited by the inherent dynamics of the Markov process. Therefore this channel has a finite capacity. The channel capacity estimate we derived, Cest = 0.087 bits/second, seems quite low compared to signaling rates in the nervous system, requiring long signaling times to transfer information successfully. However temporal dynamics in cellular systems can be quite deliberate; cell-cell communication in the social amoeba Dictyostelium, for example, is achieved by means of a carrier wave with a period of seven minutes. In addition, cells typically possess thousands of copies of the receptors for important signaling molecules, allowing for more complex detection schemes than those investigated here. Our simplified treatment suggests several avenues for further work. For example, signal transducing receptors often form Markov chains with more complicated dynamics reflecting many more than two states [7]. Also, the nonlinear nature of the channel is probably not well served by our additive noise approximation, and might be better suited to a treatment via multiplicative noise [8]. Whether cells engage in complicated temporal coding/decoding schemes, as has been proposed for neural information processing, or whether instead they achieve efficient communication by evolutionary matching of the noise characteristics of sender and receiver, remain to be investigated. We note that the dependence of the channel capacity C on such parameters as the system geometry, the diffusion and decay constants, the binding constants and the range of the receptor may shed light on evolutionary mechanisms and constraints on communication within cellular biological systems. Acknowledgments This work would not have been possible without the generous support of the Howard Hughes Medical Institute and the resources of the Computational Neurobiology Laboratory, Terrence J. Sejnowski, Director. References [1] Rappel, W.M., Thomas, P.J., Levine, H. & Loomis, W.F. (2002) Establishing Direction during Chemotaxis in Eukaryotic Cells. Biophysical Journal 83:1361-1367. [2] Ueda, M., Sako, Y., Tanaka, T., Devreotes, P. & Yanagida, T. (2001) Single Molecule Analysis of Chemotactic Signaling in Dictyostelium Cells. Science 294:864-867. [3] Detwiler, P.B., Ramanathan, S., Sengupta, A. & Shraiman, B.I. (2000) Engineering Aspects of Enzymatic Signal Transduction: Photoreceptors in the Retina. Biophysical Journal79:2801-2817. [4] Cover, T.M. & Thomas, J.A. (1991) Elements of Information Theory, New York: Wiley. [5] Getz, W.M. & Lansky, P. (2001) Receptor Dissociation Constants and the Information Entropy of Membranes Coding Ligand Concentration. Chem. Senses 26:95-104. [6] Frey, R.M. (1991) Information Capacity of the Poisson Channel. IEEE Transactions on Information Theory 37(2):244-256. [7] Uteshev, V.V. & Pennefather, P.S. (1997) Analytical Description of the Activation of Multi-State Receptors by Continuous Neurotransmitter Signals at Brain Synapses. Biophysical Journal72:11271134. [8] Mitra, P.P. & Stark, J.B. (2001) Nonlinear limits to the information capacity of optical fibre communications. Nature411:1027-1030.

3 0.79502851 189 nips-2003-Tree-structured Approximations by Expectation Propagation

Author: Yuan Qi, Tom Minka

Abstract: Approximation structure plays an important role in inference on loopy graphs. As a tractable structure, tree approximations have been utilized in the variational method of Ghahramani & Jordan (1997) and the sequential projection method of Frey et al. (2000). However, belief propagation represents each factor of the graph with a product of single-node messages. In this paper, belief propagation is extended to represent factors with tree approximations, by way of the expectation propagation framework. That is, each factor sends a “message” to all pairs of nodes in a tree structure. The result is more accurate inferences and more frequent convergence than ordinary belief propagation, at a lower cost than variational trees or double-loop algorithms. 1

4 0.77311522 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

Author: Quaid D. Morris, Brendan J. Frey

Abstract: This paper addresses the problem of untangling hidden graphs from a set of noisy detections of undirected edges. We present a model of the generation of the observed graph that includes degree-based structure priors on the hidden graphs. Exact inference in the model is intractable; we present an efficient approximate inference algorithm to compute edge appearance posteriors. We evaluate our model and algorithm on a biological graph inference problem. 1 Introduction and motivation The inference of hidden graphs from noisy edge appearance data is an important problem with obvious practical application. For example, biologists are currently building networks of all the physical protein-protein interactions (PPI) that occur in particular organisms. The importance of this enterprise is commensurate with its scale: a completed network would be as valuable as a completed genome sequence, and because each organism contains thousands of different types of proteins, there are millions of possible types of interactions. However, scalable experimental methods for detecting interactions are noisy, generating many false detections. Motivated by this application, we formulate the general problem of inferring hidden graphs as probabilistic inference in a graphical model, and we introduce an efficient algorithm that approximates the posterior probability that an edge is present. In our model, a set of hidden, constituent graphs are combined to generate the observed graph. Each hidden graph is independently sampled from a prior on graph structure. The combination mechanism acts independently on each edge but can be either stochastic or deterministic. Figure 1 shows an example of our generative model. Typically one of the hidden graphs represents the graph of interest (the true graph), the others represent different types of observation noise. Independent edge noise may also be added by the combination mechanism. We use probabilistic inference to compute a likely decomposition of the observed graph into its constituent parts. This process is deemed “untangling”. We use the term “denoising” to refer to the special case where the edge noise is independent. In denoising there is a single hidden graph, the true graph, and all edge noise in the observed graph is due E1 1 eij E2 i 2 eij j xij i j i j X Figure 1: Illustrative generative model example. Figure shows an example where an observed graph, X, is a noisy composition of two constituent graphs, E 1 and E 2 . All graphs share the same vertex set, so each can be represented by a symmetric matrix of random binary variables (i.e., an adjacency matrix). This generative model is designed to solve a toy counter-espionage problem. The vertices represent suspects and each edge in X represents an observed call between two suspects. The graph X reflects zero or more spy rings (represented by E 1 ), telemarketing calls (represented by E 2 ), social calls (independent edge noise), and lost call records (more independent edge noise). The task is to locate any spy rings hidden in X. We model the distribution of spy ring graphs using a prior, P (E 1 ), that has support only on graphs where all vertices have degree of either 2 (i.e., are in the ring) or 0 (i.e., are not). Graphs of telemarketing call patterns are represented using a prior, P (E 2 ), under which all nodes have degrees of > 3 (i.e., are telemarketers), 1 (i.e., are telemarketees), or 0 (i.e., are neither). The displayed hidden graphs are one likely untangling of X. to the combination mechanism. Prior distributions over graphs can be specified in various ways, but our choice is motivated by problems we want to solve, and by a view to deriving an efficient inference algorithm. One compact representation of a distribution over graphs consists of specifying a distribution over vertex degrees, and assuming that graphs that have the same vertex degrees are equiprobable. Such a prior can model quite rich distributions over graphs. These degree-based structure priors are natural representions of graph structure; many classes of real-world networks have a characteristic functional form associated with their degree distributions [1], and sometimes this form can be predicted using knowledge about the domain (see, e.g., [2]) or detected empirically (see, e.g., [3, 4]). As such, our model incorporates degree-based structure priors. Though exact inference in our model is intractable in general, we present an efficient algorithm for approximate inference for arbitrary degree distributions. We evaluate our model and algorithm using the real-world example of untangling yeast proteinprotein interaction networks. 2 A model of noisy and tangled graphs For degree-based structure priors, inference consists of searching over vertex degrees and edge instantiations, while comparing each edge with its noisy observation and enforcing the constraint that the number of edges connected to every vertex must equal the degree of the vertex. Our formulation of the problem in this way is inspired by the success of the sum-product algorithm (loopy belief propagation) for solving similar formulations of problems in error-correcting decoding [6, 7], phase unwrapping [8], and random satisfiability [9]. For example, in error-correcting decoding, inference consists of searching over configurations of codeword bits, while comparing each bit with its noisy observation and enforcing parity-check constraints on subsets of bits [10]. For a graph on a set of N vertices, eij is a variable that indicates the presence of an edge connecting vertices i and j: eij = 1 if there is an edge, and eij = 0 otherwise. We assume the vertex set is fixed, so each graph is specified by an adjacency matrix, E = {eij }N . The degree of vertex i is denoted by di and the i,j=1 degree set by D = {di }N . The observations are given by a noisy adjacency matrix, i=1 X = {xij }N . Generally, edges can be directed, but in this paper we focus on i,j=1 undirected graphs, so eij = eji and xij = xji . Assuming the observation noise is independent for different edges, the joint distribution is P (X, E, D) = P (X|E)P (E, D) = P (xij |eij ) P (E, D). j≥i P (xij |eij ) models the edge observation noise. We use an undirected model for the joint distribution over edges and degrees, P (E, D), where the prior distribution over di is determined by a non-negative potential fi (di ). Assuming graphs that have the same vertex degrees are equiprobable, we have N eij ) , fi (di )I(di , P (E, D) ∝ i j=1 where I(a, b) = 1 if a = b, and I(a, b) = 0 if a = b. The term I(di , j eij ) ensures that the number of edges connected to vertex i is equal to di . It is straightforward to show that the marginal distribution over di is P (di ) ∝ fi (di ) D\di nD j=i fj (dj ) , where nD is the number of graphs with degrees D and the sum is over all degree variables except di . The potentials, fi , can be estimated from a given degree prior using Markov chain Monte Carlo; or, as an approximation, they can be set to an empirical degree distribution obtained from noise-free graphs. Fig 2a shows the factor graph [11] for the above model. Each filled square corresponds to a term in the factorization of the joint distribution and the square is connected to all variables on which the term depends. Factor graphs are graphical models that unify the properties of Bayesian networks and Markov random fields [12]. Many inference algorithms, including the sum-product algorithm (a.k.a. loopy belief propagation), are more easily derived using factor graphs than Bayesian networks or Markov random fields. We describe the sum-product algorithm for our model in section 3. (a) I(d ,e + e +e +e 4 14 24 34 44 d1 e11 e12 e14 4 d3 d2 e13 f 4(d ) e22 e23 e24 (b) ) d1 d4 e33 e34 e1 e44 11 x11 x11 x12 x13 x14 x22 x23 x24 x33 d1 1 x34 x44 e2 11 e1 12 x12 e2 12 d1 2 e1 13 e1 e2 13 e1 14 x13 e1 22 x14 e2 14 d1 3 23 x22 e2 22 x23 e2 23 4 e1 e1 24 e2 24 e1 33 x24 34 x33 e2 33 x34 e2 34 e1 44 x44 e2 44 P( x44 | e44 ) (c) d4 s41 e14 e24 d2 1 d4 e34 e44 e14 s42 e24 s43 e34 d2 2 d2 3 d2 4 s44 P( x e44 44 1 ,e 2 ) 44 44 |e Figure 2: (a) A factor graph that describes a distribution over graphs with vertex degrees di , binary edge indicator variables eij , and noisy edge observations xij . The indicator function I(di , j eij ) enforces the constraint that the sum of the binary edge indicator variables for vertex i must equal the degree of vertex i. (b) A factor graph that explains noisy observed edges as a combination of two constituent graphs, with edge indicator variables e 1 and e2 . ij ij (c) The constraint I(di , j eij ) can be implemented using a chain with state variables, which leads to an exponentially faster message-passing algorithm. 2.1 Combining multiple graphs The above model is suitable when we want to infer a graph that matches a degree prior, assuming the edge observation noise is independent. A more challenging goal, with practical application, is to infer multiple hidden graphs that combine to explain the observed edge data. In section 4, we show how priors over multiple hidden graphs can be be used to infer protein-protein interactions. When there are H hidden graphs, each constituent graph is specified by a set of edges on the same set of N common vertices. For the degree variables and edge variables, we use a superscript to indicate which hidden graph the variable is used to describe. Assuming the graphs are independent, the joint distribution over the observed edge data X, and the edge variables and degree variables for the hidden graphs, E 1 , D1 , . . . , E H , DH , is H P (X, E 1 , D1 , . . . , E H , DH ) = P (xij |e1 , . . . , eH ) ij ij j≥i P (E h , Dh ), (1) h=1 where for each hidden graph, P (E h , Dh ) is modeled as described above. Here, the likelihood P (xij |e1 , . . . , eH ) describes how the edges in the hidden graphs combine ij ij to model the observed edge. Figure 2b shows the factor graph for this model. 3 Probabilistic inference of constituent graphs Exact probabilistic inference in the above models is intractable, here we introduce an approximate inference algorithm that consists of applying the sum-product algorithm, while ignoring cycles in the factor graph. Although the sum-product algorithm has been used to obtain excellent results on several problems [6, 7, 13, 14, 8, 9], we have found that the algorithm works best when the model consists of uncertain observations of variables that are subject to a large number of hard constraints. Thus the formulation of the model described above. Conceptually, our inference algorithm is a straight-forward application of the sumproduct algorithm, c.f. [15], where messages are passed along edges in the factor graph iteratively, and then combined at variables to obtain estimates of posterior probabilities. However, direct implementation of the message-passing updates will lead to an intractable algorithm. In particular, direct implementation of the update for the message sent from function I(di , j eij ) to edge variable eik takes a number of scalar operations that is exponential in the number of vertices. Fortunately there exists a more efficient way to compute these messages. 3.1 Efficiently summing over edge configurations The function I(di , j eij ) ensures that the number of edges connected to vertex i is equal to di . Passing messages through this function requires summing over all edge configurations that correspond to each possible degree, di , and summing over di . Specifically, the message, µIi →eik (eik ), sent from function I(di , j eij ) to edge variable eik is given by I(di , di {eij | j=1,...,N, j=k} eij ) j µeij →Ii (eij ) , j=k where µeij →Ii (eij ) is the message sent from eij to function I(di , j eij ). The sum over {eij | j = 1, . . . , N, j = k} contains 2N −1 terms, so direct computation is intractable. However, for a maximum degree of dmax , all messages departing from the function I(di , j eij ) can be computed using order dmax N binary scalar operations, by introducing integer state variables sij . We define sij = n≤j ein and note that, by recursion, sij = sij−1 + eij , where si0 = 0 and 0 ≤ sij ≤ dmax . This recursive expression enables us to write the high-complexity constraint as the sum of a product of low-complexity constraints, N I(di , eij ) = j I(si1 , ei1 ) {sij | j=1,...,N } I(sij , sij−1 + eij ) I(di , siN ). j=2 This summation can be performed using the forward-backward algorithm. In the factor graph, the summation can be implemented by replacing the function I(di , j eij ) with a chain of lower-complexity functions, connected as shown in Fig. 2c. The function vertex (filled square) on the far left corresponds to I(si1 , ei1 ) and the function vertex in the upper right corresponds to I(di , siN ). So, messages can be passed through each constraint function I(di , j eij ) in an efficient manner, by performing a single forward-backward pass in the corresponding chain. 4 Results We evaluate our model using yeast protein-protein interaction (PPI) data compiled by [16]. These data include eight sets of putative, but noisy, interactions derived from various sources, and one gold-standard set of interactions detected by reliable experiments. Using the ∼ 6300 yeast proteins as vertices, we represent the eight sets of putative m interactions using adjacency matrices {Y m }8 m=1 where yij = 1 if and only if putative interaction dataset m contains an interaction between proteins i and j. We similarly use Y gold to represent the gold-standard interactions. m We construct an observed graph, X, by setting xij = maxm yij for all i and j, thus the observed edge set is the union of all the putative edge sets. We test our model (a) (b) 40 0 untangling baseline random empirical potential posterior −2 30 log Pr true positives (%) 50 20 10 −4 −6 −8 0 0 5 10 −10 0 false positives (%) 10 20 30 degree (# of nodes) Figure 3: Protein-protein interaction network untangling results. (a) ROC curves measuring performance of predicting e1 when xij = 1. (b) Degree distributions. Compares the empirical ij degree distribution of the test set subgraph of E 1 to the degree potential f 1 estimated on the ˆ ij training set subgraph of E 1 and to the distribution of di = j pij where pij = P (e1 = 1|X) is estimated by untangling. on the task of discerning which of the edges in X are also in Y gold . We formalize this problem as that of decomposing X into two constituent graphs E 1 and E 2 , the gold true and the noise graphs respectively, such that e1 = xij yij and e2 = xij − e1 . ij ij ij We use a training set to fit our model parameters and then measure task performance on a test set. The training set contains a randomly selected half of the ∼ 6300 yeast proteins, and the subgraphs of E 1 , E 2 , and X restricted to those vertices. The test contains the other half of the proteins and the corresponding subgraphs. Note that interactions connecting test set proteins to training set proteins (and vice versa) are ignored. We fit three sets of parameters: a set of Naive Bayes parameters that define a set of edge-specific likelihood functions, Pij (xij |e1 , e2 ), one degree potential, f 1 , which ij ij is the same for every vertex in E1 and defines the prior P (E 1 ), and a second, f 2 , that similarly defines the prior P (E 2 ). The likelihood functions, Pij , are used to both assign likelihoods and enforce problem constraints. Given our problem definition, if xij = 0 then e1 = e2 = 0, ij ij otherwise xij = 1 and e1 = 1 − e2 . We enforce the former constraint by setij ij ting Pij (xij = 0|e1 , e2 ) = (1 − e1 )(1 − e2 ), and the latter by setting Pij (xij = ij ij ij ij 1|e1 , e2 ) = 0 whenever e1 = e2 . This construction of Pij simplifies the calculation ij ij ij ij of the µPij →eh messages and improves the computational efficiency of inference beij cause when xij = 0, we need never update messages to and from variables e1 and ij e2 . We complete the specification of Pij (xij = 1|e1 , e2 ) as follows: ij ij ij ym Pij (xij = 1|e1 , e2 ) ij ij = m ij θm (1 − θm )1−yij , if e1 = 1 and e2 = 0, ij ij ym m ij ψm (1 − ψm )1−yij , if e1 = 0 and e2 = 1. ij ij where {θm } and {ψm } are naive Bayes parameters, θm = i,j m yij e1 / ij i,j e1 and ij ψm = i,j m yij e2 / ij i,j e2 , respectively. ij The degree potentials f 1 (d) and f 2 (d) are kernel density estimates fit to the degree distribution of the training set subgraphs of E 1 and E 2 , respectively. We use Gaussian kernels and set the width parameter (standard deviation) σ using leaveone-out cross-validation to maximize the total log density of the held-out datapoints. Each datapoint is the degree of a single vertex. Both degree potentials closely followed the training set empirical degree distributions. Untangling was done on the test set subgraph of X. We initially set the µ Pij →e1 ij messages equal to the likelihood function Pij and we randomly initialized the 1 µIj →e1 messages with samples from a normal distribution with mean 0 and variij ance 0.01. We then performed 40 iterations of the following message update order: 2 2 1 1 µe1 →Ij , µIj →e1 , µe1 →Pij , µPij →e2 , µe2 →Ij , µIj →e2 , µe2 →Pij , µPij →e1 . ij ij ij ij ij ij ij ij We evaluated our untangling algorithm using an ROC curve by comparing the actual ˆ test set subgraph of E 1 to posterior marginal probabilities,P (e1 = 1|X), estimated ij by our sum-product algorithm. Note that because the true interaction network is sparse (less than 0.2% of the 1.8 × 107 possible interactions are likely present [16]) and, in this case, true positive predictions are of greater biological interest than true negative predictions, we focus on low false positive rate portions of the ROC curve. Figure 3a compares the performance of a classifier for e1 based on thresholding ij ˆ P (eij = 1|X) to a baseline method based on thresholding the likelihood functions, Pij (xij = 1|e1 = 1, e2 = 0). Note because e1 = 0 whenever xij = 0, we exclude ij ij ij the xij = 0 cases from our performance evaluation. The ROC curve shows that for the same low false positive rate, untangling produces 50% − 100% more true positives than the baseline method. Figure 3b shows that the degree potential, the true degree distribution, and the predicted degree distribution are all comparable. The slight overprediction of the true degree distribution may result because the degree potential f 1 that defines P (E 1 ) is not equal to the expected degree distribution of graphs sampled from the distribution P (E 1 ). 5 Summary and Related Work Related work includes other algorithms for structure-based graph denoising [17, 18]. These algorithms use structural properties of the observed graph to score edges and rely on the true graph having a surprisingly large number of three (or four) edge cycles compared to the noise graph. In contrast, we place graph generation in a probabilistic framework; our algorithm computes structural fit in the hidden graph, where this computation is not affected by the noise graph(s); and we allow for multiple sources of observation noise, each with its own structural properties. After submitting this paper to the NIPS conference, we discovered [19], in which a degree-based graph structure prior is used to denoise (but not untangle) observed graphs. This paper addresses denoising in directed graphs as well as undirected graphs, however, the prior that they use is not amenable to deriving an efficient sumproduct algorithm. Instead, they use Markov Chain Monte Carlo to do approximate inference in a hidden graph containing 40 vertices. It is not clear how well this approach scales to the ∼ 3000 vertex graphs that we are using. In summary, the contributions of the work described in this paper include: a general formulation of the problem of graph untangling as inference in a factor graph; an efficient approximate inference algorithm for a rich class of degree-based structure priors; and a set of reliability scores (i.e., edge posteriors) for interactions from a current version of the yeast protein-protein interaction network. References [1] A L Barabasi and R Albert. Emergence of scaling in random networks. Science, 286(5439), October 1999. [2] A Rzhetsky and S M Gomez. Birth of scale-free molecular networks and the number of distinct dna and protein domains per genome. Bioinformatics, pages 988–96, 2001. [3] M Faloutsos, P Faloutsos, and C Faloutsos. On power-law relationships of the Internet topology. Computer Communications Review, 29, 1999. [4] Hawoong Jeong, B Tombor, R´ka Albert, Z N Oltvai, and Albert-L´szl´ Barab´si. e a o a The large-scale organization of metabolic networks. Nature, 407, October 2000. [5] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo CA., 1988. [6] D. J. C. MacKay and R. M. Neal. Near Shannon limit performance of low density parity check codes. Electronics Letters, 32(18):1645–1646, August 1996. Reprinted in Electronics Letters, vol. 33, March 1997, 457–458. [7] B. J. Frey and F. R. Kschischang. Probability propagation and iterative decoding. In Proceedings of the 1996 Allerton Conference on Communication, Control and Computing, 1996. [8] B. J. Frey, R. Koetter, and N. Petrovic. Very loopy belief propagation for unwrapping phase images. In 2001 Conference on Advances in Neural Information Processing Systems, Volume 14. MIT Press, 2002. [9] M. M´zard, G. Parisi, and R. Zecchina. Analytic and algorithmic solution of random e satisfiability problems. Science, 297:812–815, 2002. [10] B. J. Frey and D. J. C. MacKay. Trellis-constrained codes. In Proceedings of the 35 th Allerton Conference on Communication, Control and Computing 1997, 1998. [11] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, Special Issue on Codes on Graphs and Iterative Algorithms, 47(2):498–519, February 2001. [12] B. J. Frey. Factor graphs: A unification of directed and undirected graphical models. University of Toronto Technical Report PSI-2003-02, 2003. [13] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Uncertainty in Artificial Intelligence 1999. Stockholm, Sweden, 1999. [14] W. Freeman and E. Pasztor. Learning low-level vision. In Proceedings of the International Conference on Computer Vision, pages 1182–1189, 1999. [15] M. I. Jordan. An Inroduction to Learning in Graphical Models. 2004. In preparation. [16] C von Mering et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 2002. [17] R Saito, H Suzuki, and Y Hayashizaki. Construction of reliable protein-protein interaction networks with a new interaction generality measure. Bioinformatics, pages 756–63, 2003. [18] D S Goldberg and F P Roth. Assessing experimentally derived interactions in a small world. Proceedings of the National Academy of Science, 2003. [19] S M Gomez and A Rzhetsky. Towards the prediction of complete protein–protein interaction networks. In Pacific Symposium on Biocomputing, pages 413–24, 2002.

5 0.77297145 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

Author: Claudio Gentile

Abstract: New feature selection algorithms for linear threshold functions are described which combine backward elimination with an adaptive regularization method. This makes them particularly suitable to the classification of microarray expression data, where the goal is to obtain accurate rules depending on few genes only. Our algorithms are fast and easy to implement, since they center on an incremental (large margin) algorithm which allows us to avoid linear, quadratic or higher-order programming methods. We report on preliminary experiments with five known DNA microarray datasets. These experiments suggest that multiplicative large margin algorithms tend to outperform additive algorithms (such as SVM) on feature selection tasks. 1

6 0.77250469 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

7 0.77161026 100 nips-2003-Laplace Propagation

8 0.77060795 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

9 0.76830375 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

10 0.76612514 158 nips-2003-Policy Search by Dynamic Programming

11 0.76199639 146 nips-2003-Online Learning of Non-stationary Sequences

12 0.7603218 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

13 0.76030535 117 nips-2003-Linear Response for Approximate Inference

14 0.76030326 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

15 0.75844091 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

16 0.75569808 78 nips-2003-Gaussian Processes in Reinforcement Learning

17 0.7542811 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

18 0.75399178 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

19 0.75392234 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

20 0.75272387 126 nips-2003-Measure Based Regularization