nips nips2003 nips2003-101 knowledge-graph by maker-knowledge-mining

101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates


Source: pdf

Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe

Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Large margin classifiers: convex loss, low noise, and convergence rates Peter L. [sent-1, score-0.272]

2 edu Abstract Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. [sent-6, score-0.704]

3 We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. [sent-7, score-1.752]

4 We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. [sent-8, score-0.752]

5 The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. [sent-9, score-0.449]

6 Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class. [sent-11, score-0.168]

7 In particular, a wide variety of two-class classification methods choose a real-valued classifier f based on the minimization of a convex surrogate φ(yf (x)) in the place of an intractable loss function 1(sign(f (x)) = y). [sent-13, score-0.729]

8 Examples of this tactic include the support vector machine, AdaBoost, and logistic regression, which are based on the exponential loss, the hinge loss and the logistic loss, respectively. [sent-14, score-0.464]

9 What are the statistical consequences of choosing models and estimation procedures so as to exploit the computational advantages of convexity? [sent-15, score-0.099]

10 In particular, it is possible to demonstrate the Bayes-risk consistency of methods based on minimizing convex surrogates for 0-1 loss, with appropriate regularization. [sent-17, score-0.305]

11 Lugosi and Vayatis (2003) have provided such a result for any differentiable, monotone, strictly convex loss function φ that satisfies φ(0) = 1. [sent-18, score-0.585]

12 Steinwart (2002) has demonstrated consistency for the SVM as well, where F is a reproducing kernel Hilbert space and φ is continuous. [sent-20, score-0.134]

13 Other results on Bayes-risk consistency have been presented by Jiang (2003), Zhang (2003), and Mannor et al. [sent-21, score-0.134]

14 To carry this agenda further, it is necessary to find general quantitative relationships between the approximation and estimation errors associated with φ, and those associated with 0-1 loss. [sent-23, score-0.117]

15 We simplify and extend Zhang’s results, developing a general methodology for finding quantitative relationships between the risk associated with φ and the risk associated with 0-1 loss. [sent-25, score-0.645]

16 In particular, let R(f ) denote the risk based on 0-1 loss and let R∗ = inf f R(f ) denote the Bayes risk. [sent-26, score-0.807]

17 Similarly, let us refer to Rφ (f ) = Eφ(Y f (X)) ∗ as the “φ-risk,” and let Rφ = inf f Rφ (f ) denote the “optimal φ-risk. [sent-27, score-0.19]

18 ” We show that, for all measurable f , ∗ ψ(R(f ) − R∗ ) ≤ Rφ (f ) − Rφ , (1) for a nondecreasing function ψ : [0, 1] → [0, ∞), and that no better bound is possible. [sent-28, score-0.209]

19 This result suggests that if ψ is well-behaved then minimization of Rφ (f ) may provide a reasonable surrogate for minimization of R(f ). [sent-30, score-0.252]

20 Moreover, the result provides a quantitative ∗ way to transfer assessments of statistical error in terms of “excess φ-risk” R φ (f ) − Rφ into ∗ assessments of error in terms of “excess risk” R(f ) − R . [sent-31, score-0.249]

21 Although our principal goal is to understand the implications of convexity in classification, we do not impose a convexity assumption on φ at the outset. [sent-32, score-0.652]

22 Indeed, while conditions such as convexity, continuity, and differentiability of φ are easy to verify and have natural relationships to optimization procedures, it is not immediately obvious how to relate such conditions to their statistical consequences. [sent-33, score-0.226]

23 Thus, in Section 2 we consider the weakest possible condition on φ—that it is “classification-calibrated,” which is essentially a pointwise form of Fisher consistency for classification. [sent-34, score-0.318]

24 We show that minimizing φ-risk leads to minimal risk precisely when φ is classification-calibrated. [sent-35, score-0.283]

25 We show that in this setting we are able to obtain an improvement in the relationship between excess φ-risk and excess risk. [sent-37, score-0.573]

26 Section 4 turns to the estimation of convergence rates for empirical φ-risk minimization in the low noise setting. [sent-38, score-0.285]

27 We find that for convex φ satisfying a certain uniform convexity condition, empirical φ-risk minimization yields convergence of misclassification risk to that of the best-performing classifier in F, and the rate of convergence can be strictly faster than the classical parametric rate of n−1/2 . [sent-39, score-1.277]

28 We give estimates for this error that are valid for any measurable function. [sent-42, score-0.187]

29 Define the Bayes risk R∗ = inf f R(f ), where the infimum is over all measurable f . [sent-57, score-0.625]

30 For η ∈ [0, 1], define the optimal conditional φ-risk H(η) = inf Cη (α) = inf (ηφ(α) + (1 − η)φ(−α)). [sent-67, score-0.38]

31 α∈ ¡ ¡ α∈ ∗ Then the optimal φ-risk satisfies Rφ := inf f Rφ (f ) = EH(η(X)), where the infimum is over measurable functions. [sent-68, score-0.342]

32 For η ∈ [0, 1], define H − (η) = inf α:α(2η−1)≤0 Cη (α) = inf α:α(2η−1)≤0 (ηφ(α) + (1 − η)φ(−α)). [sent-69, score-0.38]

33 This is the optimal value of the conditional φ-risk, under the constraint that the sign of the argument α disagrees with that of 2η − 1. [sent-70, score-0.241]

34 We now turn to the basic condition we impose on φ. [sent-71, score-0.085]

35 This condition generalizes the requirement that the minimizer of Cη (α) (if it exists) has the correct sign. [sent-72, score-0.171]

36 This is a minimal condition that can be viewed as a form of Fisher consistency for classification (Lin, 2001). [sent-73, score-0.219]

37 The following functional transform of the loss function will be useful in our main result. [sent-76, score-0.334]

38 We define the ψ-transform of a loss function as follows. [sent-78, score-0.334]

39 Equivalently, the epigraph of g ∗∗ is the closure of the convex hull of the epigraph of g. [sent-80, score-0.327]

40 We calculate the ψ-transform for exponential loss, logistic loss, quadratic loss and truncated quadratic loss, tabulating the results in Table 1. [sent-83, score-0.634]

41 All of these loss functions can be verified to be classification-calibrated. [sent-84, score-0.367]

42 φ(α) exponential logistic quadratic truncated quadratic ψ(θ) e−α ln(1 + e −2α ) (1 − α)2 (max{0, 1 − α})2 1− LB δ( ) √ eB e−B 2 /8 θ 2 e−2B 2 /4 θ2 θ2 2(B + 1) 2(B + 1) 1 − θ2 2 2 /4 /4 Table 1: Four convex loss functions and the corresponding ψ-transform. [sent-87, score-0.838]

43 On the interval [−B, B], each loss function has the indicated Lipschitz constant LB and modulus of convexity δ( ) with respect to dφ . [sent-88, score-0.894]

44 For any nonnegative loss function φ, any measurable f : X → and any probability distribution on X × {±1}, ∗ ψ(R(f ) − R∗ ) ≤ Rφ (f ) − Rφ . [sent-92, score-0.564]

45 For any nonnegative loss function φ, any > 0 and any θ ∈ [0, 1], there is a probability distribution on X × {±1} and a function f : X → ∗ such that R(f ) − R∗ = θ and ψ(θ) ≤ Rφ (f ) − Rφ ≤ ψ(θ) + . [sent-95, score-0.412]

46 (c) For every sequence of measurable functions fi : X → and every proba∗ bility distribution on X × {±1}, Rφ (fi ) → Rφ implies R(fi ) → R∗ . [sent-100, score-0.269]

47 Remark: It can be shown that classification-calibration implies ψ is invertible on [0, 1], in ∗ which case it is meaningful to write the upper bound on excess risk as ψ −1 (Rφ (f ) − Rφ ). [sent-101, score-0.695]

48 Remark: Zhang (2003) has given a comparison theorem like Part 1, for convex φ that satisfy certain conditions. [sent-102, score-0.252]

49 Then it is ∗ ˜ ˜ easy to verify that Rφ (f ) − Rφ ≤ γ ψ(α1 ) + (1 − γ)ψ(α2 ) + /2 ≤ ψ(θ) + . [sent-130, score-0.1]

50 For Part 3, first note that, for any φ, ψ is continuous on [0, 1] and ψ(0) = 0 by Lemma 4, part 6, and hence θi → 0 implies ψ(θi ) → 0. [sent-132, score-0.092]

51 Thus, we can replace condition (3b) by   (3b’) For any sequence (θi ) in [0, 1], ψ(θi ) → 0 implies θi → 0 . [sent-133, score-0.137]

52 Define c = lim sup θi > 0, and pass to a subsequence with lim θi = c. [sent-135, score-0.102]

53 Then lim ψ(θi ) = ψ(c) by continuity, and ψ(c) > 0 by classification-calibration (Lemma 4, part 7). [sent-136, score-0.091]

54 It shows that if φ is convex, the classificationcalibration condition is easy to verify and the ψ transform is a little easier to compute. [sent-142, score-0.185]

55 All of the classification procedures mentioned in earlier sections utilize surrogate loss functions which are either upper bounds on 0-1 loss or can be transformed into upper bounds via a positive scaling factor. [sent-149, score-0.935]

56 3 Tighter bounds under low noise conditions In a study of the convergence rate of empirical risk minimization, Tsybakov (2001) provided a useful condition on the behavior of the posterior probability near the optimal decision boundary {x : η(x) = 1/2}. [sent-153, score-0.583]

57 Tsybakov’s condition is useful in our setting as well; as we show in this section, it allows us to obtain a refinement of Theorem 3. [sent-154, score-0.085]

58 We say that P has noise exponent α ≥ 0 if there is a c > 0 such that every measurable f : X → has α PX (sign(f (X)) = sign(η(X) − 1/2)) ≤ c (R(f ) − R∗ ) . [sent-156, score-0.291]

59 On the other hand, it is easy to verify that α = 1 if and only if |2η(X) − 1| ≥ 1/c a. [sent-159, score-0.1]

60 Suppose P has noise exponent 0 < α ≤ 1, and φ is classification-calibrated. [sent-163, score-0.105]

61 4 Estimation rates ˆ Large margin algorithms choose f from a class F to minimize empirical φ-risk, 1 ˆ ˆ Rφ (f ) = Eφ(Y f (X)) = n n φ(Yi f (Xi )). [sent-167, score-0.151]

62 i=1 We have seen how the excess risk depends on the excess φ-risk. [sent-168, score-0.819]

63 In this section, we examine ∗ ˆ ˆ the convergence of f ’s excess φ-risk, Rφ (f ) − Rφ . [sent-169, score-0.336]

64 We can split this excess risk into an estimation error term and an approximation error term: ∗ ∗ ˆ ˆ Rφ (f ) − Rφ = (Rφ (f ) − inf Rφ (f )) + ( inf Rφ (f ) − Rφ ). [sent-170, score-1.039]

65 For simplicity, we assume throughout that some f ∗ ∈ F achieves the infimum, Rφ (f ∗ ) = inf f ∈F Rφ (f ). [sent-172, score-0.19]

66 For example,√ a nontrivial for class F, the resulting estimation error bound can decrease no faster than 1/ n. [sent-175, score-0.242]

67 (1996) showed that fast rates are also possible for the quadratic loss φ(α) = (1 − α)2 if F is convex, even if Rφ (f ∗ ) > 0. [sent-178, score-0.452]

68 In particular, because the quadratic loss function is strictly convex, it is possible to bound the variance of the excess loss (difference between the loss of a function f and that of the optimal f ∗ ) in terms of its expectation. [sent-179, score-1.492]

69 Since the variance decreases as we approach the optimal f ∗ , the risk of the empirical minimizer converges more quickly to the optimal risk than the simple uniform convergence results would suggest. [sent-180, score-0.76]

70 The proof used the idea of the modulus of convexity of a norm. [sent-182, score-0.591]

71 This idea can be used to give a simpler proof of a more general bound when the loss function satisfies a strict convexity condition, and we obtain risk bounds. [sent-183, score-1.031]

72 The modulus of convexity of an arbitrary strictly convex function (rather than a norm) is a key notion in formulating our results. [sent-184, score-0.811]

73 We consider loss functions φ that also satisfy a Lipschitz condition with respect to a pseudometric d on : we say that φ : → is Lipschitz with respect to d, with constant L, if for all a, b ∈ , |φ(a) − φ(b)| ≤ L · d(a, b). [sent-188, score-0.624]

74 ) We consider four loss functions that satisfy these conditions: the exponential loss function used in AdaBoost, the deviance function for logistic regression, the quadratic loss function, and the truncated quadratic loss function; see Table 1. [sent-190, score-1.703]

75 We use the pseudometric dφ (a, b) = inf {|a − α| + |β − b| : φ constant on (min{α, β}, max{α, β})} . [sent-191, score-0.294]

76 For all except the truncated quadratic loss function, this corresponds to the standard metric on , dφ (a, b) = |a − b|. [sent-192, score-0.484]

77 It is easy to calculate the Lipschitz constant and modulus of convexity for each of these loss functions. [sent-194, score-0.941]

78 In the following result, we consider the function class used by algorithms such as AdaBoost: the class of linear combinations of classifiers from a fixed base class. [sent-196, score-0.123]

79 We assume that this base class has finite Vapnik-Chervonenkis dimension, and we constrain the size of the class by restricting the 1 norm of the linear parameters. [sent-197, score-0.123]

80 Suppose that, on the interval [−B, B], φ is Lipschitz with constant LB and has modulus of convexity δ( ) = aB 2 (both with respect to the pseudometric d). [sent-201, score-0.664]

81 For any probability distribution P on X × Y that has noise exponent α = 1, there is a constant c for which the following is true. [sent-202, score-0.105]

82 , (Xn , Yn ), let ˆ ˆ f ∈ F be the minimizer of the empirical φ-risk, Rφ (f ) = Eφ(Y f (X)). [sent-209, score-0.126]

83 Suppose that X F = B absconv(G), where G ⊆ {±1} has dV C (G) = d, and     ∗ ≥ BLB max LB a B B 1/(d+1) , 1 n−(d+2)/(2d+2) Then with probability at least 1 − e−x , LB (LB /aB + B)x ∗ ∗ ˆ + inf Rφ (f ) − Rφ . [sent-210, score-0.19]

84 R(f ) ≤ R∗ + c + f ∈F n Notice that the rate obtained here is strictly faster than the classical n−1/2 parametric rate, even though the class is infinite dimensional and the optimal element of F can have risk larger than the Bayes risk. [sent-211, score-0.52]

85 Let f ∗ be the minimizer of φ-risk in a function class F. [sent-214, score-0.133]

86 If the class F is convex and the loss function φ is strictly convex and Lipschitz, then the variance of the excess loss, gf (x, y) = φ(yf (x)) − φ(yf ∗ (x)), decreases with its expectation. [sent-215, score-1.071]

87 More formally, suppose that φ is L-Lipschitz and has modulus of convexity δ( ) ≥ c r 2 with r ≤ 2. [sent-218, score-0.606]

88 5 Conclusions We have studied the relationship between properties of a nonnegative margin-based loss function φ and the statistical performance of the classifier which, based on an i. [sent-222, score-0.48]

89 training set, minimizes empirical φ-risk over a class of functions. [sent-225, score-0.087]

90 We first derived a universal upper bound on the population misclassification risk of any thresholded measurable classifier in terms of its corresponding population φ-risk. [sent-226, score-0.623]

91 The bound is governed by the ψ-transform, a convexified variational transform of φ. [sent-227, score-0.088]

92 It is the tightest possible upper bound uniform over all probability distributions and measurable functions in this setting. [sent-228, score-0.277]

93 Using this upper bound, we characterized the class of loss functions which guarantee that every φ-risk consistent classifier sequence is also Bayes-risk consistent, under any population distribution. [sent-229, score-0.497]

94 Here φ-risk consistency denotes sequential convergence of population φ-risks to the smallest possible φ-risk of any measurable classifier. [sent-230, score-0.402]

95 The characteristic property of such a φ, which we term classification-calibration, is a kind of pointwise Fisher consistency for the conditional φ-risk at each x ∈ X . [sent-231, score-0.192]

96 Under the low noise assumption of Tsybakov (2001), we sharpened our original upper ˆ bound and studied the Bayes-risk consistency of f , the minimizer of empirical φ-risk over a convex, bounded class of functions F which is not too complex. [sent-233, score-0.479]

97 We found that, for convex φ satisfying a certain uniform strict convexity condition, empirical φ-risk minimization yields convergence of misclassification risk to that of the best-performing classifier in F, as the sample size grows. [sent-234, score-0.989]

98 Furthermore, the rate of convergence can be strictly faster than the classical n−1/2 , depending on the strictness of convexity of φ and the complexity of F. [sent-235, score-0.584]

99 On the Bayes risk consistency of regularized boosting methods. [sent-278, score-0.417]

100 Statistical behavior and consistency of classification methods based on convex risk minimization. [sent-300, score-0.588]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('loss', 0.334), ('convexity', 0.326), ('risk', 0.283), ('excess', 0.268), ('sign', 0.241), ('modulus', 0.234), ('inf', 0.19), ('convex', 0.171), ('classi', 0.157), ('lb', 0.156), ('measurable', 0.152), ('px', 0.145), ('consistency', 0.134), ('surrogate', 0.134), ('lipschitz', 0.134), ('mendelson', 0.104), ('pseudometric', 0.104), ('tsybakov', 0.09), ('minimizer', 0.086), ('condition', 0.085), ('quadratic', 0.085), ('zhang', 0.08), ('strictly', 0.08), ('nonnegative', 0.078), ('absconv', 0.078), ('epigraph', 0.078), ('convergence', 0.068), ('vayatis', 0.068), ('boyd', 0.068), ('mum', 0.068), ('steinwart', 0.068), ('truncated', 0.065), ('logistic', 0.065), ('bartlett', 0.063), ('lemma', 0.062), ('minimization', 0.059), ('exponent', 0.058), ('pointwise', 0.058), ('bound', 0.057), ('fisher', 0.055), ('lugosi', 0.054), ('verify', 0.053), ('assessments', 0.052), ('egf', 0.052), ('implies', 0.052), ('yf', 0.052), ('lim', 0.051), ('satis', 0.05), ('population', 0.048), ('class', 0.047), ('noise', 0.047), ('easy', 0.047), ('theorem', 0.047), ('suppose', 0.046), ('mannor', 0.045), ('mcauliffe', 0.045), ('quantitative', 0.044), ('adaboost', 0.044), ('classical', 0.044), ('satisfying', 0.042), ('weakest', 0.041), ('empirical', 0.04), ('part', 0.04), ('jiang', 0.038), ('meir', 0.038), ('vandenberghe', 0.038), ('estimation', 0.038), ('relationship', 0.037), ('misclassi', 0.037), ('er', 0.037), ('faster', 0.036), ('upper', 0.035), ('error', 0.035), ('nition', 0.035), ('yn', 0.035), ('relationships', 0.035), ('lin', 0.034), ('bayes', 0.034), ('satisfy', 0.034), ('say', 0.034), ('annals', 0.034), ('functions', 0.033), ('rates', 0.033), ('fi', 0.032), ('lee', 0.032), ('cation', 0.032), ('statistical', 0.031), ('variational', 0.031), ('proof', 0.031), ('choose', 0.031), ('assessed', 0.03), ('fix', 0.03), ('conditions', 0.03), ('es', 0.03), ('procedures', 0.03), ('rate', 0.03), ('continuity', 0.029), ('nontrivial', 0.029), ('remark', 0.029), ('base', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe

Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.

2 0.27126944 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

3 0.20131931 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

Author: Tong Zhang

Abstract: The purpose of this paper is to investigate infinity-sample properties of risk minimization based multi-category classification methods. These methods can be considered as natural extensions to binary large margin classification. We establish conditions that guarantee the infinity-sample consistency of classifiers obtained in the risk minimization framework. Examples are provided for two specific forms of the general formulation, which extend a number of known methods. Using these examples, we show that some risk minimization formulations can also be used to obtain conditional probability estimates for the underlying problem. Such conditional probability information will be useful for statistical inferencing tasks beyond classification. 1 Motivation Consider a binary classification problem where we want to predict label y ∈ {±1} based on observation x. One of the most significant achievements for binary classification in machine learning is the invention of large margin methods, which include support vector machines and boosting algorithms. Based on a set of observations (X1 , Y1 ), . . . , (Xn , Yn ), ˆ a large margin classification algorithm produces a decision function fn by empirically minimizing a loss function that is often a convex upper bound of the binary classification error ˆ ˆ function. Given fn , the binary decision rule is to predict y = 1 if fn (x) ≥ 0, and to predict ˆ y = −1 otherwise (the decision rule at fn (x) = 0 is not important). In the literature, the following form of large margin binary classification is often encountered: we minimize the empirical risk associated with a convex function φ in a pre-chosen function class Cn : 1 ˆ fn = arg min f ∈Cn n n φ(f (Xi )Yi ). (1) i=1 Originally such a scheme was regarded as a compromise to avoid computational difficulties associated with direct classification error minimization, which often leads to an NP-hard problem. The current view in the statistical literature interprets such methods as algorithms to obtain conditional probability estimates. For example, see [3, 6, 9, 11] for some related studies. This point of view allows people to show the consistency of various large margin methods: that is, in the large sample limit, the obtained classifiers achieve the optimal Bayes error rate. For example, see [1, 4, 7, 8, 10, 11]. The consistency of a learning method is certainly a very desirable property, and one may argue that a good classification method should be consistent in the large sample limit. Although statistical properties of binary classification algorithms based on the risk minimization formulation (1) are quite well-understood due to many recent works such as those mentioned above, there are much fewer studies on risk minimization based multicategory problems which generalizes the binary large margin method (1). The complexity of possible generalizations may be one reason. Another reason may be that one can always estimate the conditional probability for a multi-category problem using the binary classification formulation (1) for each category, and then pick the category with the highest estimated conditional probability (or score).1 However, it is still useful to understand whether there are more natural alternatives, and what kind of risk minimization formulation which generalizes (1) can be used to yield consistent classifiers in the large sample limit. An important step toward this direction has recently been taken in [5], where the authors proposed a multi-category extension of the support vector machine that is Bayes consistent (note that there were a number of earlier proposals that were not consistent). The purpose of this paper is to generalize their investigation so as to include a much wider class of risk minimization formulations that can lead to consistent classifiers in the infinity-sample limit. We shall see that there is a rich structure in risk minimization based multi-category classification formulations. Multi-category large margin methods have started to draw more attention recently. For example, in [2], learning bounds for some multi-category convex risk minimization methods were obtained, although the authors did not study possible choices of Bayes consistent formulations. 2 Multi-category classification We consider the following K-class classification problem: we would like to predict the label y ∈ {1, . . . , K} of an input vector x. In this paper, we only consider the simplest scenario with 0 − 1 classification loss: we have a loss of 0 for correct prediction, and loss of 1 for incorrect prediction. In binary classification, the class label can be determined using the sign of a decision function. This can be generalized to K class classification problem as follows: we consider K decision functions fc (x) where c = 1, . . . , K and we predict the label y of x as: T (f (x)) = arg max c∈{1,...,K} fc (x), (2) where we denote by f (x) the vector function f (x) = [f1 (x), . . . , fK (x)]. Note that if two or more components of f achieve the same maximum value, then we may choose any of them as T (f ). In this framework, fc (x) is often regarded as a scoring function for category c that is correlated with how likely x belongs to category c (compared with the remaining k − 1 categories). The classification error is given by: (f ) = 1 − EX P (Y = T (X)|X). Note that only the relative strength of fc compared with the alternatives is important. In particular, the decision rule given in (2) does not change when we add the same numerical quantity to each component of f (x). This allows us to impose one constraint on the vector f (x) which decreases the degree of freedom K of the K-component vector f (x) to K − 1. 1 This approach is often called one-versus-all or ranking in machine learning. Another main approach is to encode a multi-category classification problem into binary classification sub-problems. The consistency of such encoding schemes can be difficult to analyze, and we shall not discuss them. For example, in the binary classification case, we can enforce f1 (x)+f2 (x) = 0, and hence f (x) can be represented as [f1 (x), −f1 (x)]. The decision rule in (2), which compares f1 (x) ≥ f2 (x), is equivalent to f1 (x) ≥ 0. This leads to the binary classification rule mentioned in the introduction. In the multi-category case, one may also interpret the possible constraint on the vector function f , which reduces its degree of freedom from K to K − 1 based on the following reasoning. In many cases, we seek fc (x) as a function of p(Y = c|x). Since we have a K constraint c=1 p(Y = c|x) = 1 (implying that the degree of freedom for p(Y = c|x) is K − 1), the degree of freedom for f is also K − 1 (instead of K). However, we shall point out that in the algorithms we formulate below, we may either enforce such a constraint that reduces the degree of freedom of f , or we do not impose any constraint, which keeps the degree of freedom of f to be K. The advantage of the latter is that it allows the computation of each fc to be decoupled. It is thus much simpler both conceptually and numerically. Moreover, it directly handles multiple-label problems where we may assign each x to multiple labels of y ∈ {1, . . . , K}. In this scenario, we do not have a constraint. In this paper, we consider an empirical risk minimization method to solve a multi-category problem, which is of the following general form: 1 ˆ fn = arg min f ∈Cn n n ΨYi (f (Xi )). (3) i=1 As we shall see later, this method is a natural generalization of the binary classification method (1). Note that one may consider an even more general form with ΨY (f (X)) replaced by ΨY (f (X), X), which we don’t study in this paper. From the standard learning theory, one can expect that with appropriately chosen Cn , the ˆ ˆ solution fn of (3) approximately minimizes the true risk R(f ) with respect to the unknown underlying distribution within the function class Cn , R(f ) = EX,Y ΨY (f (X)) = EX L(P (·|X), f (X)), (4) where P (·|X) = [P (Y = 1|X), . . . , P (Y = K|X)] is the conditional probability, and K L(q, f ) = qc Ψc (f ). (5) c=1 In order to understand the large sample behavior of the algorithm based on solving (3), we first need to understand the behavior of a function f that approximately minimizes R(f ). We introduce the following definition (also referred to as classification calibrated in [1]): Definition 2.1 Consider Ψc (f ) in (4). We say that the formulation is admissible (classification calibrated) on a closed set Ω ⊆ [−∞, ∞]K if the following conditions hold: ∀c, Ψc (·) : Ω → (−∞, ∞] is bounded below and continuous; ∩c {f : Ψc (f ) < ∞} is ∗ ∗ non-empty and dense in Ω; ∀q, if L(q, f ∗ ) = inf f L(q, f ), then fc = supk fk implies qc = supk qk . Since we allow Ψc (f ) = ∞, we use the convention that qc Ψc (f ) = 0 when qc = 0 and Ψc (f ) = ∞. The following result relates the approximate minimization of the Ψ risk to the approximate minimization of classification error: Theorem 2.1 Let B be the set of all Borel measurable functions. For a closed set Ω ⊂ [−∞, ∞]K , let BΩ = {f ∈ B : ∀x, f (x) ∈ Ω}. If Ψc (·) is admissible on Ω, then for a Borel measurable distribution, R(f ) → inf g∈BΩ R(g) implies (f ) → inf g∈B (g). Proof Sketch. First we show that the admissibility implies that ∀ > 0, ∃δ > 0 such that ∀q and x: inf {L(q, f ) : fc = sup fk } ≥ inf L(q, g) + δ. (6) qc ≤supk qk − g∈Ω k m If (6) does not hold, then ∃ > 0, and a sequence of (c , f m , q m ) with f m ∈ Ω such that m m m m fcm = supk fk , qcm ≤ supk qk − , and L(q m , f m ) − inf g∈Ω L(q m , g) → 0. Taking a limit point of (cm , f m , q m ), and using the continuity of Ψc (·), we obtain a contradiction (technical details handling the infinity case are skipped). Therefore (6) must be valid. Now we consider a vector function f (x) ∈ ΩB . Let q(x) = P (·|x). Given X, if P (Y = T (f (X))|X) ≥ P (Y = T (q(X))|X)+ , then equation (6) implies that L(q(X), f (X)) ≥ inf g∈Ω L(q(X), g) + δ. Therefore (f ) − inf (g) =EX [P (Y = T (q(X))|X) − P (Y = T (f (X))|X)] g∈B ≤ + EX I(P (Y = T (q(X))|X) − P (Y = T (f (X))|X) > ) LX (q(X), f (X)) − inf g∈BΩ LX (q(X), g) ≤ + EX δ R(f ) − inf g∈BΩ R(g) = + . δ In the above derivation we use I to denote the indicator function. Since and δ are arbitrary, we obtain the theorem by letting → 0. 2 Clearly, based on the above theorem, an admissible risk minimization formulation is suitable for multi-category classification problems. The classifier obtained from minimizing (3) can approach the Bayes error rate if we can show that with appropriately chosen function class Cn , approximate minimization of (3) implies approximate minimization of (4). Learning bounds of this forms have been very well-studied in statistics and machine learning. For example, for large margin binary classification, such bounds can be found in [4, 7, 8, 10, 11, 1], where they were used to prove the consistency of various large margin methods. In order to achieve consistency, it is also necessary to take a sequence of function classes Cn (C1 ⊂ C2 ⊂ · · · ) such that ∪n Cn is dense in the set of Borel measurable functions. The set Cn has the effect of regularization, which ensures that ˆ ˆ P R(fn ) ≈ inf f ∈Cn R(f ). It follows that as n → ∞, R(fn ) → inf f ∈B R(f ). Theorem 2.1 ˆ P then implies that (fn ) → inf f ∈B (f ). The purpose of this paper is not to study similar learning bounds that relate approximate minimization of (3) to the approximate minimization of (4). See [2] for a recent investigation. We shall focus on the choices of Ψ that lead to admissible formulations. We pay special attention to the case that each Ψc (f ) is a convex function of f , so that the resulting formulation becomes computational more tractable. Instead of working with the general form of Ψc in (4), we focus on two specific choices listed in the next two sections. 3 Unconstrained formulations We consider unconstrained formulation with the following choice of Ψ: K Ψc (f ) = φ(fc ) + s t(fk ) , (7) k=1 where φ, s and t are appropriately chosen functions that are continuously differentiable. The first term, which has a relatively simple form, depends on the label c. The second term is independent of the label, and can be regarded as a normalization term. Note that this function is symmetric with respect to components of f . This choice treats all potential classes equally. It is also possible to treat different classes differently (e.g. replacing φ(fc ) by φc (fc )), which can be useful if we associate different classification loss to different kinds of errors. 3.1 Optimality equation and probability model Using (7), the conditional true risk (5) can be written as: K L(q, f ) = K qc φ(fc ) + s t(fc ) . c=1 c=1 In the following, we study the property of the optimal vector f ∗ that minimizes L(q, f ) for a fixed q. Given q, the optimal solution f ∗ of L(q, f ) satisfies the following first order condition: ∗ ∗ qc φ (fc ) + µf ∗ t (fc ) = 0 (c = 1, . . . , K). (8) where quantity µf ∗ = s ( K k=1 ∗ t(fk )) is independent of k. ∗ Clearly this equation relates qc to fc for each component c. The relationship of q and f ∗ defined by (8) can be regarded as the (infinite sample-size) probability model associated with the learning method (3) with Ψ given by (7). The following result presents a simple criterion to check admissibility. We skip the proof for simplicity. Most of our examples satisfy the condition. Proposition 3.1 Consider (7). Assume Φc (f ) is continuous on [−∞, ∞]K and bounded below. If s (u) ≥ 0 and ∀p > 0, pφ (f ) + t (f ) = 0 has a unique solution fp that is an increasing function of p, then the formulation is admissible. If s(u) = u, the condition ∀p > 0 in Proposition 3.1 can be replaced by ∀p ∈ (0, 1). 3.2 Decoupled formulations We let s(u) = u in (7). The optimality condition (8) becomes ∗ ∗ qc φ (fc ) + t (fc ) = 0 (c = 1, . . . , K). (9) This means that we have K decoupled equalities, one for each fc . This is the simplest and in the author’s opinion, the most interesting formulation. Since the estimation problem in ˆ (3) is also decoupled into K separate equations, one for each component of fn , this class of methods are computationally relatively simple and easy to parallelize. Although this method seems to be preferable for multi-category problems, it is not the most efficient way for two-class problem (if we want to treat the two classes in a symmetric manner) since we have to solve two separate equations. We only need to deal with one equation in (1) due to the fact that an effective constraint f1 + f2 = 0 can be used to reduce the number of equations. This variable elimination has little impact if there are many categories. In the following, we list some examples of multi-category risk minimization formulations. They all satisfy the admissibility condition in Proposition 3.1. We focus on the relationship of the optimal optimizer function f∗ (q) and the conditional probability q. For simplicity, we focus on the choice φ(u) = −u. 3.2.1 φ(u) = −u and t(u) = eu ∗ We obtain the following probability model: qc = efc . This formulation is closely related K to the maximum-likelihood estimate with conditional model qc = efc / k=1 efk (logistic regression). In particular, if we choose a function class such that the normalization condiK tion k=1 efk = 1 holds, then the two formulations are identical. However, they become different when we do not impose such a normalization condition. Another very important and closely related formulation is the choice of φ(u) = − ln u and t(u) = u. This is an extension of maximum-likelihood estimate with probability model qc = fc . The resulting method is identical to maximum-likelihood if we choose our function class such that k fk = 1. However, the formulation also allows us to use function classes that do not satisfy the normalization constraint k fk = 1. Therefore this method is more flexible. 3.2.2 φ(u) = −u and t(u) = ln(1 + eu ) This version uses binary logistic regression loss, and we have the following probability ∗ model: qc = (1 + e−fc )−1 . Again this is an unnormalized model. 1 3.2.3 φ(u) = −u and t(u) = p |u|p (p > 1) ∗ ∗ We obtain the following probability model: qc = sign(fc )|fc |p−1 . This means that at the ∗ ∗ solution, fc ≥ 0. One may modify it such that we allow fc ≤ 0 to model the condition probability qc = 0. 3.2.4 φ(u) = −u and t(u) = 1 p max(u, 0)p (p > 1) ∗ In this probability model, we have the following relationship: qc = max(fc , 0)p−1 . The ∗ equation implies that we allow fc ≤ 0 to model the conditional probability qc = 0. Therefore, with a fixed function class, this model is more powerful than the previous one. How∗ ever, at the optimal solution, fc ≤ 1. This requirement can be further alleviated with the following modification. 3.2.5 φ(u) = −u and t(u) = 1 p min(max(u, 0)p , p(u − 1) + 1) (p > 1) In this probability model, we have the following relationship at the exact solution: qc = c min(max(f∗ , 0), 1)p−1 . Clearly this model is more powerful than the previous model since ∗ the function value fc ≥ 1 can be used to model qc = 1. 3.3 Coupled formulations In the coupled formulation with s(u) = u, the probability model can be normalized in a certain way. We list a few examples. 3.3.1 φ(u) = −u, and t(u) = eu , and s(u) = ln(u) This is the standard logistic regression model. The probability model is: K ∗ qc (x) = exp(fc (x))( ∗ exp(fc (x)))−1 . c=1 The right hand side is always normalized (sum up to 1). Note that the model is not continuous at infinities, and thus not admissible in our definition. However, we may consider the region Ω = {f : supk fk = 0}, and it is easy to check that this model is admissible in Ω. Ω Let fc = fc − supk fk ∈ Ω, then f Ω has the same decision rule as f and R(f ) = R(f Ω ). Therefore Theorem 2.1 implies that R(f ) → inf g∈B R(g) implies (f ) → inf g∈B (g). 1 3.3.2 φ(u) = −u, and t(u) = |u|p , and s(u) = p |u|p/p (p, p > 1) The probability model is: K ∗ ∗ ∗ |fk (x)|p )(p−p )/p sign(fc (x))|fc (x)|p −1 . qc (x) = ( k=1 We may replace t(u) by t(u) = max(0, u)p , and the probability model becomes: K qc (x) = ( ∗ ∗ max(fk (x), 0)p )(p−p )/p max(fc (x), 0)p −1 . k=1 These formulations do not seem to have advantages over the decoupled counterparts. Note that if we let p → 1, then the sum of the p p -th power of the right hand side → 1. In a −1 way, this means that the model is normalized in the limit of p → 1. 4 Constrained formulations As pointed out, one may impose constraints on possible choices of f . We may impose such a condition when we specify the function class Cn . However, for clarity, we shall directly impose a condition into our formulation. If we impose a constraint into (7), then its effect is rather similar to that of the second term in (7). In this section, we consider a direct extension of binary large-margin method (1) to multi-category case. The choice given below is motivated by [5], where an extension of SVM was proposed. We use a risk formulation that is different from (7), and for simplicity, we will consider linear equality constraint only: K Ψc (f ) = φ(−fk ), s.t. f ∈ Ω, (10) k=1,k=c where we define Ω as: K Ω = {f : fk = 0} ∪ {f : sup fk = ∞}. k k=1 We may interpret the added constraint as a restriction on the function class Cn in (3) such that every f ∈ Cn satisfies the constraint. Note that with K = 2, this leads to the usually binary large margin method. Using (10), the conditional true risk (5) can be written as: K (1 − qc )φ(−fc ), L(q, f ) = s.t. f ∈ Ω. (11) c=1 The following result provides a simple way to check the admissibility of (10). Proposition 4.1 If φ is a convex function which is bounded below and φ (0) < 0, then (10) is admissible on Ω. Proof Sketch. The continuity condition is straight-forward to verify. We may also assume that φ(·) ≥ 0 without loss of generality. Now let f achieves the minimum of L(q, ·). If fc = ∞, then it is clear that qc = 1 and thus qk = 0 for k = c. This implies that for k = c, φ(−fk ) = inf f φ(−f ), and thus fk < 0. If fc = supk fk < ∞, then the constraint implies fc ≥ 0. It is easy to see that ∀k, qc ≥ qk since otherwise, we must have φ(−fk ) > φ(−fc ), and thus φ (−fk ) > 0 and φ (−fc ) < 0, implying that with sufficient small δ > 0, φ(−(fk + δ)) < φ(−fk ) and φ(−(fc − δ)) < φ(−fc ). A contradiction. 2 Using the above criterion, we can convert any admissible convex φ for the binary formulation (1) into an admissible multi-category classification formulation (10). In [5] the special case of SVM (with loss function φ(u) = max(0, 1 − u)) was studied. The authors demonstrated the admissibility by direct calculation, although no results similar to Theorem 2.1 were established. Such a result is needed to prove consistency. The treatment presented here generalizes their study. Note that for the constrained formulation, it is more difficult to relate fc at the optimal solution to a probability model, since such a model will have a much more complicated form compared with the unconstrained counterpart. 5 Conclusion In this paper we proposed a family of risk minimization methods for multi-category classification problems, which are natural extensions of binary large margin classification methods. We established admissibility conditions that ensure the consistency of the obtained classifiers in the large sample limit. Two specific forms of risk minimization were proposed and examples were given to study the induced probability models. As an implication of this work, we see that it is possible to obtain consistent (conditional) density estimation using various non-maximum likelihood estimation methods. One advantage of some of the newly proposed methods is that they allow us to model zero density directly. Note that for the maximum-likelihood method, near zero density may cause serious robustness problems at least in theory. References [1] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Technical Report 638, Statistics Department, University of California, Berkeley, 2003. [2] Ilya Desyatnikov and Ron Meir. Data-dependent bounds for multi-category classification based on convex losses. In COLT, 2003. [3] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000. With discussion. [4] W. Jiang. Process consistency for adaboost. The Annals of Statistics, 32, 2004. with discussion. [5] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of American Statistical Association, 2002. accepted. [6] Yi Lin. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery, pages 259–275, 2002. [7] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting methods. The Annals of Statistics, 32, 2004. with discussion. [8] Shie Mannor, Ron Meir, and Tong Zhang. Greedy algorithms for classification - consistency, convergence rates, and adaptivity. Journal of Machine Learning Research, 4:713–741, 2003. [9] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Ingo Steinwart. Support vector machines are universally consistent. J. Complexity, 18:768–791, 2002. [11] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statitics, 32, 2004. with discussion.

4 0.14869675 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

Author: Ingo Steinwart

Abstract: The decision functions constructed by support vector machines (SVM’s) usually depend only on a subset of the training set—the so-called support vectors. We derive asymptotically sharp lower and upper bounds on the number of support vectors for several standard types of SVM’s. In particular, we show for the Gaussian RBF kernel that the fraction of support vectors tends to twice the Bayes risk for the L1-SVM, to the probability of noise for the L2-SVM, and to 1 for the LS-SVM. 1

5 0.14192045 51 nips-2003-Design of Experiments via Information Theory

Author: Liam Paninski

Abstract: We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. For example, we are able to explicitly calculate the asymptotic relative efficiency of the “staircase method” widely employed in psychophysics research, and to demonstrate the dependence of this efficiency on the form of the “psychometric function” underlying the output responses. 1

6 0.12085062 146 nips-2003-Online Learning of Non-stationary Sequences

7 0.11343172 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

8 0.10549048 1 nips-2003-1-norm Support Vector Machines

9 0.10490619 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees

10 0.087330945 113 nips-2003-Learning with Local and Global Consistency

11 0.084805936 126 nips-2003-Measure Based Regularization

12 0.079151466 148 nips-2003-Online Passive-Aggressive Algorithms

13 0.075095482 124 nips-2003-Max-Margin Markov Networks

14 0.073957548 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

15 0.071673468 143 nips-2003-On the Dynamics of Boosting

16 0.071201451 121 nips-2003-Log-Linear Models for Label Ranking

17 0.070084743 88 nips-2003-Image Reconstruction by Linear Programming

18 0.069331303 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

19 0.067348436 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers

20 0.065854184 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.218), (1, -0.092), (2, -0.07), (3, -0.191), (4, 0.223), (5, -0.045), (6, -0.096), (7, -0.101), (8, -0.141), (9, 0.075), (10, 0.006), (11, 0.132), (12, 0.027), (13, 0.221), (14, -0.045), (15, -0.012), (16, 0.035), (17, 0.117), (18, 0.116), (19, -0.022), (20, -0.076), (21, -0.024), (22, 0.184), (23, -0.013), (24, -0.078), (25, 0.005), (26, -0.012), (27, 0.057), (28, 0.049), (29, 0.009), (30, 0.189), (31, 0.018), (32, 0.041), (33, -0.024), (34, -0.029), (35, -0.036), (36, -0.071), (37, -0.015), (38, 0.082), (39, 0.028), (40, -0.091), (41, 0.01), (42, 0.05), (43, -0.093), (44, -0.027), (45, -0.0), (46, 0.057), (47, 0.112), (48, 0.035), (49, -0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97589374 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe

Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.

2 0.87167585 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

Author: Tong Zhang

Abstract: The purpose of this paper is to investigate infinity-sample properties of risk minimization based multi-category classification methods. These methods can be considered as natural extensions to binary large margin classification. We establish conditions that guarantee the infinity-sample consistency of classifiers obtained in the risk minimization framework. Examples are provided for two specific forms of the general formulation, which extend a number of known methods. Using these examples, we show that some risk minimization formulations can also be used to obtain conditional probability estimates for the underlying problem. Such conditional probability information will be useful for statistical inferencing tasks beyond classification. 1 Motivation Consider a binary classification problem where we want to predict label y ∈ {±1} based on observation x. One of the most significant achievements for binary classification in machine learning is the invention of large margin methods, which include support vector machines and boosting algorithms. Based on a set of observations (X1 , Y1 ), . . . , (Xn , Yn ), ˆ a large margin classification algorithm produces a decision function fn by empirically minimizing a loss function that is often a convex upper bound of the binary classification error ˆ ˆ function. Given fn , the binary decision rule is to predict y = 1 if fn (x) ≥ 0, and to predict ˆ y = −1 otherwise (the decision rule at fn (x) = 0 is not important). In the literature, the following form of large margin binary classification is often encountered: we minimize the empirical risk associated with a convex function φ in a pre-chosen function class Cn : 1 ˆ fn = arg min f ∈Cn n n φ(f (Xi )Yi ). (1) i=1 Originally such a scheme was regarded as a compromise to avoid computational difficulties associated with direct classification error minimization, which often leads to an NP-hard problem. The current view in the statistical literature interprets such methods as algorithms to obtain conditional probability estimates. For example, see [3, 6, 9, 11] for some related studies. This point of view allows people to show the consistency of various large margin methods: that is, in the large sample limit, the obtained classifiers achieve the optimal Bayes error rate. For example, see [1, 4, 7, 8, 10, 11]. The consistency of a learning method is certainly a very desirable property, and one may argue that a good classification method should be consistent in the large sample limit. Although statistical properties of binary classification algorithms based on the risk minimization formulation (1) are quite well-understood due to many recent works such as those mentioned above, there are much fewer studies on risk minimization based multicategory problems which generalizes the binary large margin method (1). The complexity of possible generalizations may be one reason. Another reason may be that one can always estimate the conditional probability for a multi-category problem using the binary classification formulation (1) for each category, and then pick the category with the highest estimated conditional probability (or score).1 However, it is still useful to understand whether there are more natural alternatives, and what kind of risk minimization formulation which generalizes (1) can be used to yield consistent classifiers in the large sample limit. An important step toward this direction has recently been taken in [5], where the authors proposed a multi-category extension of the support vector machine that is Bayes consistent (note that there were a number of earlier proposals that were not consistent). The purpose of this paper is to generalize their investigation so as to include a much wider class of risk minimization formulations that can lead to consistent classifiers in the infinity-sample limit. We shall see that there is a rich structure in risk minimization based multi-category classification formulations. Multi-category large margin methods have started to draw more attention recently. For example, in [2], learning bounds for some multi-category convex risk minimization methods were obtained, although the authors did not study possible choices of Bayes consistent formulations. 2 Multi-category classification We consider the following K-class classification problem: we would like to predict the label y ∈ {1, . . . , K} of an input vector x. In this paper, we only consider the simplest scenario with 0 − 1 classification loss: we have a loss of 0 for correct prediction, and loss of 1 for incorrect prediction. In binary classification, the class label can be determined using the sign of a decision function. This can be generalized to K class classification problem as follows: we consider K decision functions fc (x) where c = 1, . . . , K and we predict the label y of x as: T (f (x)) = arg max c∈{1,...,K} fc (x), (2) where we denote by f (x) the vector function f (x) = [f1 (x), . . . , fK (x)]. Note that if two or more components of f achieve the same maximum value, then we may choose any of them as T (f ). In this framework, fc (x) is often regarded as a scoring function for category c that is correlated with how likely x belongs to category c (compared with the remaining k − 1 categories). The classification error is given by: (f ) = 1 − EX P (Y = T (X)|X). Note that only the relative strength of fc compared with the alternatives is important. In particular, the decision rule given in (2) does not change when we add the same numerical quantity to each component of f (x). This allows us to impose one constraint on the vector f (x) which decreases the degree of freedom K of the K-component vector f (x) to K − 1. 1 This approach is often called one-versus-all or ranking in machine learning. Another main approach is to encode a multi-category classification problem into binary classification sub-problems. The consistency of such encoding schemes can be difficult to analyze, and we shall not discuss them. For example, in the binary classification case, we can enforce f1 (x)+f2 (x) = 0, and hence f (x) can be represented as [f1 (x), −f1 (x)]. The decision rule in (2), which compares f1 (x) ≥ f2 (x), is equivalent to f1 (x) ≥ 0. This leads to the binary classification rule mentioned in the introduction. In the multi-category case, one may also interpret the possible constraint on the vector function f , which reduces its degree of freedom from K to K − 1 based on the following reasoning. In many cases, we seek fc (x) as a function of p(Y = c|x). Since we have a K constraint c=1 p(Y = c|x) = 1 (implying that the degree of freedom for p(Y = c|x) is K − 1), the degree of freedom for f is also K − 1 (instead of K). However, we shall point out that in the algorithms we formulate below, we may either enforce such a constraint that reduces the degree of freedom of f , or we do not impose any constraint, which keeps the degree of freedom of f to be K. The advantage of the latter is that it allows the computation of each fc to be decoupled. It is thus much simpler both conceptually and numerically. Moreover, it directly handles multiple-label problems where we may assign each x to multiple labels of y ∈ {1, . . . , K}. In this scenario, we do not have a constraint. In this paper, we consider an empirical risk minimization method to solve a multi-category problem, which is of the following general form: 1 ˆ fn = arg min f ∈Cn n n ΨYi (f (Xi )). (3) i=1 As we shall see later, this method is a natural generalization of the binary classification method (1). Note that one may consider an even more general form with ΨY (f (X)) replaced by ΨY (f (X), X), which we don’t study in this paper. From the standard learning theory, one can expect that with appropriately chosen Cn , the ˆ ˆ solution fn of (3) approximately minimizes the true risk R(f ) with respect to the unknown underlying distribution within the function class Cn , R(f ) = EX,Y ΨY (f (X)) = EX L(P (·|X), f (X)), (4) where P (·|X) = [P (Y = 1|X), . . . , P (Y = K|X)] is the conditional probability, and K L(q, f ) = qc Ψc (f ). (5) c=1 In order to understand the large sample behavior of the algorithm based on solving (3), we first need to understand the behavior of a function f that approximately minimizes R(f ). We introduce the following definition (also referred to as classification calibrated in [1]): Definition 2.1 Consider Ψc (f ) in (4). We say that the formulation is admissible (classification calibrated) on a closed set Ω ⊆ [−∞, ∞]K if the following conditions hold: ∀c, Ψc (·) : Ω → (−∞, ∞] is bounded below and continuous; ∩c {f : Ψc (f ) < ∞} is ∗ ∗ non-empty and dense in Ω; ∀q, if L(q, f ∗ ) = inf f L(q, f ), then fc = supk fk implies qc = supk qk . Since we allow Ψc (f ) = ∞, we use the convention that qc Ψc (f ) = 0 when qc = 0 and Ψc (f ) = ∞. The following result relates the approximate minimization of the Ψ risk to the approximate minimization of classification error: Theorem 2.1 Let B be the set of all Borel measurable functions. For a closed set Ω ⊂ [−∞, ∞]K , let BΩ = {f ∈ B : ∀x, f (x) ∈ Ω}. If Ψc (·) is admissible on Ω, then for a Borel measurable distribution, R(f ) → inf g∈BΩ R(g) implies (f ) → inf g∈B (g). Proof Sketch. First we show that the admissibility implies that ∀ > 0, ∃δ > 0 such that ∀q and x: inf {L(q, f ) : fc = sup fk } ≥ inf L(q, g) + δ. (6) qc ≤supk qk − g∈Ω k m If (6) does not hold, then ∃ > 0, and a sequence of (c , f m , q m ) with f m ∈ Ω such that m m m m fcm = supk fk , qcm ≤ supk qk − , and L(q m , f m ) − inf g∈Ω L(q m , g) → 0. Taking a limit point of (cm , f m , q m ), and using the continuity of Ψc (·), we obtain a contradiction (technical details handling the infinity case are skipped). Therefore (6) must be valid. Now we consider a vector function f (x) ∈ ΩB . Let q(x) = P (·|x). Given X, if P (Y = T (f (X))|X) ≥ P (Y = T (q(X))|X)+ , then equation (6) implies that L(q(X), f (X)) ≥ inf g∈Ω L(q(X), g) + δ. Therefore (f ) − inf (g) =EX [P (Y = T (q(X))|X) − P (Y = T (f (X))|X)] g∈B ≤ + EX I(P (Y = T (q(X))|X) − P (Y = T (f (X))|X) > ) LX (q(X), f (X)) − inf g∈BΩ LX (q(X), g) ≤ + EX δ R(f ) − inf g∈BΩ R(g) = + . δ In the above derivation we use I to denote the indicator function. Since and δ are arbitrary, we obtain the theorem by letting → 0. 2 Clearly, based on the above theorem, an admissible risk minimization formulation is suitable for multi-category classification problems. The classifier obtained from minimizing (3) can approach the Bayes error rate if we can show that with appropriately chosen function class Cn , approximate minimization of (3) implies approximate minimization of (4). Learning bounds of this forms have been very well-studied in statistics and machine learning. For example, for large margin binary classification, such bounds can be found in [4, 7, 8, 10, 11, 1], where they were used to prove the consistency of various large margin methods. In order to achieve consistency, it is also necessary to take a sequence of function classes Cn (C1 ⊂ C2 ⊂ · · · ) such that ∪n Cn is dense in the set of Borel measurable functions. The set Cn has the effect of regularization, which ensures that ˆ ˆ P R(fn ) ≈ inf f ∈Cn R(f ). It follows that as n → ∞, R(fn ) → inf f ∈B R(f ). Theorem 2.1 ˆ P then implies that (fn ) → inf f ∈B (f ). The purpose of this paper is not to study similar learning bounds that relate approximate minimization of (3) to the approximate minimization of (4). See [2] for a recent investigation. We shall focus on the choices of Ψ that lead to admissible formulations. We pay special attention to the case that each Ψc (f ) is a convex function of f , so that the resulting formulation becomes computational more tractable. Instead of working with the general form of Ψc in (4), we focus on two specific choices listed in the next two sections. 3 Unconstrained formulations We consider unconstrained formulation with the following choice of Ψ: K Ψc (f ) = φ(fc ) + s t(fk ) , (7) k=1 where φ, s and t are appropriately chosen functions that are continuously differentiable. The first term, which has a relatively simple form, depends on the label c. The second term is independent of the label, and can be regarded as a normalization term. Note that this function is symmetric with respect to components of f . This choice treats all potential classes equally. It is also possible to treat different classes differently (e.g. replacing φ(fc ) by φc (fc )), which can be useful if we associate different classification loss to different kinds of errors. 3.1 Optimality equation and probability model Using (7), the conditional true risk (5) can be written as: K L(q, f ) = K qc φ(fc ) + s t(fc ) . c=1 c=1 In the following, we study the property of the optimal vector f ∗ that minimizes L(q, f ) for a fixed q. Given q, the optimal solution f ∗ of L(q, f ) satisfies the following first order condition: ∗ ∗ qc φ (fc ) + µf ∗ t (fc ) = 0 (c = 1, . . . , K). (8) where quantity µf ∗ = s ( K k=1 ∗ t(fk )) is independent of k. ∗ Clearly this equation relates qc to fc for each component c. The relationship of q and f ∗ defined by (8) can be regarded as the (infinite sample-size) probability model associated with the learning method (3) with Ψ given by (7). The following result presents a simple criterion to check admissibility. We skip the proof for simplicity. Most of our examples satisfy the condition. Proposition 3.1 Consider (7). Assume Φc (f ) is continuous on [−∞, ∞]K and bounded below. If s (u) ≥ 0 and ∀p > 0, pφ (f ) + t (f ) = 0 has a unique solution fp that is an increasing function of p, then the formulation is admissible. If s(u) = u, the condition ∀p > 0 in Proposition 3.1 can be replaced by ∀p ∈ (0, 1). 3.2 Decoupled formulations We let s(u) = u in (7). The optimality condition (8) becomes ∗ ∗ qc φ (fc ) + t (fc ) = 0 (c = 1, . . . , K). (9) This means that we have K decoupled equalities, one for each fc . This is the simplest and in the author’s opinion, the most interesting formulation. Since the estimation problem in ˆ (3) is also decoupled into K separate equations, one for each component of fn , this class of methods are computationally relatively simple and easy to parallelize. Although this method seems to be preferable for multi-category problems, it is not the most efficient way for two-class problem (if we want to treat the two classes in a symmetric manner) since we have to solve two separate equations. We only need to deal with one equation in (1) due to the fact that an effective constraint f1 + f2 = 0 can be used to reduce the number of equations. This variable elimination has little impact if there are many categories. In the following, we list some examples of multi-category risk minimization formulations. They all satisfy the admissibility condition in Proposition 3.1. We focus on the relationship of the optimal optimizer function f∗ (q) and the conditional probability q. For simplicity, we focus on the choice φ(u) = −u. 3.2.1 φ(u) = −u and t(u) = eu ∗ We obtain the following probability model: qc = efc . This formulation is closely related K to the maximum-likelihood estimate with conditional model qc = efc / k=1 efk (logistic regression). In particular, if we choose a function class such that the normalization condiK tion k=1 efk = 1 holds, then the two formulations are identical. However, they become different when we do not impose such a normalization condition. Another very important and closely related formulation is the choice of φ(u) = − ln u and t(u) = u. This is an extension of maximum-likelihood estimate with probability model qc = fc . The resulting method is identical to maximum-likelihood if we choose our function class such that k fk = 1. However, the formulation also allows us to use function classes that do not satisfy the normalization constraint k fk = 1. Therefore this method is more flexible. 3.2.2 φ(u) = −u and t(u) = ln(1 + eu ) This version uses binary logistic regression loss, and we have the following probability ∗ model: qc = (1 + e−fc )−1 . Again this is an unnormalized model. 1 3.2.3 φ(u) = −u and t(u) = p |u|p (p > 1) ∗ ∗ We obtain the following probability model: qc = sign(fc )|fc |p−1 . This means that at the ∗ ∗ solution, fc ≥ 0. One may modify it such that we allow fc ≤ 0 to model the condition probability qc = 0. 3.2.4 φ(u) = −u and t(u) = 1 p max(u, 0)p (p > 1) ∗ In this probability model, we have the following relationship: qc = max(fc , 0)p−1 . The ∗ equation implies that we allow fc ≤ 0 to model the conditional probability qc = 0. Therefore, with a fixed function class, this model is more powerful than the previous one. How∗ ever, at the optimal solution, fc ≤ 1. This requirement can be further alleviated with the following modification. 3.2.5 φ(u) = −u and t(u) = 1 p min(max(u, 0)p , p(u − 1) + 1) (p > 1) In this probability model, we have the following relationship at the exact solution: qc = c min(max(f∗ , 0), 1)p−1 . Clearly this model is more powerful than the previous model since ∗ the function value fc ≥ 1 can be used to model qc = 1. 3.3 Coupled formulations In the coupled formulation with s(u) = u, the probability model can be normalized in a certain way. We list a few examples. 3.3.1 φ(u) = −u, and t(u) = eu , and s(u) = ln(u) This is the standard logistic regression model. The probability model is: K ∗ qc (x) = exp(fc (x))( ∗ exp(fc (x)))−1 . c=1 The right hand side is always normalized (sum up to 1). Note that the model is not continuous at infinities, and thus not admissible in our definition. However, we may consider the region Ω = {f : supk fk = 0}, and it is easy to check that this model is admissible in Ω. Ω Let fc = fc − supk fk ∈ Ω, then f Ω has the same decision rule as f and R(f ) = R(f Ω ). Therefore Theorem 2.1 implies that R(f ) → inf g∈B R(g) implies (f ) → inf g∈B (g). 1 3.3.2 φ(u) = −u, and t(u) = |u|p , and s(u) = p |u|p/p (p, p > 1) The probability model is: K ∗ ∗ ∗ |fk (x)|p )(p−p )/p sign(fc (x))|fc (x)|p −1 . qc (x) = ( k=1 We may replace t(u) by t(u) = max(0, u)p , and the probability model becomes: K qc (x) = ( ∗ ∗ max(fk (x), 0)p )(p−p )/p max(fc (x), 0)p −1 . k=1 These formulations do not seem to have advantages over the decoupled counterparts. Note that if we let p → 1, then the sum of the p p -th power of the right hand side → 1. In a −1 way, this means that the model is normalized in the limit of p → 1. 4 Constrained formulations As pointed out, one may impose constraints on possible choices of f . We may impose such a condition when we specify the function class Cn . However, for clarity, we shall directly impose a condition into our formulation. If we impose a constraint into (7), then its effect is rather similar to that of the second term in (7). In this section, we consider a direct extension of binary large-margin method (1) to multi-category case. The choice given below is motivated by [5], where an extension of SVM was proposed. We use a risk formulation that is different from (7), and for simplicity, we will consider linear equality constraint only: K Ψc (f ) = φ(−fk ), s.t. f ∈ Ω, (10) k=1,k=c where we define Ω as: K Ω = {f : fk = 0} ∪ {f : sup fk = ∞}. k k=1 We may interpret the added constraint as a restriction on the function class Cn in (3) such that every f ∈ Cn satisfies the constraint. Note that with K = 2, this leads to the usually binary large margin method. Using (10), the conditional true risk (5) can be written as: K (1 − qc )φ(−fc ), L(q, f ) = s.t. f ∈ Ω. (11) c=1 The following result provides a simple way to check the admissibility of (10). Proposition 4.1 If φ is a convex function which is bounded below and φ (0) < 0, then (10) is admissible on Ω. Proof Sketch. The continuity condition is straight-forward to verify. We may also assume that φ(·) ≥ 0 without loss of generality. Now let f achieves the minimum of L(q, ·). If fc = ∞, then it is clear that qc = 1 and thus qk = 0 for k = c. This implies that for k = c, φ(−fk ) = inf f φ(−f ), and thus fk < 0. If fc = supk fk < ∞, then the constraint implies fc ≥ 0. It is easy to see that ∀k, qc ≥ qk since otherwise, we must have φ(−fk ) > φ(−fc ), and thus φ (−fk ) > 0 and φ (−fc ) < 0, implying that with sufficient small δ > 0, φ(−(fk + δ)) < φ(−fk ) and φ(−(fc − δ)) < φ(−fc ). A contradiction. 2 Using the above criterion, we can convert any admissible convex φ for the binary formulation (1) into an admissible multi-category classification formulation (10). In [5] the special case of SVM (with loss function φ(u) = max(0, 1 − u)) was studied. The authors demonstrated the admissibility by direct calculation, although no results similar to Theorem 2.1 were established. Such a result is needed to prove consistency. The treatment presented here generalizes their study. Note that for the constrained formulation, it is more difficult to relate fc at the optimal solution to a probability model, since such a model will have a much more complicated form compared with the unconstrained counterpart. 5 Conclusion In this paper we proposed a family of risk minimization methods for multi-category classification problems, which are natural extensions of binary large margin classification methods. We established admissibility conditions that ensure the consistency of the obtained classifiers in the large sample limit. Two specific forms of risk minimization were proposed and examples were given to study the induced probability models. As an implication of this work, we see that it is possible to obtain consistent (conditional) density estimation using various non-maximum likelihood estimation methods. One advantage of some of the newly proposed methods is that they allow us to model zero density directly. Note that for the maximum-likelihood method, near zero density may cause serious robustness problems at least in theory. References [1] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Technical Report 638, Statistics Department, University of California, Berkeley, 2003. [2] Ilya Desyatnikov and Ron Meir. Data-dependent bounds for multi-category classification based on convex losses. In COLT, 2003. [3] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000. With discussion. [4] W. Jiang. Process consistency for adaboost. The Annals of Statistics, 32, 2004. with discussion. [5] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of American Statistical Association, 2002. accepted. [6] Yi Lin. Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery, pages 259–275, 2002. [7] G. Lugosi and N. Vayatis. On the Bayes-risk consistency of regularized boosting methods. The Annals of Statistics, 32, 2004. with discussion. [8] Shie Mannor, Ron Meir, and Tong Zhang. Greedy algorithms for classification - consistency, convergence rates, and adaptivity. Journal of Machine Learning Research, 4:713–741, 2003. [9] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37:297–336, 1999. [10] Ingo Steinwart. Support vector machines are universally consistent. J. Complexity, 18:768–791, 2002. [11] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statitics, 32, 2004. with discussion.

3 0.7564463 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

4 0.62282854 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

Author: Ingo Steinwart

Abstract: The decision functions constructed by support vector machines (SVM’s) usually depend only on a subset of the training set—the so-called support vectors. We derive asymptotically sharp lower and upper bounds on the number of support vectors for several standard types of SVM’s. In particular, we show for the Gaussian RBF kernel that the fraction of support vectors tends to twice the Bayes risk for the L1-SVM, to the probability of noise for the L2-SVM, and to 1 for the LS-SVM. 1

5 0.55892044 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees

Author: Clayton Scott, Robert Nowak

Abstract: This paper reports on a family of computationally practical classifiers that converge to the Bayes error at near-minimax optimal rates for a variety of distributions. The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) fitting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. For any distribution on (X, Y ) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. We also study DCTs equipped with polynomial classification rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2 , the parametric rate. We are not aware of any other practical classifiers that provide similar rate of convergence guarantees. Fast algorithms for tree pruning are discussed. 1

6 0.53596127 146 nips-2003-Online Learning of Non-stationary Sequences

7 0.47914892 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

8 0.47832653 51 nips-2003-Design of Experiments via Information Theory

9 0.4666962 1 nips-2003-1-norm Support Vector Machines

10 0.4051441 124 nips-2003-Max-Margin Markov Networks

11 0.38233811 181 nips-2003-Statistical Debugging of Sampled Programs

12 0.36125562 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification

13 0.34761244 126 nips-2003-Measure Based Regularization

14 0.34707442 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

15 0.34410793 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

16 0.33283728 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

17 0.33189675 3 nips-2003-AUC Optimization vs. Error Rate Minimization

18 0.3211391 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images

19 0.31380472 113 nips-2003-Learning with Local and Global Consistency

20 0.31059706 151 nips-2003-PAC-Bayesian Generic Chaining


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.029), (11, 0.021), (19, 0.019), (30, 0.015), (35, 0.104), (53, 0.099), (63, 0.171), (66, 0.011), (69, 0.059), (71, 0.068), (73, 0.045), (76, 0.048), (78, 0.01), (85, 0.073), (91, 0.107), (99, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90940219 13 nips-2003-A Neuromorphic Multi-chip Model of a Disparity Selective Complex Cell

Author: Bertram E. Shi, Eric K. Tsang

Abstract: The relative depth of objects causes small shifts in the left and right retinal positions of these objects, called binocular disparity. Here, we describe a neuromorphic implementation of a disparity selective complex cell using the binocular energy model, which has been proposed to model the response of disparity selective cells in the visual cortex. Our system consists of two silicon chips containing spiking neurons with monocular Gabor-type spatial receptive fields (RF) and circuits that combine the spike outputs to compute a disparity selective complex cell response. The disparity selectivity of the cell can be adjusted by both position and phase shifts between the monocular RF profiles, which are both used in biology. Our neuromorphic system performs better with phase encoding, because the relative responses of neurons tuned to different disparities by phase shifts are better matched than the responses of neurons tuned by position shifts.

2 0.90623003 177 nips-2003-Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles

Author: Mark Girolami, Ata Kabán

Abstract: To provide a compact generative representation of the sequential activity of a number of individuals within a group there is a tradeoff between the definition of individual specific and global models. This paper proposes a linear-time distributed model for finite state symbolic sequences representing traces of individual user activity by making the assumption that heterogeneous user behavior may be ‘explained’ by a relatively small number of common structurally simple behavioral patterns which may interleave randomly in a user-specific proportion. The results of an empirical study on three different sources of user traces indicates that this modelling approach provides an efficient representation scheme, reflected by improved prediction performance as well as providing lowcomplexity and intuitively interpretable representations.

3 0.89410686 183 nips-2003-Synchrony Detection by Analogue VLSI Neurons with Bimodal STDP Synapses

Author: Adria Bofill-i-petit, Alan F. Murray

Abstract: We present test results from spike-timing correlation learning experiments carried out with silicon neurons with STDP (Spike Timing Dependent Plasticity) synapses. The weight change scheme of the STDP synapses can be set to either weight-independent or weight-dependent mode. We present results that characterise the learning window implemented for both modes of operation. When presented with spike trains with different types of synchronisation the neurons develop bimodal weight distributions. We also show that a 2-layered network of silicon spiking neurons with STDP synapses can perform hierarchical synchrony detection. 1

same-paper 4 0.86849105 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe

Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.

5 0.73157549 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

Author: Radford M. Neal, Matthew J. Beal, Sam T. Roweis

Abstract: We describe a Markov chain method for sampling from the distribution of the hidden state sequence in a non-linear dynamical system, given a sequence of observations. This method updates all states in the sequence simultaneously using an embedded Hidden Markov Model (HMM). An update begins with the creation of “pools” of candidate states at each time. We then define an embedded HMM whose states are indexes within these pools. Using a forward-backward dynamic programming algorithm, we can efficiently choose a state sequence with the appropriate probabilities from the exponentially large number of state sequences that pass through states in these pools. We illustrate the method in a simple one-dimensional example, and in an example showing how an embedded HMM can be used to in effect discretize the state space without any discretization error. We also compare the embedded HMM to a particle smoother on a more substantial problem of inferring human motion from 2D traces of markers. 1

6 0.7248584 78 nips-2003-Gaussian Processes in Reinforcement Learning

7 0.72429502 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model

8 0.72191674 158 nips-2003-Policy Search by Dynamic Programming

9 0.72145855 39 nips-2003-Bayesian Color Constancy with Non-Gaussian Models

10 0.72025734 126 nips-2003-Measure Based Regularization

11 0.71742976 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions

12 0.71730983 113 nips-2003-Learning with Local and Global Consistency

13 0.71691251 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

14 0.71609944 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

15 0.71551621 146 nips-2003-Online Learning of Non-stationary Sequences

16 0.71471113 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

17 0.71356857 196 nips-2003-Wormholes Improve Contrastive Divergence

18 0.71355391 189 nips-2003-Tree-structured Approximations by Expectation Propagation

19 0.71180511 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

20 0.71094191 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes