jmlr jmlr2007 jmlr2007-75 knowledge-graph by maker-knowledge-mining

75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results


Source: pdf

Author: Peter L. Bartlett, Ambuj Tewari

Abstract: One of the nice properties of kernel classifiers such as SVMs is that they often produce sparse solutions. However, the decision functions of these classifiers cannot always be used to estimate the conditional probability of the class label. We investigate the relationship between these two properties and show that these are intimately related: sparseness does not occur when the conditional probabilities can be unambiguously estimated. We consider a family of convex loss functions and derive sharp asymptotic results for the fraction of data that becomes support vectors. This enables us to characterize the exact trade-off between sparseness and the ability to estimate conditional probabilities for these loss functions. Keywords: kernel methods, support vector machines, sparseness, estimating conditional probabilities

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Journal of Machine Learning Research 8 (2007) 775-790 Submitted 6/06; Published 4/07 Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results Peter L. [sent-1, score-0.075]

2 EDU Division of Computer Science University of California Berkeley, CA 94720-1776, USA Editor: G´ bor Lugosi a Abstract One of the nice properties of kernel classifiers such as SVMs is that they often produce sparse solutions. [sent-6, score-0.068]

3 However, the decision functions of these classifiers cannot always be used to estimate the conditional probability of the class label. [sent-7, score-0.098]

4 We investigate the relationship between these two properties and show that these are intimately related: sparseness does not occur when the conditional probabilities can be unambiguously estimated. [sent-8, score-0.323]

5 We consider a family of convex loss functions and derive sharp asymptotic results for the fraction of data that becomes support vectors. [sent-9, score-0.375]

6 This enables us to characterize the exact trade-off between sparseness and the ability to estimate conditional probabilities for these loss functions. [sent-10, score-0.437]

7 Keywords: kernel methods, support vector machines, sparseness, estimating conditional probabilities 1. [sent-11, score-0.161]

8 n i=1 (1) Here, H is a reproducing kernel Hilbert space (RKHS) of some kernel k, λ > 0 is a regularization parameter and φ : R → [0, ∞) is a convex loss function. [sent-23, score-0.237]

9 Since optimization problems based on the non-convex function 0-1 loss t → I(t≤0) (where I(·) is the indicator function) are computationally intractable, use of convex loss functions is often seen as using upper bounds on the 0-1 loss to make the problem computationally easier. [sent-24, score-0.364]

10 We would like to compare different convex loss functions based on their statistical and other useful properties. [sent-28, score-0.18]

11 Conditions ensuring Bayes-risk consistency of classifiers using convex loss functions have already been established (Bartlett et al. [sent-29, score-0.2]

12 It has been observed that different cost functions have different properties and it is important to choose a loss function judiciously (e. [sent-31, score-0.101]

13 In order to understand the relative merits of different loss functions, it is important to consider these properties and investigate the extent to which different loss functions exhibit them. [sent-34, score-0.184]

14 In that case, knowing the trade-off allows one to make an informed choice while choosing a loss function for the classification task at hand. [sent-36, score-0.083]

15 One of the properties we focus on is the ability to estimate the conditional probability of the class label η(x) = P(Y = +1|X = x). [sent-37, score-0.096]

16 Under some conditions on the loss function and the sequence of regularization parameters λn , the solutions of (1) converge (in probability) to a function Fφ∗ (η(x)) which is set valued in general (Steinwart, 2003). [sent-38, score-0.1]

17 As long as we can uniquely identify η(x) based on a value in Fφ∗ (η(x)), we can hope to estimate conditional probabilities using f T,λn (x), at least asymptotically. [sent-39, score-0.141]

18 Choice of the loss function is crucial to this property. [sent-40, score-0.083]

19 For example, the L2-SVM (which uses the loss function t → (max{0, 1 − t})2 ) is much better than L1-SVM (which uses t → max{0, 1 − t}) in terms of asymptotically estimating conditional probabilities. [sent-41, score-0.132]

20 Another criterion is the sparseness of solutions of (1). [sent-42, score-0.197]

21 Steinwart’s recent work (Steinwart, 2004) has shown that for the L1-SVM and a suitable kernel, the asymptotic fraction of support vectors is twice the Bayes-risk. [sent-48, score-0.183]

22 We are interested in how sparseness relates to the ability to estimate conditional probabilities. [sent-51, score-0.293]

23 Do we always lose sparseness by being able to estimate conditional probabilities? [sent-53, score-0.277]

24 Is it possible to characterize the exact trade-off between the asymptotic fraction of support vectors and the ability to estimate conditional probabilities? [sent-54, score-0.279]

25 If sparseness is indeed lost when we are able to fully estimate conditional probabilities, we may want to estimate conditional probabilities only in an interval, say (0. [sent-55, score-0.441]

26 How can we design loss functions which enable us to estimate probabilities in sub-intervals of [0, 1] while preserving as much sparseness as possible? [sent-60, score-0.39]

27 Moreover, one cannot recover sparseness by giving up the ability to estimate conditional probabilities in some sub-interval of (1 − γ0 , γ0 ). [sent-63, score-0.371]

28 The only way to do that is to decrease γ0 thereby shortening the interval (1 − γ0 , γ0 ). [sent-64, score-0.076]

29 Dashed lines represent the corresponding plots for the original loss functions. [sent-66, score-0.083]

30 of the form: φ(t) = h((t0 − t)+ ), t0 > 0 where t+ denotes max{0,t} and h is a continuously differentiable convex function such that h (0) ≥ 0. [sent-67, score-0.155]

31 Each loss function in the family allows one to estimate probabilities in the interval (1 − γ 0 , γ0 ) for some value of γ0 . [sent-68, score-0.251]

32 The asymptotic fraction of support vectors is then Ex G(η(x)), where η → G(η) is a function that increases linearly from 0 to 1 as η goes from 0 to 1 − γ 0 . [sent-69, score-0.183]

33 For example, if φ(t) = 1 2 2 3 ((1 −t)+ ) + 3 (1 −t)+ then conditional probabilities can be estimated in (1/4, 3/4) and G(η) = 1 for η ∈ (1/4, 3/4) (see Fig. [sent-70, score-0.11]

34 Given a loss function φ, define the φ-risk of f by Rφ,P ( f ) = EP φ(y f (x)) . [sent-79, score-0.083]

35 A convex loss function is called classification calibrated if the following two conditions hold: η< 1 1 ⇒ Fφ∗ (η) ⊂ [−∞, 0) and η > ⇒ Fφ∗ (η) ⊂ (0, +∞] . [sent-92, score-0.255]

36 2 2 A necessary and sufficient condition for a convex φ to be classification calibrated is that φ (0) exists and is negative (Bartlett et al. [sent-93, score-0.172]

37 If φ is classification calibrated then it is guaranteed that for any sequence f n such that Rφ,P ( fn ) → Rφ,P , we have RP ( fn ) → RP . [sent-95, score-0.127]

38 Thus, classification calibrated loss functions are good in the sense that minimizing the φ-risk leads to classifiers that have risks approaching the Bayes-risk. [sent-96, score-0.194]

39 1 suggest that the L2-SVM decision function can be used to estimate conditional probabilities in the whole range [0, 1] while it not possible to use the L1-SVM decision function to estimate conditional probabilities in any interval. [sent-105, score-0.282]

40 However, the L1-SVM is better if one considers the asymptotic fraction of support vectors. [sent-106, score-0.156]

41 Under some conditions on the kernel and the regularization sequence, Steinwart proved that the fraction is Ex [2 min(η(x), 1 − η(x))], which also happens to be the optimal φ-risk for the hinge loss function. [sent-107, score-0.231]

42 For L2-SVM, he showed that the asymptotic fraction is Px ({x ∈ X : 0 < η(x) < 1}), which is the probability of the set where noise occurs. [sent-108, score-0.134]

43 Observe that we can write the fraction of support vectors as Ex [G(η(x))] where G(η) = 2 min{η, 1 − η} for the hinge loss and G(η) = I(η∈{0,1}) for / the squared hinge loss. [sent-109, score-0.295]

44 In general, there are loss functions which allow one to estimate probabilities in an interval centered at 1/2 and for which G(η) = 1 only on that interval. [sent-111, score-0.269]

45 778 S PARSENESS VS E STIMATING C ONDITIONAL P ROBABILITIES Steinwart (2003) also derived a general lower bound on the asymptotic number of support vectors in terms of the probability of the set S = {(x, y) ∈ Xcont × Y : 0 ∈ ∂φ(yFφ∗ (η(x)))} . [sent-112, score-0.142]

46 In the simple case of a function of one variable ∂φ(x) = [φ− (x), φ+ (x)], where φ− and φ+ are the left and right hand derivatives of φ (which always exist for convex functions). [sent-114, score-0.108]

47 Preliminary Results We will consider only classification calibrated convex loss functions. [sent-118, score-0.255]

48 Since φ is classification calibrated we know that φ (0) < 0. [sent-119, score-0.093]

49 Because φ (0) < 0 and subdifferentials of a convex function are monotonically decreasing, we must have t0 > 0. [sent-121, score-0.102]

50 / Thus, for losses like the exponential loss t → exp(−t), the bound (4) says that the fraction of support vectors approaches 1 asymptotically. [sent-129, score-0.254]

51 Since we are interested in loss functions that lead to sparse solutions, let us assume that (A1) t0 < ∞ . [sent-130, score-0.12]

52 Before we begin, we make another assumption about the number of points in the interval (−t 0 ,t0 ) where the function φ fails to be differentiable. [sent-133, score-0.076]

53 2 Since tl ∈ D for 1 ≤ l ≤ L, at least one of the inequalities, φ− (tl ) < φ+ (tl ) or φ− (−tl ) < φ+ (−tl ), is strict. [sent-149, score-0.64]

54 2 Using the observation that Fφ∗ (1 − η) = −Fφ∗ (η), we restrict ourselves to examining the behavior of Fφ∗ (η) on an interval containing [1/2, 1]. [sent-154, score-0.076]

55 Theorem 2 Let φ be a classification calibrated convex loss function satisfying assumptions (A1) and (A2). [sent-155, score-0.255]

56 The above also holds for the (possibly degenerate) interval I = [1 − γ L , γL ] but here the function gI maps [−tL ,tL ] onto I. [sent-162, score-0.076]

57 For γl < η < βl , we have ηφ− (tl ) − (1 − η)φ+ (−tl ) < 0 < ηφ+ (tl ) − (1 − η)φ− (−tl ) , 780 S PARSENESS VS E STIMATING C ONDITIONAL P ROBABILITIES and so tl is the unique minimizer of C(η, ·). [sent-166, score-0.64]

58 Being derivatives of a convex function, they are continuous. [sent-169, score-0.108]

59 If t ∈ (tl ,tl−1 ) then, for any η, t ∈ Fφ∗ (η) iff ηφ (t) + (1 − η)φ (−t) = 0 1 iff =η (t) 1 + φφ(−t) iff gI (t) = η iff t ∈ g−1 (η) . [sent-181, score-0.228]

60 Since gI (tl ) = βl , we also have tl ∈ g−1 (η) iff η = βl . [sent-184, score-0.697]

61 Arguing similarly for tl−1 , for η ∈ I, tl ∈ Fφ∗ (η) iff η = γl−1 . [sent-185, score-0.697]

62 The t’s which minimize C(1, ·) = φ(t) are then precisely those in the interval [t 0 ,t ]. [sent-191, score-0.076]

63 The above theorem says a couple of interesting things about the relationship between φ and Fφ∗ . [sent-192, score-0.061]

64 On an interval I where Fφ∗ is not constant, there exists a function gI such that given t ∈ Fφ∗ (η), one can recover η by the relation η = gI (t). [sent-194, score-0.093]

65 We can express this by saying that Fφ∗ (η) is invertible on intervals that correspond to differentiable portions of φ(t) and φ(−t) (see Fig. [sent-195, score-0.122]

66 Theorem 3 Let φ be an classification calibrated convex loss function satisfying assumptions (A1) and (A2). [sent-197, score-0.255]

67 781 (6) BARTLETT AND T EWARI 12 III t0 D C 10 t1 B 8 A 6 4 1 2 II Γ1 Β1 Γ0 Β0 t1 2 t1 I t1 t0 t → φ(t) t0 η → Fφ∗ (η) Figure 2: The loss function (left) is composed of two linear parts (I & III) and a quadratic part (II). [sent-199, score-0.083]

68 Region A arises due to the quadratic part of the loss function. [sent-203, score-0.083]

69 First, we can hope to have sparseness only for values of η ∈ [0, 1 − γ0 ] ∪ [γ0 , 1]. [sent-215, score-0.197]

70 Second, we cannot estimate conditional probabilities in these two intervals because Fφ∗ (·) is not invertible there. [sent-216, score-0.211]

71 Third, any loss function for which Fφ∗ (·) is invertible, say at η1 < 1/2, will necessarily not have sparseness on the interval [η 1 , 1 − η1 ]. [sent-217, score-0.356]

72 In the next section we will show, by deriving a sharp asymptotic result, that the bound is indeed loose for a family of loss functions. [sent-222, score-0.236]

73 Asymptotic Fraction of Support Vectors We will consider convex loss functions of the form φ(t) = h((t0 − t)+ ) . [sent-224, score-0.18]

74 782 (7) S PARSENESS VS E STIMATING C ONDITIONAL P ROBABILITIES The function h is assumed to be continuously differentiable and convex. [sent-225, score-0.076]

75 So, in this section we are not interested in loss functions that are differentiable everywhere. [sent-229, score-0.153]

76 In other words, our assumption is that the loss function is constant for all t ≥ t0 and is continuously differentiable to the left of t0 . [sent-230, score-0.159]

77 Without loss of generality, we may assume that h(0) = 0 because the solutions to (1) do not change if we add or subtract a constant from φ. [sent-232, score-0.099]

78 Note that we obtain the hinge loss if we set h(t) = t. [sent-233, score-0.121]

79 We now derive the dual of (1) for our choice of the loss function. [sent-234, score-0.083]

80 1 Dual Formulation For a convex loss function φ(t) = h((t0 − t)+ ), consider the optimization problem: arg min λ w 2 w + 1 n ∑ φ(yi wT xi ) . [sent-236, score-0.227]

81 n i=1 Make the substitution ξi = t0 − yi wT xi to get arg min λ w 2 w + 1 n ∑ φ(t0 − ξi ) n i=1 (8) subject to ξi = t0 − yi wT xi for all i . [sent-237, score-0.168]

82 Introducing Lagrange multipliers, we get the Lagrangian: L (w, ξ, α) = λ w 2 + n 1 n φ(t0 − ξi ) + ∑ αi (t0 − yi wT xi − ξi ) . [sent-238, score-0.078]

83 Then we have i λ w∗ 2 = λ(w∗ )T 1 n ∗ ∑ α yi xi 2λ i=1 i 1 n 1 n = ∑ α∗ yi (w∗ )T xi = ∑ α∗ (t0 − ξ∗ ) . [sent-242, score-0.102]

84 2 Result About Asymptotic Fraction of Support Vectors and Its Proof Recall that a kernel is called universal if its RKHS is dense in the space of continuous functions over X . [sent-244, score-0.077]

85 Since φ is differentiable on (−t0 ,t0 ), Theorem 2, part 2 implies that Fφ∗ is invertible on (1 − γ0 , γ0 ). [sent-253, score-0.087]

86 Thus, one can estimate conditional probabilities in the interval (1 − γ 0 , γ0 ). [sent-254, score-0.217]

87 i The next theorem says that the fraction of support vectors converges to the expectation E x G(η(x)) in probability. [sent-256, score-0.174]

88 Then for a classifier based on (1), which uses a loss function of the form (7), and a regularization sequence which tends to 0 sufficiently slowly, we have #SV ( fT,λn ) → Ex G(η(x)) n in probability. [sent-260, score-0.1]

89 In the second step, we relate the fraction of support vectors to an empirical average. [sent-284, score-0.113]

90 30) and the fact that 1-norm covering numbers are bounded above by 2-norm covering numbers, we get Pn T ∈ (X × Y )n : sup |ET g(x, y) − EP g(x, y)| > ε ˜ ˜ g∈Gλn ˜ ≤ 8N2 (Gλn , ε/8, n)e−nε 2λ 2 2 n /512Lg (λn )K h(t0 ) . [sent-328, score-0.129]

91 It is not possible to preserve sparseness on intervals where Fφ∗ (·) is invertible. [sent-358, score-0.232]

92 For the regions outside that interval, sparseness is maintained to some extent. [sent-359, score-0.197]

93 For many convex loss functions, the general lower bounds known previously turned out to be quite loose. [sent-360, score-0.18]

94 But that leaves open the possibility that the previously known lower bounds are actually achievable by some loss function lying outside the class of loss functions we considered. [sent-361, score-0.225]

95 Note that the right hand side of Theorem 5 only depends on the left derivative of the loss function at t0 and the right derivative at −t0 . [sent-363, score-0.119]

96 The derivatives at other points do not affect the asymptotic number of support vectors. [sent-364, score-0.121]

97 It may be that results on the continuity of solution sets of convex optimization problems can be applied here (e. [sent-366, score-0.1]

98 Sparseness of support vector machines – some asymptotically sharp bounds. [sent-399, score-0.061]

99 Consistency of support vector machines and other regularized kernel classifiers. [sent-403, score-0.078]

100 Statistical behavior and consistency of classification methods based on convex risk minimization. [sent-412, score-0.099]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tl', 0.64), ('ft', 0.373), ('px', 0.223), ('fp', 0.212), ('sparseness', 0.197), ('steinwart', 0.144), ('parseness', 0.135), ('robabilities', 0.135), ('stimating', 0.135), ('ep', 0.131), ('ewari', 0.13), ('gi', 0.117), ('ex', 0.111), ('lg', 0.098), ('bartlett', 0.097), ('calibrated', 0.093), ('rp', 0.087), ('onditional', 0.087), ('loss', 0.083), ('convex', 0.079), ('interval', 0.076), ('vs', 0.075), ('en', 0.072), ('nc', 0.072), ('asymptotic', 0.07), ('limt', 0.065), ('fraction', 0.064), ('sv', 0.062), ('probabilities', 0.061), ('iff', 0.057), ('differentiable', 0.052), ('conditional', 0.049), ('ambuj', 0.049), ('inf', 0.045), ('rkhs', 0.041), ('sharp', 0.039), ('xcont', 0.038), ('yf', 0.038), ('hinge', 0.038), ('ingo', 0.037), ('anthony', 0.035), ('invertible', 0.035), ('says', 0.035), ('intervals', 0.035), ('covering', 0.035), ('nb', 0.033), ('sup', 0.032), ('slowly', 0.032), ('estimate', 0.031), ('universal', 0.03), ('tells', 0.03), ('derivatives', 0.029), ('dpx', 0.029), ('kernel', 0.029), ('wt', 0.029), ('get', 0.027), ('regularized', 0.027), ('pollard', 0.027), ('attains', 0.027), ('vectors', 0.027), ('lugosi', 0.026), ('theorem', 0.026), ('xi', 0.026), ('yi', 0.025), ('subdifferential', 0.025), ('grace', 0.025), ('continuously', 0.024), ('achievable', 0.023), ('lost', 0.023), ('write', 0.023), ('measurable', 0.023), ('monotonically', 0.023), ('bound', 0.023), ('peter', 0.022), ('tong', 0.022), ('support', 0.022), ('ers', 0.021), ('continuity', 0.021), ('loose', 0.021), ('expectations', 0.021), ('bor', 0.02), ('degenerate', 0.02), ('arg', 0.02), ('consistency', 0.02), ('min', 0.019), ('sparse', 0.019), ('bounds', 0.018), ('functions', 0.018), ('proof', 0.018), ('classi', 0.018), ('derivative', 0.018), ('division', 0.017), ('proves', 0.017), ('recover', 0.017), ('regularization', 0.017), ('fn', 0.017), ('decreasing', 0.017), ('ability', 0.016), ('intimately', 0.016), ('subtract', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

Author: Peter L. Bartlett, Ambuj Tewari

Abstract: One of the nice properties of kernel classifiers such as SVMs is that they often produce sparse solutions. However, the decision functions of these classifiers cannot always be used to estimate the conditional probability of the class label. We investigate the relationship between these two properties and show that these are intimately related: sparseness does not occur when the conditional probabilities can be unambiguously estimated. We consider a family of convex loss functions and derive sharp asymptotic results for the fraction of data that becomes support vectors. This enables us to characterize the exact trade-off between sparseness and the ability to estimate conditional probabilities for these loss functions. Keywords: kernel methods, support vector machines, sparseness, estimating conditional probabilities

2 0.090011455 9 jmlr-2007-AdaBoost is Consistent

Author: Peter L. Bartlett, Mikhail Traskin

Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n and ε ∈ (0, 1)—the sequence of risks of the classifiers it produces approaches the Bayes risk. Keywords: boosting, adaboost, consistency

3 0.088706017 61 jmlr-2007-On the Consistency of Multiclass Classification Methods     (Special Topic on the Conference on Learning Theory 2005)

Author: Ambuj Tewari, Peter L. Bartlett

Abstract: Binary classification is a well studied special case of the classification problem. Statistical properties of binary classifiers, such as consistency, have been investigated in a variety of settings. Binary classification methods can be generalized in many ways to handle multiple classes. It turns out that one can lose consistency in generalizing a binary classification method to deal with multiple classes. We study a rich family of multiclass methods and provide a necessary and sufficient condition for their consistency. We illustrate our approach by applying it to some multiclass methods proposed in the literature. Keywords: multiclass classification, consistency, Bayes risk

4 0.056138489 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Author: Yiming Ying, Ding-Xuan Zhou

Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number

5 0.054225191 90 jmlr-2007-Value Regularization and Fenchel Duality

Author: Ryan M. Rifkin, Ross A. Lippert

Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning

6 0.052277077 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

7 0.049805805 35 jmlr-2007-General Polynomial Time Decomposition Algorithms     (Special Topic on the Conference on Learning Theory 2005)

8 0.0484557 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

9 0.04572523 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis

10 0.045501985 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

11 0.04179981 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

12 0.038610928 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers     (Special Topic on Model Selection)

13 0.038147151 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

14 0.03287011 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods

15 0.032156743 42 jmlr-2007-Infinitely Imbalanced Logistic Regression

16 0.031850066 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

17 0.030050125 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

18 0.029487196 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

19 0.025756218 44 jmlr-2007-Large Margin Semi-supervised Learning

20 0.025325889 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.169), (1, -0.159), (2, 0.042), (3, -0.015), (4, 0.083), (5, 0.05), (6, -0.135), (7, 0.086), (8, 0.073), (9, 0.013), (10, 0.165), (11, -0.004), (12, -0.093), (13, -0.029), (14, 0.02), (15, 0.024), (16, 0.052), (17, 0.066), (18, 0.062), (19, 0.215), (20, 0.131), (21, 0.096), (22, 0.195), (23, -0.014), (24, 0.017), (25, -0.133), (26, 0.002), (27, 0.18), (28, 0.062), (29, -0.029), (30, 0.14), (31, 0.063), (32, 0.034), (33, 0.02), (34, -0.046), (35, 0.022), (36, -0.146), (37, -0.0), (38, 0.068), (39, 0.198), (40, -0.076), (41, -0.051), (42, -0.012), (43, -0.032), (44, -0.065), (45, 0.041), (46, 0.091), (47, -0.266), (48, 0.039), (49, 0.214)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95178902 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

Author: Peter L. Bartlett, Ambuj Tewari

Abstract: One of the nice properties of kernel classifiers such as SVMs is that they often produce sparse solutions. However, the decision functions of these classifiers cannot always be used to estimate the conditional probability of the class label. We investigate the relationship between these two properties and show that these are intimately related: sparseness does not occur when the conditional probabilities can be unambiguously estimated. We consider a family of convex loss functions and derive sharp asymptotic results for the fraction of data that becomes support vectors. This enables us to characterize the exact trade-off between sparseness and the ability to estimate conditional probabilities for these loss functions. Keywords: kernel methods, support vector machines, sparseness, estimating conditional probabilities

2 0.59868604 61 jmlr-2007-On the Consistency of Multiclass Classification Methods     (Special Topic on the Conference on Learning Theory 2005)

Author: Ambuj Tewari, Peter L. Bartlett

Abstract: Binary classification is a well studied special case of the classification problem. Statistical properties of binary classifiers, such as consistency, have been investigated in a variety of settings. Binary classification methods can be generalized in many ways to handle multiple classes. It turns out that one can lose consistency in generalizing a binary classification method to deal with multiple classes. We study a rich family of multiclass methods and provide a necessary and sufficient condition for their consistency. We illustrate our approach by applying it to some multiclass methods proposed in the literature. Keywords: multiclass classification, consistency, Bayes risk

3 0.41170707 35 jmlr-2007-General Polynomial Time Decomposition Algorithms     (Special Topic on the Conference on Learning Theory 2005)

Author: Nikolas List, Hans Ulrich Simon

Abstract: We present a general decomposition algorithm that is uniformly applicable to every (suitably normalized) instance of Convex Quadratic Optimization and efficiently approaches an optimal solution. The number of iterations required to be within ε of optimality grows linearly with 1/ε and quadratically with the number m of variables. The working set selection can be performed in polynomial time. If we restrict our considerations to instances of Convex Quadratic Optimization with at most k0 equality constraints for some fixed constant k0 plus some so-called box-constraints (conditions that hold for most variants of SVM-optimization), the working set is found in linear time. Our analysis builds on a generalization of the concept of rate certifying pairs that was introduced by Hush and Scovel. In order to extend their results to arbitrary instances of Convex Quadratic Optimization, we introduce the general notion of a rate certifying q-set. We improve on the results by Hush and Scovel (2003) in several ways. First our result holds for Convex Quadratic Optimization whereas the results by Hush and Scovel are specialized to SVM-optimization. Second, we achieve a higher rate of convergence even for the special case of SVM-optimization (despite the generality of our approach). Third, our analysis is technically simpler. We prove furthermore that the strategy for working set selection which is based on rate certifying sets coincides with a strategy which is based on a so-called “sparse witness of sub-optimality”. Viewed from this perspective, our main result improves on convergence results by List and Simon (2004) and Simon (2004) by providing convergence rates (and by holding under more general conditions). Keywords: convex quadratic optimization, decomposition algorithms, support vector machines

4 0.33977804 9 jmlr-2007-AdaBoost is Consistent

Author: Peter L. Bartlett, Mikhail Traskin

Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n and ε ∈ (0, 1)—the sequence of risks of the classifiers it produces approaches the Bayes risk. Keywords: boosting, adaboost, consistency

5 0.29409072 90 jmlr-2007-Value Regularization and Fenchel Duality

Author: Ryan M. Rifkin, Ross A. Lippert

Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning

6 0.26755303 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

7 0.24823354 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

8 0.2273255 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

9 0.22297689 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

10 0.22125961 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning     (Special Topic on Model Selection)

11 0.21461433 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

12 0.17652491 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

13 0.17443496 44 jmlr-2007-Large Margin Semi-supervised Learning

14 0.1734955 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation

15 0.16412024 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

16 0.16167474 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers     (Special Topic on Model Selection)

17 0.15856215 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation

18 0.15833046 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

19 0.14858556 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs     (Special Topic on the Conference on Learning Theory 2005)

20 0.14606419 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.011), (12, 0.014), (28, 0.064), (40, 0.039), (48, 0.013), (60, 0.06), (85, 0.017), (98, 0.667)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99588901 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"

Author: Gaëlle Loosli, Stéphane Canu

Abstract: In a recently published paper in JMLR, Tsang et al. (2005) present an algorithm for SVM called Core Vector Machines (CVM) and illustrate its performances through comparisons with other SVM solvers. After reading the CVM paper we were surprised by some of the reported results. In order to clarify the matter, we decided to reproduce some of the experiments. It turns out that to some extent, our results contradict those reported. Reasons of these different behaviors are given through the analysis of the stopping criterion. Keywords: SVM, CVM, large scale, KKT gap, stopping condition, stopping criteria

2 0.99349701 23 jmlr-2007-Concave Learners for Rankboost

Author: Ofer Melnik, Yehuda Vardi, Cun-Hui Zhang

Abstract: Rankboost has been shown to be an effective algorithm for combining ranks. However, its ability to generalize well and not overfit is directly related to the choice of weak learner, in the sense that regularization of the rank function is due to the regularization properties of its weak learners. We present a regularization property called consistency in preference and confidence that mathematically translates into monotonic concavity, and describe a new weak ranking learner (MWGR) that generates ranking functions with this property. In experiments combining ranks from multiple face recognition algorithms and an experiment combining text information retrieval systems, rank functions using MWGR proved superior to binary weak learners. Keywords: rankboost, ranking, convex/concave, regularization 1. Ranking Problems A ranking problem is a learning problem that involves ranks as the inputs, the outputs or both. An example where ranks are used as inputs is a collaborative filtering application where people are asked to rank movies according to their preferences. In such an application the ranks assigned by different people are combined to generate recommendations. Another type of problem in which ranks are used as inputs are meta-search problems, where the ranks of multiple search engines are combined (Dwork et al., 2001). However, the inputs to a ranking problem are not always ranks. An object recognition ranking problem (Kittler and Roli, 2000) may receive as inputs a graphical representation and output a ranking of the possible objects, sorted by likelihood. The outputs of a ranking problem may also be ranks. For example, in combining multiple search engines the output are ranks which are a synthesis of the ranks from the individual search engines. A similar meta-recognition problem is the combination of the outputs of multiple face- recognition systems to improve the accuracy of detecting the correct face. While the inputs and outputs of this problem are ranks, the outputs can be simplified to only return the most likely candidate. Another example where the outputs do not need to be complete ranks could be an information retrieval combination task. In such a task the inputs might be ranks of sets of documents with respect to c 2007 Ofer Melnik, Yehuda Vardi and Cun-Hui Zhang. M ELNIK , VARDI AND Z HANG a particular query by different experts. Again, the outputs could be the complete ranks, or more simply a designation for each document of whether it is relevant or not to the particular query. 2. Rankboost The rankboost algorithm (Freund et al., 2003) tries to address this variety in ranking problems by maintaining generality in how it regards its inputs and how it applies different loss functions to outputs. The rankboost algorithm works on instances which are the discrete items (e.g., movies or faces) that either are ranked as input or are to be ranked as output. To allow a general input mechanism, the inputs to rankboost are called the ranking features of an instance, specified as f j (x) which is the j-th ranking feature of instance x. While this generality in how inputs are handled is potentially powerful, in the original rankboost paper as well as in this paper, only the case where the inputs are different rankings of the instances is considered. Thus, in this paper, the inputs to rankboost are constituent ranks, denoted as f 1 . . . fn , where f j (x ) < f j (x ) implies that instance x has a better rank than instance x under ranking f j . For example, in some of the experiments we present, each f j is the ranking result of a particular face recognition algorithm. The output of rankboost is a new ranking function, H(x), which defines a linear ordering on the instances, that is, H (x ) < H (x ) iff x has a better rank than x . In rankboost T H(x) = ∑ wt ht (x), (1) t=1 a weighted sum of weak ranking learners, where the ht (x)’s are relatively simple learned functions of the constituent rankings. To address the variety in possible loss functions of the outputs, in rankboost the desirable properties for the output loss function are specified with a favor function, Φ : X × X → R, where X is the space of instances (note that this function has been renamed from “preference function” to avoid confusion with the use of preference in this paper in Section 4). Here Φ (x , x ) > 0 means that x should be better ranked than x for a given query. It is an anti-symmetric function, that is, Φ (x , x ) = −Φ (x , x ) and Φ (x, x) = 0, which avoids loops where two instances should both be ranked better than the other. Also Φ (x , x ) = 0 when there is no favor in how two instances should be relatively ranked. For example, as described in Section 6.1, for the face recognition combination problem described above the favor function can be used to specify that the correct identity should be given a better rank than all other identities, while zeroing all other entries in the favor function, giving no favor in how incorrect identities are ranked between them. In a similar fashion for the information retrieval combination task mentioned above, the favor function can be specified such that all relevant documents should be better ranked than irrelevant documents, without specifying favor for the ordering between relevant documents and the ordering between irrelevant documents (Section 7.1). Rankboost is shown in Algorithm 1 as described in Freund et al. (2003). It uses the favor function to specify an initial weight function over instance pair orderings: D x ,x = c · max 0, Φ x , x −1 , (2) where c = ∑x ,x max (0, Φ (x , x )) . The algorithm itself is very similar to adaboost (Freund and Schapire, 1996). At each iteration the algorithm selects the weak learner that best maximizes a 792 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 1 rankboost algorithm for generating a ranking function. Input: constituent ranks, a favor function, and a class of weak learners with outputs between 0 and 1 and an appropriate training algorithm. Output: ranking function H(x) (Eq. 1) Initialize D(1) = D (Eq. 2) for t = 1 . . . T do Find weak learner, ht , that maximizes r(h) = ∑x ,x D(t) (x , x )(h(x ) − h(x )) (Eq. 4). Choose wt = 0.5 ln ((1 + r(ht )) / (1 − r(ht ))) . (Eq. 5) Update: D(t+1) (x , x ) = D(t) (x , x ) exp (wt (ht (x ) − ht (x ))) Zt where Zt is chosen to make ∑x ,x D(t+1) (x , x ) = 1 end for rank function’s utility, r(h), (Eq. 4). It then assigns it a weight in proportion to its performance and adds the weighted weak learner to the ranking function, H(x). After which the weight function D is adjusted to reflect the impact of the weak learner. Freund et al. (2003) prove that the rankboost algorithm iteratively minimizes the ranking loss function, a measure of the quantity of misranked pairs: ∑ D x ,x I H x ≤ H x x ,x where I is the indicator function (1 if true, 0 otherwise) and H(x) is a ranking function output by rankboost. The rankboost paper (Freund et al., 2003) uses binary weak learners of the following functional form:  i f f j (x) > θ,  1 0 i f f j (x) ≤ θ, (3) h(x) =  qde f i f f j (x) unde f ined. Each binary weak learner, h, operates on a single f j (ranking feature), giving a binary output of 0 or 1 depending on whether the instance’s rank is smaller than a learned parameter, θ. Note that function h(x) in (3), which is dependent on only one f j (x) with a fixed j, is monotonic increasing but not convex or concave. If there is no rank for the instance then it returns a prespecified q de f value. 3. Rankboost, the Weak Learner and Regularization While rankboost inherits the accuracy of adaboost and has been shown to be very successful in practice, in two important ways it is very different from adaboost and similar classifier leveraging algorithms (Meir and Ratsch, 2003). The first is the choice of the weak learner, h. In adaboost the weak learner is expected to minimize the weighted empirical classification error: N ∑ d (t) (i)I [yi = h (zi )] , i=1 where yi is the class label, I is the indicator function and d (t) is a weighting over training samples. This is a standard function to minimize in classification with many possible types of algorithms to 793 M ELNIK , VARDI AND Z HANG choose from as possible weak learners. In contrast the weak ranking learner for rankboost (with outputs in [0, 1]) needs to maximize the following rank utility function: r = r(h) = ∑ D(t) (x , x )(h(x ) − h(x )), (4) x ,x where D(t) is the weight function over pairs of instances. As Eq. 4 contains the term h(x ) − h(x ), short of linear learners this equation can not be concave in h or easily approximated by a concave function. Therefore the weak learner needs to be optimized either by heuristics or by brute force, which limits the choice of h. It is not surprising that the original rankboost paper only included one type of weak learner, a binary threshold function (Eq. 3) that was tuned using brute force. The second difference between rankboost and adaboost also concerns the weak ranking learner. One feature of boosting that has sparked a great deal of interesting research is the algorithm’s ability to avoid overfitting for low noise classification problems (with modifications to higher noise problems), see Meir and Ratsch (2003) for a survey. In contrast for rankboost it is only by limiting the type of underlying weak learners that overfitting is avoided. In their paper, Freund et al. (2003) show that not using weak ranking learners with cumulative positive coefficients leads to overfitting and poor performance quite quickly. Therefore, choosing a weak learner that regularizes the ranking function, the output of rankboost, is very important for achieving accuracy and avoiding overfitting. It is clear from the above discussion that the choice of a weak ranking learner for rankboost is important and non trivial. First, the learner must be efficiently tunable with respect to Eq. 4, typically limiting its complexity. Second, the learner itself must demonstrate regularization properties that are appropriate for ranking functions. In this paper we present a new weak ranking learner that enforces consistency in preference and confidence for the ranking function by being monotonic and concave. We start with a discussion of these regularization properties, theoretically justify them, and show what they mean in terms of the final ranking function. Then we present a new learner, Minimum Weighted Group Ranks (MWGR), that satisfies these properties and can be readily learned. This learner is tested and compared with the binary learner of rankboost on combining multiple face recognition systems from the FERET study (Phillips et al., 2000) and on an information retrieval combination task from TREC (Voorhees and Harman, 2001). 4. Regularizing Ranking Functions with Consistency in Preference and Confidence In this paper, as Freund et al. (2003), we consider ranking functions H(x) which depend on x only through the values of the ranking features, y j = f j (x), for that instance, so that H(x) = G ( f1 (x), . . . , fn (x)), for certain functions G (y1 , . . . , yn ) = G(y). Here, we assume that the f j (x) is an actual rank assigned to an instance, x, by the j-th ranker. Note that if the original data are numerical scores then they can easily be converted to rankings. Freund et al. (2003) make a strong case for conversion of raw measures to relative orderings (rankings) over combining measures directly, arguing that it is more general and avoids the semantics of particular measures. As the y j ’s are ranks instead of points in a general space, care should be taken as to the functional form of G. A great deal of literature in social choice theory (Arrow, 1951) revolves around the properties of various rank combination strategies that try to achieve fair rankings. In this machine learning case our motivations are different. Fairness is not the goal; the goal is to improve the 794 C ONCAVE L EARNERS FOR R ANKBOOST accuracy or performance of the ranking function. Thus, regularization, by functionally constraining G, is used to confer information on how to interpret ranks in order to ultimately improve accuracy. Freund et al. (2003) imposed the regularization principle of monotonicity on the ranking function. Monotonicity encompasses a belief that for individual features a smaller rank is always considered better than a bigger rank. It means that for every two rank vectors, a and b, if a j ≤ b j , j = 1, . . . , n then G(a) ≤ G(b). Monotonicity was enforced by requiring that the coefficients, wt in Eq. 1, combining the binary weak learners (Eq. 3) were cumulatively positive. As shown by Freund et al. (2003), without enforcing monotonicity the rankboost algorithm quickly overfits. In this section we present another regularization principle, consistency in preference and confidence (which includes monotonicity). A ranking function with this regularization property should be consistent in its preference for the individual rankers and also in how it captures their confidence. The following example illustrates these two concepts. 4.1 Grocer Example A grocer needs to make 2 decisions, to decide between stocking oat bran vs. granola and to decide between stocking turnips vs. radishes. The grocer asks her consultants to express their opinion about stocking each item, and based on their responses makes her 2 decisions. First of all, in either problem, the grocer will adopt the opinion of her consultants if they all agree with each other, that is, they all prefer granola over oat bran in the first problem. Lets assume the grocer considered the first problem and chose granola over oat bran. What this implies is that the grocer adopted the opinions of the consultants that preferred granola over oat bran. Now consider the turnips vs. radishes decision. Lets say that the same consultants that liked granola more also liked radishes more (and the same ones that like oat bran more like turnips more). Also, for this decision the radish lovers actually feel more confident in their choice than they did for granola, while the turnip lovers are more unsure than they were for oat bran. Then for this second choice, if the grocer is consistent in her method of combining the opinions of her consultants, we would expect her to pick radishes over the turnips. In addition, we would expect her to be more confident in this second decision as a reflection of the consultants relative confidence. The above properties of preference and confidence imply that preference should be applied consistently across different problems and confidence should reflect the relative confidences of the individual consultants. 4.2 The General Principle of Consistency in Preference and Confidence To generalize the above grocer’s problem consider the problem of combining the opinions of a panel of consultants on several multiple-choice decision problems. Suppose for each decision problem each consultant provides his preference as one of the possible choices and a qualitative measure of the confidence level for his preference. The consistency principle in preference and confidence holds if the combiner always agrees with at least one of the consultants in each decision problem and that the following condition holds for any pair of decision problems. Definition 1. Consistency principle for a pair of decision problems: Suppose that for the first decision problem, the combiner adopts the preference of a subset A of consultants in the sense that A is the (non empty) set of all consultants whose preference is identical to that of the combiner. 795 M ELNIK , VARDI AND Z HANG Suppose that for the second decision problem, the preference of the consultants in A is again identical. Suppose further that compared with the confidence level for the first decision problem, each consultant in A has higher or the same confidence level for his preference in the second problem, while each consultant with a different preference than A for the second problem has lower or the same confidence level. Then, the preference of the combiner for the second problem is identical to that of the consultants in A, and the confidence level of the combiner for the second problem is at least as high as his confidence level for the first problem. Let B be the set of consultants whose preferences are different from that of the combiner in the first decision problem, that is, B = Ac . If some members of B switch sides or have no preference in the second problem, the combiner would be even more confident about the adoption of the preference of A in the second problem regardless of the confidence of those who switch sides. Thus, Definition 1 requires only those against the preference of A in the second problem (necessarily a subset of B since members of A act in unison in both problems) to have lower or equal confidence in the second problem, instead of all members of B. This is taken into consideration in Theorem 1 below. Note that while preference for individual consultants is specified within each decision problem, two decision problems are needed to compare the qualitative measure of confidence. However, comparison of confidence is not always possible (it is a partial ordering). In particular, the level of confidence between different experts may not be comparable, and the levels of confidence of a given expert (or the combiner) for different decision problems are not always comparable. 4.3 Confidence for Ranks and Ranking Functions In order to apply consistency to rank combination we need to specify what we mean by more or less confidence. For ranks we make the assumption that a constant change of ranks requires more confidence for low ranks than high ranks. For example, we would expect the difference between ranks of 1 and 3 to represent a more significant distinction on the part of a ranker than would for example the difference between ranks 34 and 36 which may be almost arbitrarily assigned. Specifically we make the following definition: Definition 2. Preference and confidence for univariate rank values For any pair of rank values {r, r } ⊂ R with r < r , by monotonicity r is preferred. For any two pairs of rank values {r, r } and {r , r } with r < r and r ≤ r , the confidence level for {r, r }is higher than that of {r , r } if r ≤ r , r −r ≤ r −r with at least one inequality. The pair{r, r }provides no preference if r = r . Likewise, if either r − r = r − r = 0 or r − r = r − r = 0, the two pairs {r, r } and {r , r }are defined to have the same level of confidence. In this definition confidence between pairs of ranks can only be compared if the pair with the lowest rank has a gap at least as large as the other pair. Thus, we are actually comparing two numbers, that is, we are doing a vector comparison. As such, this comparison does not cover all pairs of ranks and applies only a partial ordering on rank pairs. As a regularization principle, a partial ordering is desirable since it does not overly constrain the ranking function and allows for flexibility. 796 C ONCAVE L EARNERS FOR R ANKBOOST As the rank function, G, is a univariate function, we apply the same definitions of preference and confidence to it. That is, for a ranking function G and for any fixed rank vectors, y and y , the preferred rank vector is the one with smaller G, that is, y is preferred iff G(y) < G(y ). For confidence again we need to consider pairs of rank vectors, {y, y } and {y , y }, with G(y) ≤ G(y ) and G(y ) ≤ G(y ). If G(y) ≤ G(y ) and G(y ) − G(y) ≥ G(y ) − G(y ) then we say that the first decision, between yand y , shows equal or more confidence than the second decision between y and y . This numerical method of capturing confidence and preference in ranks, y, and ranking functions, G, allows us to apply Definition 1. Specifically, for a pair of binary decisions the confidence of a consistent ranking function increases for the second decision if in the second decision the confidence of the agreeing rank values are comparable and increase and the confidence of the disagreeing rank values are comparable and decrease, with the exception of those that switched sides. 4.4 Three-Point Consistency in Preference and Confidence for Combining Ranks As described, the consistency property is over two binary decision problems. In this section we consider ranking functions, G, that have consistency in preference and confidence for all pairs of binary decision problems involving three rank vectors y, y and y . We show that such functions have specific mathematical properties under this three-point consistency principle in preference and confidence for ranks.. Theorem 1: For a ranking function G whose domain is convex, three-point consistency in preference and confidence holds for G iff G(y) is nondecreasing in individual components of y and is jointly concave in y. Proof of Necessity: Assume that the ranking function G exhibits three-point consistency in preference and confidence for any three rank vectors. If y ≤ y component wise, then the individual components of y all agree in their preference to y, so that G(y) ≤ G(y ) by the consistency of G in preference. This implies the monotonicity of G in y. We pick three points y, y − a and y + a, such that all points are rank vectors. Consider the pair of binary comparison problems, y vs. y + a and y − a vs. y. Assume that G(y) < G(y + a). These three points are comparable by Definition 2 (r − r = r − r for all components in the rank vectors). Since the agreeing rankers, A = j y j < y j + a j , in the y vs. y + a comparison are greater than in the y − a vs. y comparison, and the disagreeing ranks, outside A, are smaller, then by the consistency of the combiner in preference and confidence (definition 1), G(y − a) ≤ G(y) and G(y) − G(y − a) ≥ G(y + a) − G(y). A function G is concave in a convex domain iff 2G(y) ≥ G(y − a) + G(y + a), for every y and a with y ± a in its domain. To verify this inequality for a particular y and a, we must have G(y + a) > G(y), G(y − a) > G(y) or the third case of G(y) ≥ max{G(y + a), G(y − a)}. We have already proven that the consistency properties of preference and confidence imply G(y) − G(y − a) ≥ G(y + a) − G(y) in the first case. By symmetry this requirement also must hold for −a, so that G(y) − G(y + a) ≥ G(y − a) − G(y) in the second case. This completes the necessity part of the proof since 2G(y) ≥ G(y − a) + G(y + a) automatically holds in the third case. Proof of Sufficiency: Assume the ranking function G is nondecreasing in individual components of y and jointly concave in y. We need to prove consistency in preference and confidence for a pair 797 M ELNIK , VARDI AND Z HANG 1) G(y) < G(y’) y ≤ y’’ A A y’’ A 4 A A 10 y’ 8 y’ 6 4) G(y) > G(y’) y ≥ y’’ A 10 8 A 2) G(y) < G(y’) y ≥ y’’ A 10 8 6 y A y 6 4 4 2 2 2 0 0 y y’ y’’ y’’ 0 2 4 6 8 10 0 2 4 B 6 8 10 0 0 B 2 4 6 8 10 B Figure 1: Consider two decision problems in two dimensions involving three points, y, y and y , where the first decision problem is to choose between y and y and the second problem is to choose between y and y . The cases where we expect consistency in preference and confidence (Definitions 1 and 2) can be enumerated by the result of the first decision problem and relative location of y . The darker gray box is the location where B has lesser or equal confidence in the second problem, and the lighter gray box is where B switches sides. of decision problems involving three rank vectors y, y and y . Without loss of generality, suppose that the first problem is to choose between y and y and the second problem is to choose between y and y . We break the proof up by the results of the comparison between y vs. y and the relative location of the third ranking vector y that satisfies the preference and confidence requirements. If G(y) = G(y ) = G(y ), the combiner has equal confidence in the two decision problems by definition 2, so that the consistency principle holds automatically. Thus, we only need to consider the cases where G(y) = G(y ). Figure 1 illustrates three of these cases two dimensionally, where there is a single agreeing component A, the y-axis, and a single disagreeing component B, the x-axis. Let A be the set of agreeing indices, A = j : sgn(y j − y j ) = sgn (G(y ) − G(y)) = 0 , and B = c the set of disagreeing indices. We use the notation y = (y , j ∈ A) to describe corresponding A j A subvectors. Also, when used, vector inequalities are component wise. Case 1: G(y) < G(y ) and yA ≤ yA In the first decision problem, as G agrees with A and disagrees with B yA < y A , yB ≥ y B . / We also know that A = 0 since otherwise y ≥ y and therefore by monotonicity G(y) ≥ G(y ), which is a contradiction. For the second decision problem we consider the values of y which are consistent with the confidence assumption (Definitions 1 and 2), that is, agreeing with more confidence along the A indices and disagreeing with less confidence or switching sides along the B indices. As y A ≤ yA , either by the confidence relationship or by switching sides (see Figure 1) these y values satisfy yA − y A ≥ y A − y A , 798 yB − y B ≤ y B − y B C ONCAVE L EARNERS FOR R ANKBOOST which implies y ≤ y , and therefore G(y ) ≤ G(y ) by monotonicity. Thus, G(y) < G(y ) ≤ G(y ), which implies that in the second decision problem of choosing between y and y , the preference of G is the same as that of A (i.e., y since yA ≤ yA ) and the confidence of G is at least as high as the first decision problem. Case 2: G(y) < G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the confidence relationship or by switching sides y satisfies yA − y A ≥ y A − y A , yB − y B ≤ y B − y B . This implies that y + y ≤ 2y, which means that G(y ) + G(y ) ≤ 2G y +y /2 ≤ 2G(y) by the concavity and monotonicity of G. Thus, G(y) − G(y ) ≥ G(y ) − G(y) and since G(y ) > G(y), we have that G(y) − G(y ) > 0 and thus G(y) > G(y ). Therefore we see that the preference in G for the two decision problems is the same as A (with y in the first problem and with y in the second problem) and the confidence is no smaller for the second comparison. Case 3: G(y) > G(y ) and yA ≤ yA Since y is preferred in the first decision problem we have yA > y A , yB ≤ y B . For the confidence assumption to hold, that is, greater confidence for A in the second problem (Definition 2), the smaller ranks in the second problem have to be smaller than the smaller ranks in the first problem. But with yA ≥ yA and yA ≤ yA that can only be if yA = yA , which leads to y ≤ y , and by monotonicity implies G(y) ≤ G(y ). However that is a contradiction to G(y) > G(y ), and therefore this is not a viable case for comparing confidence. Case 4: G(y) > G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the confidence relationship or by switching sides y satisfies yA − y A ≥ y A − y A , yB − y B ≤ y B − y B , which means that y ≤ y . Thus, by the monotonicity of G, G(y ) ≤ G(y ). Since G(y ) < G(y) the preference in the second decision problem is also with A (i.e., y ) and the confidence is at least as high as the first decision problem. 4.5 Applying Regularization to Rankboost It follows from the above theorem that to have consistency in preference and confidence we desire ranking functions that are monotonic and concave. In rankboost H(x) = ∑ wt ht (x) (Eq. 1). To make H an increasing and concave function of constituent rankings y j = f j (x) we need to constrain the weak ranking learners. If the learners themselves are monotonically increasing and concave functions of y j , then linearly combining them with positive weights will give an H that is also an increasing and concave function of y j . 799 M ELNIK , VARDI AND Z HANG In this paper, we apply the “third method” (Freund et al., 2003) to setting a wt weight value. That is, weak learners are selected on their ability to maximize r from Eq. 4 and then wt = 0.5 ln ((1 + rmax ) / (1 − rmax )) . (5) Therefore, using monotonic and concave weak learners we select only ones that rankboost gives a positive r value to, which renders a positive wt weight. If no r values are positive the rankboost algorithm stops. We mention that a ranking function can be construed as the negative of a score function, that is, G(y) = −S(y). For score functions these regularization properties become monotonically decreasing, and convex (Melnik et al., 2004). 5. Minimum Weighted Group Ranks The functional structure of the new learner we propose is h(y) = min {γ1 y1 , . . . , γn yn , 1} , (6) where y = (y1 , . . . , yn ) = ( f1 (x), . . . , fn (x)) the vector of ranking features, and the γ j are learned positive coefficients. Note that the function’s range is in [0, 1] due to the 1 term in the min function. Using rankings as our features, the learner function (Eq. 6) is monotonically increasing. It is also a concave function in y. Thus if these learners are linearly combined with positive weights, the resulting ranking function, H, will have three-point consistency in confidence and preference. To gain some intuition, the functional form of the learner is related to the Highest Rank combination method (Ho, 1992) that assigns combined ranks as the best rank each class receives from any ranker. This is a confidence based approach, as Highest Rank bets on the classifier that is most confident, the one giving the best rank. As such, a single learner can potentially be error prone. But as we combine many of these learners during boosting, it becomes more powerful, allowing the confidence of different classifiers to be weighed with preference for their potential accuracy. 5.1 Learning At each boosting iteration, rather than selecting from all possible weak learners of form (Eq. 6), we limit our choices by building new weak learners out of the ones that have been previously trained. Let F = { f1 (x), . . . , fn (x)} be the set of ranking features. Recall that in rankboost H(x) = (t) (t) (t) ∑t wt ht (x), where ht (x) = min γ1 f1 (x), . . . , γn fn (x), 1 and γ j are learned coefficients. We set (s) (s) Ht = h(x) h(x) = min γ1 f1 (x), . . . , γn fn (x) , s ≤ t and select ht+1 (x) from weak learners of the form hnew (x) = min αh(x), β f (x), 1 (7) (t+1) with h(x) ∈ Ht and f (x) ∈ F . This learner can be rewritten in the form of Eq. 6 with the γ j derived from the learned α, β, h(x) and f (x). Thus, at each iteration we look at combinations of 800 C ONCAVE L EARNERS FOR R ANKBOOST the features and existing learners. As discussed in the next section, we can either consider all such combinations, or use a heuristic selection strategy. We propose to optimize α and β separately, in stages. That is, given a value for one of the variables we optimize the other. As α and β are symmetric we show how to optimize β given α. We need to find a value of β that maximizes r in Eq. 4. Freund et al. (2003) pointed out that this equation can be rewritten as: r = ∑ D(t) (x , x )(h(x ) − h(x )) x ,x = ∑ π(x)h(x) x where π(x) = ∑x D(t) (x , x) − D(t) (x, x ) . Given the form of Eq. 7 we can write r as a function of β, ∑ (β f (x) − 1) π(x) + ∑ r (β) = αh(x) − 1 π(x). β f j (x)≤min(αh(x),1) (8) αh(x) < 1 , then O(βl+1 ) = O(βl ) − π(xl ), P(βl+1 ) = P(βl ) − f (xl )π(xl ), Q(βl+1 ) = Q(βl ) +W π(xl ), R(βl+1 ) = R(βl ) +W αh(xl )π(xl ). Combining these formulas gives algorithm 2 for optimizing β. Algorithm 2 Algorithm for optimizing β Given α, h(x) ∈ Ht , f (x) ∈ F and the training instances. For all x’s generate and sort the set of candidate βs, B, such that β 1 ≤ β2 ≤ · · · ≤ β|B| and βl = min(αh(xl ),1) . f (xl ) O = ∑xl π(xl ) P = ∑xl f (xl )π(xl ) Q=0 R=0 rbest = 0 for j = 1 . . . |B| do r = βl P − O + R − Q if r > rbest then rbest = r end if O = O − π(xl ), P = P − f (xl )π(xl ) if αh(xl ) < 1 then Q = Q + π(xl ), R = R + αh(xl )π(xl ) end if end for 5.2 Heuristics for Learner Selection If at each boosting iteration we select from all combinations of h(x) and f (x) we end up with an O(T 2 ) algorithm, where T is the number of iterations. However, we can sacrifice accuracy for speed by only evaluating and selecting from a fixed sized pool of previously trained learners h(x) and features f (x), where at each iteration the pool is randomly chosen from the full Ht and F . To improve performance, instead of using a uniform sampling distribution we can apply a heuristic to focus the distribution on combinations with better potential. As Eq. 8 is composed of two sums, for r to be large the terms f (x)π(x) and h(x)π(x) need to be large. We can consider s f = ∑x f (x)π(x) and sh = ∑x h(x)π(x) as indicators of how well these components work with π(x). Thus, we might expect larger r values to occur when these two score values are larger. Of course, we are discounting interactions, which is the reason for the combination. Using these score values, we can order all h(x) and f (x) separately, and sample such that learners and features with better scores are more likely to be selected. We opted for a polynomial weighting 802 C ONCAVE L EARNERS FOR R ANKBOOST Training error on Dup II 9 P=1 P=1/2 P=1/4 Avg rank of correct class 8.8 8.6 8.4 8.2 8 7.8 7.6 7.4 7.2 1 2 3 4 5 6 7 8 9 Iteration number Figure 2: These plots show convergence of training error for 3 values of the heuristic selection pressure value p. These plots are averages over 10 runs and are typical of the other training data sets as well. As can be seen the heuristic improves the convergence rate of the training error. method. Thus, for example, all f ∈ F are sorted by their score and are assigned a number based on the rank of these scores,(maxrank − rank)/maxrank, that gives each f an equally sized bin in the range 0 and 1. Given a random number ξ ∼ U(0, 1), we calculate ξ p and select the f that corresponds to the bin this number falls into. Here p < 1 can be construed as a selection pressure, where bins corresponding to higher scores are more likely to be selected. Figure 2 demonstrates the effect of different values of the p parameter in one of our experiments. 6. Face Recognition Experiments We present experiments on the combination of face recognizers, comparing the binary learner with the MWGR learner. Given an image of a face, a face recognizer assigns a similarity score to each of the faces it is trained to recognize. These scores give a linear order or ranking to the gallery of faces. Different face recognition algorithms have different performance characteristics. Thus, we explore how combining the outputs of multiple face recognizers can improve recognition accuracy. 6.1 Algorithm Methods We consider a data set I of face images (queries) to train on. For each query image, i in I, we need to rank all u ∈ U faces in the gallery. In rankboost the favor function, Φ, that specifies output loss, is a function of the query and the item to be ranked. Therefore, the notational convention is to combine the query, i, and the item to be ranked, u, as an instance, x ≡ (i, u). As such, f j (x) = f j ((i, u)) is the 803 M ELNIK , VARDI AND Z HANG Error convergence 9 Train error on dup II Test error on dup I Avg rank of correct class 8.5 8 7.5 7 6.5 6 0 10 20 30 40 50 60 70 80 90 100 Iteration number Figure 3: This plot is typical of the convergence behavior of rankboost with MWGR on the FERET data. Both training and test errors tended to converge within 10-30 iterations of boosting with no significant post-convergence divergence. rank assigned to identity u for query image i by recognition algorithm j. As there is only one correct identity, we only care about the ranking of the one correct identity for each query. We set the favor function as stated by Freund et al. (2003) for this type of output loss. Let u ∗ be the the correct identity for training image i, then Φ ((i, u) , (i, u∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1 for all u = u∗ , setting all remaining elements of Φ (x , x ) = 0. That is, the correct identity of a query image is given positive favor compared to all other identities for that image, while all other rankings, including interactions between training images, are given zero favor. Note that since there is no favor interaction between queries (different i’s); Φ (x , x ) is effectively a function of 3 variables, (i, u , u ). Both weak learners were trained for 100 iterations, giving them ample time to converge. See Figure 3 for an illustration of convergence times. The binary learner was trained as specified by Freund et al. (2003). At each iteration the MWGR learner was selected from a pool of candidate combinations of f (x) and h(x), with a selection pressure of p = 0.5. For each candidate, first β was optimized with α = 1, then α was optimized using the optimized β. The candidate with the most positive r value was always selected. This is summarized in algorithm 3. 6.2 Experimental Setup FERET (Phillips et al., 2000) was a government sponsored program for the evaluation of face recognition algorithms. In this program commercial and academic algorithms were evaluated on their ability to differentiate between 1,196 individuals. The test consisted of different data sets of varying difficulty, for a total of 3,816 different images. The data sets, in order of perceived difficulty, are: the fafb data set of 1,195 images which consists of pictures taken the same day with different facial 804 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 3 Algorithm for selecting learner from pool. Given poolsize and selection pressure. Calculate s f j and sh for all rank features and existing learners. for p = 1 . . . poolsize do Generate d f ∼ U(0, 1) and dh ∼ U(0, 1) Select which h = min αh(x), β f (x), 1 to try, using s f j , sh , u f , uh . Setting α = 1, optimize β (algorithm 2) Keeping the optimized β, optimize α (algorithm 2) Get r of learner with this α, β if r > 0 and r > rbest then rbest = r, hbest = h, hbest = min αh(x), β f (x) end if end for ht = hbest, Ht+1 = Ht ∪ hbest expressions; the fafc data set of 194 images that contains pictures taken with different cameras and lighting conditions; the dup I data set of 488 images that has duplicate pictures taken within a year of the initial photo; and the most difficult, the dup II data set of 234 images which contains duplicate pictures taken more than a year later. Note that in our experiments we separate the images of dup II from the dup I data set, unlike the FERET study where dup II was also a subset of dup I. The FERET study evaluated 10 baseline and proprietary face recognition algorithms. The baseline algorithms consisted of a correlation-based method and a number of eigenfaces (principle components) methods that differ in the internal metric they use. Of the proprietary algorithms, most were from different academic institutions and one was commercial. Of the 10 algorithms we selected three dominant algorithms. From the baseline algorithms we chose to use the ANM algorithm which uses a Mahalanobis distance variation on angular distances for eigenfaces (Moon and Phillips, 2001). While this algorithm’s performance is not distinctive, within the class of baseline algorithms it was strong. Moreover, in accuracy with respect to average rank of the correct class on the dup I data set it demonstrated superior performance to all other algorithms. The other two algorithms we used were the University of Maryland’s 1997 test submission (UMD) and the University of Southern California’s 1997 test submission (USC). These algorithms clearly outperformed the other algorithms. UMD is based on a discriminant analysis of eigenfaces (Zhao et al., 1998), and USC is an elastic bunch graph matching approach (Wiskott et al., 1997). The outputs of the 10 face recognizers on the four FERET data sets, fafb, fafc, dup I and dup II were the data for the experiments. Thus, we never had access to the actual classifiers, only to data on how they ranked the different faces in these data sets. We conducted experiments based on homogeneous and heterogeneous data sets, testing the efficiency and robustness (adaptivity) of the MWGR procedure. For the homogeneous case we took all 4 FERET data sets and randomly shuffled them together. We call this the homogeneous data set as both the training and testing data are selected from the same combined pool. On this combined data set we did 4-fold cross validation. For each fold 75% of the data was used for training and the rest for testing. We combined the results of all four runs together for evaluation purposes. 805 M ELNIK , VARDI AND Z HANG For the heterogeneous case, in each experiment one of the FERET data sets was selected as a training set and another data set was selected for testing. This gave 12 experiments (not including training and testing on the same data set) per group of face recognizers, where we get combinations of training on easy data sets and testing on hard data sets, training on hard and testing on easy data sets, and training and testing on hard data sets. To reduce noise in our experiments the training ranks were truncated at 150, and outliers were removed. In face recognition and other classification applications usually only the top ranks are important. Thus, in evaluating the results we focused on the top 30 ranks. All ranks returned by a boosted combiner for the correct class above 30 were truncated to 30. In evaluating the performance of the combiners not all the test data are equally useful. We consider the following two cases as non informative. When the two best face recognizers, UMD and USC both give the correct class a rank of 1 there is very little reason for the combined rank to be different. Also when both the binary-learner-based combiner and the MWGR-based combiner give the correct class a rank greater than the truncation value (30) it makes little sense to compare between the combiners. The testing data was filtered to remove these cases, and the results are presented without them. Before presenting the results, it should be said that rankboost with both learner types gives ranking functions that significantly outperform all the individual face recognition algorithms. In addition, in our tests both learners also clearly outperformed other standard rank combination methods, such as the Borda count, highest rank and logistic regression (Ho et al., 1992). We present two sets of experiments—the combination of the 3 selected classifiers (ANM, UMD and USC) and the combination of all 10 classifiers. These are qualitatively different tasks. In combining 3, we seek to capitalize on the unique strengths of each classifier. In combining 10, we are introducing classifiers which may be correlated and classifiers which are comparably much noisier. The size of the pool was 6 when we combined 3 classifiers and 20 when combining all 10 classifiers. For all experiments we measure the average rank of the correct class for both learners: A= 1 N ∑ min {Rank (xi∗ ) , 30} , N i=1 where Rank (xi∗ ) is the rank of the correct class for query i, and the sum is over all useful test queries, as described above. The average rank difference between the learners is calculated to show the improvement in performance. To evaluate the significance of the improvement we ran paired one-sided t-tests and evaluated the significance of the p-value (a value less than 0.05 is significant). In addition we show the standard deviation of the rank difference. 6.3 Results In the experiments with the homogeneous data sets, combining all classifiers gives an improvement in the average rank of the correct class of 0.296 for MWGR, with a standard deviation of 3.9, a paired t-test statistic of 1.76 and p-value of 0.039, where combining the ANM, UMD and USC classifiers gives an average rank improvement of 0.1 for MWGR, with standard deviation of 3.6, a paired t-test statistic of 0.68 and p-value of 0.24. Table 2 contains the results for combining the 3 classifiers in the experiments with heterogeneous data sets (compare the combiner results in columns bin mean and mwgr mean with the aver806 C ONCAVE L EARNERS FOR R ANKBOOST ANM ARL EFAVG EFML1 EFML2 EXCA MSU RUT UMD USC dup i 9.52 16.85 15.02 18.23 15.85 11.9 17.64 17.87 13.21 12.44 dup ii 18.14 16.96 20.61 22.21 20.33 16.28 23.44 16.48 13.74 6.85 fafb 10.88 8.94 14.43 16.23 12.21 11.67 6.81 12.93 4.57 5.84 fafc 19.71 28.02 26.17 17.39 19.78 20.30 14.27 22.79 8.96 5.17 Table 1: The best average rank of the correct class on the different data sets for all constituent face recognition systems. age rank of the correct class for all constituent classifiers in Table 1). The diff mean column contains the improvement of MWGR over the binary learner in terms of average rank of the correct class. Of the 12 experiments, we see an improvement in 10 cases. Six of those 10 have significant p-values. The two no improvement experiments do not have significant p-values. Table 3 contains the results for combining all 10 classifiers. Of the 12 experiments, we see an improvement for MWGR in 11 cases. Eight or nine of those 11 have significant p-values. The one no improvement experiment does not have a significant p-value. It is interesting to note that we do not seem to see overfitting when increasing the number of constituents. In some case we see improvement, in others we see slight degradation, but all in all the combiner seems resilient to the noise of adding less informative constituents. All sets of experiments, homogeneous data, heterogeneous data sets, combining 3 select recognizers and combining all 10 recognizers at once yielded significant improvements in accuracy, as is visible in the change in the average rank of the correct class and the significance of the statistical tests. 7. Information Retrieval Experiments The annual Text REtrieval Conference (TREC) generates high-quality retrieval results of different systems on different retrieval tasks (Voorhees and Harman, 2001). We use the result data sets of the TREC-2001 web ad hoc task that uses actual web queries taken from web logs. This task has been used in other rank fusion experiments (Renda and Straccia, 2003). As did Renda and Straccia (2003) we combine the results of the following top 12 systems: iit01m, ok10wtnd1, csiro0mwal, flabxtd, UniNEn7d, fub01be2, hum01tdlx, JuruFull, kuadhoc, ricMM, jscbtawtl4, apl10wd. Similar to other TREC information retrieval tasks, the TREC-2001 ad hoc task consists of 50 queries. For each query a list of relevant documents is supplied. Each system returns an ordered list of 1,000 documents for each query. The fusion goal is to combine these 12 individual lists into one list of 1,000 documents, which hopefully has greater precision than the individual systems. For the 807 M ELNIK , VARDI AND Z HANG test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 7.19 7.36 8.31 5.25 5.15 5.19 2.62 3.38 2.60 2.85 2.40 2.56 mwgr mean 5.96 6.21 6.27 5.38 4.4 4.95 2.22 2.60 2.28 2.78 2.43 2.03 diff mean 1.22 1.14 2.04 -.12 0.74 0.23 0.39 0.78 0.32 0.06 -.03 0.52 diff std 5.22 5.15 5.51 4.0 4.61 5.1 3.09 3.79 2.81 3.33 2.27 2.57 pval .7e-4 .001 .1e-6 .658 .018 .275 .115 .029 .142 .424 .537 .028 Table 2: Results of combining the ANM, UMD, and USC classifiers using individual FERET data sets. test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 6.73 8.06 6.68 5.75 6.24 5.86 2.67 3.56 2.68 3.36 3.17 2.22 mwgr mean 5.89 7.09 4.87 5.67 5.47 5.31 2.07 2.45 2.17 2.51 2.95 2.23 diff mean 0.84 0.96 1.81 0.08 0.76 0.55 0.59 1.11 0.51 0.85 0.22 -0.01 diff std 4.79 6.01 5.24 4.61 4.22 4.87 2.97 4.51 2.03 3.23 3.95 2.08 pval .009 .014 .5e-6 .408 .009 .074 .033 .012 .01 .007 .3 .52 Table 3: Results of combining all 10 classifiers using individual FERET data sets. 50 queries of the TREC-2001 ad hoc task, the number of relevant documents that intersect with the union of system results range between 662 and 2664. 7.1 Methods In the information retrieval task the favor function needs to show favor for relevant documents while disregarding other documents. Thus, we can set the favor function similarly to the way it was set in 808 C ONCAVE L EARNERS FOR R ANKBOOST JuruFull 0.759 hum01tdlx 0.760 UniNEn7d 0.763 iit01m 0.762 apl10wd 0.735 jscbtawt14 0.761 csiro0mwa1 0.721 kuadhoc2001 0.727 flabxtd 0.741 ok10wtnd1 0.745 fub01be2 0.734 ricMM 0.765 Table 4: Normalized mean average precision for each constituent. the face recognition classification task. Consider a data set with I queries to train on. For each query, i in I, we need to rank all u ∈ U documents. An instance in this case is a pair, x ≡ (i, u). For each u∗ a relevant document and u an irrelevant document for query i, we set Φ ((i, u) , (i, u ∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1, setting all remaining elements of Φ (x0 , x1 ) = 0. Thus, all relevant documents are given a positive favor with respect to irrelevant documents, while all other rankings, including interactions between relevant documents and interactions between irrelevant documents are given zero favor. As the task has only 50 queries, rather than separating the data into train and test, we opted to do a cross validation performance evaluation. We use 5-fold cross validation on the normalized mean average precision measure (Salton and McGill, 1983), a standard IR measure which accounts for the precision (quantity of correct documents retrieved) and their rank position in the list. It is described by the following equation: AveP = ∑N (Prec(r) × rel(r)) r=1 number o f relevant documents where r is the rank, N is the number of documents retrieved, rel() is a binary function of the relevance of the document at rank r,and Prec is the precision ( [number of relevant documents retrieved] / [number of documents retrieved]) at a given cut-off rank. It is clear that higher values of AveP are better. Note that for purposes of normalization we assigned unretrieved relevant documents a constant rank. As the binary and MWGR weak learners had significantly different convergence properties on this task (see Figure 4) MWGR was trained for 100 iterations and the binary learner for 300 iterations. As in the FERET experiments, MWGR was optimized using algorithm 3, with a selection pressure of p = 0.5. 7.2 Results As seen in Figure 4, the MWGR learner converges in significantly less iterations than the binary learner. This could possibly be attributed to the fact that the MWGR is a more complex function that can incorporate the rankings of multiple classifiers in each learner. Also the MWGR function is not tuned to particular rank cutoffs, whereas the binary learner is, so the MWGR can better accommodate the variety in the 1000 ranks being considered. The normalized mean average precision for the MWGR after 100 iterations was 0.8537 and it was 0.8508 for the binary learner after 300 iterations. Compare these results with the precision of the constituents in Table 4. Both weak learners had a performance rate of change of approximately 3 ∗ 10−5 on their final iteration (better for MWGR). A paired t-test on the cross validation results of the two learners gives a statistically significant p-value of 0.007 in favor of MWGR. 809 M ELNIK , VARDI AND Z HANG Cross Validation Performance 0.88 MWGR Binary 0.87 0.86 0.85 0.84 Mean Avg Precision 0.83 0.82 0.81 0.8 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 0.7 0 50 100 150 200 Iteration Number 250 300 Figure 4: The cross validation mean average precision score of the two weak learners, MWGR and binary, as a function of boosting iteration number. 8. Discussion The question of how to combine ordinal data has become an active focus of research in machine learning, as applications in pattern recognition, information retrieval and other domains have come to the forefront. A particular question of importance is how can the structure of ranks be correctly exploited to maximize performance. The semi parametric nature of rankboost offers the possibility to generate arbitrarily flexible ranking functions. But as observed this flexibility comes at a cost of significant overfitting without further regularization. Freund et al. (2003) demonstrate that successful generalization only occurs when the resulting ranking functions are constrained to be monotonic. This constraint can be thought of as a regularization that incorporates prior knowledge on the interpretation of ranks and as such, how they can be combined. We present a regularization framework based on the concept of consistency in confidence and preference. Ranking functions with this property show a consistency in how they treat the preference and relative confidence exhibited by constituents. We prove that under a natural interpretation of preference and confidence for ranks, this consistency property of the combiner is equivalent to monotonicity and concavity of its ranking function. We enhance rankboost by designing a weak ranking learner that exhibits consistency in preference and confidence. A computational advantage of this weak learner, called minimum weighted group ranks (MWGR) is that its parameters can be individually optimized readily with respect to the rankboost criteria, allowing it to be tested on real-world data. In our first experiments we compare the original rankboost binary weak learner with MWGR on a task of combining the output of multiple face recognition algorithms from the FERET study. We conducted experiments on homogeneous data, testing the intrinsic efficiency of the MWGR proce810 C ONCAVE L EARNERS FOR R ANKBOOST dure. We also conducted experiments on heterogeneous data, testing the robustness or adaptivity of the procedure. In almost all cases we see that MWGR shows improved performance compared with the binary weak learner, whether combining three or all of the face recognizers, confirming the utility of this monotonic and concave learner. Our second experiment was on an Information Retrieval task taken from the TREC conference. In this task we see MWGR converges in significantly less iterations and generates statistically significant improved performance. Final Words Ofer Melnik and Cun-Hui Zhang are very saddened that our colleague and friend Yehuda Vardi passed away before he could give this paper his final stamp of approval. He was very enthusiastic about this research and would have been very pleased to see it come to fruition. Acknowledgments This research is partially supported by NSF Grants DMS 04-05202, DMS 05-04387 and, DMS 0604571, ONR Grant N00014-02-1-056 and NSA Grant H98230-04-1-0041. Ofer Melnik would also like to thank DIMACS for their support in this research. The authors are also grateful to the editor and reviewers for their constructive suggestions which improved the presentation of the results. Appendix A. In this paper we showed how three-point consistency in preference and confidence implies concave and monotonic ranking functions. For two decision problems involving two pairs of rank vectors, the four-point consistency property implies the following constraints for a ranking function   G(y) < G(y )   yB − y B ≤ 0 < y A − y A  zA ≤ yA , yA − yA ≤ zA − zA ⇒ G(z ) − G(z) > G(y ) − G(y)   yB ≤ z B , zB − z B ≤ y B − y B (11) where y and y are the rank vectors from the first decision problem and z and z are the rank vectors from the second decision problem. Unlike the three-point case, the four-point consistency property does not imply a clearly recognizable functional form for the ranking function. What we can say about it though is that as the constraints are linear, in the same way that concavity and monotonicity in the weak learner conferred the same properties to the ranking function, a weak learner that satisfies Eq. 11 will also confer those properties to the ranking function that uses it with positive weights. References K. Arrow. Social Choice and Individual Values. Wiley, 1951. C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proc. 10th Intl. World Wide Web Conf., pages 613–622, 2001. 811 M ELNIK , VARDI AND Z HANG Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148–156, 1996. T. K. Ho. A Theory of Multiple Classifier Systems and Its Application to Visual Word Recognition. PhD thesis, State University of New York at Buffalo, May 1992. T. K. Ho, J. J. Hull, and S. N. Srihari. Combination of decisions by multiple classifiers. In H. S. Baird, H. Bunke, and K. Yamamoto (Eds.), editors, Structured Document Image Analysis, pages 188–202. Springer-Verlag, Heidelberg, 1992. J. Kittler and F. Roli, editors. Multiple Classifier Systems, Lecture Notes in Computer Science 1857, 2000. Springer. R. Meir and G. Ratsch. Advanced Lectures in Machine Learning, Lecture Notes in Computer Science 2600, chapter An introduction to boosting and leveraging, pages 119–184. Springer, 2003. O. Melnik, Y. Vardi, and C-H. Zhang. Mixed group ranks: Preference and confidence in classifier combination. IEEE Pattern Analysis and Machine Intelligence, 26(8):973–981, 2004. H. Moon and P.J. Phillips. Computational and performance aspects of PCA-based face-recognition algorithms. Perception, 30:303–321, 2001. P.J. Phillips, H. Moon, S.A. Rizvi, and P.J. Rauss. The FERET evaluation methodology for facerecognition algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22:1090– 1104, 2000. M.E. Renda and U. Straccia. Web metasearch: Rank vs. score based rank aggregation methods. In 18th Annual ACM Symposium on Applied Computing (SAC-03), pages 841–846, Melbourne, Florida, USA, 2003. ACM. G. Salton and J.M. McGill. Introduction to Modern Information Retrieval. Addison Wesley Publ. Co., 1983. E.M. Voorhees and D.K. Harman, editors. NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001), number SN003-003-03750-8, 2001. Department of Commerce, National Institute of Standards and Technology, Government Printing Office. URL http://trec.nist.gov. L. Wiskott, J.-M. Fellous, N. Kruger, and C. von der Malsburg. Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(7):775– 779, 1997. W. Zhao, A. Krishnaswamy, R. Chellappa, D. Swets, and J. Weng. Face Recognition: From Theory to Applications, chapter Discriminant Analysis of Principal Components, pages 73–86. SpringerVerlag, Berlin, 1998. 812

same-paper 3 0.9924463 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

Author: Peter L. Bartlett, Ambuj Tewari

Abstract: One of the nice properties of kernel classifiers such as SVMs is that they often produce sparse solutions. However, the decision functions of these classifiers cannot always be used to estimate the conditional probability of the class label. We investigate the relationship between these two properties and show that these are intimately related: sparseness does not occur when the conditional probabilities can be unambiguously estimated. We consider a family of convex loss functions and derive sharp asymptotic results for the fraction of data that becomes support vectors. This enables us to characterize the exact trade-off between sparseness and the ability to estimate conditional probabilities for these loss functions. Keywords: kernel methods, support vector machines, sparseness, estimating conditional probabilities

4 0.99079192 70 jmlr-2007-Ranking the Best Instances

Author: Stéphan Clémençon, Nicolas Vayatis

Abstract: We formulate a local form of the bipartite ranking problem where the goal is to focus on the best instances. We propose a methodology based on the construction of real-valued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We first state the problem of finding the best instances which can be cast as a classification problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where first, the best instances would be tentatively identified and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem. Keywords: ranking, ROC curve and AUC, empirical risk minimization, fast rates

5 0.98749882 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

Author: Matthew E. Taylor, Peter Stone, Yaxin Liu

Abstract: Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artificial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005). Keywords: transfer learning, reinforcement learning, temporal difference methods, value function approximation, inter-task mapping

6 0.85270435 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

7 0.85166097 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

8 0.83868635 61 jmlr-2007-On the Consistency of Multiclass Classification Methods     (Special Topic on the Conference on Learning Theory 2005)

9 0.82255876 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

10 0.81645644 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

11 0.81207806 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

12 0.80388039 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

13 0.79878736 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters     (Special Topic on Model Selection)

14 0.79786462 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

15 0.79525882 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

16 0.79320079 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

17 0.79181969 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

18 0.78872472 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

19 0.78545702 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

20 0.77660173 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation