jmlr jmlr2007 jmlr2007-70 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Stéphan Clémençon, Nicolas Vayatis
Abstract: We formulate a local form of the bipartite ranking problem where the goal is to focus on the best instances. We propose a methodology based on the construction of real-valued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We first state the problem of finding the best instances which can be cast as a classification problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where first, the best instances would be tentatively identified and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem. Keywords: ranking, ROC curve and AUC, empirical risk minimization, fast rates
Reference: text
sentIndex sentText sentNum sentScore
1 Journal of Machine Learning Research 8 (2007) 2671-2699 Submitted 11/06; Revised 2/07; Published 12/07 Ranking the Best Instances St´ phan Cl´ mencon e e ¸ CLEMENCO @ ENST. [sent-1, score-0.206]
2 We propose a methodology based on the construction of real-valued scoring functions. [sent-5, score-0.279]
3 Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe the optimal elements of these new criteria. [sent-8, score-0.317]
4 We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where first, the best instances would be tentatively identified and then a standard AUC criterion could be applied. [sent-9, score-0.457]
5 Eventually, we state preliminary statistical results for the local ranking problem. [sent-10, score-0.269]
6 In applications where ranking is at stake, people often focus on the best instances. [sent-13, score-0.253]
7 In the different context of credit risk screening, credit establishments elaborate scoring rules as reliability indicators and their main concern is to identify risky prospects especially among the top scores. [sent-15, score-0.344]
8 These various situations can be formulated in the setup of bipartite ranking where one observes i. [sent-21, score-0.252]
9 There is a growing literature on the ranking problem in the field of Machine Learning but most of it considers the Area under the ROC Curve (also known as the AUC) criterion as a measure of performance of the ©2007 St´ phan Cl´ mencon and Nicolas Vayatis. [sent-29, score-0.483]
10 e e ¸ ´ C L E MENC ON AND VAYATIS ¸ ranking rule (Cortes and Mohri, 2004; Freund et al. [sent-30, score-0.229]
11 In a previous work, we have mentioned that the bipartite ranking problem under the AUC criterion could be interpreted as a classification problem with pairs of observations (Cl´ mencon et al. [sent-34, score-0.462]
12 Therefore it does not permit to distinguish between ranking rules making the same number of mistakes but in very different parts of the ROC curve. [sent-37, score-0.229]
13 (2007) reveal that there are several possibilities when designing ranking algorithms with focus on the top-rated instances. [sent-41, score-0.229]
14 The methodology we propose is based on the selection of a real-valued scoring function for which we formulate appropriate performance measures generalizing the AUC criterion. [sent-45, score-0.299]
15 We point out that ranking the best instances is an involved task as it is a two-fold problem: (i) find the best instances and (ii) provide a good ranking on these instances. [sent-46, score-0.638]
16 Another contribution of the paper lies in our study of the optimality issue for the local ranking problem. [sent-55, score-0.269]
17 We propose a family of possible performance measures for the problem of ranking the best instances. [sent-57, score-0.253]
18 We also point out that the empirical risks for local ranking are closely related to generalized Wilcoxon statistics. [sent-59, score-0.301]
19 We first state the problem of finding the best instances and study the performance of empirical risk minimization in this setup (Section 2). [sent-61, score-0.24]
20 In Section 3 we formulate performance measures for local ranking and provide extensions of the AUC criterion. [sent-63, score-0.289]
21 We define the rate of best instances as the proportion of best instances to be considered and denote it by u ∈ (, ). [sent-71, score-0.234]
22 Since we have the ranking problem in mind, our methodology will consist in building the candidate sets from a class S of real-valued scoring functions s : X → R. [sent-115, score-0.508]
23 L(s, u ) L(s) A scoring function minimizing the quantity Ln (s) = n n i= I{Yi · (s(Xi ) − Q(s, − u )) < } . [sent-119, score-0.279]
24 n i= ^ Note that Ln (s) is a complicated statistic since the empirical quantile involves all the instances ^ X , . [sent-123, score-0.272]
25 i= Without loss of generality, we will assume that all scoring functions in S take their values in ^ (, λ) for some λ > . [sent-133, score-0.279]
26 We now turn to study the performance of minimizers of Ln (s) over a class S of scoring functions defined by ^ sn = arg min Ln (s). [sent-134, score-0.318]
27 ^ s∈S Our first main result is an excess risk bound for the empirical risk minimizer s n over a class S ^ of uniformly bounded scoring functions. [sent-135, score-0.441]
28 In the following theorem, we consider that the level sets of scoring functions from the class S form a Vapnik-Chervonenkis (VC) class of sets. [sent-136, score-0.279]
29 Remark 7 ( ON THE CHOICE OF THE CLASS S OF SCORING FUNCTIONS ) In order to grasp the meaning of condition (ii) of the theorem, we consider the one-dimensional case with real-valued scoring functions. [sent-144, score-0.31]
30 Assume also that scoring functions s are differentiable except, possibly, at a finite number of points, and derivatives are denoted by s . [sent-146, score-0.279]
31 We can express the density of s(X) thanks to the change-of-variable formula (see, for instance, Papoulis, 1965): fs (t) = f (x p ) f (x ) +. [sent-152, score-0.485]
32 s (x ) s (x p ) 2676 R ANKING THE B EST I NSTANCES s(x) Scoring functions x Figure 1: Typical example of a scoring function. [sent-156, score-0.279]
33 This shows that the scoring functions should not present neither flat nor steep parts. [sent-157, score-0.279]
34 Note that any subinterval of [, λ] has to be in the range of scoring functions s (if not, some elements of K will present a plateau). [sent-159, score-0.279]
35 , 1996), we have: ^ L(^n ) − inf L(s) ≤ sup Ln (s) − L(s) s s∈S s∈S ^ ≤ sup Ln (s) − Ln (s) + sup |Ln (s) − L(s)| . [sent-164, score-0.225]
36 In our case, assumption (i) implies that the class of sets {x : s(x) ≥ Q(s, v )} indexed by scoring functions s has a VC dimension smaller than V . [sent-166, score-0.279]
37 For this reason and to avoid technical issues, we will consider a quite restrictive setup where it is assumed that: • the class S of scoring functions is a finite class with N elements, • an optimal scoring rule s∗ is contained in S . [sent-187, score-0.581]
38 We will show below how to obtain a similar rate when the true quantile ^ Q(s, − u ) is replaced by the empirical quantile Q(s, − u ) in the criterion to be minimized. [sent-204, score-0.258]
39 Theorem 10 We assume that the class S of scoring functions is a finite class with N elements, and that it contains an optimal scoring rule s∗ . [sent-216, score-0.558]
40 This statistic can be interpreted as a linear signed rank statistic and the key decomposition has been used in the context of nonparametric hypotheses testing and R-estimation. [sent-230, score-0.325]
41 i= ^ We note that the statistic Ln (s) is related to linear signed rank statistics. [sent-237, score-0.229]
42 Then the statistic n Φ i= R+ i n+ sgn(Zi ) is a linear signed rank statistic. [sent-249, score-0.229]
43 ^ Proposition 14 For fixed s and v, the statistic Kn (s, v) is a linear signed rank statistic. [sent-250, score-0.229]
44 It is easy to see that the statistic Kn (s, v) is a linear signed rank statistic with score generating function Φ(x) = I[x≤v] . [sent-254, score-0.325]
45 Then, we have, for all s ∈ S : Var hs (Xi ,Yi ) − hs∗ (Xi ,Yi ) ≤ c L(s) − L(s∗ ) α , for some constant c. [sent-273, score-0.256]
46 It is clear that, for any s, we have L(s, v) = L(s∗ , v) implies that L (s, v) = L (s∗ , v) otherwise s∗ would not be an optimal scoring function at some level v in the vicinity of v. [sent-278, score-0.279]
47 Denote by C = sups,x,y |hs (x, y)| and by σ (s) = Var hs (Xi ,Yi ) − hs∗ (Xi ,Yi ) . [sent-291, score-0.256]
48 Performance Measures for Local Ranking Our main interest here is to develop a setup describing the problem of not only finding but also ranking the best instances. [sent-299, score-0.276]
49 In the sequel, we build on the results from Section 2 and also on our previous work on the (global) ranking problem (Cl´ mencon et al. [sent-300, score-0.414]
50 , To appear) in order to capture e ¸ some of the features of the local ranking problem. [sent-301, score-0.269]
51 The present section is devoted to the construction of performance measures reflecting the quality of ranking rules on a restricted set of instances. [sent-302, score-0.229]
52 1 ROC Curves and Optimality in the Local Ranking Problem We consider the same statistical model as before with (X,Y ) being a pair of random variables over X × {−, +} and we examine ranking rules resulting from real-valued scoring functions s : X → (, λ). [sent-304, score-0.508]
53 The reference tool for assessing the performance of a scoring function s in separating the two populations (positive vs. [sent-305, score-0.279]
54 The optimal scoring function is the one whose ROC curve dominates all the others for all z ∈ (, λ) (or t ∈ (, )) and such a function actually exists. [sent-308, score-0.335]
55 We point out that the ROC curve of a scoring function s is invariant to strictly increasing transformations of s. [sent-310, score-0.335]
56 In our approach, for a given scoring function s, we focus on thresholds z corresponding to the cut-off separating a proportion u ∈ (, ) of top scored instances according to s from the rest. [sent-311, score-0.377]
57 We propose to re-parameterize the ROC curve with the proportion u ∈ (, ) and then describe it as the plot of the function: u → (α(s, u), β(s, u)) , for each scoring function s. [sent-314, score-0.367]
58 Definition 17 (Partial AUC) We define the partial AUC for a scoring function s and a rate u of best instances as: α(s,u ) PARTAUC(s, u ) = β(s, α) dα . [sent-323, score-0.416]
59 In order to represent the partial AUC of a scoring function s, we need to locate the cut-off point given the constraint on the rate u of best instances. [sent-327, score-0.374]
60 Hence, the part of the ROC curve of a scoring function s corresponding to the best instances at rate u is the part going from the origin (, ) to the intersection with the control line D(u , p). [sent-330, score-0.471]
61 Indeed, the closer to η the scoring function s is, the higher the ROC curve is, but at the same time the integration domain shrinks (right display of Figure 2) so that the overall impact on the integral is not clear. [sent-370, score-0.335]
62 Lemma 18 For any scoring function s, we have for all u ∈ (, ), β(s, u) ≤ β(η, u), α(s, u) ≥ α(η, u) . [sent-372, score-0.279]
63 Observe that, for any scoring function s, p( − Gs (Q(s, − u)) = P {Y = , s(X) > Q(s, − u)} = E (η(X)I{X ∈ Cs,u }) . [sent-377, score-0.279]
64 The previous lemma will be important when describing the optimal rules for local ranking criteria. [sent-380, score-0.269]
65 But, at this point, we still do not know any nice criterion for the problem of ranking the best instances. [sent-381, score-0.301]
66 Before considering different heuristics for extending the AUC criterion in the next subsections, we will proceed backwards and define our target, that is to say, the optimal scoring functions for our problem. [sent-382, score-0.327]
67 / ∗ ∗ z∈Cu Such scoring functions fulfill the two properties of locating the best instances (indeed Cs∗ ,u = and ranking them as well as the regression function. [sent-384, score-0.598]
68 (To appear), we have considered the ranking error of a scoring function s as e ¸ defined by: R(s) = P{(Y −Y )(s(X) − s(X )) < } , where (X ,Y ) is an i. [sent-389, score-0.508]
69 Interestingly, it can be proved that minimizing the ranking error R(s) is equivalent to maximizing the well-known AUC criterion. [sent-393, score-0.229]
70 p( − p) We now propose a local version of the ranking error on a measurable set C ⊂ X : R(s,C) = P (s(X) − s(X ))(Y −Y ) < , (X, X ) ∈ C . [sent-395, score-0.269]
71 2687 ´ C L E MENC ON AND VAYATIS ¸ On sets of the form C = Cs,u = {x ∈ X | s(x) ≥ Q(s, − u)} with mass equal to u, the local ranking error will be denoted by R(s, u) R(s,Cs,u ). [sent-396, score-0.307]
72 However, in the case where u < , we will see that there is no equivalence between maximizing the L OC AUC criterion and minimizing the local ranking error s → R(s, u). [sent-399, score-0.317]
73 Indeed, the local ranking error is not a relevant performance measure for finding the best instances. [sent-400, score-0.293]
74 The following theorem states that optimal scoring functions s ∗ in the set S ∗ maximize the L O C AUC criterion and that the latter may be decomposed as a sum of a ’power’ term and (the opposite of) a local ranking error term. [sent-402, score-0.618]
75 We have, for any scoring function s: ∀s∗ ∈ S ∗ , L OC AUC(s, u ) ≤ L OC AUC(s∗ , u ) . [sent-404, score-0.279]
76 2689 ´ C L E MENC ON AND VAYATIS ¸ Remark 21 (T RUNCATING THE AUC) In the theorem, we obviously recover the relation between the standard AUC criterion and the (global) ranking error when u = . [sent-427, score-0.277]
77 The values α(s, u ) and β(s, u ) are the coordinates of the intersecting point between the ROC curve of the scoring function s and the control line D(u , p). [sent-430, score-0.359]
78 The theorem reveals that evaluating the local performance of a scoring statistic s(X) by the truncated AUC as proposed in Dodd and Pepe (2003) is highly arguable since the maximizer of the functional s → PARTAUC(s, u ) is usually not in S ∗ . [sent-431, score-0.437]
79 In particular, we recall that, for fixed s, the Wilcoxon statistic Tn (s) is a linear rank statistic for the sample s(X ), . [sent-457, score-0.263]
80 2691 ´ C L E MENC ON AND VAYATIS ¸ Remark 25 (E VIDENCE AGAINST ’ TWO - STEP ’ STRATEGIES ) It is noteworthy that not all combinations of β(s, u ) (or α(s, u )) and R(s, u ) lead to a criterion with S ∗ being the set of optimal scoring functions. [sent-478, score-0.327]
81 But, in general, this remark should prevent from considering ’naive’ two-step strategies for solving the local ranking problem. [sent-480, score-0.315]
82 By ’naive’ two-step strategies, we refer here to stagewise ^ strategies which would, first, compute an estimate C of the set containing the best instances, and ^ then, solve the ranking problem over C as described in Cl´ mencon et al. [sent-481, score-0.438]
83 In any case, we stress here the importance of making use of a global criterion, synthesizing our double goal: finding and ranking the best instances. [sent-484, score-0.253]
84 Remark 26 (OTHER RANKING PERFORMANCE MEASURES ) The ideas expressed above suggest that several ranking criteria can be proposed. [sent-485, score-0.229]
85 Empirical Risk Minimization of the Local AUC Criterion In the previous section, we have seen that there are various performance measures which can be considered for the problem of ranking the best instances. [sent-494, score-0.253]
86 In order to perform the statistical analysis, we will favor the representations of L OC AUC and W which involve the classification error L(s, u ) and the local ranking error R(s, u ). [sent-495, score-0.269]
87 We exploit the first expression and choose to study the minimization of the following criterion for ranking the best instances: M(s) M(s, u ) = R(s, u ) + ( − p)L(s, u ) . [sent-497, score-0.331]
88 It is obvious that the elements of S ∗ are the optimal elements of the functional M( · , u ) and we will now consider scoring functions obtained through empirical risk minimization of this criterion. [sent-498, score-0.406]
89 The statistic Rn (s) corresponds to an unbiased estimate of the local ranking error R(s, u ). [sent-507, score-0.365]
90 The next result provides a standard error bound for the excess risk of the empirical risk minimizer over a class S of scoring functions: ^ sn = arg min Mn (s) . [sent-508, score-0.48]
91 e ¸ ^ M(^n ) − inf M(s) ≤ sup Rn (s) − Rn (s) + sup |R(s) − Rn (s)| s s∈S s∈S s∈S ^ + ( − p) sup Ln (s) − Ln (s) + sup |L(s) − Ln (s)| + s∈S s∈S n+ −p . [sent-514, score-0.293]
92 A natural question which arises here is whether the consistency results on local ranking can be extended in this spirit. [sent-534, score-0.269]
93 Note that, if we do not focus on best instances and consider the whole AUC as a performance criterion, it is straightforward to obtain consistency and universal rates of convergence for convex risk minimization (as explained in Cl´ mencon et al. [sent-535, score-0.396]
94 In the case of local ranking as we e ¸ introduced it, this extension is less straightforward since the decision rule represented here by the ^ scoring function s appears under the empirical quantile Q(s, v) in the criterion. [sent-537, score-0.658]
95 (2007) where convex risk minimization strategies in the context of ranking are discussed. [sent-539, score-0.324]
96 This constraint leads to empirical risk functionals which involve an empirical quantile indexed by the class of candidate scoring functions and can be seen as linear signed rank statistics. [sent-545, score-0.643]
97 The second part of the paper (Section 3) is dedicated to the introduction of new performance measures for local ranking related to the ROC curve and the AUC criterion. [sent-548, score-0.325]
98 We show that the AUC can be extended in several ways (partial AUC and local AUC) but not all these extensions are tailored for the local ranking problem. [sent-549, score-0.309]
99 We also introduce the optimal scoring functions which should be considered as the target of any local ranking method. [sent-551, score-0.548]
100 In the last section of the paper (Section 4), the problem of ranking the best instances is studied from a statistical perspective. [sent-553, score-0.319]
wordName wordTfidf (topN-words)
[('fs', 0.485), ('auc', 0.396), ('scoring', 0.279), ('hs', 0.256), ('ranking', 0.229), ('mencon', 0.185), ('oc', 0.164), ('cu', 0.159), ('roc', 0.146), ('menc', 0.144), ('gs', 0.124), ('nstances', 0.122), ('anking', 0.122), ('ln', 0.114), ('vn', 0.111), ('est', 0.109), ('vayatis', 0.106), ('cl', 0.105), ('statistic', 0.096), ('quantile', 0.078), ('un', 0.076), ('koul', 0.072), ('rank', 0.071), ('zn', 0.071), ('sup', 0.068), ('instances', 0.066), ('risk', 0.065), ('signed', 0.062), ('dgs', 0.062), ('pgs', 0.062), ('iii', 0.059), ('curve', 0.056), ('partauc', 0.051), ('criterion', 0.048), ('scott', 0.047), ('remark', 0.046), ('vc', 0.044), ('proposition', 0.044), ('boucheron', 0.043), ('kn', 0.042), ('cnt', 0.041), ('cossock', 0.041), ('dhs', 0.041), ('nowak', 0.041), ('op', 0.04), ('local', 0.04), ('rudin', 0.039), ('sn', 0.039), ('mass', 0.038), ('tn', 0.035), ('jek', 0.035), ('dodd', 0.035), ('yi', 0.035), ('ii', 0.034), ('lu', 0.033), ('constants', 0.033), ('cdf', 0.033), ('xi', 0.033), ('empirical', 0.032), ('proportion', 0.032), ('grasp', 0.031), ('ics', 0.031), ('icu', 0.031), ('onnection', 0.031), ('pepe', 0.031), ('sid', 0.031), ('minimization', 0.03), ('wilcoxon', 0.029), ('tsybakov', 0.028), ('lugosi', 0.028), ('rates', 0.026), ('rn', 0.026), ('cachan', 0.026), ('massart', 0.026), ('partial', 0.025), ('constraint', 0.024), ('control', 0.024), ('best', 0.024), ('iv', 0.024), ('counterpart', 0.023), ('bernstein', 0.023), ('setup', 0.023), ('zi', 0.023), ('theorem', 0.022), ('rate', 0.022), ('inf', 0.021), ('van', 0.021), ('cdfs', 0.021), ('dupac', 0.021), ('hanley', 0.021), ('kek', 0.021), ('mins', 0.021), ('phan', 0.021), ('rice', 0.021), ('rvelin', 0.021), ('umr', 0.021), ('copies', 0.02), ('formulate', 0.02), ('argument', 0.02), ('quantiles', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000014 70 jmlr-2007-Ranking the Best Instances
Author: Stéphan Clémençon, Nicolas Vayatis
Abstract: We formulate a local form of the bipartite ranking problem where the goal is to focus on the best instances. We propose a methodology based on the construction of real-valued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We first state the problem of finding the best instances which can be cast as a classification problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where first, the best instances would be tentatively identified and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem. Keywords: ranking, ROC curve and AUC, empirical risk minimization, fast rates
2 0.091908447 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér
Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics
3 0.091867313 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
Author: Marc Teboulle
Abstract: Center-based partitioning clustering algorithms rely on minimizing an appropriately formulated objective function, and different formulations suggest different possible algorithms. In this paper, we start with the standard nonconvex and nonsmooth formulation of the partitioning clustering problem. We demonstrate that within this elementary formulation, convex analysis tools and optimization theory provide a unifying language and framework to design, analyze and extend hard and soft center-based clustering algorithms, through a generic algorithm which retains the computational simplicity of the popular k-means scheme. We show that several well known and more recent center-based clustering algorithms, which have been derived either heuristically, or/and have emerged from intuitive analogies in physics, statistical techniques and information theoretic perspectives can be recovered as special cases of the proposed analysis and we streamline their relationships. Keywords: clustering, k-means algorithm, convex analysis, support and asymptotic functions, distance-like functions, Bregman and Csiszar divergences, nonlinear means, nonsmooth optimization, smoothing algorithms, fixed point methods, deterministic annealing, expectation maximization, information theory and entropy methods
4 0.08998125 9 jmlr-2007-AdaBoost is Consistent
Author: Peter L. Bartlett, Mikhail Traskin
Abstract: The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n1−ε iterations—for sample size n and ε ∈ (0, 1)—the sequence of risks of the classifiers it produces approaches the Bayes risk. Keywords: boosting, adaboost, consistency
5 0.083508782 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)
Author: Marc Boullé
Abstract: The naive Bayes classifier has proved to be very effective on many real data applications. Its performance usually benefits from an accurate estimation of univariate conditional probabilities and from variable selection. However, although variable selection is a desirable feature, it is prone to overfitting. In this paper, we introduce a Bayesian regularization technique to select the most probable subset of variables compliant with the naive Bayes assumption. We also study the limits of Bayesian model averaging in the case of the naive Bayes assumption and introduce a new weighting scheme based on the ability of the models to conditionally compress the class labels. The weighting scheme on the models reduces to a weighting scheme on the variables, and finally results in a naive Bayes classifier with “soft variable selection”. Extensive experiments show that the compressionbased averaged classifier outperforms the Bayesian model averaging scheme. Keywords: naive Bayes, Bayesian, model selection, model averaging
6 0.082934029 23 jmlr-2007-Concave Learners for Rankboost
7 0.081143647 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring
8 0.058480434 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
9 0.056719687 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors
10 0.055178989 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
12 0.053272147 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
13 0.052117657 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)
14 0.047962654 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
15 0.043148499 15 jmlr-2007-Bilinear Discriminant Component Analysis
16 0.041327573 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling
17 0.040250544 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
18 0.04015315 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
19 0.039943647 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
20 0.038862407 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
topicId topicWeight
[(0, 0.236), (1, -0.085), (2, -0.072), (3, -0.053), (4, 0.091), (5, 0.09), (6, -0.075), (7, 0.025), (8, 0.193), (9, -0.114), (10, -0.146), (11, -0.043), (12, 0.022), (13, -0.067), (14, -0.197), (15, -0.017), (16, -0.342), (17, 0.105), (18, 0.075), (19, -0.096), (20, 0.015), (21, 0.102), (22, -0.11), (23, -0.138), (24, 0.133), (25, -0.004), (26, 0.041), (27, 0.053), (28, -0.173), (29, -0.094), (30, -0.003), (31, -0.131), (32, -0.052), (33, 0.051), (34, -0.004), (35, 0.065), (36, 0.154), (37, 0.11), (38, 0.058), (39, 0.019), (40, -0.017), (41, -0.013), (42, -0.066), (43, -0.069), (44, 0.033), (45, 0.002), (46, 0.017), (47, 0.127), (48, -0.135), (49, -0.105)]
simIndex simValue paperId paperTitle
same-paper 1 0.96231616 70 jmlr-2007-Ranking the Best Instances
Author: Stéphan Clémençon, Nicolas Vayatis
Abstract: We formulate a local form of the bipartite ranking problem where the goal is to focus on the best instances. We propose a methodology based on the construction of real-valued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We first state the problem of finding the best instances which can be cast as a classification problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where first, the best instances would be tentatively identified and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem. Keywords: ranking, ROC curve and AUC, empirical risk minimization, fast rates
2 0.43575996 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér
Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics
3 0.43012518 23 jmlr-2007-Concave Learners for Rankboost
Author: Ofer Melnik, Yehuda Vardi, Cun-Hui Zhang
Abstract: Rankboost has been shown to be an effective algorithm for combining ranks. However, its ability to generalize well and not overfit is directly related to the choice of weak learner, in the sense that regularization of the rank function is due to the regularization properties of its weak learners. We present a regularization property called consistency in preference and confidence that mathematically translates into monotonic concavity, and describe a new weak ranking learner (MWGR) that generates ranking functions with this property. In experiments combining ranks from multiple face recognition algorithms and an experiment combining text information retrieval systems, rank functions using MWGR proved superior to binary weak learners. Keywords: rankboost, ranking, convex/concave, regularization 1. Ranking Problems A ranking problem is a learning problem that involves ranks as the inputs, the outputs or both. An example where ranks are used as inputs is a collaborative filtering application where people are asked to rank movies according to their preferences. In such an application the ranks assigned by different people are combined to generate recommendations. Another type of problem in which ranks are used as inputs are meta-search problems, where the ranks of multiple search engines are combined (Dwork et al., 2001). However, the inputs to a ranking problem are not always ranks. An object recognition ranking problem (Kittler and Roli, 2000) may receive as inputs a graphical representation and output a ranking of the possible objects, sorted by likelihood. The outputs of a ranking problem may also be ranks. For example, in combining multiple search engines the output are ranks which are a synthesis of the ranks from the individual search engines. A similar meta-recognition problem is the combination of the outputs of multiple face- recognition systems to improve the accuracy of detecting the correct face. While the inputs and outputs of this problem are ranks, the outputs can be simplified to only return the most likely candidate. Another example where the outputs do not need to be complete ranks could be an information retrieval combination task. In such a task the inputs might be ranks of sets of documents with respect to c 2007 Ofer Melnik, Yehuda Vardi and Cun-Hui Zhang. M ELNIK , VARDI AND Z HANG a particular query by different experts. Again, the outputs could be the complete ranks, or more simply a designation for each document of whether it is relevant or not to the particular query. 2. Rankboost The rankboost algorithm (Freund et al., 2003) tries to address this variety in ranking problems by maintaining generality in how it regards its inputs and how it applies different loss functions to outputs. The rankboost algorithm works on instances which are the discrete items (e.g., movies or faces) that either are ranked as input or are to be ranked as output. To allow a general input mechanism, the inputs to rankboost are called the ranking features of an instance, specified as f j (x) which is the j-th ranking feature of instance x. While this generality in how inputs are handled is potentially powerful, in the original rankboost paper as well as in this paper, only the case where the inputs are different rankings of the instances is considered. Thus, in this paper, the inputs to rankboost are constituent ranks, denoted as f 1 . . . fn , where f j (x ) < f j (x ) implies that instance x has a better rank than instance x under ranking f j . For example, in some of the experiments we present, each f j is the ranking result of a particular face recognition algorithm. The output of rankboost is a new ranking function, H(x), which defines a linear ordering on the instances, that is, H (x ) < H (x ) iff x has a better rank than x . In rankboost T H(x) = ∑ wt ht (x), (1) t=1 a weighted sum of weak ranking learners, where the ht (x)’s are relatively simple learned functions of the constituent rankings. To address the variety in possible loss functions of the outputs, in rankboost the desirable properties for the output loss function are specified with a favor function, Φ : X × X → R, where X is the space of instances (note that this function has been renamed from “preference function” to avoid confusion with the use of preference in this paper in Section 4). Here Φ (x , x ) > 0 means that x should be better ranked than x for a given query. It is an anti-symmetric function, that is, Φ (x , x ) = −Φ (x , x ) and Φ (x, x) = 0, which avoids loops where two instances should both be ranked better than the other. Also Φ (x , x ) = 0 when there is no favor in how two instances should be relatively ranked. For example, as described in Section 6.1, for the face recognition combination problem described above the favor function can be used to specify that the correct identity should be given a better rank than all other identities, while zeroing all other entries in the favor function, giving no favor in how incorrect identities are ranked between them. In a similar fashion for the information retrieval combination task mentioned above, the favor function can be specified such that all relevant documents should be better ranked than irrelevant documents, without specifying favor for the ordering between relevant documents and the ordering between irrelevant documents (Section 7.1). Rankboost is shown in Algorithm 1 as described in Freund et al. (2003). It uses the favor function to specify an initial weight function over instance pair orderings: D x ,x = c · max 0, Φ x , x −1 , (2) where c = ∑x ,x max (0, Φ (x , x )) . The algorithm itself is very similar to adaboost (Freund and Schapire, 1996). At each iteration the algorithm selects the weak learner that best maximizes a 792 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 1 rankboost algorithm for generating a ranking function. Input: constituent ranks, a favor function, and a class of weak learners with outputs between 0 and 1 and an appropriate training algorithm. Output: ranking function H(x) (Eq. 1) Initialize D(1) = D (Eq. 2) for t = 1 . . . T do Find weak learner, ht , that maximizes r(h) = ∑x ,x D(t) (x , x )(h(x ) − h(x )) (Eq. 4). Choose wt = 0.5 ln ((1 + r(ht )) / (1 − r(ht ))) . (Eq. 5) Update: D(t+1) (x , x ) = D(t) (x , x ) exp (wt (ht (x ) − ht (x ))) Zt where Zt is chosen to make ∑x ,x D(t+1) (x , x ) = 1 end for rank function’s utility, r(h), (Eq. 4). It then assigns it a weight in proportion to its performance and adds the weighted weak learner to the ranking function, H(x). After which the weight function D is adjusted to reflect the impact of the weak learner. Freund et al. (2003) prove that the rankboost algorithm iteratively minimizes the ranking loss function, a measure of the quantity of misranked pairs: ∑ D x ,x I H x ≤ H x x ,x where I is the indicator function (1 if true, 0 otherwise) and H(x) is a ranking function output by rankboost. The rankboost paper (Freund et al., 2003) uses binary weak learners of the following functional form: i f f j (x) > θ, 1 0 i f f j (x) ≤ θ, (3) h(x) = qde f i f f j (x) unde f ined. Each binary weak learner, h, operates on a single f j (ranking feature), giving a binary output of 0 or 1 depending on whether the instance’s rank is smaller than a learned parameter, θ. Note that function h(x) in (3), which is dependent on only one f j (x) with a fixed j, is monotonic increasing but not convex or concave. If there is no rank for the instance then it returns a prespecified q de f value. 3. Rankboost, the Weak Learner and Regularization While rankboost inherits the accuracy of adaboost and has been shown to be very successful in practice, in two important ways it is very different from adaboost and similar classifier leveraging algorithms (Meir and Ratsch, 2003). The first is the choice of the weak learner, h. In adaboost the weak learner is expected to minimize the weighted empirical classification error: N ∑ d (t) (i)I [yi = h (zi )] , i=1 where yi is the class label, I is the indicator function and d (t) is a weighting over training samples. This is a standard function to minimize in classification with many possible types of algorithms to 793 M ELNIK , VARDI AND Z HANG choose from as possible weak learners. In contrast the weak ranking learner for rankboost (with outputs in [0, 1]) needs to maximize the following rank utility function: r = r(h) = ∑ D(t) (x , x )(h(x ) − h(x )), (4) x ,x where D(t) is the weight function over pairs of instances. As Eq. 4 contains the term h(x ) − h(x ), short of linear learners this equation can not be concave in h or easily approximated by a concave function. Therefore the weak learner needs to be optimized either by heuristics or by brute force, which limits the choice of h. It is not surprising that the original rankboost paper only included one type of weak learner, a binary threshold function (Eq. 3) that was tuned using brute force. The second difference between rankboost and adaboost also concerns the weak ranking learner. One feature of boosting that has sparked a great deal of interesting research is the algorithm’s ability to avoid overfitting for low noise classification problems (with modifications to higher noise problems), see Meir and Ratsch (2003) for a survey. In contrast for rankboost it is only by limiting the type of underlying weak learners that overfitting is avoided. In their paper, Freund et al. (2003) show that not using weak ranking learners with cumulative positive coefficients leads to overfitting and poor performance quite quickly. Therefore, choosing a weak learner that regularizes the ranking function, the output of rankboost, is very important for achieving accuracy and avoiding overfitting. It is clear from the above discussion that the choice of a weak ranking learner for rankboost is important and non trivial. First, the learner must be efficiently tunable with respect to Eq. 4, typically limiting its complexity. Second, the learner itself must demonstrate regularization properties that are appropriate for ranking functions. In this paper we present a new weak ranking learner that enforces consistency in preference and confidence for the ranking function by being monotonic and concave. We start with a discussion of these regularization properties, theoretically justify them, and show what they mean in terms of the final ranking function. Then we present a new learner, Minimum Weighted Group Ranks (MWGR), that satisfies these properties and can be readily learned. This learner is tested and compared with the binary learner of rankboost on combining multiple face recognition systems from the FERET study (Phillips et al., 2000) and on an information retrieval combination task from TREC (Voorhees and Harman, 2001). 4. Regularizing Ranking Functions with Consistency in Preference and Confidence In this paper, as Freund et al. (2003), we consider ranking functions H(x) which depend on x only through the values of the ranking features, y j = f j (x), for that instance, so that H(x) = G ( f1 (x), . . . , fn (x)), for certain functions G (y1 , . . . , yn ) = G(y). Here, we assume that the f j (x) is an actual rank assigned to an instance, x, by the j-th ranker. Note that if the original data are numerical scores then they can easily be converted to rankings. Freund et al. (2003) make a strong case for conversion of raw measures to relative orderings (rankings) over combining measures directly, arguing that it is more general and avoids the semantics of particular measures. As the y j ’s are ranks instead of points in a general space, care should be taken as to the functional form of G. A great deal of literature in social choice theory (Arrow, 1951) revolves around the properties of various rank combination strategies that try to achieve fair rankings. In this machine learning case our motivations are different. Fairness is not the goal; the goal is to improve the 794 C ONCAVE L EARNERS FOR R ANKBOOST accuracy or performance of the ranking function. Thus, regularization, by functionally constraining G, is used to confer information on how to interpret ranks in order to ultimately improve accuracy. Freund et al. (2003) imposed the regularization principle of monotonicity on the ranking function. Monotonicity encompasses a belief that for individual features a smaller rank is always considered better than a bigger rank. It means that for every two rank vectors, a and b, if a j ≤ b j , j = 1, . . . , n then G(a) ≤ G(b). Monotonicity was enforced by requiring that the coefficients, wt in Eq. 1, combining the binary weak learners (Eq. 3) were cumulatively positive. As shown by Freund et al. (2003), without enforcing monotonicity the rankboost algorithm quickly overfits. In this section we present another regularization principle, consistency in preference and confidence (which includes monotonicity). A ranking function with this regularization property should be consistent in its preference for the individual rankers and also in how it captures their confidence. The following example illustrates these two concepts. 4.1 Grocer Example A grocer needs to make 2 decisions, to decide between stocking oat bran vs. granola and to decide between stocking turnips vs. radishes. The grocer asks her consultants to express their opinion about stocking each item, and based on their responses makes her 2 decisions. First of all, in either problem, the grocer will adopt the opinion of her consultants if they all agree with each other, that is, they all prefer granola over oat bran in the first problem. Lets assume the grocer considered the first problem and chose granola over oat bran. What this implies is that the grocer adopted the opinions of the consultants that preferred granola over oat bran. Now consider the turnips vs. radishes decision. Lets say that the same consultants that liked granola more also liked radishes more (and the same ones that like oat bran more like turnips more). Also, for this decision the radish lovers actually feel more confident in their choice than they did for granola, while the turnip lovers are more unsure than they were for oat bran. Then for this second choice, if the grocer is consistent in her method of combining the opinions of her consultants, we would expect her to pick radishes over the turnips. In addition, we would expect her to be more confident in this second decision as a reflection of the consultants relative confidence. The above properties of preference and confidence imply that preference should be applied consistently across different problems and confidence should reflect the relative confidences of the individual consultants. 4.2 The General Principle of Consistency in Preference and Confidence To generalize the above grocer’s problem consider the problem of combining the opinions of a panel of consultants on several multiple-choice decision problems. Suppose for each decision problem each consultant provides his preference as one of the possible choices and a qualitative measure of the confidence level for his preference. The consistency principle in preference and confidence holds if the combiner always agrees with at least one of the consultants in each decision problem and that the following condition holds for any pair of decision problems. Definition 1. Consistency principle for a pair of decision problems: Suppose that for the first decision problem, the combiner adopts the preference of a subset A of consultants in the sense that A is the (non empty) set of all consultants whose preference is identical to that of the combiner. 795 M ELNIK , VARDI AND Z HANG Suppose that for the second decision problem, the preference of the consultants in A is again identical. Suppose further that compared with the confidence level for the first decision problem, each consultant in A has higher or the same confidence level for his preference in the second problem, while each consultant with a different preference than A for the second problem has lower or the same confidence level. Then, the preference of the combiner for the second problem is identical to that of the consultants in A, and the confidence level of the combiner for the second problem is at least as high as his confidence level for the first problem. Let B be the set of consultants whose preferences are different from that of the combiner in the first decision problem, that is, B = Ac . If some members of B switch sides or have no preference in the second problem, the combiner would be even more confident about the adoption of the preference of A in the second problem regardless of the confidence of those who switch sides. Thus, Definition 1 requires only those against the preference of A in the second problem (necessarily a subset of B since members of A act in unison in both problems) to have lower or equal confidence in the second problem, instead of all members of B. This is taken into consideration in Theorem 1 below. Note that while preference for individual consultants is specified within each decision problem, two decision problems are needed to compare the qualitative measure of confidence. However, comparison of confidence is not always possible (it is a partial ordering). In particular, the level of confidence between different experts may not be comparable, and the levels of confidence of a given expert (or the combiner) for different decision problems are not always comparable. 4.3 Confidence for Ranks and Ranking Functions In order to apply consistency to rank combination we need to specify what we mean by more or less confidence. For ranks we make the assumption that a constant change of ranks requires more confidence for low ranks than high ranks. For example, we would expect the difference between ranks of 1 and 3 to represent a more significant distinction on the part of a ranker than would for example the difference between ranks 34 and 36 which may be almost arbitrarily assigned. Specifically we make the following definition: Definition 2. Preference and confidence for univariate rank values For any pair of rank values {r, r } ⊂ R with r < r , by monotonicity r is preferred. For any two pairs of rank values {r, r } and {r , r } with r < r and r ≤ r , the confidence level for {r, r }is higher than that of {r , r } if r ≤ r , r −r ≤ r −r with at least one inequality. The pair{r, r }provides no preference if r = r . Likewise, if either r − r = r − r = 0 or r − r = r − r = 0, the two pairs {r, r } and {r , r }are defined to have the same level of confidence. In this definition confidence between pairs of ranks can only be compared if the pair with the lowest rank has a gap at least as large as the other pair. Thus, we are actually comparing two numbers, that is, we are doing a vector comparison. As such, this comparison does not cover all pairs of ranks and applies only a partial ordering on rank pairs. As a regularization principle, a partial ordering is desirable since it does not overly constrain the ranking function and allows for flexibility. 796 C ONCAVE L EARNERS FOR R ANKBOOST As the rank function, G, is a univariate function, we apply the same definitions of preference and confidence to it. That is, for a ranking function G and for any fixed rank vectors, y and y , the preferred rank vector is the one with smaller G, that is, y is preferred iff G(y) < G(y ). For confidence again we need to consider pairs of rank vectors, {y, y } and {y , y }, with G(y) ≤ G(y ) and G(y ) ≤ G(y ). If G(y) ≤ G(y ) and G(y ) − G(y) ≥ G(y ) − G(y ) then we say that the first decision, between yand y , shows equal or more confidence than the second decision between y and y . This numerical method of capturing confidence and preference in ranks, y, and ranking functions, G, allows us to apply Definition 1. Specifically, for a pair of binary decisions the confidence of a consistent ranking function increases for the second decision if in the second decision the confidence of the agreeing rank values are comparable and increase and the confidence of the disagreeing rank values are comparable and decrease, with the exception of those that switched sides. 4.4 Three-Point Consistency in Preference and Confidence for Combining Ranks As described, the consistency property is over two binary decision problems. In this section we consider ranking functions, G, that have consistency in preference and confidence for all pairs of binary decision problems involving three rank vectors y, y and y . We show that such functions have specific mathematical properties under this three-point consistency principle in preference and confidence for ranks.. Theorem 1: For a ranking function G whose domain is convex, three-point consistency in preference and confidence holds for G iff G(y) is nondecreasing in individual components of y and is jointly concave in y. Proof of Necessity: Assume that the ranking function G exhibits three-point consistency in preference and confidence for any three rank vectors. If y ≤ y component wise, then the individual components of y all agree in their preference to y, so that G(y) ≤ G(y ) by the consistency of G in preference. This implies the monotonicity of G in y. We pick three points y, y − a and y + a, such that all points are rank vectors. Consider the pair of binary comparison problems, y vs. y + a and y − a vs. y. Assume that G(y) < G(y + a). These three points are comparable by Definition 2 (r − r = r − r for all components in the rank vectors). Since the agreeing rankers, A = j y j < y j + a j , in the y vs. y + a comparison are greater than in the y − a vs. y comparison, and the disagreeing ranks, outside A, are smaller, then by the consistency of the combiner in preference and confidence (definition 1), G(y − a) ≤ G(y) and G(y) − G(y − a) ≥ G(y + a) − G(y). A function G is concave in a convex domain iff 2G(y) ≥ G(y − a) + G(y + a), for every y and a with y ± a in its domain. To verify this inequality for a particular y and a, we must have G(y + a) > G(y), G(y − a) > G(y) or the third case of G(y) ≥ max{G(y + a), G(y − a)}. We have already proven that the consistency properties of preference and confidence imply G(y) − G(y − a) ≥ G(y + a) − G(y) in the first case. By symmetry this requirement also must hold for −a, so that G(y) − G(y + a) ≥ G(y − a) − G(y) in the second case. This completes the necessity part of the proof since 2G(y) ≥ G(y − a) + G(y + a) automatically holds in the third case. Proof of Sufficiency: Assume the ranking function G is nondecreasing in individual components of y and jointly concave in y. We need to prove consistency in preference and confidence for a pair 797 M ELNIK , VARDI AND Z HANG 1) G(y) < G(y’) y ≤ y’’ A A y’’ A 4 A A 10 y’ 8 y’ 6 4) G(y) > G(y’) y ≥ y’’ A 10 8 A 2) G(y) < G(y’) y ≥ y’’ A 10 8 6 y A y 6 4 4 2 2 2 0 0 y y’ y’’ y’’ 0 2 4 6 8 10 0 2 4 B 6 8 10 0 0 B 2 4 6 8 10 B Figure 1: Consider two decision problems in two dimensions involving three points, y, y and y , where the first decision problem is to choose between y and y and the second problem is to choose between y and y . The cases where we expect consistency in preference and confidence (Definitions 1 and 2) can be enumerated by the result of the first decision problem and relative location of y . The darker gray box is the location where B has lesser or equal confidence in the second problem, and the lighter gray box is where B switches sides. of decision problems involving three rank vectors y, y and y . Without loss of generality, suppose that the first problem is to choose between y and y and the second problem is to choose between y and y . We break the proof up by the results of the comparison between y vs. y and the relative location of the third ranking vector y that satisfies the preference and confidence requirements. If G(y) = G(y ) = G(y ), the combiner has equal confidence in the two decision problems by definition 2, so that the consistency principle holds automatically. Thus, we only need to consider the cases where G(y) = G(y ). Figure 1 illustrates three of these cases two dimensionally, where there is a single agreeing component A, the y-axis, and a single disagreeing component B, the x-axis. Let A be the set of agreeing indices, A = j : sgn(y j − y j ) = sgn (G(y ) − G(y)) = 0 , and B = c the set of disagreeing indices. We use the notation y = (y , j ∈ A) to describe corresponding A j A subvectors. Also, when used, vector inequalities are component wise. Case 1: G(y) < G(y ) and yA ≤ yA In the first decision problem, as G agrees with A and disagrees with B yA < y A , yB ≥ y B . / We also know that A = 0 since otherwise y ≥ y and therefore by monotonicity G(y) ≥ G(y ), which is a contradiction. For the second decision problem we consider the values of y which are consistent with the confidence assumption (Definitions 1 and 2), that is, agreeing with more confidence along the A indices and disagreeing with less confidence or switching sides along the B indices. As y A ≤ yA , either by the confidence relationship or by switching sides (see Figure 1) these y values satisfy yA − y A ≥ y A − y A , 798 yB − y B ≤ y B − y B C ONCAVE L EARNERS FOR R ANKBOOST which implies y ≤ y , and therefore G(y ) ≤ G(y ) by monotonicity. Thus, G(y) < G(y ) ≤ G(y ), which implies that in the second decision problem of choosing between y and y , the preference of G is the same as that of A (i.e., y since yA ≤ yA ) and the confidence of G is at least as high as the first decision problem. Case 2: G(y) < G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the confidence relationship or by switching sides y satisfies yA − y A ≥ y A − y A , yB − y B ≤ y B − y B . This implies that y + y ≤ 2y, which means that G(y ) + G(y ) ≤ 2G y +y /2 ≤ 2G(y) by the concavity and monotonicity of G. Thus, G(y) − G(y ) ≥ G(y ) − G(y) and since G(y ) > G(y), we have that G(y) − G(y ) > 0 and thus G(y) > G(y ). Therefore we see that the preference in G for the two decision problems is the same as A (with y in the first problem and with y in the second problem) and the confidence is no smaller for the second comparison. Case 3: G(y) > G(y ) and yA ≤ yA Since y is preferred in the first decision problem we have yA > y A , yB ≤ y B . For the confidence assumption to hold, that is, greater confidence for A in the second problem (Definition 2), the smaller ranks in the second problem have to be smaller than the smaller ranks in the first problem. But with yA ≥ yA and yA ≤ yA that can only be if yA = yA , which leads to y ≤ y , and by monotonicity implies G(y) ≤ G(y ). However that is a contradiction to G(y) > G(y ), and therefore this is not a viable case for comparing confidence. Case 4: G(y) > G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the confidence relationship or by switching sides y satisfies yA − y A ≥ y A − y A , yB − y B ≤ y B − y B , which means that y ≤ y . Thus, by the monotonicity of G, G(y ) ≤ G(y ). Since G(y ) < G(y) the preference in the second decision problem is also with A (i.e., y ) and the confidence is at least as high as the first decision problem. 4.5 Applying Regularization to Rankboost It follows from the above theorem that to have consistency in preference and confidence we desire ranking functions that are monotonic and concave. In rankboost H(x) = ∑ wt ht (x) (Eq. 1). To make H an increasing and concave function of constituent rankings y j = f j (x) we need to constrain the weak ranking learners. If the learners themselves are monotonically increasing and concave functions of y j , then linearly combining them with positive weights will give an H that is also an increasing and concave function of y j . 799 M ELNIK , VARDI AND Z HANG In this paper, we apply the “third method” (Freund et al., 2003) to setting a wt weight value. That is, weak learners are selected on their ability to maximize r from Eq. 4 and then wt = 0.5 ln ((1 + rmax ) / (1 − rmax )) . (5) Therefore, using monotonic and concave weak learners we select only ones that rankboost gives a positive r value to, which renders a positive wt weight. If no r values are positive the rankboost algorithm stops. We mention that a ranking function can be construed as the negative of a score function, that is, G(y) = −S(y). For score functions these regularization properties become monotonically decreasing, and convex (Melnik et al., 2004). 5. Minimum Weighted Group Ranks The functional structure of the new learner we propose is h(y) = min {γ1 y1 , . . . , γn yn , 1} , (6) where y = (y1 , . . . , yn ) = ( f1 (x), . . . , fn (x)) the vector of ranking features, and the γ j are learned positive coefficients. Note that the function’s range is in [0, 1] due to the 1 term in the min function. Using rankings as our features, the learner function (Eq. 6) is monotonically increasing. It is also a concave function in y. Thus if these learners are linearly combined with positive weights, the resulting ranking function, H, will have three-point consistency in confidence and preference. To gain some intuition, the functional form of the learner is related to the Highest Rank combination method (Ho, 1992) that assigns combined ranks as the best rank each class receives from any ranker. This is a confidence based approach, as Highest Rank bets on the classifier that is most confident, the one giving the best rank. As such, a single learner can potentially be error prone. But as we combine many of these learners during boosting, it becomes more powerful, allowing the confidence of different classifiers to be weighed with preference for their potential accuracy. 5.1 Learning At each boosting iteration, rather than selecting from all possible weak learners of form (Eq. 6), we limit our choices by building new weak learners out of the ones that have been previously trained. Let F = { f1 (x), . . . , fn (x)} be the set of ranking features. Recall that in rankboost H(x) = (t) (t) (t) ∑t wt ht (x), where ht (x) = min γ1 f1 (x), . . . , γn fn (x), 1 and γ j are learned coefficients. We set (s) (s) Ht = h(x) h(x) = min γ1 f1 (x), . . . , γn fn (x) , s ≤ t and select ht+1 (x) from weak learners of the form hnew (x) = min αh(x), β f (x), 1 (7) (t+1) with h(x) ∈ Ht and f (x) ∈ F . This learner can be rewritten in the form of Eq. 6 with the γ j derived from the learned α, β, h(x) and f (x). Thus, at each iteration we look at combinations of 800 C ONCAVE L EARNERS FOR R ANKBOOST the features and existing learners. As discussed in the next section, we can either consider all such combinations, or use a heuristic selection strategy. We propose to optimize α and β separately, in stages. That is, given a value for one of the variables we optimize the other. As α and β are symmetric we show how to optimize β given α. We need to find a value of β that maximizes r in Eq. 4. Freund et al. (2003) pointed out that this equation can be rewritten as: r = ∑ D(t) (x , x )(h(x ) − h(x )) x ,x = ∑ π(x)h(x) x where π(x) = ∑x D(t) (x , x) − D(t) (x, x ) . Given the form of Eq. 7 we can write r as a function of β, ∑ (β f (x) − 1) π(x) + ∑ r (β) = αh(x) − 1 π(x). β f j (x)≤min(αh(x),1) (8) αh(x) < 1 , then O(βl+1 ) = O(βl ) − π(xl ), P(βl+1 ) = P(βl ) − f (xl )π(xl ), Q(βl+1 ) = Q(βl ) +W π(xl ), R(βl+1 ) = R(βl ) +W αh(xl )π(xl ). Combining these formulas gives algorithm 2 for optimizing β. Algorithm 2 Algorithm for optimizing β Given α, h(x) ∈ Ht , f (x) ∈ F and the training instances. For all x’s generate and sort the set of candidate βs, B, such that β 1 ≤ β2 ≤ · · · ≤ β|B| and βl = min(αh(xl ),1) . f (xl ) O = ∑xl π(xl ) P = ∑xl f (xl )π(xl ) Q=0 R=0 rbest = 0 for j = 1 . . . |B| do r = βl P − O + R − Q if r > rbest then rbest = r end if O = O − π(xl ), P = P − f (xl )π(xl ) if αh(xl ) < 1 then Q = Q + π(xl ), R = R + αh(xl )π(xl ) end if end for 5.2 Heuristics for Learner Selection If at each boosting iteration we select from all combinations of h(x) and f (x) we end up with an O(T 2 ) algorithm, where T is the number of iterations. However, we can sacrifice accuracy for speed by only evaluating and selecting from a fixed sized pool of previously trained learners h(x) and features f (x), where at each iteration the pool is randomly chosen from the full Ht and F . To improve performance, instead of using a uniform sampling distribution we can apply a heuristic to focus the distribution on combinations with better potential. As Eq. 8 is composed of two sums, for r to be large the terms f (x)π(x) and h(x)π(x) need to be large. We can consider s f = ∑x f (x)π(x) and sh = ∑x h(x)π(x) as indicators of how well these components work with π(x). Thus, we might expect larger r values to occur when these two score values are larger. Of course, we are discounting interactions, which is the reason for the combination. Using these score values, we can order all h(x) and f (x) separately, and sample such that learners and features with better scores are more likely to be selected. We opted for a polynomial weighting 802 C ONCAVE L EARNERS FOR R ANKBOOST Training error on Dup II 9 P=1 P=1/2 P=1/4 Avg rank of correct class 8.8 8.6 8.4 8.2 8 7.8 7.6 7.4 7.2 1 2 3 4 5 6 7 8 9 Iteration number Figure 2: These plots show convergence of training error for 3 values of the heuristic selection pressure value p. These plots are averages over 10 runs and are typical of the other training data sets as well. As can be seen the heuristic improves the convergence rate of the training error. method. Thus, for example, all f ∈ F are sorted by their score and are assigned a number based on the rank of these scores,(maxrank − rank)/maxrank, that gives each f an equally sized bin in the range 0 and 1. Given a random number ξ ∼ U(0, 1), we calculate ξ p and select the f that corresponds to the bin this number falls into. Here p < 1 can be construed as a selection pressure, where bins corresponding to higher scores are more likely to be selected. Figure 2 demonstrates the effect of different values of the p parameter in one of our experiments. 6. Face Recognition Experiments We present experiments on the combination of face recognizers, comparing the binary learner with the MWGR learner. Given an image of a face, a face recognizer assigns a similarity score to each of the faces it is trained to recognize. These scores give a linear order or ranking to the gallery of faces. Different face recognition algorithms have different performance characteristics. Thus, we explore how combining the outputs of multiple face recognizers can improve recognition accuracy. 6.1 Algorithm Methods We consider a data set I of face images (queries) to train on. For each query image, i in I, we need to rank all u ∈ U faces in the gallery. In rankboost the favor function, Φ, that specifies output loss, is a function of the query and the item to be ranked. Therefore, the notational convention is to combine the query, i, and the item to be ranked, u, as an instance, x ≡ (i, u). As such, f j (x) = f j ((i, u)) is the 803 M ELNIK , VARDI AND Z HANG Error convergence 9 Train error on dup II Test error on dup I Avg rank of correct class 8.5 8 7.5 7 6.5 6 0 10 20 30 40 50 60 70 80 90 100 Iteration number Figure 3: This plot is typical of the convergence behavior of rankboost with MWGR on the FERET data. Both training and test errors tended to converge within 10-30 iterations of boosting with no significant post-convergence divergence. rank assigned to identity u for query image i by recognition algorithm j. As there is only one correct identity, we only care about the ranking of the one correct identity for each query. We set the favor function as stated by Freund et al. (2003) for this type of output loss. Let u ∗ be the the correct identity for training image i, then Φ ((i, u) , (i, u∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1 for all u = u∗ , setting all remaining elements of Φ (x , x ) = 0. That is, the correct identity of a query image is given positive favor compared to all other identities for that image, while all other rankings, including interactions between training images, are given zero favor. Note that since there is no favor interaction between queries (different i’s); Φ (x , x ) is effectively a function of 3 variables, (i, u , u ). Both weak learners were trained for 100 iterations, giving them ample time to converge. See Figure 3 for an illustration of convergence times. The binary learner was trained as specified by Freund et al. (2003). At each iteration the MWGR learner was selected from a pool of candidate combinations of f (x) and h(x), with a selection pressure of p = 0.5. For each candidate, first β was optimized with α = 1, then α was optimized using the optimized β. The candidate with the most positive r value was always selected. This is summarized in algorithm 3. 6.2 Experimental Setup FERET (Phillips et al., 2000) was a government sponsored program for the evaluation of face recognition algorithms. In this program commercial and academic algorithms were evaluated on their ability to differentiate between 1,196 individuals. The test consisted of different data sets of varying difficulty, for a total of 3,816 different images. The data sets, in order of perceived difficulty, are: the fafb data set of 1,195 images which consists of pictures taken the same day with different facial 804 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 3 Algorithm for selecting learner from pool. Given poolsize and selection pressure. Calculate s f j and sh for all rank features and existing learners. for p = 1 . . . poolsize do Generate d f ∼ U(0, 1) and dh ∼ U(0, 1) Select which h = min αh(x), β f (x), 1 to try, using s f j , sh , u f , uh . Setting α = 1, optimize β (algorithm 2) Keeping the optimized β, optimize α (algorithm 2) Get r of learner with this α, β if r > 0 and r > rbest then rbest = r, hbest = h, hbest = min αh(x), β f (x) end if end for ht = hbest, Ht+1 = Ht ∪ hbest expressions; the fafc data set of 194 images that contains pictures taken with different cameras and lighting conditions; the dup I data set of 488 images that has duplicate pictures taken within a year of the initial photo; and the most difficult, the dup II data set of 234 images which contains duplicate pictures taken more than a year later. Note that in our experiments we separate the images of dup II from the dup I data set, unlike the FERET study where dup II was also a subset of dup I. The FERET study evaluated 10 baseline and proprietary face recognition algorithms. The baseline algorithms consisted of a correlation-based method and a number of eigenfaces (principle components) methods that differ in the internal metric they use. Of the proprietary algorithms, most were from different academic institutions and one was commercial. Of the 10 algorithms we selected three dominant algorithms. From the baseline algorithms we chose to use the ANM algorithm which uses a Mahalanobis distance variation on angular distances for eigenfaces (Moon and Phillips, 2001). While this algorithm’s performance is not distinctive, within the class of baseline algorithms it was strong. Moreover, in accuracy with respect to average rank of the correct class on the dup I data set it demonstrated superior performance to all other algorithms. The other two algorithms we used were the University of Maryland’s 1997 test submission (UMD) and the University of Southern California’s 1997 test submission (USC). These algorithms clearly outperformed the other algorithms. UMD is based on a discriminant analysis of eigenfaces (Zhao et al., 1998), and USC is an elastic bunch graph matching approach (Wiskott et al., 1997). The outputs of the 10 face recognizers on the four FERET data sets, fafb, fafc, dup I and dup II were the data for the experiments. Thus, we never had access to the actual classifiers, only to data on how they ranked the different faces in these data sets. We conducted experiments based on homogeneous and heterogeneous data sets, testing the efficiency and robustness (adaptivity) of the MWGR procedure. For the homogeneous case we took all 4 FERET data sets and randomly shuffled them together. We call this the homogeneous data set as both the training and testing data are selected from the same combined pool. On this combined data set we did 4-fold cross validation. For each fold 75% of the data was used for training and the rest for testing. We combined the results of all four runs together for evaluation purposes. 805 M ELNIK , VARDI AND Z HANG For the heterogeneous case, in each experiment one of the FERET data sets was selected as a training set and another data set was selected for testing. This gave 12 experiments (not including training and testing on the same data set) per group of face recognizers, where we get combinations of training on easy data sets and testing on hard data sets, training on hard and testing on easy data sets, and training and testing on hard data sets. To reduce noise in our experiments the training ranks were truncated at 150, and outliers were removed. In face recognition and other classification applications usually only the top ranks are important. Thus, in evaluating the results we focused on the top 30 ranks. All ranks returned by a boosted combiner for the correct class above 30 were truncated to 30. In evaluating the performance of the combiners not all the test data are equally useful. We consider the following two cases as non informative. When the two best face recognizers, UMD and USC both give the correct class a rank of 1 there is very little reason for the combined rank to be different. Also when both the binary-learner-based combiner and the MWGR-based combiner give the correct class a rank greater than the truncation value (30) it makes little sense to compare between the combiners. The testing data was filtered to remove these cases, and the results are presented without them. Before presenting the results, it should be said that rankboost with both learner types gives ranking functions that significantly outperform all the individual face recognition algorithms. In addition, in our tests both learners also clearly outperformed other standard rank combination methods, such as the Borda count, highest rank and logistic regression (Ho et al., 1992). We present two sets of experiments—the combination of the 3 selected classifiers (ANM, UMD and USC) and the combination of all 10 classifiers. These are qualitatively different tasks. In combining 3, we seek to capitalize on the unique strengths of each classifier. In combining 10, we are introducing classifiers which may be correlated and classifiers which are comparably much noisier. The size of the pool was 6 when we combined 3 classifiers and 20 when combining all 10 classifiers. For all experiments we measure the average rank of the correct class for both learners: A= 1 N ∑ min {Rank (xi∗ ) , 30} , N i=1 where Rank (xi∗ ) is the rank of the correct class for query i, and the sum is over all useful test queries, as described above. The average rank difference between the learners is calculated to show the improvement in performance. To evaluate the significance of the improvement we ran paired one-sided t-tests and evaluated the significance of the p-value (a value less than 0.05 is significant). In addition we show the standard deviation of the rank difference. 6.3 Results In the experiments with the homogeneous data sets, combining all classifiers gives an improvement in the average rank of the correct class of 0.296 for MWGR, with a standard deviation of 3.9, a paired t-test statistic of 1.76 and p-value of 0.039, where combining the ANM, UMD and USC classifiers gives an average rank improvement of 0.1 for MWGR, with standard deviation of 3.6, a paired t-test statistic of 0.68 and p-value of 0.24. Table 2 contains the results for combining the 3 classifiers in the experiments with heterogeneous data sets (compare the combiner results in columns bin mean and mwgr mean with the aver806 C ONCAVE L EARNERS FOR R ANKBOOST ANM ARL EFAVG EFML1 EFML2 EXCA MSU RUT UMD USC dup i 9.52 16.85 15.02 18.23 15.85 11.9 17.64 17.87 13.21 12.44 dup ii 18.14 16.96 20.61 22.21 20.33 16.28 23.44 16.48 13.74 6.85 fafb 10.88 8.94 14.43 16.23 12.21 11.67 6.81 12.93 4.57 5.84 fafc 19.71 28.02 26.17 17.39 19.78 20.30 14.27 22.79 8.96 5.17 Table 1: The best average rank of the correct class on the different data sets for all constituent face recognition systems. age rank of the correct class for all constituent classifiers in Table 1). The diff mean column contains the improvement of MWGR over the binary learner in terms of average rank of the correct class. Of the 12 experiments, we see an improvement in 10 cases. Six of those 10 have significant p-values. The two no improvement experiments do not have significant p-values. Table 3 contains the results for combining all 10 classifiers. Of the 12 experiments, we see an improvement for MWGR in 11 cases. Eight or nine of those 11 have significant p-values. The one no improvement experiment does not have a significant p-value. It is interesting to note that we do not seem to see overfitting when increasing the number of constituents. In some case we see improvement, in others we see slight degradation, but all in all the combiner seems resilient to the noise of adding less informative constituents. All sets of experiments, homogeneous data, heterogeneous data sets, combining 3 select recognizers and combining all 10 recognizers at once yielded significant improvements in accuracy, as is visible in the change in the average rank of the correct class and the significance of the statistical tests. 7. Information Retrieval Experiments The annual Text REtrieval Conference (TREC) generates high-quality retrieval results of different systems on different retrieval tasks (Voorhees and Harman, 2001). We use the result data sets of the TREC-2001 web ad hoc task that uses actual web queries taken from web logs. This task has been used in other rank fusion experiments (Renda and Straccia, 2003). As did Renda and Straccia (2003) we combine the results of the following top 12 systems: iit01m, ok10wtnd1, csiro0mwal, flabxtd, UniNEn7d, fub01be2, hum01tdlx, JuruFull, kuadhoc, ricMM, jscbtawtl4, apl10wd. Similar to other TREC information retrieval tasks, the TREC-2001 ad hoc task consists of 50 queries. For each query a list of relevant documents is supplied. Each system returns an ordered list of 1,000 documents for each query. The fusion goal is to combine these 12 individual lists into one list of 1,000 documents, which hopefully has greater precision than the individual systems. For the 807 M ELNIK , VARDI AND Z HANG test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 7.19 7.36 8.31 5.25 5.15 5.19 2.62 3.38 2.60 2.85 2.40 2.56 mwgr mean 5.96 6.21 6.27 5.38 4.4 4.95 2.22 2.60 2.28 2.78 2.43 2.03 diff mean 1.22 1.14 2.04 -.12 0.74 0.23 0.39 0.78 0.32 0.06 -.03 0.52 diff std 5.22 5.15 5.51 4.0 4.61 5.1 3.09 3.79 2.81 3.33 2.27 2.57 pval .7e-4 .001 .1e-6 .658 .018 .275 .115 .029 .142 .424 .537 .028 Table 2: Results of combining the ANM, UMD, and USC classifiers using individual FERET data sets. test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 6.73 8.06 6.68 5.75 6.24 5.86 2.67 3.56 2.68 3.36 3.17 2.22 mwgr mean 5.89 7.09 4.87 5.67 5.47 5.31 2.07 2.45 2.17 2.51 2.95 2.23 diff mean 0.84 0.96 1.81 0.08 0.76 0.55 0.59 1.11 0.51 0.85 0.22 -0.01 diff std 4.79 6.01 5.24 4.61 4.22 4.87 2.97 4.51 2.03 3.23 3.95 2.08 pval .009 .014 .5e-6 .408 .009 .074 .033 .012 .01 .007 .3 .52 Table 3: Results of combining all 10 classifiers using individual FERET data sets. 50 queries of the TREC-2001 ad hoc task, the number of relevant documents that intersect with the union of system results range between 662 and 2664. 7.1 Methods In the information retrieval task the favor function needs to show favor for relevant documents while disregarding other documents. Thus, we can set the favor function similarly to the way it was set in 808 C ONCAVE L EARNERS FOR R ANKBOOST JuruFull 0.759 hum01tdlx 0.760 UniNEn7d 0.763 iit01m 0.762 apl10wd 0.735 jscbtawt14 0.761 csiro0mwa1 0.721 kuadhoc2001 0.727 flabxtd 0.741 ok10wtnd1 0.745 fub01be2 0.734 ricMM 0.765 Table 4: Normalized mean average precision for each constituent. the face recognition classification task. Consider a data set with I queries to train on. For each query, i in I, we need to rank all u ∈ U documents. An instance in this case is a pair, x ≡ (i, u). For each u∗ a relevant document and u an irrelevant document for query i, we set Φ ((i, u) , (i, u ∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1, setting all remaining elements of Φ (x0 , x1 ) = 0. Thus, all relevant documents are given a positive favor with respect to irrelevant documents, while all other rankings, including interactions between relevant documents and interactions between irrelevant documents are given zero favor. As the task has only 50 queries, rather than separating the data into train and test, we opted to do a cross validation performance evaluation. We use 5-fold cross validation on the normalized mean average precision measure (Salton and McGill, 1983), a standard IR measure which accounts for the precision (quantity of correct documents retrieved) and their rank position in the list. It is described by the following equation: AveP = ∑N (Prec(r) × rel(r)) r=1 number o f relevant documents where r is the rank, N is the number of documents retrieved, rel() is a binary function of the relevance of the document at rank r,and Prec is the precision ( [number of relevant documents retrieved] / [number of documents retrieved]) at a given cut-off rank. It is clear that higher values of AveP are better. Note that for purposes of normalization we assigned unretrieved relevant documents a constant rank. As the binary and MWGR weak learners had significantly different convergence properties on this task (see Figure 4) MWGR was trained for 100 iterations and the binary learner for 300 iterations. As in the FERET experiments, MWGR was optimized using algorithm 3, with a selection pressure of p = 0.5. 7.2 Results As seen in Figure 4, the MWGR learner converges in significantly less iterations than the binary learner. This could possibly be attributed to the fact that the MWGR is a more complex function that can incorporate the rankings of multiple classifiers in each learner. Also the MWGR function is not tuned to particular rank cutoffs, whereas the binary learner is, so the MWGR can better accommodate the variety in the 1000 ranks being considered. The normalized mean average precision for the MWGR after 100 iterations was 0.8537 and it was 0.8508 for the binary learner after 300 iterations. Compare these results with the precision of the constituents in Table 4. Both weak learners had a performance rate of change of approximately 3 ∗ 10−5 on their final iteration (better for MWGR). A paired t-test on the cross validation results of the two learners gives a statistically significant p-value of 0.007 in favor of MWGR. 809 M ELNIK , VARDI AND Z HANG Cross Validation Performance 0.88 MWGR Binary 0.87 0.86 0.85 0.84 Mean Avg Precision 0.83 0.82 0.81 0.8 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 0.7 0 50 100 150 200 Iteration Number 250 300 Figure 4: The cross validation mean average precision score of the two weak learners, MWGR and binary, as a function of boosting iteration number. 8. Discussion The question of how to combine ordinal data has become an active focus of research in machine learning, as applications in pattern recognition, information retrieval and other domains have come to the forefront. A particular question of importance is how can the structure of ranks be correctly exploited to maximize performance. The semi parametric nature of rankboost offers the possibility to generate arbitrarily flexible ranking functions. But as observed this flexibility comes at a cost of significant overfitting without further regularization. Freund et al. (2003) demonstrate that successful generalization only occurs when the resulting ranking functions are constrained to be monotonic. This constraint can be thought of as a regularization that incorporates prior knowledge on the interpretation of ranks and as such, how they can be combined. We present a regularization framework based on the concept of consistency in confidence and preference. Ranking functions with this property show a consistency in how they treat the preference and relative confidence exhibited by constituents. We prove that under a natural interpretation of preference and confidence for ranks, this consistency property of the combiner is equivalent to monotonicity and concavity of its ranking function. We enhance rankboost by designing a weak ranking learner that exhibits consistency in preference and confidence. A computational advantage of this weak learner, called minimum weighted group ranks (MWGR) is that its parameters can be individually optimized readily with respect to the rankboost criteria, allowing it to be tested on real-world data. In our first experiments we compare the original rankboost binary weak learner with MWGR on a task of combining the output of multiple face recognition algorithms from the FERET study. We conducted experiments on homogeneous data, testing the intrinsic efficiency of the MWGR proce810 C ONCAVE L EARNERS FOR R ANKBOOST dure. We also conducted experiments on heterogeneous data, testing the robustness or adaptivity of the procedure. In almost all cases we see that MWGR shows improved performance compared with the binary weak learner, whether combining three or all of the face recognizers, confirming the utility of this monotonic and concave learner. Our second experiment was on an Information Retrieval task taken from the TREC conference. In this task we see MWGR converges in significantly less iterations and generates statistically significant improved performance. Final Words Ofer Melnik and Cun-Hui Zhang are very saddened that our colleague and friend Yehuda Vardi passed away before he could give this paper his final stamp of approval. He was very enthusiastic about this research and would have been very pleased to see it come to fruition. Acknowledgments This research is partially supported by NSF Grants DMS 04-05202, DMS 05-04387 and, DMS 0604571, ONR Grant N00014-02-1-056 and NSA Grant H98230-04-1-0041. Ofer Melnik would also like to thank DIMACS for their support in this research. The authors are also grateful to the editor and reviewers for their constructive suggestions which improved the presentation of the results. Appendix A. In this paper we showed how three-point consistency in preference and confidence implies concave and monotonic ranking functions. For two decision problems involving two pairs of rank vectors, the four-point consistency property implies the following constraints for a ranking function G(y) < G(y ) yB − y B ≤ 0 < y A − y A zA ≤ yA , yA − yA ≤ zA − zA ⇒ G(z ) − G(z) > G(y ) − G(y) yB ≤ z B , zB − z B ≤ y B − y B (11) where y and y are the rank vectors from the first decision problem and z and z are the rank vectors from the second decision problem. Unlike the three-point case, the four-point consistency property does not imply a clearly recognizable functional form for the ranking function. What we can say about it though is that as the constraints are linear, in the same way that concavity and monotonicity in the weak learner conferred the same properties to the ranking function, a weak learner that satisfies Eq. 11 will also confer those properties to the ranking function that uses it with positive weights. References K. Arrow. Social Choice and Individual Values. Wiley, 1951. C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proc. 10th Intl. World Wide Web Conf., pages 613–622, 2001. 811 M ELNIK , VARDI AND Z HANG Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148–156, 1996. T. K. Ho. A Theory of Multiple Classifier Systems and Its Application to Visual Word Recognition. PhD thesis, State University of New York at Buffalo, May 1992. T. K. Ho, J. J. Hull, and S. N. Srihari. Combination of decisions by multiple classifiers. In H. S. Baird, H. Bunke, and K. Yamamoto (Eds.), editors, Structured Document Image Analysis, pages 188–202. Springer-Verlag, Heidelberg, 1992. J. Kittler and F. Roli, editors. Multiple Classifier Systems, Lecture Notes in Computer Science 1857, 2000. Springer. R. Meir and G. Ratsch. Advanced Lectures in Machine Learning, Lecture Notes in Computer Science 2600, chapter An introduction to boosting and leveraging, pages 119–184. Springer, 2003. O. Melnik, Y. Vardi, and C-H. Zhang. Mixed group ranks: Preference and confidence in classifier combination. IEEE Pattern Analysis and Machine Intelligence, 26(8):973–981, 2004. H. Moon and P.J. Phillips. Computational and performance aspects of PCA-based face-recognition algorithms. Perception, 30:303–321, 2001. P.J. Phillips, H. Moon, S.A. Rizvi, and P.J. Rauss. The FERET evaluation methodology for facerecognition algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22:1090– 1104, 2000. M.E. Renda and U. Straccia. Web metasearch: Rank vs. score based rank aggregation methods. In 18th Annual ACM Symposium on Applied Computing (SAC-03), pages 841–846, Melbourne, Florida, USA, 2003. ACM. G. Salton and J.M. McGill. Introduction to Modern Information Retrieval. Addison Wesley Publ. Co., 1983. E.M. Voorhees and D.K. Harman, editors. NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001), number SN003-003-03750-8, 2001. Department of Commerce, National Institute of Standards and Technology, Government Printing Office. URL http://trec.nist.gov. L. Wiskott, J.-M. Fellous, N. Kruger, and C. von der Malsburg. Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(7):775– 779, 1997. W. Zhao, A. Krishnaswamy, R. Chellappa, D. Swets, and J. Weng. Face Recognition: From Theory to Applications, chapter Discriminant Analysis of Principal Components, pages 73–86. SpringerVerlag, Berlin, 1998. 812
4 0.37151861 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)
Author: Marc Boullé
Abstract: The naive Bayes classifier has proved to be very effective on many real data applications. Its performance usually benefits from an accurate estimation of univariate conditional probabilities and from variable selection. However, although variable selection is a desirable feature, it is prone to overfitting. In this paper, we introduce a Bayesian regularization technique to select the most probable subset of variables compliant with the naive Bayes assumption. We also study the limits of Bayesian model averaging in the case of the naive Bayes assumption and introduce a new weighting scheme based on the ability of the models to conditionally compress the class labels. The weighting scheme on the models reduces to a weighting scheme on the variables, and finally results in a naive Bayes classifier with “soft variable selection”. Extensive experiments show that the compressionbased averaged classifier outperforms the Bayesian model averaging scheme. Keywords: naive Bayes, Bayesian, model selection, model averaging
5 0.34242818 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
Author: Marc Teboulle
Abstract: Center-based partitioning clustering algorithms rely on minimizing an appropriately formulated objective function, and different formulations suggest different possible algorithms. In this paper, we start with the standard nonconvex and nonsmooth formulation of the partitioning clustering problem. We demonstrate that within this elementary formulation, convex analysis tools and optimization theory provide a unifying language and framework to design, analyze and extend hard and soft center-based clustering algorithms, through a generic algorithm which retains the computational simplicity of the popular k-means scheme. We show that several well known and more recent center-based clustering algorithms, which have been derived either heuristically, or/and have emerged from intuitive analogies in physics, statistical techniques and information theoretic perspectives can be recovered as special cases of the proposed analysis and we streamline their relationships. Keywords: clustering, k-means algorithm, convex analysis, support and asymptotic functions, distance-like functions, Bregman and Csiszar divergences, nonlinear means, nonsmooth optimization, smoothing algorithms, fixed point methods, deterministic annealing, expectation maximization, information theory and entropy methods
6 0.3175326 9 jmlr-2007-AdaBoost is Consistent
7 0.29009742 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring
8 0.28302929 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
9 0.26261121 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
10 0.23729753 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors
12 0.20377651 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms
13 0.1955803 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
14 0.18681583 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
15 0.18047965 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
17 0.17814577 43 jmlr-2007-Integrating Naïve Bayes and FOIL
18 0.17461216 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
20 0.16484672 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis
topicId topicWeight
[(8, 0.02), (10, 0.013), (12, 0.019), (15, 0.015), (28, 0.063), (40, 0.032), (45, 0.014), (48, 0.018), (60, 0.052), (85, 0.034), (98, 0.626)]
simIndex simValue paperId paperTitle
1 0.99675238 23 jmlr-2007-Concave Learners for Rankboost
Author: Ofer Melnik, Yehuda Vardi, Cun-Hui Zhang
Abstract: Rankboost has been shown to be an effective algorithm for combining ranks. However, its ability to generalize well and not overfit is directly related to the choice of weak learner, in the sense that regularization of the rank function is due to the regularization properties of its weak learners. We present a regularization property called consistency in preference and confidence that mathematically translates into monotonic concavity, and describe a new weak ranking learner (MWGR) that generates ranking functions with this property. In experiments combining ranks from multiple face recognition algorithms and an experiment combining text information retrieval systems, rank functions using MWGR proved superior to binary weak learners. Keywords: rankboost, ranking, convex/concave, regularization 1. Ranking Problems A ranking problem is a learning problem that involves ranks as the inputs, the outputs or both. An example where ranks are used as inputs is a collaborative filtering application where people are asked to rank movies according to their preferences. In such an application the ranks assigned by different people are combined to generate recommendations. Another type of problem in which ranks are used as inputs are meta-search problems, where the ranks of multiple search engines are combined (Dwork et al., 2001). However, the inputs to a ranking problem are not always ranks. An object recognition ranking problem (Kittler and Roli, 2000) may receive as inputs a graphical representation and output a ranking of the possible objects, sorted by likelihood. The outputs of a ranking problem may also be ranks. For example, in combining multiple search engines the output are ranks which are a synthesis of the ranks from the individual search engines. A similar meta-recognition problem is the combination of the outputs of multiple face- recognition systems to improve the accuracy of detecting the correct face. While the inputs and outputs of this problem are ranks, the outputs can be simplified to only return the most likely candidate. Another example where the outputs do not need to be complete ranks could be an information retrieval combination task. In such a task the inputs might be ranks of sets of documents with respect to c 2007 Ofer Melnik, Yehuda Vardi and Cun-Hui Zhang. M ELNIK , VARDI AND Z HANG a particular query by different experts. Again, the outputs could be the complete ranks, or more simply a designation for each document of whether it is relevant or not to the particular query. 2. Rankboost The rankboost algorithm (Freund et al., 2003) tries to address this variety in ranking problems by maintaining generality in how it regards its inputs and how it applies different loss functions to outputs. The rankboost algorithm works on instances which are the discrete items (e.g., movies or faces) that either are ranked as input or are to be ranked as output. To allow a general input mechanism, the inputs to rankboost are called the ranking features of an instance, specified as f j (x) which is the j-th ranking feature of instance x. While this generality in how inputs are handled is potentially powerful, in the original rankboost paper as well as in this paper, only the case where the inputs are different rankings of the instances is considered. Thus, in this paper, the inputs to rankboost are constituent ranks, denoted as f 1 . . . fn , where f j (x ) < f j (x ) implies that instance x has a better rank than instance x under ranking f j . For example, in some of the experiments we present, each f j is the ranking result of a particular face recognition algorithm. The output of rankboost is a new ranking function, H(x), which defines a linear ordering on the instances, that is, H (x ) < H (x ) iff x has a better rank than x . In rankboost T H(x) = ∑ wt ht (x), (1) t=1 a weighted sum of weak ranking learners, where the ht (x)’s are relatively simple learned functions of the constituent rankings. To address the variety in possible loss functions of the outputs, in rankboost the desirable properties for the output loss function are specified with a favor function, Φ : X × X → R, where X is the space of instances (note that this function has been renamed from “preference function” to avoid confusion with the use of preference in this paper in Section 4). Here Φ (x , x ) > 0 means that x should be better ranked than x for a given query. It is an anti-symmetric function, that is, Φ (x , x ) = −Φ (x , x ) and Φ (x, x) = 0, which avoids loops where two instances should both be ranked better than the other. Also Φ (x , x ) = 0 when there is no favor in how two instances should be relatively ranked. For example, as described in Section 6.1, for the face recognition combination problem described above the favor function can be used to specify that the correct identity should be given a better rank than all other identities, while zeroing all other entries in the favor function, giving no favor in how incorrect identities are ranked between them. In a similar fashion for the information retrieval combination task mentioned above, the favor function can be specified such that all relevant documents should be better ranked than irrelevant documents, without specifying favor for the ordering between relevant documents and the ordering between irrelevant documents (Section 7.1). Rankboost is shown in Algorithm 1 as described in Freund et al. (2003). It uses the favor function to specify an initial weight function over instance pair orderings: D x ,x = c · max 0, Φ x , x −1 , (2) where c = ∑x ,x max (0, Φ (x , x )) . The algorithm itself is very similar to adaboost (Freund and Schapire, 1996). At each iteration the algorithm selects the weak learner that best maximizes a 792 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 1 rankboost algorithm for generating a ranking function. Input: constituent ranks, a favor function, and a class of weak learners with outputs between 0 and 1 and an appropriate training algorithm. Output: ranking function H(x) (Eq. 1) Initialize D(1) = D (Eq. 2) for t = 1 . . . T do Find weak learner, ht , that maximizes r(h) = ∑x ,x D(t) (x , x )(h(x ) − h(x )) (Eq. 4). Choose wt = 0.5 ln ((1 + r(ht )) / (1 − r(ht ))) . (Eq. 5) Update: D(t+1) (x , x ) = D(t) (x , x ) exp (wt (ht (x ) − ht (x ))) Zt where Zt is chosen to make ∑x ,x D(t+1) (x , x ) = 1 end for rank function’s utility, r(h), (Eq. 4). It then assigns it a weight in proportion to its performance and adds the weighted weak learner to the ranking function, H(x). After which the weight function D is adjusted to reflect the impact of the weak learner. Freund et al. (2003) prove that the rankboost algorithm iteratively minimizes the ranking loss function, a measure of the quantity of misranked pairs: ∑ D x ,x I H x ≤ H x x ,x where I is the indicator function (1 if true, 0 otherwise) and H(x) is a ranking function output by rankboost. The rankboost paper (Freund et al., 2003) uses binary weak learners of the following functional form: i f f j (x) > θ, 1 0 i f f j (x) ≤ θ, (3) h(x) = qde f i f f j (x) unde f ined. Each binary weak learner, h, operates on a single f j (ranking feature), giving a binary output of 0 or 1 depending on whether the instance’s rank is smaller than a learned parameter, θ. Note that function h(x) in (3), which is dependent on only one f j (x) with a fixed j, is monotonic increasing but not convex or concave. If there is no rank for the instance then it returns a prespecified q de f value. 3. Rankboost, the Weak Learner and Regularization While rankboost inherits the accuracy of adaboost and has been shown to be very successful in practice, in two important ways it is very different from adaboost and similar classifier leveraging algorithms (Meir and Ratsch, 2003). The first is the choice of the weak learner, h. In adaboost the weak learner is expected to minimize the weighted empirical classification error: N ∑ d (t) (i)I [yi = h (zi )] , i=1 where yi is the class label, I is the indicator function and d (t) is a weighting over training samples. This is a standard function to minimize in classification with many possible types of algorithms to 793 M ELNIK , VARDI AND Z HANG choose from as possible weak learners. In contrast the weak ranking learner for rankboost (with outputs in [0, 1]) needs to maximize the following rank utility function: r = r(h) = ∑ D(t) (x , x )(h(x ) − h(x )), (4) x ,x where D(t) is the weight function over pairs of instances. As Eq. 4 contains the term h(x ) − h(x ), short of linear learners this equation can not be concave in h or easily approximated by a concave function. Therefore the weak learner needs to be optimized either by heuristics or by brute force, which limits the choice of h. It is not surprising that the original rankboost paper only included one type of weak learner, a binary threshold function (Eq. 3) that was tuned using brute force. The second difference between rankboost and adaboost also concerns the weak ranking learner. One feature of boosting that has sparked a great deal of interesting research is the algorithm’s ability to avoid overfitting for low noise classification problems (with modifications to higher noise problems), see Meir and Ratsch (2003) for a survey. In contrast for rankboost it is only by limiting the type of underlying weak learners that overfitting is avoided. In their paper, Freund et al. (2003) show that not using weak ranking learners with cumulative positive coefficients leads to overfitting and poor performance quite quickly. Therefore, choosing a weak learner that regularizes the ranking function, the output of rankboost, is very important for achieving accuracy and avoiding overfitting. It is clear from the above discussion that the choice of a weak ranking learner for rankboost is important and non trivial. First, the learner must be efficiently tunable with respect to Eq. 4, typically limiting its complexity. Second, the learner itself must demonstrate regularization properties that are appropriate for ranking functions. In this paper we present a new weak ranking learner that enforces consistency in preference and confidence for the ranking function by being monotonic and concave. We start with a discussion of these regularization properties, theoretically justify them, and show what they mean in terms of the final ranking function. Then we present a new learner, Minimum Weighted Group Ranks (MWGR), that satisfies these properties and can be readily learned. This learner is tested and compared with the binary learner of rankboost on combining multiple face recognition systems from the FERET study (Phillips et al., 2000) and on an information retrieval combination task from TREC (Voorhees and Harman, 2001). 4. Regularizing Ranking Functions with Consistency in Preference and Confidence In this paper, as Freund et al. (2003), we consider ranking functions H(x) which depend on x only through the values of the ranking features, y j = f j (x), for that instance, so that H(x) = G ( f1 (x), . . . , fn (x)), for certain functions G (y1 , . . . , yn ) = G(y). Here, we assume that the f j (x) is an actual rank assigned to an instance, x, by the j-th ranker. Note that if the original data are numerical scores then they can easily be converted to rankings. Freund et al. (2003) make a strong case for conversion of raw measures to relative orderings (rankings) over combining measures directly, arguing that it is more general and avoids the semantics of particular measures. As the y j ’s are ranks instead of points in a general space, care should be taken as to the functional form of G. A great deal of literature in social choice theory (Arrow, 1951) revolves around the properties of various rank combination strategies that try to achieve fair rankings. In this machine learning case our motivations are different. Fairness is not the goal; the goal is to improve the 794 C ONCAVE L EARNERS FOR R ANKBOOST accuracy or performance of the ranking function. Thus, regularization, by functionally constraining G, is used to confer information on how to interpret ranks in order to ultimately improve accuracy. Freund et al. (2003) imposed the regularization principle of monotonicity on the ranking function. Monotonicity encompasses a belief that for individual features a smaller rank is always considered better than a bigger rank. It means that for every two rank vectors, a and b, if a j ≤ b j , j = 1, . . . , n then G(a) ≤ G(b). Monotonicity was enforced by requiring that the coefficients, wt in Eq. 1, combining the binary weak learners (Eq. 3) were cumulatively positive. As shown by Freund et al. (2003), without enforcing monotonicity the rankboost algorithm quickly overfits. In this section we present another regularization principle, consistency in preference and confidence (which includes monotonicity). A ranking function with this regularization property should be consistent in its preference for the individual rankers and also in how it captures their confidence. The following example illustrates these two concepts. 4.1 Grocer Example A grocer needs to make 2 decisions, to decide between stocking oat bran vs. granola and to decide between stocking turnips vs. radishes. The grocer asks her consultants to express their opinion about stocking each item, and based on their responses makes her 2 decisions. First of all, in either problem, the grocer will adopt the opinion of her consultants if they all agree with each other, that is, they all prefer granola over oat bran in the first problem. Lets assume the grocer considered the first problem and chose granola over oat bran. What this implies is that the grocer adopted the opinions of the consultants that preferred granola over oat bran. Now consider the turnips vs. radishes decision. Lets say that the same consultants that liked granola more also liked radishes more (and the same ones that like oat bran more like turnips more). Also, for this decision the radish lovers actually feel more confident in their choice than they did for granola, while the turnip lovers are more unsure than they were for oat bran. Then for this second choice, if the grocer is consistent in her method of combining the opinions of her consultants, we would expect her to pick radishes over the turnips. In addition, we would expect her to be more confident in this second decision as a reflection of the consultants relative confidence. The above properties of preference and confidence imply that preference should be applied consistently across different problems and confidence should reflect the relative confidences of the individual consultants. 4.2 The General Principle of Consistency in Preference and Confidence To generalize the above grocer’s problem consider the problem of combining the opinions of a panel of consultants on several multiple-choice decision problems. Suppose for each decision problem each consultant provides his preference as one of the possible choices and a qualitative measure of the confidence level for his preference. The consistency principle in preference and confidence holds if the combiner always agrees with at least one of the consultants in each decision problem and that the following condition holds for any pair of decision problems. Definition 1. Consistency principle for a pair of decision problems: Suppose that for the first decision problem, the combiner adopts the preference of a subset A of consultants in the sense that A is the (non empty) set of all consultants whose preference is identical to that of the combiner. 795 M ELNIK , VARDI AND Z HANG Suppose that for the second decision problem, the preference of the consultants in A is again identical. Suppose further that compared with the confidence level for the first decision problem, each consultant in A has higher or the same confidence level for his preference in the second problem, while each consultant with a different preference than A for the second problem has lower or the same confidence level. Then, the preference of the combiner for the second problem is identical to that of the consultants in A, and the confidence level of the combiner for the second problem is at least as high as his confidence level for the first problem. Let B be the set of consultants whose preferences are different from that of the combiner in the first decision problem, that is, B = Ac . If some members of B switch sides or have no preference in the second problem, the combiner would be even more confident about the adoption of the preference of A in the second problem regardless of the confidence of those who switch sides. Thus, Definition 1 requires only those against the preference of A in the second problem (necessarily a subset of B since members of A act in unison in both problems) to have lower or equal confidence in the second problem, instead of all members of B. This is taken into consideration in Theorem 1 below. Note that while preference for individual consultants is specified within each decision problem, two decision problems are needed to compare the qualitative measure of confidence. However, comparison of confidence is not always possible (it is a partial ordering). In particular, the level of confidence between different experts may not be comparable, and the levels of confidence of a given expert (or the combiner) for different decision problems are not always comparable. 4.3 Confidence for Ranks and Ranking Functions In order to apply consistency to rank combination we need to specify what we mean by more or less confidence. For ranks we make the assumption that a constant change of ranks requires more confidence for low ranks than high ranks. For example, we would expect the difference between ranks of 1 and 3 to represent a more significant distinction on the part of a ranker than would for example the difference between ranks 34 and 36 which may be almost arbitrarily assigned. Specifically we make the following definition: Definition 2. Preference and confidence for univariate rank values For any pair of rank values {r, r } ⊂ R with r < r , by monotonicity r is preferred. For any two pairs of rank values {r, r } and {r , r } with r < r and r ≤ r , the confidence level for {r, r }is higher than that of {r , r } if r ≤ r , r −r ≤ r −r with at least one inequality. The pair{r, r }provides no preference if r = r . Likewise, if either r − r = r − r = 0 or r − r = r − r = 0, the two pairs {r, r } and {r , r }are defined to have the same level of confidence. In this definition confidence between pairs of ranks can only be compared if the pair with the lowest rank has a gap at least as large as the other pair. Thus, we are actually comparing two numbers, that is, we are doing a vector comparison. As such, this comparison does not cover all pairs of ranks and applies only a partial ordering on rank pairs. As a regularization principle, a partial ordering is desirable since it does not overly constrain the ranking function and allows for flexibility. 796 C ONCAVE L EARNERS FOR R ANKBOOST As the rank function, G, is a univariate function, we apply the same definitions of preference and confidence to it. That is, for a ranking function G and for any fixed rank vectors, y and y , the preferred rank vector is the one with smaller G, that is, y is preferred iff G(y) < G(y ). For confidence again we need to consider pairs of rank vectors, {y, y } and {y , y }, with G(y) ≤ G(y ) and G(y ) ≤ G(y ). If G(y) ≤ G(y ) and G(y ) − G(y) ≥ G(y ) − G(y ) then we say that the first decision, between yand y , shows equal or more confidence than the second decision between y and y . This numerical method of capturing confidence and preference in ranks, y, and ranking functions, G, allows us to apply Definition 1. Specifically, for a pair of binary decisions the confidence of a consistent ranking function increases for the second decision if in the second decision the confidence of the agreeing rank values are comparable and increase and the confidence of the disagreeing rank values are comparable and decrease, with the exception of those that switched sides. 4.4 Three-Point Consistency in Preference and Confidence for Combining Ranks As described, the consistency property is over two binary decision problems. In this section we consider ranking functions, G, that have consistency in preference and confidence for all pairs of binary decision problems involving three rank vectors y, y and y . We show that such functions have specific mathematical properties under this three-point consistency principle in preference and confidence for ranks.. Theorem 1: For a ranking function G whose domain is convex, three-point consistency in preference and confidence holds for G iff G(y) is nondecreasing in individual components of y and is jointly concave in y. Proof of Necessity: Assume that the ranking function G exhibits three-point consistency in preference and confidence for any three rank vectors. If y ≤ y component wise, then the individual components of y all agree in their preference to y, so that G(y) ≤ G(y ) by the consistency of G in preference. This implies the monotonicity of G in y. We pick three points y, y − a and y + a, such that all points are rank vectors. Consider the pair of binary comparison problems, y vs. y + a and y − a vs. y. Assume that G(y) < G(y + a). These three points are comparable by Definition 2 (r − r = r − r for all components in the rank vectors). Since the agreeing rankers, A = j y j < y j + a j , in the y vs. y + a comparison are greater than in the y − a vs. y comparison, and the disagreeing ranks, outside A, are smaller, then by the consistency of the combiner in preference and confidence (definition 1), G(y − a) ≤ G(y) and G(y) − G(y − a) ≥ G(y + a) − G(y). A function G is concave in a convex domain iff 2G(y) ≥ G(y − a) + G(y + a), for every y and a with y ± a in its domain. To verify this inequality for a particular y and a, we must have G(y + a) > G(y), G(y − a) > G(y) or the third case of G(y) ≥ max{G(y + a), G(y − a)}. We have already proven that the consistency properties of preference and confidence imply G(y) − G(y − a) ≥ G(y + a) − G(y) in the first case. By symmetry this requirement also must hold for −a, so that G(y) − G(y + a) ≥ G(y − a) − G(y) in the second case. This completes the necessity part of the proof since 2G(y) ≥ G(y − a) + G(y + a) automatically holds in the third case. Proof of Sufficiency: Assume the ranking function G is nondecreasing in individual components of y and jointly concave in y. We need to prove consistency in preference and confidence for a pair 797 M ELNIK , VARDI AND Z HANG 1) G(y) < G(y’) y ≤ y’’ A A y’’ A 4 A A 10 y’ 8 y’ 6 4) G(y) > G(y’) y ≥ y’’ A 10 8 A 2) G(y) < G(y’) y ≥ y’’ A 10 8 6 y A y 6 4 4 2 2 2 0 0 y y’ y’’ y’’ 0 2 4 6 8 10 0 2 4 B 6 8 10 0 0 B 2 4 6 8 10 B Figure 1: Consider two decision problems in two dimensions involving three points, y, y and y , where the first decision problem is to choose between y and y and the second problem is to choose between y and y . The cases where we expect consistency in preference and confidence (Definitions 1 and 2) can be enumerated by the result of the first decision problem and relative location of y . The darker gray box is the location where B has lesser or equal confidence in the second problem, and the lighter gray box is where B switches sides. of decision problems involving three rank vectors y, y and y . Without loss of generality, suppose that the first problem is to choose between y and y and the second problem is to choose between y and y . We break the proof up by the results of the comparison between y vs. y and the relative location of the third ranking vector y that satisfies the preference and confidence requirements. If G(y) = G(y ) = G(y ), the combiner has equal confidence in the two decision problems by definition 2, so that the consistency principle holds automatically. Thus, we only need to consider the cases where G(y) = G(y ). Figure 1 illustrates three of these cases two dimensionally, where there is a single agreeing component A, the y-axis, and a single disagreeing component B, the x-axis. Let A be the set of agreeing indices, A = j : sgn(y j − y j ) = sgn (G(y ) − G(y)) = 0 , and B = c the set of disagreeing indices. We use the notation y = (y , j ∈ A) to describe corresponding A j A subvectors. Also, when used, vector inequalities are component wise. Case 1: G(y) < G(y ) and yA ≤ yA In the first decision problem, as G agrees with A and disagrees with B yA < y A , yB ≥ y B . / We also know that A = 0 since otherwise y ≥ y and therefore by monotonicity G(y) ≥ G(y ), which is a contradiction. For the second decision problem we consider the values of y which are consistent with the confidence assumption (Definitions 1 and 2), that is, agreeing with more confidence along the A indices and disagreeing with less confidence or switching sides along the B indices. As y A ≤ yA , either by the confidence relationship or by switching sides (see Figure 1) these y values satisfy yA − y A ≥ y A − y A , 798 yB − y B ≤ y B − y B C ONCAVE L EARNERS FOR R ANKBOOST which implies y ≤ y , and therefore G(y ) ≤ G(y ) by monotonicity. Thus, G(y) < G(y ) ≤ G(y ), which implies that in the second decision problem of choosing between y and y , the preference of G is the same as that of A (i.e., y since yA ≤ yA ) and the confidence of G is at least as high as the first decision problem. Case 2: G(y) < G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the confidence relationship or by switching sides y satisfies yA − y A ≥ y A − y A , yB − y B ≤ y B − y B . This implies that y + y ≤ 2y, which means that G(y ) + G(y ) ≤ 2G y +y /2 ≤ 2G(y) by the concavity and monotonicity of G. Thus, G(y) − G(y ) ≥ G(y ) − G(y) and since G(y ) > G(y), we have that G(y) − G(y ) > 0 and thus G(y) > G(y ). Therefore we see that the preference in G for the two decision problems is the same as A (with y in the first problem and with y in the second problem) and the confidence is no smaller for the second comparison. Case 3: G(y) > G(y ) and yA ≤ yA Since y is preferred in the first decision problem we have yA > y A , yB ≤ y B . For the confidence assumption to hold, that is, greater confidence for A in the second problem (Definition 2), the smaller ranks in the second problem have to be smaller than the smaller ranks in the first problem. But with yA ≥ yA and yA ≤ yA that can only be if yA = yA , which leads to y ≤ y , and by monotonicity implies G(y) ≤ G(y ). However that is a contradiction to G(y) > G(y ), and therefore this is not a viable case for comparing confidence. Case 4: G(y) > G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the confidence relationship or by switching sides y satisfies yA − y A ≥ y A − y A , yB − y B ≤ y B − y B , which means that y ≤ y . Thus, by the monotonicity of G, G(y ) ≤ G(y ). Since G(y ) < G(y) the preference in the second decision problem is also with A (i.e., y ) and the confidence is at least as high as the first decision problem. 4.5 Applying Regularization to Rankboost It follows from the above theorem that to have consistency in preference and confidence we desire ranking functions that are monotonic and concave. In rankboost H(x) = ∑ wt ht (x) (Eq. 1). To make H an increasing and concave function of constituent rankings y j = f j (x) we need to constrain the weak ranking learners. If the learners themselves are monotonically increasing and concave functions of y j , then linearly combining them with positive weights will give an H that is also an increasing and concave function of y j . 799 M ELNIK , VARDI AND Z HANG In this paper, we apply the “third method” (Freund et al., 2003) to setting a wt weight value. That is, weak learners are selected on their ability to maximize r from Eq. 4 and then wt = 0.5 ln ((1 + rmax ) / (1 − rmax )) . (5) Therefore, using monotonic and concave weak learners we select only ones that rankboost gives a positive r value to, which renders a positive wt weight. If no r values are positive the rankboost algorithm stops. We mention that a ranking function can be construed as the negative of a score function, that is, G(y) = −S(y). For score functions these regularization properties become monotonically decreasing, and convex (Melnik et al., 2004). 5. Minimum Weighted Group Ranks The functional structure of the new learner we propose is h(y) = min {γ1 y1 , . . . , γn yn , 1} , (6) where y = (y1 , . . . , yn ) = ( f1 (x), . . . , fn (x)) the vector of ranking features, and the γ j are learned positive coefficients. Note that the function’s range is in [0, 1] due to the 1 term in the min function. Using rankings as our features, the learner function (Eq. 6) is monotonically increasing. It is also a concave function in y. Thus if these learners are linearly combined with positive weights, the resulting ranking function, H, will have three-point consistency in confidence and preference. To gain some intuition, the functional form of the learner is related to the Highest Rank combination method (Ho, 1992) that assigns combined ranks as the best rank each class receives from any ranker. This is a confidence based approach, as Highest Rank bets on the classifier that is most confident, the one giving the best rank. As such, a single learner can potentially be error prone. But as we combine many of these learners during boosting, it becomes more powerful, allowing the confidence of different classifiers to be weighed with preference for their potential accuracy. 5.1 Learning At each boosting iteration, rather than selecting from all possible weak learners of form (Eq. 6), we limit our choices by building new weak learners out of the ones that have been previously trained. Let F = { f1 (x), . . . , fn (x)} be the set of ranking features. Recall that in rankboost H(x) = (t) (t) (t) ∑t wt ht (x), where ht (x) = min γ1 f1 (x), . . . , γn fn (x), 1 and γ j are learned coefficients. We set (s) (s) Ht = h(x) h(x) = min γ1 f1 (x), . . . , γn fn (x) , s ≤ t and select ht+1 (x) from weak learners of the form hnew (x) = min αh(x), β f (x), 1 (7) (t+1) with h(x) ∈ Ht and f (x) ∈ F . This learner can be rewritten in the form of Eq. 6 with the γ j derived from the learned α, β, h(x) and f (x). Thus, at each iteration we look at combinations of 800 C ONCAVE L EARNERS FOR R ANKBOOST the features and existing learners. As discussed in the next section, we can either consider all such combinations, or use a heuristic selection strategy. We propose to optimize α and β separately, in stages. That is, given a value for one of the variables we optimize the other. As α and β are symmetric we show how to optimize β given α. We need to find a value of β that maximizes r in Eq. 4. Freund et al. (2003) pointed out that this equation can be rewritten as: r = ∑ D(t) (x , x )(h(x ) − h(x )) x ,x = ∑ π(x)h(x) x where π(x) = ∑x D(t) (x , x) − D(t) (x, x ) . Given the form of Eq. 7 we can write r as a function of β, ∑ (β f (x) − 1) π(x) + ∑ r (β) = αh(x) − 1 π(x). β f j (x)≤min(αh(x),1) (8) αh(x) < 1 , then O(βl+1 ) = O(βl ) − π(xl ), P(βl+1 ) = P(βl ) − f (xl )π(xl ), Q(βl+1 ) = Q(βl ) +W π(xl ), R(βl+1 ) = R(βl ) +W αh(xl )π(xl ). Combining these formulas gives algorithm 2 for optimizing β. Algorithm 2 Algorithm for optimizing β Given α, h(x) ∈ Ht , f (x) ∈ F and the training instances. For all x’s generate and sort the set of candidate βs, B, such that β 1 ≤ β2 ≤ · · · ≤ β|B| and βl = min(αh(xl ),1) . f (xl ) O = ∑xl π(xl ) P = ∑xl f (xl )π(xl ) Q=0 R=0 rbest = 0 for j = 1 . . . |B| do r = βl P − O + R − Q if r > rbest then rbest = r end if O = O − π(xl ), P = P − f (xl )π(xl ) if αh(xl ) < 1 then Q = Q + π(xl ), R = R + αh(xl )π(xl ) end if end for 5.2 Heuristics for Learner Selection If at each boosting iteration we select from all combinations of h(x) and f (x) we end up with an O(T 2 ) algorithm, where T is the number of iterations. However, we can sacrifice accuracy for speed by only evaluating and selecting from a fixed sized pool of previously trained learners h(x) and features f (x), where at each iteration the pool is randomly chosen from the full Ht and F . To improve performance, instead of using a uniform sampling distribution we can apply a heuristic to focus the distribution on combinations with better potential. As Eq. 8 is composed of two sums, for r to be large the terms f (x)π(x) and h(x)π(x) need to be large. We can consider s f = ∑x f (x)π(x) and sh = ∑x h(x)π(x) as indicators of how well these components work with π(x). Thus, we might expect larger r values to occur when these two score values are larger. Of course, we are discounting interactions, which is the reason for the combination. Using these score values, we can order all h(x) and f (x) separately, and sample such that learners and features with better scores are more likely to be selected. We opted for a polynomial weighting 802 C ONCAVE L EARNERS FOR R ANKBOOST Training error on Dup II 9 P=1 P=1/2 P=1/4 Avg rank of correct class 8.8 8.6 8.4 8.2 8 7.8 7.6 7.4 7.2 1 2 3 4 5 6 7 8 9 Iteration number Figure 2: These plots show convergence of training error for 3 values of the heuristic selection pressure value p. These plots are averages over 10 runs and are typical of the other training data sets as well. As can be seen the heuristic improves the convergence rate of the training error. method. Thus, for example, all f ∈ F are sorted by their score and are assigned a number based on the rank of these scores,(maxrank − rank)/maxrank, that gives each f an equally sized bin in the range 0 and 1. Given a random number ξ ∼ U(0, 1), we calculate ξ p and select the f that corresponds to the bin this number falls into. Here p < 1 can be construed as a selection pressure, where bins corresponding to higher scores are more likely to be selected. Figure 2 demonstrates the effect of different values of the p parameter in one of our experiments. 6. Face Recognition Experiments We present experiments on the combination of face recognizers, comparing the binary learner with the MWGR learner. Given an image of a face, a face recognizer assigns a similarity score to each of the faces it is trained to recognize. These scores give a linear order or ranking to the gallery of faces. Different face recognition algorithms have different performance characteristics. Thus, we explore how combining the outputs of multiple face recognizers can improve recognition accuracy. 6.1 Algorithm Methods We consider a data set I of face images (queries) to train on. For each query image, i in I, we need to rank all u ∈ U faces in the gallery. In rankboost the favor function, Φ, that specifies output loss, is a function of the query and the item to be ranked. Therefore, the notational convention is to combine the query, i, and the item to be ranked, u, as an instance, x ≡ (i, u). As such, f j (x) = f j ((i, u)) is the 803 M ELNIK , VARDI AND Z HANG Error convergence 9 Train error on dup II Test error on dup I Avg rank of correct class 8.5 8 7.5 7 6.5 6 0 10 20 30 40 50 60 70 80 90 100 Iteration number Figure 3: This plot is typical of the convergence behavior of rankboost with MWGR on the FERET data. Both training and test errors tended to converge within 10-30 iterations of boosting with no significant post-convergence divergence. rank assigned to identity u for query image i by recognition algorithm j. As there is only one correct identity, we only care about the ranking of the one correct identity for each query. We set the favor function as stated by Freund et al. (2003) for this type of output loss. Let u ∗ be the the correct identity for training image i, then Φ ((i, u) , (i, u∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1 for all u = u∗ , setting all remaining elements of Φ (x , x ) = 0. That is, the correct identity of a query image is given positive favor compared to all other identities for that image, while all other rankings, including interactions between training images, are given zero favor. Note that since there is no favor interaction between queries (different i’s); Φ (x , x ) is effectively a function of 3 variables, (i, u , u ). Both weak learners were trained for 100 iterations, giving them ample time to converge. See Figure 3 for an illustration of convergence times. The binary learner was trained as specified by Freund et al. (2003). At each iteration the MWGR learner was selected from a pool of candidate combinations of f (x) and h(x), with a selection pressure of p = 0.5. For each candidate, first β was optimized with α = 1, then α was optimized using the optimized β. The candidate with the most positive r value was always selected. This is summarized in algorithm 3. 6.2 Experimental Setup FERET (Phillips et al., 2000) was a government sponsored program for the evaluation of face recognition algorithms. In this program commercial and academic algorithms were evaluated on their ability to differentiate between 1,196 individuals. The test consisted of different data sets of varying difficulty, for a total of 3,816 different images. The data sets, in order of perceived difficulty, are: the fafb data set of 1,195 images which consists of pictures taken the same day with different facial 804 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 3 Algorithm for selecting learner from pool. Given poolsize and selection pressure. Calculate s f j and sh for all rank features and existing learners. for p = 1 . . . poolsize do Generate d f ∼ U(0, 1) and dh ∼ U(0, 1) Select which h = min αh(x), β f (x), 1 to try, using s f j , sh , u f , uh . Setting α = 1, optimize β (algorithm 2) Keeping the optimized β, optimize α (algorithm 2) Get r of learner with this α, β if r > 0 and r > rbest then rbest = r, hbest = h, hbest = min αh(x), β f (x) end if end for ht = hbest, Ht+1 = Ht ∪ hbest expressions; the fafc data set of 194 images that contains pictures taken with different cameras and lighting conditions; the dup I data set of 488 images that has duplicate pictures taken within a year of the initial photo; and the most difficult, the dup II data set of 234 images which contains duplicate pictures taken more than a year later. Note that in our experiments we separate the images of dup II from the dup I data set, unlike the FERET study where dup II was also a subset of dup I. The FERET study evaluated 10 baseline and proprietary face recognition algorithms. The baseline algorithms consisted of a correlation-based method and a number of eigenfaces (principle components) methods that differ in the internal metric they use. Of the proprietary algorithms, most were from different academic institutions and one was commercial. Of the 10 algorithms we selected three dominant algorithms. From the baseline algorithms we chose to use the ANM algorithm which uses a Mahalanobis distance variation on angular distances for eigenfaces (Moon and Phillips, 2001). While this algorithm’s performance is not distinctive, within the class of baseline algorithms it was strong. Moreover, in accuracy with respect to average rank of the correct class on the dup I data set it demonstrated superior performance to all other algorithms. The other two algorithms we used were the University of Maryland’s 1997 test submission (UMD) and the University of Southern California’s 1997 test submission (USC). These algorithms clearly outperformed the other algorithms. UMD is based on a discriminant analysis of eigenfaces (Zhao et al., 1998), and USC is an elastic bunch graph matching approach (Wiskott et al., 1997). The outputs of the 10 face recognizers on the four FERET data sets, fafb, fafc, dup I and dup II were the data for the experiments. Thus, we never had access to the actual classifiers, only to data on how they ranked the different faces in these data sets. We conducted experiments based on homogeneous and heterogeneous data sets, testing the efficiency and robustness (adaptivity) of the MWGR procedure. For the homogeneous case we took all 4 FERET data sets and randomly shuffled them together. We call this the homogeneous data set as both the training and testing data are selected from the same combined pool. On this combined data set we did 4-fold cross validation. For each fold 75% of the data was used for training and the rest for testing. We combined the results of all four runs together for evaluation purposes. 805 M ELNIK , VARDI AND Z HANG For the heterogeneous case, in each experiment one of the FERET data sets was selected as a training set and another data set was selected for testing. This gave 12 experiments (not including training and testing on the same data set) per group of face recognizers, where we get combinations of training on easy data sets and testing on hard data sets, training on hard and testing on easy data sets, and training and testing on hard data sets. To reduce noise in our experiments the training ranks were truncated at 150, and outliers were removed. In face recognition and other classification applications usually only the top ranks are important. Thus, in evaluating the results we focused on the top 30 ranks. All ranks returned by a boosted combiner for the correct class above 30 were truncated to 30. In evaluating the performance of the combiners not all the test data are equally useful. We consider the following two cases as non informative. When the two best face recognizers, UMD and USC both give the correct class a rank of 1 there is very little reason for the combined rank to be different. Also when both the binary-learner-based combiner and the MWGR-based combiner give the correct class a rank greater than the truncation value (30) it makes little sense to compare between the combiners. The testing data was filtered to remove these cases, and the results are presented without them. Before presenting the results, it should be said that rankboost with both learner types gives ranking functions that significantly outperform all the individual face recognition algorithms. In addition, in our tests both learners also clearly outperformed other standard rank combination methods, such as the Borda count, highest rank and logistic regression (Ho et al., 1992). We present two sets of experiments—the combination of the 3 selected classifiers (ANM, UMD and USC) and the combination of all 10 classifiers. These are qualitatively different tasks. In combining 3, we seek to capitalize on the unique strengths of each classifier. In combining 10, we are introducing classifiers which may be correlated and classifiers which are comparably much noisier. The size of the pool was 6 when we combined 3 classifiers and 20 when combining all 10 classifiers. For all experiments we measure the average rank of the correct class for both learners: A= 1 N ∑ min {Rank (xi∗ ) , 30} , N i=1 where Rank (xi∗ ) is the rank of the correct class for query i, and the sum is over all useful test queries, as described above. The average rank difference between the learners is calculated to show the improvement in performance. To evaluate the significance of the improvement we ran paired one-sided t-tests and evaluated the significance of the p-value (a value less than 0.05 is significant). In addition we show the standard deviation of the rank difference. 6.3 Results In the experiments with the homogeneous data sets, combining all classifiers gives an improvement in the average rank of the correct class of 0.296 for MWGR, with a standard deviation of 3.9, a paired t-test statistic of 1.76 and p-value of 0.039, where combining the ANM, UMD and USC classifiers gives an average rank improvement of 0.1 for MWGR, with standard deviation of 3.6, a paired t-test statistic of 0.68 and p-value of 0.24. Table 2 contains the results for combining the 3 classifiers in the experiments with heterogeneous data sets (compare the combiner results in columns bin mean and mwgr mean with the aver806 C ONCAVE L EARNERS FOR R ANKBOOST ANM ARL EFAVG EFML1 EFML2 EXCA MSU RUT UMD USC dup i 9.52 16.85 15.02 18.23 15.85 11.9 17.64 17.87 13.21 12.44 dup ii 18.14 16.96 20.61 22.21 20.33 16.28 23.44 16.48 13.74 6.85 fafb 10.88 8.94 14.43 16.23 12.21 11.67 6.81 12.93 4.57 5.84 fafc 19.71 28.02 26.17 17.39 19.78 20.30 14.27 22.79 8.96 5.17 Table 1: The best average rank of the correct class on the different data sets for all constituent face recognition systems. age rank of the correct class for all constituent classifiers in Table 1). The diff mean column contains the improvement of MWGR over the binary learner in terms of average rank of the correct class. Of the 12 experiments, we see an improvement in 10 cases. Six of those 10 have significant p-values. The two no improvement experiments do not have significant p-values. Table 3 contains the results for combining all 10 classifiers. Of the 12 experiments, we see an improvement for MWGR in 11 cases. Eight or nine of those 11 have significant p-values. The one no improvement experiment does not have a significant p-value. It is interesting to note that we do not seem to see overfitting when increasing the number of constituents. In some case we see improvement, in others we see slight degradation, but all in all the combiner seems resilient to the noise of adding less informative constituents. All sets of experiments, homogeneous data, heterogeneous data sets, combining 3 select recognizers and combining all 10 recognizers at once yielded significant improvements in accuracy, as is visible in the change in the average rank of the correct class and the significance of the statistical tests. 7. Information Retrieval Experiments The annual Text REtrieval Conference (TREC) generates high-quality retrieval results of different systems on different retrieval tasks (Voorhees and Harman, 2001). We use the result data sets of the TREC-2001 web ad hoc task that uses actual web queries taken from web logs. This task has been used in other rank fusion experiments (Renda and Straccia, 2003). As did Renda and Straccia (2003) we combine the results of the following top 12 systems: iit01m, ok10wtnd1, csiro0mwal, flabxtd, UniNEn7d, fub01be2, hum01tdlx, JuruFull, kuadhoc, ricMM, jscbtawtl4, apl10wd. Similar to other TREC information retrieval tasks, the TREC-2001 ad hoc task consists of 50 queries. For each query a list of relevant documents is supplied. Each system returns an ordered list of 1,000 documents for each query. The fusion goal is to combine these 12 individual lists into one list of 1,000 documents, which hopefully has greater precision than the individual systems. For the 807 M ELNIK , VARDI AND Z HANG test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 7.19 7.36 8.31 5.25 5.15 5.19 2.62 3.38 2.60 2.85 2.40 2.56 mwgr mean 5.96 6.21 6.27 5.38 4.4 4.95 2.22 2.60 2.28 2.78 2.43 2.03 diff mean 1.22 1.14 2.04 -.12 0.74 0.23 0.39 0.78 0.32 0.06 -.03 0.52 diff std 5.22 5.15 5.51 4.0 4.61 5.1 3.09 3.79 2.81 3.33 2.27 2.57 pval .7e-4 .001 .1e-6 .658 .018 .275 .115 .029 .142 .424 .537 .028 Table 2: Results of combining the ANM, UMD, and USC classifiers using individual FERET data sets. test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 6.73 8.06 6.68 5.75 6.24 5.86 2.67 3.56 2.68 3.36 3.17 2.22 mwgr mean 5.89 7.09 4.87 5.67 5.47 5.31 2.07 2.45 2.17 2.51 2.95 2.23 diff mean 0.84 0.96 1.81 0.08 0.76 0.55 0.59 1.11 0.51 0.85 0.22 -0.01 diff std 4.79 6.01 5.24 4.61 4.22 4.87 2.97 4.51 2.03 3.23 3.95 2.08 pval .009 .014 .5e-6 .408 .009 .074 .033 .012 .01 .007 .3 .52 Table 3: Results of combining all 10 classifiers using individual FERET data sets. 50 queries of the TREC-2001 ad hoc task, the number of relevant documents that intersect with the union of system results range between 662 and 2664. 7.1 Methods In the information retrieval task the favor function needs to show favor for relevant documents while disregarding other documents. Thus, we can set the favor function similarly to the way it was set in 808 C ONCAVE L EARNERS FOR R ANKBOOST JuruFull 0.759 hum01tdlx 0.760 UniNEn7d 0.763 iit01m 0.762 apl10wd 0.735 jscbtawt14 0.761 csiro0mwa1 0.721 kuadhoc2001 0.727 flabxtd 0.741 ok10wtnd1 0.745 fub01be2 0.734 ricMM 0.765 Table 4: Normalized mean average precision for each constituent. the face recognition classification task. Consider a data set with I queries to train on. For each query, i in I, we need to rank all u ∈ U documents. An instance in this case is a pair, x ≡ (i, u). For each u∗ a relevant document and u an irrelevant document for query i, we set Φ ((i, u) , (i, u ∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1, setting all remaining elements of Φ (x0 , x1 ) = 0. Thus, all relevant documents are given a positive favor with respect to irrelevant documents, while all other rankings, including interactions between relevant documents and interactions between irrelevant documents are given zero favor. As the task has only 50 queries, rather than separating the data into train and test, we opted to do a cross validation performance evaluation. We use 5-fold cross validation on the normalized mean average precision measure (Salton and McGill, 1983), a standard IR measure which accounts for the precision (quantity of correct documents retrieved) and their rank position in the list. It is described by the following equation: AveP = ∑N (Prec(r) × rel(r)) r=1 number o f relevant documents where r is the rank, N is the number of documents retrieved, rel() is a binary function of the relevance of the document at rank r,and Prec is the precision ( [number of relevant documents retrieved] / [number of documents retrieved]) at a given cut-off rank. It is clear that higher values of AveP are better. Note that for purposes of normalization we assigned unretrieved relevant documents a constant rank. As the binary and MWGR weak learners had significantly different convergence properties on this task (see Figure 4) MWGR was trained for 100 iterations and the binary learner for 300 iterations. As in the FERET experiments, MWGR was optimized using algorithm 3, with a selection pressure of p = 0.5. 7.2 Results As seen in Figure 4, the MWGR learner converges in significantly less iterations than the binary learner. This could possibly be attributed to the fact that the MWGR is a more complex function that can incorporate the rankings of multiple classifiers in each learner. Also the MWGR function is not tuned to particular rank cutoffs, whereas the binary learner is, so the MWGR can better accommodate the variety in the 1000 ranks being considered. The normalized mean average precision for the MWGR after 100 iterations was 0.8537 and it was 0.8508 for the binary learner after 300 iterations. Compare these results with the precision of the constituents in Table 4. Both weak learners had a performance rate of change of approximately 3 ∗ 10−5 on their final iteration (better for MWGR). A paired t-test on the cross validation results of the two learners gives a statistically significant p-value of 0.007 in favor of MWGR. 809 M ELNIK , VARDI AND Z HANG Cross Validation Performance 0.88 MWGR Binary 0.87 0.86 0.85 0.84 Mean Avg Precision 0.83 0.82 0.81 0.8 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 0.7 0 50 100 150 200 Iteration Number 250 300 Figure 4: The cross validation mean average precision score of the two weak learners, MWGR and binary, as a function of boosting iteration number. 8. Discussion The question of how to combine ordinal data has become an active focus of research in machine learning, as applications in pattern recognition, information retrieval and other domains have come to the forefront. A particular question of importance is how can the structure of ranks be correctly exploited to maximize performance. The semi parametric nature of rankboost offers the possibility to generate arbitrarily flexible ranking functions. But as observed this flexibility comes at a cost of significant overfitting without further regularization. Freund et al. (2003) demonstrate that successful generalization only occurs when the resulting ranking functions are constrained to be monotonic. This constraint can be thought of as a regularization that incorporates prior knowledge on the interpretation of ranks and as such, how they can be combined. We present a regularization framework based on the concept of consistency in confidence and preference. Ranking functions with this property show a consistency in how they treat the preference and relative confidence exhibited by constituents. We prove that under a natural interpretation of preference and confidence for ranks, this consistency property of the combiner is equivalent to monotonicity and concavity of its ranking function. We enhance rankboost by designing a weak ranking learner that exhibits consistency in preference and confidence. A computational advantage of this weak learner, called minimum weighted group ranks (MWGR) is that its parameters can be individually optimized readily with respect to the rankboost criteria, allowing it to be tested on real-world data. In our first experiments we compare the original rankboost binary weak learner with MWGR on a task of combining the output of multiple face recognition algorithms from the FERET study. We conducted experiments on homogeneous data, testing the intrinsic efficiency of the MWGR proce810 C ONCAVE L EARNERS FOR R ANKBOOST dure. We also conducted experiments on heterogeneous data, testing the robustness or adaptivity of the procedure. In almost all cases we see that MWGR shows improved performance compared with the binary weak learner, whether combining three or all of the face recognizers, confirming the utility of this monotonic and concave learner. Our second experiment was on an Information Retrieval task taken from the TREC conference. In this task we see MWGR converges in significantly less iterations and generates statistically significant improved performance. Final Words Ofer Melnik and Cun-Hui Zhang are very saddened that our colleague and friend Yehuda Vardi passed away before he could give this paper his final stamp of approval. He was very enthusiastic about this research and would have been very pleased to see it come to fruition. Acknowledgments This research is partially supported by NSF Grants DMS 04-05202, DMS 05-04387 and, DMS 0604571, ONR Grant N00014-02-1-056 and NSA Grant H98230-04-1-0041. Ofer Melnik would also like to thank DIMACS for their support in this research. The authors are also grateful to the editor and reviewers for their constructive suggestions which improved the presentation of the results. Appendix A. In this paper we showed how three-point consistency in preference and confidence implies concave and monotonic ranking functions. For two decision problems involving two pairs of rank vectors, the four-point consistency property implies the following constraints for a ranking function G(y) < G(y ) yB − y B ≤ 0 < y A − y A zA ≤ yA , yA − yA ≤ zA − zA ⇒ G(z ) − G(z) > G(y ) − G(y) yB ≤ z B , zB − z B ≤ y B − y B (11) where y and y are the rank vectors from the first decision problem and z and z are the rank vectors from the second decision problem. Unlike the three-point case, the four-point consistency property does not imply a clearly recognizable functional form for the ranking function. What we can say about it though is that as the constraints are linear, in the same way that concavity and monotonicity in the weak learner conferred the same properties to the ranking function, a weak learner that satisfies Eq. 11 will also confer those properties to the ranking function that uses it with positive weights. References K. Arrow. Social Choice and Individual Values. Wiley, 1951. C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proc. 10th Intl. World Wide Web Conf., pages 613–622, 2001. 811 M ELNIK , VARDI AND Z HANG Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148–156, 1996. T. K. Ho. A Theory of Multiple Classifier Systems and Its Application to Visual Word Recognition. PhD thesis, State University of New York at Buffalo, May 1992. T. K. Ho, J. J. Hull, and S. N. Srihari. Combination of decisions by multiple classifiers. In H. S. Baird, H. Bunke, and K. Yamamoto (Eds.), editors, Structured Document Image Analysis, pages 188–202. Springer-Verlag, Heidelberg, 1992. J. Kittler and F. Roli, editors. Multiple Classifier Systems, Lecture Notes in Computer Science 1857, 2000. Springer. R. Meir and G. Ratsch. Advanced Lectures in Machine Learning, Lecture Notes in Computer Science 2600, chapter An introduction to boosting and leveraging, pages 119–184. Springer, 2003. O. Melnik, Y. Vardi, and C-H. Zhang. Mixed group ranks: Preference and confidence in classifier combination. IEEE Pattern Analysis and Machine Intelligence, 26(8):973–981, 2004. H. Moon and P.J. Phillips. Computational and performance aspects of PCA-based face-recognition algorithms. Perception, 30:303–321, 2001. P.J. Phillips, H. Moon, S.A. Rizvi, and P.J. Rauss. The FERET evaluation methodology for facerecognition algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22:1090– 1104, 2000. M.E. Renda and U. Straccia. Web metasearch: Rank vs. score based rank aggregation methods. In 18th Annual ACM Symposium on Applied Computing (SAC-03), pages 841–846, Melbourne, Florida, USA, 2003. ACM. G. Salton and J.M. McGill. Introduction to Modern Information Retrieval. Addison Wesley Publ. Co., 1983. E.M. Voorhees and D.K. Harman, editors. NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001), number SN003-003-03750-8, 2001. Department of Commerce, National Institute of Standards and Technology, Government Printing Office. URL http://trec.nist.gov. L. Wiskott, J.-M. Fellous, N. Kruger, and C. von der Malsburg. Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(7):775– 779, 1997. W. Zhao, A. Krishnaswamy, R. Chellappa, D. Swets, and J. Weng. Face Recognition: From Theory to Applications, chapter Discriminant Analysis of Principal Components, pages 73–86. SpringerVerlag, Berlin, 1998. 812
2 0.9964627 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"
Author: Gaëlle Loosli, Stéphane Canu
Abstract: In a recently published paper in JMLR, Tsang et al. (2005) present an algorithm for SVM called Core Vector Machines (CVM) and illustrate its performances through comparisons with other SVM solvers. After reading the CVM paper we were surprised by some of the reported results. In order to clarify the matter, we decided to reproduce some of the experiments. It turns out that to some extent, our results contradict those reported. Reasons of these different behaviors are given through the analysis of the stopping criterion. Keywords: SVM, CVM, large scale, KKT gap, stopping condition, stopping criteria
same-paper 3 0.99332392 70 jmlr-2007-Ranking the Best Instances
Author: Stéphan Clémençon, Nicolas Vayatis
Abstract: We formulate a local form of the bipartite ranking problem where the goal is to focus on the best instances. We propose a methodology based on the construction of real-valued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We first state the problem of finding the best instances which can be cast as a classification problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where first, the best instances would be tentatively identified and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem. Keywords: ranking, ROC curve and AUC, empirical risk minimization, fast rates
4 0.99242854 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: One of the nice properties of kernel classifiers such as SVMs is that they often produce sparse solutions. However, the decision functions of these classifiers cannot always be used to estimate the conditional probability of the class label. We investigate the relationship between these two properties and show that these are intimately related: sparseness does not occur when the conditional probabilities can be unambiguously estimated. We consider a family of convex loss functions and derive sharp asymptotic results for the fraction of data that becomes support vectors. This enables us to characterize the exact trade-off between sparseness and the ability to estimate conditional probabilities for these loss functions. Keywords: kernel methods, support vector machines, sparseness, estimating conditional probabilities
5 0.99242324 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning
Author: Matthew E. Taylor, Peter Stone, Yaxin Liu
Abstract: Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artificial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005). Keywords: transfer learning, reinforcement learning, temporal difference methods, value function approximation, inter-task mapping
6 0.86182398 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
7 0.86027753 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
9 0.83378828 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
10 0.8321687 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
11 0.83099264 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
13 0.81398904 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
14 0.81298763 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
15 0.80927634 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions
16 0.80870509 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
17 0.80588633 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
18 0.80463624 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes
19 0.80371058 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
20 0.79035944 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation