jmlr jmlr2007 jmlr2007-55 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rocío Alaiz-Rodríguez, Alicia Guerrero-Curieses, Jesús Cid-Sueiro
Abstract: The design of a minimum risk classifier based on data usually stems from the stationarity assumption that the conditions during training and test are the same: the misclassification costs assumed during training must be in agreement with real costs, and the same statistical process must have generated both training and test data. Unfortunately, in real world applications, these assumptions may not hold. This paper deals with the problem of training a classifier when prior probabilities cannot be reliably induced from training data. Some strategies based on optimizing the worst possible case (conventional minimax) have been proposed previously in the literature, but they may achieve a robust classification at the expense of a severe performance degradation. In this paper we propose a minimax regret (minimax deviation) approach, that seeks to minimize the maximum deviation from the performance of the optimal risk classifier. A neural-based minimax regret classifier for general multi-class decision problems is presented. Experimental results show its robustness and the advantages in relation to other approaches. Keywords: classification, imprecise class distribution, minimax regret, minimax deviation, neural networks 1. Introduction - Problem Motivation In the general framework of learning from examples and specifically when dealing with uncertainty, the robustness of the decision machine becomes a key issue. Most machine learning algorithms are based on the assumption that the classifier will use data drawn from the same distribution as the training data set. Unfortunately, for most practical applications (such as remote sensing, direct marketing, fraud detection, information filtering, medical diagnosis or intrusion detection) the target class distribution may not be accurately known during learning: for example, because the cost of labelling data may be class-dependent or the prior probabilities are non-stationary. Therefore, the data used to design the classifier (within the Bayesian context (see VanTrees, 1968), the c 2007 Roc´o Alaiz-Rodr´guez, Alicia Guerrero-Curieses and Jesus Cid-Sueiro. ´ ı ı A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I prior probabilities and the misclassification costs) may be non representative of the underlying real distributions. If the ratio of training data corresponding to each class is not in agreement with real class distributions, designing Bayes decision rules based on prior probabilities estimated from these data will be suboptimal and can seriously affect the reliability and performance of the classifier. A similar problem may arise if real misclassification costs are unknown during training. However, they are usually known by the end user, who can adapt the classifier decision rules to cost changes without re-training the classifier. For this reason, our attention in this paper is mainly focused on the problem of uncertainty in prior probabilities. Furthermore, being aware that class distribution is seldom known (at least totally) in real world applications, a robust approach (as opposite to adaptive) that prevents severe performance degradation appears to be convenient for these situations. Besides other adaptive and robust approaches that address this problem (discussed in more detail in Section 2.2) it is important to highlight those that handle the problem of uncertainty in priors by following a robust minimax principle: minimize the maximum possible risk. Analytic foundations of minimax classification are widely considered in the literature (see VanTrees, 1968; Moon and Stirling, 2000; Duda et al., 2001, for instance) and a few algorithms to carry out minimax decisions have been proposed. From computationally expensive ones such as estimating probability density functions (Takimoto and Warmuth, 2000; Kim, 1996) or using methods from optimization (Polak, 1997) to simpler ones like neural network training algorithms (Guerrero-Curieses et al., 2004; AlaizRodriguez et al., 2005). Minimax classifiers may, however, be seen as over-conservative since its goal is to optimize the performance under the least favorable conditions. Consider, for instance, a direct marketing campaign application carried out in order to maximize profits. Since optimal decisions rely on the proportion of potential buyers and it is usually unknown in advance, our classification system should take into account this uncertainty. Nevertheless, following a pure minimax strategy can lead to solutions where minimizing the maximum loss implies considering there are no potential clients. If it is the case, this minimax approach does not seem to be suitable for this kind of situation. In this imprecise class distribution scenario, it can be noticed that the classifier performance may be highly deviated from the optimal, that is, that of the classifier knowing actual priors. Minimizing this gap (that is, the maximum possible deviation with respect to the optimal classifier) is the focus of this paper. We seek for a system as robust as the conventional minimax approach but less pessimistic at the same time. We will refer to it as a minimax deviation (or minimax regret) classifier. In contrast to other robust and adaptive approaches, it can be used in general multiclass problems. Furthermore, as shown in Guerrero-Curieses et al. (2004), minimax approaches can be used in combination with the adaptive proposal by Saerens et al. (2002) to exploit its advantages. This minimax regret approach has recently been applied in the context of parameter estimation (Eldar et al., 2004; Eldar and Merhav, 2004) and a similar competitive strategy has been used in the context of hypothesis testing (Feder and Merhav, 2002). Under prior uncertainty, our solution provides an upper bound of the performance divergence from the optimal classifier. We propose a simple learning rate scaling algorithm in order to train a neural-based minimax deviation classifier. Although training can be based on minimizing any objective function, we have chosen objective functions that provide estimates of the posterior probabilities (see Cid-Sueiro and Figueiras-Vidal, 2001, for more details). 104 M INIMAX R EGRET C LASSIFIER This paper is organized as follows: the next section provides an overview of the problem as well as some previous approaches to cope with it. Next, Section 3 states the fundamentals of minimax classification together with a deeper analysis of the minimax regret approach proposed in this paper. Section 4 presents a neural training algorithm to get a neural-based minimax regret classifier under complete uncertainty. Moreover, practical situations with partial uncertainty in priors are also discussed. A learning algorithm to solve them is provided in Section 5. In Section 6, some experimental results show that minimax regret classifiers outperform (in terms of maximum risk deviation) classifiers trained on re-balanced data sets and those with the originally assumed priors. Finally, the main conclusions are summarized in Section 7. 2. Problem Overview Traditionally, supervised learning lies in the fact that training data and real data come from the same (although unknown) statistical model. In order to carefully analyze to what extend classifier performance depends on conditions such as class distribution or decision costs, learning and decision theory principles are briefly revisited. Next, some previous approaches to deal with environment imprecision are reviewed. 2.1 Learning and Making Optimal Decisions Let S = {(xk , dk ), k = 1, . . . , K} denote a set of labelled samples where xk ∈ RN is an observation feature vector and dk ∈ UL = {u0 , . . . , uL−1 } is the label vector. Class-i label ui is a unit L-dimensional vector with components ui, j = δi j , with every component equal to 0, except the i-th component which is equal to 1. We assume a learning process that estimates parameters w of a non-linear mapping f w : RN → P from the input space into probability space P = {p ∈ [0, 1]L | ∑L−1 pi = 1}. The soft decision is given i=0 by yk = fw (xk ) ∈ P and the hard output of the classifier is denoted by d. Note that d and d will be used to distinguish the actual class from the predicted one, respectively. Several costs (or benefits) associated with each possible decision are also defined: c i j denotes the cost of deciding in favor of class i when the true class is j. Negative values represent benefits (for instance, cii , which is the cost of correctly classifying a sample from class i could be negative in some practical cases). In general cost-sensitive classification problems, either misclassification costs c i j or cii costs can take different values for each class. Thus, there are many applications where classification errors lead to very different consequences (medical diagnosis, fault detection, credit risk analysis), what implies misclassification costs ci j that may largely vary between them. In the same way, there are also many domains where correct decision costs (or benefits) c ii do not take the same value. For instance, in targeted marketing applications (Zadrozny and Elkan, 2001), correctly identifying a buyer implies some benefit while correctly classifying a non buyer means no income. The same ¨ applies to medical diagnosis domains such as the gastric carcinoma problem studied in G uvenir et al. (2004). In this case, the benefit of correct classification also depends on the class: the benefit of correctly classifying an early stage tumor is higher than that of a later stage. The expected risk (or loss) R is given by L−1 L−1 R = ∑ ∑ ci j P{d = ui |d = u j }Pj j=0 i=0 105 , (1) A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I where P{d = ui |d = u j } with i = j represent conditional error probabilities, and P j = P{d = u j } is the prior probability of class u j . Defining the conditional risk of misclassifying samples from class u j as L−1 Rj = ∑ ci j P{d = ui |d = u j } , i=0 we can express risk (1) as L−1 R= ∑ Ri Pi . (2) i=0 It is well-known that the Bayes decision rule for the minimum risk is given by L−1 d = arg min { ∑ ci j P{d = u j |x}} , ui (3) j=0 where P{d = ui |x} is the a posteriori probability of class i given sample x. The optimal decision rule depends on posterior probabilities and therefore, on the prior probabilities and the likelihood. In theory, as long as posterior probabilities (or likelihood and prior probabilities) are known, the optimal decision in Eq. (3) can be expressed after a trivial manipulation as a function of the cost differences between the costs (ci j − c j j ) (Duda et al., 2001). This is the reason why c j j is usually assumed to be zero and the value of the cost difference is directly assigned to c i j . When dealing ¨ with practical applications, however, some authors (Zadrozny and Elkan, 2001; G uvenir et al., 2004) have urged to use meaningful decision costs measured over a common baseline (and not necessarily taking c j j = 0) in order to avoid mistakes that otherwise could be overlooked. For this reason and, what is more important, the uncertainty class distribution problem addressed in this paper, decision costs measured over a common baseline are considered. Furthermore, absolute values of decision costs are relevant to the design of classifiers under the minimax regret approach. 2.2 Related Work: Dealing with Cost and Prior Uncertainty Most proposals to address uncertainty in priors fall into the categories of adaptive and robust solutions. While the aim of a robust solution is to avoid a classifier with very poor performance under any conditions, an adaptive system pursues to fit the classifier parameters using more incoming data or more precise information. With an adaptive-oriented principle, Provost (2000) states that, once the classifier is trained under specific class distributions and cost assumptions (not necessarily the operating conditions), the selection of the optimal classifier for specific conditions is carried out by a correct placement of the decision thresholds. In the same way, the approaches in Kelly et al. (1999) and Kubat et al. (1998) consider that tuning the classifier parameters should be left to the end user, expecting that class distributions and misclassification costs will be precisely known then. Some graphical methods based on the ROC curve have been proposed in Adams and Hand (1998) and Provost and Fawcett (2001) in order to compare the classifier performance under imprecise class distributions and/or misclassification costs. The ROC convex hull method presented in Provost and Fawcett (2001) (or the alternative representation proposed in Drummond and Holte (2000)) allows the user to select potentially optimal classifiers, providing a flexible way to select 106 M INIMAX R EGRET C LASSIFIER them when precise information about priors or costs is available. Under imprecision, some classifiers can be discarded but this does not necessarily provide a method to select the optimal classifier between the possible ones and fit its parameters. Furthermore, due to its graphical character, these methods are limited to binary classification problems. Changes in prior probabilities have also been discussed by Saerens et al. (2002), who proposes a method based on re-estimating the prior probabilities of real data in an unsupervised way and subsequently adjusting the outputs of the classifier according to the new a priori probabilities. Obviously, the method requires enough unlabelled data being available for re-estimation. As an alternative to adaptive schemes, several robust solutions have been proposed, as the resampling methods, especially in domains where imbalanced classes come out (Kubat and Matwin, 1997; Lawrence et al., 1998; Chawla et al., 2002; Barandela et al., 2003). Either by undersampling or oversampling, the common purpose is to balance artificially the training data set in order to get a uniform class distribution, which is supposed to be the least biased towards any class and, thus, the most robust against changes in class distributions. The same approach is followed in cost sensitive domains, but with some subtle differences in practice. It is well known that class priors and decision costs are intrinsically related. For instance, different decision costs can be simulated by altering the priors and vice versa (see Ting, 2002, for instance). Thus, when a uniform distribution is desired in a cost sensitive domain, but working with cost insensitive decision machines, class priors are altered according to decision costs, what is commonly referred as rebalancing. The manipulation of the training data distribution has been applied to cost-sensitive learning in two-class problems (Breiman et al., 1984) in the following way: basically, the class with higher misclassification cost (suppose n times the lowest misclassification cost) is represented with n times more examples than the other class. Besides random sampling strategies, other sampling-based rebalancing schemes have been proposed to accomplish this task, like those considering closeness to the boundaries between classes (Japkowicz and Stephen, 2002; Zhou and LiuJ, 2006) or the costproportionate rejection sampling presented in Zadrozny et al. (2003). Extending the formulation of this type of procedures to general multiclass problems with multiple (and possibly asymmetric) inter-class misclassification costs appears to be a nontrivial task (Zadrozny et al., 2003; Zhou and LiuJ, 2006), but some progress has been made recently with regard to this latter point (Abe et al., 2004). Note, also, that many (although not all) of these rebalancing strategies are usually implemented by oversampling and/or subsampling, that is, replicating examples (without adding any extra information) and/or deleting them (which implies information loss). 3. Robust Classifiers Under Prior Uncertainty: Minimax Classifiers Prior probability uncertainty can be coped from a robust point of view following a minimax derived strategy. Minimax regret criterion is discussed in this section after presenting the conventional minimax criterion. Although our approach extends to general multi-class problems and the discussion is carried out in that way, we will first illustrate, for the sake of clarity and simplicity, a binary situation. 3.1 Minimax Classifiers As Eq. (3) shows, the minimum risk decisions depend on the misclassification costs, c i j , and the posterior class probabilities and, thus, they depend on the prior probabilities, Pi . Different prior 107 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I PSfrag replacements distributions (frequency for each class) give rise to different Bayes classifiers. Fig. 1 shows the Bayes risk curve, RB (P1 ) versus class-1 prior probability for a binary classification problem. Standard RF (Q1 , P1 ) R0 Minimax RF (Q1mM , P1 ) Risk c00 Minimax Deviation RF (Q1mMd , P1 ) Rbasis R1 RB (P1 ) c11 0 Q1 Q1mM 1 Q1mMd P1 Figure 1: Risk vs. P1 . Minimum risk curve and performance under prior changes for the standard, minimax and minimax deviation classifier. RB (P1 ) stands for the optimal Bayes Risk against P1 . RF (Q1 , P1 ) denotes the Risk of a standard classifier (Fixed decision rule optimized for prior probabilities Q1 estimated in the training phase) against P1 . RF (Q1mM , P1 ) denotes the Risk of a minimax classifier (Fixed decision rule optimized for the minimax probabilities Q1mM ) against P1 . RF (Q1mMd , P1 ) denotes the Risk of a minimax deviation classifier (Fixed decision rule optimized for the minimax deviation probabilities Q 1mMd ) against P1 . If the prior probability distribution is unknown when the classifier is designed, or this distribution changes with time or from one environment to other, the mismatch between training and test conditions can degrade significantly the classifier performance. For instance, assume that Q = (Q0 , Q1 ) is the vector with class-0 and class-1 prior probabilities estimated in the training phase, respectively, and let RB (Q1 ) represent the minimum (Bayes) risk attainable by any decision rule for these priors. Note, that, according to Eq. (2), for a given classifier, the risk is a linear function of priors. Thus, risk RF (Q1 , P1 ) associated to the (fixed) classifier optimized for Q changes linearly with actual prior probabilities P1 and P0 = 1 − P1 , going from (0, R0 ) to (1, R1 ) (the continuous line in Fig. 1), where R0 and R1 refer to the class conditional risks for classes 0 and 1, respectively. Fig. 1 shows the impact of this change in priors and how performance deviates from optimal. Also, it can be shown (see VanTrees, 1968, for instance) that the minimum risk curve obtained for each prior is convex and the risk function of a given classifier verifies R F (Q1 , P1 ) ≥ RB (P1 ) with a tangent point at P1 = Q1 . 108 M INIMAX R EGRET C LASSIFIER The dashed line in Fig. 1 shows the performance of the minimax classifier, which minimizes the maximum possible risk under the least favorable priors, thus providing the most robust solution, in the sense that performance becomes independent from priors. From Fig. 1, it becomes clear that the minimax classifier is optimal for prior probabilities P = QmM = (Q0mM , Q1mM ) maximizing RB . Thus, this strategy is equivalent to maximizing the minimum risk (Moon and Stirling, 2000; Duda et al., 2001). We will refer to them as the minimax probabilities. Fig. 1 also makes clear that although a minimax classifier is a robust solution to address the imprecision in priors, it may become a somewhat pessimistic approach. 3.2 Minimax Deviation Classifiers We propose an alternative classifier that, instead of minimizing the maximum risk, minimizes the maximum deviation (regret) from the optimal Bayes classifier. In the following, we will refer to it as the minimax deviation or minimax regret classifier. A comparison between minimax and minimax deviation approaches is also shown in Fig. 1. This latter case corresponds to a classifier trained on prior probabilities P = Q mMd with performance as a function of priors given by a line (a plane or hyperplane for three or more classes, respectively) parallel to what we name, in the following, basis risk (Rbasis = c00 (1 − P1 ) + c11 P1 ). Note that the maximum deviation (with respect to priors) of the classifier optimized for Q is given by D(Q) = max {RF (Q1 , P1 ) − RB (P1 )} = max {R0 − c00 , R1 − c11 } . P1 The inspection of Fig. 1 shows that the minimum of D (with respect to Q) is achieved when R0 − c00 = R1 − c11 , which means that line RF (Q1 , P1 ) is parallel to arc named Rbasis in the figure and tangent to RB at Q1mMd . Therefore, the minimax regret classifier is also the Bayes solution with respect to the least favorable priors (Q0mMd , Q1mMd ) (see Berger, 1985, for instance), which will be denoted as minimax deviation probabilities. Now, we extend the formulation to a general L-class problem. Definition 1 Consider a L-class decision problem with costs ci j , 0 ≤ i, j < L and c j j ≤ ci j , and let Rw (P) be the risk of a decision machine with parameter vector w when prior class probabilities are given by P = (P0 , . . . , PL−1 ). The deviation function is defined as Dw (P) = Rw (P) − RB (P) and the minimax deviation is defined as DmMd = inf max{Dw (P)} . w P (4) Note that the above definition assumes that the maximum exists. This is actually the case, since Dw (P) is a linear function over a compact set, P . Note, also, that our definition includes the natural assumption that c j j is never higher than ci j , meaning that making a decision error is always less costly than taking the correct decision. This assumption is used in part of our theoretical analysis. The algorithms proposed in this paper are based on the fact that the minimax deviation can be computed without knowing RB 109 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I Theorem 2 The minimax deviation is given by DmMd = inf max{Dw (P)} , w P where Dw (P) = Rw (P) − Rbasis (P) and (5) L−1 ∑ c j j Pj Rbasis (P) = . (6) j=0 Proof Note that, according to Eqs. (1) and (2), for any decision machine and any u i ∈ UL , L−1 R(u j ) = R j = ∑ ci j P{d = ui |d = u j } ≥ c j j . i=0 Since the bound is reached by the classifier deciding d = u j for any observation x, we have RB (u j ) = c j j . Therefore, using Eq. (6), we find that, for any u ∈ UL , RB (u) = Rbasis (u) and, thus, Dw (u) = Dw (u) . Since Bayes minimum risk RB (P) is a convex function of priors and Rw (P) is linear, Dw (P) is concave and, thus, it is maximum at some of the vertices in P (i.e., at some P = u ∈ U L ). Thus, max{Dw (P)} = max {Dw (u)} . u∈UL P (7) Since the maximum difference between two hyperplanes defined over P is always at some vertex, we can conclude that max{Dw (P)} = max {Dw (u)} = max {Dw (u)} . P u∈UL u∈UL (8) Combining Eqs. (4), (7) and (8), we get DmMd = inf max{Dw (P)} . w P Note that Rbasis represents the risk baseline of the ideal classifier with zero errors. Th. 2 shows that the minimax regret can be computed as the minimax deviation to this ideal classifier. Note, also, that if costs cii do not depend on i, Eq. (5) becomes equivalent (up to a constant) to the Bayes risk and the minimax regret classifier becomes equivalent to the minimax classifier . Another important result for the algorithms proposed in this paper is that, under some conditions on the minimum risk, the minimum and maximum operators can be permuted. Although general results on the permutability of minimum and maximum operators can be found in the literature (see Polak, 1997, for instance), we provide here the proof for the specific case interesting to this paper. 110 M INIMAX R EGRET C LASSIFIER Theorem 3 Consider the minimum deviation function given by Dmin (P) = inf{Dw (P)} , (9) w where Dw (P) is the normalized deviation function given by Eq. (5), and let P ∗ be the prior probability vector providing the maximum deviation, P∗ = arg max Dmin (P) P . (10) If Dmin (P) is continuously differentiable at P = P∗ , then the minimax deviation, DmMd , defined by Eq. (4), is DmMd = Dmin (P∗ ) = max inf Dw (P) . (11) P w Proof For any classifier with parameter vector w, we can write, max Dw (P) ≥ Dw (P∗ ) ≥ Dmin (P∗ ) P and, thus, inf max Dw (P) ≥ Dmin (P∗ ) . w P (12) Therefore, Dmin (P∗ ) is a lower bound of the minimax regret. Now we prove that Dmin (P∗ ) is also an upper bound. According to Eq. (9), for any ε > 0, there exists a parameter vector wε such that Dwε (P∗ ) ≤ Dmin (P∗ ) + ε . (13) By definition, for any P, Dmin (P) ≤ Dwε (P). Therefore, using Eq. (13), we can write Dwε (P∗ ) − Dwε (P) ≤ Dmin (P∗ ) − Dmin (P) + ε . (14) Since Dmin (P) is continuously differentiable and (according to Eq. (10)) maximum at P ∗ , for any ε > 0 there exists δ > 0 such that, for any P ∈ P with P∗ − P ≤ δ we have Dmin (P∗ ) − Dmin (P) ≤ ε P∗ − P ≤ ε δ . (15) Let Pδ a prior such that P∗ − Pδ = δ. Taking ε = ε δ and combining Eqs. (14) and (15) we can write Dwε (P∗ ) − Dwε (Pδ ) ≤ 2ε δ . Since the above condition is verified for any ε > 0 and any prior Pδ at distance δ from P, and taking into account that Dwε (P) is a linear function of P, we conclude that the maximum slope of D wε (P) is bounded by 2ε and, thus, for any P ∈ P , we have √ Dwε (P) − Dwε (P∗ ) ≤ 2ε P − P∗ ≤ 2 2ε , √ (where we have used the fact that the maximum distance between two probability vectors is 2). Therefore, we can write √ max Dwε (P) ≤ Dwε (P∗ ) + 2 2ε P 111 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I and, thus, √ inf max Dw (P) ≤ Dwε (P∗ ) + 2 2ε . w P √ Finally, using Eq. (13) and taking into account that ε = ε δ ≤ 2ε we get √ inf max Dw (P) ≤ Dmin (P∗ ) + 3 2ε . w P (16) Since the above is true for any ε > 0 we conclude that Dmin (P∗ ) is also an upper bound of Dw . Therefore, combining Eqs. (12) and (16), we conclude that inf max Dw (P) = Dmin (P∗ ) , w P which completes the proof. Note that the deviation function needs to be neither differentiable nor a continuous function of w parameters. If the minimum deviation function is not continuously differentiable at the minimax deviation probability, P∗ , the theorem cannot be applied. The reason is that, although there should exist at least one classifier providing the minimum deviation at P = P∗ , it or they could not provide a constant deviation with respect to the prior probability. The situation can be illustrated with an example. Let x ∈ R be given by p(x|d = 0) = 0.8N(x, σ) + 0.2N(x − 2, σ) and p(x|d = 1) = 0.2N(x − 1, σ) + 0.8N(x − 3, σ), where σ = 0.5 and N(x, σ) = (2πσ)−1/2 exp(−x2 /(2σ2 )), and consider the set Φλ of classifiers given by a single threshold over x and decision dˆ = 1 if x ≥ λ 0 if x < λ. Fig. 2 shows the distribution of both classes over x, and Fig. 3 shows, as a function of priors, the minimum error probability (continuous line) that can be obtained using classifiers in Φ λ . Note that decision costs c00 = c11 = 0 and c01 = c10 = 1 have been considered for this illustrative problem. An abrupt slope change is observed at the minimax deviation probability, for P{d = 1} = 1/2. For this prior, there are two single threshold classifiers providing the minimum error probability, which are given by thresholds λ1 and λ2 in Fig. 2. However, as shown in Fig. 3 neither of them provides a risk that is constant in the prior. The minimax deviation classifier in Φ λ , which has a threshold λ0 , does not attain minimum risk at the minimax deviation probability and, thus, cannot be obtained by using Eq. (11). For this example, the desired robust classifier should have a deviation function given by the horizontal dotted line in Fig. 3. Fortunately, it can be obtained by combining the outputs of several classifiers. For instance, let dˆ1 and dˆ2 the decisions of classifiers given by thresholds λ1 and λ2 , respectively. It is not difficult to see that the classifier selecting dˆ1 and dˆ2 at random (for each input sample x) provides a robust classifier. This procedure can be extended to the multiclass-case: consider a set of L classifiers with parameters wk , k = 0, . . . , L − 1, and consider the classifier such that, for any input sample x, makes a decision equal to dk (i.e., the decision of classifier with parameters wk ), with probability qk . It is not difficult to show that the deviation function of this classifier is given by L−1 D(P) = L−1 j=0 k=0 ∑ Pj ∑ qk D j (wk ) 112 , M INIMAX R EGRET C LASSIFIER 0.7 0.6 Likelihoods 0.5 0.4 0.3 0.2 0.1 λ 0 −2 λ −1 0 λ 0 1 1 2 2 3 4 5 x Figure 2: The conditional data distributions for the one-dimensional example discussed in the text. λ1 and λ2 are the thresholds providing the minimum risk at the minimax deviation probability. λ0 provides the minimax deviation classifier. where D j (wk ) = R j (wk ) − c j j . In order to get a constant deviation function, probabilities q k should be chosen in such a way that L−1 ∑ qk D j (wk ) = D , k=0 where D is a constant. Solving these linear equations for q k , k = 0, . . . , L − 1 (with the constraint ∑k qk = 1), the required probabilities can be found. Note that, in order to build the non-deterministic classifier providing a constant deviation, a set of L independent classifiers that are optimal at the minimax deviation prior should be found. However, we go no further on the investigation of this special case for two main reasons: • The situation does not seem to be common in practice. In our simulations, we have found that the maximum of the minimum risk deviation always provided a response which is approximately parallel to Rbasis . • In general, the abrupt change in the derivative may be a symptom that the classifier structure is not optimal for the data distribution. Instead of building a nondeterministic classifier, increasing the classifier complexity should be more efficient. Although the least favorable prior providing the minimax deviation can be computed in closed form for some simple distributions, in general, it must be computed numerically. Moreover, we assume here that the data distribution is not known, and must be learned from examples. Thus, 113 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I 0.25 0.2 λ Error probability 0 0.15 λ λ 1 2 0.1 0.05 0 0 0.2 0.4 0.6 0.8 1 P{ d=1} Figure 3: Error probabilities as a function of prior probability of class 1 for the example in Fig. 2. Thresholds λ1 and λ2 do not provide the minimax deviation classifier, which is obtained for threshold λ0 . However, the random combination of classifiers with thresholds λ 1 and λ2 (dotted line) provides a robust classifier with deviation lower than that of λ 0 . we must incorporate the estimation of the least favorable prior in the learning process. Next, we propose a training algorithm in order to get a minimax regret classifier based on neural networks. 4. Neural Robust Classifiers Under Complete Uncertainty Note that, if QmMd is the probability vector providing the maximum in Eq. (11), that is, QmMd = arg max inf{Dw (P)} w P , then we can write DmMd = inf{Dw (QmMd )} . w Therefore, the minimax deviation classifier can be estimated by training a classifier using prior in QmMd . For this reason, QmMd will be called the minimax deviation prior (or least favorable prior). Our proposed algorithms are based on an iterative process of estimating parameters w based on an estimate of the minimax deviation prior, and re-estimating prior based on an estimate of network weights. This is shown in the following. 114 M INIMAX R EGRET C LASSIFIER 4.1 Updating Network Weights Learning is based on minimizing some empirical estimate of the overall error function L−1 L−1 i=0 E{C(y, d)} = i=0 ∑ P{d = ui }E{C(y, d)|d = ui } = ∑ PiCi , where C(y, d) may be any error function and Ci is the expected conditional error for class-i. Selecting the appropriate error function (see Cid-Sueiro and Figueiras-Vidal, 2001, for instance), learning rules can be designed providing a posteriori probability estimates (y i ≈ P{d = ui |x}, where yi is the soft decision) and, thus, according to Eq. (3), the hard decision minimizing the risk can be approximated by L−1 d = arg min { ∑ ci j y j } . i j=0 The overall empirical error function (cost function) used in learning for priors P = (P0 , . . . , PL−1 ) may be written as L−1 C = ∑ PiCi = L−1 i=0 = = 1 K L−1 i=0 1 K k ∑ d C(yk , dk ), Ki k=1 i Pi K k ∑ d C(yk , dk ) Ki /K k=1 i ∑ i=0 1 K ∑ K k=1 ∑ Pi L−1 ∑ Pi d kC(yk , dk ) (0) i i=0 Pi , , (17) (0) where Pi = Ki /K is an initial estimate of class-i prior based on class frequencies in the training set and Pi is the current prior estimate. Minimizing error function (17) by means of a stochastic gradient descent learning rule leads to update the network weights at k-th iteration as w (k+1) = w (k) (n) L−1 −µ = w(k) − Pi i=0 Pi ∑ L−1 d k ∇ C(yk , dk ) (0) i w ∑ µi (n) k di , ∇wC(yk , dk ) , (18) i=0 where (n) (n) µi = µ Pi (19) (0) Pi (n) is a learning step scaled by the prior ratio. Note that di selects the appropriate µi according to the pattern class membership. The classifier is trained without altering the original training data set (0) class distribution Pi and therefore, without missing or duplicating information. 115 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I 4.2 Updating Prior Probabilities Eq. (11) shows that the learning process should maximize (5) with respect to the prior probabilities. The estimate of (5) can be computed as ¯ Dw (P) = Rw (P) − Rbasis (P) , where (20) L−1 ∑ R j Pj (21) 1 L−1 ∑ ci j Ni j N j i=0 (22) Rw (P) = j=0 is the overall Bayes risk estimate and Rj = is the class- j conditional risk estimate where N j is the number of class u j patterns in the training phase and Ni j is the number of samples from class u j assigned to ui . L−1 In order to derive a learning rule to find an estimate Pi satisfying constraints ∑i=0 Pi = 1 and 0 ≤ Pi ≤ 1, we will use auxiliary variables Bi such that Pi = exp(Bi ) L−1 ∑ j=0 exp(B j ) . (23) ¯ We maximize Dw with respect to Bi . Applying the chain rule, ¯ ¯ ∂Dw L−1 ∂Dw ∂Pj =∑ , ∂Bi j=0 ∂Pj ∂Bi and using Eqs. (20), (21) and (23), we get ¯ ∂D w ∂Bi L−1 = ∑ (R j − c j j )Pi (δi j − Pj ), j=0 L−1 L−1 j=0 j=0 = Pi Ri − cii − ∑ (R j Pj ) + ∑ (c j j Pj ) , = Pi Ri − cii − Rw − Rbasis , = Pi Rdi , where Rdi = (Ri − cii ) − (Rw − Rbasis ) . The learning rule for auxiliary variable Bi is (n) Bi (n+1) = Bi + ρ (n) ∂D w , ∂Bi (n) (n) = Bi + ρPi Rdi , 116 (24) M INIMAX R EGRET C LASSIFIER where parameter ρ > 0 controls the rate of convergence. Using Eq. (23) and Eq. (24), the updated learning rule for Pi is (n) (n+1) Pi = (n) (n) (n) exp ρPj Rd j ∑L−1 exp B j j=0 (n) = (n) (n) exp(Bi ) exp ρPi Rdi , (n) (n) Pi exp ρPi Rdi (n) (n) (n) ∑L−1 Pj exp ρPj Rd j j=0 . (25) 4.3 Training Algorithm for a Minimax Deviation Classifier In the previous section, both the network weights updating rule (18) and the prior probability update rule (25) have been derived. The algorithm resulting from the combination is shown as follows: for n = 0 to Niterations − 1 do for k = 1 to K do w(k+1) = w(k) − L−1 ∑ µi (n) k di ∇wC(yk , dk ) i=0 end for (n) Estimate R(n) , Ri , i = 0, . . . , L − 1, according to (21) and (22) (n+1) (n+1) Update minimax probability Pi , i = 0, . . . , L − 1 according to (25) and compute µi with (19) end for 5. Robust Classifiers Under Partial Uncertainty Although in many practical situations prior probabilities may not be specified with precision, they can be partially known. In this section we discuss how partial information about priors can be used to improve the classifier performance in relation to a complete uncertainty situation. From now on, let us consider that lower (or upper) bounds of the priors are known based on previous experience. We will denote the lower and upper bounds of class-i prior probability as Pil and Piu , respectively. In order to illustrate this situation consider a binary classification problem where probability lower bounds P0l and P1l are known. That is, P1 ∈ [P1l , 1 − P0l ] where this interval represents the uncertainty region. Let us denote by Γ = {P : 0 ≤ Pi ≤ 1, ∑L−1 Pi = 1, Pi ≥ Pil } the probability region i=0 satisfying the imposed constraints. In the following, we will refer to Γ as the uncertainty region. Now, the aim is to design a classifier that minimizes the maximum regret from the minimum risk only inside the uncertainty region. This is depicted in Fig. 4(a), which shows that reducing the uncertainty in priors allows to reduce deviation from the optimal classifier. This minimax regret approach for the uncertainty region Γ is often called Γ-minimax regret. As discussed before, the minimax deviation solution gives a Bayes solution with respect some priors denoted in the partial uncertainty case as QΓ mMd in Fig. 4(a), which is the least favorable distribution according to the regret criterion. 117 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I cΓ 00 RΓ basis RΓ 1 cΓ 11 PSfrag replacements Risk 0 0.5 1 1.5 2 2.5 3 3.5 0P1l RB (P) + ψ(P) Minimax Deviation with Restriction Risk RΓ 0 QΓ 1mMd 1 − P0l 1 P1 0P1l 1 − P0l 1 P1 (b) (a) Figure 4: Minimax deviation classifier under partial uncertainty of prior probabilities: (a)Γ-minMaxDev Classifier. (b) Modified cost function defined as R B (P) + ψ(P). In contrast to the minimax regret criterion, note that a classical minimax classifier for the considered uncertainty region would minimize the worst-case risk. It would be a Bayes solution for the prior where the minimum risk reaches its maximum and it could be denoted as Q Γ . mM Notice, also, that these solutions will be the same if the risk for the vertex of Γ take the same value (cΓ = k). ii 5.1 Neural Robust Classifiers Under Partial Uncertainty Minimax search can be formulated as maximizing (with respect to priors) the minimum (with respect to network parameters) of deviation function (5), as described in previous section, but subject to some constraints arg max inf {DΓ (P)} , w w P Pi ≥ Pil , i = 0, . . . , L − 1 s.t. where DΓ = RΓ − RΓ . When uncertainty is global, this hyperplane is defined by the risk in the L w w basis extreme cases with Pi = δik , that is, by the corresponding cii . However, with partial knowledge of the prior probabilities, this hyperplane becomes defined by the risk in L points which are the vertex given by the restrictions and with associated risk denoted by c Γj . j Defining 1 l(Pi ) = , (26) 1 + exp−τ(Pi −Pil ) where τ controls the hardness of this restriction, the minimax problem can be re-formulated as arg max inf {DΓ (P)} w P s.t. w l(Pi ) ≥ 1/2, i = 0, . . . , L − 1. Thus, this constrained optimization problem can be solved as a non-constrained problem by considering an auxiliary function that incorporates the restriction as a barrier function 118 M INIMAX R EGRET C LASSIFIER arg max inf {DΓ (P) + Aψ(P)} , w w P where ψ(Pi ) = log(l(Pi )) and the constant A determines the contribution of the barrier function. Fig. 4(b) shows the new risk function corresponding to the binary case previously depicted in Fig. 4(a). Note that, it is the sum of the original RB (P) and the barrier function ψ(P). As in Section 4.1, in order to derive the network weight learning rule, we need to compute ∂ψ ∂Bi L−1 = ∂ψ ∂P j , j ∂Bi ∑ ∂P j=0 = τPi L−1 ∑ 1 − l(Pk ) (δik − Pk ), k=0 = τPi ψdi , where ψdi = ∑L−1 (1 − l(Pk ))(δik − Pk ) k=0 As τ increases, the constraints become harder around the specified bound. The update learning rule for the auxiliary variable Bi at cycle n is (n+1) Bi (n) Γ(n) (n) (n) = Bi (n) + ρPi Rdi + ρAτPi ψdi . And therefore, using (23), the update learning rule for Pi is (n) (n+1) Pi = (n) Γ(n) Pi exp ρPi Rdi L−1 ∑ (n) Pj exp (n) (n) exp ρAτPi ψdi (n) Γ(n) ρ P j Rd j . (n) (n) ρAτPj ψd j exp j=0 Note that if the upper bound is known instead of the lower bound, l(Pi ) defined by (26) should be replaced by u(Pi ) = (1 + exp(τ(Pi − Piu )))−1 at the previous formulation. The minimax constrained optimization problem has been tackled by considering a new objective function defined by the sum of the original cost function and a barrier function. Studying the convexity of this new function becomes important from the fact that a stationary point of this risk curve is a global maximum. Since the minimum risk curve (RB (P)) is a convex function of the priors (see VanTrees, 1968, for details), if we verify the convexity of the barrier function, we can conclude that the function defined by the sum of both of them is also convex. This barrier function is convex in P if the Hessian matrix HR verifies PT HR P ≤ 0 The Hessian matrix of the barrier function equals to a diagonal matrix D r = diag(r) with all negative diagonal entries ri = Aτ2 (−l(Pi )(1 − l(Pi ))). As l(Pi ) ∈ [0, 1] and therefore, ri ≤ 0, it is straightforward to see that PT HR P = PT Dr P, L−1 = ∑ Pi2 ri ≤ 0 . i=0 Since the barrier function is convex, the new objective function (defined by the sum of two convex functions) is also convex. 119 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I 5.2 Extension to Other Learning Algorithms The learning algorithm proposed in this paper is intended to train a minimax deviation classifier based on neural networks with feedforward architecture. Actually, the learning algorithm we propose becomes a feasible solution for any learning process based on minimizing some empirical estimate of an overall cost (error) function. However, it is also applicable to a general classifier provided it is trained (in an iterative process) for the estimated minimax deviation probabilities and the assumed decision costs. Specifically, in this paper, scaling the learning rate allows to simulate different class distributions and the hard decisions are made based on posterior probability estimates and decision costs. Furthermore, the neural learning phase carried out in one iteration can be re-used for the next one, what allows to reduce computational cost with respect to a complete optimization process on each iteration. Apart from the general approach of completely training a classifier on each iteration and in order to reduce its computational cost, specific solutions may be studied for different learning machines. Nonetheless, it seems not feasible to readily achieve this improvement for classifiers like SVMs, where support vectors for one solution may have nothing in common with the ones obtained in next iteration and thus, making necessary to re-train the classifier in each iteration. Another possible solution for any classifier that provides a posteriori probabilities estimates or any score that can be converted into probabilities (for details on calibration methods see Wei et al., 1999; Zadrozny and Elkan, 2002; Niculescu-Mizil and Caruana, 2005) is outlined here. In this case, an iterative procedure able to estimate the minimax deviation probabilities and consequently to adjust (without re-training) the outputs of the classifier could be studied. The general idea for this approach is as follows: first, the new minimax deviation prior probabilities are estimated according to (25) and then, posterior probabilities provided by the model are adjusted as follows (see Saerens et al., 2002, for more details) (k) Pi P(k) {d = ui |x} = P(k−1) {d = ui |x} (k−1) Pi L−1 P(k) j P(k−1) {d = u j |x} (k−1) j=0 Pj . (27) ∑ The algorithm’s main structure is summarized as for k = 1 to K do (k) Estimate R(k) , Ri , i = 0, . . . , L − 1, according to (21), (22) and decision costs c i j (k+1) Update minimax probability Pi according to (25) Adjust classifier outputs according to (27) end for The effectiveness of this method relies on the accuracy of the initial a posteriori probability estimates. Studying in depth this approach and comparing different minimax deviation classifiers (decision trees, SVMs, RBF networks, feedforward networks and committee machines) together with different probability calibration methods appears as a challenging issue to be explored in future work. 120 M INIMAX R EGRET C LASSIFIER 6. Experimental Results In this section, we first present the neural network architecture used in the experiments and illustrate the proposed minimax deviation strategy on an artificial data set. Then, we apply it to several realworld classification problems. Moreover, a comparison with other proposals such as the traditional minimax and the common re-balancing approach is carried out. 6.1 Softmax-based Network Although our algorithms can be applied to any classifier architecture, we have chosen a neural network based on the softmax non-linearity with soft decisions given by Mi yi = ∑ yi j , j=1 with yi j = exp(wTj x + wi j0 ) i , Mk ∑L−1 ∑l=1 exp(wT x + wkl0 ) k=0 kl where L stands for the number of classes, M j the number of softmax outputs used to compute y j and wi j are weight vectors. We will refer to this network as a Generalized Softmax Perceptron(GSP). 1 A simple network with M j = 2 is used in the experiments. x1 wj,k y1,1 y1,... x2 x3 y1 y1,M1 Class i ... SOFTMAX ... HARD DECISION n inputs / outputs ... yL,1 xd yL,ML yL,... yL Figure 5: GSP(Generalized Softmax Perceptron) Network Fig. 5 corresponds to the neural network architecture used to classify the samples represented by feature vector x. Learning consists of estimating network parameters w by means of the stochastic gradient minimization of certain objective functions. In the experiments, we have considered the Cross Entropy objective function given by L CE(y, d) = − ∑ di log yi . i=1 The stochastic gradient learning rule for the GSP network is given by Eq. (18). Learning step µ(0) decreases according to µ(k) = 1+k/η , where k is the iteration number, µ(0) the initial learning rate and η a decay factor. µ(k) 1. Note that the GSP is similar to a two layer MLP with a single layer of weights and with coupled saturation function (softmax), instead of sigmoidal units. 121 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I The reason to illustrate this approach with a feedforward architecture is that, as mentioned in Section 5.2, it allows to exploit (in the iterative learning process) the partially optimized solution in current iteration for the next one. On the other hand, posterior probability estimation makes it possible to apply the adaptive strategy based on prior re-estimation proposed by Saerens to the minimax deviation classifier, as long as a data set representative of the operation conditions is available. Finally, the fact that intermediate outputs yi j of the GSP can be interpreted as subclass probabilities may provide quite a natural way to cope with the unexplored problem of uncertainty in subclass distributions as already pointed out by Webb and Ting (2005). Nonetheless, both architecture and cost function issues are not the goal of this paper, but merely illustrative tools. 6.2 Artificial Data Set To illustrate the minimax regret approach proposed in this paper both under complete and partial uncertainty, an artificial data set with two classes (class u0 and class u1 ) has been created. Data examples are drawn from the normal distribution p(x|d = ui ) = N(mi , σ2 ) with mean mi and standard i √ deviation σi . Mean values were set to m0 = 0, m1 = 2 and standard deviation to σ0 = σ1 = 2. A total of 4000 instances were generated with prior probabilities of class membership P{d = u 0 } = 0.93 c00 c01 2 5 and P{d = u1 } = 0.07. The cost-benefit matrix is given by . c10 c11 4 0 Initial learning rate was set to µ(0) = 0.3, decay factor to η = 2000 and training was ended after 80 cycles. Classifier assessment was carried out by following 10-fold cross-validation. Two classifiers were trained, to be called a standard classifier and a minMaxDev classifier. The former is built by considering that the estimated class prior information is precise and stationary and the latter is the approach proposed in this paper to cope with uncertainty in priors. Thus, for the standard classifier, its performance may deviate from the optimal risk in 3.39 when priors change from training to test conditions. However, a minimax deviation classifier reduces this worst-case difference from the optimal classifier to 0.77. Now, we suppose that some information about priors is available (partial uncertainty). For instance, we consider that the lower bound for prior probabilities P0 and P1 are known and set to P0l = 0.55 and P1l = 0.05, respectively, so that the uncertainty region is Γ = {(P0 , P1 )|P0 ∈ [0.55, 0.95], P1 ∈ [0.05, 0.45]}. A minimax deviation classifier can be derived for Γ (it will be called Γ-minMaxDev classifier).The narrower Γ is, the closer the minimax deviation classifier performance is to the optimal. For this particular case, under partially imprecise priors, the standard classifier may differ from optimal (in Γ) in 0.83, while the use of the simple minMaxDev classifier designed under total prior uncertainty conditions attains a maximum deviation of 0.53. However, the Γ-minMaxDev classifier only differs from optimal in 0.24. These data are reported in Table 1 where both, experimental and also theoretical results, are shown. 6.3 Real Databases In this section we report experimental results obtained with several publicly available data sets. From the UCI repository (Blake and Merz, 1998) the following benchmarks: German Credits, Australian Credits, Insurance Company, DNA slice-junction, Page-blocks, Dermatology and Pen-digits. 122 M INIMAX R EGRET C LASSIFIER Standard Th/Exp Maximum deviation from optimal (complete uncertainty) Maximum deviation from optimal in Γ (partial uncertainty) Classifier minMaxDev Γ-minMaxDev Th/Exp Th/Exp 3.41/3.39 0.72/0.77 – 0.85/0.83 0.50/0.53 0.19/0.24 Table 1: A comparison between the standard classifier (build under stationary prior assumptions), the minimax deviation classifier (minMaxDev) and the minimax deviation classifier under partial uncertainty (Γ-minMaxDev) for an artificial data set Database German Credits (GCRE) Australian Credits (AUS) Munich Credits (MCRE) Insurance Company (COIL) DNA Slice-junction (DNA) Page-blocks (PAG) Dermatology (DER) Pen-digits (PEN) # Classes 2 2 2 2 3 5 6 10 Class distribution [0.70 0.30] [0.32 0.68] [0.30 0.70] [0.94 0.06] [0.24 0.24 0.52] [0.90 0.06 0.01 0.01 0.02] [0.31 0.16 0.20 0.13 0.14 0.06] [0.104 0.104 0.104 0.096 0.104 0.096 0.096 0.104 0.096 0.096] # Attributes 8 14 20 85 180 10 34 16 # Instances 1000 690 1000 9822 3186 5473 366 10992 Table 2: Experimental Data sets Other public data set used is Munich Credits from the Dept. of Statistics at the University of Munich.2 Data set description is summarized in Table 2, and cost-benefit matrices are shown in Table 3. We have used the cost values that appear in Ikizler (2002) for those data sets in common. Otherwise, for lack of an expert analyst, the cost values have been chosen by hand. 2. Data sets available at http://www.stat.uni-muenchen.de/service/datenarchiv/welcome e.html. Insurance Company 0 1 German, Australian, Munich Credits −1 0 0 −17 Page-Blocks −1 2 2 2 2 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 −4 2 3 4 3 4 3 −3 3 5 1 5 5 0 Dermatology 3 2 3 2 −8 4 5 −10 4 3 5 4 2 1 4 5 −6 5 2 3 5 2 3 −10 −1 2 2 DNA 2 −1 2 Pendigits ci j = 0 1 Table 3: Cost-Benefit matrices for the experimental Data sets 123 3 3 0 if i = j Otherwise A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I Standard Maximum Risk Deviation from the optimal classifier Re-balanced Minimax Deviation Minimax minMaxDev minMax GCRE 0.70 0.80 (0.55 0.60) 0.99 ACRE 1.00 1.00 (0.76 0.86) 1.00 MCRE 0.91 0.77 (0.54 0.59) 0.99 COIL 2.78 0.99 (0.87 0.92) 16.32 DNA 0.34 0.53 (0.30 0.27 0.25) PAG 0.62 0.26 (0.13 0.13 0.20 0.16 0.16) DER 1.03 1.28 (0.67 0.78 0.51 0.48 0.54 PEN 0.061 0.059 (0.024 0.028 0.025 0.019 1.14 0.023 0.021 0.026 0.022 0.86 0.60) 0.023 0.029) 7.62 0.029 Table 4: Classifier Performance evaluated as Maximum Risk Deviation from the optimal classifier for several real-world applications. Class-conditional risk deviations (R i − cii ) reported for the minMaxDev classifier. Experimental results for these data sets are shown in the following sections. The robustness of different decision machines under complete uncertainty of prior probabilities is analyzed in Section 6.3.1. If uncertainty is only partial, a similar study and comparison with the previous approach (complete uncertainty) is carried out in Section 6.3.2. 6.3.1 C LASSIFIER ROBUSTNESS U NDER C OMPLETE U NCERTAINTY We now study how different neural-based classifiers cope with worst-case situations in prior probabilities. The maximum deviation from the optimal classifier (see Table 4) is reported for the proposed minMaxDev strategy as well as for other alternative approaches: the one based on the assumption of stationary priors (standard) and the common alternative of deriving the classifier from an equally distributed data set (re-balanced). A comparison with the traditional minimax strategy is also provided. Together with the previously mentioned value (maximum deviation or regret), deviation for the L class-conditional extreme cases (Ri − cii ) is also reported for the minMaxDev classifier in Table 4. Results allow to verify that this solution is fairly close to the optimal one where deviation is not dependent on priors and thus, class-conditional deviations take the same value. Although the balanced class distribution to train the classifier can be obtained by means of undersampling and/or oversampling, it is simulated by altering the learning rate used in the training 1/L phase according to (19) as µi = µ (0) , where 1/L represents the simulated probability, equal for Pi all classes. Results evidence that the assumption of stationary priors may lead to significant deviations from the optimal decision rule under “unexpected”, but rather realistic, prior changes. This deviation may reach up to three times more than the robust minimax deviation strategy. Thus, for classification problems like Page-blocks the maximum deviation from the optimal classifier is 0.62 for the 124 M INIMAX R EGRET C LASSIFIER Standard Maximum Risk Re-balanced Minimax Deviation minMaxDev Minimax minMax GCRE 0.70 0.15 0.60 0.00 ACRE 0.01 0.02 0.86 -0.00 MCRE 0.05 0.20 0.59 0.00 COIL 0.76 0.99 0.86 0.02 DNA 0.34 0.53 0.25 0.13 PAG 0.62 0.26 0.20 0.10 DER -2.10 -1.68 -2.21 -2.38 PEN 0.061 0.059 0.029 0.029 Table 5: Classifier Performance measured as Maximum Risk for several real-world applications. standard classifier while this reduces to 0.20 for the minMaxDev one. Likewise, for the Insurance company(COIL) application the maximum deviation for the standard classifier is 2.78 compared with 0.92 for the minMaxDev model. The remaining databases also show the same behavior as it is presented in Table 4. On the other hand, the use of a classifier inferred from a re-balanced data set does not necessarily involve a decrease in the maximum deviation with respect to the standard classifier. In the same way, the traditional minimax classifier does not protect against prior changes in terms of maximum relative deviation from the minimum risk classifier. However, if our criterion is more conservative and our aim is the minimization of the maximum possible risk (not the minimization of the deviation), the traditional minimax classifier represents the best option. It is shown in Table 5 where the maximum risk for the different classifiers is reported. Positive values in this table indicate a cost while negative values represent a benefit. For instance, for the Page-blocks application the minimax classifier assures a maximum risk of 0.10 while the standard, re-balanced and minMaxDev classifiers reach values of 0.62, 0.26 and 0.20, respectively. It can be noticed that for the Pen-digits data set, the minimax deviation and minimax approaches attain the same results. The reason is that, for this problem, the R basis plane takes the same value (in this case, zero) in the probability space. 6.3.2 C LASSIFIER ROBUSTNESS UNDER PARTIAL U NCERTAINTY Unlike the previous section, we consider now that partial information about the class priors is available. The aim is to find a classifier that behaves well for a delimited and realistic range of priors what constitutes an aid in reducing the maximum deviation from the optimal classifier. This situation can be treated as a constrained minimax regret strategy where the constraints represent any extra information about prior probability value. Experimental results for several situations of partial prior uncertainty are presented in this section. We consider that lower bounds for the prior probabilities are available (see Table 6). In order to get the Γ-minMaxDev classifier, the risk for the different vertex of the uncertainty domain needs to be calculated. With them, the basis risk RΓ over which deviations are measured is derived. basis 125 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I Lower bound for prior probabilities Data Set P0l P1l GCRE 0.40 0.25 ACRE 0.20 0.25 MCRE 0.20 0.25 COIL 0.15 P2l P3l P4l P5l 0.03 DNA 0.10 0.10 0.22 0.02 0.00 0.01 0.1 0.20 0.10 0.10 0.10 0.10 0.06 0.06 0.10 0.10 0.06 P9l 0.06 0.10 0.05 0.05 0.02 PEN P8l 0.02 DER P7l 0.25 PAG P6l Table 6: Lower bounds for prior probabilities defining the uncertainty region, Γ region for the experimental data sets. Maximum Risk Deviation in the uncertainty region Standard Minimax Deviation Minimax Deviation with restriction minMaxDev Γ-minMaxDev GCRE 0.24 0.19 (0.10 0.09) ACRE 0.03 0.64 (0.03 0.03) MCRE 0.22 0.38 (0.13 0.10) COIL 2.33 0.77 (0.17 0.11) DNA 0.14 0.08 (0.07 0.07 0.06) PAG 0.37 0.15 (0.10 0.08 0.08 0.05 0.04) DER 0.08 0.05 (0.03 0.03 0.04 0.02 0.05 PEN 0.013 0.007 (0.003 0.001 0.003 0.000 0.001 0.001 0.000 0.003 0.05) 0.001 0.001) Table 7: Classifier Performance under partial knowledge of prior probabilities measured as Maximum Risk Deviation for several real-world applications. Class-conditional risk deviations (RΓ − cΓ ) are reported for the Γ-minMaxDev classifier. i ii Maximum deviation from the optimal in Γ is reported for the Γ-minMaxDev classifier together with the standard and the minMaxDev ones. For instance, the standard classifier for the Pageblocks data set deviates from the optimal classifier, in the defined uncertainty region, up to 0.37, while when complete uncertainty is assumed the maximum deviation is equal to 0.62. In the same way, reducing the uncertainty also means a reduction in the maximum deviation for minMaxDev classifier (trained without considering this partial knowledge). Thus, for Γ, this classifier assures a deviation bound of 0.15. However, taking into account this partial information to train a Γ-minMaxDev classifier allows to reduce the deviation for the worst-case conditions to 0.10. It can be seen the same behavior for the other databases in Table 7. 126 M INIMAX R EGRET C LASSIFIER 7. Conclusions This work concerns the design of robust neural-based classifiers when the prior probabilities of the classes are partially or completely unknown, even by the end user. This problem of uncertainty in the class priors is often ignored in supervised classification, even though it is a widespread situation in real world applications. As a result, the reliability of the inducted classifier can be greatly affected as previously shown by the experiments. To tackle this problem, we have proposed a novel minimax deviation strategy with the goal to minimize the maximum deviation with respect to the optimal classifier. A neural network training algorithm based on learning rate scaling has been developed. The experimental results show that this minimax deviation (minMaxDev) classifier protects against prior changes while other approaches like ignoring this uncertainty or use a balanced learning data set may result in large differences in performance with respect to the minimum risk classifier. Also, it has been shown that the conventional minimax classifier reduces the maximum possible risk following a conservative attitude but at the expense of large worst-case differences from the optimal classifier. Furthermore, a constrained minimax deviation approach (Γ-minMaxDev) has been derived for those situations where uncertainty is only partial. This may be seen as a general approach with some particular cases: a) precise knowledge of prior probabilities and b) complete uncertainty about the priors. In a) the region of uncertainty collapses to a point and we have the Bayes’ rule of minimum risk and in b) the pure minimax deviation strategy comes up. While the first one may be criticized for being quite unrealistic, the other may be seen rather pessimistic. The experimental results for this proposed intermediate situation show that the Γ-minMaxDev classifier allows to reduce the maximum deviation from the optimal and performs well over a range of prior probabilities. Acknowledgments The authors thank the four referees and the associate editor for their helpful comments. This work was partially supported by the project TEC2005-06766-C03-02 from the Spanish Ministry of Education and Science. References N. Abe, B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 3–11, 2004. N. M. Adams and D. J. Hand. Comparing classifiers when the misallocation costs are uncertain. Pattern Recognition, 32(7):1139–1147, March 1998. R. Alaiz-Rodriguez, A. Guerrero-Curieses, and J. Cid-Sueiro. Minimax classifiers based on neural networks. Pattern Recognition, 38(1):29–39, January 2005. R. Barandela, J. S. Sanchez, V. Garc´a, and E. Rangel. Strategies for learning in class imbalance ı problems. Pattern Recognition, 36(3):849–851, March 2003. J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, second edition, 1985. 127 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html. URL L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Chapman & Hall, NY, 1984. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16:321–357, 2002. J. Cid-Sueiro and A. R. Figueiras-Vidal. On the structure of strict sense Bayesian cost functions and its applications. IEEE Transactions on Neural Networks, 12(3):445–455, May 2001. C. Drummond and R. C. Holte. Explicitly representing expected cost: An alternative to ROC representation. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 198–207. ACM Press, 2000. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, 2001. Y. C. Eldar and N. Merhav. Minimax approach to robust estimation of random parameters. IEEE Trans. on Signal Processing, 52(7):1931–1946, July 2004. Y. C. Eldar, A. Ben-Tal, and A. Nemirovski. Linear minimax regret estimation of deterministic parameters with bounded data uncertainties. IEEE Trans. on Signal Processing, 52(8):2177– 2188, August 2004. M. Feder and N. Merhav. Universal composite hypothesis testing: A competitive minimax approach. IEEE Trans. on Information Theory, 48(6):1504–1517, June 2002. A. Guerrero-Curieses, R. Alaiz-Rodriguez, and J. Cid-Sueiro. A fixed-point algorithm to minimax learning with neural networks. IEEE Transactions on Systems, Man and Cybernetics Part C, 34 (4):383–392, November 2004. ¨ H. A. G¨ venir, N. Emeksiz, N. Ikizler, and N. Ormeci. Diagnosis of gastric carcinoma by classifiu cation on feature projections. Artificial Intelligence in Medicine, 31(3), 2004. N. Ikizler. Benefit maximizing classification using feature intervals. Technical Report BU-CE-0208, Bilkent University, Ankara, Turkey, 2002. N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis Journal, 6(5):429–450, November 2002. M. G. Kelly, D. J. Hand, and N. M. Adams. The impact of changing populations on classifier performance. In Proceedings of Fifth International Conference on SIG Knowledge Discovery and Data Mining (SIGKDD), pages 367–371, San Diego, CA, 1999. H. J. Kim. On a constrained optimal rule for classification with unknown prior individual group membership. Journal of Multivariate Analysis, 59(2):166–186, November 1996. M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings 14th International Conference on Machine Learning, pages 179–186. Morgan Kaufmann, 1997. 128 M INIMAX R EGRET C LASSIFIER M. Kubat, R. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2/3):195–215, 1998. S. Lawrence, I. Burns, A. D. Back, A. C. Tsoi, and C. L. Giles. Neural network classification and ¨ unequal prior class probabilities. In G. Orr, K.-R. Muller, and R. Caruana, editors, Tricks of the Trade, Lecture Notes in Computer Science State-of-the-Art Surveys, pages 299–314. Springer Verlag, 1998. T. K. Moon and W. C. Stirling. Mathematical Methods and Algorithms for Signal Processing. Prentice Hall, 2000. A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In ICML ’05: Proceedings of the 22nd International Conference on Machine learning, pages 625– 632, New York, NY, USA, 2005. ACM Press. ISBN 1-59593-180-5. E. Polak. Optimization: Algorithms and Consistent Approximations. Springer, 1997. F. Provost. Learning with imbalanced data sets 101. In Invited paper for the AAAI 2000 Workshop on Imbalanced Data Sets. AAAI Press. Technical Report WS-00-05, 2000. F. Provost and T. Fawcett. Robust classification systems for imprecise environments. Machine Learning, 42(3):203–231, March 2001. M. Saerens, P. Latinne, and C. Decaestecker. Adjusting a classifier for new a priori probabilities: A simple procedure. Neural Computation, 14:21–41, January 2002. E. Takimoto and M. Warmuth. The minimax strategy for Gaussian density estimation. In Proceedings 13th Annual Conference on Computational Learning Theory, pages 100–106. Morgan Kaufmann, San Francisco, 2000. K. M. Ting. A study of the effect of class distribution using cost-sensitive learning. In Proceedings of the Fifth International Conference on Discovery Science, pages 98–112. Berlin: Springer-Verlag, 2002. H. L. VanTrees. Detection, Estimation and Modulation Theory. John Wiley and Sons, 1968. G. I. Webb and K. M. Ting. On the application of ROC analysis to predict classification performance under varying class distributions. Machine Learning, 58(1):25–32, 2005. W. Wei, T. K. Leen, and E. Barnard. A fast histogram-based postprocessor that improves posterior probability estimates. Neural Computation, 11(5):1235 – 1248, July 1999. B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 204–213. ACM Press, 2001. B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Eighth International Conference on Knowledge Discovery and Data Mining, 2002. 129 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the third IEEE International Conference on Data Mining, pages 435–442, 2003. Z. H. Zhou and X. Y. LiuJ. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1):63–77, January 2006. 130
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper we propose a minimax regret (minimax deviation) approach, that seeks to minimize the maximum deviation from the performance of the optimal risk classifier. [sent-15, score-1.116]
2 A neural-based minimax regret classifier for general multi-class decision problems is presented. [sent-16, score-0.806]
3 Keywords: classification, imprecise class distribution, minimax regret, minimax deviation, neural networks 1. [sent-18, score-1.284]
4 2) it is important to highlight those that handle the problem of uncertainty in priors by following a robust minimax principle: minimize the maximum possible risk. [sent-30, score-0.889]
5 Analytic foundations of minimax classification are widely considered in the literature (see VanTrees, 1968; Moon and Stirling, 2000; Duda et al. [sent-31, score-0.614]
6 , 2001, for instance) and a few algorithms to carry out minimax decisions have been proposed. [sent-32, score-0.642]
7 Nevertheless, following a pure minimax strategy can lead to solutions where minimizing the maximum loss implies considering there are no potential clients. [sent-39, score-0.664]
8 If it is the case, this minimax approach does not seem to be suitable for this kind of situation. [sent-40, score-0.614]
9 We seek for a system as robust as the conventional minimax approach but less pessimistic at the same time. [sent-43, score-0.65]
10 We will refer to it as a minimax deviation (or minimax regret) classifier. [sent-44, score-1.416]
11 (2004), minimax approaches can be used in combination with the adaptive proposal by Saerens et al. [sent-47, score-0.614]
12 This minimax regret approach has recently been applied in the context of parameter estimation (Eldar et al. [sent-49, score-0.765]
13 We propose a simple learning rate scaling algorithm in order to train a neural-based minimax deviation classifier. [sent-52, score-0.802]
14 Next, Section 3 states the fundamentals of minimax classification together with a deeper analysis of the minimax regret approach proposed in this paper. [sent-55, score-1.379]
15 Section 4 presents a neural training algorithm to get a neural-based minimax regret classifier under complete uncertainty. [sent-56, score-0.765]
16 In Section 6, some experimental results show that minimax regret classifiers outperform (in terms of maximum risk deviation) classifiers trained on re-balanced data sets and those with the originally assumed priors. [sent-59, score-0.928]
17 Defining the conditional risk of misclassifying samples from class u j as L−1 Rj = ∑ ci j P{d = ui |d = u j } , i=0 we can express risk (1) as L−1 R= ∑ Ri Pi . [sent-87, score-0.347]
18 (2) i=0 It is well-known that the Bayes decision rule for the minimum risk is given by L−1 d = arg min { ∑ ci j P{d = u j |x}} , ui (3) j=0 where P{d = ui |x} is the a posteriori probability of class i given sample x. [sent-88, score-0.353]
19 The optimal decision rule depends on posterior probabilities and therefore, on the prior probabilities and the likelihood. [sent-89, score-0.33]
20 Furthermore, absolute values of decision costs are relevant to the design of classifiers under the minimax regret approach. [sent-97, score-0.894]
21 Robust Classifiers Under Prior Uncertainty: Minimax Classifiers Prior probability uncertainty can be coped from a robust point of view following a minimax derived strategy. [sent-130, score-0.759]
22 Minimax regret criterion is discussed in this section after presenting the conventional minimax criterion. [sent-131, score-0.765]
23 (3) shows, the minimum risk decisions depend on the misclassification costs, c i j , and the posterior class probabilities and, thus, they depend on the prior probabilities, Pi . [sent-135, score-0.378]
24 Minimum risk curve and performance under prior changes for the standard, minimax and minimax deviation classifier. [sent-141, score-1.642]
25 RF (Q1mM , P1 ) denotes the Risk of a minimax classifier (Fixed decision rule optimized for the minimax probabilities Q1mM ) against P1 . [sent-144, score-1.367]
26 RF (Q1mMd , P1 ) denotes the Risk of a minimax deviation classifier (Fixed decision rule optimized for the minimax deviation probabilities Q 1mMd ) against P1 . [sent-145, score-1.743]
27 For instance, assume that Q = (Q0 , Q1 ) is the vector with class-0 and class-1 prior probabilities estimated in the training phase, respectively, and let RB (Q1 ) represent the minimum (Bayes) risk attainable by any decision rule for these priors. [sent-147, score-0.39]
28 Also, it can be shown (see VanTrees, 1968, for instance) that the minimum risk curve obtained for each prior is convex and the risk function of a given classifier verifies R F (Q1 , P1 ) ≥ RB (P1 ) with a tangent point at P1 = Q1 . [sent-154, score-0.385]
29 1 shows the performance of the minimax classifier, which minimizes the maximum possible risk under the least favorable priors, thus providing the most robust solution, in the sense that performance becomes independent from priors. [sent-156, score-0.872]
30 1, it becomes clear that the minimax classifier is optimal for prior probabilities P = QmM = (Q0mM , Q1mM ) maximizing RB . [sent-158, score-0.781]
31 We will refer to them as the minimax probabilities. [sent-161, score-0.614]
32 1 also makes clear that although a minimax classifier is a robust solution to address the imprecision in priors, it may become a somewhat pessimistic approach. [sent-163, score-0.674]
33 In the following, we will refer to it as the minimax deviation or minimax regret classifier. [sent-166, score-1.567]
34 A comparison between minimax and minimax deviation approaches is also shown in Fig. [sent-167, score-1.416]
35 This latter case corresponds to a classifier trained on prior probabilities P = Q mMd with performance as a function of priors given by a line (a plane or hyperplane for three or more classes, respectively) parallel to what we name, in the following, basis risk (Rbasis = c00 (1 − P1 ) + c11 P1 ). [sent-169, score-0.402]
36 Therefore, the minimax regret classifier is also the Bayes solution with respect to the least favorable priors (Q0mMd , Q1mMd ) (see Berger, 1985, for instance), which will be denoted as minimax deviation probabilities. [sent-173, score-1.705]
37 Definition 1 Consider a L-class decision problem with costs ci j , 0 ≤ i, j < L and c j j ≤ ci j , and let Rw (P) be the risk of a decision machine with parameter vector w when prior class probabilities are given by P = (P0 , . [sent-175, score-0.527]
38 The deviation function is defined as Dw (P) = Rw (P) − RB (P) and the minimax deviation is defined as DmMd = inf max{Dw (P)} . [sent-179, score-1.034]
39 2 shows that the minimax regret can be computed as the minimax deviation to this ideal classifier. [sent-199, score-1.567]
40 (5) becomes equivalent (up to a constant) to the Bayes risk and the minimax regret classifier becomes equivalent to the minimax classifier . [sent-201, score-1.513]
41 110 M INIMAX R EGRET C LASSIFIER Theorem 3 Consider the minimum deviation function given by Dmin (P) = inf{Dw (P)} , (9) w where Dw (P) is the normalized deviation function given by Eq. [sent-204, score-0.401]
42 (10) If Dmin (P) is continuously differentiable at P = P∗ , then the minimax deviation, DmMd , defined by Eq. [sent-206, score-0.614]
43 w P (12) Therefore, Dmin (P∗ ) is a lower bound of the minimax regret. [sent-209, score-0.614]
44 If the minimum deviation function is not continuously differentiable at the minimax deviation probability, P∗ , the theorem cannot be applied. [sent-229, score-1.015]
45 The reason is that, although there should exist at least one classifier providing the minimum deviation at P = P∗ , it or they could not provide a constant deviation with respect to the prior probability. [sent-230, score-0.515]
46 An abrupt slope change is observed at the minimax deviation probability, for P{d = 1} = 1/2. [sent-242, score-0.802]
47 The minimax deviation classifier in Φ λ , which has a threshold λ0 , does not attain minimum risk at the minimax deviation probability and, thus, cannot be obtained by using Eq. [sent-247, score-1.763]
48 λ1 and λ2 are the thresholds providing the minimum risk at the minimax deviation probability. [sent-268, score-0.983]
49 Note that, in order to build the non-deterministic classifier providing a constant deviation, a set of L independent classifiers that are optimal at the minimax deviation prior should be found. [sent-276, score-0.916]
50 In our simulations, we have found that the maximum of the minimum risk deviation always provided a response which is approximately parallel to Rbasis . [sent-278, score-0.376]
51 Although the least favorable prior providing the minimax deviation can be computed in closed form for some simple distributions, in general, it must be computed numerically. [sent-281, score-0.953]
52 Thresholds λ1 and λ2 do not provide the minimax deviation classifier, which is obtained for threshold λ0 . [sent-294, score-0.802]
53 Next, we propose a training algorithm in order to get a minimax regret classifier based on neural networks. [sent-297, score-0.765]
54 w Therefore, the minimax deviation classifier can be estimated by training a classifier using prior in QmMd . [sent-301, score-0.894]
55 For this reason, QmMd will be called the minimax deviation prior (or least favorable prior). [sent-302, score-0.931]
56 Our proposed algorithms are based on an iterative process of estimating parameters w based on an estimate of the minimax deviation prior, and re-estimating prior based on an estimate of network weights. [sent-303, score-0.921]
57 , L − 1, according to (21) and (22) (n+1) (n+1) Update minimax probability Pi , i = 0, . [sent-333, score-0.614]
58 Now, the aim is to design a classifier that minimizes the maximum regret from the minimum risk only inside the uncertainty region. [sent-345, score-0.448]
59 4(a), which shows that reducing the uncertainty in priors allows to reduce deviation from the optimal classifier. [sent-347, score-0.398]
60 This minimax regret approach for the uncertainty region Γ is often called Γ-minimax regret. [sent-348, score-0.894]
61 As discussed before, the minimax deviation solution gives a Bayes solution with respect some priors denoted in the partial uncertainty case as QΓ mMd in Fig. [sent-349, score-1.044]
62 5 0P1l RB (P) + ψ(P) Minimax Deviation with Restriction Risk RΓ 0 QΓ 1mMd 1 − P0l 1 P1 0P1l 1 − P0l 1 P1 (b) (a) Figure 4: Minimax deviation classifier under partial uncertainty of prior probabilities: (a)Γ-minMaxDev Classifier. [sent-355, score-0.421]
63 In contrast to the minimax regret criterion, note that a classical minimax classifier for the considered uncertainty region would minimize the worst-case risk. [sent-357, score-1.508]
64 However, with partial knowledge of the prior probabilities, this hyperplane becomes defined by the risk in L points which are the vertex given by the restrictions and with associated risk denoted by c Γj . [sent-368, score-0.392]
65 j Defining 1 l(Pi ) = , (26) 1 + exp−τ(Pi −Pil ) where τ controls the hardness of this restriction, the minimax problem can be re-formulated as arg max inf {DΓ (P)} w P s. [sent-369, score-0.679]
66 The minimax constrained optimization problem has been tackled by considering a new objective function defined by the sum of the original cost function and a barrier function. [sent-385, score-0.662]
67 2 Extension to Other Learning Algorithms The learning algorithm proposed in this paper is intended to train a minimax deviation classifier based on neural networks with feedforward architecture. [sent-392, score-0.802]
68 However, it is also applicable to a general classifier provided it is trained (in an iterative process) for the estimated minimax deviation probabilities and the assumed decision costs. [sent-394, score-0.918]
69 In this case, an iterative procedure able to estimate the minimax deviation probabilities and consequently to adjust (without re-training) the outputs of the classifier could be studied. [sent-401, score-0.877]
70 The general idea for this approach is as follows: first, the new minimax deviation prior probabilities are estimated according to (25) and then, posterior probabilities provided by the model are adjusted as follows (see Saerens et al. [sent-402, score-1.068]
71 , L − 1, according to (21), (22) and decision costs c i j (k+1) Update minimax probability Pi according to (25) Adjust classifier outputs according to (27) end for The effectiveness of this method relies on the accuracy of the initial a posteriori probability estimates. [sent-407, score-0.743]
72 Studying in depth this approach and comparing different minimax deviation classifiers (decision trees, SVMs, RBF networks, feedforward networks and committee machines) together with different probability calibration methods appears as a challenging issue to be explored in future work. [sent-408, score-0.802]
73 Experimental Results In this section, we first present the neural network architecture used in the experiments and illustrate the proposed minimax deviation strategy on an artificial data set. [sent-410, score-0.87]
74 Moreover, a comparison with other proposals such as the traditional minimax and the common re-balancing approach is carried out. [sent-412, score-0.633]
75 On the other hand, posterior probability estimation makes it possible to apply the adaptive strategy based on prior re-estimation proposed by Saerens to the minimax deviation classifier, as long as a data set representative of the operation conditions is available. [sent-443, score-0.939]
76 2 Artificial Data Set To illustrate the minimax regret approach proposed in this paper both under complete and partial uncertainty, an artificial data set with two classes (class u0 and class u1 ) has been created. [sent-447, score-0.797]
77 However, a minimax deviation classifier reduces this worst-case difference from the optimal classifier to 0. [sent-461, score-0.802]
78 A minimax deviation classifier can be derived for Γ (it will be called Γ-minMaxDev classifier). [sent-471, score-0.802]
79 The narrower Γ is, the closer the minimax deviation classifier performance is to the optimal. [sent-472, score-0.802]
80 83, while the use of the simple minMaxDev classifier designed under total prior uncertainty conditions attains a maximum deviation of 0. [sent-474, score-0.418]
81 122 M INIMAX R EGRET C LASSIFIER Standard Th/Exp Maximum deviation from optimal (complete uncertainty) Maximum deviation from optimal in Γ (partial uncertainty) Classifier minMaxDev Γ-minMaxDev Th/Exp Th/Exp 3. [sent-482, score-0.376]
82 The robustness of different decision machines under complete uncertainty of prior probabilities is analyzed in Section 6. [sent-594, score-0.317]
83 A comparison with the traditional minimax strategy is also provided. [sent-604, score-0.635]
84 Together with the previously mentioned value (maximum deviation or regret), deviation for the L class-conditional extreme cases (Ri − cii ) is also reported for the minMaxDev classifier in Table 4. [sent-605, score-0.44]
85 This deviation may reach up to three times more than the robust minimax deviation strategy. [sent-609, score-1.026]
86 In the same way, the traditional minimax classifier does not protect against prior changes in terms of maximum relative deviation from the minimum risk classifier. [sent-651, score-1.082]
87 However, if our criterion is more conservative and our aim is the minimization of the maximum possible risk (not the minimization of the deviation), the traditional minimax classifier represents the best option. [sent-652, score-0.777]
88 For instance, for the Page-blocks application the minimax classifier assures a maximum risk of 0. [sent-655, score-0.777]
89 It can be noticed that for the Pen-digits data set, the minimax deviation and minimax approaches attain the same results. [sent-660, score-1.416]
90 The aim is to find a classifier that behaves well for a delimited and realistic range of priors what constitutes an aid in reducing the maximum deviation from the optimal classifier. [sent-665, score-0.318]
91 This situation can be treated as a constrained minimax regret strategy where the constraints represent any extra information about prior probability value. [sent-666, score-0.878]
92 37, while when complete uncertainty is assumed the maximum deviation is equal to 0. [sent-756, score-0.326]
93 In the same way, reducing the uncertainty also means a reduction in the maximum deviation for minMaxDev classifier (trained without considering this partial knowledge). [sent-758, score-0.358]
94 To tackle this problem, we have proposed a novel minimax deviation strategy with the goal to minimize the maximum deviation with respect to the optimal classifier. [sent-768, score-1.04]
95 The experimental results show that this minimax deviation (minMaxDev) classifier protects against prior changes while other approaches like ignoring this uncertainty or use a balanced learning data set may result in large differences in performance with respect to the minimum risk classifier. [sent-770, score-1.162]
96 Also, it has been shown that the conventional minimax classifier reduces the maximum possible risk following a conservative attitude but at the expense of large worst-case differences from the optimal classifier. [sent-771, score-0.777]
97 Furthermore, a constrained minimax deviation approach (Γ-minMaxDev) has been derived for those situations where uncertainty is only partial. [sent-772, score-0.911]
98 In a) the region of uncertainty collapses to a point and we have the Bayes’ rule of minimum risk and in b) the pure minimax deviation strategy comes up. [sent-774, score-1.134]
99 Linear minimax regret estimation of deterministic parameters with bounded data uncertainties. [sent-877, score-0.765]
100 A fixed-point algorithm to minimax learning with neural networks. [sent-890, score-0.614]
wordName wordTfidf (topN-words)
[('minimax', 0.614), ('dw', 0.274), ('deviation', 0.188), ('pi', 0.177), ('dmin', 0.169), ('er', 0.152), ('regret', 0.151), ('minmaxdev', 0.151), ('risk', 0.134), ('laiz', 0.132), ('ueiro', 0.132), ('uerrero', 0.132), ('urieses', 0.132), ('rb', 0.127), ('guez', 0.127), ('inimax', 0.122), ('pj', 0.119), ('lassifier', 0.119), ('uncertainty', 0.109), ('egret', 0.104), ('rbasis', 0.104), ('priors', 0.101), ('prior', 0.092), ('costs', 0.088), ('probabilities', 0.075), ('classi', 0.073), ('credits', 0.066), ('rdi', 0.066), ('rf', 0.064), ('cii', 0.064), ('dk', 0.059), ('zadrozny', 0.059), ('id', 0.057), ('coil', 0.056), ('dmmd', 0.056), ('imprecise', 0.056), ('bi', 0.056), ('rw', 0.052), ('ui', 0.051), ('barrier', 0.048), ('gcre', 0.047), ('gsp', 0.047), ('mcre', 0.047), ('pag', 0.047), ('qmmd', 0.047), ('saerens', 0.047), ('ul', 0.046), ('dna', 0.045), ('inf', 0.044), ('softmax', 0.043), ('decision', 0.041), ('acre', 0.038), ('eldar', 0.038), ('pil', 0.038), ('vantrees', 0.038), ('favorable', 0.037), ('ers', 0.037), ('robust', 0.036), ('pen', 0.036), ('yk', 0.036), ('universidad', 0.032), ('partial', 0.032), ('ri', 0.032), ('maximum', 0.029), ('oversampling', 0.028), ('kubat', 0.028), ('alicia', 0.028), ('munich', 0.028), ('decisions', 0.028), ('wk', 0.028), ('roc', 0.028), ('ci', 0.028), ('network', 0.027), ('der', 0.027), ('minimum', 0.025), ('posterior', 0.024), ('di', 0.024), ('insurance', 0.024), ('imbalanced', 0.024), ('qk', 0.024), ('imprecision', 0.024), ('dermatology', 0.024), ('rule', 0.023), ('misclassi', 0.023), ('bayes', 0.023), ('providing', 0.022), ('provost', 0.021), ('moon', 0.021), ('max', 0.021), ('strategy', 0.021), ('region', 0.02), ('imbalance', 0.02), ('architecture', 0.02), ('carried', 0.019), ('company', 0.019), ('barandela', 0.019), ('buyer', 0.019), ('carcinoma', 0.019), ('comunicaciones', 0.019), ('feder', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions
Author: Rocío Alaiz-Rodríguez, Alicia Guerrero-Curieses, Jesús Cid-Sueiro
Abstract: The design of a minimum risk classifier based on data usually stems from the stationarity assumption that the conditions during training and test are the same: the misclassification costs assumed during training must be in agreement with real costs, and the same statistical process must have generated both training and test data. Unfortunately, in real world applications, these assumptions may not hold. This paper deals with the problem of training a classifier when prior probabilities cannot be reliably induced from training data. Some strategies based on optimizing the worst possible case (conventional minimax) have been proposed previously in the literature, but they may achieve a robust classification at the expense of a severe performance degradation. In this paper we propose a minimax regret (minimax deviation) approach, that seeks to minimize the maximum deviation from the performance of the optimal risk classifier. A neural-based minimax regret classifier for general multi-class decision problems is presented. Experimental results show its robustness and the advantages in relation to other approaches. Keywords: classification, imprecise class distribution, minimax regret, minimax deviation, neural networks 1. Introduction - Problem Motivation In the general framework of learning from examples and specifically when dealing with uncertainty, the robustness of the decision machine becomes a key issue. Most machine learning algorithms are based on the assumption that the classifier will use data drawn from the same distribution as the training data set. Unfortunately, for most practical applications (such as remote sensing, direct marketing, fraud detection, information filtering, medical diagnosis or intrusion detection) the target class distribution may not be accurately known during learning: for example, because the cost of labelling data may be class-dependent or the prior probabilities are non-stationary. Therefore, the data used to design the classifier (within the Bayesian context (see VanTrees, 1968), the c 2007 Roc´o Alaiz-Rodr´guez, Alicia Guerrero-Curieses and Jesus Cid-Sueiro. ´ ı ı A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I prior probabilities and the misclassification costs) may be non representative of the underlying real distributions. If the ratio of training data corresponding to each class is not in agreement with real class distributions, designing Bayes decision rules based on prior probabilities estimated from these data will be suboptimal and can seriously affect the reliability and performance of the classifier. A similar problem may arise if real misclassification costs are unknown during training. However, they are usually known by the end user, who can adapt the classifier decision rules to cost changes without re-training the classifier. For this reason, our attention in this paper is mainly focused on the problem of uncertainty in prior probabilities. Furthermore, being aware that class distribution is seldom known (at least totally) in real world applications, a robust approach (as opposite to adaptive) that prevents severe performance degradation appears to be convenient for these situations. Besides other adaptive and robust approaches that address this problem (discussed in more detail in Section 2.2) it is important to highlight those that handle the problem of uncertainty in priors by following a robust minimax principle: minimize the maximum possible risk. Analytic foundations of minimax classification are widely considered in the literature (see VanTrees, 1968; Moon and Stirling, 2000; Duda et al., 2001, for instance) and a few algorithms to carry out minimax decisions have been proposed. From computationally expensive ones such as estimating probability density functions (Takimoto and Warmuth, 2000; Kim, 1996) or using methods from optimization (Polak, 1997) to simpler ones like neural network training algorithms (Guerrero-Curieses et al., 2004; AlaizRodriguez et al., 2005). Minimax classifiers may, however, be seen as over-conservative since its goal is to optimize the performance under the least favorable conditions. Consider, for instance, a direct marketing campaign application carried out in order to maximize profits. Since optimal decisions rely on the proportion of potential buyers and it is usually unknown in advance, our classification system should take into account this uncertainty. Nevertheless, following a pure minimax strategy can lead to solutions where minimizing the maximum loss implies considering there are no potential clients. If it is the case, this minimax approach does not seem to be suitable for this kind of situation. In this imprecise class distribution scenario, it can be noticed that the classifier performance may be highly deviated from the optimal, that is, that of the classifier knowing actual priors. Minimizing this gap (that is, the maximum possible deviation with respect to the optimal classifier) is the focus of this paper. We seek for a system as robust as the conventional minimax approach but less pessimistic at the same time. We will refer to it as a minimax deviation (or minimax regret) classifier. In contrast to other robust and adaptive approaches, it can be used in general multiclass problems. Furthermore, as shown in Guerrero-Curieses et al. (2004), minimax approaches can be used in combination with the adaptive proposal by Saerens et al. (2002) to exploit its advantages. This minimax regret approach has recently been applied in the context of parameter estimation (Eldar et al., 2004; Eldar and Merhav, 2004) and a similar competitive strategy has been used in the context of hypothesis testing (Feder and Merhav, 2002). Under prior uncertainty, our solution provides an upper bound of the performance divergence from the optimal classifier. We propose a simple learning rate scaling algorithm in order to train a neural-based minimax deviation classifier. Although training can be based on minimizing any objective function, we have chosen objective functions that provide estimates of the posterior probabilities (see Cid-Sueiro and Figueiras-Vidal, 2001, for more details). 104 M INIMAX R EGRET C LASSIFIER This paper is organized as follows: the next section provides an overview of the problem as well as some previous approaches to cope with it. Next, Section 3 states the fundamentals of minimax classification together with a deeper analysis of the minimax regret approach proposed in this paper. Section 4 presents a neural training algorithm to get a neural-based minimax regret classifier under complete uncertainty. Moreover, practical situations with partial uncertainty in priors are also discussed. A learning algorithm to solve them is provided in Section 5. In Section 6, some experimental results show that minimax regret classifiers outperform (in terms of maximum risk deviation) classifiers trained on re-balanced data sets and those with the originally assumed priors. Finally, the main conclusions are summarized in Section 7. 2. Problem Overview Traditionally, supervised learning lies in the fact that training data and real data come from the same (although unknown) statistical model. In order to carefully analyze to what extend classifier performance depends on conditions such as class distribution or decision costs, learning and decision theory principles are briefly revisited. Next, some previous approaches to deal with environment imprecision are reviewed. 2.1 Learning and Making Optimal Decisions Let S = {(xk , dk ), k = 1, . . . , K} denote a set of labelled samples where xk ∈ RN is an observation feature vector and dk ∈ UL = {u0 , . . . , uL−1 } is the label vector. Class-i label ui is a unit L-dimensional vector with components ui, j = δi j , with every component equal to 0, except the i-th component which is equal to 1. We assume a learning process that estimates parameters w of a non-linear mapping f w : RN → P from the input space into probability space P = {p ∈ [0, 1]L | ∑L−1 pi = 1}. The soft decision is given i=0 by yk = fw (xk ) ∈ P and the hard output of the classifier is denoted by d. Note that d and d will be used to distinguish the actual class from the predicted one, respectively. Several costs (or benefits) associated with each possible decision are also defined: c i j denotes the cost of deciding in favor of class i when the true class is j. Negative values represent benefits (for instance, cii , which is the cost of correctly classifying a sample from class i could be negative in some practical cases). In general cost-sensitive classification problems, either misclassification costs c i j or cii costs can take different values for each class. Thus, there are many applications where classification errors lead to very different consequences (medical diagnosis, fault detection, credit risk analysis), what implies misclassification costs ci j that may largely vary between them. In the same way, there are also many domains where correct decision costs (or benefits) c ii do not take the same value. For instance, in targeted marketing applications (Zadrozny and Elkan, 2001), correctly identifying a buyer implies some benefit while correctly classifying a non buyer means no income. The same ¨ applies to medical diagnosis domains such as the gastric carcinoma problem studied in G uvenir et al. (2004). In this case, the benefit of correct classification also depends on the class: the benefit of correctly classifying an early stage tumor is higher than that of a later stage. The expected risk (or loss) R is given by L−1 L−1 R = ∑ ∑ ci j P{d = ui |d = u j }Pj j=0 i=0 105 , (1) A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I where P{d = ui |d = u j } with i = j represent conditional error probabilities, and P j = P{d = u j } is the prior probability of class u j . Defining the conditional risk of misclassifying samples from class u j as L−1 Rj = ∑ ci j P{d = ui |d = u j } , i=0 we can express risk (1) as L−1 R= ∑ Ri Pi . (2) i=0 It is well-known that the Bayes decision rule for the minimum risk is given by L−1 d = arg min { ∑ ci j P{d = u j |x}} , ui (3) j=0 where P{d = ui |x} is the a posteriori probability of class i given sample x. The optimal decision rule depends on posterior probabilities and therefore, on the prior probabilities and the likelihood. In theory, as long as posterior probabilities (or likelihood and prior probabilities) are known, the optimal decision in Eq. (3) can be expressed after a trivial manipulation as a function of the cost differences between the costs (ci j − c j j ) (Duda et al., 2001). This is the reason why c j j is usually assumed to be zero and the value of the cost difference is directly assigned to c i j . When dealing ¨ with practical applications, however, some authors (Zadrozny and Elkan, 2001; G uvenir et al., 2004) have urged to use meaningful decision costs measured over a common baseline (and not necessarily taking c j j = 0) in order to avoid mistakes that otherwise could be overlooked. For this reason and, what is more important, the uncertainty class distribution problem addressed in this paper, decision costs measured over a common baseline are considered. Furthermore, absolute values of decision costs are relevant to the design of classifiers under the minimax regret approach. 2.2 Related Work: Dealing with Cost and Prior Uncertainty Most proposals to address uncertainty in priors fall into the categories of adaptive and robust solutions. While the aim of a robust solution is to avoid a classifier with very poor performance under any conditions, an adaptive system pursues to fit the classifier parameters using more incoming data or more precise information. With an adaptive-oriented principle, Provost (2000) states that, once the classifier is trained under specific class distributions and cost assumptions (not necessarily the operating conditions), the selection of the optimal classifier for specific conditions is carried out by a correct placement of the decision thresholds. In the same way, the approaches in Kelly et al. (1999) and Kubat et al. (1998) consider that tuning the classifier parameters should be left to the end user, expecting that class distributions and misclassification costs will be precisely known then. Some graphical methods based on the ROC curve have been proposed in Adams and Hand (1998) and Provost and Fawcett (2001) in order to compare the classifier performance under imprecise class distributions and/or misclassification costs. The ROC convex hull method presented in Provost and Fawcett (2001) (or the alternative representation proposed in Drummond and Holte (2000)) allows the user to select potentially optimal classifiers, providing a flexible way to select 106 M INIMAX R EGRET C LASSIFIER them when precise information about priors or costs is available. Under imprecision, some classifiers can be discarded but this does not necessarily provide a method to select the optimal classifier between the possible ones and fit its parameters. Furthermore, due to its graphical character, these methods are limited to binary classification problems. Changes in prior probabilities have also been discussed by Saerens et al. (2002), who proposes a method based on re-estimating the prior probabilities of real data in an unsupervised way and subsequently adjusting the outputs of the classifier according to the new a priori probabilities. Obviously, the method requires enough unlabelled data being available for re-estimation. As an alternative to adaptive schemes, several robust solutions have been proposed, as the resampling methods, especially in domains where imbalanced classes come out (Kubat and Matwin, 1997; Lawrence et al., 1998; Chawla et al., 2002; Barandela et al., 2003). Either by undersampling or oversampling, the common purpose is to balance artificially the training data set in order to get a uniform class distribution, which is supposed to be the least biased towards any class and, thus, the most robust against changes in class distributions. The same approach is followed in cost sensitive domains, but with some subtle differences in practice. It is well known that class priors and decision costs are intrinsically related. For instance, different decision costs can be simulated by altering the priors and vice versa (see Ting, 2002, for instance). Thus, when a uniform distribution is desired in a cost sensitive domain, but working with cost insensitive decision machines, class priors are altered according to decision costs, what is commonly referred as rebalancing. The manipulation of the training data distribution has been applied to cost-sensitive learning in two-class problems (Breiman et al., 1984) in the following way: basically, the class with higher misclassification cost (suppose n times the lowest misclassification cost) is represented with n times more examples than the other class. Besides random sampling strategies, other sampling-based rebalancing schemes have been proposed to accomplish this task, like those considering closeness to the boundaries between classes (Japkowicz and Stephen, 2002; Zhou and LiuJ, 2006) or the costproportionate rejection sampling presented in Zadrozny et al. (2003). Extending the formulation of this type of procedures to general multiclass problems with multiple (and possibly asymmetric) inter-class misclassification costs appears to be a nontrivial task (Zadrozny et al., 2003; Zhou and LiuJ, 2006), but some progress has been made recently with regard to this latter point (Abe et al., 2004). Note, also, that many (although not all) of these rebalancing strategies are usually implemented by oversampling and/or subsampling, that is, replicating examples (without adding any extra information) and/or deleting them (which implies information loss). 3. Robust Classifiers Under Prior Uncertainty: Minimax Classifiers Prior probability uncertainty can be coped from a robust point of view following a minimax derived strategy. Minimax regret criterion is discussed in this section after presenting the conventional minimax criterion. Although our approach extends to general multi-class problems and the discussion is carried out in that way, we will first illustrate, for the sake of clarity and simplicity, a binary situation. 3.1 Minimax Classifiers As Eq. (3) shows, the minimum risk decisions depend on the misclassification costs, c i j , and the posterior class probabilities and, thus, they depend on the prior probabilities, Pi . Different prior 107 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I PSfrag replacements distributions (frequency for each class) give rise to different Bayes classifiers. Fig. 1 shows the Bayes risk curve, RB (P1 ) versus class-1 prior probability for a binary classification problem. Standard RF (Q1 , P1 ) R0 Minimax RF (Q1mM , P1 ) Risk c00 Minimax Deviation RF (Q1mMd , P1 ) Rbasis R1 RB (P1 ) c11 0 Q1 Q1mM 1 Q1mMd P1 Figure 1: Risk vs. P1 . Minimum risk curve and performance under prior changes for the standard, minimax and minimax deviation classifier. RB (P1 ) stands for the optimal Bayes Risk against P1 . RF (Q1 , P1 ) denotes the Risk of a standard classifier (Fixed decision rule optimized for prior probabilities Q1 estimated in the training phase) against P1 . RF (Q1mM , P1 ) denotes the Risk of a minimax classifier (Fixed decision rule optimized for the minimax probabilities Q1mM ) against P1 . RF (Q1mMd , P1 ) denotes the Risk of a minimax deviation classifier (Fixed decision rule optimized for the minimax deviation probabilities Q 1mMd ) against P1 . If the prior probability distribution is unknown when the classifier is designed, or this distribution changes with time or from one environment to other, the mismatch between training and test conditions can degrade significantly the classifier performance. For instance, assume that Q = (Q0 , Q1 ) is the vector with class-0 and class-1 prior probabilities estimated in the training phase, respectively, and let RB (Q1 ) represent the minimum (Bayes) risk attainable by any decision rule for these priors. Note, that, according to Eq. (2), for a given classifier, the risk is a linear function of priors. Thus, risk RF (Q1 , P1 ) associated to the (fixed) classifier optimized for Q changes linearly with actual prior probabilities P1 and P0 = 1 − P1 , going from (0, R0 ) to (1, R1 ) (the continuous line in Fig. 1), where R0 and R1 refer to the class conditional risks for classes 0 and 1, respectively. Fig. 1 shows the impact of this change in priors and how performance deviates from optimal. Also, it can be shown (see VanTrees, 1968, for instance) that the minimum risk curve obtained for each prior is convex and the risk function of a given classifier verifies R F (Q1 , P1 ) ≥ RB (P1 ) with a tangent point at P1 = Q1 . 108 M INIMAX R EGRET C LASSIFIER The dashed line in Fig. 1 shows the performance of the minimax classifier, which minimizes the maximum possible risk under the least favorable priors, thus providing the most robust solution, in the sense that performance becomes independent from priors. From Fig. 1, it becomes clear that the minimax classifier is optimal for prior probabilities P = QmM = (Q0mM , Q1mM ) maximizing RB . Thus, this strategy is equivalent to maximizing the minimum risk (Moon and Stirling, 2000; Duda et al., 2001). We will refer to them as the minimax probabilities. Fig. 1 also makes clear that although a minimax classifier is a robust solution to address the imprecision in priors, it may become a somewhat pessimistic approach. 3.2 Minimax Deviation Classifiers We propose an alternative classifier that, instead of minimizing the maximum risk, minimizes the maximum deviation (regret) from the optimal Bayes classifier. In the following, we will refer to it as the minimax deviation or minimax regret classifier. A comparison between minimax and minimax deviation approaches is also shown in Fig. 1. This latter case corresponds to a classifier trained on prior probabilities P = Q mMd with performance as a function of priors given by a line (a plane or hyperplane for three or more classes, respectively) parallel to what we name, in the following, basis risk (Rbasis = c00 (1 − P1 ) + c11 P1 ). Note that the maximum deviation (with respect to priors) of the classifier optimized for Q is given by D(Q) = max {RF (Q1 , P1 ) − RB (P1 )} = max {R0 − c00 , R1 − c11 } . P1 The inspection of Fig. 1 shows that the minimum of D (with respect to Q) is achieved when R0 − c00 = R1 − c11 , which means that line RF (Q1 , P1 ) is parallel to arc named Rbasis in the figure and tangent to RB at Q1mMd . Therefore, the minimax regret classifier is also the Bayes solution with respect to the least favorable priors (Q0mMd , Q1mMd ) (see Berger, 1985, for instance), which will be denoted as minimax deviation probabilities. Now, we extend the formulation to a general L-class problem. Definition 1 Consider a L-class decision problem with costs ci j , 0 ≤ i, j < L and c j j ≤ ci j , and let Rw (P) be the risk of a decision machine with parameter vector w when prior class probabilities are given by P = (P0 , . . . , PL−1 ). The deviation function is defined as Dw (P) = Rw (P) − RB (P) and the minimax deviation is defined as DmMd = inf max{Dw (P)} . w P (4) Note that the above definition assumes that the maximum exists. This is actually the case, since Dw (P) is a linear function over a compact set, P . Note, also, that our definition includes the natural assumption that c j j is never higher than ci j , meaning that making a decision error is always less costly than taking the correct decision. This assumption is used in part of our theoretical analysis. The algorithms proposed in this paper are based on the fact that the minimax deviation can be computed without knowing RB 109 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I Theorem 2 The minimax deviation is given by DmMd = inf max{Dw (P)} , w P where Dw (P) = Rw (P) − Rbasis (P) and (5) L−1 ∑ c j j Pj Rbasis (P) = . (6) j=0 Proof Note that, according to Eqs. (1) and (2), for any decision machine and any u i ∈ UL , L−1 R(u j ) = R j = ∑ ci j P{d = ui |d = u j } ≥ c j j . i=0 Since the bound is reached by the classifier deciding d = u j for any observation x, we have RB (u j ) = c j j . Therefore, using Eq. (6), we find that, for any u ∈ UL , RB (u) = Rbasis (u) and, thus, Dw (u) = Dw (u) . Since Bayes minimum risk RB (P) is a convex function of priors and Rw (P) is linear, Dw (P) is concave and, thus, it is maximum at some of the vertices in P (i.e., at some P = u ∈ U L ). Thus, max{Dw (P)} = max {Dw (u)} . u∈UL P (7) Since the maximum difference between two hyperplanes defined over P is always at some vertex, we can conclude that max{Dw (P)} = max {Dw (u)} = max {Dw (u)} . P u∈UL u∈UL (8) Combining Eqs. (4), (7) and (8), we get DmMd = inf max{Dw (P)} . w P Note that Rbasis represents the risk baseline of the ideal classifier with zero errors. Th. 2 shows that the minimax regret can be computed as the minimax deviation to this ideal classifier. Note, also, that if costs cii do not depend on i, Eq. (5) becomes equivalent (up to a constant) to the Bayes risk and the minimax regret classifier becomes equivalent to the minimax classifier . Another important result for the algorithms proposed in this paper is that, under some conditions on the minimum risk, the minimum and maximum operators can be permuted. Although general results on the permutability of minimum and maximum operators can be found in the literature (see Polak, 1997, for instance), we provide here the proof for the specific case interesting to this paper. 110 M INIMAX R EGRET C LASSIFIER Theorem 3 Consider the minimum deviation function given by Dmin (P) = inf{Dw (P)} , (9) w where Dw (P) is the normalized deviation function given by Eq. (5), and let P ∗ be the prior probability vector providing the maximum deviation, P∗ = arg max Dmin (P) P . (10) If Dmin (P) is continuously differentiable at P = P∗ , then the minimax deviation, DmMd , defined by Eq. (4), is DmMd = Dmin (P∗ ) = max inf Dw (P) . (11) P w Proof For any classifier with parameter vector w, we can write, max Dw (P) ≥ Dw (P∗ ) ≥ Dmin (P∗ ) P and, thus, inf max Dw (P) ≥ Dmin (P∗ ) . w P (12) Therefore, Dmin (P∗ ) is a lower bound of the minimax regret. Now we prove that Dmin (P∗ ) is also an upper bound. According to Eq. (9), for any ε > 0, there exists a parameter vector wε such that Dwε (P∗ ) ≤ Dmin (P∗ ) + ε . (13) By definition, for any P, Dmin (P) ≤ Dwε (P). Therefore, using Eq. (13), we can write Dwε (P∗ ) − Dwε (P) ≤ Dmin (P∗ ) − Dmin (P) + ε . (14) Since Dmin (P) is continuously differentiable and (according to Eq. (10)) maximum at P ∗ , for any ε > 0 there exists δ > 0 such that, for any P ∈ P with P∗ − P ≤ δ we have Dmin (P∗ ) − Dmin (P) ≤ ε P∗ − P ≤ ε δ . (15) Let Pδ a prior such that P∗ − Pδ = δ. Taking ε = ε δ and combining Eqs. (14) and (15) we can write Dwε (P∗ ) − Dwε (Pδ ) ≤ 2ε δ . Since the above condition is verified for any ε > 0 and any prior Pδ at distance δ from P, and taking into account that Dwε (P) is a linear function of P, we conclude that the maximum slope of D wε (P) is bounded by 2ε and, thus, for any P ∈ P , we have √ Dwε (P) − Dwε (P∗ ) ≤ 2ε P − P∗ ≤ 2 2ε , √ (where we have used the fact that the maximum distance between two probability vectors is 2). Therefore, we can write √ max Dwε (P) ≤ Dwε (P∗ ) + 2 2ε P 111 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I and, thus, √ inf max Dw (P) ≤ Dwε (P∗ ) + 2 2ε . w P √ Finally, using Eq. (13) and taking into account that ε = ε δ ≤ 2ε we get √ inf max Dw (P) ≤ Dmin (P∗ ) + 3 2ε . w P (16) Since the above is true for any ε > 0 we conclude that Dmin (P∗ ) is also an upper bound of Dw . Therefore, combining Eqs. (12) and (16), we conclude that inf max Dw (P) = Dmin (P∗ ) , w P which completes the proof. Note that the deviation function needs to be neither differentiable nor a continuous function of w parameters. If the minimum deviation function is not continuously differentiable at the minimax deviation probability, P∗ , the theorem cannot be applied. The reason is that, although there should exist at least one classifier providing the minimum deviation at P = P∗ , it or they could not provide a constant deviation with respect to the prior probability. The situation can be illustrated with an example. Let x ∈ R be given by p(x|d = 0) = 0.8N(x, σ) + 0.2N(x − 2, σ) and p(x|d = 1) = 0.2N(x − 1, σ) + 0.8N(x − 3, σ), where σ = 0.5 and N(x, σ) = (2πσ)−1/2 exp(−x2 /(2σ2 )), and consider the set Φλ of classifiers given by a single threshold over x and decision dˆ = 1 if x ≥ λ 0 if x < λ. Fig. 2 shows the distribution of both classes over x, and Fig. 3 shows, as a function of priors, the minimum error probability (continuous line) that can be obtained using classifiers in Φ λ . Note that decision costs c00 = c11 = 0 and c01 = c10 = 1 have been considered for this illustrative problem. An abrupt slope change is observed at the minimax deviation probability, for P{d = 1} = 1/2. For this prior, there are two single threshold classifiers providing the minimum error probability, which are given by thresholds λ1 and λ2 in Fig. 2. However, as shown in Fig. 3 neither of them provides a risk that is constant in the prior. The minimax deviation classifier in Φ λ , which has a threshold λ0 , does not attain minimum risk at the minimax deviation probability and, thus, cannot be obtained by using Eq. (11). For this example, the desired robust classifier should have a deviation function given by the horizontal dotted line in Fig. 3. Fortunately, it can be obtained by combining the outputs of several classifiers. For instance, let dˆ1 and dˆ2 the decisions of classifiers given by thresholds λ1 and λ2 , respectively. It is not difficult to see that the classifier selecting dˆ1 and dˆ2 at random (for each input sample x) provides a robust classifier. This procedure can be extended to the multiclass-case: consider a set of L classifiers with parameters wk , k = 0, . . . , L − 1, and consider the classifier such that, for any input sample x, makes a decision equal to dk (i.e., the decision of classifier with parameters wk ), with probability qk . It is not difficult to show that the deviation function of this classifier is given by L−1 D(P) = L−1 j=0 k=0 ∑ Pj ∑ qk D j (wk ) 112 , M INIMAX R EGRET C LASSIFIER 0.7 0.6 Likelihoods 0.5 0.4 0.3 0.2 0.1 λ 0 −2 λ −1 0 λ 0 1 1 2 2 3 4 5 x Figure 2: The conditional data distributions for the one-dimensional example discussed in the text. λ1 and λ2 are the thresholds providing the minimum risk at the minimax deviation probability. λ0 provides the minimax deviation classifier. where D j (wk ) = R j (wk ) − c j j . In order to get a constant deviation function, probabilities q k should be chosen in such a way that L−1 ∑ qk D j (wk ) = D , k=0 where D is a constant. Solving these linear equations for q k , k = 0, . . . , L − 1 (with the constraint ∑k qk = 1), the required probabilities can be found. Note that, in order to build the non-deterministic classifier providing a constant deviation, a set of L independent classifiers that are optimal at the minimax deviation prior should be found. However, we go no further on the investigation of this special case for two main reasons: • The situation does not seem to be common in practice. In our simulations, we have found that the maximum of the minimum risk deviation always provided a response which is approximately parallel to Rbasis . • In general, the abrupt change in the derivative may be a symptom that the classifier structure is not optimal for the data distribution. Instead of building a nondeterministic classifier, increasing the classifier complexity should be more efficient. Although the least favorable prior providing the minimax deviation can be computed in closed form for some simple distributions, in general, it must be computed numerically. Moreover, we assume here that the data distribution is not known, and must be learned from examples. Thus, 113 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I 0.25 0.2 λ Error probability 0 0.15 λ λ 1 2 0.1 0.05 0 0 0.2 0.4 0.6 0.8 1 P{ d=1} Figure 3: Error probabilities as a function of prior probability of class 1 for the example in Fig. 2. Thresholds λ1 and λ2 do not provide the minimax deviation classifier, which is obtained for threshold λ0 . However, the random combination of classifiers with thresholds λ 1 and λ2 (dotted line) provides a robust classifier with deviation lower than that of λ 0 . we must incorporate the estimation of the least favorable prior in the learning process. Next, we propose a training algorithm in order to get a minimax regret classifier based on neural networks. 4. Neural Robust Classifiers Under Complete Uncertainty Note that, if QmMd is the probability vector providing the maximum in Eq. (11), that is, QmMd = arg max inf{Dw (P)} w P , then we can write DmMd = inf{Dw (QmMd )} . w Therefore, the minimax deviation classifier can be estimated by training a classifier using prior in QmMd . For this reason, QmMd will be called the minimax deviation prior (or least favorable prior). Our proposed algorithms are based on an iterative process of estimating parameters w based on an estimate of the minimax deviation prior, and re-estimating prior based on an estimate of network weights. This is shown in the following. 114 M INIMAX R EGRET C LASSIFIER 4.1 Updating Network Weights Learning is based on minimizing some empirical estimate of the overall error function L−1 L−1 i=0 E{C(y, d)} = i=0 ∑ P{d = ui }E{C(y, d)|d = ui } = ∑ PiCi , where C(y, d) may be any error function and Ci is the expected conditional error for class-i. Selecting the appropriate error function (see Cid-Sueiro and Figueiras-Vidal, 2001, for instance), learning rules can be designed providing a posteriori probability estimates (y i ≈ P{d = ui |x}, where yi is the soft decision) and, thus, according to Eq. (3), the hard decision minimizing the risk can be approximated by L−1 d = arg min { ∑ ci j y j } . i j=0 The overall empirical error function (cost function) used in learning for priors P = (P0 , . . . , PL−1 ) may be written as L−1 C = ∑ PiCi = L−1 i=0 = = 1 K L−1 i=0 1 K k ∑ d C(yk , dk ), Ki k=1 i Pi K k ∑ d C(yk , dk ) Ki /K k=1 i ∑ i=0 1 K ∑ K k=1 ∑ Pi L−1 ∑ Pi d kC(yk , dk ) (0) i i=0 Pi , , (17) (0) where Pi = Ki /K is an initial estimate of class-i prior based on class frequencies in the training set and Pi is the current prior estimate. Minimizing error function (17) by means of a stochastic gradient descent learning rule leads to update the network weights at k-th iteration as w (k+1) = w (k) (n) L−1 −µ = w(k) − Pi i=0 Pi ∑ L−1 d k ∇ C(yk , dk ) (0) i w ∑ µi (n) k di , ∇wC(yk , dk ) , (18) i=0 where (n) (n) µi = µ Pi (19) (0) Pi (n) is a learning step scaled by the prior ratio. Note that di selects the appropriate µi according to the pattern class membership. The classifier is trained without altering the original training data set (0) class distribution Pi and therefore, without missing or duplicating information. 115 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I 4.2 Updating Prior Probabilities Eq. (11) shows that the learning process should maximize (5) with respect to the prior probabilities. The estimate of (5) can be computed as ¯ Dw (P) = Rw (P) − Rbasis (P) , where (20) L−1 ∑ R j Pj (21) 1 L−1 ∑ ci j Ni j N j i=0 (22) Rw (P) = j=0 is the overall Bayes risk estimate and Rj = is the class- j conditional risk estimate where N j is the number of class u j patterns in the training phase and Ni j is the number of samples from class u j assigned to ui . L−1 In order to derive a learning rule to find an estimate Pi satisfying constraints ∑i=0 Pi = 1 and 0 ≤ Pi ≤ 1, we will use auxiliary variables Bi such that Pi = exp(Bi ) L−1 ∑ j=0 exp(B j ) . (23) ¯ We maximize Dw with respect to Bi . Applying the chain rule, ¯ ¯ ∂Dw L−1 ∂Dw ∂Pj =∑ , ∂Bi j=0 ∂Pj ∂Bi and using Eqs. (20), (21) and (23), we get ¯ ∂D w ∂Bi L−1 = ∑ (R j − c j j )Pi (δi j − Pj ), j=0 L−1 L−1 j=0 j=0 = Pi Ri − cii − ∑ (R j Pj ) + ∑ (c j j Pj ) , = Pi Ri − cii − Rw − Rbasis , = Pi Rdi , where Rdi = (Ri − cii ) − (Rw − Rbasis ) . The learning rule for auxiliary variable Bi is (n) Bi (n+1) = Bi + ρ (n) ∂D w , ∂Bi (n) (n) = Bi + ρPi Rdi , 116 (24) M INIMAX R EGRET C LASSIFIER where parameter ρ > 0 controls the rate of convergence. Using Eq. (23) and Eq. (24), the updated learning rule for Pi is (n) (n+1) Pi = (n) (n) (n) exp ρPj Rd j ∑L−1 exp B j j=0 (n) = (n) (n) exp(Bi ) exp ρPi Rdi , (n) (n) Pi exp ρPi Rdi (n) (n) (n) ∑L−1 Pj exp ρPj Rd j j=0 . (25) 4.3 Training Algorithm for a Minimax Deviation Classifier In the previous section, both the network weights updating rule (18) and the prior probability update rule (25) have been derived. The algorithm resulting from the combination is shown as follows: for n = 0 to Niterations − 1 do for k = 1 to K do w(k+1) = w(k) − L−1 ∑ µi (n) k di ∇wC(yk , dk ) i=0 end for (n) Estimate R(n) , Ri , i = 0, . . . , L − 1, according to (21) and (22) (n+1) (n+1) Update minimax probability Pi , i = 0, . . . , L − 1 according to (25) and compute µi with (19) end for 5. Robust Classifiers Under Partial Uncertainty Although in many practical situations prior probabilities may not be specified with precision, they can be partially known. In this section we discuss how partial information about priors can be used to improve the classifier performance in relation to a complete uncertainty situation. From now on, let us consider that lower (or upper) bounds of the priors are known based on previous experience. We will denote the lower and upper bounds of class-i prior probability as Pil and Piu , respectively. In order to illustrate this situation consider a binary classification problem where probability lower bounds P0l and P1l are known. That is, P1 ∈ [P1l , 1 − P0l ] where this interval represents the uncertainty region. Let us denote by Γ = {P : 0 ≤ Pi ≤ 1, ∑L−1 Pi = 1, Pi ≥ Pil } the probability region i=0 satisfying the imposed constraints. In the following, we will refer to Γ as the uncertainty region. Now, the aim is to design a classifier that minimizes the maximum regret from the minimum risk only inside the uncertainty region. This is depicted in Fig. 4(a), which shows that reducing the uncertainty in priors allows to reduce deviation from the optimal classifier. This minimax regret approach for the uncertainty region Γ is often called Γ-minimax regret. As discussed before, the minimax deviation solution gives a Bayes solution with respect some priors denoted in the partial uncertainty case as QΓ mMd in Fig. 4(a), which is the least favorable distribution according to the regret criterion. 117 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I cΓ 00 RΓ basis RΓ 1 cΓ 11 PSfrag replacements Risk 0 0.5 1 1.5 2 2.5 3 3.5 0P1l RB (P) + ψ(P) Minimax Deviation with Restriction Risk RΓ 0 QΓ 1mMd 1 − P0l 1 P1 0P1l 1 − P0l 1 P1 (b) (a) Figure 4: Minimax deviation classifier under partial uncertainty of prior probabilities: (a)Γ-minMaxDev Classifier. (b) Modified cost function defined as R B (P) + ψ(P). In contrast to the minimax regret criterion, note that a classical minimax classifier for the considered uncertainty region would minimize the worst-case risk. It would be a Bayes solution for the prior where the minimum risk reaches its maximum and it could be denoted as Q Γ . mM Notice, also, that these solutions will be the same if the risk for the vertex of Γ take the same value (cΓ = k). ii 5.1 Neural Robust Classifiers Under Partial Uncertainty Minimax search can be formulated as maximizing (with respect to priors) the minimum (with respect to network parameters) of deviation function (5), as described in previous section, but subject to some constraints arg max inf {DΓ (P)} , w w P Pi ≥ Pil , i = 0, . . . , L − 1 s.t. where DΓ = RΓ − RΓ . When uncertainty is global, this hyperplane is defined by the risk in the L w w basis extreme cases with Pi = δik , that is, by the corresponding cii . However, with partial knowledge of the prior probabilities, this hyperplane becomes defined by the risk in L points which are the vertex given by the restrictions and with associated risk denoted by c Γj . j Defining 1 l(Pi ) = , (26) 1 + exp−τ(Pi −Pil ) where τ controls the hardness of this restriction, the minimax problem can be re-formulated as arg max inf {DΓ (P)} w P s.t. w l(Pi ) ≥ 1/2, i = 0, . . . , L − 1. Thus, this constrained optimization problem can be solved as a non-constrained problem by considering an auxiliary function that incorporates the restriction as a barrier function 118 M INIMAX R EGRET C LASSIFIER arg max inf {DΓ (P) + Aψ(P)} , w w P where ψ(Pi ) = log(l(Pi )) and the constant A determines the contribution of the barrier function. Fig. 4(b) shows the new risk function corresponding to the binary case previously depicted in Fig. 4(a). Note that, it is the sum of the original RB (P) and the barrier function ψ(P). As in Section 4.1, in order to derive the network weight learning rule, we need to compute ∂ψ ∂Bi L−1 = ∂ψ ∂P j , j ∂Bi ∑ ∂P j=0 = τPi L−1 ∑ 1 − l(Pk ) (δik − Pk ), k=0 = τPi ψdi , where ψdi = ∑L−1 (1 − l(Pk ))(δik − Pk ) k=0 As τ increases, the constraints become harder around the specified bound. The update learning rule for the auxiliary variable Bi at cycle n is (n+1) Bi (n) Γ(n) (n) (n) = Bi (n) + ρPi Rdi + ρAτPi ψdi . And therefore, using (23), the update learning rule for Pi is (n) (n+1) Pi = (n) Γ(n) Pi exp ρPi Rdi L−1 ∑ (n) Pj exp (n) (n) exp ρAτPi ψdi (n) Γ(n) ρ P j Rd j . (n) (n) ρAτPj ψd j exp j=0 Note that if the upper bound is known instead of the lower bound, l(Pi ) defined by (26) should be replaced by u(Pi ) = (1 + exp(τ(Pi − Piu )))−1 at the previous formulation. The minimax constrained optimization problem has been tackled by considering a new objective function defined by the sum of the original cost function and a barrier function. Studying the convexity of this new function becomes important from the fact that a stationary point of this risk curve is a global maximum. Since the minimum risk curve (RB (P)) is a convex function of the priors (see VanTrees, 1968, for details), if we verify the convexity of the barrier function, we can conclude that the function defined by the sum of both of them is also convex. This barrier function is convex in P if the Hessian matrix HR verifies PT HR P ≤ 0 The Hessian matrix of the barrier function equals to a diagonal matrix D r = diag(r) with all negative diagonal entries ri = Aτ2 (−l(Pi )(1 − l(Pi ))). As l(Pi ) ∈ [0, 1] and therefore, ri ≤ 0, it is straightforward to see that PT HR P = PT Dr P, L−1 = ∑ Pi2 ri ≤ 0 . i=0 Since the barrier function is convex, the new objective function (defined by the sum of two convex functions) is also convex. 119 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I 5.2 Extension to Other Learning Algorithms The learning algorithm proposed in this paper is intended to train a minimax deviation classifier based on neural networks with feedforward architecture. Actually, the learning algorithm we propose becomes a feasible solution for any learning process based on minimizing some empirical estimate of an overall cost (error) function. However, it is also applicable to a general classifier provided it is trained (in an iterative process) for the estimated minimax deviation probabilities and the assumed decision costs. Specifically, in this paper, scaling the learning rate allows to simulate different class distributions and the hard decisions are made based on posterior probability estimates and decision costs. Furthermore, the neural learning phase carried out in one iteration can be re-used for the next one, what allows to reduce computational cost with respect to a complete optimization process on each iteration. Apart from the general approach of completely training a classifier on each iteration and in order to reduce its computational cost, specific solutions may be studied for different learning machines. Nonetheless, it seems not feasible to readily achieve this improvement for classifiers like SVMs, where support vectors for one solution may have nothing in common with the ones obtained in next iteration and thus, making necessary to re-train the classifier in each iteration. Another possible solution for any classifier that provides a posteriori probabilities estimates or any score that can be converted into probabilities (for details on calibration methods see Wei et al., 1999; Zadrozny and Elkan, 2002; Niculescu-Mizil and Caruana, 2005) is outlined here. In this case, an iterative procedure able to estimate the minimax deviation probabilities and consequently to adjust (without re-training) the outputs of the classifier could be studied. The general idea for this approach is as follows: first, the new minimax deviation prior probabilities are estimated according to (25) and then, posterior probabilities provided by the model are adjusted as follows (see Saerens et al., 2002, for more details) (k) Pi P(k) {d = ui |x} = P(k−1) {d = ui |x} (k−1) Pi L−1 P(k) j P(k−1) {d = u j |x} (k−1) j=0 Pj . (27) ∑ The algorithm’s main structure is summarized as for k = 1 to K do (k) Estimate R(k) , Ri , i = 0, . . . , L − 1, according to (21), (22) and decision costs c i j (k+1) Update minimax probability Pi according to (25) Adjust classifier outputs according to (27) end for The effectiveness of this method relies on the accuracy of the initial a posteriori probability estimates. Studying in depth this approach and comparing different minimax deviation classifiers (decision trees, SVMs, RBF networks, feedforward networks and committee machines) together with different probability calibration methods appears as a challenging issue to be explored in future work. 120 M INIMAX R EGRET C LASSIFIER 6. Experimental Results In this section, we first present the neural network architecture used in the experiments and illustrate the proposed minimax deviation strategy on an artificial data set. Then, we apply it to several realworld classification problems. Moreover, a comparison with other proposals such as the traditional minimax and the common re-balancing approach is carried out. 6.1 Softmax-based Network Although our algorithms can be applied to any classifier architecture, we have chosen a neural network based on the softmax non-linearity with soft decisions given by Mi yi = ∑ yi j , j=1 with yi j = exp(wTj x + wi j0 ) i , Mk ∑L−1 ∑l=1 exp(wT x + wkl0 ) k=0 kl where L stands for the number of classes, M j the number of softmax outputs used to compute y j and wi j are weight vectors. We will refer to this network as a Generalized Softmax Perceptron(GSP). 1 A simple network with M j = 2 is used in the experiments. x1 wj,k y1,1 y1,... x2 x3 y1 y1,M1 Class i ... SOFTMAX ... HARD DECISION n inputs / outputs ... yL,1 xd yL,ML yL,... yL Figure 5: GSP(Generalized Softmax Perceptron) Network Fig. 5 corresponds to the neural network architecture used to classify the samples represented by feature vector x. Learning consists of estimating network parameters w by means of the stochastic gradient minimization of certain objective functions. In the experiments, we have considered the Cross Entropy objective function given by L CE(y, d) = − ∑ di log yi . i=1 The stochastic gradient learning rule for the GSP network is given by Eq. (18). Learning step µ(0) decreases according to µ(k) = 1+k/η , where k is the iteration number, µ(0) the initial learning rate and η a decay factor. µ(k) 1. Note that the GSP is similar to a two layer MLP with a single layer of weights and with coupled saturation function (softmax), instead of sigmoidal units. 121 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I The reason to illustrate this approach with a feedforward architecture is that, as mentioned in Section 5.2, it allows to exploit (in the iterative learning process) the partially optimized solution in current iteration for the next one. On the other hand, posterior probability estimation makes it possible to apply the adaptive strategy based on prior re-estimation proposed by Saerens to the minimax deviation classifier, as long as a data set representative of the operation conditions is available. Finally, the fact that intermediate outputs yi j of the GSP can be interpreted as subclass probabilities may provide quite a natural way to cope with the unexplored problem of uncertainty in subclass distributions as already pointed out by Webb and Ting (2005). Nonetheless, both architecture and cost function issues are not the goal of this paper, but merely illustrative tools. 6.2 Artificial Data Set To illustrate the minimax regret approach proposed in this paper both under complete and partial uncertainty, an artificial data set with two classes (class u0 and class u1 ) has been created. Data examples are drawn from the normal distribution p(x|d = ui ) = N(mi , σ2 ) with mean mi and standard i √ deviation σi . Mean values were set to m0 = 0, m1 = 2 and standard deviation to σ0 = σ1 = 2. A total of 4000 instances were generated with prior probabilities of class membership P{d = u 0 } = 0.93 c00 c01 2 5 and P{d = u1 } = 0.07. The cost-benefit matrix is given by . c10 c11 4 0 Initial learning rate was set to µ(0) = 0.3, decay factor to η = 2000 and training was ended after 80 cycles. Classifier assessment was carried out by following 10-fold cross-validation. Two classifiers were trained, to be called a standard classifier and a minMaxDev classifier. The former is built by considering that the estimated class prior information is precise and stationary and the latter is the approach proposed in this paper to cope with uncertainty in priors. Thus, for the standard classifier, its performance may deviate from the optimal risk in 3.39 when priors change from training to test conditions. However, a minimax deviation classifier reduces this worst-case difference from the optimal classifier to 0.77. Now, we suppose that some information about priors is available (partial uncertainty). For instance, we consider that the lower bound for prior probabilities P0 and P1 are known and set to P0l = 0.55 and P1l = 0.05, respectively, so that the uncertainty region is Γ = {(P0 , P1 )|P0 ∈ [0.55, 0.95], P1 ∈ [0.05, 0.45]}. A minimax deviation classifier can be derived for Γ (it will be called Γ-minMaxDev classifier).The narrower Γ is, the closer the minimax deviation classifier performance is to the optimal. For this particular case, under partially imprecise priors, the standard classifier may differ from optimal (in Γ) in 0.83, while the use of the simple minMaxDev classifier designed under total prior uncertainty conditions attains a maximum deviation of 0.53. However, the Γ-minMaxDev classifier only differs from optimal in 0.24. These data are reported in Table 1 where both, experimental and also theoretical results, are shown. 6.3 Real Databases In this section we report experimental results obtained with several publicly available data sets. From the UCI repository (Blake and Merz, 1998) the following benchmarks: German Credits, Australian Credits, Insurance Company, DNA slice-junction, Page-blocks, Dermatology and Pen-digits. 122 M INIMAX R EGRET C LASSIFIER Standard Th/Exp Maximum deviation from optimal (complete uncertainty) Maximum deviation from optimal in Γ (partial uncertainty) Classifier minMaxDev Γ-minMaxDev Th/Exp Th/Exp 3.41/3.39 0.72/0.77 – 0.85/0.83 0.50/0.53 0.19/0.24 Table 1: A comparison between the standard classifier (build under stationary prior assumptions), the minimax deviation classifier (minMaxDev) and the minimax deviation classifier under partial uncertainty (Γ-minMaxDev) for an artificial data set Database German Credits (GCRE) Australian Credits (AUS) Munich Credits (MCRE) Insurance Company (COIL) DNA Slice-junction (DNA) Page-blocks (PAG) Dermatology (DER) Pen-digits (PEN) # Classes 2 2 2 2 3 5 6 10 Class distribution [0.70 0.30] [0.32 0.68] [0.30 0.70] [0.94 0.06] [0.24 0.24 0.52] [0.90 0.06 0.01 0.01 0.02] [0.31 0.16 0.20 0.13 0.14 0.06] [0.104 0.104 0.104 0.096 0.104 0.096 0.096 0.104 0.096 0.096] # Attributes 8 14 20 85 180 10 34 16 # Instances 1000 690 1000 9822 3186 5473 366 10992 Table 2: Experimental Data sets Other public data set used is Munich Credits from the Dept. of Statistics at the University of Munich.2 Data set description is summarized in Table 2, and cost-benefit matrices are shown in Table 3. We have used the cost values that appear in Ikizler (2002) for those data sets in common. Otherwise, for lack of an expert analyst, the cost values have been chosen by hand. 2. Data sets available at http://www.stat.uni-muenchen.de/service/datenarchiv/welcome e.html. Insurance Company 0 1 German, Australian, Munich Credits −1 0 0 −17 Page-Blocks −1 2 2 2 2 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 −4 2 3 4 3 4 3 −3 3 5 1 5 5 0 Dermatology 3 2 3 2 −8 4 5 −10 4 3 5 4 2 1 4 5 −6 5 2 3 5 2 3 −10 −1 2 2 DNA 2 −1 2 Pendigits ci j = 0 1 Table 3: Cost-Benefit matrices for the experimental Data sets 123 3 3 0 if i = j Otherwise A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I Standard Maximum Risk Deviation from the optimal classifier Re-balanced Minimax Deviation Minimax minMaxDev minMax GCRE 0.70 0.80 (0.55 0.60) 0.99 ACRE 1.00 1.00 (0.76 0.86) 1.00 MCRE 0.91 0.77 (0.54 0.59) 0.99 COIL 2.78 0.99 (0.87 0.92) 16.32 DNA 0.34 0.53 (0.30 0.27 0.25) PAG 0.62 0.26 (0.13 0.13 0.20 0.16 0.16) DER 1.03 1.28 (0.67 0.78 0.51 0.48 0.54 PEN 0.061 0.059 (0.024 0.028 0.025 0.019 1.14 0.023 0.021 0.026 0.022 0.86 0.60) 0.023 0.029) 7.62 0.029 Table 4: Classifier Performance evaluated as Maximum Risk Deviation from the optimal classifier for several real-world applications. Class-conditional risk deviations (R i − cii ) reported for the minMaxDev classifier. Experimental results for these data sets are shown in the following sections. The robustness of different decision machines under complete uncertainty of prior probabilities is analyzed in Section 6.3.1. If uncertainty is only partial, a similar study and comparison with the previous approach (complete uncertainty) is carried out in Section 6.3.2. 6.3.1 C LASSIFIER ROBUSTNESS U NDER C OMPLETE U NCERTAINTY We now study how different neural-based classifiers cope with worst-case situations in prior probabilities. The maximum deviation from the optimal classifier (see Table 4) is reported for the proposed minMaxDev strategy as well as for other alternative approaches: the one based on the assumption of stationary priors (standard) and the common alternative of deriving the classifier from an equally distributed data set (re-balanced). A comparison with the traditional minimax strategy is also provided. Together with the previously mentioned value (maximum deviation or regret), deviation for the L class-conditional extreme cases (Ri − cii ) is also reported for the minMaxDev classifier in Table 4. Results allow to verify that this solution is fairly close to the optimal one where deviation is not dependent on priors and thus, class-conditional deviations take the same value. Although the balanced class distribution to train the classifier can be obtained by means of undersampling and/or oversampling, it is simulated by altering the learning rate used in the training 1/L phase according to (19) as µi = µ (0) , where 1/L represents the simulated probability, equal for Pi all classes. Results evidence that the assumption of stationary priors may lead to significant deviations from the optimal decision rule under “unexpected”, but rather realistic, prior changes. This deviation may reach up to three times more than the robust minimax deviation strategy. Thus, for classification problems like Page-blocks the maximum deviation from the optimal classifier is 0.62 for the 124 M INIMAX R EGRET C LASSIFIER Standard Maximum Risk Re-balanced Minimax Deviation minMaxDev Minimax minMax GCRE 0.70 0.15 0.60 0.00 ACRE 0.01 0.02 0.86 -0.00 MCRE 0.05 0.20 0.59 0.00 COIL 0.76 0.99 0.86 0.02 DNA 0.34 0.53 0.25 0.13 PAG 0.62 0.26 0.20 0.10 DER -2.10 -1.68 -2.21 -2.38 PEN 0.061 0.059 0.029 0.029 Table 5: Classifier Performance measured as Maximum Risk for several real-world applications. standard classifier while this reduces to 0.20 for the minMaxDev one. Likewise, for the Insurance company(COIL) application the maximum deviation for the standard classifier is 2.78 compared with 0.92 for the minMaxDev model. The remaining databases also show the same behavior as it is presented in Table 4. On the other hand, the use of a classifier inferred from a re-balanced data set does not necessarily involve a decrease in the maximum deviation with respect to the standard classifier. In the same way, the traditional minimax classifier does not protect against prior changes in terms of maximum relative deviation from the minimum risk classifier. However, if our criterion is more conservative and our aim is the minimization of the maximum possible risk (not the minimization of the deviation), the traditional minimax classifier represents the best option. It is shown in Table 5 where the maximum risk for the different classifiers is reported. Positive values in this table indicate a cost while negative values represent a benefit. For instance, for the Page-blocks application the minimax classifier assures a maximum risk of 0.10 while the standard, re-balanced and minMaxDev classifiers reach values of 0.62, 0.26 and 0.20, respectively. It can be noticed that for the Pen-digits data set, the minimax deviation and minimax approaches attain the same results. The reason is that, for this problem, the R basis plane takes the same value (in this case, zero) in the probability space. 6.3.2 C LASSIFIER ROBUSTNESS UNDER PARTIAL U NCERTAINTY Unlike the previous section, we consider now that partial information about the class priors is available. The aim is to find a classifier that behaves well for a delimited and realistic range of priors what constitutes an aid in reducing the maximum deviation from the optimal classifier. This situation can be treated as a constrained minimax regret strategy where the constraints represent any extra information about prior probability value. Experimental results for several situations of partial prior uncertainty are presented in this section. We consider that lower bounds for the prior probabilities are available (see Table 6). In order to get the Γ-minMaxDev classifier, the risk for the different vertex of the uncertainty domain needs to be calculated. With them, the basis risk RΓ over which deviations are measured is derived. basis 125 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I Lower bound for prior probabilities Data Set P0l P1l GCRE 0.40 0.25 ACRE 0.20 0.25 MCRE 0.20 0.25 COIL 0.15 P2l P3l P4l P5l 0.03 DNA 0.10 0.10 0.22 0.02 0.00 0.01 0.1 0.20 0.10 0.10 0.10 0.10 0.06 0.06 0.10 0.10 0.06 P9l 0.06 0.10 0.05 0.05 0.02 PEN P8l 0.02 DER P7l 0.25 PAG P6l Table 6: Lower bounds for prior probabilities defining the uncertainty region, Γ region for the experimental data sets. Maximum Risk Deviation in the uncertainty region Standard Minimax Deviation Minimax Deviation with restriction minMaxDev Γ-minMaxDev GCRE 0.24 0.19 (0.10 0.09) ACRE 0.03 0.64 (0.03 0.03) MCRE 0.22 0.38 (0.13 0.10) COIL 2.33 0.77 (0.17 0.11) DNA 0.14 0.08 (0.07 0.07 0.06) PAG 0.37 0.15 (0.10 0.08 0.08 0.05 0.04) DER 0.08 0.05 (0.03 0.03 0.04 0.02 0.05 PEN 0.013 0.007 (0.003 0.001 0.003 0.000 0.001 0.001 0.000 0.003 0.05) 0.001 0.001) Table 7: Classifier Performance under partial knowledge of prior probabilities measured as Maximum Risk Deviation for several real-world applications. Class-conditional risk deviations (RΓ − cΓ ) are reported for the Γ-minMaxDev classifier. i ii Maximum deviation from the optimal in Γ is reported for the Γ-minMaxDev classifier together with the standard and the minMaxDev ones. For instance, the standard classifier for the Pageblocks data set deviates from the optimal classifier, in the defined uncertainty region, up to 0.37, while when complete uncertainty is assumed the maximum deviation is equal to 0.62. In the same way, reducing the uncertainty also means a reduction in the maximum deviation for minMaxDev classifier (trained without considering this partial knowledge). Thus, for Γ, this classifier assures a deviation bound of 0.15. However, taking into account this partial information to train a Γ-minMaxDev classifier allows to reduce the deviation for the worst-case conditions to 0.10. It can be seen the same behavior for the other databases in Table 7. 126 M INIMAX R EGRET C LASSIFIER 7. Conclusions This work concerns the design of robust neural-based classifiers when the prior probabilities of the classes are partially or completely unknown, even by the end user. This problem of uncertainty in the class priors is often ignored in supervised classification, even though it is a widespread situation in real world applications. As a result, the reliability of the inducted classifier can be greatly affected as previously shown by the experiments. To tackle this problem, we have proposed a novel minimax deviation strategy with the goal to minimize the maximum deviation with respect to the optimal classifier. A neural network training algorithm based on learning rate scaling has been developed. The experimental results show that this minimax deviation (minMaxDev) classifier protects against prior changes while other approaches like ignoring this uncertainty or use a balanced learning data set may result in large differences in performance with respect to the minimum risk classifier. Also, it has been shown that the conventional minimax classifier reduces the maximum possible risk following a conservative attitude but at the expense of large worst-case differences from the optimal classifier. Furthermore, a constrained minimax deviation approach (Γ-minMaxDev) has been derived for those situations where uncertainty is only partial. This may be seen as a general approach with some particular cases: a) precise knowledge of prior probabilities and b) complete uncertainty about the priors. In a) the region of uncertainty collapses to a point and we have the Bayes’ rule of minimum risk and in b) the pure minimax deviation strategy comes up. While the first one may be criticized for being quite unrealistic, the other may be seen rather pessimistic. The experimental results for this proposed intermediate situation show that the Γ-minMaxDev classifier allows to reduce the maximum deviation from the optimal and performs well over a range of prior probabilities. Acknowledgments The authors thank the four referees and the associate editor for their helpful comments. This work was partially supported by the project TEC2005-06766-C03-02 from the Spanish Ministry of Education and Science. References N. Abe, B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 3–11, 2004. N. M. Adams and D. J. Hand. Comparing classifiers when the misallocation costs are uncertain. Pattern Recognition, 32(7):1139–1147, March 1998. R. Alaiz-Rodriguez, A. Guerrero-Curieses, and J. Cid-Sueiro. Minimax classifiers based on neural networks. Pattern Recognition, 38(1):29–39, January 2005. R. Barandela, J. S. Sanchez, V. Garc´a, and E. Rangel. Strategies for learning in class imbalance ı problems. Pattern Recognition, 36(3):849–851, March 2003. J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, second edition, 1985. 127 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html. URL L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Chapman & Hall, NY, 1984. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16:321–357, 2002. J. Cid-Sueiro and A. R. Figueiras-Vidal. On the structure of strict sense Bayesian cost functions and its applications. IEEE Transactions on Neural Networks, 12(3):445–455, May 2001. C. Drummond and R. C. Holte. Explicitly representing expected cost: An alternative to ROC representation. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 198–207. ACM Press, 2000. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, 2001. Y. C. Eldar and N. Merhav. Minimax approach to robust estimation of random parameters. IEEE Trans. on Signal Processing, 52(7):1931–1946, July 2004. Y. C. Eldar, A. Ben-Tal, and A. Nemirovski. Linear minimax regret estimation of deterministic parameters with bounded data uncertainties. IEEE Trans. on Signal Processing, 52(8):2177– 2188, August 2004. M. Feder and N. Merhav. Universal composite hypothesis testing: A competitive minimax approach. IEEE Trans. on Information Theory, 48(6):1504–1517, June 2002. A. Guerrero-Curieses, R. Alaiz-Rodriguez, and J. Cid-Sueiro. A fixed-point algorithm to minimax learning with neural networks. IEEE Transactions on Systems, Man and Cybernetics Part C, 34 (4):383–392, November 2004. ¨ H. A. G¨ venir, N. Emeksiz, N. Ikizler, and N. Ormeci. Diagnosis of gastric carcinoma by classifiu cation on feature projections. Artificial Intelligence in Medicine, 31(3), 2004. N. Ikizler. Benefit maximizing classification using feature intervals. Technical Report BU-CE-0208, Bilkent University, Ankara, Turkey, 2002. N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis Journal, 6(5):429–450, November 2002. M. G. Kelly, D. J. Hand, and N. M. Adams. The impact of changing populations on classifier performance. In Proceedings of Fifth International Conference on SIG Knowledge Discovery and Data Mining (SIGKDD), pages 367–371, San Diego, CA, 1999. H. J. Kim. On a constrained optimal rule for classification with unknown prior individual group membership. Journal of Multivariate Analysis, 59(2):166–186, November 1996. M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings 14th International Conference on Machine Learning, pages 179–186. Morgan Kaufmann, 1997. 128 M INIMAX R EGRET C LASSIFIER M. Kubat, R. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2/3):195–215, 1998. S. Lawrence, I. Burns, A. D. Back, A. C. Tsoi, and C. L. Giles. Neural network classification and ¨ unequal prior class probabilities. In G. Orr, K.-R. Muller, and R. Caruana, editors, Tricks of the Trade, Lecture Notes in Computer Science State-of-the-Art Surveys, pages 299–314. Springer Verlag, 1998. T. K. Moon and W. C. Stirling. Mathematical Methods and Algorithms for Signal Processing. Prentice Hall, 2000. A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In ICML ’05: Proceedings of the 22nd International Conference on Machine learning, pages 625– 632, New York, NY, USA, 2005. ACM Press. ISBN 1-59593-180-5. E. Polak. Optimization: Algorithms and Consistent Approximations. Springer, 1997. F. Provost. Learning with imbalanced data sets 101. In Invited paper for the AAAI 2000 Workshop on Imbalanced Data Sets. AAAI Press. Technical Report WS-00-05, 2000. F. Provost and T. Fawcett. Robust classification systems for imprecise environments. Machine Learning, 42(3):203–231, March 2001. M. Saerens, P. Latinne, and C. Decaestecker. Adjusting a classifier for new a priori probabilities: A simple procedure. Neural Computation, 14:21–41, January 2002. E. Takimoto and M. Warmuth. The minimax strategy for Gaussian density estimation. In Proceedings 13th Annual Conference on Computational Learning Theory, pages 100–106. Morgan Kaufmann, San Francisco, 2000. K. M. Ting. A study of the effect of class distribution using cost-sensitive learning. In Proceedings of the Fifth International Conference on Discovery Science, pages 98–112. Berlin: Springer-Verlag, 2002. H. L. VanTrees. Detection, Estimation and Modulation Theory. John Wiley and Sons, 1968. G. I. Webb and K. M. Ting. On the application of ROC analysis to predict classification performance under varying class distributions. Machine Learning, 58(1):25–32, 2005. W. Wei, T. K. Leen, and E. Barnard. A fast histogram-based postprocessor that improves posterior probability estimates. Neural Computation, 11(5):1235 – 1248, July 1999. B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 204–213. ACM Press, 2001. B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Eighth International Conference on Knowledge Discovery and Data Mining, 2002. 129 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the third IEEE International Conference on Data Mining, pages 435–442, 2003. Z. H. Zhou and X. Y. LiuJ. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1):63–77, January 2006. 130
2 0.11345409 34 jmlr-2007-From External to Internal Regret (Special Topic on the Conference on Learning Theory 2005)
Author: Avrim Blum, Yishay Mansour
Abstract: External regret compares the performance of an online algorithm, selecting among N actions, to the performance of the best of those actions in hindsight. Internal regret compares the loss of an online algorithm to the loss of a modified online algorithm, which consistently replaces one action by another. In this paper we give a simple generic reduction that, given an algorithm for the external regret problem, converts it to an efficient online algorithm for the internal regret problem. We provide methods that work both in the full information model, in which the loss of every action is observed at each time step, and the partial information (bandit) model, where at each time step only the loss of the selected action is observed. The importance of internal regret in game theory is due to the fact that in a general game, if each player has sublinear internal regret, then the empirical frequencies converge to a correlated equilibrium. For external regret we also derive a quantitative regret bound for a very general setting of regret, which includes an arbitrary set of modification rules (that possibly modify the online algorithm) and an arbitrary set of time selection functions (each giving different weight to each time step). The regret for a given time selection and modification rule is the difference between the cost of the online algorithm and the cost of the modified online algorithm, where the costs are weighted by the time selection function. This can be viewed as a generalization of the previously-studied sleeping experts setting. Keywords: online learning, internal regret, external regret, multi-arm bandit, sleeping experts, reductions
3 0.077299789 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
Author: Sanjoy Dasgupta, Leonard Schulman
Abstract: We show that, given data from a mixture of k well-separated spherical Gaussians in R d , a simple two-round variant of EM will, with high probability, learn the parameters of the Gaussians to nearoptimal precision, if the dimension is high (d ln k). We relate this to previous theoretical and empirical work on the EM algorithm. Keywords: expectation maximization, mixtures of Gaussians, clustering, unsupervised learning, probabilistic analysis
4 0.058862325 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
Author: Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, Robert L. Wolpert
Abstract: Kernel methods have been very popular in the machine learning literature in the last ten years, mainly in the context of Tikhonov regularization algorithms. In this paper we study a coherent Bayesian kernel model based on an integral operator defined as the convolution of a kernel with a signed measure. Priors on the random signed measures correspond to prior distributions on the functions mapped by the integral operator. We study several classes of signed measures and their image mapped by the integral operator. In particular, we identify a general class of measures whose image is dense in the reproducing kernel Hilbert space (RKHS) induced by the kernel. A consequence of this result is a function theoretic foundation for using non-parametric prior specifications in Bayesian modeling, such as Gaussian process and Dirichlet process prior distributions. We discuss the construction of priors on spaces of signed measures using Gaussian and L´ vy processes, e with the Dirichlet processes being a special case the latter. Computational issues involved with sampling from the posterior distribution are outlined for a univariate regression and a high dimensional classification problem. Keywords: reproducing kernel Hilbert space, non-parametric Bayesian methods, L´ vy processes, e Dirichlet processes, integral operator, Gaussian processes c 2007 Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee and Robert L. Wolpert. P ILLAI , W U , L IANG , M UKHERJEE AND W OLPERT
5 0.054927208 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers
Author: François Laviolette, Mario Marchand
Abstract: We propose a PAC-Bayes theorem for the sample-compression setting where each classifier is described by a compression subset of the training data and a message string of additional information. This setting, which is the appropriate one to describe many learning algorithms, strictly generalizes the usual data-independent setting where classifiers are represented only by data-independent message strings (or parameters taken from a continuous set). The proposed PAC-Bayes theorem for the sample-compression setting reduces to the PAC-Bayes theorem of Seeger (2002) and Langford (2005) when the compression subset of each classifier vanishes. For posteriors having all their weights on a single sample-compressed classifier, the general risk bound reduces to a bound similar to the tight sample-compression bound proposed in Laviolette et al. (2005). Finally, we extend our results to the case where each sample-compressed classifier of a data-dependent ensemble may abstain of predicting a class label. Keywords: PAC-Bayes, risk bounds, sample-compression, set covering machines, decision list machines
6 0.043160569 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
7 0.041329287 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
8 0.040061872 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
9 0.039308771 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
10 0.038817342 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
11 0.037334081 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
12 0.034821019 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring
13 0.033629969 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction
14 0.03278726 42 jmlr-2007-Infinitely Imbalanced Logistic Regression
15 0.032184158 70 jmlr-2007-Ranking the Best Instances
16 0.031719252 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
17 0.031105909 9 jmlr-2007-AdaBoost is Consistent
18 0.030999633 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors
19 0.03076221 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)
20 0.02895554 90 jmlr-2007-Value Regularization and Fenchel Duality
topicId topicWeight
[(0, 0.185), (1, -0.019), (2, -0.091), (3, -0.059), (4, 0.049), (5, -0.003), (6, 0.011), (7, -0.127), (8, -0.061), (9, 0.081), (10, 0.018), (11, -0.14), (12, 0.11), (13, 0.068), (14, -0.065), (15, 0.053), (16, 0.007), (17, 0.054), (18, 0.09), (19, -0.148), (20, 0.126), (21, 0.191), (22, 0.014), (23, 0.509), (24, 0.022), (25, -0.057), (26, 0.146), (27, -0.038), (28, -0.036), (29, 0.063), (30, 0.081), (31, -0.06), (32, 0.09), (33, 0.101), (34, -0.04), (35, 0.023), (36, 0.057), (37, 0.031), (38, -0.051), (39, -0.051), (40, -0.005), (41, -0.017), (42, -0.135), (43, 0.13), (44, -0.061), (45, -0.224), (46, -0.138), (47, -0.109), (48, -0.016), (49, 0.099)]
simIndex simValue paperId paperTitle
same-paper 1 0.95610619 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions
Author: Rocío Alaiz-Rodríguez, Alicia Guerrero-Curieses, Jesús Cid-Sueiro
Abstract: The design of a minimum risk classifier based on data usually stems from the stationarity assumption that the conditions during training and test are the same: the misclassification costs assumed during training must be in agreement with real costs, and the same statistical process must have generated both training and test data. Unfortunately, in real world applications, these assumptions may not hold. This paper deals with the problem of training a classifier when prior probabilities cannot be reliably induced from training data. Some strategies based on optimizing the worst possible case (conventional minimax) have been proposed previously in the literature, but they may achieve a robust classification at the expense of a severe performance degradation. In this paper we propose a minimax regret (minimax deviation) approach, that seeks to minimize the maximum deviation from the performance of the optimal risk classifier. A neural-based minimax regret classifier for general multi-class decision problems is presented. Experimental results show its robustness and the advantages in relation to other approaches. Keywords: classification, imprecise class distribution, minimax regret, minimax deviation, neural networks 1. Introduction - Problem Motivation In the general framework of learning from examples and specifically when dealing with uncertainty, the robustness of the decision machine becomes a key issue. Most machine learning algorithms are based on the assumption that the classifier will use data drawn from the same distribution as the training data set. Unfortunately, for most practical applications (such as remote sensing, direct marketing, fraud detection, information filtering, medical diagnosis or intrusion detection) the target class distribution may not be accurately known during learning: for example, because the cost of labelling data may be class-dependent or the prior probabilities are non-stationary. Therefore, the data used to design the classifier (within the Bayesian context (see VanTrees, 1968), the c 2007 Roc´o Alaiz-Rodr´guez, Alicia Guerrero-Curieses and Jesus Cid-Sueiro. ´ ı ı A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I prior probabilities and the misclassification costs) may be non representative of the underlying real distributions. If the ratio of training data corresponding to each class is not in agreement with real class distributions, designing Bayes decision rules based on prior probabilities estimated from these data will be suboptimal and can seriously affect the reliability and performance of the classifier. A similar problem may arise if real misclassification costs are unknown during training. However, they are usually known by the end user, who can adapt the classifier decision rules to cost changes without re-training the classifier. For this reason, our attention in this paper is mainly focused on the problem of uncertainty in prior probabilities. Furthermore, being aware that class distribution is seldom known (at least totally) in real world applications, a robust approach (as opposite to adaptive) that prevents severe performance degradation appears to be convenient for these situations. Besides other adaptive and robust approaches that address this problem (discussed in more detail in Section 2.2) it is important to highlight those that handle the problem of uncertainty in priors by following a robust minimax principle: minimize the maximum possible risk. Analytic foundations of minimax classification are widely considered in the literature (see VanTrees, 1968; Moon and Stirling, 2000; Duda et al., 2001, for instance) and a few algorithms to carry out minimax decisions have been proposed. From computationally expensive ones such as estimating probability density functions (Takimoto and Warmuth, 2000; Kim, 1996) or using methods from optimization (Polak, 1997) to simpler ones like neural network training algorithms (Guerrero-Curieses et al., 2004; AlaizRodriguez et al., 2005). Minimax classifiers may, however, be seen as over-conservative since its goal is to optimize the performance under the least favorable conditions. Consider, for instance, a direct marketing campaign application carried out in order to maximize profits. Since optimal decisions rely on the proportion of potential buyers and it is usually unknown in advance, our classification system should take into account this uncertainty. Nevertheless, following a pure minimax strategy can lead to solutions where minimizing the maximum loss implies considering there are no potential clients. If it is the case, this minimax approach does not seem to be suitable for this kind of situation. In this imprecise class distribution scenario, it can be noticed that the classifier performance may be highly deviated from the optimal, that is, that of the classifier knowing actual priors. Minimizing this gap (that is, the maximum possible deviation with respect to the optimal classifier) is the focus of this paper. We seek for a system as robust as the conventional minimax approach but less pessimistic at the same time. We will refer to it as a minimax deviation (or minimax regret) classifier. In contrast to other robust and adaptive approaches, it can be used in general multiclass problems. Furthermore, as shown in Guerrero-Curieses et al. (2004), minimax approaches can be used in combination with the adaptive proposal by Saerens et al. (2002) to exploit its advantages. This minimax regret approach has recently been applied in the context of parameter estimation (Eldar et al., 2004; Eldar and Merhav, 2004) and a similar competitive strategy has been used in the context of hypothesis testing (Feder and Merhav, 2002). Under prior uncertainty, our solution provides an upper bound of the performance divergence from the optimal classifier. We propose a simple learning rate scaling algorithm in order to train a neural-based minimax deviation classifier. Although training can be based on minimizing any objective function, we have chosen objective functions that provide estimates of the posterior probabilities (see Cid-Sueiro and Figueiras-Vidal, 2001, for more details). 104 M INIMAX R EGRET C LASSIFIER This paper is organized as follows: the next section provides an overview of the problem as well as some previous approaches to cope with it. Next, Section 3 states the fundamentals of minimax classification together with a deeper analysis of the minimax regret approach proposed in this paper. Section 4 presents a neural training algorithm to get a neural-based minimax regret classifier under complete uncertainty. Moreover, practical situations with partial uncertainty in priors are also discussed. A learning algorithm to solve them is provided in Section 5. In Section 6, some experimental results show that minimax regret classifiers outperform (in terms of maximum risk deviation) classifiers trained on re-balanced data sets and those with the originally assumed priors. Finally, the main conclusions are summarized in Section 7. 2. Problem Overview Traditionally, supervised learning lies in the fact that training data and real data come from the same (although unknown) statistical model. In order to carefully analyze to what extend classifier performance depends on conditions such as class distribution or decision costs, learning and decision theory principles are briefly revisited. Next, some previous approaches to deal with environment imprecision are reviewed. 2.1 Learning and Making Optimal Decisions Let S = {(xk , dk ), k = 1, . . . , K} denote a set of labelled samples where xk ∈ RN is an observation feature vector and dk ∈ UL = {u0 , . . . , uL−1 } is the label vector. Class-i label ui is a unit L-dimensional vector with components ui, j = δi j , with every component equal to 0, except the i-th component which is equal to 1. We assume a learning process that estimates parameters w of a non-linear mapping f w : RN → P from the input space into probability space P = {p ∈ [0, 1]L | ∑L−1 pi = 1}. The soft decision is given i=0 by yk = fw (xk ) ∈ P and the hard output of the classifier is denoted by d. Note that d and d will be used to distinguish the actual class from the predicted one, respectively. Several costs (or benefits) associated with each possible decision are also defined: c i j denotes the cost of deciding in favor of class i when the true class is j. Negative values represent benefits (for instance, cii , which is the cost of correctly classifying a sample from class i could be negative in some practical cases). In general cost-sensitive classification problems, either misclassification costs c i j or cii costs can take different values for each class. Thus, there are many applications where classification errors lead to very different consequences (medical diagnosis, fault detection, credit risk analysis), what implies misclassification costs ci j that may largely vary between them. In the same way, there are also many domains where correct decision costs (or benefits) c ii do not take the same value. For instance, in targeted marketing applications (Zadrozny and Elkan, 2001), correctly identifying a buyer implies some benefit while correctly classifying a non buyer means no income. The same ¨ applies to medical diagnosis domains such as the gastric carcinoma problem studied in G uvenir et al. (2004). In this case, the benefit of correct classification also depends on the class: the benefit of correctly classifying an early stage tumor is higher than that of a later stage. The expected risk (or loss) R is given by L−1 L−1 R = ∑ ∑ ci j P{d = ui |d = u j }Pj j=0 i=0 105 , (1) A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I where P{d = ui |d = u j } with i = j represent conditional error probabilities, and P j = P{d = u j } is the prior probability of class u j . Defining the conditional risk of misclassifying samples from class u j as L−1 Rj = ∑ ci j P{d = ui |d = u j } , i=0 we can express risk (1) as L−1 R= ∑ Ri Pi . (2) i=0 It is well-known that the Bayes decision rule for the minimum risk is given by L−1 d = arg min { ∑ ci j P{d = u j |x}} , ui (3) j=0 where P{d = ui |x} is the a posteriori probability of class i given sample x. The optimal decision rule depends on posterior probabilities and therefore, on the prior probabilities and the likelihood. In theory, as long as posterior probabilities (or likelihood and prior probabilities) are known, the optimal decision in Eq. (3) can be expressed after a trivial manipulation as a function of the cost differences between the costs (ci j − c j j ) (Duda et al., 2001). This is the reason why c j j is usually assumed to be zero and the value of the cost difference is directly assigned to c i j . When dealing ¨ with practical applications, however, some authors (Zadrozny and Elkan, 2001; G uvenir et al., 2004) have urged to use meaningful decision costs measured over a common baseline (and not necessarily taking c j j = 0) in order to avoid mistakes that otherwise could be overlooked. For this reason and, what is more important, the uncertainty class distribution problem addressed in this paper, decision costs measured over a common baseline are considered. Furthermore, absolute values of decision costs are relevant to the design of classifiers under the minimax regret approach. 2.2 Related Work: Dealing with Cost and Prior Uncertainty Most proposals to address uncertainty in priors fall into the categories of adaptive and robust solutions. While the aim of a robust solution is to avoid a classifier with very poor performance under any conditions, an adaptive system pursues to fit the classifier parameters using more incoming data or more precise information. With an adaptive-oriented principle, Provost (2000) states that, once the classifier is trained under specific class distributions and cost assumptions (not necessarily the operating conditions), the selection of the optimal classifier for specific conditions is carried out by a correct placement of the decision thresholds. In the same way, the approaches in Kelly et al. (1999) and Kubat et al. (1998) consider that tuning the classifier parameters should be left to the end user, expecting that class distributions and misclassification costs will be precisely known then. Some graphical methods based on the ROC curve have been proposed in Adams and Hand (1998) and Provost and Fawcett (2001) in order to compare the classifier performance under imprecise class distributions and/or misclassification costs. The ROC convex hull method presented in Provost and Fawcett (2001) (or the alternative representation proposed in Drummond and Holte (2000)) allows the user to select potentially optimal classifiers, providing a flexible way to select 106 M INIMAX R EGRET C LASSIFIER them when precise information about priors or costs is available. Under imprecision, some classifiers can be discarded but this does not necessarily provide a method to select the optimal classifier between the possible ones and fit its parameters. Furthermore, due to its graphical character, these methods are limited to binary classification problems. Changes in prior probabilities have also been discussed by Saerens et al. (2002), who proposes a method based on re-estimating the prior probabilities of real data in an unsupervised way and subsequently adjusting the outputs of the classifier according to the new a priori probabilities. Obviously, the method requires enough unlabelled data being available for re-estimation. As an alternative to adaptive schemes, several robust solutions have been proposed, as the resampling methods, especially in domains where imbalanced classes come out (Kubat and Matwin, 1997; Lawrence et al., 1998; Chawla et al., 2002; Barandela et al., 2003). Either by undersampling or oversampling, the common purpose is to balance artificially the training data set in order to get a uniform class distribution, which is supposed to be the least biased towards any class and, thus, the most robust against changes in class distributions. The same approach is followed in cost sensitive domains, but with some subtle differences in practice. It is well known that class priors and decision costs are intrinsically related. For instance, different decision costs can be simulated by altering the priors and vice versa (see Ting, 2002, for instance). Thus, when a uniform distribution is desired in a cost sensitive domain, but working with cost insensitive decision machines, class priors are altered according to decision costs, what is commonly referred as rebalancing. The manipulation of the training data distribution has been applied to cost-sensitive learning in two-class problems (Breiman et al., 1984) in the following way: basically, the class with higher misclassification cost (suppose n times the lowest misclassification cost) is represented with n times more examples than the other class. Besides random sampling strategies, other sampling-based rebalancing schemes have been proposed to accomplish this task, like those considering closeness to the boundaries between classes (Japkowicz and Stephen, 2002; Zhou and LiuJ, 2006) or the costproportionate rejection sampling presented in Zadrozny et al. (2003). Extending the formulation of this type of procedures to general multiclass problems with multiple (and possibly asymmetric) inter-class misclassification costs appears to be a nontrivial task (Zadrozny et al., 2003; Zhou and LiuJ, 2006), but some progress has been made recently with regard to this latter point (Abe et al., 2004). Note, also, that many (although not all) of these rebalancing strategies are usually implemented by oversampling and/or subsampling, that is, replicating examples (without adding any extra information) and/or deleting them (which implies information loss). 3. Robust Classifiers Under Prior Uncertainty: Minimax Classifiers Prior probability uncertainty can be coped from a robust point of view following a minimax derived strategy. Minimax regret criterion is discussed in this section after presenting the conventional minimax criterion. Although our approach extends to general multi-class problems and the discussion is carried out in that way, we will first illustrate, for the sake of clarity and simplicity, a binary situation. 3.1 Minimax Classifiers As Eq. (3) shows, the minimum risk decisions depend on the misclassification costs, c i j , and the posterior class probabilities and, thus, they depend on the prior probabilities, Pi . Different prior 107 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I PSfrag replacements distributions (frequency for each class) give rise to different Bayes classifiers. Fig. 1 shows the Bayes risk curve, RB (P1 ) versus class-1 prior probability for a binary classification problem. Standard RF (Q1 , P1 ) R0 Minimax RF (Q1mM , P1 ) Risk c00 Minimax Deviation RF (Q1mMd , P1 ) Rbasis R1 RB (P1 ) c11 0 Q1 Q1mM 1 Q1mMd P1 Figure 1: Risk vs. P1 . Minimum risk curve and performance under prior changes for the standard, minimax and minimax deviation classifier. RB (P1 ) stands for the optimal Bayes Risk against P1 . RF (Q1 , P1 ) denotes the Risk of a standard classifier (Fixed decision rule optimized for prior probabilities Q1 estimated in the training phase) against P1 . RF (Q1mM , P1 ) denotes the Risk of a minimax classifier (Fixed decision rule optimized for the minimax probabilities Q1mM ) against P1 . RF (Q1mMd , P1 ) denotes the Risk of a minimax deviation classifier (Fixed decision rule optimized for the minimax deviation probabilities Q 1mMd ) against P1 . If the prior probability distribution is unknown when the classifier is designed, or this distribution changes with time or from one environment to other, the mismatch between training and test conditions can degrade significantly the classifier performance. For instance, assume that Q = (Q0 , Q1 ) is the vector with class-0 and class-1 prior probabilities estimated in the training phase, respectively, and let RB (Q1 ) represent the minimum (Bayes) risk attainable by any decision rule for these priors. Note, that, according to Eq. (2), for a given classifier, the risk is a linear function of priors. Thus, risk RF (Q1 , P1 ) associated to the (fixed) classifier optimized for Q changes linearly with actual prior probabilities P1 and P0 = 1 − P1 , going from (0, R0 ) to (1, R1 ) (the continuous line in Fig. 1), where R0 and R1 refer to the class conditional risks for classes 0 and 1, respectively. Fig. 1 shows the impact of this change in priors and how performance deviates from optimal. Also, it can be shown (see VanTrees, 1968, for instance) that the minimum risk curve obtained for each prior is convex and the risk function of a given classifier verifies R F (Q1 , P1 ) ≥ RB (P1 ) with a tangent point at P1 = Q1 . 108 M INIMAX R EGRET C LASSIFIER The dashed line in Fig. 1 shows the performance of the minimax classifier, which minimizes the maximum possible risk under the least favorable priors, thus providing the most robust solution, in the sense that performance becomes independent from priors. From Fig. 1, it becomes clear that the minimax classifier is optimal for prior probabilities P = QmM = (Q0mM , Q1mM ) maximizing RB . Thus, this strategy is equivalent to maximizing the minimum risk (Moon and Stirling, 2000; Duda et al., 2001). We will refer to them as the minimax probabilities. Fig. 1 also makes clear that although a minimax classifier is a robust solution to address the imprecision in priors, it may become a somewhat pessimistic approach. 3.2 Minimax Deviation Classifiers We propose an alternative classifier that, instead of minimizing the maximum risk, minimizes the maximum deviation (regret) from the optimal Bayes classifier. In the following, we will refer to it as the minimax deviation or minimax regret classifier. A comparison between minimax and minimax deviation approaches is also shown in Fig. 1. This latter case corresponds to a classifier trained on prior probabilities P = Q mMd with performance as a function of priors given by a line (a plane or hyperplane for three or more classes, respectively) parallel to what we name, in the following, basis risk (Rbasis = c00 (1 − P1 ) + c11 P1 ). Note that the maximum deviation (with respect to priors) of the classifier optimized for Q is given by D(Q) = max {RF (Q1 , P1 ) − RB (P1 )} = max {R0 − c00 , R1 − c11 } . P1 The inspection of Fig. 1 shows that the minimum of D (with respect to Q) is achieved when R0 − c00 = R1 − c11 , which means that line RF (Q1 , P1 ) is parallel to arc named Rbasis in the figure and tangent to RB at Q1mMd . Therefore, the minimax regret classifier is also the Bayes solution with respect to the least favorable priors (Q0mMd , Q1mMd ) (see Berger, 1985, for instance), which will be denoted as minimax deviation probabilities. Now, we extend the formulation to a general L-class problem. Definition 1 Consider a L-class decision problem with costs ci j , 0 ≤ i, j < L and c j j ≤ ci j , and let Rw (P) be the risk of a decision machine with parameter vector w when prior class probabilities are given by P = (P0 , . . . , PL−1 ). The deviation function is defined as Dw (P) = Rw (P) − RB (P) and the minimax deviation is defined as DmMd = inf max{Dw (P)} . w P (4) Note that the above definition assumes that the maximum exists. This is actually the case, since Dw (P) is a linear function over a compact set, P . Note, also, that our definition includes the natural assumption that c j j is never higher than ci j , meaning that making a decision error is always less costly than taking the correct decision. This assumption is used in part of our theoretical analysis. The algorithms proposed in this paper are based on the fact that the minimax deviation can be computed without knowing RB 109 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I Theorem 2 The minimax deviation is given by DmMd = inf max{Dw (P)} , w P where Dw (P) = Rw (P) − Rbasis (P) and (5) L−1 ∑ c j j Pj Rbasis (P) = . (6) j=0 Proof Note that, according to Eqs. (1) and (2), for any decision machine and any u i ∈ UL , L−1 R(u j ) = R j = ∑ ci j P{d = ui |d = u j } ≥ c j j . i=0 Since the bound is reached by the classifier deciding d = u j for any observation x, we have RB (u j ) = c j j . Therefore, using Eq. (6), we find that, for any u ∈ UL , RB (u) = Rbasis (u) and, thus, Dw (u) = Dw (u) . Since Bayes minimum risk RB (P) is a convex function of priors and Rw (P) is linear, Dw (P) is concave and, thus, it is maximum at some of the vertices in P (i.e., at some P = u ∈ U L ). Thus, max{Dw (P)} = max {Dw (u)} . u∈UL P (7) Since the maximum difference between two hyperplanes defined over P is always at some vertex, we can conclude that max{Dw (P)} = max {Dw (u)} = max {Dw (u)} . P u∈UL u∈UL (8) Combining Eqs. (4), (7) and (8), we get DmMd = inf max{Dw (P)} . w P Note that Rbasis represents the risk baseline of the ideal classifier with zero errors. Th. 2 shows that the minimax regret can be computed as the minimax deviation to this ideal classifier. Note, also, that if costs cii do not depend on i, Eq. (5) becomes equivalent (up to a constant) to the Bayes risk and the minimax regret classifier becomes equivalent to the minimax classifier . Another important result for the algorithms proposed in this paper is that, under some conditions on the minimum risk, the minimum and maximum operators can be permuted. Although general results on the permutability of minimum and maximum operators can be found in the literature (see Polak, 1997, for instance), we provide here the proof for the specific case interesting to this paper. 110 M INIMAX R EGRET C LASSIFIER Theorem 3 Consider the minimum deviation function given by Dmin (P) = inf{Dw (P)} , (9) w where Dw (P) is the normalized deviation function given by Eq. (5), and let P ∗ be the prior probability vector providing the maximum deviation, P∗ = arg max Dmin (P) P . (10) If Dmin (P) is continuously differentiable at P = P∗ , then the minimax deviation, DmMd , defined by Eq. (4), is DmMd = Dmin (P∗ ) = max inf Dw (P) . (11) P w Proof For any classifier with parameter vector w, we can write, max Dw (P) ≥ Dw (P∗ ) ≥ Dmin (P∗ ) P and, thus, inf max Dw (P) ≥ Dmin (P∗ ) . w P (12) Therefore, Dmin (P∗ ) is a lower bound of the minimax regret. Now we prove that Dmin (P∗ ) is also an upper bound. According to Eq. (9), for any ε > 0, there exists a parameter vector wε such that Dwε (P∗ ) ≤ Dmin (P∗ ) + ε . (13) By definition, for any P, Dmin (P) ≤ Dwε (P). Therefore, using Eq. (13), we can write Dwε (P∗ ) − Dwε (P) ≤ Dmin (P∗ ) − Dmin (P) + ε . (14) Since Dmin (P) is continuously differentiable and (according to Eq. (10)) maximum at P ∗ , for any ε > 0 there exists δ > 0 such that, for any P ∈ P with P∗ − P ≤ δ we have Dmin (P∗ ) − Dmin (P) ≤ ε P∗ − P ≤ ε δ . (15) Let Pδ a prior such that P∗ − Pδ = δ. Taking ε = ε δ and combining Eqs. (14) and (15) we can write Dwε (P∗ ) − Dwε (Pδ ) ≤ 2ε δ . Since the above condition is verified for any ε > 0 and any prior Pδ at distance δ from P, and taking into account that Dwε (P) is a linear function of P, we conclude that the maximum slope of D wε (P) is bounded by 2ε and, thus, for any P ∈ P , we have √ Dwε (P) − Dwε (P∗ ) ≤ 2ε P − P∗ ≤ 2 2ε , √ (where we have used the fact that the maximum distance between two probability vectors is 2). Therefore, we can write √ max Dwε (P) ≤ Dwε (P∗ ) + 2 2ε P 111 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I and, thus, √ inf max Dw (P) ≤ Dwε (P∗ ) + 2 2ε . w P √ Finally, using Eq. (13) and taking into account that ε = ε δ ≤ 2ε we get √ inf max Dw (P) ≤ Dmin (P∗ ) + 3 2ε . w P (16) Since the above is true for any ε > 0 we conclude that Dmin (P∗ ) is also an upper bound of Dw . Therefore, combining Eqs. (12) and (16), we conclude that inf max Dw (P) = Dmin (P∗ ) , w P which completes the proof. Note that the deviation function needs to be neither differentiable nor a continuous function of w parameters. If the minimum deviation function is not continuously differentiable at the minimax deviation probability, P∗ , the theorem cannot be applied. The reason is that, although there should exist at least one classifier providing the minimum deviation at P = P∗ , it or they could not provide a constant deviation with respect to the prior probability. The situation can be illustrated with an example. Let x ∈ R be given by p(x|d = 0) = 0.8N(x, σ) + 0.2N(x − 2, σ) and p(x|d = 1) = 0.2N(x − 1, σ) + 0.8N(x − 3, σ), where σ = 0.5 and N(x, σ) = (2πσ)−1/2 exp(−x2 /(2σ2 )), and consider the set Φλ of classifiers given by a single threshold over x and decision dˆ = 1 if x ≥ λ 0 if x < λ. Fig. 2 shows the distribution of both classes over x, and Fig. 3 shows, as a function of priors, the minimum error probability (continuous line) that can be obtained using classifiers in Φ λ . Note that decision costs c00 = c11 = 0 and c01 = c10 = 1 have been considered for this illustrative problem. An abrupt slope change is observed at the minimax deviation probability, for P{d = 1} = 1/2. For this prior, there are two single threshold classifiers providing the minimum error probability, which are given by thresholds λ1 and λ2 in Fig. 2. However, as shown in Fig. 3 neither of them provides a risk that is constant in the prior. The minimax deviation classifier in Φ λ , which has a threshold λ0 , does not attain minimum risk at the minimax deviation probability and, thus, cannot be obtained by using Eq. (11). For this example, the desired robust classifier should have a deviation function given by the horizontal dotted line in Fig. 3. Fortunately, it can be obtained by combining the outputs of several classifiers. For instance, let dˆ1 and dˆ2 the decisions of classifiers given by thresholds λ1 and λ2 , respectively. It is not difficult to see that the classifier selecting dˆ1 and dˆ2 at random (for each input sample x) provides a robust classifier. This procedure can be extended to the multiclass-case: consider a set of L classifiers with parameters wk , k = 0, . . . , L − 1, and consider the classifier such that, for any input sample x, makes a decision equal to dk (i.e., the decision of classifier with parameters wk ), with probability qk . It is not difficult to show that the deviation function of this classifier is given by L−1 D(P) = L−1 j=0 k=0 ∑ Pj ∑ qk D j (wk ) 112 , M INIMAX R EGRET C LASSIFIER 0.7 0.6 Likelihoods 0.5 0.4 0.3 0.2 0.1 λ 0 −2 λ −1 0 λ 0 1 1 2 2 3 4 5 x Figure 2: The conditional data distributions for the one-dimensional example discussed in the text. λ1 and λ2 are the thresholds providing the minimum risk at the minimax deviation probability. λ0 provides the minimax deviation classifier. where D j (wk ) = R j (wk ) − c j j . In order to get a constant deviation function, probabilities q k should be chosen in such a way that L−1 ∑ qk D j (wk ) = D , k=0 where D is a constant. Solving these linear equations for q k , k = 0, . . . , L − 1 (with the constraint ∑k qk = 1), the required probabilities can be found. Note that, in order to build the non-deterministic classifier providing a constant deviation, a set of L independent classifiers that are optimal at the minimax deviation prior should be found. However, we go no further on the investigation of this special case for two main reasons: • The situation does not seem to be common in practice. In our simulations, we have found that the maximum of the minimum risk deviation always provided a response which is approximately parallel to Rbasis . • In general, the abrupt change in the derivative may be a symptom that the classifier structure is not optimal for the data distribution. Instead of building a nondeterministic classifier, increasing the classifier complexity should be more efficient. Although the least favorable prior providing the minimax deviation can be computed in closed form for some simple distributions, in general, it must be computed numerically. Moreover, we assume here that the data distribution is not known, and must be learned from examples. Thus, 113 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I 0.25 0.2 λ Error probability 0 0.15 λ λ 1 2 0.1 0.05 0 0 0.2 0.4 0.6 0.8 1 P{ d=1} Figure 3: Error probabilities as a function of prior probability of class 1 for the example in Fig. 2. Thresholds λ1 and λ2 do not provide the minimax deviation classifier, which is obtained for threshold λ0 . However, the random combination of classifiers with thresholds λ 1 and λ2 (dotted line) provides a robust classifier with deviation lower than that of λ 0 . we must incorporate the estimation of the least favorable prior in the learning process. Next, we propose a training algorithm in order to get a minimax regret classifier based on neural networks. 4. Neural Robust Classifiers Under Complete Uncertainty Note that, if QmMd is the probability vector providing the maximum in Eq. (11), that is, QmMd = arg max inf{Dw (P)} w P , then we can write DmMd = inf{Dw (QmMd )} . w Therefore, the minimax deviation classifier can be estimated by training a classifier using prior in QmMd . For this reason, QmMd will be called the minimax deviation prior (or least favorable prior). Our proposed algorithms are based on an iterative process of estimating parameters w based on an estimate of the minimax deviation prior, and re-estimating prior based on an estimate of network weights. This is shown in the following. 114 M INIMAX R EGRET C LASSIFIER 4.1 Updating Network Weights Learning is based on minimizing some empirical estimate of the overall error function L−1 L−1 i=0 E{C(y, d)} = i=0 ∑ P{d = ui }E{C(y, d)|d = ui } = ∑ PiCi , where C(y, d) may be any error function and Ci is the expected conditional error for class-i. Selecting the appropriate error function (see Cid-Sueiro and Figueiras-Vidal, 2001, for instance), learning rules can be designed providing a posteriori probability estimates (y i ≈ P{d = ui |x}, where yi is the soft decision) and, thus, according to Eq. (3), the hard decision minimizing the risk can be approximated by L−1 d = arg min { ∑ ci j y j } . i j=0 The overall empirical error function (cost function) used in learning for priors P = (P0 , . . . , PL−1 ) may be written as L−1 C = ∑ PiCi = L−1 i=0 = = 1 K L−1 i=0 1 K k ∑ d C(yk , dk ), Ki k=1 i Pi K k ∑ d C(yk , dk ) Ki /K k=1 i ∑ i=0 1 K ∑ K k=1 ∑ Pi L−1 ∑ Pi d kC(yk , dk ) (0) i i=0 Pi , , (17) (0) where Pi = Ki /K is an initial estimate of class-i prior based on class frequencies in the training set and Pi is the current prior estimate. Minimizing error function (17) by means of a stochastic gradient descent learning rule leads to update the network weights at k-th iteration as w (k+1) = w (k) (n) L−1 −µ = w(k) − Pi i=0 Pi ∑ L−1 d k ∇ C(yk , dk ) (0) i w ∑ µi (n) k di , ∇wC(yk , dk ) , (18) i=0 where (n) (n) µi = µ Pi (19) (0) Pi (n) is a learning step scaled by the prior ratio. Note that di selects the appropriate µi according to the pattern class membership. The classifier is trained without altering the original training data set (0) class distribution Pi and therefore, without missing or duplicating information. 115 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I 4.2 Updating Prior Probabilities Eq. (11) shows that the learning process should maximize (5) with respect to the prior probabilities. The estimate of (5) can be computed as ¯ Dw (P) = Rw (P) − Rbasis (P) , where (20) L−1 ∑ R j Pj (21) 1 L−1 ∑ ci j Ni j N j i=0 (22) Rw (P) = j=0 is the overall Bayes risk estimate and Rj = is the class- j conditional risk estimate where N j is the number of class u j patterns in the training phase and Ni j is the number of samples from class u j assigned to ui . L−1 In order to derive a learning rule to find an estimate Pi satisfying constraints ∑i=0 Pi = 1 and 0 ≤ Pi ≤ 1, we will use auxiliary variables Bi such that Pi = exp(Bi ) L−1 ∑ j=0 exp(B j ) . (23) ¯ We maximize Dw with respect to Bi . Applying the chain rule, ¯ ¯ ∂Dw L−1 ∂Dw ∂Pj =∑ , ∂Bi j=0 ∂Pj ∂Bi and using Eqs. (20), (21) and (23), we get ¯ ∂D w ∂Bi L−1 = ∑ (R j − c j j )Pi (δi j − Pj ), j=0 L−1 L−1 j=0 j=0 = Pi Ri − cii − ∑ (R j Pj ) + ∑ (c j j Pj ) , = Pi Ri − cii − Rw − Rbasis , = Pi Rdi , where Rdi = (Ri − cii ) − (Rw − Rbasis ) . The learning rule for auxiliary variable Bi is (n) Bi (n+1) = Bi + ρ (n) ∂D w , ∂Bi (n) (n) = Bi + ρPi Rdi , 116 (24) M INIMAX R EGRET C LASSIFIER where parameter ρ > 0 controls the rate of convergence. Using Eq. (23) and Eq. (24), the updated learning rule for Pi is (n) (n+1) Pi = (n) (n) (n) exp ρPj Rd j ∑L−1 exp B j j=0 (n) = (n) (n) exp(Bi ) exp ρPi Rdi , (n) (n) Pi exp ρPi Rdi (n) (n) (n) ∑L−1 Pj exp ρPj Rd j j=0 . (25) 4.3 Training Algorithm for a Minimax Deviation Classifier In the previous section, both the network weights updating rule (18) and the prior probability update rule (25) have been derived. The algorithm resulting from the combination is shown as follows: for n = 0 to Niterations − 1 do for k = 1 to K do w(k+1) = w(k) − L−1 ∑ µi (n) k di ∇wC(yk , dk ) i=0 end for (n) Estimate R(n) , Ri , i = 0, . . . , L − 1, according to (21) and (22) (n+1) (n+1) Update minimax probability Pi , i = 0, . . . , L − 1 according to (25) and compute µi with (19) end for 5. Robust Classifiers Under Partial Uncertainty Although in many practical situations prior probabilities may not be specified with precision, they can be partially known. In this section we discuss how partial information about priors can be used to improve the classifier performance in relation to a complete uncertainty situation. From now on, let us consider that lower (or upper) bounds of the priors are known based on previous experience. We will denote the lower and upper bounds of class-i prior probability as Pil and Piu , respectively. In order to illustrate this situation consider a binary classification problem where probability lower bounds P0l and P1l are known. That is, P1 ∈ [P1l , 1 − P0l ] where this interval represents the uncertainty region. Let us denote by Γ = {P : 0 ≤ Pi ≤ 1, ∑L−1 Pi = 1, Pi ≥ Pil } the probability region i=0 satisfying the imposed constraints. In the following, we will refer to Γ as the uncertainty region. Now, the aim is to design a classifier that minimizes the maximum regret from the minimum risk only inside the uncertainty region. This is depicted in Fig. 4(a), which shows that reducing the uncertainty in priors allows to reduce deviation from the optimal classifier. This minimax regret approach for the uncertainty region Γ is often called Γ-minimax regret. As discussed before, the minimax deviation solution gives a Bayes solution with respect some priors denoted in the partial uncertainty case as QΓ mMd in Fig. 4(a), which is the least favorable distribution according to the regret criterion. 117 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I cΓ 00 RΓ basis RΓ 1 cΓ 11 PSfrag replacements Risk 0 0.5 1 1.5 2 2.5 3 3.5 0P1l RB (P) + ψ(P) Minimax Deviation with Restriction Risk RΓ 0 QΓ 1mMd 1 − P0l 1 P1 0P1l 1 − P0l 1 P1 (b) (a) Figure 4: Minimax deviation classifier under partial uncertainty of prior probabilities: (a)Γ-minMaxDev Classifier. (b) Modified cost function defined as R B (P) + ψ(P). In contrast to the minimax regret criterion, note that a classical minimax classifier for the considered uncertainty region would minimize the worst-case risk. It would be a Bayes solution for the prior where the minimum risk reaches its maximum and it could be denoted as Q Γ . mM Notice, also, that these solutions will be the same if the risk for the vertex of Γ take the same value (cΓ = k). ii 5.1 Neural Robust Classifiers Under Partial Uncertainty Minimax search can be formulated as maximizing (with respect to priors) the minimum (with respect to network parameters) of deviation function (5), as described in previous section, but subject to some constraints arg max inf {DΓ (P)} , w w P Pi ≥ Pil , i = 0, . . . , L − 1 s.t. where DΓ = RΓ − RΓ . When uncertainty is global, this hyperplane is defined by the risk in the L w w basis extreme cases with Pi = δik , that is, by the corresponding cii . However, with partial knowledge of the prior probabilities, this hyperplane becomes defined by the risk in L points which are the vertex given by the restrictions and with associated risk denoted by c Γj . j Defining 1 l(Pi ) = , (26) 1 + exp−τ(Pi −Pil ) where τ controls the hardness of this restriction, the minimax problem can be re-formulated as arg max inf {DΓ (P)} w P s.t. w l(Pi ) ≥ 1/2, i = 0, . . . , L − 1. Thus, this constrained optimization problem can be solved as a non-constrained problem by considering an auxiliary function that incorporates the restriction as a barrier function 118 M INIMAX R EGRET C LASSIFIER arg max inf {DΓ (P) + Aψ(P)} , w w P where ψ(Pi ) = log(l(Pi )) and the constant A determines the contribution of the barrier function. Fig. 4(b) shows the new risk function corresponding to the binary case previously depicted in Fig. 4(a). Note that, it is the sum of the original RB (P) and the barrier function ψ(P). As in Section 4.1, in order to derive the network weight learning rule, we need to compute ∂ψ ∂Bi L−1 = ∂ψ ∂P j , j ∂Bi ∑ ∂P j=0 = τPi L−1 ∑ 1 − l(Pk ) (δik − Pk ), k=0 = τPi ψdi , where ψdi = ∑L−1 (1 − l(Pk ))(δik − Pk ) k=0 As τ increases, the constraints become harder around the specified bound. The update learning rule for the auxiliary variable Bi at cycle n is (n+1) Bi (n) Γ(n) (n) (n) = Bi (n) + ρPi Rdi + ρAτPi ψdi . And therefore, using (23), the update learning rule for Pi is (n) (n+1) Pi = (n) Γ(n) Pi exp ρPi Rdi L−1 ∑ (n) Pj exp (n) (n) exp ρAτPi ψdi (n) Γ(n) ρ P j Rd j . (n) (n) ρAτPj ψd j exp j=0 Note that if the upper bound is known instead of the lower bound, l(Pi ) defined by (26) should be replaced by u(Pi ) = (1 + exp(τ(Pi − Piu )))−1 at the previous formulation. The minimax constrained optimization problem has been tackled by considering a new objective function defined by the sum of the original cost function and a barrier function. Studying the convexity of this new function becomes important from the fact that a stationary point of this risk curve is a global maximum. Since the minimum risk curve (RB (P)) is a convex function of the priors (see VanTrees, 1968, for details), if we verify the convexity of the barrier function, we can conclude that the function defined by the sum of both of them is also convex. This barrier function is convex in P if the Hessian matrix HR verifies PT HR P ≤ 0 The Hessian matrix of the barrier function equals to a diagonal matrix D r = diag(r) with all negative diagonal entries ri = Aτ2 (−l(Pi )(1 − l(Pi ))). As l(Pi ) ∈ [0, 1] and therefore, ri ≤ 0, it is straightforward to see that PT HR P = PT Dr P, L−1 = ∑ Pi2 ri ≤ 0 . i=0 Since the barrier function is convex, the new objective function (defined by the sum of two convex functions) is also convex. 119 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I 5.2 Extension to Other Learning Algorithms The learning algorithm proposed in this paper is intended to train a minimax deviation classifier based on neural networks with feedforward architecture. Actually, the learning algorithm we propose becomes a feasible solution for any learning process based on minimizing some empirical estimate of an overall cost (error) function. However, it is also applicable to a general classifier provided it is trained (in an iterative process) for the estimated minimax deviation probabilities and the assumed decision costs. Specifically, in this paper, scaling the learning rate allows to simulate different class distributions and the hard decisions are made based on posterior probability estimates and decision costs. Furthermore, the neural learning phase carried out in one iteration can be re-used for the next one, what allows to reduce computational cost with respect to a complete optimization process on each iteration. Apart from the general approach of completely training a classifier on each iteration and in order to reduce its computational cost, specific solutions may be studied for different learning machines. Nonetheless, it seems not feasible to readily achieve this improvement for classifiers like SVMs, where support vectors for one solution may have nothing in common with the ones obtained in next iteration and thus, making necessary to re-train the classifier in each iteration. Another possible solution for any classifier that provides a posteriori probabilities estimates or any score that can be converted into probabilities (for details on calibration methods see Wei et al., 1999; Zadrozny and Elkan, 2002; Niculescu-Mizil and Caruana, 2005) is outlined here. In this case, an iterative procedure able to estimate the minimax deviation probabilities and consequently to adjust (without re-training) the outputs of the classifier could be studied. The general idea for this approach is as follows: first, the new minimax deviation prior probabilities are estimated according to (25) and then, posterior probabilities provided by the model are adjusted as follows (see Saerens et al., 2002, for more details) (k) Pi P(k) {d = ui |x} = P(k−1) {d = ui |x} (k−1) Pi L−1 P(k) j P(k−1) {d = u j |x} (k−1) j=0 Pj . (27) ∑ The algorithm’s main structure is summarized as for k = 1 to K do (k) Estimate R(k) , Ri , i = 0, . . . , L − 1, according to (21), (22) and decision costs c i j (k+1) Update minimax probability Pi according to (25) Adjust classifier outputs according to (27) end for The effectiveness of this method relies on the accuracy of the initial a posteriori probability estimates. Studying in depth this approach and comparing different minimax deviation classifiers (decision trees, SVMs, RBF networks, feedforward networks and committee machines) together with different probability calibration methods appears as a challenging issue to be explored in future work. 120 M INIMAX R EGRET C LASSIFIER 6. Experimental Results In this section, we first present the neural network architecture used in the experiments and illustrate the proposed minimax deviation strategy on an artificial data set. Then, we apply it to several realworld classification problems. Moreover, a comparison with other proposals such as the traditional minimax and the common re-balancing approach is carried out. 6.1 Softmax-based Network Although our algorithms can be applied to any classifier architecture, we have chosen a neural network based on the softmax non-linearity with soft decisions given by Mi yi = ∑ yi j , j=1 with yi j = exp(wTj x + wi j0 ) i , Mk ∑L−1 ∑l=1 exp(wT x + wkl0 ) k=0 kl where L stands for the number of classes, M j the number of softmax outputs used to compute y j and wi j are weight vectors. We will refer to this network as a Generalized Softmax Perceptron(GSP). 1 A simple network with M j = 2 is used in the experiments. x1 wj,k y1,1 y1,... x2 x3 y1 y1,M1 Class i ... SOFTMAX ... HARD DECISION n inputs / outputs ... yL,1 xd yL,ML yL,... yL Figure 5: GSP(Generalized Softmax Perceptron) Network Fig. 5 corresponds to the neural network architecture used to classify the samples represented by feature vector x. Learning consists of estimating network parameters w by means of the stochastic gradient minimization of certain objective functions. In the experiments, we have considered the Cross Entropy objective function given by L CE(y, d) = − ∑ di log yi . i=1 The stochastic gradient learning rule for the GSP network is given by Eq. (18). Learning step µ(0) decreases according to µ(k) = 1+k/η , where k is the iteration number, µ(0) the initial learning rate and η a decay factor. µ(k) 1. Note that the GSP is similar to a two layer MLP with a single layer of weights and with coupled saturation function (softmax), instead of sigmoidal units. 121 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I The reason to illustrate this approach with a feedforward architecture is that, as mentioned in Section 5.2, it allows to exploit (in the iterative learning process) the partially optimized solution in current iteration for the next one. On the other hand, posterior probability estimation makes it possible to apply the adaptive strategy based on prior re-estimation proposed by Saerens to the minimax deviation classifier, as long as a data set representative of the operation conditions is available. Finally, the fact that intermediate outputs yi j of the GSP can be interpreted as subclass probabilities may provide quite a natural way to cope with the unexplored problem of uncertainty in subclass distributions as already pointed out by Webb and Ting (2005). Nonetheless, both architecture and cost function issues are not the goal of this paper, but merely illustrative tools. 6.2 Artificial Data Set To illustrate the minimax regret approach proposed in this paper both under complete and partial uncertainty, an artificial data set with two classes (class u0 and class u1 ) has been created. Data examples are drawn from the normal distribution p(x|d = ui ) = N(mi , σ2 ) with mean mi and standard i √ deviation σi . Mean values were set to m0 = 0, m1 = 2 and standard deviation to σ0 = σ1 = 2. A total of 4000 instances were generated with prior probabilities of class membership P{d = u 0 } = 0.93 c00 c01 2 5 and P{d = u1 } = 0.07. The cost-benefit matrix is given by . c10 c11 4 0 Initial learning rate was set to µ(0) = 0.3, decay factor to η = 2000 and training was ended after 80 cycles. Classifier assessment was carried out by following 10-fold cross-validation. Two classifiers were trained, to be called a standard classifier and a minMaxDev classifier. The former is built by considering that the estimated class prior information is precise and stationary and the latter is the approach proposed in this paper to cope with uncertainty in priors. Thus, for the standard classifier, its performance may deviate from the optimal risk in 3.39 when priors change from training to test conditions. However, a minimax deviation classifier reduces this worst-case difference from the optimal classifier to 0.77. Now, we suppose that some information about priors is available (partial uncertainty). For instance, we consider that the lower bound for prior probabilities P0 and P1 are known and set to P0l = 0.55 and P1l = 0.05, respectively, so that the uncertainty region is Γ = {(P0 , P1 )|P0 ∈ [0.55, 0.95], P1 ∈ [0.05, 0.45]}. A minimax deviation classifier can be derived for Γ (it will be called Γ-minMaxDev classifier).The narrower Γ is, the closer the minimax deviation classifier performance is to the optimal. For this particular case, under partially imprecise priors, the standard classifier may differ from optimal (in Γ) in 0.83, while the use of the simple minMaxDev classifier designed under total prior uncertainty conditions attains a maximum deviation of 0.53. However, the Γ-minMaxDev classifier only differs from optimal in 0.24. These data are reported in Table 1 where both, experimental and also theoretical results, are shown. 6.3 Real Databases In this section we report experimental results obtained with several publicly available data sets. From the UCI repository (Blake and Merz, 1998) the following benchmarks: German Credits, Australian Credits, Insurance Company, DNA slice-junction, Page-blocks, Dermatology and Pen-digits. 122 M INIMAX R EGRET C LASSIFIER Standard Th/Exp Maximum deviation from optimal (complete uncertainty) Maximum deviation from optimal in Γ (partial uncertainty) Classifier minMaxDev Γ-minMaxDev Th/Exp Th/Exp 3.41/3.39 0.72/0.77 – 0.85/0.83 0.50/0.53 0.19/0.24 Table 1: A comparison between the standard classifier (build under stationary prior assumptions), the minimax deviation classifier (minMaxDev) and the minimax deviation classifier under partial uncertainty (Γ-minMaxDev) for an artificial data set Database German Credits (GCRE) Australian Credits (AUS) Munich Credits (MCRE) Insurance Company (COIL) DNA Slice-junction (DNA) Page-blocks (PAG) Dermatology (DER) Pen-digits (PEN) # Classes 2 2 2 2 3 5 6 10 Class distribution [0.70 0.30] [0.32 0.68] [0.30 0.70] [0.94 0.06] [0.24 0.24 0.52] [0.90 0.06 0.01 0.01 0.02] [0.31 0.16 0.20 0.13 0.14 0.06] [0.104 0.104 0.104 0.096 0.104 0.096 0.096 0.104 0.096 0.096] # Attributes 8 14 20 85 180 10 34 16 # Instances 1000 690 1000 9822 3186 5473 366 10992 Table 2: Experimental Data sets Other public data set used is Munich Credits from the Dept. of Statistics at the University of Munich.2 Data set description is summarized in Table 2, and cost-benefit matrices are shown in Table 3. We have used the cost values that appear in Ikizler (2002) for those data sets in common. Otherwise, for lack of an expert analyst, the cost values have been chosen by hand. 2. Data sets available at http://www.stat.uni-muenchen.de/service/datenarchiv/welcome e.html. Insurance Company 0 1 German, Australian, Munich Credits −1 0 0 −17 Page-Blocks −1 2 2 2 2 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 −4 2 3 4 3 4 3 −3 3 5 1 5 5 0 Dermatology 3 2 3 2 −8 4 5 −10 4 3 5 4 2 1 4 5 −6 5 2 3 5 2 3 −10 −1 2 2 DNA 2 −1 2 Pendigits ci j = 0 1 Table 3: Cost-Benefit matrices for the experimental Data sets 123 3 3 0 if i = j Otherwise A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I Standard Maximum Risk Deviation from the optimal classifier Re-balanced Minimax Deviation Minimax minMaxDev minMax GCRE 0.70 0.80 (0.55 0.60) 0.99 ACRE 1.00 1.00 (0.76 0.86) 1.00 MCRE 0.91 0.77 (0.54 0.59) 0.99 COIL 2.78 0.99 (0.87 0.92) 16.32 DNA 0.34 0.53 (0.30 0.27 0.25) PAG 0.62 0.26 (0.13 0.13 0.20 0.16 0.16) DER 1.03 1.28 (0.67 0.78 0.51 0.48 0.54 PEN 0.061 0.059 (0.024 0.028 0.025 0.019 1.14 0.023 0.021 0.026 0.022 0.86 0.60) 0.023 0.029) 7.62 0.029 Table 4: Classifier Performance evaluated as Maximum Risk Deviation from the optimal classifier for several real-world applications. Class-conditional risk deviations (R i − cii ) reported for the minMaxDev classifier. Experimental results for these data sets are shown in the following sections. The robustness of different decision machines under complete uncertainty of prior probabilities is analyzed in Section 6.3.1. If uncertainty is only partial, a similar study and comparison with the previous approach (complete uncertainty) is carried out in Section 6.3.2. 6.3.1 C LASSIFIER ROBUSTNESS U NDER C OMPLETE U NCERTAINTY We now study how different neural-based classifiers cope with worst-case situations in prior probabilities. The maximum deviation from the optimal classifier (see Table 4) is reported for the proposed minMaxDev strategy as well as for other alternative approaches: the one based on the assumption of stationary priors (standard) and the common alternative of deriving the classifier from an equally distributed data set (re-balanced). A comparison with the traditional minimax strategy is also provided. Together with the previously mentioned value (maximum deviation or regret), deviation for the L class-conditional extreme cases (Ri − cii ) is also reported for the minMaxDev classifier in Table 4. Results allow to verify that this solution is fairly close to the optimal one where deviation is not dependent on priors and thus, class-conditional deviations take the same value. Although the balanced class distribution to train the classifier can be obtained by means of undersampling and/or oversampling, it is simulated by altering the learning rate used in the training 1/L phase according to (19) as µi = µ (0) , where 1/L represents the simulated probability, equal for Pi all classes. Results evidence that the assumption of stationary priors may lead to significant deviations from the optimal decision rule under “unexpected”, but rather realistic, prior changes. This deviation may reach up to three times more than the robust minimax deviation strategy. Thus, for classification problems like Page-blocks the maximum deviation from the optimal classifier is 0.62 for the 124 M INIMAX R EGRET C LASSIFIER Standard Maximum Risk Re-balanced Minimax Deviation minMaxDev Minimax minMax GCRE 0.70 0.15 0.60 0.00 ACRE 0.01 0.02 0.86 -0.00 MCRE 0.05 0.20 0.59 0.00 COIL 0.76 0.99 0.86 0.02 DNA 0.34 0.53 0.25 0.13 PAG 0.62 0.26 0.20 0.10 DER -2.10 -1.68 -2.21 -2.38 PEN 0.061 0.059 0.029 0.029 Table 5: Classifier Performance measured as Maximum Risk for several real-world applications. standard classifier while this reduces to 0.20 for the minMaxDev one. Likewise, for the Insurance company(COIL) application the maximum deviation for the standard classifier is 2.78 compared with 0.92 for the minMaxDev model. The remaining databases also show the same behavior as it is presented in Table 4. On the other hand, the use of a classifier inferred from a re-balanced data set does not necessarily involve a decrease in the maximum deviation with respect to the standard classifier. In the same way, the traditional minimax classifier does not protect against prior changes in terms of maximum relative deviation from the minimum risk classifier. However, if our criterion is more conservative and our aim is the minimization of the maximum possible risk (not the minimization of the deviation), the traditional minimax classifier represents the best option. It is shown in Table 5 where the maximum risk for the different classifiers is reported. Positive values in this table indicate a cost while negative values represent a benefit. For instance, for the Page-blocks application the minimax classifier assures a maximum risk of 0.10 while the standard, re-balanced and minMaxDev classifiers reach values of 0.62, 0.26 and 0.20, respectively. It can be noticed that for the Pen-digits data set, the minimax deviation and minimax approaches attain the same results. The reason is that, for this problem, the R basis plane takes the same value (in this case, zero) in the probability space. 6.3.2 C LASSIFIER ROBUSTNESS UNDER PARTIAL U NCERTAINTY Unlike the previous section, we consider now that partial information about the class priors is available. The aim is to find a classifier that behaves well for a delimited and realistic range of priors what constitutes an aid in reducing the maximum deviation from the optimal classifier. This situation can be treated as a constrained minimax regret strategy where the constraints represent any extra information about prior probability value. Experimental results for several situations of partial prior uncertainty are presented in this section. We consider that lower bounds for the prior probabilities are available (see Table 6). In order to get the Γ-minMaxDev classifier, the risk for the different vertex of the uncertainty domain needs to be calculated. With them, the basis risk RΓ over which deviations are measured is derived. basis 125 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I Lower bound for prior probabilities Data Set P0l P1l GCRE 0.40 0.25 ACRE 0.20 0.25 MCRE 0.20 0.25 COIL 0.15 P2l P3l P4l P5l 0.03 DNA 0.10 0.10 0.22 0.02 0.00 0.01 0.1 0.20 0.10 0.10 0.10 0.10 0.06 0.06 0.10 0.10 0.06 P9l 0.06 0.10 0.05 0.05 0.02 PEN P8l 0.02 DER P7l 0.25 PAG P6l Table 6: Lower bounds for prior probabilities defining the uncertainty region, Γ region for the experimental data sets. Maximum Risk Deviation in the uncertainty region Standard Minimax Deviation Minimax Deviation with restriction minMaxDev Γ-minMaxDev GCRE 0.24 0.19 (0.10 0.09) ACRE 0.03 0.64 (0.03 0.03) MCRE 0.22 0.38 (0.13 0.10) COIL 2.33 0.77 (0.17 0.11) DNA 0.14 0.08 (0.07 0.07 0.06) PAG 0.37 0.15 (0.10 0.08 0.08 0.05 0.04) DER 0.08 0.05 (0.03 0.03 0.04 0.02 0.05 PEN 0.013 0.007 (0.003 0.001 0.003 0.000 0.001 0.001 0.000 0.003 0.05) 0.001 0.001) Table 7: Classifier Performance under partial knowledge of prior probabilities measured as Maximum Risk Deviation for several real-world applications. Class-conditional risk deviations (RΓ − cΓ ) are reported for the Γ-minMaxDev classifier. i ii Maximum deviation from the optimal in Γ is reported for the Γ-minMaxDev classifier together with the standard and the minMaxDev ones. For instance, the standard classifier for the Pageblocks data set deviates from the optimal classifier, in the defined uncertainty region, up to 0.37, while when complete uncertainty is assumed the maximum deviation is equal to 0.62. In the same way, reducing the uncertainty also means a reduction in the maximum deviation for minMaxDev classifier (trained without considering this partial knowledge). Thus, for Γ, this classifier assures a deviation bound of 0.15. However, taking into account this partial information to train a Γ-minMaxDev classifier allows to reduce the deviation for the worst-case conditions to 0.10. It can be seen the same behavior for the other databases in Table 7. 126 M INIMAX R EGRET C LASSIFIER 7. Conclusions This work concerns the design of robust neural-based classifiers when the prior probabilities of the classes are partially or completely unknown, even by the end user. This problem of uncertainty in the class priors is often ignored in supervised classification, even though it is a widespread situation in real world applications. As a result, the reliability of the inducted classifier can be greatly affected as previously shown by the experiments. To tackle this problem, we have proposed a novel minimax deviation strategy with the goal to minimize the maximum deviation with respect to the optimal classifier. A neural network training algorithm based on learning rate scaling has been developed. The experimental results show that this minimax deviation (minMaxDev) classifier protects against prior changes while other approaches like ignoring this uncertainty or use a balanced learning data set may result in large differences in performance with respect to the minimum risk classifier. Also, it has been shown that the conventional minimax classifier reduces the maximum possible risk following a conservative attitude but at the expense of large worst-case differences from the optimal classifier. Furthermore, a constrained minimax deviation approach (Γ-minMaxDev) has been derived for those situations where uncertainty is only partial. This may be seen as a general approach with some particular cases: a) precise knowledge of prior probabilities and b) complete uncertainty about the priors. In a) the region of uncertainty collapses to a point and we have the Bayes’ rule of minimum risk and in b) the pure minimax deviation strategy comes up. While the first one may be criticized for being quite unrealistic, the other may be seen rather pessimistic. The experimental results for this proposed intermediate situation show that the Γ-minMaxDev classifier allows to reduce the maximum deviation from the optimal and performs well over a range of prior probabilities. Acknowledgments The authors thank the four referees and the associate editor for their helpful comments. This work was partially supported by the project TEC2005-06766-C03-02 from the Spanish Ministry of Education and Science. References N. Abe, B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 3–11, 2004. N. M. Adams and D. J. Hand. Comparing classifiers when the misallocation costs are uncertain. Pattern Recognition, 32(7):1139–1147, March 1998. R. Alaiz-Rodriguez, A. Guerrero-Curieses, and J. Cid-Sueiro. Minimax classifiers based on neural networks. Pattern Recognition, 38(1):29–39, January 2005. R. Barandela, J. S. Sanchez, V. Garc´a, and E. Rangel. Strategies for learning in class imbalance ı problems. Pattern Recognition, 36(3):849–851, March 2003. J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, second edition, 1985. 127 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html. URL L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Chapman & Hall, NY, 1984. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16:321–357, 2002. J. Cid-Sueiro and A. R. Figueiras-Vidal. On the structure of strict sense Bayesian cost functions and its applications. IEEE Transactions on Neural Networks, 12(3):445–455, May 2001. C. Drummond and R. C. Holte. Explicitly representing expected cost: An alternative to ROC representation. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 198–207. ACM Press, 2000. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, 2001. Y. C. Eldar and N. Merhav. Minimax approach to robust estimation of random parameters. IEEE Trans. on Signal Processing, 52(7):1931–1946, July 2004. Y. C. Eldar, A. Ben-Tal, and A. Nemirovski. Linear minimax regret estimation of deterministic parameters with bounded data uncertainties. IEEE Trans. on Signal Processing, 52(8):2177– 2188, August 2004. M. Feder and N. Merhav. Universal composite hypothesis testing: A competitive minimax approach. IEEE Trans. on Information Theory, 48(6):1504–1517, June 2002. A. Guerrero-Curieses, R. Alaiz-Rodriguez, and J. Cid-Sueiro. A fixed-point algorithm to minimax learning with neural networks. IEEE Transactions on Systems, Man and Cybernetics Part C, 34 (4):383–392, November 2004. ¨ H. A. G¨ venir, N. Emeksiz, N. Ikizler, and N. Ormeci. Diagnosis of gastric carcinoma by classifiu cation on feature projections. Artificial Intelligence in Medicine, 31(3), 2004. N. Ikizler. Benefit maximizing classification using feature intervals. Technical Report BU-CE-0208, Bilkent University, Ankara, Turkey, 2002. N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis Journal, 6(5):429–450, November 2002. M. G. Kelly, D. J. Hand, and N. M. Adams. The impact of changing populations on classifier performance. In Proceedings of Fifth International Conference on SIG Knowledge Discovery and Data Mining (SIGKDD), pages 367–371, San Diego, CA, 1999. H. J. Kim. On a constrained optimal rule for classification with unknown prior individual group membership. Journal of Multivariate Analysis, 59(2):166–186, November 1996. M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings 14th International Conference on Machine Learning, pages 179–186. Morgan Kaufmann, 1997. 128 M INIMAX R EGRET C LASSIFIER M. Kubat, R. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2/3):195–215, 1998. S. Lawrence, I. Burns, A. D. Back, A. C. Tsoi, and C. L. Giles. Neural network classification and ¨ unequal prior class probabilities. In G. Orr, K.-R. Muller, and R. Caruana, editors, Tricks of the Trade, Lecture Notes in Computer Science State-of-the-Art Surveys, pages 299–314. Springer Verlag, 1998. T. K. Moon and W. C. Stirling. Mathematical Methods and Algorithms for Signal Processing. Prentice Hall, 2000. A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In ICML ’05: Proceedings of the 22nd International Conference on Machine learning, pages 625– 632, New York, NY, USA, 2005. ACM Press. ISBN 1-59593-180-5. E. Polak. Optimization: Algorithms and Consistent Approximations. Springer, 1997. F. Provost. Learning with imbalanced data sets 101. In Invited paper for the AAAI 2000 Workshop on Imbalanced Data Sets. AAAI Press. Technical Report WS-00-05, 2000. F. Provost and T. Fawcett. Robust classification systems for imprecise environments. Machine Learning, 42(3):203–231, March 2001. M. Saerens, P. Latinne, and C. Decaestecker. Adjusting a classifier for new a priori probabilities: A simple procedure. Neural Computation, 14:21–41, January 2002. E. Takimoto and M. Warmuth. The minimax strategy for Gaussian density estimation. In Proceedings 13th Annual Conference on Computational Learning Theory, pages 100–106. Morgan Kaufmann, San Francisco, 2000. K. M. Ting. A study of the effect of class distribution using cost-sensitive learning. In Proceedings of the Fifth International Conference on Discovery Science, pages 98–112. Berlin: Springer-Verlag, 2002. H. L. VanTrees. Detection, Estimation and Modulation Theory. John Wiley and Sons, 1968. G. I. Webb and K. M. Ting. On the application of ROC analysis to predict classification performance under varying class distributions. Machine Learning, 58(1):25–32, 2005. W. Wei, T. K. Leen, and E. Barnard. A fast histogram-based postprocessor that improves posterior probability estimates. Neural Computation, 11(5):1235 – 1248, July 1999. B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 204–213. ACM Press, 2001. B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Eighth International Conference on Knowledge Discovery and Data Mining, 2002. 129 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the third IEEE International Conference on Data Mining, pages 435–442, 2003. Z. H. Zhou and X. Y. LiuJ. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1):63–77, January 2006. 130
2 0.60268629 34 jmlr-2007-From External to Internal Regret (Special Topic on the Conference on Learning Theory 2005)
Author: Avrim Blum, Yishay Mansour
Abstract: External regret compares the performance of an online algorithm, selecting among N actions, to the performance of the best of those actions in hindsight. Internal regret compares the loss of an online algorithm to the loss of a modified online algorithm, which consistently replaces one action by another. In this paper we give a simple generic reduction that, given an algorithm for the external regret problem, converts it to an efficient online algorithm for the internal regret problem. We provide methods that work both in the full information model, in which the loss of every action is observed at each time step, and the partial information (bandit) model, where at each time step only the loss of the selected action is observed. The importance of internal regret in game theory is due to the fact that in a general game, if each player has sublinear internal regret, then the empirical frequencies converge to a correlated equilibrium. For external regret we also derive a quantitative regret bound for a very general setting of regret, which includes an arbitrary set of modification rules (that possibly modify the online algorithm) and an arbitrary set of time selection functions (each giving different weight to each time step). The regret for a given time selection and modification rule is the difference between the cost of the online algorithm and the cost of the modified online algorithm, where the costs are weighted by the time selection function. This can be viewed as a generalization of the previously-studied sleeping experts setting. Keywords: online learning, internal regret, external regret, multi-arm bandit, sleeping experts, reductions
3 0.30476147 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
Author: David Mease, Abraham J. Wyner, Andreas Buja
Abstract: The standard by which binary classifiers are usually judged, misclassification error, assumes equal costs of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of the conditional class probability function P[y = 1|x]. Boosted classification trees are known to perform quite well for such problems. In this article we consider the use of standard, off-the-shelf boosting for two more general problems: 1) classification with unequal costs or, equivalently, classification at quantiles other than 1/2, and 2) estimation of the conditional class probability function P[y = 1|x]. We first examine whether the latter problem, estimation of P[y = 1|x], can be solved with LogitBoost, and with AdaBoost when combined with a natural link function. The answer is negative: both approaches are often ineffective because they overfit P[y = 1|x] even though they perform well as classifiers. A major negative point of the present article is the disconnect between class probability estimation and classification. Next we consider the practice of over/under-sampling of the two classes. We present an algorithm that uses AdaBoost in conjunction with Over/Under-Sampling and Jittering of the data (“JOUS-Boost”). This algorithm is simple, yet successful, and it preserves the advantage of relative protection against overfitting, but for arbitrary misclassification costs and, equivalently, arbitrary quantile boundaries. We then use collections of classifiers obtained from a grid of quantiles to form estimators of class probabilities. The estimates of the class probabilities compare favorably to those obtained by a variety of methods across both simulated and real data sets. Keywords: boosting algorithms, LogitBoost, AdaBoost, class probability estimation, over-sampling, under-sampling, stratification, data jittering
4 0.28241545 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
Author: Sanjoy Dasgupta, Leonard Schulman
Abstract: We show that, given data from a mixture of k well-separated spherical Gaussians in R d , a simple two-round variant of EM will, with high probability, learn the parameters of the Gaussians to nearoptimal precision, if the dimension is high (d ln k). We relate this to previous theoretical and empirical work on the EM algorithm. Keywords: expectation maximization, mixtures of Gaussians, clustering, unsupervised learning, probabilistic analysis
5 0.27794191 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers
Author: François Laviolette, Mario Marchand
Abstract: We propose a PAC-Bayes theorem for the sample-compression setting where each classifier is described by a compression subset of the training data and a message string of additional information. This setting, which is the appropriate one to describe many learning algorithms, strictly generalizes the usual data-independent setting where classifiers are represented only by data-independent message strings (or parameters taken from a continuous set). The proposed PAC-Bayes theorem for the sample-compression setting reduces to the PAC-Bayes theorem of Seeger (2002) and Langford (2005) when the compression subset of each classifier vanishes. For posteriors having all their weights on a single sample-compressed classifier, the general risk bound reduces to a bound similar to the tight sample-compression bound proposed in Laviolette et al. (2005). Finally, we extend our results to the case where each sample-compressed classifier of a data-dependent ensemble may abstain of predicting a class label. Keywords: PAC-Bayes, risk bounds, sample-compression, set covering machines, decision list machines
6 0.24312544 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)
7 0.23542055 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
8 0.22005911 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
9 0.20783827 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
10 0.19045691 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
11 0.17631021 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
12 0.17457175 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction
14 0.16096631 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
15 0.15538277 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
16 0.14951439 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation
17 0.14858176 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
18 0.14590544 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
19 0.14380828 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes
20 0.14283861 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition
topicId topicWeight
[(4, 0.013), (8, 0.04), (10, 0.032), (12, 0.031), (15, 0.014), (28, 0.061), (40, 0.034), (45, 0.014), (48, 0.03), (51, 0.431), (60, 0.042), (77, 0.011), (80, 0.014), (85, 0.056), (98, 0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.68353552 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions
Author: Rocío Alaiz-Rodríguez, Alicia Guerrero-Curieses, Jesús Cid-Sueiro
Abstract: The design of a minimum risk classifier based on data usually stems from the stationarity assumption that the conditions during training and test are the same: the misclassification costs assumed during training must be in agreement with real costs, and the same statistical process must have generated both training and test data. Unfortunately, in real world applications, these assumptions may not hold. This paper deals with the problem of training a classifier when prior probabilities cannot be reliably induced from training data. Some strategies based on optimizing the worst possible case (conventional minimax) have been proposed previously in the literature, but they may achieve a robust classification at the expense of a severe performance degradation. In this paper we propose a minimax regret (minimax deviation) approach, that seeks to minimize the maximum deviation from the performance of the optimal risk classifier. A neural-based minimax regret classifier for general multi-class decision problems is presented. Experimental results show its robustness and the advantages in relation to other approaches. Keywords: classification, imprecise class distribution, minimax regret, minimax deviation, neural networks 1. Introduction - Problem Motivation In the general framework of learning from examples and specifically when dealing with uncertainty, the robustness of the decision machine becomes a key issue. Most machine learning algorithms are based on the assumption that the classifier will use data drawn from the same distribution as the training data set. Unfortunately, for most practical applications (such as remote sensing, direct marketing, fraud detection, information filtering, medical diagnosis or intrusion detection) the target class distribution may not be accurately known during learning: for example, because the cost of labelling data may be class-dependent or the prior probabilities are non-stationary. Therefore, the data used to design the classifier (within the Bayesian context (see VanTrees, 1968), the c 2007 Roc´o Alaiz-Rodr´guez, Alicia Guerrero-Curieses and Jesus Cid-Sueiro. ´ ı ı A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I prior probabilities and the misclassification costs) may be non representative of the underlying real distributions. If the ratio of training data corresponding to each class is not in agreement with real class distributions, designing Bayes decision rules based on prior probabilities estimated from these data will be suboptimal and can seriously affect the reliability and performance of the classifier. A similar problem may arise if real misclassification costs are unknown during training. However, they are usually known by the end user, who can adapt the classifier decision rules to cost changes without re-training the classifier. For this reason, our attention in this paper is mainly focused on the problem of uncertainty in prior probabilities. Furthermore, being aware that class distribution is seldom known (at least totally) in real world applications, a robust approach (as opposite to adaptive) that prevents severe performance degradation appears to be convenient for these situations. Besides other adaptive and robust approaches that address this problem (discussed in more detail in Section 2.2) it is important to highlight those that handle the problem of uncertainty in priors by following a robust minimax principle: minimize the maximum possible risk. Analytic foundations of minimax classification are widely considered in the literature (see VanTrees, 1968; Moon and Stirling, 2000; Duda et al., 2001, for instance) and a few algorithms to carry out minimax decisions have been proposed. From computationally expensive ones such as estimating probability density functions (Takimoto and Warmuth, 2000; Kim, 1996) or using methods from optimization (Polak, 1997) to simpler ones like neural network training algorithms (Guerrero-Curieses et al., 2004; AlaizRodriguez et al., 2005). Minimax classifiers may, however, be seen as over-conservative since its goal is to optimize the performance under the least favorable conditions. Consider, for instance, a direct marketing campaign application carried out in order to maximize profits. Since optimal decisions rely on the proportion of potential buyers and it is usually unknown in advance, our classification system should take into account this uncertainty. Nevertheless, following a pure minimax strategy can lead to solutions where minimizing the maximum loss implies considering there are no potential clients. If it is the case, this minimax approach does not seem to be suitable for this kind of situation. In this imprecise class distribution scenario, it can be noticed that the classifier performance may be highly deviated from the optimal, that is, that of the classifier knowing actual priors. Minimizing this gap (that is, the maximum possible deviation with respect to the optimal classifier) is the focus of this paper. We seek for a system as robust as the conventional minimax approach but less pessimistic at the same time. We will refer to it as a minimax deviation (or minimax regret) classifier. In contrast to other robust and adaptive approaches, it can be used in general multiclass problems. Furthermore, as shown in Guerrero-Curieses et al. (2004), minimax approaches can be used in combination with the adaptive proposal by Saerens et al. (2002) to exploit its advantages. This minimax regret approach has recently been applied in the context of parameter estimation (Eldar et al., 2004; Eldar and Merhav, 2004) and a similar competitive strategy has been used in the context of hypothesis testing (Feder and Merhav, 2002). Under prior uncertainty, our solution provides an upper bound of the performance divergence from the optimal classifier. We propose a simple learning rate scaling algorithm in order to train a neural-based minimax deviation classifier. Although training can be based on minimizing any objective function, we have chosen objective functions that provide estimates of the posterior probabilities (see Cid-Sueiro and Figueiras-Vidal, 2001, for more details). 104 M INIMAX R EGRET C LASSIFIER This paper is organized as follows: the next section provides an overview of the problem as well as some previous approaches to cope with it. Next, Section 3 states the fundamentals of minimax classification together with a deeper analysis of the minimax regret approach proposed in this paper. Section 4 presents a neural training algorithm to get a neural-based minimax regret classifier under complete uncertainty. Moreover, practical situations with partial uncertainty in priors are also discussed. A learning algorithm to solve them is provided in Section 5. In Section 6, some experimental results show that minimax regret classifiers outperform (in terms of maximum risk deviation) classifiers trained on re-balanced data sets and those with the originally assumed priors. Finally, the main conclusions are summarized in Section 7. 2. Problem Overview Traditionally, supervised learning lies in the fact that training data and real data come from the same (although unknown) statistical model. In order to carefully analyze to what extend classifier performance depends on conditions such as class distribution or decision costs, learning and decision theory principles are briefly revisited. Next, some previous approaches to deal with environment imprecision are reviewed. 2.1 Learning and Making Optimal Decisions Let S = {(xk , dk ), k = 1, . . . , K} denote a set of labelled samples where xk ∈ RN is an observation feature vector and dk ∈ UL = {u0 , . . . , uL−1 } is the label vector. Class-i label ui is a unit L-dimensional vector with components ui, j = δi j , with every component equal to 0, except the i-th component which is equal to 1. We assume a learning process that estimates parameters w of a non-linear mapping f w : RN → P from the input space into probability space P = {p ∈ [0, 1]L | ∑L−1 pi = 1}. The soft decision is given i=0 by yk = fw (xk ) ∈ P and the hard output of the classifier is denoted by d. Note that d and d will be used to distinguish the actual class from the predicted one, respectively. Several costs (or benefits) associated with each possible decision are also defined: c i j denotes the cost of deciding in favor of class i when the true class is j. Negative values represent benefits (for instance, cii , which is the cost of correctly classifying a sample from class i could be negative in some practical cases). In general cost-sensitive classification problems, either misclassification costs c i j or cii costs can take different values for each class. Thus, there are many applications where classification errors lead to very different consequences (medical diagnosis, fault detection, credit risk analysis), what implies misclassification costs ci j that may largely vary between them. In the same way, there are also many domains where correct decision costs (or benefits) c ii do not take the same value. For instance, in targeted marketing applications (Zadrozny and Elkan, 2001), correctly identifying a buyer implies some benefit while correctly classifying a non buyer means no income. The same ¨ applies to medical diagnosis domains such as the gastric carcinoma problem studied in G uvenir et al. (2004). In this case, the benefit of correct classification also depends on the class: the benefit of correctly classifying an early stage tumor is higher than that of a later stage. The expected risk (or loss) R is given by L−1 L−1 R = ∑ ∑ ci j P{d = ui |d = u j }Pj j=0 i=0 105 , (1) A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I where P{d = ui |d = u j } with i = j represent conditional error probabilities, and P j = P{d = u j } is the prior probability of class u j . Defining the conditional risk of misclassifying samples from class u j as L−1 Rj = ∑ ci j P{d = ui |d = u j } , i=0 we can express risk (1) as L−1 R= ∑ Ri Pi . (2) i=0 It is well-known that the Bayes decision rule for the minimum risk is given by L−1 d = arg min { ∑ ci j P{d = u j |x}} , ui (3) j=0 where P{d = ui |x} is the a posteriori probability of class i given sample x. The optimal decision rule depends on posterior probabilities and therefore, on the prior probabilities and the likelihood. In theory, as long as posterior probabilities (or likelihood and prior probabilities) are known, the optimal decision in Eq. (3) can be expressed after a trivial manipulation as a function of the cost differences between the costs (ci j − c j j ) (Duda et al., 2001). This is the reason why c j j is usually assumed to be zero and the value of the cost difference is directly assigned to c i j . When dealing ¨ with practical applications, however, some authors (Zadrozny and Elkan, 2001; G uvenir et al., 2004) have urged to use meaningful decision costs measured over a common baseline (and not necessarily taking c j j = 0) in order to avoid mistakes that otherwise could be overlooked. For this reason and, what is more important, the uncertainty class distribution problem addressed in this paper, decision costs measured over a common baseline are considered. Furthermore, absolute values of decision costs are relevant to the design of classifiers under the minimax regret approach. 2.2 Related Work: Dealing with Cost and Prior Uncertainty Most proposals to address uncertainty in priors fall into the categories of adaptive and robust solutions. While the aim of a robust solution is to avoid a classifier with very poor performance under any conditions, an adaptive system pursues to fit the classifier parameters using more incoming data or more precise information. With an adaptive-oriented principle, Provost (2000) states that, once the classifier is trained under specific class distributions and cost assumptions (not necessarily the operating conditions), the selection of the optimal classifier for specific conditions is carried out by a correct placement of the decision thresholds. In the same way, the approaches in Kelly et al. (1999) and Kubat et al. (1998) consider that tuning the classifier parameters should be left to the end user, expecting that class distributions and misclassification costs will be precisely known then. Some graphical methods based on the ROC curve have been proposed in Adams and Hand (1998) and Provost and Fawcett (2001) in order to compare the classifier performance under imprecise class distributions and/or misclassification costs. The ROC convex hull method presented in Provost and Fawcett (2001) (or the alternative representation proposed in Drummond and Holte (2000)) allows the user to select potentially optimal classifiers, providing a flexible way to select 106 M INIMAX R EGRET C LASSIFIER them when precise information about priors or costs is available. Under imprecision, some classifiers can be discarded but this does not necessarily provide a method to select the optimal classifier between the possible ones and fit its parameters. Furthermore, due to its graphical character, these methods are limited to binary classification problems. Changes in prior probabilities have also been discussed by Saerens et al. (2002), who proposes a method based on re-estimating the prior probabilities of real data in an unsupervised way and subsequently adjusting the outputs of the classifier according to the new a priori probabilities. Obviously, the method requires enough unlabelled data being available for re-estimation. As an alternative to adaptive schemes, several robust solutions have been proposed, as the resampling methods, especially in domains where imbalanced classes come out (Kubat and Matwin, 1997; Lawrence et al., 1998; Chawla et al., 2002; Barandela et al., 2003). Either by undersampling or oversampling, the common purpose is to balance artificially the training data set in order to get a uniform class distribution, which is supposed to be the least biased towards any class and, thus, the most robust against changes in class distributions. The same approach is followed in cost sensitive domains, but with some subtle differences in practice. It is well known that class priors and decision costs are intrinsically related. For instance, different decision costs can be simulated by altering the priors and vice versa (see Ting, 2002, for instance). Thus, when a uniform distribution is desired in a cost sensitive domain, but working with cost insensitive decision machines, class priors are altered according to decision costs, what is commonly referred as rebalancing. The manipulation of the training data distribution has been applied to cost-sensitive learning in two-class problems (Breiman et al., 1984) in the following way: basically, the class with higher misclassification cost (suppose n times the lowest misclassification cost) is represented with n times more examples than the other class. Besides random sampling strategies, other sampling-based rebalancing schemes have been proposed to accomplish this task, like those considering closeness to the boundaries between classes (Japkowicz and Stephen, 2002; Zhou and LiuJ, 2006) or the costproportionate rejection sampling presented in Zadrozny et al. (2003). Extending the formulation of this type of procedures to general multiclass problems with multiple (and possibly asymmetric) inter-class misclassification costs appears to be a nontrivial task (Zadrozny et al., 2003; Zhou and LiuJ, 2006), but some progress has been made recently with regard to this latter point (Abe et al., 2004). Note, also, that many (although not all) of these rebalancing strategies are usually implemented by oversampling and/or subsampling, that is, replicating examples (without adding any extra information) and/or deleting them (which implies information loss). 3. Robust Classifiers Under Prior Uncertainty: Minimax Classifiers Prior probability uncertainty can be coped from a robust point of view following a minimax derived strategy. Minimax regret criterion is discussed in this section after presenting the conventional minimax criterion. Although our approach extends to general multi-class problems and the discussion is carried out in that way, we will first illustrate, for the sake of clarity and simplicity, a binary situation. 3.1 Minimax Classifiers As Eq. (3) shows, the minimum risk decisions depend on the misclassification costs, c i j , and the posterior class probabilities and, thus, they depend on the prior probabilities, Pi . Different prior 107 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I PSfrag replacements distributions (frequency for each class) give rise to different Bayes classifiers. Fig. 1 shows the Bayes risk curve, RB (P1 ) versus class-1 prior probability for a binary classification problem. Standard RF (Q1 , P1 ) R0 Minimax RF (Q1mM , P1 ) Risk c00 Minimax Deviation RF (Q1mMd , P1 ) Rbasis R1 RB (P1 ) c11 0 Q1 Q1mM 1 Q1mMd P1 Figure 1: Risk vs. P1 . Minimum risk curve and performance under prior changes for the standard, minimax and minimax deviation classifier. RB (P1 ) stands for the optimal Bayes Risk against P1 . RF (Q1 , P1 ) denotes the Risk of a standard classifier (Fixed decision rule optimized for prior probabilities Q1 estimated in the training phase) against P1 . RF (Q1mM , P1 ) denotes the Risk of a minimax classifier (Fixed decision rule optimized for the minimax probabilities Q1mM ) against P1 . RF (Q1mMd , P1 ) denotes the Risk of a minimax deviation classifier (Fixed decision rule optimized for the minimax deviation probabilities Q 1mMd ) against P1 . If the prior probability distribution is unknown when the classifier is designed, or this distribution changes with time or from one environment to other, the mismatch between training and test conditions can degrade significantly the classifier performance. For instance, assume that Q = (Q0 , Q1 ) is the vector with class-0 and class-1 prior probabilities estimated in the training phase, respectively, and let RB (Q1 ) represent the minimum (Bayes) risk attainable by any decision rule for these priors. Note, that, according to Eq. (2), for a given classifier, the risk is a linear function of priors. Thus, risk RF (Q1 , P1 ) associated to the (fixed) classifier optimized for Q changes linearly with actual prior probabilities P1 and P0 = 1 − P1 , going from (0, R0 ) to (1, R1 ) (the continuous line in Fig. 1), where R0 and R1 refer to the class conditional risks for classes 0 and 1, respectively. Fig. 1 shows the impact of this change in priors and how performance deviates from optimal. Also, it can be shown (see VanTrees, 1968, for instance) that the minimum risk curve obtained for each prior is convex and the risk function of a given classifier verifies R F (Q1 , P1 ) ≥ RB (P1 ) with a tangent point at P1 = Q1 . 108 M INIMAX R EGRET C LASSIFIER The dashed line in Fig. 1 shows the performance of the minimax classifier, which minimizes the maximum possible risk under the least favorable priors, thus providing the most robust solution, in the sense that performance becomes independent from priors. From Fig. 1, it becomes clear that the minimax classifier is optimal for prior probabilities P = QmM = (Q0mM , Q1mM ) maximizing RB . Thus, this strategy is equivalent to maximizing the minimum risk (Moon and Stirling, 2000; Duda et al., 2001). We will refer to them as the minimax probabilities. Fig. 1 also makes clear that although a minimax classifier is a robust solution to address the imprecision in priors, it may become a somewhat pessimistic approach. 3.2 Minimax Deviation Classifiers We propose an alternative classifier that, instead of minimizing the maximum risk, minimizes the maximum deviation (regret) from the optimal Bayes classifier. In the following, we will refer to it as the minimax deviation or minimax regret classifier. A comparison between minimax and minimax deviation approaches is also shown in Fig. 1. This latter case corresponds to a classifier trained on prior probabilities P = Q mMd with performance as a function of priors given by a line (a plane or hyperplane for three or more classes, respectively) parallel to what we name, in the following, basis risk (Rbasis = c00 (1 − P1 ) + c11 P1 ). Note that the maximum deviation (with respect to priors) of the classifier optimized for Q is given by D(Q) = max {RF (Q1 , P1 ) − RB (P1 )} = max {R0 − c00 , R1 − c11 } . P1 The inspection of Fig. 1 shows that the minimum of D (with respect to Q) is achieved when R0 − c00 = R1 − c11 , which means that line RF (Q1 , P1 ) is parallel to arc named Rbasis in the figure and tangent to RB at Q1mMd . Therefore, the minimax regret classifier is also the Bayes solution with respect to the least favorable priors (Q0mMd , Q1mMd ) (see Berger, 1985, for instance), which will be denoted as minimax deviation probabilities. Now, we extend the formulation to a general L-class problem. Definition 1 Consider a L-class decision problem with costs ci j , 0 ≤ i, j < L and c j j ≤ ci j , and let Rw (P) be the risk of a decision machine with parameter vector w when prior class probabilities are given by P = (P0 , . . . , PL−1 ). The deviation function is defined as Dw (P) = Rw (P) − RB (P) and the minimax deviation is defined as DmMd = inf max{Dw (P)} . w P (4) Note that the above definition assumes that the maximum exists. This is actually the case, since Dw (P) is a linear function over a compact set, P . Note, also, that our definition includes the natural assumption that c j j is never higher than ci j , meaning that making a decision error is always less costly than taking the correct decision. This assumption is used in part of our theoretical analysis. The algorithms proposed in this paper are based on the fact that the minimax deviation can be computed without knowing RB 109 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I Theorem 2 The minimax deviation is given by DmMd = inf max{Dw (P)} , w P where Dw (P) = Rw (P) − Rbasis (P) and (5) L−1 ∑ c j j Pj Rbasis (P) = . (6) j=0 Proof Note that, according to Eqs. (1) and (2), for any decision machine and any u i ∈ UL , L−1 R(u j ) = R j = ∑ ci j P{d = ui |d = u j } ≥ c j j . i=0 Since the bound is reached by the classifier deciding d = u j for any observation x, we have RB (u j ) = c j j . Therefore, using Eq. (6), we find that, for any u ∈ UL , RB (u) = Rbasis (u) and, thus, Dw (u) = Dw (u) . Since Bayes minimum risk RB (P) is a convex function of priors and Rw (P) is linear, Dw (P) is concave and, thus, it is maximum at some of the vertices in P (i.e., at some P = u ∈ U L ). Thus, max{Dw (P)} = max {Dw (u)} . u∈UL P (7) Since the maximum difference between two hyperplanes defined over P is always at some vertex, we can conclude that max{Dw (P)} = max {Dw (u)} = max {Dw (u)} . P u∈UL u∈UL (8) Combining Eqs. (4), (7) and (8), we get DmMd = inf max{Dw (P)} . w P Note that Rbasis represents the risk baseline of the ideal classifier with zero errors. Th. 2 shows that the minimax regret can be computed as the minimax deviation to this ideal classifier. Note, also, that if costs cii do not depend on i, Eq. (5) becomes equivalent (up to a constant) to the Bayes risk and the minimax regret classifier becomes equivalent to the minimax classifier . Another important result for the algorithms proposed in this paper is that, under some conditions on the minimum risk, the minimum and maximum operators can be permuted. Although general results on the permutability of minimum and maximum operators can be found in the literature (see Polak, 1997, for instance), we provide here the proof for the specific case interesting to this paper. 110 M INIMAX R EGRET C LASSIFIER Theorem 3 Consider the minimum deviation function given by Dmin (P) = inf{Dw (P)} , (9) w where Dw (P) is the normalized deviation function given by Eq. (5), and let P ∗ be the prior probability vector providing the maximum deviation, P∗ = arg max Dmin (P) P . (10) If Dmin (P) is continuously differentiable at P = P∗ , then the minimax deviation, DmMd , defined by Eq. (4), is DmMd = Dmin (P∗ ) = max inf Dw (P) . (11) P w Proof For any classifier with parameter vector w, we can write, max Dw (P) ≥ Dw (P∗ ) ≥ Dmin (P∗ ) P and, thus, inf max Dw (P) ≥ Dmin (P∗ ) . w P (12) Therefore, Dmin (P∗ ) is a lower bound of the minimax regret. Now we prove that Dmin (P∗ ) is also an upper bound. According to Eq. (9), for any ε > 0, there exists a parameter vector wε such that Dwε (P∗ ) ≤ Dmin (P∗ ) + ε . (13) By definition, for any P, Dmin (P) ≤ Dwε (P). Therefore, using Eq. (13), we can write Dwε (P∗ ) − Dwε (P) ≤ Dmin (P∗ ) − Dmin (P) + ε . (14) Since Dmin (P) is continuously differentiable and (according to Eq. (10)) maximum at P ∗ , for any ε > 0 there exists δ > 0 such that, for any P ∈ P with P∗ − P ≤ δ we have Dmin (P∗ ) − Dmin (P) ≤ ε P∗ − P ≤ ε δ . (15) Let Pδ a prior such that P∗ − Pδ = δ. Taking ε = ε δ and combining Eqs. (14) and (15) we can write Dwε (P∗ ) − Dwε (Pδ ) ≤ 2ε δ . Since the above condition is verified for any ε > 0 and any prior Pδ at distance δ from P, and taking into account that Dwε (P) is a linear function of P, we conclude that the maximum slope of D wε (P) is bounded by 2ε and, thus, for any P ∈ P , we have √ Dwε (P) − Dwε (P∗ ) ≤ 2ε P − P∗ ≤ 2 2ε , √ (where we have used the fact that the maximum distance between two probability vectors is 2). Therefore, we can write √ max Dwε (P) ≤ Dwε (P∗ ) + 2 2ε P 111 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I and, thus, √ inf max Dw (P) ≤ Dwε (P∗ ) + 2 2ε . w P √ Finally, using Eq. (13) and taking into account that ε = ε δ ≤ 2ε we get √ inf max Dw (P) ≤ Dmin (P∗ ) + 3 2ε . w P (16) Since the above is true for any ε > 0 we conclude that Dmin (P∗ ) is also an upper bound of Dw . Therefore, combining Eqs. (12) and (16), we conclude that inf max Dw (P) = Dmin (P∗ ) , w P which completes the proof. Note that the deviation function needs to be neither differentiable nor a continuous function of w parameters. If the minimum deviation function is not continuously differentiable at the minimax deviation probability, P∗ , the theorem cannot be applied. The reason is that, although there should exist at least one classifier providing the minimum deviation at P = P∗ , it or they could not provide a constant deviation with respect to the prior probability. The situation can be illustrated with an example. Let x ∈ R be given by p(x|d = 0) = 0.8N(x, σ) + 0.2N(x − 2, σ) and p(x|d = 1) = 0.2N(x − 1, σ) + 0.8N(x − 3, σ), where σ = 0.5 and N(x, σ) = (2πσ)−1/2 exp(−x2 /(2σ2 )), and consider the set Φλ of classifiers given by a single threshold over x and decision dˆ = 1 if x ≥ λ 0 if x < λ. Fig. 2 shows the distribution of both classes over x, and Fig. 3 shows, as a function of priors, the minimum error probability (continuous line) that can be obtained using classifiers in Φ λ . Note that decision costs c00 = c11 = 0 and c01 = c10 = 1 have been considered for this illustrative problem. An abrupt slope change is observed at the minimax deviation probability, for P{d = 1} = 1/2. For this prior, there are two single threshold classifiers providing the minimum error probability, which are given by thresholds λ1 and λ2 in Fig. 2. However, as shown in Fig. 3 neither of them provides a risk that is constant in the prior. The minimax deviation classifier in Φ λ , which has a threshold λ0 , does not attain minimum risk at the minimax deviation probability and, thus, cannot be obtained by using Eq. (11). For this example, the desired robust classifier should have a deviation function given by the horizontal dotted line in Fig. 3. Fortunately, it can be obtained by combining the outputs of several classifiers. For instance, let dˆ1 and dˆ2 the decisions of classifiers given by thresholds λ1 and λ2 , respectively. It is not difficult to see that the classifier selecting dˆ1 and dˆ2 at random (for each input sample x) provides a robust classifier. This procedure can be extended to the multiclass-case: consider a set of L classifiers with parameters wk , k = 0, . . . , L − 1, and consider the classifier such that, for any input sample x, makes a decision equal to dk (i.e., the decision of classifier with parameters wk ), with probability qk . It is not difficult to show that the deviation function of this classifier is given by L−1 D(P) = L−1 j=0 k=0 ∑ Pj ∑ qk D j (wk ) 112 , M INIMAX R EGRET C LASSIFIER 0.7 0.6 Likelihoods 0.5 0.4 0.3 0.2 0.1 λ 0 −2 λ −1 0 λ 0 1 1 2 2 3 4 5 x Figure 2: The conditional data distributions for the one-dimensional example discussed in the text. λ1 and λ2 are the thresholds providing the minimum risk at the minimax deviation probability. λ0 provides the minimax deviation classifier. where D j (wk ) = R j (wk ) − c j j . In order to get a constant deviation function, probabilities q k should be chosen in such a way that L−1 ∑ qk D j (wk ) = D , k=0 where D is a constant. Solving these linear equations for q k , k = 0, . . . , L − 1 (with the constraint ∑k qk = 1), the required probabilities can be found. Note that, in order to build the non-deterministic classifier providing a constant deviation, a set of L independent classifiers that are optimal at the minimax deviation prior should be found. However, we go no further on the investigation of this special case for two main reasons: • The situation does not seem to be common in practice. In our simulations, we have found that the maximum of the minimum risk deviation always provided a response which is approximately parallel to Rbasis . • In general, the abrupt change in the derivative may be a symptom that the classifier structure is not optimal for the data distribution. Instead of building a nondeterministic classifier, increasing the classifier complexity should be more efficient. Although the least favorable prior providing the minimax deviation can be computed in closed form for some simple distributions, in general, it must be computed numerically. Moreover, we assume here that the data distribution is not known, and must be learned from examples. Thus, 113 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I 0.25 0.2 λ Error probability 0 0.15 λ λ 1 2 0.1 0.05 0 0 0.2 0.4 0.6 0.8 1 P{ d=1} Figure 3: Error probabilities as a function of prior probability of class 1 for the example in Fig. 2. Thresholds λ1 and λ2 do not provide the minimax deviation classifier, which is obtained for threshold λ0 . However, the random combination of classifiers with thresholds λ 1 and λ2 (dotted line) provides a robust classifier with deviation lower than that of λ 0 . we must incorporate the estimation of the least favorable prior in the learning process. Next, we propose a training algorithm in order to get a minimax regret classifier based on neural networks. 4. Neural Robust Classifiers Under Complete Uncertainty Note that, if QmMd is the probability vector providing the maximum in Eq. (11), that is, QmMd = arg max inf{Dw (P)} w P , then we can write DmMd = inf{Dw (QmMd )} . w Therefore, the minimax deviation classifier can be estimated by training a classifier using prior in QmMd . For this reason, QmMd will be called the minimax deviation prior (or least favorable prior). Our proposed algorithms are based on an iterative process of estimating parameters w based on an estimate of the minimax deviation prior, and re-estimating prior based on an estimate of network weights. This is shown in the following. 114 M INIMAX R EGRET C LASSIFIER 4.1 Updating Network Weights Learning is based on minimizing some empirical estimate of the overall error function L−1 L−1 i=0 E{C(y, d)} = i=0 ∑ P{d = ui }E{C(y, d)|d = ui } = ∑ PiCi , where C(y, d) may be any error function and Ci is the expected conditional error for class-i. Selecting the appropriate error function (see Cid-Sueiro and Figueiras-Vidal, 2001, for instance), learning rules can be designed providing a posteriori probability estimates (y i ≈ P{d = ui |x}, where yi is the soft decision) and, thus, according to Eq. (3), the hard decision minimizing the risk can be approximated by L−1 d = arg min { ∑ ci j y j } . i j=0 The overall empirical error function (cost function) used in learning for priors P = (P0 , . . . , PL−1 ) may be written as L−1 C = ∑ PiCi = L−1 i=0 = = 1 K L−1 i=0 1 K k ∑ d C(yk , dk ), Ki k=1 i Pi K k ∑ d C(yk , dk ) Ki /K k=1 i ∑ i=0 1 K ∑ K k=1 ∑ Pi L−1 ∑ Pi d kC(yk , dk ) (0) i i=0 Pi , , (17) (0) where Pi = Ki /K is an initial estimate of class-i prior based on class frequencies in the training set and Pi is the current prior estimate. Minimizing error function (17) by means of a stochastic gradient descent learning rule leads to update the network weights at k-th iteration as w (k+1) = w (k) (n) L−1 −µ = w(k) − Pi i=0 Pi ∑ L−1 d k ∇ C(yk , dk ) (0) i w ∑ µi (n) k di , ∇wC(yk , dk ) , (18) i=0 where (n) (n) µi = µ Pi (19) (0) Pi (n) is a learning step scaled by the prior ratio. Note that di selects the appropriate µi according to the pattern class membership. The classifier is trained without altering the original training data set (0) class distribution Pi and therefore, without missing or duplicating information. 115 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I 4.2 Updating Prior Probabilities Eq. (11) shows that the learning process should maximize (5) with respect to the prior probabilities. The estimate of (5) can be computed as ¯ Dw (P) = Rw (P) − Rbasis (P) , where (20) L−1 ∑ R j Pj (21) 1 L−1 ∑ ci j Ni j N j i=0 (22) Rw (P) = j=0 is the overall Bayes risk estimate and Rj = is the class- j conditional risk estimate where N j is the number of class u j patterns in the training phase and Ni j is the number of samples from class u j assigned to ui . L−1 In order to derive a learning rule to find an estimate Pi satisfying constraints ∑i=0 Pi = 1 and 0 ≤ Pi ≤ 1, we will use auxiliary variables Bi such that Pi = exp(Bi ) L−1 ∑ j=0 exp(B j ) . (23) ¯ We maximize Dw with respect to Bi . Applying the chain rule, ¯ ¯ ∂Dw L−1 ∂Dw ∂Pj =∑ , ∂Bi j=0 ∂Pj ∂Bi and using Eqs. (20), (21) and (23), we get ¯ ∂D w ∂Bi L−1 = ∑ (R j − c j j )Pi (δi j − Pj ), j=0 L−1 L−1 j=0 j=0 = Pi Ri − cii − ∑ (R j Pj ) + ∑ (c j j Pj ) , = Pi Ri − cii − Rw − Rbasis , = Pi Rdi , where Rdi = (Ri − cii ) − (Rw − Rbasis ) . The learning rule for auxiliary variable Bi is (n) Bi (n+1) = Bi + ρ (n) ∂D w , ∂Bi (n) (n) = Bi + ρPi Rdi , 116 (24) M INIMAX R EGRET C LASSIFIER where parameter ρ > 0 controls the rate of convergence. Using Eq. (23) and Eq. (24), the updated learning rule for Pi is (n) (n+1) Pi = (n) (n) (n) exp ρPj Rd j ∑L−1 exp B j j=0 (n) = (n) (n) exp(Bi ) exp ρPi Rdi , (n) (n) Pi exp ρPi Rdi (n) (n) (n) ∑L−1 Pj exp ρPj Rd j j=0 . (25) 4.3 Training Algorithm for a Minimax Deviation Classifier In the previous section, both the network weights updating rule (18) and the prior probability update rule (25) have been derived. The algorithm resulting from the combination is shown as follows: for n = 0 to Niterations − 1 do for k = 1 to K do w(k+1) = w(k) − L−1 ∑ µi (n) k di ∇wC(yk , dk ) i=0 end for (n) Estimate R(n) , Ri , i = 0, . . . , L − 1, according to (21) and (22) (n+1) (n+1) Update minimax probability Pi , i = 0, . . . , L − 1 according to (25) and compute µi with (19) end for 5. Robust Classifiers Under Partial Uncertainty Although in many practical situations prior probabilities may not be specified with precision, they can be partially known. In this section we discuss how partial information about priors can be used to improve the classifier performance in relation to a complete uncertainty situation. From now on, let us consider that lower (or upper) bounds of the priors are known based on previous experience. We will denote the lower and upper bounds of class-i prior probability as Pil and Piu , respectively. In order to illustrate this situation consider a binary classification problem where probability lower bounds P0l and P1l are known. That is, P1 ∈ [P1l , 1 − P0l ] where this interval represents the uncertainty region. Let us denote by Γ = {P : 0 ≤ Pi ≤ 1, ∑L−1 Pi = 1, Pi ≥ Pil } the probability region i=0 satisfying the imposed constraints. In the following, we will refer to Γ as the uncertainty region. Now, the aim is to design a classifier that minimizes the maximum regret from the minimum risk only inside the uncertainty region. This is depicted in Fig. 4(a), which shows that reducing the uncertainty in priors allows to reduce deviation from the optimal classifier. This minimax regret approach for the uncertainty region Γ is often called Γ-minimax regret. As discussed before, the minimax deviation solution gives a Bayes solution with respect some priors denoted in the partial uncertainty case as QΓ mMd in Fig. 4(a), which is the least favorable distribution according to the regret criterion. 117 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I cΓ 00 RΓ basis RΓ 1 cΓ 11 PSfrag replacements Risk 0 0.5 1 1.5 2 2.5 3 3.5 0P1l RB (P) + ψ(P) Minimax Deviation with Restriction Risk RΓ 0 QΓ 1mMd 1 − P0l 1 P1 0P1l 1 − P0l 1 P1 (b) (a) Figure 4: Minimax deviation classifier under partial uncertainty of prior probabilities: (a)Γ-minMaxDev Classifier. (b) Modified cost function defined as R B (P) + ψ(P). In contrast to the minimax regret criterion, note that a classical minimax classifier for the considered uncertainty region would minimize the worst-case risk. It would be a Bayes solution for the prior where the minimum risk reaches its maximum and it could be denoted as Q Γ . mM Notice, also, that these solutions will be the same if the risk for the vertex of Γ take the same value (cΓ = k). ii 5.1 Neural Robust Classifiers Under Partial Uncertainty Minimax search can be formulated as maximizing (with respect to priors) the minimum (with respect to network parameters) of deviation function (5), as described in previous section, but subject to some constraints arg max inf {DΓ (P)} , w w P Pi ≥ Pil , i = 0, . . . , L − 1 s.t. where DΓ = RΓ − RΓ . When uncertainty is global, this hyperplane is defined by the risk in the L w w basis extreme cases with Pi = δik , that is, by the corresponding cii . However, with partial knowledge of the prior probabilities, this hyperplane becomes defined by the risk in L points which are the vertex given by the restrictions and with associated risk denoted by c Γj . j Defining 1 l(Pi ) = , (26) 1 + exp−τ(Pi −Pil ) where τ controls the hardness of this restriction, the minimax problem can be re-formulated as arg max inf {DΓ (P)} w P s.t. w l(Pi ) ≥ 1/2, i = 0, . . . , L − 1. Thus, this constrained optimization problem can be solved as a non-constrained problem by considering an auxiliary function that incorporates the restriction as a barrier function 118 M INIMAX R EGRET C LASSIFIER arg max inf {DΓ (P) + Aψ(P)} , w w P where ψ(Pi ) = log(l(Pi )) and the constant A determines the contribution of the barrier function. Fig. 4(b) shows the new risk function corresponding to the binary case previously depicted in Fig. 4(a). Note that, it is the sum of the original RB (P) and the barrier function ψ(P). As in Section 4.1, in order to derive the network weight learning rule, we need to compute ∂ψ ∂Bi L−1 = ∂ψ ∂P j , j ∂Bi ∑ ∂P j=0 = τPi L−1 ∑ 1 − l(Pk ) (δik − Pk ), k=0 = τPi ψdi , where ψdi = ∑L−1 (1 − l(Pk ))(δik − Pk ) k=0 As τ increases, the constraints become harder around the specified bound. The update learning rule for the auxiliary variable Bi at cycle n is (n+1) Bi (n) Γ(n) (n) (n) = Bi (n) + ρPi Rdi + ρAτPi ψdi . And therefore, using (23), the update learning rule for Pi is (n) (n+1) Pi = (n) Γ(n) Pi exp ρPi Rdi L−1 ∑ (n) Pj exp (n) (n) exp ρAτPi ψdi (n) Γ(n) ρ P j Rd j . (n) (n) ρAτPj ψd j exp j=0 Note that if the upper bound is known instead of the lower bound, l(Pi ) defined by (26) should be replaced by u(Pi ) = (1 + exp(τ(Pi − Piu )))−1 at the previous formulation. The minimax constrained optimization problem has been tackled by considering a new objective function defined by the sum of the original cost function and a barrier function. Studying the convexity of this new function becomes important from the fact that a stationary point of this risk curve is a global maximum. Since the minimum risk curve (RB (P)) is a convex function of the priors (see VanTrees, 1968, for details), if we verify the convexity of the barrier function, we can conclude that the function defined by the sum of both of them is also convex. This barrier function is convex in P if the Hessian matrix HR verifies PT HR P ≤ 0 The Hessian matrix of the barrier function equals to a diagonal matrix D r = diag(r) with all negative diagonal entries ri = Aτ2 (−l(Pi )(1 − l(Pi ))). As l(Pi ) ∈ [0, 1] and therefore, ri ≤ 0, it is straightforward to see that PT HR P = PT Dr P, L−1 = ∑ Pi2 ri ≤ 0 . i=0 Since the barrier function is convex, the new objective function (defined by the sum of two convex functions) is also convex. 119 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I 5.2 Extension to Other Learning Algorithms The learning algorithm proposed in this paper is intended to train a minimax deviation classifier based on neural networks with feedforward architecture. Actually, the learning algorithm we propose becomes a feasible solution for any learning process based on minimizing some empirical estimate of an overall cost (error) function. However, it is also applicable to a general classifier provided it is trained (in an iterative process) for the estimated minimax deviation probabilities and the assumed decision costs. Specifically, in this paper, scaling the learning rate allows to simulate different class distributions and the hard decisions are made based on posterior probability estimates and decision costs. Furthermore, the neural learning phase carried out in one iteration can be re-used for the next one, what allows to reduce computational cost with respect to a complete optimization process on each iteration. Apart from the general approach of completely training a classifier on each iteration and in order to reduce its computational cost, specific solutions may be studied for different learning machines. Nonetheless, it seems not feasible to readily achieve this improvement for classifiers like SVMs, where support vectors for one solution may have nothing in common with the ones obtained in next iteration and thus, making necessary to re-train the classifier in each iteration. Another possible solution for any classifier that provides a posteriori probabilities estimates or any score that can be converted into probabilities (for details on calibration methods see Wei et al., 1999; Zadrozny and Elkan, 2002; Niculescu-Mizil and Caruana, 2005) is outlined here. In this case, an iterative procedure able to estimate the minimax deviation probabilities and consequently to adjust (without re-training) the outputs of the classifier could be studied. The general idea for this approach is as follows: first, the new minimax deviation prior probabilities are estimated according to (25) and then, posterior probabilities provided by the model are adjusted as follows (see Saerens et al., 2002, for more details) (k) Pi P(k) {d = ui |x} = P(k−1) {d = ui |x} (k−1) Pi L−1 P(k) j P(k−1) {d = u j |x} (k−1) j=0 Pj . (27) ∑ The algorithm’s main structure is summarized as for k = 1 to K do (k) Estimate R(k) , Ri , i = 0, . . . , L − 1, according to (21), (22) and decision costs c i j (k+1) Update minimax probability Pi according to (25) Adjust classifier outputs according to (27) end for The effectiveness of this method relies on the accuracy of the initial a posteriori probability estimates. Studying in depth this approach and comparing different minimax deviation classifiers (decision trees, SVMs, RBF networks, feedforward networks and committee machines) together with different probability calibration methods appears as a challenging issue to be explored in future work. 120 M INIMAX R EGRET C LASSIFIER 6. Experimental Results In this section, we first present the neural network architecture used in the experiments and illustrate the proposed minimax deviation strategy on an artificial data set. Then, we apply it to several realworld classification problems. Moreover, a comparison with other proposals such as the traditional minimax and the common re-balancing approach is carried out. 6.1 Softmax-based Network Although our algorithms can be applied to any classifier architecture, we have chosen a neural network based on the softmax non-linearity with soft decisions given by Mi yi = ∑ yi j , j=1 with yi j = exp(wTj x + wi j0 ) i , Mk ∑L−1 ∑l=1 exp(wT x + wkl0 ) k=0 kl where L stands for the number of classes, M j the number of softmax outputs used to compute y j and wi j are weight vectors. We will refer to this network as a Generalized Softmax Perceptron(GSP). 1 A simple network with M j = 2 is used in the experiments. x1 wj,k y1,1 y1,... x2 x3 y1 y1,M1 Class i ... SOFTMAX ... HARD DECISION n inputs / outputs ... yL,1 xd yL,ML yL,... yL Figure 5: GSP(Generalized Softmax Perceptron) Network Fig. 5 corresponds to the neural network architecture used to classify the samples represented by feature vector x. Learning consists of estimating network parameters w by means of the stochastic gradient minimization of certain objective functions. In the experiments, we have considered the Cross Entropy objective function given by L CE(y, d) = − ∑ di log yi . i=1 The stochastic gradient learning rule for the GSP network is given by Eq. (18). Learning step µ(0) decreases according to µ(k) = 1+k/η , where k is the iteration number, µ(0) the initial learning rate and η a decay factor. µ(k) 1. Note that the GSP is similar to a two layer MLP with a single layer of weights and with coupled saturation function (softmax), instead of sigmoidal units. 121 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I The reason to illustrate this approach with a feedforward architecture is that, as mentioned in Section 5.2, it allows to exploit (in the iterative learning process) the partially optimized solution in current iteration for the next one. On the other hand, posterior probability estimation makes it possible to apply the adaptive strategy based on prior re-estimation proposed by Saerens to the minimax deviation classifier, as long as a data set representative of the operation conditions is available. Finally, the fact that intermediate outputs yi j of the GSP can be interpreted as subclass probabilities may provide quite a natural way to cope with the unexplored problem of uncertainty in subclass distributions as already pointed out by Webb and Ting (2005). Nonetheless, both architecture and cost function issues are not the goal of this paper, but merely illustrative tools. 6.2 Artificial Data Set To illustrate the minimax regret approach proposed in this paper both under complete and partial uncertainty, an artificial data set with two classes (class u0 and class u1 ) has been created. Data examples are drawn from the normal distribution p(x|d = ui ) = N(mi , σ2 ) with mean mi and standard i √ deviation σi . Mean values were set to m0 = 0, m1 = 2 and standard deviation to σ0 = σ1 = 2. A total of 4000 instances were generated with prior probabilities of class membership P{d = u 0 } = 0.93 c00 c01 2 5 and P{d = u1 } = 0.07. The cost-benefit matrix is given by . c10 c11 4 0 Initial learning rate was set to µ(0) = 0.3, decay factor to η = 2000 and training was ended after 80 cycles. Classifier assessment was carried out by following 10-fold cross-validation. Two classifiers were trained, to be called a standard classifier and a minMaxDev classifier. The former is built by considering that the estimated class prior information is precise and stationary and the latter is the approach proposed in this paper to cope with uncertainty in priors. Thus, for the standard classifier, its performance may deviate from the optimal risk in 3.39 when priors change from training to test conditions. However, a minimax deviation classifier reduces this worst-case difference from the optimal classifier to 0.77. Now, we suppose that some information about priors is available (partial uncertainty). For instance, we consider that the lower bound for prior probabilities P0 and P1 are known and set to P0l = 0.55 and P1l = 0.05, respectively, so that the uncertainty region is Γ = {(P0 , P1 )|P0 ∈ [0.55, 0.95], P1 ∈ [0.05, 0.45]}. A minimax deviation classifier can be derived for Γ (it will be called Γ-minMaxDev classifier).The narrower Γ is, the closer the minimax deviation classifier performance is to the optimal. For this particular case, under partially imprecise priors, the standard classifier may differ from optimal (in Γ) in 0.83, while the use of the simple minMaxDev classifier designed under total prior uncertainty conditions attains a maximum deviation of 0.53. However, the Γ-minMaxDev classifier only differs from optimal in 0.24. These data are reported in Table 1 where both, experimental and also theoretical results, are shown. 6.3 Real Databases In this section we report experimental results obtained with several publicly available data sets. From the UCI repository (Blake and Merz, 1998) the following benchmarks: German Credits, Australian Credits, Insurance Company, DNA slice-junction, Page-blocks, Dermatology and Pen-digits. 122 M INIMAX R EGRET C LASSIFIER Standard Th/Exp Maximum deviation from optimal (complete uncertainty) Maximum deviation from optimal in Γ (partial uncertainty) Classifier minMaxDev Γ-minMaxDev Th/Exp Th/Exp 3.41/3.39 0.72/0.77 – 0.85/0.83 0.50/0.53 0.19/0.24 Table 1: A comparison between the standard classifier (build under stationary prior assumptions), the minimax deviation classifier (minMaxDev) and the minimax deviation classifier under partial uncertainty (Γ-minMaxDev) for an artificial data set Database German Credits (GCRE) Australian Credits (AUS) Munich Credits (MCRE) Insurance Company (COIL) DNA Slice-junction (DNA) Page-blocks (PAG) Dermatology (DER) Pen-digits (PEN) # Classes 2 2 2 2 3 5 6 10 Class distribution [0.70 0.30] [0.32 0.68] [0.30 0.70] [0.94 0.06] [0.24 0.24 0.52] [0.90 0.06 0.01 0.01 0.02] [0.31 0.16 0.20 0.13 0.14 0.06] [0.104 0.104 0.104 0.096 0.104 0.096 0.096 0.104 0.096 0.096] # Attributes 8 14 20 85 180 10 34 16 # Instances 1000 690 1000 9822 3186 5473 366 10992 Table 2: Experimental Data sets Other public data set used is Munich Credits from the Dept. of Statistics at the University of Munich.2 Data set description is summarized in Table 2, and cost-benefit matrices are shown in Table 3. We have used the cost values that appear in Ikizler (2002) for those data sets in common. Otherwise, for lack of an expert analyst, the cost values have been chosen by hand. 2. Data sets available at http://www.stat.uni-muenchen.de/service/datenarchiv/welcome e.html. Insurance Company 0 1 German, Australian, Munich Credits −1 0 0 −17 Page-Blocks −1 2 2 2 2 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 −4 2 3 4 3 4 3 −3 3 5 1 5 5 0 Dermatology 3 2 3 2 −8 4 5 −10 4 3 5 4 2 1 4 5 −6 5 2 3 5 2 3 −10 −1 2 2 DNA 2 −1 2 Pendigits ci j = 0 1 Table 3: Cost-Benefit matrices for the experimental Data sets 123 3 3 0 if i = j Otherwise A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I Standard Maximum Risk Deviation from the optimal classifier Re-balanced Minimax Deviation Minimax minMaxDev minMax GCRE 0.70 0.80 (0.55 0.60) 0.99 ACRE 1.00 1.00 (0.76 0.86) 1.00 MCRE 0.91 0.77 (0.54 0.59) 0.99 COIL 2.78 0.99 (0.87 0.92) 16.32 DNA 0.34 0.53 (0.30 0.27 0.25) PAG 0.62 0.26 (0.13 0.13 0.20 0.16 0.16) DER 1.03 1.28 (0.67 0.78 0.51 0.48 0.54 PEN 0.061 0.059 (0.024 0.028 0.025 0.019 1.14 0.023 0.021 0.026 0.022 0.86 0.60) 0.023 0.029) 7.62 0.029 Table 4: Classifier Performance evaluated as Maximum Risk Deviation from the optimal classifier for several real-world applications. Class-conditional risk deviations (R i − cii ) reported for the minMaxDev classifier. Experimental results for these data sets are shown in the following sections. The robustness of different decision machines under complete uncertainty of prior probabilities is analyzed in Section 6.3.1. If uncertainty is only partial, a similar study and comparison with the previous approach (complete uncertainty) is carried out in Section 6.3.2. 6.3.1 C LASSIFIER ROBUSTNESS U NDER C OMPLETE U NCERTAINTY We now study how different neural-based classifiers cope with worst-case situations in prior probabilities. The maximum deviation from the optimal classifier (see Table 4) is reported for the proposed minMaxDev strategy as well as for other alternative approaches: the one based on the assumption of stationary priors (standard) and the common alternative of deriving the classifier from an equally distributed data set (re-balanced). A comparison with the traditional minimax strategy is also provided. Together with the previously mentioned value (maximum deviation or regret), deviation for the L class-conditional extreme cases (Ri − cii ) is also reported for the minMaxDev classifier in Table 4. Results allow to verify that this solution is fairly close to the optimal one where deviation is not dependent on priors and thus, class-conditional deviations take the same value. Although the balanced class distribution to train the classifier can be obtained by means of undersampling and/or oversampling, it is simulated by altering the learning rate used in the training 1/L phase according to (19) as µi = µ (0) , where 1/L represents the simulated probability, equal for Pi all classes. Results evidence that the assumption of stationary priors may lead to significant deviations from the optimal decision rule under “unexpected”, but rather realistic, prior changes. This deviation may reach up to three times more than the robust minimax deviation strategy. Thus, for classification problems like Page-blocks the maximum deviation from the optimal classifier is 0.62 for the 124 M INIMAX R EGRET C LASSIFIER Standard Maximum Risk Re-balanced Minimax Deviation minMaxDev Minimax minMax GCRE 0.70 0.15 0.60 0.00 ACRE 0.01 0.02 0.86 -0.00 MCRE 0.05 0.20 0.59 0.00 COIL 0.76 0.99 0.86 0.02 DNA 0.34 0.53 0.25 0.13 PAG 0.62 0.26 0.20 0.10 DER -2.10 -1.68 -2.21 -2.38 PEN 0.061 0.059 0.029 0.029 Table 5: Classifier Performance measured as Maximum Risk for several real-world applications. standard classifier while this reduces to 0.20 for the minMaxDev one. Likewise, for the Insurance company(COIL) application the maximum deviation for the standard classifier is 2.78 compared with 0.92 for the minMaxDev model. The remaining databases also show the same behavior as it is presented in Table 4. On the other hand, the use of a classifier inferred from a re-balanced data set does not necessarily involve a decrease in the maximum deviation with respect to the standard classifier. In the same way, the traditional minimax classifier does not protect against prior changes in terms of maximum relative deviation from the minimum risk classifier. However, if our criterion is more conservative and our aim is the minimization of the maximum possible risk (not the minimization of the deviation), the traditional minimax classifier represents the best option. It is shown in Table 5 where the maximum risk for the different classifiers is reported. Positive values in this table indicate a cost while negative values represent a benefit. For instance, for the Page-blocks application the minimax classifier assures a maximum risk of 0.10 while the standard, re-balanced and minMaxDev classifiers reach values of 0.62, 0.26 and 0.20, respectively. It can be noticed that for the Pen-digits data set, the minimax deviation and minimax approaches attain the same results. The reason is that, for this problem, the R basis plane takes the same value (in this case, zero) in the probability space. 6.3.2 C LASSIFIER ROBUSTNESS UNDER PARTIAL U NCERTAINTY Unlike the previous section, we consider now that partial information about the class priors is available. The aim is to find a classifier that behaves well for a delimited and realistic range of priors what constitutes an aid in reducing the maximum deviation from the optimal classifier. This situation can be treated as a constrained minimax regret strategy where the constraints represent any extra information about prior probability value. Experimental results for several situations of partial prior uncertainty are presented in this section. We consider that lower bounds for the prior probabilities are available (see Table 6). In order to get the Γ-minMaxDev classifier, the risk for the different vertex of the uncertainty domain needs to be calculated. With them, the basis risk RΓ over which deviations are measured is derived. basis 125 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I Lower bound for prior probabilities Data Set P0l P1l GCRE 0.40 0.25 ACRE 0.20 0.25 MCRE 0.20 0.25 COIL 0.15 P2l P3l P4l P5l 0.03 DNA 0.10 0.10 0.22 0.02 0.00 0.01 0.1 0.20 0.10 0.10 0.10 0.10 0.06 0.06 0.10 0.10 0.06 P9l 0.06 0.10 0.05 0.05 0.02 PEN P8l 0.02 DER P7l 0.25 PAG P6l Table 6: Lower bounds for prior probabilities defining the uncertainty region, Γ region for the experimental data sets. Maximum Risk Deviation in the uncertainty region Standard Minimax Deviation Minimax Deviation with restriction minMaxDev Γ-minMaxDev GCRE 0.24 0.19 (0.10 0.09) ACRE 0.03 0.64 (0.03 0.03) MCRE 0.22 0.38 (0.13 0.10) COIL 2.33 0.77 (0.17 0.11) DNA 0.14 0.08 (0.07 0.07 0.06) PAG 0.37 0.15 (0.10 0.08 0.08 0.05 0.04) DER 0.08 0.05 (0.03 0.03 0.04 0.02 0.05 PEN 0.013 0.007 (0.003 0.001 0.003 0.000 0.001 0.001 0.000 0.003 0.05) 0.001 0.001) Table 7: Classifier Performance under partial knowledge of prior probabilities measured as Maximum Risk Deviation for several real-world applications. Class-conditional risk deviations (RΓ − cΓ ) are reported for the Γ-minMaxDev classifier. i ii Maximum deviation from the optimal in Γ is reported for the Γ-minMaxDev classifier together with the standard and the minMaxDev ones. For instance, the standard classifier for the Pageblocks data set deviates from the optimal classifier, in the defined uncertainty region, up to 0.37, while when complete uncertainty is assumed the maximum deviation is equal to 0.62. In the same way, reducing the uncertainty also means a reduction in the maximum deviation for minMaxDev classifier (trained without considering this partial knowledge). Thus, for Γ, this classifier assures a deviation bound of 0.15. However, taking into account this partial information to train a Γ-minMaxDev classifier allows to reduce the deviation for the worst-case conditions to 0.10. It can be seen the same behavior for the other databases in Table 7. 126 M INIMAX R EGRET C LASSIFIER 7. Conclusions This work concerns the design of robust neural-based classifiers when the prior probabilities of the classes are partially or completely unknown, even by the end user. This problem of uncertainty in the class priors is often ignored in supervised classification, even though it is a widespread situation in real world applications. As a result, the reliability of the inducted classifier can be greatly affected as previously shown by the experiments. To tackle this problem, we have proposed a novel minimax deviation strategy with the goal to minimize the maximum deviation with respect to the optimal classifier. A neural network training algorithm based on learning rate scaling has been developed. The experimental results show that this minimax deviation (minMaxDev) classifier protects against prior changes while other approaches like ignoring this uncertainty or use a balanced learning data set may result in large differences in performance with respect to the minimum risk classifier. Also, it has been shown that the conventional minimax classifier reduces the maximum possible risk following a conservative attitude but at the expense of large worst-case differences from the optimal classifier. Furthermore, a constrained minimax deviation approach (Γ-minMaxDev) has been derived for those situations where uncertainty is only partial. This may be seen as a general approach with some particular cases: a) precise knowledge of prior probabilities and b) complete uncertainty about the priors. In a) the region of uncertainty collapses to a point and we have the Bayes’ rule of minimum risk and in b) the pure minimax deviation strategy comes up. While the first one may be criticized for being quite unrealistic, the other may be seen rather pessimistic. The experimental results for this proposed intermediate situation show that the Γ-minMaxDev classifier allows to reduce the maximum deviation from the optimal and performs well over a range of prior probabilities. Acknowledgments The authors thank the four referees and the associate editor for their helpful comments. This work was partially supported by the project TEC2005-06766-C03-02 from the Spanish Ministry of Education and Science. References N. Abe, B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 3–11, 2004. N. M. Adams and D. J. Hand. Comparing classifiers when the misallocation costs are uncertain. Pattern Recognition, 32(7):1139–1147, March 1998. R. Alaiz-Rodriguez, A. Guerrero-Curieses, and J. Cid-Sueiro. Minimax classifiers based on neural networks. Pattern Recognition, 38(1):29–39, January 2005. R. Barandela, J. S. Sanchez, V. Garc´a, and E. Rangel. Strategies for learning in class imbalance ı problems. Pattern Recognition, 36(3):849–851, March 2003. J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, second edition, 1985. 127 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/ mlearn/MLRepository.html. URL L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Chapman & Hall, NY, 1984. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16:321–357, 2002. J. Cid-Sueiro and A. R. Figueiras-Vidal. On the structure of strict sense Bayesian cost functions and its applications. IEEE Transactions on Neural Networks, 12(3):445–455, May 2001. C. Drummond and R. C. Holte. Explicitly representing expected cost: An alternative to ROC representation. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 198–207. ACM Press, 2000. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, 2001. Y. C. Eldar and N. Merhav. Minimax approach to robust estimation of random parameters. IEEE Trans. on Signal Processing, 52(7):1931–1946, July 2004. Y. C. Eldar, A. Ben-Tal, and A. Nemirovski. Linear minimax regret estimation of deterministic parameters with bounded data uncertainties. IEEE Trans. on Signal Processing, 52(8):2177– 2188, August 2004. M. Feder and N. Merhav. Universal composite hypothesis testing: A competitive minimax approach. IEEE Trans. on Information Theory, 48(6):1504–1517, June 2002. A. Guerrero-Curieses, R. Alaiz-Rodriguez, and J. Cid-Sueiro. A fixed-point algorithm to minimax learning with neural networks. IEEE Transactions on Systems, Man and Cybernetics Part C, 34 (4):383–392, November 2004. ¨ H. A. G¨ venir, N. Emeksiz, N. Ikizler, and N. Ormeci. Diagnosis of gastric carcinoma by classifiu cation on feature projections. Artificial Intelligence in Medicine, 31(3), 2004. N. Ikizler. Benefit maximizing classification using feature intervals. Technical Report BU-CE-0208, Bilkent University, Ankara, Turkey, 2002. N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis Journal, 6(5):429–450, November 2002. M. G. Kelly, D. J. Hand, and N. M. Adams. The impact of changing populations on classifier performance. In Proceedings of Fifth International Conference on SIG Knowledge Discovery and Data Mining (SIGKDD), pages 367–371, San Diego, CA, 1999. H. J. Kim. On a constrained optimal rule for classification with unknown prior individual group membership. Journal of Multivariate Analysis, 59(2):166–186, November 1996. M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings 14th International Conference on Machine Learning, pages 179–186. Morgan Kaufmann, 1997. 128 M INIMAX R EGRET C LASSIFIER M. Kubat, R. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2/3):195–215, 1998. S. Lawrence, I. Burns, A. D. Back, A. C. Tsoi, and C. L. Giles. Neural network classification and ¨ unequal prior class probabilities. In G. Orr, K.-R. Muller, and R. Caruana, editors, Tricks of the Trade, Lecture Notes in Computer Science State-of-the-Art Surveys, pages 299–314. Springer Verlag, 1998. T. K. Moon and W. C. Stirling. Mathematical Methods and Algorithms for Signal Processing. Prentice Hall, 2000. A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In ICML ’05: Proceedings of the 22nd International Conference on Machine learning, pages 625– 632, New York, NY, USA, 2005. ACM Press. ISBN 1-59593-180-5. E. Polak. Optimization: Algorithms and Consistent Approximations. Springer, 1997. F. Provost. Learning with imbalanced data sets 101. In Invited paper for the AAAI 2000 Workshop on Imbalanced Data Sets. AAAI Press. Technical Report WS-00-05, 2000. F. Provost and T. Fawcett. Robust classification systems for imprecise environments. Machine Learning, 42(3):203–231, March 2001. M. Saerens, P. Latinne, and C. Decaestecker. Adjusting a classifier for new a priori probabilities: A simple procedure. Neural Computation, 14:21–41, January 2002. E. Takimoto and M. Warmuth. The minimax strategy for Gaussian density estimation. In Proceedings 13th Annual Conference on Computational Learning Theory, pages 100–106. Morgan Kaufmann, San Francisco, 2000. K. M. Ting. A study of the effect of class distribution using cost-sensitive learning. In Proceedings of the Fifth International Conference on Discovery Science, pages 98–112. Berlin: Springer-Verlag, 2002. H. L. VanTrees. Detection, Estimation and Modulation Theory. John Wiley and Sons, 1968. G. I. Webb and K. M. Ting. On the application of ROC analysis to predict classification performance under varying class distributions. Machine Learning, 58(1):25–32, 2005. W. Wei, T. K. Leen, and E. Barnard. A fast histogram-based postprocessor that improves posterior probability estimates. Neural Computation, 11(5):1235 – 1248, July 1999. B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 204–213. ACM Press, 2001. B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Eighth International Conference on Knowledge Discovery and Data Mining, 2002. 129 A LAIZ -RODR´GUEZ , G UERRERO -C URIESES , AND C ID -S UEIRO I B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the third IEEE International Conference on Data Mining, pages 435–442, 2003. Z. H. Zhou and X. Y. LiuJ. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1):63–77, January 2006. 130
2 0.55024785 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér
Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics
3 0.31738621 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
Author: Jia Li, Surajit Ray, Bruce G. Lindsay
Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density
4 0.30957666 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
Author: Rie Johnson, Tong Zhang
Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Specifically, by introducing a definition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary significantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments confirm the superiority of the normalization scheme motivated by learning theory on artificial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian
5 0.30833775 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
Author: Guy Lebanon, Yi Mao, Joshua Dillon
Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing
7 0.30563742 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
8 0.30561456 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
9 0.30424225 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
10 0.30423427 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
11 0.30302879 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
12 0.3025136 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
14 0.30223829 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
15 0.30167019 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
16 0.30163565 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
18 0.29980987 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
19 0.29854465 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
20 0.29853612 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm