nips nips2004 nips2004-86 knowledge-graph by maker-knowledge-mining

86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

Source: pdf

Author: Shyam Visweswaran, Gregory F. Cooper

Abstract: Classification algorithms typically induce population-wide models that are trained to perform well on average on expected future instances. We introduce a Bayesian framework for learning instance-specific models from data that are optimized to predict well for a particular instance. Based on this framework, we present a lazy instance-specific algorithm called ISA that performs selective model averaging over a restricted class of Bayesian networks. On experimental evaluation, this algorithm shows superior performance over model selection. We intend to apply such instance-specific algorithms to improve the performance of patient-specific predictive models induced from medical data. 1 In t ro d u c t i o n Commonly used classification algorithms, such as neural networks, decision trees, Bayesian networks and support vector machines, typically induce a single model from a training set of instances, with the intent of applying it to all future instances. We call such a model a population-wide model because it is intended to be applied to an entire population of future instances. A population-wide model is optimized to predict well on average when applied to expected future instances. In contrast, an instance-specific model is one that is constructed specifically for a particular instance. The structure and parameters of an instance-specific model are specialized to the particular features of an instance, so that it is optimized to predict especially well for that instance. Usually, methods that induce population-wide models employ eager learning in which the model is induced from the training data before the test instance is encountered. In contrast, lazy learning defers most or all processing until a response to a test instance is required. Learners that induce instance-specific models are necessarily lazy in nature since they take advantage of the information in the test instance. An example of a lazy instance-specific method is the lazy Bayesian rule (LBR) learner, implemented by Zheng and Webb [1], which induces rules in a lazy fashion from examples in the neighborhood of the test instance. A rule generated by LBR consists of a conjunction of the attribute-value pairs present in the test instance as the antecedent and a local simple (naïve) Bayes classifier as the consequent. The structure of the local simple Bayes classifier consists of the attribute of interest as the parent of all other attributes that do not appear in the antecedent, and the parameters of the classifier are estimated from the subset of training instances that satisfy the antecedent. A greedy step-forward search selects the optimal LBR rule for a test instance to be classified. When evaluated on 29 UCI datasets, LBR had the lowest average error rate when compared to several eager learning methods [1]. Typically, both eager and lazy algorithms select a single model from some model space, ignoring the uncertainty in model selection. Bayesian model averaging is a coherent approach to dealing with the uncertainty in model selection, and it has been shown to improve the predictive performance of classifiers [2]. However, since the number of models in practically useful model spaces is enormous, exact model averaging over the entire model space is usually not feasible. In this paper, we describe a lazy instance-specific averaging (ISA) algorithm for classification that approximates Bayesian model averaging in an instance-sensitive manner. ISA extends LBR by adding Bayesian model averaging to an instance-specific model selection algorithm. While the ISA algorithm is currently able to directly handle only discrete variables and is computationally more intensive than comparable eager algorithms, the results in this paper show that it performs well. In medicine, such lazy instance-specific algorithms can be applied to patient-specific modeling for improving the accuracy of diagnosis, prognosis and risk assessment. The rest of this paper is structured as follows. Section 2 introduces a Bayesian framework for instance-specific learning. Section 3 describes the implementation of ISA. In Section 4, we evaluate ISA and compare its performance to that of LBR. Finally, in Section 5 we discuss the results of the comparison. 2 Deci si on Th eo ret i c F rame wo rk We use the following notation. Capital letters like X, Z, denote random variables and corresponding lower case letters, x, z, denote specific values assigned to them. Thus, X = x denotes that variable X is assigned the value x. Bold upper case letters, such as X, Z, represent sets of variables or random vectors and their realization is denoted by the corresponding bold lower case letters, x, z. Hence, X = x denotes that the variables in X have the states given by x. In addition, Z denotes the target variable being predicted, X denotes the set of attribute variables, M denotes a model, D denotes the training dataset, and denotes a generic test instance that is not in D. We now characterize population-wide and instance-specific model selection in decision theoretic terms. Given training data D and a separate generic test instance , the Bayes optimal prediction for Zt is obtained by combining the predictions of all models weighted by their posterior probabilities, as follows: P (Z t | X t , D ) = ∫ P( Z t | X t , M ) P ( M | D )dM . (1) M The optimal population-wide model for predicting Zt is as follows:   max∑ U P( Z t | X t , D), P (Z t | X t , M ) P ( X | D) , M  Xt  [ ] (2) where the function U gives the utility of approximating the Bayes optimal estimate P(Zt | Xt , D), with the estimate P(Zt | Xt , M) obtained from model M. The term P(X | D) is given by: P ( X | D) = ∫ P ( X | M ) P ( M | D)dM . (3) M The optimal instance-specific model for predicting Zt is as follows: { [ ]} max U P ( Z t | X t = x t , D), P (Z t | X t = x t , M ) , M (4) where xt are the values of the attributes of the test instance Xt for which we want to predict Zt. The Bayes optimal estimate P(Zt | Xt = xt, D), in Equation 4 is derived using Equation 1, for the special case in which Xt = xt . The difference between the population-wide and the instance-specific models can be noted by comparing Equations 2 and 4. Equation 2 for the population-wide model selects the model that on average will have the greatest utility. Equation 4 for the instance-specific model, however, selects the model that will have the greatest expected utility for the specific instance Xt = xt . For predicting Zt in a given instance Xt = xt, the model selected using Equation 2 can never have an expected utility greater than the model selected using Equation 4. This observation provides support for developing instance-specific models. Equations 2 and 4 represent theoretical ideals for population-wide and instancespecific model selection, respectively; we are not suggesting they are practical to compute. The current paper focuses on model averaging, rather than model selection. Ideal Bayesian model averaging is given by Equation 1. Model averaging has previously been applied using population-wide models. Studies have shown that approximate Bayesian model averaging using population-wide models can improve predictive performance over population-wide model selection [2]. The current paper concentrates on investigating the predictive performance of approximate Bayesian model averaging using instance-specific models. 3 In st an ce- S p eci fi c Algo ri t h m We present the implementation of the lazy instance-specific algorithm based on the above framework. ISA searches the space of a restricted class of Bayesian networks to select a subset of the models over which to derive a weighted (averaged) posterior of the target variable Zt . A key characteristic of the search is the use of a heuristic to select models that will have a significant influence on the weighted posterior. We introduce Bayesian networks briefly and then describe ISA in detail. 3.1 B ay e s i a n N e t w or k s A Bayesian network is a probabilistic model that combines a graphical representation (the Bayesian network structure) with quantitative information (the parameters of the Bayesian network) to represent the joint probability distribution over a set of random variables [3]. Specifically, a Bayesian network M representing the set of variables X consists of a pair (G, ΘG ). G is a directed acyclic graph that contains a node for every variable in X and an arc between every pair of nodes if the corresponding variables are directly probabilistically dependent. Conversely, the absence of an arc between a pair of nodes denotes probabilistic independence between the corresponding variables. ΘG represents the parameterization of the model. In a Bayesian network M, the immediate predecessors of a node X i in X are called the parents of X i and the successors, both immediate and remote, of Xi in X are called the descendants of X i . The immediate successors of X i are called the children of X i . For each node Xi there is a local probability distribution (that may be discrete or continuous) on that node given the state of its parents. The complete joint probability distribution over X, represented by the parameterization ΘG, can be factored into a product of local probability distributions defined on each node in the network. This factorization is determined by the independences captured by the structure of the Bayesian network and is formalized in the Bayesian network Markov condition: A node (representing a variable) is independent of its nondescendants given just its parents. According to this Markov condition, the joint probability distribution on model variables X = (X1 , X 2, …, X n ) can be factored as follows: n P ( X 1 , X 2 , ..., X n ) = ∏ P ( X i | parents( X i )) , (5) i =1 where parents(Xi ) denotes the set of nodes that are the parents of X i . If Xi has no parents, then the set parents(Xi ) is empty and P(Xi | parents(X i)) is just P(Xi ). 3.2 I S A M od e l s The LBR models of Zheng and Webb [1] can be represented as members of a restricted class of Bayesian networks (see Figure 1). We use the same class of Bayesian networks for the ISA models, to facilitate comparison between the two algorithms. In Figure 1, all nodes represent attributes that are discrete. Each node in X has either an outgoing arc into target node, Z, or receives an arc from Z. That is, each node is either a parent or a child of Z. Thus, X is partitioned into two sets: the first containing nodes (X 1 , …, X j in Figure 1) each of which is a parent of Z and every node in the second set, and the second containing nodes (X j+1 , …, X k in Figure 1) that have as parents the node Z and every node in the first set. The nodes in the first set are instantiated to the corresponding values in the test instance for which Zt is to be predicted. Thus, the first set of nodes represents the antecedent of the LBR rule and the second set of nodes represents the consequent. ... X1= x1 Xi = xi Z Xi+1 ... Xk Figure 1: An example of a Bayesian network LBR model with target node Z and k attribute nodes of which X1 , …, X j are instantiated to values x 1 , …, x j in xt . X 1, …, X j are present in the antecedent of the LBR rule and Z, X j+1 , …, X k (that form the local simple Bayes classifier) are present in the consequent. The indices need not be ordered as shown, but are presented in this example for convenience of exposition. 3.3 M od e l A ve r ag i n g For Bayesian networks, Equation 1 can be evaluated as follows: P ( Z t | x t , D ) = ∑ P ( Z t | x t , M ) P( M | D ) , (6) M with M being a Bayesian network comprised of structure G and parameters ΘG. The probability distribution of interest is a weighted average of the posterior distribution over all possible Bayesian networks where the weight is the probability of the Bayesian network given the data. Since exhaustive enumeration of all possible models is not feasible, even for this class of simple Bayesian networks, we approximate exact model averaging with selective model averaging. Let R be the set of models selected by the search procedure from all possible models in the model space, as described in the next section. Then, with selective model averaging, P(Zt | xt, D) is estimated as: ∑RP( Z t | x t , M ) P(M | D) P (Z t | x t , D) ≅ M ∈ . ∑RP(M | D) M∈ (7) Assuming uniform prior belief over all possible models, the model posterior P(M | D) in Equation 7 can be replaced by the marginal likelihood P(D | M), to obtain the following equation: P ( Z | x , D) ≅ t t ∑ P ( Z t | x t , M ) P( D | M ) . ∑RP( D | M ) M∈ M ∈R (8) The (unconditional) marginal likelihood P(D | M) in Equation 8, is a measure of the goodness of fit of the model to the data and is also known as the model score. While this score is suitable for assessing the model’s fit to the joint probability distribution, it is not necessarily appropriate for assessing the goodness of fit to a conditional probability distribution which is the focus in prediction and classification tasks, as is the case here. A more suitable score in this situation is a conditional model score that is computed from training data D of d instances as: d score( D, M ) = ∏ P ( z p | x1 ,..., x p ,z 1 ,...,z p −1 ,M ) . (9) p =1 This score is computed in a predictive and sequential fashion: for the pth training instance the probability of predicting the observed value zp for the target variable is computed based on the values of all the variables in the preceding p-1 training instances and the values xp of the attributes in the pth instance. One limitation of this score is that its value depends on the ordering of the data. Despite this limitation, it has been shown to be an effective scoring criterion for classification models [4]. The parameters of the Bayesian network M, used in the above computations, are defined as follows: P ( X i = k | parents ( X i ) = j ) ≡ θ ijk = N ijk + α ijk N ij + α ij , (10) where (i) Nijk is the number of instances in the training dataset D where variable Xi has value k and the parents of X i are in state j, (ii) N ij = ∑k N ijk , (iii) αijk is a parameter prior that can be interpreted as the belief equivalent of having previously observed αijk instances in which variable Xi has value k and the parents of X i are in state j, and (iv) α ij = ∑k α ijk . 3.4 M od e l Se a r c h We use a two-phase best-first heuristic search to sample the model space. The first phase ignores the evidence xt in the test instance while searching for models that have high scores as given by Equation 9. This is followed by the second phase that searches for models having the greatest impact on the prediction of Zt for the test instance, which we formalize below. The first phase searches for models that predict Z in the training data very well; these are the models that have high conditional model scores. The initial model is the simple Bayes network that includes all the attributes in X as children of Z. A succeeding model is derived from a current model by reversing the arc of a child node in the current model, adding new outgoing arcs from it to Z and the remaining children, and instantiating this node to the value in the test instance. This process is performed for each child in the current model. An incoming arc of a child node is considered for reversal only if the node’s value is not missing in the test instance. The newly derived models are added to a priority queue, Q. During each iteration of the search, the model with the highest score (given by Equation 9) is removed from Q and placed in a set R, following which new models are generated as described just above, scored and added to Q. The first phase terminates after a user-specified number of models have accumulated in R. The second phase searches for models that change the current model-averaged estimate of P(Zt | xt , D) the most. The idea here is to find viable competing models for making this posterior probability prediction. When no competitive models can be found, the prediction becomes stable. During each iteration of the search, the highest ranked model M* is removed from Q and added to R. The ranking is based on how much the model changes the current estimate of P(Zt | xt , D). More change is better. In particular, M* is the model in Q that maximizes the following function: f ( R, M *) = g ( R) − g ( R U {M *}) , (11) where for a set of models S, the function g(S) computes the approximate model averaged prediction for Zt, as follows: g (S ) = ∑ P(Z M ∈S t | x t , M ) score( D, M ) ∑ score( D, M ) ∈ . (12) M S The second phase terminates when no new model can be found that has a value (as given by Equation 11) that is greater than a user-specified minimum threshold T. The final distribution of Zt is then computed from the models in R using Equation 8. 4 Ev a lu a t i o n We evaluated ISA on the 29 UCI datasets that Zheng and Webb used for the evaluation of LBR. On the same datasets, we also evaluated a simple Bayes classifier (SB) and LBR. For SB and LBR, we used the Weka implementations (Weka v3.3.6, http://www.cs.waikato.ac.nz/ml/weka/) with default settings [5]. We implemented the ISA algorithm as a standalone application in Java. The following settings were used for ISA: a maximum of 100 phase-1 models, a threshold T of 0.001 in phase-2, and an upper limit of 500 models in R. For the parameter priors in Equation 10, all αijk were set to 1. All error rates were obtained by averaging the results from two stratified 10-fold cross-validation (20 trials total) similar to that used by Zheng and Webb. Since, both LBR and ISA can handle only discrete attributes, all numeric attributes were discretized in a pre-processing step using the entropy based discretization method described in [6]. For each pair of training and test folds, the discretization intervals were first estimated from the training fold and then applied to both folds. The error rates of two algorithms on a dataset were compared with a paired t-test carried out at the 5% significance level on the error rate statistics obtained from the 20 trials. The results are shown in Table 1. Compared to SB, ISA has significantly fewer errors on 9 datasets and significantly more errors on one dataset. Compared to LBR, ISA has significantly fewer errors on 7 datasets and significantly more errors on two datasets. On two datasets, chess and tic-tac-toe, ISA shows considerable improvement in performance over both SB and LBR. With respect to computation Table 1: Percent error rates of simple Bayes (SB), Lazy Bayesian Rule (LBR) and Instance-Specific Averaging (ISA). A - indicates that the ISA error rate is statistically significantly lower than the marked SB or LBR error rate. A + indicates that the ISA error rate is statistically significantly higher. Dataset Size Annealing Audiology Breast (W) Chess (KR-KP) Credit (A) Echocardiogram Glass Heart (C) Hepatitis Horse colic House votes 84 Hypothyroid Iris Labor LED 24 Liver disorders Lung cancer Lymphography Pima Postoperative Primary tumor Promoters Solar flare Sonar Soybean Splice junction Tic-Tac-Toe Wine Zoo 898 226 699 3169 690 131 214 303 155 368 435 3163 150 57 200 345 32 148 768 90 339 106 1389 208 683 3177 958 178 101 No. of classes 6 24 2 2 2 2 6 2 2 2 2 2 3 2 10 2 3 4 2 3 22 2 2 2 19 3 2 3 7 Num. Attrib. 6 0 9 0 6 6 9 13 6 7 0 7 4 8 0 6 0 0 8 1 0 0 0 60 0 0 0 13 0 Nom. Attrib. 32 69 0 36 9 1 0 0 13 15 16 18 0 8 24 0 56 18 0 7 17 57 10 0 35 60 9 0 16 Percent error rate SB LBR ISA 1.9 3.5 2.7 29.6 29.4 30.9 3.7 2.9 + 2.8 + 1.1 12.1 3.0 13.8 14.0 13.9 33.2 34.0 35.9 26.9 27.8 29.0 16.2 16.2 17.5 14.2 - 14.2 - 11.3 20.2 16.0 17.8 5.1 10.1 7.0 0.9 0.9 1.4 6.0 6.0 5.3 8.8 6.1 7.0 40.5 40.5 40.3 36.8 36.8 36.8 56.3 56.3 56.3 15.5 - 15.5 - 13.2 21.8 22.0 22.3 33.3 33.3 33.3 54.4 53.5 54.2 7.5 7.5 7.5 20.2 18.3 + 19.4 15.4 15.6 15.9 7.1 7.2 7.9 4.7 4.3 4.4 30.3 - 13.7 - 10.3 1.1 1.1 1.1 6.4 8.4 8.4 - times, ISA took 6 times longer to run than LBR on average for a single test instance on a desktop computer with a 2 GHz Pentium 4 processor and 3 GB of RAM. 5 C o n c lu si o n s a n d Fu t u re R e s ea rc h We have introduced a Bayesian framework for instance-specific model averaging and presented ISA as one example of a classification algorithm based on this framework. An instance-specific algorithm like LBR that does model selection has been shown by Zheng and Webb to perform classification better than several eager algorithms [1]. Our results show that ISA, which extends LBR by adding Bayesian model averaging, improves overall on LBR, which provides support that we can obtain additional prediction improvement by performing instance-specific model averaging rather than just instance-specific model selection. In future work, we plan to explore further the behavior of ISA with respect to the number of models being averaged and the effect of the number of models selected in each of the two phases of the search. We will also investigate methods to improve the computational efficiency of ISA. In addition, we plan to examine other heuristics for model search as well as more general model spaces such as unrestricted Bayesian networks. The instance-specific framework is not restricted to the Bayesian network models that we have used in this investigation. In the future, we plan to explore other models using this framework. Our ultimate interest is to apply these instancespecific algorithms to improve patient-specific predictions (for diagnosis, therapy selection, and prognosis) and thereby to improve patient care. A c k n ow l e d g me n t s This work was supported by the grant T15-LM/DE07059 from the National Library of Medicine (NLM) to the University of Pittsburgh’s Biomedical Informatics Training Program. We would like to thank the three anonymous reviewers for their helpful comments. References [1] Zheng, Z. and Webb, G.I. (2000). Lazy Learning of Bayesian Rules. Machine Learning, 41(1):53-84. [2] Hoeting, J.A., Madigan, D., Raftery, A.E. and Volinsky, C.T. (1999). Bayesian Model Averaging: A Tutorial. Statistical Science, 14:382-417. [3] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo, CA. [4] Kontkanen, P., Myllymaki, P., Silander, T., and Tirri, H. (1999). On Supervised Selection of Bayesian Networks. In Proceedings of the 15th International Conference on Uncertainty in Artificial Intelligence, pages 334-342, Stockholm, Sweden. Morgan Kaufmann. [5] Witten, I.H. and Frank, E. (2000). Data Mining: Practical Machine Learning Tools with Java Implementations. Morgan Kaufmann, San Francisco, CA. [6] Fayyad, U.M., and Irani, K.B. (1993). Multi-Interval Discretization of ContinuousValued Attributes for Classification Learning. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 1022-1027, San Mateo, CA. Morgan Kaufmann.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Classification algorithms typically induce population-wide models that are trained to perform well on average on expected future instances. [sent-6, score-0.127]

2 We introduce a Bayesian framework for learning instance-specific models from data that are optimized to predict well for a particular instance. [sent-7, score-0.084]

3 Based on this framework, we present a lazy instance-specific algorithm called ISA that performs selective model averaging over a restricted class of Bayesian networks. [sent-8, score-0.5]

4 We intend to apply such instance-specific algorithms to improve the performance of patient-specific predictive models induced from medical data. [sent-10, score-0.127]

5 1 In t ro d u c t i o n Commonly used classification algorithms, such as neural networks, decision trees, Bayesian networks and support vector machines, typically induce a single model from a training set of instances, with the intent of applying it to all future instances. [sent-11, score-0.321]

6 We call such a model a population-wide model because it is intended to be applied to an entire population of future instances. [sent-12, score-0.13]

7 A population-wide model is optimized to predict well on average when applied to expected future instances. [sent-13, score-0.109]

8 In contrast, an instance-specific model is one that is constructed specifically for a particular instance. [sent-14, score-0.086]

9 The structure and parameters of an instance-specific model are specialized to the particular features of an instance, so that it is optimized to predict especially well for that instance. [sent-15, score-0.083]

10 Usually, methods that induce population-wide models employ eager learning in which the model is induced from the training data before the test instance is encountered. [sent-16, score-0.41]

11 In contrast, lazy learning defers most or all processing until a response to a test instance is required. [sent-17, score-0.317]

12 Learners that induce instance-specific models are necessarily lazy in nature since they take advantage of the information in the test instance. [sent-18, score-0.345]

13 An example of a lazy instance-specific method is the lazy Bayesian rule (LBR) learner, implemented by Zheng and Webb [1], which induces rules in a lazy fashion from examples in the neighborhood of the test instance. [sent-19, score-0.681]

14 A rule generated by LBR consists of a conjunction of the attribute-value pairs present in the test instance as the antecedent and a local simple (naïve) Bayes classifier as the consequent. [sent-20, score-0.319]

15 The structure of the local simple Bayes classifier consists of the attribute of interest as the parent of all other attributes that do not appear in the antecedent, and the parameters of the classifier are estimated from the subset of training instances that satisfy the antecedent. [sent-21, score-0.416]

16 A greedy step-forward search selects the optimal LBR rule for a test instance to be classified. [sent-22, score-0.21]

17 When evaluated on 29 UCI datasets, LBR had the lowest average error rate when compared to several eager learning methods [1]. [sent-23, score-0.141]

18 Typically, both eager and lazy algorithms select a single model from some model space, ignoring the uncertainty in model selection. [sent-24, score-0.475]

19 Bayesian model averaging is a coherent approach to dealing with the uncertainty in model selection, and it has been shown to improve the predictive performance of classifiers [2]. [sent-25, score-0.36]

20 However, since the number of models in practically useful model spaces is enormous, exact model averaging over the entire model space is usually not feasible. [sent-26, score-0.391]

21 In this paper, we describe a lazy instance-specific averaging (ISA) algorithm for classification that approximates Bayesian model averaging in an instance-sensitive manner. [sent-27, score-0.751]

22 ISA extends LBR by adding Bayesian model averaging to an instance-specific model selection algorithm. [sent-28, score-0.324]

23 While the ISA algorithm is currently able to directly handle only discrete variables and is computationally more intensive than comparable eager algorithms, the results in this paper show that it performs well. [sent-29, score-0.149]

24 In medicine, such lazy instance-specific algorithms can be applied to patient-specific modeling for improving the accuracy of diagnosis, prognosis and risk assessment. [sent-30, score-0.25]

25 Capital letters like X, Z, denote random variables and corresponding lower case letters, x, z, denote specific values assigned to them. [sent-37, score-0.101]

26 In addition, Z denotes the target variable being predicted, X denotes the set of attribute variables, M denotes a model, D denotes the training dataset, and denotes a generic test instance that is not in D. [sent-41, score-0.424]

27 We now characterize population-wide and instance-specific model selection in decision theoretic terms. [sent-42, score-0.09]

28 Given training data D and a separate generic test instance , the Bayes optimal prediction for Zt is obtained by combining the predictions of all models weighted by their posterior probabilities, as follows: P (Z t | X t , D ) = ∫ P( Z t | X t , M ) P ( M | D )dM . [sent-43, score-0.253]

29 (3) M The optimal instance-specific model for predicting Zt is as follows: { [ ]} max U P ( Z t | X t = x t , D), P (Z t | X t = x t , M ) , M (4) where xt are the values of the attributes of the test instance Xt for which we want to predict Zt. [sent-46, score-0.457]

30 The Bayes optimal estimate P(Zt | Xt = xt, D), in Equation 4 is derived using Equation 1, for the special case in which Xt = xt . [sent-47, score-0.138]

31 Equation 2 for the population-wide model selects the model that on average will have the greatest utility. [sent-49, score-0.174]

32 Equation 4 for the instance-specific model, however, selects the model that will have the greatest expected utility for the specific instance Xt = xt . [sent-50, score-0.391]

33 For predicting Zt in a given instance Xt = xt, the model selected using Equation 2 can never have an expected utility greater than the model selected using Equation 4. [sent-51, score-0.238]

34 Equations 2 and 4 represent theoretical ideals for population-wide and instancespecific model selection, respectively; we are not suggesting they are practical to compute. [sent-53, score-0.098]

35 The current paper focuses on model averaging, rather than model selection. [sent-54, score-0.104]

36 Ideal Bayesian model averaging is given by Equation 1. [sent-55, score-0.234]

37 Model averaging has previously been applied using population-wide models. [sent-56, score-0.182]

38 Studies have shown that approximate Bayesian model averaging using population-wide models can improve predictive performance over population-wide model selection [2]. [sent-57, score-0.451]

39 The current paper concentrates on investigating the predictive performance of approximate Bayesian model averaging using instance-specific models. [sent-58, score-0.274]

40 3 In st an ce- S p eci fi c Algo ri t h m We present the implementation of the lazy instance-specific algorithm based on the above framework. [sent-59, score-0.204]

41 ISA searches the space of a restricted class of Bayesian networks to select a subset of the models over which to derive a weighted (averaged) posterior of the target variable Zt . [sent-60, score-0.249]

42 A key characteristic of the search is the use of a heuristic to select models that will have a significant influence on the weighted posterior. [sent-61, score-0.09]

43 Specifically, a Bayesian network M representing the set of variables X consists of a pair (G, ΘG ). [sent-65, score-0.085]

44 G is a directed acyclic graph that contains a node for every variable in X and an arc between every pair of nodes if the corresponding variables are directly probabilistically dependent. [sent-66, score-0.322]

45 Conversely, the absence of an arc between a pair of nodes denotes probabilistic independence between the corresponding variables. [sent-67, score-0.194]

46 In a Bayesian network M, the immediate predecessors of a node X i in X are called the parents of X i and the successors, both immediate and remote, of Xi in X are called the descendants of X i . [sent-69, score-0.403]

47 The immediate successors of X i are called the children of X i . [sent-70, score-0.104]

48 For each node Xi there is a local probability distribution (that may be discrete or continuous) on that node given the state of its parents. [sent-71, score-0.21]

49 The complete joint probability distribution over X, represented by the parameterization ΘG, can be factored into a product of local probability distributions defined on each node in the network. [sent-72, score-0.183]

50 This factorization is determined by the independences captured by the structure of the Bayesian network and is formalized in the Bayesian network Markov condition: A node (representing a variable) is independent of its nondescendants given just its parents. [sent-73, score-0.207]

51 According to this Markov condition, the joint probability distribution on model variables X = (X1 , X 2, …, X n ) can be factored as follows: n P ( X 1 , X 2 , . [sent-74, score-0.138]

52 , X n ) = ∏ P ( X i | parents( X i )) , (5) i =1 where parents(Xi ) denotes the set of nodes that are the parents of X i . [sent-77, score-0.285]

53 2 I S A M od e l s The LBR models of Zheng and Webb [1] can be represented as members of a restricted class of Bayesian networks (see Figure 1). [sent-80, score-0.177]

54 In Figure 1, all nodes represent attributes that are discrete. [sent-82, score-0.151]

55 Each node in X has either an outgoing arc into target node, Z, or receives an arc from Z. [sent-83, score-0.352]

56 That is, each node is either a parent or a child of Z. [sent-84, score-0.192]

57 Thus, X is partitioned into two sets: the first containing nodes (X 1 , …, X j in Figure 1) each of which is a parent of Z and every node in the second set, and the second containing nodes (X j+1 , …, X k in Figure 1) that have as parents the node Z and every node in the first set. [sent-85, score-0.661]

58 The nodes in the first set are instantiated to the corresponding values in the test instance for which Zt is to be predicted. [sent-86, score-0.21]

59 Thus, the first set of nodes represents the antecedent of the LBR rule and the second set of nodes represents the consequent. [sent-87, score-0.243]

60 Xk Figure 1: An example of a Bayesian network LBR model with target node Z and k attribute nodes of which X1 , …, X j are instantiated to values x 1 , …, x j in xt . [sent-94, score-0.514]

61 X 1, …, X j are present in the antecedent of the LBR rule and Z, X j+1 , …, X k (that form the local simple Bayes classifier) are present in the consequent. [sent-95, score-0.121]

62 3 M od e l A ve r ag i n g For Bayesian networks, Equation 1 can be evaluated as follows: P ( Z t | x t , D ) = ∑ P ( Z t | x t , M ) P( M | D ) , (6) M with M being a Bayesian network comprised of structure G and parameters ΘG. [sent-98, score-0.137]

63 The probability distribution of interest is a weighted average of the posterior distribution over all possible Bayesian networks where the weight is the probability of the Bayesian network given the data. [sent-99, score-0.115]

64 Since exhaustive enumeration of all possible models is not feasible, even for this class of simple Bayesian networks, we approximate exact model averaging with selective model averaging. [sent-100, score-0.372]

65 Let R be the set of models selected by the search procedure from all possible models in the model space, as described in the next section. [sent-101, score-0.195]

66 Then, with selective model averaging, P(Zt | xt, D) is estimated as: ∑RP( Z t | x t , M ) P(M | D) P (Z t | x t , D) ≅ M ∈ . [sent-102, score-0.085]

67 ∑RP( D | M ) M∈ M ∈R (8) The (unconditional) marginal likelihood P(D | M) in Equation 8, is a measure of the goodness of fit of the model to the data and is also known as the model score. [sent-104, score-0.184]

68 While this score is suitable for assessing the model’s fit to the joint probability distribution, it is not necessarily appropriate for assessing the goodness of fit to a conditional probability distribution which is the focus in prediction and classification tasks, as is the case here. [sent-105, score-0.44]

69 A more suitable score in this situation is a conditional model score that is computed from training data D of d instances as: d score( D, M ) = ∏ P ( z p | x1 ,. [sent-106, score-0.273]

70 Despite this limitation, it has been shown to be an effective scoring criterion for classification models [4]. [sent-115, score-0.184]

71 4 M od e l Se a r c h We use a two-phase best-first heuristic search to sample the model space. [sent-118, score-0.149]

72 The first phase ignores the evidence xt in the test instance while searching for models that have high scores as given by Equation 9. [sent-119, score-0.345]

73 This is followed by the second phase that searches for models having the greatest impact on the prediction of Zt for the test instance, which we formalize below. [sent-120, score-0.252]

74 The first phase searches for models that predict Z in the training data very well; these are the models that have high conditional model scores. [sent-121, score-0.309]

75 The initial model is the simple Bayes network that includes all the attributes in X as children of Z. [sent-122, score-0.227]

76 A succeeding model is derived from a current model by reversing the arc of a child node in the current model, adding new outgoing arcs from it to Z and the remaining children, and instantiating this node to the value in the test instance. [sent-123, score-0.528]

77 An incoming arc of a child node is considered for reversal only if the node’s value is not missing in the test instance. [sent-125, score-0.291]

78 During each iteration of the search, the model with the highest score (given by Equation 9) is removed from Q and placed in a set R, following which new models are generated as described just above, scored and added to Q. [sent-127, score-0.178]

79 The first phase terminates after a user-specified number of models have accumulated in R. [sent-128, score-0.121]

80 The second phase searches for models that change the current model-averaged estimate of P(Zt | xt , D) the most. [sent-129, score-0.282]

81 The idea here is to find viable competing models for making this posterior probability prediction. [sent-130, score-0.082]

82 When no competitive models can be found, the prediction becomes stable. [sent-131, score-0.082]

83 The ranking is based on how much the model changes the current estimate of P(Zt | xt , D). [sent-133, score-0.19]

84 (12) M S The second phase terminates when no new model can be found that has a value (as given by Equation 11) that is greater than a user-specified minimum threshold T. [sent-136, score-0.12]

85 4 Ev a lu a t i o n We evaluated ISA on the 29 UCI datasets that Zheng and Webb used for the evaluation of LBR. [sent-138, score-0.098]

86 On the same datasets, we also evaluated a simple Bayes classifier (SB) and LBR. [sent-139, score-0.111]

87 All error rates were obtained by averaging the results from two stratified 10-fold cross-validation (20 trials total) similar to that used by Zheng and Webb. [sent-151, score-0.182]

88 Since, both LBR and ISA can handle only discrete attributes, all numeric attributes were discretized in a pre-processing step using the entropy based discretization method described in [6]. [sent-152, score-0.129]

89 For each pair of training and test folds, the discretization intervals were first estimated from the training fold and then applied to both folds. [sent-153, score-0.137]

90 Compared to SB, ISA has significantly fewer errors on 9 datasets and significantly more errors on one dataset. [sent-156, score-0.206]

91 Compared to LBR, ISA has significantly fewer errors on 7 datasets and significantly more errors on two datasets. [sent-157, score-0.206]

92 4 - times, ISA took 6 times longer to run than LBR on average for a single test instance on a desktop computer with a 2 GHz Pentium 4 processor and 3 GB of RAM. [sent-254, score-0.113]

93 5 C o n c lu si o n s a n d Fu t u re R e s ea rc h We have introduced a Bayesian framework for instance-specific model averaging and presented ISA as one example of a classification algorithm based on this framework. [sent-255, score-0.393]

94 An instance-specific algorithm like LBR that does model selection has been shown by Zheng and Webb to perform classification better than several eager algorithms [1]. [sent-256, score-0.336]

95 Our results show that ISA, which extends LBR by adding Bayesian model averaging, improves overall on LBR, which provides support that we can obtain additional prediction improvement by performing instance-specific model averaging rather than just instance-specific model selection. [sent-257, score-0.367]

96 In future work, we plan to explore further the behavior of ISA with respect to the number of models being averaged and the effect of the number of models selected in each of the two phases of the search. [sent-258, score-0.162]

97 In addition, we plan to examine other heuristics for model search as well as more general model spaces such as unrestricted Bayesian networks. [sent-260, score-0.171]

98 The instance-specific framework is not restricted to the Bayesian network models that we have used in this investigation. [sent-261, score-0.133]

99 In the future, we plan to explore other models using this framework. [sent-262, score-0.083]

100 Our ultimate interest is to apply these instancespecific algorithms to improve patient-specific predictions (for diagnosis, therapy selection, and prognosis) and thereby to improve patient care. [sent-263, score-0.114]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('isa', 0.528), ('lbr', 0.436), ('lazy', 0.204), ('bayesian', 0.2), ('parents', 0.187), ('averaging', 0.182), ('zt', 0.179), ('ijk', 0.146), ('xt', 0.138), ('classification', 0.131), ('zheng', 0.12), ('eager', 0.115), ('webb', 0.115), ('sb', 0.107), ('node', 0.105), ('arc', 0.096), ('antecedent', 0.092), ('attributes', 0.09), ('classifier', 0.085), ('significantly', 0.081), ('score', 0.073), ('instance', 0.073), ('equation', 0.067), ('nodes', 0.061), ('od', 0.06), ('bayes', 0.056), ('models', 0.053), ('model', 0.052), ('network', 0.051), ('child', 0.05), ('searches', 0.05), ('induce', 0.048), ('instancespecific', 0.046), ('prognosis', 0.046), ('pth', 0.046), ('shyam', 0.046), ('weka', 0.046), ('instances', 0.046), ('biomedical', 0.046), ('datasets', 0.044), ('attribute', 0.044), ('fit', 0.044), ('phase', 0.041), ('test', 0.04), ('predictive', 0.04), ('successors', 0.04), ('discretization', 0.039), ('greatest', 0.039), ('selection', 0.038), ('xi', 0.038), ('search', 0.037), ('letters', 0.037), ('denotes', 0.037), ('informatics', 0.037), ('parent', 0.037), ('chess', 0.036), ('goodness', 0.036), ('instantiated', 0.036), ('medicine', 0.036), ('rp', 0.036), ('networks', 0.035), ('improve', 0.034), ('variables', 0.034), ('diagnosis', 0.034), ('specifically', 0.034), ('children', 0.034), ('pittsburgh', 0.034), ('morgan', 0.034), ('selective', 0.033), ('predicting', 0.033), ('artificial', 0.032), ('predict', 0.031), ('selects', 0.031), ('specific', 0.03), ('immediate', 0.03), ('plan', 0.03), ('kaufmann', 0.03), ('prediction', 0.029), ('mateo', 0.029), ('percent', 0.029), ('training', 0.029), ('rule', 0.029), ('restricted', 0.029), ('posterior', 0.029), ('intelligent', 0.028), ('dm', 0.028), ('lu', 0.028), ('assessing', 0.028), ('outgoing', 0.028), ('utility', 0.028), ('target', 0.027), ('terminates', 0.027), ('ij', 0.027), ('joint', 0.027), ('variable', 0.026), ('bold', 0.026), ('defined', 0.026), ('evaluated', 0.026), ('future', 0.026), ('factored', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

Author: Shyam Visweswaran, Gregory F. Cooper

2 0.13397653 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We provide a worst-case analysis of selective sampling algorithms for learning linear threshold functions. The algorithms considered in this paper are Perceptron-like algorithms, i.e., algorithms which can be efﬁciently run in any reproducing kernel Hilbert space. Our algorithms exploit a simple margin-based randomized rule to decide whether to query the current label. We obtain selective sampling algorithms achieving on average the same bounds as those proven for their deterministic counterparts, but using much fewer labels. We complement our theoretical ﬁndings with an empirical comparison on two text categorization tasks. The outcome of these experiments is largely predicted by our theoretical results: Our selective sampling algorithms tend to perform as good as the algorithms receiving the true label after each classiﬁcation, while observing in practice substantially fewer labels. 1

3 0.1316376 48 nips-2004-Convergence and No-Regret in Multiagent Learning

Author: Michael Bowling

Abstract: Learning in a multiagent system is a challenging problem due to two key factors. First, if other agents are simultaneously learning then the environment is no longer stationary, thus undermining convergence guarantees. Second, learning is often susceptible to deception, where the other agents may be able to exploit a learner’s particular dynamics. In the worst case, this could result in poorer performance than if the agent was not learning at all. These challenges are identiﬁable in the two most common evaluation criteria for multiagent learning algorithms: convergence and regret. Algorithms focusing on convergence or regret in isolation are numerous. In this paper, we seek to address both criteria in a single algorithm by introducing GIGA-WoLF, a learning algorithm for normalform games. We prove the algorithm guarantees at most zero average regret, while demonstrating the algorithm converges in many situations of self-play. We prove convergence in a limited setting and give empirical results in a wider variety of situations. These results also suggest a third new learning criterion combining convergence and regret, which we call negative non-convergence regret (NNR). 1

4 0.073341034 183 nips-2004-Temporal-Difference Networks

Author: Richard S. Sutton, Brian Tanner

Abstract: We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single prediction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predictions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a ﬁxed interval, which is not possible with conventional TD methods. Secondly, we show that if the interpredictive relationships are made conditional on action, then the usual learning-efﬁciency advantage of TD methods over Monte Carlo (supervised learning) methods becomes particularly pronounced. Thirdly, we demonstrate that TD networks can learn predictive state representations that enable exact solution of a non-Markov problem. A very broad range of inter-predictive temporal relationships can be expressed in these networks. Overall we argue that TD networks represent a substantial extension of the abilities of TD methods and bring us closer to the goal of representing world knowledge in entirely predictive, grounded terms. Temporal-difference (TD) learning is widely used in reinforcement learning methods to learn moment-to-moment predictions of total future reward (value functions). In this setting, TD learning is often simpler and more data-efﬁcient than other methods. But the idea of TD learning can be used more generally than it is in reinforcement learning. TD learning is a general method for learning predictions whenever multiple predictions are made of the same event over time, value functions being just one example. The most pertinent of the more general uses of TD learning have been in learning models of an environment or task domain (Dayan, 1993; Kaelbling, 1993; Sutton, 1995; Sutton, Precup & Singh, 1999). In these works, TD learning is used to predict future values of many observations or state variables of a dynamical system. The essential idea of TD learning can be described as “learning a guess from a guess”. In all previous work, the two guesses involved were predictions of the same quantity at two points in time, for example, of the discounted future reward at successive time steps. In this paper we explore a few of the possibilities that open up when the second guess is allowed to be different from the ﬁrst. To be more precise, we must make a distinction between the extensive deﬁnition of a prediction, expressing its desired relationship to measurable data, and its TD deﬁnition, expressing its desired relationship to other predictions. In reinforcement learning, for example, state values are extensively deﬁned as an expectation of the discounted sum of future rewards, while they are TD deﬁned as the solution to the Bellman equation (a relationship to the expectation of the value of successor states, plus the immediate reward). It’s the same prediction, just deﬁned or expressed in different ways. In past work with TD methods, the TD relationship was always between predictions with identical or very similar extensive semantics. In this paper we retain the TD idea of learning predictions based on others, but allow the predictions to have different extensive semantics. 1 The Learning-to-predict Problem The problem we consider in this paper is a general one of learning to predict aspects of the interaction between a decision making agent and its environment. At each of a series of discrete time steps t, the environment generates an observation o t ∈ O, and the agent takes an action at ∈ A. Whereas A is an arbitrary discrete set, we assume without loss of generality that ot can be represented as a vector of bits. The action and observation events occur in sequence, o1 , a1 , o2 , a2 , o3 · · ·, with each event of course dependent only on those preceding it. This sequence will be called experience. We are interested in predicting not just each next observation but more general, action-conditional functions of future experience, as discussed in the next section. In this paper we use a random-walk problem with seven states, with left and right actions available in every state: 1 1 0 2 0 3 0 4 0 5 0 6 1 7 The observation upon arriving in a state consists of a special bit that is 1 only at the two ends of the walk and, in the ﬁrst two of our three experiments, seven additional bits explicitly indicating the state number (only one of them is 1). This is a continuing task: reaching an end state does not end or interrupt experience. Although the sequence depends deterministically on action, we assume that the actions are selected randomly with equal probability so that the overall system can be viewed as a Markov chain. The TD networks introduced in this paper can represent a wide variety of predictions, far more than can be represented by a conventional TD predictor. In this paper we take just a few steps toward more general predictions. In particular, we consider variations of the problem of prediction by a ﬁxed interval. This is one of the simplest cases that cannot otherwise be handled by TD methods. For the seven-state random walk, we will predict the special observation bit some numbers of discrete steps in advance, ﬁrst unconditionally and then conditioned on action sequences. 2 TD Networks A TD network is a network of nodes, each representing a single scalar prediction. The nodes are interconnected by links representing the TD relationships among the predictions and to the observations and actions. These links determine the extensive semantics of each prediction—its desired or target relationship to the data. They represent what we seek to predict about the data as opposed to how we try to predict it. We think of these links as determining a set of questions being asked about the data, and accordingly we call them the question network. A separate set of interconnections determines the actual computational process—the updating of the predictions at each node from their previous values and the current action and observation. We think of this process as providing the answers to the questions, and accordingly we call them the answer network. The question network provides targets for a learning process shaping the answer network and does not otherwise affect the behavior of the TD network. It is natural to consider changing the question network, but in this paper we take it as ﬁxed and given. Figure 1a shows a suggestive example of a question network. The three squares across the top represent three observation bits. The node labeled 1 is directly connected to the ﬁrst observation bit and represents a prediction that that bit will be 1 on the next time step. The node labeled 2 is similarly a prediction of the expected value of node 1 on the next step. Thus the extensive deﬁnition of Node 2’s prediction is the probability that the ﬁrst observation bit will be 1 two time steps from now. Node 3 similarly predicts the ﬁrst observation bit three time steps in the future. Node 4 is a conventional TD prediction, in this case of the future discounted sum of the second observation bit, with discount parameter γ. Its target is the familiar TD target, the data bit plus the node’s own prediction on the next time step (with weightings 1 − γ and γ respectively). Nodes 5 and 6 predict the probability of the third observation bit being 1 if particular actions a or b are taken respectively. Node 7 is a prediction of the average of the ﬁrst observation bit and Node 4’s prediction, both on the next step. This is the ﬁrst case where it is not easy to see or state the extensive semantics of the prediction in terms of the data. Node 8 predicts another average, this time of nodes 4 and 5, and the question it asks is even harder to express extensively. One could continue in this way, adding more and more nodes whose extensive deﬁnitions are difﬁcult to express but which would nevertheless be completely deﬁned as long as these local TD relationships are clear. The thinner links shown entering some nodes are meant to be a suggestion of the entirely separate answer network determining the actual computation (as opposed to the goals) of the network. In this paper we consider only simple question networks such as the left column of Figure 1a and of the action-conditional tree form shown in Figure 1b. 1−γ 1 4 γ a 5 b L 6 L 2 7 R L R R 8 3 (a) (b) Figure 1: The question networks of two TD networks. (a) a question network discussed in the text, and (b) a depth-2 fully-action-conditional question network used in Experiments 2 and 3. Observation bits are represented as squares across the top while actual nodes of the TD network, corresponding each to a separate prediction, are below. The thick lines represent the question network and the thin lines in (a) suggest the answer network (the bulk of which is not shown). Note that all of these nodes, arrows, and numbers are completely different and separate from those representing the random-walk problem on the preceding page. i More formally and generally, let yt ∈ [0, 1], i = 1, . . . , n, denote the prediction of the 1 n ith node at time step t. The column vector of predictions yt = (yt , . . . , yt )T is updated according to a vector-valued function u with modiﬁable parameter W: yt = u(yt−1 , at−1 , ot , Wt ) ∈ n . (1) The update function u corresponds to the answer network, with W being the weights on its links. Before detailing that process, we turn to the question network, the deﬁning TD i i relationships between nodes. The TD target zt for yt is an arbitrary function z i of the successive predictions and observations. In vector form we have 1 zt = z(ot+1 , ˜t+1 ) ∈ n , y (2) where ˜t+1 is just like yt+1 , as in (1), except calculated with the old weights before they y are updated on the basis of zt : ˜t = u(yt−1 , at−1 , ot , Wt−1 ) ∈ n . y (3) (This temporal subtlety also arises in conventional TD learning.) For example, for the 1 2 1 3 2 4 4 nodes in Figure 1a we have zt = o1 , zt = yt+1 , zt = yt+1 , zt = (1 − γ)o2 + γyt+1 , t+1 t+1 1 1 1 4 1 4 1 5 5 6 3 7 8 zt = zt = ot+1 , zt = 2 ot+1 + 2 yt+1 , and zt = 2 yt+1 + 2 yt+1 . The target functions z i are only part of specifying the question network. The other part has to do with making them potentially conditional on action and observation. For example, Node 5 in Figure 1a predicts what the third observation bit will be if action a is taken. To arrange for such i semantics we introduce a new vector ct of conditions, ci , indicating the extent to which yt t i is held responsible for matching zt , thus making the ith prediction conditional on ci . Each t ci is determined as an arbitrary function ci of at and yt . In vector form we have: t ct = c(at , yt ) ∈ [0, 1]n . (4) For example, for Node 5 in Figure 1a, c5 = 1 if at = a, otherwise c5 = 0. t t Equations (2–4) correspond to the question network. Let us now turn to deﬁning u, the update function for yt mentioned earlier and which corresponds to the answer network. In general u is an arbitrary function approximator, but for concreteness we deﬁne it to be of a linear form yt = σ(Wt xt ) (5) m where xt ∈ is a feature vector, Wt is an n × m matrix, and σ is the n-vector form of the identity function (Experiments 1 and 2) or the S-shaped logistic function σ(s) = 1 1+e−s (Experiment 3). The feature vector is an arbitrary function of the preceding action, observation, and node values: xt = x(at−1 , ot , yt−1 ) ∈ m . (6) For example, xt might have one component for each observation bit, one for each possible action (one of which is 1, the rest 0), and n more for the previous node values y t−1 . The ij learning algorithm for each component wt of Wt is ij ij i i wt+1 − wt = α(zt − yt )ci t i ∂yt , (7) ij ∂wt where α is a step-size parameter. The timing details may be clariﬁed by writing the sequence of quantities in the order in which they are computed: yt at ct ot+1 xt+1 ˜t+1 zt Wt+1 yt+1 . y (8) Finally, the target in the extensive sense for yt is (9) y∗ = Et,π (1 − ct ) · y∗ + ct · z(ot+1 , y∗ ) , t t+1 t where · represents component-wise multiplication and π is the policy being followed, which is assumed ﬁxed. 1 In general, z is a function of all the future predictions and observations, but in this paper we treat only the one-step case. 3 Experiment 1: n-step Unconditional Prediction In this experiment we sought to predict the observation bit precisely n steps in advance, for n = 1, 2, 5, 10, and 25. In order to predict n steps in advance, of course, we also have to predict n − 1 steps in advance, n − 2 steps in advance, etc., all the way down to predicting one step ahead. This is speciﬁed by a TD network consisting of a single chain of predictions like the left column of Figure 1a, but of length 25 rather than 3. Random-walk sequences were constructed by starting at the center state and then taking random actions for 50, 100, 150, and 200 steps (100 sequences each). We applied a TD network and a corresponding Monte Carlo method to this data. The Monte Carlo method learned the same predictions, but learned them by comparing them to the i actual outcomes in the sequence (instead of zt in (7)). This involved signiﬁcant additional complexity to store the predictions until their corresponding targets were available. Both algorithms used feature vectors of 7 binary components, one for each of the seven states, all of which were zero except for the one corresponding to the current state. Both algorithms formed their predictions linearly (σ(·) was the identity) and unconditionally (c i = 1 ∀i, t). t In an initial set of experiments, both algorithms were applied online with a variety of values for their step-size parameter α. Under these conditions we did not ﬁnd that either algorithm was clearly better in terms of the mean square error in their predictions over the data sets. We found a clearer result when both algorithms were trained using batch updating, in which weight changes are collected “on the side” over an experience sequence and then made all at once at the end, and the whole process is repeated until convergence. Under batch updating, convergence is to the same predictions regardless of initial conditions or α value (as long as α is sufﬁciently small), which greatly simpliﬁes comparison of algorithms. The predictions learned under batch updating are also the same as would be computed by least squares algorithms such as LSTD(λ) (Bradtke & Barto, 1996; Boyan, 2000; Lagoudakis & Parr, 2003). The errors in the ﬁnal predictions are shown in Table 1. For 1-step predictions, the Monte-Carlo and TD methods performed identically of course, but for longer predictions a signiﬁcant difference was observed. The RMSE of the Monte Carlo method increased with prediction length whereas for the TD network it decreased. The largest standard error in any of the numbers shown in the table is 0.008, so almost all of the differences are statistically signiﬁcant. TD methods appear to have a signiﬁcant data-efﬁciency advantage over non-TD methods in this prediction-by-n context (and this task) just as they do in conventional multi-step prediction (Sutton, 1988). Time Steps 50 100 150 200 1-step MC/TD 0.205 0.124 0.089 0.076 2-step MC TD 0.219 0.172 0.133 0.100 0.103 0.073 0.084 0.060 5-step MC TD 0.234 0.159 0.160 0.098 0.121 0.076 0.109 0.065 10-step MC TD 0.249 0.139 0.168 0.079 0.130 0.063 0.112 0.056 25-step MC TD 0.297 0.129 0.187 0.068 0.153 0.054 0.118 0.049 Table 1: RMSE of Monte-Carlo and TD-network predictions of various lengths and for increasing amounts of training data on the random-walk example with batch updating. 4 Experiment 2: Action-conditional Prediction The advantage of TD methods should be greater for predictions that apply only when the experience sequence unfolds in a particular way, such as when a particular sequence of actions are made. In a second experiment we sought to learn n-step-ahead predictions conditional on action selections. The question network for learning all 2-step-ahead pre- dictions is shown in Figure 1b. The upper two nodes predict the observation bit conditional on taking a left action (L) or a right action (R). The lower four nodes correspond to the two-step predictions, e.g., the second lower node is the prediction of what the observation bit will be if an L action is taken followed by an R action. These predictions are the same as the e-tests used in some of the work on predictive state representations (Littman, Sutton & Singh, 2002; Rudary & Singh, 2003). In this experiment we used a question network like that in Figure 1b except of depth four, consisting of 30 (2+4+8+16) nodes. The conditions for each node were set to 0 or 1 depending on whether the action taken on the step matched that indicated in the ﬁgure. The feature vectors were as in the previous experiment. Now that we are conditioning on action, the problem is deterministic and α can be set uniformly to 1. A Monte Carlo prediction can be learned only when its corresponding action sequence occurs in its entirety, but then it is complete and accurate in one step. The TD network, on the other hand, can learn from incomplete sequences but must propagate them back one level at a time. First the one-step predictions must be learned, then the two-step predictions from them, and so on. The results for online and batch training are shown in Tables 2 and 3. As anticipated, the TD network learns much faster than Monte Carlo with both online and batch updating. Because the TD network learns its n step predictions based on its n − 1 step predictions, it has a clear advantage for this task. Once the TD Network has seen each action in each state, it can quickly learn any prediction 2, 10, or 1000 steps in the future. Monte Carlo, on the other hand, must sample actual sequences, so each exact action sequence must be observed. Time Step 100 200 300 400 500 1-Step MC/TD 0.153 0.019 0.000 0.000 0.000 2-Step MC TD 0.222 0.182 0.092 0.044 0.040 0.000 0.019 0.000 0.019 0.000 3-Step MC TD 0.253 0.195 0.142 0.054 0.089 0.013 0.055 0.000 0.038 0.000 4-Step MC TD 0.285 0.185 0.196 0.062 0.139 0.017 0.093 0.000 0.062 0.000 Table 2: RMSE of the action-conditional predictions of various lengths for Monte-Carlo and TD-network methods on the random-walk problem with online updating. Time Steps 50 100 150 200 MC 53.48% 30.81% 19.26% 11.69% TD 17.21% 4.50% 1.57% 0.14% Table 3: Average proportion of incorrect action-conditional predictions for batch-updating versions of Monte-Carlo and TD-network methods, for various amounts of data, on the random-walk task. All differences are statistically signiﬁcant. 5 Experiment 3: Learning a Predictive State Representation Experiments 1 and 2 showed advantages for TD learning methods in Markov problems. The feature vectors in both experiments provided complete information about the nominal state of the random walk. In Experiment 3, on the other hand, we applied TD networks to a non-Markov version of the random-walk example, in particular, in which only the special observation bit was visible and not the state number. In this case it is not possible to make accurate predictions based solely on the current action and observation; the previous time step’s predictions must be used as well. As in the previous experiment, we sought to learn n-step predictions using actionconditional question networks of depths 2, 3, and 4. The feature vector xt consisted of three parts: a constant 1, four binary features to represent the pair of action a t−1 and observation bit ot , and n more features corresponding to the components of y t−1 . The features vectors were thus of length m = 11, 19, and 35 for the three depths. In this experiment, σ(·) was the S-shaped logistic function. The initial weights W0 and predictions y0 were both 0. Fifty random-walk sequences were constructed, each of 250,000 time steps, and presented to TD networks of the three depths, with a range of step-size parameters α. We measured the RMSE of all predictions made by the networks (computed from knowledge of the task) and also the “empirical RMSE,” the error in the one-step prediction for the action actually taken on each step. We found that in all cases the errors approached zero over time, showing that the problem was completely solved. Figure 2 shows some representative learning curves for the depth-2 and depth-4 TD networks. .3 Empirical RMS error .2 α=.1 .1 α=.5 α=.5 α=.75 0 0 α=.25 depth 2 50K 100K 150K 200K 250K Time Steps Figure 2: Prediction performance on the non-Markov random walk with depth-4 TD networks (and one depth-2 network) with various step-size parameters, averaged over 50 runs and 1000 time-step bins. The “bump” most clearly seen with small step sizes is reliably present and may be due to predictions of different lengths being learned at different times. In ongoing experiments on other non-Markov problems we have found that TD networks do not always ﬁnd such complete solutions. Other problems seem to require more than one step of history information (the one-step-preceding action and observation), though less than would be required using history information alone. Our results as a whole suggest that TD networks may provide an effective alternative learning algorithm for predictive state representations (Littman et al., 2000). Previous algorithms have been found to be effective on some tasks but not on others (e.g, Singh et al., 2003; Rudary & Singh, 2004; James & Singh, 2004). More work is needed to assess the range of effectiveness and learning rate of TD methods vis-a-vis previous methods, and to explore their combination with history information. 6 Conclusion TD networks suggest a large set of possibilities for learning to predict, and in this paper we have begun exploring the ﬁrst few. Our results show that even in a fully observable setting there may be signiﬁcant advantages to TD methods when learning TD-deﬁned predictions. Our action-conditional results show that TD methods can learn dramatically faster than other methods. TD networks allow the expression of many new kinds of predictions whose extensive semantics is not immediately clear, but which are ultimately fully grounded in data. It may be fruitful to further explore the expressive potential of TD-deﬁned predictions. Although most of our experiments have concerned the representational expressiveness and efﬁciency of TD-deﬁned predictions, it is also natural to consider using them as state, as in predictive state representations. Our experiments suggest that this is a promising direction and that TD learning algorithms may have advantages over previous learning methods. Finally, we note that adding nodes to a question network produces new predictions and thus may be a way to address the discovery problem for predictive representations. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Satinder Singh, Doina Precup, Michael Littman, Mark Ring, Vadim Bulitko, Eddie Rafols, Anna Koop, Tao Wang, and all the members of the rlai.net group. References Boyan, J. A. (2000). Technical update: Least-squares temporal difference learning. Machine Learning 49:233–246. Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning 22(1/2/3):33–57. Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation 5(4):613–624. James, M. and Singh, S. (2004). Learning and discovery of predictive state representations in dynamical systems with reset. In Proceedings of the Twenty-First International Conference on Machine Learning, pages 417–424. Kaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the Tenth International Conference on Machine Learning, pp. 167–173. Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research 4(Dec):1107–1149. Littman, M. L., Sutton, R. S. and Singh, S. (2002). Predictive representations of state. In Advances In Neural Information Processing Systems 14:1555–1561. Rudary, M. R. and Singh, S. (2004). A nonlinear predictive state representation. In Advances in Neural Information Processing Systems 16:855–862. Singh, S., Littman, M. L., Jong, N. K., Pardoe, D. and Stone, P. (2003) Learning predictive state representations. In Proceedings of the Twentieth Int. Conference on Machine Learning, pp. 712–719. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning 3:9–44. Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In A. Prieditis and S. Russell (eds.), Proceedings of the Twelfth International Conference on Machine Learning, pp. 531–539. Morgan Kaufmann, San Francisco. Sutton, R. S., Precup, D. and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artiﬁcial Intelligence 112:181–121.

5 0.072881997 82 nips-2004-Incremental Algorithms for Hierarchical Classification

Author: Nicolò Cesa-bianchi, Claudio Gentile, Andrea Tironi, Luca Zaniboni

Abstract: We study the problem of hierarchical classiﬁcation when labels corresponding to partial and/or multiple paths in the underlying taxonomy are allowed. We introduce a new hierarchical loss function, the H-loss, implementing the simple intuition that additional mistakes in the subtree of a mistaken class should not be charged for. Based on a probabilistic data model introduced in earlier work, we derive the Bayes-optimal classiﬁer for the H-loss. We then empirically compare two incremental approximations of the Bayes-optimal classiﬁer with a ﬂat SVM classiﬁer and with classiﬁers obtained by using hierarchical versions of the Perceptron and SVM algorithms. The experiments show that our simplest incremental approximation of the Bayes-optimal classiﬁer performs, after just one training epoch, nearly as well as the hierarchical SVM classiﬁer (which performs best). For the same incremental algorithm we also derive an H-loss bound showing, when data are generated by our probabilistic data model, exponentially fast convergence to the H-loss of the hierarchical classiﬁer based on the true model parameters. 1 Introduction and basic deﬁnitions We study the problem of classifying data in a given taxonomy of labels, where the taxonomy is speciﬁed as a tree forest. We assume that every data instance is labelled with a (possibly empty) set of class labels called multilabel, with the only requirement that multilabels including some node i in the taxonony must also include all ancestors of i. Thus, each multilabel corresponds to the union of one or more paths in the forest, where each path must start from a root but it can terminate on an internal node (rather than a leaf). Learning algorithms for hierarchical classiﬁcation have been investigated in, e.g., [8, 9, 10, 11, 12, 14, 15, 17, 20]. However, the scenario where labelling includes multiple and partial paths has received very little attention. The analysis in [5], which is mainly theoretical, shows in the multiple and partial path case a 0/1-loss bound for a hierarchical learning algorithm based on regularized least-squares estimates. In this work we extend [5] in several ways. First, we introduce a new hierarchical loss function, the H-loss, which is better suited than the 0/1-loss to analyze hierarchical classiﬁcation tasks, and we derive the corresponding Bayes-optimal classiﬁer under the parametric data model introduced in [5]. Second, considering various loss functions, including the H-loss, we empirically compare the performance of the following three incremental kernel-based ∗ This work was supported in part by the PASCAL Network of Excellence under EC grant no. 506778. This publication only reﬂects the authors’ views. algorithms: 1) a hierarchical version of the classical Perceptron algorithm [16]; 2) an approximation to the Bayes-optimal classiﬁer; 3) a simpliﬁed variant of this approximation. Finally, we show that, assuming data are indeed generated according to the parametric model mentioned before, the H-loss of the algorithm in 3) converges to the H-loss of the classiﬁer based on the true model parameters. Our incremental algorithms are based on training linear-threshold classiﬁers in each node of the taxonomy. A similar approach has been studied in [8], though their model does not consider multiple-path classiﬁcations as we do. Incremental algorithms are the main focus of this research, since we strongly believe that they are a key tool for coping with tasks where large quantities of data items are generated and the classiﬁcation system needs to be frequently adjusted to keep up with new items. However, we found it useful to provide a reference point for our empirical results. Thus we have also included in our experiments the results achieved by nonincremental algorithms. In particular, we have chosen a ﬂat and a hierarchical version of SVM [21, 7, 19], which are known to perform well on the textual datasets considered here. We assume data elements are encoded as real vectors x ∈ Rd which we call instances. A multilabel for an instance x is any subset of the set {1, . . . , N } of all labels/classes, including the empty set. We denote the multilabel associated with x by a vector y = (y1 , . . . , yN ) ∈ {0, 1}N , where i belongs to the multilabel of x if and only if yi = 1. A taxonomy G is a forest whose trees are deﬁned over the set of labels. A multilabel y ∈ {0, 1}N is said to respect a taxonomy G if and only if y is the union of one or more paths in G, where each path starts from a root but need not terminate on a leaf. See Figure 1. We assume the data-generating mechanism produces examples (x, y) such that y respects some ﬁxed underlying taxonomy G with N nodes. The set of roots in G is denoted by root(G). We use par(i) to denote the unique parent of node i, anc(i) to denote the set of ancestors of i, and sub(i) to denote the set of nodes in the subtree rooted at i (including i). Finally, given a predicate φ over a set Ω, we will use {φ} to denote both the subset of Ω where φ is true and the indicator function of this subset. 2 The H-loss Though several hierarchical losses have been proposed in the literature (e.g., in [11, 20]), no one has emerged as a standard yet. Since hierarchical losses are deﬁned over multilabels, we start by considering two very simple functions measuring the discrepancy between multilabels y = (y1 , ..., yN ) and y = (y1 , ..., yN ): the 0/1-loss 0/1 (y, y) = {∃i : yi = yi } and the symmetric difference loss ∆ (y, y) = {y1 = y1 } + . . . + {yN = yN }. There are several ways of making these losses depend on a given taxonomy G. In this work, we follow the intuition “if a mistake is made at node i, then further mistakes made in the subtree rooted at i are unimportant”. That is, we do not require the algorithm be able to make ﬁne-grained distinctions on tasks when it is unable to make coarse-grained ones. For example, if an algorithm failed to label a document with the class SPORTS, then the algorithm should not be charged more loss because it also failed to label the same document with the subclass SOCCER and the sub-subclass CHAMPIONS LEAGUE. A function implementing this intuition is deﬁned by N H (y, y) = i=1 ci {yi = yi ∧ yj = yj , j ∈ anc(i)}, where c1 , . . . , cN > 0 are ﬁxed cost coefﬁcients. This loss, which we call H-loss, can also be described as follows: all paths in G from a root down to a leaf are examined and, whenever we encounter a node i such that yi = yi , we add ci to the loss, whereas all the loss contributions in the subtree rooted at i are discarded. Note that if c1 = . . . = cN = 1 then 0/1 ≤ H ≤ ∆ . Choices of ci depending on the structure of G are proposed in Section 4. Given a multilabel y ∈ {0, 1}N deﬁne its G-truncation as the multilabel y = (y1 , ..., yN ) ∈ {0, 1}N where, for each i = 1, . . . , N , yi = 1 iff yi = 1 and yj = 1 for all j ∈ anc(i). Note that the G-truncation of any multilabel always respects G. A graphical (a) (b) (c) (d) Figure 1: A one-tree forest (repeated four times). Each node corresponds to a class in the taxonomy G, hence in this case N = 12. Gray nodes are included in the multilabel under consideration, white nodes are not. (a) A generic multilabel which does not respect G; (b) its G-truncation. (c) A second multilabel that respects G. (d) Superposition of multilabel (b) on multilabel (c): Only the checked nodes contribute to the H-loss between (b) and (c). representation of the notions introduced so far is given in Figure 1. In the next lemma we show that whenever y respects G, then H (y, y) cannot be smaller than H (y , y). In other words, when the multilabel y to be predicted respects a taxonomy G then there is no loss of generality in restricting to predictions which respect G. Lemma 1 Let G be a taxonomy, y, y ∈ {0, 1}N be two multilabels such that y respects G, and y be the G-truncation of y. Then H (y , y) ≤ H (y, y) . Proof. For each i = 1, . . . , N we show that yi = yi and yj = yj for all j ∈ anc(i) implies yi = yi and yj = yj for all j ∈ anc(i). Pick some i and suppose yi = yi and yj = yj for all j ∈ anc(i). Now suppose yj = 0 (and thus yj = 0) for some j ∈ anc(i). Then yi = 0 since y respects G. But this implies yi = 1, contradicting the fact that the G-truncation y respects G. Therefore, it must be the case that yj = yj = 1 for all j ∈ anc(i). Hence the G-truncation of y left each node j ∈ anc(i) unchanged, implying yj = yj for all j ∈ anc(i). But, since the G-truncation of y does not change the value of a node i whose ancestors j are such that yj = 1, this also implies yi = yi . Therefore yi = yi and the proof is concluded. 3 A probabilistic data model Our learning algorithms are based on the following statistical model for the data, originally introduced in [5]. The model deﬁnes a probability distribution fG over the set of multilabels respecting a given taxonomy G by associating with each node i of G a Bernoulli random variable Yi and deﬁning fG (y | x) = N i=1 P Yi = yi | Ypar(i) = ypar(i) , X = x . To guarantee that fG (y | x) = 0 whenever y ∈ {0, 1}N does not respect G, we set P Yi = 1 | Ypar(i) = 0, X = x = 0. Notice that this deﬁnition of fG makes the (rather simplistic) assumption that all Yk with the same parent node i (i.e., the children of i) are independent when conditioned on Yi and x. Through fG we specify an i.i.d. process {(X 1 , Y 1 ), (X 2 , Y 2 ), . . .}, where, for t = 1, 2, . . ., the multilabel Y t is distributed according to fG (· | X t ) and X t is distributed according to a ﬁxed and unknown distribution D. Each example (xt , y t ) is thus a realization of the corresponding pair (X t , Y t ) of random variables. Our parametric model for fG is described as follows. First, we assume that the support of D is the surface of the d-dimensional unit sphere (i.e., instances x ∈ R d are such that ||x|| = 1). With each node i in the taxonomy, we associate a unit-norm weight vector ui ∈ Rd . Then, we deﬁne the conditional probabilities for a nonroot node i with parent j by P (Yi = 1 | Yj = 1, X = x) = (1 + ui x)/2. If i is a root node, the previous equation simpliﬁes to P (Yi = 1 | X = x) = (1 + ui x)/2. 3.1 The Bayes-optimal classiﬁer for the H-loss We now describe a classiﬁer, called H - BAYES, that is the Bayes-optimal classiﬁer for the H-loss. In other words, H - BAYES classiﬁes any instance x with the multilabel y = argminy∈{0,1} E[ H (¯ , Y ) | x ]. Deﬁne pi (x) = P Yi = 1 | Ypar(i) = 1, X = x . y ¯ When no ambiguity arises, we write pi instead of pi (x). Now, ﬁx any unit-length instance x and let y be a multilabel that respects G. For each node i in G, recursively deﬁne H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ) + k∈child(i) H k,x (y) . The classiﬁer H - BAYES operates as follows. It starts by putting all nodes of G in a set S; nodes are then removed from S one by one. A node i can be removed only if i is a leaf or if all nodes j in the subtree rooted at i have been already removed. When i is removed, its value yi is set to 1 if and only if pi 2 − k∈child(i) H k,x (y)/ci ≥ 1 . (1) (Note that if i is a leaf then (1) is equivalent to yi = {pi ≥ 1/2}.) If yi is set to zero, then all nodes in the subtree rooted at i are set to zero. Theorem 2 For any taxonomy G and all unit-length x ∈ Rd , the multilabel generated by H - BAYES is the Bayes-optimal classiﬁcation of x for the H-loss. Proof sketch. Let y be the multilabel assigned by H - BAYES and y ∗ be any multilabel minimizing the expected H-loss. Introducing the short-hand Ex [·] = E[· | x], we can write Ex H (y, Y )= N i=1 ci (pi (1 − yi ) + (1 − pi )yi ) j∈anc(i) pj {yj = 1} . Note that we can recursively decompose the expected H-loss as Ex H (y, Y )= i∈root(G) where Ex Hi (y, Y ) = ci (pi (1 − yi ) + (1 − pi )yi ) Ex Hi (y, Y ), pj {yj = 1} + j∈anc(i) Ex Hk (y, Y ) . (2) k∈child(i) ∗ Pick a node i. If i is a leaf, then the sum in the RHS of (2) disappears and yi = {pi ≥ 1/2}, ∗ which is also the minimizer of H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ), implying yi = yi . ∗ Now let i be an internal node and inductively assume yj = yj for all j ∈ sub(i). Notice ∗ that the factors j∈anc(i) pj {yj = 1} occur in both terms in the RHS of (2). Hence yi does not depend on these factors and we can equivalently minimize ci (pi (1 − yi ) + (1 − pi )yi ) + pi {yi = 1} k∈child(i) H k,x (y), (3) where we noted that, for each k ∈ child(i), Ex Hk (y, Y ) = j∈anc(i) pj {yj = 1} pi {yi = 1}H k,x (y) . ∗ Now observe that yi minimizing (3) is equivalent to the assignment produced by H - BAYES. ∗ ∗ To conclude the proof, note that whenever yi = 0, Lemma 1 requires that yj = 0 for all nodes j ∈ sub(i), which is exactly what H - BAYES does. 4 The algorithms We consider three incremental algorithms. Each one of these algorithms learns a hierarchical classiﬁer by training a decision function gi : Rd → {0, 1} at each node i = 1, . . . , N . For a given set g1 , . . . , gN of decision functions, the hierarchical classiﬁer generated by these algorithms classiﬁes an instance x through a multilabel y = (y1 , ..., yN ) deﬁned as follows: yi = gi (x) 0 if i ∈ root(G) or yj = 1 for all j ∈ anc(i) otherwise. (4) Note that y computed this way respects G. The classiﬁers (4) are trained incrementally. Let gi,t be the decision function at node i after training on the ﬁrst t − 1 examples. When the next training example (xt , y t ) is available, the algorithms compute the multilabel y t using classiﬁer (4) based on g1,t (xt ), . . . , gN,t (xt ). Then, the algorithms consider for an update only those decision functions sitting at nodes i satisfying either i ∈ root(G) or ypar(i),t = 1. We call such nodes eligible at time t. The decision functions of all other nodes are left unchanged. The ﬁrst algorithm we consider is a simple hierarchical version of the Perceptron algorithm [16], which we call H - PERC. The decision functions at time t are deﬁned by gi,t (xt ) = {wi,t xt ≥ 0}. In the update phase, the Perceptron rule wi,t+1 = wi,t + yi,t xt is applied to every node i eligible at time t and such that yi,t = yi,t . The second algorithm, called APPROX - H - BAYES, approximates the H - BAYES classiﬁer of Section 3.1 by replacing the unknown quantities pi (xt ) with estimates (1+w i,t xt )/2. The weights w i,t are regularized least-squares estimates deﬁned by (i) wi,t = (I + Si,t−1 Si,t−1 + xt xt )−1 Si,t−1 y t−1 . (5) The columns of the matrix Si,t−1 are all past instances xs that have been stored at node i; (i) the s-th component of vector y t−1 is the i-th component yi,s of the multilabel y s associated with instance xs . In the update phase, an instance xt is stored at node i, causing an update of wi,t , whenever i is eligible at time t and |w i,t xt | ≤ (5 ln t)/Ni,t , where Ni,t is the number of instances stored at node i up to time t − 1. The corresponding decision functions gi,t are of the form gi,t (xt ) = {w i,t xt ≥ τi,t }, where the threshold τi,t ≥ 0 at node i depends on the margin values w j,t xt achieved by nodes j ∈ sub(i) — recall (1). Note that gi,t is not a linear-threshold function, as xt appears in the deﬁnition of w i,t . The margin threshold (5 ln t)/Ni,t , controlling the update of node i at time t, reduces the space requirements of the classiﬁer by keeping matrices Si,t suitably small. This threshold is motivated by the work [4] on selective sampling. The third algorithm, which we call H - RLS (Hierarchical Regularized Least Squares), is a simpliﬁed variant of APPROX - H - BAYES in which the thresholds τi,t are set to zero. That is, we have gi,t (xt ) = {w i,t xt ≥ 0} where the weights w i,t are deﬁned as in (5) and updated as in the APPROX - H - BAYES algorithm. Details on how to run APPROX - H - BAYES 2 and H - RLS in dual variables and perform an update at node i in time O(Ni,t ) are found in [3] (where a mistake-driven version of H - RLS is analyzed). 5 Experimental results The empirical evaluation of the algorithms was carried out on two well-known datasets of free-text documents. The ﬁrst dataset consists of the ﬁrst (in chronological order) 100,000 newswire stories from the Reuters Corpus Volume 1, RCV1 [2]. The associated taxonomy of labels, which are the topics of the documents, has 101 nodes organized in a forest of 4 trees. The forest is shallow: the longest path has length 3 and the the distribution of nodes, sorted by increasing path length, is {0.04, 0.53, 0.42, 0.01}. For this dataset, we used the bag-of-words vectorization performed by Xerox Research Center Europe within the EC project KerMIT (see [4] for details on preprocessing). The 100,000 documents were divided into 5 equally sized groups of chronologically consecutive documents. We then used each adjacent pair of groups as training and test set in an experiment (here the ﬁfth and ﬁrst group are considered adjacent), and then averaged the test set performance over the 5 experiments. The second dataset is a speciﬁc subtree of the OHSUMED corpus of medical abstracts [1]: the subtree rooted in “Quality of Health Care” (MeSH code N05.715). After removing overlapping classes (OHSUMED is not quite a tree but a DAG), we ended up with 94 Table 1: Experimental results on two hierarchical text classiﬁcation tasks under various loss functions. We report average test errors along with standard deviations (in parenthesis). In bold are the best performance ﬁgures among the incremental algorithms. RCV1 PERC H - PERC H - RLS AH - BAY SVM H - SVM OHSU. PERC H - PERC H - RLS AH - BAY SVM H - SVM 0/1-loss 0.702(±0.045) 0.655(±0.040) 0.456(±0.010) 0.550(±0.010) 0.482(±0.009) 0.440(±0.008) unif. H-loss 1.196(±0.127) 1.224(±0.114) 0.743(±0.026) 0.815(±0.028) 0.790(±0.023) 0.712(±0.021) norm. H-loss 0.100(±0.029) 0.099(±0.028) 0.057(±0.001) 0.090(±0.001) 0.057(±0.001) 0.055(±0.001) ∆-loss 1.695(±0.182) 1.861(±0.172) 1.086(±0.036) 1.465(±0.040) 1.173(±0.051) 1.050(±0.027) 0/1-loss 0.899(±0.024) 0.846(±0.024) 0.769(±0.004) 0.819(±0.004) 0.784(±0.003) 0.759(±0.002) unif. H-loss 1.938(±0.219) 1.560(±0.155) 1.200(±0.007) 1.197(±0.006) 1.206(±0.003) 1.170(±0.005) norm. H-loss 0.058(±0.005) 0.057(±0.005) 0.045(±0.000) 0.047(±0.000) 0.044(±0.000) 0.044(±0.000) ∆-loss 2.639(±0.226) 2.528(±0.251) 1.957(±0.011) 2.029(±0.009) 1.872(±0.005) 1.910(±0.007) classes and 55,503 documents. We made this choice based only on the structure of the subtree: the longest path has length 4, the distribution of nodes sorted by increasing path length is {0.26, 0.37, 0.22, 0.12, 0.03}, and there are a signiﬁcant number of partial and multiple path multilabels. The vectorization of the subtree was carried out as follows: after tokenization, we removed all stopwords and also those words that did not occur at least 3 times in the corpus. Then, we vectorized the documents using the Bow library [13] with a log(1 + TF) log(IDF) encoding. We ran 5 experiments by randomly splitting the corpus in a training set of 40,000 documents and a test set of 15,503 documents. Test set performances are averages over these 5 experiments. In the training set we kept more documents than in the RCV1 splits since the OHSUMED corpus turned out to be a harder classiﬁcation problem than RCV1. In both datasets instances have been normalized to unit length. We tested the hierarchical Perceptron algorithm (H - PERC), the hierarchical regularized leastsquares algorithm (H - RLS), and the approximated Bayes-optimal algorithm (APPROX - H BAYES ), all described in Section 4. The results are summarized in Table 1. APPROX - H BAYES ( AH - BAY in Table 1) was trained using cost coefﬁcients c i chosen as follows: if i ∈ root(G) then ci = |root(G)|−1 . Otherwise, ci = cj /|child(j)|, where j is the parent of i. Note that this choice of coefﬁcients amounts to splitting a unit cost equally among the roots and then splitting recursively each node’s cost equally among its children. Since, in this case, 0 ≤ H ≤ 1, we call the resulting loss normalized H-loss. We also tested a hierarchical version of SVM (denoted by H - SVM in Table 1) in which each node is an SVM classiﬁer trained using a batch version of our hierarchical learning protocol. More precisely, each node i was trained only on those examples (xt , y t ) such that ypar(i),t = 1 (note that, as no conditions are imposed on yi,t , node i is actually trained on both positive and negative examples). The resulting set of linear-threshold functions was then evaluated on the test set using the hierachical classiﬁcation scheme (4). We tried both the C and ν parametrizations [18] for SVM and found the setting C = 1 to work best for our data. 1 We ﬁnally tested the “ﬂat” variants of Perceptron and SVM, denoted by PERC and SVM. In these variants, each node is trained and evaluated independently of the others, disregarding all taxonomical information. All SVM experiments were carried out using the libSVM implementation [6]. All the tested algorithms used a linear kernel. 1 It should be emphasized that this tuning of C was actually chosen in hindsight, with no crossvalidation. As far as loss functions are concerned, we considered the 0/1-loss, the H-loss with cost coefﬁcients set to 1 (denoted by uniform H-loss), the normalized H-loss, and the symmetric difference loss (denoted by ∆-loss). Note that H - SVM performs best, but our incremental algorithms were trained for a single epoch on the training set. The good performance of SVM (the ﬂat variant of H - SVM ) is surprising. However, with a single epoch of training H - RLS does not perform worse than SVM (except on OHSUMED under the normalized H-loss) and comes reasonably close to H - SVM. On the other hand, the performance of APPROX - H - BAYES is disappointing: on OHSUMED it is the best algorithm only for the uniform H-loss, though it was trained using the normalized H-loss; on RCV1 it never outperforms H - RLS, though it always does better than PERC and H - PERC. A possible explanation for this behavior is that APPROX - H - BAYES is very sensitive to errors in the estimates of pi (x) (recall Section 3.1). Indeed, the least-squares estimates (5), which we used to approximate H - BAYES, seem to work better in practice on simpler (and possibly more robust) algorithms, such as H - RLS. The lower values of normalized H-loss on OHSUMED (a harder corpus than RCV1) can be explained because a quarter of the 94 nodes in the OHSUMED taxonomy are roots, and thus each top-level mistake is only charged about 4/94. As a ﬁnal remark, we observe that the normalized H-loss gave too small a range of values to afford ﬁne comparisons among the best performing algorithms. 6 Regret bounds for the H-loss In this section we prove a theoretical bound on the H-loss of a slight variant of the algorithm H - RLS tested in Section 5. More precisely, we assume data are generated according to the probabilistic model introduced in Section 3 with unknown instance distribution D and unknown coefﬁcients u1 , . . . , uN . We deﬁne the regret of a classiﬁer assigning label y to instance X as E H (y, Y t ) − E H (y, Y ), where the expected value is with respect the random draw of (X, Y ) and y is the multilabel assigned by classiﬁer (4) when the decision functions gi are zero-threshold functions of the form gi (x) = {ui x ≥ 0}. The theorem below shows that the regret of the classiﬁer learned by a variant of H - RLS after t training examples, with t large enough, is exponentially small in t. In other words, H - RLS learns to classify as well as the algorithm that is given the true parameters u1 , . . . , uN of the underlying data-generating process. We have been able to prove the theorem only for the variant of H - RLS storing all instances at each node. That is, every eligible node at time t is updated, irrespective of whether |w i,t xt | ≤ (5 ln t)/Ni,t . Given the i.i.d. data-generating process (X 1 , Y 1 ), (X 2 , Y 2 ), . . ., for each node k we deﬁne the derived process X k1 , X k2 , . . . including all and only the instances X s of the original process that satisfy Ypar(k),s = 1. We call this derived process the process at node k. Note that, for each k, the process at node k is an i.i.d. process. However, its distribution might depend on k. The spectrum of the process at node k is the set of eigenvalues of the correlation matrix with entries E[Xk1 ,i Xk1 ,j ] for i, j = 1, . . . , d. We have the following theorem, whose proof is omitted due to space limitations. Theorem 3 Let G be a taxonomy with N nodes and let fG be a joint density for G parametrized by N unit-norm vectors u1 , . . . , uN ∈ Rd . Assume the instance distribution is such that there exist γ1 , . . . , γN > 0 satisfying P |ui X t | ≥ γi = 1 for i = 1, . . . , N . Then, for all t > max maxi=1,...,N E H (y t , Y t ) −E 16 µ i λ i γi , maxi=1,...,N 192d µi λ 2 i the regret H (y t , Y t ) of the modiﬁed H - RLS algorithm is at most N 2 2 µi t e−κ1 γi λi t µi + t2 e−κ2 λi t µi cj , i=1 j∈sub(i) where κ1 , κ2 are constants, µi = E j∈anc(i) (1 + uj X)/2 eigenvalue in the spectrum of the process at node i. and λi is the smallest 7 Conclusions and open problems In this work we have studied the problem of hierarchical classiﬁcation of data instances in the presence of partial and multiple path labellings. We have introduced a new hierarchical loss function, the H-loss, derived the corresponding Bayes-optimal classiﬁer, and empirically compared an incremental approximation to this classiﬁer with some other incremental and nonincremental algorithms. Finally, we have derived a theoretical guarantee on the H-loss of a simpliﬁed variant of the approximated Bayes-optimal algorithm. Our investigation leaves several open issues. The current approximation to the Bayesoptimal classiﬁer is not satisfying, and this could be due to a bad choice of the model, of the estimators, of the datasets, or of a combination of them. Also, the normalized H-loss is not fully satisfying, since the resulting values are often too small. From the theoretical viewpoint, we would like to analyze the regret of our algorithms with respect to the Bayesoptimal classiﬁer, rather than with respect to a classiﬁer that makes a suboptimal use of the true model parameters. References [1] The OHSUMED test collection. URL: medir.ohsu.edu/pub/ohsumed/. [2] Reuters corpus volume 1. URL: about.reuters.com/researchandstandards/corpus/. [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order Perceptron algorithm. In Proc. 15th COLT, pages 121–137. Springer, 2002. [4] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Learning probabilistic linear-threshold classiﬁers via selective sampling. In Proc. 16th COLT, pages 373–386. Springer, 2003. [5] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Regret bounds for hierarchical classiﬁcation with linear-threshold functions. In Proc. 17th COLT. Springer, 2004. To appear. [6] C.-C. Chang and C.-J. Lin. Libsvm — a library for support vector machines. URL: www.csie.ntu.edu.tw/∼cjlin/libsvm/. [7] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2001. [8] O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classiﬁcation. In Proc. 21st ICML. Omnipress, 2004. [9] S.T. Dumais and H. Chen. Hierarchical classiﬁcation of web content. In Proc. 23rd ACM Int. Conf. RDIR, pages 256–263. ACM Press, 2000. [10] M. Granitzer. Hierarchical Text Classiﬁcation using Methods from Machine Learning. PhD thesis, Graz University of Technology, 2003. [11] T. Hofmann, L. Cai, and M. Ciaramita. Learning with taxonomies: Classifying documents and words. In NIPS Workshop on Syntax, Semantics, and Statistics, 2003. [12] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proc. 14th ICML, Morgan Kaufmann, 1997. [13] A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classiﬁcation and clustering. URL: www-2.cs.cmu.edu/∼mccallum/bow/. [14] A.K. McCallum, R. Rosenfeld, T.M. Mitchell, and A.Y. Ng. Improving text classiﬁcation by shrinkage in a hierarchy of classes. In Proc. 15th ICML. Morgan Kaufmann, 1998. [15] D. Mladenic. Turning yahoo into an automatic web-page classiﬁer. In Proceedings of the 13th European Conference on Artiﬁcial Intelligence, pages 473–474, 1998. [16] F. Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Review, 65:386–408, 1958. [17] M.E. Ruiz and P. Srinivasan. Hierarchical text categorization using neural networks. Information Retrieval, 5(1):87–118, 2002. [18] B. Sch¨ lkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. o Neural Computation, 12:1207–1245, 2000. [19] B. Sch¨ lkopf and A. Smola. Learning with kernels. MIT Press, 2002. o [20] A. Sun and E.-P. Lim. Hierarchical text classiﬁcation and evaluation. In Proc. 2001 Int. Conf. Data Mining, pages 521–528. IEEE Press, 2001. [21] V.N. Vapnik. Statistical Learning Theory. Wiley, 1998.

6 0.069163762 28 nips-2004-Bayesian inference in spiking neurons

7 0.0673948 130 nips-2004-Newscast EM

8 0.065454774 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

9 0.060780983 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

10 0.057906639 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

11 0.056010555 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

12 0.053637266 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

13 0.053382393 26 nips-2004-At the Edge of Chaos: Real-time Computations and Self-Organized Criticality in Recurrent Neural Networks

14 0.053138249 116 nips-2004-Message Errors in Belief Propagation

15 0.052551549 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

16 0.050763369 138 nips-2004-Online Bounds for Bayesian Algorithms

17 0.049793825 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture

18 0.048210204 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

19 0.048103087 83 nips-2004-Incremental Learning for Visual Tracking

20 0.047870521 178 nips-2004-Support Vector Classification with Input Data Uncertainty

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.165), (1, -0.027), (2, 0.112), (3, 0.014), (4, 0.096), (5, -0.042), (6, 0.032), (7, 0.047), (8, 0.053), (9, -0.095), (10, 0.012), (11, 0.032), (12, -0.067), (13, -0.061), (14, 0.023), (15, 0.014), (16, -0.019), (17, 0.033), (18, 0.11), (19, 0.02), (20, -0.023), (21, 0.051), (22, 0.038), (23, -0.016), (24, 0.085), (25, 0.018), (26, -0.063), (27, -0.027), (28, 0.139), (29, 0.003), (30, -0.091), (31, 0.028), (32, -0.05), (33, 0.021), (34, -0.05), (35, -0.029), (36, -0.173), (37, -0.111), (38, -0.136), (39, 0.024), (40, 0.028), (41, 0.051), (42, -0.021), (43, -0.03), (44, -0.057), (45, 0.165), (46, -0.1), (47, -0.002), (48, -0.021), (49, -0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91624415 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

Author: Shyam Visweswaran, Gregory F. Cooper

2 0.55240017 183 nips-2004-Temporal-Difference Networks

Author: Richard S. Sutton, Brian Tanner

3 0.55051547 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni

4 0.54118246 48 nips-2004-Convergence and No-Regret in Multiagent Learning

Author: Michael Bowling

5 0.5154686 82 nips-2004-Incremental Algorithms for Hierarchical Classification

Author: Nicolò Cesa-bianchi, Claudio Gentile, Andrea Tironi, Luca Zaniboni

6 0.47287628 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters

7 0.46387017 130 nips-2004-Newscast EM

8 0.44785666 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models

9 0.42997137 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms

10 0.42303669 185 nips-2004-The Convergence of Contrastive Divergences

11 0.39921311 136 nips-2004-On Semi-Supervised Classification

12 0.38493198 122 nips-2004-Modelling Uncertainty in the Game of Go

13 0.36613053 95 nips-2004-Large-Scale Prediction of Disulphide Bond Connectivity

14 0.36227655 8 nips-2004-A Machine Learning Approach to Conjoint Analysis

15 0.35179949 146 nips-2004-Pictorial Structures for Molecular Modeling: Interpreting Density Maps

16 0.3303481 33 nips-2004-Brain Inspired Reinforcement Learning

17 0.32781145 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons

18 0.3265864 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

19 0.31775641 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation

20 0.3098129 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.101), (15, 0.111), (17, 0.018), (25, 0.012), (26, 0.063), (31, 0.032), (33, 0.165), (35, 0.023), (39, 0.019), (41, 0.012), (50, 0.035), (51, 0.015), (77, 0.013), (83, 0.235), (87, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81285131 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

Author: Shyam Visweswaran, Gregory F. Cooper

2 0.78475952 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

Author: Corinna Cortes, Mehryar Mohri

Abstract: In many applications, good ranking is a highly desirable performance for a classiﬁer. The criterion commonly used to measure the ranking quality of a classiﬁcation algorithm is the area under the ROC curve (AUC). To report it properly, it is crucial to determine an interval of conﬁdence for its value. This paper provides conﬁdence intervals for the AUC based on a statistical and combinatorial analysis using only simple parameters such as the error rate and the number of positive and negative examples. The analysis is distribution-independent, it makes no assumption about the distribution of the scores of negative or positive examples. The results are of practical use and can be viewed as the equivalent for AUC of the standard conﬁdence intervals given in the case of the error rate. They are compared with previous approaches in several standard classiﬁcation tasks demonstrating the beneﬁts of our analysis. 1 Motivation In many machine learning applications, the ranking quality of a classiﬁer is critical. For example, the ordering of the list of relevant documents returned by a search engine or a document classiﬁcation system is essential. The criterion widely used to measure the ranking quality of a classiﬁcation algorithm is the area under an ROC curve (AUC). But, to measure and report the AUC properly, it is crucial to determine an interval of conﬁdence for its value as it is customary for the error rate and other measures. It is also important to make the computation of the conﬁdence interval practical by relying only on a small and simple number of parameters. In the case of the error rate, such intervals are often derived from just the sample size N . We present an extensive theoretical analysis of the AUC and show that a similar conﬁdence interval can be derived for its value using only simple parameters such as the error rate k/N , the number of positive examples m, and the number of negative examples n = N − m. Thus, our results extend to AUC the computation of conﬁdence intervals from a small number of readily available parameters. Our analysis is distribution-independent in the sense that it makes no assumption about the distribution of the scores of negative or positive examples. The use of the error rate helps determine tight conﬁdence intervals. This contrasts with existing approaches presented in the statistical literature [11, 5, 2] which are based either on weak distribution-independent assumptions resulting in too loose conﬁdence intervals, or strong distribution-dependent assumptions leading to tight but unsafe conﬁdence intervals. We show that our results are of practical use. We also compare them with previous approaches in several standard classiﬁcation tasks demonstrating the beneﬁts of our analysis. Our results are also useful for testing the statistical signiﬁcance of the difference of the AUC values of two classiﬁers. The paper is organized as follows. We ﬁrst introduce the deﬁnition of the AUC, its connection with the Wilcoxon-Mann-Whitney statistic (Section 2), and brieﬂy review some essential aspects of the existing literature related to the computation of conﬁdence intervals for the AUC. Our computation of the expected value and variance of the AUC for a ﬁxed error rate requires establishing several combinatorial identities. Section 4 presents some existing identities and gives the proof of novel ones useful for the computation of the variance. Section 5 gives the reduced expressions for the expected value and variance of the AUC for a ﬁxed error rate. These can be efﬁciently computed and used to determine our conﬁdence intervals for the AUC (Section 6). Section 7 reports the result of the comparison of our method with previous approaches, including empirical results for several standard tasks. 2 Deﬁnition and Properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally introduced in signal detection theory [6] in connection with the study of radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [14, 13, 7, 12, 16, 3]. The ROC curve for a binary classiﬁcation problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classiﬁcation threshold from the most positive classiﬁcation value to the most negative. For a fully random classiﬁcation, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classiﬁcation results in an ROC curve at least partially above this straight line. The AUC is deﬁned as the area under the ROC curve. Consider a binary classiﬁcation task with m positive examples and n negative examples. Let C be a ﬁxed classiﬁer that outputs a strictly ordered list for these examples. Let x1 , . . . , xm be the output of C on the positive examples and y1 , . . . , yn its output on the negative examples and denote by 1X the indicator function of a set X. Then, the AUC, A, associated to C is given by: A= m i=1 n j=1 1xi >yj (1) mn which is the value of the Wilcoxon-Mann-Whitney statistic [10]. Thus, the AUC is closely related to the ranking quality of the classiﬁcation. It can be viewed as a measure based on pairwise comparisons between classiﬁcations of the two classes. It is an estimate of the probability Pxy that the classiﬁer ranks a randomly chosen positive example higher than a negative example. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC, and the expected AUC value for a random ranking is 0.5. 3 Overview of Related Work This section brieﬂy describes some previous distribution-dependent approaches presented in the statistical literature to derive conﬁdence intervals for the AUC and compares them to our method. The starting point for these analyses is a formula giving the variance of the AUC, A, for a ﬁxed distribution of the scores Px of the positive examples and Py of the negative examples [10, 1]: 2 σA = A(1 − A) + (m − 1)(Pxxy − A2 ) + (n − 1)(Pxyy − A2 ) mn (2) where Pxxy is the probability that the classiﬁer ranks two randomly chosen positive examples higher than a negative one, and Pxyy the probability that it ranks two randomly chosen negative examples lower than a positive one. To compute the variance exactly using Equation 2, the distributions Px and Py must be known. Hanley and McNeil [10] argue in favor of exponential distributions, loosely claiming that this upper-bounds the variance of normal distributions with various means and ratios of A 2A2 variances. They show that for exponential distributions Pxxy = 2−A and Pxyy = 1+A . The resulting conﬁdence intervals are of course relatively tight, but their validity is questionable since they are based on a strong assumption about the distributions of the positive and negative scores that may not hold in many cases. An alternative considered by several authors to the exact computation of the variance is to determine instead the maximum of the variance over all possible continuous distributions with the same expected value of the AUC. For all such distributions, one can ﬁx m and n and compute the expected AUC and its variance. The maximum variance is denoted by 2 σmax and is given by [5, 2]: 2 σmax = A(1 − A) 1 ≤ min {m, n} 4 min {m, n} (3) Unfortunately, this often yields loose conﬁdence intervals of limited practical use. Our approach for computing the mean and variance of the AUC is distribution-independent and inspired by the machine learning literature where analyses typically center on the error rate. We require only that the error rate be measured and compute the mean and variance of the AUC over all distributions Px and Py that maintain the same error rate. Our approach is in line with that of [5, 2] but it crucially avoids considering the maximum of the variance. We show that it is possible to compute directly the mean and variance of the AUC assigning equal weight to all the possible distributions. Of course, one could argue that not all distributions Px and Py are equally probable, but since these distributions are highly problem-dependent, we ﬁnd it risky to make any general assumption on the distributions and thereby limit the validity of our results. Our approach is further justiﬁed empirically by the experiments reported in the last section. 4 Combinatorial Analysis The analysis of the statistical properties of the AUC given a ﬁxed error rate requires various combinatorial calculations. This section describes several of the combinatorial identities that are used in our computation of the conﬁdence intervals. For all q ≥ 0, let Xq (k, m, n) be deﬁned by: k M M Xq (k, m, n) = xq (4) x x x=0 where M = m − (k − x) + x, M = n + (k − x) − x, and x = k − x. In previous work, we derived the following two identities which we used to compute the expected value of the AUC [4]: k X0 (k, m, n) = x=0 n+m+1 x k X1 (k, m, n) = (k − x)(m − n) + k n + m + 1 2 x x=0 To simplify the expression of the variance of the AUC, we need to compute X2 (k, m, n). Proposition 1 Let k, m, n be non-negative integers such that k ≤ min{m, n}, then: k X2 (k, m, n) = P2 (k, m, n, x) x=0 m+n+1 x (5) where P2 is the following 4th-degree polynomial: P2 (k, m, n, x) = (k − x)/12(−2x3 + 2x2 (2m − n + 2k − 4) + x(−3m2 + 3nm + 3m − 5km − 2k 2 + 2 + k + nk + 6n) + (3(k − 1)m2 − 3nm(k − 1) + 6km + 5m + k 2 m + 8n + 8 − 9nk + 3k + k 2 + k 2 n)). Proof. 5 The proof of the proposition is left to a longer version of this paper. Expectation and Variance of the AUC This section presents the expression of the expectation and variance of the AUC for a ﬁxed error rate k/N assuming that all classiﬁcations or rankings with k errors are equiprobable. For a given classiﬁcation, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is ﬁxed, there are x = k − x false negative examples. The expression Xq discussed in the previous section represents the q-th moment of x over all classiﬁcations with exactly k errors. In previous work, we gave the exact expression of the expectation of the AUC for a ﬁxed number of errors k: Proposition 2 ([4]) Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the expected value of the AUC, A, over all classiﬁcations with k errors is given by: E[A] = 1 − k (n − m)2 (m + n + 1) − m+n 4mn k − m+n k−1 m+n x=0 x k m+n+1 x=0 x . Note that the two sums in this expression cannot be further simpliﬁed since they are known not to admit a closed form [9]. We also gave the expression of the variance of the AUC in terms of the function F deﬁned for all Y by: F (Y ) = k M M x=0 x x k M M x=0 x x Y . (6) The following proposition reproduces that result: Proposition 3 ([4]) Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classiﬁcations with k errors is given by: σ 2 (A) = F ((1 − k−x x n+ m )2 ) 2 2 2 F ( mx +n(k−x) +(m(m+1)x+n(n+1)(k−x))−2x(k−x)(m+n+1) ). 12m2 n2 − F ((1 − k−x x n+ m 2 ))2 + Because of the products of binomial terms, the computation of the variance using this expression is inefﬁcient even for relatively small values of m and n. This expression can however be reduced using the identities presented in the previous section which leads to signiﬁcantly more efﬁcient computations that we have been using in all our experiments. Corollary 1 ([4]) Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classiﬁcations with ((m+n−2)Z4 −(2m−n+3k−10)Z3 ) k errors is given by: σ 2 (A) = (m+n+1)(m+n)(m+n−1)T72m2 n2 + (m+n+1)(m+n)T (m2 −nm+3km−5m+2k2 −nk+12−9k)Z2 48m2 n2 (m+n+1)Q1 Z1 kQ0 + 144m2 n2 with: 2 n2 72m Pk−i m+n+1−i Zi = x=0 Pk ( x=0 x (m+n+1) x ) − 2 (m+n+1)2 (m−n)4 Z1 16m2 n2 − , T = 3((m − n)2 + m + n) + 2, and: Q0 = (m + n + 1)T k2 + ((−3n2 + 3mn + 3m + 1)T − 12(3mn + m + n) − 8)k + (−3m2 + 7m + 10n + 3nm + 10)T − 4(3mn + m + n + 1) Q1 = T k3 + 3(m − 1)T k2 + ((−3n2 + 3mn − 3m + 8)T − 6(6mn + m + n))k + (−3m2 + 7(m + n) + 3mn)T − 2(6mn + m + n) Proof. The expression of the variance given in Proposition 3 requires the computation of Xq (k, m, n), q = 0, 1, 2. Using the identities giving the expressions of X0 and X1 and Proposition 1, which provides the expression of X2 , σ 2 (A) can be reduced to the expression given by the corollary. 6 Theory and Analysis Our estimate of the conﬁdence interval for the AUC is based on a simple and natural assumption. The main idea for its computation is the following. Assume that a conﬁdence interval E = [e1 , e2 ] is given for the error rate of a classiﬁer C over a sample S, with the conﬁdence level 1 − . This interval may have have been derived from a binomial model of C, which is a standard assumption for determining a conﬁdence interval for the error rate, or from any other model used to compute that interval. For a given error rate e ∈ E, or equivalently for a given number of misclassiﬁcations, we can use the expectation and variance computed in the previous section and Chebyshev’s inequality to predict a conﬁdence interval Ae for the AUC at the conﬁdence level 1 − . Since our equiprobable model for the classiﬁcations is independent of the model used to compute the interval of conﬁdence for the error rate, we can use E and Ae , e ∈ E, to compute a conﬁdence interval of the AUC at the level (1 − )(1 − ). Theorem 1 Let C be a binary classiﬁer and let S be a data sample of size N with m positive examples and n negative examples, N = m + n. Let E = [e1 , e2 ] be a conﬁdence interval for the error rate of C over S at the conﬁdence level 1 − . Then, for any , 0 ≤ ≤ 1, we can compute a conﬁdence interval for the AUC value of the classiﬁer C at the conﬁdence level (1 − )(1 − ) that depends only on , , m, n, and the interval E. Proof. Let k1 = N e1 and k2 = N e2 be the number of errors associated to the error rates e1 and e2 , and let IK be the interval IK = [k1 , k2 ]. For a ﬁxed k ∈ IK , by Propositions 2 and Corollary 1, we can compute the exact value of the expectation E[Ak ] and variance σ 2 (Ak ) of the AUC Ak . Using Chebyshev’s inequality, for any k ∈ IK and any k > 0, σ(Ak ) P |Ak − E[Ak ]| ≥ √ k ≤ (7) k where E[Ak ] and σ(Ak ) are the expressions given in Propositions 2 and Corollary 1, which depend only on k, m, and n. Let α1 and α2 be deﬁned by: α1 = min k∈IK σ(Ak ) E[Ak ] − √ σ(Ak ) α2 = max E[Ak ] + √ k∈IK k (8) k α1 and α2 only depend on IK (i.e., on e1 and e2 ), and on k, m, and n. Let IA be the conﬁdence interval deﬁned by IA = [α1 , α2 ] and let k = for all k ∈ IK . Using the fact that the conﬁdence interval E is independent of our equiprobability model for ﬁxed-k AUC values and the Bayes’ rule: P(A ∈ IA ) = k∈R+ ≥ k∈IK P (A ∈ IA | K = k)P (K = k) (9) P (A ∈ IA | K = k)P (K = k) (10) ≥ (1 − ) k∈IK P (K = k) ≥ (1 − )(1 − ) (11) where we used the property of Eq. 7 and the deﬁnitions of the intervals IK and IA . Thus, IA constitutes a conﬁdence interval for the AUC value of C at the conﬁdence level (1 − )(1 − ). In practice, the conﬁdence interval E is often determined as a result of the assumption that C follows a binomial law. This leads to the following theorem. .020 .035 Standard deviation Standard deviation .030 .015 .010 Max Distribution−dependent Distribution−independent .005 .025 .020 .015 Max Distribution−dependent Distribution−independent .010 .005 0.75 0.80 0.85 0.90 0.95 1.00 0.6 0.7 0.8 AUC (a) 0.9 1.0 AUC (b) Figure 1: Comparison of the standard deviations for three different methods with: (a) m = n = 500; (b) m = 400 and n = 200. The curves are obtained by computing the expected AUC and its standard deviations for different values of the error rate using the maximum-variance approach (Eq. 3), our distribution-independent method, and the distribution-dependent approach of Hanley [10]. Theorem 2 Let C be a binary classiﬁer, let S be a data sample of size N with m positive examples and n negative examples, N = m + n, and let k0 be the number of misclassiﬁcations of C on S. Assume that C follows a binomial law, then, for any , 0 ≤ ≤ 1, we can compute a conﬁdence interval of the AUC value of the classiﬁer C at the conﬁdence level 1 − that depends only on , k0 , m, and n. Proof. Assume that C follows a binomial law with coefﬁcient p. Then, Chebyshev’s inequality yields: 1 p(1 − p) ≤ (12) P(|C − E[C]| ≥ η) ≤ 2 Nη 4N η 2 1 1 Thus, E = [ k0 − √ √ , k0 + √ √ ] forms a conﬁdence interval for the N 2 (1− 1− )N N 2 √ (1− 1− )N error rate of C at the conﬁdence level 1 − . By Theorem 1, we can√ compute for the √ AUC value a conﬁdence interval at the level (1 − (1 − 1 − ))(1 − (1 − 1 − )) = 1 − depending only on , m, n, and the interval E, i.e., k0 , N = m + n, and . For large N , we can use the normal approximation of the binomial law to determine a ﬁner interval E. Indeed, for large N , √ (13) P(|C − E[C]| ≥ η) ≤ 2Φ(2 N η) with Φ(u) = ∞ e−x2 /2 √ dx. u 2π Thus, E = [ k0 − N √ Φ−1 ( 1− 21− ) k0 √ ,N 2 N √ conﬁdence interval for the error rate at the conﬁdence level √ + Φ−1 ( 1− 21− ) √ ] 2 N is the 1− . For simplicity, in the proof of Theorem 2, k was chosen to be a constant ( k = ) but, in general, it can be another function of k leading to tighter conﬁdence intervals. The results presented in the next section were obtained with k = a0 exp((k − k0 )2 /2a2 ), where a0 1 and a1 are constants selected so that the inequality 11 be veriﬁed. 7 Experiments and Comparisons The analysis in the previous section provides a principled method for computing a conﬁdence interval of the AUC value of a classier C at the conﬁdence level 1 − that depends only on k, n and m. As already discussed, other expressions found in the statistical literature lead to either too loose or unsafely narrow conﬁdence intervals based on questionable assumptions on the probability functions Px and Py [10, 15]. Figure 1 shows a comparison of the standard deviations obtained using the maximum-approach (Eq. 3), the distribution-dependent expression from [10], and our distribution-independent method for NAME m+n n m+n AUC k m+n σindep σA σdep σmax 368 700 303 1159 2473 201 0.63 0.67 0.54 0.17 0.10 0.37 0.70 0.63 0.87 0.85 0.84 0.85 0.24 0.26 0.13 0.05 0.03 0.13 0.0297 0.0277 0.0176 0.0177 0.0164 0.0271 0.0440 0.0330 0.0309 0.0161 0.0088 0.0463 0.0269 0.0215 0.0202 0.0176 0.0161 0.0306 0.0392 0.0317 0.0281 0.0253 0.0234 0.0417 pima yeast credit internet-ads page-blocks ionosphere Table 1: Accuracy and AUC values for AdaBoost [8] and estimated standard deviations for several datasets from the UC Irvine repository. σindep is a distribution-independent standard deviation obtained using our method (Theorem 2). σA is given by Eq. (2) with the values of A, Pxxy , and Pxyy derived from data. σdep is the distribution-dependent standard deviation of Hanley [10], which is based on assumptions that may not always hold. σmax is deﬁned by Eq. (3). All results were obtained on a randomly selected test set of size m + n. various error rates. For m = n = 500, our distribution-independent method consistently leads to tighter conﬁdence intervals (Fig. 1 (a)). It also leads to tighter conﬁdence intervals for AUC values more than .75 for the uneven distribution m = 400 and n = 200 (Fig. 1 (b)). For lower AUC values, the distribution-dependent approach produces tighter intervals, but its underlying assumptions may not hold. A different comparison was made using several datasets available from the UC Irvine repository (Table 1). The table shows that our estimates of the standard deviations (σindep ) are in general close to or tighter than the distribution-dependent standard deviation σdep of Hanley [10]. This is despite we do not make any assumption about the distributions of positive and negative examples. In contrast, Hanley’s method is based on speciﬁc assumptions about these distributions. Plots of the actual ranking distribution demonstrate that these assumptions are often violated however. Thus, the relatively good performance of Hanley’s approach on several data sets can be viewed as fortuitous and is not general. Our distribution-independent method provides tight conﬁdence intervals, in some cases tighter than those derived from σA , in particular because it exploits the information provided by the error rate. Our analysis can also be used to determine if the AUC values produced by two classiﬁers are statistically signiﬁcant by checking if the AUC value of one falls within the conﬁdence interval of the other. 8 Conclusion We presented principled techniques for computing useful conﬁdence intervals for the AUC from simple parameters: the error rate, and the negative and positive sample sizes. We demonstrated the practicality of these conﬁdence intervals by comparing them to previous approaches in several tasks. We also derived the exact expression of the variance of the AUC for a ﬁxed k, which can be of interest in other analyses related to the AUC. The Wilcoxon-Mann-Whitney statistic is a general measure of the quality of a ranking that is an estimate of the probability that the classiﬁer ranks a randomly chosen positive example higher than a negative example. One could argue that accuracy at the top or the bottom of the ranking is of higher importance. This, however, contrarily to some belief, is already captured to a certain degree by the deﬁnition of the Wilcoxon-Mann-Whitney statistic which penalizes more errors at the top or the bottom of the ranking. It is however an interesting research problem to determine how to incorporate this bias in a stricter way in the form of a score-speciﬁc weight in the ranking measure, a weighted WilcoxonMann-Whitney statistic, or how to compute the corresponding expected value and standard deviation in a general way and design machine learning algorithms to optimize such a mea- sure. A preliminary analysis suggests, however, that the calculation of the expectation and the variance are likely to be extremely complex in that case. Finally, it could also be interesting but difﬁcult to adapt our results to the distribution-dependent case and compare them to those of [10]. Acknowledgments We thank Rob Schapire for pointing out to us the problem of the statistical signiﬁcance of the AUC, Daryl Pregibon for the reference to [11], and Saharon Rosset for various discussions about the topic of this paper. References [1] D. Bamber. The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph. Journal of Math. Psychology, 12, 1975. [2] Z. W. Birnbaum and O. M. Klose. Bounds for the Variance of the Mann-Whitney Statistic. Annals of Mathematical Statistics, 38, 1957. [3] J-H. Chauchat, R. Rakotomalala, M. Carloz, and C. Pelletier. Targeting Customer Groups using Gain and Cost Matrix; a Marketing Application. Technical report, ERIC Laboratory - University of Lyon 2, 2001. [4] Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), volume 16, Vancouver, Canada, 2004. MIT Press. [5] D. Van Dantzig. On the Consistency and Power of Wilcoxon’s Two Sample Test. In Koninklijke Nederlandse Akademie van Weterschappen, Series A, volume 54, 1915. [6] J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. [7] C. Ferri, P. Flach, and J. Hern´ ndez-Orallo. Learning Decision Trees Using the Area a Under the ROC Curve. In Proceedings of the 19th International Conference on Machine Learning. Morgan Kaufmann, 2002. [8] Yoav Freund and Robert E. Schapire. A Decision Theoretical Generalization of OnLine Learning and an Application to Boosting. In Proceedings of the Second European Conference on Computational Learning Theory, volume 2, 1995. [9] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics. Addison-Wesley, Reading, Massachusetts, 1994. [10] J. A. Hanley and B. J. McNeil. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology, 1982. [11] E. L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco, California, 1975. [12] M. C. Mozer, R. Dodier, M. D. Colagrosso, C. Guerra-Salcedo, and R. Wolniewicz. Prodding the ROC Curve: Constrained Optimization of Classiﬁer Performance. In Neural Information Processing Systems (NIPS 2002). MIT Press, 2002. [13] C. Perlich, F. Provost, and J. Simonoff. Tree Induction vs. Logistic Regression: A Learning Curve Analysis. Journal of Machine Learning Research, 2003. [14] F. Provost and T. Fawcett. Analysis and Visualization of Classiﬁer Performance: Comparison under Imprecise Class and Cost Distribution. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI, 1997. [15] Saharon Rosset. Ranking-Methods for Flexible Evaluation and Efﬁcient Comparison of 2-Class Models. Master’s thesis, Tel-Aviv University, 1999. [16] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing Classiﬁer Performance via the Wilcoxon-Mann-Whitney Statistics. In Proceedings of the International Conference on Machine Learning, 2003.

3 0.70006502 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

Author: Nicolò Cesa-bianchi, Claudio Gentile, Luca Zaniboni

4 0.69916135 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

5 0.69876856 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

6 0.69546109 69 nips-2004-Fast Rates to Bayes for Kernel Machines

7 0.69398558 102 nips-2004-Learning first-order Markov models for control

8 0.69322383 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

9 0.69002128 116 nips-2004-Message Errors in Belief Propagation

10 0.68999851 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

11 0.68980664 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

12 0.68951875 70 nips-2004-Following Curved Regularized Optimization Solution Paths

13 0.6893481 163 nips-2004-Semi-parametric Exponential Family PCA

14 0.68865162 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

15 0.68862975 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

16 0.68722224 178 nips-2004-Support Vector Classification with Input Data Uncertainty

17 0.68698132 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

18 0.68635565 127 nips-2004-Neighbourhood Components Analysis

19 0.68595517 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

20 0.68557203 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition