jmlr jmlr2007 jmlr2007-39 jmlr2007-39-reference knowledge-graph by maker-knowledge-mining

39 jmlr-2007-Handling Missing Values when Applying Classification Models

Source: pdf

Author: Maytal Saar-Tsechansky, Foster Provost

Abstract: Much work has studied the effect of different treatments of missing values on model induction, but little work has analyzed treatments for the common case of missing values at prediction time. This paper ﬁrst compares several different methods—predictive value imputation, the distributionbased imputation used by C4.5, and using reduced models—for applying classiﬁcation trees to instances with missing values (and also shows evidence that the results generalize to bagged trees and to logistic regression). The results show that for the two most popular treatments, each is preferable under different conditions. Strikingly the reduced-models approach, seldom mentioned or used, consistently outperforms the other two methods, sometimes by a large margin. The lack of attention to reduced modeling may be due in part to its (perceived) expense in terms of computation or storage. Therefore, we then introduce and evaluate alternative, hybrid approaches that allow users to balance between more accurate but computationally expensive reduced modeling and the other, less accurate but less computationally expensive treatments. The results show that the hybrid methods can scale gracefully to the amount of investment in computation/storage, and that they outperform imputation even for small investments. Keywords: missing data, classiﬁcation, classiﬁcation trees, decision trees, imputation

reference text

Gustavo E. A. P. A. Batista and Maria Carolina Monard. An analysis of four missing data treatment methods for supervised learning. Applied Artiﬁcial Intelligence, 17(5-6):519–533, 2003. E. Bauer and R. Kohavi. An empirical comparison of voting classiﬁcation algorithms: Bagging, boosting and variants. Machine Learning, 36(1-2):105–139, 1999. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proc. of the 11th Annual Conf. on Computational Learning Theory, pages 92–100, Madison, WI, 1998. L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. L. Breiman, J. H. Friedman, R. Olshen, and C. Stone. Wadsworth and Brooks, Monterey, CA, 1984. Classiﬁcation and Regression Trees. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1–38, 1977. Y. Ding and J. Simonoff. An investigation of missing data methods for classiﬁcation trees. Working paper 2006-SOR-3, Stern School of Business, New York University, 2006. A. J. Feelders. Handling missing data in trees: Surrogate splits or statistical imputation? In Principles of Data Mining and Knowledge Discovery, pages 329–334, Berlin / Heidelberg, 1999. Springer. Lecture Notes in Computer Science, Vol. 1704. J. H. Friedman, R. Kohavi, and Y. Yun. Lazy decision trees. In Howard Shrobe and Ted Senator, editors, Proceedings of the Thirteenth National Conference on Artiﬁcial Intelligence and the Eighth Innovative Applications of Artiﬁcial Intelligence Conference, pages 717–724, Menlo Park, California, 1996. AAAI Press. N. Friedman and M. Goldszmidt. Learning Bayesian networks with local structure. In Proc. of 12th Conference on Uncertainty in Artiﬁcial Intelligence (UAI-97), pages 252–262, 1996. L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure. Journal of Machine Learning Research, 3:679–707, 2002. Z. Ghahramani and M. I. Jordan. Supervised learning from incomplete data via the EM approach. In Advances in Neural Information Processing Systems 6, pages 120–127, 1994. Z. Ghahramani and M. I. Jordan. Mixture models for learning from incomplete data. In R. Greiner, T. Petsche, and S.J. Hanson, editors, Computational Learning Theory and Natural Learning Systems, volume IV, pages 7–85. MIT Press, Cambridge, MA, 1997. R. Greiner, A. J. Grove, and A. Kogan. Knowing what doesn’t matter: Exploiting the omission of irrelevant data. Artiﬁcial Intelligence, 97(1-2):345–380, 1997a. R. Greiner, A. J. Grove, and D. Schuurmans. Learning Bayesian nets that perform well. In The Proceedings of The Thirteenth Conference on Uncertainty in Artiﬁcial Intelligence, pages 198– 207, 1997b. 1655 S AAR -T SECHANSKY AND P ROVOST Herskovits E. H. and Cooper G. F. Algorithms for Bayesian belief-network precomputation. In Methods of Information in Medicine, pages 362–370. 1992. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Verlag, New York, August 2001. D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. M. Kadie. Dependency networks for inference, collaborative ﬁltering, and data visualization. Journal of Machine Learning Research, 1:49–75, 2000. R. Kohavi and G. H. John. Wrappers for feature subset selection. Artiﬁcial Intelligence, 97(1-2): 273–324, 1997. N. Landwehr, M. Hall, and E. Frank. Logistic model trees. Machine Learning, 59(1-2):161–205, 2005. C. X. Ling, Q. Yang, J. Wang, and S. Zhang. Decision trees with minimal costs. In Proc. of 21st International Conference on Machine Learning (ICML-2004), 2004. R. Little and D. Rubin. Statistical Analysis with Missing Data. John Wiley & Sons, 1987. C. J. Merz, P. M. Murphy, and D. W. Aha. Repository of machine learning databases Department of Information and Computer Science, University of California, Irvine, CA, 1996. http://www.ics.uci.edu/˜mlearn/mlrepository.html. J. Neville and D. Jensen. Relational dependency networks. Journal of Machine Learning Research, 8:653–692, 2007. A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In Proc. of 22nd International Conference on Machine Learning (ICML-2005), pages 625–632, New York, NY, USA, 2005. ACM Press. ISBN 1-59593-180-5. K. Nigam and R. Ghani. Understanding the behavior of co-training. In Proc. of 6th Intl. Conf. on Knowledge Discovery and Data Mining (KDD-2000), 2000. B. Padmanabhan, Z. Zheng, and S. O. Kimbrough. Personalization from incomplete data: what you don’t know can hurt. In Proc. of 7th Intl. Conf. on Knowledge Discovery and Data Mining (KDD-2001), pages 154–163, 2001. C. Perlich, F. Provost, and J. S. Simonoff. Tree induction vs. logistic regression: a learning-curve analysis. Journal of Machine Learning Research, 4:211–255, 2003. ISSN 1533-7928. B. W. Porter, R. Bareiss, and R. C. Holte. Concept learning and heuristic classiﬁcation in weaktheory domains. Artiﬁcial Intelligence, 45:229–263, 1990. J. R. Quinlan. Unknown attribute values in induction. In Proc. of 6th International Workshop on Machine Learning, pages 164–168, Ithaca, NY, June 1989. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. D. B. Rubin. Multiple imputation for nonresponse in surveys. John Wiley & Sons, New York, 1987. 1656 H ANDLING M ISSING VALUES WHEN A PPLYING C LASSIFICATION M ODELS J.L. Schafer. Analysis of Incomplete Multivariate Data. Chapman & Hall, London, 1997. D. Schuurmans and R. Greiner. Learning to classify incomplete examples. In Computational Learning Theory and Natural Learning Systems IV: Making Learning Systems Practical, pages 87–105. MIT Press, Cambridge MA, 1997. L. G. Valiant. A theory of the learnable. Communications of the Association for Computing Machinery, 27(11):1134–1142, 1984. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, 1999. 1657