jmlr jmlr2008 jmlr2008-39 jmlr2008-39-reference knowledge-graph by maker-knowledge-mining

39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields


Source: pdf

Author: Thomas G. Dietterich, Guohua Hao, Adam Ashenfelter

Abstract: Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees, which provide compact representations of feature interactions. So the algorithm does not explicitly consider the potentially large parameter space. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions, rather than exponentially as in previous algorithms based on iterative scaling and gradient descent. Gradient tree boosting also makes it possible to use instance weighting (as in C4.5) and surrogate splitting (as in CART) to handle missing values. Experimental studies of the effectiveness of these two methods (as well as standard imputation and indicator feature methods) show that instance weighting is the best method in most cases when feature values are missing at random. Keywords: sequential supervised learning, conditional random fields, functional gradient, gradient tree boosting, missing values


reference text

Alan Agresti. An Introduction to Categorical Data Analysis. Wiley, New York, 1996. Yasemin Altun, Ioannis Tsochantaridis, and Thomas Hofmann. Hidden markov support vector machines. In Tom Fawcett and Nina Mishra, editors, Proceedings of the 20th International Conference on Machine Learning (ICML 2003), pages 3–10. AAAI Press, 2003. Julian Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society B, 36(2):192–236, 1974. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth Publishing Company, 1984. Thomas G. Dietterich, Adam Ashenfelter, and Yaroslav Bulatov. Training conditional random fields via gradient tree boosting. In Proceedings of the 21st International Conference on Machine Learning (ICML 2004), pages 217–224, Banff, Canada, 2004. ACM Press. Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning (ICML 1996), pages 148–156. Morgan Kaufmann, 1996. Jerome Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, 2001. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 38(2):337–374, 2000. Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6): 721–741, Nov. 1984. John M. Hammersley and Peter Clifford. Markov fields on finite graphs and lattices. Technical report, Unpublished, 1971. John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML 2001), pages 282–289. Morgan Kaufmann, 2001. 2137 D IETTERICH , H AO AND A SHENFELTER Lin Liao, Tanzeem Choudhury, Dieter Fox, and Henry A. Kautz. Training conditional random fields using virtual evidence boosting. In Manuela M. Veloso, editor, Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), pages 2530–2535, Hyderabad, India, January 6-12 2007. Andrew McCallum. Efficiently inducing features of conditional random fields. In Christopher Meek and Uffe Kjaerulff, editors, Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence (UAI 2003), pages 403–410. Morgan Kaufmann, 2003. Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. Maximum entropy markov models for information extraction and segmentation. In Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pages 591–598. Morgan Kaufmann, 2000. Andrew Kachites McCallum. http://mallet.cs.umass.edu, 2002. Mallet: A machine learning for language toolkit. Andrew Y. Ng and Michael Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In Advances in Neural Information Processing Systems, volume 14. MIT Press, 2002. Ning Qian and Terrence J. Sejnowski. Predicting the secondary structure of globular proteins using neural network models. Journal of Molecular Biology, 202:865–884, 1988. J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA, 1993. Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. Adwait Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In Eric Brill and Kenneth Church, editors, Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 133–142, Somerset, New Jersey, 1996. Association for Computational Linguistics. Terrence J. Sejnowski and Charles R. Rosenberg. Parallel networks that learn to pronounce english text. Complex Systems, 1:145–168, 1987. Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Marti Hearst and Mari Ostendorf, editors, HLT-NAACL 2003: Main Proceedings, pages 213–220, Edmonton, Alberta, Canada, May 27 – June 1 2003. Association for Computational Linguistics. Charles Sutton, Michael Sindelar, and Andrew McCallum. Reducing weight undertraining in structured discriminative learning. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 89–95, Morristown, NJ, USA, 2006. Association for Computational Linguistics. Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks. In Sebastian ¨ Thrun, Lawrence Saul, and Bernhard Scholkopf, editors, Advances in Neural Information Processing Systems 16, pages 25–32. MIT Press, Cambridge, MA, 2004. 2138 G RADIENT T REE B OOSTING FOR T RAINING C ONDITIONAL R ANDOM F IELDS Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the 21st International Conference on Machine Learning (ICML 2004), pages 823–830, Banff, Canada, 2004. ACM Press. S. V. N. Vishwanathan, Nicol N. Schraudolph, Mark W. Schmidt, and Kevin P. Murphy. Accelerated training of conditional random fields with stochastic gradient methods. In William W. Cohen and Andrew Moore, editors, Proceedings of the 23rd International Conference on Machine learning (ICML 2006), pages 969–976, New York, NY, USA, 2006. ACM. 2139