nips nips2009 nips2009-75 nips2009-75-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, Gideon S. Mann
Abstract: Training conditional maximum entropy models on massive data sets requires significant computational resources. We examine three common distributed training methods for conditional maxent: a distributed gradient computation method, a majority vote method, and a mixture weight method. We analyze and compare the CPU and network time complexity of each of these methods and present a theoretical analysis of conditional maxent models, including a study of the convergence of the mixture weight method, the most resource-efficient technique. We also report the results of large-scale experiments comparing these three methods which demonstrate the benefits of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches.
[1] A. Berger, V. Della Pietra, and S. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71, 1996.
[2] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499–526, 2002.
[3] S. F. Chen and R. Rosenfeld. A survey of smoothing techniques for ME models. IEEE Transactions on Speech and Audio Processing, 8(1):37–50, 2000.
[4] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-Reduce for machine learning on multicore. In Advances in Neural Information Processing Systems, 2007.
[5] M. Collins, R. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 48, 2002.
[6] C. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh. Sample selection bias correction theory. In Proceedings of ALT 2008, volume 5254 of LNCS, pages 38–53. Springer, 2008.
[7] J. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, pages 1470–1480, 1972.
[8] S. Della Pietra, V. Della Pietra, J. Lafferty, R. Technol, and S. Brook. Inducing features of random fields. IEEE transactions on pattern analysis and machine intelligence, 19(4):380– 393, 1997.
[9] K. Ganchev and M. Dredze. Small statistical models by random feature mixing. In Workshop on Mobile Language Processing, ACL, 2008.
[10] D. Graff, J. Kong, K. Chen, and K. Maeda. English gigaword third edition, linguistic data consortium, philadelphia, 2007.
[11] E. T. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4):620630, 1957.
[12] J. Jeon and R. Manmatha. Using maximum entropy for automatic image annotation. In International Conference on Image and Video Retrieval, 2004.
[13] G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In Advances in Neural Information Processing Systems, pages 447–454, 2001.
[14] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.
[15] R. Malouf. A comparison of algorithms for maximum entropy parameter estimation. In International Conference on Computational Linguistics (COLING), 2002.
[16] M. Marcus, M. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2):313–330, 1993.
[17] C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148–188. Cambridge University Press, Cambridge, 1989.
[18] J. Nocedal and S. Wright. Numerical optimization. Springer, 1999.
[19] C. E. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal, 30:50–64, 1951.
[20] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In International Conference on Machine Learning, 2009.
[21] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In International Conference on Machine Learning, 2004. 9