nips nips2009 nips2009-75 nips2009-75-reference knowledge-graph by maker-knowledge-mining

75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

Source: pdf

Author: Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, Gideon S. Mann

Abstract: Training conditional maximum entropy models on massive data sets requires signiﬁcant computational resources. We examine three common distributed training methods for conditional maxent: a distributed gradient computation method, a majority vote method, and a mixture weight method. We analyze and compare the CPU and network time complexity of each of these methods and present a theoretical analysis of conditional maxent models, including a study of the convergence of the mixture weight method, the most resource-efﬁcient technique. We also report the results of large-scale experiments comparing these three methods which demonstrate the beneﬁts of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches.

reference text

[1] A. Berger, V. Della Pietra, and S. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71, 1996.

[2] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499–526, 2002.

[3] S. F. Chen and R. Rosenfeld. A survey of smoothing techniques for ME models. IEEE Transactions on Speech and Audio Processing, 8(1):37–50, 2000.

[4] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-Reduce for machine learning on multicore. In Advances in Neural Information Processing Systems, 2007.

[5] M. Collins, R. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 48, 2002.

[6] C. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh. Sample selection bias correction theory. In Proceedings of ALT 2008, volume 5254 of LNCS, pages 38–53. Springer, 2008.

[7] J. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, pages 1470–1480, 1972.

[8] S. Della Pietra, V. Della Pietra, J. Lafferty, R. Technol, and S. Brook. Inducing features of random ﬁelds. IEEE transactions on pattern analysis and machine intelligence, 19(4):380– 393, 1997.

[9] K. Ganchev and M. Dredze. Small statistical models by random feature mixing. In Workshop on Mobile Language Processing, ACL, 2008.

[10] D. Graff, J. Kong, K. Chen, and K. Maeda. English gigaword third edition, linguistic data consortium, philadelphia, 2007.

[11] E. T. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4):620630, 1957.

[12] J. Jeon and R. Manmatha. Using maximum entropy for automatic image annotation. In International Conference on Image and Video Retrieval, 2004.

[13] G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In Advances in Neural Information Processing Systems, pages 447–454, 2001.

[14] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.

[15] R. Malouf. A comparison of algorithms for maximum entropy parameter estimation. In International Conference on Computational Linguistics (COLING), 2002.

[16] M. Marcus, M. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2):313–330, 1993.

[17] C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148–188. Cambridge University Press, Cambridge, 1989.

[18] J. Nocedal and S. Wright. Numerical optimization. Springer, 1999.

[19] C. E. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal, 30:50–64, 1951.

[20] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In International Conference on Machine Learning, 2009.

[21] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In International Conference on Machine Learning, 2004. 9