jmlr jmlr2012 jmlr2012-28 jmlr2012-28-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural-language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of textcategorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing. Keywords: online learning, confidence prediction, text categorization
Galen Andrew and Jianfeng Gao. Scalable training of l1-regularized log-linear models. In ICML ’07: Proceedings Of The 24th International Conference On Machine Learning, pages 33–40, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-793-3. doi: http://doi.acm.org/10.1145/ 1273496.1273501. Steffen Bickel. ECML-PKDD discovery challenge overview. In The ECML-PKDD Discovery Challenge Workshop, 2006. John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Association For Computational Linguistics (ACL), 2007. Antoine Bordes and L´ on Bottou. The Huller: a simple and efficient online SVM. In European e Conference On Machine Learning( ECML ), LNAI 3720, 2005. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Xavier Carreras, Michael Collins, and Terry Koo. Tag, dynamic programming, and the perceptron for efficient, feature-rich parsing. In Conference On Natural Language Learning (CONLL), 2008. Vitor R. Carvalho and William W. Cohen. Single-pass online learning: Performance, voting schemes and online feature selection. In KDD-2006, 2006. Nicol´ Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge University o Press, 2006. Nicol´ Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and o Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, May 1997. Nicol´ Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron algorithm. o Siam Journal Of Commutation, 34(3):640–668, 2005. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm. David Chiang, Yuval Marton, and Philip Resnik. Online large-margin training of syntactic and structural translation features. In Empirical Methods In Natural Language Processing (EMNLP), 2008. 1923 C RAMMER , D REDZE AND P EREIRA Michael Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Empirical Methods In Natural Language Processing (EMNLP), 2002. Koby Crammer and Daniel D. Lee. Learning via Gaussian herding. In Advances In Neural Information Processing Systems 24, 2010. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. Journal Of Machine Learning Research, 7:551–585, 2006a. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. Journal Of Machine Learning Research, 7:551–585, 2006b. Koby Crammer, Mark Dredze, and Fernando Pereira. Exact convex confidence-weighted learning. In Neural Information Processing Systems (NIPS), 2008. Koby Crammer, Alex Kulesza, and Mark Dredze. Adaptive regularization of weight vectors. In Advances In Neural Information Processing Systems 23, pages 414–422, 2009a. Koby Crammer, Mehryar Mohri, and Fernando Pereira. Gaussian margin machines. In Proceedings Of The Twelfth Intentional Conference On Artificial Intelligence And Statistics (AISTATS), 2009b. Mark Dredze and Koby Crammer. Active learning with confidence. In Association For Computational Linguistics (ACL), 2008a. Mark Dredze and Koby Crammer. Online methods for multi-domain learning and adaptation. In Empirical Methods In Natural Language Processing (EMNLP), 2008b. Mark Dredze, Koby Crammer, and Fernando Pereira. Confidence-weighted linear classification. In International Conference On Machine Learning (ICML), 2008. Mark Dredze, Alex Kulesza, and Koby Crammer. Multi-domain learning by confidence-weighted parameter combination. Machine Learning, 79(1-2):123–149, 2010. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. In Proceedings Of The Twenty Third Annual Conference On Learning Theory, 2010. Jianfeng Gao, Galen Andrew, Mark Johnson, and Kristina Toutanova. A comparative study of parameter estimation methods for statistical natural language processing. In Proceedings Of The 45th Annual Meeting Of The Association Of Computational Linguistics, pages 824–831, Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL http: //www.aclweb.org/anthology/P07-1104. Amir Globerson, Terry Y. Koo, Xavier Carreras, and Michael Collins. Exponentiated gradient algorithms for log-linear structured prediction. In Proceedings Of The 24th International Conference On Machine Learning, pages 305–312. ACM New York, NY, USA, 2007. Edward Harrington, Ralf Herbrich, Jyrki Kivinen, John Platt, and Robert C. Williamson. Online Bayes point machines. In 7th Pacific-Asia Conference On Knowledge Discovery And Data Mining (PAKDD), 2003. 1924 C ONFIDENCE -W EIGHTED L INEAR C LASSIFICATION FOR T EXT C ATEGORIZATION Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2001. Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines. Journal Of Machine Learning Research, 1:245–279, 2001. Jonathan J. Hull. A database for handwritten text recognition research. Pattern Analysis And Machine Intelligence, IEEE Transactions On, 16(5):550–554, 1994. Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum entropy discrimination, 1999. Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information And Computation, 132(1):1–64, January 1997. David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. Journal Of Machine Learning Research (JMLR), 5:361–397, 2004. Nick Littlestone. Mistake bounds and logarithmic linear-threshold learning algorithms. PhD thesis, U. C. Santa Cruz, March 1989. Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 108:212–261, 1994. Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. Identifying suspicious URLs: An application of large-scale online learning. In Proc. Of The International Conference On Machine Learning (ICML), 2009. Justin Ma, Alex Kulesza, Koby Crammer, Mark Dredze, Lawrence Saul, and Fernando Pereira. Exploiting feature covariance in high-dimensional online learning. In AIStats, 2010. Andrew McCallum. MALLET: A machine learning for language toolkit. http://mallet.cs. umass.edu, 2002. Ryan McDonald, Koby Crammer, and Fernando Pereira. Large margin online learning algorithms for scalable structured classification. In NIPS Workshop On Structured Outputs, 2004. Ryan McDonald, Koby Crammer, and Fernando Pereira. Flexible text segmentation with structured multilabel classification. In Empirical Methods In Natural Language Processing (EMNLP), 2005a. Ryan McDonald, Koby Crammer, and Fernando Pereira. Online large-margin training of dependency parsers. In Association For Computational Linguistics (ACL), 2005b. Ryan McDonald, Keith Hall, and Gideon Mann. Distributed training strategies for the structured perceptron. In North American Chapter Of The Association For Computational Linguistics (NAACL), 2010. H. Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. In Proceedings Of The Twenty Third Annual Conference On Learning Theory, 2010. 1925 C RAMMER , D REDZE AND P EREIRA Thomas P. Minka, Rongjing Xiang, and Yuan Alan Qi. Virtual vector machine for Bayesian online classification. In Proceedings Of The Twenty Fifth Conference On Uncertainty In Artificial Intelligence, 2009. Francesco Orabona and Koby Crammer. New adaptive algorithms for online classification. In Advances In Neural Information Processing Systems 24, 2010. Kaare B. Petersen and Michael S. Pedersen. The matrix cookbook, oct 2008. URL http://www2. imm.dtu.dk/pubdb/p.php?3274. Version 20081110. John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In P.J. Bartlett, B. Sch¨ lkopf, D. Schuurmans, and A. J Smola, editors, o Advances In Large Margin Classifiers. MIT Press, 1998. Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006. Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).). Libin Shen, Giorgio Satta, and Aravind K. Joshi. Guided learning for bidirectional sequence classification. In Association For Computational Linguistics (ACL), 2007. Pannaga Shivaswamy and Tony Jebara. Ellipsoidal kernel machines. In Artificial Intelligence And Statistics (AISTATS), 2007. Pannaga Shivaswamy and Tony Jebara. Empirical Bernstein boosting. In Y.W. Teh and M. Titterington, editors, Proceedings Of The Thirteenth International Conference On Artificial Intelligence And Statistics (AISTATS) 2010, volume Volume 9 of JMLR: W&CP;, pages 733–740, May 13-15 2010a. Pannagadatta K. Shivaswamy and Tony Jebara. Maximum relative margin and data-dependent regularization. Journal Of Machine Learning Research, 11:747–788, 2010b. Richard S. Sutton. Adapting bias by gradient descent: an incremental version of delta-bar-delta. In Proceedings Of The Tenth National Conference On Artificial Intelligence, pages 171–176. MIT Press, 1992. Michael E. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal Of Machine Learning Research, 1:211–244, 2001. Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. Journal Of Machine Learning Research (JMLR), 2001. Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In International Conference On Machine Learning (ICML), 2004. 1926