acl acl2013 acl2013-309 acl2013-309-reference knowledge-graph by maker-knowledge-mining

309 acl-2013-Scaling Semi-supervised Naive Bayes with Feature Marginals


Source: pdf

Author: Michael Lucas ; Doug Downey

Abstract: Semi-supervised learning (SSL) methods augment standard machine learning (ML) techniques to leverage unlabeled data. SSL techniques are often effective in text classification, where labeled data is scarce but large unlabeled corpora are readily available. However, existing SSL techniques typically require multiple passes over the entirety of the unlabeled data, meaning the techniques are not applicable to large corpora being produced today. In this paper, we show that improving marginal word frequency estimates using unlabeled data can enable semi-supervised text classification that scales to massive unlabeled data sets. We present a novel learning algorithm, which optimizes a Naive Bayes model to accord with statistics calculated from the unlabeled corpus. In experiments with text topic classification and sentiment analysis, we show that our method is both more scalable and more accurate than SSL techniques from previous work.


reference text

Mikhail Belkin and Partha Niyogi. 2004. Semisupervised learning on riemannian manifolds. Machine Learning, 56(1):209–239. John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Association for Computational Linguistics, Prague, Czech Republic. O. Chapelle, B. Sch o¨lkopf, and A. Zien, editors. 2006. Semi-Supervised Learning. MIT Press, Cambridge, MA. Thorsten Joachims. 1999. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pages 200– 209, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1 : A new benchmark collection for text categorization research. The Journal of Ma350chine Learning Research, 5:361–397. Frank Lin and William W Cohen. 2011. Adaptation of graph-based semi-supervised methods to largescale text data. In The 9th Workshop on Mining and Learning with Graphs. Wei Liu, Junfeng He, and Shih-Fu Chang. 2010. Large graph construction for scalable semi-supervised learning. In ICML, pages 679–686. Gideon S. Mann and Andrew McCallum. 2010. Generalized expectation criteria for semi-supervised learning with weakly labeled data. J. Mach. Learn. Res., 11:955–984, March. Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using em. Mach. Learn. , 39(2-3): 103–134, May. Jiang Su, Jelber Sayyad Shirab, and Stan Matwin. 2011. Large scale text classification using semisupervised multinomial naive bayes. In Lise Getoor and Tobias Scheffer, editors, ICML, pages 97–104. Omnipress. Amar Subramanya and Jeff A. Bilmes. 2009. Entropic graph regularization in non-parametric semisupervised classification. In Neural Information Processing Society (NIPS), Vancouver, Canada, December. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. Urbana, 51:61801 . Yiming Yang and Xin Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 42–49. ACM. X. Zhu and Z. Ghahramani. 2002. Learning from labeled and unlabeled data with label propagation. Technical report, Technical Report CMU-CALD02-107, Carnegie Mellon University. Xiaojin Zhu. 2006. Semi-supervised learning literature survey. 351