jmlr jmlr2011 jmlr2011-84 jmlr2011-84-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Amarnag Subramanya, Jeff Bilmes
Abstract: We describe a new objective for graph-based semi-supervised learning based on minimizing the Kullback-Leibler divergence between discrete probability measures that encode class membership probabilities. We show how the proposed objective can be efficiently optimized using alternating minimization. We prove that the alternating minimization procedure converges to the correct optimum and derive a simple test for convergence. In addition, we show how this approach can be scaled to solve the semi-supervised learning problem on very large data sets, for example, in one instance we use a data set with over 108 samples. In this context, we propose a graph node ordering algorithm that is also applicable to other graph-based semi-supervised learning approaches. We compare the proposed approach against other standard semi-supervised learning algorithms on the semi-supervised learning benchmark data sets (Chapelle et al., 2007), and other real-world tasks such as text classification on Reuters and WebKB, speech phone classification on TIMIT and Switchboard, and linguistic dialog-act tagging on Dihana and Switchboard. In each case, the proposed approach outperforms the state-of-the-art. Lastly, we show that our objective can be generalized into a form that includes the standard squared-error loss, and we prove a geometric rate of convergence in that case. Keywords: graph-based semi-supervised learning, transductive inference, large-scale semi-supervised learning, non-parametric models
A. Alexandrescu and K. Kirchhoff. Data-driven graph construction for semi-supervised graph-based learning in nlp. In Proc. of the Human Language Technologies Conference (HLT-NAACL), 2007a. A. Alexandrescu and K. Kirchhoff. Graph-based learning for statistical machine translation. In Proc. of the Human Language Technologies Conference (HLT-NAACL), 2007b. S. Arya and D. M. Mount. Approximate nearest neighbor queries in fixed dimensions. In ACMSIAM Symp. on Discrete Algorithms (SODA), 1993. S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching. Journal of the ACM, 1998. A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. Journal of Machine Learning Research, 2005. R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 3:1183–1208, 2003. ISSN 1533-7928. M. Belkin, P. Niyogi, and V. Sindhwani. On manifold regularization. In Proc. of the Conference on Artificial Intelligence and Statistics (AISTATS), 2005. Y. Bengio, O. Delalleau, and N. L. Roux. Label propogation and quadratic criterion. In O. Chapelle, B. Scholkopf, and A. Zien, editors, Semi-Supervised Learning. MIT Press, 2007. D. Bertsekas. Nonlinear Programming. Athena Scientific Publishing, 1999. 3365 S UBRAMANYA AND B ILMES J. Bilmes and A. Subramanya. Parallel graph-based semi-supervised learning. In R. Bekkerman, M. Bilenko, and J. Langford, editors, Scaling Up Machine Learning. Cambridge University Press, 2011. Forthcoming. C. Bishop, editor. Neural Networks for Pattern Recognition. Oxford University Press, 1995. J. Blitzer and J. Zhu. ACL 2008 tutorial on Semi-Supervised learning. http://ssl-acl08. wikidot.com/, 2008. A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proc. 18th International Conf. on Machine Learning, pages 19–26. Morgan Kaufmann, San Francisco, CA, 2001. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, 1998. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2006. O. Chapelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning. MIT Press, 2007. W. Cheney and A. Goldstien. Proximity maps for convex sets. American Mathematical Society, 1959. R. Collobert, F. Sinz, J. Weston, L. Bottou, and T. Joachims. Large scale transductive svms. Journal of Machine Learning Research, 2006. A. Corduneanu and T. Jaakkola. On information regularization. In Uncertainty in Artificial Intelligence, 2003. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley Series in Telecommunications. Wiley, New York, 1991. I. Csiszar and G Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, 1984. O. Delalleau, Y. Bengio, and N. L. Roux. Efficient non-parametric function induction in semisupervised learning. In Proc. of the Conference on Artificial Intelligence and Statistics (AISTATS), 2005. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977. N. Deshmukh, A. Ganapathiraju, A. Gleeson, J. Hamaker, and J. Picone. Resegmentation of switchboard. In Proceedings of the International Conference on Spoken Language Processing, pages 1543–1546, Sydney, Australia, November 1998. S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In CIKM ’98: Proceedings of the Seventh International Conference on Information and Knowledge Management, New York, NY, USA, 1998. 3366 G RAPH - BASED S EMI -S UPERVISED L EARNING WITH M EASURE P ROPAGATION G. Evermann, H. Y. Chan, M. J. F. Gales, B. Jia, D. Mrva, P. C. Woodland, and K. Yu. Training lvcsr systems on thousands of hours of data. In Proc. of ICASSP, 2005. W. Fei and Z. Changshui. Label propagation through linear neighborhoods. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pages 985–992, New York, NY, USA, 2006. ACM. J.H. Friedman, J.L. Bentley, and R.A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transaction on Mathematical Software, 3, 1977. J. Godfrey, E. Holliman, and J. McDaniel. Switchboard: telephone speech corpus for research and development. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 517–520, San Francisco, California, March 1992. A. B. Goldberg and X. Zhu. Keepin’ it real: Semi-supervised learning with realistic tuning. In SemiSupLearn ’09: Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing, Morristown, NJ, USA, 2009. Association for Computational Linguistics. Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In CAP, 2005. S Greenberg. The switchboard transcription project. Technical report, The Johns Hopkins University (CLSP) Summer Research Workshop, 1995. A. Gunawardena. The Information Geometri of EM Variants for Speech and Image Processing. PhD thesis, Johns Hopkins University, 2001. A. K. Halberstadt and J. R. Glass. Heterogeneous acoustic measurements for phonetic classification. In Proc. Eurospeech ’97, pages 401–404, Rhodes, Greece, 1997. URL citeseer.ist.psu.edu/ article/halberstadt97heterogeneous.html. H.L.Royden. Real Analysis. Prentice Hall, 1988. G. Ji and J. Bilmes. Dialog act tagging using graphical models. In Proc. of ICASSP, 2005. T. Joachims. Transductive inference for text classification using support vector machines. In Proc. of the International Conference on Machine Learning (ICML), 1999. T. Joachims. SVM Light. http://svmlight.joachims.org, 2002. T. Joachims. Transductive learning via spectral graph partitioning. In Proc. of the International Conference on Machine Learning (ICML), 2003. T. Joachims. SGT light. http://sgt.joachims.org, 2004. D. Jurafsky and C. V. Ess-Dykema. Switchboard discourse language modeling project. Johns Hopkins Summer Workshop, 1997. M. Karlen, J. Weston, A. Erkan, and R. Collobert. Large scale manifold transduction. In International Conference on Machine Learning, ICML, 2008. 3367 S UBRAMANYA AND B ILMES J. Lafferty, S. D. Pietra, and V. D. Pietra. Statistical learning algorithms based on bregman distances. In Proceedings of 1997 Canadian Workshop on Information Theory, 1997. K. F. Lee and H. Hon. Speaker independant phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(11), 1989. D. Lewis et al. Reuters-21578. http://www.daviddlewis.com/resources/testcollections/ reuters21578, 1987. X. Li and J. Bilmes. Regularized adaptation of discriminative classifiers. In Proc. IEEE Intl. Conf. on Acoustic, Speech and Signal Processing, September 2006. J. Malkin, A. Subramanya, and J. A. Bilmes. On the semi-supervised learning of multi-layered perceptrons. In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH), Brighton, UK, September 2009. G. S. Mann and A. McCallum. Simple, robust, scalable, semi-supervised learning via expectation regularization. In Proc. of the International Conference on Machine Learning (ICML), 2007. C. Mart´nez-Hinarejos, J. Bened´, and R. Granell. Statistical framework for a spanish spoken diaı ı logue corpus. Speech Communication, 50(11-12), 2008. N. Morgan. Personal communication, 2009. B. Nadler, N. Srebro, and X. Zhou. Statistical analysis of semi-supervised learning: The limit of infinite unlabelled data. In Advances in Neural Information Processing Systems (NIPS), 2010. A. Ng and M. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in Neural Information Processing Systems (NIPS), 2002. K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Learning to classify text from labeled and unlabeled documents. In AAAI ’98/IAAI ’98: Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, pages 792–799, 1998. J. Pearl. Jeffrey’s Rule, Passage of Experience and Neo-Bayesianism in Knowledge Representation and Defeasible Reasoning. Kluwer Academic Publishers, 1990. M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. V. Raghavan, P. Bollmann, and G. S. Jung. A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans. Inf. Syst., 7(3):205–229, 1989. ISSN 10468188. R. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of Machine Learning Research, 2004. G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Technical report, Cornell University, Ithaca, NY, USA, 1987. H. J. Scudder. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 11, 1965. 3368 G RAPH - BASED S EMI -S UPERVISED L EARNING WITH M EASURE P ROPAGATION M. Seeger. Learning with labeled and unlabeled data. Technical report, University of Edinburgh, U.K., 2000. V. Sindhwani and S. Keerthi. Large scale semi-supervised linear svms. In SIGIR ’06: Proceedings of the 29th Annual International ACM SIGIR, 2006. V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: From transductive to semisupervised learning. In Proc. of the International Conference on Machine Learning (ICML), 2005. P. Somervuo. Experiments with linear and nonlinear feature transformations in HMM based phone recognition. In Proc. of ICASSP 2003, 2003. URL citeseer.ist.psu.edu/ somervuo03experiments.html. A. Subramanya and J. Bilmes. Soft-supervised text classification. In EMNLP, 2008. A. Subramanya and J. Bilmes. The semi-supervised switchboard transcription project. In Interspeech, 2009. A. Subramanya, C. Bartels, J. Bilmes, and P. Nguyen. Uncertainty in training large vocabulary speech recognizers. In Proc. of the IEEE Workshop on Speech Recognition and Understanding, 2007. M. Szummer and T. Jaakkola. Partially labeled classification with Markov random walks. In Advances in Neural Information Processing Systems, volume 14, 2001. A. Tomkins. Keynote speech. CIKM Workshop on Search and Social Media, 2008. I. W. Tsang and J. T. Kwok. Large-scale sparsified manifold regularization. In Advances in Neural Information Processing Systems (NIPS) 19, 2006. K. Tsuda. Propagating distributions on a hypergraph by dual information regularization. In Proceedings of the 22nd International Conference on Machine Learning, 2005. K. Tsuda, G. R¨ tsch, and M. K. Warmuth. Matrix exponentiated gradient updates for on-line learna ing and bregman projection. J. Mach. Learn. Res., 6:995–1018, 2005. ISSN 1533-7928. V. Vladimir. Statistical Learning Theory. Wiley Series, 1998. J. Wang, T. Jebara, and S. Chang. Graph transduction via alternating minimization. In Proc. of the International Conference on Machine Learning (ICML), 2008. C. F. Jeff Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1): 95–103, 1983. W. Zangwill. Nonlinear Programming: a Unified Approach. Prentice-Hall International Series in Management, Englewood Cliffs: N.J., 1969. X.H. Zhang and W.S. Lee. Hyperparameter learning for semi-supervised learning algorithms. In NIPS, 2006. 3369 S UBRAMANYA AND B ILMES D. Zhou, J. Huang, and B. Sch¨ lkopf. Learning from labeled and unlabeled data on a directed graph. o In Proceedings of the 22nd International Conference on Machine Learning. ACM, 2005. X. Zhu. Semi-Supervised Learning with Graphs. PhD thesis, Carnegie Mellon University, 2005a. X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005b. X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Carnegie Mellon University, 2002a. X. Zhu and Z. Ghahramani. Towards semi-supervised classification with Markov random fields. Technical Report CMU-CALD-02-106, Carnegie Mellon University, 2002b. X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proc. of the International Conference on Machine Learning (ICML), 2003. X. Zhu, J. Kandola, Z. Ghahramani, and J. Lafferty. Nonparametric transforms of graph kernels for semi-supervised learning. In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems (NIPS), 2005. V.W. Zue, S. Seneff, and J. Glass. Speech database development at MIT:TIMIT and beyond. In Speech Comminication, 1990. 3370