nips nips2009 nips2009-82 nips2009-82-reference knowledge-graph by maker-knowledge-mining

82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification


Source: pdf

Author: Amarnag Subramanya, Jeff A. Bilmes

Abstract: We prove certain theoretical properties of a graph-regularized transductive learning objective that is based on minimizing a Kullback-Leibler divergence based loss. These include showing that the iterative alternating minimization procedure used to minimize the objective converges to the correct solution and deriving a test for convergence. We also propose a graph node ordering algorithm that is cache cognizant and leads to a linear speedup in parallel computations. This ensures that the algorithm scales to large data sets. By making use of empirical evaluation on the TIMIT and Switchboard I corpora, we show this approach is able to outperform other state-of-the-art SSL approaches. In one instance, we solve a problem on a 120 million node graph. 1


reference text

[1] O. Chapelle, B. Scholkopf, and A. Zien, Semi-Supervised Learning. MIT Press, 2007.

[2] X. Zhu, “Semi-supervised learning literature survey,” tech. rep., Computer Sciences, University of Wisconsin-Madison, 2005.

[3] M. Szummer and T. Jaakkola, “Partially labeled classification with Markov random walks,” in Advances in Neural Information Processing Systems, vol. 14, 2001.

[4] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data with label propagation,” tech. rep., Carnegie Mellon University, 2002.

[5] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using gaussian fields and harmonic functions,” in Proc. of the International Conference on Machine Learning (ICML), 2003.

[6] T. Joachims, “Transductive learning via spectral graph partitioning,” in Proc. of the International Conference on Machine Learning (ICML), 2003.

[7] A. Corduneanu and T. Jaakkola, “On information regularization,” in Uncertainty in Artificial Intelligence, 2003.

[8] K. Tsuda, “Propagating distributions on a hypergraph by dual information regularization,” in Proceedings of the 22nd International Conference on Machine Learning, 2005.

[9] M. Belkin, P. Niyogi, and V. Sindhwani, “On manifold regularization,” in Proc. of the Conference on Artificial Intelligence and Statistics (AISTATS), 2005.

[10] C. Bishop, ed., Neural Networks for Pattern Recognition. Oxford University Press, 1995. 8

[11] A. Subramanya and J. Bilmes, “Soft-supervised text classification,” in EMNLP, 2008.

[12] R. Collobert, F. Sinz, J. Weston, L. Bottou, and T. Joachims, “Large scale transductive svms,” Journal of Machine Learning Research, 2006.

[13] V. Sindhwani and S. S. Keerthi, “Large scale semi-supervised linear svms,” in SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR, 2006.

[14] O. Delalleau, Y. Bengio, and N. L. Roux, “Efficient non-parametric function induction in semi-supervised learning,” in Proc. of the Conference on Artificial Intelligence and Statistics (AISTATS), 2005.

[15] M. Karlen, J. Weston, A. Erkan, and R. Collobert, “Large scale manifold transduction,” in International Conference on Machine Learning, ICML, 2008.

[16] I. W. Tsang and J. T. Kwok, “Large-scale sparsified manifold regularization,” in Advances in Neural Information Processing Systems (NIPS) 19, 2006.

[17] A. Tomkins, “Keynote speech.” CIKM Workshop on Search and Social Media, 2008.

[18] A. Jansen and P. Niyogi, “Semi-supervised learning of speech sounds,” in Interspeech, 2007.

[19] T. M. Cover and J. A. Thomas, Elements of Information Theory. Wiley Series in Telecommunications, New York: Wiley, 1991.

[20] Y. Bengio, O. Delalleau, and N. L. Roux, Semi-Supervised Learning, ch. Label Propogation and Quadratic Criterion. MIT Press, 2007.

[21] Dempster, Laird, and Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1–38, 1977.

[22] T. Abatzoglou and B. O. Donnell, “Minimization by coordinate descent,” Journal of Optimization Theory and Applications, 1982.

[23] W. Zangwill, Nonlinear Programming: a Unified Approach. Englewood Cliffs: N.J.: Prentice-Hall International Series in Management, 1969.

[24] C. F. J. Wu, “On the convergence properties of the EM algorithm,” The Annals of Statistics, vol. 11, no. 1, pp. 95–103, 1983.

[25] I. Csiszar and G. Tusnady, “Information Geometry and Alternating Minimization Procedures,” Statistics and Decisions, 1984.

[26] A. K. Halberstadt and J. R. Glass, “Heterogeneous acoustic measurements for phonetic classification,” in Proc. Eurospeech ’97, (Rhodes, Greece), pp. 401–404, 1997.

[27] K. F. Lee and H. Hon, “Speaker independant phone recognition using hidden markov models,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37, no. 11, 1989.

[28] J. Godfrey, E. Holliman, and J. McDaniel, “Switchboard: Telephone speech corpus for research and development,” in Proceedings of ICASSP, vol. 1, (San Francisco, California), pp. 517–520, March 1992.

[29] N. Deshmukh, A. Ganapathiraju, A. Gleeson, J. Hamaker, and J. Picone, “Resegmentation of switchboard,” in Proceedings of ICSLP, (Sydney, Australia), pp. 1543–1546, November 1998.

[30] S. Greenberg, “The Switchboard transcription project,” tech. rep., The Johns Hopkins University (CLSP) Summer Research Workshop, 1995.

[31] J. Friedman, J. Bentley, and R. Finkel, “An algorithm for finding best matches in logarithmic expected time,” ACM Transaction on Mathematical Software, vol. 3, 1977.

[32] S. Arya and D. M. Mount, “Approximate nearest neighbor queries in fixed dimensions,” in ACM-SIAM Symp. on Discrete Algorithms (SODA), 1993. 9