nips nips2001 nips2001-129 nips2001-129-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lawrence K. Saul, Daniel D. Lee
Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1
[1] M. Collins, R. Schapire, and Y. Singer (2000). Logistic regression, adaBoost, and Bregman distances. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory.
[2] J. N. Darroch and D. Ratcliff (1972). Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics 43:1470–1480.
[3] A. P. Dempster, N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. B 39: 1–37.
[4] S. Della Pietra, V. Della Pietra, and J. Lafferty (1997). Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4): 380–393.
[5] P.S. Gopalakrishnan, D. Kanevsky, A. Ndas and D. Nahamoo (1991). An inequality for rational functions with applications to some statistical estimation problems. IEEE Transactions on Information Theory 37: 107–113.
[6] T. Jebara and A. Pentland (1998). Maximum conditional likelihood via bound maximization and the CEM algorithm. In M. Kearns, S. Solla, and D. Cohn (eds.). Advances in Neural Information Processing Systems 11, 494–500. MIT Press: Cambridge, MA.
[7] J. Kivinen and M. Warmuth (1997). Additive versus exponentiated gradient updates for linear prediction. Journal of Information and Computation 132: 1–64.
[8] D. D. Lee and H. S. Seung (1999). Learning the parts of objects with nonnegative matrix factorization. Nature 401: 788–791.
[9] D. D. Lee and H. S. Seung (2000). Algorithms for nonnegative matrix factorization. In T. Leen, T. Dietterich, and V. Tresp (eds.). Advances in Neural Information Processing Systems 13. MIT Press: Cambridge, MA.
[10] Y.LeCun, L. Jackel, L.Bottou, A.Brunot, C.Cortes, J. Denker, H.Drucker, I.Guyon, U. Muller, E.Sackinger, P.Simard, and V.Vapnik (1995). A comparison of learning algorithms for handwritten digit recognition. In F.Fogelman and P.Gallinari (eds.). International Conference on Artificial Neural Networks, 1995, Paris: 53–60.
[11] G. McLachlan and K. Basford (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker.
[12] Y. Normandin (1991). Hidden Markov Models, Maximum Mutual Information Estimation and the Speech Recognition Problem. Ph.D. Thesis, McGill University, Montreal.
[13] J. A. O’Sullivan (1998). Alternating minimization algorithms: from Blahut-Arimoto to Expectation-Maximization. In A. Vardy (ed.). Codes, Curves, and Signals: Common Threads in Communications. Kluwer: Norwell, MA.
[14] V. Vapnik (1999). The Nature of Statistical Learning Theory. Springer Verlag.