jmlr jmlr2009 jmlr2009-55 jmlr2009-55-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jun Zhu, Eric P. Xing
Abstract: The standard maximum margin approach for structured prediction lacks a straightforward probabilistic interpretation of the learning scheme and the prediction rule. Therefore its unique advantages such as dual sparseness and kernel tricks cannot be easily conjoined with the merits of a probabilistic model such as Bayesian regularization, model averaging, and ability to model hidden variables. In this paper, we present a new general framework called maximum entropy discrimination Markov networks (MaxEnDNet, or simply, MEDN), which integrates these two approaches and combines and extends their merits. Major innovations of this approach include: 1) It extends the conventional max-entropy discrimination learning of classification rules to a new structural maxentropy discrimination paradigm of learning a distribution of Markov networks. 2) It generalizes the extant Markov network structured-prediction rule based on a point estimator of model coefficients to an averaging model akin to a Bayesian predictor that integrates over a learned posterior distribution of model coefficients. 3) It admits flexible entropic regularization of the model during learning. By plugging in different prior distributions of the model coefficients, it subsumes the wellknown maximum margin Markov networks (M3 N) as a special case, and leads to a model similar to an L1 -regularized M3 N that is simultaneously primal and dual sparse, or other new types of Markov networks. 4) It applies a modular learning algorithm that combines existing variational inference techniques and convex-optimization based M3 N solvers as subroutines. Essentially, MEDN can be understood as a jointly maximum likelihood and maximum margin estimate of Markov network. It represents the first successful attempt to combine maximum entropy learning (a dual form of maximum likelihood learning) with maximum margin learning of Markov network for structured input/output problems; and the basic principle can be generalized to learning arbi
Yasemin Altun, Ioannis Tsochantaridis, and Thomas Hofmann. Hidden Markov support vector machines. In International Conference on Machine Learning, 2003. Yasemin Altun, David McAllester, and Mikhail Belkin. Maximum margin semi-supervised learning for structured variables. In Advances in Neural Information Processing Systems, 2006. Galen Andrew and Jianfeng Gao. Scalable training of l1 -regularized log-linear models. In International Conference on Machine Learning, 2007. David F. Andrews and Colin L. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical Society, B(1):99–102, 1974. Peter L. Bartlett, Michael Collins, Ben Taskar, and David McAllester. Exponentiated gradient algorithms for larg-margin structured classification. In Advances in Neural Information Processing Systems, 2004. Kristin P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optim. Methods Softw, (1):23–34, 1992. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Ulf Brefeld and Tobias Scheffer. Semi-supervised learning for structured output variables. In International Conference on Machine Learning, 2006. Antoni B. Chan, Nuno Vasconcelos, and Gert R. G. Lanckriet. Direct convex relaxations of sparse SVM. In International Conference on Machine Learning, 2007. Mark Dredze, Koby Crammer, and Fernando Pereira. Confidence-weighted linear classification. In International Conference on Machine Learning, 2008. John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projection onto the l1 -ball for learning in high dimensions. In International Conference on Machine Learning, 2008. Mirosla Dud´k, Steven J. Phillips, and Robert E. Schapire. Maximum entropy density estimation ı with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, (8):1217–1260, 2007. Susana Eyheramendy, Alexander Genkin, Wen-Hua Ju, David D. Lewis, and David Madiagan. Sparse Bayesian classifiers for text categorization. Technical report, Rutgers University, 2003. Mario Figueiredo. Adaptive sparseness for supervised learning. IEEE Trans. on Pattern Analysis and Machine Intelligence, 25(9):1150–1159, 2003. Mario A. T. Figueiredo. Adaptive sparseness using Jeffreys prior. In Advances in Neural Information Processing Systems, 2001. Amir Globerson, Terry Y. Koo, Xavier Carreras, and Michael Collins. Exponentiated gradient algorithms for log-linear structured prediction. In International Conference on Machine Learning, 2007. 2566 M AXIMUM E NTROPY D ISCRIMINATION M ARKOV N ETWORKS Tommi Jaakkola, Marina Meila, and Tony Jebara. Maximum entropy discrimination. In Advances in Neural Information Processing Systems, 1999. Tony Jebara. Multitask sparsity via maximum entropy discrimination. To appear in Journal of Machine Learning Research, 2009. Tony Jebara and Tommi Jaakkola. Feature selection and dualities in maximum entropy discrimination. In Uncertainty in Artificial Intelligence, 2000. Ata Kaban. On Bayesian classification with laplace priors. Pattern Recognition Letters, 28(10): 1271–1282, 2007. John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning, 2001. John Langford and John Shawe-Taylor. Pac-Bayes & margins. In Advances in Neural Information Processing Systems, 2003. John Langford, Matthias Seeger, and Nimrod Megiddo. An improved predictive accuracy bound for averaging classifiers. In International Conference on Machine Learning, 2001. Guy Lebanon and John Lafferty. Boosting and maximum likelihood for exponential models. In Advances in Neural Information Processing Systems, 2001. Su-In Lee, Varun Ganapathi, and Daphne Koller. Efficient structure learning of Markov networks using l1 -regularization. In Advances in Neural Information Processing Systems, 2006. Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, (45):503–528, 1989. David McAllester. Generalization bounds and consistency for structured labeling. In Predicting Structured Data, edited by G. Bakir, T. Hofmann, B. Scholkopf, A. Smola, B. Taskar, and S. V. N. Vishwanathan. MIT Press, 2007. David McAllester. PAC-Bayesian model averaging. In the Twelfth Annual Conference on Computational Learning Theory, 1999. Yuan (Alan) Qi, Martin Szummer, and Thomas P. Minka. Bayesian conditional random fields. In International Conference on Artificial Intelligence and Statistics, 2005. Ariadna Quattoni, Michael Collins, and Trevor Darrell. Conditional random fields for object recognition. In Advances in Neural Information Processing Systems, 2004. Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, (77(2)):257–286, 1989. Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Zinkevich. (online) subgradient methods for structured prediction. In International Conference on Artificial Intelligence and Statistics, 2007. 2567 Z HU AND X ING Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998. Peter Sollich. Bayesian methods for support vector machines: Evidence and predictive class probabilities. Journal of Machine Learning Research, (3):21–52, 2002. Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. In Advances in Neural Information Processing Systems, 2003. Ben Taskar, Simon Lacoste-Julien, and Michael I. Jordan. Structured prediction via the extragradient method. In Advances in Neural Information Processing Systems, 2006. Robert Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society, B(58):267–288, 1996. Michael E. Tipping. Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research, (1):211–244, 2001. Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector machine learning for interdependent and structured output spaces. In International Conference on Machine Learning, 2004. Martin J. Wainwright, Pradeep Ravikumar, and John Lafferty. High-dimensional graphical model selection using l1 -regularized logistic regression. In Advances in Neural Information Processing Systems, 2006. David Wipf, Jason Palmer, and Bhaskar Rao. Perspectives on sparse bayesian learning. In Advances in Neural Information Processing Systems, 2003. Linli Xu, Dana Wilkinson, Finnegan Southey, and Dale Schuurmans. Discriminative unsupervised learning of structured predictors. In International Conference on Machine Learning, 2006. Zhihua Zhang and Michael I. Jordan. Bayesian multicategory support vector machines. In Uncertainty in Artificial Intelligence, 2006. Ji Zhu, Saharon Rosset, Trevor Hastie, and Rob Tibshirani. 1-norm support vector machines. In Advances in Neural Information Processing Systems, 2004. Jun Zhu and Eric P. Xing. On primal and dual sparsity of Markov networks. In International Conference on Machine Learning, 2009. Jun Zhu, Zaiqing Nie, Bo Zhang, and Ji-Rong Wen. Dynamic hierarchical Markov random fields for integrated web data extraction. Journal of Machine Learning Research, (9):1583–1614, 2008a. Jun Zhu, Eric P. Xing, and Bo Zhang. Laplace maximum margin Markov networks. In International Conference on Machine Learning, 2008b. Jun Zhu, Eric P. Xing, and Bo Zhang. Partially observed maximum entropy discrimination Markov networks. In Advances in Neural Information Processing Systems, 2008c. 2568 M AXIMUM E NTROPY D ISCRIMINATION M ARKOV N ETWORKS Jun Zhu, Amr Ahmed, and Eric P. Xing. MedLDA: Maximum margin supervised topic models for regression and classification. In International Conference on Machine Learning, 2009a. Jun Zhu, Eric P. Xing, and Bo Zhang. Primal sparse max-margin Markov networks. In International Conference on Knowledge Discovery and Data Mining (KDD), 2009b. 2569