emnlp emnlp2011 emnlp2011-129 emnlp2011-129-reference knowledge-graph by maker-knowledge-mining

129 emnlp-2011-Structured Sparsity in Structured Prediction

Source: pdf

Author: Andre Martins ; Noah Smith ; Mario Figueiredo ; Pedro Aguiar

Abstract: Linear models have enjoyed great success in structured prediction in NLP. While a lot of progress has been made on efficient training with several loss functions, the problem of endowing learners with a mechanism for feature selection is still unsolved. Common approaches employ ad hoc filtering or L1regularization; both ignore the structure of the feature space, preventing practicioners from encoding structural prior knowledge. We fill this gap by adopting regularizers that promote structured sparsity, along with efficient algorithms to handle them. Experiments on three tasks (chunking, entity recognition, and dependency parsing) show gains in performance, compactness, and model interpretability.

reference text

Y. Altun, I. Tsochantaridis, and T. Hofmann. 2003. Hidden Markov support vector machines. In Proc. of ICML. G. Andrew and J. Gao. 2007. Scalable training of L1-regularized log-linear models. In Proc. of ICML. ACM. F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. 2011. Convex optimization with sparsity-inducing norms. In Optimization for Machine Learning. MIT Press. F. Bach. 2008. Exploring large feature spaces with hierarchical multiple kernel learning. NIPS, 21. S. Bakin. 1999. Adaptive regression and model selection in data mining problems. Ph.D. thesis, Australian National University. S. Buchholz and E. Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing. In Proc. of CoNLL. B. Carpenter. 2008. Lazy sparse stochastic gradient descent for regularized multinomial logistic regression. Technical report, Technical report, Alias-i. R. Caruana. 1997. Multitask learning. Machine Learning, 28(1):41–75. S. F. Chen and R. Rosenfeld. 2000. A survey of smoothing techniques for maximum entropy models. IEEE Transactions on Speech and Audio Processing, 8(1):37–50. M. Collins. 2002a. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In Proc. of EMNLP. M. Collins. 2002b. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Proc. of ACL. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. 2006. Online Passive-Aggressive Algorithms. JMLR, 7:551–585. S. Della Pietra, V. Della Pietra, and J. Lafferty. 1997. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:380–393. J. Duchi and Y. Singer. 2009. Efficient online and batch learning using forward backward splitting. JMLR, 10:2873–2908. J. Eisenstein, N. A. Smith, and E. P. Xing. 2011. Discovering sociolinguistic associations with structured sparsity. In Proc. of ACL. M.A.T. Figueiredo. 2002. Adaptive sparseness using Jeffreys’ prior. Advances in Neural Information Processing Systems. J. Friedman, T. Hastie, and R. Tibshirani. 2010. A note on the group lasso and a sparse group lasso. Unpublished manuscript. J. Gao, G. Andrew, M. Johnson, and K. Toutanova. 2007. A comparative study of parameter estimation methods for statistical natural language processing. In Proc. of ACL. J. Goodman. 2004. Exponential priors for maximum entropy models. In Proc. of NAACL. J. Gra ¸ca, K. Ganchev, B. Taskar, and F. Pereira. 2009. Posterior vs. parameter sparsity in latent variable models. Advances in Neural Information Processing Systems. I. Guyon and A. Elisseeff. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research, 3: 1157–1 182. R. Jenatton, J.-Y. Audibert, and F. Bach. 2009. Struc- tured variable selection with sparsity-inducing norms. Technical report, arXiv:0904.3523. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. 2010. Proximal methods for sparse hierarchical dictionary learning. In Proc. of ICML. J. Kazama and J. Tsujii. 2003. Evaluation and extension of maximum entropy models with inequality constraints. In Proc. of EMNLP. S. Kim and E.P. Xing. 2010. Tree-guided group lasso for multi-task regression with structured sparsity. In Proc. of ICML. 1510 J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ofICML. G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. 2004. Learning the kernel matrix with semidefinite programming. JMLR, 5:27– 72. J. Langford, L. Li, and T. Zhang. 2009. Sparse online learning via truncated gradient. JMLR, 10:777–801. T. Lavergne, O. Capp e´, and F. Yvon. 2010. Practical very large scale CRFs. In Proc. of ACL. J. Liu and J. Ye. 2010. Moreau-Yosida regularization for grouped tree structure learning. In Advances in Neural Information Processing Systems. J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. 2010. Network flow algorithms for structured sparsity. In Advances in Neural Information Processing Systems. A. F. T. Martins, N. A. Smith, E. P. Xing, P. M. Q. Aguiar, and M. A. T. Figueiredo. 2010. Turbo parsers: Dependency parsing by approximate variational inference. In Proc. of EMNLP. A. F. T. Martins, M. A. T. Figueiredo, P. M. Q. Aguiar, N. A. Smith, and E. P. Xing. 2011. Online learning of structured predictors with multiple kernels. In Proc. of AISTATS. A. McCallum. 2003. Efficiently inducing features of conditional random fields. In Proc. of UAI. R. T. McDonald, F. Pereira, K. Ribarov, and J. Hajic. 2005. Non-projective dependency parsing using spanning tree algorithms. In Proc. of HLT-EMNLP. G. Obozinski, B. Taskar, and M.I. Jordan. 2010. Joint covariate selection andjoint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252. S. Petrov and D. Klein. 2008a. Discriminative log-linear grammars with latent variables. Advances in Neural Information Processing Systems, 20: 1153–1 160. S. Petrov and D. Klein. 2008b. Sparse multi-scale grammars for discriminative latent variable parsing. In Proc. of EMNLP. A. Quattoni, X. Carreras, M. Collins, and T. Darrell. 2009. An efficient projection for l1,∞ regularization. In Proc. of ICML. and S. Buchholz. 2000. Introduction to the CoNLL-2000 shared task: Chunking. In Proceedings E.F.T.K. Sang of CoNLL-2000 and LLL-2000. and F. De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proc. of CoNLL. E.F.T.K. Sang. 2002. Introduction to the CoNLL2002 shared task: Language-independent named entity recognition. In Proc. of CoNLL. E.F.T.K. Sang M. Schmidt and K. Murphy. 2010. Convex structure learning in log-linear models: Beyond pairwise potentials. In Proc. of AISTATS. S. Shalev-Shwartz, Y. Singer, and N. Srebro. 2007. Pegasos: Primal estimated sub-gradient solver for SVM. In ICML. N. Sokolovska, T. Lavergne, O. Capp e´, and F. Yvon. 2010. Efficient learning of sparse conditional random fields for supervised sequence labelling. IEEE Journal of Selected Topics in Signal Processing, 4(6):953–964. B. Taskar, C. Guestrin, and D. Koller. 2003. Max-margin Markov networks. In Advances in Neural Information Processing Systems. R. Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B., pages 267–288. I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. 2004. Support vector machine learning for interdependent and structured output spaces. In ICML. Y. Tsuruoka, J. Tsujii, and S. Ananiadou. 2009. Stochastic gradient descent training for l1-regularized loglinear models with cumulative penalty. In Proc. of ACL. S.J. Wright, R. Nowak, and M.A.T. Figueiredo. 2009. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479– 2493. L. Xiao. 2009. Dual averaging methods for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems. M. Yuan and Y. Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society (B), 68(1):49. P. Zhao, G. Rocha, and B. Yu. 2009. Grouped and hierarchical model selection through composite absolute penalties. Annals of Statistics, 37(6A):3468–3497. H. Zou and T. Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B (Statistical Methodology), 67(2):301–320. 1511