emnlp emnlp2013 emnlp2013-176 emnlp2013-176-reference knowledge-graph by maker-knowledge-mining

176 emnlp-2013-Structured Penalties for Log-Linear Language Models

Source: pdf

Author: Anil Kumar Nelakanti ; Cedric Archambeau ; Julien Mairal ; Francis Bach ; Guillaume Bouchard

Abstract: Language models can be formalized as loglinear regression models where the input features represent previously observed contexts up to a certain length m. The complexity of existing algorithms to learn the parameters by maximum likelihood scale linearly in nd, where n is the length of the training corpus and d is the number of observed features. We present a model that grows logarithmically in d, making it possible to efficiently leverage longer contexts. We account for the sequential structure of natural language using treestructured penalized objectives to avoid overfitting and achieve better generalization.

reference text

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. 2012. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, pages 1– 106. A. Beck and M. Teboulle. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal of Imaging Sciences, 2(1): 183–202. Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. 2003. A neural probabilistic language model. Journal ofMachine Learning Research, 3: 1137–1 155. 242 P. Bruckner. 1984. An o(n) algorithm for quadratic knapsack problems. Operations Research Letters, 3: 163–166. L. Burget, P. Matejka, P. Schwarz, O. Glembek, and J.H. Cernocky. 2007. Analysis of feature extraction and channel compensation in a GMM speaker recognition system. IEEE Transactions on Audio, Speech and Language Processing, 15(7): 1979–1986, September. Y-W. Chang and M. Collins. 2011. Exact decoding of phrase-based translation models through lagrangian relaxation. In Proc. Conf. Empirical Methods for Natural Language Processing, pages 26–37. S. F. Chen and R. Rosenfeld. 2000. A survey of smoothing techniques for maximum entropy models. IEEE Transactions on Speech and Audio Processing, 8(1):37–50. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. 1990. An Introduction to Algorithms. MIT Press. S. Della Pietra, V. Della Pietra, and J. Lafferty. 1997. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):380–393. J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. 2008. Efficient projections onto the ‘1-ball for learning in high dimensions. Proc. 25th Int. Conf. Machine Learning. R. Giegerich and S. Kurtz. 1997. From ukkonen to McCreight and weiner: A unifying view of linear-time suffix tree construction. Algorithmica. J. Goodman. 2001. A bit of progress in language modelling. Computer Speech and Language, pages 403– 434, October. J. Goodman. 2004. Exponential priors for maximum entropy models. In Proc. North American Chapter of the Association of Computational Linguistics. C. Hu, J.T. Kwok, and W. Pan. 2009. Accelerated gradient methods for stochastic optimization and online learning. Advances in Neural Information Processing Systems. R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. 2011. Proximal methods for hierarchical sparse coding. Journal of Machine Learning Research, 12:2297– 2334. C. R. Kennington, M. Kay, and A. Friedrich. 2012. Sufx trees as language models. Language Resources and Evaluation Conference. R. Kneser and H. Ney. 1995. Improved backing-off for m-gram language modeling. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 1. A. F. T. Martins, N. A. Smith, P. M. Q. Aguiar, and M. A. T. Figueiredo. 2011. Structured sparsity in structured prediction. In Proc. Conf. Empirical Methods for Natural Language Processing, pages 1500– 1511. P. McCullagh and J. Nelder. 1989. Generalized linear models. Chapman and Hall. 2nd edition. A. Mnih and G. Hinton. 2007. Three new graphical models for statistical language modelling. Proc. 24th Int. Conference on Machine Learning. A. Mnih and G. Hinton. 2008. A scalable hierarchical distributed language model. Advances in Neural Information Processing Systems. Y. Nesterov. 2007. Gradient methods for minimizing composite objective function. CORE Discussion Paper. B. Roark, M. Saraclar, M. Collins, and M. Johnson. 2004. Discriminative language modeling with conditional random fields and the perceptron algorithm. Proc. Association for Computation Linguistics. A. Stolcke. 2002. Srilm- an extensible language modeling toolkit. Proc. Int. Conf. Spoken Language Processing, 2:901–904. E. Ukkonen. 1995. Online construction of suffix trees. Algorithmica. S. Vargas, P. Castells, and D. Vallet. 2012. Explicit relevance models in intent-oriented information retrieval diversification. In Proc. 35th Int. ACM SIGIR Conf. Research and development in information retrieval, SIGIR ’ 12, pages 75–84. ACM. F. Wood, C. Archambeau, J. Gasthaus, J. Lancelot, and Y.-W. Teh. 2009. A stochastic memoizer for sequence data. In Proc. 26th Intl. Conf. on Machine Learning. F. Wood, J. Gasthaus, C. Archambeau, L. James, and Y. W. Teh. 2011. The sequence memoizer. In Communications of the ACM, volume 54, pages 91–98. J. Wu and S. Khundanpur. 2000. Efficient training methods for maximum entropy language modeling. Proc. 6th Inter. Conf. Spoken Language Technologies, pages 114–1 17. P. Zhao, G. Rocha, and B. Yu. 2009. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37(6A):3468–3497. 243