acl acl2012 acl2012-74 acl2012-74-reference knowledge-graph by maker-knowledge-mining

74 acl-2012-Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach

Source: pdf

Author: Hao Tang ; Joseph Keshet ; Karen Livescu

Abstract: We address the problem of learning the mapping between words and their possible pronunciations in terms of sub-word units. Most previous approaches have involved generative modeling of the distribution of pronunciations, usually trained to maximize likelihood. We propose a discriminative, feature-rich approach using large-margin learning. This approach allows us to optimize an objective closely related to a discriminative task, to incorporate a large number of complex features, and still do inference efficiently. We test the approach on the task of lexical access; that is, the prediction of a word given a phonetic transcription. In experiments on a subset of the Switchboard conversational speech corpus, our models thus far improve classification error rates from a previously published result of 29.1% to about 15%. We find that large-margin approaches outperform conditional random field learning, and that the Passive-Aggressive algorithm for largemargin learning is faster to converge than the Pegasos algorithm.

reference text

H. Bourlard, S. Furui, N. Morgan, and H. Strik. 1999. Special issue on modeling pronunciation variation for automatic speech recognition. Speech Communication, 29(2-4). C. P. Browman and L. Goldstein. 1992. Articulatory phonology: an overview. Phonetica, 49(3-4). K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. 2006. Online passive aggressive algorithms. Journal of Machine Learning Research, 7. K. Filali and J. Bilmes. 2005. A dynamic Bayesian framework to model context and memory in edit distance learning: An application to pronunciation classification. In Proc. Association for Computational Linguistics (ACL). L. Fissore, P. Laface, G. Micca, and R. Pieraccini. 1989. Lexical access to large vocabularies for speech recog- nition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(8). E. Fosler-Lussier, I. Amdal, and H.-K. J. Kuo. 2002. On the road to improved lexical confusability metrics. In ISCA Tutorial and Research Workshop (ITRW) on Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology. J. E. Fosler-Lussier. 1999. Dynamic Pronunciation Models for Automatic Speech Recognition. Ph.D. thesis, U. C. Berkeley. S. Greenberg, J. Hollenback, and D. Ellis. 1996. Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus. In Proc. International Conference on Spoken Language Processing (ICSLP). A. Gunawardana, M. Mahajan, A. Acero, and J. Platt. 2005. Hidden conditional random fields for phone classification. In Proc. Interspeech. T. J. Hazen, I. L. Hetherington, H. Shu, and K. Livescu. 2005. Pronunciation modeling using a finite-state transducer representation. Speech Communication, 46(2). T. Holter and T. Svendsen. 1999. Maximum likelihood modelling of pronunciation variation. Speech Communication. C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. 2008. A dual coordinate descent method for large-scale linear SVM. In Proc. International Conference on Machine Learning (ICML). B. Hutchinson and J. Droppo. 2011. Learning nonparametric models of pronunciation. In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP). D. Jurafsky, W. Ward, Z. Jianping, K. Herold, Y. Xiuyang, and Z. Sen. 2001 . What kind of pronunciation variation is hard for triphones to model? In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 202 P. Jyothi, K. Livescu, and E. Fosler-Lussier. 2011. Lexical access experiments with context-dependent articulatory feature-based models. In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP). J. Keshet, S. Shalev-Shwartz, Y. Singer, and D. Chazan. 2007. A large margin algorithm for speech and audio segmentation. IEEE Transactions on Acoustics, Speech, and Language Processing, 15(8). J. Keshet, D. McAllester, and T. Hazan. 2011. PACBayesian approach for minimization of phoneme error rate. In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP). F. Korkmazskiy and B.-H. Juang. 1997. Discriminative training of the pronunciation networks. In Proc. IEEE Workshop on Automatic Speech Recognition and Un- derstanding (ASRU). J. Lafferty, A. McCallum, and F. Pereira. 2001 . Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In Proc. International Conference on Machine Learning (ICML). K. Livescu and J. Glass. 2004. Feature-based pronunciation modeling with trainable asynchrony probabilities. In Proc. International Conference on Spoken Language Processing (ICSLP). K. Livescu. 2005. Feature-based Pronunciation Modeling for Automatic Speech Recognition. Ph.D. thesis, Massachusetts Institute of Technology. D. McAllaster, L. Gillick, F. Scattone, and M. Newman. 1998. Fabricating conversational speech data with acoustic models : A program to examine model-data mismatch. In Proc. International Conference on Spoken Language Processing (ICSLP). J. Morris and E. Fosler-Lussier. 2008. Conditional random fields for integrating local discriminative classifiers. IEEE Transactions on Acoustics, Speech, and Language Processing, 16(3). R. Prabhavalkar, E. Fosler-Lussier, and K. Livescu. 2011. A factored conditional random field model for articulatory feature forced transcription. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). M. Riley, W. Byrne, M. Finke, S. Khudanpur, A. Ljolje, J. McDonough, H. Nock, M. Saraclar, C. Wooters, and G. Zavaliagkos. 1999. Stochastic pronunciation modelling from hand-labelled phonetic corpora. Speech Communication, 29(2-4). E. S. Ristad and P. N. Yianilos. 1998. Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(2). G. Salton, A. Wong, and C. S. Yang. 1975. A vector space model for automatic indexing. Commun. ACM, 18. M. Sara ¸clar and S. Khudanpur. 2004. Pronunciation change in conversational speech and its implications for automatic speech recognition. Computer Speech and Language, 18(4). H. Schramm and P. Beyerlein. 2001. Towards discriminative lexicon optimization. In Proc. Eurospeech. S. Shalev-Shwartz, Y. Singer, and N. Srebro. 2007. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In Proc. International Conference on Machine Learning (ICML). B. Taskar, C. Guestrin, and D. Koller. 2003. Max-margin Markov networks. In Advances in Neural Information Processing Systems (NIPS) 17. I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. 2005. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6. V. Venkataramani and W. Byrne. 2001. MLLR adaptation techniques for pronunciation modeling. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). O. Vinyals, L. Deng, D. Yu, and A. Acero. 2009. Discriminative pronunciation learning using phonetic decoder and minimum-classification-error criterion. In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP). G. Zweig, P. Nguyen, and A. Acero. 2010. Continuous speech recognition with a TF-IDF acoustic model. In Proc. Interspeech. G. Zweig, P. Nguyen, D. Van Compernolle, K. Demuynck, L. Atlas, P. Clark, G. Sell, M. Wang, F. Sha, H. Hermansky, D. Karakos, A. Jansen, S. Thomas, G.S.V.S. Sivaram, S. Bowman, and J. Kao. 2011. Speech recognition with segmental conditional random fields: A summary of the JHU CLSP 2010 summer workshop. In Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 203