nips nips2010 nips2010-206 nips2010-206-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton
Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1
[1] S. Young, “Statistical modeling in continuous speech recognition (CSR),” in UAI ’01: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, San Francisco, CA, USA, 2001, pp. 562–571, Morgan Kaufmann Publishers Inc.
[2] C. K. I. Williams, “How to pretend that correlated variables are independent by using difference observations,” Neural Comput., vol. 17, no. 1, pp. 1–6, 2005.
[3] J.S. Bridle, “Towards better understanding of the model implied by the use of dynamic features in HMMs,” in Proceedings of the International Conference on Spoken Language Processing, 2004, pp. 725–728.
[4] K. C. Sim and M. J. F. Gales, “Minimum phone error training of precision matrix models,” IEEE Transactions on Audio, Speech & Language Processing, vol. 14, no. 3, pp. 882–889, 2006.
[5] A. Mohamed, G. E. Dahl, and G. E. Hinton, “Deep belief networks for phone recognition,” in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.
[6] M. Ranzato and G. Hinton, “Modeling pixel means and covariances using factorized third-order boltzmann machines,” in Proc. of Computer Vision and Pattern Recognition Conference (CVPR 2010), 2010.
[7] T. N. Sainath, B. Ramabhadran, and M. Picheny, “An exploration of large vocabulary tools for small vocabulary phonetic recognition,” in IEEE Automatic Speech Recognition and Understanding Workshop, 2009. 8
[8] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006.
[9] N. Morgan and H. Bourlard, “Continuous speech recognition,” Signal Processing Magazine, IEEE, vol. 12, no. 3, pp. 24 –42, may 1995.
[10] M. Ranzato, A. Krizhevsky, and G. Hinton, “Factored 3-way restricted Boltzmann machines for modeling natural images,” in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2010, vol. 13.
[11] R. M. Neal, Bayesian Learning for Neural Networks, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1996.
[12] R. M. Neal, “Connectionist learning of belief networks,” Artificial Intelligence, vol. 56, no. 1, pp. 71–113, 1992.
[13] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.
[14] K. F. Lee and H. W. Hon, “Speaker-independent phone recognition using hidden markov models,” IEEE Transactions on Audio, Speech & Language Processing, vol. 37, no. 11, pp. 1641–1648, 1989.
[15] V. Mnih, “Cudamat: a CUDA-based matrix class for python,” Tech. Rep. UTML TR 2009-004, Department of Computer Science, University of Toronto, November 2009.
[16] V. Nair and G. E. Hinton, “3-d object recognition with deep belief nets,” in Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, Eds., 2009, pp. 1339–1347.
[17] V. V. Digalakis, M. Ostendorf, and J. R. Rohlicek, “Fast algorithms for phone classification and recognition using segment-based models,” IEEE Transactions on Signal Processing, vol. 40, pp. 2885–2896, 1992.
[18] J. Morris and E. Fosler-Lussier, “Combining phonetic attributes using conditional random fields,” in Proc. Interspeech, 2006, pp. 597–600.
[19] F. Sha and L. Saul, “Large margin gaussian mixture modeling for phonetic classification and recognition,” in Proc. ICASSP, 2006, pp. 265–268.
[20] Y. Hifny and S. Renals, “Speech recognition using augmented conditional random fields,” IEEE Transactions on Audio, Speech & Language Processing, vol. 17, no. 2, pp. 354–365, 2009.
[21] A. Robinson, “An application of recurrent nets to phone probability estimation,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.
[22] J. Ming and F. J. Smith, “Improved phone recognition using bayesian triphone models,” in Proc. ICASSP, 1998, pp. 409–412.
[23] L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp. 445–448.
[24] A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple classifiers for speech recognition,” in Proc. ICSLP, 1998. 9