nips nips2010 nips2010-206 nips2010-206-reference knowledge-graph by maker-knowledge-mining

206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine

Source: pdf

Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton

Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the ﬁrst-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), ﬁrst introduced for modeling natural images, is a much more representationally efﬁcient and powerful way of modeling the covariance structure of speech data. Every conﬁguration of the precision units of the mcRBM speciﬁes a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1

reference text

[1] S. Young, “Statistical modeling in continuous speech recognition (CSR),” in UAI ’01: Proceedings of the 17th Conference in Uncertainty in Artiﬁcial Intelligence, San Francisco, CA, USA, 2001, pp. 562–571, Morgan Kaufmann Publishers Inc.

[2] C. K. I. Williams, “How to pretend that correlated variables are independent by using difference observations,” Neural Comput., vol. 17, no. 1, pp. 1–6, 2005.

[3] J.S. Bridle, “Towards better understanding of the model implied by the use of dynamic features in HMMs,” in Proceedings of the International Conference on Spoken Language Processing, 2004, pp. 725–728.

[4] K. C. Sim and M. J. F. Gales, “Minimum phone error training of precision matrix models,” IEEE Transactions on Audio, Speech & Language Processing, vol. 14, no. 3, pp. 882–889, 2006.

[5] A. Mohamed, G. E. Dahl, and G. E. Hinton, “Deep belief networks for phone recognition,” in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.

[6] M. Ranzato and G. Hinton, “Modeling pixel means and covariances using factorized third-order boltzmann machines,” in Proc. of Computer Vision and Pattern Recognition Conference (CVPR 2010), 2010.

[7] T. N. Sainath, B. Ramabhadran, and M. Picheny, “An exploration of large vocabulary tools for small vocabulary phonetic recognition,” in IEEE Automatic Speech Recognition and Understanding Workshop, 2009. 8

[8] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006.

[9] N. Morgan and H. Bourlard, “Continuous speech recognition,” Signal Processing Magazine, IEEE, vol. 12, no. 3, pp. 24 –42, may 1995.

[10] M. Ranzato, A. Krizhevsky, and G. Hinton, “Factored 3-way restricted Boltzmann machines for modeling natural images,” in Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, 2010, vol. 13.

[11] R. M. Neal, Bayesian Learning for Neural Networks, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1996.

[12] R. M. Neal, “Connectionist learning of belief networks,” Artiﬁcial Intelligence, vol. 56, no. 1, pp. 71–113, 1992.

[13] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.

[14] K. F. Lee and H. W. Hon, “Speaker-independent phone recognition using hidden markov models,” IEEE Transactions on Audio, Speech & Language Processing, vol. 37, no. 11, pp. 1641–1648, 1989.

[15] V. Mnih, “Cudamat: a CUDA-based matrix class for python,” Tech. Rep. UTML TR 2009-004, Department of Computer Science, University of Toronto, November 2009.

[16] V. Nair and G. E. Hinton, “3-d object recognition with deep belief nets,” in Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, Eds., 2009, pp. 1339–1347.

[17] V. V. Digalakis, M. Ostendorf, and J. R. Rohlicek, “Fast algorithms for phone classiﬁcation and recognition using segment-based models,” IEEE Transactions on Signal Processing, vol. 40, pp. 2885–2896, 1992.

[18] J. Morris and E. Fosler-Lussier, “Combining phonetic attributes using conditional random ﬁelds,” in Proc. Interspeech, 2006, pp. 597–600.

[19] F. Sha and L. Saul, “Large margin gaussian mixture modeling for phonetic classiﬁcation and recognition,” in Proc. ICASSP, 2006, pp. 265–268.

[20] Y. Hifny and S. Renals, “Speech recognition using augmented conditional random ﬁelds,” IEEE Transactions on Audio, Speech & Language Processing, vol. 17, no. 2, pp. 354–365, 2009.

[21] A. Robinson, “An application of recurrent nets to phone probability estimation,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.

[22] J. Ming and F. J. Smith, “Improved phone recognition using bayesian triphone models,” in Proc. ICASSP, 1998, pp. 409–412.

[23] L. Deng and D. Yu, “Use of differential cepstra as acoustic features in hidden trajectory modelling for phonetic recognition,” in Proc. ICASSP, 2007, pp. 445–448.

[24] A. Halberstadt and J. Glass, “Heterogeneous measurements and multiple classiﬁers for speech recognition,” in Proc. ICSLP, 1998. 9