nips nips2009 nips2009-253 nips2009-253-reference knowledge-graph by maker-knowledge-mining

253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks


Source: pdf

Author: Honglak Lee, Peter Pham, Yan Largman, Andrew Y. Ng

Abstract: In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not been extensively studied for auditory data. In this paper, we apply convolutional deep belief networks to audio data and empirically evaluate them on various audio classification tasks. In the case of speech data, we show that the learned features correspond to phones/phonemes. In addition, our feature representations learned from unlabeled audio data show very good performance for multiple audio classification tasks. We hope that this paper will inspire more research on deep learning approaches applied to a wide range of audio recognition tasks. 1


reference text

[1] E. C. Smith and M. S. Lewicki. Efficient auditory coding. Nature, 439:978–982, 2006.

[2] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607–609, 1996.

[3] R. Grosse, R. Raina, H. Kwong, and A.Y. Ng. Shift-invariant sparse coding for audio classification. In UAI, 2007.

[4] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.

[5] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse representations with an energy-based model. In NIPS, 2006.

[6] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In NIPS, 2006.

[7] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, 2007.

[8] H. Lee, C. Ekanadham, and A. Y. Ng. Sparse deep belief network model for visual area V2. In NIPS, 2008.

[9] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009.

[10] G. Desjardins and Y. Bengio. Empirical evaluation of convolutional RBMs for vision. Technical report, 2008.

[11] M. Norouzi, M. Ranjbar, and G. Mori. Stacks of convolutional restricted boltzmann machines for shift-invariant feature learning. In CVPR, 2009.

[12] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002.

[13] W. Fisher, G. Doddington, and K. Goudie-Marshall. The darpa speech recognition research database: Specifications and status. In DARPA Speech Recognition Workshop, 1986.

[14] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: Transfer learning from unlabeled data. In ICML, 2007.

[15] P. Clarkson and P. J. Moreno. On the use of support vector machines for phonetic classification. In ICASSP99, pages 585–588, 1999.

[16] D. A. Reynolds. Speaker identification and verification using gaussian mixture speaker models. Speech Commun., 17(1-2):91–108, 1995.

[17] F. Sha and L. K. Saul. Large margin gaussian mixture modeling for phonetic classication and recognition. In ICASSP’06, 2006.

[18] Y.-H. Sung, C. Boulis, C. Manning, and D. Jurafsky. Regularization, adaptation, and nonindependent features improve hidden conditional random fields for phone classification. In IEEE ASRU, 2007.

[19] S. Petrov, A. Pauls, and D. Klein. Learning structured models for phone recognition. In EMNLP-CoNLL, 2007.

[20] D. Yu, L. Deng, and A. Acero. Hidden conditional random field with distribution constraints for phone classification. In Interspeech, 2009. 9