nips nips2009 nips2009-253 nips2009-253-reference knowledge-graph by maker-knowledge-mining

253 nips-2009-Unsupervised feature learning for audio classification using convolutional deep belief networks

Source: pdf

Author: Honglak Lee, Peter Pham, Yan Largman, Andrew Y. Ng

Abstract: In recent years, deep learning approaches have gained signiﬁcant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning approaches have not been extensively studied for auditory data. In this paper, we apply convolutional deep belief networks to audio data and empirically evaluate them on various audio classiﬁcation tasks. In the case of speech data, we show that the learned features correspond to phones/phonemes. In addition, our feature representations learned from unlabeled audio data show very good performance for multiple audio classiﬁcation tasks. We hope that this paper will inspire more research on deep learning approaches applied to a wide range of audio recognition tasks. 1

reference text

[1] E. C. Smith and M. S. Lewicki. Efﬁcient auditory coding. Nature, 439:978–982, 2006.

[2] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive ﬁeld properties by learning a sparse code for natural images. Nature, 381:607–609, 1996.

[3] R. Grosse, R. Raina, H. Kwong, and A.Y. Ng. Shift-invariant sparse coding for audio classiﬁcation. In UAI, 2007.

[4] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.

[5] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efﬁcient learning of sparse representations with an energy-based model. In NIPS, 2006.

[6] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In NIPS, 2006.

[7] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, 2007.

[8] H. Lee, C. Ekanadham, and A. Y. Ng. Sparse deep belief network model for visual area V2. In NIPS, 2008.

[9] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009.

[10] G. Desjardins and Y. Bengio. Empirical evaluation of convolutional RBMs for vision. Technical report, 2008.

[11] M. Norouzi, M. Ranjbar, and G. Mori. Stacks of convolutional restricted boltzmann machines for shift-invariant feature learning. In CVPR, 2009.

[12] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002.

[13] W. Fisher, G. Doddington, and K. Goudie-Marshall. The darpa speech recognition research database: Speciﬁcations and status. In DARPA Speech Recognition Workshop, 1986.

[14] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: Transfer learning from unlabeled data. In ICML, 2007.

[15] P. Clarkson and P. J. Moreno. On the use of support vector machines for phonetic classiﬁcation. In ICASSP99, pages 585–588, 1999.

[16] D. A. Reynolds. Speaker identiﬁcation and veriﬁcation using gaussian mixture speaker models. Speech Commun., 17(1-2):91–108, 1995.

[17] F. Sha and L. K. Saul. Large margin gaussian mixture modeling for phonetic classication and recognition. In ICASSP’06, 2006.

[18] Y.-H. Sung, C. Boulis, C. Manning, and D. Jurafsky. Regularization, adaptation, and nonindependent features improve hidden conditional random ﬁelds for phone classiﬁcation. In IEEE ASRU, 2007.

[19] S. Petrov, A. Pauls, and D. Klein. Learning structured models for phone recognition. In EMNLP-CoNLL, 2007.

[20] D. Yu, L. Deng, and A. Acero. Hidden conditional random ﬁeld with distribution constraints for phone classiﬁcation. In Interspeech, 2009. 9