nips nips2012 nips2012-150 nips2012-150-reference knowledge-graph by maker-knowledge-mining

150 nips-2012-Hierarchical spike coding of sound

Source: pdf

Author: Yan Karklin, Chaitanya Ekanadham, Eero P. Simoncelli

Abstract: Natural sounds exhibit complex statistical regularities at multiple scales. Acoustic events underlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. Learning this structure from data while capturing the inherent variability is an important ﬁrst step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. The ﬁrst layer consists of a sparse spiking representation that encodes the sound using kernels positioned precisely in time and frequency. Patterns in the positions of ﬁrst layer spikes are learned from the data: on a coarse scale, statistical regularities are encoded by a second-layer spiking representation, while ﬁne-scale structure is captured by recurrent interactions within the ﬁrst layer. When ﬁt to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a signiﬁcant improvement over standard methods. 1

reference text

[1] C. Fevotte, B. Torresani, L. Daudet, and S. Godsill, “Sparse linear regression with structured priors and application to denoising of musical audio,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 16, pp. 174 –185, jan. 2008.

[2] M. Plumbley, T. Blumensath, L. Daudet, R. Gribonval, and M. Davies, “Sparse representations in audio and music: From coding to source separation,” Proceedings of the IEEE, vol. 98, pp. 995 –1005, june 2010.

[3] D. J. Klein, P. K¨ nig, and K. P. K¨ rding, “Sparse spectrotemporal coding of sounds,” EURASIP J. Appl. o o Signal Process., vol. 2003, pp. 659–667, Jan. 2003.

[4] H. Lee, Y. Largman, P. Pham, and A. Y. Ng, “Unsupervised feature learning for audio classiﬁcation using convolutional deep belief networks,” in Advances in Neural Information Processing Systems, pp. 1096– 1104, The MIT Press, 2009.

[5] E. Smith and M. S. Lewicki, “Efﬁcient coding of time-relative structure using spikes,” Neural Computation, vol. 17, no. 1, pp. 19–45, 2005.

[6] M. Lewicki and T. Sejnowski, “Coding time-varying signals using sparse, shift-invariant representations,” in Advances in Neural Information Processing Systems, pp. 730–736, The MIT Press, 1999.

[7] S. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Trans Sig Proc, vol. 41, pp. 3397–3415, December 1993.

[8] E. Smith and M. S. Lewicki, “Efﬁcient auditory coding,” Nature, vol. 439, no. 7079, 2006.

[9] P. McCullagh and J. A. Nelder, Generalized linear models (Second edition). London: Chapman & Hall, 1989.

[10] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, “Darpa timit acoustic phonetic continuous speech corpus cdrom,” 1993.

[11] S. Mallat, A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way. Academic Press, 3rd ed., 2008. 9