nips nips2012 nips2012-356 nips2012-356-reference knowledge-graph by maker-knowledge-mining

356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

Source: pdf

Author: Sourish Chaudhuri, Bhiksha Raj

Abstract: Approaches to audio classiﬁcation and retrieval tasks largely rely on detectionbased discriminative models. We submit that such models make a simplistic assumption in mapping acoustics directly to semantics, whereas the actual process is likely more complex. We present a generative model that maps acoustics in a hierarchical manner to increasingly higher-level semantics. Our model has two layers with the ﬁrst layer modeling generalized sound units with no clear semantic associations, while the second layer models local patterns over these sound units. We evaluate our model on a large-scale retrieval task from TRECVID 2011, and report signiﬁcant improvements over standard baselines. 1

reference text

[1] E. Wold, T. Blum, D. Keislar, and J.W. Wheaton. Content-based classication, search, and retrieval of audio. IEEE Multimedia, 3:27–36, 1996.

[2] A.G. Hauptmann and M.J. Witbrock. Informedia: News-on-demand multimedia information acquisition and retrieval. In Proceedings of Intelligent Multimedia Information Retrieval, pages 213–239. AAAI Press, 1997.

[3] G. Guo and S.Z. Li. Content-based audio classication and retrieval by support vector machines. IEEE Transactions on Neural Nets, 14, 2003.

[4] M. Slaney. Mixture of probability experts for audio retrieval and indexing. In Proceedings of the International Conference of Multimedia and Expo, 2002.

[5] M. Slaney. Semantic audio retrieval. In Proceedings of the International Conference on Acoustic Speech and Signal Processing, 2002.

[6] S.F. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa, A. Loui, and J. Luo. Large-scale multimodal semantic concept detection for consumer video. In Proceedings of the MIR workshop, ACM-Multimedia, 2007.

[7] S. Sundaram and S. Narayanan. Classication of sound clips by two schemes: using onomatopoeia and semantic labels. In Proceedings of the IEEE International Conference of Multimedia and Expo, 2008.

[8] Z. Liu, J. Huang, and Y. Wang. Classiﬁcation of tv programs based on audio information using hidden markov model. In Proceedings of the 2nd IEEE Workshop on Multimedia Signal Processing, 1998.

[9] S. Berrani, G. Manson, and P. Lechat. A non-supervised approach for repeated sequence detection in tv broadcast streams. In Signal Processing: Image Communication, volume 23, pages 525–537, 2008.

[10] G. Friedland, L. Gottlieb, and A. Janin. Using artistic markers and speaker identiﬁcation for narrativetheme navigation of seinfeld episodes. In Workshop on Content-Based Audio/Video Analysis for Novel TV Services, 11th IEEE International Symposium on Multimedia, 2009.

[11] S. Kim, S. Sundaram, P. Georgiou, and S. Narayanan. Audio scene understanding using topic models. In Proceedings of the NIPS Workshop on Applications for Topic Models: Text and Beyond, 2009.

[12] S. Kim, S. Sundaram, P. Georgiou, and S. Narayanan. Acoustic stopwords for unstructured audio information retrieval. In Proceedings of the 18th European Signal Processing Conference, 2010.

[13] X. Zhu. Semi-supervised learning with graphs. PhD Thesis, 2005.

[14] S. Chaudhuri, M. Harvilla, and B. Raj. Unsupervised learning of acoustic unit descriptors for audio content representation and classiﬁcation. In Proceedings of Interspeech, 2011.

[15] W. Li. Random texts exhibit zipf’s-law-like word frequency distribution. IEEE Transactions on Information Theory, 38:1842–1845, 1992.

[16] J. Eeckhout. Gibrat’s law for (all) cities. American Economic Review, 94:1429–1451, 2004.

[17] D. Mochihashi, T. Yamada, and N. Ueda. Bayesian unsupervised word segmentation with nested pitmanyor language modeling. In Proceedings of the 47th Meeting of the Association for Computational Linguistics, 2009.

[18] H. Poon, C. Cherry, and K. Toutanova. Unsupervised morphological segmentation with log-linear models. In Proceedings of the 47th Meeting of the Association for Computational Linguistics, 2009.

[19] S. Goldwater, T.L. Grifﬁths, and M. Johnson. A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112:21–54, 2009.

[20] M. Johnson and S. Goldwater. Improving nonparametric bayesian inference: Experiments on unsupervised word segmentation with adaptor grammars. In Proceedings of Human Language Technologies: North American Chapter of the Association for Computational Linguistics, 2009.

[21] A. P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977.

[22] TRECVID Multimedia Event Detection Task. http://www.nist.gov/itl/iad/mig/med11.cfm. 2011.

[23] The Art of Foley. http://www.sound-ideas.com/artfoley.html. 2005.

[24] L. Breiman. Random forests. Machine Learning, 45:5–32, 2001. 9