nips nips2012 nips2012-229 nips2012-229-reference knowledge-graph by maker-knowledge-mining

229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

Source: pdf

Author: Nitish Srivastava, Ruslan Salakhutdinov

Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a uniﬁed representation that fuses modalities together. We ﬁnd that this representation is useful for classiﬁcation and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and ﬁlling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model signiﬁcantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1

reference text

[1] R. R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, volume 12, 2009.

[2] Mark J. Huiskes, Bart Thomee, and Michael S. Lew. New trends and ideas in visual concept detection: the MIR ﬂickr retrieval evaluation initiative. In Multimedia Information Retrieval, pages 527–536, 2010.

[3] M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semi-supervised learning for image classiﬁcation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 902 –909, june 2010.

[4] Eric P. Xing, Rong Yan, and Alexander G. Hauptmann. Mining associated text and images with dual-wing harmoniums. In UAI, pages 633–641. AUAI Press, 2005.

[5] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. Multimodal deep learning. In International Conference on Machine Learning (ICML), Bellevue, USA, June 2011.

[6] Ruslan Salakhutdinov and Geoffrey E. Hinton. Replicated softmax: an undirected topic model. In NIPS, pages 1607–1614. Curran Associates, Inc., 2009.

[7] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1711–1800, 2002.

[8] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML. ACM, 2008.

[9] L. Younes. On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates, March 17 2000.

[10] Mark J. Huiskes and Michael S. Lew. The MIR Flickr retrieval evaluation. In MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval, New York, NY, USA, 2008. ACM.

[11] A Bosch, Andrew Zisserman, and X Munoz. Image classiﬁcation using random forests and ferns. IEEE 11th International Conference on Computer Vision (2007), 23:1–8, 2007.

[12] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42:145–175, 2001.

[13] B.S. Manjunath, J.-R. Ohm, V.V. Vasudevan, and A. Yamada. Color and texture descriptors. Circuits and Systems for Video Technology, IEEE Transactions on, 11(6):703 –715, 2001.

[14] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms, 2008.

[15] Muhammet Bastan, Hayati Cam, Ugur Gudukbay, and Ozgur Ulusoy. Bilvideo-7: An mpeg-7compatible video indexing and retrieval system. IEEE Multimedia, 17:62–73, 2010. 9