nips nips2011 nips2011-126 nips2011-126-reference knowledge-graph by maker-knowledge-mining

126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs

Source: pdf

Author: Vicente Ordonez, Girish Kulkarni, Tamara L. Berg

Abstract: We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then ﬁltering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning. 1

reference text

[1] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan. Matching words and pictures. Journal of Machine Learning Research, 3:1107–1135, 2003.

[2] T. Berg, A. Berg, J. Edwards, M. Maire, R. White, E. Learned-Miller, Y. Teh, and D. Forsyth. Names and faces. In CVPR, 2004.

[3] L. Bourdev, S. Maji, T. Brox, and J. Malik. Detecting people using mutually consistent poselet activations. In ECCV, 2010.

[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.

[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.

[6] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth. Object recognition as machine translation. In ECCV, 2002.

[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2010 (VOC2010) Results. http://www.pascalnetwork.org/challenges/VOC/voc2010/workshop/index.html.

[8] A. Farhadi, I. Endres, D. Hoiem, and D. A. Forsyth. Describing objects by their attributes. In CVPR, 2009.

[9] A. Farhadi, M. Hejrati, A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. A. Forsyth. Every picture tells a story: generating sentences for images. In ECCV, 2010.

[10] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models, release 4. http://people.cs.uchicago.edu/ pff/latent-release4/.

[11] Y. Feng and M. Lapata. How many words is a picture worth? automatic caption generation for news images. In Proc. of the Assoc. for Computational Linguistics, ACL ’10, pages 1239–1249, 2010.

[12] V. Ferrari and A. Zisserman. Learning visual attributes. In NIPS, 2007.

[13] J. Hays and A. A. Efros. im2gps: estimating geographic information from a single image. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2008.

[14] D. Hoiem, A. A. Efros, and M. Hebert. Recovering surface layout from an image. Int. J. Comput. Vision, 75:151–172, October 2007.

[15] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Babytalk: Understanding and generating simple image descriptions. In CVPR, 2011.

[16] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classiﬁers for face veriﬁcation. In ICCV, 2009.

[17] C. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009.

[18] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching. In CVPR, June 2006.

[19] W. Li, W. Xu, M. Wu, C. Yuan, and Q. Lu. Extractive summarization using inter- and intra- event relevance. In Int Conf on Computational Linguistics, 2006.

[20] E. P. X. Li-Jia Li, Hao Su and L. Fei-Fei. Object bank: A high-level image representation for scene classiﬁcation and semantic feature sparsiﬁcation. In Neural Information Processing Systems (NIPS), Vancouver, Canada, December 2010.

[21] S. Maji, L. Bourdev, and J. Malik. Action recognition from a distributed representation of pose and appearance. In CVPR, 2011.

[22] R. Mihalcea. Language independent extractive summarization. In National Conference on Artiﬁcial Intelligence, pages 1688–1689, 2005.

[23] A. Nenkova, L. Vanderwende, and K. McKeown. A compositional context sensitive multi-document summarizer: exploring the factors that inﬂuence summarization. In SIGIR, 2006.

[24] K. Papineni, S. Roukos, T. Ward, and W. jing Zhu. Bleu: a method for automatic evaluation of machine translation. pages 311–318, 2002.

[25] D. R. Radev and T. Allison. Mead - a platform for multidocument multilingual text summarization. In Int Conf on Language Resources and Evaluation, 2004.

[26] J. Tighe and S. Lazebnik. Superparsing: Scalable nonparametric image parsing with superpixels. In ECCV, 2010.

[27] A. Torralba, R. Fergus, and W. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. PAMI, 30, 2008.

[28] K.-F. Wong, M. Wu, and W. Li. Extractive summarization using supervised and semi-supervised learning. In International Conference on Computational Linguistics, pages 985–992, 2008.

[29] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.

[30] B. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t: Image parsing to text description. Proc. IEEE, 98(8), 2010. 9