emnlp emnlp2011 emnlp2011-34 emnlp2011-34-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yezhou Yang ; Ching Teo ; Hal Daume III ; Yiannis Aloimonos
Abstract: We propose a sentence generation strategy that describes images by predicting the most likely nouns, verbs, scenes and prepositions that make up the core sentence structure. The input are initial noisy estimates of the objects and scenes detected in the image using state of the art trained detectors. As predicting actions from still images directly is unreliable, we use a language model trained from the English Gigaword corpus to obtain their estimates; together with probabilities of co-located nouns, scenes and prepositions. We use these estimates as parameters on a HMM that models the sentence generation process, with hidden nodes as sentence components and image detections as the emissions. Experimental results show that our strategy of combining vision and language produces readable and de- , scriptive sentences compared to naive strategies that use vision alone.
Berg, T. L., Berg, A. C., Edwards, J., and Forsyth, D. A. (2004). Who’s in the picture? In NIPS. Bonnie, D. Z. and Dorr, B. (2004). Bbn/umd at duc-2004: Topiary. In In Proceedings of the 2004 Document Understanding Conference (DUC 2004) at NLT/NAACL 2004, pages 112–1 19. Choi, J. D. and Palmer, M. (2010). Robust constituentto-dependency conversion for english. In Proceedings of the 9th International Workshop on Treebanks and Linguistic Theories, pages 55–66, Tartu, Estonia. Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2008). The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results. Farhadi, A., Hejrati, S. M. M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. A. (2010). Every picture tells a story: Generating sentences from images. In Daniilidis, K., Maragos, P., and Paragios, N., editors, ECCV (4), volume 63 14 of Lecture Notes in Computer Science, pages 15–29. Springer. Felzenszwalb, P. F., Girshick, R. B., and McAllester, D. (2008). Discriminatively trained deformable part models, release 4. http://people.cs.uchicago.edu/ pff/latentrelease4/. Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A., and Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9): 1627–1645. Golland, D., Liang, P., and Klein, D. (2010). A gametheoretic approach to generating spatial descriptions. In Proceedings of EMNLP. Graff, D. (2003). English gigaword. In Linguistic Data Consortium, Philadelphia, PA. Jie, L., Caputo, B., and Ferrari, V. (2009). Who’s doing what: Joint modeling of names and verbs for simultaneous face and pose annotation. In NIPS, editor, Advances in Neural Information Processing Systems, NIPS. NIPS. Kojima, A., Izumi, M., Tamura, T., and Fukunaga, K. (2000). Generating natural language description of human behavior from video images. In Pattern Recognition, 2000. Proceedings. 15th International Conference on, volume 4, pages 728 –73 1vol.4. Kourtzi, Z. (2004). But still, it moves. Trends in Cognitive Sciences, 8(2):47 – 49. 454 Liang, P., Jordan, M. I., and Klein, D. (2009). Learning from measurements in exponential families. In International Conference on Machine Learning (ICML). Lin, C. and Hovy, E. (2003). Automatic evaluation of summaries using n-gram co-occurrence statistics. In NAACLHLT. Lin, H.-T., Lin, C.-J., and Weng, R. C. (2007). A note on platt’s probabilistic outputs for support vector machines. Mach. Learn. , 68:267–276. Mann, G. S. and Mccallum, A. (2007). Simple, robust, scalable semi-supervised learning via expectation regularization. In The 24th International Conference on Machine Learning. McKeown, K. (2009). Query-focused summarization using text-to-text generation: When information comes from multilingual sources. In Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009), page 3, Suntec, Singapore. Association for Computational Linguistics. Oliva, A. and Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3): 145–175. Schwartz, W., Kembhavi, A., Harwood, D., and Davis, L. (2009). Human detection using partial least squares analysis. In International Conference on Computer Vision. Torralba, A., Murphy, K. P., Freeman, W. T., and Rubin, M. A. (2003). Context-based vision system for place and object recognition. In ICCV, pages 273–280. IEEE Computer Society. Traum, D., Fleischman, M., and Hovy, E. (2003). Nl generation for virtual humans in a complex social environment. In In Proceedings of he AAAI Spring Symposium on Natural Language Generation in Spoken and Written Dialogue, pages 15 1–158. Urgesi, C., Moro, V., Candidi, M., and Aglioti, S. M. (2006). Mapping implied body actions in the human motor system. J Neurosci, 26(30):7942–9. Yang, W., Wang, Y., and Mori, G. (2010). Recognizing human actions from still images with latent poses. In CVPR. Yao, B. and Fei-Fei, L. (2010). Grouplet: a structured image representation for recognizing human and object interactions. In The Twenty-Third IEEE Confer- ence on Computer Vision andPattern Recognition, San Francisco, CA. Yao, B., Yang, X., Lin, L., Lee, M. W., and Zhu, S.-C. (2010). I2t: Image parsing to text description. Proceedings of the IEEE, 98(8): 1485 –1508.