cvpr cvpr2013 cvpr2013-25 cvpr2013-25-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sanja Fidler, Abhishek Sharma, Raquel Urtasun
Abstract: We are interested in holistic scene understanding where images are accompanied with text in the form of complex sentential descriptions. We propose a holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as well as image information as input. We automatically parse the sentences and extract objects and their relationships, and incorporate them into the model, both via potentials as well as by re-ranking candidate detections. We demonstrate the effectiveness of our approach in the challenging UIUC sentences dataset and show segmentation improvements of 12.5% over the visual only model and detection improvements of 5% AP over deformable part-based models [8].
[1] A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, L. Schmidt, J. Shangguan, J. Siskind, J. Waggoner,
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11] S. Wang, J. Wei, Y. Yin, and Z. Zhang. Video-in-sentences out. In UAI, 2012. 1, 2 K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. Blei, and M. Jordan. Matching words and pictures. In JMLR, 2003. 1, 2 A. Berg, T. Berg, H. Daum, J. Dodge, A. Goyal, X. Han, A. Mensch, M. Mitchell, A. Sood, K. Stratos, and K. Yamaguchi. Understanding and predicting importance in images. In CVPR, 2012. 1, 2 D. Blei and M. Jordan. Modeling annotated data. In ACM SIGIR, 2003. 2 M. de Marneffe, B. MacCartney, and C. Manning. Generating typed dependency parses from phrase structure parses. In LREC, 2006. 2 P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In ECCV, 2002. 1, 2 A. Farhadi, M. Hejrati, M. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences for images. In ECCV, 2010. 1, 2, 6 P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. PAMI, 32(9), 2010. 1, 2, 4 A. Gupta and L. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV, 2008. 2 B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011. 6 T. Hazan and R. Urtasun. A primal-dual message-passing algorithm for approximated large scale structured prediction. In NIPS, 2010. 3, 6 112990990199 sent 1: “A dog herding two sheep.” sent 2: “A sheep dog and two sheep walking in a field.” sent 3: “Black dog herding sheep in grassy field.” sent 1: “Passengers at a station waiting to board a train pulled by a green locomotive engine.” sent 2: “Passengers loading onto a train with a green and black steam engine.” sent 3: “Several people waiting to board the train.” sent 1: “Cattle in a snow-covered field.” sent 2: “Cows grazing in a snow covered field.” sent 3: “Five cows grazing in a snow covered field.” sent 4: “Three black cows and one brown cow stand in a snowy field.” sent 1: “A yellow and white sail boat glides between the shore and a yellow buoy.” sent 2: “Sail boat on water with two people riding inside.” sent 3: “Small sailboat with spinnaker passing a buoy.” sent 1: “A table is set with wine and dishes for two people.” sent 2: “A table set for two.” sent 3: “A wooden table is set with candles, wine, and a purple plastic bowl.” sent 1: “An old fashioned passenger bus with open windows.” sent 2: “Bus with yellow flag sticking out window.” sent 3: “The front of a red, blue, and yellow bus.” sent 4: “The idle tourist bus awaits its passengers.” Figure 7. A few example images, sentences per image and the final segmentation. image three sofap e rps eo en rs o n p e rps e o srn o n rtanip e psr eo srn o n rtanip e psr eo srn o n cowcow Yao [29] one sent two sent sent sent 1: “Passengers at a station waiting to board a train pulled by a green locomotive engine.” sent 2: “Passengers loading onto a train with a green and black steam engine.” sent 3: “Several people waiting to board the train.” sent 1: “Black and white cows grazing in a pen.’ sent 2: “The black and white cows pause in front of the gate.” sent 3: “Two cows in a field grazing near a gate.” image Yao [29] one sent two sent three sent pesronpersonaeroppalensreonaeorppalenreson sent 1: “Two men on a plane, the closer one with a suspicious look on his face.’ .” sent 2: “A wide-eyed blonde man sits in an airplane next to an Asian “Up close photo of man with short blonde hair on airplane.” man.” sent 3: pesrbiondc h a irc hca hri a r id rni gatblepd siron gatbelc h atrimvocnh c ahticoari ari pesrdnio gtabelc h vta rimonc tihc ah rai r ipedsrnio gatbelc h avt rim ocn hcti ha o ari ri sent 1: “Man using computer on a table.” sent 2: “The man sitting at a messy table and using a laptop.” sent 3: “Young man sitting at messy table staring at laptop.” Figure 8. Results as a function of the number of sentences employed.
[12] Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality similarity for multinomial data. In ICCV, 2011. 1, 2
[13] D. Klein and C. Manning. Fast exact inference with a factored model for natural language parsing. In NIPS’03. 2
[14] P. Kohli, M. P. Kumar, and P. H. S. Torr. and beyond: Solving energies with higher order cliquess. In CVPR’07. 4
[15] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. Berg, and T. Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, 2011. 1, 2
[16] L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. Graph cut based inference with co-occurrence statistics. In ECCV, 2010. 3, 4
[17] D. C. Lee, A. Gupta, M. Hebert, and T. Kanade. Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In NIPS, 2010. 3
[18] L. Li, R. Socher, and L. Fei-Fei. Towards total scene understanding:classification, annotation and segmentation in an automatic framework. In CVPR, 2009. 1, 2
[19] V. Ordonez, G. Kulkarni, and T. Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, 2011. 1, 2
[20] D. Putthividhy, H. Attias, and S. Nagarajan. Topic regression p3
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29] multi-modal latent dirichlet allocation for image annotation. In CVPR, 2010. 2 A. Quattoni, M. Collins, and T. Darrell. Learning visual representations using images with captions. In CVPR07. 1, 2 K. Saenko and T. Darrel. Filtering abstract senses from image search results. In NIPS, 2009. 2 A. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Distributed message passing for large scale graphical models. In CVPR, 2011. 6 J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In CVPR, 2008. 4 J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object recognition and segmentation byjointly modeling appearance, shape and context. IJCV, 81(1), 2009. 6 R. Socher and L. Fei-Fei. Connecting modalities: Semisupervised segmentation and annotation of images using unaligned text corpora. In CVPR, 2010. 1, 2 N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, 2012. 1, 2 K. Toutanova, D. Klein, and C. Manning. Feature-rich partof-speech tagging with a cyclic dependency network. In HLT-NAACL, 2003. 2 Y. Yao, S. Fidler, and R. Urtasun. Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In CVPR, 2012. 2, 3, 4, 5, 6, 7, 8 222000000200