nips nips2008 nips2008-191 nips2008-191-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Leo Zhu, Yuanhao Chen, Yuan Lin, Chenxi Lin, Alan L. Yuille
Abstract: Language and image understanding are two major goals of artificial intelligence which can both be conceptually formulated in terms of parsing the input signal into a hierarchical representation. Natural language researchers have made great progress by exploiting the 1D structure of language to design efficient polynomialtime parsing algorithms. By contrast, the two-dimensional nature of images makes it much harder to design efficient image parsers and the form of the hierarchical representations is also unclear. Attempts to adapt representations and algorithms from natural language have only been partially successful. In this paper, we propose a Hierarchical Image Model (HIM) for 2D image parsing which outputs image segmentation and object recognition. This HIM is represented by recursive segmentation and recognition templates in multiple layers and has advantages for representation, inference, and learning. Firstly, the HIM has a coarse-to-fine representation which is capable of capturing long-range dependency and exploiting different levels of contextual information. Secondly, the structure of the HIM allows us to design a rapid inference algorithm, based on dynamic programming, which enables us to parse the image rapidly in polynomial time. Thirdly, we can learn the HIM efficiently in a discriminative manner from a labeled dataset. We demonstrate that HIM outperforms other state-of-the-art methods by evaluation on the challenging public MSRC image dataset. Finally, we sketch how the HIM architecture can be extended to model more complex image phenomena. 1
[1] F. Jelinek and J. D. Lafferty, “Computation of the probability of initial substring generation by stochastic context-free grammars,” Computational Linguistics, vol. 17, no. 3, pp. 315–323, 1991.
[2] M. Collins, “Head-driven statistical models for natural language parsing,” Ph.D. Thesis, University of Pennsylvania, 1999.
[3] K. Lari and S. J. Young, “The estimation of stochastic context-free grammars using the inside-outside algorithm,” in Computer Speech and Languag, 1990.
[4] M. Shilman, P. Liang, and P. A. Viola, “Learning non-generative grammatical models for document analysis,” in Proceedings of IEEE International Conference on Computer Vision, 2005, pp. 962–969.
[5] Z. Tu and S. C. Zhu, “Image segmentation by data-driven markov chain monte carlo,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 657–673, 2002.
[6] Z. Tu, X. Chen, A. L. Yuille, and S. C. Zhu, “Image parsing: Unifying segmentation, detection, and recognition,” in Proceedings of IEEE International Conference on Computer Vision, 2003, pp. 18–25.
[7] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of International Conference on Machine Learning, 2001, pp. 282–289.
[8] J. Shotton, J. M. Winn, C. Rother, and A. Criminisi, “TextonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation,” in Proceedings of European Conference on Computer Vision, 2006, pp. 1–15.
[9] M. Collins, “Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms,” in Proceedings of Annual Meeting on Association for Computational Linguistics conference on Empirical methods in natural language processing, 2002, pp. 1–8. ´
[10] X. He, R. S. Zemel, and M. A. Carreira-Perpi˜ an, “Multiscale conditional random fields for image labeling,” in Proceedings of IEEE n´ Computer Society Conference on Computer Vision and Pattern Recognition, 2004, pp. 695–702.
[11] S. Kumar and M. Hebert, “A hierarchical field framework for unified context-based classification,” in Proceedings of IEEE International Conference on Computer Vision, 2005, pp. 1284–1291.
[12] E. L. Allwein, R. E. Schapire, and Y. Singer, “Reducing multiclass to binary: A unifying approach for margin classifiers,” Journal of Machine Learning Research, vol. 1, pp. 113–141, 2000.
[13] Y. Boykov and M.-P. Jolly, “Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images,” in Proceedings of IEEE International Conference on Computer Vision, 2001, pp. 105–112.
[14] A. Oliva and A. Torralba, “Building the gist of a scene: the role of global image features in recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 155, pp. 23–36, 2006.
[15] A. Levin and Y. Weiss, “Learning to combine bottom-up and top-down segmentation,” in Proceedings of European Conference on Computer Vision, 2006, pp. 581–594.
[16] E. B. Sudderth, A. B. Torralba, W. T. Freeman, and A. S. Willsky, “Learning hierarchical models of scenes, objects, and parts,” in Proceedings of IEEE International Conference on Computer Vision, 2005, pp. 1331–1338.
[17] Y. Chen, L. Zhu, C. Lin, A. L. Yuille, and H. Zhang, “Rapid inference on a novel and/or graph for object detection, segmentation and parsing,” in Advances in Neural Information Processing Systems, 2007.
[18] B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning, “Max-margin parsing,” in Proceedings of Annual Meeting on Association for Computational Linguistics conference on Empirical methods in natural language processing, 2004.
[19] L. Zhu, Y. Chen, X. Ye, and A. L. Yuille, “Structure-perceptron learning of a hierarchical log-linear model,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008.
[20] J. Verbeek and B. Triggs, “Region classification with markov field aspect models,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007.
[21] Z. Tu, “Auto-context and its application to high-level vision tasks,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008.
[22] J. Verbeek and B. Triggs, “Scene segmentation with crfs learned from partially labeled images,” in Advances in Neural Information Processing Systems, vol. 20, 2008. 8