nips nips2005 nips2005-55 nips2005-55-reference knowledge-graph by maker-knowledge-mining

55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes

Source: pdf

Author: Antonio Torralba, Alan S. Willsky, Erik B. Sudderth, William T. Freeman

Abstract: Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach explicitly captures uncertainty in the number of object instances depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an object–centered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP’s inclusion of spatial structure improves detection performance, ﬂexibly exploiting partially labeled training images. 1

reference text

[1] P. Viola and M. J. Jones. Robust real–time face detection. IJCV, 57(2):137–154, 2004.

[2] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their location in images. In ICCV, 2005. Road Building Car Road 1 0.9 Detection Rate 0.8 0.7 0.6 Car (TDP) Building (TDP) Road (TDP) Car (LDA) Building (LDA) Road (LDA) 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 False Alarm Rate 0.8 1 Figure 4: TDP analysis of street scenes containing cars (red), buildings (green), and roads (blue). Top right: Global model G0 describing object shape (solid) and expected transformations (dashed). Bottom right: ROC curves comparing TDP feature segmentation performance to an LDA model of feature appearance. Left: Four test images (ﬁrst row), estimated segmentations of features into object categories (second row), transformed global clusters associated with each image interpretation (third row), and features assigned to different instances of the transformed car cluster (fourth row).

[3] M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. J. Amer. Stat. Assoc., 90(430):577–588, June 1995.

[4] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Technical Report 653, U.C. Berkeley Statistics, October 2004.

[5] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. In NIPS 17, pages 1385–1392. MIT Press, 2005.

[6] J. B. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models. Neural Comp., 12:1247–1283, 2000.

[7] L. Fei-Fei, R. Fergus, and P. Perona. A Bayesian approach to unsupervised one-shot learning of object categories. In ICCV, volume 2, pages 1134–1141, 2003.

[8] J. M. Tenenbaum and H. G. Barrow. Experiments in interpretation-guided segmentation. Artif. Intel., 8:241–274, 1977.

[9] A. J. Storkey and C. K. I. Williams. Image modeling with position-encoding dynamic trees. IEEE Trans. PAMI, 25(7):859–871, July 2003.

[10] J. M. Siskind et al. Spatial random tree grammars for modeling hierarchal structure in images. Submitted to IEEE Tran. PAMI, 2004.

[11] Z. Tu, X. Chen, A. L. Yuille, and S. C. Zhu. Image parsing: Unifying segmentation, detection, and recognition. In ICCV, volume 1, pages 18–25, 2003.

[12] B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. BLOG: Probabilistic models with unknown objects. In IJCAI 19, pages 1352–1359, 2005.

[13] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3:993–1022, 2003.

[14] L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. In CVPR, volume 2, pages 524–531, 2005.

[15] K. Barnard et al. Matching words and pictures. JMLR, 3:1107–1135, 2003.

[16] E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky. Learning hierarchical models of scenes, objects, and parts. In ICCV, 2005.

[17] E. G. Miller, N. E. Matsakis, and P. A. Viola. Learning from one example through shared densities on transforms. In CVPR, volume 1, pages 464–471, 2000.

[18] N. Jojic and B. J. Frey. Learning ﬂexible sprites in video layers. In CVPR, volume 1, pages 199–206, 2001.

[19] D. G. Lowe. Distinctive image features from scale–invariant keypoints. IJCV, 60(2):91–110, 2004.