nips nips2012 nips2012-311 nips2012-311-reference knowledge-graph by maker-knowledge-mining

311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video

Source: pdf

Author: Kevin Tang, Vignesh Ramanathan, Li Fei-fei, Daphne Koller

Abstract: Typical object detectors trained on images perform poorly on video, as there is a clear distinction in domain between the two types of data. In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos. We treat the problem as one of unsupervised domain adaptation, in which we are given labeled data from the source domain (image), but only unlabeled data from the target domain (video). Our approach, self-paced domain adaptation, seeks to iteratively adapt the detector by re-training the detector with automatically discovered target domain examples, starting with the easiest ﬁrst. At each iteration, the algorithm adapts by considering an increased number of target domain examples, and a decreased number of source domain examples. To discover target domain examples from the vast amount of video data, we introduce a simple, robust approach that scores trajectory tracks instead of bounding boxes. We also show how rich and expressive features speciﬁc to the target domain can be incorporated under the same framework. We show promising results on the 2011 TRECVID Multimedia Event Detection [1] and LabelMe Video [2] datasets that illustrate the beneﬁt of our approach to adapt object detectors to video. 1

reference text

[1] P. Over, G. Awad, M. Michel, J. Fiscus, W. Kraaij, and A. F. Smeaton. Trecvid 2011 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In TRECVID 2011. NIST, USA, 2011.

[2] J. Yuen, B. C. Russell, C. Liu, and A. Torralba. Labelme video: Building a video database with human annotations. In ICCV, 2009. 8

[3] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local svm approach. In ICPR, 2004.

[4] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. IEEE TPAMI, 2007.

[5] J. C. Niebles, C.-W. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity classiﬁcation. In ECCV, 2010.

[6] K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012.

[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.

[8] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In IJCAI, 1981.

[9] C. Tomasi and T. Kanade. Detection and tracking of point features. Technical report, CMU, 1991.

[10] P. Sharma, C. Huang, and R. Nevatia. Unsupervised incremental learning for improved object detection in a video. In CVPR, 2012.

[11] X. Wang, G. Hua, and T. X. Han. Detection by detections: Non-parametric detector adaptation for a video. In CVPR, 2012.

[12] M. Yang, S. Zhu, F. Lv, and K. Yu. Correspondence driven adaptation for human proﬁle recognition. In CVPR, 2011.

[13] N. Cherniavsky, I. Laptev, J. Sivic, and A. Zisserman. Semi-supervised learning of facial attributes in video. In ECCV 2010, 2010.

[14] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In CVPR, 2012.

[15] L.-J. Li and L. Fei-Fei. OPTIMOL: automatic Online Picture collecTion via Incremental MOdel Learning. IJCV, 2009.

[16] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV, 2010.

[17] B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR, 2011.

[18] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised approach. In ICCV, 2011.

[19] A. Bergamo and L. Torresani. Exploiting weakly-labeled web images to improve object classiﬁcation: a domain adaptation approach. In NIPS, 2010.

[20] G. Schweikert, C. Widmer, B. Sch¨ lkopf, and G. R¨ tsch. An empirical analysis of domain adaptation o a algorithms for genomic sequence analysis. In NIPS, 2008.

[21] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. In ACM Multimedia, 2007.

[22] L. Duan, D. Xu, I. W.-H. Tsang, and J. Luo. Visual event recognition in videos by learning from web data. In CVPR, 2010.

[23] T. Joachims. Transductive inference for text classiﬁcation using support vector machines. In ICML, 1999.

[24] T. Tommasi, F. Orabona, and B. Caputo. Safety in numbers: Learning categories from few examples with multi model knowledge transfer. In CVPR, 2010.

[25] C. Zhang, R. Hamid, and Z. Zhang. Taylor expansion based classiﬁer adaptation: Application to person detection. In CVPR, 2008.

[26] P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, 2010.

[27] J. J. Lim, R. Salakhutdinov, and A. Torralba. Transfer learning by borrowing examples for multiclass object detection. In NIPS, 2011.

[28] T. Gao and D. Koller. Discriminative learning of relaxed hierarchy for large-scale visual recognition. In ICCV, 2011.

[29] H. D. III. Frustratingly easy domain adaptation. In ACL, 2007.

[30] B. Taskar, M.-F. Wong, and D. Koller. Learning on the test data: Leveraging ‘unseen’ features. In ICML, 2003.

[31] E. Krupka and N. Tishby. Generalization in clustering with unobserved features. In NIPS, 2005.

[32] C. M. Christoudias, R. Urtasun, M. Salzmann, and T. Darrell. Learning to recognize objects from unseen modalities. In ECCV, 2010.

[33] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010.

[34] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.

[35] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classiﬁcation. JMLR, 2008. 9