iccv iccv2013 iccv2013-274 iccv2013-274-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mohamed R. Amer, Sinisa Todorovic, Alan Fern, Song-Chun Zhu
Abstract: This paper presents an efficient approach to video parsing. Our videos show a number of co-occurring individual and group activities. To address challenges of the domain, we use an expressive spatiotemporal AND-OR graph (ST-AOG) that jointly models activity parts, their spatiotemporal relations, and context, as well as enables multitarget tracking. The standard ST-AOG inference is prohibitively expensive in our setting, since it would require running a multitude of detectors, and tracking their detections in a long video footage. This problem is addressed by formulating a cost-sensitive inference of ST-AOG as Monte Carlo Tree Search (MCTS). For querying an activity in the video, MCTS optimally schedules a sequence of detectors and trackers to be run, and where they should be applied in the space-time volume. Evaluation on the benchmark datasets demonstrates that MCTS enables two-magnitude speed-ups without compromising accuracy relative to the standard cost-insensitive inference.
[1] J. Aggarwal and M. Ryoo. Human activity analysis: A review. ACM Comput. Surv., 43: 16: 1–16:43, 2011.
[2] M. Amer and S. Todorovic. A Chains model for localizing group activities in videos. In ICCV, 2011.
[3] M. Amer and S. Todorovic. Sum-product networks for modeling activities with stochastic structure. In CVPR, 2012.
[4] M. Amer, D. Xie, M. Zhao, S. Todorovic, and S.-C. Zhu. Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In ECCV, 2012.
[5] C. Browne, E. J. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods. IEEE Trans. Comput. Intellig. and AI in Games, 4(1): 1–43, 2012.
[6] W. Choi and S. Savarese. A unified framework for multi-target tracking and collective activity recognition. In ECCV, 2012.
[7] W. Choi, K. Shahid, and S. Savarese. What are they doing? : Collective activity classification using spatiotemporal relationship among people. In ICCV, 2009.
[8] A. Gupta, P. Srinivasan, J. Shi, and L. Davis. Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In CVPR, 2009. 1359 VariantLine Tour Disc.SitWalk WaitAvgTime QVL213( 52)L13,(5P )re, cF iP s on7694835.269175436.1924761208. 9438176932. 6541789650.3218672103. 62501789.642105 215 VQ132L( V1253() ,15P )r,Fec Pis on7169850. 2184765910.462817 59321. 6850 716843 .169724 8160527. 38167 203.9512786901.426375 TablQeV23L1(.V∞ A32()v∞,eP r)a,eFgc Peis porneci71s689034.o 2n680a715n6230.d4752fa18ls3725e.184po 17s453i.t82764ve 1890r23a.1t560es781o230.n 196the78 15320U.478691CLA2 73 0 Court- yard Dataset for group activities. The larger the allowed number of actions, the better precision. Results are shown in %, and time is in seconds. ClassVP1(re∞c.) V1F(∞P)V2Pr(e∞c.) V2F(P∞)VP3r(e∞c..) V3(F∞P)QLPr(e∞c.) QLF(P∞) SDAWBRTEciraeovulkgtdf78 109265.3481 6025478.1 67 1845.3918246 7031.8 5267 89345.219 7038541. 6290 71840. 63271 964208.731 Time210210210210210210330330 Table 2. Average precision, and false positive rates on the UCLA Courtyard Dataset for individual actions. The larger the allowed number of actions, the better precision. Results are shown in %, and time is in seconds. ClassV3(∞)[4][2][12][7]V1(∞) V2(∞)[6][9] QCWTuraoeliksute789 51.34179 854.7234 769 24.98216978 09568 374. 9648 731.458917 89620.3409865421. 931 8956137.13825 Avg86.584.8 82.5 78.4 64.988.987.28072 Time12016555N/A N/A180150N/A N/A Table 3. Average classification accuracy, and running times on the Collective Activity Dataset [7]. We use B = ∞. Results are shown in %, and ltiemcteiv ise Ainc tseivciotynd Dsa. ClassV3(∞) QL(∞)[7]V1(∞) V2(∞)[6] DGCQWTiahutsmlaekusin rsgnal84956185. 731948576201. 39287453290. 2904897264. 25918492615. 948123 79 231. 04952 Avg77.774.877.484.280.179.2 Time130s150sN/A180s170sN/A Table 4. Average classification accuracy, and running times on the New Collective Activity Dataset [6]. We use B = ∞. Results are shown in %, aCnodl eticmtive ies A Ainc sivecitoynd Das.t DatasetMatch Linear [6] Quadratic [6] V2 (∞) V1(∞) NeCwol Cloelclteicvetiv Aec tAicvti vyi t[7y] [6]2881..77393872..24808422..75848436..198438..97 Table 5. Tracking results on
[7] and [6]. All results are shown in %.
[9] S. Khamis, V. I. Morariu, and L. S. Davis. Combining per-frame and per-track cues for multi-person action recognition. In ECCV, 2012.
[10] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert. Activity forecasting. In ECCV, 2012.
[11] L. Kocsis and C. Szepesvari. Bandit based monte carlo planning. In ECML, 2006.
[12] T. Lan, Y. Wang, W. Yang, S. Robinovitch, and G. Mori. Discriminative latent models for recognizing contextual group activities. TPAMI, 34(8): 1549–1562, 2012.
[13] P. Matikainen, M. Hebert, and R. Sukthankar. Representing pairwise spatial and temporal relations for action recognition. In ECCV, 2010.
[14] M. Pei, Y. Jia, and S.-C. Zhu. Parsing video events with goal inference and intent prediction. In ICCV, 2011.
[15] H. Pirsiavash, D. Ramanan, and C. Fowlkes. Globallyoptimal greedy algorithms for tracking a variable number of objects. In CVPR, 2011.
[16] Z. Si, M. Pei, B. Yao, and S.-C. Zhu. Unsupervised learning of event AND-OR grammar and semantics from video. In ICCV, 2011.
[17] T. Wu and S.-C. Zhu. A numerical study of the bottom-up and top-down inference processes in andor graphs. IJCV, 93:226–252, June 2011.
[18] S.-C. Zhu and D. Mumford. A stochastic grammar of images. Found. Trends. Comput. Graph. Vis., 2(4):259–362, 2006. 1360