Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis

Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.

1 We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. [sent-9, score-0.504]

2 These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. [sent-10, score-0.779]

3 What defines these spatiotemporal patches is their discriminative and representative properties. [sent-11, score-0.781]

4 We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. [sent-12, score-2.378]

5 Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets. [sent-13, score-1.257]

6 Currently, the most common answer to this question involves recognizing the particular event or action that occurs in the video. [sent-17, score-0.555]

7 But this level of de- scription does not address issues such as the temporal extent of the action [27]. [sent-19, score-0.517]

8 This could be addressed by modeling videos in terms of their constituent semantic actions and objects [10, 30, 4]. [sent-24, score-0.515]

9 However, recent research in object and action recognition has shown that current computational models for identifying semantic entities are not robust enough to serve as a basis for video analysis [7]. [sent-30, score-0.671]

10 Following recent work on discriminative patch-based representation [2, 3 1], we represent videos in terms of discriminative spatio-temporal patches rather than global feature vectors or a set of semantic entities. [sent-32, score-1.228]

11 These spatio- temporal patches might correspond to a primitive human action, a semantic object, human-object pair or perhaps a random but informative spatio-temporal patch in the video. [sent-33, score-0.762]

12 They are determined by their discriminative properties and their ability to establish correspondences with videos from similar classes. [sent-34, score-0.64]

13 We automatically mine these discriminative patches from training data consisting of hundreds of videos. [sent-35, score-0.809]

14 Figure 1(d)(left) shows some of the mined discriminative patches for the “weightlifting” class. [sent-36, score-0.696]

15 We show how these mined patches can act as a discriminative vocabulary for action classification and demonstrate state-of-the-art performance on the Olympics Sports dataset [23] and the UCF-50 dataset1. [sent-37, score-1.294]

16 But, more importantly, we demonstrate how these patches can be used to establish strong correspondence between spatio-temporal patches in training and test videos. [sent-38, score-1.253]

17 We can use this correspondence to align the videos and perform tasks such as object localization, finer-level action detection etc. [sent-39, score-0.931]

18 Figure 2 shows an example of how aligned videos (shown in Figure 1(d)(right)) are used to localize humans and objects, detect finer action categories and estimate human poses. [sent-43, score-0.908]

19 Given a query video (a), one can represent it using global feature vector and use it for action classification (b). [sent-51, score-0.614]

20 Our approach discovers representative and discriminative spatio-temporal patches for a given action class (d-left). [sent-54, score-1.272]

21 These patches are then used to establishing correspondence followed by alignment (d-right). [sent-55, score-0.678]

22 Typically these representations are most appropriate for classification; they are not well-suited as action detectors or for establishing correspondence. [sent-58, score-0.555]

23 The third class of approaches is structural and decom- poses videos into constituent parts. [sent-59, score-0.461]

24 While these approaches attempt to develop a rich representation and learn the structure of the videos in terms of constituent objects, one of their inherent drawbacks is that they are highly dependent on the success of object and action detection algorithms. [sent-61, score-0.924]

25 A more recent approach is based on using discriminative spatio-temporal patches rather than semantic entities [12, 26]. [sent-63, score-0.79]

26 For example, [26] uses manually selected spatio-temporal patches to create a dictionary of discriminative patches for each action class. [sent-64, score-1.643]

27 These patches are then correlated with test video patches and a new feature vector is created using pooling. [sent-65, score-1.053]

28 There are several issues here: 1) What is the criteria for selecting spatio-temporal patches to create the dictionary? [sent-66, score-0.507]

29 2) How many patches are needed to capture all the variations in the data? [sent-67, score-0.477]

30 Motivated by work in object recognition [7], recent approaches have attempted to decompose an action or event into a set of discriminative “parts” or spatio-temporal “patches” designed to capture the local spatio-temporal structure of the data [33, 23]. [sent-68, score-0.701]

31 However, these approaches still focus on the problem of classification and cannot establish strong correspondence or explain why a video is classified as a member of certain class. [sent-69, score-0.328]

32 The key idea is that instead of using semantic parts/constituents, videos are represented in terms of discriminative spatio-temporal patches that can establish correspondences across videos. [sent-71, score-1.182]

33 Recent approaches have tried to circumvent the key point annotation problem by using manually-labeled discriminative regions [16] or objectness criteria [29] to create candidate discriminative regions. [sent-74, score-0.436]

34 We do not use any priors (such as objectness) to select discriminative patches; rather we let the data select the patches of appropriate scale and location. [sent-75, score-0.785]

35 [3 1] and extract “video poselets” from just action labels. [sent-77, score-0.476]

36 Strong alignment allows us to richly annotate test videos using a simple label transfer technique. [sent-82, score-0.53]

37 assume that there exists a consistent spatio-temporal patch across positive examples (positive instance in the bag); instead we want to extract multiple discriminative patches per action class depending on the style in which an action is performed. [sent-83, score-1.798]

38 Mining Discriminative Patches Given a set oftraining videos, we first find discriminative spatio-temporal patches which are representative of each action class. [sent-85, score-1.199]

39 These patches satisfy two conditions: 1) they occur frequently within a class; 2) they are distinct from patches in other classes. [sent-86, score-0.954]

40 The challenge is that the space of potential spatio-temporal patches is extremely large given that these patches can occur over a range of scales. [sent-87, score-0.954]

41 And, the overwhelming majority of video patches are uninteresting, consisting of background clutter (track, grass, sky etc). [sent-88, score-0.541]

42 One approach would be to follow the bag-of-words paradigm: sample a few thousand patches, perform kmeans clustering to find representative clusters and then rank these clusters based on membership in different action classes. [sent-89, score-0.72]

43 ple, Figure 3 shows a query patch (left) and similar patches retrieved using Euclidean distance (right). [sent-96, score-0.594]

44 Instead, we learn a discriminative distance metric to retrieve similar patches and, hence, representative clusters. [sent-98, score-0.723]

45 However, in many cases, assigning cluster memberships to rare background patches is hard. [sent-101, score-0.518]

46 Every spatio-temporal patch is considered as a possible cluster center and we determine whether or not a discriminative cluster for some action class can be formed around that patch. [sent-104, score-0.899]

47 The training partition is used to learn a discriminative distance metric and form clusters and the validation partition is used to rank the clusters based on representativeness. [sent-112, score-0.476]

48 We sample a few hundred patches from each video in the training partition as candidates. [sent-113, score-0.654]

49 We bias the sampling to avoid background patches - patches with uniform or no motion should be rejected. [sent-114, score-0.987]

50 However, learning an e-SVM for all the sampled patches is still computationally infeasible (Assuming 50 training videos per class and 200 sampled patches, we have approx222555777311 imately 10K candidate patches per class). [sent-115, score-1.483]

51 Based on this ranking, we select a few hundred patches per action class and use the e-SVM to learn patch-specific discriminative distance metrics. [sent-119, score-1.334]

52 These e-SVMs are then used to form clusters by retrieving similar patches from the training and validation partitions. [sent-120, score-0.626]

53 Ranking Our goal is to select a smaller dictionary (set of representative patches) from the candidate patches for each class. [sent-124, score-0.67]

54 (b) Purity: To represent the purity/discriminativeness of each cluster we use tf-idf scores: the ratio of how many patches it retrieves from videos of the same action class to the number of patches retrieved from videos of different classes. [sent-127, score-2.128]

55 All patches are ranked using a linear combination of the two scores. [sent-128, score-0.477]

56 Figure 4 shows a set of top-ranked discriminative spatio-temporal patches for different classes selected by this approach. [sent-129, score-0.659]

57 As the figure shows, our spatio-temporal patches are quite representative of various actions. [sent-130, score-0.541]

58 As expected, our discriminative patches are not always semantically meaningful. [sent-132, score-0.659]

59 To further demonstrate that our patches are quite representative and capture the essence of actions, we extracted spatio-temporal patches that exemplify “Gangnam” style dance. [sent-134, score-1.018]

60 We use 30 gangnam dance step youtube videos as our positive set and 340 random videos as a negative set. [sent-135, score-0.715]

61 Figure 5 shows the top discriminative patches selected which indeed represent the dance steps associated with gangnam-dance. [sent-136, score-0.695]

62 Action Classification We first evaluate our discriminative patches for action classification. [sent-140, score-1.135]

63 Beyond Classification: Explanation via Discriminative Patches We now discuss how we can use detections of discriminative patches for establishing correspondences between training and test videos. [sent-147, score-0.992]

64 Once a strong correspondence is established and the videos are aligned, we can perform a variety of other tasks such as object localization, finer-level action detection, etc. [sent-148, score-0.889]

65 Our vocabulary consists of hundreds of discriminative patches; many of the corresponding e-SVMs fire on any given test video. [sent-150, score-0.384]

66 If our inferred action class is l, then our goal is to select the xi such that the cost function Jl is minimiz? [sent-165, score-0.612]

67 pTpheisa rteanrmce encourages selection of patches with high e-SVM scores. [sent-178, score-0.509]

68 For example, for the weightlifting class it prefers selection of the patches with man and bar with vertical motion. [sent-183, score-0.638]

69 We penalize if: 1) e-SVMs iand j do not fire frequently together in the training data; 2) the e-SVMs iand j are trained from different action classes. [sent-186, score-0.56]

70 The set of patches which maximizes this cost function is then used for label transfer and to infer finer details of the underlying action. [sent-204, score-0.676]

71 Experimental Evaluation We demonstrate the effectiveness of our representation for the task of action classification and establishing correspondence. [sent-206, score-0.628]

72 We will also show how correspondence between training and test videos can be used for label transfer and to construct detailed descriptions of videos. [sent-207, score-0.615]

73 Datasets: We use two benchmark action recognition datasets for experimental evaluation: UCF-50 and Olympics Sports Dataset [24]. [sent-208, score-0.476]

74 We use UCF-50 to qualitatively evaluate how discriminative patches can be used to establish correspondences and transfer labels from training to test videos. [sent-209, score-1.026]

75 We manually annotated the videos in 13 of these classes with annotations including the bound222555777533 ing boxes of objects and humans (manually annotating the whole dataset would have required too much human effort). [sent-210, score-0.412]

76 Quantitatively, we evaluate the performance of our approach on action classification on the UCF-50 and the complete Olympics dataset. [sent-215, score-0.519]

77 Implementation Details: Our current implementation considers only cuboid patches, and takes patches at scales ranging from 120x120x50 to the entire video. [sent-216, score-0.511]

78 At the initial step, we sample 200 spatio-temporal patches per video. [sent-219, score-0.505]

79 The nearest neighbor step selects 500 patches per class for which e-SVMs are learned. [sent-220, score-0.578]

80 We evaluate performance by counting the number of videos correctly classified out of the total number of videos in each class. [sent-230, score-0.584]

81 Table 1 shows performance of our algorithm compared to the action bank approach [26] on the 13 class subset (run with same test-train set as our approach) and a bagof-words approach as a baseline. [sent-231, score-0.603]

82 Table 4 shows the performance variation with the number of e-SVM patches trained per class. [sent-233, score-0.505]

83 Finally, we evaluate action classification for all 50 classes in UCF (group-wise) and get an improvement of 3. [sent-234, score-0.519]

84 Correspondence and Label Transfer We now demonstrate how our discriminative patches can be used to establish correspondence and align the videos. [sent-240, score-0.894]

85 It can be seen that our spatio-temporal patches are insensitive to background changes and establish strong alignment. [sent-242, score-0.616]

86 We also use the aligned videos to generate annotations oftest videos by simple label-transfer technique. [sent-243, score-0.688]

87 We manually labeled 50 discriminative patches per class with extra annotations such as objects of interaction (e. [sent-244, score-0.806]

88 videos we transfer these annotations to the new test videos. [sent-247, score-0.496]

89 But again, using our discriminative spatio-temporal patches we just align the videos using motion and appearance and then transfer the poses from the training videos to test videos. [sent-253, score-1.559]

90 We automatically mine these patches from hundreds of training videos using exemplar-based clustering approach. [sent-267, score-0.949]

91 We have also shown how these patches can be used to obtain strong correspondence and align the videos for transferring annotations. [sent-268, score-0.943]

92 Fur- thermore, these patches can be used as a vocabulary to achieve state of the art results for action classification. [sent-269, score-1.032]

93 Learning person-object interactions for action recognition in still images. [sent-297, score-0.476]

94 Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. [sent-376, score-0.699]

95 Discriminative figure-centric models for joint action localization and recognition. [sent-382, score-0.476]

96 Unsupervised learning of human action categories using spatial-temporal words. [sent-434, score-0.517]

97 A 3-dimensional sift descriptor and its application to action recognition. [sent-457, score-0.476]

98 Similarity constrained latent support vector machine: An application to weakly supervised action classification. [sent-465, score-0.476]

99 Action recognition by learning bases of action attributes and parts. [sent-555, score-0.476]

100 Hidden part models for human action recognition:probabilistic versus max margin. [sent-568, score-0.517]

