cvpr cvpr2013 cvpr2013-355 knowledge-graph by maker-knowledge-mining

355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches


Source: pdf

Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis

Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. [sent-9, score-0.504]

2 These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. [sent-10, score-0.779]

3 What defines these spatiotemporal patches is their discriminative and representative properties. [sent-11, score-0.781]

4 We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. [sent-12, score-2.378]

5 Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets. [sent-13, score-1.257]

6 Currently, the most common answer to this question involves recognizing the particular event or action that occurs in the video. [sent-17, score-0.555]

7 But this level of de- scription does not address issues such as the temporal extent of the action [27]. [sent-19, score-0.517]

8 This could be addressed by modeling videos in terms of their constituent semantic actions and objects [10, 30, 4]. [sent-24, score-0.515]

9 However, recent research in object and action recognition has shown that current computational models for identifying semantic entities are not robust enough to serve as a basis for video analysis [7]. [sent-30, score-0.671]

10 Following recent work on discriminative patch-based representation [2, 3 1], we represent videos in terms of discriminative spatio-temporal patches rather than global feature vectors or a set of semantic entities. [sent-32, score-1.228]

11 These spatio- temporal patches might correspond to a primitive human action, a semantic object, human-object pair or perhaps a random but informative spatio-temporal patch in the video. [sent-33, score-0.762]

12 They are determined by their discriminative properties and their ability to establish correspondences with videos from similar classes. [sent-34, score-0.64]

13 We automatically mine these discriminative patches from training data consisting of hundreds of videos. [sent-35, score-0.809]

14 Figure 1(d)(left) shows some of the mined discriminative patches for the “weightlifting” class. [sent-36, score-0.696]

15 We show how these mined patches can act as a discriminative vocabulary for action classification and demonstrate state-of-the-art performance on the Olympics Sports dataset [23] and the UCF-50 dataset1. [sent-37, score-1.294]

16 But, more importantly, we demonstrate how these patches can be used to establish strong correspondence between spatio-temporal patches in training and test videos. [sent-38, score-1.253]

17 We can use this correspondence to align the videos and perform tasks such as object localization, finer-level action detection etc. [sent-39, score-0.931]

18 Figure 2 shows an example of how aligned videos (shown in Figure 1(d)(right)) are used to localize humans and objects, detect finer action categories and estimate human poses. [sent-43, score-0.908]

19 Given a query video (a), one can represent it using global feature vector and use it for action classification (b). [sent-51, score-0.614]

20 Our approach discovers representative and discriminative spatio-temporal patches for a given action class (d-left). [sent-54, score-1.272]

21 These patches are then used to establishing correspondence followed by alignment (d-right). [sent-55, score-0.678]

22 Typically these representations are most appropriate for classification; they are not well-suited as action detectors or for establishing correspondence. [sent-58, score-0.555]

23 The third class of approaches is structural and decom- poses videos into constituent parts. [sent-59, score-0.461]

24 While these approaches attempt to develop a rich representation and learn the structure of the videos in terms of constituent objects, one of their inherent drawbacks is that they are highly dependent on the success of object and action detection algorithms. [sent-61, score-0.924]

25 A more recent approach is based on using discriminative spatio-temporal patches rather than semantic entities [12, 26]. [sent-63, score-0.79]

26 For example, [26] uses manually selected spatio-temporal patches to create a dictionary of discriminative patches for each action class. [sent-64, score-1.643]

27 These patches are then correlated with test video patches and a new feature vector is created using pooling. [sent-65, score-1.053]

28 There are several issues here: 1) What is the criteria for selecting spatio-temporal patches to create the dictionary? [sent-66, score-0.507]

29 2) How many patches are needed to capture all the variations in the data? [sent-67, score-0.477]

30 Motivated by work in object recognition [7], recent approaches have attempted to decompose an action or event into a set of discriminative “parts” or spatio-temporal “patches” designed to capture the local spatio-temporal structure of the data [33, 23]. [sent-68, score-0.701]

31 However, these approaches still focus on the problem of classification and cannot establish strong correspondence or explain why a video is classified as a member of certain class. [sent-69, score-0.328]

32 The key idea is that instead of using semantic parts/constituents, videos are represented in terms of discriminative spatio-temporal patches that can establish correspondences across videos. [sent-71, score-1.182]

33 Recent approaches have tried to circumvent the key point annotation problem by using manually-labeled discriminative regions [16] or objectness criteria [29] to create candidate discriminative regions. [sent-74, score-0.436]

34 We do not use any priors (such as objectness) to select discriminative patches; rather we let the data select the patches of appropriate scale and location. [sent-75, score-0.785]

35 [3 1] and extract “video poselets” from just action labels. [sent-77, score-0.476]

36 Strong alignment allows us to richly annotate test videos using a simple label transfer technique. [sent-82, score-0.53]

37 assume that there exists a consistent spatio-temporal patch across positive examples (positive instance in the bag); instead we want to extract multiple discriminative patches per action class depending on the style in which an action is performed. [sent-83, score-1.798]

38 Mining Discriminative Patches Given a set oftraining videos, we first find discriminative spatio-temporal patches which are representative of each action class. [sent-85, score-1.199]

39 These patches satisfy two conditions: 1) they occur frequently within a class; 2) they are distinct from patches in other classes. [sent-86, score-0.954]

40 The challenge is that the space of potential spatio-temporal patches is extremely large given that these patches can occur over a range of scales. [sent-87, score-0.954]

41 And, the overwhelming majority of video patches are uninteresting, consisting of background clutter (track, grass, sky etc). [sent-88, score-0.541]

42 One approach would be to follow the bag-of-words paradigm: sample a few thousand patches, perform kmeans clustering to find representative clusters and then rank these clusters based on membership in different action classes. [sent-89, score-0.72]

43 ple, Figure 3 shows a query patch (left) and similar patches retrieved using Euclidean distance (right). [sent-96, score-0.594]

44 Instead, we learn a discriminative distance metric to retrieve similar patches and, hence, representative clusters. [sent-98, score-0.723]

45 However, in many cases, assigning cluster memberships to rare background patches is hard. [sent-101, score-0.518]

46 Every spatio-temporal patch is considered as a possible cluster center and we determine whether or not a discriminative cluster for some action class can be formed around that patch. [sent-104, score-0.899]

47 The training partition is used to learn a discriminative distance metric and form clusters and the validation partition is used to rank the clusters based on representativeness. [sent-112, score-0.476]

48 We sample a few hundred patches from each video in the training partition as candidates. [sent-113, score-0.654]

49 We bias the sampling to avoid background patches - patches with uniform or no motion should be rejected. [sent-114, score-0.987]

50 However, learning an e-SVM for all the sampled patches is still computationally infeasible (Assuming 50 training videos per class and 200 sampled patches, we have approx222555777311 imately 10K candidate patches per class). [sent-115, score-1.483]

51 Based on this ranking, we select a few hundred patches per action class and use the e-SVM to learn patch-specific discriminative distance metrics. [sent-119, score-1.334]

52 These e-SVMs are then used to form clusters by retrieving similar patches from the training and validation partitions. [sent-120, score-0.626]

53 Ranking Our goal is to select a smaller dictionary (set of representative patches) from the candidate patches for each class. [sent-124, score-0.67]

54 (b) Purity: To represent the purity/discriminativeness of each cluster we use tf-idf scores: the ratio of how many patches it retrieves from videos of the same action class to the number of patches retrieved from videos of different classes. [sent-127, score-2.128]

55 All patches are ranked using a linear combination of the two scores. [sent-128, score-0.477]

56 Figure 4 shows a set of top-ranked discriminative spatio-temporal patches for different classes selected by this approach. [sent-129, score-0.659]

57 As the figure shows, our spatio-temporal patches are quite representative of various actions. [sent-130, score-0.541]

58 As expected, our discriminative patches are not always semantically meaningful. [sent-132, score-0.659]

59 To further demonstrate that our patches are quite representative and capture the essence of actions, we extracted spatio-temporal patches that exemplify “Gangnam” style dance. [sent-134, score-1.018]

60 We use 30 gangnam dance step youtube videos as our positive set and 340 random videos as a negative set. [sent-135, score-0.715]

61 Figure 5 shows the top discriminative patches selected which indeed represent the dance steps associated with gangnam-dance. [sent-136, score-0.695]

62 Action Classification We first evaluate our discriminative patches for action classification. [sent-140, score-1.135]

63 Beyond Classification: Explanation via Discriminative Patches We now discuss how we can use detections of discriminative patches for establishing correspondences between training and test videos. [sent-147, score-0.992]

64 Once a strong correspondence is established and the videos are aligned, we can perform a variety of other tasks such as object localization, finer-level action detection, etc. [sent-148, score-0.889]

65 Our vocabulary consists of hundreds of discriminative patches; many of the corresponding e-SVMs fire on any given test video. [sent-150, score-0.384]

66 If our inferred action class is l, then our goal is to select the xi such that the cost function Jl is minimiz? [sent-165, score-0.612]

67 pTpheisa rteanrmce encourages selection of patches with high e-SVM scores. [sent-178, score-0.509]

68 For example, for the weightlifting class it prefers selection of the patches with man and bar with vertical motion. [sent-183, score-0.638]

69 We penalize if: 1) e-SVMs iand j do not fire frequently together in the training data; 2) the e-SVMs iand j are trained from different action classes. [sent-186, score-0.56]

70 The set of patches which maximizes this cost function is then used for label transfer and to infer finer details of the underlying action. [sent-204, score-0.676]

71 Experimental Evaluation We demonstrate the effectiveness of our representation for the task of action classification and establishing correspondence. [sent-206, score-0.628]

72 We will also show how correspondence between training and test videos can be used for label transfer and to construct detailed descriptions of videos. [sent-207, score-0.615]

73 Datasets: We use two benchmark action recognition datasets for experimental evaluation: UCF-50 and Olympics Sports Dataset [24]. [sent-208, score-0.476]

74 We use UCF-50 to qualitatively evaluate how discriminative patches can be used to establish correspondences and transfer labels from training to test videos. [sent-209, score-1.026]

75 We manually annotated the videos in 13 of these classes with annotations including the bound222555777533 ing boxes of objects and humans (manually annotating the whole dataset would have required too much human effort). [sent-210, score-0.412]

76 Quantitatively, we evaluate the performance of our approach on action classification on the UCF-50 and the complete Olympics dataset. [sent-215, score-0.519]

77 Implementation Details: Our current implementation considers only cuboid patches, and takes patches at scales ranging from 120x120x50 to the entire video. [sent-216, score-0.511]

78 At the initial step, we sample 200 spatio-temporal patches per video. [sent-219, score-0.505]

79 The nearest neighbor step selects 500 patches per class for which e-SVMs are learned. [sent-220, score-0.578]

80 We evaluate performance by counting the number of videos correctly classified out of the total number of videos in each class. [sent-230, score-0.584]

81 Table 1 shows performance of our algorithm compared to the action bank approach [26] on the 13 class subset (run with same test-train set as our approach) and a bagof-words approach as a baseline. [sent-231, score-0.603]

82 Table 4 shows the performance variation with the number of e-SVM patches trained per class. [sent-233, score-0.505]

83 Finally, we evaluate action classification for all 50 classes in UCF (group-wise) and get an improvement of 3. [sent-234, score-0.519]

84 Correspondence and Label Transfer We now demonstrate how our discriminative patches can be used to establish correspondence and align the videos. [sent-240, score-0.894]

85 It can be seen that our spatio-temporal patches are insensitive to background changes and establish strong alignment. [sent-242, score-0.616]

86 We also use the aligned videos to generate annotations oftest videos by simple label-transfer technique. [sent-243, score-0.688]

87 We manually labeled 50 discriminative patches per class with extra annotations such as objects of interaction (e. [sent-244, score-0.806]

88 videos we transfer these annotations to the new test videos. [sent-247, score-0.496]

89 But again, using our discriminative spatio-temporal patches we just align the videos using motion and appearance and then transfer the poses from the training videos to test videos. [sent-253, score-1.559]

90 We automatically mine these patches from hundreds of training videos using exemplar-based clustering approach. [sent-267, score-0.949]

91 We have also shown how these patches can be used to obtain strong correspondence and align the videos for transferring annotations. [sent-268, score-0.943]

92 Fur- thermore, these patches can be used as a vocabulary to achieve state of the art results for action classification. [sent-269, score-1.032]

93 Learning person-object interactions for action recognition in still images. [sent-297, score-0.476]

94 Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. [sent-376, score-0.699]

95 Discriminative figure-centric models for joint action localization and recognition. [sent-382, score-0.476]

96 Unsupervised learning of human action categories using spatial-temporal words. [sent-434, score-0.517]

97 A 3-dimensional sift descriptor and its application to action recognition. [sent-457, score-0.476]

98 Similarity constrained latent support vector machine: An application to weakly supervised action classification. [sent-465, score-0.476]

99 Action recognition by learning bases of action attributes and parts. [sent-555, score-0.476]

100 Hidden part models for human action recognition:probabilistic versus max margin. [sent-568, score-0.517]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('patches', 0.477), ('action', 0.476), ('videos', 0.292), ('olympics', 0.191), ('discriminative', 0.182), ('ipfp', 0.131), ('transfer', 0.123), ('detections', 0.11), ('establish', 0.1), ('gangnam', 0.095), ('actions', 0.091), ('patch', 0.086), ('correspondence', 0.082), ('establishing', 0.079), ('vocabulary', 0.079), ('clusters', 0.075), ('gupta', 0.074), ('class', 0.073), ('constituent', 0.067), ('entities', 0.066), ('correspondences', 0.066), ('semantic', 0.065), ('representative', 0.064), ('video', 0.064), ('cli', 0.064), ('lary', 0.064), ('select', 0.063), ('mine', 0.06), ('spatiotemporal', 0.058), ('poselets', 0.058), ('club', 0.056), ('groupwise', 0.056), ('weightlifting', 0.056), ('bank', 0.054), ('align', 0.053), ('primitive', 0.052), ('svm', 0.049), ('hundreds', 0.047), ('storyline', 0.047), ('golf', 0.047), ('annotations', 0.046), ('throw', 0.045), ('event', 0.043), ('classification', 0.043), ('training', 0.043), ('purity', 0.042), ('human', 0.041), ('temporal', 0.041), ('umd', 0.041), ('fire', 0.041), ('cluster', 0.041), ('alignment', 0.04), ('singh', 0.04), ('label', 0.04), ('strong', 0.039), ('ucf', 0.039), ('iarpa', 0.039), ('mined', 0.037), ('objectness', 0.037), ('finer', 0.036), ('dance', 0.036), ('recognizing', 0.036), ('pij', 0.036), ('niebles', 0.036), ('candidate', 0.035), ('test', 0.035), ('jl', 0.035), ('hundred', 0.035), ('partitioning', 0.035), ('partition', 0.035), ('cuboid', 0.034), ('bag', 0.034), ('lan', 0.034), ('humans', 0.033), ('motion', 0.033), ('euclidean', 0.033), ('integer', 0.033), ('selection', 0.032), ('dictionary', 0.031), ('drawbacks', 0.031), ('query', 0.031), ('validation', 0.031), ('ranking', 0.03), ('aligned', 0.03), ('infeasible', 0.03), ('clustering', 0.03), ('representation', 0.03), ('selecting', 0.03), ('paris', 0.03), ('poses', 0.029), ('extracts', 0.029), ('activity', 0.029), ('penalty', 0.029), ('score', 0.029), ('detection', 0.028), ('per', 0.028), ('nect', 0.028), ('oftest', 0.028), ('humanobject', 0.028), ('minimiz', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999905 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches

Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis

Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.

2 0.35546607 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition

Author: Feng Shi, Emil Petriu, Robert Laganière

Abstract: Local spatio-temporal features and bag-of-features representations have become popular for action recognition. A recent trend is to use dense sampling for better performance. While many methods claimed to use dense feature sets, most of them are just denser than approaches based on sparse interest point detectors. In this paper, we explore sampling with high density on action recognition. We also investigate the impact of random sampling over dense grid for computational efficiency. We present a real-time action recognition system which integrates fast random sampling method with local spatio-temporal features extracted from a Local Part Model. A new method based on histogram intersection kernel is proposed to combine multiple channels of different descriptors. Our technique shows high accuracy on the simple KTH dataset, and achieves state-of-the-art on two very challenging real-world datasets, namely, 93% on KTH, 83.3% on UCF50 and 47.6% on HMDB51.

3 0.35271001 287 cvpr-2013-Modeling Actions through State Changes

Author: Alireza Fathi, James M. Rehg

Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.

4 0.34457445 40 cvpr-2013-An Approach to Pose-Based Action Recognition

Author: Chunyu Wang, Yizhou Wang, Alan L. Yuille

Abstract: We address action recognition in videos by modeling the spatial-temporal structures of human poses. We start by improving a state of the art method for estimating human joint locations from videos. More precisely, we obtain the K-best estimations output by the existing method and incorporate additional segmentation cues and temporal constraints to select the “best” one. Then we group the estimated joints into five body parts (e.g. the left arm) and apply data mining techniques to obtain a representation for the spatial-temporal structures of human actions. This representation captures the spatial configurations ofbodyparts in one frame (by spatial-part-sets) as well as the body part movements(by temporal-part-sets) which are characteristic of human actions. It is interpretable, compact, and also robust to errors on joint estimations. Experimental results first show that our approach is able to localize body joints more accurately than existing methods. Next we show that it outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets.

5 0.32550254 444 cvpr-2013-Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest

Author: Tsz-Ho Yu, Tae-Kyun Kim, Roberto Cipolla

Abstract: This work addresses the challenging problem of unconstrained 3D human pose estimation (HPE)from a novelperspective. Existing approaches struggle to operate in realistic applications, mainly due to their scene-dependent priors, such as background segmentation and multi-camera network, which restrict their use in unconstrained environments. We therfore present a framework which applies action detection and 2D pose estimation techniques to infer 3D poses in an unconstrained video. Action detection offers spatiotemporal priors to 3D human pose estimation by both recognising and localising actions in space-time. Instead of holistic features, e.g. silhouettes, we leverage the flexibility of deformable part model to detect 2D body parts as a feature to estimate 3D poses. A new unconstrained pose dataset has been collected to justify the feasibility of our method, which demonstrated promising results, significantly outperforming the relevant state-of-the-arts.

6 0.29767624 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection

7 0.2673482 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition

8 0.25304088 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)

9 0.25101283 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

10 0.24297768 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

11 0.24215207 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition

12 0.23761067 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images

13 0.23739956 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes

14 0.20041288 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path

15 0.1973924 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition

16 0.18996705 393 cvpr-2013-Separating Signal from Noise Using Patch Recurrence across Scales

17 0.17577308 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition

18 0.16642165 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences

19 0.16195424 388 cvpr-2013-Semi-supervised Learning of Feature Hierarchies for Object Detection in a Video

20 0.16039085 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.317), (1, -0.187), (2, -0.034), (3, -0.223), (4, -0.348), (5, 0.007), (6, -0.125), (7, 0.077), (8, -0.101), (9, -0.118), (10, 0.005), (11, -0.059), (12, 0.01), (13, 0.025), (14, -0.074), (15, -0.086), (16, 0.032), (17, -0.05), (18, 0.118), (19, 0.097), (20, 0.075), (21, 0.012), (22, 0.034), (23, -0.029), (24, 0.004), (25, -0.088), (26, -0.038), (27, -0.089), (28, 0.016), (29, -0.078), (30, -0.06), (31, -0.12), (32, 0.051), (33, 0.056), (34, 0.026), (35, 0.035), (36, -0.047), (37, -0.058), (38, -0.04), (39, -0.059), (40, 0.061), (41, -0.016), (42, -0.013), (43, -0.052), (44, 0.05), (45, 0.072), (46, -0.037), (47, -0.001), (48, 0.048), (49, -0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96967179 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches

Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis

Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.

2 0.84585541 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition

Author: Michalis Raptis, Leonid Sigal

Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.

3 0.8383916 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection

Author: Yicong Tian, Rahul Sukthankar, Mubarak Shah

Abstract: Deformable part models have achieved impressive performance for object detection, even on difficult image datasets. This paper explores the generalization of deformable part models from 2D images to 3D spatiotemporal volumes to better study their effectiveness for action detection in video. Actions are treated as spatiotemporal patterns and a deformable part model is generated for each action from a collection of examples. For each action model, the most discriminative 3D subvolumes are automatically selected as parts and the spatiotemporal relations between their locations are learned. By focusing on the most distinctive parts of each action, our models adapt to intra-class variation and show robustness to clutter. Extensive experiments on several video datasets demonstrate the strength of spatiotemporal DPMs for classifying and localizing actions.

4 0.83498806 287 cvpr-2013-Modeling Actions through State Changes

Author: Alireza Fathi, James M. Rehg

Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.

5 0.82195097 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition

Author: Feng Shi, Emil Petriu, Robert Laganière

Abstract: Local spatio-temporal features and bag-of-features representations have become popular for action recognition. A recent trend is to use dense sampling for better performance. While many methods claimed to use dense feature sets, most of them are just denser than approaches based on sparse interest point detectors. In this paper, we explore sampling with high density on action recognition. We also investigate the impact of random sampling over dense grid for computational efficiency. We present a real-time action recognition system which integrates fast random sampling method with local spatio-temporal features extracted from a Local Part Model. A new method based on histogram intersection kernel is proposed to combine multiple channels of different descriptors. Our technique shows high accuracy on the simple KTH dataset, and achieves state-of-the-art on two very challenging real-world datasets, namely, 93% on KTH, 83.3% on UCF50 and 47.6% on HMDB51.

6 0.78461874 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition

7 0.74294323 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)

8 0.74048996 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition

9 0.69745511 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

10 0.68900687 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition

11 0.67851442 40 cvpr-2013-An Approach to Pose-Based Action Recognition

12 0.65653747 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path

13 0.63000262 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition

14 0.61543429 444 cvpr-2013-Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest

15 0.60375577 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images

16 0.5945183 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes

17 0.58970261 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization

18 0.58037406 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

19 0.50958133 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

20 0.50355542 393 cvpr-2013-Separating Signal from Noise Using Patch Recurrence across Scales


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.128), (16, 0.032), (26, 0.061), (28, 0.022), (33, 0.321), (39, 0.011), (67, 0.082), (69, 0.066), (80, 0.026), (87, 0.072), (92, 0.105)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97654802 115 cvpr-2013-Depth Super Resolution by Rigid Body Self-Similarity in 3D

Author: unkown-author

Abstract: We tackle the problem of jointly increasing the spatial resolution and apparent measurement accuracy of an input low-resolution, noisy, and perhaps heavily quantized depth map. In stark contrast to earlier work, we make no use of ancillary data like a color image at the target resolution, multiple aligned depth maps, or a database of highresolution depth exemplars. Instead, we proceed by identifying and merging patch correspondences within the input depth map itself, exploiting patchwise scene self-similarity across depth such as repetition of geometric primitives or object symmetry. While the notion of ‘single-image ’ super resolution has successfully been applied in the context of color and intensity images, we are to our knowledge the first to present a tailored analogue for depth images. Rather than reason in terms of patches of 2D pixels as others have before us, our key contribution is to proceed by reasoning in terms of patches of 3D points, with matched patch pairs related by a respective 6 DoF rigid body motion in 3D. In support of obtaining a dense correspondence field in reasonable time, we introduce a new 3D variant of Patch- Match. A third contribution is a simple, yet effective patch upscaling and merging technique, which predicts sharp object boundaries at the target resolution. We show that our results are highly competitive with those of alternative techniques leveraging even a color image at the target resolution or a database of high-resolution depth exemplars.

2 0.95881665 393 cvpr-2013-Separating Signal from Noise Using Patch Recurrence across Scales

Author: Maria Zontak, Inbar Mosseri, Michal Irani

Abstract: Recurrence of small clean image patches across different scales of a natural image has been successfully used for solving ill-posed problems in clean images (e.g., superresolution from a single image). In this paper we show how this multi-scale property can be extended to solve ill-posed problems under noisy conditions, such as image denoising. While clean patches are obscured by severe noise in the original scale of a noisy image, noise levels drop dramatically at coarser image scales. This allows for the unknown hidden clean patches to “naturally emerge ” in some coarser scale of the noisy image. We further show that patch recurrence across scales is strengthened when using directional pyramids (that blur and subsample only in one direction). Our statistical experiments show that for almost any noisy image patch (more than 99%), there exists a “good” clean version of itself at the same relative image coordinates in some coarser scale of the image.This is a strong phenomenon of noise-contaminated natural images, which can serve as a strong prior for separating the signal from the noise. Finally, incorporating this multi-scale prior into a simple denoising algorithm yields state-of-the-art denois- ing results.

3 0.95254838 29 cvpr-2013-A Video Representation Using Temporal Superpixels

Author: Jason Chang, Donglai Wei, John W. Fisher_III

Abstract: We develop a generative probabilistic model for temporally consistent superpixels in video sequences. In contrast to supervoxel methods, object parts in different frames are tracked by the same temporal superpixel. We explicitly model flow between frames with a bilateral Gaussian process and use this information to propagate superpixels in an online fashion. We consider four novel metrics to quantify performance of a temporal superpixel representation and demonstrate superior performance when compared to supervoxel methods.

4 0.9514727 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

Author: Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, Silvio Savarese

Abstract: Visual scene understanding is a difficult problem interleaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relationships between objects whichfrequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections.

5 0.95033431 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

Author: Brandon Rothrock, Seyoung Park, Song-Chun Zhu

Abstract: In this paper we present a compositional and-or graph grammar model for human pose estimation. Our model has three distinguishing features: (i) large appearance differences between people are handled compositionally by allowingparts or collections ofparts to be substituted with alternative variants, (ii) each variant is a sub-model that can define its own articulated geometry and context-sensitive compatibility with neighboring part variants, and (iii) background region segmentation is incorporated into the part appearance models to better estimate the contrast of a part region from its surroundings, and improve resilience to background clutter. The resulting integrated framework is trained discriminatively in a max-margin framework using an efficient and exact inference algorithm. We present experimental evaluation of our model on two popular datasets, and show performance improvements over the state-of-art on both benchmarks.

6 0.95025462 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

7 0.94972616 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection

8 0.94800246 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection

9 0.94785988 30 cvpr-2013-Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses

10 0.947519 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image

same-paper 11 0.9475041 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches

12 0.94702232 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors

13 0.94692266 325 cvpr-2013-Part Discovery from Partial Correspondence

14 0.94678605 287 cvpr-2013-Modeling Actions through State Changes

15 0.94631606 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection

16 0.94506699 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

17 0.94470209 414 cvpr-2013-Structure Preserving Object Tracking

18 0.94466454 43 cvpr-2013-Analyzing Semantic Segmentation Using Hybrid Human-Machine CRFs

19 0.94462216 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

20 0.94460613 256 cvpr-2013-Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning