cvpr cvpr2013 cvpr2013-408 knowledge-graph by maker-knowledge-mining

408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection


Source: pdf

Author: Yicong Tian, Rahul Sukthankar, Mubarak Shah

Abstract: Deformable part models have achieved impressive performance for object detection, even on difficult image datasets. This paper explores the generalization of deformable part models from 2D images to 3D spatiotemporal volumes to better study their effectiveness for action detection in video. Actions are treated as spatiotemporal patterns and a deformable part model is generated for each action from a collection of examples. For each action model, the most discriminative 3D subvolumes are automatically selected as parts and the spatiotemporal relations between their locations are learned. By focusing on the most distinctive parts of each action, our models adapt to intra-class variation and show robustness to clutter. Extensive experiments on several video datasets demonstrate the strength of spatiotemporal DPMs for classifying and localizing actions.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This paper explores the generalization of deformable part models from 2D images to 3D spatiotemporal volumes to better study their effectiveness for action detection in video. [sent-8, score-0.873]

2 Actions are treated as spatiotemporal patterns and a deformable part model is generated for each action from a collection of examples. [sent-9, score-0.787]

3 For each action model, the most discriminative 3D subvolumes are automatically selected as parts and the spatiotemporal relations between their locations are learned. [sent-10, score-0.8]

4 Extensive experiments on several video datasets demonstrate the strength of spatiotemporal DPMs for classifying and localizing actions. [sent-12, score-0.28]

5 This paper focuses on the related problem of action detection [7, 21], sometimes termed action localization [12] or event detection [8, 9], where the goal is to detect every occurrence of a given action within a long video, and to localize each detection both in space and time. [sent-15, score-1.533]

6 As observed by others [1, 8, 28], the action detection problem can be viewed as a spatiotemporal generalization of 2D object detection in images; thus, it is fruitful to study how successful approaches pertaining to the latter could be extended to the former. [sent-16, score-0.766]

7 [8] investigate spatiotemporal extensions of Viola-Jones [23], we study how the current state-of-the-art method for object detection in images, the deformable part model (DPM) [6] should best be generalized to spatiotemporal representations (see Fig. [sent-18, score-0.633]

8 Although trained in videos with cluttered background at a different scale, the SDPM successfully localizes the target action in both space and time. [sent-24, score-0.513]

9 unable to capture the intra-class spatiotemporal variation of many actions [12]. [sent-34, score-0.429]

10 Clearly, a more sophisticated approach is warranted and in this paper, we propose a spatiotemporal deformable part model (SDPM) that stays true to the structure of the original DPM (see Fig. [sent-35, score-0.345]

11 2) while generalizing the parts to capture spatiotemporal structure. [sent-36, score-0.395]

12 A key difference between SDPM and earlier approaches is that our proposed model employs volumetric parts that displace in both time and space; this has important implications for actions that exhibit significant intra-class variation in terms of execution and also improves performance in clutter. [sent-40, score-0.346]

13 The primary aim of this paper is to comprehensively evaluate spatiotemporal extensions of the deformable part model to understand how well the DPM approach for object detection generalizes to action detection in video. [sent-41, score-0.906]

14 We believe that a hybrid action detection system that incorporates our ideas could achieve further gains. [sent-44, score-0.478]

15 Related Work Bag-of-words representations [13, 14, 20, 24] have demonstrated excellent results in action recognition. [sent-46, score-0.422]

16 However, such approaches typically ignore the spatiotemporal distribution of visual words, preventing localization of actions within a video. [sent-47, score-0.431]

17 [27] apply pLSA to capture the spatiotemporal relationship of visual words. [sent-50, score-0.254]

18 Although some examples of action localization are shown, the localization is performed in simple or controlled settings and no quantitative results on action detection are presented. [sent-51, score-0.998]

19 Earlier work proposes several strategies for template matching approaches to action localization. [sent-52, score-0.453]

20 [18] generalize the traditional MACH filter to video and vector-valued data, and detect actions by analyzing the response of such filters. [sent-54, score-0.269]

21 [11] localize human actions by a track-aligned HOG3D action representation, which (unlike our method) requires human detection and tracking. [sent-56, score-0.628]

22 [9] introduce the notion of parts and efficiently match the volumetric representation of an event against oversegmented spatiotemporal video volumes; however, these parts are manually specified using prior knowledge and exhibit limited robustness to intra-class variation. [sent-58, score-0.579]

23 Second, both the global template and set of part templates in SDPM are spatiotemporal volumes, and we search for the best fit across scale, space and time. [sent-68, score-0.314]

24 As a 3D subvol- ume, each part jointly considers appearance and motion information spanning several frames, which is better suited for actions than 2D parts in a single frame [12] that primarily capture pose. [sent-69, score-0.355]

25 Third, we employ a dense scanning approach that matches parts to a large state space, avoiding the potential errors caused by hard decisions on video segmentation, which are then used for matching parts [17]. [sent-70, score-0.293]

26 Finally, we focus explicitly on demonstrating the effectiveness of action detection within a DPM framework, without 222666444311 resorting to global bag-of-words information [12, 17], trajectories [17] or expensive video segmentation [2, 9]. [sent-71, score-0.549]

27 Generalizing DPM from 2D to 3D Generalizing deformable part models from 2D images to 3D spatiotemporal volumes involves some subtleties that stem from the inherent asymmetry between space and time that is often ignored by volumetric approaches. [sent-73, score-0.487]

28 Briefly: 1) Perspective effects, which cause large variation in observed object/action size do not affect the temporal dimension; similarly, viewpoint changes affect the spatial configuration of parts while leaving their temporal orderings unchanged. [sent-74, score-0.243]

29 First, consider the difference between a bounding box circumscribing an object in a 2D image and the corresponding cuboid enclosing an action in a video. [sent-77, score-0.52]

30 By contrast, for actions particularly those that involve whole-body translation, such as walking, or large limb articulations such as kicking or waving — the bounding volume is primarily composed of background pixels. [sent-79, score-0.315]

31 This is because enclosing the set ofpixels swept during even a single cycle of the action requires a large spatiotemporal box (see Fig. [sent-80, score-0.846]

32 The immediate consequence of this phenomenon, as confirmed in our experiments, is that a detector without parts (solely using the root filter on the enclosing volume) is no longer competitive. [sent-82, score-0.32]

33 Finding discriminative parts is thus more important for action detection than learning the analogous parts for DPMs for 2D objects. [sent-83, score-0.711]

34 To quantify the severity of this effect, we analyze the masks in the Weizmann dataset and see that for nine out of ten actions, the percentage of pixels occupied by the actor in a box bounding a single cyle of the action is between 18% to 30%; the highest is ‘pjump’ with 35. [sent-84, score-0.549]

35 This observation drives our decision during training to select a set of parts such that in total they occupy 50% of the action cycle volume. [sent-87, score-0.661]

36 — Second, in the construction of spatiotemporal pyramids that enable efficient search across scale, space and time differently. [sent-89, score-0.259]

37 The variation in action duration is principally caused by differences between actors, is relatively small and better handled by shifting parts. [sent-92, score-0.469]

38 Deformable Part Models Inspired by the 2D models in [6], we propose a spatiotemporal model with deformable parts for action detection. [sent-98, score-0.822]

39 The model we employ consists of a root filter F0 and several part models. [sent-99, score-0.259]

40 propose the HOG3D [10] descriptor based on a histogram of oriented spatiotemporal gradients as a volumetric generalization of the popular HOG [4] descriptor. [sent-105, score-0.326]

41 During training, we extract HOG3D features over an action cycle volume to train root filter and part filters. [sent-116, score-0.861]

42 During detection, HOG3D features of the whole test video volume are used to form feature maps and construct a feature pyramid to enable efficient search through scale and spatiotemporal location. [sent-117, score-0.417]

43 Root filter We follow the overall DPM training paradigm, as influenced by the discussion in Section 3: During training, for positive instances, from each video we select a single box enclosing one cycle of the given action. [sent-120, score-0.311]

44 These negatives are supplemented with random volumes drawn at different scales from videos that do not contain the given action to help better discriminate the given action from background. [sent-122, score-0.951]

45 The root filter captures the overall information of the action cycle and is obtained by applying an SVM on the HOG3D features of the action cycle volume. [sent-123, score-1.285]

46 How to divide the action volume is important for good performance. [sent-124, score-0.502]

47 In our experiments, to train the root filter, we have experimentally determined that dividing the spatial extent of an action cycle volume into 3 3 works well. [sent-127, score-0.767]

48 This is an instance of the asymmetry between space and time discussed in Section 3 since the observed spatial extent of an action varies greatly with camera pose but is similar across actions, while temporal durations are invariant to camera pose but very dependent on the type of action. [sent-130, score-0.525]

49 Thus, the number of stages T is determined automatically for each action type according to its distribution of durations computed over its positive examples, such that each stage of the model contains 5–10 frames. [sent-132, score-0.475]

50 3(a), only a small fraction of the pixels in a bounding action volume correspond to the actor. [sent-139, score-0.562]

51 As confirmed by our experiments, these issues are more serious in volumetric action detection than in images, so the role of automatically learned deformable parts in SDPM to address them is consequently crucial. [sent-141, score-0.711]

52 Right: spatial area corresponding to the bounding volume, which (for this action type) is divided into 3 cells in x (yellow), 3 cells in y (red), 3 cells in t (green) to compute the HOG3D features for the root filter. [sent-145, score-0.751]

53 Our experiments confirm that extracting HOG3D features for part models at twice the resolution and with more cells in space (but not time) enables the learned parts to capture important details; this is consistent with Felzenszwalb et al. [sent-149, score-0.241]

54 Analogous to the parts in DPM, we allow the parts selected by SDPM to overlap in space. [sent-151, score-0.234]

55 ×× × After applying SVM to the extracted features, subvolumes with higher weights, which means they are more discriminative for the given action type, are selected as parts, while those with lower weights are ignored. [sent-152, score-0.462]

56 In our setting, the action volume is divided into 12 12×T cells to extract HthOeG ac3tDio nfe vaotulurmese a isnd d veaidched part i s1 a s1u2b×voTlu cmeell occupying 3 3 1 cells. [sent-153, score-0.615]

57 Then, we greedily select the N parts with the highest energy shuecnh, twheat g rtheeeidri ulyn isoenle fcitl ltsh e50 N% p oafr ttsh ew aithcti tohen cycle volume. [sent-154, score-0.239]

58 The weights in a subvolume are cleared after that subvolume has been selected as a part, and this process continues until all N parts are determined. [sent-156, score-0.242]

59 In our model, each part represents a spatiotemporal volume. [sent-157, score-0.283]

60 For example, xi < xj means that the ith part occurs to the left of the jth part, and ti < tj means that the ith part occurs before the jth part in time. [sent-165, score-0.264]

61 Additionally, to address the high degree of intra-class variability in each action type, we allow each part of the model to shift within a certain spatiotemporal region. [sent-166, score-0.705]

62 An action cycle is divided into three temporal stages, with each stage containing several frames. [sent-183, score-0.611]

63 In this case, HOG3D features for root filter are computed by dividing the action cycle volume into 3 3 3 cells. [sent-184, score-0.838]

64 Tsh oef large yellow rectangle indicates the region covered by the root filter; the small red, magenta, and green ones are the selected parts in each temporal stage. [sent-187, score-0.292]

65 Thus, filters and deformation cost coefficients di are updated to better capture action characteristics. [sent-202, score-0.506]

66 Action detection with SDPM Given a test video volume, we build a spatiotemporal feature pyramid by computing HOG3D features at different scales, enabling SDPM to efficiently evaluate models in scale, space and time. [sent-204, score-0.366]

67 Score maps for root and part filters are computed at every level of the feature pyramid using template matching. [sent-209, score-0.248]

68 For level l, the score map S(l) of each filter can be obtained by correlation of filter F with features of the test video volume φ(l), S(l,i,j,k) = ? [sent-210, score-0.3]

69 , (1) At level l in the feature pyramid, the score of a detection volume centered at (x, y, t) is the sum of the score of the × root filter on this volume and the scores from each part filter on the best possible subvolume: score(x, y, t, l) = F0 · α(x, y, t, l) + ? [sent-215, score-0.573]

70 We choose the highest score from all possible placements in the detection volume as the score of each part model, and for each placement, the score is computed by the filter response minus deformation cost. [sent-231, score-0.378]

71 If a detection volume scores above a threshold, then that action is detected at the given spatiotemporal location. [sent-232, score-0.79]

72 This strikes an effective balance between exhaustive search and computational efficiency, covering the target video volume with sufficient spatiotemporal overlap. [sent-234, score-0.36]

73 As with DPM, our root filter expresses the overall structure of the action while part filters capture the finer details. [sent-235, score-0.702]

74 This combination of root and part filters ensures good detection performance. [sent-237, score-0.243]

75 In experiments, we observe that the peak of score map obtained by combining root score and part scores is more distinct, stable and accurate than that of only root score map. [sent-238, score-0.349]

76 Since the parts can ignore the background pixels in the bounding volume and focus on the distinctive aspects of the given action, the part-based SDPM is significantly more effective. [sent-239, score-0.245]

77 More importantly, since this paper’s primary goal is to study the correct way to generalize DPM to spatiotemporal settings, we stress reproducibility by employing standard features, eschewing parameter tweaking and making our source code available. [sent-243, score-0.303]

78 better understand how SDPM root and part filters work in spatiotemporal volumes. [sent-254, score-0.419]

79 SDPM achieves 100% recognition (without use of masks) and localizes every action occurrence correctly, which is an excellent sanity check. [sent-255, score-0.471]

80 We evaluate action detection on MSRIIDataset using SDPMs trained solely on the KTH Dataset. [sent-259, score-0.478]

81 Our results on MSR-II confirm that parts are critical for action detection in crowded and complex scenes. [sent-260, score-0.608]

82 For action detection (spatiotemporal localization), we employ the usual “intersection-over-union” criterion, generate ROC curves when overlap criterion equals 0. [sent-261, score-0.533]

83 For action recognition (whole clip, forced-choice classi- fication), we apply an SDPM for each action class to each clip and assign the clip to that class with the highest number of detections. [sent-264, score-0.946]

84 We provide action recognition results mainly to show that SDPM is also competitive on this task, even though detection is our primary goal. [sent-265, score-0.505]

85 Experiments on Weizmann Dataset The Weizmann dataset [1] is a popular action dataset with nine people performing ten actions. [sent-268, score-0.422]

86 Weizmann does not come with occurrence-level annotations so we annotate a single action cycle from each video clip to provide positive training instances; as usual, negatives include such instances from other classes augmented with randomly-sampled subvolumes from other classes. [sent-272, score-0.741]

87 Experiments on UCF Sports Dataset The UCF Sports Dataset [18] consists of videos from sports broadcasts, with a total of 150 videos from 10 action classes, such as golf, lifting and running. [sent-297, score-0.6]

88 From the provided frame-level annotations, we create a new large bounding volume that circumscribes all of the annotations for a given action cycle. [sent-299, score-0.56]

89 For action recognition (not our primary goal), SDPM’s forced-choice classification accuracy, averaged over action classes is 75. [sent-303, score-0.871]

90 We evaluate action localization using the standard “intersection-over-union” measure. [sent-324, score-0.471]

91 Following [12], an action occurrence is counted as correct when the measure exceeds 0. [sent-325, score-0.45]

92 [12] on action detection; we are unable to directly compare detection accuracy against Raptis et al. [sent-334, score-0.478]

93 For each model, the training set consists of a single action cycle from each KTH clip (positives) and instances from the other two classes (negatives). [sent-342, score-0.606]

94 We attribute this robustness to SDPM’s ability to capture the intrinsic spatiotemporal structure of actions. [sent-346, score-0.254]

95 Conclusion We present SDPM for action detection by extending deformable part models from 2D images to 3D spatiotemporal 4We note that the description of precision and recall in [3] is reversed. [sent-348, score-0.823]

96 Naive approaches to generalizing DPMs fail because the fraction of discriminative pixels in action volumes is fewer than that in corresponding 2D bounding boxes. [sent-356, score-0.567]

97 Discriminative figure-centric models for joint action localization and recognition. [sent-447, score-0.471]

98 Unsupervised learning of human action categories using spatial-temporal words. [sent-475, score-0.422]

99 Discovering discrim- [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] inative action parts from mid-level video representations. [sent-481, score-0.576]

100 Action MACH a spatio-temporal maximum average correlation height filter for action recognition. [sent-487, score-0.493]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sdpm', 0.711), ('action', 0.422), ('spatiotemporal', 0.232), ('actions', 0.15), ('cycle', 0.133), ('ucf', 0.11), ('parts', 0.106), ('root', 0.104), ('dpm', 0.103), ('weizmann', 0.088), ('sports', 0.08), ('volume', 0.08), ('filter', 0.071), ('aser', 0.067), ('volumetric', 0.065), ('cells', 0.062), ('deformable', 0.062), ('dpms', 0.058), ('subvolume', 0.058), ('temporal', 0.056), ('detection', 0.056), ('clip', 0.051), ('part', 0.051), ('volumes', 0.05), ('localization', 0.049), ('video', 0.048), ('kl', 0.046), ('cell', 0.041), ('subvolumes', 0.04), ('lifting', 0.04), ('bounding', 0.039), ('enclosing', 0.039), ('boxing', 0.036), ('lan', 0.035), ('generalizing', 0.035), ('marsza', 0.034), ('stages', 0.033), ('employ', 0.033), ('handclapping', 0.033), ('handwaving', 0.033), ('pjump', 0.033), ('sdpms', 0.033), ('filters', 0.032), ('raptis', 0.032), ('template', 0.031), ('ti', 0.031), ('pyramid', 0.03), ('deformation', 0.03), ('score', 0.03), ('roc', 0.03), ('videos', 0.029), ('icosahedron', 0.029), ('actor', 0.029), ('descriptor', 0.029), ('occurrence', 0.028), ('negatives', 0.028), ('ith', 0.028), ('dividing', 0.028), ('niebles', 0.027), ('primary', 0.027), ('enable', 0.027), ('asymmetry', 0.027), ('ke', 0.027), ('primarily', 0.026), ('yellow', 0.026), ('auc', 0.026), ('variation', 0.025), ('reproducibility', 0.025), ('rahul', 0.025), ('crcv', 0.025), ('golf', 0.024), ('crowded', 0.024), ('yi', 0.024), ('xi', 0.024), ('trajectories', 0.023), ('mach', 0.022), ('duration', 0.022), ('event', 0.022), ('overlap', 0.022), ('capture', 0.022), ('cluttered', 0.021), ('localizes', 0.021), ('fraction', 0.021), ('ek', 0.021), ('analogous', 0.021), ('continues', 0.02), ('durations', 0.02), ('iarpa', 0.02), ('activity', 0.02), ('treated', 0.02), ('laptev', 0.02), ('masks', 0.02), ('box', 0.02), ('background', 0.02), ('rodriguez', 0.019), ('brendel', 0.019), ('magenta', 0.019), ('stress', 0.019), ('occupied', 0.019), ('annotations', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection

Author: Yicong Tian, Rahul Sukthankar, Mubarak Shah

Abstract: Deformable part models have achieved impressive performance for object detection, even on difficult image datasets. This paper explores the generalization of deformable part models from 2D images to 3D spatiotemporal volumes to better study their effectiveness for action detection in video. Actions are treated as spatiotemporal patterns and a deformable part model is generated for each action from a collection of examples. For each action model, the most discriminative 3D subvolumes are automatically selected as parts and the spatiotemporal relations between their locations are learned. By focusing on the most distinctive parts of each action, our models adapt to intra-class variation and show robustness to clutter. Extensive experiments on several video datasets demonstrate the strength of spatiotemporal DPMs for classifying and localizing actions.

2 0.31146884 287 cvpr-2013-Modeling Actions through State Changes

Author: Alireza Fathi, James M. Rehg

Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.

3 0.29767624 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches

Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis

Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.

4 0.29758632 444 cvpr-2013-Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest

Author: Tsz-Ho Yu, Tae-Kyun Kim, Roberto Cipolla

Abstract: This work addresses the challenging problem of unconstrained 3D human pose estimation (HPE)from a novelperspective. Existing approaches struggle to operate in realistic applications, mainly due to their scene-dependent priors, such as background segmentation and multi-camera network, which restrict their use in unconstrained environments. We therfore present a framework which applies action detection and 2D pose estimation techniques to infer 3D poses in an unconstrained video. Action detection offers spatiotemporal priors to 3D human pose estimation by both recognising and localising actions in space-time. Instead of holistic features, e.g. silhouettes, we leverage the flexibility of deformable part model to detect 2D body parts as a feature to estimate 3D poses. A new unconstrained pose dataset has been collected to justify the feasibility of our method, which demonstrated promising results, significantly outperforming the relevant state-of-the-arts.

5 0.27951238 40 cvpr-2013-An Approach to Pose-Based Action Recognition

Author: Chunyu Wang, Yizhou Wang, Alan L. Yuille

Abstract: We address action recognition in videos by modeling the spatial-temporal structures of human poses. We start by improving a state of the art method for estimating human joint locations from videos. More precisely, we obtain the K-best estimations output by the existing method and incorporate additional segmentation cues and temporal constraints to select the “best” one. Then we group the estimated joints into five body parts (e.g. the left arm) and apply data mining techniques to obtain a representation for the spatial-temporal structures of human actions. This representation captures the spatial configurations ofbodyparts in one frame (by spatial-part-sets) as well as the body part movements(by temporal-part-sets) which are characteristic of human actions. It is interpretable, compact, and also robust to errors on joint estimations. Experimental results first show that our approach is able to localize body joints more accurately than existing methods. Next we show that it outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets.

6 0.26495525 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition

7 0.23439004 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition

8 0.23043799 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition

9 0.22769313 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes

10 0.22450329 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)

11 0.18610355 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

12 0.18426664 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition

13 0.16877364 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path

14 0.16853687 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images

15 0.16553734 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition

16 0.15876712 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

17 0.15800786 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences

18 0.14042042 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition

19 0.12584303 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition

20 0.12034508 248 cvpr-2013-Learning Collections of Part Models for Object Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.211), (1, -0.117), (2, -0.005), (3, -0.239), (4, -0.318), (5, -0.015), (6, -0.049), (7, 0.086), (8, -0.077), (9, -0.094), (10, -0.05), (11, -0.054), (12, -0.017), (13, -0.011), (14, -0.057), (15, -0.038), (16, 0.022), (17, -0.019), (18, 0.074), (19, 0.196), (20, 0.025), (21, -0.007), (22, 0.037), (23, 0.115), (24, 0.096), (25, -0.057), (26, 0.038), (27, -0.04), (28, 0.0), (29, -0.022), (30, 0.012), (31, 0.004), (32, 0.011), (33, 0.013), (34, 0.016), (35, 0.031), (36, 0.006), (37, -0.055), (38, -0.033), (39, 0.018), (40, 0.028), (41, 0.018), (42, -0.054), (43, -0.009), (44, -0.005), (45, -0.023), (46, -0.017), (47, -0.015), (48, 0.005), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9553408 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection

Author: Yicong Tian, Rahul Sukthankar, Mubarak Shah

Abstract: Deformable part models have achieved impressive performance for object detection, even on difficult image datasets. This paper explores the generalization of deformable part models from 2D images to 3D spatiotemporal volumes to better study their effectiveness for action detection in video. Actions are treated as spatiotemporal patterns and a deformable part model is generated for each action from a collection of examples. For each action model, the most discriminative 3D subvolumes are automatically selected as parts and the spatiotemporal relations between their locations are learned. By focusing on the most distinctive parts of each action, our models adapt to intra-class variation and show robustness to clutter. Extensive experiments on several video datasets demonstrate the strength of spatiotemporal DPMs for classifying and localizing actions.

2 0.92705941 287 cvpr-2013-Modeling Actions through State Changes

Author: Alireza Fathi, James M. Rehg

Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.

3 0.86729568 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition

Author: Michalis Raptis, Leonid Sigal

Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.

4 0.83194715 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)

Author: Yezhou Yang, Cornelia Fermüller, Yiannis Aloimonos

Abstract: The problem of action recognition and human activity has been an active research area in Computer Vision and Robotics. While full-body motions can be characterized by movement and change of posture, no characterization, that holds invariance, has yet been proposed for the description of manipulation actions. We propose that a fundamental concept in understanding such actions, are the consequences of actions. There is a small set of fundamental primitive action consequences that provides a systematic high-level classification of manipulation actions. In this paper a technique is developed to recognize these action consequences. At the heart of the technique lies a novel active tracking and segmentation method that monitors the changes in appearance and topological structure of the manipulated object. These are then used in a visual semantic graph (VSG) based procedure applied to the time sequence of the monitored object to recognize the action consequence. We provide a new dataset, called Manipulation Action Consequences (MAC 1.0), which can serve as testbed for other studies on this topic. Several ex- periments on this dataset demonstrates that our method can robustly track objects and detect their deformations and division during the manipulation. Quantitative tests prove the effectiveness and efficiency of the method.

5 0.77171087 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition

Author: LiMin Wang, Yu Qiao, Xiaoou Tang

Abstract: This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. Towards this goal, we develop a data-driven approach to learn motionlets from training videos. First, we extract 3D regions with high motion saliency. Then we cluster these regions and preserve the centers as candidate templates for motionlet. Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. With motionlets, we present a mid-level representation for video, called motionlet activation vector. We conduct experiments on three datasets, KTH, HMDB51, and UCF50. The results show that the proposed methods significantly outperform state-of-the-art methods.

6 0.75719571 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches

7 0.74599677 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition

8 0.69844949 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition

9 0.6877808 40 cvpr-2013-An Approach to Pose-Based Action Recognition

10 0.65587193 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes

11 0.65123439 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path

12 0.64764571 444 cvpr-2013-Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest

13 0.64497155 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition

14 0.64082724 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition

15 0.59366453 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

16 0.5610652 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images

17 0.50161487 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization

18 0.49657783 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences

19 0.46447942 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

20 0.44165832 407 cvpr-2013-Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.175), (16, 0.016), (26, 0.042), (28, 0.013), (33, 0.277), (39, 0.011), (67, 0.099), (69, 0.042), (80, 0.023), (87, 0.081), (97, 0.125)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94119245 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection

Author: Yicong Tian, Rahul Sukthankar, Mubarak Shah

Abstract: Deformable part models have achieved impressive performance for object detection, even on difficult image datasets. This paper explores the generalization of deformable part models from 2D images to 3D spatiotemporal volumes to better study their effectiveness for action detection in video. Actions are treated as spatiotemporal patterns and a deformable part model is generated for each action from a collection of examples. For each action model, the most discriminative 3D subvolumes are automatically selected as parts and the spatiotemporal relations between their locations are learned. By focusing on the most distinctive parts of each action, our models adapt to intra-class variation and show robustness to clutter. Extensive experiments on several video datasets demonstrate the strength of spatiotemporal DPMs for classifying and localizing actions.

2 0.93169868 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

Author: Ian Endres, Kevin J. Shih, Johnston Jiaa, Derek Hoiem

Abstract: We propose a method to learn a diverse collection of discriminative parts from object bounding box annotations. Part detectors can be trained and applied individually, which simplifies learning and extension to new features or categories. We apply the parts to object category detection, pooling part detections within bottom-up proposed regions and using a boosted classifier with proposed sigmoid weak learners for scoring. On PASCAL VOC 2010, we evaluate the part detectors ’ ability to discriminate and localize annotated keypoints. Our detection system is competitive with the best-existing systems, outperforming other HOG-based detectors on the more deformable categories.

3 0.92943293 414 cvpr-2013-Structure Preserving Object Tracking

Author: Lu Zhang, Laurens van_der_Maaten

Abstract: Model-free trackers can track arbitrary objects based on a single (bounding-box) annotation of the object. Whilst the performance of model-free trackers has recently improved significantly, simultaneously tracking multiple objects with similar appearance remains very hard. In this paper, we propose a new multi-object model-free tracker (based on tracking-by-detection) that resolves this problem by incorporating spatial constraints between the objects. The spatial constraints are learned along with the object detectors using an online structured SVM algorithm. The experimental evaluation ofour structure-preserving object tracker (SPOT) reveals significant performance improvements in multi-object tracking. We also show that SPOT can improve the performance of single-object trackers by simultaneously tracking different parts of the object.

4 0.92665559 314 cvpr-2013-Online Object Tracking: A Benchmark

Author: Yi Wu, Jongwoo Lim, Ming-Hsuan Yang

Abstract: Object tracking is one of the most important components in numerous applications of computer vision. While much progress has been made in recent years with efforts on sharing code and datasets, it is of great importance to develop a library and benchmark to gauge the state of the art. After briefly reviewing recent advances of online object tracking, we carry out large scale experiments with various evaluation criteria to understand how these algorithms perform. The test image sequences are annotated with different attributes for performance evaluation and analysis. By analyzing quantitative results, we identify effective approaches for robust tracking and provide potential future research directions in this field.

5 0.92550105 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation

Author: Brandon Rothrock, Seyoung Park, Song-Chun Zhu

Abstract: In this paper we present a compositional and-or graph grammar model for human pose estimation. Our model has three distinguishing features: (i) large appearance differences between people are handled compositionally by allowingparts or collections ofparts to be substituted with alternative variants, (ii) each variant is a sub-model that can define its own articulated geometry and context-sensitive compatibility with neighboring part variants, and (iii) background region segmentation is incorporated into the part appearance models to better estimate the contrast of a part region from its surroundings, and improve resilience to background clutter. The resulting integrated framework is trained discriminatively in a max-margin framework using an efficient and exact inference algorithm. We present experimental evaluation of our model on two popular datasets, and show performance improvements over the state-of-art on both benchmarks.

6 0.92212832 285 cvpr-2013-Minimum Uncertainty Gap for Robust Visual Tracking

7 0.92154014 325 cvpr-2013-Part Discovery from Partial Correspondence

8 0.91967964 324 cvpr-2013-Part-Based Visual Tracking with Online Latent Structural Learning

9 0.9169268 147 cvpr-2013-Ensemble Learning for Confidence Measures in Stereo Vision

10 0.91614646 400 cvpr-2013-Single Image Calibration of Multi-axial Imaging Systems

11 0.91581535 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation

12 0.91580528 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image

13 0.91506571 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation

14 0.91422617 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection

15 0.91405922 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors

16 0.91296035 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

17 0.91292775 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation

18 0.91227508 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities

19 0.91200322 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path

20 0.91194415 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence