cvpr cvpr2013 cvpr2013-336 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michalis Raptis, Leonid Sigal
Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
Reference: text
sentIndex sentText sentNum sentScore
1 Poselet Key-framing: A Model for Human Activity Recognition Michalis Raptis and Leonid Sigal Disney Research, Pittsburgh {mrapt i ,l igal} @ disneyre search . [sent-1, score-0.044]
2 An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. [sent-3, score-1.817]
3 We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. [sent-4, score-1.508]
4 This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. [sent-5, score-0.958]
5 Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. [sent-6, score-0.167]
6 This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. [sent-7, score-0.273]
7 We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting. [sent-8, score-0.03]
8 Introduction It is compelling to think of an action, or interaction with another person, as a sequence of keyframes key-poses of the actor(s), depicting key states in the action sequence. [sent-10, score-1.248]
9 This representation is compact and sparse, which is desirable computationally and for robustness, yet is rich and descriptive. [sent-11, score-0.053]
10 The sparsity and compactness come from the fact that keyframes are, by definition, temporally very local, in our case, each spanning just two frames (using the second frame to compute optical flow). [sent-12, score-0.842]
11 It is worth noting that this use of local temporal information is in sharp contrast to most research in video-based action recognition where often long temporal trajectories [24] or features computed on much larger temporal scale (20 or 100 frame segments [32]) are deemed necessary. [sent-13, score-1.021]
12 Using a sparse local keyframe representation, however, does have certain benefits. [sent-14, score-0.244]
13 First, it allows our model to focus on the – let activation max-pooled over the spatial extent of each frame. [sent-15, score-0.055]
14 An action is modeled by a set of latent keyframes discriminatively selected using a max-margin learning framework. [sent-16, score-1.104]
15 most distinct parts of the action and disregard frames that are not discriminative or relevant. [sent-17, score-0.566]
16 Second, it translates to robustness to variation in action duration or dropped frames because these changes minimally affect our representation. [sent-18, score-0.634]
17 Further, in perception, it has long been shown that certain discriminant static images of humans engaged in activity can convey dynamic information (an effect known as implied motion [ 15]1). [sent-19, score-0.293]
18 These studies, along with the success of keyframe summaries as means of motion illustration [1] and/or synthesis in computer graphics, motivate us to consider local keyframes as sufficient for our task. [sent-20, score-0.997]
19 However, discovering such keyframe representations is challenging because, intuitively, it requires having accurate pose information for the entire video [30], which is both difficult and computationally expensive. [sent-21, score-0.376]
20 Motivated by the success of poselets in human detection [3] and pose estimation [35] we posit that representing a keyframe as learned collection of poselets has a number of significant benefits over the more holistic person/posebased representation [34]. [sent-22, score-0.743]
21 The key benefit of poselets is that they can capture discriminative action parts that carry partial pose (and, in our case, motion) information; this is extremely useful in complex environments where there is clutter or the actor experiences severe occlusions. [sent-23, score-0.929]
22 Moreover, poselets also allow for a semantic, spatially localizable, mid-level representation of the keyframe. [sent-24, score-0.243]
23 1Further, it has been shown that the same parts of the brain that engage in processing and analysis of motion are engaged in processing implied motion in static images [15]. [sent-25, score-0.261]
24 222666445088 For video-based action recognition, which is the goal of this work, one keyframe may not be sufficient to adequately characterize the action. [sent-26, score-0.685]
25 How many distinct (key)frames are needed to characterize a human activity [29]? [sent-27, score-0.111]
26 For instance, a handshake between two people can be summarized using 3 distinctive keyframes: (i) the two persons approach each other, (ii) they extend their hands towards one another when near, and (iii) they touch and shake their hands. [sent-28, score-0.074]
27 While it may be sensible to specify the number of keyframes to use in a representation; specifying the location of such keyframes for many actions would be too tedious and, frankly, prone to errors. [sent-29, score-1.473]
28 Humans are notoriously bad at the task, and semantic meaningfulness may not translate well to discriminability. [sent-30, score-0.08]
29 Contributions: We cast the learning in a max-margin discriminative framework where we treat keyframes as latent variables. [sent-31, score-0.848]
30 This allows us to (jointly) learn a set of the most discriminative keyframes while also learning the local temporal context between them. [sent-32, score-0.958]
31 First, it allows temporal localization of actions and action parts by modeling actions as sequences of keyframes. [sent-34, score-0.882]
32 Second, it is tolerant to variations in action duration, as our model only assumes partial ordering between the keyframes. [sent-35, score-0.529]
33 Third, it implicitly allows spatial localization of actions within keyframes by representing the keyframes with poselets. [sent-36, score-1.529]
34 Fourth, our formulation is amenable to on-line inference for early detection [22]. [sent-37, score-0.029]
35 Finally, our model generates semantic interpretations (summarizations) of actions by modeling them as action stories contextual temporal orderings of discriminant partial poses. [sent-38, score-0.92]
36 Related Work The problem of action recognition has been studied extensively in the literature. [sent-41, score-0.39]
37 Sequential models: Hidden semi-Markov Models (HSMM) [10] , CRFs [3 1], and finite-state-machines [13] have been used to model the temporal evolution of human activities. [sent-43, score-0.201]
38 These works model entire video sequence both its progression and the temporal duration of each phase. [sent-46, score-0.419]
39 As consequence, during training those models need to also encode irrelevant to the action events, making the learning procedure inherently challenging. [sent-47, score-0.39]
40 Our model is more flexible, selecting and modeling only a very sparse, discriminative subset of the sequence. [sent-48, score-0.069]
41 Static image activity recognition: Most approaches in still image action recognition assume a single actor and rely on either explicit [8, 33] or implicit pose [36] information and, often, bag-of-words (BoW) terms for background context [8, 36]. [sent-49, score-0.696]
42 For example, in [33] actions are represented by histograms of pose primitives. [sent-50, score-0.156]
43 [8] uses a more traditional part-based pose model instead. [sent-52, score-0.063]
44 These methods, yet, implicitly inherit the problems of traditional pose estimation. [sent-53, score-0.14]
45 [21] side-step these problems by proposing to use poselets as a mid-level representation for pose; in [21] poselet activation vectors are used directly for recognition, whereas in [36] intermediate latent 3D pose is constructed to bootstrap performance. [sent-56, score-0.481]
46 We leverage a poselet-like mid-level representation for the frames, but focus on a video scenario where multiple discriminative frames must be selected to encode the action. [sent-57, score-0.239]
47 Key volumes: Recent video-based action recognition methods [12, 28] observed that limiting the bag of spatiotemporal interest point representation [9, 17] in a temporal segment boosts recognition performance. [sent-58, score-0.644]
48 [23] combines the BoW representation of the entire video (global term) with sub-volumes that capture the temporal composition of the action. [sent-60, score-0.323]
49 However, the proposed model lacks the ability to spatially localize action parts. [sent-61, score-0.488]
50 Moreover, the model of [23], similar to [24], relies on global terms, assuming that rough temporal segmentation is given. [sent-62, score-0.201]
51 In contrast, we propose an extremely local action model geared towards analyzing longer image sequences. [sent-63, score-0.43]
52 Brendel and Todorovic [4] introduce a generative model that describes the spatio-temporal structure of the action as a weighted directed graph defined on a spatio-temporal oversegmentation of the video. [sent-64, score-0.39]
53 Chen and Grauman [6] propose an irregular sub-graph model in which local temporal topological changes are allowed. [sent-65, score-0.201]
54 While key volume methods focus on discriminative portion of the video, their volumetric nature is susceptible to variabilities present in the temporal execution of the action. [sent-66, score-0.393]
55 In contrast, our method decouples action representation from its exact temporal execution, focusing only on temporally local keyframes that are less variable. [sent-67, score-1.415]
56 Keyframes: A number of approaches have proposed the use of keyframes as a representation. [sent-68, score-0.66]
57 Unlike [5], we automatically select keyframes and return, not require as [20], spatial-temporal localization; our keyframe representation is also more compact utilizing fewer (up to 4, or 4% of the sequence) keyframes. [sent-71, score-0.957]
58 [34] propose a max-margin framework for modeling interactions as sequences of key-poses performed by a pair of actors. [sent-73, score-0.059]
59 To model interactions, the approach requires complete tracks of both actors across the entire sequence. [sent-74, score-0.057]
60 In contrast, we rely on a collection of poselets to characterize frames and hence can better deal with partial occlusions, and we are not limited to interaction scenarios. [sent-75, score-0.386]
61 [19] generate a representation of a given video based on the detected active attributes. [sent-81, score-0.093]
62 Our model tries to close this gap by building a localizable mid-level representation. [sent-85, score-0.098]
63 The Model A graphical representation of our model is illustrated in Fig. [sent-87, score-0.053]
64 This model has the ability to localize temporally and spatially the discriminative action components. [sent-89, score-0.605]
65 It performs action classification and generates detailed spatial and temporal reports for each component. [sent-90, score-0.621]
66 The temporal context of the action, such as the ordering of the action’s components and their temporal correlations are explicitly modeled. [sent-91, score-0.469]
67 Moreover, the model implicitly performs spatial localization of the image regions that are part of the action. [sent-92, score-0.116]
68 Model Formulation Given a set of video sequences {x1, . [sent-95, score-0.071]
69 , Touhris mapping sfu tnoct leioanrn nw ail m malapsop ienngab fle t:he X Xaut →o{m−a1ti,c temporal maannpoptinagtio fnu nocft unseen lv aildseoo sequences. [sent-102, score-0.389]
70 uOtourinput variables are sequence of images xi = {xi1, . [sent-103, score-0.046]
71 Ou=r output variab}le, consists of a “global” label y indicating whether a particular action occurs inside the video. [sent-107, score-0.39]
wordName wordTfidf (topN-words)
[('keyframes', 0.66), ('action', 0.39), ('keyframe', 0.244), ('temporal', 0.201), ('poselets', 0.156), ('actor', 0.115), ('hsmm', 0.098), ('localizable', 0.098), ('actions', 0.093), ('frames', 0.077), ('temporally', 0.077), ('localization', 0.074), ('engaged', 0.073), ('poselet', 0.071), ('discriminative', 0.069), ('duration', 0.067), ('bow', 0.064), ('pose', 0.063), ('partial', 0.062), ('dropped', 0.06), ('activity', 0.06), ('activation', 0.055), ('depicting', 0.055), ('latent', 0.054), ('execution', 0.053), ('representation', 0.053), ('exemplars', 0.051), ('characterize', 0.051), ('implied', 0.047), ('sequence', 0.046), ('andg', 0.044), ('handshake', 0.044), ('meaningfulness', 0.044), ('variab', 0.044), ('leonid', 0.044), ('disneyre', 0.044), ('vahdat', 0.044), ('implicitly', 0.042), ('static', 0.041), ('video', 0.04), ('minimally', 0.04), ('nocft', 0.04), ('disney', 0.04), ('stories', 0.04), ('geared', 0.04), ('fle', 0.04), ('discriminant', 0.04), ('rely', 0.04), ('ordering', 0.039), ('tolerant', 0.038), ('fnu', 0.038), ('experiences', 0.038), ('strokes', 0.036), ('progression', 0.036), ('engage', 0.036), ('notoriously', 0.036), ('sfu', 0.036), ('todorovic', 0.036), ('holistic', 0.036), ('cast', 0.036), ('key', 0.036), ('localize', 0.035), ('inherit', 0.035), ('posit', 0.035), ('spatially', 0.034), ('orderings', 0.034), ('ail', 0.034), ('decouples', 0.034), ('variabilities', 0.034), ('summaries', 0.033), ('tennis', 0.033), ('delaitre', 0.033), ('motion', 0.032), ('states', 0.032), ('raptis', 0.032), ('sensible', 0.032), ('sequences', 0.031), ('interpretations', 0.03), ('touch', 0.03), ('streaming', 0.03), ('pittsburgh', 0.03), ('disregard', 0.03), ('generates', 0.03), ('entire', 0.029), ('amenable', 0.029), ('compelling', 0.029), ('brendel', 0.029), ('treat', 0.029), ('sigal', 0.029), ('bootstrap', 0.029), ('lacks', 0.029), ('context', 0.028), ('interactions', 0.028), ('tedious', 0.028), ('associations', 0.028), ('motivate', 0.028), ('compactness', 0.028), ('niebles', 0.028), ('deemed', 0.028), ('actors', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
Author: Michalis Raptis, Leonid Sigal
Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
2 0.29707801 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
Author: Tsz-Ho Yu, Tae-Kyun Kim, Roberto Cipolla
Abstract: This work addresses the challenging problem of unconstrained 3D human pose estimation (HPE)from a novelperspective. Existing approaches struggle to operate in realistic applications, mainly due to their scene-dependent priors, such as background segmentation and multi-camera network, which restrict their use in unconstrained environments. We therfore present a framework which applies action detection and 2D pose estimation techniques to infer 3D poses in an unconstrained video. Action detection offers spatiotemporal priors to 3D human pose estimation by both recognising and localising actions in space-time. Instead of holistic features, e.g. silhouettes, we leverage the flexibility of deformable part model to detect 2D body parts as a feature to estimate 3D poses. A new unconstrained pose dataset has been collected to justify the feasibility of our method, which demonstrated promising results, significantly outperforming the relevant state-of-the-arts.
4 0.2673482 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis
Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.
5 0.26147515 40 cvpr-2013-An Approach to Pose-Based Action Recognition
Author: Chunyu Wang, Yizhou Wang, Alan L. Yuille
Abstract: We address action recognition in videos by modeling the spatial-temporal structures of human poses. We start by improving a state of the art method for estimating human joint locations from videos. More precisely, we obtain the K-best estimations output by the existing method and incorporate additional segmentation cues and temporal constraints to select the “best” one. Then we group the estimated joints into five body parts (e.g. the left arm) and apply data mining techniques to obtain a representation for the spatial-temporal structures of human actions. This representation captures the spatial configurations ofbodyparts in one frame (by spatial-part-sets) as well as the body part movements(by temporal-part-sets) which are characteristic of human actions. It is interpretable, compact, and also robust to errors on joint estimations. Experimental results first show that our approach is able to localize body joints more accurately than existing methods. Next we show that it outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets.
6 0.23439004 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
7 0.21203615 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
8 0.21048328 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
9 0.20164315 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
10 0.19462608 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
11 0.16671591 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
12 0.15639466 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
13 0.15524566 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
14 0.15316194 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
15 0.15233439 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
16 0.1510362 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images
17 0.14251798 120 cvpr-2013-Detecting and Naming Actors in Movies Using Generative Appearance Models
18 0.14240903 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition
19 0.13994034 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
20 0.13696578 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
topicId topicWeight
[(0, 0.204), (1, -0.1), (2, -0.02), (3, -0.244), (4, -0.324), (5, -0.02), (6, -0.058), (7, 0.045), (8, -0.049), (9, -0.064), (10, 0.008), (11, 0.006), (12, -0.047), (13, 0.04), (14, -0.046), (15, 0.021), (16, 0.031), (17, -0.01), (18, 0.046), (19, 0.082), (20, 0.019), (21, -0.014), (22, 0.009), (23, 0.067), (24, 0.009), (25, -0.059), (26, 0.017), (27, -0.016), (28, 0.002), (29, 0.01), (30, 0.034), (31, -0.04), (32, -0.006), (33, 0.019), (34, -0.024), (35, -0.0), (36, 0.038), (37, -0.041), (38, 0.01), (39, -0.013), (40, -0.03), (41, -0.001), (42, -0.028), (43, -0.019), (44, 0.021), (45, -0.013), (46, -0.043), (47, -0.042), (48, -0.001), (49, -0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.9713909 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
Author: Michalis Raptis, Leonid Sigal
Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
2 0.93037802 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
3 0.87658072 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
Author: Yicong Tian, Rahul Sukthankar, Mubarak Shah
Abstract: Deformable part models have achieved impressive performance for object detection, even on difficult image datasets. This paper explores the generalization of deformable part models from 2D images to 3D spatiotemporal volumes to better study their effectiveness for action detection in video. Actions are treated as spatiotemporal patterns and a deformable part model is generated for each action from a collection of examples. For each action model, the most discriminative 3D subvolumes are automatically selected as parts and the spatiotemporal relations between their locations are learned. By focusing on the most distinctive parts of each action, our models adapt to intra-class variation and show robustness to clutter. Extensive experiments on several video datasets demonstrate the strength of spatiotemporal DPMs for classifying and localizing actions.
4 0.82583034 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
Author: Yezhou Yang, Cornelia Fermüller, Yiannis Aloimonos
Abstract: The problem of action recognition and human activity has been an active research area in Computer Vision and Robotics. While full-body motions can be characterized by movement and change of posture, no characterization, that holds invariance, has yet been proposed for the description of manipulation actions. We propose that a fundamental concept in understanding such actions, are the consequences of actions. There is a small set of fundamental primitive action consequences that provides a systematic high-level classification of manipulation actions. In this paper a technique is developed to recognize these action consequences. At the heart of the technique lies a novel active tracking and segmentation method that monitors the changes in appearance and topological structure of the manipulated object. These are then used in a visual semantic graph (VSG) based procedure applied to the time sequence of the monitored object to recognize the action consequence. We provide a new dataset, called Manipulation Action Consequences (MAC 1.0), which can serve as testbed for other studies on this topic. Several ex- periments on this dataset demonstrates that our method can robustly track objects and detect their deformations and division during the manipulation. Quantitative tests prove the effectiveness and efficiency of the method.
5 0.80045056 40 cvpr-2013-An Approach to Pose-Based Action Recognition
Author: Chunyu Wang, Yizhou Wang, Alan L. Yuille
Abstract: We address action recognition in videos by modeling the spatial-temporal structures of human poses. We start by improving a state of the art method for estimating human joint locations from videos. More precisely, we obtain the K-best estimations output by the existing method and incorporate additional segmentation cues and temporal constraints to select the “best” one. Then we group the estimated joints into five body parts (e.g. the left arm) and apply data mining techniques to obtain a representation for the spatial-temporal structures of human actions. This representation captures the spatial configurations ofbodyparts in one frame (by spatial-part-sets) as well as the body part movements(by temporal-part-sets) which are characteristic of human actions. It is interpretable, compact, and also robust to errors on joint estimations. Experimental results first show that our approach is able to localize body joints more accurately than existing methods. Next we show that it outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets.
6 0.78262323 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
7 0.75341451 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
9 0.72345144 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
10 0.71463466 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
11 0.6708107 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
12 0.64858735 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
13 0.64113301 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
14 0.61579436 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition
15 0.61308378 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
16 0.57261914 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
17 0.57196456 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
18 0.53750658 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images
19 0.53384525 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
20 0.5135749 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
topicId topicWeight
[(10, 0.13), (16, 0.018), (26, 0.037), (33, 0.324), (50, 0.182), (67, 0.091), (69, 0.052), (80, 0.019), (87, 0.06)]
simIndex simValue paperId paperTitle
1 0.93739605 8 cvpr-2013-A Fast Approximate AIB Algorithm for Distributional Word Clustering
Author: Lei Wang, Jianjia Zhang, Luping Zhou, Wanqing Li
Abstract: Distributional word clustering merges the words having similar probability distributions to attain reliable parameter estimation, compact classification models and even better classification performance. Agglomerative Information Bottleneck (AIB) is one of the typical word clustering algorithms and has been applied to both traditional text classification and recent image recognition. Although enjoying theoretical elegance, AIB has one main issue on its computational efficiency, especially when clustering a large number of words. Different from existing solutions to this issue, we analyze the characteristics of its objective function the loss of mutual information, and show that by merely using the ratio of word-class joint probabilities of each word, good candidate word pairs for merging can be easily identified. Based on this finding, we propose a fast approximate AIB algorithm and show that it can significantly improve the computational efficiency of AIB while well maintaining or even slightly increasing its classification performance. Experimental study on both text and image classification benchmark data sets shows that our algorithm can achieve more than 100 times speedup on large real data sets over the state-of-the-art method.
2 0.93237513 417 cvpr-2013-Subcategory-Aware Object Classification
Author: Jian Dong, Wei Xia, Qiang Chen, Jianshi Feng, Zhongyang Huang, Shuicheng Yan
Abstract: In this paper, we introduce a subcategory-aware object classification framework to boost category level object classification performance. Motivated by the observation of considerable intra-class diversities and inter-class ambiguities in many current object classification datasets, we explicitly split data into subcategories by ambiguity guided subcategory mining. We then train an individual model for each subcategory rather than attempt to represent an object category with a monolithic model. More specifically, we build the instance affinity graph by combining both intraclass similarity and inter-class ambiguity. Visual subcategories, which correspond to the dense subgraphs, are detected by the graph shift algorithm and seamlessly integrated into the state-of-the-art detection assisted classification framework. Finally the responses from subcategory models are aggregated by subcategory-aware kernel regression. The extensive experiments over the PASCAL VOC 2007 and PASCAL VOC 2010 databases show the state-ofthe-art performance from our framework.
same-paper 3 0.92687136 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
Author: Michalis Raptis, Leonid Sigal
Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
4 0.92487478 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors
Author: Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan
Abstract: Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects. In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. We demonstrate the effectiveness of our evaluation framework by comparing its performance to that ofmultiple human evaluators. Finally, wepresent resultsfor our framework tested on hundreds of user-generated videos.
5 0.91958702 321 cvpr-2013-PDM-ENLOR: Learning Ensemble of Local PDM-Based Regressions
Author: Yen H. Le, Uday Kurkure, Ioannis A. Kakadiaris
Abstract: Statistical shape models, such as Active Shape Models (ASMs), sufferfrom their inability to represent a large range of variations of a complex shape and to account for the large errors in detection of model points. We propose a novel method (dubbed PDM-ENLOR) that overcomes these limitations by locating each shape model point individually using an ensemble of local regression models and appearance cues from selected model points. Our method first detects a set of reference points which were selected based on their saliency during training. For each model point, an ensemble of regressors is built. From the locations of the detected reference points, each regressor infers a candidate location for that model point using local geometric constraints, encoded by a point distribution model (PDM). The final location of that point is determined as a weighted linear combination, whose coefficients are learnt from the training data, of candidates proposed from its ensemble ’s component regressors. We use different subsets of reference points as explanatory variables for the component regressors to provide varying degrees of locality for the models in each ensemble. This helps our ensemble model to capture a larger range of shape variations as compared to a single PDM. We demonstrate the advantages of our method on the challenging problem of segmenting gene expression images of mouse brain.
6 0.89602804 413 cvpr-2013-Story-Driven Summarization for Egocentric Video
7 0.89235747 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
8 0.89225596 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation
9 0.89199996 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
10 0.8910436 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
11 0.8907274 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
12 0.89058924 325 cvpr-2013-Part Discovery from Partial Correspondence
13 0.89028382 206 cvpr-2013-Human Pose Estimation Using Body Parts Dependent Joint Regressors
14 0.89028037 14 cvpr-2013-A Joint Model for 2D and 3D Pose Estimation from a Single Image
15 0.89011431 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
16 0.8900786 256 cvpr-2013-Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning
17 0.88974565 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
18 0.88969725 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
19 0.88942134 414 cvpr-2013-Structure Preserving Object Tracking
20 0.88940483 387 cvpr-2013-Semi-supervised Domain Adaptation with Instance Constraints