cvpr cvpr2013 cvpr2013-287 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract In this paper we present a model of action based on the change in the state of the environment. [sent-3, score-0.786]
2 The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. [sent-5, score-0.516]
3 Our results outperform state-of-the-art action recognition and activity segmentation results. [sent-9, score-0.788]
4 The common theme among all these works is that they model an action by encoding motion and appearance throughout the interval in which it is performed. [sent-16, score-0.695]
5 For example, “open coffee jar” and “closed coffee jar” are two different actions, in fact they are inverse. [sent-19, score-0.58]
6 For example, the opening action changes the state of an object from closed to open. [sent-23, score-0.913]
7 Figure 1: By comparing the initial and final frames of an action, and exploiting the action label, we can learn to detect the meaningful changes in the state of objects and materials produced by the action. [sent-66, score-1.014]
8 In example (a), our method recognizes the action of “close coffee jar”, as a result of detecting regions corresponding to open and closed coffee jar. [sent-67, score-1.404]
9 Similarly in example (b), the action of “spread jelly on bread” is recognized by detecting the regions corresponding to plain bread loaf and jelly spread on bread respectively in the initial and final frames of the action. [sent-68, score-1.474]
10 Based on this observation, we introduce a method for recognizing daily actions by recognizing the changes in the state of objects and materials. [sent-69, score-0.73]
11 For instance, the action “spread jelly on bread using knife” requires jelly to be on the knife but not on the bread when it is applied. [sent-72, score-1.284]
12 This action changes the state of the jelly from being on the knife to being spread on the bread. [sent-73, score-1.109]
13 Or for example, “take cup” is an action before which the cup is not being held by the hand, but once it is performed the cup is grasped by the hand1 . [sent-74, score-0.839]
14 1Note that this notion of actions state changing processes holds for most cases, however, there are exceptions such as “dancing” that do not create any describable or observable changes in the environment. [sent-75, score-0.497]
15 coffee powder mixing with water, jelly getting spread on bread, egg getting scrambled). [sent-80, score-0.572]
16 For example, the following actions take place during the activity of “making coffee”: (open coffee jar), (scoop coffee using spoon), (pour coffee into cup) and (put hot water into cup), (close coffee jar). [sent-81, score-1.642]
17 Throughout these actions, the coffee jar changes states from closed to open and again to closed. [sent-82, score-0.911]
18 Likewise, the coffee powder changes state from being in the coffee jar to being on the spoon, and then being in the cup, and finally dissolving into hot water. [sent-83, score-1.236]
19 In contrast, the egocentric view puts the environment into the center of the action interpretation problem. [sent-86, score-0.765]
20 Once these state detectors are learned, we run them at each frame of the videos and describe the environment at each moment in time based on the existence or absence of detected object and material states. [sent-90, score-0.474]
21 We introduce methods that leverage the changes in the state of the environment to recognize actions and segment activities. [sent-91, score-0.662]
22 Our results outperform state-of-the-art action recognition and activity segmentation results. [sent-92, score-0.788]
23 There has been various attempts in the past to model object context for action and activity recognition. [sent-101, score-0.761]
24 [14] use scene context to improve action recognition performance. [sent-111, score-0.589]
25 Vaina and Jaulent [25] suggested that a comprehensive description of an action requires un- derstanding its goal. [sent-116, score-0.589]
26 They refer to the conditions necessary for achieving the goal of an action as action requirements and model the compatibility of an object with those goals. [sent-117, score-1.178]
27 However, change of intensity cannot describe complex state changes caused by an action like opening an object. [sent-120, score-0.917]
28 In this paper, we describe a method that discovers state-specific regions from the training videos and models their changes to recognize actions during the testing phase. [sent-121, score-0.525]
29 In the AI and robotics literature, an action is performed by an intelligent agent through actuators and results in particular changes to the environment [19]. [sent-122, score-0.783]
30 Such changes can be perceived by agent’s sensors, and lead to decision making, resulting in a perception and action loop. [sent-123, score-0.691]
31 Method Our task is to model daily actions via the changes they induce in the state of the environment. [sent-127, score-0.584]
32 First for each action instance, we compare its initial frames with its final frames to extract the regions that are changed. [sent-136, score-0.869]
33 In the second stage we discard the changes that are not common over the examples of their corresponding action type. [sent-137, score-0.743]
34 During the testing phase we apply the trained region detectors to describe actions and states. [sent-139, score-0.406]
35 order to find the regions that are changed as a result of an action, we compare their appearance before the action starts to their appearance after the action ends. [sent-140, score-1.335]
36 Using our method, we show significant gains in action recognition and video segmentation performance. [sent-142, score-0.648]
37 An activity like making peanut-butter and jelly sandwich, consists of a sequence of atomic actions (e. [sent-144, score-0.562]
38 For each training activity video, the actions are annotated. [sent-148, score-0.416]
39 The annotation for each action contains its start frame, end frame, a verb (e. [sent-149, score-0.589]
40 3, we propose a method for recognizing actions based on the change in the detected object states and materials. [sent-160, score-0.454]
41 In this stage, we make two assumptions: (1) an object state or material does not change unless an action is performed and (2) an object state or material change is associated with an action only if it consistently occurs at all instances of that action. [sent-166, score-1.726]
42 In the first stage, we identify regions that either appear or disappear as a result of the execution of each action instance. [sent-168, score-0.748]
43 For example, the region corresponding to the lid of the coffee jar will change as a result of performing the action of open coffee. [sent-169, score-1.39]
44 Change Detection: In this stage, we find the regions that either appear or disappear throughout each action instance in the training set. [sent-175, score-0.755]
45 Each action instance corresponds to a short interval which is a sub-part of a longer activity video. [sent-176, score-0.897]
46 For each action instance, we sample a few frames from its beginning and a few frames from its end. [sent-177, score-0.85]
47 For example, pouring water into a cup containing coffee powder results in the appearance of a new dark brown liquid region in the cup. [sent-185, score-0.566]
48 Consistent Regions: In the previous stage, we extract regions that have changed between the initial and final frames of each action instance. [sent-192, score-0.843]
49 Now in this stage, we only keep the subset of those regions that consistently occur across the instances of an action type. [sent-193, score-0.759]
50 For example, a region that corresponds to coffee jar’s lid consistently appears at the beginning of the “open coffee”, but a spurious region would not. [sent-194, score-0.582]
51 A region r consistently occurs at an action class A, if there is a region ˆr similar to r at each instance a of that action class (a ∈ A). [sent-195, score-1.351]
52 [23], we cluster the N extracted regions from instances of action class A into k sets {S1, S2 , . [sent-198, score-0.755]
53 , Sk} by enforcing the regions in each cluster to be drawn from the majority of action instances. [sent-201, score-0.726]
54 We further add an additional constraint that each action instance can at most contribute one example to each cluster. [sent-202, score-0.619]
55 This objective function is similar to the objective function of k-means with two additional constraints that enforce a cluster Si to have samples from at least h action instances. [sent-205, score-0.616]
56 We make sure each of these regions is picked from a different action instance. [sent-208, score-0.699]
57 We continue this until either k clusters are returned or there are less than h action instances with regions left in them. [sent-210, score-0.728]
58 We train a linear SVM by using the regions belonging to the cluster as the positive set and all the regions in activities that do not contain the action as the negative set. [sent-215, score-0.904]
59 States as Action Requirements An action can be performed only if certain conditions are satisfied in the environment. [sent-226, score-0.589]
60 For example, “clean the table” is an action that requires the table to be dirty before 222555888200 ? [sent-227, score-0.589]
61 u p Figure 3: In contrast to the features of conventional action recognition methods, our features are meaningful to humans. [sent-275, score-0.589]
62 For example in (a), in the first frame SVM puts a high weight on the open mouth of peanut-butter jar and in the last frame puts a high weight on peanut-butter jar’s lid. [sent-277, score-0.628]
63 Thus, the key to recognizing a goal-oriented action is to be able to recognize the state of the environment both before and after that action. [sent-279, score-0.913]
64 We represent the environment state based on two criteria: (1) existence or absence of state-specific regions and (2) whether or not an object (region) is grasped and is being manipulated by the hands. [sent-280, score-0.409]
65 In order to model if the regions are being grasped by the hands or not, we use the foreground segmentation method of [6] which identifies if a region is being moved by the hands or not. [sent-287, score-0.4]
66 Modeling Actions through State Changes The majority of common action recognition approaches rely on analyzing the motion and appearance content of the action intervals. [sent-292, score-1.178]
67 However, most daily objectmanipulation tasks are goal-oriented actions that are defined by the changes they cause to the state of the environment. [sent-294, score-0.584]
68 Given a test action interval, we build two response vectors. [sent-295, score-0.66]
69 One is based on the response of the detectors on its beginning frames, and the other is based on the responses on its ending frames. [sent-296, score-0.437]
70 We use linear SVM to train a classifier for each action type. [sent-298, score-0.611]
71 Since we have concatenated the vectors of beginning and ending frames, linear SVM can model the change of an object state or material by putting weights on its corresponding responses. [sent-299, score-0.499]
72 We show visualizations of the classifier weight vectors for few action instances in Fig 3. [sent-301, score-0.618]
73 In order to do so, often one takes all the action detection scores as input and infers the frames that are assigned to each action in that video. [sent-305, score-1.251]
74 In order to handle detection errors, a common strategy is to apply the action classifiers to every possible interval, and then use non-maximum suppression or dynamic programming [16, 6]. [sent-306, score-0.589]
75 In state detection, different than action recognition, the problem is to assign a state label to each frame of the video. [sent-308, score-0.969]
76 The possible set of states are: 1) before a particular action starts, during that action, after that action ends. [sent-309, score-1.28]
77 open coffee), we train two state detectors, one using its beginning frames and one using its ending frames. [sent-313, score-0.584]
78 The state detectors are learned on top of the frame’s responses to pre-trained state-specific region detectors (Sec 3. [sent-314, score-0.425]
79 The state detectors are trained using linear SVM by tak- × ing the action’s beginning or ending frames as positive set 222555888311 IsitIdiurIeinIiaft? [sent-316, score-0.562]
80 Given a test activity video, we apply all the trained beginning and ending state detectors on its frames. [sent-319, score-0.661]
81 This results in two |A| T matrices SB and SE respectively, where |A| is the number of action types, T is the number of frames in the test activity video, and SB [a, t] and SE [a, t] respectively contain the classification scores of detecting the initial and final frames of action a at frame t. [sent-320, score-1.598]
82 An interval Ii has a few properties: Iia = identifies the its action label, Isit identifies its initial frame number, and Iein identifies its final frame number. [sent-325, score-1.004]
83 The score of interval Ii : {Iai, Isit, Iein} is computed by adding the response of the detector corresponding to the initial frame of action Iia on frame Isit with the response of the detector corresponding to the final frame of action Iia on frame Iein. [sent-329, score-1.76]
84 The first two constraints prevent action intervals from overlapping with each other. [sent-331, score-0.618]
85 We train the matrix M based on observed action transitions in training activities. [sent-334, score-0.651]
86 In order to do this, in addition to the first frame of the interval Isit and its last frame Iein, we add two auxiliary states for it: during Iidur and after The score of entering these states is zero, and they are only used to enforce the constraints of the Eq 2. [sent-337, score-0.466]
87 7% on 61 classes of action which is significantly higher than the baseline (23%). [sent-341, score-0.589]
88 The set of possible state transitions for an action are shown in Fig 4. [sent-345, score-0.78]
89 There are 61 actions in this dataset, after omitting the background action classes and fixing some of the mistakes in the original annotation. [sent-353, score-0.856]
90 We show the comparison on the actions of the activity of making coffee. [sent-367, score-0.416]
91 Our Method: Given an action interval, we build two response vectors: one using its beginning frames and one using its ending frames, as described in Sec 3. [sent-372, score-0.988]
92 We train 10 detectors from the consistent changes of each action type. [sent-375, score-0.796]
93 Finally we represent each interval with a 610 2 2 = 2440 dimensional feature vector, and train a linear SVM for each action type. [sent-383, score-0.717]
94 Classifier Visualization: Many conventional action recognition methods rely on features such as corners, point tracks, etc. [sent-390, score-0.589]
95 The action recognition results in [6] are frame based. [sent-393, score-0.667]
96 The weights of the linear SVM classifier trained on examples of an action determines the state-specific parts and materials that should exist in its initial or final frames. [sent-397, score-0.666]
97 In our previous work, we train a CRF for each activity and apply that on action scores to force transition constraints. [sent-404, score-0.807]
98 Discussion In this paper, we present a model for actions based on the changes in the state of objects and materials. [sent-411, score-0.519]
99 We show significant gains in both action recognition and detection results. [sent-412, score-0.589]
100 In addition, we introduce a method for discovering state-specific regions from the training action examples. [sent-414, score-0.726]
wordName wordTfidf (topN-words)
[('action', 0.589), ('jar', 0.292), ('coffee', 0.29), ('actions', 0.244), ('activity', 0.172), ('bread', 0.167), ('state', 0.151), ('jelly', 0.146), ('ending', 0.14), ('isit', 0.125), ('beginning', 0.115), ('regions', 0.11), ('interval', 0.106), ('states', 0.102), ('changes', 0.102), ('cup', 0.097), ('iia', 0.093), ('fig', 0.087), ('daily', 0.087), ('iein', 0.084), ('powder', 0.084), ('detectors', 0.083), ('open', 0.083), ('frame', 0.078), ('egocentric', 0.074), ('sec', 0.074), ('frames', 0.073), ('knife', 0.069), ('spoon', 0.069), ('environment', 0.065), ('recognizing', 0.062), ('grasped', 0.056), ('region', 0.056), ('materials', 0.053), ('stage', 0.052), ('spread', 0.052), ('responses', 0.052), ('stip', 0.05), ('response', 0.047), ('material', 0.047), ('changed', 0.047), ('activities', 0.046), ('change', 0.046), ('recognize', 0.046), ('hands', 0.044), ('identifies', 0.043), ('scoop', 0.042), ('closed', 0.042), ('transitions', 0.04), ('water', 0.039), ('puts', 0.037), ('pour', 0.037), ('svm', 0.035), ('iai', 0.034), ('milk', 0.034), ('lid', 0.034), ('dancing', 0.032), ('video', 0.032), ('correspond', 0.032), ('chance', 0.031), ('consistently', 0.031), ('bag', 0.03), ('rehg', 0.03), ('understand', 0.03), ('instance', 0.03), ('instances', 0.029), ('intervals', 0.029), ('opening', 0.029), ('leverage', 0.028), ('sb', 0.028), ('concatenation', 0.028), ('discovering', 0.027), ('existence', 0.027), ('segmentation', 0.027), ('hot', 0.027), ('agent', 0.027), ('cluster', 0.027), ('segment', 0.026), ('disappear', 0.026), ('si', 0.026), ('stuff', 0.025), ('initial', 0.024), ('transition', 0.024), ('weakly', 0.024), ('build', 0.024), ('phase', 0.023), ('videos', 0.023), ('mistakes', 0.023), ('patterns', 0.023), ('discovery', 0.023), ('quantizing', 0.023), ('mouth', 0.023), ('detector', 0.023), ('pick', 0.023), ('trains', 0.023), ('execution', 0.023), ('objects', 0.022), ('train', 0.022), ('accomplish', 0.022), ('foreground', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
Author: Tsz-Ho Yu, Tae-Kyun Kim, Roberto Cipolla
Abstract: This work addresses the challenging problem of unconstrained 3D human pose estimation (HPE)from a novelperspective. Existing approaches struggle to operate in realistic applications, mainly due to their scene-dependent priors, such as background segmentation and multi-camera network, which restrict their use in unconstrained environments. We therfore present a framework which applies action detection and 2D pose estimation techniques to infer 3D poses in an unconstrained video. Action detection offers spatiotemporal priors to 3D human pose estimation by both recognising and localising actions in space-time. Instead of holistic features, e.g. silhouettes, we leverage the flexibility of deformable part model to detect 2D body parts as a feature to estimate 3D poses. A new unconstrained pose dataset has been collected to justify the feasibility of our method, which demonstrated promising results, significantly outperforming the relevant state-of-the-arts.
3 0.35271001 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis
Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.
4 0.35060954 40 cvpr-2013-An Approach to Pose-Based Action Recognition
Author: Chunyu Wang, Yizhou Wang, Alan L. Yuille
Abstract: We address action recognition in videos by modeling the spatial-temporal structures of human poses. We start by improving a state of the art method for estimating human joint locations from videos. More precisely, we obtain the K-best estimations output by the existing method and incorporate additional segmentation cues and temporal constraints to select the “best” one. Then we group the estimated joints into five body parts (e.g. the left arm) and apply data mining techniques to obtain a representation for the spatial-temporal structures of human actions. This representation captures the spatial configurations ofbodyparts in one frame (by spatial-part-sets) as well as the body part movements(by temporal-part-sets) which are characteristic of human actions. It is interpretable, compact, and also robust to errors on joint estimations. Experimental results first show that our approach is able to localize body joints more accurately than existing methods. Next we show that it outperforms state of the art action recognizers on the UCF sport, the Keck Gesture and the MSR-Action3D datasets.
5 0.34793872 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
Author: Yezhou Yang, Cornelia Fermüller, Yiannis Aloimonos
Abstract: The problem of action recognition and human activity has been an active research area in Computer Vision and Robotics. While full-body motions can be characterized by movement and change of posture, no characterization, that holds invariance, has yet been proposed for the description of manipulation actions. We propose that a fundamental concept in understanding such actions, are the consequences of actions. There is a small set of fundamental primitive action consequences that provides a systematic high-level classification of manipulation actions. In this paper a technique is developed to recognize these action consequences. At the heart of the technique lies a novel active tracking and segmentation method that monitors the changes in appearance and topological structure of the manipulated object. These are then used in a visual semantic graph (VSG) based procedure applied to the time sequence of the monitored object to recognize the action consequence. We provide a new dataset, called Manipulation Action Consequences (MAC 1.0), which can serve as testbed for other studies on this topic. Several ex- periments on this dataset demonstrates that our method can robustly track objects and detect their deformations and division during the manipulation. Quantitative tests prove the effectiveness and efficiency of the method.
6 0.31146884 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
7 0.29707801 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
8 0.29157731 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
9 0.29007301 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
10 0.2533136 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
11 0.24696511 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
12 0.23712184 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
13 0.23255464 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
14 0.23087579 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
15 0.2050669 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
16 0.19255093 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
17 0.18717572 153 cvpr-2013-Expanded Parts Model for Human Attribute and Action Recognition in Still Images
18 0.18585868 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
19 0.16438946 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition
20 0.16025692 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos
topicId topicWeight
[(0, 0.233), (1, -0.126), (2, -0.005), (3, -0.285), (4, -0.428), (5, -0.035), (6, -0.108), (7, 0.071), (8, -0.112), (9, -0.07), (10, 0.015), (11, -0.075), (12, -0.052), (13, 0.044), (14, -0.099), (15, -0.009), (16, 0.023), (17, -0.026), (18, 0.101), (19, 0.202), (20, 0.031), (21, -0.007), (22, -0.013), (23, 0.109), (24, 0.068), (25, -0.071), (26, 0.032), (27, 0.019), (28, 0.033), (29, 0.044), (30, 0.011), (31, 0.009), (32, -0.01), (33, 0.034), (34, -0.027), (35, -0.007), (36, 0.019), (37, -0.034), (38, -0.015), (39, -0.006), (40, -0.05), (41, 0.037), (42, -0.014), (43, -0.018), (44, 0.007), (45, 0.003), (46, 0.024), (47, -0.038), (48, 0.014), (49, -0.0)]
simIndex simValue paperId paperTitle
same-paper 1 0.97381836 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
2 0.89589924 336 cvpr-2013-Poselet Key-Framing: A Model for Human Activity Recognition
Author: Michalis Raptis, Leonid Sigal
Abstract: In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and minefor hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.
3 0.87171334 408 cvpr-2013-Spatiotemporal Deformable Part Models for Action Detection
Author: Yicong Tian, Rahul Sukthankar, Mubarak Shah
Abstract: Deformable part models have achieved impressive performance for object detection, even on difficult image datasets. This paper explores the generalization of deformable part models from 2D images to 3D spatiotemporal volumes to better study their effectiveness for action detection in video. Actions are treated as spatiotemporal patterns and a deformable part model is generated for each action from a collection of examples. For each action model, the most discriminative 3D subvolumes are automatically selected as parts and the spatiotemporal relations between their locations are learned. By focusing on the most distinctive parts of each action, our models adapt to intra-class variation and show robustness to clutter. Extensive experiments on several video datasets demonstrate the strength of spatiotemporal DPMs for classifying and localizing actions.
4 0.85979539 123 cvpr-2013-Detection of Manipulation Action Consequences (MAC)
Author: Yezhou Yang, Cornelia Fermüller, Yiannis Aloimonos
Abstract: The problem of action recognition and human activity has been an active research area in Computer Vision and Robotics. While full-body motions can be characterized by movement and change of posture, no characterization, that holds invariance, has yet been proposed for the description of manipulation actions. We propose that a fundamental concept in understanding such actions, are the consequences of actions. There is a small set of fundamental primitive action consequences that provides a systematic high-level classification of manipulation actions. In this paper a technique is developed to recognize these action consequences. At the heart of the technique lies a novel active tracking and segmentation method that monitors the changes in appearance and topological structure of the manipulated object. These are then used in a visual semantic graph (VSG) based procedure applied to the time sequence of the monitored object to recognize the action consequence. We provide a new dataset, called Manipulation Action Consequences (MAC 1.0), which can serve as testbed for other studies on this topic. Several ex- periments on this dataset demonstrates that our method can robustly track objects and detect their deformations and division during the manipulation. Quantitative tests prove the effectiveness and efficiency of the method.
5 0.76135981 291 cvpr-2013-Motionlets: Mid-level 3D Parts for Human Motion Recognition
Author: LiMin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motionlet can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motionlet for action recognition: high motion saliency, multiple scale representation, and representative-discriminative ability. Towards this goal, we develop a data-driven approach to learn motionlets from training videos. First, we extract 3D regions with high motion saliency. Then we cluster these regions and preserve the centers as candidate templates for motionlet. Finally, we examine the representative and discriminative power of the candidates, and introduce a greedy method to select effective candidates. With motionlets, we present a mid-level representation for video, called motionlet activation vector. We conduct experiments on three datasets, KTH, HMDB51, and UCF50. The results show that the proposed methods significantly outperform state-of-the-art methods.
6 0.72604781 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition
7 0.71652097 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
8 0.70249772 3 cvpr-2013-3D R Transform on Spatio-temporal Interest Points for Action Recognition
9 0.69589096 98 cvpr-2013-Cross-View Action Recognition via a Continuous Virtual Path
10 0.69238794 40 cvpr-2013-An Approach to Pose-Based Action Recognition
11 0.64923614 302 cvpr-2013-Multi-task Sparse Learning with Beta Process Prior for Action Recognition
12 0.64691728 205 cvpr-2013-Hollywood 3D: Recognizing Actions in 3D Natural Scenes
13 0.64329898 444 cvpr-2013-Unconstrained Monocular 3D Human Pose Estimation by Action Detection and Cross-Modality Regression Forest
14 0.63398618 149 cvpr-2013-Evaluation of Color STIPs for Human Action Recognition
15 0.62438411 459 cvpr-2013-Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
16 0.53736264 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities
17 0.53116614 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
18 0.52746546 32 cvpr-2013-Action Recognition by Hierarchical Sequence Summarization
19 0.51035547 196 cvpr-2013-HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences
20 0.45975351 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition
topicId topicWeight
[(10, 0.097), (16, 0.034), (26, 0.031), (28, 0.012), (33, 0.319), (67, 0.074), (69, 0.075), (76, 0.018), (87, 0.07), (92, 0.18)]
simIndex simValue paperId paperTitle
1 0.94025809 115 cvpr-2013-Depth Super Resolution by Rigid Body Self-Similarity in 3D
Author: unkown-author
Abstract: We tackle the problem of jointly increasing the spatial resolution and apparent measurement accuracy of an input low-resolution, noisy, and perhaps heavily quantized depth map. In stark contrast to earlier work, we make no use of ancillary data like a color image at the target resolution, multiple aligned depth maps, or a database of highresolution depth exemplars. Instead, we proceed by identifying and merging patch correspondences within the input depth map itself, exploiting patchwise scene self-similarity across depth such as repetition of geometric primitives or object symmetry. While the notion of ‘single-image ’ super resolution has successfully been applied in the context of color and intensity images, we are to our knowledge the first to present a tailored analogue for depth images. Rather than reason in terms of patches of 2D pixels as others have before us, our key contribution is to proceed by reasoning in terms of patches of 3D points, with matched patch pairs related by a respective 6 DoF rigid body motion in 3D. In support of obtaining a dense correspondence field in reasonable time, we introduce a new 3D variant of Patch- Match. A third contribution is a simple, yet effective patch upscaling and merging technique, which predicts sharp object boundaries at the target resolution. We show that our results are highly competitive with those of alternative techniques leveraging even a color image at the target resolution or a database of high-resolution depth exemplars.
2 0.90996933 393 cvpr-2013-Separating Signal from Noise Using Patch Recurrence across Scales
Author: Maria Zontak, Inbar Mosseri, Michal Irani
Abstract: Recurrence of small clean image patches across different scales of a natural image has been successfully used for solving ill-posed problems in clean images (e.g., superresolution from a single image). In this paper we show how this multi-scale property can be extended to solve ill-posed problems under noisy conditions, such as image denoising. While clean patches are obscured by severe noise in the original scale of a noisy image, noise levels drop dramatically at coarser image scales. This allows for the unknown hidden clean patches to “naturally emerge ” in some coarser scale of the noisy image. We further show that patch recurrence across scales is strengthened when using directional pyramids (that blur and subsample only in one direction). Our statistical experiments show that for almost any noisy image patch (more than 99%), there exists a “good” clean version of itself at the same relative image coordinates in some coarser scale of the image.This is a strong phenomenon of noise-contaminated natural images, which can serve as a strong prior for separating the signal from the noise. Finally, incorporating this multi-scale prior into a simple denoising algorithm yields state-of-the-art denois- ing results.
same-paper 3 0.90635294 287 cvpr-2013-Modeling Actions through State Changes
Author: Alireza Fathi, James M. Rehg
Abstract: In this paper we present a model of action based on the change in the state of the environment. Many actions involve similar dynamics and hand-object relationships, but differ in their purpose and meaning. The key to differentiating these actions is the ability to identify how they change the state of objects and materials in the environment. We propose a weakly supervised method for learning the object and material states that are necessary for recognizing daily actions. Once these state detectors are learned, we can apply them to input videos and pool their outputs to detect actions. We further demonstrate that our method can be used to segment discrete actions from a continuous video of an activity. Our results outperform state-of-the-art action recognition and activity segmentation results.
4 0.90151018 29 cvpr-2013-A Video Representation Using Temporal Superpixels
Author: Jason Chang, Donglai Wei, John W. Fisher_III
Abstract: We develop a generative probabilistic model for temporally consistent superpixels in video sequences. In contrast to supervoxel methods, object parts in different frames are tracked by the same temporal superpixel. We explicitly model flow between frames with a bilateral Gaussian process and use this information to propagate superpixels in an online fashion. We consider four novel metrics to quantify performance of a temporal superpixel representation and demonstrate superior performance when compared to supervoxel methods.
5 0.89348996 355 cvpr-2013-Representing Videos Using Mid-level Discriminative Patches
Author: Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis
Abstract: How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets.
6 0.88551879 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases
7 0.88466281 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video
8 0.88282168 70 cvpr-2013-Bottom-Up Segmentation for Top-Down Detection
9 0.88255876 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision
10 0.88250524 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
11 0.88242006 256 cvpr-2013-Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning
12 0.88223821 318 cvpr-2013-Optimized Pedestrian Detection for Multiple and Occluded People
13 0.88222879 36 cvpr-2013-Adding Unlabeled Samples to Categories by Learned Attributes
14 0.88220179 82 cvpr-2013-Class Generative Models Based on Feature Regression for Pose Estimation of Object Categories
15 0.88197464 132 cvpr-2013-Discriminative Re-ranking of Diverse Segmentations
16 0.88192898 372 cvpr-2013-SLAM++: Simultaneous Localisation and Mapping at the Level of Objects
17 0.88145196 329 cvpr-2013-Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images
18 0.88088191 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
19 0.88075316 202 cvpr-2013-Hierarchical Saliency Detection
20 0.88072139 242 cvpr-2013-Label Propagation from ImageNet to 3D Point Clouds