iccv iccv2013 iccv2013-37 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shugao Ma, Jianming Zhang, Nazli Ikizler-Cinbis, Stan Sclaroff
Abstract: We propose Hierarchical Space-Time Segments as a new representation for action recognition and localization. This representation has a two-level hierarchy. The first level comprises the root space-time segments that may contain a human body. The second level comprises multi-grained space-time segments that contain parts of the root. We present an unsupervised method to generate this representation from video, which extracts both static and non-static relevant space-time segments, and also preserves their hierarchical and temporal relationships. Using simple linear SVM on the resultant bag of hierarchical space-time segments representation, we attain better than, or comparable to, state-of-the-art action recognition performance on two challenging benchmark datasets and at the same time produce good action localization results.
Reference: text
sentIndex sentText sentNum sentScore
1 The first level comprises the root space-time segments that may contain a human body. [sent-7, score-0.868]
2 The second level comprises multi-grained space-time segments that contain parts of the root. [sent-8, score-0.661]
3 Using simple linear SVM on the resultant bag of hierarchical space-time segments representation, we attain better than, or comparable to, state-of-the-art action recognition performance on two challenging benchmark datasets and at the same time produce good action localization results. [sent-10, score-1.695]
4 Introduction Human action recognition is an important topic of interest, due to its wide ranging application in automatic video analysis, video retrieval and more. [sent-12, score-0.598]
5 We argue that both non-static and relevant static parts in the video are important for action recognition and localization. [sent-16, score-0.68]
6 oF voro example, for the golf swing action, instead of just relying on the regions that cover the hands and arms, which have significant motion, the overall body pose can also indicate important information that may be exploited to better discriminate this action from others. [sent-18, score-0.614]
7 Extracted segments from example video frames of the UCF Sports dataset. [sent-20, score-0.657]
8 In this paper, we propose a representation that we call hierarchical space-time segments for both action recognition and localization. [sent-26, score-1.135]
9 In this representation, the space-time segments of videos are organized in a two-level hierarchy. [sent-27, score-0.525]
10 The first level comprises the root space-time segments that may contain the whole human body. [sent-28, score-0.9]
11 The second level comprises space-time segments that contain parts of the root. [sent-29, score-0.661]
12 We present an algorithm to extract hierarchical spacetime segments from videos. [sent-30, score-0.781]
13 1 shows some example video frames and extracted hierarchical segments in the UCF-Sports video dataset [19] and more examples are shown in Fig. [sent-33, score-0.886]
14 These segments are then tracked in time to produce space-time segments as shown in Fig. [sent-38, score-1.089]
15 We first ap- ply hierarchical segmentation on each video frame to get a set of segment trees, each of which is considered as a candidate segment tree of the human body. [sent-41, score-0.957]
16 Red boxes show a segment tree on a frame, and the space-time segments are produced by tracking these segments. [sent-45, score-0.844]
17 Finally, we track each segment of the remaining segment trees in time both forward and backward. [sent-47, score-0.545]
18 These space-time segments are subsequently grouped into tracks according to their space-time overlap. [sent-49, score-0.565]
19 We then utilize these space-time segments in computing a bag-of-words representation. [sent-50, score-0.525]
20 Our hierarchical segmentation-based representation preserves hierarchical relationships naturally during extraction and, by following the temporal continuity of these space-time segments, the temporal relationships are also preserved. [sent-53, score-0.703]
21 We show in experiments that by using both parts and root spacetime segments together, better recognition is achieved. [sent-54, score-0.879]
22 Leveraging temporal relationships among root space-time segments, we can also localize the whole track of the action by identifying a sparse set of space-time segments. [sent-55, score-0.821]
23 A new hierarchical space-time segments representation designed for both action recognition and localization that incorporates multi-grained representation of the parts and the whole body in a hierarchical way. [sent-57, score-1.693]
24 An algorithm to extract the proposed hierarchical space-time segments that preserves both static and non-static relevant space time segments as well as their hierarchical and temporal relationships. [sent-59, score-1.565]
25 Using just a simple linear SVM on the bag of hiearchical space-time segments representation, better or comparable to state-of-the-art action recognition performance is achieved without using human bounding box annotations. [sent-63, score-1.204]
26 At the same time, as the results demonstrate, our proposed representation produces good action localization results. [sent-64, score-0.572]
27 Related Work In recent years, action recognition methods that use bag of space-time interest points [13] or dense trajectories [22] have performed well on many benchmark datasets. [sent-66, score-0.598]
28 Many attempts have been made to explore such relationships for action recognition, which usually resort to higher order statistics of the already extracted STIPs or dense trajectories, such as pairs [24, 11], groups [7], point clouds [3], or clusters [18, 6]. [sent-68, score-0.545]
29 In contrast, the extraction of both space-time segments and their hierarchical and temporal relationships are integrated in our approach. [sent-69, score-0.844]
30 Action localization is usually done in the action detection setting [20, 21] and relatively few works do both action recognition and localization. [sent-70, score-0.972]
31 Action recognition methods that use holistic representations of the human figure may have the potential to localize the action performer, such as motion history images [2], space-time shape models [8] and human silhouettes [23]. [sent-71, score-0.657]
32 In [12] the bag of STIP approach was extended beyond action recognition to localization using latent SVM. [sent-75, score-0.624]
33 In this paper, we show that by using hierarchical space-time segments we can do action localization within the bag-ofwords framework. [sent-76, score-1.211]
34 The method in [4] applies a general video segmentation method to produce video sub-volumes for action recognition. [sent-80, score-0.597]
35 However, for action recognition and localization, general video segmentation methods may produce much more irrelevant space-time segments than our method that explores human action related cues to effectively prune irrelevant ones. [sent-81, score-1.701]
36 Some work proposed object centric video segmentation [14, 16], but these methods do not extract space-time segments of parts. [sent-82, score-0.639]
37 Video Frame Hierarchical Segmentation For human action recognition, segments in a video frame that contain motion are useful as they may belong to moving body parts. [sent-87, score-1.403]
38 However, some static segments may belong to 2745 the static body parts, and thus may be useful for the pose information they contain. [sent-88, score-0.904]
39 Moreover, for localizing the action performer, both static and non-static segments of the human body are needed. [sent-89, score-1.241]
40 Based on this observation, we design our video frame segmentation method to preserve segments of the whole body and the parts while suppressing the background. [sent-90, score-0.938]
41 The idea is to use both color and motion information to reduce boundaries within the background and rigid objects and strengthen internal motion boundaries of human body resulting from different motions of body parts. [sent-91, score-0.51]
42 Then, a subsequent hierarchical segmentation may further reduce irrelevant segments of background and rigid objects while retaining dense multi-grained segments on the human body. [sent-92, score-1.412]
43 The UCM represents a hierarchical segmentation of a video frame [1], in which the root is the whole video frame. [sent-95, score-0.593]
44 We traverse this segment tree to remove redundant segments as well as segments that are too large or too small and unlikely to be a human body or body parts. [sent-96, score-1.71]
45 Specifically, at each segment, if its size is larger than 32 of its parent size, or is larger or smaller than some thresholds (parameters of the system), the parent of its children segments is set to its parent segment and it is then removed. [sent-97, score-0.873]
46 We then remove the root of the segment tree and get a set of segment trees Tt (t is index of the frame). [sent-98, score-0.711]
47 Each Tjt ∈ Tt oisf c seongmsiednerte tdre as a candidate segment tree of a human body and we denote Tjt = {stij } where each sitj is a segment and s0tj is the root seg=me {nst. [sent-99, score-1.131]
48 Pruning Candidate Segment Trees We want to extract both static and non-static relevant segments, so the pruning should preserve segments that are static but relevant. [sent-104, score-0.812]
49 We achieve this by exploring the hierarchical relationships among the segments so that the decision to prune a segment is not made using only the local information contained in the segment itself, but the information of all segments of the same candidate segment tree. [sent-105, score-2.089]
50 Especially in the golf action, there are only slight motions at the hands of the human bodies, but we can still correctly extract the whole human body and the static body parts. [sent-114, score-0.611]
51 We explore multiple action related cues to prune the candidate segment trees, described in the order of our pipeline (Fig. [sent-115, score-0.749]
52 Motion cue: For each segment sitj ∈ Tjt, we compute Mtheo average :m Footrio ena magnitude o sf al∈l within sitj . [sent-139, score-0.569]
53 Second, segments of the foreground human body are more consistently present than segments caused by artificial edges and erroneous segmentation, and we account for this as a global color cue over the whole video sequence. [sent-147, score-1.491]
54 For root segment s0tj we define its height h0tj to be 1 and for a non-root sitj its height hitj is to the number of edges on the path from the root to it plus one. [sent-164, score-0.786]
55 2, colors of more frequently appearing segments and segments with greater heights will receive more votes. [sent-167, score-1.05]
56 For each segment sitj in a candidate seg- segment set pmreonbtab trieliety T jtas∈ thT? [sent-170, score-0.678]
57 3 that by using the foreground map, we can effectively prune the background segments in the example frames. [sent-175, score-0.655]
58 In this respect, our space-time segments have a relatively short extent, with a maximum temporal length of 15. [sent-230, score-0.631]
59 Although segment tree Tjt may have deep structures with height larger than 3, these structures may not persistently occur in other video frames due to the change in human motion and the errors in segmentation. [sent-231, score-0.622]
60 For robustness and simplicity, we have only two levels in the resultant hierarchical structure of space-time segments: the root level is the space-time segment by tracking the root segment s0tj of Tjt, and the part level are the space-time segments by tracking non-root segments sitj ,∀i 0. [sent-232, score-2.115]
61 Since space-time segments are constructed using segment trees of every video frame, many of them may temporally overlap and contain the same object, e. [sent-233, score-0.998]
62 spacetime segments produced by tracking the segments that contain the same human body but are from two consecutive frames. [sent-235, score-1.421]
63 Specifically, two root space-time segments that have significant spatial overlap on any frame are grouped into the same track. [sent-238, score-0.772]
64 The part space-time segments are subsequently grouped into the same track as their roots. [sent-241, score-0.582]
65 Note that now we have temporal relationships among root spacetime segments of the same track. [sent-242, score-0.967]
66 As described before, many root spacetime segments of a track may temporally overlap on some frames. [sent-244, score-0.933]
67 On each of those frames, every overlapping root space-time segment will provide a candidate bounding box. [sent-245, score-0.515]
68 As we assume the foreground human body is relatively consistently present in the video, we further prune irrelevant space-time segments by removing tracks of short temporal extent. [sent-248, score-1.037]
69 Action Recognition and Localization To better illustrate the effectiveness of hierarchical space-time segments for action recognition and localization, we use simple learning techniques to train action classifiers and the learned models are then used for action localization. [sent-251, score-1.903]
70 Here we want to mention that we do not limit the space-time segments to have the same length as the method in [22] did for dense trajectories. [sent-253, score-0.603]
71 We do not use spacetime segments that are too short (length < 6) as they may not be discriminative enough. [sent-255, score-0.661]
72 We also split long space-time segments (length > 12) into two to produce shorter ones while keeping the original one. [sent-256, score-0.525]
73 We build separate codebooks for root and part spacetime segments using k-means clustering. [sent-258, score-0.794]
74 Subsequently each test video is encoded in the BoW (bag of words) representation using max pooling over the similarity values between its space-time segments and the code words, where the similarity is measured by histogram intersection. [sent-259, score-0.681]
75 We train onevs-all linear SVMs on the training videos’ BoW representation for multiclass action classification, and the action label of a test video is given by: y = argy∈ mYax? [sent-260, score-0.915]
76 m For action localization, in a test video we find space-time segments that have positive contribution to the classification wyr of the video and output the tracks that contain them. [sent-263, score-1.207]
77 We then output all the tracks that have at least one space-time segment in the set U as action localization results. [sent-270, score-0.791]
78 In this way, although space-time segments in U may only cover a sparse set of frames, our algorithm is able to output denser localization results. [sent-271, score-0.718]
79 Essentially these results benefit from the temporal relationships (before, after) among the root space-time segments in the same track. [sent-272, score-0.857]
80 Experiments We conducted experiments on the UCF-Sports [19] and High Five [17] datasets to evaluate the proposed hierarchical space-time segments representation. [sent-283, score-0.671]
81 1 The parameters for extracting hierarchical space-time segments are empirically chosen without extensive tuning and mostly kept the same for both datasets. [sent-288, score-0.729]
82 2 W ×e 2bu silpda a c-otidmeeb goorkid o afn d20 t0h0e w otohredrs sfeotrti rnogost spacetime segments and 4000 words for parts. [sent-294, score-0.635]
83 We build a codebook of 1800 words for root space-time segments and 3600 words for parts. [sent-301, score-0.684]
84 Note that, [20] and [21] need bounding boxes in training and their models are only for binary action detection, so their results are not directly comparable to ours. [sent-318, score-0.512]
85 The method in [12] uses a figure-centric visual word representation in a latent SVM formulation for both action localization and recognition. [sent-329, score-0.572]
86 Although [18] achieves comparable classification performance with ours, it cannot provide meaningful action localization results. [sent-335, score-0.54]
87 None of the compared methods perform action localization as our method does. [sent-342, score-0.54]
88 To assess the benefit of extracting the relevant static space-time regions that are contained in the root space-time segments, we compare with a baseline that only uses space- time segments of parts. [sent-343, score-0.855]
89 0% on High Five) if space-time segments of roots are not used. [sent-346, score-0.555]
90 This supports our hypothesis that pose information captured by root space-time segments is useful for action recognition. [sent-347, score-1.084]
91 Action localization: Table 3 and Table 4 show the action localization results on the UCF-Sports and HighFive datasets. [sent-348, score-0.54]
92 The methods of [20] and [21] only report action localization results on 3 classes (running, diving and horse riding) of UCF Sports. [sent-356, score-0.57]
93 [20] and [21] have reported action localization results only on the kiss class, which are 18. [sent-366, score-0.54]
94 Conclusion and Future Work In this work, we propose hierarchical space-time segments for action recognition that can be utilized to effectively answer what and where an action happened in realistic videos as demonstrated in our experiments. [sent-371, score-1.536]
95 Compared to previous methods such as STIPs and dense trajectories, this representation preserves both relevant static and nonstatic space-time segments as well as their hierarchical relationships, which helped us in both action recognition and localization. [sent-372, score-1.375]
96 One direction for future work is to make the method more robust to low video quality, as it may fail to extract good space-time segments when there is significant blur or jerky camera motion. [sent-373, score-0.634]
97 A particularly promising direction for future work is to apply more advanced machine learning techniques to explore the hierarchical and temporal relationships provided within this representation for even better action recognition and localization. [sent-374, score-0.783]
98 Discriminative figure-centric models for joint action localization and recognition. [sent-452, score-0.54]
99 The inclusion of one box in another indicates the parent-children relationships covered by red mask is our action localization output. [sent-463, score-0.687]
100 Discriminative hierarchical part-based models for human parsing and action recognition. [sent-549, score-0.621]
wordName wordTfidf (topN-words)
[('segments', 0.525), ('action', 0.4), ('tjt', 0.238), ('segment', 0.211), ('sitj', 0.179), ('root', 0.159), ('highfive', 0.157), ('body', 0.155), ('hierarchical', 0.146), ('localization', 0.14), ('spacetime', 0.11), ('stips', 0.099), ('ya', 0.097), ('relationships', 0.095), ('xa', 0.093), ('xyaa', 0.089), ('pruning', 0.089), ('static', 0.086), ('video', 0.083), ('ucf', 0.079), ('temporal', 0.078), ('candidate', 0.077), ('human', 0.075), ('foreground', 0.069), ('bounding', 0.068), ('trees', 0.066), ('trajectories', 0.064), ('tree', 0.064), ('prune', 0.061), ('mf', 0.06), ('frame', 0.059), ('track', 0.057), ('iou', 0.057), ('performer', 0.056), ('flow', 0.055), ('parts', 0.053), ('bag', 0.052), ('box', 0.052), ('comprises', 0.052), ('sports', 0.051), ('dense', 0.05), ('frames', 0.049), ('motion', 0.049), ('tran', 0.046), ('ckr', 0.045), ('nonstatic', 0.045), ('stij', 0.045), ('ucfsports', 0.045), ('wyp', 0.045), ('wyr', 0.045), ('boxes', 0.044), ('histogram', 0.041), ('tracks', 0.04), ('ckp', 0.04), ('spb', 0.04), ('height', 0.039), ('tracked', 0.039), ('thresholds', 0.038), ('mc', 0.038), ('optical', 0.037), ('ucm', 0.037), ('daily', 0.036), ('actions', 0.035), ('recognising', 0.035), ('irrelevant', 0.034), ('annotations', 0.034), ('parent', 0.033), ('golf', 0.033), ('five', 0.033), ('realistic', 0.033), ('preserves', 0.033), ('whole', 0.032), ('recognition', 0.032), ('representation', 0.032), ('extracting', 0.032), ('curvature', 0.031), ('contain', 0.031), ('segmentation', 0.031), ('diving', 0.03), ('roots', 0.03), ('bow', 0.03), ('overlap', 0.029), ('length', 0.028), ('cit', 0.028), ('tt', 0.027), ('denser', 0.027), ('raptis', 0.027), ('temporally', 0.027), ('contained', 0.027), ('color', 0.027), ('boundary', 0.027), ('mbh', 0.027), ('lan', 0.027), ('relevant', 0.026), ('sp', 0.026), ('grid', 0.026), ('stip', 0.026), ('pruned', 0.026), ('may', 0.026), ('kept', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
Author: Shugao Ma, Jianming Zhang, Nazli Ikizler-Cinbis, Stan Sclaroff
Abstract: We propose Hierarchical Space-Time Segments as a new representation for action recognition and localization. This representation has a two-level hierarchy. The first level comprises the root space-time segments that may contain a human body. The second level comprises multi-grained space-time segments that contain parts of the root. We present an unsupervised method to generate this representation from video, which extracts both static and non-static relevant space-time segments, and also preserves their hierarchical and temporal relationships. Using simple linear SVM on the resultant bag of hierarchical space-time segments representation, we attain better than, or comparable to, state-of-the-art action recognition performance on two challenging benchmark datasets and at the same time produce good action localization results.
2 0.34027028 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments
Author: Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, James M. Rehg
Abstract: We propose an unsupervised video segmentation approach by simultaneously tracking multiple holistic figureground segments. Segment tracks are initialized from a pool of segment proposals generated from a figure-ground segmentation algorithm. Then, online non-local appearance models are trained incrementally for each track using a multi-output regularized least squares formulation. By using the same set of training examples for all segment tracks, a computational trick allows us to track hundreds of segment tracks efficiently, as well as perform optimal online updates in closed-form. Besides, a new composite statistical inference approach is proposed for refining the obtained segment tracks, which breaks down the initial segment proposals and recombines for better ones by utilizing highorder statistic estimates from the appearance model and enforcing temporal consistency. For evaluating the algorithm, a dataset, SegTrack v2, is collected with about 1,000 frames with pixel-level annotations. The proposed framework outperforms state-of-the-art approaches in the dataset, show- ing its efficiency and robustness to challenges in different video sequences.
3 0.30644834 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu
Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.
4 0.28582713 86 iccv-2013-Concurrent Action Detection with Structural Prediction
Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu
Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.
5 0.27523696 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
Author: Jiang Wang, Ying Wu
Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.
6 0.26240659 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
7 0.23684531 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
8 0.21829408 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
9 0.21813484 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
10 0.21760286 39 iccv-2013-Action Recognition with Improved Trajectories
11 0.21346177 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
12 0.2022908 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
13 0.20058967 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition
14 0.19912186 57 iccv-2013-BOLD Features to Detect Texture-less Objects
15 0.19501379 379 iccv-2013-Semantic Segmentation without Annotating Segments
16 0.19435322 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
17 0.19082312 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
18 0.18599389 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition
19 0.16895206 160 iccv-2013-Fast Object Segmentation in Unconstrained Video
20 0.16698483 317 iccv-2013-Piecewise Rigid Scene Flow
topicId topicWeight
[(0, 0.293), (1, 0.16), (2, 0.188), (3, 0.348), (4, 0.076), (5, -0.003), (6, 0.04), (7, -0.006), (8, -0.01), (9, 0.012), (10, 0.044), (11, 0.153), (12, 0.073), (13, -0.062), (14, 0.063), (15, 0.01), (16, 0.008), (17, -0.026), (18, -0.022), (19, -0.09), (20, 0.062), (21, 0.014), (22, -0.045), (23, -0.095), (24, -0.065), (25, -0.006), (26, -0.046), (27, -0.026), (28, -0.084), (29, -0.028), (30, -0.057), (31, 0.014), (32, -0.115), (33, -0.038), (34, -0.157), (35, 0.174), (36, -0.001), (37, -0.023), (38, 0.104), (39, -0.077), (40, 0.041), (41, 0.135), (42, 0.024), (43, -0.021), (44, 0.049), (45, 0.155), (46, -0.045), (47, 0.051), (48, -0.091), (49, 0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.98640043 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
Author: Shugao Ma, Jianming Zhang, Nazli Ikizler-Cinbis, Stan Sclaroff
Abstract: We propose Hierarchical Space-Time Segments as a new representation for action recognition and localization. This representation has a two-level hierarchy. The first level comprises the root space-time segments that may contain a human body. The second level comprises multi-grained space-time segments that contain parts of the root. We present an unsupervised method to generate this representation from video, which extracts both static and non-static relevant space-time segments, and also preserves their hierarchical and temporal relationships. Using simple linear SVM on the resultant bag of hierarchical space-time segments representation, we attain better than, or comparable to, state-of-the-art action recognition performance on two challenging benchmark datasets and at the same time produce good action localization results.
2 0.75129342 22 iccv-2013-A New Adaptive Segmental Matching Measure for Human Activity Recognition
Author: Shahriar Shariat, Vladimir Pavlovic
Abstract: The problem of human activity recognition is a central problem in many real-world applications. In this paper we propose a fast and effective segmental alignmentbased method that is able to classify activities and interactions in complex environments. We empirically show that such model is able to recover the alignment that leads to improved similarity measures within sequence classes and hence, raises the classification performance. We also apply a bounding technique on the histogram distances to reduce the computation of the otherwise exhaustive search.
3 0.72906333 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments
Author: Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, James M. Rehg
Abstract: We propose an unsupervised video segmentation approach by simultaneously tracking multiple holistic figureground segments. Segment tracks are initialized from a pool of segment proposals generated from a figure-ground segmentation algorithm. Then, online non-local appearance models are trained incrementally for each track using a multi-output regularized least squares formulation. By using the same set of training examples for all segment tracks, a computational trick allows us to track hundreds of segment tracks efficiently, as well as perform optimal online updates in closed-form. Besides, a new composite statistical inference approach is proposed for refining the obtained segment tracks, which breaks down the initial segment proposals and recombines for better ones by utilizing highorder statistic estimates from the appearance model and enforcing temporal consistency. For evaluating the algorithm, a dataset, SegTrack v2, is collected with about 1,000 frames with pixel-level annotations. The proposed framework outperforms state-of-the-art approaches in the dataset, show- ing its efficiency and robustness to challenges in different video sequences.
4 0.72320545 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
Author: Weiyu Zhang, Menglong Zhu, Konstantinos G. Derpanis
Abstract: This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes. Unlike previous action-related work, the discovery of patch classifiers is posed as a strongly-supervised process. Specifically, keypoint labels (e.g., position) across spacetime are used in a data-driven training process to discover patches that are highly clustered in the spacetime keypoint configuration space. To support this process, a new human action dataset consisting of challenging consumer videos is introduced, where notably the action label, the 2D position of a set of keypoints and their visibilities are provided for each video frame. On a novel input video, each acteme is used in a sliding volume scheme to yield a set of sparse, non-overlapping detections. These detections provide the intermediate substrate for segmenting out the action. For action classification, the proposed representation shows significant improvement over state-of-the-art low-level features, while providing spatiotemporal localiza- tion as additional output. This output sheds further light into detailed action understanding.
5 0.70042747 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
Author: Limin Wang, Yu Qiao, Xiaoou Tang
Abstract: This paper proposes motion atom and phrase as a midlevel temporal “part” for representing and classifying complex action. Motion atom is defined as an atomic part of action, and captures the motion information of action video in a short temporal scale. Motion phrase is a temporal composite of multiple motion atoms with an AND/OR structure, which further enhances the discriminative ability of motion atoms by incorporating temporal constraints in a longer scale. Specifically, given a set of weakly labeled action videos, we firstly design a discriminative clustering method to automatically discovera set ofrepresentative motion atoms. Then, based on these motion atoms, we mine effective motion phrases with high discriminative and representativepower. We introduce a bottom-upphrase construction algorithm and a greedy selection method for this mining task. We examine the classification performance of the motion atom and phrase based representation on two complex action datasets: Olympic Sports and UCF50. Experimental results show that our method achieves superior performance over recent published methods on both datasets.
6 0.69547278 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
7 0.68521392 86 iccv-2013-Concurrent Action Detection with Structural Prediction
8 0.67492515 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
9 0.61898214 38 iccv-2013-Action Recognition with Actons
10 0.60678929 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
11 0.60653299 57 iccv-2013-BOLD Features to Detect Texture-less Objects
12 0.60286945 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition
13 0.60278267 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
14 0.59616119 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
15 0.58351213 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach
16 0.58052665 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
17 0.57004112 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
18 0.56632096 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
19 0.53712237 69 iccv-2013-Capturing Global Semantic Relationships for Facial Action Unit Recognition
20 0.53684521 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks
topicId topicWeight
[(2, 0.061), (7, 0.01), (26, 0.078), (31, 0.032), (42, 0.088), (64, 0.417), (73, 0.026), (89, 0.168), (98, 0.014)]
simIndex simValue paperId paperTitle
1 0.96526128 99 iccv-2013-Cross-View Action Recognition over Heterogeneous Feature Spaces
Author: Xinxiao Wu, Han Wang, Cuiwei Liu, Yunde Jia
Abstract: In cross-view action recognition, “what you saw” in one view is different from “what you recognize ” in another view. The data distribution even the feature space can change from one view to another due to the appearance and motion of actions drastically vary across different views. In this paper, we address the problem of transferring action models learned in one view (source view) to another different view (target view), where action instances from these two views are represented by heterogeneous features. A novel learning method, called Heterogeneous Transfer Discriminantanalysis of Canonical Correlations (HTDCC), is proposed to learn a discriminative common feature space for linking source and target views to transfer knowledge between them. Two projection matrices that respectively map data from source and target views into the common space are optimized via simultaneously minimizing the canonical correlations of inter-class samples and maximizing the intraclass canonical correlations. Our model is neither restricted to corresponding action instances in the two views nor restricted to the same type of feature, and can handle only a few or even no labeled samples available in the target view. To reduce the data distribution mismatch between the source and target views in the commonfeature space, a nonparametric criterion is included in the objective function. We additionally propose a joint weight learning method to fuse multiple source-view action classifiers for recognition in the target view. Different combination weights are assigned to different source views, with each weight presenting how contributive the corresponding source view is to the target view. The proposed method is evaluated on the IXMAS multi-view dataset and achieves promising results.
2 0.95193064 298 iccv-2013-Online Robust Non-negative Dictionary Learning for Visual Tracking
Author: Naiyan Wang, Jingdong Wang, Dit-Yan Yeung
Abstract: This paper studies the visual tracking problem in video sequences and presents a novel robust sparse tracker under the particle filter framework. In particular, we propose an online robust non-negative dictionary learning algorithm for updating the object templates so that each learned template can capture a distinctive aspect of the tracked object. Another appealing property of this approach is that it can automatically detect and reject the occlusion and cluttered background in a principled way. In addition, we propose a new particle representation formulation using the Huber loss function. The advantage is that it can yield robust estimation without using trivial templates adopted by previous sparse trackers, leading to faster computation. We also reveal the equivalence between this new formulation and the previous one which uses trivial templates. The proposed tracker is empirically compared with state-of-the-art trackers on some challenging video sequences. Both quantitative and qualitative comparisons show that our proposed tracker is superior and more stable.
3 0.94005257 88 iccv-2013-Constant Time Weighted Median Filtering for Stereo Matching and Beyond
Author: Ziyang Ma, Kaiming He, Yichen Wei, Jian Sun, Enhua Wu
Abstract: Despite the continuous advances in local stereo matching for years, most efforts are on developing robust cost computation and aggregation methods. Little attention has been seriously paid to the disparity refinement. In this work, we study weighted median filtering for disparity refinement. We discover that with this refinement, even the simple box filter aggregation achieves comparable accuracy with various sophisticated aggregation methods (with the same refinement). This is due to the nice weighted median filtering properties of removing outlier error while respecting edges/structures. This reveals that the previously overlooked refinement can be at least as crucial as aggregation. We also develop the first constant time algorithmfor the previously time-consuming weighted median filter. This makes the simple combination “box aggregation + weighted median ” an attractive solution in practice for both speed and accuracy. As a byproduct, the fast weighted median filtering unleashes its potential in other applications that were hampered by high complexities. We show its superiority in various applications such as depth upsampling, clip-art JPEG artifact removal, and image stylization.
4 0.92679101 242 iccv-2013-Learning People Detectors for Tracking in Crowded Scenes
Author: Siyu Tang, Mykhaylo Andriluka, Anton Milan, Konrad Schindler, Stefan Roth, Bernt Schiele
Abstract: People tracking in crowded real-world scenes is challenging due to frequent and long-term occlusions. Recent tracking methods obtain the image evidence from object (people) detectors, but typically use off-the-shelf detectors and treat them as black box components. In this paper we argue that for best performance one should explicitly train people detectors on failure cases of the overall tracker instead. To that end, we first propose a novel joint people detector that combines a state-of-the-art single person detector with a detector for pairs of people, which explicitly exploits common patterns of person-person occlusions across multiple viewpoints that are a frequent failure case for tracking in crowded scenes. To explicitly address remaining failure modes of the tracker we explore two methods. First, we analyze typical failures of trackers and train a detector explicitly on these cases. And second, we train the detector with the people tracker in the loop, focusing on the most common tracker failures. We show that our joint multi-person detector significantly improves both de- tection accuracy as well as tracker performance, improving the state-of-the-art on standard benchmarks.
5 0.90195894 166 iccv-2013-Finding Actors and Actions in Movies
Author: P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, J. Sivic
Abstract: We address the problem of learning a joint model of actors and actions in movies using weak supervision provided by scripts. Specifically, we extract actor/action pairs from the script and use them as constraints in a discriminative clustering framework. The corresponding optimization problem is formulated as a quadratic program under linear constraints. People in video are represented by automatically extracted and tracked faces together with corresponding motion features. First, we apply the proposed framework to the task of learning names of characters in the movie and demonstrate significant improvements over previous methods used for this task. Second, we explore the joint actor/action constraint and show its advantage for weakly supervised action learning. We validate our method in the challenging setting of localizing and recognizing characters and their actions in feature length movies Casablanca and American Beauty.
same-paper 6 0.89542019 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
7 0.87439263 441 iccv-2013-Video Motion for Every Visible Point
8 0.86033726 215 iccv-2013-Incorporating Cloud Distribution in Sky Representation
9 0.81849962 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes
10 0.80453485 303 iccv-2013-Orderless Tracking through Model-Averaged Posterior Estimation
11 0.7832253 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments
12 0.76422679 86 iccv-2013-Concurrent Action Detection with Structural Prediction
13 0.7526927 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
14 0.75223732 424 iccv-2013-Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines
15 0.73425704 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning
16 0.72812927 425 iccv-2013-Tracking via Robust Multi-task Multi-view Joint Sparse Representation
17 0.71277511 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
18 0.68593675 338 iccv-2013-Randomized Ensemble Tracking
19 0.67876989 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
20 0.6783849 22 iccv-2013-A New Adaptive Segmental Matching Measure for Human Activity Recognition