iccv iccv2013 iccv2013-41 knowledge-graph by maker-knowledge-mining

41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos


Source: pdf

Author: Sunil Bandla, Kristen Grauman

Abstract: Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the “untrimmed” nature of real video data.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. [sent-3, score-0.759]

2 We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. [sent-4, score-0.551]

3 Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. [sent-5, score-0.773]

4 To deal with this problem, we propose a detection-based active learner to train action category models. [sent-6, score-0.713]

5 We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. [sent-7, score-0.747]

6 On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the “untrimmed” nature of real video data. [sent-8, score-0.903]

7 Introduction Locating where an action occurs in a video is known as action localization or action detection, while categorizing an action is called action recognition. [sent-10, score-2.274]

8 Researchers have made much progress in recent years to deal with these issues, most notably with learning-based methods that discover discriminative patterns to distinguish each action of interest, e. [sent-12, score-0.453]

9 Unfortunately, data collection is particularly intensive for activity recognition, since annotators must not only identify what actions are present, but also specify the time interval (and possibly spatial bounding box) where that action occurs. [sent-16, score-0.755]

10 Yet, the amount of manual supervision required to annotate large datasets with many action categories can be daunting and expensive. [sent-18, score-0.484]

11 While active learning is a natural way to try and reduce annotator effort, its success has been concentrated in the Figure 1. [sent-19, score-0.389]

12 Left: If we somehow had unlabeled videos that were trimmed to the action instances of interest (just not labeled by their category), then useful instances would be relatively apparent, and one could apply traditional active learning methods directly. [sent-20, score-1.363]

13 Right: In reality, though, an unlabeled video is untrimmed: it may contain multiple action classes, and it need not be temporally cropped to where actions of interest occur. [sent-21, score-1.043]

14 Our approach takes untrimmed and unlabeled videos as input and repeatedly selects the most useful one for annotation, as determined by its current action detector. [sent-23, score-1.249]

15 In contrast, standard active classifier training does not translate easily to learning actions in video. [sent-29, score-0.543]

16 Before passing through the hands of an annotator, a typical video will not be trimmed in the temporal dimension to focus on a single action (let alone cropped in the spatial dimensions). [sent-31, score-0.654]

17 Hence, applying active learning to video clips is non-trivial. [sent-32, score-0.467]

18 In fact, most prior active learning efforts for video have focused on problems other than activity recognition—namely, segmenting tracked objects [27, 5, 24] or identifying people [29] in video frames. [sent-33, score-0.557]

19 Most active learning algorithms assume that data points in the unlabeled pool have a single label, and that the fea- ture descriptor for an unlabeled point reflects that instance alone. [sent-34, score-0.9]

20 Yet in real unlabeled video, there may be multiple simultaneous actions as well as extraneous frames belong11883333 ing to no action category of interest. [sent-35, score-0.888]

21 As a result, untrimmed videos cannot be used directly to estimate standard active selection criteria. [sent-36, score-0.848]

22 For example, an uncertainty-based sampling strategy using a classifier is insufficient; even if the action of interest is likely present in some untrimmed clip, the classifier may not realize it if applied to all the features pooled together. [sent-37, score-0.92]

23 Our goal is to perform active learning of actions in untrimmed videos. [sent-39, score-0.916]

24 To achieve this, we introduce a technique to measure the information content of an untrimmed video, rank all unlabeled videos based on these scores, and request annotations on the most valuable video. [sent-40, score-0.905]

25 In order to predict the informativeness of an untrimmed video, we first use a Houghbased action detector to estimate the spatio-temporal extents in which an action of interest could occur. [sent-42, score-1.535]

26 Next, we forecast how each of these predicted action intervals would influence the action detector, were we to get it labeled by a human. [sent-44, score-1.058]

27 To this end, we develop a novel uncertainty metric computed based on entropy in the 3D vote space of a video. [sent-45, score-0.413]

28 Importantly, rather than simply look for the single untrimmed video that has the highest uncertainty, we estimate how much each candidate video will reduce the total uncertainty across all videos. [sent-46, score-0.772]

29 That is, the best video to get labeled is the one that, once used to augment the Hough detector, will more confidently localize actions in all unlabeled videos. [sent-47, score-0.678]

30 We evaluate our method on three action localization datasets. [sent-50, score-0.455]

31 The results demonstrate that accounting for the untrimmed nature of unlabeled video data is critical; in fact, we find directly applying a standard active learning criterion can even underperform the passive learning strategy. [sent-52, score-1.265]

32 While most work focuses on the action recognition task alone, some recent work tackles action detection, which further entails the localization problem. [sent-59, score-0.918]

33 Compared to both of the above, voting-based methods for action detection [12, 28, 30] have the appeal of circumventing person tracking (a hard problem of its own), allowing detection to succeed even with partial local evidence, and naturally supporting incremental updates to the training set. [sent-62, score-0.476]

34 Our active learning method capitalizes on these advantages, and we adapt ideas from [28, 30] to build a Hough detector (Sec. [sent-63, score-0.498]

35 While we tailor the details to best suit our goals, the novelty of our contribution is the active learning idea for untrimmed video, not the action detector it employs. [sent-66, score-1.324]

36 Active learning has been explored for object recognition with images [14, 22, 20, 25, 8, 23] and to facilitate annotation of objects in video [29, 27, 5, 24] or actions in trimmed clips [25]. [sent-67, score-0.522]

37 Unlike any of these methods, we propose to actively learn an action detector. [sent-68, score-0.473]

38 As discussed above, untrimmed video makes direct application of existing active learning methods problematic, both conceptually and empirically, as we will see in the results. [sent-72, score-0.842]

39 Our formulation is distinct in that it handles active learning of actions in video, and it includes a novel entropy metric computed in a space-time vote space. [sent-77, score-0.817]

40 Aside from active learning, researchers have also pursued interactive techniques to minimize manual effort in video annotation. [sent-78, score-0.478]

41 Approach Given a small pool of labeled videos, our method initializes a voting-based action detector for the class of interest (e. [sent-87, score-0.701]

42 Then we survey all remaining unlabeled videos, for which both the action labels and intervals are unknown, and identify the video whose annotation (if requested) is likely to most improve the current detector. [sent-90, score-1.054]

43 To measure the value of each candidate video, we predict how it would reduce the uncertainty among all unlabeled videos. [sent-91, score-0.415]

44 We gauge uncertainty by the entropy in the vote space that results when the current detector’s training data is temporarily augmented to include the candidate’s predicted action intervals. [sent-92, score-0.881]

45 Video Annotations The few labeled videos used to initialize the action detector are annotated with both the temporal extent(s) of the action as well as its spatial bounding box within each frame. [sent-97, score-1.247]

46 In contrast, videos in the unlabeled pool are not trimmed to any specific action. [sent-98, score-0.498]

47 This means they could contain multiple instances of one action and/or multiple action classes. [sent-99, score-0.883]

48 When our system requests a manual annotation for a clip, it will get back two pieces of information: 1) whether the action class of interest is present, and 2) if it is, each spatiotemporal subvolume where it appears (which may be more than one). [sent-100, score-0.601]

49 To aid in the latter, we enhanced the VATIC tool [26] to allow annotators to specify the temporal interval in which an action is located. [sent-101, score-0.557]

50 Building the Action Detector Our active learning approach requires an action detector as a subroutine. [sent-106, score-0.924]

51 For the second step, we rank the words in both tables according to their discriminative power for the action of interest. [sent-119, score-0.525]

52 Our full action detector D trained on data T consists of the Hough lt aabcltieosn na dndet secorttoerd D Ddis trcariinmeidna otniv dea wtao Trds c: Dns(isTts ) o=f (Hhog, Hhof, Whog , Whof). [sent-141, score-0.597]

53 Thus, as active learning proceeds, certain words may gain or lose entry into the list of those that get to cast votes during detection. [sent-143, score-0.52]

54 Then, for any words appearing in the top N = 100 discriminative words in Whog and Whof, we look up the corresponding ewnotrridess iinn Hhog aanndd Hhof, and use those 3D displacement veentcrtoierss tno Hvote on dth He probable action center (x, y, t). [sent-148, score-0.51]

55 This voting procedure assumes that an action at test time will occur at roughly the same time-scale as some exemplar from training. [sent-153, score-0.476]

56 In the ideal case, a true detection would have nearly all votes cast near its action center. [sent-156, score-0.602]

57 Active Selection of Untrimmed Videos At this stage, we have an action detector D(T ), and an A utnl tahbiesle sdta g uent,ri wmem headv pool oacft ivoinde doe clips r U D. [sent-160, score-0.649]

58 The key technical issue is how to identify the untrimmed video that, if annotated, would most benefit the detector. [sent-164, score-0.515]

59 Motivated by this general idea, we define a criterion tailored to our problem setting that seeks the unlabeled video that, if used to augment the action detector, would most reduce the vote-space uncertainty across all unlabeled data. [sent-166, score-1.259]

60 e sS,( a·n) dis a scoring function that takes a training set as input, and estimates the confidence of a detector trained on that data and applied to all unlabeled data (to be defined next). [sent-168, score-0.531]

61 In particular, if we were to consider vl as a potential positive (denoted v+) by simply updating the detector with vote vectors extracted from all features within it, we are likely to see no reduction in uncertainty—even if that video contains a great example of the action. [sent-171, score-0.594]

62 The reason is that the features outside the true positive action interval would introduce substantial noise into the Hough tables. [sent-172, score-0.538]

63 We overcome this problem by first estimating any occurrences of the action within v using the current detector and the procedure given in Sec. [sent-173, score-0.597]

64 (2) Since the unlabeled video may very well contain no positive intervals, we must also consider it as a negative instance. [sent-189, score-0.472]

65 (3) Recall that positive intervals modify the detector in two ways: 1) by updating the Hough tables with votes for the new action centers, and 2) by updating the top N most discriminative words in Whog and Whof. [sent-193, score-1.098]

66 2 and 3, which ought to reflect how well the detector can localize actions in every unlabeled video. [sent-198, score-0.663]

67 We use each candidate detector (modified as described above) to cast votes in each unlabeled 11883366 Entropy = 0. [sent-200, score-0.625]

68 Intuitively, a vote space with good cluster(s) indicates there is consensus on the location(s) of the action center, whereas spread out votes suggest confusion among the features about the action center placement. [sent-214, score-1.128]

69 While videos with one clear detection will score best, multiple tight vote clusters will still have lower entropy than diffuse votes, as desired. [sent-215, score-0.475]

70 Then the normalized entropy over the entire vote space Vv when using detector D(T ) is: H? [sent-221, score-0.472]

71 Using this entropy-based uncertainty metric, we define the confidence of a detector in localizing actions on the entire unlabeled set: VALUE? [sent-225, score-0.801]

72 Rather, it selects the video which is most expected to reduce entropy among all unlabeled videos. [sent-232, score-0.568]

73 The former would be susceptible to selecting uninformative false positive videos, where even a small number of words agreeing strongly on an action center could yield a low entropy score. [sent-234, score-0.668]

74 Essentially, our method helps discover true positive intervals, since once added to the detector they most help “explain” the remainder of the unlabeled data. [sent-241, score-0.525]

75 This means, for example, that a false positive interval in an unlabeled video will not overshadow the positive impact of a true positive interval in the same video. [sent-246, score-0.693]

76 1: we consider each unlabeled video in turn, and score its predicted influence on the uncertainty of all unlabeled videos. [sent-248, score-0.773]

77 If an action is split between two shots, the shot with the major portion is assigned that action’s label. [sent-267, score-0.426]

78 While MeanShift can return a variable number of modes, for efficiency during active selection we limit the number of candidates per video to K = 2. [sent-274, score-0.447]

79 Evaluation metric We quantify performance with learning curves: after each iteration, we score action localization accuracy on an unseen test set. [sent-280, score-0.495]

80 Methods compared We test two variants of our approach: one that uses the true action intervals when evaluating a video for active selection (Active GT-Ints), and one that uses the (noisier) Hough-predicted intervals (Active Pred-Ints). [sent-289, score-1.211]

81 It is valuable because it isolates the impact of the proposed active learning strategy from that of the specific detector we use. [sent-291, score-0.525]

82 aWmebuild an SVM action classifier using a χ2 kernel and bag-of-words histograms computed for every bounding volume in the initial labeled videos. [sent-296, score-0.53]

83 The classifier predicts the probability each unlabeled video is positive, and we request a label for the one whose confidence is nearest 0. [sent-297, score-0.537]

84 This mAectthivoed Esenltercotsp ythe is sm aos dt utenccteorrt-abinas veidde mo fthoro labeling, where detector uncertainty is computed with the same vote space entropy in Eqn. [sent-300, score-0.584]

85 Whereas our method selects the video that should reduce uncertainty over all data, this baseline seeks the video that itself appears most uncertain. [sent-302, score-0.376]

86 2 In 7 of the 8 action classes, Active-GT-Ints clearly outperforms all the baseline methods. [sent-305, score-0.426]

87 On SitUp, however, it is a toss-up, likely because parts of this action are very similar to the beginning and ending sequences of StandUp and SitDown, respectively, and hence the detector votes for action centers of other classes. [sent-306, score-1.183]

88 rati1o0ns15 we see that our advantage is best for those classes where more videos are available in the dataset; this allows us to initialize the detectors with more positive samples (L = 8), and the stronger initial detectors make uncertainty reduction estimates more reliable. [sent-313, score-0.392]

89 Our active detector quickly explores the distinct useful types for annotation. [sent-319, score-0.458]

90 In this action, the actor stands still and points at something, which leads to fewer STIPs in the action bounding volume, and thus an insufficient set of discriminative words for voting. [sent-322, score-0.541]

91 We find the Hough detector occasionally generates large clusters around action centers of a different action class. [sent-323, score-1.095]

92 Periodicity is problematic for voting because the features specific to the action re-occur at multiple places in the bounding volume. [sent-351, score-0.516]

93 This supports our claim that simply estimating individual video uncertainty is insufficient for our setting; our use of total entropy reduction over all unlabeled data is more reliable. [sent-356, score-0.719]

94 This underscores the importance of reasoning about the untrimmed nature of video when performing active learning. [sent-358, score-0.802]

95 Figure 8 shows Active Pred-Int’s mean overlap accuracy per dataset at three different points: 1) at the onset, using the small initial labeled set, 2) after 10-20 rounds of active learning, and 3) after adding all the videos with their annotations. [sent-362, score-0.511]

96 Typically the active approach produces a detector nearly as accurate as the one trained with all data, yet it costs much less annotator time, e. [sent-363, score-0.52]

97 Again, this is an upshot of formulating the selection criterion in terms of total uncertainty reduction on all unlabeled data. [sent-371, score-0.496]

98 Rather than request labels on individual videos that look uncertain, our method searches for examples that help explain the plausible detections in all remaining unlabeled examples. [sent-372, score-0.517]

99 Conclusion Untrimmed video is what exists “in the wild” before any annotators touch it, yet it is ill-suited for traditional active selection metrics. [sent-380, score-0.489]

100 We introduced an approach to discover informative action instances among such videos. [sent-381, score-0.484]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('action', 0.426), ('untrimmed', 0.4), ('active', 0.287), ('unlabeled', 0.273), ('actions', 0.189), ('detector', 0.171), ('intervals', 0.169), ('vote', 0.155), ('entropy', 0.146), ('hough', 0.129), ('votes', 0.121), ('whog', 0.12), ('videos', 0.116), ('hollywood', 0.115), ('video', 0.115), ('msr', 0.113), ('uncertainty', 0.112), ('trimmed', 0.082), ('whof', 0.08), ('annotation', 0.071), ('request', 0.066), ('boxing', 0.066), ('annotator', 0.062), ('hhog', 0.06), ('interval', 0.058), ('vijayanarasimhan', 0.057), ('tables', 0.057), ('confidence', 0.056), ('clip', 0.055), ('passive', 0.055), ('positive', 0.054), ('clapping', 0.053), ('hug', 0.053), ('annotations', 0.05), ('voting', 0.05), ('actively', 0.047), ('selection', 0.045), ('rounds', 0.045), ('informativeness', 0.044), ('stips', 0.044), ('effort', 0.043), ('temporarily', 0.042), ('annotators', 0.042), ('words', 0.042), ('hof', 0.041), ('bounding', 0.04), ('learning', 0.04), ('interest', 0.04), ('reduction', 0.04), ('altof', 0.04), ('hhof', 0.04), ('revising', 0.04), ('situp', 0.04), ('standup', 0.04), ('subvolumes', 0.04), ('waving', 0.04), ('centers', 0.039), ('entails', 0.037), ('labeled', 0.037), ('detections', 0.036), ('getoutcar', 0.035), ('sitdown', 0.035), ('detectors', 0.035), ('augment', 0.034), ('selects', 0.034), ('laptev', 0.033), ('insufficient', 0.033), ('manual', 0.033), ('handshake', 0.033), ('optimistic', 0.033), ('requested', 0.033), ('clusters', 0.033), ('instances', 0.031), ('scoring', 0.031), ('subvolume', 0.031), ('temporal', 0.031), ('candidate', 0.03), ('vl', 0.03), ('cast', 0.03), ('interface', 0.03), ('localize', 0.03), ('negative', 0.03), ('underperform', 0.029), ('updating', 0.029), ('localization', 0.029), ('extents', 0.028), ('kick', 0.028), ('curves', 0.028), ('annotating', 0.028), ('classifier', 0.027), ('impact', 0.027), ('discover', 0.027), ('pool', 0.027), ('overlap', 0.026), ('criterion', 0.026), ('explain', 0.026), ('detection', 0.025), ('clips', 0.025), ('annotate', 0.025), ('word', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999869 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos

Author: Sunil Bandla, Kristen Grauman

Abstract: Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the “untrimmed” nature of real video data.

2 0.35247242 86 iccv-2013-Concurrent Action Detection with Structural Prediction

Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu

Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.

3 0.33310953 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

Author: Mihai Zanfir, Marius Leordeanu, Cristian Sminchisescu

Abstract: Human action recognition under low observational latency is receiving a growing interest in computer vision due to rapidly developing technologies in human-robot interaction, computer gaming and surveillance. In this paper we propose a fast, simple, yet powerful non-parametric Moving Pose (MP)frameworkfor low-latency human action and activity recognition. Central to our methodology is a moving pose descriptor that considers both pose information as well as differential quantities (speed and acceleration) of the human body joints within a short time window around the current frame. The proposed descriptor is used in conjunction with a modified kNN classifier that considers both the temporal location of a particular frame within the action sequence as well as the discrimination power of its moving pose descriptor compared to other frames in the training set. The resulting method is non-parametric and enables low-latency recognition, one-shot learning, and action detection in difficult unsegmented sequences. Moreover, the framework is real-time, scalable, and outperforms more sophisticated approaches on challenging benchmarks like MSR-Action3D or MSR-DailyActivities3D.

4 0.29343137 6 iccv-2013-A Convex Optimization Framework for Active Learning

Author: Ehsan Elhamifar, Guillermo Sapiro, Allen Yang, S. Shankar Sasrty

Abstract: In many image/video/web classification problems, we have access to a large number of unlabeled samples. However, it is typically expensive and time consuming to obtain labels for the samples. Active learning is the problem of progressively selecting and annotating the most informative unlabeled samples, in order to obtain a high classification performance. Most existing active learning algorithms select only one sample at a time prior to retraining the classifier. Hence, they are computationally expensive and cannot take advantage of parallel labeling systems such as Mechanical Turk. On the other hand, algorithms that allow the selection of multiple samples prior to retraining the classifier, may select samples that have significant information overlap or they involve solving a non-convex optimization. More importantly, the majority of active learning algorithms are developed for a certain classifier type such as SVM. In this paper, we develop an efficient active learning framework based on convex programming, which can select multiple samples at a time for annotation. Unlike the state of the art, our algorithm can be used in conjunction with any type of classifiers, including those of the fam- ily of the recently proposed Sparse Representation-based Classification (SRC). We use the two principles of classifier uncertainty and sample diversity in order to guide the optimization program towards selecting the most informative unlabeled samples, which have the least information overlap. Our method can incorporate the data distribution in the selection process by using the appropriate dissimilarity between pairs of samples. We show the effectiveness of our framework in person detection, scene categorization and face recognition on real-world datasets.

5 0.27676231 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

Author: Jiang Wang, Ying Wu

Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.

6 0.25409669 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

7 0.24877618 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction

8 0.24008156 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition

9 0.23684531 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

10 0.23138805 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

11 0.2045746 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition

12 0.20256636 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition

13 0.19885683 166 iccv-2013-Finding Actors and Actions in Movies

14 0.18865308 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

15 0.17901424 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps

16 0.16983575 39 iccv-2013-Action Recognition with Improved Trajectories

17 0.16401918 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition

18 0.16096 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

19 0.14922369 43 iccv-2013-Active Visual Recognition with Expertise Estimation in Crowdsourcing

20 0.14799054 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.257), (1, 0.273), (2, 0.116), (3, 0.269), (4, 0.036), (5, -0.043), (6, 0.11), (7, -0.075), (8, -0.034), (9, 0.021), (10, 0.05), (11, 0.074), (12, -0.023), (13, -0.098), (14, 0.206), (15, -0.07), (16, -0.038), (17, -0.035), (18, -0.108), (19, -0.065), (20, -0.069), (21, 0.004), (22, -0.176), (23, -0.023), (24, -0.023), (25, 0.001), (26, 0.054), (27, 0.064), (28, 0.046), (29, -0.095), (30, -0.131), (31, -0.002), (32, 0.076), (33, -0.075), (34, 0.045), (35, 0.018), (36, 0.079), (37, -0.039), (38, -0.011), (39, 0.04), (40, -0.02), (41, 0.0), (42, 0.001), (43, -0.007), (44, 0.026), (45, -0.041), (46, -0.056), (47, -0.057), (48, 0.013), (49, -0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98091835 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos

Author: Sunil Bandla, Kristen Grauman

Abstract: Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the “untrimmed” nature of real video data.

2 0.84237045 86 iccv-2013-Concurrent Action Detection with Structural Prediction

Author: Ping Wei, Nanning Zheng, Yibiao Zhao, Song-Chun Zhu

Abstract: Action recognition has often been posed as a classification problem, which assumes that a video sequence only have one action class label and different actions are independent. However, a single human body can perform multiple concurrent actions at the same time, and different actions interact with each other. This paper proposes a concurrent action detection model where the action detection is formulated as a structural prediction problem. In this model, an interval in a video sequence can be described by multiple action labels. An detected action interval is determined both by the unary local detector and the relations with other actions. We use a wavelet feature to represent the action sequence, and design a composite temporal logic descriptor to describe the action relations. The model parameters are trained by structural SVM learning. Given a long video sequence, a sequential decision window search algorithm is designed to detect the actions. Experiments on our new collected concurrent action dataset demonstrate the strength of our method.

3 0.8276118 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

Author: Weiyu Zhang, Menglong Zhu, Konstantinos G. Derpanis

Abstract: This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes. Unlike previous action-related work, the discovery of patch classifiers is posed as a strongly-supervised process. Specifically, keypoint labels (e.g., position) across spacetime are used in a data-driven training process to discover patches that are highly clustered in the spacetime keypoint configuration space. To support this process, a new human action dataset consisting of challenging consumer videos is introduced, where notably the action label, the 2D position of a set of keypoints and their visibilities are provided for each video frame. On a novel input video, each acteme is used in a sliding volume scheme to yield a set of sparse, non-overlapping detections. These detections provide the intermediate substrate for segmenting out the action. For action classification, the proposed representation shows significant improvement over state-of-the-art low-level features, while providing spatiotemporal localiza- tion as additional output. This output sheds further light into detailed action understanding.

4 0.80266505 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

Author: Jiang Wang, Ying Wu

Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.

5 0.77210295 38 iccv-2013-Action Recognition with Actons

Author: Jun Zhu, Baoyuan Wang, Xiaokang Yang, Wenjun Zhang, Zhuowen Tu

Abstract: With the improved accessibility to an exploding amount of video data and growing demands in a wide range of video analysis applications, video-based action recognition/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. The weakly-supervised actons are learned via a new max-margin multi-channel multiple instance learning framework, which can capture multiple mid-level action concepts simultaneously. The learned actons (with no requirement for detailed manual annotations) observe theproperties ofbeing compact, informative, discriminative, and easy to scale. The experimental results demonstrate the effectiveness ofapplying the learned actons in our two-layer structure, and show the state-ofthe-art recognition performance on two challenging action datasets, i.e., Youtube and HMDB51.

6 0.77141863 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

7 0.77078122 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition

8 0.7267648 166 iccv-2013-Finding Actors and Actions in Movies

9 0.71508312 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

10 0.68014878 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

11 0.62776768 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction

12 0.60157102 244 iccv-2013-Learning View-Invariant Sparse Representations for Cross-View Action Recognition

13 0.60153294 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

14 0.59610438 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach

15 0.59148449 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition

16 0.58935952 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

17 0.58472073 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition

18 0.56646335 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition

19 0.55667257 6 iccv-2013-A Convex Optimization Framework for Active Learning

20 0.55306172 43 iccv-2013-Active Visual Recognition with Expertise Estimation in Crowdsourcing


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.075), (7, 0.017), (12, 0.026), (26, 0.066), (31, 0.036), (40, 0.035), (42, 0.125), (64, 0.115), (71, 0.173), (73, 0.025), (78, 0.012), (89, 0.174), (95, 0.02), (98, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85975015 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos

Author: Sunil Bandla, Kristen Grauman

Abstract: Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the “untrimmed” nature of real video data.

2 0.85523534 428 iccv-2013-Translating Video Content to Natural Language Descriptions

Author: Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, Bernt Schiele

Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.

3 0.83659655 264 iccv-2013-Minimal Basis Facility Location for Subspace Segmentation

Author: Choon-Meng Lee, Loong-Fah Cheong

Abstract: In contrast to the current motion segmentation paradigm that assumes independence between the motion subspaces, we approach the motion segmentation problem by seeking the parsimonious basis set that can represent the data. Our formulation explicitly looks for the overlap between subspaces in order to achieve a minimal basis representation. This parsimonious basis set is important for the performance of our model selection scheme because the sharing of basis results in savings of model complexity cost. We propose the use of affinity propagation based method to determine the number of motion. The key lies in the incorporation of a global cost model into the factor graph, serving the role of model complexity. The introduction of this global cost model requires additional message update in the factor graph. We derive an efficient update for the new messages associated with this global cost model. An important step in the use of affinity propagation is the subspace hypotheses generation. We use the row-sparse convex proxy solution as an initialization strategy. We further encourage the selection of subspace hypotheses with shared basis by integrat- ing a discount scheme that lowers the factor graph facility cost based on shared basis. We verified the model selection and classification performance of our proposed method on both the original Hopkins 155 dataset and the more balanced Hopkins 380 dataset.

4 0.82453734 27 iccv-2013-A Robust Analytical Solution to Isometric Shape-from-Template with Focal Length Calibration

Author: Adrien Bartoli, Daniel Pizarro, Toby Collins

Abstract: We study the uncalibrated isometric Shape-fromTemplate problem, that consists in estimating an isometric deformation from a template shape to an input image whose focal length is unknown. Our method is the first that combines the following features: solving for both the 3D deformation and the camera ’s focal length, involving only local analytical solutions (there is no numerical optimization), being robust to mismatches, handling general surfaces and running extremely fast. This was achieved through two key steps. First, an ‘uncalibrated’ 3D deformation is computed thanks to a novel piecewise weak-perspective projection model. Second, the camera’s focal length is estimated and enables upgrading the 3D deformation to metric. We use a variational framework, implemented using a smooth function basis and sampled local deformation models. The only degeneracy which we easily detect– for focal length estimation is a flat and fronto-parallel surface. Experimental results on simulated and real datasets show that our method achieves a 3D shape accuracy – slightly below state of the art methods using a precalibrated or the true focal length, and a focal length accuracy slightly below static calibration methods.

5 0.81964999 338 iccv-2013-Randomized Ensemble Tracking

Author: Qinxun Bai, Zheng Wu, Stan Sclaroff, Margrit Betke, Camille Monnier

Abstract: We propose a randomized ensemble algorithm to model the time-varying appearance of an object for visual tracking. In contrast with previous online methods for updating classifier ensembles in tracking-by-detection, the weight vector that combines weak classifiers is treated as a random variable and the posterior distribution for the weight vector is estimated in a Bayesian manner. In essence, the weight vector is treated as a distribution that reflects the confidence among the weak classifiers used to construct and adapt the classifier ensemble. The resulting formulation models the time-varying discriminative ability among weak classifiers so that the ensembled strong classifier can adapt to the varying appearance, backgrounds, and occlusions. The formulation is tested in a tracking-by-detection implementation. Experiments on 28 challenging benchmark videos demonstrate that the proposed method can achieve results comparable to and often better than those of stateof-the-art approaches.

6 0.81832278 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes

7 0.81556332 215 iccv-2013-Incorporating Cloud Distribution in Sky Representation

8 0.81517494 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

9 0.81496733 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments

10 0.81474245 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

11 0.81388056 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning

12 0.8125124 86 iccv-2013-Concurrent Action Detection with Structural Prediction

13 0.81097907 445 iccv-2013-Visual Reranking through Weakly Supervised Multi-graph Learning

14 0.80862373 425 iccv-2013-Tracking via Robust Multi-task Multi-view Joint Sparse Representation

15 0.80378306 303 iccv-2013-Orderless Tracking through Model-Averaged Posterior Estimation

16 0.80313849 124 iccv-2013-Domain Transfer Support Vector Ranking for Person Re-identification without Target Camera Label Information

17 0.80091166 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps

18 0.8003509 340 iccv-2013-Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests

19 0.79987645 166 iccv-2013-Finding Actors and Actions in Movies

20 0.79778951 168 iccv-2013-Finding the Best from the Second Bests - Inhibiting Subjective Bias in Evaluation of Visual Tracking Algorithms