iccv iccv2013 iccv2013-127 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Weixin Li, Qian Yu, Ajay Divakaran, Nuno Vasconcelos
Abstract: The problem of adaptively selecting pooling regions for the classification of complex video events is considered. Complex events are defined as events composed of several characteristic behaviors, whose temporal configuration can change from sequence to sequence. A dynamic pooling operator is defined so as to enable a unified solution to the problems of event specific video segmentation, temporal structure modeling, and event detection. Video is decomposed into segments, and the segments most informative for detecting a given event are identified, so as to dynamically determine the pooling operator most suited for each sequence. This dynamic pooling is implemented by treating the locations of characteristic segments as hidden information, which is inferred, on a sequence-by-sequence basis, via a large-margin classification rule with latent variables. Although the feasible set of segment selections is combinatorial, it is shown that a globally optimal solution to the inference problem can be obtained efficiently, through the solution of a series of linear programs. Besides the coarselevel location of segments, a finer model of video struc- ture is implemented by jointly pooling features of segmenttuples. Experimental evaluation demonstrates that the re- sulting event detector has state-of-the-art performance on challenging video datasets.
Reference: text
sentIndex sentText sentNum sentScore
1 edu 0 Abstract The problem of adaptively selecting pooling regions for the classification of complex video events is considered. [sent-2, score-0.69]
2 Complex events are defined as events composed of several characteristic behaviors, whose temporal configuration can change from sequence to sequence. [sent-3, score-0.569]
3 A dynamic pooling operator is defined so as to enable a unified solution to the problems of event specific video segmentation, temporal structure modeling, and event detection. [sent-4, score-1.48]
4 Video is decomposed into segments, and the segments most informative for detecting a given event are identified, so as to dynamically determine the pooling operator most suited for each sequence. [sent-5, score-1.012]
5 This dynamic pooling is implemented by treating the locations of characteristic segments as hidden information, which is inferred, on a sequence-by-sequence basis, via a large-margin classification rule with latent variables. [sent-6, score-0.951]
6 Besides the coarselevel location of segments, a finer model of video struc- ture is implemented by jointly pooling features of segmenttuples. [sent-8, score-0.558]
7 Experimental evaluation demonstrates that the re- sulting event detector has state-of-the-art performance on challenging video datasets. [sent-9, score-0.469]
8 Figure 1: Challenges of event recognition in open source video (best viewed in color). [sent-83, score-0.432]
9 , “wedding”, can differ substantially in the atomic actions that compose them and corresponding durations (indicated by color bars). [sent-89, score-0.251]
10 For example, the upper “wedding” video includes the atomic actions “walking the bride” (red), “dancing” (light grey), “flower throwing” (orange), “cake cutting” (yellow) and “bride and groom traveling” (green). [sent-90, score-0.426]
11 In the “feeding an animal” examples, only a small portion (red box) of the video actually depicts the action of handing food to an animal. [sent-93, score-0.265]
12 an event can recognition of primitive or atomic actions, such as “walking”, “’running”, from carefully assembled video, complex events depict human behaviors in unconstraint scenes, performing more sophisticated activities, which involve more complex interactions with the environment, e. [sent-95, score-1.008]
13 Due to all these, the detection of complex events presents two major challenges beyond those commonly addressed in the action recognition literature. [sent-103, score-0.25]
14 The first is that the video is usually not precisely segmented to include only the behaviors of interest. [sent-104, score-0.342]
15 For example, as shown in Figure 1, while the event “feeding an animal” is mostly about the behavior of handing the animal food, a typical YouTube video in this class depicts a caretaker approaching the animal, playing with it, checking its health, etc. [sent-105, score-0.571]
16 The second challenge is that the behaviors of interest can have a complex temporal structure. [sent-106, score-0.496]
17 In general, a complex event can have multiple such behaviors and these can appear with great variability of temporal configurations. [sent-107, score-0.824]
18 For example, the “birthday party” and “wedding” events of Figure 1, have significant variation in the continuity, order, and duration of characteristic behaviors such as “walking the bride,” “dancing,” “flower throwing,” or “cake cutting”. [sent-108, score-0.523]
19 One operation critical for its success is the pooling of visual features into a holistic video representation. [sent-110, score-0.521]
20 However, while fixed pooling strategies, such as average pooling or temporal pyramid matching, are suitable for carefully manicured video, they have two strong limitations for complex event recognition. [sent-111, score-1.459]
21 First, by integrating information in a pre-defined manner, they cannot adapt to the temporal structure of the behaviors of interest. [sent-112, score-0.47]
22 Second, by pooling features from video regions that do not depict characteristic behaviors, they produce noisy histograms, where the feature counts due to characteristic behavior can be easily overwhelmed by those due to uninformative content. [sent-114, score-0.841]
23 In this work, we address both limitations by proposing a pooling scheme adaptive to the temporal structure of the particular video to recognize. [sent-115, score-0.753]
24 The video sequence is decom- posed into segments, and the most informative segments for detection of a given event are identified, so as to dynamically determine the pooling operator most suited for that particular video sequence. [sent-116, score-1.27]
25 This dynamic pooling is implemented by treating the locations of the characteristic segments as hidden information, which is inferred, on a sequence-by-sequence basis, via a large-margin classification rule with latent variables. [sent-117, score-0.951]
26 In this way, only the portions of the video informative about the event of interest are used for its representation. [sent-119, score-0.469]
27 The proposed pooling scheme can be seen either as 1) a discriminant form of segmentation and grouping, which eliminates histogram noise due to uninformative content, or 2) a discriminant approach to modeling video structure, which automatically identifies the locations of behaviors of interest. [sent-120, score-0.873]
28 Besides the coarse-level location of segments, finer modeling of structure can be achieved by jointly pooling histograms of segment-tuples. [sent-122, score-0.489]
29 This is akin to recent attempts at modeling the short-term temporal layout of simple actions [9], but relies on adaptively rather than manually specified video segments. [sent-123, score-0.41]
30 Related Work There has, so far, been limited work on pooling mechanisms for complex event detection. [sent-126, score-0.811]
31 extend spatial pyramid matching to the video domain and propose a BoF temporal pyramid (BoF-TP) matching for atomic action recognition in movie clips [13]. [sent-128, score-0.664]
32 use unsupervised clustering of image features to guide feature pooling at the image level [5]. [sent-130, score-0.417]
33 Since these pooling schemes cannot 1) select informative video segments, or 2) model the temporal structure of the underlying activities, they have limited applicability to complex event modeling. [sent-131, score-1.184]
34 While [10] assumes that the optimal spatial regions (receptive fields) for pooling descriptors of a given category are fixed, our work addresses content-driven pooling regions, dynamically or adaptively discovered on a sequence-by-sequence basis. [sent-133, score-0.895]
35 Several works have addressed the modeling of temporal structure of human activities. [sent-134, score-0.264]
36 In [21], Schindler and Gool show that simple actions can be recognized almost instantaneously, with a signature video segment less than 1 second long. [sent-139, score-0.334]
37 , 1) ignoring the temporal structure within subsequences, 2) limiting the hypothesis space of video cropping to continuous subsequences (which precludes temporally disconnected subsequences that are potentially more discriminant for complex event recognition), and 3) limited 2729 ? [sent-143, score-1.039]
38 ionsf“jumpinguwithboard” and “landing”, which are mined out to represent the event either by segment or segment-pair pooling. [sent-193, score-0.42]
39 We address this problem by proposing an efficient procedure to dynamically determine the most discriminant segments for video classification. [sent-196, score-0.375]
40 The second class aims to factorize activities into sequences of atomic behaviors, and characterize their temporal dependencies [17, 8, 4, 23, 24, 14, 16]. [sent-197, score-0.521]
41 [8] raise the semantics of the representation, explicitly characterizing activities as sequences of atomic actions (e. [sent-203, score-0.411]
42 Li and Vasconcelos extend this idea by characterizing the dynamics of action attributes, using a binary dynamic system (BDS) to model trajectories of human activity in attribute space [14], and then to bag of words for attribute dynamics (BoWAD) [16]. [sent-206, score-0.304]
43 Some drawbacks of these approaches include the need for manual 1) segmentation of activities into predefined atomic actions, or 2) annotation of training sets for learning attributes or atomic actions. [sent-207, score-0.498]
44 Some automated methods have, however, been proposed for discovery of latent temporal structure. [sent-208, score-0.288]
45 Most methods in this group assume that 1) the entire video sequence is well described by the associated label, and 2) video sequences are precisely cropped and aligned with activities of interest. [sent-211, score-0.418]
46 Event Detection via Dynamic Pooling In this section we introduce a detector of complex events using dynamic pooling. [sent-214, score-0.277]
47 Complex Events A complex event is defined as an event composed of sev- eral local behaviors. [sent-217, score-0.722]
48 A video sequence v is first divided into a series of short-term temporal segments S = {si}iτ=1, winthoic ah are edse onfot sehdo rat-totemrmic segments. [sent-218, score-0.515]
49 By determining the composition of the subset S¯, h controls the temporal pooling of visual word counts. [sent-244, score-0.647]
50 AS f,ix hed co hn implements a rstaalti pco pooling 2730 mechanism, e. [sent-245, score-0.417]
51 In this work, we introduce a dynamic pooling operator, by making h a latent variable, adapted to each sequence so as to maximize classification accuracy. [sent-248, score-0.634]
52 Prediction Rule A detector for event class c is implemented as d(v) = sign[fw (v)], where fw (v) is a linear predictor that quantifies the confidence with which v belongs to c. [sent-252, score-0.507]
53 In this case, a, b are fixed hyperparameters, encoding prior knowledge on event structure. [sent-262, score-0.328]
54 Hypothesis Space for Pooled Features In this section we discuss several possibilities for the hypothesis space of the proposed complex event detector. [sent-332, score-0.479]
55 The fourth is an unconstrained selector h, which is a special temporally localized selector with window (1, τ). [sent-358, score-0.481]
56 Structure of Pooled Features So far, we have assumed that the features xi of (1) are histogram of visual word counts of video segments si. [sent-363, score-0.351]
57 The first consists of the sequence of atomic behaviors “car accelerates” and “car crashes”, corresponding to regular traffic accidents. [sent-367, score-0.457]
58 In the absence of an explicit encoding of the temporal sequence of the atomic behaviors, the two events cannot be disambiguated. [sent-369, score-0.514]
59 Another possibility is to extend the proposed pooling scheme to tuples of pooling regions. [sent-372, score-0.871]
60 For example, dynamic pooling can be applied to segment pairs, by simply replacing the segment set S with S2 = {(si , sj ) |L1 ? [sent-373, score-0.672]
61 For example, when L1 = L2 = 1, the pair pooling strategy is similar to the localized version of the t2 temporal pyramid matching scheme of [13], albeit with dynamically selected pooling windows. [sent-393, score-1.185]
62 t hSe representation of [8], where activities are manually decomposed into three atomic actions. [sent-395, score-0.329]
63 In this second learning stage, the hidden variable selector h of (1) is restricted to a continuous pooling window (CPW), producing a latent SVM of parameter wCPW. [sent-407, score-0.786]
64 This parameter is next used to initialize the CCCP algorithm for learning a latent SVM of temporally localized window for single segment pooling (SSP), i. [sent-408, score-0.812]
65 3% pole-vault gymnastics-vault shot-put snatch clean-jerk javelin throw hammer throw discus throw diving-platform diving-springboard basketball-layup bowling tennis-serve 60. [sent-429, score-0.275]
66 Finally, wSSP is used to initialize CCCP for learning a latent SVM of temporally localized pooling window with segment pair selection (SSP), i. [sent-517, score-0.812]
67 Experiments Several experiments were conducted to evaluate the performance of the proposed event detector, using three datasets and a number of benchmark methods for activity or event recognition. [sent-521, score-0.742]
68 Unless otherwise specified, all experiments relied on the popular spatio-temporal interest point (STIP) descriptor of [13], and parameters of dynamic pooling were selected by cross-validation in the training set. [sent-523, score-0.488]
69 While not really an open-source video collection (many of the sequences are extracted from sports broadcasts and depict a single well defined activity), this dataset is challenging for two main reasons: 1) some activities (e. [sent-527, score-0.356]
70 , “tennis serve”, or “basketball layup”) have a variety of signature behaviors ofvariable location or duration, due to intra-class variability and poor segmentation/alignment; and 2) it contains pairs of confusing activities (e. [sent-529, score-0.454]
71 , sub-types of a common category, such as the weight lifting activities of “snatch” and “clean-andjerk”), whose discrimination requires fine-grained models of temporal structure. [sent-531, score-0.352]
72 Low-level features were extracted from video segments of 30-frames (with an overlap of 15frames) and quantized with a 4000-word codebook. [sent-532, score-0.273]
73 Pooling Strategy We first evaluated the benefits of the various pooling structures of Section 4. [sent-534, score-0.417]
74 The top of Figure 3 shows results for 4 structures: average pooling on the whole sequence (BoF), or on a continuous window (CW) (t, δ), temporally localized (TL) selector, and unconstrained (U) selector. [sent-535, score-0.709]
75 The latter two were repeated for two feature configurations - single segments (SSP) and segment pairs (SPP) - for a total of 6 configurations. [sent-536, score-0.261]
76 All dynamic pooling mech- anisms outperformed BoF, with gains as high as 10%. [sent-537, score-0.52]
77 The only exception was the U selector which, while beating BoF and CW, underperformed its temporally localized counterpart (TL). [sent-541, score-0.325]
78 This suggests that it is important to rely on a flexible selector h, but it helps to localize the region from which segments are selected. [sent-542, score-0.306]
79 With respect to features, pooling of segment pairs (SPP) substantially outperformed single segment pooling (SSP). [sent-543, score-1.05]
80 This is intuitive, since the SPP representation accounts for long-term temporal video structure, which is important for the discrimination of similar activities (see discussion below). [sent-544, score-0.456]
81 Given these observations, we adopted the TL pooling strategy in all remaining experiments. [sent-545, score-0.417]
82 Modeling Temporal Structure We next compared the proposed detector to prior methods for modeling the temporal structure of complex activities. [sent-546, score-0.367]
83 Keyframes ofthe characteristic segments are shown with their anchor points in the timeline. [sent-556, score-0.29]
84 This suggests that there are two important components of activity representation: 1) the selection of signature segments depicting characteristic behaviors; and 2) the temporal structure of these behaviors. [sent-569, score-0.664]
85 Note, in fact, that the prior models underperform even TL-SSP on categories with characteristic behaviors widely scattered across the video, e. [sent-573, score-0.394]
86 This is illustrated in Figure 4, which shows the segments selected by TL-SSP for the activities “tennis-serve”, “basketball layup” and “bowling”. [sent-576, score-0.329]
87 Note that, despite the large variability of location of the characteristic behaviors in the video of these cate- gories, e. [sent-577, score-0.463]
88 This ability is also quantified in Figure 3 by a small experiment, where we 1) manually annotated the characteristic behaviors of “bowling” and “tennis-serve”, and 2) compared this ground-truth to the video portion selected by TL-SSP. [sent-580, score-0.463]
89 TRECVID-MED11 The second and third sets of experiments were conducted on the 2011 TRECVID multimedia event detection (MED) dataset [ 19]. [sent-589, score-0.328]
90 It contains over 45, 000 videos of 15 high-level event classes (denoted “E001” to “E015”) collected from a variety of Internet resources. [sent-590, score-0.328]
91 The training set (denoted “EC”), contains 100 to 200 ground-truth instances of each event class, totaling over 2000 videos. [sent-591, score-0.328]
92 The large variation in temporal duration, scenes, illumination, cutting, resolution, etc in these video clips, together with the size of the negative class, make the detection task extremely difficult. [sent-595, score-0.332]
93 To improve discriminative power, we implemented the feature mapping of [26] for dynamic pooling and the baseline BoF-TP of [13]. [sent-597, score-0.525]
94 This is too much for approaches modeling holistic temporal structure like DMS [ 17], VD-HMM [23] and BDS [ 14], which significantly underperform the baseline BoF-TP. [sent-600, score-0.299]
95 both the identification of characteristic segments and the modeling of their temporal structure are important. [sent-674, score-0.554]
96 Conclusion We proposed a joint framework for extracting characteristic behaviors, modeling temporal structure, and recognizing activity on video of complex events. [sent-678, score-0.601]
97 It was shown that, under this formulation, efficient and exact inference for selection of signature video portion is possible over the combinatorial space of possible segment selections. [sent-679, score-0.252]
98 An experimental comparison to various benchmarks for event detection, on challenging datasets, justified the effectiveness of the proposed approach. [sent-680, score-0.328]
99 [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] Scene aligned pooling for complex video recognition. [sent-712, score-0.587]
100 Modeling temporal structure of decomposable motion segments for activity classification. [sent-796, score-0.487]
wordName wordTfidf (topN-words)
[('pooling', 0.417), ('event', 0.328), ('behaviors', 0.238), ('temporal', 0.192), ('segments', 0.169), ('atomic', 0.169), ('activities', 0.16), ('ssp', 0.158), ('bof', 0.152), ('selector', 0.137), ('characteristic', 0.121), ('olympic', 0.118), ('cccp', 0.11), ('spp', 0.105), ('video', 0.104), ('events', 0.103), ('latent', 0.096), ('segment', 0.092), ('wedding', 0.091), ('animal', 0.091), ('temporally', 0.087), ('activity', 0.086), ('actions', 0.082), ('action', 0.081), ('throwing', 0.078), ('party', 0.076), ('bowling', 0.074), ('birthday', 0.072), ('bds', 0.071), ('bride', 0.071), ('groom', 0.071), ('wdttxhh', 0.071), ('dynamic', 0.071), ('cutting', 0.068), ('complex', 0.066), ('trecvid', 0.065), ('devt', 0.063), ('dancing', 0.063), ('duration', 0.061), ('window', 0.061), ('dynamically', 0.061), ('localized', 0.059), ('dms', 0.059), ('fw', 0.057), ('signature', 0.056), ('cake', 0.055), ('sports', 0.054), ('throw', 0.053), ('subsequences', 0.05), ('sequence', 0.05), ('cw', 0.049), ('predictor', 0.048), ('crashes', 0.048), ('handing', 0.048), ('ilfp', 0.048), ('lfp', 0.048), ('subsequence', 0.047), ('hypothesis', 0.046), ('gaidon', 0.044), ('tl', 0.044), ('beating', 0.042), ('snatch', 0.042), ('yifw', 0.042), ('si', 0.042), ('receptive', 0.041), ('discriminant', 0.041), ('feeding', 0.041), ('programming', 0.04), ('clips', 0.04), ('structure', 0.04), ('counts', 0.04), ('flower', 0.04), ('hidden', 0.04), ('pyramid', 0.039), ('dtf', 0.039), ('devo', 0.039), ('ihi', 0.039), ('layup', 0.039), ('possibilities', 0.039), ('depict', 0.038), ('word', 0.038), ('youtube', 0.038), ('implemented', 0.037), ('detector', 0.037), ('informative', 0.037), ('united', 0.037), ('tuples', 0.037), ('etc', 0.036), ('parade', 0.035), ('vasconcelos', 0.035), ('underperform', 0.035), ('pooled', 0.035), ('continuous', 0.035), ('convex', 0.034), ('divakaran', 0.034), ('accelerates', 0.034), ('attribute', 0.033), ('modeling', 0.032), ('outperformed', 0.032), ('food', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000014 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
Author: Weixin Li, Qian Yu, Ajay Divakaran, Nuno Vasconcelos
Abstract: The problem of adaptively selecting pooling regions for the classification of complex video events is considered. Complex events are defined as events composed of several characteristic behaviors, whose temporal configuration can change from sequence to sequence. A dynamic pooling operator is defined so as to enable a unified solution to the problems of event specific video segmentation, temporal structure modeling, and event detection. Video is decomposed into segments, and the segments most informative for detecting a given event are identified, so as to dynamically determine the pooling operator most suited for each sequence. This dynamic pooling is implemented by treating the locations of characteristic segments as hidden information, which is inferred, on a sequence-by-sequence basis, via a large-margin classification rule with latent variables. Although the feasible set of segment selections is combinatorial, it is shown that a globally optimal solution to the inference problem can be obtained efficiently, through the solution of a series of linear programs. Besides the coarselevel location of segments, a finer model of video struc- ture is implemented by jointly pooling features of segmenttuples. Experimental evaluation demonstrates that the re- sulting event detector has state-of-the-art performance on challenging video datasets.
2 0.34851846 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
Author: Ping Wei, Yibiao Zhao, Nanning Zheng, Song-Chun Zhu
Abstract: Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.
3 0.33531204 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
Author: Yifan Zhang, Qiang Ji, Hanqing Lu
Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.
4 0.30466959 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach
Author: Arash Vahdat, Kevin Cannons, Greg Mori, Sangmin Oh, Ilseo Kim
Abstract: We present a compositional model for video event detection. A video is modeled using a collection of both global and segment-level features and kernel functions are employed for similarity comparisons. The locations of salient, discriminative video segments are treated as a latent variable, allowing the model to explicitly ignore portions of the video that are unimportant for classification. A novel, multiple kernel learning (MKL) latent support vector machine (SVM) is defined, that is used to combine and re-weight multiple feature types in a principled fashion while simultaneously operating within the latent variable framework. The compositional nature of the proposed model allows it to respond directly to the challenges of temporal clutter and intra-class variation, which are prevalent in unconstrained internet videos. Experimental results on the TRECVID Multimedia Event Detection 2011 (MED11) dataset demonstrate the efficacy of the method.
5 0.26736039 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
Author: Lukas Bossard, Matthieu Guillaumin, Luc Van_Gool
Abstract: The task of recognizing events in photo collections is central for automatically organizing images. It is also very challenging, because of the ambiguity of photos across different event classes and because many photos do not convey enough relevant information. Unfortunately, the field still lacks standard evaluation data sets to allow comparison of different approaches. In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, annotated with 14 diverse social event classes. Casting collections as sequential data, we build upon recent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recognition in photo collections. However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the photographers. Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a function of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM). In our experiments, we show that our proposed model outperforms approaches based only on feature pooling or a classical hidden Markov model. With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections.
6 0.24777967 396 iccv-2013-Space-Time Robust Representation for Action Recognition
7 0.23767366 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
8 0.23212855 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
9 0.21346177 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
10 0.20864011 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
11 0.19787207 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
12 0.17588505 400 iccv-2013-Stable Hyper-pooling and Query Expansion for Event Detection
13 0.16998014 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
14 0.16871215 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis
15 0.16727665 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
16 0.1575563 81 iccv-2013-Combining the Right Features for Complex Event Recognition
17 0.15645039 249 iccv-2013-Learning to Share Latent Tasks for Action Recognition
18 0.1552114 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
19 0.15505025 86 iccv-2013-Concurrent Action Detection with Structural Prediction
20 0.15220237 198 iccv-2013-Hierarchical Part Matching for Fine-Grained Visual Categorization
topicId topicWeight
[(0, 0.258), (1, 0.243), (2, 0.127), (3, 0.217), (4, 0.085), (5, 0.07), (6, 0.099), (7, -0.072), (8, -0.042), (9, -0.134), (10, -0.131), (11, -0.099), (12, -0.034), (13, 0.174), (14, -0.236), (15, -0.109), (16, 0.055), (17, 0.071), (18, 0.058), (19, 0.053), (20, 0.107), (21, -0.007), (22, -0.014), (23, -0.027), (24, -0.021), (25, 0.012), (26, -0.008), (27, 0.05), (28, 0.015), (29, 0.004), (30, -0.041), (31, -0.013), (32, -0.099), (33, 0.023), (34, -0.026), (35, 0.011), (36, -0.015), (37, -0.07), (38, 0.079), (39, -0.003), (40, 0.006), (41, 0.066), (42, -0.004), (43, 0.014), (44, -0.023), (45, 0.067), (46, -0.01), (47, 0.045), (48, -0.008), (49, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.96970582 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
Author: Weixin Li, Qian Yu, Ajay Divakaran, Nuno Vasconcelos
Abstract: The problem of adaptively selecting pooling regions for the classification of complex video events is considered. Complex events are defined as events composed of several characteristic behaviors, whose temporal configuration can change from sequence to sequence. A dynamic pooling operator is defined so as to enable a unified solution to the problems of event specific video segmentation, temporal structure modeling, and event detection. Video is decomposed into segments, and the segments most informative for detecting a given event are identified, so as to dynamically determine the pooling operator most suited for each sequence. This dynamic pooling is implemented by treating the locations of characteristic segments as hidden information, which is inferred, on a sequence-by-sequence basis, via a large-margin classification rule with latent variables. Although the feasible set of segment selections is combinatorial, it is shown that a globally optimal solution to the inference problem can be obtained efficiently, through the solution of a series of linear programs. Besides the coarselevel location of segments, a finer model of video struc- ture is implemented by jointly pooling features of segmenttuples. Experimental evaluation demonstrates that the re- sulting event detector has state-of-the-art performance on challenging video datasets.
2 0.86685169 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
Author: Ping Wei, Yibiao Zhao, Nanning Zheng, Song-Chun Zhu
Abstract: Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.
3 0.86066931 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
Author: Yifan Zhang, Qiang Ji, Hanqing Lu
Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.
4 0.82529765 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
Author: Chen Sun, Ram Nevatia
Abstract: The goal of high level event classification from videos is to assign a single, high level event label to each query video. Traditional approaches represent each video as a set of low level features and encode it into a fixed length feature vector (e.g. Bag-of-Words), which leave a big gap between low level visual features and high level events. Our paper tries to address this problem by exploiting activity concept transitions in video events (ACTIVE). A video is treated as a sequence of short clips, all of which are observations corresponding to latent activity concept variables in a Hidden Markov Model (HMM). We propose to apply Fisher Kernel techniques so that the concept transitions over time can be encoded into a compact and fixed length feature vector very efficiently. Our approach can utilize concept annotations from independent datasets, and works well even with a very small number of training samples. Experiments on the challenging NIST TRECVID Multimedia Event Detection (MED) dataset shows our approach performs favorably over the state-of-the-art.
5 0.82296002 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann
Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.
6 0.78165126 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach
7 0.76177764 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
8 0.75429708 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis
9 0.66203445 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
10 0.6125049 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks
11 0.59521097 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
12 0.53322572 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
13 0.52751541 22 iccv-2013-A New Adaptive Segmental Matching Measure for Human Activity Recognition
14 0.52645797 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
15 0.52560425 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
16 0.51328439 243 iccv-2013-Learning Slow Features for Behaviour Analysis
17 0.50845194 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
18 0.50756949 34 iccv-2013-Abnormal Event Detection at 150 FPS in MATLAB
19 0.50354999 38 iccv-2013-Action Recognition with Actons
20 0.49237102 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
topicId topicWeight
[(2, 0.057), (4, 0.016), (7, 0.013), (12, 0.036), (26, 0.092), (31, 0.058), (34, 0.012), (35, 0.012), (40, 0.016), (42, 0.093), (64, 0.089), (68, 0.014), (73, 0.026), (77, 0.023), (78, 0.022), (80, 0.145), (89, 0.183), (95, 0.011), (98, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.88999379 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
Author: Weixin Li, Qian Yu, Ajay Divakaran, Nuno Vasconcelos
Abstract: The problem of adaptively selecting pooling regions for the classification of complex video events is considered. Complex events are defined as events composed of several characteristic behaviors, whose temporal configuration can change from sequence to sequence. A dynamic pooling operator is defined so as to enable a unified solution to the problems of event specific video segmentation, temporal structure modeling, and event detection. Video is decomposed into segments, and the segments most informative for detecting a given event are identified, so as to dynamically determine the pooling operator most suited for each sequence. This dynamic pooling is implemented by treating the locations of characteristic segments as hidden information, which is inferred, on a sequence-by-sequence basis, via a large-margin classification rule with latent variables. Although the feasible set of segment selections is combinatorial, it is shown that a globally optimal solution to the inference problem can be obtained efficiently, through the solution of a series of linear programs. Besides the coarselevel location of segments, a finer model of video struc- ture is implemented by jointly pooling features of segmenttuples. Experimental evaluation demonstrates that the re- sulting event detector has state-of-the-art performance on challenging video datasets.
2 0.87587011 171 iccv-2013-Fix Structured Learning of 2013 ICCV paper k2opt.pdf
Author: empty-author
Abstract: Submodular functions can be exactly minimized in polynomial time, and the special case that graph cuts solve with max flow [19] has had significant impact in computer vision [5, 21, 28]. In this paper we address the important class of sum-of-submodular (SoS) functions [2, 18], which can be efficiently minimized via a variant of max flow called submodular flow [6]. SoS functions can naturally express higher order priors involving, e.g., local image patches; however, it is difficult to fully exploit their expressive power because they have so many parameters. Rather than trying to formulate existing higher order priors as an SoS function, we take a discriminative learning approach, effectively searching the space of SoS functions for a higher order prior that performs well on our training set. We adopt a structural SVM approach [15, 34] and formulate the training problem in terms of quadratic programming; as a result we can efficiently search the space of SoS priors via an extended cutting-plane algorithm. We also show how the state-of-the-art max flow method for vision problems [11] can be modified to efficiently solve the submodular flow problem. Experimental comparisons are made against the OpenCVimplementation ofthe GrabCut interactive seg- mentation technique [28], which uses hand-tuned parameters instead of machine learning. On a standard dataset [12] our method learns higher order priors with hundreds of parameter values, and produces significantly better segmentations. While our focus is on binary labeling problems, we show that our techniques can be naturally generalized to handle more than two labels.
3 0.87413526 154 iccv-2013-Face Recognition via Archetype Hull Ranking
Author: Yuanjun Xiong, Wei Liu, Deli Zhao, Xiaoou Tang
Abstract: The archetype hull model is playing an important role in large-scale data analytics and mining, but rarely applied to vision problems. In this paper, we migrate such a geometric model to address face recognition and verification together through proposing a unified archetype hull ranking framework. Upon a scalable graph characterized by a compact set of archetype exemplars whose convex hull encompasses most of the training images, the proposed framework explicitly captures the relevance between any query and the stored archetypes, yielding a rank vector over the archetype hull. The archetype hull ranking is then executed on every block of face images to generate a blockwise similarity measure that is achieved by comparing two different rank vectors with respect to the same archetype hull. After integrating blockwise similarity measurements with learned importance weights, we accomplish a sensible face similarity measure which can support robust and effective face recognition and verification. We evaluate the face similarity measure in terms of experiments performed on three benchmark face databases Multi-PIE, Pubfig83, and LFW, demonstrat- ing its performance superior to the state-of-the-arts.
4 0.85375255 284 iccv-2013-Multiview Photometric Stereo Using Planar Mesh Parameterization
Author: Jaesik Park, Sudipta N. Sinha, Yasuyuki Matsushita, Yu-Wing Tai, In So Kweon
Abstract: We propose a method for accurate 3D shape reconstruction using uncalibrated multiview photometric stereo. A coarse mesh reconstructed using multiview stereo is first parameterized using a planar mesh parameterization technique. Subsequently, multiview photometric stereo is performed in the 2D parameter domain of the mesh, where all geometric and photometric cues from multiple images can be treated uniformly. Unlike traditional methods, there is no need for merging view-dependent surface normal maps. Our key contribution is a new photometric stereo based mesh refinement technique that can efficiently reconstruct meshes with extremely fine geometric details by directly estimating a displacement texture map in the 2D parameter domain. We demonstrate that intricate surface geometry can be reconstructed using several challenging datasets containing surfaces with specular reflections, multiple albedos and complex topologies.
5 0.84319198 338 iccv-2013-Randomized Ensemble Tracking
Author: Qinxun Bai, Zheng Wu, Stan Sclaroff, Margrit Betke, Camille Monnier
Abstract: We propose a randomized ensemble algorithm to model the time-varying appearance of an object for visual tracking. In contrast with previous online methods for updating classifier ensembles in tracking-by-detection, the weight vector that combines weak classifiers is treated as a random variable and the posterior distribution for the weight vector is estimated in a Bayesian manner. In essence, the weight vector is treated as a distribution that reflects the confidence among the weak classifiers used to construct and adapt the classifier ensemble. The resulting formulation models the time-varying discriminative ability among weak classifiers so that the ensembled strong classifier can adapt to the varying appearance, backgrounds, and occlusions. The formulation is tested in a tracking-by-detection implementation. Experiments on 28 challenging benchmark videos demonstrate that the proposed method can achieve results comparable to and often better than those of stateof-the-art approaches.
6 0.84091389 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments
7 0.84065175 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
8 0.83776152 299 iccv-2013-Online Video SEEDS for Temporal Window Objectness
9 0.83664209 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning
10 0.83635974 420 iccv-2013-Topology-Constrained Layered Tracking with Latent Flow
11 0.83621138 180 iccv-2013-From Where and How to What We See
12 0.83559424 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction
13 0.83533633 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
14 0.83516788 150 iccv-2013-Exemplar Cut
15 0.83425009 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning
16 0.83372962 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
17 0.8337152 414 iccv-2013-Temporally Consistent Superpixels
18 0.83307374 379 iccv-2013-Semantic Segmentation without Annotating Segments
19 0.8329531 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
20 0.83253956 65 iccv-2013-Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses