iccv iccv2013 iccv2013-203 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann
Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. [sent-5, score-0.652]
2 For example, a “marriage proposal” event is described by multiple objects (e. [sent-6, score-0.474]
3 The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. [sent-13, score-1.208]
4 It would be beneficial to utilize the related exemplars for complex event detection. [sent-14, score-1.195]
5 However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. [sent-15, score-1.308]
6 Two related exemplars can be about completely different events, e. [sent-16, score-0.659]
7 To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. [sent-19, score-1.294]
8 Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. [sent-20, score-0.713]
9 Both video sequences are of the event “marriage proposal” in the TRECVID MED dataset. [sent-26, score-0.582]
10 The first event took place in a classroom while in the second video a man proposed outdoor. [sent-27, score-0.588]
11 Two video sequences of the event “marriage proposal” in TRECVID MED dataset. [sent-37, score-0.582]
12 An event may take place in different places with huge variations in terms of lighting, resolution, duration and so forth. [sent-38, score-0.495]
13 ) analysis, event detection is more challenging in the following aspects: Firstly, an event is a higher level semantic abstraction of video sequences than a concept and consists of multiple concepts. [sent-40, score-1.134]
14 For example, a “marriage proposal” event can be described by multiple objects (e. [sent-41, score-0.474]
15 Secondly, a concept can be detected in a shorter video sequence or even in a single frame but an event is usually contained in a longer video clip. [sent-50, score-0.646]
16 In contrast, a video sequence of the event “birthday party” may last longer. [sent-52, score-0.56]
17 If we see only a few frames showing some people chatting, we could not 2104 know if it is a “ birthday party” event or not. [sent-53, score-0.601]
18 Thirdly, different video sequences of a particular event may have dramatic variations. [sent-54, score-0.582]
19 Taking “giving directions to a location” event as an example, it may take place in the street, inside a shopping mall or even in a car, where the visual features are very different. [sent-55, score-0.474]
20 While much progress has been made on visual concept recognition recently, the detection of complex event is still in its infancy. [sent-58, score-0.555]
21 Since 2012, limited studies focusing on complex event analysis of web videos have been reported. [sent-62, score-0.61]
22 In [6], researchers proposed a graph based approach to analyze the relationship among different concepts such as action, scene, and object for complex event analysis. [sent-63, score-0.536]
23 However, they only focused on event recognition while event detection is a more challenging task. [sent-64, score-0.996]
24 have experimentally compared seven visual features for complex event detection in web videos [14] and found that MoSIFT [3] is the most discriminative feature [14]. [sent-66, score-0.658]
25 proposed to adapt the auxiliary knowledge from pre-labeled video dataset to facilitate event detection [9] where only 10 positive exemplars are available. [sent-68, score-1.342]
26 The study in [12] has combined acoustic feature, texture feature and visual feature for event detection. [sent-69, score-0.501]
27 have proposed an decision level fusion algorithm, which jointly considers threshold and smoothing factor to learn optimal weights of multiple features, for event detection [18]. [sent-71, score-0.522]
28 In literature, Support Vector Machine (SVM) with χ2 kernel has been shown to be an effective tool for event detection in research papers and TRECVID competition [20] [11] [9] [12] [14]. [sent-72, score-0.543]
29 In [10], event detection and video attribute classification are integrated into a joint framework to leverage the mutual benefit. [sent-73, score-0.608]
30 Compared to concepts, an event is a higher level abstraction of a longer video clip. [sent-74, score-0.59]
31 Due to the semantic richness of an event in longer web videos, we may need more positive exemplars for training. [sent-78, score-1.251]
32 For example, if all the positive exemplars of “marriage proposal” we have are indoor videos, the system probably may not be able to detect the second video in Figure 1as “marriage proposal. [sent-79, score-0.82]
33 complex event in web videos when only 10 positive and 10 related video exemplars are available. [sent-85, score-1.468]
34 The premise is that it is a non-trivial task to collect a positive exemplar video which conveys the precise semantic of a particular event and excludes any irrelevant information. [sent-86, score-0.78]
35 The related exemplars can be of any other event, e. [sent-89, score-0.659]
36 Due to the difficulties, although NIST has provided related exemplars for event detection in TRECVID, none of the existing systems has ever used these data. [sent-93, score-1.181]
37 In this paper, we aim to detect complex events using only 10 positive exemplars along with 10 related video exemplars for event detection. [sent-94, score-2.057]
38 To the best of our knowledge, this paper is the first research attempt to automatically assess the relatedness of each related exemplar and utilize them adaptively, thereby resulting in more reliable event detection when the positive data are few. [sent-95, score-0.937]
39 Motivations and Problem Formulation Detecting complex event using few positive exemplars is more challenging than the existing works which use more than 100 positive exemplars for training [12] [14]. [sent-97, score-1.975]
40 Figure 2 shows some frames from a video clip marked as related to the event “marriage proposal” in TRECVID MED dataset. [sent-98, score-0.653]
41 If we have sufficient positive exemplars for a particular event, including the related exemplars may not improve the performance. [sent-102, score-1.393]
42 However, given that only few positive exemplars are available, it is crucial to make the utmost use of all the information. [sent-103, score-0.734]
43 Related exemplars are easier to obtain, but are much more difficult to use. [sent-104, score-0.621]
44 Simply assigning identical labels to different related exemplars does not make much sense as a related exemplar can be either closely or loosely related to the target event. [sent-110, score-0.896]
45 The video looks pretty much like a “marriage proposal” event but it is not. [sent-112, score-0.586]
46 Consequently, adaptively assigning soft labels to related exemplars by automatically assessing the relatedness turns to an important research challenge. [sent-116, score-0.973]
47 Next, we give our algorithm which is able to assign labels to related exemplars adaptively. [sent-117, score-0.69]
48 Hereafter, a null video is a video sequence which can be any video sequence except for positive and related exemplars. [sent-130, score-0.492]
49 To better differentiate related and There are two label setsY˜ and Y used in Y˜ Y˜i Y˜ Y˜ positive exemplars, −if S xi is a positive exemplar, its adaptive soft label Yia is set to be Yia = 1 + Si. [sent-141, score-0.475]
50 The intuition lying behind is that related exemplars have positive attributes but less positive than the true positive exemplars. [sent-143, score-0.998]
51 The basic model to adaptively assess the positiveness of related exemplars is formulated as follows: P,mS,inYa? [sent-149, score-0.744]
52 As the adaptive labnoeln -mneatgriaxti vYea c oisn an optimization v Rariable in (1), the model is able to adaptively utilize the related exemplars on a perexemplar basis. [sent-168, score-0.744]
53 When there are no related exemplars available, Y is the same as Y˜ , and the algorithm will reduce to least square regression. [sent-169, score-0.659]
54 target event, a larger value Si will be subtracted from As the number of null videos is much larger than those of positive and related exemplars, we further cluster the null videos into k clusters by K-means as preprocessing. [sent-172, score-0.46]
55 In this way, the training exemplars are grouped into k negative sets and one positive set (including related exemplars). [sent-173, score-0.818]
56 The same as (1), if xi is a null video from the jth negative cluster, then Yrij = 1; if xi is a positive or related exemplar then = 1. [sent-178, score-0.529]
57 Further to the basic model shown in (1), we constrain that the transformation matrix P in (2) should have some common structure with the detector which is learnt based on positive and null exemplars only, as positive exemplars are more accurate than related ones. [sent-179, score-1.589]
58 The Dataset In 2011, NIST collected a large scale video dataset, namely MED 11DEV-O collection, as the test bed for event 2107 detection. [sent-401, score-0.56]
59 NIST provided about 2,000 positive exemplars of the 10 new events as MED 12 Develop collection. [sent-406, score-0.805]
60 There are two types of event detection tasks defined by NIST. [sent-407, score-0.522]
61 The first one is to detect complex event using about 150 positive ex- emplars. [sent-408, score-0.62]
62 The other one is to detect events using only 10 positive exemplars and 10 related exemplars. [sent-409, score-0.843]
63 We use the 10 positive exemplars and the related exemplars of each event identied by NIST for training. [sent-410, score-1.867]
64 Since the labels for TRECVID MED 12 testing collection are not released, we remove the 10 positive and related exemplars from MED 12 Develop collection and merge the remaining into MED 11DEV-O collection as the testing data. [sent-413, score-0.869]
65 Leveraging related exemplars for event detection is so far an unexploited area. [sent-425, score-1.181]
66 To show the advantage of our algorithm in utilizing related exemplars, we report the results of SVM and KR using related exemplars as positive exemplars, which are denoted as SVMRP and KRRP Table 1. [sent-429, score-0.81]
67 In addition, as the related exemplars may not be closely related to the target event, we also report the results of SVM and KR using related exemplars as negative exemplars, which are denoted as SVMRN and KRRN. [sent-437, score-1.425]
68 The χ2 kernel as described in [16] has been demonstrated the most effective kernel for event detection [12] [14] [20] [11]. [sent-441, score-0.564]
69 Bush are divided into two subset: 10 as positive training exemplars and the remaining 520 data for testing. [sent-453, score-0.734]
70 Bush as related exemplars because of the father-child relationship. [sent-456, score-0.659]
71 1000 null images are used as negative training exemplars and the remaining 4746 images are used as null testing images. [sent-461, score-0.833]
72 The frames sampled from two video sequences marked as related exemplars to the event “birthday party” by NIST. [sent-472, score-1.296]
73 The frames sampled from two video sequences marked as related to the event “town hall meeting” by NIST. [sent-478, score-0.675]
74 Figure 4 shows the frames sampled from two video sequences marked as related exemplars of the event “birthday party” by NIST. [sent-481, score-1.296]
75 It is also related to a “birthday party” event as there are several people in the video. [sent-485, score-0.541]
76 Figure 5 shows another example, which includes the frames sampled from two related exemplars ofthe event “town hall meeting”. [sent-490, score-1.161]
77 These examples also demonstrate that it is less reasonable to fix the labels of related exemplars as a smaller constant, e. [sent-496, score-0.69]
78 Our algorithm outperforms KR by almost 9% relatively, indicating that it is beneficial to utilize related exemplars for event detection. [sent-502, score-1.162]
79 As human assessment of relatedness is subjective, the selection of related exemplars is somewhat arbitrary. [sent-503, score-0.811]
80 Some of the related exemplars should be regarded as positive exemplars but others are much less positive. [sent-504, score-1.393]
81 This observation indicates that different related exemplars should be utilized adaptively. [sent-507, score-0.659]
82 Using related exemplars as either positive or negative will degrade the overall performance for both SVM and KR. [sent-508, score-0.818]
83 That could be also the reason why none of the existing event detection systems built in 2012 used related exemplars for event detection [20] [11] [9] [14], although NIST has provided them. [sent-509, score-1.703]
84 The upper two subfigures of Figure 6 show the event detection performance of using MoSIFT feature. [sent-513, score-0.58]
85 The lower two subfigures of Figure 6 show the event detection performance of using Color SIFT feature. [sent-514, score-0.58]
86 As SVM and Kernel regression models have been demonstrated very effective for complex event detection [12] [14] [20] [11] [9], this experiment demonstrates that our model not only gains the best MAP for all the events, the performance is also stable across multiple events. [sent-517, score-0.58]
87 The Limitations Although the proposed algorithm gains promising performance in event detection, it still has some limitations. [sent-520, score-0.499]
88 As the number of positive examples are much smaller than that of negative examples, all the induced soft labels of related videos will be the same as negative labels. [sent-523, score-0.433]
89 If we look at the video, it is very similar to the event visually and should have a higher score. [sent-526, score-0.474]
90 The frames sampled from a video sequence marked as related to the event “Getting a vehicle unstuck” by NIST. [sent-530, score-0.653]
91 Conclusions We have focused on how to utilize related exemplars for event detection when the positive exemplars are few. [sent-533, score-1.944]
92 The main challenge confronted is that the human labels of related exemplars are subjective. [sent-534, score-0.713]
93 We propose to automatically learn the relatedness and assign soft labels to related exemplars adaptively. [sent-535, score-0.917]
94 Extensive experiments indicate that 1) taking related exemplars either as positive or negative exemplars may degrade the performance; 2) our algorithm is able to effectively leverage the information from related exemplars by exploiting the relatedness of a video sequence. [sent-536, score-2.312]
95 Future work will apply our model to interactive information retrieval where the users may not be able to get the exact search exemplars for relevance feedback. [sent-537, score-0.621]
96 Recognizing complex events using large margin joint low-level event model. [sent-584, score-0.578]
97 Knowledge adaptation for ad hoc multimedia event detection with few examplars. [sent-605, score-0.557]
98 Multimodal feature fusion for robust event detection in web videos. [sent-641, score-0.565]
99 Evaluation of low-level features and their combinations for complex event detection in open source video. [sent-658, score-0.555]
100 Informedia e-lamp @ TRECVID 2012, multimedia event detection and recounting. [sent-703, score-0.557]
wordName wordTfidf (topN-words)
[('exemplars', 0.621), ('event', 0.474), ('marriage', 0.264), ('med', 0.17), ('yrk', 0.132), ('relatedness', 0.128), ('trecvid', 0.126), ('positive', 0.113), ('exemplar', 0.107), ('mosift', 0.102), ('etett', 0.099), ('soft', 0.099), ('video', 0.086), ('null', 0.083), ('proposal', 0.081), ('party', 0.074), ('xtp', 0.073), ('events', 0.071), ('birthday', 0.07), ('george', 0.064), ('videos', 0.06), ('subfigures', 0.058), ('eq', 0.057), ('adaptively', 0.056), ('nist', 0.055), ('yra', 0.054), ('yia', 0.051), ('bush', 0.049), ('svmrn', 0.049), ('twt', 0.049), ('xtpt', 0.049), ('detection', 0.048), ('negative', 0.046), ('kr', 0.044), ('lady', 0.044), ('web', 0.043), ('label', 0.042), ('young', 0.041), ('related', 0.038), ('asks', 0.036), ('sebe', 0.035), ('multimedia', 0.035), ('natarajan', 0.034), ('bouquet', 0.033), ('cheers', 0.033), ('girlfriend', 0.033), ('kneeling', 0.033), ('krrp', 0.033), ('prom', 0.033), ('svmrp', 0.033), ('wtwt', 0.033), ('complex', 0.033), ('labels', 0.031), ('actions', 0.031), ('abstraction', 0.03), ('singapore', 0.03), ('utilize', 0.029), ('positiveness', 0.029), ('blacklight', 0.029), ('psc', 0.029), ('supercomputing', 0.029), ('guys', 0.029), ('restaurant', 0.029), ('concepts', 0.029), ('people', 0.029), ('xi', 0.028), ('frames', 0.028), ('man', 0.028), ('acoustic', 0.027), ('kpca', 0.027), ('town', 0.027), ('marked', 0.027), ('tamrakar', 0.026), ('pretty', 0.026), ('gains', 0.025), ('music', 0.024), ('boy', 0.024), ('klaser', 0.024), ('assessment', 0.024), ('toy', 0.024), ('action', 0.024), ('svm', 0.023), ('ma', 0.023), ('someone', 0.023), ('confronted', 0.023), ('imbalanced', 0.023), ('derived', 0.023), ('target', 0.023), ('sift', 0.023), ('vitaladevuni', 0.022), ('walker', 0.022), ('collection', 0.022), ('sequences', 0.022), ('dancing', 0.022), ('yt', 0.021), ('kernel', 0.021), ('duration', 0.021), ('iarpa', 0.021), ('hauptmann', 0.021), ('prasad', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann
Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.
2 0.30203563 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
Author: Ping Wei, Yibiao Zhao, Nanning Zheng, Song-Chun Zhu
Abstract: Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.
3 0.29981613 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
Author: Yifan Zhang, Qiang Ji, Hanqing Lu
Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.
4 0.29561883 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis
Author: Zhongwen Xu, Yi Yang, Ivor Tsang, Nicu Sebe, Alexander G. Hauptmann
Abstract: Fusion of multiple features can boost the performance of large-scale visual classification and detection tasks like TRECVID Multimedia Event Detection (MED) competition [1]. In this paper, we propose a novel feature fusion approach, namely Feature Weighting via Optimal Thresholding (FWOT) to effectively fuse various features. FWOT learns the weights, thresholding and smoothing parameters in a joint framework to combine the decision values obtained from all the individual features and the early fusion. To the best of our knowledge, this is the first work to consider the weight and threshold factors of fusion problem simultaneously. Compared to state-of-the-art fusion algorithms, our approach achieves promising improvements on HMDB [8] action recognition dataset and CCV [5] video classification dataset. In addition, experiments on two TRECVID MED 2011 collections show that our approach outperforms the state-of-the-art fusion methods for complex event detection.
5 0.28025907 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
Author: Lukas Bossard, Matthieu Guillaumin, Luc Van_Gool
Abstract: The task of recognizing events in photo collections is central for automatically organizing images. It is also very challenging, because of the ambiguity of photos across different event classes and because many photos do not convey enough relevant information. Unfortunately, the field still lacks standard evaluation data sets to allow comparison of different approaches. In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, annotated with 14 diverse social event classes. Casting collections as sequential data, we build upon recent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recognition in photo collections. However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the photographers. Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a function of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM). In our experiments, we show that our proposed model outperforms approaches based only on feature pooling or a classical hidden Markov model. With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections.
6 0.23212855 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
7 0.21717446 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach
8 0.20877354 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
9 0.19060718 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
10 0.18282272 344 iccv-2013-Recognising Human-Object Interaction via Exemplar Based Modelling
11 0.1678374 81 iccv-2013-Combining the Right Features for Complex Event Recognition
12 0.15678926 150 iccv-2013-Exemplar Cut
13 0.1565022 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
14 0.13016717 62 iccv-2013-Bird Part Localization Using Exemplar-Based Models with Enforced Pose and Subcategory Consistency
15 0.10324166 149 iccv-2013-Exemplar-Based Graph Matching for Robust Facial Landmark Localization
16 0.098053455 34 iccv-2013-Abnormal Event Detection at 150 FPS in MATLAB
17 0.097855963 229 iccv-2013-Large-Scale Video Hashing via Structure Learning
18 0.094682664 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks
19 0.094315194 400 iccv-2013-Stable Hyper-pooling and Query Expansion for Event Detection
20 0.089127257 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
topicId topicWeight
[(0, 0.154), (1, 0.168), (2, 0.05), (3, 0.117), (4, 0.076), (5, 0.062), (6, 0.111), (7, -0.044), (8, -0.041), (9, -0.109), (10, -0.178), (11, -0.175), (12, -0.067), (13, 0.207), (14, -0.28), (15, -0.094), (16, 0.035), (17, 0.04), (18, 0.021), (19, 0.079), (20, 0.126), (21, 0.0), (22, 0.019), (23, 0.092), (24, 0.064), (25, 0.04), (26, -0.025), (27, 0.085), (28, -0.019), (29, -0.084), (30, -0.128), (31, -0.025), (32, 0.071), (33, 0.021), (34, 0.035), (35, -0.002), (36, -0.028), (37, -0.009), (38, -0.028), (39, 0.04), (40, 0.044), (41, -0.109), (42, 0.043), (43, 0.043), (44, 0.038), (45, 0.035), (46, -0.063), (47, -0.071), (48, -0.014), (49, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.96193349 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann
Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.
2 0.8251465 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
Author: Ping Wei, Yibiao Zhao, Nanning Zheng, Song-Chun Zhu
Abstract: Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.
3 0.81627417 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
Author: Yifan Zhang, Qiang Ji, Hanqing Lu
Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.
4 0.79007107 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
Author: Chen Sun, Ram Nevatia
Abstract: The goal of high level event classification from videos is to assign a single, high level event label to each query video. Traditional approaches represent each video as a set of low level features and encode it into a fixed length feature vector (e.g. Bag-of-Words), which leave a big gap between low level visual features and high level events. Our paper tries to address this problem by exploiting activity concept transitions in video events (ACTIVE). A video is treated as a sequence of short clips, all of which are observations corresponding to latent activity concept variables in a Hidden Markov Model (HMM). We propose to apply Fisher Kernel techniques so that the concept transitions over time can be encoded into a compact and fixed length feature vector very efficiently. Our approach can utilize concept annotations from independent datasets, and works well even with a very small number of training samples. Experiments on the challenging NIST TRECVID Multimedia Event Detection (MED) dataset shows our approach performs favorably over the state-of-the-art.
5 0.73987764 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis
Author: Zhongwen Xu, Yi Yang, Ivor Tsang, Nicu Sebe, Alexander G. Hauptmann
Abstract: Fusion of multiple features can boost the performance of large-scale visual classification and detection tasks like TRECVID Multimedia Event Detection (MED) competition [1]. In this paper, we propose a novel feature fusion approach, namely Feature Weighting via Optimal Thresholding (FWOT) to effectively fuse various features. FWOT learns the weights, thresholding and smoothing parameters in a joint framework to combine the decision values obtained from all the individual features and the early fusion. To the best of our knowledge, this is the first work to consider the weight and threshold factors of fusion problem simultaneously. Compared to state-of-the-art fusion algorithms, our approach achieves promising improvements on HMDB [8] action recognition dataset and CCV [5] video classification dataset. In addition, experiments on two TRECVID MED 2011 collections show that our approach outperforms the state-of-the-art fusion methods for complex event detection.
6 0.73306602 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
7 0.68369031 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
8 0.59007806 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach
9 0.47304815 81 iccv-2013-Combining the Right Features for Complex Event Recognition
10 0.4409838 344 iccv-2013-Recognising Human-Object Interaction via Exemplar Based Modelling
11 0.42648327 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
12 0.42373973 34 iccv-2013-Abnormal Event Detection at 150 FPS in MATLAB
13 0.42101076 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
14 0.40908447 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
15 0.40846571 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks
16 0.39392713 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
17 0.33853865 400 iccv-2013-Stable Hyper-pooling and Query Expansion for Event Detection
18 0.31358498 167 iccv-2013-Finding Causal Interactions in Video Sequences
19 0.30694595 397 iccv-2013-Space-Time Tradeoffs in Photo Sequencing
20 0.2961202 229 iccv-2013-Large-Scale Video Hashing via Structure Learning
topicId topicWeight
[(2, 0.069), (4, 0.064), (7, 0.018), (12, 0.015), (26, 0.057), (31, 0.039), (42, 0.082), (64, 0.052), (68, 0.238), (73, 0.023), (77, 0.015), (78, 0.019), (89, 0.18), (98, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.81143248 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann
Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.
2 0.7503531 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching
Author: Kaiye Wang, Ran He, Wei Wang, Liang Wang, Tieniu Tan
Abstract: Cross-modal matching has recently drawn much attention due to the widespread existence of multimodal data. It aims to match data from different modalities, and generally involves two basic problems: the measure of relevance and coupled feature selection. Most previous works mainly focus on solving the first problem. In this paper, we propose a novel coupled linear regression framework to deal with both problems. Our method learns two projection matrices to map multimodal data into a common feature space, in which cross-modal data matching can be performed. And in the learning procedure, the ?21-norm penalties are imposed on the two projection matrices separately, which leads to select relevant and discriminative features from coupled feature spaces simultaneously. A trace norm is further imposed on the projected data as a low-rank constraint, which enhances the relevance of different modal data with connections. We also present an iterative algorithm based on halfquadratic minimization to solve the proposed regularized linear regression problem. The experimental results on two challenging cross-modal datasets demonstrate that the proposed method outperforms the state-of-the-art approaches.
3 0.72360873 436 iccv-2013-Unsupervised Intrinsic Calibration from a Single Frame Using a "Plumb-Line" Approach
Author: R. Melo, M. Antunes, J.P. Barreto, G. Falcão, N. Gonçalves
Abstract: Estimating the amount and center ofdistortionfrom lines in the scene has been addressed in the literature by the socalled “plumb-line ” approach. In this paper we propose a new geometric method to estimate not only the distortion parameters but the entire camera calibration (up to an “angular” scale factor) using a minimum of 3 lines. We propose a new framework for the unsupervised simultaneous detection of natural image of lines and camera parameters estimation, enabling a robust calibration from a single image. Comparative experiments with existing automatic approaches for the distortion estimation and with ground truth data are presented.
4 0.72237933 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
Author: Jiang Wang, Ying Wu
Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.
5 0.71481979 208 iccv-2013-Image Co-segmentation via Consistent Functional Maps
Author: Fan Wang, Qixing Huang, Leonidas J. Guibas
Abstract: Joint segmentation of image sets has great importance for object recognition, image classification, and image retrieval. In this paper, we aim to jointly segment a set of images starting from a small number of labeled images or none at all. To allow the images to share segmentation information with each other, we build a network that contains segmented as well as unsegmented images, and extract functional maps between connected image pairs based on image appearance features. These functional maps act as general property transporters between the images and, in particular, are used to transfer segmentations. We define and operate in a reduced functional space optimized so that the functional maps approximately satisfy cycle-consistency under composition in the network. A joint optimization framework is proposed to simultaneously generate all segmentation functions over the images so that they both align with local segmentation cues in each particular image, and agree with each other under network transportation. This formulation allows us to extract segmentations even with no training data, but can also exploit such data when available. The collective effect of the joint processing using functional maps leads to accurate information sharing among images and yields superior segmentation results, as shown on the iCoseg, MSRC, and PASCAL data sets.
6 0.69049704 236 iccv-2013-Learning Discriminative Part Detectors for Image Classification and Cosegmentation
7 0.69017231 158 iccv-2013-Fast High Dimensional Vector Multiplication Face Recognition
8 0.6901719 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis
9 0.68619466 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction
10 0.68586993 79 iccv-2013-Coherent Object Detection with 3D Geometric Context from a Single Image
11 0.68308896 71 iccv-2013-Category-Independent Object-Level Saliency Detection
12 0.68284947 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
13 0.67728835 195 iccv-2013-Hidden Factor Analysis for Age Invariant Face Recognition
14 0.67480743 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning
15 0.67456323 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
16 0.67122912 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
17 0.67077887 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
18 0.67075908 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach
19 0.67052817 128 iccv-2013-Dynamic Probabilistic Volumetric Models
20 0.67003167 445 iccv-2013-Visual Reranking through Weakly Supervised Multi-graph Learning