iccv iccv2013 iccv2013-4 knowledge-graph by maker-knowledge-mining

4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification


Source: pdf

Author: Chen Sun, Ram Nevatia

Abstract: The goal of high level event classification from videos is to assign a single, high level event label to each query video. Traditional approaches represent each video as a set of low level features and encode it into a fixed length feature vector (e.g. Bag-of-Words), which leave a big gap between low level visual features and high level events. Our paper tries to address this problem by exploiting activity concept transitions in video events (ACTIVE). A video is treated as a sequence of short clips, all of which are observations corresponding to latent activity concept variables in a Hidden Markov Model (HMM). We propose to apply Fisher Kernel techniques so that the concept transitions over time can be encoded into a compact and fixed length feature vector very efficiently. Our approach can utilize concept annotations from independent datasets, and works well even with a very small number of training samples. Experiments on the challenging NIST TRECVID Multimedia Event Detection (MED) dataset shows our approach performs favorably over the state-of-the-art.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The goal of high level event classification from videos is to assign a single, high level event label to each query video. [sent-2, score-0.88]

2 Our paper tries to address this problem by exploiting activity concept transitions in video events (ACTIVE). [sent-6, score-0.918]

3 A video is treated as a sequence of short clips, all of which are observations corresponding to latent activity concept variables in a Hidden Markov Model (HMM). [sent-7, score-0.679]

4 We propose to apply Fisher Kernel techniques so that the concept transitions over time can be encoded into a compact and fixed length feature vector very efficiently. [sent-8, score-0.412]

5 Compared with these, high level event classification for web videos focuses on classifying complex events (e. [sent-15, score-0.596]

6 (Top) A video from Wedding ceremony event is separated into a sequence of clips, each of which corresponds to an activity concept like kissing and dancing. [sent-27, score-1.154]

7 (Bottom) Each dimension in our representation corresponds to an activity concept transition. [sent-28, score-0.634]

8 First, unlike lower level activity concepts which have relatively discriminative motion or visual patterns, most high level events consist of complex human ob991133 ject interaction and various scene backgrounds, which pose difficulty for the existing low level frameworks. [sent-36, score-1.152]

9 Finally, many activity concepts have pairwise relationships in temporal domain. [sent-38, score-0.776]

10 To overcome these limitations, we propose to encode activity concept transitions with Fisher kernel techniques [7]. [sent-41, score-0.795]

11 In this model, a video event is a sequence of activity concepts. [sent-46, score-0.689]

12 By using this model, we bridge low level features and high level events via activity concepts, and utilize the temporal relationships of activity concepts explicitly. [sent-49, score-1.466]

13 Our activity concept classifiers are pre-learned and offer a good abstraction of low level features. [sent-60, score-0.812]

14 This makes it possible to learn robust event classifiers with very limited training data. [sent-61, score-0.337]

15 Our approach can utilize both mid-level and atomic activity concepts, even when they do not occur in high level events directly, or are collected from a different domain. [sent-62, score-0.622]

16 In this case, activity concepts can be seen as groups of low level features such that each of them can provide useful statistics. [sent-63, score-0.844]

17 The key contributions of this paper are threefold: First, we propose to encode activity concept transitions with Fisher Vectors for high level event classification and description. [sent-66, score-1.191]

18 Related Work Low-level features are widely used in high level video event classification. [sent-70, score-0.432]

19 For high level video event classification, [20] evaluates different types of low-level visual features, and shows a late fusion of these features can improve performance. [sent-74, score-0.432]

20 The idea of using concepts has been adopted in image [22] [11] [3] and video classification [12] under different names. [sent-75, score-0.459]

21 A set of object concept classifiers called classemes is used for image classification and novel class detection tasks in [22]. [sent-76, score-0.361]

22 Our activity concepts, on the other hand, are detected on fixed length short video clips. [sent-78, score-0.403]

23 Like [22] and [11], our framework doesn’t require the concepts classifiers to be perfect or directly related to target domain. [sent-79, score-0.431]

24 Recently, [5] models the activity concepts as latent variables with pairwise correlations, and applies latent SVM for classification. [sent-80, score-0.738]

25 Unlike our approach, their activity concepts have single responses for the entire video and cannot model the evolution of concepts over time. [sent-81, score-1.186]

26 Among the frameworks which also exploit temporal structure, [21] uses explicit-duration HMM where both concepts and concept durations are hidden variables, and [4] uses a generative temporal model by estimating the distribution of relative concept spacings. [sent-82, score-1.076]

27 Our approach is different from them as our model with activity concepts is not used for classification directly. [sent-83, score-0.772]

28 A similar approach is used in high level video event classification and shows promising results [19]. [sent-86, score-0.466]

29 An input video is separated into fixed length clips, each has a vector of activity concept responses. [sent-91, score-0.679]

30 Video Representation In this section, we describe how we represent videos with activity concepts, as well as how to get the representation. [sent-95, score-0.43]

31 To avoid confusion, we first define the term events and activity concepts used throughout the paper. [sent-96, score-0.841]

32 • An activity concept is an atomic action or an activity containing simple tin itse aranct aiotonms among objects over a short period of time. [sent-97, score-1.03]

33 less than 10 seconds) An event is a complex activity consisting of several activity concepts over a relatively long period fo sfe tviemrea. [sent-100, score-1.382]

34 30 seconds or higher) The activity concepts we use are predefined and trained under supervision. [sent-103, score-0.738]

35 All techniques for event classification with low level features can be used, resulting a single fixed-length descriptor x for each video clip. [sent-105, score-0.471]

36 We then train 1-vs-rest classifier φc for each activity concept c. [sent-106, score-0.634]

37 φK] are obtained, we scan the video with fixed-length sliding windows and represent each video by a T by K matrix M = [φt,k], where T is the number of sliding windows, K is the number of activity concepts and φt,k is the classifier response of the k-th activity concept for the t-th sliding window. [sent-112, score-1.631]

38 HMM Fisher Vector In this section, we introduce how we model and encode the transitions of activity concepts in videos. [sent-114, score-0.874]

39 Our Model We use HMM to model a video event with activity concept transitions over time. [sent-118, score-1.101]

40 There are K states, each corresponds to an activity concept. [sent-119, score-0.358]

41 Every two concepts i,j have a transition probability P(Cj |Ci) from concept ito j. [sent-120, score-0.765]

42 Since we are working with a generative model, the emission probability of x given concept Ci is derived from P(x|Ci) ∼Pφi((Cxi)) where φi (x) is the activity concept classifier output and P(Ci) is the prior probability of concept i. [sent-122, score-1.344]

43 the transition probability from concept j to con• θx|i is the emission probability of x given concept i. [sent-126, score-0.744]

44 991155 As the emission probabilities are derived from activity concept classifiers, we only use partial derivatives over transition probability parameters τi|j to derive the Fisher Vector UX . [sent-133, score-0.795]

45 First, we randomly select activity concept responses from neighboring sliding windows of all events. [sent-159, score-0.728]

46 The dimension of HMMFV is K2, where K is the number of activity concepts. [sent-166, score-0.358]

47 Intuitively, by looking into Equation 4, HMMFV accumulates the difference between the actual expectation of concept transition and the model’s prediction based on the previous concept. [sent-167, score-0.354]

48 This is especially useful for high level event classification since the videos often contain irrelevant information. [sent-170, score-0.493]

49 Consider a birthday party event, in which people singing and people dancing are often followed by each other, FV (dancing, singing) and FV (singing, dancing) should both have high positive energy, indicating their transition probabilities are underestimated in the general model. [sent-172, score-0.411]

50 Based on this property, we can describe a video using activity concept transitions with high positive values in HMMFV. [sent-174, score-0.838]

51 Then we compare our approach with baseline, and study the influence of activity concept selection. [sent-177, score-0.634]

52 For the purpose of training activity concept classifiers, we used two datasets. [sent-199, score-0.634]

53 We got 60 activity concept annotations used in [5] by communicating with the authors. [sent-200, score-0.634]

54 The 991166 concepts were annotated on the EventKit and highly related to high level events. [sent-201, score-0.481]

55 We call these concepts Same Domain Concepts, some of the concept names are shown in Table 1. [sent-202, score-0.656]

56 Another dataset we used for training concepts is the UCF 101 [18] dataset. [sent-203, score-0.38]

57 Dense Trajectory (DT) feature [23] was used as low level feature for activity concept classification. [sent-213, score-0.74]

58 Since same domain concepts were annotated on all EventKit videos, we used only the annotations in training partition to train the activity concept classifiers. [sent-219, score-1.104]

59 Max pooling (Max) was selected as the baseline, it rep- resents a video by the maximum activity concept responses. [sent-221, score-0.71]

60 Suppose the concept responses from the t-th sliding window is [φt1 φ2t . [sent-223, score-0.348]

61 Average precision for same domain concepts with same domain concepts and cross domain concepts, bold numbers correspond to the higher performance in their groups 5. [sent-256, score-1.101]

62 Table 1 shows the performance of our concept classifiers, the parameters were selected by 5-fold cross validation. [sent-259, score-0.347]

63 20 of 60 concepts are randomly selected due to space limitation. [sent-260, score-0.38]

64 Max pooling achieves better performance in woodworking event, this may happen when some concept classifiers have strong performance and are highly correlated to a single event (e. [sent-263, score-0.683]

65 The red line shows a randomly selected subset of cross domain concepts from 20 to 101. [sent-270, score-0.541]

66 The cyan line illustrates a randomly selected subset of same domain concepts from 20 to 60. [sent-271, score-0.47]

67 Cross Domain Concepts Although same domain concepts can provide semantic meanings for the videos, the annotations can be expensive and time consuming to obtain and adding new event classes can become cumbersome. [sent-274, score-0.756]

68 Interestingly, the fourth and fifth columns of Table 2 show that even if the activity concepts are not related to events directly, our framework still achieves comparable performance. [sent-276, score-0.841]

69 It is quite likely that the concept classifiers capture some inherent appearance and motion information, so that they can still be used to provide discriminative information for event classification. [sent-277, score-0.639]

70 HMMFV still achieves better performance than max pooling, which indicates that temporal information is also useful when the activity concepts are from a different domain. [sent-278, score-0.797]

71 To study the influence ofdomain relevance, we randomly selected 20, 40 and 60 concepts from same domain concepts, and 20, 40 and 80 concepts from cross domain concepts for HMMFV. [sent-279, score-1.391]

72 Same domain concepts can easily outperform cross domain concepts given same number of concepts and reach the same level of performance with only 60 concepts, compared with 101 from cross domain concepts. [sent-282, score-1.63]

73 Besides, as shown in Table 3, if we combine the two sets ofresults by late fusion, mean AP can be further improved by 4%, which indicates that HMMFV obtained from same and cross domain concepts have complementary performance. [sent-283, score-0.541]

74 Mean average precisions when using same domain concepts, cross domain concepts and both for generating HMMFV Event IDJoint [5]LL[15]HMMFVHMMFV+LL refers to the joint modeling of activity concepts, as described in [5], LL refers to the low level approach as described in [15] 5. [sent-288, score-1.095]

75 It models the joint occurrence of concepts without considering temporal information. [sent-292, score-0.418]

76 We used both same domain and cross domain concepts in our framework. [sent-296, score-0.631]

77 Our framework outperforms the joint modeling of activity concepts approach in 11 of the 15 events. [sent-298, score-0.738]

78 Moreover, we used only a single type of low level feature, and did not fuse the event classification results obtained from low level features directly. [sent-299, score-0.556]

79 These comparisons indicate that encoding concept transitions provides useful information for event classification. [sent-304, score-0.728]

80 154 0672314traingsmple Methodmean AP for each event pairwise relationship of activity concepts, temporal information is not preserved. [sent-307, score-0.682]

81 Classification with Limited Training Samples In real world retrieval problems, it is desirable to let users provide just a few video samples of some event they want before the system can build a reasonably good event classifier. [sent-310, score-0.617]

82 The training videos were randomly selected from the original training partition, and we used the previous concept classifiers since they didn’t include test information. [sent-312, score-0.399]

83 As shown in Table 5, when the number of training samples is limited, the performance of low level approach decreases more significantly than activity concept based HMMFV, their relative mean AP difference is 14. [sent-313, score-0.74]

84 One possible explanation is that by using activity concepts, our framework has a better level of abstraction, which can be captured by discriminative classifiers even with a few training samples. [sent-315, score-0.513]

85 Another interesting observation is that, when the number of training sample is limited, the performance of same domain concepts is 8. [sent-316, score-0.47]

86 This is understandable since some same domain concepts are highly correlated with high level events (e. [sent-318, score-0.674]

87 Event Description with Concept Transitions Finally, we show how to describe high level events with concept transitions. [sent-324, score-0.48]

88 4, we showed that FV (i, j) has high positive energy if the transition probability from concept j to iis high, and is underestimated by the general model. [sent-326, score-0.437]

89 A direct application is to sort the HMMFV values in descending order and use activity concept transitions with largest values to describe the video. [sent-327, score-0.77]

90 Compared with the description in [5], our method returns not only the activity concepts, but also the transition patterns over time. [sent-328, score-0.436]

91 We show the event level descriptions in Figure 4, the HMMFVs were obtained by averaging over all HMMFVs from test videos of a single event. [sent-329, score-0.465]

92 For example, in a parkour event, the top three concept transitions are: jumping tojumping, flipping tojumping and dancing to jumping. [sent-331, score-0.742]

93 Some descriptions are not exact but also informative, like spreading cream to hands visible in a making a sandwich event. [sent-332, score-0.435]

94 Conclusion This paper addresses high level event classification problem by encoding the activity concept transitions over time. [sent-334, score-1.221]

95 It can also be used to describe videos with activity concept transitions. [sent-337, score-0.706]

96 Experimental results show that our approach achieves better results compared with state-of-the-art concept based framework and low level framework. [sent-338, score-0.382]

97 Our system can utilize different types of activity concepts, we recommend same domain concepts for video description and compact HMMFV generation when they are available, and cross domain concepts to reduce the need for event specific concept annotations. [sent-340, score-1.998]

98 High level events and their top rated descriptions based on activity concept transitions. [sent-364, score-0.844]

99 Recognizing complex events using large margin joint low-level event model. [sent-396, score-0.389]

100 Large-scale web video event classification by use of fisher vectors. [sent-485, score-0.486]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hmmfv', 0.455), ('concepts', 0.38), ('activity', 0.358), ('event', 0.286), ('concept', 0.276), ('hands', 0.17), ('transitions', 0.136), ('hmm', 0.127), ('fisher', 0.121), ('dancing', 0.115), ('kissing', 0.115), ('jumping', 0.109), ('events', 0.103), ('singing', 0.1), ('sewing', 0.095), ('domain', 0.09), ('eventkit', 0.087), ('reeling', 0.086), ('visible', 0.083), ('transition', 0.078), ('marching', 0.078), ('level', 0.078), ('ceremony', 0.074), ('wrench', 0.072), ('videos', 0.072), ('cross', 0.071), ('wedding', 0.067), ('vehicle', 0.062), ('cream', 0.062), ('animal', 0.059), ('washing', 0.057), ('fv', 0.057), ('ux', 0.055), ('spreading', 0.054), ('hmmfvs', 0.052), ('steering', 0.052), ('turning', 0.052), ('emission', 0.052), ('classifiers', 0.051), ('sliding', 0.049), ('carving', 0.048), ('video', 0.045), ('generative', 0.044), ('parkour', 0.039), ('woodworking', 0.039), ('atomic', 0.038), ('temporal', 0.038), ('sandwich', 0.037), ('ucf', 0.037), ('jlogp', 0.035), ('tojumping', 0.035), ('classification', 0.034), ('party', 0.034), ('actions', 0.033), ('birthday', 0.032), ('flipping', 0.032), ('eating', 0.031), ('hugging', 0.031), ('southern', 0.031), ('pooling', 0.031), ('probability', 0.031), ('encoding', 0.03), ('feeding', 0.03), ('clips', 0.03), ('walking', 0.029), ('underestimated', 0.029), ('descriptions', 0.029), ('xt', 0.028), ('aser', 0.028), ('dt', 0.028), ('low', 0.028), ('gmm', 0.027), ('grooming', 0.027), ('discriminative', 0.026), ('appliance', 0.026), ('parade', 0.026), ('kernel', 0.025), ('mob', 0.025), ('hidden', 0.024), ('ap', 0.024), ('fuse', 0.024), ('jaakkola', 0.024), ('unstuck', 0.024), ('tire', 0.023), ('responses', 0.023), ('high', 0.023), ('kl', 0.022), ('flash', 0.022), ('iarpa', 0.022), ('landing', 0.022), ('repairing', 0.022), ('windows', 0.022), ('utilize', 0.022), ('person', 0.022), ('laptev', 0.022), ('gathering', 0.022), ('max', 0.021), ('abstraction', 0.021), ('med', 0.021), ('ci', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.000001 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification

Author: Chen Sun, Ram Nevatia

Abstract: The goal of high level event classification from videos is to assign a single, high level event label to each query video. Traditional approaches represent each video as a set of low level features and encode it into a fixed length feature vector (e.g. Bag-of-Words), which leave a big gap between low level visual features and high level events. Our paper tries to address this problem by exploiting activity concept transitions in video events (ACTIVE). A video is treated as a sequence of short clips, all of which are observations corresponding to latent activity concept variables in a Hidden Markov Model (HMM). We propose to apply Fisher Kernel techniques so that the concept transitions over time can be encoded into a compact and fixed length feature vector very efficiently. Our approach can utilize concept annotations from independent datasets, and works well even with a very small number of training samples. Experiments on the challenging NIST TRECVID Multimedia Event Detection (MED) dataset shows our approach performs favorably over the state-of-the-art.

2 0.24026623 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition

Author: Ping Wei, Yibiao Zhao, Nanning Zheng, Song-Chun Zhu

Abstract: Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.

3 0.2393624 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints

Author: Yifan Zhang, Qiang Ji, Hanqing Lu

Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.

4 0.23767366 127 iccv-2013-Dynamic Pooling for Complex Event Recognition

Author: Weixin Li, Qian Yu, Ajay Divakaran, Nuno Vasconcelos

Abstract: The problem of adaptively selecting pooling regions for the classification of complex video events is considered. Complex events are defined as events composed of several characteristic behaviors, whose temporal configuration can change from sequence to sequence. A dynamic pooling operator is defined so as to enable a unified solution to the problems of event specific video segmentation, temporal structure modeling, and event detection. Video is decomposed into segments, and the segments most informative for detecting a given event are identified, so as to dynamically determine the pooling operator most suited for each sequence. This dynamic pooling is implemented by treating the locations of characteristic segments as hidden information, which is inferred, on a sequence-by-sequence basis, via a large-margin classification rule with latent variables. Although the feasible set of segment selections is combinatorial, it is shown that a globally optimal solution to the inference problem can be obtained efficiently, through the solution of a series of linear programs. Besides the coarselevel location of segments, a finer model of video struc- ture is implemented by jointly pooling features of segmenttuples. Experimental evaluation demonstrates that the re- sulting event detector has state-of-the-art performance on challenging video datasets.

5 0.22064279 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM

Author: Lukas Bossard, Matthieu Guillaumin, Luc Van_Gool

Abstract: The task of recognizing events in photo collections is central for automatically organizing images. It is also very challenging, because of the ambiguity of photos across different event classes and because many photos do not convey enough relevant information. Unfortunately, the field still lacks standard evaluation data sets to allow comparison of different approaches. In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, annotated with 14 diverse social event classes. Casting collections as sequential data, we build upon recent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recognition in photo collections. However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the photographers. Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a function of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM). In our experiments, we show that our proposed model outperforms approaches based only on feature pooling or a classical hidden Markov model. With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections.

6 0.20877354 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?

7 0.1961865 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach

8 0.15640357 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

9 0.15389344 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis

10 0.15336603 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

11 0.15307653 446 iccv-2013-Visual Semantic Complex Network for Web Images

12 0.14361748 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

13 0.13392963 81 iccv-2013-Combining the Right Features for Complex Event Recognition

14 0.13005291 452 iccv-2013-YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition

15 0.09088321 381 iccv-2013-Semantically-Based Human Scanpath Estimation with HMMs

16 0.08619047 191 iccv-2013-Handling Uncertain Tags in Visual Recognition

17 0.08552134 428 iccv-2013-Translating Video Content to Natural Language Descriptions

18 0.083934553 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks

19 0.082226619 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories

20 0.080724657 331 iccv-2013-Pyramid Coding for Functional Scene Element Recognition in Video Scenes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.162), (1, 0.167), (2, 0.064), (3, 0.115), (4, 0.07), (5, 0.068), (6, 0.087), (7, -0.055), (8, -0.033), (9, -0.092), (10, -0.117), (11, -0.165), (12, -0.055), (13, 0.172), (14, -0.188), (15, -0.077), (16, -0.004), (17, 0.033), (18, 0.065), (19, 0.051), (20, 0.081), (21, -0.03), (22, 0.038), (23, 0.082), (24, -0.014), (25, 0.026), (26, -0.012), (27, 0.025), (28, -0.013), (29, -0.015), (30, -0.018), (31, -0.096), (32, -0.01), (33, -0.027), (34, -0.014), (35, -0.045), (36, -0.017), (37, -0.005), (38, -0.04), (39, 0.018), (40, -0.026), (41, -0.002), (42, -0.013), (43, 0.063), (44, -0.011), (45, -0.007), (46, 0.048), (47, -0.035), (48, 0.041), (49, -0.06)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95940185 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification

Author: Chen Sun, Ram Nevatia

Abstract: The goal of high level event classification from videos is to assign a single, high level event label to each query video. Traditional approaches represent each video as a set of low level features and encode it into a fixed length feature vector (e.g. Bag-of-Words), which leave a big gap between low level visual features and high level events. Our paper tries to address this problem by exploiting activity concept transitions in video events (ACTIVE). A video is treated as a sequence of short clips, all of which are observations corresponding to latent activity concept variables in a Hidden Markov Model (HMM). We propose to apply Fisher Kernel techniques so that the concept transitions over time can be encoded into a compact and fixed length feature vector very efficiently. Our approach can utilize concept annotations from independent datasets, and works well even with a very small number of training samples. Experiments on the challenging NIST TRECVID Multimedia Event Detection (MED) dataset shows our approach performs favorably over the state-of-the-art.

2 0.87511516 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?

Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann

Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.

3 0.85718298 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints

Author: Yifan Zhang, Qiang Ji, Hanqing Lu

Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.

4 0.85147321 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition

Author: Ping Wei, Yibiao Zhao, Nanning Zheng, Song-Chun Zhu

Abstract: Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.

5 0.78041893 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM

Author: Lukas Bossard, Matthieu Guillaumin, Luc Van_Gool

Abstract: The task of recognizing events in photo collections is central for automatically organizing images. It is also very challenging, because of the ambiguity of photos across different event classes and because many photos do not convey enough relevant information. Unfortunately, the field still lacks standard evaluation data sets to allow comparison of different approaches. In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, annotated with 14 diverse social event classes. Casting collections as sequential data, we build upon recent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recognition in photo collections. However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the photographers. Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a function of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM). In our experiments, we show that our proposed model outperforms approaches based only on feature pooling or a classical hidden Markov model. With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections.

6 0.76734853 127 iccv-2013-Dynamic Pooling for Complex Event Recognition

7 0.75749105 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis

8 0.6288777 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach

9 0.55724674 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

10 0.54514343 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

11 0.527394 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation

12 0.51201952 191 iccv-2013-Handling Uncertain Tags in Visual Recognition

13 0.48328382 81 iccv-2013-Combining the Right Features for Complex Event Recognition

14 0.48175722 452 iccv-2013-YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition

15 0.47483009 34 iccv-2013-Abnormal Event Detection at 150 FPS in MATLAB

16 0.45322448 400 iccv-2013-Stable Hyper-pooling and Query Expansion for Event Detection

17 0.42944169 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

18 0.41913489 428 iccv-2013-Translating Video Content to Natural Language Descriptions

19 0.40347636 331 iccv-2013-Pyramid Coding for Functional Scene Element Recognition in Video Scenes

20 0.3976503 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.113), (4, 0.015), (7, 0.012), (12, 0.023), (13, 0.022), (26, 0.067), (31, 0.032), (42, 0.061), (47, 0.204), (64, 0.071), (68, 0.011), (73, 0.017), (77, 0.034), (78, 0.018), (89, 0.197)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83601725 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification

Author: Chen Sun, Ram Nevatia

Abstract: The goal of high level event classification from videos is to assign a single, high level event label to each query video. Traditional approaches represent each video as a set of low level features and encode it into a fixed length feature vector (e.g. Bag-of-Words), which leave a big gap between low level visual features and high level events. Our paper tries to address this problem by exploiting activity concept transitions in video events (ACTIVE). A video is treated as a sequence of short clips, all of which are observations corresponding to latent activity concept variables in a Hidden Markov Model (HMM). We propose to apply Fisher Kernel techniques so that the concept transitions over time can be encoded into a compact and fixed length feature vector very efficiently. Our approach can utilize concept annotations from independent datasets, and works well even with a very small number of training samples. Experiments on the challenging NIST TRECVID Multimedia Event Detection (MED) dataset shows our approach performs favorably over the state-of-the-art.

2 0.80125368 126 iccv-2013-Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification

Author: Bo Wang, Zhuowen Tu, John K. Tsotsos

Abstract: In graph-based semi-supervised learning approaches, the classification rate is highly dependent on the size of the availabel labeled data, as well as the accuracy of the similarity measures. Here, we propose a semi-supervised multi-class/multi-label classification scheme, dynamic label propagation (DLP), which performs transductive learning through propagation in a dynamic process. Existing semi-supervised classification methods often have difficulty in dealing with multi-class/multi-label problems due to the lack in consideration of label correlation; our algorithm instead emphasizes dynamic metric fusion with label information. Significant improvement over the state-of-the-art methods is observed on benchmark datasets for both multiclass and multi-label tasks.

3 0.79081571 210 iccv-2013-Image Retrieval Using Textual Cues

Author: Anand Mishra, Karteek Alahari, C.V. Jawahar

Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.

4 0.76649171 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies

Author: Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev

Abstract: We seek to obtain a pixel-wise segmentation and pose estimation of multiple people in a stereoscopic video. This involves challenges such as dealing with unconstrained stereoscopic video, non-stationary cameras, and complex indoor and outdoor dynamic scenes. The contributions of our work are two-fold: First, we develop a segmentation model incorporating person detection, pose estimation, as well as colour, motion, and disparity cues. Our new model explicitly represents depth ordering and occlusion. Second, we introduce a stereoscopic dataset with frames extracted from feature-length movies “StreetDance 3D ” and “Pina ”. The dataset contains 2727 realistic stereo pairs and includes annotation of human poses, person bounding boxes, and pixel-wise segmentations for hundreds of people. The dataset is composed of indoor and outdoor scenes depicting multiple people with frequent occlusions. We demonstrate results on our new challenging dataset, as well as on the H2view dataset from (Sheasby et al. ACCV 2012).

5 0.76174265 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition

Author: Limin Wang, Yu Qiao, Xiaoou Tang

Abstract: This paper proposes motion atom and phrase as a midlevel temporal “part” for representing and classifying complex action. Motion atom is defined as an atomic part of action, and captures the motion information of action video in a short temporal scale. Motion phrase is a temporal composite of multiple motion atoms with an AND/OR structure, which further enhances the discriminative ability of motion atoms by incorporating temporal constraints in a longer scale. Specifically, given a set of weakly labeled action videos, we firstly design a discriminative clustering method to automatically discovera set ofrepresentative motion atoms. Then, based on these motion atoms, we mine effective motion phrases with high discriminative and representativepower. We introduce a bottom-upphrase construction algorithm and a greedy selection method for this mining task. We examine the classification performance of the motion atom and phrase based representation on two complex action datasets: Olympic Sports and UCF50. Experimental results show that our method achieves superior performance over recent published methods on both datasets.

6 0.7577765 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach

7 0.75743812 426 iccv-2013-Training Deformable Part Models with Decorrelated Features

8 0.75602818 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

9 0.75591648 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction

10 0.75490844 229 iccv-2013-Large-Scale Video Hashing via Structure Learning

11 0.75312281 127 iccv-2013-Dynamic Pooling for Complex Event Recognition

12 0.75112593 238 iccv-2013-Learning Graphs to Match

13 0.75020051 396 iccv-2013-Space-Time Robust Representation for Action Recognition

14 0.75015777 448 iccv-2013-Weakly Supervised Learning of Image Partitioning Using Decision Trees with Structured Split Criteria

15 0.74945474 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction

16 0.74914503 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition

17 0.74873781 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary

18 0.74872637 313 iccv-2013-Person Re-identification by Salience Matching

19 0.74859726 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation

20 0.7482574 361 iccv-2013-Robust Trajectory Clustering for Motion Segmentation