iccv iccv2013 iccv2013-268 knowledge-graph by maker-knowledge-mining

268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition


Source: pdf

Author: Ping Wei, Yibiao Zhao, Nanning Zheng, Song-Chun Zhu

Abstract: Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 x Abstract Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. [sent-6, score-0.409]

2 Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. [sent-8, score-0.452]

3 For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. [sent-10, score-0.245]

4 For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. [sent-11, score-0.738]

5 Introduction The past decade has seen remarkable progress in event understanding [5, 8, 12, 22]. [sent-14, score-0.569]

6 It can be decomposed into several sequential sub-events or atomic events in the temporal domain [12]. [sent-16, score-0.733]

7 In addition to recognizing the entire event, modeling and recognizing these atomic events are also important, especially in the real applications, like predicting agent’s goal and intention in actions [12]. [sent-17, score-0.673]

8 A cellphone is a cellphone because ofits ability to allow the agent to perform the action make a call. [sent-21, score-0.392]

9 For example, in the event fetch water from dispenser, the dispenser almost stays still, and the mug smoothly moves with the hand. [sent-31, score-1.316]

10 An event is a sequence of time-varying interactions between human and objects with hierarchical structures in 3D spatial domain and 1D temporal domain. [sent-32, score-0.926]

11 In this paper, we propose a 4D human-object interaction model (4DHOI) for event recognition and object detection. [sent-33, score-0.654]

12 The human-object interaction relation is embedded in 4D space: i) the semantic co-occurrence and geometric compatibility of human pose and object in 3D spatial domain; ii) the atomic event transition and object coherence in 1D temporal domain. [sent-35, score-1.525]

13 An event is decomposed into several sequential atomic events. [sent-37, score-1.043]

14 The atomic event is decomposed into human pose and objects. [sent-38, score-1.158]

15 Given the RGBD video and the human pose from the Kinect camera [18], we design an online dynamic programming beam search algorithm to segment the video, recognize the events, and detect the objects in each video frame. [sent-39, score-0.506]

16 The possible interpretations to this frame are jointly proposed according to the human pose, the objects, and the 3D spatial relations between them. [sent-41, score-0.306]

17 In this way, the algorithm gener- ates the hierarchical event interpretation and correspondent object labeling. [sent-43, score-0.694]

18 To evaluate our method, we built a largescale 3D event dataset with human-object interactions. [sent-46, score-0.569]

19 It includes 8 event categories and 11 interacting object classes. [sent-48, score-0.64]

20 Each event category includes about 477 video sequence instances. [sent-57, score-0.717]

21 In recent years, many work applied the human-object mutual context to event and object recognition [2, 5, 7, 9, 11, 14, 15, 23, 24]. [sent-62, score-0.606]

22 Event is usually recognized by combining the human body features and the temporal relations [8, 10, 12, 17, 19, 22]. [sent-76, score-0.253]

23 Some work [10, 22] took the event recognition as a classification problem. [sent-77, score-0.569]

24 They represented the pre-segmented video as a feature vector, and classified it to an event category. [sent-78, score-0.642]

25 In addition to event clas- sification, our model can segment the video, recognize the atomic events and objects in each frame. [sent-82, score-1.225]

26 [12] represented an action with several atomic actions and employed a temporal filter embedded in an And-or graph for video parsing. [sent-92, score-0.722]

27 Hierarchical Graph Model of Event In the 1D temporal domain, an event is decomposed into multiple ordered smaller atomic events. [sent-95, score-1.134]

28 For example, the event fetch water from dispenser in Figure 2 is decomposed into three sequential atomic events - approach the dispenser, fetch water, and leave the dispenser. [sent-96, score-2.041]

29 In the 3D spatial domain, each atomic event is decomposed into human pose, interacting objects, and the geometric relations between them. [sent-97, score-1.241]

30 An atomic event integrates a specific type of human pose and one or more objects. [sent-98, score-1.111]

31 The semantic relation between the object class and a specific atomic event is a hard constraint. [sent-99, score-1.116]

32 For example, the atomic event fetch water consists of the pose fetch and the objects eaoehmdvbuistuojme pmgcanetni cse vrent 3273 approach the dispenser fetch water leave the dispenser Figure 2. [sent-100, score-2.601]

33 , Iτ) is an event video sequence in the time interval [1, τ] , where It is the RGBD frame at time t. [sent-106, score-0.86]

34 , |∆| } is the event category like fetch water from edis|ipe =n 1se,r. [sent-110, score-0.965]

35 , KE} is the atomic event class like fetch wa=ter {. [sent-128, score-1.226]

36 cElaacshs event category ei has its own distinct atomic event set Ωei , i. [sent-133, score-1.587]

37 the relations between an event and its atomic events are hard constraints. [sent-135, score-1.238]

38 Because each event has its own distinct atomic event set, we omit the variable E in the right side of Eq. [sent-145, score-1.565]

39 Semantic co-occurrence means a specific type of human pose and some object classes appear together in an atomic event. [sent-150, score-0.579]

40 Suppose zti is the 3D bounding box center of the object oti in the 3D space. [sent-162, score-0.284]

41 The probability of object oti at zti is obtained by normalizing the SVM detector with Platt scaling p(oti |zit) = 1/{1 + exp{us(zti) + v}} [6, 13], where s(zti) is th|ez score 1of/ {l1in e+ar e xSpV{Mus object dv}et}ec [t6o,r 1 3w]i,th w thheer eR sG(zBD HOG features at location zti. [sent-164, score-0.297]

42 In an atomic event, the location of an object is closely related to the locations and directions of some body parts, which we call the key parts, as the arm to the dispenser in Figure 3. [sent-177, score-0.82]

43 The subscript (oit, at) indicates that the human-object geometric relation varies in different atomic events and objects. [sent-187, score-0.659]

44 Temporal Relation The temporal relation Ψ(l1:t−1 , lt) is decomposed as Ψ(l1:t−1, lt) = ψ1 (a1:t−1, at) + ψ2(ot−1, ot) (5) where a1:t−1 are the atomic event labels of the frames from the time 1to t −1 . [sent-191, score-1.227]

45 The first term encodes the atomic event ttrhaens timitioen 1, taon dt −the1 s. [sent-192, score-0.996]

46 In an event, the transition probability from the current atomic event to the next atomic event is related to the duration of current atomic event. [sent-195, score-2.55]

47 Suppose ωk−1 and ωk are two neighboring atomic events of event E. [sent-197, score-1.14]

48 Given E and at−1 = ωk−1, the next frame’s atomic event at can be ωk−1 (repeat the same atomic event) or ωk (start a new atomic event). [sent-198, score-1.85]

49 If approach the dispenser has been lasting a long time, as in the interval 3, then the probability of transition to approach the dispenser will be much smaller than the probability to fetch water. [sent-217, score-0.983]

50 The interval 1 and 3 describe the common duration distribution of the atomic event, and the interval 2 reflects the variance. [sent-219, score-0.671]

51 In an event, the locations of some objects like dispenser are rare to be changed. [sent-223, score-0.373]

52 Some objects like mug can move when human action is applied. [sent-224, score-0.292]

53 Each sequence contains one instance of the event E from the beginning to the end. [sent-232, score-0.652]

54 3275 mug1phone2bo k3mdoenmsitko rusechair4 mk1eoynbditoreisankrdwithcmau5irg2cadlmsiwputeghncs6elrpkheotnmleug3readb7okbut4onusemous8e 5 6 fetch water 7 pour water 8 press button Figure 5. [sent-234, score-0.613]

55 type on keyboard Based on the KE segments, we can obtain KE atomic events for E. [sent-236, score-0.653]

56 The pose model of the kth atomic event is the kth component of the mixture Gaussian. [sent-237, score-1.106]

57 The co-occurrence object categories in all the frames of the k-th segment are set as the interacting object classes for the kth atomic event. [sent-238, score-0.6]

58 Figure 5 shows some samples of the learned atomic events. [sent-241, score-0.427]

59 The energy that the video V is interpreted Qb y the graph Vlist G is QqQ=1 E(G|V) =XqQ=1En(Gq|Vq) (8) En(Gq |Vq) is the energy of each video clip, as defined in Eq. [sent-253, score-0.244]

60 We extend it to the video interpretation and exploit the characteristic of the event graph structure to accelerate the computation. [sent-263, score-0.742]

61 The general idea is that based on the interpretations to the past video frames, we compute all the interpretations to the current frame. [sent-264, score-0.285]

62 This process iterates forward frame by frame until the video sequence ends. [sent-266, score-0.262]

63 , GtJ−1 are J possible interpretation graph lists to thte− v1ideo sequence in the time interval [1, t 1gr],a pwhit lhis tthse to energy Ee1ot− s1e , . [sent-271, score-0.279]

64 Suppose at−1 and at are the atomic event labels of frame It−1 and It, respectively. [sent-281, score-1.064]

65 Given the jth path there are three types of interpretation to the current framte− It (shown in the right side of Figure 6): 1) at repeats the same atomic event with at−1 ; 2) at is the next atomic event of at−1 in the same event; 3) at is the atomic event of a new event. [sent-282, score-3.048]

66 In the third case, at can be any atomic event in the given set, which makes our model able to handle the cases ofevent insertion, interruption, and repetition. [sent-283, score-1.029]

67 , EtJ as the interpretations to the video in the interval [E1, ,t]. [sent-302, score-0.276]

68 Additional to recognize the event and the atomic event, it also detect and label the objects in each frame. [sent-307, score-1.081]

69 Multiview 3D Event Dataset To evaluate our algorithm, we collect a large-scale multiview 3D event dataset. [sent-311, score-0.602]

70 Each subject repeats an event for 3276 ev ntgraphset. [sent-314, score-0.569]

71 The graph list in the dash green box is the interpretation to the video in the time interval [1, t − 1] . [sent-326, score-0.294]

72 MV: multiview; SN: the number of total video sequences; ASN: the average number of sequences for each event category; AVL: the average length (frames) of each video sequence. [sent-330, score-0.741]

73 To label the video, we manually cut the original long videos into short sequences that each sequence contains one event from the beginning to the end. [sent-333, score-0.678]

74 Totally, our labeled dataset contains 3815 event video sequences and 383,036 RGBD frames. [sent-334, score-0.668]

75 Each event category has about 477 sequence instances on the average. [sent-335, score-0.644]

76 But due to the various styles of actor’s action, the viewpoint of each event is much larger than three. [sent-339, score-0.569]

77 Second, our event involves various objects and has complex temporal structures. [sent-340, score-0.708]

78 Finally, our dataset has large variety due to the various styles of each actor to per1 3 4 5 6 7 8 2 1 3 4 5 6 7 8 2 1 with mug drink 2 call with cellphone 3 read book 4 use mouse 5 type on keyboard 6 fetch water 7 pourH water 8 press button Figure 7. [sent-341, score-1.078]

79 Table 1 gives the comparison of our dataset with two typical human-object interaction event datasets. [sent-344, score-0.617]

80 Event Recognition Event recognition is to predict an event label for each video sequence which contains one event from the beginning to the end. [sent-347, score-1.294]

81 We use two classical event recognition method as baselines - motion template (MT) [10] and traditional hidden Markov model (HMM) [16]. [sent-350, score-0.569]

82 It outperform other methods in 6 categories of all 8 event categories, and improves the overall accuracy greatly, which demonstrates the strength of our method. [sent-355, score-0.569]

83 The comparison between 4DH and 4DHOI demonstrates the effect of human-object interaction on event recognition. [sent-361, score-0.617]

84 For example, the human body movement in the event drink with mug and call with cellphone are highly similar. [sent-362, score-1.0]

85 Incorporating the object information of mug and cellphone, the two events are better distinguished. [sent-364, score-0.279]

86 Consider another event - pour water from kettle, it is complex in the temporal structure and human body movement because it involves the movement of both two arms and the coordination between the two arms. [sent-365, score-1.022]

87 The object kettle has distinct appearance and only exists in the event pour water from kettle, which makes it provide strong support to this event. [sent-366, score-0.899]

88 The new event interpretation segments the video into clips which correspond to different events. [sent-373, score-0.73]

89 We use 10 unsegmented long event sequences to test the segmentation. [sent-374, score-0.595]

90 Our segmentation data is challenging because many of the highly similar events successively occur in one sequence, and some events occur many times in one sequence. [sent-376, score-0.288]

91 Object Recognition and Localization In video, object recognition is to determine the object class, which is related to the event recognition since the connection between object class and event category is hard constraint. [sent-397, score-1.298]

92 Different from the previous work which only localized objects in one video frame, or just recognized the pre-detected object motion, we localize the object in each video frame of the 3D point cloud (with it, the 2D location on image is available by projection). [sent-399, score-0.367]

93 The objects involved in the event present large appearance variance. [sent-408, score-0.617]

94 Some small objects are always occluded by the human body in the action, like cellphone and mouse. [sent-411, score-0.318]

95 The human action information can facilitate the localization by using the temporal and human body context. [sent-414, score-0.303]

96 Conclusion We proposed a 4D human-object interaction model for event and object recognition. [sent-418, score-0.654]

97 The human-object interactions defined in 3D spatial domain boost the reliability on atomic event recognition. [sent-419, score-1.072]

98 Ambiguities in interpreting the video frames are resolved by integrating temporal relation between frames. [sent-420, score-0.257]

99 Through the dynamic programming beam search algorithm, we can efficiently segment the video, recognize events, and localize objects simultaneously. [sent-421, score-0.276]

100 The experiment on our large scale multiview 3D event dataset proves the effectiveness of our method. [sent-422, score-0.602]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('event', 0.569), ('atomic', 0.427), ('dispenser', 0.3), ('fetch', 0.205), ('zti', 0.164), ('cellphone', 0.154), ('water', 0.144), ('events', 0.144), ('interpretations', 0.106), ('kettle', 0.1), ('mug', 0.098), ('interval', 0.097), ('temporal', 0.091), ('beam', 0.088), ('keyboard', 0.082), ('transition', 0.081), ('ot', 0.078), ('ht', 0.074), ('video', 0.073), ('button', 0.071), ('relations', 0.071), ('dk', 0.07), ('frame', 0.068), ('dpbs', 0.067), ('rdh', 0.067), ('gq', 0.061), ('human', 0.061), ('interpretation', 0.06), ('action', 0.06), ('oti', 0.059), ('relation', 0.056), ('rgbd', 0.056), ('pose', 0.054), ('sequence', 0.053), ('interactions', 0.052), ('oit', 0.051), ('gtj', 0.05), ('yoit', 0.05), ('duration', 0.05), ('lt', 0.049), ('pour', 0.049), ('objects', 0.048), ('book', 0.048), ('interaction', 0.048), ('decomposed', 0.047), ('programming', 0.042), ('graph', 0.04), ('recognize', 0.037), ('object', 0.037), ('frames', 0.037), ('read', 0.036), ('mouse', 0.035), ('drink', 0.035), ('hmm', 0.035), ('vq', 0.034), ('interacting', 0.034), ('adipspproeancsehr', 0.033), ('ofevent', 0.033), ('xoit', 0.033), ('kinect', 0.033), ('multiview', 0.033), ('compatibility', 0.032), ('geometric', 0.032), ('actions', 0.031), ('localize', 0.031), ('body', 0.03), ('nn', 0.03), ('dynamic', 0.03), ('beginning', 0.03), ('yibiao', 0.03), ('zit', 0.03), ('sung', 0.03), ('energy', 0.029), ('clips', 0.028), ('kth', 0.028), ('hierarchical', 0.028), ('activity', 0.027), ('hard', 0.027), ('affordance', 0.027), ('suppose', 0.027), ('gj', 0.027), ('movement', 0.027), ('ke', 0.027), ('call', 0.026), ('sequences', 0.026), ('markov', 0.025), ('rgb', 0.025), ('like', 0.025), ('lnp', 0.025), ('box', 0.024), ('arms', 0.024), ('domain', 0.024), ('koppula', 0.024), ('someone', 0.024), ('agent', 0.024), ('sliding', 0.023), ('clip', 0.023), ('prest', 0.023), ('recognizing', 0.023), ('category', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition

Author: Ping Wei, Yibiao Zhao, Nanning Zheng, Song-Chun Zhu

Abstract: Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.

2 0.47811511 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints

Author: Yifan Zhang, Qiang Ji, Hanqing Lu

Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.

3 0.34851846 127 iccv-2013-Dynamic Pooling for Complex Event Recognition

Author: Weixin Li, Qian Yu, Ajay Divakaran, Nuno Vasconcelos

Abstract: The problem of adaptively selecting pooling regions for the classification of complex video events is considered. Complex events are defined as events composed of several characteristic behaviors, whose temporal configuration can change from sequence to sequence. A dynamic pooling operator is defined so as to enable a unified solution to the problems of event specific video segmentation, temporal structure modeling, and event detection. Video is decomposed into segments, and the segments most informative for detecting a given event are identified, so as to dynamically determine the pooling operator most suited for each sequence. This dynamic pooling is implemented by treating the locations of characteristic segments as hidden information, which is inferred, on a sequence-by-sequence basis, via a large-margin classification rule with latent variables. Although the feasible set of segment selections is combinatorial, it is shown that a globally optimal solution to the inference problem can be obtained efficiently, through the solution of a series of linear programs. Besides the coarselevel location of segments, a finer model of video struc- ture is implemented by jointly pooling features of segmenttuples. Experimental evaluation demonstrates that the re- sulting event detector has state-of-the-art performance on challenging video datasets.

4 0.34496501 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM

Author: Lukas Bossard, Matthieu Guillaumin, Luc Van_Gool

Abstract: The task of recognizing events in photo collections is central for automatically organizing images. It is also very challenging, because of the ambiguity of photos across different event classes and because many photos do not convey enough relevant information. Unfortunately, the field still lacks standard evaluation data sets to allow comparison of different approaches. In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, annotated with 14 diverse social event classes. Casting collections as sequential data, we build upon recent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recognition in photo collections. However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the photographers. Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a function of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM). In our experiments, we show that our proposed model outperforms approaches based only on feature pooling or a classical hidden Markov model. With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections.

5 0.30203563 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?

Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann

Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.

6 0.24753837 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

7 0.24026623 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification

8 0.20957062 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach

9 0.19233611 344 iccv-2013-Recognising Human-Object Interaction via Exemplar Based Modelling

10 0.18403485 86 iccv-2013-Concurrent Action Detection with Structural Prediction

11 0.17815834 81 iccv-2013-Combining the Right Features for Complex Event Recognition

12 0.16223188 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis

13 0.15379992 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition

14 0.1424994 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

15 0.13494395 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks

16 0.13306127 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

17 0.11049613 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation

18 0.10157552 34 iccv-2013-Abnormal Event Detection at 150 FPS in MATLAB

19 0.10093142 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition

20 0.098476015 400 iccv-2013-Stable Hyper-pooling and Query Expansion for Event Detection


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.197), (1, 0.175), (2, 0.097), (3, 0.22), (4, 0.112), (5, 0.044), (6, 0.114), (7, -0.073), (8, -0.07), (9, -0.094), (10, -0.179), (11, -0.169), (12, -0.097), (13, 0.275), (14, -0.285), (15, -0.068), (16, 0.033), (17, 0.02), (18, 0.014), (19, 0.089), (20, 0.139), (21, -0.003), (22, 0.045), (23, 0.021), (24, 0.036), (25, -0.003), (26, -0.053), (27, 0.088), (28, 0.0), (29, -0.032), (30, -0.09), (31, -0.003), (32, 0.057), (33, 0.032), (34, 0.056), (35, -0.006), (36, -0.013), (37, 0.004), (38, 0.009), (39, 0.02), (40, 0.01), (41, 0.032), (42, -0.029), (43, 0.036), (44, 0.019), (45, -0.025), (46, 0.013), (47, 0.006), (48, 0.093), (49, -0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95611984 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition

Author: Ping Wei, Yibiao Zhao, Nanning Zheng, Song-Chun Zhu

Abstract: Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.

2 0.94209754 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints

Author: Yifan Zhang, Qiang Ji, Hanqing Lu

Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.

3 0.890387 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?

Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann

Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.

4 0.82711041 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification

Author: Chen Sun, Ram Nevatia

Abstract: The goal of high level event classification from videos is to assign a single, high level event label to each query video. Traditional approaches represent each video as a set of low level features and encode it into a fixed length feature vector (e.g. Bag-of-Words), which leave a big gap between low level visual features and high level events. Our paper tries to address this problem by exploiting activity concept transitions in video events (ACTIVE). A video is treated as a sequence of short clips, all of which are observations corresponding to latent activity concept variables in a Hidden Markov Model (HMM). We propose to apply Fisher Kernel techniques so that the concept transitions over time can be encoded into a compact and fixed length feature vector very efficiently. Our approach can utilize concept annotations from independent datasets, and works well even with a very small number of training samples. Experiments on the challenging NIST TRECVID Multimedia Event Detection (MED) dataset shows our approach performs favorably over the state-of-the-art.

5 0.78721786 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM

Author: Lukas Bossard, Matthieu Guillaumin, Luc Van_Gool

Abstract: The task of recognizing events in photo collections is central for automatically organizing images. It is also very challenging, because of the ambiguity of photos across different event classes and because many photos do not convey enough relevant information. Unfortunately, the field still lacks standard evaluation data sets to allow comparison of different approaches. In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, annotated with 14 diverse social event classes. Casting collections as sequential data, we build upon recent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recognition in photo collections. However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the photographers. Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a function of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM). In our experiments, we show that our proposed model outperforms approaches based only on feature pooling or a classical hidden Markov model. With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections.

6 0.78453898 127 iccv-2013-Dynamic Pooling for Complex Event Recognition

7 0.68548095 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis

8 0.59488314 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach

9 0.50233364 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

10 0.49141273 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks

11 0.47946963 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation

12 0.46631068 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set

13 0.45045963 81 iccv-2013-Combining the Right Features for Complex Event Recognition

14 0.45031264 34 iccv-2013-Abnormal Event Detection at 150 FPS in MATLAB

15 0.44328725 344 iccv-2013-Recognising Human-Object Interaction via Exemplar Based Modelling

16 0.41971263 397 iccv-2013-Space-Time Tradeoffs in Photo Sequencing

17 0.40530357 191 iccv-2013-Handling Uncertain Tags in Visual Recognition

18 0.3994073 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

19 0.37118241 400 iccv-2013-Stable Hyper-pooling and Query Expansion for Event Detection

20 0.34190643 167 iccv-2013-Finding Causal Interactions in Video Sequences


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.05), (12, 0.028), (26, 0.085), (31, 0.041), (40, 0.012), (42, 0.081), (44, 0.015), (48, 0.013), (64, 0.067), (73, 0.022), (78, 0.068), (82, 0.154), (89, 0.235), (95, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91613269 251 iccv-2013-Like Father, Like Son: Facial Expression Dynamics for Kinship Verification

Author: Hamdi Dibeklioglu, Albert Ali Salah, Theo Gevers

Abstract: Kinship verification from facial appearance is a difficult problem. This paper explores the possibility of employing facial expression dynamics in this problem. By using features that describe facial dynamics and spatio-temporal appearance over smile expressions, we show that it is possible to improve the state ofthe art in thisproblem, and verify that it is indeed possible to recognize kinship by resemblance of facial expressions. The proposed method is tested on different kin relationships. On the average, 72.89% verification accuracy is achieved on spontaneous smiles.

same-paper 2 0.90324092 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition

Author: Ping Wei, Yibiao Zhao, Nanning Zheng, Song-Chun Zhu

Abstract: Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.

3 0.86594021 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

Author: Weiyu Zhang, Menglong Zhu, Konstantinos G. Derpanis

Abstract: This paper presents a novel approach for analyzing human actions in non-scripted, unconstrained video settings based on volumetric, x-y-t, patch classifiers, termed actemes. Unlike previous action-related work, the discovery of patch classifiers is posed as a strongly-supervised process. Specifically, keypoint labels (e.g., position) across spacetime are used in a data-driven training process to discover patches that are highly clustered in the spacetime keypoint configuration space. To support this process, a new human action dataset consisting of challenging consumer videos is introduced, where notably the action label, the 2D position of a set of keypoints and their visibilities are provided for each video frame. On a novel input video, each acteme is used in a sliding volume scheme to yield a set of sparse, non-overlapping detections. These detections provide the intermediate substrate for segmenting out the action. For action classification, the proposed representation shows significant improvement over state-of-the-art low-level features, while providing spatiotemporal localiza- tion as additional output. This output sheds further light into detailed action understanding.

4 0.85719472 252 iccv-2013-Line Assisted Light Field Triangulation and Stereo Matching

Author: Zhan Yu, Xinqing Guo, Haibing Lin, Andrew Lumsdaine, Jingyi Yu

Abstract: Light fields are image-based representations that use densely sampled rays as a scene description. In this paper, we explore geometric structures of 3D lines in ray space for improving light field triangulation and stereo matching. The triangulation problem aims to fill in the ray space with continuous and non-overlapping simplices anchored at sampled points (rays). Such a triangulation provides a piecewise-linear interpolant useful for light field superresolution. We show that the light field space is largely bilinear due to 3D line segments in the scene, and direct triangulation of these bilinear subspaces leads to large errors. We instead present a simple but effective algorithm to first map bilinear subspaces to line constraints and then apply Constrained Delaunay Triangulation (CDT). Based on our analysis, we further develop a novel line-assisted graphcut (LAGC) algorithm that effectively encodes 3D line constraints into light field stereo matching. Experiments on synthetic and real data show that both our triangulation and LAGC algorithms outperform state-of-the-art solutions in accuracy and visual quality.

5 0.85609603 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks

Author: Xiaoyu Ding, Wen-Sheng Chu, Fernando De_La_Torre, Jeffery F. Cohn, Qiao Wang

Abstract: Automatic facial Action Unit (AU) detection from video is a long-standing problem in facial expression analysis. AU detection is typically posed as a classification problem between frames or segments of positive examples and negative ones, where existing work emphasizes the use of different features or classifiers. In this paper, we propose a method called Cascade of Tasks (CoT) that combines the use ofdifferent tasks (i.e., , frame, segment and transition)for AU event detection. We train CoT in a sequential manner embracing diversity, which ensures robustness and generalization to unseen data. In addition to conventional framebased metrics that evaluate frames independently, we propose a new event-based metric to evaluate detection performance at event-level. We show how the CoT method consistently outperforms state-of-the-art approaches in both frame-based and event-based metrics, across three public datasets that differ in complexity: CK+, FERA and RUFACS.

6 0.85578203 248 iccv-2013-Learning to Rank Using Privileged Information

7 0.85537338 69 iccv-2013-Capturing Global Semantic Relationships for Facial Action Unit Recognition

8 0.85011256 127 iccv-2013-Dynamic Pooling for Complex Event Recognition

9 0.84814233 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition

10 0.84764576 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints

11 0.84535539 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

12 0.84534943 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction

13 0.84283972 410 iccv-2013-Support Surface Prediction in Indoor Scenes

14 0.84236121 160 iccv-2013-Fast Object Segmentation in Unconstrained Video

15 0.84213704 319 iccv-2013-Point-Based 3D Reconstruction of Thin Objects

16 0.84162426 299 iccv-2013-Online Video SEEDS for Temporal Window Objectness

17 0.84161294 190 iccv-2013-Handling Occlusions with Franken-Classifiers

18 0.84102684 78 iccv-2013-Coherent Motion Segmentation in Moving Camera Videos Using Optical Flow Orientations

19 0.84040582 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary

20 0.83995527 433 iccv-2013-Understanding High-Level Semantics by Modeling Traffic Patterns