iccv iccv2013 iccv2013-147 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lukas Bossard, Matthieu Guillaumin, Luc Van_Gool
Abstract: The task of recognizing events in photo collections is central for automatically organizing images. It is also very challenging, because of the ambiguity of photos across different event classes and because many photos do not convey enough relevant information. Unfortunately, the field still lacks standard evaluation data sets to allow comparison of different approaches. In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, annotated with 14 diverse social event classes. Casting collections as sequential data, we build upon recent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recognition in photo collections. However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the photographers. Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a function of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM). In our experiments, we show that our proposed model outperforms approaches based only on feature pooling or a classical hidden Markov model. With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections.
Reference: text
sentIndex sentText sentNum sentScore
1 ch Abstract The task of recognizing events in photo collections is central for automatically organizing images. [sent-4, score-0.915]
2 It is also very challenging, because of the ambiguity of photos across different event classes and because many photos do not convey enough relevant information. [sent-5, score-0.85]
3 In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, annotated with 14 diverse social event classes. [sent-7, score-1.332]
4 Casting collections as sequential data, we build upon recent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recognition in photo collections. [sent-8, score-1.882]
5 However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the photographers. [sent-9, score-0.536]
6 Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a function of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM). [sent-10, score-0.266]
7 With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections. [sent-12, score-0.96]
8 Introduction With the advent of digital photography, we have witnessed the explosion ofpersonal and professional photo collections, both online and offline. [sent-14, score-0.459]
9 The vast amount of pictures that users accumulate raises the need for automatic photo organization. [sent-15, score-0.453]
10 be uiser C oatB tiona radG u gn edW ya id d irthB Figure 1: Eight examples of photo collections from four event classes in our data set. [sent-21, score-1.282]
11 However, these works seldomly exploit the simple fact that online and offline images frequently come in collections: People organize their personal photos in di- rectories, either corresponding to particular contents (persons, things of interest) or particular events. [sent-26, score-0.26]
12 Online photo sharing websites such as Flickr, Panoramio or Facebook adopted this scheme and are organised in albums (examples shown in in Fig. [sent-27, score-0.429]
13 The benefits from recognizing event types are evident: Automatic organisation helps users keep order in their photo collections and also enables the retrieval of similar event types in large photo repositories. [sent-29, score-2.193]
14 As in videos, discriminative features in photo collection are often outnumbered by many diverse and semantically ambiguous frames that contribute little to the understanding of an event class: portraits, group photos and landscapes all occur in multiple types of events. [sent-35, score-1.151]
15 In contrast to videos where images are sampled at a fixed frame rate, photo collections instead present a very sparse sampling of visual data, such that relating consecutive images is typically a harder task, c. [sent-36, score-0.804]
16 A great benefit of photo collections, however, is that the frequency of sampling is itself a measure of the relative importance of photos [8], and that we can exploit this information to distinguish between event classes. [sent-40, score-1.092]
17 Unfortunately, there are no standard benchmark data set for studying the challenging problem of event recognition for photo collections. [sent-41, score-0.932]
18 In the literature on classifying photo collections [18, 26, 29], only small and private data sets are used. [sent-42, score-0.759]
19 As a contribution of this paper, we have collected a large data set of more than 61,000 images in 807 collections from Flickr and manually annotated it with 14 event classes as we describe in Sect. [sent-44, score-0.853]
20 These collections correspond to real-world personal photo collections taken by individual photographers. [sent-46, score-1.098]
21 As a second contribution, we propose to modify a recent state-of-the-art model [25], initially designed for videos, for event recognition in photo collections. [sent-51, score-0.932]
22 This includes a proper multi-class formulation and a modified hidden Markov model where the transition probabilities depend on observed temporal gaps between images. [sent-52, score-0.287]
23 6) that our model outperforms alternative event classification schemes for photo collections based on feature or score pooling or simple hidden Markov models and present our conclusions in Sect. [sent-60, score-1.338]
24 While these algorithms focus on finding structure in unorganized data, our goal is to exploit the collection structure that is often found in personal and professional photo archives. [sent-67, score-0.643]
25 [6] exploit photo collections to reduce the complexity of propagating labels between images by observing that images within a collection are more likely to depict similar scenes. [sent-69, score-0.835]
26 The authors use a data set of 100 collections and label each image with an event and a scene label. [sent-70, score-0.823]
27 [7] further extends this idea towards a hierarchical model where a photo collection is split in a sub-sequence of so-called “events”, composed of images from similar scenes, and exploits additional information such as GPS tracks. [sent-71, score-0.513]
28 GPS tracks make it simpler to distinguish between events such as backyard parties, hikes and road trips [29] because of the difference of their geographical extent, but are still not very common in photo collections. [sent-72, score-0.627]
29 [18] proposes a simple scheme to aggregate the SVM scores of each photo in a collection, and use it for classification into 8 social classes. [sent-73, score-0.506]
30 For instance, the generative model in [16] allows its authors to integrate cues such as scene, object categories and people to segment and recover the event category in a single image. [sent-75, score-0.556]
31 [19] exploits user context, location and user1194 provided tags and comments on a photo sharing website to improve automatic image annotation. [sent-79, score-0.462]
32 The most related works to ours deal with event classification in videos [12, 25]. [sent-80, score-0.552]
33 Both works consider the use of latent sub-events in a discriminative learning framework, to maximize predictive performance. [sent-81, score-0.186]
34 However, [12] relies on known sub-events and uses them as an intermediate representation of collections for event classification. [sent-82, score-0.8]
35 Inspired by discretely observed Markov jump processes [4], we propose a Markov model where transi- tion probabilities are functions of the temporal gap between images as if it were measured by a stopwatch (c. [sent-88, score-0.353]
36 Data Set In this section, we describe our efforts to collect and annotate a large data set of personal photo collections for use as an event recognition benchmark. [sent-93, score-1.304]
37 We first defined event classes of interest by using the most popular tags on Flickr and Picasa as well as Wikipedia categories that correspond to social events. [sent-94, score-0.617]
38 Because we did not have direct access to large private photo collections we formulated different keyword queries by using variations of the event’s name or by adding year numbers to retrieve single images from Flickr. [sent-95, score-0.783]
39 If a returned image was contained in a Flickr set and if we could access the original image and its EXIF meta data, we downloaded the whole photo set. [sent-96, score-0.511]
40 As these sets only loosely correspond to collections, we manually reviewed and discarded those sets that did not consist of a personal album or one single event, had wrong or missing meta data or were heavily retouched. [sent-97, score-0.19]
41 About 60% of the downloaded photo sets had to be discarded. [sent-98, score-0.429]
42 This led to the choice of 14 event classes as shown in Tab. [sent-99, score-0.556]
43 1, with in total 807 photo collections which together contain 61,364 photos with EXIF data. [sent-100, score-0.861]
44 The Stopwatch Hidden Markov Model People usually do not take pictures at fixed intervals when photographing at an event they attend. [sent-109, score-0.527]
45 For each of the 14 classes, we detail the number of photo collections and the total number of images that they contain. [sent-113, score-0.726]
46 Other events might even expose a more subtle and thus latent substructure. [sent-116, score-0.282]
47 In this work, we assume that the photo bursts act as a proxy for this sub-structure. [sent-117, score-0.56]
48 Since events of the same type show a very large variety in their temporal composition, it can be difficult even for humans to identify and thus annotate sub-events. [sent-118, score-0.208]
49 This is why we treat the sub-events as latent in this work and learn them while training the event classifier. [sent-119, score-0.628]
50 , xT} of T + 1 timGe ovredner aed p images originating fr {oxm a single event, our goal is to predict the correct event class label y in a set Y of gKo possible leadbieclts t. [sent-123, score-0.562]
51 h We cast this prediction task in the framework of structured-output SVM with latent variables [20, 28], where the output is a multi-class prediction y∗ parametrized by Θ: = y∗ fΘ(X) = argmyaxmZax? [sent-124, score-0.236]
52 3, we detail how we learn the parameters given a set of training photo collections with manual annotations. [sent-139, score-0.726]
53 Our model for photo collection classification is based on a hidden Markov model, as commonly done for modelling sequences [14, 24, 25]. [sent-145, score-0.622]
54 Each observed image xt in the collection is associated with an unobserved latent variable zt representing its state among S possible ones. [sent-146, score-0.623]
55 In the specific context of event recognition, those latent states are often called sub-events, to stress their intended semantics. [sent-147, score-0.628]
56 φp,T−1→Tφl,TzxT Figure 3: Factor graph corresponding to our photo collection event recognition model. [sent-149, score-1.016]
57 t=−01θp,zt,zt+1,y· φp(xt,xt+1,zt,zt+1,y) (2) The feature map Φg (X, y) allows the integration of global cues from the full sequence into the event prediction. [sent-159, score-0.556]
58 The maps Φl (xt, zt, y) represent images xt and their assignments to latent sub-events zt for a particular event class y. [sent-160, score-1.066]
59 Finally, the pairwise features φp(xt, xt+1 , zt , zt+1 , y) encode the sub-event transition costs between consecutive images. [sent-161, score-0.336]
60 3 shows the ·fac φtor graph corresponding to a photo collection. [sent-167, score-0.429]
61 This allows to learn sub-events that help discriminate between events in a multi-class setting, whereas [25] only considers binary CRFs. [sent-169, score-0.184]
62 Indeed, inspired by Markov Jump Processes [4], we use the observed time gap δt→t+1 = τ(xt+1 ) − τ(xt) between two consecutive images xt and xt+1 to) )in −flu τe(xnce the transition probabilities. [sent-172, score-0.346]
63 Our Stopwatch Hidden Markov model can model the intuition that the transition matrices for short temporal gaps should typically be close to the identity matrix (i. [sent-173, score-0.199]
64 The transition matrix between two consecutive images depends on the temporal gap δt→t+1 . [sent-178, score-0.226]
65 This allows to model bursts of photos and the typical durations of subevents. [sent-179, score-0.271]
66 Intuitively, ethl eD menosidteyl “Etsrtuismtsa”a transition more, ifthe observed time-gap is consistent with time-gaps observed for class y. [sent-183, score-0.19]
67 estimating the event and sub-event label can be simply done as shown in Sect. [sent-186, score-0.503]
68 Inference Given a photo collection, inferring the event class label and the latent sub-events means to jointly maximize over the latent variables and the class labels as in Eq. [sent-194, score-1.396]
69 This can be done efficiently by observing that, for a fixed event label y, the problem of inferring over the latent variables Z, i. [sent-196, score-0.687]
70 To perform inference in the full model, we therefore simply apply the Viterbi algorithm to infer the latent variables Zy∗ for each choice of event label y, and then maximize the corresponding prediction function over y: y∗ = argmyax? [sent-203, score-0.764]
71 refore equivalent to having one chain model per event class, and predicting the class with highest confidence. [sent-207, score-0.591]
72 φpy,T−1→TzxT φly,T Figure 5: Factor graph corresponding to our photo collection classification model when the event label y is fixed. [sent-209, score-1.04]
73 linear in the number of event classes and size of the photo collection, but quadratic in the number of sub-events. [sent-215, score-0.985]
74 , (XN , yN)} of N photo cao tlrlaeicntiionngs s eXt iD w =ith { (thXeir class labels yi ∈ Y}. [sent-221, score-0.57]
75 The key is to take again advantage of the photo bursts in the time domain. [sent-271, score-0.533]
76 Our assumption is again that such bursts act as a proxy to latent sub-events. [sent-273, score-0.256]
77 To do so, we segment each photo collection using Hierarchical Agglomerative Clustering u? [sent-274, score-0.513]
78 Global features are functions of the whole photo collection and help capture holistic properties. [sent-286, score-0.54]
79 We define different cues based on time and aggregate them over the photo collection in different histograms. [sent-290, score-0.566]
80 Those cues include time of day, day of week, month and the duration to help recognize events that show specific patterns in the time domain. [sent-291, score-0.276]
81 Out of the pool of 807 photo collections, we randomly selected 10 collections for each of the 14 classes as test set, which we use to report our evaluations. [sent-337, score-0.779]
82 We also sam- pled 6 random collections per class to validate the hyperparameter. [sent-338, score-0.356]
83 All the remaining collections can be used for learning the parameters of the algorithms for event recognition. [sent-339, score-0.8]
84 Each event class has at least 24 training collections. [sent-340, score-0.562]
85 In the experiments that we report below, we have balanced our training data and used 24 random collections for each event class. [sent-343, score-0.8]
86 Instead, the latent sub-events are independently assigned to each im- age to maximize the prediction on the training set. [sent-360, score-0.205]
87 Note how events taking place in different scene types can be discriminated properly, but events that have a similar scenery are confused (e. [sent-377, score-0.364]
88 single event classes much better than a single SVM. [sent-386, score-0.556]
89 For instance, the correct event is among the top two predictions for 72. [sent-409, score-0.503]
90 8, we can sometimes clearly identify semantic concepts: outdoor view for the Hiking class, a typical photo setting for Graduation, painting frames for Exhibitions. [sent-412, score-0.429]
91 This highlights the benefits of using a latent model for event recognition, as it can provide some additional semantic knowledge that eventually increases the ability to automatically understand, organize and exploit images in photo collections. [sent-413, score-1.107]
92 9 some examples of photo collections that our approach correctly and incorrectly classified. [sent-415, score-0.726]
93 Conclusion In this paper, we have introduced a novel data set for event recognition in photo collections. [sent-422, score-0.932]
94 We believe that semantic hierarchies would help model events as well as complex sub-events, while scaling sublinearly with the number of event classes and sub-events. [sent-426, score-0.74]
95 1199 the predicted event class labels are shown and the color indicates if the SHMM correctly predicted it (correct labels shown in braces, only selected subset of images are shown). [sent-427, score-0.562]
96 Annotating photo collections by label propagation according to multiple similarity cues. [sent-469, score-0.726]
97 Image annotation within the context of personal photo collections using hierarchical event and scene models. [sent-477, score-1.327]
98 Recognizing complex events using large margin joint low-level event model. [sent-509, score-0.66]
99 Compositional object pattern: a new model for album event recognition. [sent-607, score-0.534]
100 Mining GPS traces and visual words for event classification. [sent-628, score-0.503]
wordName wordTfidf (topN-words)
[('event', 0.503), ('photo', 0.429), ('collections', 0.297), ('zt', 0.204), ('stopwatch', 0.164), ('events', 0.157), ('xt', 0.145), ('photos', 0.135), ('latent', 0.125), ('shmm', 0.117), ('bursts', 0.104), ('markov', 0.095), ('ayd', 0.094), ('dayh', 0.094), ('easter', 0.094), ('graduation', 0.094), ('halloween', 0.094), ('birthday', 0.086), ('hidden', 0.085), ('collection', 0.084), ('hiking', 0.083), ('transition', 0.079), ('personal', 0.075), ('christmas', 0.062), ('concert', 0.062), ('hmm', 0.061), ('class', 0.059), ('meta', 0.058), ('gps', 0.057), ('cues', 0.053), ('consecutive', 0.053), ('classes', 0.053), ('exif', 0.052), ('trip', 0.052), ('temporal', 0.051), ('cruise', 0.047), ('subevents', 0.047), ('gaps', 0.046), ('transitions', 0.046), ('viterbi', 0.045), ('cao', 0.044), ('prediction', 0.043), ('gap', 0.043), ('skiing', 0.042), ('road', 0.041), ('flickr', 0.041), ('confusion', 0.039), ('unobserved', 0.039), ('day', 0.039), ('coin', 0.039), ('yi', 0.038), ('svm', 0.037), ('maximize', 0.037), ('bag', 0.036), ('portraits', 0.036), ('discretely', 0.036), ('children', 0.035), ('kautz', 0.035), ('attributes', 0.034), ('inferring', 0.034), ('tags', 0.033), ('private', 0.033), ('jump', 0.033), ('durations', 0.032), ('metadata', 0.032), ('recognizing', 0.032), ('luo', 0.032), ('aggregated', 0.032), ('album', 0.031), ('quattoni', 0.031), ('cccp', 0.031), ('inference', 0.031), ('professional', 0.03), ('hmms', 0.03), ('assignments', 0.03), ('chain', 0.029), ('boat', 0.029), ('social', 0.028), ('difficulty', 0.028), ('help', 0.027), ('confused', 0.027), ('proxy', 0.027), ('observed', 0.026), ('learnt', 0.026), ('discarded', 0.026), ('videos', 0.025), ('exploit', 0.025), ('scores', 0.025), ('organize', 0.025), ('variables', 0.025), ('acknowledge', 0.024), ('indoor', 0.024), ('access', 0.024), ('classification', 0.024), ('predictive', 0.024), ('pictures', 0.024), ('ambiguity', 0.024), ('scene', 0.023), ('matrices', 0.023), ('recall', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
Author: Lukas Bossard, Matthieu Guillaumin, Luc Van_Gool
Abstract: The task of recognizing events in photo collections is central for automatically organizing images. It is also very challenging, because of the ambiguity of photos across different event classes and because many photos do not convey enough relevant information. Unfortunately, the field still lacks standard evaluation data sets to allow comparison of different approaches. In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, annotated with 14 diverse social event classes. Casting collections as sequential data, we build upon recent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recognition in photo collections. However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the photographers. Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a function of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM). In our experiments, we show that our proposed model outperforms approaches based only on feature pooling or a classical hidden Markov model. With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections.
2 0.36864769 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
Author: Yifan Zhang, Qiang Ji, Hanqing Lu
Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.
3 0.34496501 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
Author: Ping Wei, Yibiao Zhao, Nanning Zheng, Song-Chun Zhu
Abstract: Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.
4 0.28025907 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann
Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.
5 0.26736039 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
Author: Weixin Li, Qian Yu, Ajay Divakaran, Nuno Vasconcelos
Abstract: The problem of adaptively selecting pooling regions for the classification of complex video events is considered. Complex events are defined as events composed of several characteristic behaviors, whose temporal configuration can change from sequence to sequence. A dynamic pooling operator is defined so as to enable a unified solution to the problems of event specific video segmentation, temporal structure modeling, and event detection. Video is decomposed into segments, and the segments most informative for detecting a given event are identified, so as to dynamically determine the pooling operator most suited for each sequence. This dynamic pooling is implemented by treating the locations of characteristic segments as hidden information, which is inferred, on a sequence-by-sequence basis, via a large-margin classification rule with latent variables. Although the feasible set of segment selections is combinatorial, it is shown that a globally optimal solution to the inference problem can be obtained efficiently, through the solution of a series of linear programs. Besides the coarselevel location of segments, a finer model of video struc- ture is implemented by jointly pooling features of segmenttuples. Experimental evaluation demonstrates that the re- sulting event detector has state-of-the-art performance on challenging video datasets.
6 0.25795785 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach
7 0.22064279 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
8 0.18574631 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis
9 0.17967249 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
10 0.17155431 81 iccv-2013-Combining the Right Features for Complex Event Recognition
11 0.16763927 219 iccv-2013-Internet Based Morphable Model
12 0.15001643 223 iccv-2013-Joint Noise Level Estimation from Personal Photo Collections
13 0.14591315 444 iccv-2013-Viewing Real-World Faces in 3D
14 0.12365457 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks
15 0.11982999 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
16 0.11443593 381 iccv-2013-Semantically-Based Human Scanpath Estimation with HMMs
17 0.11246041 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
18 0.11084509 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
19 0.10956219 400 iccv-2013-Stable Hyper-pooling and Query Expansion for Event Detection
20 0.10759526 187 iccv-2013-Group Norm for Learning Structured SVMs with Unstructured Latent Variables
topicId topicWeight
[(0, 0.205), (1, 0.16), (2, 0.038), (3, 0.078), (4, 0.102), (5, 0.054), (6, 0.131), (7, -0.067), (8, -0.027), (9, -0.119), (10, -0.176), (11, -0.2), (12, -0.046), (13, 0.256), (14, -0.306), (15, -0.086), (16, 0.01), (17, 0.045), (18, 0.059), (19, 0.042), (20, 0.079), (21, -0.037), (22, 0.028), (23, 0.027), (24, 0.046), (25, 0.003), (26, 0.022), (27, 0.005), (28, 0.074), (29, 0.022), (30, -0.01), (31, 0.102), (32, 0.051), (33, 0.013), (34, -0.032), (35, -0.057), (36, 0.102), (37, 0.1), (38, 0.015), (39, -0.048), (40, -0.02), (41, 0.025), (42, 0.023), (43, 0.014), (44, 0.048), (45, -0.072), (46, -0.018), (47, 0.002), (48, -0.015), (49, -0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.95951086 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
Author: Lukas Bossard, Matthieu Guillaumin, Luc Van_Gool
Abstract: The task of recognizing events in photo collections is central for automatically organizing images. It is also very challenging, because of the ambiguity of photos across different event classes and because many photos do not convey enough relevant information. Unfortunately, the field still lacks standard evaluation data sets to allow comparison of different approaches. In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, annotated with 14 diverse social event classes. Casting collections as sequential data, we build upon recent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recognition in photo collections. However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the photographers. Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a function of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM). In our experiments, we show that our proposed model outperforms approaches based only on feature pooling or a classical hidden Markov model. With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections.
2 0.88117856 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
Author: Yifan Zhang, Qiang Ji, Hanqing Lu
Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.
3 0.81161737 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann
Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.
4 0.79590452 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
Author: Ping Wei, Yibiao Zhao, Nanning Zheng, Song-Chun Zhu
Abstract: Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.
5 0.77788824 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
Author: Chen Sun, Ram Nevatia
Abstract: The goal of high level event classification from videos is to assign a single, high level event label to each query video. Traditional approaches represent each video as a set of low level features and encode it into a fixed length feature vector (e.g. Bag-of-Words), which leave a big gap between low level visual features and high level events. Our paper tries to address this problem by exploiting activity concept transitions in video events (ACTIVE). A video is treated as a sequence of short clips, all of which are observations corresponding to latent activity concept variables in a Hidden Markov Model (HMM). We propose to apply Fisher Kernel techniques so that the concept transitions over time can be encoded into a compact and fixed length feature vector very efficiently. Our approach can utilize concept annotations from independent datasets, and works well even with a very small number of training samples. Experiments on the challenging NIST TRECVID Multimedia Event Detection (MED) dataset shows our approach performs favorably over the state-of-the-art.
6 0.70820463 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
7 0.68067235 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach
8 0.6744132 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis
9 0.55500913 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
10 0.52673793 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
11 0.49043509 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks
12 0.48443249 34 iccv-2013-Abnormal Event Detection at 150 FPS in MATLAB
13 0.46641824 81 iccv-2013-Combining the Right Features for Complex Event Recognition
14 0.43561813 243 iccv-2013-Learning Slow Features for Behaviour Analysis
15 0.41001275 397 iccv-2013-Space-Time Tradeoffs in Photo Sequencing
16 0.4059062 331 iccv-2013-Pyramid Coding for Functional Scene Element Recognition in Video Scenes
17 0.39757347 400 iccv-2013-Stable Hyper-pooling and Query Expansion for Event Detection
18 0.38091668 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
19 0.37824717 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
20 0.35341886 286 iccv-2013-NYC3DCars: A Dataset of 3D Vehicles in Geographic Context
topicId topicWeight
[(2, 0.112), (4, 0.012), (7, 0.014), (12, 0.012), (26, 0.07), (31, 0.042), (34, 0.03), (42, 0.092), (48, 0.011), (54, 0.157), (64, 0.052), (73, 0.021), (78, 0.014), (89, 0.24), (98, 0.015)]
simIndex simValue paperId paperTitle
1 0.93856734 389 iccv-2013-Shortest Paths with Curvature and Torsion
Author: Petter Strandmark, Johannes Ulén, Fredrik Kahl, Leo Grady
Abstract: This paper describes a method of finding thin, elongated structures in images and volumes. We use shortest paths to minimize very general functionals of higher-order curve properties, such as curvature and torsion. Our globally optimal method uses line graphs and its runtime is polynomial in the size of the discretization, often in the order of seconds on a single computer. To our knowledge, we are the first to perform experiments in three dimensions with curvature and torsion regularization. The largest graphs we process have almost one hundred billion arcs. Experiments on medical images and in multi-view reconstruction show the significance and practical usefulness of regularization based on curvature while torsion is still only tractable for small-scale problems.
2 0.93107277 407 iccv-2013-Subpixel Scanning Invariant to Indirect Lighting Using Quadratic Code Length
Author: Nicolas Martin, Vincent Couture, Sébastien Roy
Abstract: We present a scanning method that recovers dense subpixel camera-projector correspondence without requiring any photometric calibration nor preliminary knowledge of their relative geometry. Subpixel accuracy is achieved by considering several zero-crossings defined by the difference between pairs of unstructured patterns. We use gray-level band-pass white noise patterns that increase robustness to indirect lighting and scene discontinuities. Simulated and experimental results show that our method recovers scene geometry with high subpixel precision, and that it can handle many challenges of active reconstruction systems. We compare our results to state of the art methods such as micro phase shifting and modulated phase shifting.
3 0.92782801 68 iccv-2013-Camera Alignment Using Trajectory Intersections in Unsynchronized Videos
Author: Thomas Kuo, Santhoshkumar Sunderrajan, B.S. Manjunath
Abstract: This paper addresses the novel and challenging problem of aligning camera views that are unsynchronized by low and/or variable frame rates using object trajectories. Unlike existing trajectory-based alignment methods, our method does not require frame-to-frame synchronization. Instead, we propose using the intersections of corresponding object trajectories to match views. To find these intersections, we introduce a novel trajectory matching algorithm based on matching Spatio-Temporal Context Graphs (STCGs). These graphs represent the distances between trajectories in time and space within a view, and are matched to an STCG from another view to find the corresponding trajectories. To the best of our knowledge, this is one of the first attempts to align views that are unsynchronized with variable frame rates. The results on simulated and real-world datasets show trajectory intersections area viablefeatureforcamera alignment, and that the trajectory matching method performs well in real-world scenarios.
same-paper 4 0.90736639 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
Author: Lukas Bossard, Matthieu Guillaumin, Luc Van_Gool
Abstract: The task of recognizing events in photo collections is central for automatically organizing images. It is also very challenging, because of the ambiguity of photos across different event classes and because many photos do not convey enough relevant information. Unfortunately, the field still lacks standard evaluation data sets to allow comparison of different approaches. In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, annotated with 14 diverse social event classes. Casting collections as sequential data, we build upon recent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recognition in photo collections. However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the photographers. Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a function of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM). In our experiments, we show that our proposed model outperforms approaches based only on feature pooling or a classical hidden Markov model. With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections.
5 0.89527082 293 iccv-2013-Nonparametric Blind Super-resolution
Author: Tomer Michaeli, Michal Irani
Abstract: Super resolution (SR) algorithms typically assume that the blur kernel is known (either the Point Spread Function ‘PSF’ of the camera, or some default low-pass filter, e.g. a Gaussian). However, the performance of SR methods significantly deteriorates when the assumed blur kernel deviates from the true one. We propose a general framework for “blind” super resolution. In particular, we show that: (i) Unlike the common belief, the PSF of the camera is the wrong blur kernel to use in SR algorithms. (ii) We show how the correct SR blur kernel can be recovered directly from the low-resolution image. This is done by exploiting the inherent recurrence property of small natural image patches (either internally within the same image, or externally in a collection of other natural images). In particular, we show that recurrence of small patches across scales of the low-res image (which forms the basis for single-image SR), can also be used for estimating the optimal blur kernel. This leads to significant improvement in SR results.
6 0.87523437 238 iccv-2013-Learning Graphs to Match
7 0.87515545 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
8 0.87221599 448 iccv-2013-Weakly Supervised Learning of Image Partitioning Using Decision Trees with Structured Split Criteria
9 0.87178838 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction
10 0.87110984 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies
11 0.868415 404 iccv-2013-Structured Forests for Fast Edge Detection
12 0.8681674 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
13 0.86724317 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
14 0.8670426 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
15 0.86694103 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
16 0.86650038 229 iccv-2013-Large-Scale Video Hashing via Structure Learning
17 0.86590779 169 iccv-2013-Fine-Grained Categorization by Alignments
18 0.86584443 396 iccv-2013-Space-Time Robust Representation for Action Recognition
19 0.86570752 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
20 0.86553705 24 iccv-2013-A Non-parametric Bayesian Network Prior of Human Pose