iccv iccv2013 iccv2013-443 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xiatian Zhu, Chen Change Loy, Shaogang Gong
Abstract: Generating coherent synopsis for surveillance video stream remains a formidable challenge due to the ambiguity and uncertainty inherent to visual observations. In contrast to existing video synopsis approaches that rely on visual cues alone, we propose a novel multi-source synopsis framework capable of correlating visual data and independent non-visual auxiliary information to better describe and summarise subtlephysical events in complex scenes. Specifically, our unsupervised framework is capable of seamlessly uncovering latent correlations among heterogeneous types of data sources, despite the non-trivial heteroscedasticity and dimensionality discrepancy problems. Additionally, the proposed model is robust to partial or missing non-visual information. We demonstrate the effectiveness of our framework on two crowded public surveillance datasets.
Reference: text
sentIndex sentText sentNum sentScore
1 uk an Abstract Generating coherent synopsis for surveillance video stream remains a formidable challenge due to the ambiguity and uncertainty inherent to visual observations. [sent-13, score-0.637]
2 In contrast to existing video synopsis approaches that rely on visual cues alone, we propose a novel multi-source synopsis framework capable of correlating visual data and independent non-visual auxiliary information to better describe and summarise subtlephysical events in complex scenes. [sent-14, score-1.224]
3 Specifically, our unsupervised framework is capable of seamlessly uncovering latent correlations among heterogeneous types of data sources, despite the non-trivial heteroscedasticity and dimensionality discrepancy problems. [sent-15, score-0.526]
4 Introduction A critical task in visual surveillance is to automatically make sense of the massive amount of video data by summarising its content using higher-level intrinsic physical events1 beyond low-level key-frame visual feature statistics and/or object detection counts. [sent-19, score-0.341]
5 In most contemporary techniques, low-level imagery visual cues are typically exploited as the sole information source for video summarisation tasks [ 1 1, 17, 6, 12]. [sent-20, score-0.4]
6 On the other hand, in complex and cluttered public scenes there are intrinsically more interesting and relevant higher-level events that can provide a more concise and meaningful summarisation of the video data. [sent-21, score-0.475]
7 However, such events may not be immediately observable visually and cannot be detected reliably by visual cues alone. [sent-22, score-0.219]
8 In particular, surveillance visual data from public spaces is often inaccurate and/or incomplete due to uncontrollable sources of variation, changes in illumination, occlusion, and background clutters [8]. [sent-23, score-0.647]
9 The proposed CC-Forest discovers latent correlations among heterogeneous visual and non-visual data sources, which can be both inaccurate and incomplete, for video synopsis of crowded public scenes. [sent-31, score-0.901]
10 Examples of non-visual sources include weather report, GPS-based traffic speed data, geo-location data, textual data from social networks, and on-line event schedules. [sent-33, score-0.694]
11 Effectively discovering and exploiting such a latent correlation space can bridge the semantic gap between low-level imagery features and high-level semantic interpretation. [sent-36, score-0.288]
12 a college event calendar) data for video interpretation and structured synopsis (Fig. [sent-41, score-0.577]
13 The learned model can then be used for event inference and ambiguity reasoning in unseen video data. [sent-43, score-0.288]
14 Unsupervised mining of latent association and interaction between heterogeneous data sources is non-trivial due to: (1) Disparate sources significantly differ in representation (continuous or categorical), and largely vary in scale and covariance2. [sent-44, score-0.903]
15 In addition, the dimension of visual sources often exceeds that of non-visual information to a 2Also known as the heteroscedasticity problem [4]. [sent-45, score-0.499]
16 (2) Both visual and non-visual data in isolation can be inaccurate and incomplete, especially in surveillance data of public spaces. [sent-49, score-0.254]
17 event time tables, may not be necessarily available or synchronised with the visual observations. [sent-52, score-0.239]
18 Firstly, we show that coherent and meaningful multi-source based video synopsis can be constructed in an unsupervised manner by learning collectively from heterogeneous visual and non-visual sources. [sent-56, score-0.778]
19 This is made possible by formulating a novel Constrained-Clustering Forest (CC-Forest) with a reformulated information gain function that seamlessly handles multi-heterogeneous data sources dissimilar in representation, distribution, and dimension. [sent-57, score-0.429]
20 Although both visual and non-visual data in isolation can be inaccurate and incomplete, our model is capable of uncovering and subsequently exploiting the shared latent correlation for video synopsis. [sent-59, score-0.433]
21 As shown in the experiments, combining visual and non-visual data using the proposed method improves the accuracy in video clustering and segmentation, leading towards more meaningful video synopsis. [sent-60, score-0.444]
22 In particular, we demonstrate the usefulness of our framework through generating video synopsis enriched by plausible semantic explanation, providing structured event-based summarisation beyond object detection counts or key-frame feature statistics. [sent-65, score-0.738]
23 Related Work Contemporary video summarisation methods can be broadly classified into two paradigms, keyframe-based [7, 2 1, 12] and object-based [ 18, 17, 6] methods. [sent-67, score-0.28]
24 They are neither suitable nor scalable to complex scenes where visual data are inherently incomplete and inaccurate, mostly the case in typical surveillance videos. [sent-73, score-0.21]
25 Our work differs significantly from these studies in that we exploit not only visual data without object tracking, but also non-visual sources as complementary information in order to discover higher-level events that are visually subtle and difficult to be detected. [sent-74, score-0.549]
26 In addition, the com- plementary sources are well synchronised, mostly noise free and complete as they are extracted from the embedded text metadata. [sent-77, score-0.33]
27 Importantly, despite it seeks for the optimal weighted combination of the affinity matrices, it does not consider dependency between different data sources in model learning. [sent-88, score-0.438]
28 To overcome these problems, in this work a single affinity matrix that captures correlation between diverse types of sources is derived from a reformulated model of clustering forest. [sent-89, score-0.617]
29 Video Summarisation from Diverse Sources We consider the following different sources of information to be taken into account in a multi-source input feature space (Fig. [sent-92, score-0.33]
30 We then extract a d-dimensional visual descriptor from the ith clip denoted by xi = (xi,1, . [sent-94, score-0.228]
31 We collectively represent m types of non-visual data associated with the ith clip as yi = (yi,1, . [sent-102, score-0.263]
32 To facilitate video summarisation with plausible semantic explanation, we need to model latent associations between visual events in video clips and non-visual semantical explanations from independent sources, given a large corpus of video clips and non-visual data. [sent-110, score-1.22]
33 An unsupervised solution is by discovering the natural groupings/clusters from these multiple heterogeneous data sources, so that each cluster represents a meaningful collection of clips with coherent events, associated with unique distributions of nonvisual data types. [sent-111, score-0.736]
34 Given a long unseen video, one can then apply a nearest neighbour search in the cluster space to infer the non-visual distribution of any clips in the unseen video. [sent-112, score-0.396]
35 Discovering coherent heterogeneous data groupings requires the mining of multi-source correlation, which is nontrivial (Sec. [sent-113, score-0.255]
36 5), since the notion of proximity becomes less precise when a single distance function is used for quantifying the groupings of heterogeneous sources differing in representation, distribution, and dimension. [sent-116, score-0.537]
37 A decision forest [3, 23], particularly the clustering forest [ 1, 13], appears to be a viable solution since its model learning is based on unsupervised information gain optimisation. [sent-119, score-0.46]
38 Nevertheless, the conventional clustering forest is not well suited to solving our problem since it expects a full concatenated representation of visual + non-visual sources as input during both the model training and deployment stage. [sent-120, score-0.733]
39 This does not conform to the assumption of only visual data being available during the model deployment for unseen video synopsis. [sent-121, score-0.324]
40 Training steps for learning a multi-source synopsis model. [sent-123, score-0.324]
41 To overcome the limitations of the conventional clustering forest, we develop a new constrained clustering forest (CC-Forest) by reformulating its optimisation objective function. [sent-125, score-0.418]
42 In a conventional clustering forest, the information gain ΔI is defined as ∈T ΔI = Ip−nnplIl−nnprIr, (2) where p, l and r refer to a splitting node, the left and right child node; n denotes the number of samples at a node, with np = nl + nr. [sent-131, score-0.297]
43 Specifically, we define a new information gain for node splitting as follow ΔI =? [sent-133, score-0.214]
44 ΔThIis new term plays a critical role in that the node splitting is no longer solely dependent on visual data. [sent-150, score-0.191]
45 It is this re-formulation of joint information gain optimisation that provides a chance for associating multiple heterogeneous data sources, and simultaneously balancing the influence exerted by both visual and non-visual information on node splitting. [sent-153, score-0.473]
46 Temporal term - We also add a temporal smoothness gain ΔIt to encourage temporally adjacent video clips to be grouped together. [sent-154, score-0.403]
47 To this end, we consider spectral clustering on manifold to discover latent clusters in a lower dimensional space (Fig 2-c). [sent-175, score-0.256]
48 Spectral clustering [24] groups data using eigenvectors of an affinity matrix derived from the data. [sent-176, score-0.202]
49 fF ionrd eeaxc ahn clustering tree, we first compute a tree-level Nv Nv affinity matrix At with elements defined as Ait,j = exp−distt(xi,xj) distt(xi,xj) =? [sent-179, score-0.202]
50 l(xj), 84 with (4) We assign the maximum affinity (affinity=1, distance=0) to points xi and xj if they fall into the same leaf node, and the minimum affinity (affinity=0, distance=+∞) otherwise. [sent-181, score-0.264]
51 ver the latent clusters of training clips with the number o? [sent-189, score-0.312]
52 Each training clip xi is then assigned to a cluster ci ∈ C, with C the set of all clusters. [sent-191, score-0.188]
53 The learned clusters group swiimthila Cr clips ebot tohf visually taenrds. [sent-192, score-0.277]
54 Structure-Driven Non-Visual Tag Inference To summarise a long unseen video with high-level interpretation, we need to first infer semantic contents of each clip in the video. [sent-200, score-0.383]
55 A straightforward way to compute the tag distribution p(yi |x∗) of is to search for its nearest cluster ∈ C, and |lext p(yi |x∗) = p(yi |c∗). [sent-204, score-0.234]
56 3) for generating video synopsis enriched by non-visual semantic labels. [sent-231, score-0.576]
57 ference: (a) Channel an unseen clip x∗ into individual trees; (b) Estimate the nearest clusters of x∗ within the leaves it falls into: hollow circles denote clusters; (c) Compute the tag distributions by averaging tree-level predictions. [sent-235, score-0.404]
58 There are a total of 7324 video clips spanning over 14 days in the TISI dataset, whilst a total of 13817 clips were collected across a period of two months in the ERCe dataset. [sent-238, score-0.6]
59 The TISI dataset is challenging due to severe inter-object occlusion, complex behaviour patterns, and large illumination variations caused by both natural and artificial light sources at different day time. [sent-243, score-0.33]
60 Visual and non-visual sources - We extracted a variety of visual features from each video clip: (a) colour features including RGB and HSV; (b) local texture features based on 3http : / /www . [sent-247, score-0.524]
61 For the ERCe dataset, we collected data from multiple independent on-line sources about the time table of events including: No Scheduled Event (No Schd. [sent-263, score-0.439]
62 Note that other visual features and non-visual data types can be considered without altering the training and inference methods of our model as the CC-Forest can cope with different families of visual features as well as distinct types of non-visual sources. [sent-270, score-0.214]
63 Baselines - We compare the proposed model Visual + NonVisual + CC-Forest (VNV-CC-Forest) with: (1) VO-Forest a conventional forest [1] trained with visual features alone, to demonstrate the benefits from using non-visual sources 6. [sent-271, score-0.578]
64 (2) VNV-Kmeans - k-means using both visual and nonvisual sources, to highlight the heteroscedastic and dimensionality discrepancy problem caused by heterogeneous visual and non-visual data. [sent-272, score-0.642]
65 (3) VNV-AASC - a state-of-the-art multi-modal spectral clustering method [10] learned with both visual and non-visual data, to demonstrate the superiority of VNV-CC-Forest in handling diverse data representations and correlating multiple sources through joint information gain optimisation. [sent-273, score-0.76]
66 (5) VPNV(R)-CC-Forest - a variation of our model but with R% of training samples having arbitrary number of partial non-visual types, to evaluate the robustness of our model in coping with partial/missing nonvisual data. [sent-274, score-0.242]
67 Implementation details - The clustering forest size Tc was set to 1000. [sent-275, score-0.211]
68 Multi-Source Latent Cluster Discovery For validating the effectiveness of different clustering models for multi-source clustering in order to provide more coherent video content grouping (Sec. [sent-287, score-0.387]
69 (3)) is more effective in handling heterogeneous data than conventional clustering models. [sent-303, score-0.323]
70 It is evident that only the VNV-CC-Forest is able to provide coherent video grouping, with only slight decrease in clustering purity given partial/missing non-visual data. [sent-306, score-0.298]
71 These non-relevant clips are visually ‘close’ to sunny weather, but semantically not. [sent-308, score-0.334]
72 The VNV-CC-Forest model avoids this mistake by correlating both visual and non-visual sources in an information theoretic sense. [sent-309, score-0.565]
73 (X/Y) in the brackets - X refers to the number of clips with sunny weather as shown in the images in the first two columns. [sent-312, score-0.478]
74 The frames inside the red boxes refer to those inconsistent clips in a cluster. [sent-314, score-0.186]
75 Weather tagging confusion matrices on the TISI Dataset. [sent-330, score-0.205]
76 Contextually-Rich Multi-Source Synopsis Generating video synopsis with semantically meaningful contextual labels requires accurate tag prediction (Sec. [sent-333, score-0.642]
77 In this experiment we compared the performance of different methods in inferring tag labels given unseen video clips extracted from long video streams. [sent-336, score-0.653]
78 For quantitative evaluation, we manually annotated three different weathers (sunny, cloudy and rainy) and four traffic speeds on all the TISI test clips, as well as eight event categories on all the ERCe test clips. [sent-337, score-0.24]
79 Correlating and tagging video by weather and traffic conditions - Video synopsis by tagging weather and traffic conditions was tested using the TISI outdoor dataset. [sent-339, score-1.236]
80 Further comparisons of their confusion matrices on weather conditions tagging are provided in Fig. [sent-343, score-0.383]
81 It is worth pointing out that VNV-CC-Forest not only outperforms other baselines in isolating the sunny weather, but also performs well in distinguishing the visually ambiguous cloudy and rainy weathers. [sent-345, score-0.256]
82 two Correlating and tagging video by semantic events Video synopsis by correlating and tagging higher-level semantic events was tested using the ERCe dataset. [sent-380, score-1.141]
83 It is evident that using visual information alone is not sufficient to discover such type of event without the support of additional nonvisual sources (the semantic gap problem). [sent-387, score-0.774]
84 Due to the typically high dimension of visual sources in comparison to non-visual sources, the latter is often overwhelmed by the former in representation. [sent-388, score-0.406]
85 This suggests that the conventional distance-based clustering is poor in coping with the inherent heteroscedasticity and dimension discrepancy problems in modelling heterogeneous multi-source independent data. [sent-391, score-0.542]
86 In contrast, the proposed VNVCC-Forest correlates different sources via a joint information gain criterion to effectively alleviate the heteroscedasticity and dimension discrepancy problem, leading to more robust and accurate tagging performance. [sent-394, score-0.725]
87 Again, it is observed that VPNV(10/20)-CC-Forest performed comparably to VNV-CC-Forest, further validating the robustness of VNV-CC-Forest in tackling partial/missing nonvisual data with the proposed adaptive weighting mechanism (Sec. [sent-395, score-0.218]
88 After inferring the non-visual semantics for the unseen clips, one can readily generate various types of concise video synopsis with enriched contextual interpretation or relevant high-level physical events, using a similar strategy as [14]. [sent-401, score-0.63]
89 8 we show a synopsis with a multi-scale overview of weather changes and traffic condition over multiple days. [sent-404, score-0.587]
90 9 we depict a synopsis highlighting some of the key events taking place during the first two months of a new semester in a university campus. [sent-407, score-0.53]
91 Further Analysis The superior performance of VNV-CC-Forest can be better explained by examining more closely the capability of CC-Forest in uncovering and exploiting the intrinsic association among different visual sources and more critically among visual and non-visual auxiliary sources. [sent-410, score-0.575]
92 This indirect correlation among multi-heterogeneous data sources results in well-structured decision trees, subsequently leading to more consistent clusters and more accurate semantics inference. [sent-411, score-0.474]
93 Moreover, visual sources also benefited from the correlational support from non-visual information through the cross-source optimisation of individual information gains (Eqn. [sent-416, score-0.5]
94 The latent correlations among heterogeneous visual and multiple non-visual sources discovered on the TISI dataset. [sent-423, score-0.732]
95 Conclusion We have presented a novel unsupervised method for generating contextually-rich and semantically-meaningful video synopsis by correlating visual features and independent sources of non-visual information. [sent-425, score-1.009]
96 The proposed model, which is learned based on a joint information gain criterion for learning latent correlations among different independent data sources, naturally copes with diverse types of data with different representation, distribution, and dimension. [sent-426, score-0.268]
97 Crucially, it is robust to partial and missing nonvisual data. [sent-427, score-0.238]
98 Experimental results have demonstrated that combining both visual and non-visual sources facilitates more accurate video event clustering with richer semantical interpretation and video tagging than using visual information alone. [sent-428, score-1.115]
99 The usefulness of the proposed model is not limited to video summarisation, and can be explored for other tasks such as multi-source video retrieval and indexing. [sent-429, score-0.236]
100 In addition, the semantic tag distributions inferred by the model can be exploited as the prior for other surveil- 88 lance tasks such as social role and/or identity inference. [sent-430, score-0.206]
wordName wordTfidf (topN-words)
[('sources', 0.33), ('synopsis', 0.324), ('tisi', 0.278), ('clips', 0.186), ('erce', 0.185), ('nonvisual', 0.185), ('weather', 0.178), ('heterogeneous', 0.174), ('summarisation', 0.162), ('tag', 0.162), ('tagging', 0.134), ('correlating', 0.125), ('video', 0.118), ('forest', 0.117), ('clip', 0.116), ('sunny', 0.114), ('events', 0.109), ('affinity', 0.108), ('event', 0.101), ('gain', 0.099), ('clustering', 0.094), ('heteroscedasticity', 0.093), ('traffic', 0.085), ('yi', 0.08), ('visual', 0.076), ('cluster', 0.072), ('surveillance', 0.071), ('aasc', 0.069), ('vpnv', 0.069), ('unseen', 0.069), ('latent', 0.069), ('discrepancy', 0.069), ('node', 0.066), ('incomplete', 0.063), ('synchronised', 0.062), ('heteroscedastic', 0.062), ('deployment', 0.061), ('whilst', 0.059), ('inaccurate', 0.059), ('ct', 0.058), ('optimisation', 0.058), ('uncovering', 0.057), ('gun', 0.057), ('coping', 0.057), ('clusters', 0.057), ('conventional', 0.055), ('correlation', 0.054), ('rainy', 0.054), ('enriched', 0.054), ('cloudy', 0.054), ('missing', 0.053), ('months', 0.051), ('discovered', 0.05), ('splitting', 0.049), ('leaf', 0.048), ('public', 0.048), ('coherent', 0.048), ('agncor', 0.046), ('csutgomucpccha', 0.046), ('inpmcg', 0.046), ('qmu', 0.046), ('semester', 0.046), ('transcripts', 0.046), ('imagery', 0.044), ('semantic', 0.044), ('service', 0.042), ('tc', 0.042), ('vehicle', 0.041), ('distt', 0.041), ('confusion', 0.039), ('tree', 0.038), ('evident', 0.038), ('wind', 0.038), ('impurity', 0.038), ('mha', 0.038), ('meaningful', 0.038), ('auxiliary', 0.036), ('spectral', 0.036), ('ith', 0.036), ('pages', 0.036), ('copes', 0.036), ('summarise', 0.036), ('busy', 0.036), ('correlational', 0.036), ('generating', 0.036), ('xc', 0.034), ('nv', 0.034), ('enjoy', 0.034), ('semantical', 0.034), ('mistake', 0.034), ('interpretation', 0.034), ('visually', 0.034), ('discovering', 0.033), ('correlations', 0.033), ('decision', 0.033), ('groupings', 0.033), ('validating', 0.033), ('pritch', 0.033), ('matrices', 0.032), ('types', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999875 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
Author: Xiatian Zhu, Chen Change Loy, Shaogang Gong
Abstract: Generating coherent synopsis for surveillance video stream remains a formidable challenge due to the ambiguity and uncertainty inherent to visual observations. In contrast to existing video synopsis approaches that rely on visual cues alone, we propose a novel multi-source synopsis framework capable of correlating visual data and independent non-visual auxiliary information to better describe and summarise subtlephysical events in complex scenes. Specifically, our unsupervised framework is capable of seamlessly uncovering latent correlations among heterogeneous types of data sources, despite the non-trivial heteroscedasticity and dimensionality discrepancy problems. Additionally, the proposed model is robust to partial or missing non-visual information. We demonstrate the effectiveness of our framework on two crowded public surveillance datasets.
2 0.14756781 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
Author: Arash Vahdat, Greg Mori
Abstract: Gathering accurate training data for recognizing a set of attributes or tags on images or videos is a challenge. Obtaining labels via manual effort or from weakly-supervised data typically results in noisy training labels. We develop the FlipSVM, a novel algorithm for handling these noisy, structured labels. The FlipSVM models label noise by “flipping ” labels on training examples. We show empirically that the FlipSVM is effective on images-and-attributes and video tagging datasets.
3 0.12875642 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
Author: Yifan Zhang, Qiang Ji, Hanqing Lu
Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.
4 0.11982746 314 iccv-2013-Perspective Motion Segmentation via Collaborative Clustering
Author: Zhuwen Li, Jiaming Guo, Loong-Fah Cheong, Steven Zhiying Zhou
Abstract: This paper addresses real-world challenges in the motion segmentation problem, including perspective effects, missing data, and unknown number of motions. It first formulates the 3-D motion segmentation from two perspective views as a subspace clustering problem, utilizing the epipolar constraint of an image pair. It then combines the point correspondence information across multiple image frames via a collaborative clustering step, in which tight integration is achieved via a mixed norm optimization scheme. For model selection, wepropose an over-segment and merge approach, where the merging step is based on the property of the ?1-norm ofthe mutual sparse representation oftwo oversegmented groups. The resulting algorithm can deal with incomplete trajectories and perspective effects substantially better than state-of-the-art two-frame and multi-frame methods. Experiments on a 62-clip dataset show the significant superiority of the proposed idea in both segmentation accuracy and model selection.
5 0.11407956 437 iccv-2013-Unsupervised Random Forest Manifold Alignment for Lipreading
Author: Yuru Pei, Tae-Kyun Kim, Hongbin Zha
Abstract: Lipreading from visual channels remains a challenging topic considering the various speaking characteristics. In this paper, we address an efficient lipreading approach by investigating the unsupervised random forest manifold alignment (RFMA). The density random forest is employed to estimate affinity of patch trajectories in speaking facial videos. We propose novel criteria for node splitting to avoid the rank-deficiency in learning density forests. By virtue of the hierarchical structure of random forests, the trajectory affinities are measured efficiently, which are used to find embeddings of the speaking video clips by a graph-based algorithm. Lipreading is formulated as matching between manifolds of query and reference video clips. We employ the manifold alignment technique for matching, where the L∞norm-based manifold-to-manifold distance is proposed to find the matching pairs. We apply this random forest manifold alignment technique to various video data sets captured by consumer cameras. The experiments demonstrate that lipreading can be performed effectively, and outperform state-of-the-arts.
6 0.11084509 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
7 0.11049613 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
8 0.10769603 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
9 0.10635307 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach
10 0.098623775 305 iccv-2013-POP: Person Re-identification Post-rank Optimisation
11 0.093360513 404 iccv-2013-Structured Forests for Fast Edge Detection
12 0.092051685 81 iccv-2013-Combining the Right Features for Complex Event Recognition
13 0.091110311 336 iccv-2013-Random Forests of Local Experts for Pedestrian Detection
14 0.087172851 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
15 0.085452989 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
16 0.083706692 233 iccv-2013-Latent Task Adaptation with Large-Scale Hierarchies
17 0.083274834 433 iccv-2013-Understanding High-Level Semantics by Modeling Traffic Patterns
18 0.080975905 194 iccv-2013-Heterogeneous Image Features Integration via Multi-modal Semi-supervised Learning Model
19 0.079323217 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
20 0.079198413 448 iccv-2013-Weakly Supervised Learning of Image Partitioning Using Decision Trees with Structured Split Criteria
topicId topicWeight
[(0, 0.203), (1, 0.074), (2, 0.002), (3, 0.033), (4, 0.032), (5, 0.074), (6, 0.022), (7, 0.03), (8, 0.036), (9, -0.046), (10, -0.053), (11, -0.074), (12, -0.049), (13, 0.092), (14, -0.073), (15, -0.025), (16, -0.038), (17, -0.034), (18, -0.006), (19, 0.072), (20, -0.056), (21, 0.021), (22, 0.026), (23, 0.066), (24, 0.001), (25, -0.023), (26, 0.025), (27, 0.029), (28, 0.029), (29, -0.102), (30, 0.052), (31, 0.017), (32, -0.033), (33, 0.044), (34, 0.011), (35, 0.018), (36, -0.018), (37, 0.047), (38, -0.023), (39, -0.085), (40, -0.029), (41, -0.002), (42, -0.08), (43, -0.154), (44, -0.032), (45, -0.054), (46, 0.009), (47, -0.001), (48, 0.033), (49, 0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.93973601 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
Author: Xiatian Zhu, Chen Change Loy, Shaogang Gong
Abstract: Generating coherent synopsis for surveillance video stream remains a formidable challenge due to the ambiguity and uncertainty inherent to visual observations. In contrast to existing video synopsis approaches that rely on visual cues alone, we propose a novel multi-source synopsis framework capable of correlating visual data and independent non-visual auxiliary information to better describe and summarise subtlephysical events in complex scenes. Specifically, our unsupervised framework is capable of seamlessly uncovering latent correlations among heterogeneous types of data sources, despite the non-trivial heteroscedasticity and dimensionality discrepancy problems. Additionally, the proposed model is robust to partial or missing non-visual information. We demonstrate the effectiveness of our framework on two crowded public surveillance datasets.
2 0.65253824 437 iccv-2013-Unsupervised Random Forest Manifold Alignment for Lipreading
Author: Yuru Pei, Tae-Kyun Kim, Hongbin Zha
Abstract: Lipreading from visual channels remains a challenging topic considering the various speaking characteristics. In this paper, we address an efficient lipreading approach by investigating the unsupervised random forest manifold alignment (RFMA). The density random forest is employed to estimate affinity of patch trajectories in speaking facial videos. We propose novel criteria for node splitting to avoid the rank-deficiency in learning density forests. By virtue of the hierarchical structure of random forests, the trajectory affinities are measured efficiently, which are used to find embeddings of the speaking video clips by a graph-based algorithm. Lipreading is formulated as matching between manifolds of query and reference video clips. We employ the manifold alignment technique for matching, where the L∞norm-based manifold-to-manifold distance is proposed to find the matching pairs. We apply this random forest manifold alignment technique to various video data sets captured by consumer cameras. The experiments demonstrate that lipreading can be performed effectively, and outperform state-of-the-arts.
3 0.64196575 178 iccv-2013-From Semi-supervised to Transfer Counting of Crowds
Author: Chen Change Loy, Shaogang Gong, Tao Xiang
Abstract: Regression-based techniques have shown promising results for people counting in crowded scenes. However, most existing techniques require expensive and laborious data annotation for model training. In this study, we propose to address this problem from three perspectives: (1) Instead of exhaustively annotating every single frame, the most informative frames are selected for annotation automatically and actively. (2) Rather than learning from only labelled data, the abundant unlabelled data are exploited. (3) Labelled data from other scenes are employed to further alleviate the burden for data annotation. All three ideas are implemented in a unified active and semi-supervised regression framework with ability to perform transfer learning, by exploiting the underlying geometric structure of crowd patterns via manifold analysis. Extensive experiments validate the effectiveness of our approach.
4 0.62501204 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
Author: Lukas Bossard, Matthieu Guillaumin, Luc Van_Gool
Abstract: The task of recognizing events in photo collections is central for automatically organizing images. It is also very challenging, because of the ambiguity of photos across different event classes and because many photos do not convey enough relevant information. Unfortunately, the field still lacks standard evaluation data sets to allow comparison of different approaches. In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, annotated with 14 diverse social event classes. Casting collections as sequential data, we build upon recent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recognition in photo collections. However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the photographers. Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a function of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM). In our experiments, we show that our proposed model outperforms approaches based only on feature pooling or a classical hidden Markov model. With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections.
5 0.62322557 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
Author: Yifan Zhang, Qiang Ji, Hanqing Lu
Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.
6 0.60493058 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
7 0.59317613 404 iccv-2013-Structured Forests for Fast Edge Detection
8 0.59282243 34 iccv-2013-Abnormal Event Detection at 150 FPS in MATLAB
9 0.58778733 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach
10 0.58552551 352 iccv-2013-Revisiting Example Dependent Cost-Sensitive Learning with Decision Trees
11 0.57986706 145 iccv-2013-Estimating the Material Properties of Fabric from Video
12 0.57886243 47 iccv-2013-Alternating Regression Forests for Object Detection and Pose Estimation
13 0.57445258 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
14 0.57142764 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories
15 0.54899663 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
16 0.54303074 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
18 0.53242034 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
19 0.52607787 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
20 0.52020556 130 iccv-2013-Dynamic Structured Model Selection
topicId topicWeight
[(2, 0.098), (7, 0.027), (12, 0.035), (13, 0.012), (26, 0.056), (31, 0.059), (35, 0.01), (40, 0.02), (42, 0.107), (48, 0.014), (64, 0.057), (65, 0.195), (73, 0.03), (89, 0.168), (95, 0.013), (98, 0.01)]
simIndex simValue paperId paperTitle
1 0.87674385 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories
Author: Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, Tamara L. Berg
Abstract: Entry level categories the labels people will use to name an object were originally defined and studied by psychologists in the 1980s. In this paper we study entrylevel categories at a large scale and learn the first models for predicting entry-level categories for images. Our models combine visual recognition predictions with proxies for word “naturalness ” mined from the enormous amounts of text on the web. We demonstrate the usefulness of our models for predicting nouns (entry-level words) associated with images by people. We also learn mappings between concepts predicted by existing visual recognition systems and entry-level concepts that could be useful for improving human-focused applications such as natural language image description or retrieval. – –
same-paper 2 0.8482855 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
Author: Xiatian Zhu, Chen Change Loy, Shaogang Gong
Abstract: Generating coherent synopsis for surveillance video stream remains a formidable challenge due to the ambiguity and uncertainty inherent to visual observations. In contrast to existing video synopsis approaches that rely on visual cues alone, we propose a novel multi-source synopsis framework capable of correlating visual data and independent non-visual auxiliary information to better describe and summarise subtlephysical events in complex scenes. Specifically, our unsupervised framework is capable of seamlessly uncovering latent correlations among heterogeneous types of data sources, despite the non-trivial heteroscedasticity and dimensionality discrepancy problems. Additionally, the proposed model is robust to partial or missing non-visual information. We demonstrate the effectiveness of our framework on two crowded public surveillance datasets.
3 0.76816094 445 iccv-2013-Visual Reranking through Weakly Supervised Multi-graph Learning
Author: Cheng Deng, Rongrong Ji, Wei Liu, Dacheng Tao, Xinbo Gao
Abstract: Visual reranking has been widely deployed to refine the quality of conventional content-based image retrieval engines. The current trend lies in employing a crowd of retrieved results stemming from multiple feature modalities to boost the overall performance of visual reranking. However, a major challenge pertaining to current reranking methods is how to take full advantage of the complementary property of distinct feature modalities. Given a query image and one feature modality, a regular visual reranking framework treats the top-ranked images as pseudo positive instances which are inevitably noisy, difficult to reveal this complementary property, and thus lead to inferior ranking performance. This paper proposes a novel image reranking approach by introducing a Co-Regularized Multi-Graph Learning (Co-RMGL) framework, in which the intra-graph and inter-graph constraints are simultaneously imposed to encode affinities in a single graph and consistency across different graphs. Moreover, weakly supervised learning driven by image attributes is performed to denoise the pseudo- labeled instances, thereby highlighting the unique strength of individual feature modality. Meanwhile, such learning can yield a few anchors in graphs that vitally enable the alignment and fusion of multiple graphs. As a result, an edge weight matrix learned from the fused graph automatically gives the ordering to the initially retrieved results. We evaluate our approach on four benchmark image retrieval datasets, demonstrating a significant performance gain over the state-of-the-arts.
4 0.76650715 180 iccv-2013-From Where and How to What We See
Author: S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath
Abstract: Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object detectors for faces and text. The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.
5 0.76323736 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction
Author: Ming-Ming Cheng, Jonathan Warrell, Wen-Yan Lin, Shuai Zheng, Vibhav Vineet, Nigel Crook
Abstract: Detecting visually salient regions in images is one of the fundamental problems in computer vision. We propose a novel method to decompose an image into large scale perceptually homogeneous elements for efficient salient region detection, using a soft image abstraction representation. By considering both appearance similarity and spatial distribution of image pixels, the proposed representation abstracts out unnecessary image details, allowing the assignment of comparable saliency values across similar regions, and producing perceptually accurate salient region detection. We evaluate our salient region detection approach on the largest publicly available dataset with pixel accurate annotations. The experimental results show that the proposed method outperforms 18 alternate methods, reducing the mean absolute error by 25.2% compared to the previous best result, while being computationally more efficient.
6 0.7608093 338 iccv-2013-Randomized Ensemble Tracking
7 0.76051438 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
8 0.75956917 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
9 0.75882828 340 iccv-2013-Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests
10 0.75855726 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
11 0.75841695 349 iccv-2013-Regionlets for Generic Object Detection
12 0.7577073 59 iccv-2013-Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation
13 0.75766218 384 iccv-2013-Semi-supervised Robust Dictionary Learning via Efficient l-Norms Minimization
14 0.75731146 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
15 0.75708723 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
16 0.75696248 448 iccv-2013-Weakly Supervised Learning of Image Partitioning Using Decision Trees with Structured Split Criteria
17 0.75651187 406 iccv-2013-Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time
18 0.75647807 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data
19 0.75635803 194 iccv-2013-Heterogeneous Image Features Integration via Multi-modal Semi-supervised Learning Model
20 0.75622511 126 iccv-2013-Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification