iccv iccv2013 iccv2013-163 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhongwen Xu, Yi Yang, Ivor Tsang, Nicu Sebe, Alexander G. Hauptmann
Abstract: Fusion of multiple features can boost the performance of large-scale visual classification and detection tasks like TRECVID Multimedia Event Detection (MED) competition [1]. In this paper, we propose a novel feature fusion approach, namely Feature Weighting via Optimal Thresholding (FWOT) to effectively fuse various features. FWOT learns the weights, thresholding and smoothing parameters in a joint framework to combine the decision values obtained from all the individual features and the early fusion. To the best of our knowledge, this is the first work to consider the weight and threshold factors of fusion problem simultaneously. Compared to state-of-the-art fusion algorithms, our approach achieves promising improvements on HMDB [8] action recognition dataset and CCV [5] video classification dataset. In addition, experiments on two TRECVID MED 2011 collections show that our approach outperforms the state-of-the-art fusion methods for complex event detection.
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we propose a novel feature fusion approach, namely Feature Weighting via Optimal Thresholding (FWOT) to effectively fuse various features. [sent-10, score-0.483]
2 FWOT learns the weights, thresholding and smoothing parameters in a joint framework to combine the decision values obtained from all the individual features and the early fusion. [sent-11, score-0.241]
3 To the best of our knowledge, this is the first work to consider the weight and threshold factors of fusion problem simultaneously. [sent-12, score-0.496]
4 Compared to state-of-the-art fusion algorithms, our approach achieves promising improvements on HMDB [8] action recognition dataset and CCV [5] video classification dataset. [sent-13, score-0.622]
5 In addition, experiments on two TRECVID MED 2011 collections show that our approach outperforms the state-of-the-art fusion methods for complex event detection. [sent-14, score-0.735]
6 Introduction The huge number of videos uploaded and viewed on the Internet makes video analysis a hot topic in computer vision and multimedia communities. [sent-16, score-0.247]
7 , SIFT [12], Color SIFT [21]), and acoustic features (e. [sent-21, score-0.18]
8 In the video action recognition and event detection tasks, researchers have developed systems which combine multiple features. [sent-31, score-0.416]
9 While performing action recognition on largescale video datasets, Reddy and Shah [17] found that combining scene features (e. [sent-32, score-0.185]
10 As for event detection tasks, reports from teams with top performance [26, 14, 15] in TRECVID MED competition show that fusion, either feature-level fusion or decision-level fusion brings performance gain into the detection tasks. [sent-37, score-1.196]
11 Fusion mechanisms can be grouped into two types which are feature-level fusion and decision-level fusion. [sent-38, score-0.443]
12 In the feature-level fusion, a linear combination of kernel matrices from different features is used to capture the structure of video data [18]. [sent-39, score-0.132]
13 The other fusion mechanism is decision-level fusion, which adopts classifiers to features and then fuses the results based on the confidence scores. [sent-41, score-0.66]
14 [9] find that combining the decision values obtained from the kernel matrices of individual features and the average distances of all the features will gain better performance than using the decision values from each individual features only. [sent-43, score-0.29]
15 The most widely used decision-level fusion method is to assign average weights to confidence scores from each feature, which may restrain the overall performance due to the inconsistency and incomparability of confidence scores from different models. [sent-44, score-0.989]
16 For the event “Birthday party”, the acoustic feature MFCC achieves the best prediction performance, and it is much better than visual motion features. [sent-54, score-0.408]
17 Differently, for the event “Changing a vehicle tire”, acoustic information becomes less discriminative so that MFCC gets worse performance than Dense Trajectories feature. [sent-56, score-0.397]
18 Another issue in decision-level fusion is the difference of thresholds among confidence scores from different models. [sent-60, score-0.871]
19 Assume that we retrieve the top 500 videos among 32,000 testing videos according to the confidence scores. [sent-61, score-0.4]
20 Table 2 shows that the threshold of confidence scores from different models can be very different. [sent-62, score-0.293]
21 For example, Dense Trajectories feature has higher threshold than others, which means that in the prediction using Dense Trajectories feature, only videos with very high confidence scores should be considered as positive results. [sent-63, score-0.456]
22 If the effects of the difference of thresholds among predictive results are ignored, it would degrade the discriminative ability of the fusion result. [sent-64, score-0.631]
23 As aforementioned, the weights and thresholds of multiple features are two factors to be considered for feature fusion. [sent-67, score-0.33]
24 In light of this, the fusion algorithm proposed in this paper integrates feature weighting and thresholds selection into a joint framework. [sent-68, score-0.718]
25 Inspired by [9], we combine the early fusion result at the decision-level fusion. [sent-71, score-0.47]
26 To the best of our knowledge, this is the first work which optimizes weights and thresholds simultaneously for fusion. [sent-72, score-0.254]
27 Instead of directly solving a non-convex and time consuming problem, we preset a series of thresholds as candidates, which in turn transforms the problem from detecting the optimal thresholds to selecting the best thresholds from the candidates. [sent-73, score-0.683]
28 In that way, the optimized weights and thresholds can be obtained. [sent-82, score-0.254]
29 Related Work Multiple Kernel Learning (MKL) [16] is the most popular way for combining different kernels to utilize the advantages of different features in applications such as visual object classification, object detection and video semantic analysis. [sent-84, score-0.143]
30 The experiment shows that it is beneficial to exploit the unlabeled data for multiple feature fusion when the labeled data are few. [sent-91, score-0.483]
31 propose to use multiple features to learn different types of video attributes for event detection. [sent-93, score-0.324]
32 [14, 15] propose a decision-level fusion method particularly for event detection. [sent-96, score-0.667]
33 The algorithm adaptively fuses multiple features, which assigns videos with the weights based on the detection thresholds. [sent-97, score-0.259]
34 The adaptive decision-level fusion assigns lower weights to specific scores if the confidence scores are near the threshold while assigns higher weights to videos if the confidence scores are very far away from the threshold. [sent-98, score-1.317]
35 Though it is a reasonable way to assign weights to features according to the detection threshold, this method highly depends on the preset detection threshold. [sent-100, score-0.278]
36 An illustration of our Feature Weighting via Optimal Thresholding (FWOT) fusion method −0 −. [sent-102, score-0.443]
37 Then we show the detailed steps to obtain the optimal fusion function. [sent-117, score-0.472]
38 Problem Formulation Suppose there are n training videos, we denote each video as a variable xm ∈ Rd(1 ≤ m ≤ n), and its label as ym ∈ {−1, +1}, w∈h Rere( ym ≤= m m+ 1≤ in nd),ica atneds xm ais- a positive exemplar a1n}d, ym r=e 1indicates xm is a negative one. [sent-121, score-0.574]
39 One simple function to combine the confidence scores is −y ? [sent-126, score-0.24]
40 1 where wi and bi are the weight and the threshold for confidence scores of the i-th feature respectively. [sent-130, score-0.333]
41 The function in (1) indicates that for the i-th feature, if the confidence score is above the threshold bi, the video would be labeled as +1; otherwise −1, and then we combine the label values according etor weights wi. [sent-131, score-0.337]
42 Thus the final fusion function can be formulated as, f(x) =? [sent-138, score-0.443]
43 (2) As the smoothing parameters a are tightly correlated to the thresholds, we formulate the problem as selecting the most appropriate combination of thresholds b and smoothing parameters a, based on which the optimal weights w are learned. [sent-140, score-0.407]
44 In particular, after we get the confidence scores for the i-th feature, we can uniformly sample s confidence scores as threshold candidates, which are denoted as bi1, bi2 , . [sent-141, score-0.533]
45 We also preset r smoothing parameters ai1, ai2 , . [sent-145, score-0.152]
46 Denoting the fusion classifier as f(x) = 33443425 wTgD (x), to learn weights for different features, a straightforward way is to minimize the following risk function: ? [sent-161, score-0.509]
47 Our approach generates a pool of threshold-smoothing parameter candidates iteratively with the cutting plane algorithm, which makes the number of base matrices in each iteration much smaller than the original problem. [sent-264, score-0.141]
48 Experiments We test our approach on three publicly available datasets: HMDB action dataset [8], Columbia Consumer Video (CCV) dataset [5] and TRECVID MED 2011 dataset [1] (including DEV-T and DEV-O collections). [sent-329, score-0.175]
49 In the experiments, we use the same pipeline as described in [24] to evaluate the performance of the proposed method on action recognition, video classification and event detection. [sent-330, score-0.373]
50 In CCV dataset, we use all the acoustic and visual features provided by the authors in [5]. [sent-331, score-0.18]
51 In addition to visual features, we use 4,096 dimensional MFCC BoWs [26, 15, 19] as the acoustic feature in the event detection experiment. [sent-336, score-0.483]
52 In the classification process, we adopt LIBSVM to generate the confidence scores from the probability outputs, and χ2-kernel is applied to each type of features. [sent-337, score-0.24]
53 Except for the confidence scores from basic features, we also use the predictive scores on average of kernel matrices to enhance the performance. [sent-339, score-0.358]
54 We compare the result with state-of-the-art fusion algorithms, including Early Kernel Fusion (EKF) [18], Multiple Kernel Learning (MKL) [16], and LPBoost [4]. [sent-340, score-0.443]
55 Other late fusion method like linear SVM on top of normalized decision scores from all the different features has similar optimization goal and consistent performance with the LPBoost. [sent-341, score-0.674]
56 Thus in the late fusion comparison algorithms, we only report the result of LPBoost. [sent-342, score-0.515]
57 In TRECVID MED DEV-T and DEV-O collections, we additionally compare the result with Adaptive Late Fusion (ALF) [15], which is particularly designed for event detection. [sent-344, score-0.224]
58 In the stage of presetting threshold-smoothing parameter candidates, we sample every 10 confidence scores as threshold candidates and empirically set smoothing parameter candidates as {0. [sent-345, score-0.451]
59 There are 6,766 videos in total from 51 distinct action categories in HMDB. [sent-367, score-0.208]
60 The huge diversity in visible body parts, camera motion, camera viewpoint, number of people in the action and video quality makes it a very difficult benchmark dataset for the state-of-the-art ac33443447 Dense TMraejethcotodries [23]Mean A4c6cu. [sent-370, score-0.179]
61 The top row shows the performance of the best individual feature, and others indicate performance of fusion methods. [sent-375, score-0.481]
62 In our experiment, we use the official three standard training/testing splits identified by [8], which contain 70 videos for training and 30 videos for testing in each action. [sent-380, score-0.246]
63 Before the fusion stage, we train a multi-class SVM classifier for each visual feature with one-vs-all approach. [sent-382, score-0.483]
64 Confidence scores for training videos are obtained by 5-fold cross-validation. [sent-383, score-0.209]
65 After weighted fusion, we choose the action category with highest confidence score as the predicted result. [sent-384, score-0.239]
66 Results are shown in Table 3, in which we list the performance of the best individual feature Dense Trajectories to show the improvement of the fusion methods over the individual feature. [sent-385, score-0.559]
67 Comparison in Table 3 shows that for action recognition in unconstrained videos using the HMDB dataset, our proposed method outperforms the state-of-the-art fusion methods by appropriately assigning optimal weights to multiple features. [sent-386, score-0.746]
68 Experiment dataset on Columbia Consumer Video For the video classification task , we use Columbia Consumer Video dataset (CCV) [5] to compare the performance of different fusion methods. [sent-389, score-0.567]
69 In the CCV dataset, there are totally 9,317 videos with 20 semantic categories, in which 4,659 videos are used as training data and 4,658 videos are used as testing data. [sent-390, score-0.369]
70 Consumer videos contain very diverse content and have much fewer textual tags and descriptions, which motivates the content analysis based on both acoustic and visual features. [sent-392, score-0.267]
71 The top row shows the performance of the best individual feature, and others indicate performance of fusion methods. [sent-398, score-0.481]
72 In Table 4, we report the experiment results of different fusion methods, and the performance of the best individual feature SIFT is reported as well. [sent-410, score-0.521]
73 We can see from the table that our proposed method could discriminate features in different situation, and achieve significant improvement over other fusion methods. [sent-416, score-0.479]
74 MED raises a question in communities of multimedia and computer vision: given some descrip- tions of an event and a set of illustrative video exemplars, could a system detect the occurrence of an event using acoustic and visual information (individually or together)? [sent-420, score-0.716]
75 In 2011, NIST collected a dataset which consists of about 32,000 testing videos from various Internet video hosting sites, namely the DEV-O collection. [sent-421, score-0.217]
76 MED 11 DEV-O collection: 10 events are used in the DEV-O collection to test the performance of multimedia event detection system. [sent-424, score-0.469]
77 The total duration of the DEV-O collection is about 1,200 hours, which makes it possibly the largest available dataset with meaningful labels for video analysis. [sent-427, score-0.158]
78 Different from the recognition datasets, many videos in the MED 2011 DEV-T and DEV-O collections do not belong to any events, which are called null data. [sent-430, score-0.191]
79 The videos in DEV-T and DEV-O collections have huge variance in terms of quality, duration, scene and so forth [1], which makes the MED a great challenge for content based video analysis. [sent-431, score-0.255]
80 In our experiment, all of the positive video exemplars for each event are used in the training data. [sent-432, score-0.321]
81 When detecting one event, we train a binary χ2-kernel SVM classifier for each feature to obtain the confidence scores. [sent-436, score-0.194]
82 5-fold cross-validation is used to get the confidence scores for training data. [sent-437, score-0.24]
83 5 of different methods on DEV-T collection and DEV-O collection in Table 5 and Table 6. [sent-447, score-0.128]
84 We additionally compare our algorithm to Adaptive Late Fusion (ALF), which was proposed in [15] particularly for event detection. [sent-448, score-0.224]
85 Note that in the Adaptive Late Fusion (ALF) algorithm, thresholds are set before the fusion process, and bad thresholds would lead to weak performance of ALF method. [sent-450, score-0.819]
86 O collections shows that ALF may suffer from the difficulty of getting a good detection threshold and show unstable per- formance in the fusion stage. [sent-460, score-0.637]
87 On the contrary, our method learns proper thresholds in the process of weighting fusion, which makes the fusion method more robust in the event detection system. [sent-461, score-0.945]
88 5 of different fusion methods on every event in TRECVID MED 11DEV-O collection. [sent-463, score-0.667]
89 We can see that our fusion method outperforms other state-of-the-art fusion algorithms in 8 out of 10 events in TRECVID MED 11DEV-O collection. [sent-464, score-0.964]
90 Conclusion In this paper, we have introduced an approach to leverage multiple features by decision-level fusion, which optimizes the weights and thresholds for features in the confidence scores simultaneously. [sent-466, score-0.566]
91 We formulate the problem as selecting the most appropriate combination of thresholds and smoothing parameters, based on which the optimal weights are learned. [sent-467, score-0.345]
92 We first preset lots of thresholds and smoothing parameter candidates, then we use the cutting plane algorithm to obtain the optimal weights and thresholds, which is very efficient even in a large-scale problem. [sent-468, score-0.528]
93 Experiments on HMDB dataset and CCV dataset show that our approach outperforms other state-of-the-art methods on action recognition and consumer video classification. [sent-469, score-0.275]
94 In addition, we achieve the best performance among different fusion methods on a large-scale video dataset TRECVID MED 2011 (including DEV-T and DEV-O collections) using both Average Precision and Pmiss@TER=12. [sent-470, score-0.537]
95 The exper- imental results confirm that our method is superior to other fusion methods for different video analysis tasks. [sent-472, score-0.507]
96 Multi-channel shape-flow kernel descriptors for robust video event detection and retrieval. [sent-585, score-0.363]
97 Multimodal feature fusion for robust event detection in web videos. [sent-595, score-0.75]
98 Evaluation of low-level features and their combinations for complex event detection in open source videos. [sent-627, score-0.303]
99 Multi-feature fusion via hierarchical regression for multimedia anal- ysis. [sent-675, score-0.503]
100 Informedia e-lamp@ trecvid2012: Multimedia event detection and recounting med and mer. [sent-686, score-0.561]
wordName wordTfidf (topN-words)
[('fusion', 0.443), ('med', 0.294), ('pmiss', 0.286), ('event', 0.224), ('thresholds', 0.188), ('hmdb', 0.178), ('trecvid', 0.168), ('confidence', 0.154), ('ccv', 0.144), ('acoustic', 0.144), ('mfcc', 0.136), ('xm', 0.134), ('videos', 0.123), ('stip', 0.115), ('alf', 0.11), ('dijk', 0.11), ('fwot', 0.11), ('tanh', 0.097), ('ter', 0.094), ('preset', 0.09), ('scores', 0.086), ('action', 0.085), ('xq', 0.085), ('events', 0.078), ('violated', 0.075), ('bows', 0.075), ('late', 0.072), ('trajectories', 0.07), ('collections', 0.068), ('consumer', 0.066), ('weights', 0.066), ('pkp', 0.066), ('qymyqkd', 0.066), ('tmraejethcotodries', 0.066), ('mip', 0.065), ('video', 0.064), ('collection', 0.064), ('cutting', 0.062), ('smoothing', 0.062), ('multimedia', 0.06), ('mosift', 0.058), ('mkl', 0.058), ('threshold', 0.053), ('candidates', 0.048), ('weighting', 0.047), ('sebe', 0.047), ('dense', 0.046), ('natarajan', 0.045), ('lmepfkbkwfoloo', 0.044), ('lpboost', 0.044), ('minimax', 0.044), ('qymyq', 0.044), ('wcohrkainb', 0.044), ('ymwtgd', 0.044), ('gd', 0.044), ('md', 0.043), ('detection', 0.043), ('party', 0.042), ('thresholding', 0.041), ('feature', 0.04), ('birthday', 0.04), ('nm', 0.04), ('lan', 0.039), ('psc', 0.039), ('dp', 0.039), ('individual', 0.038), ('columbia', 0.037), ('decision', 0.037), ('ap', 0.037), ('features', 0.036), ('ym', 0.036), ('kd', 0.035), ('indicator', 0.034), ('oss', 0.034), ('kuehne', 0.034), ('minm', 0.034), ('multipliers', 0.034), ('exemplars', 0.033), ('parade', 0.032), ('klaser', 0.032), ('kernel', 0.032), ('dimensional', 0.032), ('convex', 0.031), ('nist', 0.031), ('inequality', 0.031), ('plane', 0.031), ('dataset', 0.03), ('sift', 0.03), ('getting', 0.03), ('vitaladevuni', 0.03), ('rix', 0.03), ('optimal', 0.029), ('vehicle', 0.029), ('tire', 0.029), ('reddy', 0.029), ('lagrange', 0.029), ('iarpa', 0.028), ('early', 0.027), ('fuses', 0.027), ('zhuang', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis
Author: Zhongwen Xu, Yi Yang, Ivor Tsang, Nicu Sebe, Alexander G. Hauptmann
Abstract: Fusion of multiple features can boost the performance of large-scale visual classification and detection tasks like TRECVID Multimedia Event Detection (MED) competition [1]. In this paper, we propose a novel feature fusion approach, namely Feature Weighting via Optimal Thresholding (FWOT) to effectively fuse various features. FWOT learns the weights, thresholding and smoothing parameters in a joint framework to combine the decision values obtained from all the individual features and the early fusion. To the best of our knowledge, this is the first work to consider the weight and threshold factors of fusion problem simultaneously. Compared to state-of-the-art fusion algorithms, our approach achieves promising improvements on HMDB [8] action recognition dataset and CCV [5] video classification dataset. In addition, experiments on two TRECVID MED 2011 collections show that our approach outperforms the state-of-the-art fusion methods for complex event detection.
2 0.29561883 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann
Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.
3 0.21868806 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach
Author: Arash Vahdat, Kevin Cannons, Greg Mori, Sangmin Oh, Ilseo Kim
Abstract: We present a compositional model for video event detection. A video is modeled using a collection of both global and segment-level features and kernel functions are employed for similarity comparisons. The locations of salient, discriminative video segments are treated as a latent variable, allowing the model to explicitly ignore portions of the video that are unimportant for classification. A novel, multiple kernel learning (MKL) latent support vector machine (SVM) is defined, that is used to combine and re-weight multiple feature types in a principled fashion while simultaneously operating within the latent variable framework. The compositional nature of the proposed model allows it to respond directly to the challenges of temporal clutter and intra-class variation, which are prevalent in unconstrained internet videos. Experimental results on the TRECVID Multimedia Event Detection 2011 (MED11) dataset demonstrate the efficacy of the method.
4 0.21069013 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
Author: Dan Oneata, Jakob Verbeek, Cordelia Schmid
Abstract: Action recognition in uncontrolled video is an important and challenging computer vision problem. Recent progress in this area is due to new local features and models that capture spatio-temporal structure between local features, or human-object interactions. Instead of working towards more complex models, we focus on the low-level features and their encoding. We evaluate the use of Fisher vectors as an alternative to bag-of-word histograms to aggregate a small set of state-of-the-art low-level descriptors, in combination with linear classifiers. We present a large and varied set of evaluations, considering (i) classification of short actions in five datasets, (ii) localization of such actions in feature-length movies, and (iii) large-scale recognition of complex events. We find that for basic action recognition and localization MBH features alone are enough for stateof-the-art performance. For complex events we find that SIFT and MFCC features provide complementary cues. On all three problems we obtain state-of-the-art results, while using fewer features and less complex models.
5 0.18574631 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
Author: Lukas Bossard, Matthieu Guillaumin, Luc Van_Gool
Abstract: The task of recognizing events in photo collections is central for automatically organizing images. It is also very challenging, because of the ambiguity of photos across different event classes and because many photos do not convey enough relevant information. Unfortunately, the field still lacks standard evaluation data sets to allow comparison of different approaches. In this paper, we introduce and release a novel data set of personal photo collections containing more than 61,000 images in 807 collections, annotated with 14 diverse social event classes. Casting collections as sequential data, we build upon recent and state-of-the-art work in event recognition in videos to propose a latent sub-event approach for event recognition in photo collections. However, photos in collections are sparsely sampled over time and come in bursts from which transpires the importance of specific moments for the photographers. Thus, we adapt a discriminative hidden Markov model to allow the transitions between states to be a function of the time gap between consecutive images, which we coin as Stopwatch Hidden Markov model (SHMM). In our experiments, we show that our proposed model outperforms approaches based only on feature pooling or a classical hidden Markov model. With an average accuracy of 56%, we also highlight the difficulty of the data set and the need for future advances in event recognition in photo collections.
6 0.17957173 81 iccv-2013-Combining the Right Features for Complex Event Recognition
7 0.16871215 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
8 0.16454485 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
9 0.16223188 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
10 0.15389344 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
11 0.14691864 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
12 0.14549822 229 iccv-2013-Large-Scale Video Hashing via Structure Learning
13 0.10804414 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
14 0.10519257 39 iccv-2013-Action Recognition with Improved Trajectories
15 0.10319988 445 iccv-2013-Visual Reranking through Weakly Supervised Multi-graph Learning
16 0.10125253 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
17 0.099388912 155 iccv-2013-Facial Action Unit Event Detection by Cascade of Tasks
18 0.09874098 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
19 0.092948146 3 iccv-2013-3D Sub-query Expansion for Improving Sketch-Based Multi-view Image Retrieval
20 0.090525702 6 iccv-2013-A Convex Optimization Framework for Active Learning
topicId topicWeight
[(0, 0.191), (1, 0.153), (2, 0.051), (3, 0.133), (4, 0.038), (5, 0.089), (6, 0.095), (7, -0.045), (8, -0.025), (9, -0.073), (10, -0.121), (11, -0.118), (12, -0.01), (13, 0.125), (14, -0.118), (15, -0.075), (16, 0.028), (17, 0.037), (18, 0.045), (19, 0.055), (20, 0.027), (21, 0.015), (22, -0.021), (23, 0.089), (24, 0.062), (25, 0.052), (26, -0.003), (27, 0.114), (28, 0.031), (29, -0.015), (30, -0.057), (31, 0.006), (32, 0.019), (33, 0.016), (34, -0.003), (35, -0.069), (36, 0.0), (37, -0.033), (38, -0.013), (39, 0.033), (40, -0.02), (41, -0.069), (42, 0.045), (43, 0.116), (44, 0.027), (45, 0.002), (46, -0.008), (47, -0.059), (48, -0.093), (49, -0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.93358219 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis
Author: Zhongwen Xu, Yi Yang, Ivor Tsang, Nicu Sebe, Alexander G. Hauptmann
Abstract: Fusion of multiple features can boost the performance of large-scale visual classification and detection tasks like TRECVID Multimedia Event Detection (MED) competition [1]. In this paper, we propose a novel feature fusion approach, namely Feature Weighting via Optimal Thresholding (FWOT) to effectively fuse various features. FWOT learns the weights, thresholding and smoothing parameters in a joint framework to combine the decision values obtained from all the individual features and the early fusion. To the best of our knowledge, this is the first work to consider the weight and threshold factors of fusion problem simultaneously. Compared to state-of-the-art fusion algorithms, our approach achieves promising improvements on HMDB [8] action recognition dataset and CCV [5] video classification dataset. In addition, experiments on two TRECVID MED 2011 collections show that our approach outperforms the state-of-the-art fusion methods for complex event detection.
2 0.88971692 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann
Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.
3 0.81882614 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
Author: Chen Sun, Ram Nevatia
Abstract: The goal of high level event classification from videos is to assign a single, high level event label to each query video. Traditional approaches represent each video as a set of low level features and encode it into a fixed length feature vector (e.g. Bag-of-Words), which leave a big gap between low level visual features and high level events. Our paper tries to address this problem by exploiting activity concept transitions in video events (ACTIVE). A video is treated as a sequence of short clips, all of which are observations corresponding to latent activity concept variables in a Hidden Markov Model (HMM). We propose to apply Fisher Kernel techniques so that the concept transitions over time can be encoded into a compact and fixed length feature vector very efficiently. Our approach can utilize concept annotations from independent datasets, and works well even with a very small number of training samples. Experiments on the challenging NIST TRECVID Multimedia Event Detection (MED) dataset shows our approach performs favorably over the state-of-the-art.
4 0.75280547 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
Author: Ping Wei, Yibiao Zhao, Nanning Zheng, Song-Chun Zhu
Abstract: Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method.
5 0.74413234 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
Author: Yifan Zhang, Qiang Ji, Hanqing Lu
Abstract: In complex scenes with multiple atomic events happening sequentially or in parallel, detecting each individual event separately may not always obtain robust and reliable result. It is essential to detect them in a holistic way which incorporates the causality and temporal dependency among them to compensate the limitation of current computer vision techniques. In this paper, we propose an interval temporal constrained dynamic Bayesian network to extendAllen ’s interval algebra network (IAN) [2]from a deterministic static model to a probabilistic dynamic system, which can not only capture the complex interval temporal relationships, but also model the evolution dynamics and handle the uncertainty from the noisy visual observation. In the model, the topology of the IAN on each time slice and the interlinks between the time slices are discovered by an advanced structure learning method. The duration of the event and the unsynchronized time lags between two correlated event intervals are captured by a duration model, so that we can better determine the temporal boundary of the event. Empirical results on two real world datasets show the power of the proposed interval temporal constrained model.
6 0.74087828 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
7 0.73003829 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
8 0.66348529 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
9 0.65510553 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach
10 0.5737837 81 iccv-2013-Combining the Right Features for Complex Event Recognition
11 0.54975855 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
12 0.52218461 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
13 0.48982376 34 iccv-2013-Abnormal Event Detection at 150 FPS in MATLAB
14 0.47772375 400 iccv-2013-Stable Hyper-pooling and Query Expansion for Event Detection
15 0.4536823 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
16 0.44170007 38 iccv-2013-Action Recognition with Actons
17 0.42126369 116 iccv-2013-Directed Acyclic Graph Kernels for Action Recognition
18 0.40946871 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
19 0.40423369 167 iccv-2013-Finding Causal Interactions in Video Sequences
20 0.39316893 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
topicId topicWeight
[(2, 0.07), (4, 0.311), (7, 0.02), (26, 0.053), (31, 0.026), (42, 0.077), (48, 0.012), (64, 0.033), (68, 0.019), (73, 0.043), (77, 0.018), (78, 0.017), (89, 0.192), (98, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.78113306 163 iccv-2013-Feature Weighting via Optimal Thresholding for Video Analysis
Author: Zhongwen Xu, Yi Yang, Ivor Tsang, Nicu Sebe, Alexander G. Hauptmann
Abstract: Fusion of multiple features can boost the performance of large-scale visual classification and detection tasks like TRECVID Multimedia Event Detection (MED) competition [1]. In this paper, we propose a novel feature fusion approach, namely Feature Weighting via Optimal Thresholding (FWOT) to effectively fuse various features. FWOT learns the weights, thresholding and smoothing parameters in a joint framework to combine the decision values obtained from all the individual features and the early fusion. To the best of our knowledge, this is the first work to consider the weight and threshold factors of fusion problem simultaneously. Compared to state-of-the-art fusion algorithms, our approach achieves promising improvements on HMDB [8] action recognition dataset and CCV [5] video classification dataset. In addition, experiments on two TRECVID MED 2011 collections show that our approach outperforms the state-of-the-art fusion methods for complex event detection.
2 0.73136163 195 iccv-2013-Hidden Factor Analysis for Age Invariant Face Recognition
Author: Dihong Gong, Zhifeng Li, Dahua Lin, Jianzhuang Liu, Xiaoou Tang
Abstract: Age invariant face recognition has received increasing attention due to its great potential in real world applications. In spite of the great progress in face recognition techniques, reliably recognizingfaces across ages remains a difficult task. The facial appearance of a person changes substantially over time, resulting in significant intra-class variations. Hence, the key to tackle this problem is to separate the variation caused by aging from the person-specific features that are stable. Specifically, we propose a new method, calledHidden FactorAnalysis (HFA). This methodcaptures the intuition above through a probabilistic model with two latent factors: an identity factor that is age-invariant and an age factor affected by the aging process. Then, the observed appearance can be modeled as a combination of the components generated based on these factors. We also develop a learning algorithm that jointly estimates the latent factors and the model parameters using an EM procedure. Extensive experiments on two well-known public domain face aging datasets: MORPH (the largest public face aging database) and FGNET, clearly show that the proposed method achieves notable improvement over state-of-the-art algorithms.
3 0.71950281 236 iccv-2013-Learning Discriminative Part Detectors for Image Classification and Cosegmentation
Author: Jian Sun, Jean Ponce
Abstract: In this paper, we address the problem of learning discriminative part detectors from image sets with category labels. We propose a novel latent SVM model regularized by group sparsity to learn these part detectors. Starting from a large set of initial parts, the group sparsity regularizer forces the model to jointly select and optimize a set of discriminative part detectors in a max-margin framework. We propose a stochastic version of a proximal algorithm to solve the corresponding optimization problem. We apply the proposed method to image classification and cosegmentation, and quantitative experiments with standard benchmarks show that it matches or improves upon the state of the art.
4 0.7155354 158 iccv-2013-Fast High Dimensional Vector Multiplication Face Recognition
Author: Oren Barkan, Jonathan Weill, Lior Wolf, Hagai Aronowitz
Abstract: This paper advances descriptor-based face recognition by suggesting a novel usage of descriptors to form an over-complete representation, and by proposing a new metric learning pipeline within the same/not-same framework. First, the Over-Complete Local Binary Patterns (OCLBP) face representation scheme is introduced as a multi-scale modified version of the Local Binary Patterns (LBP) scheme. Second, we propose an efficient matrix-vector multiplication-based recognition system. The system is based on Linear Discriminant Analysis (LDA) coupled with Within Class Covariance Normalization (WCCN). This is further extended to the unsupervised case by proposing an unsupervised variant of WCCN. Lastly, we introduce Diffusion Maps (DM) for non-linear dimensionality reduction as an alternative to the Whitened Principal Component Analysis (WPCA) method which is often used in face recognition. We evaluate the proposed framework on the LFW face recognition dataset under the restricted, unrestricted and unsupervised protocols. In all three cases we achieve very competitive results.
5 0.68137002 71 iccv-2013-Category-Independent Object-Level Saliency Detection
Author: Yangqing Jia, Mei Han
Abstract: It is known that purely low-level saliency cues such as frequency does not lead to a good salient object detection result, requiring high-level knowledge to be adopted for successful discovery of task-independent salient objects. In this paper, we propose an efficient way to combine such high-level saliency priors and low-level appearance models. We obtain the high-level saliency prior with the objectness algorithm to find potential object candidates without the need of category information, and then enforce the consistency among the salient regions using a Gaussian MRF with the weights scaled by diverse density that emphasizes the influence of potential foreground pixels. Our model obtains saliency maps that assign high scores for the whole salient object, and achieves state-of-the-art performance on benchmark datasets covering various foreground statistics.
6 0.65871119 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction
7 0.64435208 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
8 0.63363475 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
9 0.62012911 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning
10 0.61518407 193 iccv-2013-Heterogeneous Auto-similarities of Characteristics (HASC): Exploiting Relational Information for Classification
11 0.6133002 106 iccv-2013-Deep Learning Identity-Preserving Face Space
12 0.61133981 73 iccv-2013-Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification
13 0.61090684 229 iccv-2013-Large-Scale Video Hashing via Structure Learning
14 0.60953313 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
15 0.6055823 142 iccv-2013-Ensemble Projection for Semi-supervised Image Classification
16 0.6048978 411 iccv-2013-Symbiotic Segmentation and Part Localization for Fine-Grained Categorization
17 0.60482645 85 iccv-2013-Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach
18 0.60471255 445 iccv-2013-Visual Reranking through Weakly Supervised Multi-graph Learning
19 0.59761018 74 iccv-2013-Co-segmentation by Composition
20 0.5916087 238 iccv-2013-Learning Graphs to Match