iccv iccv2013 iccv2013-396 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nicolas Ballas, Yi Yang, Zhen-Zhong Lan, Bertrand Delezoide, Françoise Prêteux, Alexander Hauptmann
Abstract: We address the problem of action recognition in unconstrained videos. We propose a novel content driven pooling that leverages space-time context while being robust toward global space-time transformations. Being robust to such transformations is of primary importance in unconstrained videos where the action localizations can drastically shift between frames. Our pooling identifies regions of interest using video structural cues estimated by different saliency functions. To combine the different structural information, we introduce an iterative structure learning algorithm, WSVM (weighted SVM), that determines the optimal saliency layout ofan action model through a sparse regularizer. A new optimization method isproposed to solve the WSVM’ highly non-smooth objective function. We evaluate our approach on standard action datasets (KTH, UCF50 and HMDB). Most noticeably, the accuracy of our algorithm reaches 51.8% on the challenging HMDB dataset which outperforms the state-of-the-art of 7.3% relatively.
Reference: text
sentIndex sentText sentNum sentScore
1 Space-Time Robust Video Representation for Action Recognition Nicolas Ballas CEA-List & Mines-ParisTech bal l s . [sent-1, score-0.038]
2 franco i e @mine s -pari st e ch fr s Zhen-zhong Lan Carnegie Mellon University l zh zh@ c s . [sent-9, score-0.096]
3 edu Abstract We address the problem of action recognition in unconstrained videos. [sent-13, score-0.219]
4 We propose a novel content driven pooling that leverages space-time context while being robust toward global space-time transformations. [sent-14, score-0.6]
5 Being robust to such transformations is of primary importance in unconstrained videos where the action localizations can drastically shift between frames. [sent-15, score-0.312]
6 Our pooling identifies regions of interest using video structural cues estimated by different saliency functions. [sent-16, score-1.283]
7 To combine the different structural information, we introduce an iterative structure learning algorithm, WSVM (weighted SVM), that determines the optimal saliency layout ofan action model through a sparse regularizer. [sent-17, score-1.001]
8 A new optimization method isproposed to solve the WSVM’ highly non-smooth objective function. [sent-18, score-0.038]
9 We evaluate our approach on standard action datasets (KTH, UCF50 and HMDB). [sent-19, score-0.219]
10 Introduction With the constant expansion of visual online collections, action recognition has become an important problem in computer vision. [sent-24, score-0.219]
11 It is a difficult task since online videos are subject to large visual diversity. [sent-25, score-0.037]
12 A BoF is computed in 3 steps: (1) local feature extraction, (2) local feature coding and (3) local feature pooling. [sent-27, score-0.09]
13 Such an algorithm discards the local feature position information in the video space-volume. [sent-30, score-0.124]
14 However, this space-time context has Figure 1: “Soccer” and “Running” are likely to be distinguished by the area surrounding the human legs while “Clap” and “Wave” are more easily distinguished by the upper-bodies. [sent-31, score-0.192]
15 Inter-VideosIntra-Video Figure 2: In different videos, actions localization can be subject to variation due to camera viewpoint change. [sent-32, score-0.037]
16 But, even within a single video sequence, the action area can change among frames. [sent-33, score-0.313]
17 Indeed, discriminative information is not equally distributed in the video space-time domain as shown by Figure 1. [sent-35, score-0.176]
18 To benefit from this context, spatial pooling [12, 11] divides a video using fixed segmentation grids and pools the features locally in each grid cell. [sent-36, score-0.884]
19 Despite the performance improvement, spatial pooling loses the BoF space-time invariance. [sent-37, score-0.437]
20 Different action instances with various localizations in the space-time volume can result in divergent representations. [sent-38, score-0.315]
21 This problem is severe for the actions which have dramatic spacetime variance as illustrated in Figure 2. [sent-39, score-0.088]
22 In this case, spatial pooling divides one action across different grid cells which may lead to a significant performance drop. [sent-40, score-0.809]
23 A BoF representation robust to space-time variance is therefore critical 2704 Action words Background words Fix Grid Segmenta ionDynamic Segmenta ion Figure 3: Illustration of the space-time robustness importance. [sent-41, score-0.041]
24 In this work, we propose to take advantage of the spacetime discriminative context with an emphasis on retaining the space-time robustness. [sent-43, score-0.169]
25 Beyond standard spatial pooling which uses fixed segmentation grids, we segment a video according to its content through saliency maps. [sent-44, score-1.201]
26 Our algorithm relies on the idea that the discriminative information has a non-uniform distribution in saliency spaces. [sent-45, score-0.625]
27 For example, “Running” is more likely to be distinguished from “Walking” by regions subject to high motion. [sent-46, score-0.153]
28 In addition, different saliencies can highlight different regions in the video space-time volumes. [sent-47, score-0.216]
29 We introduce a novel space-time invariant pooling which leverages the space-time context. [sent-50, score-0.465]
30 We first extract video structural cues using various saliency measures. [sent-51, score-0.807]
31 We then aggregate the local feature statistics over fixed saliency subregions, each sub-region defining a structural primitive. [sent-52, score-0.743]
32 Focusing on different structural aspects, cornerness, light and motion saliencies are investigated. [sent-53, score-0.37]
33 Cornerness highlights regions repeatable under geometric transformations, motion identifies regions with strong dynamics and light provides coarse object segmentation. [sent-54, score-0.355]
34 To automatically determine the optimal structural primitives combination associated to a specific action, we introduce a sparse feature weighting regularizer, which is able to assign optimal weights to different feature groups. [sent-55, score-0.297]
35 2,p norm to a linear SVM classifier and propose a Weighted SVM (WSVM) for action recognition. [sent-59, score-0.257]
36 Related Work Spatial pooling [12, 11] has successfully demonstrated a performance improvement over classic BoF. [sent-62, score-0.393]
37 However, to be fully effective, feature space-time statistics must align with the segmentation grids due to their fixed aspect ratio. [sent-63, score-0.218]
38 Recent efforts [22, 5, 7, 2] have tried to exploit richer spatial or temporal information by learning segmentation grids adapted to specific task. [sent-64, score-0.262]
39 Jia [7] relies on sparsity to select segmentation grids in an overcomplete basis while Sharma and Harada [22, 5] learn weights scheme associated to predefined segmentation grids. [sent-65, score-0.348]
40 Since all those approaches par- × tition local features in the spatial domain, they are not robust to space-time change. [sent-66, score-0.112]
41 It is also not robust to time variation since the local features are pooled in the temporal domain. [sent-70, score-0.03]
42 Rahtu [17] uses saliency to segment object from image. [sent-72, score-0.548]
43 Wang [26] uses saliency to compute highly discriminative local descriptor. [sent-73, score-0.621]
44 In an image recognition context, Parikhn, Shabaz and Moosman [14, 15, 16, 21] define sparse sampling strategies to detect local features. [sent-74, score-0.03]
45 We do not use saliency information to sample features but to pool them. [sent-76, score-0.548]
46 We identify prominent regions in a video through saliency to model the space-time context while preserving the space-time robustness. [sent-77, score-0.809]
47 In the remainder of this paper, we start by introducing our space-time invariant pooling. [sent-78, score-0.034]
48 Space-Time Robust Representation Figure 3 compares two pooling schemes using 2 2 staFticig grid segmentation or a dynamic segmentation 2 ba ×sed 2 on motion saliency. [sent-82, score-0.61]
49 Due to its localization variance, the action falls in different cells of the static grids leading to two spatial BoFs having low-similarity despite depicting the same action. [sent-83, score-0.469]
50 By segmenting the video dynamically, the sec- × ond pooling scheme remains robust to the action space-time variance while still taking advantage of the local feature space-time context. [sent-84, score-0.836]
51 This motivate us to propose a novel pooling algorithm using video content information. [sent-85, score-0.55]
52 Content Driven Pooling In the following, we first reformulate the spatial pooling problem and then extend this formulation to take advantage of video content information. [sent-88, score-0.623]
53 , dM} be a set oflocal features extracted fromLe a Dvid =eo. [sent-92, score-0.035]
54 Wi eis d a binary Gma =tri {xG indicating }w ah siceht o vfi gdreido voxels are active, Gi ∈ {0, 1}sx st , (sx , sy, st) being the video dimension. [sent-98, score-0.132]
55 B∈ase {d0 on those definitions, we express the max spatial pooling operation as (1). [sent-99, score-0.474]
56 sy Xi = ( mx,ay,xt)Gix,j,t code(dω(x,y,t)) (1) : R3 → [1, M] is function indexing the descriptors D based on →thei [r1 positions. [sent-100, score-0.085]
57 cTtihoen f iunndcetxiionng c tohdee d : cDr p→to sR DK bisa a ldo ocanl tfheeaitrur peo coding s Tchheem fue nsuctcihon as sparse-coding or ω 2705 FeatursExtacFoineasturec(stuar)nSalkpinacfgeo-rbmTai setdioInvFarixendtgProidslnegmntaioTrnigvdeosignatures(Mb)01o. [sent-101, score-0.099]
58 2,p Figure 4: Illustration of the space-time invariant pooling and the WSVM algorithm. [sent-106, score-0.427]
59 (1) relies on max pooling since it improves the class separation [1]. [sent-108, score-0.427]
60 Traditional spatial pooling uses a set of pre-defined pyramidal grids segmenting the video in increasingly finer cells. [sent-109, score-0.784]
61 Recent pooling works [22, 5, 7] learn G directly from data achieving task-specific segmentation. [sent-110, score-0.393]
62 Both approaches pool local features in the space-time domain. [sent-111, score-0.03]
63 Differently, we aim at modeling the space-time context while remaining robust to the space-time variance. [sent-112, score-0.05]
64 To do so, we identify prominent regions using saliency. [sent-113, score-0.117]
65 As shown in Figure 4a, we (i) extract saliency information from a video, then, (ii) order local features in rank lists according to each saliency and (iii) capture local feature statistics in various rank list sub-regions. [sent-114, score-1.156]
66 As a result, our pooling scheme does not require space-time information to compute video regions, and, it performs video-specific segmentation based on their structural cues. [sent-115, score-0.711]
67 Since our pooling uses ranks to group features instead of absolute values, it remains invariant to global translation in the saliency space. [sent-116, score-0.975]
68 To formulate our content driven pooling, we modify the indexing function ω in (1) to include video structural cues. [sent-117, score-0.411]
69 , pM} be the saliency values for each locLaelt Pfea t=ure {. [sent-121, score-0.548]
70 φ : [1,} }M be] h→e a[1li,e Mncy] visa a ranking fhun lco-tion ordering φthe : lo [1ca,lM fe]a →tures [ according t roa Pki. [sent-122, score-0.038]
71 , φ(M)}, we minimize the functional minΦ ipφ(i) , dφ(1) is the local features having the highes? [sent-126, score-0.03]
72 iM=1 Xi,k =j∈ m[1a,Mx]Gij,k × code(dφ(j)) (2) With (2), the pooling is performed in the saliency instead of the space-time domain. [sent-130, score-0.941]
73 SL−1 such as each grid Si is composed by 2i equally sized cells: G = {Gi,1, . [sent-135, score-0.099]
74 Xi,k captures the distribution of local features over a saliency sub-region. [sent-140, score-0.578]
75 2 normalized and concatenated tsotr oubcttauirna lt pher signature eX ? [sent-145, score-0.038]
76 When using several saliency functions, we repeat this pooling operation for each measure and concatenate all the resulting structural primitives. [sent-151, score-1.143]
77 We take advantage of tfhinee ev tihdeeo v valiusueasl P Pdat =a through saliency measures vtoa identify prominent or salient areas: pi = s(di). [sent-158, score-0.687]
78 s : D → [0 − 1] is a local measure that describes how )m. [sent-159, score-0.03]
79 u sch : a f e→atu [r0e −di 1f-] fers relatively to its immediate neighborhoods [6]. [sent-160, score-0.033]
80 We focus on 3 different saliency functions : “cornerness”, “light” and “motion”. [sent-161, score-0.548]
81 The cornerness saliency highlights visually distinctive features, which are repeatable under geometric transformation. [sent-162, score-0.863]
82 Feature cornerness is estimated with the Harris-Laplace transform [14]. [sent-163, score-0.216]
83 The L (Light) component of the color space is divided in 60 equal-sized bins and the light saliency is computed by an efficient center-surround operation using sliding windows [17]. [sent-166, score-0.704]
84 Motion saliency considers the video optical flow computed for each video frame through the Farneback algorithm [3]. [sent-167, score-0.736]
85 The motion saliency is then computed with the same sliding windows approach as the light saliency [17]. [sent-169, score-1.254]
86 Weighting Structural Primitives As shown in Figure 5, the saliency measures emphasize different areas of the video space-time volume. [sent-171, score-0.674]
87 The discriminative power of those regions is non-uniform and tend to be action dependent, i. [sent-172, score-0.307]
88 the saliency measures are not equally discriminative for the different actions. [sent-174, score-0.63]
89 For instance, motion saliency emphasizes foreground as well 2706 Reference Cornernes LightMotionReference Cornernes LightMotion Figure 5: Illustration of prominent areas detected with the different saliency measures. [sent-175, score-1.239]
90 The most discriminative saliency measure for each action is indicated by the red contour. [sent-176, score-0.81]
91 as background area for an action subject to strong camera movement while light saliency remains robust to this phenomena. [sent-177, score-0.893]
92 By focusing on only a few structural primitives at classification, we could take advantage of saliency functions which fit best the action of interest while discarding area containing irrelevant or noisy information. [sent-178, score-1.205]
93 In this section, we introduce SVM algorithm with a sparse feature weigthing regularizer, illustrated in Figure 4b, that determines the optimal structuralprimitives layout given an action. [sent-179, score-0.031]
94 Weighted SVM Model Let X ∈ RN×d be N training video signatures and Y ∈L {t0 X, X1}N ∈ t hReir corresponding binary labels. [sent-182, score-0.094]
95 Each video sYig ∈nat {u0r,e1 X}i is the concatenation of the structural primitives i. [sent-183, score-0.391]
96 Linear SVM combined to max-pooling has demonstrated encouraging results in the context of image classification while limiting the training complexity to O(n) [27]. [sent-189, score-0.05]
97 This norm VatMtac mheosd tehle u same importance to each coefficient in W, i. [sent-206, score-0.038]
98 To leverage the non-uniform discriminative power of structural primitives, we propose to prioritize only the most substantial groups Wg for an action while discarding the irrelevant one by adding a sparsity constraint on W. [sent-209, score-0.573]
wordName wordTfidf (topN-words)
[('saliency', 0.548), ('pooling', 0.393), ('action', 0.219), ('cornerness', 0.216), ('wsvm', 0.216), ('structural', 0.165), ('grids', 0.159), ('wg', 0.142), ('primitives', 0.132), ('bof', 0.111), ('video', 0.094), ('light', 0.089), ('cornernes', 0.086), ('segmenta', 0.086), ('xiw', 0.086), ('saliencies', 0.077), ('prominent', 0.072), ('distinguished', 0.071), ('carnegie', 0.064), ('mellon', 0.064), ('hmdb', 0.064), ('alex', 0.064), ('pyramidal', 0.064), ('content', 0.063), ('zh', 0.061), ('localizations', 0.061), ('grid', 0.06), ('segmentation', 0.059), ('repeatable', 0.059), ('cmu', 0.058), ('driven', 0.056), ('svm', 0.055), ('sy', 0.052), ('regularizer', 0.05), ('context', 0.05), ('spacetime', 0.047), ('cells', 0.047), ('divides', 0.046), ('regions', 0.045), ('spatial', 0.044), ('discarding', 0.043), ('discriminative', 0.043), ('sx', 0.043), ('variance', 0.041), ('highlights', 0.04), ('yi', 0.04), ('pm', 0.039), ('equally', 0.039), ('motion', 0.039), ('leverages', 0.038), ('visa', 0.038), ('otraf', 0.038), ('tition', 0.038), ('bal', 0.038), ('bofs', 0.038), ('cdr', 0.038), ('isproposed', 0.038), ('ega', 0.038), ('vfi', 0.038), ('fran', 0.038), ('tsotr', 0.038), ('ofan', 0.038), ('tfhinee', 0.038), ('norm', 0.038), ('identifies', 0.038), ('operation', 0.037), ('sparsity', 0.037), ('subject', 0.037), ('focusing', 0.036), ('dwe', 0.035), ('oflocal', 0.035), ('clap', 0.035), ('divergent', 0.035), ('harada', 0.035), ('ffic', 0.035), ('franco', 0.035), ('fou', 0.035), ('gma', 0.035), ('ldo', 0.035), ('invariant', 0.034), ('relies', 0.034), ('peo', 0.033), ('fers', 0.033), ('prioritize', 0.033), ('illustration', 0.033), ('indexing', 0.033), ('irrelevant', 0.033), ('areas', 0.032), ('nicolas', 0.032), ('tri', 0.032), ('transformations', 0.032), ('layout', 0.031), ('nat', 0.031), ('tohdee', 0.031), ('thei', 0.031), ('local', 0.03), ('sliding', 0.03), ('segmenting', 0.03), ('advantage', 0.029), ('pools', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 396 iccv-2013-Space-Time Robust Representation for Action Recognition
Author: Nicolas Ballas, Yi Yang, Zhen-Zhong Lan, Bertrand Delezoide, Françoise Prêteux, Alexander Hauptmann
Abstract: We address the problem of action recognition in unconstrained videos. We propose a novel content driven pooling that leverages space-time context while being robust toward global space-time transformations. Being robust to such transformations is of primary importance in unconstrained videos where the action localizations can drastically shift between frames. Our pooling identifies regions of interest using video structural cues estimated by different saliency functions. To combine the different structural information, we introduce an iterative structure learning algorithm, WSVM (weighted SVM), that determines the optimal saliency layout ofan action model through a sparse regularizer. A new optimization method isproposed to solve the WSVM’ highly non-smooth objective function. We evaluate our approach on standard action datasets (KTH, UCF50 and HMDB). Most noticeably, the accuracy of our algorithm reaches 51.8% on the challenging HMDB dataset which outperforms the state-of-the-art of 7.3% relatively.
2 0.43327251 71 iccv-2013-Category-Independent Object-Level Saliency Detection
Author: Yangqing Jia, Mei Han
Abstract: It is known that purely low-level saliency cues such as frequency does not lead to a good salient object detection result, requiring high-level knowledge to be adopted for successful discovery of task-independent salient objects. In this paper, we propose an efficient way to combine such high-level saliency priors and low-level appearance models. We obtain the high-level saliency prior with the objectness algorithm to find potential object candidates without the need of category information, and then enforce the consistency among the salient regions using a Gaussian MRF with the weights scaled by diverse density that emphasizes the influence of potential foreground pixels. Our model obtains saliency maps that assign high scores for the whole salient object, and achieves state-of-the-art performance on benchmark datasets covering various foreground statistics.
3 0.42393997 372 iccv-2013-Saliency Detection via Dense and Sparse Reconstruction
Author: Xiaohui Li, Huchuan Lu, Lihe Zhang, Xiang Ruan, Ming-Hsuan Yang
Abstract: In this paper, we propose a visual saliency detection algorithm from the perspective of reconstruction errors. The image boundaries are first extracted via superpixels as likely cues for background templates, from which dense and sparse appearance models are constructed. For each image region, we first compute dense and sparse reconstruction errors. Second, the reconstruction errors are propagated based on the contexts obtained from K-means clustering. Third, pixel-level saliency is computed by an integration of multi-scale reconstruction errors and refined by an object-biased Gaussian model. We apply the Bayes formula to integrate saliency measures based on dense and sparse reconstruction errors. Experimental results show that the proposed algorithm performs favorably against seventeen state-of-the-art methods in terms of precision and recall. In addition, the proposed algorithm is demonstrated to be more effective in highlighting salient objects uniformly and robust to background noise.
4 0.36035717 91 iccv-2013-Contextual Hypergraph Modeling for Salient Object Detection
Author: Xi Li, Yao Li, Chunhua Shen, Anthony Dick, Anton Van_Den_Hengel
Abstract: Salient object detection aims to locate objects that capture human attention within images. Previous approaches often pose this as a problem of image contrast analysis. In this work, we model an image as a hypergraph that utilizes a set of hyperedges to capture the contextual properties of image pixels or regions. As a result, the problem of salient object detection becomes one of finding salient vertices and hyperedges in the hypergraph. The main advantage of hypergraph modeling is that it takes into account each pixel’s (or region ’s) affinity with its neighborhood as well as its separation from image background. Furthermore, we propose an alternative approach based on centerversus-surround contextual contrast analysis, which performs salient object detection by optimizing a cost-sensitive support vector machine (SVM) objective function. Experimental results on four challenging datasets demonstrate the effectiveness of the proposed approaches against the stateof-the-art approaches to salient object detection.
5 0.33146128 50 iccv-2013-Analysis of Scores, Datasets, and Models in Visual Saliency Prediction
Author: Ali Borji, Hamed R. Tavakoli, Dicky N. Sihite, Laurent Itti
Abstract: Significant recent progress has been made in developing high-quality saliency models. However, less effort has been undertaken on fair assessment of these models, over large standardized datasets and correctly addressing confounding factors. In this study, we pursue a critical and quantitative look at challenges (e.g., center-bias, map smoothing) in saliency modeling and the way they affect model accuracy. We quantitatively compare 32 state-of-the-art models (using the shuffled AUC score to discount center-bias) on 4 benchmark eye movement datasets, for prediction of human fixation locations and scanpath sequence. We also account for the role of map smoothing. We find that, although model rankings vary, some (e.g., AWS, LG, AIM, and HouNIPS) consistently outperform other models over all datasets. Some models work well for prediction of both fixation locations and scanpath sequence (e.g., Judd, GBVS). Our results show low prediction accuracy for models over emotional stimuli from the NUSEF dataset. Our last benchmark, for the first time, gauges the ability of models to decode the stimulus category from statistics of fixations, saccades, and model saliency values at fixated locations. In this test, ITTI and AIM models win over other models. Our benchmark provides a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in saliency modeling.
6 0.31766868 373 iccv-2013-Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics
7 0.27122563 217 iccv-2013-Initialization-Insensitive Visual Tracking through Voting with Salient Local Features
8 0.25502333 374 iccv-2013-Salient Region Detection by UFO: Uniqueness, Focusness and Objectness
9 0.24777967 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
10 0.23850431 370 iccv-2013-Saliency Detection in Large Point Sets
11 0.23150857 371 iccv-2013-Saliency Detection via Absorbing Markov Chain
12 0.21864042 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction
13 0.19890678 369 iccv-2013-Saliency Detection: A Boolean Map Approach
14 0.17501229 86 iccv-2013-Concurrent Action Detection with Structural Prediction
15 0.15747246 381 iccv-2013-Semantically-Based Human Scanpath Estimation with HMMs
16 0.15653123 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
17 0.15346639 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
18 0.14703734 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
19 0.14576715 411 iccv-2013-Symbiotic Segmentation and Part Localization for Fine-Grained Categorization
20 0.13978553 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
topicId topicWeight
[(0, 0.228), (1, 0.082), (2, 0.511), (3, -0.068), (4, -0.163), (5, 0.019), (6, 0.082), (7, -0.11), (8, -0.002), (9, -0.002), (10, -0.014), (11, 0.064), (12, 0.006), (13, -0.036), (14, 0.045), (15, -0.104), (16, 0.108), (17, -0.006), (18, -0.007), (19, 0.056), (20, 0.005), (21, 0.006), (22, -0.003), (23, -0.07), (24, -0.026), (25, -0.004), (26, 0.036), (27, -0.009), (28, 0.039), (29, 0.022), (30, 0.009), (31, -0.031), (32, -0.038), (33, 0.024), (34, -0.009), (35, 0.008), (36, -0.048), (37, -0.075), (38, 0.037), (39, 0.03), (40, -0.055), (41, -0.015), (42, 0.015), (43, 0.021), (44, -0.014), (45, 0.042), (46, -0.022), (47, 0.009), (48, -0.018), (49, -0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.95835823 396 iccv-2013-Space-Time Robust Representation for Action Recognition
Author: Nicolas Ballas, Yi Yang, Zhen-Zhong Lan, Bertrand Delezoide, Françoise Prêteux, Alexander Hauptmann
Abstract: We address the problem of action recognition in unconstrained videos. We propose a novel content driven pooling that leverages space-time context while being robust toward global space-time transformations. Being robust to such transformations is of primary importance in unconstrained videos where the action localizations can drastically shift between frames. Our pooling identifies regions of interest using video structural cues estimated by different saliency functions. To combine the different structural information, we introduce an iterative structure learning algorithm, WSVM (weighted SVM), that determines the optimal saliency layout ofan action model through a sparse regularizer. A new optimization method isproposed to solve the WSVM’ highly non-smooth objective function. We evaluate our approach on standard action datasets (KTH, UCF50 and HMDB). Most noticeably, the accuracy of our algorithm reaches 51.8% on the challenging HMDB dataset which outperforms the state-of-the-art of 7.3% relatively.
2 0.85654652 91 iccv-2013-Contextual Hypergraph Modeling for Salient Object Detection
Author: Xi Li, Yao Li, Chunhua Shen, Anthony Dick, Anton Van_Den_Hengel
Abstract: Salient object detection aims to locate objects that capture human attention within images. Previous approaches often pose this as a problem of image contrast analysis. In this work, we model an image as a hypergraph that utilizes a set of hyperedges to capture the contextual properties of image pixels or regions. As a result, the problem of salient object detection becomes one of finding salient vertices and hyperedges in the hypergraph. The main advantage of hypergraph modeling is that it takes into account each pixel’s (or region ’s) affinity with its neighborhood as well as its separation from image background. Furthermore, we propose an alternative approach based on centerversus-surround contextual contrast analysis, which performs salient object detection by optimizing a cost-sensitive support vector machine (SVM) objective function. Experimental results on four challenging datasets demonstrate the effectiveness of the proposed approaches against the stateof-the-art approaches to salient object detection.
3 0.83103687 369 iccv-2013-Saliency Detection: A Boolean Map Approach
Author: Jianming Zhang, Stan Sclaroff
Abstract: A novel Boolean Map based Saliency (BMS) model is proposed. An image is characterized by a set of binary images, which are generated by randomly thresholding the image ’s color channels. Based on a Gestalt principle of figure-ground segregation, BMS computes saliency maps by analyzing the topological structure of Boolean maps. BMS is simple to implement and efficient to run. Despite its simplicity, BMS consistently achieves state-of-the-art performance compared with ten leading methods on five eye tracking datasets. Furthermore, BMS is also shown to be advantageous in salient object detection.
4 0.8293401 50 iccv-2013-Analysis of Scores, Datasets, and Models in Visual Saliency Prediction
Author: Ali Borji, Hamed R. Tavakoli, Dicky N. Sihite, Laurent Itti
Abstract: Significant recent progress has been made in developing high-quality saliency models. However, less effort has been undertaken on fair assessment of these models, over large standardized datasets and correctly addressing confounding factors. In this study, we pursue a critical and quantitative look at challenges (e.g., center-bias, map smoothing) in saliency modeling and the way they affect model accuracy. We quantitatively compare 32 state-of-the-art models (using the shuffled AUC score to discount center-bias) on 4 benchmark eye movement datasets, for prediction of human fixation locations and scanpath sequence. We also account for the role of map smoothing. We find that, although model rankings vary, some (e.g., AWS, LG, AIM, and HouNIPS) consistently outperform other models over all datasets. Some models work well for prediction of both fixation locations and scanpath sequence (e.g., Judd, GBVS). Our results show low prediction accuracy for models over emotional stimuli from the NUSEF dataset. Our last benchmark, for the first time, gauges the ability of models to decode the stimulus category from statistics of fixations, saccades, and model saliency values at fixated locations. In this test, ITTI and AIM models win over other models. Our benchmark provides a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in saliency modeling.
5 0.82554978 71 iccv-2013-Category-Independent Object-Level Saliency Detection
Author: Yangqing Jia, Mei Han
Abstract: It is known that purely low-level saliency cues such as frequency does not lead to a good salient object detection result, requiring high-level knowledge to be adopted for successful discovery of task-independent salient objects. In this paper, we propose an efficient way to combine such high-level saliency priors and low-level appearance models. We obtain the high-level saliency prior with the objectness algorithm to find potential object candidates without the need of category information, and then enforce the consistency among the salient regions using a Gaussian MRF with the weights scaled by diverse density that emphasizes the influence of potential foreground pixels. Our model obtains saliency maps that assign high scores for the whole salient object, and achieves state-of-the-art performance on benchmark datasets covering various foreground statistics.
6 0.82549852 372 iccv-2013-Saliency Detection via Dense and Sparse Reconstruction
7 0.82203668 374 iccv-2013-Salient Region Detection by UFO: Uniqueness, Focusness and Objectness
8 0.80615336 373 iccv-2013-Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics
9 0.79428738 370 iccv-2013-Saliency Detection in Large Point Sets
10 0.79417264 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction
11 0.7532776 371 iccv-2013-Saliency Detection via Absorbing Markov Chain
12 0.60274249 217 iccv-2013-Initialization-Insensitive Visual Tracking through Voting with Salient Local Features
13 0.49778625 38 iccv-2013-Action Recognition with Actons
14 0.47405827 40 iccv-2013-Action and Event Recognition with Fisher Vectors on a Compact Feature Set
15 0.47016394 381 iccv-2013-Semantically-Based Human Scanpath Estimation with HMMs
16 0.44435841 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding
17 0.42549405 86 iccv-2013-Concurrent Action Detection with Structural Prediction
18 0.42545661 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
19 0.39152187 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
20 0.3903656 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition
topicId topicWeight
[(2, 0.08), (4, 0.018), (7, 0.016), (10, 0.181), (26, 0.131), (31, 0.04), (35, 0.011), (42, 0.078), (48, 0.01), (64, 0.065), (73, 0.026), (89, 0.179), (97, 0.046), (98, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.85039854 396 iccv-2013-Space-Time Robust Representation for Action Recognition
Author: Nicolas Ballas, Yi Yang, Zhen-Zhong Lan, Bertrand Delezoide, Françoise Prêteux, Alexander Hauptmann
Abstract: We address the problem of action recognition in unconstrained videos. We propose a novel content driven pooling that leverages space-time context while being robust toward global space-time transformations. Being robust to such transformations is of primary importance in unconstrained videos where the action localizations can drastically shift between frames. Our pooling identifies regions of interest using video structural cues estimated by different saliency functions. To combine the different structural information, we introduce an iterative structure learning algorithm, WSVM (weighted SVM), that determines the optimal saliency layout ofan action model through a sparse regularizer. A new optimization method isproposed to solve the WSVM’ highly non-smooth objective function. We evaluate our approach on standard action datasets (KTH, UCF50 and HMDB). Most noticeably, the accuracy of our algorithm reaches 51.8% on the challenging HMDB dataset which outperforms the state-of-the-art of 7.3% relatively.
2 0.80063176 392 iccv-2013-Similarity Metric Learning for Face Recognition
Author: Qiong Cao, Yiming Ying, Peng Li
Abstract: Recently, there is a considerable amount of efforts devoted to the problem of unconstrained face verification, where the task is to predict whether pairs of images are from the same person or not. This problem is challenging and difficult due to the large variations in face images. In this paper, we develop a novel regularization framework to learn similarity metrics for unconstrained face verification. We formulate its objective function by incorporating the robustness to the large intra-personal variations and the discriminative power of novel similarity metrics. In addition, our formulation is a convex optimization problem which guarantees the existence of its global solution. Experiments show that our proposed method achieves the state-of-the-art results on the challenging Labeled Faces in the Wild (LFW) database [10].
3 0.79845005 35 iccv-2013-Accurate Blur Models vs. Image Priors in Single Image Super-resolution
Author: Netalee Efrat, Daniel Glasner, Alexander Apartsin, Boaz Nadler, Anat Levin
Abstract: Over the past decade, single image Super-Resolution (SR) research has focused on developing sophisticated image priors, leading to significant advances. Estimating and incorporating the blur model, that relates the high-res and low-res images, has received much less attention, however. In particular, the reconstruction constraint, namely that the blurred and downsampled high-res output should approximately equal the low-res input image, has been either ignored or applied with default fixed blur models. In this work, we examine the relative importance ofthe imageprior and the reconstruction constraint. First, we show that an accurate reconstruction constraint combined with a simple gradient regularization achieves SR results almost as good as those of state-of-the-art algorithms with sophisticated image priors. Second, we study both empirically and theoretically the sensitivity of SR algorithms to the blur model assumed in the reconstruction constraint. We find that an accurate blur model is more important than a sophisticated image prior. Finally, using real camera data, we demonstrate that the default blur models of various SR algorithms may differ from the camera blur, typically leading to over- smoothed results. Our findings highlight the importance of accurately estimating camera blur in reconstructing raw low- res images acquired by an actual camera.
4 0.79745215 414 iccv-2013-Temporally Consistent Superpixels
Author: Matthias Reso, Jörn Jachalsky, Bodo Rosenhahn, Jörn Ostermann
Abstract: Superpixel algorithms represent a very useful and increasingly popular preprocessing step for a wide range of computer vision applications, as they offer the potential to boost efficiency and effectiveness. In this regards, this paper presents a highly competitive approach for temporally consistent superpixelsfor video content. The approach is based on energy-minimizing clustering utilizing a novel hybrid clustering strategy for a multi-dimensional feature space working in a global color subspace and local spatial subspaces. Moreover, a new contour evolution based strategy is introduced to ensure spatial coherency of the generated superpixels. For a thorough evaluation the proposed approach is compared to state of the art supervoxel algorithms using established benchmarks and shows a superior performance.
5 0.78824627 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning
Author: Jifeng Dai, Ying Nian Wu, Jie Zhou, Song-Chun Zhu
Abstract: Cosegmentation refers to theproblem ofsegmenting multiple images simultaneously by exploiting the similarities between the foreground and background regions in these images. The key issue in cosegmentation is to align common objects between these images. To address this issue, we propose an unsupervised learning framework for cosegmentation, by coupling cosegmentation with what we call “cosketch ”. The goal of cosketch is to automatically discover a codebook of deformable shape templates shared by the input images. These shape templates capture distinct image patterns and each template is matched to similar image patches in different images. Thus the cosketch of the images helps to align foreground objects, thereby providing crucial information for cosegmentation. We present a statistical model whose energy function couples cosketch and cosegmentation. We then present an unsupervised learning algorithm that performs cosketch and cosegmentation by energy minimization. Experiments show that our method outperforms state of the art methods for cosegmentation on the challenging MSRC and iCoseg datasets. We also illustrate our method on a new dataset called Coseg-Rep where cosegmentation can be performed within a single image with repetitive patterns.
6 0.78726494 102 iccv-2013-Data-Driven 3D Primitives for Single Image Understanding
7 0.78719866 8 iccv-2013-A Deformable Mixture Parsing Model with Parselets
8 0.78710645 411 iccv-2013-Symbiotic Segmentation and Part Localization for Fine-Grained Categorization
9 0.78641647 20 iccv-2013-A Max-Margin Perspective on Sparse Representation-Based Classification
10 0.78373575 150 iccv-2013-Exemplar Cut
11 0.78358495 371 iccv-2013-Saliency Detection via Absorbing Markov Chain
12 0.78256315 71 iccv-2013-Category-Independent Object-Level Saliency Detection
13 0.78247762 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation
14 0.78240997 425 iccv-2013-Tracking via Robust Multi-task Multi-view Joint Sparse Representation
15 0.78032404 423 iccv-2013-Towards Motion Aware Light Field Video for Dynamic Scenes
16 0.77820039 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction
17 0.77748752 372 iccv-2013-Saliency Detection via Dense and Sparse Reconstruction
18 0.7774719 156 iccv-2013-Fast Direct Super-Resolution by Simple Functions
19 0.77742106 196 iccv-2013-Hierarchical Data-Driven Descent for Efficient Optimal Deformation Estimation
20 0.7772904 295 iccv-2013-On One-Shot Similarity Kernels: Explicit Feature Maps and Properties