Author: Dmitry Rudoy, Dan B. Goldman, Eli Shechtman, Lihi Zelnik-Manor

Abstract: During recent years remarkable progress has been made in visual saliency modeling. Our interest is in video saliency. Since videos are fundamentally different from still images, they are viewed differently by human observers. For example, the time each video frame is observed is a fraction of a second, while a still image can be viewed leisurely. Therefore, video saliency estimation methods should differ substantially from image saliency methods. In this paper we propose a novel methodfor video saliency estimation, which is inspired by the way people watch videos. We explicitly model the continuity of the video by predicting the saliency map of a given frame, conditioned on the map from the previousframe. Furthermore, accuracy and computation speed are improved by restricting the salient locations to a carefully selected candidate set. We validate our method using two gaze-tracked video datasets and show we outperform the state-of-the-art.

1 Learning video saliency from human gaze using candidate selection Dmitry Rudoy Dan B Goldman Technion Adobe Research Haifa, Israel Seattle, WA dmit ry . [sent-1, score-1.345]

2 com Abstract During recent years remarkable progress has been made in visual saliency modeling. [sent-4, score-0.543]

3 Therefore, video saliency estimation methods should differ substantially from image saliency methods. [sent-8, score-1.188]

4 In this paper we propose a novel methodfor video saliency estimation, which is inspired by the way people watch videos. [sent-9, score-0.685]

5 We explicitly model the continuity of the video by predicting the saliency map of a given frame, conditioned on the map from the previousframe. [sent-10, score-0.749]

6 Another application that might take advantage of human gaze prediction is video editing [1]: knowing where the viewer looks could help to create smoother shot transitions. [sent-17, score-0.683]

7 Moreover, we hypothesize that reliable gaze prediction may drive gaze-aware video compression or key-frame selection [15]. [sent-18, score-0.581]

8 Image saliency is well explored in the computer vision community. [sent-19, score-0.543]

9 The saliency maps overlayed on the images show that video saliency is tighter and more concentrated on a single object, while image saliency covers several interesting locations. [sent-30, score-1.76]

10 The difference between human fixations when viewing a static image versus a video frame is exemplified in Figure 1. [sent-33, score-0.578]

11 In this work we propose a method that predicts saliency by explicitly accounting for gaze transitions over time. [sent-35, score-1.101]

12 Rather than trying to model where people look in each frame independently, we predict the gaze location given the previous frame’s fixation map. [sent-36, score-0.834]

13 In this way, we handle interframe dynamics of the gaze transitions, along with within- frame salient locations. [sent-37, score-0.666]

14 To this end we learn a model that predicts a saliency map for a frame given the fixation map from a recent preceding moment and test it on a large set of realistic videos. [sent-38, score-0.97]

15 A key contribution of this work is the observation that saliency in video is typically very sparse and computing it at each and every pixel is redundant. [sent-39, score-0.645]

16 Instead, we select a set of candidate gaze locations, and compute saliency only at these locations. [sent-40, score-1.194]

17 We verify experimentally that our candidate-based approach outperforms the pixel based approach, and is significantly better than an image saliency 1 1 1 1 1 14 4 47 5 5 based approach. [sent-42, score-0.573]

18 Since a video is a stream of frames, the human gaze in each frame depends on the previous gaze locations. [sent-44, score-1.177]

19 Later, Koch and Ullman [20] proposed a feedforward model for the integration, along with the concept of a saliency map a measure of visual attraction of every point in the scene. [sent-64, score-0.582]

20 Since then much progress in image saliency has been made. [sent-67, score-0.543]

21 Seo and Milanfar [26] propose using self-resemblance in both static and space-time saliency detection. [sent-76, score-0.652]

22 [8] take a different approach: they concentrate on motion saliency only and detect it by using temporal spectral analysis. [sent-78, score-0.631]

23 Our work differs from previous video saliency methods by narrowing the focus to a small number of candidate gaze locations, and learning conditional gaze transitions over time. [sent-81, score-1.825]

24 Motivation and overview Most previous saliency modeling methods calculate a saliency value for every pixel. [sent-83, score-1.133]

25 In our work we propose to calculate saliency at a small set of candidate locations, instead of at every pixel. [sent-84, score-0.787]

26 First, we observe that image saliency studies concentrate on a single image stimulus, without any prior. [sent-86, score-0.572]

27 This is usually achieved by “resetting” the participants’ gaze –presenting a black screen or a single target in the center. [sent-87, score-0.454]

28 Here, the gaze varies little between frames, and when it does change significantly it is highly constrained to local regions. [sent-89, score-0.454]

29 Our second observation is that when watching dynamic scenes people usually follow the action and the characters by shifting their gaze to a new interesting location in the scene. [sent-91, score-0.639]

30 Focusing on a sparse candidate set of salient locations allows us to model and learn these transitions explicitly with a relatively small computational effort. [sent-92, score-0.377]

31 To accommodate these observations our system consists of three phases: identifying candidate gaze locations at each frame (Section 4), extracting features for those locations (Section 5. [sent-93, score-0.859]

32 1) and learning or predicting gaze probabilities for each candidate (Section 5. [sent-94, score-0.677]

33 The static and semantic candidate locations are gener- ated separately for every video frame. [sent-103, score-0.507]

34 The motion candidates are computed using optical flow between neighboring pairs of frames, and therefore implicitly account for the dynamics in the video. [sent-104, score-0.427]

35 Static candidates Since a video is composed of individual frames we start with candidates that attract peoples’ attention due to static cues. [sent-108, score-0.848]

36 For a given frame of interest we calculate the graph-based visual saliency (GBVS), proposed by Harel et al. [sent-109, score-0.708]

37 We preferred GBVS over other image saliency methods for two main reasons: (i) it has been shown that GBVS accurately predicts human fixations in static images [3], and (ii) it is fast to calculate compared to more accurate methods [18]. [sent-111, score-0.944]

38 We hypothesize that other image saliency detection methods may be used instead. [sent-112, score-0.543]

39 Given the image saliency map we wish to find the most attractive candidate regions within it. [sent-113, score-0.833]

40 We treat the normalized saliency map as a distribution and use it to sample a large number of random points. [sent-114, score-0.582]

41 Finally, we estimate the covariance matrix of each candidate by fitting a Gaussian to the saliency map in the neighborhood of the candidate location. [sent-117, score-1.037]

42 Motion candidates Modeling the saliency in independent frames is insufficient for videos since it ignores the dynamics. [sent-126, score-0.887]

43 To produce motion candidates we first calculate the optical flow between consecutive frames [22]. [sent-129, score-0.483]

44 The motion candidates are created from the DoG map in the same way as the static candidates are created from the image saliency map (i. [sent-132, score-1.323]

45 Semantic candidates Finally we wish to add semantic candidates to our set. [sent-140, score-0.588]

46 The original frame is shown in gray (for visualization) It is overlaid with: (a) the GBVS saliency map and (b) optical flow magnitude. [sent-143, score-0.768]

47 Modeling gaze dynamics Having extracted a set of candidates we next wish to select the most salient one. [sent-165, score-0.818]

48 We accomplish this by learning transition probability the probability to shift from one gaze location in a source frame to a new one in a destination frame. [sent-166, score-1.106]

49 This transition is different from a saccade we are dealing with a shift of the entire distribution, while a saccade is a rapid movement of a gaze point. [sent-167, score-0.632]

50 This allows us to model the gaze dynamics in the video and predict the saliency more accurately. [sent-169, score-1.133]

51 Features – – To model changes in focus of attention we associate a feature vector with pairs of source and destination candidates in a given pair of frames. [sent-172, score-0.67]

52 The features can be categorized into two sets: destination frame features and inter-frame features. [sent-174, score-0.379]

53 We experimented with the use of source frame features as well, but found these features led to overfitting in the learning process, as they are only slightly different from the destination frame features. [sent-175, score-0.578]

54 It is important to note that all types of features are computed for all the destination candidates regardless of the type of the candidate. [sent-187, score-0.501]

55 Gaze transitions for training We pose the learning problem as classification: whether a gaze transition occurs from a given source candidate to a given target candidate. [sent-190, score-0.882]

56 To train such a classifier based on the features described in the previous section we need (i) to choose relevant pairs of frames, and (ii) to label positive and negative gaze transitions between these frames. [sent-191, score-0.555]

57 Since it takes 5 to 10 frames for humans to fixate on a new object of interest we set the destination frame 15 frames after the cut [13]. [sent-194, score-0.578]

58 This ensures that we will not learn from incomplete or partial gaze transitions. [sent-195, score-0.454]

59 Next, we need to obtain examples of positive and negative gaze transitions. [sent-198, score-0.454]

60 We take all pairs of source locations and destination candidates for our training set. [sent-204, score-0.653]

61 Pairs with a destination candidate near a focus of the destination frame are labeled as positive. [sent-205, score-0.837]

62 At the inference stage the trained model classifies every transition between source and destination candidates and provides a confidence value. [sent-217, score-0.657]

63 We use the normalized confidence as the transition probability P(d|si) – the transition probability tfrraonms tthioen source si tyo Pth(ed |csurrent destination candidate d. [sent-218, score-0.769]

64 The transition pairs are overlayed on the source (top) and destination (bottom) frames, together with source (magenta) and destination (yellow) gaze maps. [sent-222, score-1.268]

65 ndidate saliency and S is the set of all the sources. [sent-227, score-0.543]

66 Finally, we produce the saliency map in a similar fashion to how Gaussian mixture models are used to create a continuous distribution: we replace each candidate with a Gaussian of corresponding covariance and sum them up using the candidate saliency as weight. [sent-228, score-1.577]

67 Experimental validation In this section we experimentally validate the proposed video saliency detection method. [sent-230, score-0.675]

68 The dataset is provided together with gaze tracks of about 50 participants per video. [sent-233, score-0.454]

69 (b), (c) Example frames, together with human fixation points (green) and our extracted candidates (yellow). [sent-238, score-0.458]

70 Verification of the candidates First, we wish to demonstrate that human fixations can be modeled well by our limited candidate set. [sent-242, score-0.683]

71 To do so we count the number of candidate locations that are “close enough” to a fixation point. [sent-243, score-0.411]

72 This means that on most of the frames most of the fixations can be modeled well by our candidate set. [sent-249, score-0.433]

73 Since our method computes the probability to shift from a location in a source frame to a location in a destination frame, we calculate the video saliency in a sequential order. [sent-257, score-1.269]

74 For every following frame we compute transition 1 1 1 1 1 145 4 9 1 9 probability to its candidate set using the predicted saliency map from the previous frame as the source. [sent-259, score-1.116]

75 This method does not drift over time, since the transitions are largely independent of the source frame properties (recall that features of the source frame were excluded and the destination candidates are computed independently for each frame). [sent-260, score-0.974]

76 The first metric is the area-under-curve (AUC), which utilizes the receiver-operator curve to compute the similarity between human fixations and the predicted saliency map. [sent-264, score-0.759]

77 Since the AUC considers the saliency results only at the locations ofthe ground truth fixation points, it cannot distinguish well between a peaky saliency map and a smooth one. [sent-267, score-1.371]

78 The χ2 distance will prefer a peaky saliency map over a broad one, when comparing them to the tight distribution of the ground truth. [sent-270, score-0.614]

79 We convert the sparse ground truth fixation map, recorded by the gaze tracker, to a dense probability map by convolving it with a constant size Gaussian kernel. [sent-272, score-0.688]

80 We compare the proposed saliency prediction approach with five different methods. [sent-275, score-0.568]

81 The first, referred to as humans, serves as an upper bound for the saliency prediction and measures how much the fixation map explains itself. [sent-276, score-0.776]

82 We further compare our results to the image saliency approach of GBVS [12], and two video saliency methods PQFT [11] and the method of Hou and Zhang [14] (annotated in figures and tables as Hou for brevity). [sent-282, score-1.188]

83 Both methods are among the highest rated video saliency algorithms according to the recent benchmark of Borji et al. [sent-283, score-0.645]

84 Using χ2 further emphasizes the benefits of our approach: we produce a tight distribution that is more similar to the original gaze map. [sent-303, score-0.454]

85 We further visually compare our saliency maps to those of other methods. [sent-311, score-0.543]

86 As can be seen, the saliency maps produced 1 1 1 1 1 15 5 52 0 0 Table 1. [sent-313, score-0.543]

87 63 by the proposed method are more visually consistent with the shape, size, and location of the ground truth gaze map than the maps of the other methods. [sent-330, score-0.519]

88 Conclusions In this paper we proposed a novel method for video saliency prediction. [sent-338, score-0.645]

89 The method is substantially different from existing methods and uses a sparse candidate set to model the saliency map. [sent-339, score-0.74]

90 It is shown experimentally that using candidates boosts the accuracy of the saliency prediction and speeds up the algorithm. [sent-340, score-0.838]

91 Furthermore, the proposed method accounts for the temporal dimension of the video by learning the probability to shift between saliency locations. [sent-341, score-0.71]

92 When determining the motion candidates we filter out all regions with optical flow magnitude lower that 2 pixels. [sent-346, score-0.391]

93 d ×× When calculating static and motion features in the neighborhood of a candidate we use three different neighborhoods, sized 5 5, 9 9 and 17 17 pixels. [sent-357, score-0.396]

94 We thank the DIEM database for making the gaze tracking results publicly available. [sent-363, score-0.454]

95 Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study. [sent-378, score-0.543]

96 Examples of saliency detection results using different methods show that the saliency predicted by the proposed method better approximates the human gaze map. [sent-418, score-1.589]

97 Spatio-temporal [12] [13] [14] [15] [16] [17] [18] [19] [20] saliency detection using phase spectrum of quaternion fourier transform. [sent-423, score-0.569]

98 Spatiotemporal saliency detection and its applications in static and dynamic scenes. [sent-465, score-0.693]

99 Clustering of gaze during dynamic scene viewing is predicted by motion. [sent-491, score-0.528]

100 Predicting human gaze using quaternion dct image signature saliency and face detection. [sent-496, score-1.096]

simIndex simValue paperId paperTitle

1 0.90174627 65 cvpr-2013-Blind Deconvolution of Widefield Fluorescence Microscopic Data by Regularization of the Optical Transfer Function (OTF)

Author: Margret Keuper, Thorsten Schmidt, Maja Temerinac-Ott, Jan Padeken, Patrick Heun, Olaf Ronneberger, Thomas Brox

Abstract: With volumetric data from widefield fluorescence microscopy, many emerging questions in biological and biomedical research are being investigated. Data can be recorded with high temporal resolution while the specimen is only exposed to a low amount of phototoxicity. These advantages come at the cost of strong recording blur caused by the infinitely extended point spread function (PSF). For widefield microscopy, its magnitude only decays with the square of the distance to the focal point and consists of an airy bessel pattern which is intricate to describe in the spatial domain. However, the Fourier transform of the incoherent PSF (denoted as Optical Transfer Function (OTF)) is well localized and smooth. In this paper, we present a blind -fre iburg .de Figure 1. As for widefield microscopy the convolution ofthe signal deconvolution method that improves results of state-of-theart deconvolution methods on widefield data by exploiting the properties of the widefield OTF.

2 0.87018806 424 cvpr-2013-Templateless Quasi-rigid Shape Modeling with Implicit Loop-Closure

Author: Ming Zeng, Jiaxiang Zheng, Xuan Cheng, Xinguo Liu

Abstract: This paper presents a method for quasi-rigid objects modeling from a sequence of depth scans captured at different time instances. As quasi-rigid objects, such as human bodies, usually have shape motions during the capture procedure, it is difficult to reconstruct their geometries. We represent the shape motion by a deformation graph, and propose a model-to-partmethod to gradually integrate sampled points of depth scans into the deformation graph. Under an as-rigid-as-possible assumption, the model-to-part method can adjust the deformation graph non-rigidly, so as to avoid error accumulation in alignment, which also implicitly achieves loop-closure. To handle the drift and topological error for the deformation graph, two algorithms are introduced. First, we use a two-stage registration to largely keep the rigid motion part. Second, in the step of graph integration, we topology-adaptively integrate new parts and dynamically control the regularization effect of the deformation graph. We demonstrate the effectiveness and robustness of our method by several depth sequences of quasi-rigid objects, and an application in human shape modeling.

same-paper 3 0.86550575 258 cvpr-2013-Learning Video Saliency from Human Gaze Using Candidate Selection

Author: Dmitry Rudoy, Dan B. Goldman, Eli Shechtman, Lihi Zelnik-Manor

Abstract: During recent years remarkable progress has been made in visual saliency modeling. Our interest is in video saliency. Since videos are fundamentally different from still images, they are viewed differently by human observers. For example, the time each video frame is observed is a fraction of a second, while a still image can be viewed leisurely. Therefore, video saliency estimation methods should differ substantially from image saliency methods. In this paper we propose a novel methodfor video saliency estimation, which is inspired by the way people watch videos. We explicitly model the continuity of the video by predicting the saliency map of a given frame, conditioned on the map from the previousframe. Furthermore, accuracy and computation speed are improved by restricting the salient locations to a carefully selected candidate set. We validate our method using two gaze-tracked video datasets and show we outperform the state-of-the-art.

4 0.86101317 2 cvpr-2013-3D Pictorial Structures for Multiple View Articulated Pose Estimation

Author: Magnus Burenius, Josephine Sullivan, Stefan Carlsson

Abstract: We consider the problem of automatically estimating the 3D pose of humans from images, taken from multiple calibrated views. We show that it is possible and tractable to extend the pictorial structures framework, popular for 2D pose estimation, to 3D. We discuss how to use this framework to impose view, skeleton, joint angle and intersection constraints in 3D. The 3D pictorial structures are evaluated on multiple view data from a professional football game. The evaluation is focused on computational tractability, but we also demonstrate how a simple 2D part detector can be plugged into the framework.

5 0.85963738 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation

Author: Luming Zhang, Mingli Song, Zicheng Liu, Xiao Liu, Jiajun Bu, Chun Chen

Abstract: Weakly supervised image segmentation is a challenging problem in computer vision field. In this paper, we present a new weakly supervised image segmentation algorithm by learning the distribution of spatially structured superpixel sets from image-level labels. Specifically, we first extract graphlets from each image where a graphlet is a smallsized graph consisting of superpixels as its nodes and it encapsulates the spatial structure of those superpixels. Then, a manifold embedding algorithm is proposed to transform graphlets of different sizes into equal-length feature vectors. Thereafter, we use GMM to learn the distribution of the post-embedding graphlets. Finally, we propose a novel image segmentation algorithm, called graphlet cut, that leverages the learned graphlet distribution in measuring the homogeneity of a set of spatially structured superpixels. Experimental results show that the proposed approach outperforms state-of-the-art weakly supervised image segmentation methods, and its performance is comparable to those of the fully supervised segmentation models.

6 0.85552347 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection

7 0.85476577 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

8 0.85412961 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues

9 0.85366136 45 cvpr-2013-Articulated Pose Estimation Using Discriminative Armlet Classifiers

10 0.85150516 160 cvpr-2013-Face Recognition in Movie Trailers via Mean Sequence Sparse Representation-Based Classification

11 0.85093677 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence

12 0.84906858 375 cvpr-2013-Saliency Detection via Graph-Based Manifold Ranking

13 0.84884536 275 cvpr-2013-Lp-Norm IDF for Large Scale Image Search

14 0.84793329 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors

15 0.84584874 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation

16 0.84578705 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

17 0.84306526 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

18 0.84282106 438 cvpr-2013-Towards Pose Robust Face Recognition

19 0.84242618 89 cvpr-2013-Computationally Efficient Regression on a Dependency Graph for Human Pose Estimation

20 0.84194577 246 cvpr-2013-Learning Binary Codes for High-Dimensional Data Using Bilinear Projections