Author: Dennis Park, C. Lawrence Zitnick, Deva Ramanan, Piotr Dollár

Abstract: We describe novel but simple motion features for the problem of detecting objects in video sequences. Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization. We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features. Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition. We show results for pedestrian detection and human pose estimation in video sequences, achieving state-of-the-art results in both. In particular, given a fixed detection rate our method achieves a five-fold reduction in false positives over prior art on the Caltech Pedestrian benchmark. Finally, we perform extensive diagnostic experiments to reveal what aspects of our system are crucial for good performance. Proper stabilization, long time-scale features, and proper normalization are all critical.

1 com Abstract We describe novel but simple motion features for the problem of detecting objects in video sequences. [sent-5, score-0.478]

2 Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization. [sent-6, score-0.813]

3 We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features. [sent-7, score-0.492]

4 Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition. [sent-8, score-1.433]

5 We show results for pedestrian detection and human pose estimation in video sequences, achieving state-of-the-art results in both. [sent-9, score-0.374]

6 In this work, we explore the motion counterpart for object detection in video. [sent-17, score-0.368]

7 We show that one can exploit simple motion features to significantly increase detection accuracy with little additional computation. [sent-18, score-0.424]

8 We classify image motion into three types using a stationary world coordinate frame and a moving object coordinate frame. [sent-20, score-0.432]

9 Camera-centric motion is the movement of the camera with respect to the world. [sent-21, score-0.397]

10 Objectcentric motion is the movement of the object centroid with respect to the world. [sent-22, score-0.328]

11 Finally, part-centric motion is the movement of object parts with respect to the object. [sent-23, score-0.356]

12 com Figure 1: Illustration of various types of video stabilization: (a) no stabilization, (b) camera motion stabilization, (c) object-centric motion stabilization, (d) camera and objectcentric motion stabilization, and (e) full stabilization of camera, object-centric, and part-centric motion. [sent-28, score-1.58]

13 We posit that for detecting articulated objects such as people the majority of useful motion information is contained in partcentric motion. [sent-29, score-0.576]

14 A simple approach is to directly compute image motion features on raw video. [sent-32, score-0.398]

15 Methods that define motion features using optical flow or spacetime gradients often take this route [29]. [sent-34, score-0.799]

16 One can partly remove camera motion by looking at differences offlow [8]. [sent-35, score-0.424]

17 A more direct approach is to simply compute motion features on a stationary camera, such as [27]. [sent-36, score-0.439]

18 Such motion features encode both object- and part-centric motion. [sent-37, score-0.367]

19 When the camera is moving, one may try to register frames using a homography or egomotion estimation [18, 19], which removes some camera-centric motion but can be challenging for dynamic scenes or those with complex 3D geometry. [sent-39, score-0.494]

20 Finally, other techniques compute optical flow in an object-centric coordinate frame [13]; Figure 1(c) shows that such an approach actually encodes both camera- and part-centric motion. [sent-40, score-0.448]

21 222888888200 In this paper, we posit (and verify by experiment) that the majority of useful motion information for detecting articulated objects such as people is contained in part-centric motion. [sent-41, score-0.484]

22 To allow the temporal features to easily extract part-centric motion information, we attempt to stabilize both camera and object-centric motion, Figure 1(d). [sent-43, score-0.709]

23 We accomplish this by using coarse-scale optical flow to align a sequence of image frames. [sent-44, score-0.339]

24 Weak stabilization using coarse-scale flow has the benefit of aligning large objects such as the background or a person’s body without removing detailed motion such as an object’s parts, Figure 1(d,e). [sent-45, score-0.984]

25 While artifacts may exists around large flow discontinuities, we demonstrate that coarse-scale flow is robust in practice. [sent-46, score-0.426]

26 We use temporal difference features to capture the partcentric motion that remains after weak stabilization. [sent-47, score-0.794]

27 While features based on fine-scale optical flow [13, 8] may be extracted from the stabilized frames, fine-scale flow is notoriously difficult to extract for small parts such as arms [4]. [sent-48, score-0.759]

28 We demonstrate that when sampled at the proper temporal intervals, simple temporal difference features are an effective alternative capable of achieving state-of-the-art results. [sent-49, score-0.627]

29 We perform a thorough evaluation of motion features for object detection in video. [sent-50, score-0.473]

30 We focus on detecting pedestrians in moving cameras [12] as well as pose estimation from static cameras [1]. [sent-51, score-0.34]

31 We demonstrate significant improvements from integrating our motion features into three distinct approaches: rigid SVM detectors defined on HOG features [7], articulated part models defined on HOG features [16, 3 1], and boosted detectors defined on channel features [11]. [sent-52, score-0.77]

32 Related Work Optical-flow-based features: A popular strategy for video-based recognition is to extend static image features into the temporal domain through use of optical flow. [sent-57, score-0.597]

33 Examples include spatially blurred flow fields [13] or histograms of optical flow vectors [8, 28]. [sent-58, score-0.554]

34 For stationary cameras, temporal difference features can be computed on background models, yielding background-subtraction masks [24]. [sent-69, score-0.406]

35 Our approach can be seen as a combination of optical-flow and temporal differencing as we compute differences on spacetime windows that are weakly-stabilized with coarse optical flow. [sent-70, score-0.661]

36 Action classification: Many of the above motion features have been explored in the context of action classification [10, 21]. [sent-71, score-0.412]

37 In particular, [29] performs a thorough evaluation of motion descriptors, discovering that histograms of flow perform well. [sent-72, score-0.577]

38 Tracking: An alternate use of temporal information to improve detection reliability is to explicitly track objects. [sent-75, score-0.287]

39 Most trackers tend to define motion models on static image features, although exceptions do exist [15]. [sent-77, score-0.468]

40 We then describe our approach to weakly-stabilizing video frames and our resulting motion features. [sent-83, score-0.486]

41 Static features: In addition to the motion features introduced below, we use one of two sets of static features densely computed on the current frame. [sent-89, score-0.64]

42 Our first set of static features are the channel features described in [11]. [sent-90, score-0.405]

43 (b) Using fine-scale LK flows, the overall body is stabilized onto the last frame at the cost of distortion in body parts (most visible at the heads and legs of the top row). [sent-95, score-0.385]

44 Our second type of static features is the commonly used Histogram of Oriented Gradients (HOG) descriptor [7]. [sent-98, score-0.303]

45 Stabilizing videos Our goal is to compute motion features based on partcentric motion, such as the movement of a person’s limbs. [sent-102, score-0.537]

46 This requires weakly stabilizing image frames to remove both camera and object-centric motion while preserving the part-centric motion. [sent-103, score-0.642]

47 We accomplish this by using coarsescale optical flow to align a sequence of frames. [sent-104, score-0.339]

48 We estimate optical flow using the approach of LucasKanade [22] but applied in a somewhat non-standard manner. [sent-105, score-0.307]

49 Lucas-Kanade proposed a differential approach to flow bilized and weakly stabilized frames spaced one frame apart (m = 1) and 8 frames apart (m = 8). [sent-106, score-0.914]

50 With larger frame spans (m = 8) temporal differences appear. [sent-108, score-0.414]

51 However, × weak stabilization is needed to remove non-informative differences resulting from camera and object motion. [sent-109, score-0.628]

52 In practice, we find Wt,t−1 stabilizes the majority of motion due to camera and objectcentric motion, as shown in Figure 2. [sent-121, score-0.459]

53 Computing the coarse flows is fast (no need to compute flow at finest scale) and fairly robust (due to the large σ). [sent-122, score-0.361]

54 When stabilizing across multiple frames, we compute the global motion Wt,t−n by progressively warping and summing pairwise flow fields. [sent-123, score-0.626]

55 We found this to work better in practice than computing the potentially large flow directly between frames It and It−n. [sent-124, score-0.357]

56 Motion features Given (weakly) stabilized image frames, we propose the use of simple temporal differencing or temporal gradient features. [sent-127, score-0.82]

57 The temporal gradient is defined as the difference between two frames, Dσ = It − It−1,t, (1) where σ is the scale of the computed flow. [sent-129, score-0.35]

58 Because σ is tuned to be roughly the size of an object, we expect the temporal gradient to contain useful cues about nonrigid object motion that are helpful for detection, as in Figure 3. [sent-130, score-0.551]

59 We denote temporal gradient on unstabilized frames as DUS: DUS = It − It−1 (2) Using multiple frames: We previously defined the difference features over pairs of frames. [sent-131, score-0.549]

60 In many instances, the amount of motion observed between subsequent frames may be quite small, especially with slow moving objects. [sent-132, score-0.425]

61 First, we consider the simple approach of computing multiple frame differences between the current frame and k = n/m other frames spaced apart temporally by m frames from t − m to t − n. [sent-136, score-0.725]

62 Another approach is to compute the set of differences between neighboring frames within a multiframe set, mean frame Mt and the neighboring frames, Rectified features: Previously, we defined our temporal difference features using the signed temporal gradient. [sent-143, score-1.151]

63 Sev- ×× × eral other possibilities also exist for encoding the temporal differences, such as using the absolute value of the temporal gradient or using rectified gradients. [sent-144, score-0.57]

64 Rectified gradients compute two features for each pixel’s temporal gradient dt corresponding to max(0, dt) and max(0, −dt). [sent-145, score-0.465]

65 We begin by exploring the feature parameter space on the task of pedestrian detection using a boosting classifier [11]. [sent-166, score-0.375]

66 the frame skip m, (b) other forms of stabilization, (c) frame skip m vs. [sent-178, score-0.534]

67 frame span n, (d) various types of reference frames for computing D(m, n), (e) different types of rectification for utilizing the color channels, and (f) boosting vs. [sent-179, score-0.508]

68 The best results, D106 (8, 4), are achieved using σ = 16, m = 4, n = 8, the current frame as reference, and the signed temporal differences of the luminance channel. [sent-181, score-0.545]

69 We measure accuracy using the standard log-average miss rate for the detections [12], which is computed by averaging the miss rate at nine false positives per image (FPPI) rates evenly spaced between 10−2 to A detection is labeled as correct if the area of overlap is greater than 50%. [sent-184, score-0.491]

70 [11], as reported in [12], is a 56% log-average miss rate using only static features and trained on the INRIA dataset [7]. [sent-187, score-0.404]

71 We perform our sweeps using boosting and the 10 static channel features described in Section 3. [sent-195, score-0.508]

72 Lastly, we combine the optimal temporal features found for boosting with the static HOG features for use by linear SVMs. [sent-198, score-0.703]

73 frame skip: We first explore the space of two parameters; the scale of LK flows, σ, and the skip between two frames used to compute the temporal difference, m, see Fig. [sent-203, score-0.733]

74 When the pair of frames are temporally nearby, stabilization plays a smaller role, since objects are relatively well aligned even without stabilization. [sent-208, score-0.621]

75 As we increase the skip m between the pair of frames, stabilization becomes critical. [sent-209, score-0.586]

76 Ideally, the optical flow scale should roughly cover an object, and so would be defined relative to the size of the candidate window being evaluated. [sent-211, score-0.338]

77 Multiframe: Given a fixed scale σ = 16, we now examine the question of the optimal multiframe span n, skip m, and reference frame. [sent-218, score-0.4]

78 We find that a large span n = 8 and small skip value m = 1 performs best, although a larger skip m = 4 also does well, see Fig. [sent-220, score-0.362]

79 T Ihis yields the final multiframe motion feature of D0 (n = 8, m = 4). [sent-225, score-0.403]

80 4(e), using three temporal differences across the LUV color channels. [sent-227, score-0.304]

81 The “Max” scheme uses the maximum temporal difference across the 3 channels, while the “Lum” scheme just uses the luminance (L) channel. [sent-228, score-0.335]

82 The normalization has minimal effect on the performance of the boosting classifier, presumably because boosting classifiers can train more flexible decision boundaries that perform implicit normalization. [sent-235, score-0.297]

83 5 we compare with previous work including ‘MultiFtr+Motion’ [28] (which uses motion features) and ‘MultiresC’ [23] (which uses static features trained on the same data as [12]). [sent-239, score-0.585]

84 6 shows several examples of detections using our approach compared to using static features alone. [sent-243, score-0.326]

85 Several false detections are removed around the car’s boundary as temporal features remove the ambiguities. [sent-244, score-0.419]

86 Temporal features can also help discover missed detections, such as the pedestrian riding a bicycle in the second row. [sent-245, score-0.29]

87 Our new temporal features lead to a significant improvement across all FPPI rates. [sent-251, score-0.316]

88 The annotated frames are evenly split into training and testing, and used to evaluate the ability of our motion features to perform human pose estimation in video sequences. [sent-256, score-0.653]

89 Baseline articulated part model: We describe our baseline articulated part model [3 1], and show how to extend it to incorporate our motion features. [sent-257, score-0.413]

90 Motion features: For our experiments, we simply augment the appearance descriptor to include both HOG and 222888888755 models; one trained only with static features (left), and the other trained with both static and our motion features (right). [sent-266, score-0.919]

91 Note that our motion features help detect instances that are considered hard due to abnormal pose (biking) or occlusion, and significantly reduce false positives. [sent-267, score-0.469]

92 (7) The above formulation allows us to easily incorporate our motion features into the existing pipeline at both test-time and train-time. [sent-270, score-0.367]

93 We show estimates from the pose model of [3 1] trained using our motion features. [sent-276, score-0.364]

94 Multiple people often interact and occlude each other, making pose estimation and motion extraction difficult. [sent-283, score-0.377]

95 Conclusion We described a family of temporal features utilizing weakly stabilized video frames. [sent-285, score-0.581]

96 Weak stabilization enables our detectors to easily extract part-centric informa- tion by removing most camera- and object-centric motion. [sent-286, score-0.429]

97 We experimentally show that simple temporal differences extracted across large time-spans are capable of producing 222888888866 Table1:LUAFoupOwegHvamrtuealdnrtmsgi a67n35012Ha. [sent-287, score-0.304]

98 25t0M 3m% otdienlwhur motion features produces consistently better part localizations. [sent-289, score-0.367]

99 Large displacement optical flow: descriptor matching in variational motion estimation. [sent-320, score-0.405]

100 Human detection using oriented histograms of flow and appearance. [sent-340, score-0.304]

1 0.9598754 350 cvpr-2013-Reconstructing Loopy Curvilinear Structures Using Integer Programming

Author: Engin Türetken, Fethallah Benmansour, Bjoern Andres, Hanspeter Pfister, Pascal Fua

Abstract: We propose a novel approach to automated delineation of linear structures that form complex and potentially loopy networks. This is in contrast to earlier approaches that usually assume a tree topology for the networks. At the heart of our method is an Integer Programming formulation that allows us to find the global optimum of an objective function designed to allow cycles but penalize spurious junctions and early terminations. We demonstrate that it outperforms state-of-the-art techniques on a wide range of datasets.

2 0.93852228 263 cvpr-2013-Learning the Change for Automatic Image Cropping

Author: Jianzhou Yan, Stephen Lin, Sing Bing Kang, Xiaoou Tang

Abstract: Image cropping is a common operation used to improve the visual quality of photographs. In this paper, we present an automatic cropping technique that accounts for the two primary considerations of people when they crop: removal of distracting content, and enhancement of overall composition. Our approach utilizes a large training set consisting of photos before and after cropping by expert photographers to learn how to evaluate these two factors in a crop. In contrast to the many methods that exist for general assessment of image quality, ours specifically examines differences between the original and cropped photo in solving for the crop parameters. To this end, several novel image features are proposed to model the changes in image content and composition when a crop is applied. Our experiments demonstrate improvements of our method over recent cropping algorithms on a broad range of images.

3 0.93071783 78 cvpr-2013-Capturing Layers in Image Collections with Componential Models: From the Layered Epitome to the Componential Counting Grid

Author: Alessandro Perina, Nebojsa Jojic

Abstract: Recently, the Counting Grid (CG) model [5] was developed to represent each input image as a point in a large grid of feature counts. This latent point is a corner of a window of grid points which are all uniformly combined to match the (normalized) feature counts in the image. Being a bag of word model with spatial layout in the latent space, the CG model has superior handling of field of view changes in comparison to other bag of word models, but with the price of being essentially a mixture, mapping each scene to a single window in the grid. In this paper we introduce a family of componential models, dubbed the Componential Counting Grid, whose members represent each input image by multiple latent locations, rather than just one. In this way, we make a substantially more flexible admixture model which captures layers or parts of images and maps them to separate windows in a Counting Grid. We tested the models on scene and place classification where their com- ponential nature helped to extract objects, to capture parallax effects, thus better fitting the data and outperforming Counting Grids and Latent Dirichlet Allocation, especially on sequences taken with wearable cameras.

same-paper 4 0.92405206 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction

Author: Dennis Park, C. Lawrence Zitnick, Deva Ramanan, Piotr Dollár

Abstract: We describe novel but simple motion features for the problem of detecting objects in video sequences. Previous approaches either compute optical flow or temporal differences on video frame pairs with various assumptions about stabilization. We describe a combined approach that uses coarse-scale flow and fine-scale temporal difference features. Our approach performs weak motion stabilization by factoring out camera motion and coarse object motion while preserving nonrigid motions that serve as useful cues for recognition. We show results for pedestrian detection and human pose estimation in video sequences, achieving state-of-the-art results in both. In particular, given a fixed detection rate our method achieves a five-fold reduction in false positives over prior art on the Caltech Pedestrian benchmark. Finally, we perform extensive diagnostic experiments to reveal what aspects of our system are crucial for good performance. Proper stabilization, long time-scale features, and proper normalization are all critical.

5 0.91804129 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

Author: Raghuraman Gopalan

Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.

6 0.91515553 171 cvpr-2013-Fast Trust Region for Segmentation

7 0.91285157 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

8 0.91226435 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection

9 0.91117603 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors

10 0.90791529 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation

11 0.90723616 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence

12 0.90669656 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

13 0.90651405 438 cvpr-2013-Towards Pose Robust Face Recognition

14 0.90613252 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image

15 0.905963 202 cvpr-2013-Hierarchical Saliency Detection

16 0.90581769 204 cvpr-2013-Histograms of Sparse Codes for Object Detection

17 0.90561581 167 cvpr-2013-Fast Multiple-Part Based Object Detection Using KD-Ferns

18 0.9054029 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection

19 0.90504128 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

20 0.90458012 338 cvpr-2013-Probabilistic Elastic Matching for Pose Variant Face Verification