Author: Haroon Idrees, Imran Saleemi, Cody Seibert, Mubarak Shah
Abstract: We propose to leverage multiple sources of information to compute an estimate of the number of individuals present in an extremely dense crowd visible in a single image. Due to problems including perspective, occlusion, clutter, and few pixels per person, counting by human detection in such images is almost impossible. Instead, our approach relies on multiple sources such as low confidence head detections, repetition of texture elements (using SIFT), and frequency-domain analysis to estimate counts, along with confidence associated with observing individuals, in an image region. Secondly, we employ a global consistency constraint on counts using Markov Random Field. This caters for disparity in counts in local neighborhoods and across scales. We tested our approach on a new dataset of fifty crowd images containing 64K annotated humans, with the head counts ranging from 94 to 4543. This is in stark con- trast to datasets usedfor existing methods which contain not more than tens of individuals. We experimentally demonstrate the efficacy and reliability of the proposed approach by quantifying the counting performance.
1 edu Abstract We propose to leverage multiple sources of information to compute an estimate of the number of individuals present in an extremely dense crowd visible in a single image. [sent-3, score-0.873]
2 Due to problems including perspective, occlusion, clutter, and few pixels per person, counting by human detection in such images is almost impossible. [sent-4, score-0.297]
3 Instead, our approach relies on multiple sources such as low confidence head detections, repetition of texture elements (using SIFT), and frequency-domain analysis to estimate counts, along with confidence associated with observing individuals, in an image region. [sent-5, score-0.404]
4 Secondly, we employ a global consistency constraint on counts using Markov Random Field. [sent-6, score-0.52]
5 This caters for disparity in counts in local neighborhoods and across scales. [sent-7, score-0.529]
6 We tested our approach on a new dataset of fifty crowd images containing 64K annotated humans, with the head counts ranging from 94 to 4543. [sent-8, score-1.179]
7 Introduction The problem of counting the number of objects, specifically people, in images and videos arises in several realworld applications including crowd management, design and analysis of buildings and spaces, and safety and security. [sent-12, score-0.817]
8 In certain scenarios, obtaining the people count is of direct importance, e. [sent-13, score-0.286]
9 The manual counting of individuals in very dense crowds is an extremely laborious task, but is performed nonetheless by experienced personnel when needed [18]. [sent-16, score-0.624]
10 Computer vision research in the area of crowd analysis has resulted in several automated and semi-automated solutions for density estimation and counting. [sent-17, score-0.689]
11 On average, each image in the crowd counting dataset contains around 1280 humans. [sent-21, score-0.755]
12 1) rather than a few tens of individuals [4, 5]; and (2) reliance on temporal constraints in crowd videos [20], which are not applicable to the more prevalent still images. [sent-24, score-0.767]
13 Some methods proposed in literature for crowd detection perform image segmentation without actual counting or localization [1], while others simply estimate the coarse density range within local regions [24]. [sent-26, score-1.036]
14 In terms of experimental data, most of the existing algorithms for exact counting have been tested on low 222555444755 to medium density crowds, e. [sent-27, score-0.395]
15 , USCD dataset with density of 11 − 46 people per frame [4], Mall dataset with density ooff 1131 − −5 436 6i pndeoivpildeua pelsr per mfraem [4e] [,5 M],a alnl dd aPtaEsTetS wdaitthas deet containing 35 3− in 4di0v people per farmamee [ []9,] a. [sent-29, score-0.563]
16 The proposed approach is motivated by the fact that in extremely dense crowds of people, no single feature or detection method is reliable enough to provide an accurate count due to low resolution, severe occlusion, foreshorten- ing, and perspective. [sent-33, score-0.446]
17 We observe however that densely packed crowds of individuals can be treated as a texture, albeit irregular and inhomogeneous at a coarse scale. [sent-35, score-0.342]
18 Furthermore, there does exist a spatial relationship that is expected to constrain the counting estimates in neighboring local image regions in terms of similarity of counts. [sent-37, score-0.297]
19 This observation has been used successfully for crowd detection in [1], although not for counting or localization. [sent-40, score-0.782]
20 Another main contribution of the proposed framework is the use of frequency-domain analysis in crowd counting. [sent-42, score-0.535]
21 Fourier transform has been used extensively in texture analysis [2], and specifically in crowd analysis [17]. [sent-43, score-0.617]
22 Given geo- metrically arranged texture elements, the Fourier transform can provide reliable estimates of the texton counts [14]. [sent-44, score-0.638]
23 In the domain of crowd counting however, the application of frequency analysis is severely limited due to two main reasons: (1) the spatial arrangement of texture elements is very irregular; and (2) the Fourier transform is not useful in localizing the repeating elements. [sent-45, score-0.898]
24 First, we employ Fourier analysis along with head detections and interest-point based counts in local neighborhoods on multiple scales to avoid the problem ofirregularity in the perceived textures emanating from images of dense crowds. [sent-47, score-0.759]
25 The count estimates from this localized multi-scale analysis are then aggregated subject to global consistency constraints. [sent-48, score-0.312]
26 , Fourier, interest points and head Detection, with their respective confidences, we compute counts at localized patches independently, which are then globally constrained to get an estimate of count for the entire image. [sent-55, score-0.987]
27 We propose a solution to obtain counts from multi-scale grid MRF which infers the solution simultaneously at all scales while enforcing the count consistency constraint. [sent-57, score-0.743]
28 This category of methods however is not useful for the kind of images we deal with, because human, or even head and face detection in these images is difficult due to severe occlusion and clutter, low resolution, and few pixels per individuals due to foreshortening. [sent-64, score-0.352]
29 We demonstrate this fact by reporting quantitative results of detection on our crowd image dataset. [sent-65, score-0.541]
30 Computation of such patterns of motion were also proposed in [22, 23, 12], but not with explicit application to the problem of crowd counting. [sent-67, score-0.514]
31 These algorithms require video frames as input, with reasonably high frame rate for reliable motion estimation, but are not suitable to still images of crowds, or even videos if the individuals in the crowd show nominal or no motion, e. [sent-68, score-0.742]
32 Another category of techniques proposed for crowd 222555444866 counting rely on estimation of direct relationships between low level or local features and counts, by learning regression functions. [sent-71, score-0.791]
33 This assumption is largely invalid in most real world scenarios due to perspective, changes in viewpoint, and changes in crowd density. [sent-74, score-0.514]
34 Chen et al [5] have recently proposed that information sharing among regions should allow more accurate and robust crowd counting. [sent-79, score-0.534]
35 They propose a single multioutput model for joint localized crowd counting based on ridge regression. [sent-80, score-0.783]
36 Their proposed framework employs interdependent local features from local spatial regions as input and people count from individual regions as multidimensional structured output. [sent-81, score-0.347]
37 We also collected, annotated, and tested on a large dataset of real world crowd images. [sent-84, score-0.514]
38 This variation in density may be inherent to the scene that the image captures (different distribution of individuals in different parts of the scene) or it may arise due to the viewpoint and perspective effects of the camera. [sent-90, score-0.369]
39 Thus, the proposed framework begins by counting individuals in small patches uniformly sampled over the image. [sent-92, score-0.573]
40 But, even though the density varies across the image, it does so smoothly, suggesting the density in adjacent patches should be similar. [sent-93, score-0.428]
41 When counting people in patches, we assume the density is uniform but implicitly assume that the number of people in each patch is independent of adjacent one of the few images where head detection gives reasonable results. [sent-95, score-0.89]
42 Once we estimate density or counts in each patch, we remove the independence assumption and place them in multi-scale Markov Random Field to model the dependence in counts among nearby patches. [sent-98, score-1.206]
43 Counting in Patches Given a patch P, we estimate the counts from three dif- ferent and complementary sources, alongside confidences for those counts. [sent-101, score-0.762]
44 The three sources are later combined to obtain a single estimate of count for that patch using the individual counts and confidences. [sent-102, score-0.997]
45 1 HOG based Head Detections The simplest approach to estimate counts is through human detections. [sent-105, score-0.528]
46 However, a quick glance at images of dense crowds reveals that the bodies are almost entirely occluded, leaving only heads for counting and analysis. [sent-106, score-0.441]
47 The consistency in scale and confidence is a measure of how reliable head detections are in that patch. [sent-113, score-0.272]
48 2 Fourier Analysis When a crowd image contains thousands of individuals, with each individual occupying only tens of pixels, especially those far away from the camera in an image with perspective distortion, histograms of gradients do not im- part any useful information. [sent-116, score-0.609]
49 However, a crowd is inherently repetitive in nature, since all humans appear the same from a distance. [sent-117, score-0.514]
50 The positive correlation is evident from the number of local maximas in the reconstructed patch, and the ground truth counts shown at the bottom. [sent-119, score-0.659]
51 , crowd density in the patch is uniform, can be captured by Fourier Transform, f(ξ), where the periodic occurrence of heads shows as peaks in the frequency domain. [sent-122, score-0.965]
52 3 Interest Points based Counting We use interest points not only to estimate counts but also to get a confidence whether the patch represents crowd or not. [sent-131, score-1.277]
53 2) and Fourier Analysis is crowd-blind, it is important to discard counts from such patches. [sent-133, score-0.515]
54 In order to obtain counts or densities using sparse SIFT features, we use Support Vector Regression using the counts computed at each patch from ground truth. [sent-135, score-1.219]
55 From the perspective of Statistics, the number of individuals in a particular patch can be seen as spatial Poisson Counting Process with parameter (corresponds to density), λ, i. [sent-136, score-0.385]
56 123e 4s on the left have confidence of crowd likelihood obtained through Eq. [sent-155, score-0.579]
57 In the top image, the gap between stadium tiers gets low confidence of crowd presence. [sent-157, score-0.579]
58 cGriovewnd a s e=t eofx pp(o−sitλive(+) and negative examples(−), the relative densoifti peso (iftirveeq(u+en)c aiensd n noergmatailvieze edx a bmy paleres(a−) o)f, tthhee rfeelaattuivree vary in positive and negative images, and can be used to identify crowd patches from non-crowd ones. [sent-177, score-0.634]
59 Assuming independence among features, the log-likelihood ϕ(P) of the ratio of patch containing crowd to non-crowd is [1]: log(γ1, γ2, . [sent-178, score-0.713]
60 i (2) The above equation gives us a confidence for presence of crowd in a patch. [sent-190, score-0.6]
61 Fusion of Three Sources For learning and fusion at the patch level, we densely sample overlapping patches from the training images and 222555445088 Σ Σ÷ ÷ Σ Σ÷ ÷ ΣΣ÷ ÷ Figure 5: The figure shown multi-scale Markov random Field for inferring counts for the entire image. [sent-195, score-0.785]
62 using the annotation, obtain counts for the corresponding patches. [sent-197, score-0.495]
63 Computing counts and confidences from the three sources, we scale individual features and regress using ? [sent-198, score-0.58]
64 Counting in Images In order to impose smoothness among counts from different patches, we place them in an MRF framework with grid structure. [sent-202, score-0.495]
65 Then, the beliefs in the groups of 2 2 are added giving the beliefs for the intermediate nodes b2ti× ×a2bo avree tahded ebdo tgtoivmin layer. [sent-256, score-0.291]
66 The sum of labels (counts) at the bottom layer gives the count for the image. [sent-273, score-0.281]
67 6 shows three instances where the estimated count of patch was improved based on neighbors (both spatial and layer). [sent-275, score-0.416]
68 In all cases, the patch under consideration lies in the center of 3 3 patch set. [sent-276, score-0.34]
69 co Inns tthraein fit using oM cRoluFm, thnes, overestimated counts are reduced, becoming closer to ground truth. [sent-278, score-0.534]
70 The patch in the middle had a much lower count than neighbors which after inference increased becoming similar to its neighbors. [sent-280, score-0.42]
71 Although the new estimate is closer to ground truth, the increase is not necessarily correct since the lower count was due to presence of a non-human object (an ambulance). [sent-281, score-0.274]
72 The second row shows the ground truth counts, and the estimated counts before and after MRF inference are shown in third and fourth rows, respectively. [sent-286, score-0.592]
73 consists of 50 images with counts ranging between 94 and 4543 with an average of 1280 individuals per image. [sent-288, score-0.711]
74 One of the images is a painting while another is an abstract depiction of a crowd (the one with the least count, shown in Fig. [sent-290, score-0.514]
75 Some examples of images with the associated ground truth counts can be seen in Fig. [sent-293, score-0.57]
76 tWioen uals eedff itcwieon simple measures dto 5 quantify tshseresults: mean and deviation of Absolute Difference (AD), and mean and deviation of Normalized Absolute Difference (NAD), which was obtained by normalizing the absolute difference with the actual count for each image. [sent-296, score-0.352]
77 The first row in Table 1shows the results of using counts from Fourier Analysis only, giving AD of 703. [sent-299, score-0.539]
78 Including counts from head detections improves AD marginally to 510. [sent-307, score-0.656]
79 Adding counts from regression on sparse SIFT features reduces error in both measures, giving values of 468. [sent-309, score-0.575]
80 Finally, inferring counts for complete images using counts from patches through multi-scale MRF further improves AD taking it to 419. [sent-312, score-1.11]
81 8a show average of actual counts per patch in that image. [sent-318, score-0.741]
82 For easier analysis, the x-axis shows images sorted with respect to actual counts in both plots. [sent-319, score-0.57]
83 It can be seen that AD per patch increases as the actual counts increases, except for the images in the range 25 to 45 with corresponding actual counts in the range of 1000−2500 per image. [sent-320, score-1.312]
84 A NDot, but lowest deviations as well, which means the approach consistently predict correct counts for patches in this range. [sent-322, score-0.642]
85 The reason for better performance in the middle range is obvious: the counts range from 94 −4543, so the largest count ivsi a tsr:e tmheen cdoouunst s4 r8a3n2g%e f orofm mth 9e4 s−m4al5le4s3t, scoou thnet. [sent-323, score-0.723]
86 o A euDn P rCP+atchtaP reP DAtNhc Image number (a) Image number (b) Figure 8: This figure shows analysis of patch estimates in terms of absolute and normalized absolute differences. [sent-327, score-0.345]
87 Means are shown in black asterisk, standard deviations with red bars, and ground truth counts with olive dots. [sent-329, score-0.637]
88 [20], and Lempitsky and Zisserman [13], which were suitable for this dataset since other methods for crowd counting mostly deal with videos or use human detection, and cannot be used for testing on this dataset. [sent-332, score-0.775]
89 The x-axis shows the average counts of each of the 10 groups. [sent-342, score-0.495]
90 Density aware person detection [20] performs best around counts of 1000, but its error increases as we move away. [sent-343, score-0.559]
91 The reason becomes obvious when we look at the absolute counts output by the method in Fig. [sent-344, score-0.554]
92 The reason lies in the algorithm itself, as it is designed to minimize the maximum AD across images when training, and since images with higher counts tend to have higher AD, the learning focuses on such images. [sent-348, score-0.495]
93 The learner gets biased towards high density images, thus, producing a lower AD overall, but overestimating at lower counts (Fig. [sent-349, score-0.672]
94 8a reveals that patch density increases super-linearly for this group, which otherwise is linear for first nine groups. [sent-356, score-0.324]
95 At very high density, the relative frequencies across patches with different density may become similar, resulting in a loss of discriminative power. [sent-359, score-0.274]
96 Conclusion We presented an approach to count number of individuals in extremely dense crowds, on a scale not tackled before. [sent-361, score-0.452]
97 We fuse information from three sources in terms of counts, confidences and different measures at the patch level, and then enforce smoothness constraint on nearby patches to improve estimates of incorrect patches, thereby 222555555311 DNA3201. [sent-362, score-0.466]
98 Possible improvements include explicit preprocessed estimation of crowd density, and making regression an explicit function of density so that it better adapts to various crowd sizes. [sent-375, score-1.218]
99 Privacy preserving crowd monitoring: Counting people without people models or tracking. [sent-404, score-0.682]
100 A neural-based crowd estimation by hybrid global learning algorithm. [sent-417, score-0.514]
