Skip to main content Skip to navigation

MSc Individual Project


Crowd counting is the computer vision area that concerns estimating the number of individuals present in an image. It has a wide range of applications, such as traffic control and biological studies. One of its most remarkable practices is monitoring the density of people, which is crucial under the circumstance of the global pandemic of COVID-19. Videos from surveillance cameras can be decomposed into frame. Then these frames can be fed into crowd counting models to infer numbers of people. This technique allows governments to know about the crowdedness of a place and take corresponding measures to inhibit the transmission of the virus. Another example is that public transport companies can calculate the number of passengers in a station and adjust the timetable dynamically to reduce the running cost. These two instances demonstrate that crowd counting has a vast potential to be applied in surveillance systems.

On the other hand, there are two challenges encountered during the deployment of crowd counting models. Although state-of-the-art algorithms have achieved exceptional results on several benchmark datasets, their backbones slow down the inference speed. Most use VGGs as front ends, but this family of neural networks is composed of standard convolutions, indicating that hundreds of billions of arithmetic operations are required to make prediction on a 720P image. In contrast, many 1 times 1 and spatially separable convolutions are involved in Inception-V3, significantly reducing the total amount of computation. More detailed comparison of the inference time of different CNN models can be found on Keras ApplicationsLink opens in a new window.

The other difficulty is the lack of appropriately pre-trained models. The most popular benchmark datasets are UCF_CC_50, ShanghaiTech (A and B), and UCF-QNRF, so most researchers only open-source weights tuned on them to prove their models’ superiority. However, although ShanghaiTech B is from a surveillance perspective, it solely represents crowded scenes, and the other three are free-view and much more congested. Hence, models trained on these datasets usually fail to generalise well to real-world data from surveillance cameras. By comparison, Crowd Surveillance is another surveillance-view dataset containing both sparse and crowded scenes. This property endows models with better generalisation on surveillance videos.

Hence, in this paper, inspired by Inception-V3], a more diminutive crowd counting model will be proposed, and its pre-trained weights on Crowd Surveillance [43] will be open-sourced to facilitate its implementation in monitoring systems. Besides, multiple experiments will be conducted, including evaluating its performance on ShanghaiTech (A and B) and Mall, to show that it requires fewer computation resources while preserves a high accuracy.

Model Structure

The structure of ICC.



Comparison of models on the ShanghaiTech datasets.
Methods Complexity Part A Part B
MCNN 56.21 G 110.2 173.2 26.4 41.3
CMTL 243.80 G 101.3 152.4 20.0 31.1
CSRNet 857.84 G 68.2 115.0 10.6 16.0
CAN 908.05 G 62.3 100.0 7.8 12.2
DM-Count 853.70 G 59.7 95.7 7.4 11.8
M-SegNet 749.73 G 60.55 100.8 6.8 10.4
SASNet 1.84 T 54.59 88.38 6.35 9.9
ICC (proposed) 125.53 G 76.97 130.16 8.46 15.20

Ground-truth annotations for an image from ShanghaiTech BThe prediction of the crowd distribution


Comparison of models on the Mall dataset.
Methods MAE RMSE
Chen et al. 3.15 15.7
ConvLSTM 2.24 8.50
DecideNet 1.52 1.90
DRSAN 1.72 2.10
SAAN 1.28 1.68
LA-Batch 1.34 1.90
ICC (proposed) 2.16 2.74
ICC (proposed; trained on Crowd Surveillance) 3.79 4.77

Ground-truth annotations of an image from mallThe predicted crowd distribution


In this paper, a convolutional neural network, established on Inception-V3 and CAN has been proposed to facilitate the deployment of crowd counting models in surveillance systems. The proposed method is much less computationally complex compared with the state-of-the-art algorithms, while its performance is not significantly harmed. This property has been testified by various experiments on benchmark datasets. Both context-aware components within the model have been proved to be useful, and this work also shows models pre-trained on Crowd Surveillance have good generalisation.


Recent sophisticated CNN-based algorithms have demonstrated their extraordinary ability to automate counting crowds from images, thanks to their structures which are designed to address the issue of various head scales. However, these complicated architectures also increase computational complexity enormously, making real-time estimation implausible. Thus, in this paper, a new method, based on Inception-V3, is proposed to reduce the amount of computation. This proposed approach (ICC), exploits the first five inception blocks and the contextual module designed in CAN to extract features at different receptive fields, thereby being context-aware. The employment of these two different strategies can also increase the model’s robustness. Experiments show that ICC can at best reduce 85.3 percent calculations with 24.4 percent performance loss. This high efficiency contributes significantly to the deployment of crowd counting models in surveillance systems to guard the public safety. The code will be available on GitHub, and its pre-trained weights on the Crowd Counting dataset, which comprises a large variety of scenes from surveillance perspectives, will also open-sourced.


Yiming Ma


 Inception-Based Crowd CountingLink opens in a new window