MSc Individual Project

Introduction

Crowd counting is the computer vision area that concerns estimating the number of individuals present in an image. It has a wide range of applications, such as traffic control and biological studies. One of its most remarkable practices is monitoring the density of people, which is crucial under the circumstance of the global pandemic of COVID-19. Videos from surveillance cameras can be decomposed into frame. Then these frames can be fed into crowd counting models to infer numbers of people. This technique allows governments to know about the crowdedness of a place and take corresponding measures to inhibit the transmission of the virus. Another example is that public transport companies can calculate the number of passengers in a station and adjust the timetable dynamically to reduce the running cost. These two instances demonstrate that crowd counting has a vast potential to be applied in surveillance systems.

On the other hand, there are two challenges encountered during the deployment of crowd counting models. Although state-of-the-art algorithms have achieved exceptional results on several benchmark datasets, their backbones slow down the inference speed. Most use VGGs as front ends, but this family of neural networks is composed of standard convolutions, indicating that hundreds of billions of arithmetic operations are required to make prediction on a 720P image. In contrast, many $1 times 1$ and spatially separable convolutions are involved in Inception-V3, significantly reducing the total amount of computation. More detailed comparison of the inference time of different CNN models can be found on Keras ApplicationsLink opens in a new window.

The other difficulty is the lack of appropriately pre-trained models. The most popular benchmark datasets are UCF_CC_50, ShanghaiTech (A and B), and UCF-QNRF, so most researchers only open-source weights tuned on them to prove their models’ superiority. However, although ShanghaiTech B is from a surveillance perspective, it solely represents crowded scenes, and the other three are free-view and much more congested. Hence, models trained on these datasets usually fail to generalise well to real-world data from surveillance cameras. By comparison, Crowd Surveillance is another surveillance-view dataset containing both sparse and crowded scenes. This property endows models with better generalisation on surveillance videos.

Hence, in this paper, inspired by Inception-V3], a more diminutive crowd counting model will be proposed, and its pre-trained weights on Crowd Surveillance [43] will be open-sourced to facilitate its implementation in monitoring systems. Besides, multiple experiments will be conducted, including evaluating its performance on ShanghaiTech (A and B) and Mall, to show that it requires fewer computation resources while preserves a high accuracy.

Model Structure

The structure of ICC.

Results

ShanghaiTech

Comparison of models on the ShanghaiTech datasets.
Methods	Complexity	Part A		Part B
Methods	Complexity	MAE	RMSE	MAE	RMSE
MCNN	56.21 G	110.2	173.2	26.4	41.3
CMTL	243.80 G	101.3	152.4	20.0	31.1
CSRNet	857.84 G	68.2	115.0	10.6	16.0
CAN	908.05 G	62.3	100.0	7.8	12.2
DM-Count	853.70 G	59.7	95.7	7.4	11.8
M-SegNet	749.73 G	60.55	100.8	6.8	10.4
SASNet	1.84 T	54.59	88.38	6.35	9.9
ICC (proposed)	125.53 G	76.97	130.16	8.46	15.20

Ground-truth annotations for an image from ShanghaiTech B The prediction of the crowd distribution

Mall

Comparison of models on the Mall dataset.
Methods	MAE	RMSE
Chen et al.	3.15	15.7
ConvLSTM	2.24	8.50
DecideNet	1.52	1.90
DRSAN	1.72	2.10
SAAN	1.28	1.68
LA-Batch	1.34	1.90
ICC (proposed)	2.16	2.74
ICC (proposed; trained on Crowd Surveillance)	3.79	4.77

Conclusion

In this paper, a convolutional neural network, established on Inception-V3 and CAN has been proposed to facilitate the deployment of crowd counting models in surveillance systems. The proposed method is much less computationally complex compared with the state-of-the-art algorithms, while its performance is not significantly harmed. This property has been testified by various experiments on benchmark datasets. Both context-aware components within the model have been proved to be useful, and this work also shows models pre-trained on Crowd Surveillance have good generalisation.

Abstract

Recent sophisticated CNN-based algorithms have demonstrated their extraordinary ability to automate counting crowds from images, thanks to their structures which are designed to address the issue of various head scales. However, these complicated architectures also increase computational complexity enormously, making real-time estimation implausible. Thus, in this paper, a new method, based on Inception-V3, is proposed to reduce the amount of computation. This proposed approach (ICC), exploits the first five inception blocks and the contextual module designed in CAN to extract features at different receptive fields, thereby being context-aware. The employment of these two different strategies can also increase the model’s robustness. Experiments show that ICC can at best reduce 85.3 percent calculations with 24.4 percent performance loss. This high efficiency contributes significantly to the deployment of crowd counting models in surveillance systems to guard the public safety. The code will be available on GitHub, and its pre-trained weights on the Crowd Counting dataset, which comprises a large variety of scenes from surveillance perspectives, will also open-sourced.

Contact:

Yiming Ma

Mathematics for Real-World Systems Centre for Doctoral Training