Attention in crowd counting using the transformer and density map to improve counting result

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	6
Dung lượng	737,18 KB

Nội dung

Attention in Crowd Counting Using the Transformer and Density Map to Improve Counting Result Attention in Crowd Counting Using the Transformer and Density Map to Improve Counting Result Phuc Thinh Do[.]

2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Attention in Crowd Counting Using the Transformer and Density Map to Improve Counting Result Phuc Thinh Do Dong Nai Technology University Dong Nai, Vietnam dophucthinh@dntu.edu.vn other hand, with the ability to collect global information, the model solves the problem of limited receptive fields Abstract— With the vigorous development of CNN, most crowd counting methods have approached using CNN to estimate the density map and then infer the count However, these methods face many limitations due to limited receptive fields, background noise, etc With the advent of Transformer in natural language processing, it is possible to utilize this model for the crowd counting problem The Transformer can model the global context, so it helps to solve the problem of receptive fields On the other hand, with the attention mechanism, the model can focus on areas of concentration of people, helping to solve the problem of background noise In this paper, we propose a Crowd counting model combining Transformer and Density map (TDCrowd) to estimate the number of people in a crowd With the use of a Transformer, TDCrowd can still be trained so that it does not need information about the location of people in the crowd, but only information about the count Experiments on three datasets ShanghaiTech, UCF_QNRF, and JHU-Crowd++, show that TDCrowd gives better results when compared to regression-based methods (need only the count information) and density map-based (need the count information and location information) In summary, we propose a model that uses the Transformer to generate density maps instead of just estimating counts This approach allows our model to capture the global context, thus solving the problem of limited receptive fields of CNN On the other hand, we can leverage location information to improve this density map because we can generate a density map from the head positions The rest of the paper is organized as follows In section II, we will discuss crowd counting approaches and some related research Next, we will present the proposed model, baseline, and how to train the model In part IV, we will talk about experimenting and analyzing the results Finally, we will conclude and outline the future direction in the last part Keywords— crowd counting, convolutional neural networks, density map, Transformer, attention I INTRODUCTION Crowd counting refers to estimating the number of objects in a crowd, such as people, vehicles, trees, etc It is one of the essential tasks in surveillance systems The original approach of crowd counting was detecting objects and counting the number of detected objects Since object detection in images with dense density is not good, a new method has arisen, the regression method Methods of this type will attempt to map between the crowd image and the count Recently, with the development of deep learning, the crowd counting method has moved in a new direction, using density maps (Fig 1) This method helps to take advantage of spatial information by obtaining information about the position of objects in the image However, density mapbased methods will face problems such as limited receptive fields, background noise, etc With the advent of the Transformer [26] in natural language processing, many works have taken advantage of this model for image processing [1], [7] The advantage of Transformer is that it can capture global information, which can solve the problem of limited receptive fields of CNN in general and density map-based methods in particular In this paper, we propose a combined model of Transformer and density map (TDCrowd) This model can take advantage of the Attention mechanism to focus on crowded areas, thereby solving the problem of background noise On the 978-1-6654-1001-4/21/$31.00 ©2021 IEEE Fig Sample images in the JHU-Crowd++ dataset and their density maps II RELATED WORK Previous crowd counting methods often follow the direct counting approach or use regression models The direct counting methods use a sliding window over the image to detect objects [11], [29] Several works use CNN to build a detection model to predict bounding boxes [17], [21] The number of bounding boxes is also the number of people in the image However, with scenes with too many people, these models will have difficulty identifying individual objects Another approach is to use a regression model [1], [9], [22] Methods of this type will build a regression model to map the crowd image and the count number However, this makes the results of the model more difficult to interpret On the other hand, since only one count number information is used, the model lacks spatial information Lempitsky [15] proposed a mapping method from the input 65 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) image to density map that can be integrated to obtain the final counts Density maps take advantage of spatial information, which is the location of people in the image With the development of CNN, many methods of using CNN to generate density maps are proposed MCNN [31] uses multi-column CNNs with different filter sizes to extract features in different scales Switch-CNN [23] improves M-CNN by using additional VGG-16 classifiers to select the appropriate CNN column, while Do et al [4], [5] focus on removing the non-human scene CSRNet [16] uses dilated convolution to increase receptive fields CHF [6] uses filters to improve density map quality Recently, many works have used Transformer for image processing and achieved many good results DETR [1], ViT [7] are the first works to apply Transformer for object detection and recognition Inspired by ViT, Liang et at [18] proposed TransCrowd, one of the first methods to use Transformers for crowd counting The input image will be converted to sequence data and turned into a count This method can collect global information but ignores information about the location of the person in the image A Input image processing Because the input of the Transformer [26] is sequence data, we convert the input image into equal parts With the input image of size ℎ (where is the width, ℎ is the height), we divide it into parts of size The number of patches obtained will be patches Then we stack these patches into (Fig 2) To convert sequence into a latent Ddimensional embedding feature, we use a learnable projection 1, 2, … , where has size D To maintain the : → position information, we adopt a specific position embedding 1, 2, … B Transformer encoder The Transformer encoder [26] consists of ! blocks of multihead self-attention (MSA) and Multilayer Perceptron (MLP) blocks Every block, layer normalization (LN), and residual connections are applied The MLP contains two layers with a GELU activation function [8] The first layer expands the embedding dimension from " to 4" , and the second one compresses the size from 4" to " The output of the Transformer encoder ($% is represented as follows: III PROPOSED METHOD Our proposed model uses the Transformer [26] to build the density map The sum of pixel values of the density map represents the number of people Inspired by ViT [7], our model uses the Transformer encoder to get information However, instead of mapping directly to the count number, the output of the Transformer encoder will be convolution by a 1×1 convolution to generate a density map This method helps solve the background noise problem because the attention mechanism helps to focus attention on areas where people are present Furthermore, the Transformer can receive global information, which helps solve the problem of limited receptive fields The model can be trained in two ways: with position information or only with count information Suppose the data has information about the location of people in the image In that case, the model will be trained using the L2 loss function between the estimated density map and ground truth density map On the other hand, if only the count information is available, from the estimated density map after convolution by a 1x1 convolution [31], the model will calculate the count and use the L1 loss function between prediction count and ground truth count The proposed model is depicted in Fig   '()*! $%& $% $% + , '!0*! $%& , $%- $%& / / 1, 2, … !  1, 2, … !    MSA has independent self-attention (SA) modules and a re-projection operation ( ) The input of each SA contains three information: query (2), key (3 , and value (4 The value of SA can be defined as follows:  56 718 () $%  $%- :, $%- ;, :; < √> ? 4 $%-  @  where , , are three learnable matrices The softmax function is applied for the input matrix Different from ViT [7], the desired output of the model is a density map Therefore, we add a 1 convolution layer to regress the density map (Fig 2) Fig The proposed model is in the training stage 66 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) where is the number of images, P BC is the ground truth count of the -th image, P Θ is the predicted count of the -th image with parameters Θ The details of the algorithm used to build and train the model are shown in Fig Algorithm Training Input: Annotated image with ground truth density map dm (flag = 1) or ground truth count gtc (flag = 0) Output: Trained model Begin optimizer = Adam(lr=1e-5) if (flag == 1): // with the position information model = VisionTransformer model.Sequential(Conv2d(kernel_size=1)) if (flag == 0): // with the count information model = VisionTransformer foreach image in dataset: if (flag == 1): // Estimated density map est = model(image) l2loss = MSEloss(est, dm) // Optimized using the Pytorch library optimizer.zero_grad() l2loss.backward() optimizer.step() if (flag == 0): est = model(image) count = est.sum() // Calculate the count l1loss = L1loss(gtc, count) optimizer.zero_grad() loss.backward() optimizer.step() Fig The proposed model is in the testing stage C The ground truth density map For datasets with location information, to be able to train the model, we create the ground truth density map by using Gaussian kernel at each head position:  ∑ G E $ + $ , F  ABC  where ABC is the ground truth density map, E is a Gaussian kernel with standard deviation F , n is the number of head positions The final count is obtained by summing the values of the density map Similar to the method using density maps, we also choose F 15 End Fig Model training algorithm D Loss function After obtaining the density map by convolution with 1 convolution, the model will be trained using the L2 loss function to measure the difference between the estimated density map and the ground truth density map:  L2 Θ KL K ∑LG.MA Θ + ABC M  K IV EXPERIMENTS  where ' is the number of images, ABC is the ground truth density map of the -th image, A Θ is the estimated density map of the -th image with parameters Θ However, with the use of the Transformer [26], the proposed model can also be trained using the L1 loss function to measure the difference between predicted count and ground truth count:  L1 Θ N ∑NG.OP Θ + P BC O  We evaluate our method on three datasets, including ShanghaiTech, UCF-QRNF, and JHU-CROWD++ The model is implemented in python using the PyTorch library When processing the input image, we choose n = [4] To increase the training data, we use transformations such as rotation, inversion, and random cropping Due to memory limitations, we limit the size to 1024 and use sliding windows when dealing with UCFQRNF and JHU-CROWD++ datasets We used Adam optimizer [13] with a learning rate of 1e-5, weight decay 1e-4 The evaluation metric and results of the experiment are shown below A Evaluation Metric For comparison with the previous methods, we use two evaluation metrics; is Mean Absolute Error (MAE) and Mean Squared Error (MSE):  67 ')Q N BC ∑N OP + P O  2021 8th NAFOSTED Conference on Information and Computer Science (NICS) '(Q  BC K R ∑N  P +P  N where is the number of images, P is the estimated count, PBC is the ground truth count MAE indicates the accuracy of the predicted result, and MSE measures the robustness B ShanghaiTech dataset This dataset [31] is divided into parts A and B Part A includes 300 training images and 182 testing images scrawled from the internet Part B consists of 400 training images and 316 testing images Images in Part B are taken from the metropolis in Shanghai city In total, the dataset includes 1198 images with 330,165 annotations We can see the results in TABLE I , TDCrowd gives better results than other methods, especially for images with a high density of people TABLE I Method COMPARISON WITH OTHER METHODS ON THE SHANGHAITECH DATASET Part A Fig Visualization results of our model on the ShanghaiTech Part B dataset The left column is the sample image, and the right column is the estimated density map Part B MAE MSE MAE MSE MCNN [31] 110.2 173.2 26.4 41.3 Switch-CNN [23] 90.4 135.0 21.6 33.4 Do et al [4] 81.9 122.1 20.9 33.1 SSC [5] 69.7 120.2 11.8 17.2 CSRNet [16] 68.2 115.0 10.6 16.0 TransCrow [18] 66.1 105.1 9.3 16.1 CHF [6] 65.1 99.7 7.9 11.2 TEDnet [12] 64.2 109.1 8.2 12.8 CG-DRCN [25] 60.2 94.0 7.5 12.1 DM-Count [28] 59.7 95.7 7.4 11.8 GL [27] 61.3 95.4 7.3 11.7 TDCrowd (Our) 57.9 95.4 7.1 11.3 C UCF-QRNF dataset This dataset [10] includes 1535 images with 1.25 million annotations In particular, 1201 training images and 334 testing images This dataset is for crowd counting and localization, which contains realistic scenarios captured in the wild The total number of people in each image ranges from 49 to 12,865 On this large-scale dataset, TDCrowd outperforms the state-of-theart methods TDCrowd reduces the MAE of TransCrowd from 97.2 to 83.0 and MSE from 168.5 to 143.4 TABLE II COMPARISON WITH OTHER METHODS ON THE UCF-QRNF DATASET MAE MSE Switch-CNN [23] 252.0 514.0 Do et al [4] 245.3 512.7 SSC [5] 125.7 213.1 CSRNet [16] 120.3 208.5 Method Fig Visualization results of our model on the ShanghaiTech Part A dataset The left column is the sample image, and the right column is the estimated density map 68 Idrees et al [10] 132.0 191.0 CHF [6] 110.2 179.6 TransCrow [18] 97.2 168.5 TEDnet [12] 113.0 188.0 CG-DRCN [25] 112.2 176.3 DM-Count [28] 85.6 147.5 GL [27] 84.3 147.5 TDCrowd (Our) 83.0 143.4 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) However, because the size of the dataset is quite large, there are few methods to use this data set to evaluate their model TABLE III shows the results of TDCrowd outperforming other methods when evaluated on the validation and testing sets E Ablation Study Comparison with Transformer method for crowd counting: TransCrow was one of the first methods to use Transformer for estimating the number of people in a crowd However, TransCrow only uses count information and focuses on exploiting the attention mechanism The experimental results in TABLE IV show that our method is better because it has more information about the location Comparison with methods using only count information: These methods use less information than methods using density maps, so we compare TDCrowd with these methods in the aspect that uses only the count and does not use the position information TABLE IV shows that TDCrowd ranked second when evaluated on the ShanghaiTech Part A dataset Although MAE and MSE of TDCrowd were higher than TransCrowd using GAP, TDCrowd still improved by 2.1 MAE points, 8.2 MSE points compared to TransCrowd using Token Fig Visualization results of our model on the UCF-QRNF dataset The left column is the sample image, and the right column is the estimated density map D JHU-Crowd++ dataset TABLE IV COMPARISON WITH METHODS THAT DO NOT USE LOCATION INFORMATION Method Label Location Number TDCrowd (Our) Fig Visualization results of our model on the JHU-Crowd++ dataset The left column is the sample image, and the right column is the estimated density map DATASET Method Validation Testing MAE MSE MAE MSE MCNN [31] 160.6 377.7 188.9 483.4 CSRNet [16] 72.2 249.9 85.9 309.2 CAN [20] 89.5 239.3 100.1 314.0 SANet [1] 82.1 272.6 91.1 320.4 CG-DRCN [25] 67.9 262.1 82.3 328.0 TransCrow [18] 56.8 193.6 - - TDCrowd (Our) 54.0 190.4 67.2 259.6 MSE 57.9 95.4 Yang et al [30] ✗ 104.6 145.2 MATT [14] ✗ 80.1 129.4 TransCrowd-Token [18] ✗ 69.0 116.5 TransCrowd-GAP [18] ✗ 66.1 105.1 TDCrowd (Our) ✗ 67.9 108.3 V CONCLUSION AND FUTURE WORK We have proposed a model using the Transformer for density map construction This model can capture global information to solve the problem of limited receptive fields On the other hand, with the use of density maps, the model still takes advantage of the location information of the crowd datasets Experimentally, when using the density map, our method is better than methods using the Transformer to estimate the count In the future, we will continue to improve the density map model for better results On the other hand, we will also study counting methods on other objects such as animals, fruits, books, etc COMPARISON WITH OTHER METHODS ON THE JHU-CROWD++ TABLE III MAE REFERENCES [1] [2] [3] The dataset [25] includes 2722 training images, 1600 testing images, and 500 validation images collected from diverse scenarios and weather conditions It has negative samples (images without people) with a count range of to 25,791 [4] 69 Cao, Xinkun, et al "Scale aggregation network for accurate and efficient crowd counting." Proceedings of the European Conference on Computer Vision (ECCV) 2018 Carion, Nicolas, et al "End-to-end object detection with transformers." European Conference on Computer Vision Springer, Cham, 2020 K Chen, C C Loy, S Gong, and T Xiang Feature mining for localised crowd counting In BMVC, 2012 Phuc Thinh Do, and Ngoc Quoc Ly A New Framework For Crowded Scene Counting Based On Weighted Sum Of Regressors and Human Classifier In SoICT ’18: Ninth International Symposium on Information and Communication Technology, 2018 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] Do, Phuc Thinh, Manh Thuong Phan, and Thien Tam Chan Le "A singlecolumn convolutional neural networks for crowd counting." 2019 6th NAFOSTED Conference on Information and Computer Science (NICS) IEEE, 2019 Do, Phuc Thinh, and Ngoc Quoc Ly "A New High Performance Approach for Crowd Counting Using Human Filter." 2020 7th NAFOSTED Conference on Information and Computer Science (NICS) IEEE, 2020 Dosovitskiy, Alexey, et al "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020) Hendrycks, Dan, and Kevin Gimpel "Bridging nonlinearities and stochastic regularizers with gaussian error linear units." (2016) H Idrees, I Saleemi, C Seibert, and M Shah Multi-source multi-scale counting in extremely densecrowd images In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2547– 2554, 2013 H Idrees, M Tayyab, K Athrey, D Zhang, S Al-Maadeed, N Rajpoot, and M Shah, Composition loss for counting, density map estimation and localization in dense crowds In ECCV, 2018, pp 532–546 W Ge and R T Collins Marked point processes for crowd counting In Computer Vision and Pattern Recognition, 2009 CVPR 2009 IEEE Conference on, pages 2913–2920 IEEE, 2009 X Jiang, Z Xiao, B Zhang, X Zhen, X Cao, D Doermann, and L Shao, “Crowd counting and density estimation by trellis encoderdecoder network,” CVPR, 2019 DP Kingma, Diederik P, and Jimmy BA Adam: A method for stochastic optimization arXiv preprint arXiv:1412.6980, 2014 Lei, Yinjie, et al "Towards using count-level weak supervision for crowd counting." Pattern Recognition 109 (2021): 107616 V Lempitsky and A Zisserman Learning to count objects in images In Advances in neural information processing systems, pages 1324–1332, 2010 Yuhong Li, Xiaofan Zhang, and Deming Chen CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes CVPR, 2018 Lian, Dongze, et al "Density map regression guided detection network for rgb-d crowd counting and localization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019 Liang, Dingkang, et al "TransCrowd: Weakly-Supervised Crowd Counting with Transformer." arXiv preprint arXiv:2104.09116 (2021) [19] Liu, Ning, et al "Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019 [20] Liu, Weizhe, Mathieu Salzmann, and Pascal Fua "Context-aware crowd counting." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019 [21] Liu, Yuting, et al "Point in, box out: Beyond counting persons in crowds." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019 [22] Paragios, Nikos, and Visvanathan Ramesh "A MRF-based approach for real-time subway monitoring." Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition CVPR 2001 Vol IEEE, 2001 [23] D B Sam, S Surya, R V Babu Switching convolutional neural network for crowd counting In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017 [24] Karen Simonyan and Andrew Zisserman Very deep convolutional networks for large-scale image recognition arXivpreprint arXiv:1409.1556, 2014 [25] Sindagi, Vishwanath, Rajeev Yasarla, and Vishal MM Patel "Jhucrowd++: Large-scale crowd counting dataset and a benchmark method." IEEE Transactions on Pattern Analysis and Machine Intelligence (2020) [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin Attention is all you need In NIPS, 2017 [27] Wan, Jia, Ziquan Liu, and Antoni B Chan "A Generalized Loss Function for Crowd Counting and Localization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021 [28] Wang, Boyu, et al "Distribution matching for crowd counting." arXiv preprint arXiv:2009.13077 (2020) [29] M Wang and X Wang Automatic adaptation of a generic pedestrian detector to a specific traffic scene In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3401–3408 IEEE, 2011 [30] Yang, Yifan, et al "Weakly-supervised crowd counting learns from sorting rather than locations." Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16 Springer International Publishing, 2020 [31] Y Zhang, D Zhou, S Chen, S Gao, Y Ma Single image crowd counting via multi-column convolutional neural network In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 589–597, 2016 70 ... methods using the Transformer to estimate the count In the future, we will continue to improve the density map model for better results On the other hand, we will also study counting methods on other... trained using the L2 loss function between the estimated density map and ground truth density map On the other hand, if only the count information is available, from the estimated density map. .. density map, E is a Gaussian kernel with standard deviation F , n is the number of head positions The final count is obtained by summing the values of the density map Similar to the method using density

Ngày đăng: 18/02/2023, 05:36