Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 71 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
71
Dung lượng
5,98 MB
Nội dung
BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM KỸ THUẬT THÀNH PHỐ HỒ CHÍ MINH LUẬN VĂN THẠC SĨ NGUYỄN MINH NGHĨA PHÁT TRIỂN ỨNG DỤNG ƯỚC LƯỢNG MẬT ĐỘ NGƯỜI ĐÁM ĐÔNG SỬ DỤNG HỌC SÂU NGÀNH: KỸ THUẬT ĐIỆN TỬ Tp Hồ Chí Minh, tháng 11/2022 BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC SƯ PHẠM KỸ THUẬT THÀNH PHỐ HỒ CHÍ MINH LUẬN VĂN THẠC SĨ NGUYỄN MINH NGHĨA PHÁT TRIỂN ỨNG DỤNG ƯỚC LƯỢNG MẬT ĐỘ NGƯỜI TRONG ĐÁM ĐÔNG SỬ DỤNG HỌC SÂU NGÀNH: KỸ THUẬT ĐIỆN TỬ - 8520203 Hướng dẫn khoa học: TS TRẦN VŨ HỒNG TP HỒ CHÍ MINH – 04/2022 i ii iii iv v vi LÝ LỊCH KHOA HỌC I LÝ LỊCH SƠ LƯỢC: Họ & tên: Nguyễn Minh Nghĩa Giới tính: Nam Ngày, tháng, năm sinh: 28/06/1996 Nơi sinh: Bến Tre Quê quán: Bến Tre Dân tộc: Kinh Địa liên lạc: 184 Trần Văn Kiểu, phường 10, quận 6, TP.HCM E-mail: minhnghia1996@gmail.com II QUÁ TRÌNH ĐÀO TẠO: Đại học: Hệ đào tạo: Chính quy Thời gian đào tạo từ 09/2014 đến 09/2018 Nơi học (trường, thành phố): Đại học Sư phạm Kỹ Thuật TP.HCM Ngành học: Công nghệ kỹ thuật điện tử truyền thông Tên đồ án: NHẬN DẠNG CỬ CHỈ BÀN TAY DÙNG MẠNG NƠ-RON TÍCH CHẬP Ngày & nơi bảo vệ đồ án: 08/07/2018, Đại học Sư phạm Kỹ thuật TP.HCM Người hướng dẫn: ThS Lê Minh Thành Thạc sĩ: Hệ đào tạo: Chính Quy Thời gian đào tạo từ 09/2019 đến 09/2022 Nơi học (trường, thành phố): Đại học Sư phạm Kỹ Thuật TP.HCM Ngành học: Kỹ thuật điện tử Tên luận văn: PHÁT TRIỂN ỨNG DỤNG ƯỚC LƯỢNG MẬT ĐỘ NGƯỜI TRONG ĐÁM ĐÔNG SỬ DỤNG HỌC SÂU Ngày & nơi bảo vệ luận văn: 24/04/2022, Đại học Sư phạm Kỹ thuật TP.HCM Người hướng dẫn: TS Trần Vũ Hồng vii III Q TRÌNH CƠNG TÁC CHUN MƠN KỂ TỪ KHI TỐT NGHIỆP ĐẠI HỌC: Thời gian Nơi công tác Công việc đảm nhiệm 10/2018 đến Công ty TNHH Giải pháp Phần Mềm Tường Minh Kỹ sư Phần mềm viii 4.4 MỘT SỐ KẾT QUẢ 4.4.1 Sự ảnh hưởng 𝜶 Bảng 4.4 kết thí nghiệm sử dụng để xác định giá trị 𝛼 đề tài Dựa kết đó, ta thấy tăng giá trị 𝛼 làm giảm MAE làm tăng MSE Vì vậy, để đạt kết tốt nhất, tơi định chọn 𝛼 = suốt trình huấn luyện mạng Bảng 4.4: Sự ảnh hưởng 𝛼 liệu ShanghaiTech Phần A 𝜶 MAE MSE 58.2 88.94 57.58 96.45 10 57.5 99.07 4.4.2 So sánh với số phương pháp khác Trong phần này, so sánh phương pháp đề xuất đề tài với số phương pháp khác liệt kê Bảng 4.5 Các phương pháp so sánh phần huấn luyện Phần A Phần B ShanghaiTech Dựa vào kết trình bày Bảng 4.5, ta thấy phương pháp đề xuất đề tài có hiệu tơi hầu hết phương pháp lại liệu ShanghaiTech Phần B Ở đây, chọn hai phương pháp để so sánh Đầu tiên phương pháp trực tiếp – phương pháp sử dụng kiến trúc mạng để huấn luyện ước lượng đồ mật độ người đám đông Ở phương pháp này, DSNet lựa chọn để so sánh với phương pháp đề tài Khi so sánh với DSNet, phương pháp đề xuất cải thiện 5.67% MAE 13.32% MSE liệu ShanghaiTech phần A; 45.82% MAE 24.48% MSE liệu 36 ShanghaiTech phần B Kế đến phương pháp gián tiếp – phương pháp sử dụng thêm tác vụ phụ trình huấn luyện để cải thiện độ xác việc ước lượng đồ mật độ người đám đông Trong lần so sánh này, so sánh phương pháp đề xuất với mơ hình DANet – phương pháp sử dụng tác vụ phụ ước lượng đồ độ sâu Khi tiến hành so sánh với DANet, thấy phương pháp đề xuất có MAE thấp 18.49% MSE thấp 26.25% liệu ShanghaiTech phần A; MAE thấp 60.11% MSE thấp 46.05% liệu ShanghaiTech phần B Cuối cùng, thông qua kết trên, ta thấy phương pháp đề xuất đề tài có hiệu việc ước lượng đồ mật độ người đám đơng Bên cạnh đó, việc sử dụng hai phương pháp trực tiếp gián tiếp, phương pháp đề xuất xử lý vấn đề biến đổi tỷ lệ kích thước người ảnh Bảng 4.5: So sánh hiệu số phương pháp ước tính đồ mật độ Phần A STT Phần B Phương pháp MAE MSE MAE MSE DSSINet [6] 60.63 96.04 6.85 10.34 BL [23] 62.8 101.8 7.7 12.7 PGCNet [4] 57.0 86.0 8.8 13.7 DANet [3] 71.4 120.6 9.1 14.7 DSNet [1] 61.7 102.6 6.7 10.5 Đề Tài 58.2 88.94 3.63 7.93 37 Trong Hình 4.7 số kết dự đoán phương pháp đề xuất đề tài Hàng số ảnh đám đông với khoảng tỉ lệ khác từ vài chục vài ngàn người Hàng thứ hai ground truth map hàng cuối kết dự đốn Các kết cho thấy độ xác việc đếm người tỉ lệ biến đổi khác đề tài Thơng qua ví dụ trên, ta thấy phương pháp đề xuất giải tốt xác với biến đổi tỷ lệ kích thước người khác ảnh Hình 4.7: Một số kết dự đoán phương pháp đề xuất 38 4.4.3 Đánh giá việc học chuyển tiếp (transfer learning) Mục đích thí nghiệm để chứng minh tính tổng qt mơ hình huấn luyện Phương pháp dùng phương pháp học chuyển tiếp Bộ liệu huấn luyện sử dụng phần ShanghaiTech Phần A liệu kiểm tra UCF_CC_50 [24] Trong thí nghiệm này, mơ hình khơng tiến hành huấn luyện lại mạng với liệu UCF_CC_50 Mơ hình DSNet phương pháp đề tài so sánh với thí nghiệm Trong Bảng 4.6, thấy việc nhúng nhiều thơng tin tỉ lệ cách gián tiếp vào đặc trưng, cải thiện tính tổng quát chúng Bằng chứng hiệu suất cải thiện khoảng 20% MAE MSE Bảng 4.6: Kết với học chuyển tiếp (transfer learning) Phương pháp MAE MSE DSNet [1] 503.46 733.25 Đề Tài 402.12 586.52 39 Chương KẾT LUẬN VÀ HƯỚNG PHÁT TRIỂN 5.1 KẾT LUẬN Đề tài đề xuất phương pháp để giải vấn đề biến đổi tỷ lệ kích thước người ảnh cho nhiệm vụ đếm số lượng người đám đông theo cách trực tiếp gián tiếp Đầu tiên, thông tin tỷ lệ mật độ người học trực tiếp thơng qua mạng thiết kế với DDCB (dense dilated convolution blocks) DRC (dense residual connections) khối Sau đó, thơng tin tỷ lệ kích thước người ảnh tiếp tục thêm vào đặc trưng cách gián tiếp thông qua việc thêm tác vụ phụ ước lượng đồ độ sâu Bằng cách thực đồng thời hai điều này, phương pháp đề xuất vượt trội phương pháp thử nghiệm tập liệu ShanhaiTech Ngoài ra, phương pháp đề xuất công bố hội nghị International Conference on System Science and Engineering (ICSSE) năm 2021, toàn văn báo thể phần phụ lục 5.2 HƯỚNG PHÁT TRIỂN 1) Thử nghiệm thêm số tác vụ có liên quan đến biến đổi tỉ lệ, là: - Ước lượng đồ phối cảnh, - Phân cấp mật độ đám đông, 2) Thử nghiệm nhiều liệu đám đông khác để kiểm chứng hiệu phương pháp đề xuất 40 TÀI LIỆU THAM KHẢO [1] F Dai, H Liu, Y Ma, J Cao, Q Zhao, and Y Zhang, “Dense Scale Network for Crowd Counting,” arXiv preprint arXiv:1906.09707, 2019 [2] G Gao, J Gao, Q Liu, Q Wang, and Y Wang, “CNN-based Density Estimation and Crowd Counting: A Survey,” arXiv preprint arXiv:2003.12783, 2020 [3] V Huynh, V Tran and C Huang, “DAnet: Depth-Aware Network for Crowd Counting,” 2019 IEEE International Conference on Image Processing (ICIP), pp 3001-3005, 2019 [4] Z Yan et al., “Perspective-Guided Convolution Networks for Crowd Counting,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 952-961, 2019 [5] V Huynh, V Tran and C Huang, “IUML: Inception U-Net Based Multi-Task Learning For Density Level Classification And Crowd Density Estimation,” 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp 30193024, 2019 [6] L Liu, Z Qiu, G Li, S Liu, W Ouyang and L Lin, “Crowd Counting With Deep Structured Scale Integration Network,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019 [7] X Cao, Z Wang, Y Zhao, and F Su “Scale aggregation network for accurate and efficient crowd counting,” In ECCV, pages 734–750, 2018 [8] Y Zhang, D Zhou, S Chen, S Gao and Y Ma, “Single-Image Crowd Counting via Multi-Column Convolutional Neural Network,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016 [9] D Sam, S Surya, and R Babu, “Switching convolutional neural network for crowd counting,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol 1, no 3, 2017 41 [10] V A Sindagi and V M Patel, “CNN-Based cascaded multi-task learning of high-level prior and density estimation for crowd counting,” 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 1-6, 2017 [11] V A Sindagi and V M Patel, “Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs,” 2017 IEEE International Conference on Computer Vision (ICCV), 2017 [12] X Liu, J van de Weijer and A D Bagdanov, “Leveraging Unlabeled Data for Crowd Counting by Learning to Rank,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018 [13] Zhang, Yu & Yang, Qiang (2018) An overview of multi-task learning National Science Review 30-43 10.1093/nsr/nwx105 [14] Ruder, Sebastian (2017) An Overview of Multi-Task Learning in Deep Neural Networks [15] Li, Yuhong & Zhang, Xiaofan & Chen, Deming (2018) CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes 1091-1100 10.1109/CVPR.2018.00120 [16] Fang, Bei & Li, Ying & Zhang, Haokui & Chan, Jonathan (2019) Hyperspectral Images Classification Based on Dense Convolutional Networks with Spectral-Wise Attention Mechanism Remote Sensing 11 159 10.3390/rs11020159 [17] K Simonyan, and A Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv 1409.1556, 2014 [18] J Uhrig, N Schneider, L Schneider, U Franke, T Brox and A Geiger, “Sparsity Invariant CNNs,” 2017 International Conference on 3D Vision (3DV), 2017 42 [19] A Levin, D Lischinski, and Y Weiss, “Colorization using Optimization,” ACM Transactions on Graphics, 2004 [20] A Paszke, S Gross, S Chintala, and G Chanan, et al “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” Advances in Neural Information Processing Systems 32, 2019 [21] J Deng, W Dong, R Socher, L -J Li, Kai Li and Li Fei-Fei, "ImageNet: A large-scale hierarchical image database," 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp 248-255, doi: 10.1109/CVPR.2009.5206848 [22] D P Kingma and J Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014 [23] Z Ma, X Wei, X Hong and Y Gong, “Bayesian Loss for Crowd Count Estimation With Point Supervision,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019 [24] H Idrees, I Saleemi, C Seibert and M Shah, “Multi-source Multi-scale Counting in Extremely Dense Crowd Images,” 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013 43 PHỤ LỤC Depth Embedded and Dense Dilated Convolutional Network for Crowd Density Estimation Minh-Nghia Nguyen Faculty of Electrical and Electronics Engineering HCMC University of Education and Technology Ho Chi Minh City, Viet Nam minhnghia1996@gmail.com Vu-Hoang Tran Faculty of Electrical and Electronics Engineering HCMC University of Education and Technology Ho Chi Minh City, Viet Nam hoangtv@hcmute.edu.vn Abstract—In recent years, due to the rapid growth of the urban population, the management of public security has become extremely necessary Therefore, accurate crowd counting and density distribution estimation play an important role in many situations especially during the Covid-19 pandemic which has been spreading around the world Although many studies have been proposed, it remains to be a challenging task because of the vivid intra-scene scale variations of people caused by depth effects In this paper, we propose a novel unified system that allows the scale variation problem to be solved both directly and indirectly To allow the network to have an understanding of depth when estimating crowd density, we first propose to embed this information into the crowd density estimation network indirectly through the training process by mean of multi-task learning Our network is now designed to solve not only the main task of estimating crowd density, but also a side task: depth estimation Besides, to learn the large-scale features directly, dense dilated convolution blocks were proposed to be used in our encoder The experimental results demonstrate that by using both such direct and indirect methods, we can boost the performance and achieve good results compared to existing methods Besides, with the multi-task design, we can completely cut off the unnecessary branches of the network related to the side task to speed up computation during the testing phase Ton-Nghia Huynh Academic Affairs Office HCMC University of Education and Technology Ho Chi Minh City, Viet Nam nghiaht@hcmute.edu.vn Fig Some images about scale variation Fig Some images about scale variation With the success of deep learning technology, especially convolutional neural networks, crowd density maps have been increasingly accurately estimated However, crowd counting is still a challenging task due to large scale variation, occlusion, background noise and perspective distortion [1, 2] Among them, scale variation, as shown in Fig 1, is the main issue affecting the quality of crowd counting and has attracted the attention of many recent studies [1, 3, 4, 5] The traditional way to solve this issue is to apply multi-scale [6, 7] or multicolumn [8, 9] structures to combine information from different scales However, these methods are designed with the discrete scales and columns, they still have limitations when solving continuous scale variations in the real scenarios Besides, when increasing the number of columns or the scales to solve scale variation problems, the computational cost also increases, causing the waste of resources in terms of processing time as well as the required hardware To solve this problem, Feng Dai et al proposed DSNet [1] for crowd counting In this research, to output features captured different scales densely, dense dilated convolution blocks (DDCB) were proposed By combining different dilation rates within a block and stacking multiple blocks together with residual connections, DSNet can achieve a large scale range and thus can deal with the large scale variation problem in crowd counting Also to address the continuous scale variations, PGCNet, which uses perspective information to allocate Keywords—Crowd counting, Deep learning, Multi task learning, Crowd density estimation, Depth estimation I INTRODUCTION Covid-19 pandemic has been spreading all over the world and taking many people’s lives today Most countries strictly implement anti-epidemic measures such as social distancing rules to keep people separated by a safe distance However, in many countries, people still congregate at traditional festivals, sporting events, music festivals or family parties that make difficult pandemic control Therefore, to make the public spaces safer, crowd analysis is critical One of the most basic tasks for crowd analysis is crowd counting Many studies related to this task have been proposed in recent years, but they can be divided into two main directions: detection-based and regression-based [1] In the first approach, the quantity of people in the crowd is estimated by detecting their body or head But this approach would be difficult to deploy in dense crowd images due to serious occlusions Therefore, the second approach, which estimates the crowd density map of a given image first then sums all of its values to give the quantity of people in the crowd, was proposed 44 (a) (b) Fig The illustration of our proposed method (a) The multi-task framework for estimating crowd density map and depth map; (b) the details of the encoder and decoder block spatially variant receptive fields adatively, was proposed in [4] However, this network is also quite cumbersome and takes a lot of time to process because it also needs an extra network to handle perspective estimation task Besides, this network might achieve very poor results if training in end-toend manner II PROPOSED METHOD As shown in Fig 2, our framework consists of two branches corresponding to two tasks: crowd density map estimation and depth map estimation The two branches have an identical structure including two parts: encoder and decoder The top branch is used to learn dense-scale features for solving crowd density map estimation directly The bottom is used to provide understanding of scale variation into the network indirectly through training with depth estimation task To integrate depth information into density map estimation, we share the encoder’s weights between two branches The details of the proposed network would be explained in the following subsections A Encoder Another recently used approach for solving scale variation problem is multi-task learning With this approach, in addition to the main task of estimating crowd density maps, the main network is also trained to simultaneously handle the other side tasks to embed the necessary information about scale variation One benefit of this approach is that multiple tasks can be trained concurrently in an end-to-end manner, besides the side-tasks can be easily removed from the network to reduce computation cost in the test phase The side tasks were selected differently in different studies In [5, 10, 11], the authors added the density level classification task to the main model to improve crowd counting quality In [12], besides solving the scale variation problem, the training data is also increased to boost performance by adding the density ranking task However, in these methods, to train multiple tasks simultaneously, the datasets containing all kinds of labels of these tasks are required Preparing such datasets is both time consuming and laborious Therefore, DAnet [3], which uses separated and individual datasets in training, was proposed In this work, by embedding the depth information, which is related to scale variation, from the depth estimation dataset into the crowd counting network during training, the counting performance is improved As aforementioned, in this paper, we would like to directly obtain multi-scale features in a single-column network, so we inherit the idea of DSNet [1] for designing the Encoder The structure of our encoder is given in Fig 2b, there are three special things that need to be paid attention to in this structure: As a result, we propose to solve this scale variation problem in this paper by both direct and indirect ways simultaneously: (1) learning dense-scale features directly by applying dense dilated convolution blocks in the encoder; (2) embedding scale information indirectly into network through training the encoder simultaneously with depth estimation dataset Direct learning helps to enrich our features while indirect learning through another related dataset not only makes use of relevant information but also solves the lack of data problem Besides, by removing side-task during testing phase, the computation cost is not increased at all The results of the experiments show that by learning in both direct and indirect ways simultaneously, our method can solve the scale variation problem efficiently and outperform state-of-the-art methods 45 - The first ten layers of VGG-16 [13] are used as the backbone network The idea behind it is: for a faster algorithm, it is recommended to use more convolutional layers with small kernels instead of fewer layers with larger kernels - To capture a large-scale range as dense as possible, three dilated convolutional layers with increasing dilation rate of 1, 2, and are densely connected together to form dense dilated convolution blocks (DDCB) as shown in the middle Fig 2b With this kind of setting, the acquired scale diversity is increased that can handle the large-scale variation problem directly - To further improve the information flow, the output of each DDCB is connected to each layer of the later DDCBs densely as shown in the top of Fig 2b With this dense residual connection, the scale range is further enlarged Besides, for a specific image, the suitable scale features will be preserved adaptively through residual connections the counts vary from 33 to 3139 Part B includes 716 images captured from the fixed cameras of streets of Shanghai and the counts vary from 12 to 578 in this part Both of them are divided into training and testing sets, the details are given in Table I B Decoder There are two independent decoders corresponding to two branches: density map estimation and depth map estimation Each decoder consists of three layers: two convolutional layers with kernel size of 3x3 and a convolutional layer with kernel size of 1x1 as shown in Fig 2b C Loss Function TABLE I DATASETS USED IN OUR EXPERIMENTS SHANGHAITECH PART A #TRAINING SAMPLES (BEFORE/AFTER AUGMENTATION) 300/ 20,540 SHANGHAITECH PART B 400/ 41,600 316 KITTI DEPTH COMPLETION 20,540 - DATASETS In this paper, we use Euclidean distance to measure the difference between the estimated map and ground truth The loss function is defined as follows: ̂𝑖 − 𝐷𝑖 ‖, 𝐿𝐷 = ∑𝑁 ‖𝐷 (1) 𝑁 𝑖=1 where N is the number of images in the training batch, 𝐷𝑖 and ̂𝑖 are the ground truth map and the estimation from the 𝐷 network respectively of the 𝑖 𝑡ℎ image in the batch Based on [1], we also use multi-scale density level consistency loss to make sure the consistency of the both global and local density levels between the estimated density map and ground truth The loss function is defined as follows: 1 𝑆 ̂ 𝐿𝐶 = ∑𝑁 𝑖=1 ∑𝑗=1 ‖𝑃𝑎𝑣𝑒 (𝐷𝑖 , 𝑘𝑗 ) − 𝑃𝑎𝑣𝑒 (𝐷𝑖 , 𝑘𝑗 )‖1 , (2) 𝑁 KITTI depth completion [14]: There are 85,898 training samples in the dataset However, we just choose 20,540 images when training with ShanghaiTech Part A and 41,600 images when training with ShanghaiTech Part B to balance the training samples for the two tasks We also use the algorithm proposed in [15] to smooth the sparse depth maps for better training where S is the number of scale levels for consistency checking which separates the density map into different subregions, 𝑃𝑎𝑣𝑒 is average pooling operation which is applied in each separated sub-region, 𝑘𝑗 is the specified output size of average pooling The total loss for density map estimation branch is calculated as follows: 𝐿𝑑𝑒𝑛 = 𝐿𝐷 + 𝜆𝐿𝐶 , (3) where λ is the weight to balance the pixel-wise and density level consistency loss To summarize, the whole network is optimized and trained by using the following overall objective function: 𝐿 = 𝛼𝐿𝑑𝑒𝑛 + 𝐿𝑑𝑒𝑝𝑡ℎ , (4) where 𝐿𝑑𝑒𝑛 and 𝐿𝑑𝑒𝑝𝑡ℎ are the losses for the density map estimation and the depth map estimation tasks respectively, 𝛼 is the weight to balance these two losses 𝐿𝑑𝑒𝑝𝑡ℎ is simply calculated by (1) B Implement details Our network is trained end-to-end using Pytorch library [16] In order to initialize our network, we use the pre-trained model on ImageNet to initialize the VGG-16 backbone For the other layers, Gaussian initialization with zero mean and a standard deviation of 0.01 is used Adam [17] optimizer is used with a fixed learning rate of 1e-6, weight decay of 5e-4 and batch size of one In our experiments, we set λ equal to 1000 The above parameters we set based on the suggestion in [1] C Influence of 𝛼 Table II shows the experiments we conducted to determine alpha Based on those results, we discovered that increasing the value of alpha reduced MAE but increased MSE In order to have a good trade-off between MAE and MSE, we decided to use alpha = in our training system EXPERIMENTAL RESULTS In our paper, we use ShanghaiTech dataset for crowd density estimation task and KITTI depth completion for depth map estimation task The details of these datasets will be given in the following sub-sections To verify the performance of the proposed method, we use two conventional evaluation metrics: the mean absolute error (MAE) and mean square error (MSE) given in Eq (5) and (6) ̂ 𝑀𝐴𝐸 = ∑𝑁 𝑖=1|𝐶𝑖 − 𝐶𝑖 |, 𝑁 ̂ 𝑀𝑆𝐸 = √ ∑𝑁 𝑖=1(𝐶𝑖 − 𝐶𝑖 ) , 𝑁 182 Because the amount of training data in the datasets is too small, we apply some data augmentation techniques such as: randomly cropping, flipping, blurring, and noise adding The amount of data after augmentation is also given in Table I The ground truth is created by the method described in [8] to make sure the crowd counts can be yielded by summing all the values in the density map 𝑘𝑗 III #TESTING SAMPLES TABLE II INFULENCE OF 𝛼 ON SHANGHAITECH PART A 𝛼 (5) (6) where N is the total number of testing images; 𝐶𝑖 and 𝐶̂𝑖 are the ground truth count and estimated count of the 𝑖 𝑡ℎ image respectively The estimated count 𝐶̂𝑖 will be estimated by summing all the values in the 𝑖 𝑡ℎ estimated density map A Datasets ShanghaiTech [8]: The ShanghaiTech dataset is divided into two parts: Part A and Part B Part A includes 482 highly congested images obtained from the Internet, in this dataset, 46 MAE MSE 58.2 88.94 57.58 96.45 10 57.5 99.07 Fig An demonstration of estimated crowd density maps and crowd counts generated by the proposed method The first row shows four examples with different scale ranges increasing in increments of several tens to several thousand The second row shows the ground truth maps while bottom row shows our corresponding estimation The estimated counts point out that our method give the accurate crowd counts in the different scale variations TABLE III METHOD E Evaluation of transfer learning We use the transfer learning setting to demonstrate the generalizability of the learned model The training set (source) is the Part A ShanghaiTech dataset, and the testing set (target) is the UCF_CC_50 dataset [19], which only contains 50 grayscale images In this experiment, we not fine-tune the network in the target dataset The DSNet model and our proposed model are being considered Table IV shows that by embedding the more scale information indirectly into the features, we can improve their generality The evidence is that the performance is improved by about 20% on both MAE and MSE COMPARISON WITH STATE-OF-THE-ART PART_A PART_B MAE MSE MAE MSE DSSINET [6] 60.63 96.04 6.85 10.34 BL [18] 62.8 101.8 7.7 12.7 PGCNET [4] 57.0 86.0 8.8 13.7 DANET [3] 71.4 120.6 9.1 14.7 DSNET [1] 61.7 102.6 6.7 10.5 OUR PROPOSED 58.2 88.94 3.63 7.93 D Comparison with State-of-the-Art The results, as shown in Table III, show that our network outperforms other methods in Part B, and has the comparable performance to state-of-the-art method PGCNet in Part A Compared to using a direct method only (DSNet), by introducing more scale information indirectly through another task, our method can get 5.67% MAE and 13.32% MSE improvements in Part_A; and 45.82% MAE and 24.48% MSE improvements in Part_B While compared to using the indirect method only (DANet), our method delivers 18.49% lower MAE and 26.25% lower MSE in Part_A; in Part_B we get the huge improvement with 60.11% lower MAE and 46.05% lower MSE These numbers show the effectiveness of simultaneously solving the scale problem in both direct and indirect ways To be more intuitive, several examples with different scale variations and our estimations are presented in Fig These examples once again confirm the assumptions above about solving the scale diversity in crowd counting problems Whatever the scale variation, our method offers accurate crowd counting TABLE IV TRANSFER LEARNING TEST ACROSS DATASETS METHOD MAE MSE DSNET [1] 503.46 733.25 OUR PROPOSED 402.12 586.52 IV CONCLUSION A novel way, which tackles the scale variation problem for crowd counting tasks in both direct and indirect ways, was proposed in this paper First, the dense scale information is learned directly though the main network which is designed with dense dilated convolution blocks and dense residual connections among the blocks Then, the scale information is further crammed into the features indirectly through learning depth information from an auxiliary depth dataset By doing these two things simultaneously, the proposed approach outperforms state-of-the-art methods in experiments on the ShanhaiTech dataset 47 REFERENCES [1] F Dai, H Liu, Y Ma, J Cao, Q Zhao, and Y Zhang, “Dense Scale Network for Crowd Counting,” arXiv preprint arXiv:1906.09707, 2019, [2] G Gao, J Gao, Q Liu, Q Wang, and Y Wang, “CNN-based Density Estimation and Crowd Counting: A Survey,” arXiv preprint arXiv:2003.12783, 2020 [3] V Huynh, V Tran and C Huang, “DAnet: Depth-Aware Network for Crowd Counting,” 2019 IEEE International Conference on Image Processing (ICIP), pp 3001-3005, 2019 [4] Z Yan et al., “Perspective-Guided Convolution Networks for Crowd Counting,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 952-961, 2019 [5] V Huynh, V Tran and C Huang, “IUML: Inception U-Net Based MultiTask Learning For Density Level Classification And Crowd Density Estimation,” 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp 3019-3024, 2019 [6] L Liu, Z Qiu, G Li, S Liu, W Ouyang and L Lin, “Crowd Counting With Deep Structured Scale Integration Network,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019 [7] X Cao, Z Wang, Y Zhao, and F Su “Scale aggregation network for accurate and efficient crowd counting,” In ECCV, pages 734–750, 2018 [8] Y Zhang, D Zhou, S Chen, S Gao and Y Ma, “Single-Image Crowd Counting via Multi-Column Convolutional Neural Network,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016 [9] D Sam, S Surya, and R Babu, “Switching convolutional neural network for crowd counting,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol 1, no 3, 2017 [10] V A Sindagi and V M Patel, “CNN-Based cascaded multi-task learning of high-level prior and density estimation for crowd counting,” 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 1-6, 2017 [11] V A Sindagi and V M Patel, “Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs,” 2017 IEEE International Conference on Computer Vision (ICCV), 2017 [12]X Liu, J van de Weijer and A D Bagdanov, “Leveraging Unlabeled Data for Crowd Counting by Learning to Rank,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018 [13] K Simonyan, and A Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv 1409.1556, 2014 [14] J Uhrig, N Schneider, L Schneider, U Franke, T Brox and A Geiger, “Sparsity Invariant CNNs,” 2017 International Conference on 3D Vision (3DV), 2017 [15] A Levin, D Lischinski, and Y Weiss, “Colorization using Optimization,” ACM Transactions on Graphics, 2004 [16] A Paszke, S Gross, S Chintala, and G Chanan, et al “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” Advances in Neural Information Processing Systems 32, 2019 [17] D P Kingma and J Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014 [18] Z Ma, X Wei, X Hong and Y Gong, “Bayesian Loss for Crowd Count Estimation With Point Supervision,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019 [19] H Idrees, I Saleemi, C Seibert and M Shah, “Multi-source Multi-scale Counting in Extremely Dense Crowd Images,” 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013 48 S K L 0 ... nghiên cứu ước lượng mật độ người đám đông MỘT SỐ ỨNG DỤNG CỦA ƯỚC LƯỢNG MẬT ĐỘ NGƯỜI TRONG 1.3 ĐÁM ĐƠNG - Phân tích đám đơng thơng minh: Kỹ thuật ước lượng số mật độ người đám đông sử dụng để thu... ĐÁM ĐÔNG Việc ước lượng mật độ người đám đông nhiều thách thức, số thách thức việc ước lượng mật độ người đám đông: - Che khuất (Occlusion): Khi mật độ người đám đông tăng lên dẫn đến tượng người. .. Ngành học: Kỹ thuật điện tử Tên luận văn: PHÁT TRIỂN ỨNG DỤNG ƯỚC LƯỢNG MẬT ĐỘ NGƯỜI TRONG ĐÁM ĐÔNG SỬ DỤNG HỌC SÂU Ngày & nơi bảo vệ luận văn: 24/04/2022, Đại học Sư phạm Kỹ thuật TP.HCM Người