Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 92 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
92
Dung lượng
18,98 MB
Nội dung
People Counting Using Detection And Tracking Techniques For Smart Video Surveillance Ha Thi Oanh Ha Noi University of Science and Technology Supervisor Assoc Prof Tran Thi Thanh Hai In partial fulfillment of the requirements for the degree of Master of Computer Science April 20, 2023 Acknowledgements First of all, I would like to express my gratitude to my primary advisor, Assoc Prof Tran Thi Thanh Hai, who guided me throughout this project I would like to thank Assoc Prof Le Thi Lan and Assoc Prof Vu Hai for giving me deep insight, valuable recommendations and brilliant idea I am grateful for my time spent at MICA International Research Institute, where I learnt a lot about research and enjoyed a very warm and friendly working atmosphere In particular, I wish to extend my special thanks to Dr Doan Thi Huong Giang who directly supported me The master’s thesis is within the framework of the ministerial-level scientific research project ”Research and development of an automatic system for assessing learning activities in class based on image processing technology and artificial intelligence” code CT2020.02.BKA.02 led by Assoc Prof Dr Le Thi Lan Students sincerely thank the topic Finally, I wish to show my appreciation to all my friends and family members who helped me finalizing the project Abstract Video or image-based people counting in real-time has multiple applications in intelligent transportation, density estimation or class management, and so on Although this problem has been widely studied, it stills face some main challenges due to crowded scene and occlusion In a common approach, this problem is carried out by detecting people using conventional detectors However, this approach can be failed when people stay in various postures or are occluded by each other We notice that even a main part of human body is occluded, their face and head are still observable In addition, a person can not be detected at a frame but may be recovered at the previous or the next frames In this thesis, we attempt to improve the people counting result beyond these observations We first deploy two detectors (Yolo and Retina-Face) for detecting heads and for faces of people in the scene We then develop a pairing technique that aligns the face and the head of each person This alignment helps to recover the missed detection of head or face thus increases the true positive rate To overcome the missed detection of both face and head at a certain frame, we apply a tracking technique (i.e SORT) on the combined detection result Putting all of these techniques in an unified framework helps to increases the true positive rates from 90.36% to 96.21% on ClassHead Part dataset Contents List of Acronymtypes x Introduction 1.1 Introduction to people counting 1.2 Scientific and practical significance 1.2.1 Scientific significance 1.2.2 Practical significance 1.2.3 Challenges and Motivation Objectives and Contributions 1.3.1 Objectives 1.3.2 Contributions Thesis outline 1.3 1.4 Related works 2.1 Detection based people counting 2.1.1 Face detection based people counting 2.1.2 Head detection based on people counting 11 2.1.3 Hybrid detection based on people counting 13 2.2 Density estimation based people counting 15 2.3 People tracking 16 2.3.1 Overview of object tracking 16 2.3.2 Multiple Object Tracking 17 2.3.3 Tracking techniques 18 iii CONTENTS 2.3.4 2.4 2.3.3.1 Kalman filter 19 2.3.3.2 SORT 22 2.3.3.3 DeepSORT 25 Tracking-based people counting 26 Conclusion of the chapter Proposed method for people counting 29 30 3.1 The proposed people counting framework 30 3.2 Yolo-based head detection 31 3.2.1 Yolo revisit 31 3.2.2 Yolov5 34 3.2.3 Implementation of Yolov5 for head detection 38 RetinaFace based face detection 39 3.3.1 RetinaFace architecture 40 3.3.2 Implementation of RetinaFace for face detection 43 Combination of head and face detection 44 3.4.1 Linear sum assignment problem 44 3.4.2 Head-face pairing cost 45 3.5 Person tracking 45 3.6 Conclusion 49 3.3 3.4 Experiments 4.1 50 Dataset and Evaluation Metrics 50 4.1.1 Our collected dataset: ClassHead 50 4.1.1.1 ClassHead Part 53 4.1.1.2 ClassHead Part 55 4.1.2 Hollywood Heads dataset 55 4.1.3 Casablanca dataset 57 4.1.4 Wider Face dataset 57 4.1.5 Evaluation metrics 57 iv CONTENTS 4.2 4.1.5.1 Intersection over Union (IoU) 59 4.1.5.2 Precision and Recall 59 4.1.5.3 F1-score 60 4.1.5.4 AP and mAP 61 4.1.5.5 Mean Absolute Error 61 Experimental Results 62 4.2.1 Evaluation on Hollywood dataset 62 4.2.2 Evaluation on Casablanca dataset 63 4.2.3 Evaluation on Wider Face dataset 66 4.2.4 Evaluation on ClassHead Part dataset 66 Conclusions 72 5.1 Conclusion 72 5.2 Future Works 72 References 81 v List of Figures 1.1 Illustration of the input and output of people counting from an image 1.2 Some challenges in crowd counting [1] 2.1 Framework for a people counting based on face detection and tracking in a video [2] 2.2 System framework for depth-assisted face detection and association for people counting 2.3 10 11 System framework for a people counting method based on head detection and tracking 12 2.4 Network structure of Double Anchor R-CNN 13 2.5 Architecture of JointDet 14 2.6 Examples of people density estimation 16 2.7 Example of Multiple Object Tracking 19 2.8 Hungarian Algorithm 23 2.9 The tracking process of the SORT algorithm 25 2.10 Architecture of the proposed people counting and tracking system 27 2.11 Flow architecture of the proposed smart surveillance system 28 3.1 The proposed framework for people counting by pairing head and face detection and tracking 32 3.2 Output of Yolo network[3] 34 3.3 Yolov5 architecture[4] 35 3.4 Spatial Pyramid Pooling 37 vi LIST OF FIGURES 3.5 Path Aggregation Network 37 3.6 Automatic learning of bound box anchors [4] 38 3.7 Activation functions used in Yolov5 (a) SiLU function (b) Sigmoid function [4] 39 3.8 Example for creating dataset.yaml 40 3.9 An overview of the single-stage dense face localisation approach RetinaFace is designed based on the feature pyramids with independent context modules Following the context modules, we calculate a multi-task loss for each anchor 42 3.10 Organize dataset for Yolo training 43 3.11 Example of RetinaFace testing on Wider Face dataset 44 3.12 Flowchart of combining object detection and tracking to improve the true positive rate 4.1 Camera layout in the simulated classroom and an image obtained from each camera view 4.2 52 Illustration of LabelMe interface and main operations to annotate an image 4.3 46 53 Illustration of images taken from five camera view in ClassHead Part dataset: (a) View , (b) View 2, (c) View 3, (d) View and (e) View 54 4.4 Some example images of ClassHead Part dataset: view ch03 (a), view ch04 (b), view ch05 (c), and view ch12 (d) and view ch13 (e) 4.5 56 Some example images of Hollywood Heads dataset (first row), Casablanca dataset (second row), Wider Face dataset (third row), and ClassHead Part of our dataset (last row) 58 4.6 Calculating IOU 59 4.7 Precision and Recall metrics 60 4.8 MAE measurement results on proposed methods in Hollywood Heads dataset vii 63 LIST OF FIGURES 4.9 Results of Hollywood Heads dataset (a) Results of head detection; (b) Results of face detection; (c) Matching head and face detection using the Hungarian algorithm Heads are denoted with green, faces are yellow, missed ground truths are red, and head-face pairings are cyan 64 4.10 MAE measurement results on proposed methods in Casablanca Heads dataset 65 4.11 Results of Casablanca dataset (a) Results of head detection; (b) Results of face detection; (c) Matching head and face detection using the Hungarian algorithm Heads are denoted with green, faces are yellow, missed ground truths are red, and head-face pairings are cyan 65 4.12 Results of Wider Face dataset (a) Results of head detection; (b) Results of face detection; (c) Matching head and face detection using the Hungarian algorithm Heads are denoted with green, faces are yellow, missed ground truths are red, and head-face pairings are cyan 67 4.13 MultiDetect results in ClassHead Part (a) Head detections, (b) Face detections, (c) MultiDetect 68 4.14 Head tracking method results in ClassHead Part dataset (a) Head detections at frame 1, (b) Head tracking at frame 100 69 4.15 MultiDetect with Track method results in ClassHead Part dataset (a) MultiDetect with Track at frame 1, (b) MultiDetect with Track at frame 100 70 4.16 MAE measurement results on proposed methods in ClassHead Part dataset viii 71 List of Tables 4.1 Setup camera parameters for data collection 51 4.2 ClassHead Part dataset for training and testing Head detector Yolov5 55 4.3 ClassHead Part dataset 55 4.4 Results of the proposed method on the Hollywood Heads dataset 63 4.5 Results of the proposed method on the Casablanca dataset 64 4.6 Results of the proposed method on Wider Face dataset 66 4.7 Results of the method of the head detection method in ClassHead Part dataset 67 4.8 Results of the method of the MultiDectect in ClassHead Part dataset 68 4.9 Results of the Head Tracking in ClassHead Part dataset 69 4.10 Results of method MultiDetect with Track in ClassHead Part dataset 70 4.11 Experimental results in the ClassHead Part dataset after using methods ix 71 4.2 Experimental Results (a) (b) (c) Figure 4.12: Results of Wider Face dataset (a) Results of head detection; (b) Results of face detection; (c) Matching head and face detection using the Hungarian algorithm Heads are denoted with green, faces are yellow, missed ground truths are red, and headface pairings are cyan Table 4.7: Results of the method of the head detection method in ClassHead Part dataset Dataset Precision(%) Recall(%) F1-score(%) AP(%) ch03 ch04 ch05 ch12 ch13 71.43 95.51 91.64 88.97 95.08 76 96.58 89.4 95.41 94.4 73.64 96.04 90.51 92.08 94.74 66.22 95.58 88.62 87.65 93.97 The results of the combined head and face (MultiDetect) method are described in detail in five views from ch03 to ch13 In which the results show in Tab 4.8 that the method increase the Precision is 9.71%, 3.01%, 0.83%, 0.46% in corresponding views ch03, ch04, ch05, and ch12 compared with head detection method Besides, also with this method, Recall increases 12.6%, 3.09%, 1.04%, 1.05%, 1.68% in correspondence to views ch03, ch04, ch05, ch12, and ch13 compared with head detection method In addition, we illustrate the results of the method in Fig.4.13 In Fig.4.13(a), the green bounding box shows the results of head detections using Yolov5 In addition, in Fig.4.13 (b), the yellow bounding boxes represent the results of face detections Finally, Fig.4.13 (c) is the combined result of head detections and face detections, 67 4.2 Experimental Results Table 4.8: Results of the method of the MultiDectect in ClassHead Part dataset View Precision(%) Recall(%) F1-score(%) AP(%) ch03 ch04 ch05 81.14 98.52 92.47 88.6 99.67 90.44 84.71 99.09 91.44 81.91 99.56 89.9 (a) ch12 ch13 89.43 92.85 96.46 96.08 92.81 94.44 88.73 95.59 (b) (c) Figure 4.13: MultiDetect results in ClassHead Part (a) Head detections, (b) Face detections, (c) MultiDetect which includes green boxes and yellow bounding boxes We consider additional face detections as missing heads due to the object detection model missed by Yolov5 Head Tracking We this part on the dataset which is collected by ourselves because sorting is the process of tracking many consecutive frames, so we proceed using our ClassHead Part dataset On other datasets, because the frames are discrete, we cannot perform re-utilization for this approach Object tracking is one of the methods commonly used today, we apply the tracking techniques in this people counting problem The results show that this method achieves quite positive results, as illustrated in Tab 4.9 This method increases the precision 68 4.2 Experimental Results (a) (b) Figure 4.14: Head tracking method results in ClassHead Part dataset (a) Head detections at frame 1, (b) Head tracking at frame 100 at views ch03, ch04, and ch05 to 9.49%, 2.689%, and 0.219%, respectively, using the tracking method when compared with head detection method Also, with this method, Recall increases 12.8%, 3.13%, 1.2%, 1.28%, 1.28% in corresponding to views ch03, ch04, ch05, ch12, ch13 when compared with head detection method Besides, we illustrate the results of the tracking process as shown in Fig.4.14 In Fig 4.14(a) the green bounding boxes are the first objects obtained from object detection using Yolov5 Fig 4.14(b) illustrates the object tracking obtained by cyan bounding boxes compared to the green bounding boxes obtained from object detection Table 4.9: Results of the Head Tracking in ClassHead Part dataset View Precision(%) Recall(%) F1-score(%) AP(%) ch03 ch04 80.92 98.19 88.8 99.71 84.68 98.94 81.8 99.56 ch05 91.85 90.6 91.22 89.86 ch12 88.6 96.69 92.47 88.69 ch13 94.21 95.68 94.94 94.56 MultiDetect with Track Because of the results in the previous step, we have shown that the method of combining head and face (MultiDetect) is very promising Therefore, we combined the results from the previous section 4.2.2 in MultiDetect with Track method The results are specifically illustrated in Tab 4.10 More specifically, by MultiDetect with Track method, Recall value at view ch03 is up 5.1%, 0.08%, 1.24%, 1.21%, 2.2% when compared with the MultiDetect method 69 4.2 Experimental Results (a) (b) Figure 4.15: MultiDetect with Track method results in ClassHead Part dataset (a) MultiDetect with Track at frame 1, (b) MultiDetect with Track at frame 100 In addition, we illustrate the results of the MultiDetect with Track method as shown in Fig 4.15 In Fig 4.15(a) the green bounding boxes are the first objects obtained from object detection using Yolov5 Fig 4.15(b) illustrates the object tracking obtained with the cyan bounding boxes compared to the green and yellow bounding boxes obtained from the MultiDetect method Table 4.10: Results of method MultiDetect with Track in ClassHead Part dataset View Precision(%) Recall(%) F1-score(%) AP(%) ch03 69.55 93.667 79.83 74.73 ch04 ch05 92.899 85.49 99.75 91.68 96.2 88.48 98.67 86.64 ch12 82.653 97.666 89.53 85.89 ch13 82.505 98.28 89.7 89.54 However, to compare with previous methods, we have performed the aggregation of the results in Tab.4.11 According to Tab.4.11, all the proposed methods with Precision metrics are highest on the MultiDetect method with 90.88% and with Recall they are highest on the MultiDetect with Track method with 96.21% In addition, we perform an evaluation with the MAE measure between the predictive model results and the ground truth The results show as in Fig.4.16 below 70 4.2 Experimental Results Figure 4.16: MAE measurement results on proposed methods in ClassHead Part dataset Table 4.11: Experimental results in the ClassHead Part dataset after using methods ClassHead Part dataset Head Detection MultiDetect MultiDetect Track Precison AVG(%) 88.53 90.75 82.62 Recall AVG(%) 90.36 94.29 96.21 Metrics 71 Chapter Conclusions 5.1 Conclusion In this thesis, we attempt to improve the people counting result beyond these observations We first deploy two detectors (Yolo and Retina-Face) to detect the heads and faces of people on the scene We then develop a pairing technique that aligns the face and the head of each person This alignment helps to recover the missed detection of head or face thus increases the true positive rate To overcome the missed detection of both face and head at a certain frame, we apply a tracking technique (i.e SORT) on the combined detection result Putting all of these techniques in an unified framework helps to increase the true positive rates from 90.36% to 96.21% on ClassHead Part dataset The proposed methodology for the Improvement of People Counting by Pairing Head and Face Detections from Still Images method was published at the 2021 MAPR conference [31] 5.2 Future Works People counting is the process of accurately measuring the number of people entering, exiting, or passing through a specific area or location The data generated by people counting can be used to optimize operations, improve customer experience, ensure public safety, and measure the effectiveness of marketing campaigns, among other applications People counting is commonly used in retail stores, shopping malls, airports, 72 5.2 Future Works public transportation systems, museums, and other public spaces In the near future, we will continue to work on people counting and methods of combining human body parts More specifically, we will study methods of combining the head with the upper body of a person, thereby creating a complete end-to-end network to improve the efficiency of the people counting problem 73 References [1] G Gao, J Gao, Q Liu, Q Wang, and Y Wang, “Cnn-based density estimation and crowd counting: A survey,” arXiv preprint arXiv:2003.12783, 2020 vi, [2] X Zhao, E Delleandrea, and L Chen, “A people counting system based on face detection and tracking in a video,” in 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pp 67–72, IEEE, 2009 vi, 10 [3] J Redmon, S Divvala, R Girshick, and A Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788, 2016 vi, 31, 34 [4] A Bochkovskiy, C.-Y Wang, and H.-Y M Liao, “Yolov4: Optimal speed and accuracy of object detection,” 2020 vi, vii, 33, 35, 37, 38, 39 [5] T.-Y Chen, C.-H Chen, D.-J Wang, and Y.-L Kuo, “A people counting system based on face-detection,” in 2010 Fourth International Conference on Genetic and Evolutionary Computing, pp 699–702, IEEE, 2010 10 [6] G Zhao, H Liu, L Yu, B Wang, and F Sun, “Depth-assisted face detection and association for people counting.,” in CCPR, pp 251–258, 2012 10, 11 [7] B Li, J Zhang, Z Zhang, and Y Xu, “A people counting method based on head detection and tracking,” in 2014 International Conference on Smart Computing, pp 136–141, IEEE, 2014 11, 12 [8] S D Khan, H Ullah, M Ullah, N Conci, F A Cheikh, and A Beghdadi, “Person head detection based deep model for people counting in sports videos,” in 74 REFERENCES 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 1–8, IEEE, 2019 12 [9] K Zhang, F Xiong, P Sun, L Hu, B Li, and G Yu, “Double anchor r-cnn for human detection in a crowd,” arXiv preprint arXiv:1909.09998, 2019 13 [10] C Chi, S Zhang, J Xing, Z Lei, S Z Li, and X Zou, “Relational learning for joint head and human detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol 34, pp 10647–10654, 2020 13, 14 [11] P Karpagavalli and A Ramprasad, “Estimating the density of the people and counting the number of people in a crowd environment for human safety,” in 2013 International Conference on Communication and Signal Processing, pp 663–667, IEEE, 2013 15 [12] V Lempitsky and A Zisserman, “Learning to count objects in images,” Advances in neural information processing systems, vol 23, 2010 15, 16 [13] Y Zhang, P Sun, Y Jiang, D Yu, F Weng, Z Yuan, P Luo, W Liu, and X Wang, “Bytetrack: Multi-object tracking by associating every detection box,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pp 1–21, Springer, 2022 19 [14] R E Kalman, “A new approach to linear filtering and prediction problems,” Transactions of the ASME–Journal of Basic Engineering, vol 82, no Series D, pp 35–45, 1960 19 [15] A Bewley, Z Ge, L Ott, F Ramos, and B Upcroft, “Simple online and realtime tracking,” CoRR, vol abs/1602.00763, 2016 22 [16] https://en.oi-wiki.org/graph/graph-matching/graph-match/ 22, 23 [17] N Wojke, A Bewley, and D Paulus, “Deep sort: simple online and realtime tracking with a deep association metric,” arXiv preprint arXiv:1703.07402, 2017 25 75 REFERENCES [18] S Vogt, A Khamene, F Sauer, and H Niemann, “Single camera tracking of marker clusters: Multiparameter cluster optimization and experimental verification,” in Proceedings International Symposium on Mixed and Augmented Reality, pp 127–136, IEEE, 2002 26 [19] E Ristani, F Solera, R Zou, R Cucchiara, and C Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in Computer Vision– ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II, pp 17–35, Springer, 2016 26 [20] T Schmidt, K Hertkorn, R Newcombe, Z Marton, M Suppa, and D Fox, “Depth-based tracking with physical constraints for robot manipulation,” in 2015 IEEE International Conference on Robotics and Automation (ICRA), pp 119–126, IEEE, 2015 26 [21] M Pervaiz, Y Y Ghadi, M Gochoo, A Jalal, S Kamal, and D.-S Kim, “A smart surveillance system for people counting and tracking using particle flow and modified som,” Sustainability, vol 13, no 10, p 5367, 2021 27 [22] A Shehzed, A Jalal, and K Kim, “Multi-person tracking in smart surveillance system for crowd counting and normal/abnormal events detection,” in 2019 international conference on applied and engineering mathematics (ICAEM), pp 163– 168, IEEE, 2019 27, 28 [23] J Redmon and A Farhadi, “Yolo9000: Better, faster, stronger,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6517–6525, 2017 33 [24] J Redmon and A Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018 33, 36 [25] K He, X Zhang, S Ren, and J Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol 37, no 9, pp 1904–1916, 2015 36 76 REFERENCES [26] J Deng, J Guo, Y Zhou, J Yu, I Kotsia, and S Zafeiriou, “Retinaface: Singlestage dense face localisation in the wild,” arXiv preprint arXiv:1905.00641, 2019 40, 42 [27] T.-H Vu, A Osokin, and I Laptev, “Context aware cnns for person head detection,” in Proceedings of the IEEE International Conference on Computer Vision, pp 2893–2901, 2015 55 [28] X Ren, “Finding people in archive films through tracking,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp 1–8, IEEE, 2008 57 [29] S Yang, P Luo, C.-C Loy, and X Tang, “Wider face: A face detection benchmark,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5525–5533, 2016 57 [30] https://machinelearningcoban.com/2017/08/31/evaluation/ 59, 60 [31] T.-O Ha, H.-N Tran, H.-Q Nguyen, T.-H Tran, P.-D Nguyen, H.-G Doan, V.-T Nguyen, H Vu, and T.-L Le, “Improvement of people counting by pairing head and face detections from still images,” in 2021 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp 1–6, IEEE, 2021 72 [32] M Tan, R Pang, and Q V Le, “Efficientdet: Scalable and efficient object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10781–10790, 2020 [33] S Seong, J Song, D Yoon, J Kim, and J Choi, “Determination of vehicle trajectory through optimization of vehicle bounding boxes using a convolutional neural network,” Sensors, vol 19, no 19, p 4263, 2019 [34] Y.-Q Huang, J.-C Zheng, S.-D Sun, C.-F Yang, and J Liu, “Optimized yolov3 algorithm and its application in traffic flow detections,” Applied Sciences, vol 10, no 9, p 3079, 2020 77 REFERENCES [35] J Redmon and A Farhadi, “Yolo9000: Better, faster, stronger,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017 [36] S Ren, K He, R Girshick, and J Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Trans Pattern Anal Mach Intell., vol 39, p 1137–1149, June 2017 [37] D Peng, Z Sun, Z Chen, Z Cai, L Xie, and L Jin, “Detecting heads using feature refine net and cascaded multi-scale architecture,” 2018 24th International Conference on Pattern Recognition (ICPR), pp 2528–2533, 2018 [38] J Dai, Y Li, K He, and J Sun, “R-fcn: Object detection via region-based fully convolutional networks,” in Advances in neural information processing systems, pp 379–387, 2016 [39] B Hariharan, P Arbel´aez, R Girshick, and J Malik, “Hypercolumns for object segmentation and fine-grained localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 447–456, 2015 [40] W Liu, D Anguelov, D Erhan, C Szegedy, S Reed, C.-Y Fu, and A C Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision, pp 21–37, Springer, 2016 [41] T Kong, A Yao, Y Chen, and F Sun, “Hypernet: Towards accurate region proposal generation and joint object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 845–853, 2016 [42] W Liu, A Rabinovich, and A C Berg, “Parsenet: Looking wider to see better,” arXiv preprint arXiv:1506.04579, 2015 [43] Z Huang, J Wang, X Fu, T Yu, Y Guo, and R Wang, “Dc-spp-yolo: Dense connection and spatial pyramid pooling based yolo for object detection,” Information Sciences, 2020 78 REFERENCES [44] Y Sun, X Wang, and X Tang, “Deeply learned face representations are sparse, selective, and robust,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2892–2900, 2015 [45] M Liu, R Wang, S Li, S Shan, Z Huang, and X Chen, “Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild,” in Proceedings of the 16th International Conference on multimodal interaction, pp 494–501, 2014 [46] J Carreira, R Caseiro, J Batista, and C Sminchisescu, “Semantic segmentation with second-order pooling,” in European Conference on Computer Vision, pp 430– 443, Springer, 2012 [47] E U Haq, H Jianjun, K Li, and H U Haq, “Human detection and tracking with deep convolutional neural networks under the constrained of noise and occluded scenes,” Multimedia Tools and Applications, vol 79, no 41, pp 30685–30708, 2020 [48] A G Howard, M Zhu, B Chen, D Kalenichenko, W Wang, T Weyand, M Andreetto, and H Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017 [49] M Tan and Q Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning, pp 6105–6114, PMLR, 2019 [50] R Girshick, J Donahue, T Darrell, and J Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587, 2014 [51] R Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp 1440–1448, 2015 79 REFERENCES [52] S Ren, K He, R Girshick, and J Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol 28, pp 91–99, 2015 [53] T.-O Ha, T N P Truong, H.-Q Nguyen, T.-H Tran, T.-L Le, H Vu, and H.-G Doan, “Automatic student counting in images using deep learning techniques, application in smart classroom management (in vietnamese),” in The 23rd National Conference on Electronics, Communications and Information Technology-REVECIT, pp 142–146, 2020 [54] S Zhang, R Zhu, X Wang, H Shi, T Fu, S Wang, T Mei, and S Z Li, “Improved selective refinement network for face detection,” arXiv preprint arXiv:1901.06651, 2019 [55] T Teixeira and A Savvides, “Lightweight people counting and localizing in indoor spaces using camera sensor nodes,” in 2007 First ACM/IEEE International Conference on Distributed Smart Cameras, pp 36–43, IEEE, 2007 [56] J Luo, J Wang, H Xu, and H Lu, “A real-time people counting approach in indoor environment,” in International Conference on Multimedia Modeling, pp 214– 223, Springer, 2015 [57] T Zhao and R Nevatia, “Bayesian human segmentation in crowded situations,” in 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003 Proceedings., vol 2, pp II–459, IEEE, 2003 [58] E Zhang and F Chen, “A fast and robust people counting method in video surveillance,” in 2007 International Conference on Computational Intelligence and Security (CIS 2007), pp 339–343, IEEE, 2007 [59] T.-H Vu, A Osokin, and I Laptev, “Context-aware cnns for person head detection,” in Proceedings of the IEEE International Conference on Computer Vision, pp 2893–2901, 2015 80 REFERENCES [60] W Wong, D Q Huynh, and M Bennamoun, “Upper body detection in unconstrained still images,” in 2011 6th IEEE Conference on Industrial Electronics and Applications, pp 287–292, IEEE, 2011 [61] C Gao, P Li, Y Zhang, J Liu, and L Wang, “People counting based on head detection combining adaboost and cnn in crowded surveillance environment,” Neurocomputing, vol 208, pp 108–116, 2016 [62] W Liu, M Salzmann, and P Fua, “Counting people by estimating people flows,” arXiv preprint arXiv:2012.00452, 2020 [63] X Shi, X Li, C Wu, S Kong, J Yang, and L He, “A real-time deep network for crowd counting,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 2328–2332, IEEE, 2020 [64] B Wei, M Chen, Q Wang, and X Li, “Multi-channel deep supervision for crowd counting,” arXiv preprint arXiv:2103.09553, 2021 [65] D Liang, X Chen, W Xu, Y Zhou, and X Bai, “Transcrowd: Weakly-supervised crowd counting with transformer,” arXiv preprint arXiv:2104.09116, 2021 [66] J Liu, J Liu, and M Zhang, “A detection and tracking based method for realtime people counting,” in 2013 Chinese Automation Congress, pp 470–473, IEEE, 2013 [67] W Luo, J Xing, A Milan, X Zhang, W Liu, and T.-K Kim, “Multiple object tracking: A literature review,” Artificial intelligence, vol 293, p 103448, 2021 81