ĐẠI HỌC QUỐC GIA TP.HCM TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN ĐỖ PHÚC THỊNH XÂY DỰNG HỆ THỐNG GIÁM SÁT ĐÁM ĐÔNG LUẬN VĂN THẠC SĨ: KHOA HỌC MÁY TÍNH Tp Hồ Chí Minh, Năm 2018 ĐẠI HỌC QUỐC GIA TP.HCM TRƯỜNG ĐẠI HỌC KHOA HỌC TỰ NHIÊN ĐỖ PHÚC THỊNH XÂY DỰNG HỆ THỐNG GIÁM SÁT ĐÁM ĐÔNG Chuyên ngành: Khoa học máy tính Mã số chuyên ngành: 8480101 LUẬN VĂN THẠC SĨ: CÔNG NGHỆ THÔNG TIN NGƯỜI HƯỚNG DẪN KHOA HỌC: PGS TS LÝ QUỐC NGỌC Tp Hồ Chí Minh, Năm 2018 LỜI CAM ĐOAN Tơi xin cam đoan nội dung trình bày luận văn cơng trình nghiên cứu tơi, hướng dẫn Thầy PGS TS Lý Quốc Ngọc, trường đại học Khoa Học Tự Nhiên, Đại học Quốc Gia Thành Phố Hồ Chí Minh Trong luận văn, kiến thức từ cơng trình có liên quan kế thừa lại có trích dẫn đầy đủ Mã nguồn cài đặt hệ thống, thực nghiệm, kết quả, số liệu hình ảnh sử dụng luận văn trung thực TP.HCM, ngày 28 tháng 11 năm 2018 Đỗ Phúc Thịnh LỜI CẢM ƠN Trong trình thực đề tài luận văn này, xin chân thành gửi lời cảm ơn đến: • Thầy PGS TS Lý Quốc Ngọc tận tình hướng dẫn, định hướng dành thời gian q báu để góp ý cho tơi hồn thành luận văn • Phịng Đào Tạo Sau Đại Học trường Đại học Khoa học Tự nhiên TP.HCM, thầy cô Khoa Công nghệ Thông tin, giảng dạy, truyền đạt kiến thức kinh nghiệm quý báu, bảo tạo điều kiện cho tơi hồn thành luận văn Xin chân thành cảm ơn Đỗ Phúc Thịnh MỤC LỤC LỜI CAM ĐOAN LỜI CẢM ƠN MỤC LỤC DANH MỤC CÁC KÝ HIỆU, CHỮ VIẾT TẮT DANH MỤC CÁC BẢNG .9 DANH MỤC CÁC HÌNH VẼ, ĐỒ THỊ 10 Chương Giới Thiệu 12 Tổng quan 12 Động lực nghiên cứu 12 1.2.1 Tính khoa học 12 1.2.2 Tính ứng dụng 13 Phát biểu toán .13 Phạm vi toán .13 Mô tả chung hệ thống 14 Đóng góp luận văn .14 Cấu trúc luận văn 15 Chương Cơ Sở Lý Thuyết Và Các Cơng Trình Nghiên Cứu Liên Quan .16 Một số sở lý thuyết 16 2.1.1 Phân phối chuẩn .16 2.1.2 Mạng Nơ-ron tích chập (Convolutional Neural Network – CNN) 17 Các phương pháp giải có 20 2.2.1 Các phương pháp dựa phát đối tượng .20 2.2.2 Các phương pháp dựa hồi quy 20 2.2.3 Các phương pháp dựa ước lượng mật độ 20 Các cơng trình nghiên cứu liên quan 21 Hướng tiếp cận luận văn 36 2.4.1 Một số đề xuất cải tiến 36 2.4.2 Hướng tiếp cận luận văn 38 Chương Hệ Thống Ước Lượng Đám Đông 40 Giới thiệu 40 Mơ hình phân lớp Human Classifier 40 Ước lượng số người đám đông .42 3.3.1 Xây dựng đồ mật độ xác thực 42 3.3.2 Regressor 46 3.3.3 Switch Classifier 48 3.3.4 Mơ hình ước lượng số người đám đông .49 Chương Thực Nghiệm Và Đánh Giá 53 Giới thiệu 53 Các liệu chuẩn để thực nghiệm 53 4.2.1 Bộ liệu UCF_CC_50 53 4.2.2 Bộ liệu ShanghaiTech .53 Phương pháp đánh giá 54 Cài đặt chương trình ứng dụng demo .55 4.4.1 Môi trường ngôn ngữ cài đặt .55 4.4.2 Giao diện chương trình 56 4.4.3 Tạo liệu để huấn luyện kiểm thử mơ hình 57 4.4.4 Huấn luyện mơ hình giao diện console 58 4.4.5 Kiểm thử mơ hình giao diện console 58 Kết thực nghiệm .59 4.5.1 Tập liệu UCF_CC_50 .59 4.5.2 Tập liệu ShanghaiTech 61 Chương Kết Luận 63 Kết luận .63 Hướng phát triển .63 CƠNG TRÌNH CƠNG BỐ 64 TÀI LIỆU THAM KHẢO 65 DANH MỤC CÁC KÝ HIỆU, CHỮ VIẾT TẮT STT Ký hiệu viết tắt Nội dung viết tắt CCTV closed-circuit television CNN Convolutional Neural Network conv Convolutional ReLU Rectified Linear Unit CCNN Couting Convolutional Neural Network MCNN Multi-column Convolutional Neural Network Switch-CNN Switching Convolutional Neural Network GAP global average pool DANH MỤC CÁC BẢNG Bảng 2.1 Tổng kết kiến trúc mạng thông tin đầu vào, đầu cơng trình nghiên cứu liên quan 25 Bảng 2.2 Mơ hình học trọng số cơng trình nghiên cứu liên quan .30 Bảng 2.3 Phương pháp ước tính số người cơng trình nghiên cứu liên quan 34 Bảng 2.4 Khuyết điểm số phương pháp 35 Bảng 2.5 Hướng tiếp cận luận văn 39 Bảng 4.1 Kết đánh giá tập liệu UCF_CC_50 so sánh với phương pháp có 61 Bảng 4.2 Kết đánh giá độ đo MRE tập liệu ShanghaiTech Part A so sánh với phương pháp 61 Bảng 4.3 Kết đánh giá tập liệu ShanghaiTech so sánh với phương pháp có 62 DANH MỤC CÁC HÌNH VẼ, ĐỒ THỊ Hình 1.1 Mơ hình chung hệ thống ước lượng số người giai đoạn ngoại tuyến trực tuyến .14 Hình 1.2 Kết tạo đồ mật độ ước tính số người .14 Hình 2.1 Một số dạng phân phối liệu .16 Hình 2.2 Kiến trúc CNN 17 Hình 2.3 Mơ tả cách tính filter 18 Hình 2.4 Hình dạng hàm kích hoạt f(x) 19 Hình 2.5 Ví dụ maxpooling với filter 2x2 stride 19 Hình 2.6 Kiến trúc MCNN 22 Hình 2.7 Ảnh input, đồ mật độ số đếm ước tính 23 Hình 2.8 Mơ hình Switching-CNN giai đoạn online 24 Hình 2.9 Một số điểm cải tiến 38 Hình 3.1 Kiến trúc mơ hình Human Classifier .41 Hình 3.2 Minh họa phương pháp tính đồ mật độ 43 Hình 3.3 Phân phối Gaussian chiều x, y với kỳ vọng mean điểm (0,0) phương sai 44 Hình 3.4 Khoảng cách từ điểm xét đến điểm lân cận gần đánh dấu đầu người (Xét k = 5) 45 Hình 3.5 Kiến trúc Regressor với cột R1, R2, R3 .47 Hình 3.6 Kiến trúc mơ hình Switch Classifier .49 10 Một số đánh giá báo review, báo 5/6 điểm - REVIEW Overall evaluation: (weak accept) - Overall evaluation In this paper, the authors propose the Weighted Sum of Regressors and Human Classifier for crowded scene counting The main idea is to use the human classifier (VGG-16) to recognize the positive or negative sample regions, follow which trains regressors (MCNN) to estimate the number of people in the crowd The numerical test results on ShanghaiTech and UCF_CC_50 datasets show that the proposed method achieves the high prediction correctness compared to the state-of-theart methods The paper is relevant the AI & Big Data Analytics community The presentation is not bad The authors should: - provide the implementation for the proposed method, including the used library, which language - illustrate how to obtain numerical test results of other methods 69 - REVIEW Overall evaluation: (accept) - Overall evaluation The readability of this paper is good The survey of related work is reasonable and the authors discuss the advantage of the proposed method by comparing recent work and showed better performance This is one of the positive aspects of this paper Readability and presentation of this paper is also good However, too small letters in some figures should be larger for better readability Heading of section (Ack.) should be moved to the top of the column on the right In addition, other formatting errors should be corrected before publication 70 - REVIEW Overall evaluation: (accept) - Overall evaluation The paper proposes a framework for Crowded Scene Counting The two main points in this framework are: (1) the pre-processing step using a human classifier based on a deep CNN, (2) combination of density maps based on the weight The experiments are sufficient and show that the proposed method outperforms Switch-CNN, MCNN on ShanghaiTech dataset and UCF_CC_50 dataset The following is some comments/questions that might help the authors improve the paper: - Why is each image divided into image patches? It would be nice to see an explanation for the way to divide image Also, the authors should present more detail about how to define sizes of the filters for regressors as well as the number of regressors The content relates to "Fourth Industrial Revolution" in the introduction section can be removed if it needs space - There are several typo mistakes for example in Subsection 3.2.2: "The architecture of the Regressors is shown in Fig 3." (Fig.4 but not Fig.3?) 71 72 73 74 A New Framework For Crowded Scene Counting Based On Weighted Sum Of Regressors and Human Classifier Phuc inh Do VNUHCM - University of Science 227 Nguyen Van Cu Street, District 5, Ho Chi Minh City Vietnam tenshiren@gmail.com Ngoc Qoc Ly VNUHCM - University of Science 227 Nguyen Van Cu Street, District 5, Ho Chi Minh City Vietnam lqngoc@ t.hcmus.edu.vn ABSTRACT INTRODUCTION Crowd density estimation is an important task in the surveillance camera system, it serves in security, tra c, business etc At the present, the trend of monitoring is moving from individual to crowd, but traditional counting techniques will be ine cient in this case because of issues such as scale, clu er background and occlusion Most of the previous methods have focused on modeling work to accurately estimate the density map and thus infer the count However, with non-human scenes, which have many clouds, trees, houses, seas etc, these models are o en confused, resulting in inaccurate count estimates To overcome this problem, we propose the "Weighted Sum of Regressors and Human Classi er" (WSRHC) method Our model consists of two main parts: human – non-human classi cation and crowd counting estimation First of all, we built a Human Classi er, which lters out negative sample images (non-human images) before entering into the regressors en, the count estimation is based on the regressors e di erence between regressors is the size of the lters e essence of this method is the count depends on the weighted average of the density map obtained from these regressors is is to overcome the defects of the previous model, Switching Convolutional Neural Network (Switch-CNN) select the count as the output of one of the regressors Multi-Column Convolutional Neural Network (MCNN) combines the count and the weight of the Regressors by xed weights from MCNN, while our approach is adapted for individual images Our experiments have shown that our method outperform Switch-CNN, MCNN on ShanghaiTech dataset and UCF_CC_50 dataset The Fourth Industrial Revolution is taking place It can be said that this is a revolution to promote the intelligibility of production processes and social management Digital technology is one of the most important areas that have a decisive influence on this process Recent advances in artificial intelligence are due to machine learning, especially to deep learning Currently, in the process of social management, people want to build a Smart City with closed-circuit television (CCTV) surveillance system From there we see the essential role of automated monitoring There are many popular tasks from automatic monitoring such as detection, identification, classification, tracking and new statistical analysis tasks as density estimation, object counting etc However, from the individual monitoring tendency, it has grown to monitor the crowd so the previous solutions for the scene with sparse and medium density is no longer suitable Today, more and more cities are built up, making population density more and more dense On the other hand, many events, meetings, demonstrations etc are also held Parallel to that, the situation of security instability also progressed very complex, especially terrorist organizations are always targeting to the crowd causing great damage Therefore, it is necessary to improve automatic monitoring by surveillance cameras to ensure public order Crowds are not simply sum of components, so traditional counting solutions are no longer appropriate There are many tasks related to crowds such as crowd counting, crowd density estimation, crowd behavior analysis, crowd attribute recognition, etc This paper focus to crowd counting and crowd density estimation Based on the surveys, we found that there were two main waves which are direct and indirect approaches in crowd counting and crowd density estimation CCS CONCEPTS • Computing methodologies → Arti cial intelligence → Computer vision → Computer vision tasks; Neural networks Part A Part B UCF_CC_50 KEYWORDS Crowd Density Estimation, Crowded Convolutional Neural Networks (CNN) Scene Counting, ACM Reference format: Phuc inh Do and Ngoc Qoc Ly 2018 A New Framework For Crowded Scene Counting Based On Weighted Sum Of Regressors and Human Classi er In SoICT ’18: Ninth International Symposium on Information and Communication Technology, December 6–7, 2018, Da Nang City, Viet Nam ACM, New York, NY, USA, pages h ps://doi.org/10.1145/3287921.3287980 Figure 1: Crowd sample from the ShanghaiTech Part A, Shanghai Tech Part B and UCF_CC_50 datasets Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page Copyrights for components of this work owned by others than ACM must be honored Abstracting with credit is permitted To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee Request permissions@acm.org SoICT 2018, December 6–7, 2018, Danang City, Viet Nam © 2018 Association for Computing Machinery ACM ISBN 978-1-4503-6539-0/18/12…$15.00 https://doi.org/10.1145/3287921.3287980 permissions from Direct approaches based on human detection solutions using handcrafted features and probabilistic models are often effective in sparse to medium density environments, but failed in a crowded environment because of the challenges such as scale, clutter and occlusion, see local binary pattern (LBP), histogram of oriented gradients (HOG) etc [21] Currently, object-based detection solutions based on Deep Convolutional Neural Networks (DCNN) models such as Fast R-CNN, Faster R-CNN, YOLO [5, 17, 16] etc have achieved high performance in object detection in sparse environments However, the effect will be severely reduced when detecting objects in the dense environment Thus an indirect approach is based on the density map [6, 7, 12, 22] The essence of this approach is similar to probability density function, since the count is defined as integral over the entire density map Some recent works such as Zhang et at with the MCNN [27] Sam et al with the SwitchCNN [19] etc use this approach and achieve good results However, these solutions have a shortcoming that is not well addressed in the non-human region, so our system proposes a human classifier before putting the image into the density map estimate model The Switch-CNN also has a shortcoming when the wrong selected CNN model which cause the negative effect on counting Our system proposes a soft assignment solution which meaning that the result will be the sum of the output of the CNNs with the corresponding weights From the above two improvements, our system builds adaptive weights for each image and reduce counting error by the human classifier The experimental results have shown that our suggested method have outperformed the MCNN, Switch-CNN in both medium and dense environment The remainder of this paper is organized as follows: related works are reviewed in Section Our proposed method is presented in Section Experiments and Evaluation are showed in Section Finally, conclusions and future works are drawn in Section RELATED WORK People counting methods are usually divided into the following three approaches [21, 11]: Detection-based approaches: Most of these methods focus on object detection [1, 4, 10, 18, 23, 24] using a slider window to detect people or Nguyen at al [13] use DCNN to detect head region, and use this information to count the number of people The disadvantage of these methods are that the greater the crowd density is, the lower the counting performance Regression-based approaches: The approach consisted of two tasks: extracting features from the image and constructing a regression model to map these features to the count [3, 15] The approach tackles the challenge of detecting individual objects However, it did not exploit the useful spatial information as position of head in image Wang et al [25] was one of the pioneers in applying CNN to estimating the number of people in the crowd They chose the AlexNet architecture [8] with minor revision as follows: the last fully-connected layer of the AlexNet was replaced with a neuron to estimate the count They also augmented the training data set by adding negative samples and set the number of people on these samples as zero The purpose is to reduce estimated error when meeting other non-human objects such as houses, trees etc Density estimation-based approaches: The principle of this approach is based on the probability density function and the count is the integral of this density function over a specified range These methods take advantage of spatial information in the image by using head position as the ground truth Lempitsky [9] proposed counting method based on the density map, which is created by evaluate Gaussian kernel at each head position The count will be calculated by integrating over the entire density map Zhang [26] claims that the density map is proposed by Lempitsky [9] only be suitable for circular objects such as cells or bacteria This approach may not be optimal for crowds as the vision of the camera is often tilted To solve this problem, they propose to evaluate the Gaussian kernel to both the head and body parts In the offline stage, the density map is estimated based on deep CNN The input of the deep CNN is image and the output is the density map Boominathan et al [2] proposes the model to estimate the density map with two shallow and deep CNNs The output of the Deep Network and Shallow Network is combined with a 1x1 convolution layer forming the estimated density map Zhang et al [27] and Onoro et al [14] proposed a multi-column CNN architecture with each column consisted of the filters with different sizes to estimate the density map Sam et al [19] improved the MCNN architecture by adding a classifier to select the CNN column that conforms the input image In the online stage, deep CNN model will estimate the density map from the input image then calculate the count from the density map PROPOSED METHOD Our framework consisted of two main modules as follows: Human Classifier (HC) and crowd counting estimation HC is used to classify as human or non-human image For each nonhuman image patch, the count will be set to zero and human images will be used to estimate the count With the approach of Zhang et el [27], after all the samples are learned, the model has a set of fixed weights and uses these weights to estimate the count when encountering new samples The disadvantage of this approach is that sets of fixed weights will not adapt to each image Therefore, in the crowd counting estimation, we used three separate CNN models corresponding to different size of the filters (SoF) to estimate the count Three CNN models will give three different sets of weights and this will make the model more adaptable to each image Unlike Sam et al [19], we weighted each density map obtained from three CNNs and combine them by linear multiplication to become a single density map The final count will be calculated on this density map of people in the image is labeled if the number of people equal to zero, is labeled if the number of people not equal The problem of imbalanced data is handled by randomly duplicating the samples until the sample of two classes are equal With this dataset, HC will be trained trained via cross-entropy loss L a) b) = log (1) where is number of samples, represents the target probability of th patch, represents the predicted probability of th patch 3.2 Crowd counting estimation c) d) Figure 2: a) Patch sample from the ShanghaiTech dataset [27], b) Head position map, c) Ground truth density map; d) Estimate density map 3.1 Human classifier In my best knowledge, human classifier has not been used in crowd counting system With traditional counting methods using CNN, when encountering new samples, the CNN model certainly estimates a non-zero value This increases the error when the new sample is non-human image Therefore, we preclassify non-human samples and assign the counts to As mentioned earlier, most current methods focus on the model to increase accuracy when counting and this count heavily depend on the density map However, facing to scenes having many non-human objects as trees, patterns, buildings etc, these models will be easy to confuse and output inaccurate density maps so increase counting error So we propose to add a classifier to classify a human or non-human image As we mentioned in section 1, the methods based on handcrafted features such as HOG, LBP etc does not work well for the dense environment Therefore, we choose a deep CNN for our classification model Human classifier is CNN based on VGG-16 architecture [20] because of its easy to implement and short training time VGG16 architecture can be completely replaced by a more efficient DCNN model, but the article wants to demonstrate the effectiveness of human pre-classification in crowd density estimation Following the settings of Sam et al [19], the weights are initialized from VGG-16 architecture We also dropped three fully-connected layers in this architecture and replaced it with a global average pool and a fully-connected layer with 512-nodes, and a 2-class softmax classifier to classify the human or nonhuman patches The inputs of the Human Classifier (HC) are the two-dimensional matrix of the grayscale input image The outputs are the probabilities of classification and the class with greater probability will be chosen Because of the datasets are already annotated, we created the ground truth dataset for training by automatically assigning labels based on the number Inspired by the approach of Sam et al [19] and Zhang et al [27], our framework of density map estimation is as follows: First, we show how to create the ground truth density map from the dataset Then use them to train the regressors These regressors have different SoF so it will give three different density maps Therefore, we combine the density maps based on weight by using Switch Classifier (SC) 3.2.1 Ground truth We prepared the ground truth for the regressors to learn density map Similar to the approach of Zhang et al [27], with the image dataset was annotated at each head position Ground truth density map is obtained by using the 2-dimensional Gaussian kernel at each head position, the density function will be defined as: (2) F p H p ∗G p where F p is the density function, ∗ is convolution operator, H p is the head position map obtained from head positions are annotated by a set of 2-dimensional points, G p is the Gaussian function at pixel p Because center point is the origin point when calculating average value, the mean value μ of the G p is zero So G p will be: + − (3) G where , is the coordinates are calculated from the origin is the head position There are two ways to choose σ [27]: - Fixed kernels: Choose fixed σ depending on dataset - Geometry-adaptive kernels: Calculate σ based on the distan-ces from head position to its k nearest neighbors ( ̅ ) : ̅ σ Input: Image I with head position is annotated by coordinates Output: Ground truth density map F(p) Algorithms: Create the ground truth density map 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Begin H p : Head position map; G(p): Gaussian kernel σ: parameter of G(p) ; = 0.3 ; k = F p : Ground truth density map // Create head position map Initialize H p = zero(array); foreach head_position p(x, y) H(x, y) = 255; end foreach if (fixed σ) Calculates G p at σ else // Geometry-adaptive kernels ∑k dj ; σ βd d k j= Calculate G p at σ end if foreach head_position p in H p H p ∗G p Calculate F p end foreach End Figure 3: Algorithm calculates ground truth density map We chose = 0.3 and k = follow settings of Zhang et al [27] We using fixed kernels with σ 15 for unequally distributed data-sets such as ShanghaiTech Part B and using geometry-adaptive kernels for equally distributed dataset ShanghaiTech Part A, UCF_CC_50 The number of people in image I is calculated by integrating the density function F p (sum of the density map) on the entire image Count ∬F , ~ , F , 3.2.2 Regressor Regressor is used to estimate the density map Unlike the CNN architecture used to classify, regressor does not have a fully connected layer to keep two-dimensional size from the image The architecture of the Regressors is shown in Fig Its structure is similar to the columns of the MCNN [27] Regressors vary in the SoF The purpose is that large-size filters (Regressor R1) will extract features from human head that are large in size This regressor will be suitable for patches with low density Similarly, R2 will be suitable for patches with medium density and R3 will be suitable for patches with high density Therefore, the number of regressors and the size of the filters can be changed However, according to the settings of Zhang et al [27] and Sam et al [19], we choose the number of regressors and filter sizes as Fig (4) Figure 4: a) The architecture of the Human Classifier; b) The architecture of the Regressors � , � , � with different filter sizes Activation function is selected as Rectified linear unit (ReLU) The inputs of the Regressors are the two-dimensional matrix of the grayscale input image The output is a two-dimensional matrix representing the density map, whose size is ¼ the size of the input image (because there are two pooling layers in the network) Therefore, during the learning process, the samples will be resized to ¼ the original size Each regressor gives its own density map so SC is proposed to weight and combine these density maps 3.2.3 Switch Classifier The purpose of SC is to weight and combine the density maps Similar to the HC in Section 3.1 and Sam et al [19], the SC is transfer learning from the VGG-16 architecture The inputs of the SC are the two-dimensional matrix of the grayscale input image The output is the probability of classification These probabilities will be multiplied with the density maps respectively to create the final density map 3.2.4 Crowded Scene Counting Model Our suggested crowded scene counting framework is shown in Fig Each training image is divided into image patches to increase the number of samples as well as increase the accuracy of the regressors Firstly, label is assigned for each patch based on the count of that patch Patches are assigned label or label if the number of people in patches is equal to zero or not equal to zero respectively HC will be trained with these labels via crossentropy loss function The problem of imbalanced data is mentioned in Section 3.1 Figure 5: Our suggested crowded scene counting framework on the online stage Secondly, the regressors R1, R2, R3 will be trained to estimate the density map The loss function used in the learning process is: //Divide each image in training set into image patches Input: N training patches P with ground truth density map Output: Parameters of Human Classifier, Regressor, Switch Classifier Algorithms: Model Learning 10 11 12 13 14 15 16 17 18 Begin // Create training set for Human Classifier count(P) == ? label(P) = : label(P) = //Train Human Classifier for t = to T1_epochs do: // 30 epochs Initialize HC with VGG-16 weights Train HC with label(P) end for //Pre-train Regressors for t = to T2_epochs do: // 100 epochs Initialize regressors with random weights Train regressors end for //Train Regressors for t = to T3_epochs do: // 100 epochs foreach patch P do: Caculate from best_regressor argmin| | 19 20 21 22 23 24 25 Train best_regressor using loss function (5) end foreach for t = to T4_epochs do: // 30 epochs // Create training set for Switch Classifier foreach patch P do: Θ Caculate from type argmin | | 26 27 28 29 30 31 32 33 34 35 36 end foreach Switch_train_set = (P , type ) Train Switch Classifier with Switch_train_set // Re-train Regressor which is choosed // by Switch Classifier Foreach patch P do: Regressor argmax Switch_Classifier P Train regressor th end foreach end for End Figure 6: Our suggested crowded scene counting algorithm Θ ‖ Θ ‖ (5) = where is number of samples, represents the ground truth of density map of image patch , Θ represents the estimated density map with parameters Θ (output of the Regressors) for image patch Thirdly, because the regressors have different SoF, the counts are different Thus the parameters Θ of each Regressor will be adjusted again through the count error For each image patch, which Regressor could give the result with lowest count error then its parameters will be adjusted and assigned to this label (R1, R2 or R3) for SC training type argmin| | (6) where is type of regressors, is the count from regressor type for th patch, is the ground truth count for th patch Fourthly, SC is trained with the input are grayscale image patches and output are labels obtained from the above step based on transfer learning from VGG-16 Finally, the Regressors will be trained again As the training data goes into the model, SC will select the regressor with the appropriate SoF and the parameters of this regressor will be updated The above processes are trained until the accuracy reaches the threshold The dataset is introduced by Zhang et al [27] The dataset has a total of 1198 annotated images with 330,165 heads are already annotated The dataset is divided into two parts: Part A consists of 482 images randomly selected on the Internet and part B consists of 716 images (768x1024) taken in urban areas in Shanghai The density of people in part A is higher than part B Both part A and part B already were divided into trainsets (300 images for part A, 400 for part B) and testsets (182 images for part A, 316 for part B) a) b) EXPERIMENTS We evaluated the performance of the model using the main datasets ShanghaiTech and UCF_CC_50 The proposed method is implemented on python using theano library We used Stochastic Gradient Descent (SGD) with learning rate = 1e-6, momentum = 0.9 The number of epoch is shown in Fig The experiments were performed on the i7-7820HK@2.90GHz, 16GB RAM, GTX 1070 8GB, Windows 10 Home 64bit laptop c) 4.1 Evaluation Metric Similar to the previous approaches, the two main measures used to evaluate model performance were Mean Absolute Error (MAE) and Mean Squared Error (MSE), which were defined as follows: 1 | | (7) ( ) (8) We also used Mean Relative Error (MRE) to evaluate the count error more accurately (9) where is the crowd count estimated by the model, is the ground truth crowd count, is number of images Thus, the lower the MAE, MSE and MRE is, the better the model is Since the density estimation problem is not as complex as the classification problem, we use the following datasets with not many samples but still enough to solve this problem 4.2 ShanghaiTech dataset d) Figure 7: Row a) Sample from the ShanghaiTech Part A dataset [27]; Row b) Ground truth density map; Row c) Estimate density map without Human Classifier; Row d) Estimate density map with Human Classifier Table Comparison of our method and other methods on ShanghaiTech dataset Method Part A Part B MAE MSE MAE MSE Cross-scene [26] MCNN [27] Switch-CNN [19] WSR (without HC) 181.8 110.2 90.4 87.5 277.7 173.2 135.0 125.6 32.0 26.4 21.6 21.3 49.8 41.3 33.4 33.2 WSRHC 81.9 122.1 20.9 33.1 In Part A, we use geometry-adaptive kernels to generate the ground truth density map For Part B, we chose fixed kernels σ 15 Table show the results for our proposed WSRHC and other methods The accuracy of the Human Classifier is 96.2% in Part A and 94.0% in Part B WSRHC improved 8.5 points (9.4%) in MAE on Part A and 0.7 points (3.2%) in MAE on Part B in comparison to Switch-CNN [19] Table show that WSRHC also improved 3.0% with MRE metric sizes The range of people is between 94 and 4543 and the average people in a image is 1280 The data set contains 63,075 people already annotated Table Comparison of our method with other methods on UCF_CC_50 dataset Table Comparison of our method and other methods on ShanghaiTech part A with MRE Metric Method ShanghaiTech Part A MRE Switch-CNN [19] 23.235% WSR (without HC) 23.943% WSRHC 20.232% a) a) b) b) c) d) Method Multi-source multi-scale [6] MAE 468.0 MSE 590.3 Cross-scene [26] MCNN [27] 467.0 377.6 498.5 509.1 Hydra-CNN [14] 333.7 425.2 Switch-CNN [19] WSR (without HC) 318.1 310.3 439.2 401.2 WSRHC 250.5 383.7 c) d) Figure 8: Row a) Sample from the ShanghaiTech Part B dataset [27]; Row b) Ground truth density map; Row c) Estimate density map without Human Classifier; Row d) Estimate density map with Human Classifier Figure 9: Row a) Sample from the UCF_CC_50 dataset [6]; Row b) Ground truth density map; Row c) Estimate density map without Human Classifier; Row d) Estimate density map with Human Classifier 4.3 UCF_CC_50 dataset This data set is challenging due to not only the low number of images but also the number of people in the image fluctuate quite high The dataset is introduced by Idrees et al [6] The dataset contains 50 images of extremely dense crowd Images are collected from the image hosting service FLICKR with different We use geometry-adaptive kernels and 5-fold crossvalidation to validate the performance following the standard setting in [6] The results show in Table The accuracy of the Human Classifier is 98.9% WSRHC also outperforms all methods on both the MAE and MSE metric CONCLUSION AND FUTURE WORKS In this paper we have proposed the human classifier as preprocessing This work is simple but plays an important role in increasing the efficiency of the model We also combine density maps based on the weight instead of choosing only one, which makes our model more flexible Experiments show that our method outperforms the state-of-the-art methods on the ShanghaiTech dataset and the challenging UCF_CC_50 dataset In the future, we will research to build a model to better estimate the density map At the same time, we also research related fields such as traffic, breeding, cultivation and expand the application of pre-classify to other objects such as vehicles, animals, fruits etc ACKNOWLEDGEMENT This research is funded by Vietnam National University Ho Chi Minh City (VNUHCM) under grant no B2018-18-01 REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] A Bansal and K Venkatesh 2015 People counting in high density crowds from still images arXiv preprint arXiv:1507.08445 L Boominathan, S S Kruthiventi, R V Babu 2016 Crowdnet: A deep convolutional network for dense crowd counting In Proceedings of the 2016 ACM on Multimedia Conference, ACM, pages 640–644 A B Chan, Z.-S J Liang, and N Vasconcelos 2008 Privacy preserving crowd monitoring: Counting people without people models or tracking In CVPR P Dollar, C Wojek, B Schiele, P Perona 2012 Pedestrian detection: An evaluation of the state of the art IEEE transactions on pattern analysis and machine intelligence 34, 743–761 R Girshick Fast R-CNN 2015 In ICCV H Idrees, I Saleemi, C Seibert, and M Shah 2013 Multi-source multi-scale counting in extremely dense crowd images In CVPR, pages 2547–2554 IEEE SA Kasmani, X He, W Jia, D Wang, M Zeibots 2018 A-CCNN: Adaptive CCNN for density estimation and crowd counting arXiv:1804.06958 A Krizhevsky, I Sutskever, G Hinton 2012 Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems, pp 1097–1105 V Lempitsky and A Zisserman 2010 Learning to count objects in images In Advances in Neural Information Processing Systems, pages 1324–1332 M Li, Z Zhang, K Huang, T Tan 2008 Estimating the number of people in crowded scenes by mid based foreground segmentationand head-shoulder detection In ICPR C C Loy, K Chen, S Gong, and T Xiang 2013 Crowd counting and profiling: Methodology and evaluation In Modeling, Simulation, and Visual Analysis of Large Crowds Springer M Marsden, K McGuinness, S Little, N E O’Connor 2017 ResnetCrowd: A residual deep learning architecture for crowd counting, violent behaviour detection and crowd density level classification arXiv preprint arXiv:1705.10698 A H Nguyen, N Q Ly 2018 A new framework for people counting from coarse to fine could be robust to viewpoint and illumination In: N Nguyen, D Hoang, T P Hong, H Pham H, B Trawiński (eds) Intelligent Information and Database Systems ACIIDS 2018 Lecture Notes in Computer Science, vol 10752 Springer, Cham D Onoro-Rubio and R.J Lpez-Sastre 2016 Towards perspective-free object counting with deep learning In Proceedings of the ECCV Springer, pp 615–629 V Rabaud and S Belongie 2006 Counting crowded moving objects In CVPR, pp 705–711 J Redmon, S Divvala, R Girshick, A Farhadi 2016 You only look once: Unified, real-time object detection In CVPR [17] S Ren, K He, R Girshick, and J Sun 2015 Faster R-CNN: Towards real-time object detection with region proposal networks In NIPS [18] M Rodriguez, I Laptev, J Sivic, and J.-Y Audibert 2011 Density-aware person detection and tracking in crowds In ICCV [19] D B Sam, S Surya, R V Babu 2017 Switching convolutional neural network for crowd counting In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [20] K Simonyan and A Zisserman 2014 Very deep convolutional networks for large-scale image recognition arXiv preprint arXiv:1409.1556 [21] V A Sindagi and V M Patel 2017 A survey of recent advances in CNN-based single image crowd counting and density In Pattern Recognition Letters [22] V A Sindagi and V M Patel 2017 CNN-based cascaded multitask learning of high-level prior and density estimation for crowd counting In Advanced Video and Signal Based Surveillance, 2017 IEEE International Conference on IEEE [23] P Viola, M J Jones 2001 Rapid object detection using a boosted cascade of simple features In CVPR, issue 1, 2001, pp 511–518 [24] P Viola, M J Jones, and D Snow 2003 Detecting pedestrians using patterns of motion and appearance In The 9th ICCV, Nice, France, volume 1, pages 734– 741 [25] C Wang, H Zhang, L Yang, S Liu, X Cao 2015 Deep people counting in extremely dense crowds In Proceedings of the 23rd ACM international conference on Multimedia, ACM pages 1299–1302 [26] C Zhang, H Li, X Wang, X Yang 2015 Cross-scene crowd counting via deep volutional neural networks In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 833–841 [27] Y Zhang, D Zhou, S Chen, S Gao, Y Ma 2016 Single image crowd counting via multi-column convolutional neural network In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 589–597 ... HỌC KHOA HỌC TỰ NHIÊN ĐỖ PHÚC THỊNH XÂY DỰNG HỆ THỐNG GIÁM SÁT ĐÁM ĐƠNG Chun ngành: Khoa học máy tính Mã số chuyên ngành: 8480101 LUẬN VĂN THẠC SĨ: CÔNG NGHỆ THÔNG TIN NGƯỜI HƯỚNG DẪN KHOA HỌC:... 38 Chương Hệ Thống Ước Lượng Đám Đông 40 Giới thiệu 40 Mô hình phân lớp Human Classifier 40 Ước lượng số người đám đông .42 3.3.1 Xây dựng đồ mật độ xác... thơng minh (Smart City) với hệ thống giám sát camera giám sát CCTV, điều có nghĩa chúng có khả thơng qua thiết bị phần mềm với cốt lõi trí tuệ nhân tạo để giúp tự động giám sát, quản lý vật người