2021 8th NAFOSTED Conference on Information and Computer Science (NICS) UIT-Anomaly: A Modern Vietnamese Video Dataset for Anomaly Detection Dung T.T Vo Tung Minh Tran Vietnam National University University of Information Technology Ho Chi Minh City, Vietnam 18520641@gm.uit.edu.vn Vietnam National University University of Information Technology Ho Chi Minh City, Vietnam tungtm.ncs@grad.uit.edu.vn Nguyen D Vo Khang Nguyen Vietnam National University University of Information Technology Ho Chi Minh City, Vietnam nguyenvd@uit.edu.vn Vietnam National University University of Information Technology Ho Chi Minh City, Vietnam khangnttm@uit.edu.vn Abstract—Anomaly detection in videos is of utmost importance for numerous tasks in the field of computer vision We introduce the UIT-Anomaly dataset captured in Vietnam with a total duration of 200 minutes It contains 224 videos with six different types of anomalies Moreover, we apply a method for weakly supervised video anomaly detection, called Robust Temporal Feature Magnitude learning (RTFM) based on feature magnitude learning to detect abnormal snippets The approached method yields competitive results, compared to other state-of-the-art algorithms using publicly available datasets such as ShanghaiTech and UCF–Crime Index Terms—Anomaly Detection, Weakly Supervision, Multiple Instance Learning I INTRODUCTION Nowadays, remote anomaly detection has become more popular due to the increase in the number of surveillance cameras However, these surveillance systems are still not timely and require manual labour Therefore, it is necessary that we leverage the power of computer vision to automatically detect anomalies in videos The purpose of this problem is to find out a model to accurately identify the start and the end points of an anomalous event The input and output of the problem are demonstrated in Figure represented as a bag of video snippets, this anomaly detection approach is based on the temporal feature magnitude of the snippets in the video Specifically, normal snippets are represented with low feature magnitude, and abnormal snippets are denoted by high feature magnitude In this approach, k different snippets with the highest feature magnitude will be selected from normal and abnormal videos, leading to a probability of selected abnormal snippets in anomalous videos higher than that of the MIL method [9] wherein these k snippets play a vital role in training a snippet classifier One of the biggest challenges in the anomaly detection problem in Vietnam is the lack of data Benchmark datasets for this problem are often taken from movies, and they are rarely extracted from surveillance cameras Hence, they can not provide high realism and distinctive features such as settings, individuals, or forms of violence in Vietnam Therefore, we built a novel dataset of normal and abnormal videos in Vietnam, called UIT-Anomaly Our dataset consists of 224 videos with six types of unusual behavior common in Vietnam All the videos in our dataset capture actual events in a variety of contexts, which makes UIT-Anomaly more diverse than other benchmark datasets, which are presented in Part IV II RELATED WORK In this section, we present two aspects which are anomaly detection and the multiple instance learning method (MIL) [9] A Anomaly Detection Fig Demonstration of the input and output of the problem Input: videos filmed in Vietnam Output: determine which time window (the starting frame to the ending frame) contains the abnormal events In this paper, we apply a method for weakly supervised anomaly detection called RTFM by learning the annotated training videos at the video level Furthermore, each video is 978-1-6654-1001-4/21/$31.00 ©2021 IEEE There are three main approaches to solve the anomaly detection problem in videos namely unsupervised anomaly detection, supervised anomaly detection, and weakly supervised detection 1) Unsupervised anomaly detection: as mentioned earlier, one of the challenges of anomaly detection is the lack of data Anomalies not often occur in real life, which hinders collecting data in a variety of contexts By contrast, normal 352 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) samples are easy to collect and not take much time Therefore, only normal videos in the training set are used in this approach This helps to save time and effort when building a dataset 2) Supervised anomaly detection: with this supervised approach, frame-level annotation is required for both the training and the test sets, which means distinguishing between abnormal and normal frames This is the most expensive step in the process of building a dataset 3) Weakly supervised anomaly detection: in this approach, the training set is annotated at the video level, but it is required to learn frame-level prediction because testing set is still fully annotated as in the supervised approach Therefore, compared with the supervised approach, the annotation cost in the weakly supervised approach is very low, and it achieves much better performance than the unsupervised approach The weakly supervised approach is the best option for the anomaly detection problem chance to learn more abnormal snippets in each video might be missed when using this method III METHODOLOGY Yu Tian et al [10] proposed a method for anomaly detection called RFTM to improve the MIL method’s drawbacks Similar to MIL, RFTM also detects the anomaly and normal snippets by learning through weakly labeled videos to identify a snippet as a normal or an anomaly Each video was presented as a bag consisting of T snippets Snippets in each video were extracted feature via I3D [2] or C3D [11] After that, F represented features in the video that had dimension D from T snippets B MIL Method Sultani et al [1] proposed a method for anomaly detection based on multiple instance learning In this method, before extracting features, the frame rate of the video was changed to 30 fps with a size of 240 ˆ 320 pixels Each video was represented as a bag including 32 different snippets Each snippet was divided into 16 different frames to extract features at the FC6 layer of the C3D network Furthermore, the feature vector of each set of 16 frames was a 4096D one This method calculated the average of all features of sets of 16 frames in the snippet based on the l2 normalization and used it as the feature of the whole snippet, so the feature vector of each snippet was also 4096D In the step of snippet classification, the features of all snippets of a video are put into the FC neural network The input was a 4096D vector which has layers containing 512, 32, and units, respectively There was a 60% dropout [8] between the layers Moreover, the method also used ReLU [6] activation for the first FC layer and the Sigmoid activation for the last FC layer In addition, the anomaly score of each snippet was considered as the anomaly score of the frames in it Figure illustrates the MIL method Fig Deep MIL Ranking method [9] Although this method is currently working effectively, there are still many limitations In MIL, each snippet with the highest anomaly score was used to represent each video, so ,likely, the snippet with the highest anomaly score is not an anomaly because the abnormal snippets are overwhelmed by the normal ones In the case of more than one outlier, the Fig RTFM method [10] The multi-scale temporal network (MTN) [10] incorporated two modules pyramid of dilated convolutions (PDC) [13] and a temporal self-attention (TSA) [12] on temporal dimension by capturing the multi-resolution local temporal dependencies and the global temporal dependencies between video snippets, which were presented in Figure Furthermore, the output of MTN was temporal features transformed from F, denoted X A video or a snippet was classified as outlier or anomaly, the method used l2 normalization to calculate feature magnitude and then selected k snippets with the highest feature magnitude Assuming that a normal snippet has a smaller feature magnitude than an anomaly snippet, RFTM optimized topk snippets’ average feature magnitude from each video At the Feature Magnitude Learning phase, the highest feature magnitude of normal videos’ snippets was minimized, whereas the highest feature magnitude of anomalous video’s snippets was maximized This helped to increase the ability to classify normal and anomalous videos Finally, the RTFM used k snippets with the highest feature magnitude to train To sum up, RFTM model training process includes optimization of three modules: (1) Multi-scale temporal feature learning; (2) Feature magnitude learning, and (3) Snippet classifier training RTFM’s training process is shown in Figure IV DATASET Nowadays, there are publicly available datasets for anomaly detection problems such as UMN [7], Violent-Flows [3], 353 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) the regulation is not enough to decide to keep the original videos or the edited videos but they are still intact After that, we re-edit videos by deleting borders, changing video speed to normal, etc., intending to make the video as close to the original as possible Furthermore, we also remove videos with excessive modifications or videos with unclear anomalies D Video Annotation Our dataset is annotated based on weakly supervised approach so the training set just needs annotating at video-level In addition, frame-level of the testing set is also annotated to evaluate the performance of the methods on the testing phase, that is, to confirm the start and end frames of anomalous activities The dataset is finally accomplished after intense efforts over several months E Dataset Statistics Fig Multi-scale temporal network (MTN) [13] Avenue [5] However, they still have some disadvantages For example, anomalous behavior is set or extracted from movies leads to the lack of reality, especially can not express the context and feature of Vietnam regarding the environment, culture, people, and type of violence Therefore, we built a novel surveillance video dataset, called UIT-Anomaly A Selecting Anomaly Categories As far as we know, it is difficult to completely define anomalous behavior because there are a lot of aspects and different presentations in the real-word, we clearly describe anomalous activities to minimize the ambiguity in creating fundamental truth To mitigate the above issues, we consider the six following anomaly classes: Stealing, Traffic Accident, Fighting, Unsporting Behavior, and Against We are interested in these anomalies because they have distinct features in Vietnam Additionally, some samples of our dataset are presented in Figure B Video Collection We collect videos from Youtube by using keywords like “street violence”, “street stealing”, “marital combat”, “dog thief” and other words that have a similar meaning with each type of anomaly With the normal behavior, we search for “security camera at school”, “CCTV at street”, “CCTV at home”, etc Anomalous behavior in Vietnam is not captured by CCTV much, so we also collect other videos captured by smartphones, car black boxes However, the number of videos captured by CCTV still accounts for the majority and is completely shot from real events C Video Cleaning As a rule of the original regulation, we only use videos that they have not manually edited, so our annotator team checks each video to make sure The number of videos that satisfy The UIT-Anomaly dataset includes a total of 224 muted videos captured at a frame rate of 30 fps with various resolutions It has 104 normal and 120 anomalous videos The total duration is more than 200 minutes, corresponding to 392,188 frames We divide these videos into two subsets: the training set included 90 abnormal and 90 normal videos, while the test set consisted of the remaining 30 abnormal and the remaining 14 normal videos Both training and test sets contain six classes of anomalies The video distributions in terms of length and number of frame rate test videos are presented in Figures 7, 8, respectively We have a comparison between the UIT-Anomaly dataset and the others in Table I Our dataset has a size and length that overwhelms the rest of the datasets Overall, abnormal activities in UIT–Anomaly are very different compared to other datasets’ anomalous activities For example, Against and Dog Thief classes are two anomalies that are very rare in other datasets Regarding diversity, the number of anomalous types in our dataset is larger In addition, we also collect videos in indoor and outdoor places such as streets, homes, restaurants, whereas other datasets only focus on one specific space Moreover, other datasets can not represent specific features for anomalous events in Vietnam V EXPERIMENTS A Implementation Details Within our approach, each video is divided into 32 snippets We conduct experiments on the UIT-Anomaly dataset with k = 3, 4, 5, 6, 7, 8, 9, 10, 11 Moreover, 2048D features are extracted at the mix_5c layer of the pre-trained I3D net Furthermore, three FC layers in MTN have 512, 128, and units, respectively After that, each FC layer uses a ReLU activation function with a 70% dropout between FC layers For all experiments, we train the RFTM method by using Adam optimizer with 0.0005 weight decay and 16 batch size In addition, we set the learning rate at 0.001 for 15000 epochs Additionally, each mini–batch includes samples from 32 normal and abnormal videos which are randomly selected 354 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig Some samples of six anomalous activities in UIT-Anomaly dataset TABLE I: Comparison between UIT-Anomaly and benchmark datasets Dataset # of videos Anomalous behavior Length Scene(s) Vietnam UMN [7] 11 Run University No Avenue [5] 37 30 Run, throw, new object Campus avenue No UCSD Ped1 [4] 70 Bikers, small carts, walking across walkways Walkway No UCSD Ped2 [4] 28 Bikers, small carts, walking across walkways Walkway No Subway Entrance [1] 1.5 hours Wrong direction, No payment Subway No Subway Exit [1] 1.5 hours Wrong direction, No payment Subway No Stealing, Traffic Accident, Fighting, Street, house, restaurant, Unsporting Behavior, Against, Dog Thief grocery store, office, etc UIT-Anomaly (Ours) 224 3.5 hours B Performance and Evaluation From the results published in Table II, we see that the proposed framework uses k snippets with the largest feature magnitude achieves effectively detect outcomes It is noticeable that the ability to distinguish between abnormal and normal videos tends to go up when k increases due to Yes expanding the scope of training, which makes the model learn better However, if k is too high, the model will not detect snippet as an anomaly because the abnormal samples are overwhelmed by the normal ones in both normal and abnormal videos Furthermore, we also compare the results in two cases: training from scratch and using the highest pre-trained sets 355 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) racy AUC = 97.21% and 84.30%, respectively [10] they have the lowest performance on UIT–Anomaly dataset Therefore, building a dataset for anomaly detection problems in Vietnam is essential TABLE III: Training results comparison with trained parameters from benchmark datasets in terms of AUC metric (%) Fig Distribution of videos according to length in both two sets Fig Distribution of video frames in the testing videos TABLE II: AUC performance UIT-Anomaly AUC (%) k=3 75.89 k=4 74.37 k=5 74.40 k=6 73.26 k=7 73.11 k=8 73.66 k=9 76.07 k = 10 73.98 k = 11 72.12 AUC Trained parameters from ShanghaiTech 42.84 Trained parameters from UCF–Crime 49.84 Trained on UIT–Anomaly (k = 9) 76.07 As shown in Figure 9, we visualize anomaly detection on testing videos The videos include well-recorded anomalous events such as Traffic_Accident_051, Dog_Thief_033 Furthermore, the model almost correctly detects the start and end times of unusual event The experimental results on normal videos also achieve good performance such as Normal_088 and Normal_074, respectively However, failure cases are observed in the testing videos For instance, in terms of Stealing_058 video, it is clearly to see that the model does not detect any anomalous event in this video because the thief snatches the bag instantaneously and the CCTV installation position is too high to capture the situation Similarly, it does not detect the start and end times of the anomalous event occurring during in a soccer match in Unsporting_Behavior_008 This happens because the constant and disordered movement of the players can make detection difficult for the model, which can potentially lead to anomaly detection For videos with a lot of anomalies occurring in a very short time such as Fighting_100 video, RFTM cannot detect unusual events separately but it treats them as a whole anomalous event This is one of the challenges of the UIT-Anomaly dataset, even a SOTA method like RTFM that achieves high performance on other benchmark datasets still has some drawbacks when testing on it Fig Distribution of videos according to length in both two sets k RTFM VI CONCLUSION We introduce the UIT-Anomaly dataset, a novel dataset for video anomaly detection in Vietnam Moreover, framelevel ground truth labels for anomalous events in the videos are provided to evaluate anomaly detection methods across a variety of approaches Several state-of-the-art methods are thoroughly evaluated on this dataset and show that there is still room for improvement We hope that the proposed dataset will stimulate the development of new weakly unsupervised or unsupervised anomaly detection methods of performance parameters on the ShanghaiTech and UCFCrime datasets from [10] in Table III Although the parameters trained on ShanghaiTech and UCF–Crime, achieve high accu- ACKNOWLEDGEMENT This work was supported by the Multimedia Processing Lab (MMLab) at the University of Information Technology,VNUHCM We would also like to show our gratitude to the UIT-Together research group for sharing their pearls of wisdom with us during this research 356 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig Qualitative results of RTFM method on testing videos Pink window shows ground truth anomalous region In order from left to right, from top to bottom: Traffic_Accident_051, Normal_088, Normal_074, Dog_Thief_033, Against_033, Stealing_051, Stealing_058, Unsporting_Behavior_008, and Fighting_100 REFERENCES [1] Amit Adam, Ehud Rivlin, Ilan Shimshoni, and Daviv Reinitz “Robust real-time unusual event detection using multiple fixed-location monitors” In: IEEE transactions on pattern analysis and machine intelligence 30.3 (2008), pp 555–560 [2] Joao Carreira and Andrew Zisserman “Quo vadis, action recognition? a new model and the kinetics dataset” In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, pp 6299–6308 [3] Tal Hassner, Yossi Itcher, and Orit Kliper-Gross “Violent flows: Real-time detection of violent crowd behavior” In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops IEEE 2012, pp 1–6 [4] Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos “Anomaly detection and localization in crowded scenes” In: IEEE transactions on pattern analysis and machine intelligence 36.1 (2013), pp 18–32 [5] Cewu Lu, Jianping Shi, and Jiaya Jia “Abnormal event detection at 150 fps in matlab” In: Proceedings of the IEEE international conference on computer vision 2013, pp 2720–2727 [6] Vinod Nair and Geoffrey E Hinton “Rectified linear units improve restricted boltzmann machines” In: Icml 2010 [7] R Raghavendra, AD Bue, and M Cristani Unusual crowd activity dataset of University of Minnesota 2006 [8] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov “Dropout: a simple way to prevent neural networks from overfitting” In: The journal of machine learning research 15.1 (2014), pp 1929–1958 [9] Waqas Sultani, Chen Chen, and Mubarak Shah “Realworld anomaly detection in surveillance videos” In: Proceedings of the IEEE conference on computer vision and pattern recognition 2018, pp 6479–6488 [10] Yu Tian et al “Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning” In: arXiv preprint arXiv:2101.10030 (2021) [11] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri “Learning spatiotemporal features with 3d convolutional networks” In: Proceedings of the IEEE international conference on computer vision 2015, pp 4489–4497 [12] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He “Non-local neural networks” In: Proceedings of the IEEE conference on computer vision and pattern recognition 2018, pp 7794–7803 [13] Fisher Yu and Vladlen Koltun “Multi-scale context aggregation by dilated convolutions” In: arXiv preprint arXiv:1511.07122 (2015) 357 ... Some samples of six anomalous activities in UIT- Anomaly dataset TABLE I: Comparison between UIT- Anomaly and benchmark datasets Dataset # of videos Anomalous behavior Length Scene(s) Vietnam UMN... UIT- Anomaly dataset, a novel dataset for video anomaly detection in Vietnam Moreover, framelevel ground truth labels for anomalous events in the videos are provided to evaluate anomaly detection. .. on Information and Computer Science (NICS) racy AUC = 97.21% and 84.30%, respectively [10] they have the lowest performance on UIT? ? ?Anomaly dataset Therefore, building a dataset for anomaly detection