Guided anchoring cascade r cnn an intensive improvement of r cnn in vietnamese document detection

2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Guided Anchoring Cascade R-CNN: An intensive improvement of R-CNN in Vietnamese Document Detection Hai Le Truong Nguyen Vy Le Vietnam National University University of Information Technology Ho Chi Minh, Viet Nam 20520481@gm.uit.edu.vn Vietnam National University University of Information Technology Ho Chi Minh, Viet Nam 20522087@gm.uit.edu.vn Vietnam National University University of Information Technology Ho Chi Minh, Viet Nam 20520355@gm.uit.edu.vn Thuan Trong Nguyen Nguyen D Vo Khang Nguyen Vietnam National University University of Information Technology Ho Chi Minh, Viet Nam 18521471@gm.uit.edu.vn Vietnam National University University of Information Technology Ho Chi Minh, Viet Nam nguyenvd@uit.edu.vn Vietnam National University University of Information Technology Ho Chi Minh, Viet Nam khangnttm@uit.edu.vn Abstract—Along with the development of the world, digital documents are gradually replacing paper documents Therefore, the need to extract information from digital documents is increasing and becoming one of the main interests in the field of computer vision, particularly reading comprehension of image documents The problem of object detection on image documents (figures, tables, formulas) is one of the premise problems for analyzing and extracting information from documents Previous studies have mostly focused on English documents In this study, we now experiment on a Vietnamese image document dataset UIT-DODV, which includes four classes: Table, Figure, Caption and Formula We test on common state-of-the-art object detection models such as Double-Head R-CNN, Libra R-CNN, Guided Anchoring and achieved the highest results with Guided Anchoring of 73.6% mAP Besides, we assume that high-quality anchor boxes are keys to the success of an anchor-based object detection models, thus we decide to adopt Guided Anchoring in our research Moreover, we attempt to raise the quality of the predicted bounding boxes by utilizing Cascade R-CNN architecture, which can afford this by its scheme, so that we can filter out as many confused bounding boxes as possible Based on the initial evaluation results from the common state-of-the-art object detection models, we proposed an object detection model for Vietnamese image documents based on Cascade R-CNN and Guided Anchoring Our proposed model has achieved up to 76.6% mAP, 2.1% higher than the baseline model on the UIT-DODV dataset Index Terms—Document Object detection, Vietnamese Document Images, Cascade R-CNN, Guided Anchoring I INTRODUCTION The document digitization process has been taking place in many organizations and businesses since the growing Industry 4.0 era Traditional documents such as paper, books, invoices are gradually transformed and replaced by digital documents (PDF, WORD, EXCEL) stored on cloud computing services for convenient access, searching, and archiving With such an amount of documents, document search becomes more 978-1-6654-1001-4/21/$31.00 ©2021 IEEE difficult than ever Thus, a good model to identify the elements in document images is necessary We decide to put the problem into the perspective of an object detection problem Document Object Detection (DOD) [1] [8] [16] is aimed at automatic detection of important elements (Caption, Table, Figure, Formula) (Figure 1) and the structure of the document page Current detection models for document [13] [20] [11] often use common languages such as English and Mandarin Chinese However, Vietnamese document [4] [9] have many challenges due to the different presentations has existed many problems In this research, we focus on improving the performance of document object detection in Vietnamese document images In specially, we experiment with state-of-the-art methods: Double-Head R-CNN [19], Libra R-CNN [10] and Guided Anchoring [17] on UIT-DODV dataset [4], which is the first Vietnamese dataset with input image objects as Caption, Table, Figure, and Formula The main feature of UIT-DODV is Vietnamese documents which bring many new challenges For example, the presentation of semantic objects creates many difficulties in feature extraction of information, formulas, etc., not only in standard mathematical formulas and non-mathematical forms We experiment on common object detection models on the UIT-DODV dataset We achieved initial results with Guided Anchoring of 73.6% on the mAP measure Based on the analysis of the performance of common object detection models, we propose an object detection model generation based on Cascade R-CNN and Guided Anchoring Our proposal achieved up to 76.6% mAP, which is 2.1% higher than the baseline presented in The rest of the paper includes: section is related to research, in section we focus on introducing experimental methods In section 4, we experiment and analyze the obtained results, the paper is concluded, and the issues that need to be 188 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) (a) Input (b) Output Fig Document object detection The input is the document image, and the output is the likely object locations in the image - the green (table), blue (image), red (caption), and yellow (formula) bounding box indicates detected objects in the image researched in the future are addressed in section II RELATED WORK A One-stage Object Detection Duan et al [5] proposed CenterNet is based on the approach of bringing the object detection problem to the keypoint estimation problem, thereby deducing the size and calculating the bounding box for object detection problems FCOS [15] is a fully convolutional one-stage object detector that solves perpixel object classification FCOS uses multi-level prediction to improve recall and resolve overlapped bounding boxes, adding a “centerness” branch to remove low-quality and high-margin bounding boxes to improve overall performance YOLOv4 [18] is a series of speed improvements over YOLOv3 [14], combining a CSPNet architecture with a Darknet-53 backbone (as in YOLOv3) B Two-stages Object Detection Faster R-CNN [12] is designed by adding RPN (Region Proposal Network) instead of Selective Search to extract regions that potentially contain objects of the image and then performed similarly to Fast-R-CNN but much faster and designed as an end-to-end trainable network FPN [6] proposed by Lin et al with a top-down architecture combined with lateral connections, the network takes full leverage of highlevel semantic feature maps of all sizes As a result, FPN has shown remarkable improvements in research and application of pyramid features Cai and Vasconcelos [2] proposed Cascade R-CNN as a high-quality object detector, with different heads used at different layers Each head is designed for a specific IoU threshold from small to large Cascade R-CNN reduces overfitting during training and reduces quality mismatch in inference time III METHOD A Previous improving R-CNN methods 1) Double-Head-R-CNN: Double-Head R-CNN [19] is deployed based on Feature Pyramid Network (FPN) as presented in Fig Contrary to the previous methods, which use only a single head to extract features in the region of interest RoI for both classification and bounding box regression problems Double-Head R-CNN architecture is divided into two particular heads for each classification and localization problem 189 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Fig Differences between single head and double head, (a) a single fully connected (2-fc) head, (b) a single convolution head, (c) DoubleHead, which is separated by a fully connected head and a convolution head for classification and localization tasks respectively, and (d) Double-HeadExt, a double extension, which presents supervision from unfocused tasks during training and leverages classification scores from both heads for final predictions [19] Wu et al suggested that the fc-head handles classification task remarkably better than the conv-head, and conversely, conv-head handles bounding box regression better than the other one Thus, designing a double-head architecture (a combination of conv-head and fc-head) will create a premise for outstanding results in object detection problems Consequently, double-head architecture can fully exploit the strengths of fchead and conv-head for classification and localization problems, respectively Moreover, in Double-head-Ext architecture, fc-head and conv-head will support each other in the classification problem to obtain the best result 2) Libra R-CNN: Pang et al [10] proposed Libra R-CNN based on FPN Faster R-CNN, which attempts to reach the balance of the training Libra R-CNN proved to be highly efficient as we can operate this design on various backbones for both single-stage and two-stage detectors As observed from several previous methods, the imbalance, which limited the object detection performance, generally includes three levels: Sample level imbalance, feature level imbalance, and objective level imbalance corresponding to each layer in pyramid architecture to capture balanced semantic features through some transformations as shown in Fig In addition, the authors decided to apply a non-local module to adjust the obtained features Finally, the balanced function L1 Loss is proposed to balance the effects of two tasks, classification and localization, and the amount of gradient contribution of the obtained samples 3) Guided Anchoring: Guided Anchoring [17] is built to improve the region proposal generation process Hence it took Region Proposal Network (RPN) as its baseline This design will predict the locations where the center of objects of interest potentially exists and the scales and aspect ratios at different locations Especially, Guided Anchoring gets rid of anchor generation, which uses default parameters In other words, an anchor’s scales and aspect ratios can now vary dynamically instead of being fixed as before As a result, it creates diversity during the training process Besides, Wang et al also study the influence of high-quality proposals in two-stage detectors Fig Basic framework of Guided Anchoring [17] In the Guided Anchoring method, Wang et al applied an anchor generation module which contains two branches for location and anchor shapes prediction, as its architecture is shown in Fig The Anchor Location Prediction branch predicts the anchor’s location by creating a probability map to easily figure out the likely location of the object in the image The Anchor Shape Prediction will leverage px, yq coordinates of the location pre-generated in the previous stage to predict pw, hq coordinates representing the width and height of the anchor, respectively Thereby, their model can generate the anchor shapes that most closely match the ground-truth bounding boxes However, they empirically found that it is not stable to predict these two values directly So, certain transformations have been applied to address this problem using a sub-network to utilize ˆ convolution to generate an appropriate map Fig Pipeline and heatmap visualization of balanced feature pyramid [10] Libra R-CNN architecture solves the imbalances at the three mentioned levels above At the sample level, Pang et al have designed a probability function to pick a certain number of negative samples from the total corresponding candidates Moving to feature level consideration, the imbalance here occurs since the number of features obtained at each layer is not equal, leading to the distinction in information and features For that reason, Pang et al balanced the amount of information B Proposed Method As mentioned at III-A, Double Head R-CNN, Libra R-CNN, or Guided Anchoring are common kinds of architecture added to improve the results of the Faster R-CNN model Thus, these architectures can be considered a module that can be easily assigned to other detectors, particularly in this case, the authors used Faster R-CNN In addition, Cascade R-CNN is a multistage extension of R-CNN in which the detector stages are 190 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) selected more sequentially, thereby improving the limitations of Faster R-CNN mentioned at [2] average of AP over all classes We calculate AP at different IoU thresholds for each class and take their average to get that class’s AP We will take the results on the AP50 and AP75 metrics with IoU thresholds of 0.5 and 0.75 respectively B Analysis Fig Guided Anchoring Cascade R-CNN architecture Thanks to the impressive anchor generation in Guided Anchoring, we totally can leverage semantic features to guide the anchoring Moreover, we also aim to improve the training phase by exploiting fully the potential of Cascade R-CNN [2], which can filter out most of poor bounding boxes and reduce the overfitting problem Therefore, we design a model based on Cascade R-CNN architecture combined with Guided Anchoring to improve object detection performance on image documents Realizing the power of multi-level features, we decide to use FPN in this method to capture valuable information represented in feature maps with dimensions corresponding to each level in FPN Prior to this, we use ResNeXt-101 as our feature extractor, the architecture is described as shown in Fig Wang et al introduced Guided Anchoring, which helps to reduce the number of anchor boxes generated compared to previous methods, specifically RPN as shown in Fig 6, thereby reducing costs and increasing the processing speed while training significantly We then feed those materials above into Cascade R-CNN network Fig Comparison between RPN proposals (top row) and GA-RPN proposals (bottom row)[17] IV EXPERIMENT A Experiment Settings Dataset: We conducted experiments with the UIT-DODV dataset, including 1440 images for the training set, 234 images for the validation set, and 720 images for the testing set Configuration: We experiment on 2ˆ GPU RTX 2080 Ti based on MMDetection [3] Evaluation Metric We aim to compare with the baseline results on the UIT-DODV dataset therefore mAP (mean Average Precision) [7] is used for evaluation The mAP score is the First of all, we evaluated the effectiveness of three models, including Double-Head R-CNN, Libra R-CNN, Guided Anchoring on the metrics of AP50 , AP75 and mAP Our empirical results have been summarized in Table I In general, the best results came from the common object detection method with Guided Anchoring AP50 , AP75 and mAP at 91.0%, 80.8%, 73.6% respectively In contrast, Double-Head R-CNN achieved the lowest results with AP50 , AP75 at 88.7%, 78.4%, and mAP at 71.0% Next, we specifically analyzed the performance of each class We can see clearly that every class of the dataset was experimented by using the Guided Anchoring method, which yielded the highest accuracy amongst the others with AP metrics of the ‘Table’, ‘Figure’, ‘Caption’ and ‘Formula’ class each attaining 92.7%, 81.6%, 73.3%, and 46.9%, respectively Meanwhile, Double-Head R-CNN only reached 91.5%, 80.6%, 65.6%, and 46.3% with AP metrics, clearly losing by a large margin compared to the State-of-the-art methods The performance with applying Double-Head R-CNN is not necessarily poor However, if we were to compare DoubleHead R-CNN architecture with Guided Anchoring Cascade R-CNN, we can definitely say that Guided Anchoring has brought many noteworthy improvements Moreover, features in the training process are filtered out carefully and assured the quality of the anchors Therefore, the result of Guided Anchoring proves extremely convincing Initial results on the default model Guided Anchoring with the Faster R-CNN baseline similar to Double Head or Libra had shown outstanding efficiency Thus, we propose that Cascade R-CNN be used alongside Guided Anchoring as mentioned in III-B Results show that our suggested model has brought about exceptional efficiency with mAP up to 76.6%, this is 2.1% higher than the compared with baseline as published by Dieu et al when using CascadeTabNet and Fused Loss only achieving 74.5% After that, we visualize the prediction results of the three methods Double-Head R-CNN, Libra R-CNN, and Guided Anchoring as shown in Fig However, Class ’Table’ still has false predictions on Double-Head R-CNN (Fig 7a) and Libra RCNN (Fig 7b) but Guided Anchoring (Fig 7c) did an excellent job Moreover, the features of the ’Figure’ class and ’Caption’ class caused many difficulties for the prediction process of Double-Head R-CNN Although Libra R-CNN did a good job identifying ’Figure’ and ’Caption’, it mispredicted ’Formula’ objects Overall, Guided Anchoring possesses all the strengths of the other two methods Guided Anchoring proved pretty well at correctly predicting all four classes, as we can observe in the visualization In addition, we also evaluate the effectiveness of our proposed Cascade Guided Anchoring model on the UIT-DODV dataset through the results visualized in Fig when compared 191 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) TABLE I: Experiment results on data UIT-DODV Method Double-Head R-CNN Libra R-CNN Guided Anchoring Faster R-CNN CascadeTabNet + Fused Loss [4] Guided Anchoring Cascade R-CNN (Ours) (a) Double-Head-R-CNN Table 91.5 92.5 92.7 94.3 95.4 Figure 80.6 81.0 81.6 83.0 84.8 Caption 65.6 68.2 73.3 73.3 75.9 (b) Libra R-CNN Formula 46.3 46.0 46.9 47.5 50.5 AP@.5 88.7 89.6 91.0 89.1 91.8 AP@.75 78.4 79.1 80.8 81.6 83.1 mAP 71.0 71.9 73.6 74.5 76.6 (c) Guided Anchoring Fig Visualize prediction results of models Double-Head-R-CNN, Libra R-CNN and Guided Anchoring on the UIT-DODV dataset - (green - table, blue - figure, red - caption and yellow – formula) (a) Faster R-CNN (b) Cascade R-CNN Fig Comparison between Faster R-CNN and Cascade R-CNN using Double-Head-R-CNN (green - table, blue - figure, red - caption and yellow - formula) with the model The baseline model using Guided Anchoring is Faster R-CNN Both methods Faster R-CNN in Fig 8a and Cascade R-CNN as shown in Fig 8b have been run experimentally in combination with Double-Head R-CNN and have yielded the results objectively For Faster R-CNN, the model performed very well on the object detection problem in all four classes However, there were still overlaps in the ’Table’ class, a problem that will affect model accuracy Meanwhile, the problem is completely solved in the Cascade R-CNN method This shows that Cascade R-CNN handled the characteristics of document objects in general and UIT-DODV datasets in particular V CONCLUSION AND FUTURE WORK Realizing that the demand for information extraction and document data analysis are increasing, we conducted in-depth studies to the object detection problem on the Vietnam dataset - UIT-DODV After experimenting on state-of-the- 192 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) art models such as Double-Head R-CNN, Libra R-CNN, and Guided Anchoring, we obtained the highest result is Guided Anchoring with 73.6% on mAP measure With the above premise, we propose the object detection model Guided Anchoring Cascade R-CNN - the combination of two methods: Guided Anchoring and Cascade R-CNN The result of our proposed model has reached mAP 76.6%, higher than the baseline model on the UIT-DODV dataset by 2.1% ACKNOWLEDGMENT This research is funded by Vietnam National University HoChiMinh City (VNU-HCM) under grant number DSC202126-03 This work was supported by the MMLab at the University of Information Technology, VNU-HCM We also would like to show our gratitude to the UIT-Together research group for sharing their pearls of wisdom with us during this research REFERENCES [1] Jwalin Bhatt, Khurram Azeem Hashmi, Muhammad Zeshan Afzal, and Didier Stricker “A Survey of Graphical Page Object Detection with Deep Neural Networks” In: Applied Sciences 11.12 (2021) ISSN: 2076-3417 DOI: 10.3390/app11125344 URL: https://www.mdpi.com/ 2076-3417/11/12/5344 [2] Zhaowei Cai and Nuno Vasconcelos “Cascade r-cnn: Delving into high quality object detection” In: Proceedings of the IEEE conference on computer vision and pattern recognition 2018, pp 6154–6162 [3] Kai Chen et al “MMDetection: Open MMLab Detection Toolbox and Benchmark” In: arXiv preprint arXiv:1906.07155 (2019) [4] Linh Truong Dieu, Thuan Trong Nguyen, Nguyen D Vo, Tam V Nguyen, and Khang Nguyen “Parsing Digitized Vietnamese Paper Documents” In: Computer Analysis of Images and Patterns Cham: Springer International Publishing, 2021, pp 382–392 [5] Kaiwen Duan et al “Centernet: Keypoint triplets for object detection” In: Proceedings of the IEEE/CVF International Conference on Computer Vision 2019, pp 6569–6578 [6] Tsung-Yi Lin et al “Feature pyramid networks for object detection” In: Proceedings of the IEEE conference on computer vision and pattern recognition 2017, pp 2117–2125 [7] Tsung-Yi Lin et al “Microsoft coco: Common objects in context” In: European conference on computer vision Springer 2014, pp 740–755 [8] Duong Phi Long, Nguyen Trung Hieu, Nguyen Thanh Tuong Vi, Vo Duy Nguyen, and Nguyen Tan Tran Minh Khang “Phat hien bang tai liệu dạng anh su dung phuong phap đinh vi góc CornerNet” In: Proceedings of Fundamental and Applied Information Technology Research (FAIR) 2020 [9] Thuan Trong Nguyen, Thuan Q Nguyen, Long Duong, Nguyen D Vo, and Khang Nguyen “CDeRSNet: Towards High Performance Object Detection in Vietnamese Documents Images” In: International Conference on Multimedia Modelling (MMM) 2022 [10] Jiangmiao Pang et al “Libra r-cnn: Towards balanced learning for object detection” In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, pp 821–830 [11] Bui Hai Phong, Thang Manh Hoang, and Thi-Lan Le “An end-to-end framework for the detection of mathematical expressions in scientific document images” In: Expert Systems (2021), e12800 [12] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks 2016 arXiv: 1506 01497 [cs.CV] [13] Ningning Sun, Yuanping Zhu, and Xiaoming Hu “Table Detection Using Boundary Refining via Corner Locating” In: Pattern Recognition and Computer Vision Ed by Zhouchen Lin et al Cham: Springer International Publishing, 2019, pp 135–146 ISBN: 978-3-03031654-9 [14] Yunong Tian et al “Apple detection during different growth stages in orchards using the improved YOLOV3 model” In: Computers and electronics in agriculture 157 (2019), pp 417–426 [15] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He “Fcos: Fully convolutional one-stage object detection” In: Proceedings of the IEEE/CVF international conference on computer vision 2019, pp 9627–9636 [16] Nguyen D Vo, Khanh Nguyen, Tam V Nguyen, and Khang Nguyen “Ensemble of deep object detectors for page object detection” In: Proceedings of the 12th International Conference on Ubiquitous Information Management and Communication 2018, pp 1–6 [17] Jiaqi Wang, Kai Chen, Shuo Yang, Chen Change Loy, and Dahua Lin “Region proposal by guided anchoring” In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, pp 2965–2974 [18] Dihua Wu, Shuaichao Lv, Mei Jiang, and Huaibo Song “Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments” In: Computers and Electronics in Agriculture 178 (2020), p 105742 [19] Yue Wu et al “Rethinking classification and localization for object detection” In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2020, pp 10186–10195 [20] Junaid Younas et al “FFD: Figure and formula detection from document images” In: 2019 Digital Image Computing: Techniques and Applications (DICTA) IEEE 2019, pp 1–7 193 ... that Guided Anchoring has brought many noteworthy improvements Moreover, features in the training process are filtered out carefully and assured the quality of the anchors Therefore, the result of. .. UIT-Together research group for sharing their pearls of wisdom with us during this research REFERENCES [1] Jwalin Bhatt, Khurram Azeem Hashmi, Muhammad Zeshan Afzal, and Didier Stricker “A Survey of Graphical... Especially, Guided Anchoring gets rid of anchor generation, which uses default parameters In other words, an anchor’s scales and aspect ratios can now vary dynamically instead of being fixed as before

Định dạng
Số trang	6
Dung lượng	5,1 MB