Developing a pipeline for table extraction in document images

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY HO CHI MINH UNIVERSITY OF TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING GRADUATION THESIS DEVELOPING A PIPELINE FOR TABLE EXTRACTION IN DOCUMENT IMAGES MAJOR: COMPUTER SCIENCE THESIS COMMITTEE: SUPERVISORS: REVIEWER: STUDENT: COMPUTER SCIENCE Dr Tran Tuan Anh Mr Nguyen Nam Quan Dr Nguyen Tien Thinh Dr Le Thanh Sach Lu Anh Khoa HO CHI MINH CITY, 02/2023 (1852112) CQNG HOA XA HOI CHU NCHIA VIET NAM DQc lap - Tu - Hanh Phic EAI HQC QIJOC GIA TP.I.ICM TRU,.,NG DAI HQC BACH KHOA KHOA: KH s0 N,roN: & KT M6y NHIEM VU LUAN AN TOT NGHIEP tinh- KHMT_ Chi i: Sinh vian phqi dltn HQ VA TEN: Lu Anh Khoa NGANH: Khoa hoc mril tinh ki nd), vdo - l6P tang nhd cia ban rhuyir trinh MSSV:1852 ll2 l DAu dd luAn rin: XAy dung trQihOng nhdn d4ng bdng dnh tdi liQu Developing a pipeline for table extraction in the document image Nhi$m vg (y6u ciu v6 nQi dung vir s5 liQu ban dAu;: Table extraction is one ofthe most critical components ofdocument imagesi we can see it almost everywhere in every report and documenl l-his is also the topic ofmost concem today in digital transformation Table recognition and comprehension require many techniques and research Nowadays it is still a massive challenge tbr scientists and industrial applications This thesis focuses on studying the problem oftable information extraction in a whole document image processing system There are three main tasks in this research: - Build a comprehensive pipeline lor tablc cxtraction in thc document image - Develop a model lbr detecting tablc regions in the document inlagc - Develop a classification model to classity the detected table into a borderless table and a bordered table for extracting the table's cells - Benchmark the models with public/private datasets Ngiry giao nhiQm vg lu$n 6n: 2011212021 Nglry hoirn thirnh nhiQm vq:20/1212022 Hg tOn giing vi6n hufng I )Trin Tuin Anh IIuong dAn: Phin hu6ng din: dan dinh hutrng phdt tri€n tich hop kiim tra drinh gi[ - I lucrng dAn ph6t tri6n model c6ng ngh€, thu thap dt liQu Tiiln Thinh - Huong dAn nQi dung trinh biy b6 cgc phuong ph6p nghidn cuu 2) Nguy6n Nam Quin 3)Nguy6n - N6i dung vd y6u cAu I-Vl N de dugc thtlng qua 86 m6n Ngity ; thcing ndm CHU NHIEM BQ MON (Ki, vd ghi rd ho rAn) GIANG VIEN HU,ONG DAN CHiNH (K! vtt hi rb ho l0rt , )^' 44t4 /LaD PH4N D)NH (,LIO KIIOA BO MON: Nguoi duyQt (chim so bO): Don v i: - Ngd)' bdo v4: Di6m t6ng Noi luu tri ktit:- _ - lu6n dn: ,r4 TRU.ONG DAI HQC BACH KHOA KHOA KH & KT MAY TINH cQNG HoA xA HQl cHU NGHie vpr Nelvt DQc l{p - Tr,r - Hanh Phuc Nsdy 26 ttuing 12 ndm2022 PHIEU CHAM BAO VE LVTN lDinh cho nguii hmmg din phun biQnt l H9 vd t6n SV: Lu Anh Khoa Ngdnh (chuy€n ngdnh): Khoa Hqc M6y.Tinh MSSV: 1852112 Z Od tui, Developing a pipeline for table extraction in the document image (X6y dUng hQ th6ng nhan dang bang anh tai liQu) Hq t6n nguoi huong din/phan biQn: Trin tuAn enl T6ng qurit v6 ban thuyOt minh: 56 chuong: 55 Sdhinh.vE: s6 Uang s6 PhAn m6m tinh to6n: 56 tai liQu tham HiQn v4t (san ph6m) Tiing qtuit vd c6c bdn v€: Kh6 kh6a: Ban Ban - Sti-ban tinh: vC m6y tren 56 ba'n - Sii ban vC vC LVTN: cua di6m chinh Nhtng uu - This Gsis presents a pipeline for table extraction in the document image Student also develops a model for deiecting tabie regions in the document image and a classification model to classifr the detected table into a borderless table and a bordered table for extracting the table's cells - The presented pipeline is also used in many applications in an industrial environment This thesis hai researched many different methods and approaches and made appropriate trang: tigu kheo: vC: 42: Al: tay - assessments and analYzes - The thesis has also completed some practical data and put it into the evaluation, the construction of diverse data is very necessary - Students have had good experiments, evaluations, and demos Nhirng thiliu s6t chinh cua LVTN: - This thisis is inclined towards application, so the direction of research development has not been made clear - The overview of the works has not been fully detailed, nor is the assessment comprehensive DE ngh!: Dugc bio vQ E 86 sung th6m d6 bio v€ tr Kh6ng dugc b6o vQ cAu h6i SV phni tra loi trudc HQi dti,ng: - Make clear the future works including research/pipeline development 10 Drinh gi6 chung GAng chft gi6i, klui TB): Gi6i Di6m Ki : 9.3110 ten (gh irSho t6n T ran tu6n enn tr cLC TRTIONG DAI HQC BACH KHOA xA HQI CHU NGHIA VIET NAM CONG HOA KHOA KH & KT MAY TINH pttT J- DQc l{p - Tu - Hanh phtic Ngity 28 rhdng 12 ndm 2022 PHIEU CHAM BAO VE LVTN (Ditnh cho ngrdi phdn biAn) l Ho vd t6n SV: Lu Anh Khoa Ngdnh (chuydn ngdnh): Computer Science MSSV: 1852112 Z Od al: Developing a pipeline for table extraction in documenl images H9 t6n phan.biQn: TS Ld Thinh Sfch T6ng qudt v6 ban thuyCt minh: 56 chuong: 56 trang: 31 56 hinh v6: 56 bang s6 liQu PhAn mAm tinh to6Ln: 56 tdi liQu tham kh6o: HiQn vft (san phAm) T6ng qu6t vO c6c bdn v6: Kh6 kh6c: Ban A.2: - 56 ban vC: Ban Al: - 56 ban ve ve tay SO ban ve tr6n miiy tinh: Nhtng uu tli6m chinh cia LVTN: o The author has a strong background in deep leaming and its applications for computer vision o The author has proposed a pipeline for table extraction in document images and focused (implemented) on two main tasks inside: table detection and table classification o Table detection: consists of two steps: detecting tables using YoloVT model and post-processing detected tables with morphology and connected-component analysis o Table classification: classifying extracted tables into two classes: bordered vs borderless using MobilenetV3 (Large) model o The proposed method can produce accurate results and be able to compared with other methods in table extraction Nhring thii5u s6t chinh cira o LVTN: The thesis needs to be re-written: to add a survey for related works in the research fields (table detection and classification) to add detail explaination for the proposed method and its results (quantitative and qualitative) to add an evaluation for the processing time o o o x Dd nghi: Du-o c bdo vQ: 16 sung th6m di: bao vC cdu hdi SV phAi trd ldi truoc H6i d6ng: 10 Drinh giri chung (b5ng chir: gi6i, khri, TB): tr Gi6i Kh6ng cluoc bAo v€ Di6m Kf : ll0 t6n (ghi 16 ho ttn) LE Thdnh SAch tr Ho Chi Minh University of Technology Faculty of Computer Science and Engineering DECLARATION OF AUTHORSHIP I hereby declare that this thesis was carried out by myself under the guidance and supervision of Dr.Tran Tuan Anh, Dr Nguyen Tien Thinh, and Mr Nguyen Nam Quan; and that the work contained and the results in it are true by its author and have not violated research ethics The data and figures presented in this thesis are for analysis, comments, and evaluations from various resources by my work and have been fully acknowledged in the reference part In addition, other comments, reviews, and data used by other authors, and organizations have been acknowledged, and explicitly cited I will take full responsibility for any fraud detected in my thesis HO CHI MINH CITY, Dec 2022 Author Thesis, Semester 1, Academic year 2022 - 2023 Page Ho Chi Minh University of Technology Faculty of Computer Science and Engineering ACKNOWLEDGMENT I would like to acknowledge and give my warmest thanks to my advisors, Dr Tran Tuan Anh, Dr Nguyen Tien Thinh, and Mr Nguyen Nam Quan who made this work possible They are the ones who build the first bricks of my scientific career Besides, I would like to also acknowledge all of the instructors of Ho Chi Minh University of Technology, who have given me motivation, encouragement, and precious knowledge during the long road of my university life Last but not least, I would like to thank my family who always be there, support me throughout my life Thesis, Semester 1, Academic year 2022 - 2023 Page Ho Chi Minh University of Technology Faculty of Computer Science and Engineering ABSTRACTION The ”Developing pipeline for table extraction in document images” research topic aims to develop a system to extract tabular regions from scanned/captured document images (invoices, reports, research papers ) with high accuracy and reasonable response time This thesis will propose a pipeline consists of several steps to detect, classify and to extract data from tabular regions Thesis, Semester 1, Academic year 2022 - 2023 Page Ho Chi Minh University of Technology Faculty of Computer Science and Engineering Contents Introduction 1.1 The need for table extraction 1.2 The goal Background knowledge 2.1 Convolution and cross-correlation in image processing 2.2 Convolution neural network (CNN) 2.2.1 Building blocks 2.2.2 Hyperparameters 2.2.3 Regularization methods 2.3 Image segmentation 2.3.1 Thresholding 2.3.2 K-means clustering 2.3.3 Trainable segmentation 2.4 Object detection 2.5 Object detection - Metric 2.6 Faster R-CNN 2.6.1 RPN (Region Proposal Network) 2.6.2 Anchors 2.6.3 RoI pooling 2.6.4 Detection model 2.6.5 Loss function 2.7 YOLO model 2.8 YOLO approach for object detection 2.9 YOLOv1 2.10 YOLOv2 2.11 YOLOv3 2.12 YOLOv4 2.13 YOLOv5 2.14 YOLOv7 architecture 2.14.1 CSP-ize a block 2.14.2 Backbone 2.14.3 Neck 2.14.4 Head 2.15 YOLOv7 - training techniques 2.15.1 Label assignment: Simple Optimal Transport Assignment (SimOTA) 2.15.2 Augmentation: Mosaic 2.15.3 Augmentation: Mixup 2.16 YOLOv7 - Loss function 2.17 Image classification 2.18 Image classification - Metric 2.19 MobilenetV3 2.19.1 Depthwise Separable Convolutions 2.19.2 Inverted residual block 2.19.3 Squeeze and excite (SE) 7 7 8 10 11 13 14 14 14 14 14 15 15 16 17 17 18 18 19 19 20 21 21 26 27 28 28 30 31 32 32 34 34 34 35 35 36 36 37 37 Related works 3.1 Traditional methods 3.2 Convolutional Neural Networks (CNN) 38 38 38 Proposed method 4.1 Pipeline 4.2 Table detection 4.3 Postprocess for table detection 4.3.1 Why ? 4.3.2 How ? 4.3.3 Result 4.4 Table classification 4.5 Table structured recognition 39 39 39 39 39 39 40 41 41 Thesis, Semester 1, Academic year 2022 - 2023 Page Ho Chi Minh University of Technology Faculty of Computer Science and Engineering 4.6 4.7 42 42 43 44 44 44 45 45 45 45 46 46 47 Conclusion 6.1 Achievements 6.2 Limitation 6.3 Future works 48 48 48 49 4.8 Structure of a table Bordered table - just use some image 4.7.1 Result Putting everything together processing Experiments 5.1 Datasets 5.2 Training process 5.3 Quantitative results 5.3.1 Table detection - mAP 5.3.2 Table detection - Weighted Average F1 5.3.3 Table classification - Model comparisons 5.3.4 Speed 5.4 Qualitative results List of Figures 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 A typical CNN Maxpool 2x2: an example of pooling layer Some activation function Example of Dilated Convolution Dropout visualization Example of data augmentation An example of image segmentation output Different IOU with Red bounding boxes are ground truths while the Green ones are predictions RCNN architecture Faster RCNN achitecture RPN in Faster-RCNN Anchors Detection model in Faster-RCNN YOLOv1 architecture YOLOv2 architecture YOLOv3 backbone YOLOv3 architecture Darknet53 vs CSPDarknet53 CSPResBlock Left: Sample image, Center: DropOut, Right: DropBlock PAN structure SPP in YOLOv4 Hard label vs Smooth label Cosine Annealing Learning rate Groundtruth: green Predict: Black Using L1 yields 9.07 for all cases but their IOU are different by a large margin IOU is also used to evaluate an object detection model So using IOU as a loss is a logical improvement YOLOv5 architecture Yolov7 architecture CSP-ized a block ELAN block Transition block Yolov7 backbone CSP-OSA block RepConv block Implicit knowledge YOLOv7 head, with implicit knowledge Thesis, Semester 1, Academic year 2022 - 2023 9 10 11 12 12 13 15 16 16 16 17 18 19 21 22 23 23 24 24 24 25 25 25 26 27 27 28 29 29 30 30 31 32 32 Page Ho Chi Minh University of Technology Faculty of Computer Science and Engineering 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 An example of mosaic augmentation, images are ”merged” together to create a new sample Red bounding boxes annotate labels Mixup augmentation example Depthwise Convolution, visualized Left: Normal convolution Right: Depthwise separable convolution Squeeze and excitation block CascadeTabNet architecture from [23] Overview Before (left) and After (right) Notice that on the left the table does not have enough outer bordered lines Examples of bordered tables Examples of borderless tables Original bordered table After cell indexing, each cell has the form start row, start col end row, end col Excel result of figure 46 Note that tesseract cannot read the text in some cells Correct detection on publaynet dataset Correct detection on fintabnet dataset Wrong cases: Partial detection (top), missed tables(bottoms) Green boxes are ground truth while red boxes are predictions 34 34 36 37 37 39 39 40 41 41 43 44 44 47 48 49 List of Tables Data statistic for table detection Data statistic for table classification Result on test set of each dataset, table detection Comparison with ICDAR19 Competition on Table Detection and Recognition, with previous participants Scores of other teams are taken directly from [11] Comparisons between different classification models Speed measurements Thesis, Semester 1, Academic year 2022 - 2023 track A2 45 45 45 46 46 47 Page Ho Chi Minh University of Technology Faculty of Computer Science and Engineering • The score of each channel is multiplied with the respective channel of feature map U to produce the final output In term of math notations, Fsq (U ) produces a vector s of length C which each element of sc of s is computed as follows: W H X X sc = Uc (i, j) H ×W i j Then vector s goes through excitation function Fex : (Some convolution layers, activation function and sigmoid gate) s = Fex (s, W ) = σ(W2 + ReLU (W1 s)) And finally, Fsc scales the input feature map U with vector s for each channel by multiplying channel-wise: Uc = Fsc (Uc , sc ) = Uc sc 3.1 Related works Traditional methods Traditional methods use geometric, hand-crafted features extracted from the ruled lines, pixels distributions and blank gaps One of the earlier works in the area of table analysis was done by Kieninger et al.[19] In the paper, Kieninger used an algorithm to cluster the bounding box of words into block, or block segmentation, then using post processing technique to correct those cluster by merging, splitting and other rules to handle edge cases Harit et al [16] relied on separators such as ruling lines, thickness of white space separators, background colors, horizontal and vertical lines to identify start and end patterns of a tabular region Tran et al.[6] considers table detection as a part of page segmentation After separating text and non-text regions by using minimum homogeneity algorithm and adaptive mathematical morphology, a set of rules will be applied to determine tabular regions among non-text components 3.2 Convolutional Neural Networks (CNN) Since AlexNet won the ImageNet competition in 2012, CNN began to surge as a superior method for image tasks As the years passed, CNN models became larger and stronger and still have not been beaten in most of image tasks For table extraction, object detection is the most common method After the improvement of object detection algorithms from Fast R-CNN [15] to Faster R-CNN [27], the tables are treated as an object in the document images Gilani et al [14] employed deep learning method on the images to detect tables The technique involve image transformation as a pre-processing step follow by the table detection In the image transformation part, euclidean distance transform, linear distance transform, and max distance transform are applied on blue, green and red channels of the image respectively; Gilani et al [14] used Faster R-CNN [27] as the core of table detection part The backbone of their Region Proposal Network (RPN) is based on ZFNet [36] Their approach was able to beat the state-of-the-art results on UNLV [29] dataset DeepDeSRT [28] not only detects the tabular region but also distinguishes the structure of the table by identifying row, columns and cells For the table detection step, transfer learning is performed by fine-tuning a pre-trained model of Faster R-CNN They experimented with ZFNet [36] and the much deeper VGG-16 [30] The model is pretrained on Pascal VOC dataset [10] In Tablebank paper [21], Faster R-CNN[27] is used as a baseline for table detection Prasad et al.[23] proposed CascasdeTabNet, which is an end-to-end system for table detection and also table structure recognition The model is based on the structure of Cascade Mask R-CNN model [8] It used HRNet [35] as the backbone to extract feature, as HRNet can maintain features from high resolution Thesis, Semester 1, Academic year 2022 - 2023 Page 38 Ho Chi Minh University of Technology Faculty of Computer Science and Engineering original image, then feed these features in a cascade manner into multiple heads (region proposing head, mask head, bounding box head), so every head receives information from the backbone as well as the output of the head before itself The model can detect the tabular region and also classify them into bordered tables and borderless tables (see Figure 41) Figure 41: CascadeTabNet architecture from [23] Proposed method 4.1 Pipeline The pipeline consists of several steps: First, we detect table regions, classify it to bordered or borderless, then using the methods to recognize that type of table Summarize can be seen in figure 42 Figure 42: Overview 4.2 Table detection To detect tables in images, we can frame it as a object detection problem We treat tables as objects So the most convenient way is to just use a CNN object detection model In this thesis, YOLOv7 is used as the main detector for this stage 4.3 4.3.1 Postprocess for table detection Why ? When observing the result of detection model, we see that for bordered table, the outer bordered lines are usually missing, causing the classification step to perform worse 4.3.2 How ? For this part, we just post process for bordered table without affecting borderless table Following the method in [6] and the idea of non-maximum suppression (NMS), the steps can be summarized as follows: • Binarize the image • Use connected component analysis and heuristics in [6] to determine table regions • Match the predictions from detection model with the regions in step 2, from highest to lowest confidence region, based on IOU Thesis, Semester 1, Academic year 2022 - 2023 Page 39 Ho Chi Minh University of Technology Faculty of Computer Science and Engineering Algorithm 2: Table detection post process Input: Ta : the set of tables from the output of CCA, from [6] Tm : the set of predictions which are sorted descending according to confidence from deep learning model rl , rr , ru , rd : expand ratio for left, right, up and down, respectively iouT hresh: IOU threshold for accepting CCA tables Result: Tm , the final set of tables which will be the output of table detection module for t1 ∈ Tm w ← t1 x2 − t1 x1 h ← t1 y2 − t1 y1 tm ← t1 tm x1 ← t1 x1 − w × rl tm x2 ← t1 x2 + w × rr 10 tm y1 ← t1 y1 − h × ru 11 tm y2 ← t1 y2 + h × rd 12 t2 ← argmaxtk ∈Ta IOU (tm , tk ) 13 if IOU (t1 , t2 ) ≥ iouT hresh then 14 t1 ← t2 15 Ta ← Ta \ {t2 } 16 end 17 end Hyperparameters rl , rr , ru , rd and iouT hresh are set to 0.2, 0.2, 0.2, 0.2 and 0.8, respectively The theory complexity for algorithm is O(|Ta | ∗ |Tm |) In practice, the maximum number of tables in an image of all mentioned datasets not exceed 10 so the algorithm would work fast enough 4.3.3 Result Result before and after post process can be seen in figure 43 Figure 43: Before (left) and After (right) Notice that on the left the table does not have enough outer bordered lines Thesis, Semester 1, Academic year 2022 - 2023 Page 40 Ho Chi Minh University of Technology Faculty of Computer Science and Engineering 4.4 Table classification Why we need a separate classification step instead of using YOLO for that too? Everyone want an end2end model But in practice, accuracy drops because the number of false positives are too high when we try to use YOLO for classification Also, almost no public dataset classifies tables, as table classes are opinions, not consensus or fact Method After Table detection post-processing, we crop each table and pass them to classification stage The main method of classification stage is MobilenetV3 We divide tables into two classes: • Bordered: Full-lined tables whose cells are recognizable by line separators (figure 44) • Borderless: The tables which not comply with bordered definition (figure 45) Figure 44: Examples of bordered tables Figure 45: Examples of borderless tables 4.5 Table structured recognition Table structure recognition is the task that turns the region of table into logical format of the table which can be represented by HTML or Excel format The main approach for table structure recognition in this thesis is based on separators Thesis, Semester 1, Academic year 2022 - 2023 Page 41 Ho Chi Minh University of Technology Faculty of Computer Science and Engineering 4.6 Structure of a table A table contains many cells, each cell has • top-left, bottom-right: cell’s image coordinates • start-column, start-row, end-column and end-row for logical indexing • Cell content (usually text) 4.7 Bordered table - just use some image processing Steps: • Binarize the image • Use morphology operations to detect horizontal and vertical lines • Get cells • Indexing cell To binarize the image, we apply Gaussian blur with kernel size (5, 5) then threshold it with Otsu method [22] Then using kernels of thin stripe to detect horizontal and vertical lines Then we invert the image and get the cells which are the connected components lie within the table For each cell, we will first approximate it with a rectangle region Then for each x1[i], we find the largest x2[j] such that x2[j]

Định dạng
Số trang	56
Dung lượng	3,7 MB