development of a system for digitalization and information extraction from offical documents

MINISTRY OF EDUCATION AND TRAINING HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION GRADUATION THESIS MECHANICAL ENGINEERING DEVELOPMENT OF A SYSTEM FOR DIGITALIZATION AND INFO

Trang 1

MINISTRY OF EDUCATION AND TRAINING

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION

GRADUATION THESIS MECHANICAL ENGINEERING

DEVELOPMENT OF A SYSTEM FOR DIGITALIZATION

AND INFORMATION EXTRACTION FROM

OFFICAL DOCUMENTS

BUI HA DUC, PhD STUDENT:

S K L 0 1 2 6 2 9

Ho Chi Minh City, January, 2024

INSTRUCTOR:

PHAM VU DUNG HOANG PHI HAI

Trang 2

MINISTRY OF EDUCATION AND TRAINING HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND

EDUCATION

SUMMARY GRADUATION THESIS

_

MECHANICAL ENGINEERING FACULTY OF

DEVELOPMENT OF A SYSTEM FOR DIGITALIZATION AND INFORMATION EXTRACTION FROM OFFICAL DOCUMENTS

Supervisor: BUI HA DUC, PhD

Student: PHAM VU DUNG

Student ID: 19134074

Student: HOANG PHI HAI

Student ID: 19134076

Year Of Admission: 2019-2023

Trang 3

Chapter 3 Information extraction system 7

3.1 Design document layout analysis algorithm 7

3.2 Optical Character Recognition 10

3.3 Design named entity recognition algorithm 11

Chapter 4 Server and Database 14

Trang 4

INTRODUCTION

Chapter 1 INTRODUCTION

1.1 Motivation Vietnam aims for advanced digital status by 2030, led by Prime Minister Pham Minh Chinh's vision of digital governance, economy, and society The nation's swift digital progress is evidenced by initiatives like the national data sharing platform, facilitating 1.6 million daily transactions and linking 95% of civil servant data across ministries Digitization plays a pivotal role, converting analog to digital data, streamlining processes, enhancing security, and cutting costs Meanwhile, the evolutio n of NLP and Transformer modules fuels Document AI's growth, despite challenges like limited models and datasets, especially in fields like finance and medicine

Recognizing these challenges, our research focuses on automating digitization processes, developing software to extract information from forms, compatible with various scanners and featuring a user-friendly interface

1.2 Objectives - Design software to communicate with hardware

component to perform form digitalization - Develop algorithms for form information extraction with

acceptable accuracy

Trang 5

INTRODUCTION - Deploy deep learning models on a LAN server to send

and store data to the database 1.3 System Flow

Figure 1.1: System Diagram

The diagram illustrates the activity diagram of the system starting with paper paged documents and after final processing, it will be in the form of "Key: Value pairs" and sent to the database

Trang 6

DIGITALIZATION SYSTEM

Chapter 2 DIGITALIZATION SYSTEM

Digitization system incorporates two components: “Hardware for converting paper documents into digital format” and “Graphical User Interface (GUI)”

Figure 2.1: Block diagram of digitalization component

2.1 Hardware system We will communicate with Windows Image Acquisition (WIA) system of Windows because all scanners produced must communicate with WIA In this project, we will use the MFC-

795CW scanner as an example of an ADF scanner After

installing the driver on the computer, the scanning process will proceed as follows:

Figure 2.2: Block diagram of scanning process of ADF

scanner

- Communication with scanner

Trang 7

DIGITALIZATION SYSTEM This is the process of establishing a connection between the computer and the scanner

Figure 2.3: Block diagram of setup scanner device process

First of all, we establish a connection to the local WIA device manager by creating an instance of the IWiaDevMgr2 (Windows Image Acquisition Device Manager) object After that, we enumerate information about the available imaging devices on the system through the EnumDeviceInfo method of the IWiaDevMgr2 object Having information about the connected scanners, we can select the scanner After selecting device, we will set the output properties of the scanner using EnumDeviceInfo method of the IWiaDevMgr2 object

- Scanning process This is the communication process between the scanning device and the GUI

Figure 2.4: Diagram of scanning process Diagram of scanning process

Firstly, the GUI will call the scan function to instruct the scanner to pick up the paper While scanning, the GUI will continuo us ly

Trang 8

DIGITALIZATION SYSTEM receive information about the documents in bytes During this process, the GUI will check if there are any bytes indicating a request to stop If there are no errors during the process, and when each page is finished, the program will receive data matching the value of the variable 'end_of_page_cb' and proceed to the stage of saving the image before continuing to scan the next sheet Conversely, if an error occurs during the process or if there are no more documents to scan, the scanner will send the value 'end_of_scan_cb' and request to stop the scanning process

Figure 2.5: Document scanned image

After scanning a document, the computer will save them for further processing, such as extracting information using an AI model The images are typically saved in JPG format

Trang 9

DIGITALIZATION SYSTEM 2.2 GUI layout

The GUI is designed with the purpose of assisting users in utilizing the scanner, communicating with the AI system, and managing the extracted information

Figure 2.6: GUI design

Trang 10

INFORMATION EXTRACTION SYSTEM

Chapter 3 INFORMATION EXTRACTION SYSTEM

Our system would first determine regions of interest such as title, text, table, etc Next, OCR would be applied to those regions to obtain texts and finally, key information would be extracted from the previous texts

Figure 3.1: Block diagram of information extraction system

3.1 Design document layout analysis algorithm - Objectives:

 Preprocess scanned image of document: allevia t ing the effect of underexposing and deskew the image

 Layout analysis: After preprocessing, the image is fed into layout analysis model

Figure 3.2: Block diagram of layout analysis process

Trang 11

INFORMATION EXTRACTION SYSTEM The scan image of the document will be binarized first and enhance its contrast to avoid underexposing Then, we deskew the image if it is tilted

Figure 3.3: Image before and after preprocessing

- Layout analysis model LayoutLMv3 [37] is a well-known multimodal model in Document AI that doesn’t need a pre-trained CNN or Faster-RCNN backbone to extract visual features It's specifica lly designed for tasks related to Document AI, which encompasses various applications like information extraction, question answering, and document understanding

Trang 12

Figure 3.4: LayoutLMv3 architect

- Training procedure

 Number of images: ~150 images

 Finetuning on Google Colab Tesla T4 GPU The finetuning took 2 hours to complete

 Evaluation result: Intersection over Union: 0.777 Mean average precision (mAP): 72%

Trang 13

Figure 3.5: Inference result

While our model may not currently reach state-of-the-art performance levels, it still meets the objectives we set out Thus the current performance is deemed acceptable Nonetheless, there is room for improvement in the future by incorpora ting additional data and fine-tuning the hyperparameters

3.2 Optical Character Recognition In the next step, we use TesseractOCR to digitize the text from the crop text region TesseractOCR is an open-source OCR engine for various operating systems It can recognize more than 100 languages and also supporting many type of input and output image

Trang 14

Figure 3.6: TesseractOCR flowchart

3.3 Design named entity recognition algorithm - Objectives:

The main point of this project is information extraction, thus the NER model must have a good performance In addition to that, the inference process must be quick and precise

Figure 3.7: Flowchart for NER process

- NER model With such requirements, we utilize a ELECTRA model pre-trained on a Vietnamese dataset and fine-tune it to our task ELECTRA involves the generator and the discriminator The generator is tasked with substituting tokens within a sequence, thus it undergoes training using MLM The discriminator aims to discern the tokens replaced by the generator in the sequence This forces the discriminator to learn better representations of

Trang 15

Figure 3.8: ELECTRA architect

- Training procedure Finetuning procedure: + Training data: ~81000 sentences + Finetuning on Google Colab Tesla T4 GPU The finetuning took almost 2 hours to complete

+ Evaluation result: Precision Recall F1 Accuracy Validation loss

0.975610 0.9950 0.9852 0.991489 0.025635

Trang 16

Figure 3.9: Inference result

The result shows that the model has achieved an acceptable accuracy However, the model is restricted to the labeled key-value pairs, when it comes to entirely new key-value pairs, the model needs fine-tuning in order to adapt

Trang 17

SERVER AND DATABASE

Chapter 4 SERVER AND DATABASE

To implement AI, powerful computing configurations are required Therefore, the project will utilize one workstation for information extraction and multiple machines for running the GUI

4.1 Architecture The appropriate model for the current system is the client-ser ver model Communication between the client and server typically relies on the request-response model The client sends a request to the server containing the information to be processed The server receives and processes the request, then sends back the result or response to the client The architecture utilizes RESTful API (Representational State Transferful Applicatio n Programming Interface) with the HTTP request protocol With the client-server model, the current system's block diagram would look as follows:

Figure 4.1: System’s block diagram

Trang 18

SERVER AND DATABASE 4.2 Server

The server here will be a LAN server Previously, we tried to find an online server that allows running the model for free, but with those servers, it would take a lot of time and not meet our needs With the current model—a server connecting to an AI—we can combine the AI and server on the same computer The server here will be a API server because it needs to send data and images to the GUI To implement the mentioned functionalities, each button press will call an API on the server and perform the corresponding actions

The server of the project will utilize an HTTP server Since the project is not too complex in terms of server issues and requires quick and flexible development, the Flask library is chosen for implementation

4.3 Database The project's database objective will be online, be to develop quickly, be expandable, and be free of charge Leveraging a NoSQL database in AI projects offers significant advantages by efficiently managing unstructured data, facilitating horizonta l scalability for handling large volumes of data, and enhancing flexibility and performance in data storage, retrieval, and processing Based on the analyses provided, Firebase Realtime Database by Google will be chosen for this project

Trang 19

SERVER AND DATABASE

Table 1: Data templates

Trang 20

EXPERIMENT

Chapter 5 EXPERIMENT

● Number of testing forms: 14 “anh văn đầu ra” form, 12 “đơn xin thôi học” form, 13 “đơn xin học môn thay thế” form

● Result: - The percentage of forms that exceed 80% correct

extracted entities rate is 71% - The frequently appeared entities such as name, student

id, etc are extracted normally, yet the system sometimes can not recognize entities like subject id, subject name, etc This is because the training data is lacking in the entities

Figure 5.1: System result

Trang 21

EXPERIMENT

Figure 5.2: System result

Trang 22

CONCLUSION

Chapter 6 CONCLUSION

Overall, we can accomplish all the objectives that we proposed We designed software that can communicate and control the scan machine for scanning documents We can develop deep learning models that can identify the layout structure of the document and extract information automatically The evaluatio n process suggests that the performance of the system is quite good

Nonetheless, our project still has shortages Firstly, the average processing time took quite long for a single document file Secondly, our project solely focuses on analyzing fill- in forms in FME faculty, hence it can not be used to analyze other type of unstructured data without fine-tuning such as birth certificate, citizen identity card, etc

Tiêu đề	Development of a System for Digitalization and Information Extraction from Offical Documents
Tác giả	Pham Vu Dung, Hoang Phi Hai
Người hướng dẫn	Bui Ha Duc, PhD
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Mechanical Engineering
Thể loại	Graduation Thesis
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	23
Dung lượng	3,57 MB