MINISTRY OF EDUCATION AND TRAINING HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION GRADUATION THESIS MECHANICAL ENGINEERING DEVELOPMENT OF A SYSTEM FOR DIGITALIZATION AND INFO
Trang 1MINISTRY OF EDUCATION AND TRAINING
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND EDUCATION
GRADUATION THESIS MECHANICAL ENGINEERING
DEVELOPMENT OF A SYSTEM FOR DIGITALIZATION
AND INFORMATION EXTRACTION FROM
OFFICAL DOCUMENTS
BUI HA DUC, PhD STUDENT:
S K L 0 1 2 6 2 9
Ho Chi Minh City, January, 2024
INSTRUCTOR:
PHAM VU DUNG HOANG PHI HAI
Trang 2MINISTRY OF EDUCATION AND TRAINING HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND
EDUCATION
SUMMARY GRADUATION THESIS
_
MECHANICAL ENGINEERING FACULTY OF
DEVELOPMENT OF A SYSTEM FOR DIGITALIZATION AND INFORMATION EXTRACTION FROM OFFICAL DOCUMENTS
Supervisor: BUI HA DUC, PhD
Student: PHAM VU DUNG
Student ID: 19134074
Student: HOANG PHI HAI
Student ID: 19134076
Year Of Admission: 2019-2023
Trang 3Chapter 3 Information extraction system 7
3.1 Design document layout analysis algorithm 7
3.2 Optical Character Recognition 10
3.3 Design named entity recognition algorithm 11
Chapter 4 Server and Database 14
Trang 4INTRODUCTION
Chapter 1 INTRODUCTION
1.1 Motivation Vietnam aims for advanced digital status by 2030, led by Prime Minister Pham Minh Chinh's vision of digital governance, economy, and society The nation's swift digital progress is evidenced by initiatives like the national data sharing platform, facilitating 1.6 million daily transactions and linking 95% of civil servant data across ministries Digitization plays a pivotal role, converting analog to digital data, streamlining processes, enhancing security, and cutting costs Meanwhile, the evolutio n of NLP and Transformer modules fuels Document AI's growth, despite challenges like limited models and datasets, especially in fields like finance and medicine
Recognizing these challenges, our research focuses on automating digitization processes, developing software to extract information from forms, compatible with various scanners and featuring a user-friendly interface
1.2 Objectives - Design software to communicate with hardware
component to perform form digitalization - Develop algorithms for form information extraction with
acceptable accuracy
Trang 5INTRODUCTION - Deploy deep learning models on a LAN server to send
and store data to the database 1.3 System Flow
Figure 1.1: System Diagram
The diagram illustrates the activity diagram of the system starting with paper paged documents and after final processing, it will be in the form of "Key: Value pairs" and sent to the database
Trang 6DIGITALIZATION SYSTEM
Chapter 2 DIGITALIZATION SYSTEM
Digitization system incorporates two components: “Hardware for converting paper documents into digital format” and “Graphical User Interface (GUI)”
Figure 2.1: Block diagram of digitalization component
2.1 Hardware system We will communicate with Windows Image Acquisition (WIA) system of Windows because all scanners produced must communicate with WIA In this project, we will use the MFC-
795CW scanner as an example of an ADF scanner After
installing the driver on the computer, the scanning process will proceed as follows:
Figure 2.2: Block diagram of scanning process of ADF
scanner
- Communication with scanner
Trang 7DIGITALIZATION SYSTEM This is the process of establishing a connection between the computer and the scanner
Figure 2.3: Block diagram of setup scanner device process
First of all, we establish a connection to the local WIA device manager by creating an instance of the IWiaDevMgr2 (Windows Image Acquisition Device Manager) object After that, we enumerate information about the available imaging devices on the system through the EnumDeviceInfo method of the IWiaDevMgr2 object Having information about the connected scanners, we can select the scanner After selecting device, we will set the output properties of the scanner using EnumDeviceInfo method of the IWiaDevMgr2 object
- Scanning process This is the communication process between the scanning device and the GUI
Figure 2.4: Diagram of scanning process Diagram of scanning process
Firstly, the GUI will call the scan function to instruct the scanner to pick up the paper While scanning, the GUI will continuo us ly
Trang 8DIGITALIZATION SYSTEM receive information about the documents in bytes During this process, the GUI will check if there are any bytes indicating a request to stop If there are no errors during the process, and when each page is finished, the program will receive data matching the value of the variable 'end_of_page_cb' and proceed to the stage of saving the image before continuing to scan the next sheet Conversely, if an error occurs during the process or if there are no more documents to scan, the scanner will send the value 'end_of_scan_cb' and request to stop the scanning process
Figure 2.5: Document scanned image
After scanning a document, the computer will save them for further processing, such as extracting information using an AI model The images are typically saved in JPG format
Trang 9DIGITALIZATION SYSTEM 2.2 GUI layout
The GUI is designed with the purpose of assisting users in utilizing the scanner, communicating with the AI system, and managing the extracted information
Figure 2.6: GUI design
Trang 10INFORMATION EXTRACTION SYSTEM
Chapter 3 INFORMATION EXTRACTION SYSTEM
Our system would first determine regions of interest such as title, text, table, etc Next, OCR would be applied to those regions to obtain texts and finally, key information would be extracted from the previous texts
Figure 3.1: Block diagram of information extraction system
3.1 Design document layout analysis algorithm - Objectives:
Preprocess scanned image of document: allevia t ing the effect of underexposing and deskew the image
Layout analysis: After preprocessing, the image is fed into layout analysis model
Figure 3.2: Block diagram of layout analysis process
Trang 11INFORMATION EXTRACTION SYSTEM The scan image of the document will be binarized first and enhance its contrast to avoid underexposing Then, we deskew the image if it is tilted
Figure 3.3: Image before and after preprocessing
- Layout analysis model LayoutLMv3 [37] is a well-known multimodal model in Document AI that doesn’t need a pre-trained CNN or Faster-RCNN backbone to extract visual features It's specifica lly designed for tasks related to Document AI, which encompasses various applications like information extraction, question answering, and document understanding
Trang 12INFORMATION EXTRACTION SYSTEM
Figure 3.4: LayoutLMv3 architect
- Training procedure
Number of images: ~150 images
Finetuning on Google Colab Tesla T4 GPU The finetuning took 2 hours to complete
Evaluation result: Intersection over Union: 0.777 Mean average precision (mAP): 72%
Trang 13INFORMATION EXTRACTION SYSTEM
Figure 3.5: Inference result
While our model may not currently reach state-of-the-art performance levels, it still meets the objectives we set out Thus the current performance is deemed acceptable Nonetheless, there is room for improvement in the future by incorpora ting additional data and fine-tuning the hyperparameters
3.2 Optical Character Recognition In the next step, we use TesseractOCR to digitize the text from the crop text region TesseractOCR is an open-source OCR engine for various operating systems It can recognize more than 100 languages and also supporting many type of input and output image
Trang 14INFORMATION EXTRACTION SYSTEM
Figure 3.6: TesseractOCR flowchart
3.3 Design named entity recognition algorithm - Objectives:
The main point of this project is information extraction, thus the NER model must have a good performance In addition to that, the inference process must be quick and precise
Figure 3.7: Flowchart for NER process
- NER model With such requirements, we utilize a ELECTRA model pre-trained on a Vietnamese dataset and fine-tune it to our task ELECTRA involves the generator and the discriminator The generator is tasked with substituting tokens within a sequence, thus it undergoes training using MLM The discriminator aims to discern the tokens replaced by the generator in the sequence This forces the discriminator to learn better representations of
Trang 15INFORMATION EXTRACTION SYSTEM
Figure 3.8: ELECTRA architect
- Training procedure Finetuning procedure: + Training data: ~81000 sentences + Finetuning on Google Colab Tesla T4 GPU The finetuning took almost 2 hours to complete
+ Evaluation result: Precision Recall F1 Accuracy Validation loss
0.975610 0.9950 0.9852 0.991489 0.025635
Trang 16INFORMATION EXTRACTION SYSTEM
Figure 3.9: Inference result
The result shows that the model has achieved an acceptable accuracy However, the model is restricted to the labeled key-value pairs, when it comes to entirely new key-value pairs, the model needs fine-tuning in order to adapt
Trang 17SERVER AND DATABASE
Chapter 4 SERVER AND DATABASE
To implement AI, powerful computing configurations are required Therefore, the project will utilize one workstation for information extraction and multiple machines for running the GUI
4.1 Architecture The appropriate model for the current system is the client-ser ver model Communication between the client and server typically relies on the request-response model The client sends a request to the server containing the information to be processed The server receives and processes the request, then sends back the result or response to the client The architecture utilizes RESTful API (Representational State Transferful Applicatio n Programming Interface) with the HTTP request protocol With the client-server model, the current system's block diagram would look as follows:
Figure 4.1: System’s block diagram
Trang 18SERVER AND DATABASE 4.2 Server
The server here will be a LAN server Previously, we tried to find an online server that allows running the model for free, but with those servers, it would take a lot of time and not meet our needs With the current model—a server connecting to an AI—we can combine the AI and server on the same computer The server here will be a API server because it needs to send data and images to the GUI To implement the mentioned functionalities, each button press will call an API on the server and perform the corresponding actions
The server of the project will utilize an HTTP server Since the project is not too complex in terms of server issues and requires quick and flexible development, the Flask library is chosen for implementation
4.3 Database The project's database objective will be online, be to develop quickly, be expandable, and be free of charge Leveraging a NoSQL database in AI projects offers significant advantages by efficiently managing unstructured data, facilitating horizonta l scalability for handling large volumes of data, and enhancing flexibility and performance in data storage, retrieval, and processing Based on the analyses provided, Firebase Realtime Database by Google will be chosen for this project
Trang 19SERVER AND DATABASE
Table 1: Data templates
Trang 20EXPERIMENT
Chapter 5 EXPERIMENT
● Number of testing forms: 14 “anh văn đầu ra” form, 12 “đơn xin thôi học” form, 13 “đơn xin học môn thay thế” form
● Result: - The percentage of forms that exceed 80% correct
extracted entities rate is 71% - The frequently appeared entities such as name, student
id, etc are extracted normally, yet the system sometimes can not recognize entities like subject id, subject name, etc This is because the training data is lacking in the entities
Figure 5.1: System result
Trang 21EXPERIMENT
Figure 5.2: System result
Trang 22CONCLUSION
Chapter 6 CONCLUSION
Overall, we can accomplish all the objectives that we proposed We designed software that can communicate and control the scan machine for scanning documents We can develop deep learning models that can identify the layout structure of the document and extract information automatically The evaluatio n process suggests that the performance of the system is quite good
Nonetheless, our project still has shortages Firstly, the average processing time took quite long for a single document file Secondly, our project solely focuses on analyzing fill- in forms in FME faculty, hence it can not be used to analyze other type of unstructured data without fine-tuning such as birth certificate, citizen identity card, etc