1. Trang chủ
  2. » Tất cả

Digitalization of administrative documents a digital transformation step in practice

6 1 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 6
Dung lượng 1,52 MB

Nội dung

2021 8th NAFOSTED Conference on Information and Computer Science (NICS) Digitalization of Administrative Documents A Digital Transformation Step in Practice Sinh Van Nguyen, Dung Anh Nguyen, Lam Son Quoc Pham School of Computer Science and Engineering International University, Vietnam National University of HCMC Ho Chi Minh, Vietnam ITITIU17073@student.hcmiu.edu.vn; pqslam@hcmiu.edu.vn Corresponding author: nvsinh@hcmiu.edu.vn (ORCID: 0000-0003-0424-5542) Abstract—Digital transformation is one of the most popular keyword in recent years It is not only a trend in science research based on the development of information technology, but also a proposed duty that applied in the companies or organizations nowadays Digitalization of administrative documents is therefore considered as the first step in digital transformation of public organization Through the digitizing process, the information that were in written format or hard copies will be converted into digital format (e.g document files) to serve for storing, mining, processing and managing the documents This paper presents a method to build a web application for digitizing the administrative documents applied in most public organizations The method is based on the OCR (Optical Character Recognition) combined with the image processing techniques Our digital process is implemented as following steps (i) Scanning the hard copies of the administrative documents (ii) Removing noise data and filtering necessary information in the content based on image processing technique (iii) Classifying automatically the acquired contents into the respective components of a template form following the structured format of Vietnam Government (iv) Generating automatically a document file The application can process a document with a single or multiple pages To compare with similar applications, our application is processed very fast, without limitation of pages for each document and obtained accuracy as our expectation Index Terms—Digital Transformation, Document Digitalization, OCR, Image Processing, Smart Web Application I I NTRODUCTION The development of Information Technology (IT) nowadays brings us advantages in daily work, study, research and entertainment Application of IT is considered as a popular tool in the official activities and also a mean for administrative management This leads to starting steps in digital transformation (DT) of any organization or countries all over the world In the side of work, DT is a transformation of work from traditional to digital activities based on the background of IT and communication devices On the other side, DT is formed by the merger of personal and corporate IT environments based on an intersection of digital technologies such as cloud computing, big data, IoT, and AI, etc to serve for all activities of organization [1], [2] Digitalization is the process of transforming data or information into computer-based digital format Document digitalization [3], [4], [16] refers to the technique of scanning the hard copy of a document and convert 978-1-6654-1001-4/21/$31.00 ©2021 IEEE its content into electronic soft version of document file format such as doc, docx or pdf files In these digital format, information are arranged into distinct data components stored on the computer memory that can be processed individually, and therefore they are understandable and readable to a computer The document digitalization is processed as follow: a hard copy of document file is scanned and saved as a picture or a pdf file, page by page The light and dark regions on the scanned image are analyzed by an optical character recognition engine, which then turns each letter or number into an ASCII code; the system will analyze and divide the ASCII characters into several little portions that may be saved for later usage In practice, we have too many documents that need to be kept and preserved carefully for a long time, because their values to individual or even to the history of a country They can be certificates, degrees, legal documents of law or administration, etc., more and more increase day by day Normally, they are papers or may be made by special woods and difficult to keep for a long time The problem comes from capacity of stored space and activities in document preservation According to the State Records and Archives Department of Vietnam [5], we have six centrals of national archives, where store a huge amount of national documents To study in each public organization or companies in practice, the document storage and management workloads takes up half of the working time of an average employee, even more in large firms where documents are quickly piled up Such minor task should not be used a large space and time consuming, which is why it should be automated by the time As a solution to this problem, a computer-based application can be developed to convert all traditional papers into their digital format counterparts The documents in this application are structurally stored in a robust architecture to support the administrative workloads of employees Such application can assist large organizational needs for keeping information safe, up to date and accessible to all authorized parties The work in this paper aims to propose a method and create an application for digitizing administrative documents Our solution is performed based on a web application that can support management needs, storing, searching and mining in large firms and public organizations of the Vietnam govern- 519 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) ment The method consists of following steps: (i) The hard copies of the administrative documents are scanned and save as picture or pdf files (ii) The image processing technique is used to filter the scanned data by removing noise data [6] (iii) Classifying automatically the acquired contents based on the Tesseract OCR [7] technique and put them into the right positions of a template form following the structured format of Vietnam government (iv) Creating automatically document files for using and managing The final product will provide tools that assists companies in administrative works such as organization, management, store, mining, using and reproduce their documents The whole process is considered as a DT step in administrative management The remainder of the paper is structured as follows Section reviews several methods, solutions and application in practice Section presents our method and system architecture of the application in detail The implementation and obtained results are presented in the Section We compare and discuss between the methods and usage functions of the application in Section The last Section is our conclusion we can apply the image processing technique to build an application to support doctor in disease diagnosis Sinh et al [14] presented a method for building and visualizing medical data objects based on a web application This web-app was very useful and can be used for both medical staff and patients Therefore, application of image processing technique to develop a web-base system for digitalizing administrative documents is popular and widely used in practice According to the format of the Vietnam Government [15], the structure of an administrative document is presented as in Fig.1 This is a required document template (using the paper size A4) to create an administrative document in all the public organizations It is also used in the private companies that following the administrative laws of Vietnam The structure of the document II R ELATED WORK Digitization of data is one important step in almost activities of data mining and management in every companies and public organizations This is also a module in the DT process of Government [8] The existing tools, techniques and methods that base on the background of IT are key factors in the whole process The OCR technique [3] is used to read and identify characters and image information on the scanned documents Part of the conversion is to recognize characters within the uploaded image of a document and export these characters onto a digital copy The benefit of digital copies from the source materials is that they are managed easily in large quantity and (in theory) can use indefinitely This technology [9], [10] aims to revolutionize any administrativebased workload in large organizations and firms by eliminating the hassle of paperwork instead option for a digital solution that is both reliable and manageable Koichi Kise [11] presented a method for classifying a document image into homogeneous components such as text blocks, figures, and tables The method is based on image processing technique to distinguish background, foreground, object components, color and intensity of pixels on the image The obtained results proved a promising way in extraction and recognition of characters in the document images In order to process the images, OpenCV is one of the popular open source libraries that can help to process very fast Chung B W [12] introduced step by step how to install and work on the OpenCV This guideline is useful to the researchers and developers Image processing technique is widely applied in computer graphics and computer vision Minh et al [13] introduced a method for creating a virtual museum based on the virtual reality application The method is considered as a digital transformation step in the filed of digital heritage The application allows user visiting and interacting with the relics in the museum as in practice In the field of medical, Fig Format of an administrative document by the Vietnam Government is numbered and distributed on different positions of the document parts as follows: (1) National name (2) Name of the organization that issued the document (3) Document ID (4) Place and date issued (5a) Type of document (5b) Abstract (6) Main content (7a, 7b, 7c) Title, Full name and Signature of competent person, respectively (8) Seal (Stamp) and Signature of organization (9a and 9b) Recipient (10a, 10b) Confidentiality Indicator, Urgency indicator (11) Scope of circulation indication (12) Writer notation and number of editions (13) Contact information of organization (14) Digital signature of organization for copied version of document into electronic format Among these components, the main content 520 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) (6) can be extended more than one page Ruili Zhang et al [5] presented a framework for digital document processing The proposed method includes several important steps like scanning, indexing, quality checking, archiving and backup of electronic documentary information However, the method is worked on non-structured documents without the comparison of accuracy to the existing methods in the same context The next section presents in detail our proposed method and application for digitalizing and managing administrative documents III P ROPOSED METHOD A System architecture The system is based on a web application with MVC model and a database management system for working, storing and mining in practice Our proposed method and application is described as the following steps (see Fig.2) Fig Our proposed workflow of digitizing process B Proposed algorithm To start the process of digitalization, the system first needs to receive and load a scanned image of the document After scanning, a cropped replica of it is generated (which will be used later to locate the stamp) Then, we use techniques in image processing to convert it into a grayscale image (black and white color) based on a threshold function to filter noisy elements The threshold operation alters the value of pixels; if the pixel value exceeds the threshold value, it is assigned the value (white); otherwise, it is set to (black) After that, the spaces between the characters are filled to form a uniform partition to identify each component of the document in the image by using the dilate method to increase the character thickness to a specific level In the next step, to detect regions of image or text, four points of a rectangular area surrounding each partition are then identified to determine its contour (called a bounding box) Using the structure of the administrative document (as in Fig.1) to determine and extract text information in the document image The location of each partition corresponding to the component in the picture can be predicted based on the four inferred points (x, y coordinates), and partition size (width, height) They will be matched with numbers in the document structure such as name of organization, ID, place and date, document type, abstract, content, recipient, position, stamp and signature) If the detected component is a stamp, the system proceed to detect the circle shape that represents the stamp’s border, using the RGB filter to maintain only the red color and save it as a PNG picture This image is noted that will not be utilized to extract the text inside but rather kept as a replica of the stamp The image of each component is cropped from the replica after the components are determined (which was made at the start) and stored as a series of photos labeled with the component’s name This process allows us to treat each component independently Finally, an OCR engine is used to transform each component’s picture into text and return the results In general, our proposed algorithm (Algorithm 1) is described as follows Algorithm DocumentDigitization() 1: Input: Images files 2: Output: Document files 3: Load an image file 4: Crop the image 5: Create replica of the image 6: Convert the replica into pure black and white using threshold function 7: Dilate characters in image 8: Find contours of image components and store in array C Initialize i = 9: while i ≤ C.length 10: Create bounding rectangle with C[i] 11: Get x, y, width, height of the bounding rectangle 12: Classify the component based on x, y, width, height 13: if the component is stamp then 14: Find mask of bounding circle of stamp 15: Extract the stamp from the image 16: (using BitwiseAND with the original image) 17: else 18: Crop the area at (x,y,x+width,y+height) 19: of the images clone 20: Extract text from the area and push into array texts 21: end if 22: i=i+1 23: end while 24: Return image of stamp and texts IV I MPLEMENTATION AND RESULTS The application can be used by many users (or called actors) who interact with the system Any one can view, search and use the system The official staffs can process and manage 521 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) all administrative documents both outgoing and incoming of the organization The administrator will manage users of the system We use JavaScript programming language based on the NodeJS [19], ReactJS [17], [18] and the Visual Studio Code IDE [21] The Tesseract OCR [23] is selected among many OCR engines due to its supports for various languages and performance on different fonts To combine with the image processing technique, we use OpenCV to enhance the quality and obtain the best results The main page to process the document digitalization is built as in Fig.3 The picture of a document file is loaded on the left hand side of the webpage After that, it is digitalized by extracting all texts and picture of the stamp They are recognized and transformed into the data structured from of the right hand side in this webpage These data are then corrected on each component of the form (if necessary), stored in the database of MongoDB [22] and can be reproduced a new document based on the function of document management (see Fig.4) In this form, user can choose any document based on its ID to display a new document file with content is exactly as the input document image before TABLE I C OMPARISON OF PERFORMANCE (#E/#C: THE NUMBER OF ERROR / THE NUMBER OF CHARACTERS ), ACCURACY (ACC %) AND PROCESSING TIME ( MS : MILLISECOND ) DocID 480/QBGDT 110/TBHQT 187/QUBND Quality Low Medium High Type #E/#C original clean original clean original clean 83/1037 85/1037 45/1030 47/1030 41/1514 46/1514 Fig The page of documents management # of comp 10 10 11 12 12 12 Time (ms) 6000 5730 5550 4990 5990 5300 V D ISCUSSION AND COMPARISON In this section, we test our application with different type of documents focused on the administrative documents to compare the obtained results Several experiments have also been carried out to assess the application’s efficiency In the context of OCR and image processing, each of the tests was carefully examined, independent one by one Table I shows the overall findings for a variety of document characteristics Each entry in the table reflects a specific experiment that was carried out on the corresponding document Several documents have been tested with different quality and type to evaluate the obtained results with accuracy We have performed on three input document pictures (with their ID shows on the first column) They have different resolutions (the quality) Each of them is tested with the two types (original input and cleared one after removing noisy) Example, a scanned image with DPI (Dots Per Inch) equal or greater than 400 is considered as a high resolution image; from 300 to 400 is medium resolution; and less than 300 is a low resolution image Depending on the quality of scanned documents and their resolution after denoising process, the number of components (# of comp) that determined on each document is different The processing time and accuracy [20] of the recognition are presented in Table I The OCR engine at the character level, which is determined by the equation (1) E (1) C where, e: the number of error characters and c: the number of all characters in the document We count and compute the ratio between number of errors per number of characters in each document (#E/#C) to obtained the accuracy of our proposed method In general, the clean documents clearly received more precise results in terms of character level precision, as well as obtained exactly number of component placement, and the processing time is a bit faster to complete the task The obtained results shows that the better the input data the more efficient the outcome will be The accuracy of the clean documents (in case of medium and high) is above 95% The results indicated that the algorithms in the application can meet the demanding criteria By providing a more capable system and higher quality input using other scanning methods can document be more accurately digitized, which will significantly with the outcome However, processing Ac = − Fig The main page to digitize an administrative document Acc (%) 91.996 91.803 95.631 95.437 96.019 95.533 522 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) the hand writing is still a challenge to the researchers [10]; especially the signature and stamp (or seal) of the organization Because of the security issues, the seal is managed following the rules and legal system of the Vietnam Government In this application, we just get the shape of the seal based on its boundary to prove the technical issue and not reuse it for reproducing new documents The second point is the signature, following the rule, It is always located 1/3 overlap the seal from the left Therefore, we did not extract it to preserve the legal of the root documents In practice, the administrative document of any organization or company has legal value when it is sealed (stamped) and signed by the leader Although electronic signatures have been approved by Vietnamese law (but only in some cases) To compare with the existing methods and applications in practice (see Table II and Table III), our application has many advantages It is built based on the web-app, so it is useful, easily to process It can handle a lot of documents as the same time on the connected network system within an organization or a company The capacity of stored space is invested by the hardware devices in the data server Therefore, the application is very utility and reality in management, storing, mining effectively in the process of DT in organization In Table II, VietOCR is a free tool built and run on both web and desktop application It process fastest comparing to all application; the accuracy is also higher our application The testing version online can obtain exactly text data like information of an identification card, medical card or one page document It is very useful to develop the QR or Bar code scanner used in supermarkets or stores to check items and paying process However, it is not used in the administrative document digitizing, with multiple pages While the ABYYFineReader approximates to our application, with similar support, accuracy, and better processing time in some cases However, it is a proprietary software, which may cause compatibility and licensing issues SodaPDF obtained results with a very low accuracy, while the processing time is largest and sometimes generate overlapping words and noises Omnipage Docudirect does not have support processing Vietnamese and making it difficult to evaluate Without counting the Vietnamese accents, it can generate decent characters, however the accuracy is still low Both SodaPDF and Omnipage Docudirect could not process the documents 480/Q-BGDT due to its low resolution, which resulted in much lower accuracy In contrast, the others and our application only suffered negligible loss in accuracy with that sample Therefore, the more quality of the document images, the better results we can obtain, both accuracy and time processing All the application used in our experiments were unable to extract the stamp separately, which can cause legal problems as they are often overlapped with the signature Our product has efficient accuracy and not time consuming, while support processing Vietnamese, separately extract stamp and signature We compare the support functions, abilities to process stamp and text, license issues and platform of several application in practice in Table III The advantage of our application is TABLE II C OMPARISON OF THE ACCURACY AND PROCESSING TIME Doc ID 480/QBGDT 110/TBHQT 187/QUBND Application Our Application VietOCR SodaPDF ABYYFineReader Omnipage Docudirect Our Application VietOCR SodaPDF ABYYFineReader Omnipage Docudirect Our Application VietOCR SodaPDF ABYYFineReader Omnipage Docudirect Error of chars 84 25 490 88 436 46 98 45 175 43 12 140 38 262 Acc (%) 91.899 97.589 52.427 91.514 57.955 95.534 99.339 90.485 95.641 83.009 95.776 99.207 90.752 97.490 82.694 Time (ms) 5724 1410 11340 4280 14420 5466 1600 15900 3750 6600 5558 1230 15200 5890 8200 TABLE III C OMPARISON OF UTILITIES BETWEEN THE APPLICATION Apps Our application VietOCR SodaPDF ABYY FineReader Omnipage Docudirect Support VN Extract stamp Structure form License Platform Yes Yes Yes Free Web-app Yes No Not clear Not clear Not clear No Free Trial Vers Trial Vers Trial Vers Web-app Web-app, desktop Yes Yes No Yes Yes Yes Desktop Desktop free for using; it is designed based on the web application; and it can process a document with multiple pages The Omnipage Docudirect does not support processing Vietnamese, but still can generate relatively accurate words The SodaPDF, ABYYFineReader and Omnipage Docudirect cannot extract separately image of stamp and signature This is also a different point comparing to our method and VieOCR Besides, we have also pay fee for using them In general, our product has efficient accuracy, supports Vietnamese, extract separately stamps and signature, process standard document and export output into the structured format of the administrative document form The important point is that, our application can support freely to the organizations in their digital transformation VI C ONCLUSION In this research, we have researched, proposed a method and built an application for digitizing the administrative documents The research is relied on the fields of computer graphics, computer vision and image processing We used the ReactJS and NodeJS as the utility tools combined with the libraries of OpenCV and Tesseract OCR to build a web application system for digitizing and managing the administrative documents The obtained results reached more than 91% of accuracy and the processing time is just few seconds for each document page (both scanning and character recognizing) The application is very useful in the administrative document 523 2021 8th NAFOSTED Conference on Information and Computer Science (NICS) management and it can support staffs in the office activities The current system has a difficulty in processing hand-written documents as they are inconsistent and not following any rules However, it can be solved based on the deep learning models in the future Moreover, the system uses a predetermined structure of an administrative document, which does not cover all use cases in the real world This issue is also not important because the structure of the document form is easily modified and improved While the accuracy of the obtained results is not 100%, it makes up for the fast response time with sufficient accuracy and excels at printed documents, as well as providing option for editing before exportation Such high accuracy and fast response time, the application can be used in the large organizations and firms, where a huge amount of administrative documents are processed each day Moreover, the process of integrating the OCR module in the web application did not interfere with the performance, but it is rather enhanced the user experience by providing an ease-of-use interface for easy conversion, editing and management With the obtained web application, the daily administrative workload can be significantly reduced and providing a fast and secure solution Improvements on accuracy and wider range of use cases can be covered with a larger dataset containing different forms of documents as well as using a wider variety of character dictionaries other than the Tesseract OCR The further research and tests in the future are necessary for optimizing the system Besides, we will research to process hand-written characters to improve next version VII ACKNOWLEDGMENT The research work in this paper is funded by the student project of the International University, Vietnam National University of Ho Chi Minh City (HCMIU), with the ID is SV20220-IT-05 We would like to thank for the fund [8] M Borg, T Olsson, U Franke and S Assar (2018) “Digitalization of Swedish Government Agencies - A Perspective Through the Lens of a Software Development Census” IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS) pp 37-46, 2018 [9] Chirag Pate, Chirag Pate, Dharmendra Patel “Optical Character Recognition by Open Source OCR Tool Tesseract: A Case Study” International Journal of Computer Applications Volume 55 No.10, 2012 [10] J Memon, M Sami, R A Khan and M Uddin “Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR)” IEEE Access Vol 8, pp 142642-142668, 2020 [11] Koichi Kise “Page Segmentation Techniques in Document Analysis” Handbook of Document Image Processing and Recognition, pp 135-175, Springer, 2014 [12] Chung B.W “Getting Started with Processing and OpenCV” Pro Processing for Images and Computer Vision with OpenCV, pp 1-37, doi.org/10.1007/978-1-4842-2775-6 1, 2017 [13] Minh Khai Tran, Sinh Van Nguyen, Nghia Tuan To, Marcin Maleszka Processing and Visualizing the 3D Models in Digital Heritage 13th International Conference on Computational Collective Intelligence (ICCCI 2021, Rank B) Lecture Notes in Computer Science, vol 12876 Springer, Pages 613-625, 2021 [14] NGUYEN Van Sinh, TRAN Manh Ha, LE Son Truong Visualization of Medical Images Data Based on Geometric Modeling Lecture note in computer science 11814, ISSN 0302-9743, Pages 560-576, Springer, 2019 [15] ,Vietnam Government “Format of the administrative document”, Number 30/2020/ND-CP, March 23, 2020 [16] Ruili Zhang, Yanming Yang and Wenxiu Wang “Research on document digitization processing technology” MATEC Web of Conferences 309, 02014, CSCNS2019 pp 1-6, doi.org/10.1051/matecconf/202030902014, 2020 [17] ReactJS, A JavaScript library for building user interfaces, https://reactjs.org, access Nov, 2021 [18] Sanchit Aggarwal “Modern Web-Development using ReactJS” International Journal of Recent Research Aspects ISSN 2349-7688 Vol 5, Issue 1, March 2018, pp 133-137, 2018 [19] Introduction to NodeJS, https://nodejs.dev/learn, access Nov, 2021 [20] Christian Clausner, Stefan Pletschacher, Apostolos Antonacopoulos “Flexible character accuracy measure for reading-order-independent evaluation” Journal of Pattern Recognition Letters 131 (2020), pp 390397, doi.org/10.1016/j.patrec.2020.02.003 [21] Visual Studio Code for the Web, https://code.visualstudio.com, access Nov, 2021 [22] Christudas B “Install, Configure, and Run MongoDB” Practical Microservices Architectural Patterns, 2019 [23] Ray W Smith “History of the Tesseract OCR engine: what worked and what didn’t” Document Recognition and Retrieval XX, 865802 https://doi.org/10.1117/12.2010051, 2013 R EFERENCES [1] Thomas M Siebel “Digital Transformation: Survive and Thrive in an Era of Mass Extinction” First edition published by RosettaBooks, 2019 [2] Ziyadin S and Suieubayeva S and Utegenova A “Digital Transformation in Business” International Scientific Conference “Digital Transformation of the Economy: Challenges, Trends, New Opportunities, ISCDTE 2019 Lecture Notes in Networks and Systems, vol 84 pp 408415, https://doi.org/10.1007/978-3-030-27015-5 49, Springer, 2020 [3] Johan, M., Tan, R., Suteja, B and Afiany, N “Document Digitalization and Scoring System of Students Final Project” Jurnal Teknik Informatika Dan Sistem Informasi, 6(3) https://doi.org/10.28932/jutisi.v6i3.3126, 2020 [4] Johan, M., Tan, R., Suteja, B and Afiany, N “Document digitalization through use of cloud computing technology” International Journal of Engineering Applied Sciences and Technology Vol 4, Issue 10, ISSN No 2455-2143, Pages 260-262, 2020 [5] The Sate Records and Archives Department of Vietnam, https://luutru.gov.vn/home.htm, access Nov, 2021 [6] Fan, L Zhang, F Fan, H “Brief review of image denoising techniques” Vis Comput Ind Biomed https://doi.org/10.1186/s42492-019-0016-7, 2019 [7] Ray Smith “An Overview of the Tesseract OCR engine” International Conference on Document Analysis and Recognition (ICDAR), IEEE Computer Society, pp 629-633, 2007 524 ... revolutionize any administrativebased workload in large organizations and firms by eliminating the hassle of paperwork instead option for a digital solution that is both reliable and manageable Koichi... method and application for digitalizing and managing administrative documents III P ROPOSED METHOD A System architecture The system is based on a web application with MVC model and a database management... that assists companies in administrative works such as organization, management, store, mining, using and reproduce their documents The whole process is considered as a DT step in administrative

Ngày đăng: 22/02/2023, 22:42

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN