Ứng dụng tiền xử lý ảnh và hậu xử lý trong quá trình nhận dạng chữ quang học nghiên cứu áp dụng cho danh thiếp tiếng việt

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	7
Dung lượng	485,68 KB

Nội dung

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018 90 APPLYING IMAGE PRE PROCESSING AND POST PROCESSING TO OCR A CASE STUDY FOR VIETNAMESE BUSINESS CARDS Thai Duy Quya*, Vo Phương Bin[.]

KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018 APPLYING IMAGE PRE-PROCESSING AND POST-PROCESSING TO OCR: A CASE STUDY FOR VIETNAMESE BUSINESS CARDS Thai Duy Quya*, Vo Phương Binha, Tran Nhat Quanga, Phan Thi Thanh Ngaa a The Faculty of Information Technology, Dalat University, Lamdong, Vietnam *Corresponding author: Email: quytd@dlu.edu.vn Abstract This paper presents a proposal image pre-processing and Vietnamese post-processing algorithms efficiently adopt the Tesseract open source Optical Character Recognition (OCR) library We built a mobile application (Android) and applied the result for Vietnamese business cards The experimental results show that the proposed method implemented as an Android application achieved more accuracy than the original OCR library Keywords: Android; OCR; Image pre-processing; Post-processing; Vietnamese Business Card 90 KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018 ỨNG DỤNG TIỀN XỬ LÝ ẢNH VÀ HẬU XỬ LÝ TRONG QUÁ TRÌNH NHẬN DẠNG CHỮ QUANG HỌC: NGHIÊN CỨU ÁP DỤNG CHO DANH THIẾP TIẾNG VIỆT Thái Duy Quýa*, Võ Phương Bìnha, Trần Nhật Quanga, Phan Thị Thanh Ngaa a Khoa Công nghệ Thông tin, Trường Đại học Đà Lạt, Lâm Đồng, Việt Nam *Tác giả liên hệ: Email: quytd@dlu.edu.vn Tóm tắt Bài báo trình bày đề xuất phương pháp tiền xử lý ảnh hậu xử lý tiếng Việt áp dụng cho trình nhận dạng ký tự quang học thư viện mã nguồn mở Tesseract Chúng xây dựng ứng dụng hệ điều hành Android áp dụng kết nghiên cứu cho danh thiếp tiếng Việt Kết cho thấy phương pháp đề xuất thực thi cho kết xác ứng dụng hành Từ khoá: Android; Danh thiếp tiếng Việt; Hậu xử lý; Nhận dạng ký tự quang học; Tiền xử lý ảnh 91 KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018 INTRODUCTION In daily work, we usually receive business cards from our friends or partners The business cards regularly have some information, such as name, address, phone number, etc In the contact list of a smartphone, the user can also store the same contact information as a business card Therefore, our goal is to build an application to extract the text of the business card and save the contact information into a smart phone The Android application can directly input an image of the contact information using the phone’s camera Noise in the business card image is then eliminated The image is then provided to the Optical Character Recognition (OCR) engine to extract the necessary information and to save it to the contact list To improve the efficiency of the extraction process, we developed improved algorithms for image pre-processing and postprocessing Our application is implemented on an Android device and tested with Vietnamese business cards The OCR engine used in this paper is the Tesseract open source library RELATED WORK OCR systems have been under development in research and industry since the 1950s using knowledge-based and statistical pattern recognition techniques to transform scanned or photographed images of text into machine-editable text files (Eason, Noble, & Sneddon, 1955) Shalin, Chopra, Ghadge, and Onkar (2014) developed an early OCR system Techniques of pre-processing images, used as an initial step in character recognition systems, were presented, of which the feature extraction step of optical character recognition is the most important In order to improve the accuracy of image recognition, Mande and Hansheng (2015) and Matteo, Ratko, Matija, and Tihomir (2017) have proposed an efficient method to remove background noise and enhance low-quality images, respectively In addition, Nirmala and Nagabhushan (2009) proposed an approach which can handle document images with varying backgrounds of multiple colors Bhaskar, Lavassar, and Green (2015); Pal, Rajani, Poojary, and Prasad (2017); and Yorozu, Hirano, Oka, and Tagawa (1987) presented a tutorial to improve the accuracy of the OCR method when converting printed words into digital text Although there are many applications of OCR which were high accurate for the English language (Badla, 2014; Chang, & Steven, 2009; Kulkarni, Jadhav, Kalpe, & Kurkut, 2014; Palan, Bhatt, Mehta, Shavdia, & Kambli, 2014; Phan, Nguyen, Nguyen, Thai, & Vo, 2017; & Trần, 2013), OCR systems for non-English languages may have several problems Vietnamese is a language with tones and single syllables (Phan & et al., 2017) We were not successful in finding any relevant studies that have a 100% recognition rate for Vietnamese, but some applications have been implemented, such as in Trần (2013) Among commercial versions, another popular application is CamCard, but it does not offer much support for Vietnamese language business cards An application available for Vietnamese language in Google Store is Business Card Reader Free, but the experimental accuracy is not high 92 KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018 OCR AND TESSERACT OCR is the technical process which converts scanned images, typewritten, or printed text into machine encoded text OCR has been in development for almost 80 years, as the first patent for an OCR machine was filed in 1929 by a German named Gustav Tauschek and an American patent was filed subsequently in 1935 OCR has many applications, including use in the postal service, language translation, and digital libraries Currently, OCR is even in the hands of the general public in the form of mobile applications The OCR system input images include text which cannot be edited The output of the OCR process is editable text from the input images The OCR process is illustrated in Fig Figure OCR process There are a few stages within the OCR process used to convert an image to text To simplify these steps, we use an open source software called Tesseract as the kernel for our project Tesseract was first built in 1985 by Hewlett Packard The project later changed hands and was further developed by the University of Nevada-Las Vegas from 1996 to 2006 (Matteo & et al., 2017) From 2007, Google has sponsored this project under the Apache 2.0 license as open source software Today, Tesseract is considered the most accurate free OCR engine in existence and is one of the most widely used in the world Tesseract now provides support for 139 languages (Mande & Hansheng, 2015) The Tesseract OCR process can be represented by the flow chart in Figure 2, in this system, there are eight stages, as follows (Bhaskar & et al., 2017):  A Gray-scale or color image is provided as input: The input data should ideally be a “flat” image from a flatbed scanner or a near parallel image capture  Adaptive threshold: Performs the reduction of a gray-scale image to a binary image using Otsu’s method (Bhaskar & et al., 2017) The algorithm assumes that in an image there are foreground (black) pixels and background (white) pixels It then calculates the optimal threshold that separates the two pixel classes so that the variance between the two is minimal;  Connected-component labeling: Through the binary image, Tesseract will identify the foreground pixels and then mark the potential characters; 93 KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018  Line finding algorithm: Lines of text are found by analyzing the image space adjacent to potential characters  Baseline fitting algorithm: Finding baselines for each of the lines After each line of text is found, Tesseract examines the lines of text to find the approximate text height across the line  Fixed pitch detection: The other step of setting up character detection is finding the approximate character width This allows the correct incremental extraction of characters as Tesseract progresses down a line;  Non-fixed pitch spacing delimiting: Characters that are not of uniform width, or not of a width that agrees with the surrounding neighbourhood, are reclassified to be processed in an alternate manner;  Word recognition: After finding all of the possible character “blobs” in the document, Tesseract performs word recognition on a word-by-word, line-byline basis Words are then passed through a contextual and syntactical analyzer, which ensures accurate recognition Figure Tesseract flow chart 94 KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018 PROPOSED METHOD 4.1 Pre-processing The Tesseract engine is the kernel of the OCR system in our project To improve the accuracy of the process, we use some pre-processing techniques for the input images The first technique is to fix a frame after taking pictures with a camera and converting to gray-scale images After that, we used the methods proposed by Mande and Hansheng (2015); Matteo and et al (2017); and Shivananda and Nagabhushan (2009) When the user finishes taking the images, the program automatically identifies the frame for the picture, which is the outline of the business card It can change the size and shape of the frame as suitable for recognizing text This action not only helps increase the accuracy of the captured image, but also removes unnecessary parts of the business card Figure shows an example of the frame selection for a photographed business card We used the OpenCV open source library, which is an efficient tool for image processing OpenCV tool can also convert a color picture to a gray-scale picture, which is very convenient in the next step of our OCR process Figure A frame after taking a picture On the other hand, the images can be processed before input to Tesseract Therefore, we have applied some methods proposed by previous authors First, the original colored image is converted into a gray-scale image using the formula proposed by Li, Jia-bing, and Shan-shan (2010) shown in Equation (1) Y = 0.2999R + 0.587G + 0.114B (1) where R, G, and B are the normalized red, green, and blue pixel values, respectively Second, we applied the methods proposed by Badla (2014) to convert the color images to gray-scale by two techniques: Luminosity and DPI Enhancement Both of these techniques used the OpenCV library to perform the conversion Luminosity is a method for converting an image into gray-scale while preserving some of the color intensities (Badla, 2014) The algorithm code below describes the image luminosity process: 95 KỶ YẾU HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018 // Get buffered image from input file; iterate all the pixels in the image with width=w and height=h for int w=0 to w=width { for int h=0 to h=height { // call BufferedImage.getRGB() saves the color of the pixel // call Color(int) to grab the RGB value in pixel Color= new color(); // now use red, green, and black components to calculator average int luminosity = (int)(0.2126 * red + 0.7152 *green + 0.0722 *blue; // now create new values Color lum = new ColorLum Image.set(lum) // set the pixel in the new formed object } } To get the best results out of the image, we need to fix the DPI as 300 DPI is the minimum acceptable for Tesseract (Badla, 2014) The algorithm for DPI enhancement is as follows: start edge extract (low, high){ // define edge Edge edge; // form image matrix Int imgx[3][3]={} Int imgy[3][3]={} Img height; Img width; //Get diff in dpi on X edge // get diff in dpi on y edge diffx= height* width; diffy=r_Height*r_Width; img magnitude= sizeof(int)* r_Height*r_Width); memset(diffx, 0, sizeof(int)* r_Height*r_Width); memset(diffy, 0, sizeof(int)* r_Height*r_Width); memset(mag, 0, sizeof(int)* r_Height*r_Width); // this computes the angles // and magnitude in input img For ( int y=0 to y=height) For (int x=0 to x=width) Result_xside +=pixel*x[dy][dx]; Result_yside=pixel*y[dy][dx]; // return recreated image result=new Image(edge, r_Height, r_Width) return result; } Finally, we use the methods proposed by Mande and Hansheng (2015) and Matteo & et al (2017) with low-quality or background images Tesseract requires a minimum text size for reasonable accuracy If the x-height of images is below 20px, the accuracy drops off The first pre-processing method proposed of Matteo and et al (2017) is image resizing so that the image height is 100px Resizing is only applied if the height of the original image is below 100px The second pre-processing method of Matteo and et al (2017) is an image sharpening method The main reason for using it is to enhance the contrast between edges, i.e to enhance contrast between text and background The image sharpening is achieved using unsharp masking, represented by Equation (2) g(i,j) = f(i,j) - fsmooth(i, j) (2) 96 ... HỘI NGHỊ KHOA HỌC THƯỜNG NIÊN TRƯỜNG ĐẠI HỌC ĐÀ LẠT NĂM 2018 ỨNG DỤNG TIỀN XỬ LÝ ẢNH VÀ HẬU XỬ LÝ TRONG QUÁ TRÌNH NHẬN DẠNG CHỮ QUANG HỌC: NGHIÊN CỨU ÁP DỤNG CHO DANH THIẾP TIẾNG VIỆT Thái Duy... tiền xử lý ảnh hậu xử lý tiếng Việt áp dụng cho trình nhận dạng ký tự quang học thư viện mã nguồn mở Tesseract Chúng xây dựng ứng dụng hệ điều hành Android áp dụng kết nghiên cứu cho danh thiếp tiếng. .. danh thiếp tiếng Việt Kết cho thấy phương pháp đề xuất thực thi cho kết xác ứng dụng hành Từ khoá: Android; Danh thiếp tiếng Việt; Hậu xử lý; Nhận dạng ký tự quang học; Tiền xử lý ảnh 91 KỶ YẾU

Ngày đăng: 28/02/2023, 20:42