Những ưu điểm chính của LVTN: The thesis addresses some challenging issues when processing an image of a house ownership certificate, which is text extraction and especially identifying
Trang 1NATIONAL UNIVERSITY OF HO CHI MINH CITY
UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING
O O O
TP.HCMBK
INSTRUCTOR:ASSOC PROF QUẢN THÀNH THƠ
REVIEWER: ASSOC PROF BÙI HOÀI THẮNG
—O0O—
STUDENT: LÊBÁ THÀNH 1552340
Ho Chi Minh city, July 2021
Trang 2TRƯỜNG ĐẠI HỌC BÁCH KHOA
KHOA:KH & KT Máy tính NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP
BỘ MÔN:KHMT Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình
1 Đầu đề luận án:
Development of a mobile application to process house ownership certificate
2 Nhiệm vụ (yêu cầu về nội dung và số liệu ban đầu):
✔ Design and implement a UI for friendly users to easily interact and retrieve information
✔ Define which part of the certificate needs to extract usefulinformation
✔ Build a pipeline to process the right part of the certificate toretrieve the location of the land on the maps for quick review
✔ Improve the performance of the processing location for more faster
✔ Evaluating the whole process and the accuracy of application
3 Ngày giao nhiệm vụ luận án:
4 Ngày hoàn thành nhiệm vụ:
5 Họ tên giảng viên hướng dẫn: Phần hướng dẫn:
CHỦ NHIỆM BỘ MÔN GIẢNG VIÊN HƯỚNG DẪN CHÍNH
PGS.TS Quản Thành Thơ
PHẦN DÀNH CHO KHOA, BỘ MÔN:
Người duyệt (chấm sơ bộ):
Trang 3TRƯỜNG ĐẠI HỌC BÁCH KHOA CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
KHOA KH & KT MÁY TÍNH Độc lập - Tự do - Hạnh phúc
2 Đề tài: Development of a mobile application to process house ownership certificate
3 Họ tên người hướng dẫn/phản biện: PGS.TS Quản Thành Thơ
4 Tổng quát về bản thuyết minh:
6 Những ưu điểm chính của LVTN:
The thesis addresses some challenging issues when processing an image of a house ownership certificate, which is text extraction and especially identifying the location of the land The student conducted a literature review, proposed suitable solutions, which are based on state-of-the-art methods in computer vision and successfully solved the challenges As a result, the student
introduced a full-pledged mobile application and made a paper published in a scientific conference of OISP students
7 Những thiếu sót chính của LVTN:
- The writing is not very good, making a number of grammatical errors
8 Đề nghị: Được bảo vệ □ Bổ sung thêm để bảo vệ □ Không được bảo vệ □
9 3 câu hỏi SV phải trả lời trước Hội đồng:
a
b
c
10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Điểm : 8.7 /10
Ký tên (ghi rõ họ tên)
PGS.TS Quản Thành Thơ
Trang 4Ngành (chuyên ngành): Khoa học Máy tính
2 Đề tài: Development of a mobile application to process house ownership certificate
3 Họ tên người hướng dẫn/phản biện: Bùi Hoài Thắng
4 Tổng quát về bản thuyết minh:
6 Những ưu điểm chính của LVTN:
- Presented the motivation and then the problem of recognizing the text in the House Ownership Certificate to help the buyer locate the actual location of the selling house/land in order to avoid fraud
- Showed a study of background knowledge for that area, including some deep learning and
artificial intelligence techniques, particularly in image processing,text detection and character recognition
- Proposed a model of combining those techniques in making a mobile system to help the buyers take the photo of the House Ownership Certificate and show the actual house on the Google Map, with Street View option
7 Những thiếu sót chính của LVTN:
8 Đề nghị: Được bảo vệ X Bổ sung thêm để bảo vệ Không được bảo vệ
9 3 câu hỏi SV phải trả lời trước Hội đồng:
a
b
c
10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Giỏi Điểm : 9.0 /10
Ký tên (ghi rõ họ tên)
Bùi Hoài Thắng
MSSV: 1552340
Trang 6I would like to express our deep gratitude and appreciation to Assoc Prof Quản Thành Thơ,
My instructor for his guidance and instructions throughout this project He never stopped lenging me, and helping me develop my ideas with patience and advice
chal-My appreciation also extends to the teachers in the Faculty of Computer Science andEngineering in particular and all the teachers in the Ho Chi Minh city University of Technology
in general
LEBATHANH
Trang 7P REFACE
In this era, own houses or lands are becoming popular in many individuals’ plans, but not manyhave sufficient time to follow the seller to the house or land location to see the size in reality.Instead of that, with this application, just a few touches and choose users can obtain the house’slocation by the house ownership diploma compared to an hour conversation to get a house’slocation
According to Pham Thanh Hung or Shark Hung in the show Shark Tank VietNam, he ischairman of the joint-stock company CENINVEST who said about the problem of site viewingthat employee has received a house ownership certificate in place and when onsite it was an-other place Because of the misleading in the coordinate of the house in reality
This thesis proposed a mobile application that can process the house ownership locationbased on the credential’s coordinate table Furthermore, combine with the information on thetype of house/land, method, type of use, and area to infer the sufficient in borrowing the moneyfrom bank to invest The opportunity of that land could be upgraded from the farmland to resi-dential land by spending money to change the type of that house in the government base
Trang 81.1 House owner certificate sample 4
2.1 A normal pipline in OCR process 9
2.2 Use case of Google Vision API 10
2.3 The result of text processed and the return JSON encoded 11
2.4 Schematic of CRAFT network architecture 13
2.5 A normal pipline in OCR process 14
2.6 Convolution example 14
2.7 the filter 3*3 matrix with the image matrix 5*5 create a feature map 3*3 15
2.8 The matrix after padding by 1 15
2.9 The matrix with padding 1 with stride 1 and padding 1 with stride 2 15
2.10 Max pooling with stride 2 and kernel 2 * 2 16
2.11 CNN architecture in detect handwritting number 16
2.12 AttentionOCR architecture 17
2.13 CNN flatten out the feature maps 18
2.14 TransformerOCR architecture 18
3.1 React Native Architecture 22
3.2 Machine Learning libraries in Python 24
3.3 Client-Server Model 26
3.4 Old server model 29
3.5 Virtualization server model 29
3.6 Containerlization server model 30
3.7 The morphological dilation on binary image 32
3.8 The morphological erosion on binary image 33
3.9 Matrix inside diamond’s kernel structure element 33
4.1 Overall architecture OCR process 35
4.2 Application flowchart 37
5.1 House owner certificate left side 40
5.2 House owner certificate right side 41
5.3 Orginal left side certificate 43
5.4 Binary image after threshold 44
5.5 All the contour detected on the image 45
5.6 Image text detection orginal and after using GDBSCAN 46
5.7 The raw text of Tesseract has processing 46
5.8 Image text detection hinh1 full page and after using GDBSCAN 50
5.9 House owner certificate right side 51
5.10 Image with text enhance and background noise removed 52
5.11 verticle binary image and horizontal binary image 53
5.12 The combined horizontal and vertical edge image 54
5.13 The joint node between horizontal and vertical line with dilation 55
Trang 9LIST OF FIGURES viii
5.14 All the contour extracted by the blending image and the table coordinate
ex-tracted by Joint value 56
5.15 Main Screen and Upload Screen 57
5.16 Upload Screen before and after process 58
5.17 Review screen and map screen 59
6.1 response from backend API with all the field extracted as JSON 64
6.2 Curve text image processed by normal pipeline with GDBSCAN 65
6.3 Curve text image processed by Refine net model CRAFT 66
6.4 The polygon bounding text with the crop polygon 67
6.5 Result after rectify the polygon text 67
A.1 Working plan 72
Trang 101.1 Table types of land appear inside the house ownership certificate 5
2.1 Table accuracy tested by the author of VietOCR 18
5.1 Table testing Hinh1 with two module VietOCR and Tesseract 47
5.3 Table testing with different size of Hinh1 49
6.2 Table summarize the information need to take out with Regular expression rule 62 6.3 Table evaluation the testing recognition data over the dataset with CPU and GPU 63 6.4 Table evaluation the proposed method over the dataset 64
Trang 11CNN Convolution Neural Network
CRAFT Character Region Awareness For Text Detection
OCR Optical Character Recognition
RN React Native
Trang 12Preface vi
1.1 Introduction 2
1.2 Motivation for this work 3
1.3 Objective and Research scope 6
1.4 Thesis architecture 6
Chapter2 RELATED WORK 8 2.1 Google Cloud Vision OCR 10
2.2 CRAFT Character Region Awareness for text detection 11
2.3 Convolution neural network 14
2.4 Vietnamese Optical Character Reader 16
2.4.1 Attention Optical Character Reader 17
2.4.2 Transformer Optical Character Reader 17
2.5 Tesseract OCR 19
Chapter3 THEORETICAL FRAMEWORK 20 3.1 Mobile Development 21
3.1.1 React Native 21
3.2 Back-end Programming Language 23
3.2.1 Python 23
3.3 Network Architecture 24
3.3.1 Flask 24
3.3.2 REST API 25
3.3.3 Client-Server Model 26
3.3.4 Firebase 28
3.3.5 Docker 28
3.4 Image Processing 31
3.4.1 Color Picture 31
3.4.2 Morphological transformation 32
Chapter4 SYSTEM DESIGN 34 4.1 Overall System 35
4.1.1 System Architecture 35
4.1.2 Diagrammatic Design 36
Chapter5 IMPLEMENTATION 38 5.1 Overview 39
Trang 13CONTENTS xii
5.2 Introduce Dataset 39
5.2.1 Dataset using 39
5.3 CRAFT configuration 41
5.4 Left side certificate process part 42
5.5 Right side certificate process part 50
5.6 Work Flow 56
Chapter6 EXPERIMENT AND EVALUATION 60 6.1 Block text extract with morphology 61
6.2 Table extract with morphology 64
6.3 Discussion 64
Chapter7 SUMMARY WORK 68 7.1 Final result 69
7.1.1 Pros 69
7.1.2 Cons 70
7.1.3 System overall 70
7.2 Difficulties 70
7.3 Future Works 71
Trang 14In this chapter, I will illustrate the contents of this topic and the structure of thesis proposed.
Content
1.1 Introduction 2
1.2 Motivation for this work 3
1.3 Objective and Research scope 6
1.4 Thesis architecture 6
Trang 15CHAPTER 1 OVERVIEW 2
1.1 Introduction
The real estate these days in Viet Nam is development and self-proved that is a conspicuousmarket more than ever, and this market makes a passive profit for an investor since every secondthis property can gain more money by itself Not only that, this potential business seems toengage a massive amount of people who want to invest in and the value of the transactionrelated to this particular field earns slightly increase every day Especially true in the city’s highpace of life like Thanh Pho Ho Chi Minh, Ha Noi, and Da Nang, known as the leading-edgeeconomy and the special economic zone Phu Quoc, Van Don, and Bac Van Phong where havethe price and heat evaluation continuously rising every day Thus, leading to the surge of dataand catch the tendency of trading houses and land Because of the trendy and revenue of thismarket, some people make corrupt use of this market to fraud, marketing and take advantage ofeach other, sync the range of people who invest into this not only a personal, household but also
a well know group this meaning the demand of society in this aspect of a business, but not all
of the investor among the list mentioned above equip enough knowledge and can be a target ofthe sophisticated trick in trading and selling property of the real estate
The demand to invest in real estate of people these days are in words, auspicious and tunities Still, nobody has enough time to review and understand all information given inside
oppor-a house ownership certificoppor-ate The House Ownership Certificoppor-ate is oppor-a moppor-atter of primoppor-ary tance in selling and buying real estate in general and in the house particularly Not only it showsthe house/land owner’s information for authentication, but also it contains vast of relevant in-formation for customer reference and makes a decision to buy or invest into that land, such as amethod using for land, the purpose of using, type of house/land, address, area, expiry date
impor-In recent times, computer vision has proved powerful in many different fields in our daily ties It reduces the labor working of humans to enhance the efficiency in productivity of humansand save time for people The list represents the typical work helped by computer vision, rec-ognize the bike or car plate for security finish their responsibility for protecting the property ofthe customer, user, or resident at the apartment building, more than the public security using isthe police department using the smart camera system to recognize all the transportation moving
activi-on the road By using the technique to capture and cactivi-onvert the image to the machine elementfor process and conduct a native movement, police can base on the result of the process to give
a penalization to that vehicle’s owner In office work, the ubiquity of handwritten and printeddocuments is massive paperwork for the employee The user has to record, read, and review itand have to carry it around and send it to partners In the traditional, they have to take a photo
or even typing all the document into a file and sharing it By noticing the old-fashioned workingstyle limitation, computer vision can help users scan the record file such as historical, educa-tional, and business data as a picture and renovate it to a digital form document able to sharewith few simple actions
What’s more, personal identity cards or bill payment recognization are provided by computervision recently, showing a lot of interest from the researcher Some companies have used thistechnology to improve the massive amount of data to increase the better user experience Thishas made retrieval of the quired information easier as traditional action recourse to OCR, known
as Optical Character Reader, is a branch of computer vision tree
Word detection by OCR (Optical Character Recognition) in these days are the mostly concernproblems in the image processing aspect The main purpose of this to cope with the miss-
Trang 16ing ability of reading and recognizing the content of the image or data from a paper to a textcoressponding to the original one By the evolution of computation of computer and the deeplearning technique has played an important key in the significant improvement of a vacancy ofprocess image on computer machine.
With machine learning these days, the real estate industry is rising by many researchers, sors, and experts The number of categories including economic index, number of floors, house
profes-or apartment, the address, ward, district, front profes-or alley, etc., can infer the price of that house andpredict that house’s price in many years
So we want to contributing our work into this industry by our working
1.2 Motivation for this work
Since the crucial role of house ownership certificate in the transaction between seller and tomer, base on the information on the certificate, the customer can presume that the potential
cus-of this house with some categoricals customer has referenced And as a standard transactionbetween user and seller, both will discuss and negotiation about the price, ability to upgrade thetype of land, then finish the process with buying or not Therefore, after taking a look at thedesign and factors of the land, customers want to survey the real condition of the house or land,because of the conception of the customer, they want to buy a property that near good and ser-vice area, the place with a nice view and so on Some sellers doing the trick are considering anexpedient to the buyer to satisfy a customer’s need, such as driving a user to another place thatmeets the requirement of the purchaser This situation happens not only with the new investorbut also appear even the large corporate company This problem has been mentioned on the tvshow Shark Tank, and shark Hung, also known as Pham Thanh Hung, pointed out the difficulty
in verifying the location between the certificate and the area outside
Trang 17CHAPTER 1 OVERVIEW 4
Figure 1.1: House owner certificate sample
As the figure above, this is a full-size house ownership certificate that contains informationrelated to the property currently selling This information as a beginner investor is meaninglesssince they do not have any experience in that factor Still, if we can extract that information anduse those data to infer a bit of persuasive advice for users, they should invest in this property
In particular, the house ownership certificate contains three essential categories, which we willprovide below
With the land number and map number, we can access the government map and look for thestatus of the real estate that is it into the planning land But usually, the seller wants to hidethat number due to dodge the customer searching the data related to the land type project Next
is the address of the residence This field only exists if a house has been built, registered, andprovided by the government or the local government area
Correspondingly with the second part of the certificate is the number of the surface area andtype of the land or house This number indicates the total area that the land held, and base onthe kind of land, this number can be divided into smaller pieces of land, which we will discusslater The table below depicted the three main types of land, which is the crucial key in deciding
to provide buyer’s finance into it Almost we want to invest in the residential land as it is thehighest rate in trading and to sell over among the other types in Viet Nam Still, the other landtype can upgrade to residential land after a few procedures in government and fee rely on thecurrent class of that land The price can be calculated with the government formula
Trang 18TT Loại đất Mã TT Loại đất Mã
I NHÓM ĐẤT NÔNG NGHIỆP 15 Đất khu công nghiệp SKK
1 Đất chuyên trồng lúa nước LUC 16 Đất khu chế xuất SKT
2 Đất trồng lúa nước còn lại LUK 17 Đất cụm công nghiệp SKN
3 Đất lúa nương LUN 18 Đất cơ sở sản xuất phi nông nghiệp SKC
cây hàng năm khác NHK 20 Đất sử dụng cho hoạt động khoáng sản SKS
6 Đất trồng cây lâu năm CLN 21 Đất sản xuất vật liệu
xây dựng, làm đồ gốm SKX
7 Đất rừng sản xuất RSX 22 Đất giao thông DGT
8 Đất rừng phòng hộ RPH 23 Đất thủy lợi DTL
9 Đất rừng đặc dụng RDD 24 Đất công trình năng lượng DNL
10 Đất nuôi trồng thủy sản NTS 25 Đất công trình bưu chính, viễn thông DBV
11 Đất làm muối LMU 26 Đất sinh hoạt cộng đồng DSH
12 Đất nông nghiệp khác NKH 27 Đất khu vui chơi, giải trí công cộng DKV
II NHÓM ĐẤT PHI NÔNG NGHIỆP 28 Đất chợ DCH
1 Đất ở tại nông thôn ONT 29 Đất có di tích lịch sử - văn hóa DDT
2 Đất ở tại đô thị ODT 30 Đất danh lam thắng cảnh DDL
3 Đất xây dựng trụ sở cơ quan TSC 31 Đất bãi thải, xử lý chất thải DRA
4 Đất xây dựng trụ sở
của tổ chức sự nghiệp DTS 32 Đất công trình công cộng khác DCK
5 Đất xây dựng cơ sở văn hóa DVH 33 Đất cơ sở tôn giáo TON
6 Đất xây dựng cơ sở y tế DYT 34 Đất cơ sở tín ngưỡng TIN
7 Đất xây dựng cơ
sở giáo dục và đào tạo DGD 35
Đất làm nghĩa trang, nghĩa địa, nhà tang lễ, nhà hỏa táng NTD
3 Núi đá không có rừng cây NCS
15 Đất khu công nghiệp SKK
Table 1.1: Table types of land appear inside the house ownership certificate
Trang 19CHAPTER 1 OVERVIEW 6
1.3 Objective and Research scope
The core objective of our thesis is to make a mobile application integrated with deep learningand computer vision, end to end, with the ability to allow user initiative to take and uploadphotos and contribution to the real estate industry Nowadays, the era of the fourth technologicalrevolution is reaching closer to our daily lives The most profitable companies globally arepredicated on artificial intelligence, plus machine learning has become their backbones, such
as Google, Facebook, Amazon, Netflix, Etc.[1] With the different fields, not except real estate.The demand of users needs to use machine learning to process the house ownership certificatebecause sellers/buyers hesitate and invest by reckless decision, fear of frauding in the legalprocedure of trading real estate Yet, not a lot of people realize that the potential of technology
in real estate So, with the mentioned situation, we want to contribute our work into this byreferencing some documents related to mobile applications side by side with deep learningtechniques and the designing system architecture to communicate in the middle of clients andservers at the same time
The scope are contains:
• Research about deep learning and artificial intelligence models, particularly in image cessing, text detection and recognization
pro-• Research about making application mobile system
• Build a basic system architecture between the client and server
• Propose tools, resources and IDE to be used in the project, then create a detail road-mapfrom the begin to the result
• Propose a application can take a photo, and upload photo to process image
1.4 Thesis architecture
All of the research contents I will divide into subsection below
Chapter 1 Overview. In this chapter, I will illustrate the contents of this topic and thestructure of thesis proposed
Chapter 2 Related work. While explaining the exist related work , I also mention and plain some structure or architecture and knowledge, that I referenced
ex-Chapter 3 Theoretical Framework. I will explain the theories behind each tool we used,how it works, the pros and cons, and why we choose to use them in our project Understandingthe theory is the first step to grasp and control the tools to their most efficiency fully
Chapter 4 System Design. This chapter demonstrate charts constructed my application
Trang 20Chapter 5 Implementation. This chapter I will discuss about the techniques that I usedfor process the data input and expected the output And clarify the process both side of houseownership certificate Chapter 6 Experiment and evaluation. This chapter is used toshowing the testing process to evaluating and experimenting running overall system.
Chapter 7 Summary Work. This chapter is used to summarize all the working, evaluatingand future work
Trang 21In this chapter, we are going to talk about some similar project about important part in the optical character reader process, that we have taken inspiration from to complete our thesis.
Content
2.1 Google Cloud Vision OCR 10
2.2 CRAFT Character Region Awareness for text detection 11
2.3 Convolution neural network 14
2.4 Vietnamese Optical Character Reader 16
2.5 Tesseract OCR 19
Trang 22Almost of the optical character reader system provided in this industrial have a similar workflow.And we will clarify after showing the image below.
Figure 2.1: A normal pipline in OCR process
The creating dataset: every deep learning problem needs many variant data, not except Optical
Character Reader We need the relevant data, images, or documents for our situation
Image Preprocessing: This is the crucial key in OCR’s practical problem since the data input
to the process comes from a different source Sometimes we need to normalize the standard forthe union input sample For instance, the image input is taken from the wild like a transportationsignal on the street or the store’s text The angle of the image is usually not straight, flat and thetext is not even So, we have to intercede by skewing the text to the right place before moving tothe next step to ensure the accuracy extract process is stable Moreover, the noise and redundantinformation on the input image can reduce the total speed of the process Image quality andresolution also take the highest place in the pipeline because the image quality is too low, so theresult cannot be precise Thus, to make the whole process running well, we have to preprocessthe data by some different technique or open image library
Text segmentation: After considering the noise of the image, if we want to extract the
infor-mation of the data’s text, converting and segmenting each line of text is needed to add beforegoing to the next step In this stage, we refer to many well-known articles related to detectingand recognizing text boxes, which provided different algorithms with a difference in accuracyand speed to segment each character, word, or line of text like EAST, CTPN FASTER R-CNN,etc Those methods mentioned can be used as detect text as well as segmentation text
Optical Character Recognition: After the segmentation or detection, the result prediction of
the text box will be put into model text recognition to recognize the substance of the text line
Re-structuring: In this stage, the substance of text segmentation will be organized into the proper
Trang 23CHAPTER 2 RELATED WORK 10
position at the beginning of the image Because of the process reading text happen not workinglike a human is from left to right and top to bottom
Natual Language Processing: In some problem characteristics, we need more step by using
model NLP (Natural Language Processing) like RNN (Recurrent Neural Network) or RegularExpression in some well-organized structure document To use RNN, we have to prepare anamount of data and label the information we want to extract on that data, then train the model
to spot out the exact location among the other redundant information
2.1 Google Cloud Vision OCR
To over barriers and the practical light way the use of machine learning algorithms and reach acustomer, cloud-based services provided by big companies such as Amazon, Google, Microsoft,BigML, and others have been developed The benefit of this for customers and other companies
is to use the OCR feature without training and hosting their models So in this part I will discussthe existing service provided by Google company, a technology company providing Internet-related services, products in many fields such as cloud computing, computer software, computerhardware, artificial intelligence, and advertising One of the similar feature we will implementand introduce to you later that is Google Cloud Vision OCR, this service allow user to uploadimage after the process stage, the response will contain the labels data that service has predicted,block text, coordinate of x and y of bounding polys text on the image, text inside bounding andthe level of the image that contains inappropriate contents such as adult content, spoof, medical,violence and racy on the ratio from Unknown, Very Unlikely, Unlikely, Possible, Likely, andVery Likely [2]
So after trying Google’s API OCR service, it gives the information has analyzed with a ference of programmed rules and returns the data to the user with a lot of field and meaningdata Specifically, there are two interpretations to help the user choose which one is suitable fordemand Here is the Use Case of the Google’s service
dif-Figure 2.2: Use case of Google Vision API
Text_annotation: It excerpts and returns machine-encode data to process later by the computer
from the user’s image In the beginning, it was born to tackle the image with a wide variable
Trang 24in lighting conditions This model shows off the vigorous in scanning and reading words ofdifferent styles, but only at a sparse level Service will encode the response data from the output
in the JSON type, and the user can easily carry out the needed information by some key andvalue
Document_Text_annotation: This is designed for compact batch text documents similarly
scanned books Hence, it is more suitable with full texts instead of the wild image like a portation signal sign, but it does not mean it can not adapt to the requirement [2]
trans-Figure 2.3: The result of text processed and the return JSON encoded
But we should admit that Google Vision OCR service has many advantages helping normalusers who do not know about AI, deep learning, or OCR can easily use the tool provided as abackbone of a business scanning system With multiple language support, because of an enor-mous database for training the service, they currently performed than 60 languages Moreover,the long-term support of Google affirms that the ability to recognize and speed will maintainand improve by the incredible storage and scalable system with the excellent price to encouragethe user to hire and use their service, while the price is lower when the customer using over 5million API calls
2.2 CRAFT Character Region Awareness for text detection
Scene text detection has earned a progressing study in current years due to its broad applicability
in the document, specifically image data analysis and scene understanding The driving factorsstem from both application promise and research value Nowadays, many proposed method with
a great process detector can detect horizontal text (rectangle text box), curved text, and arbitrarytext which can help we capture the useful information frequently appears on many view imagesuch as product search[3], place discovery over the picture and autonomous driving Backboned
by the deep neural network, many well-known works in this field are the conventional methodSWT (Stroke Width Transform) and MSER (Maximally Stable Extremal Regions) provided along time ago has been reinstated by deep learning based methods spirit To briefly listing aboutthe related famous article in this problem, many exist modern approaches divided into fourclasses are regression-based, segmentation-based, end-to-end, and character-level annotations
Regression-based text detection: various text detectors using principal thoughts from generalobject detection frameworks proposed, scence text difference with the object in natural It is
Trang 25CHAPTER 2 RELATED WORK 12
often represented in many shapes with different aspect ratios like Vietnamese with a cadence inword make it narrower than the other word
Segmentation-based text detection: this method cast text detection as a semantic segmentationproblem, seeking text regions at the pixel level by inferring word bounding areas These are themodel has been proposed using segmentation as the foundation: Multi-scale FCN , Holistic-prediction and PixelLink
End-to-end text detectors: An end-to-end approach trains the detection and recognition modulesconcurrently to intensify detection precision by leveraging the recognition result Because of theadvantage of training the text detector with the recognition module helps the process detect thetext in the messy background more precisely and burdensome than the two above methods
Character-level text detector: this method provided a detector base on the character level byusing text block candidates distilled by the traditional method detect blob in an image is MSER,which I had mentioned above
All the methods above, all taken from each other the goods, and optimize according to theauthor’s improvement and create a better version of it-self But with the existing method in theScene text detector, CRAFT (Character Region Awareness for Text Detection) is the best choice
in my case among the other
The referred paper CRAFT is an abbreviation of Character Region Awareness For Text tection This is designed in combination with a convolutional neural network as a backbone,producing the character region score and affinity score The region score is used to localizeindividual characters in the image, and the affinity score is used to group each character into asingle instance This framework adds weakly supervised learning that estimates character-levelground truths in existing real word-level datasets This methodology’s objective is to localizeeach character in natural images precisely This process using a deep neural network to predictcharacter regions and the affinity between characters The CRAFT method is a fully convolu-tional network architecture is based on VGG-16 with batch normalization is adopted as authorbackbone But this model has skip connections in the decoding part, and after the process out-put has two channels as score maps are the region score and affinity score Here is the networkarchitecture [4]:
Trang 26de-Figure 2.4: Schematic of CRAFT network architecture
The author has provided two models separately for different purposes:
general purposes: This model has been trained on the set of datasets SynthText, IC13, IC17,
which has been tested on multiple languages and giving a highly expected result
Arbitrary purpose: This model name RefineNet does not general generate a bounding box of
a word annotation, but it can carry out the polys data bounding around a sentence in an arbitraryshape, curve long text But the limitation of this model is it only can generate a maximum of 14points of polys
About the dataset, ICDAR2013 (IC13) was released during the ICDAR 2013 Robust ReadingCompetition for focused scene text detection, consisting of high-resolution images, split into
229 for training plus 233 for testing, containing texts in English The annotations are at level using rectangular boxes While ICDAR2017 (IC17) contains 7,200 training images, 1,800validation images, and 9,000 testing images with texts in 9 languages for multi-lingual scenetext detection And CTW-1500 (CTW) consists of 1000 training and 500 testing images Everyphotograph has curved text examples, which are annotated by polygons with 14 vertices
Trang 27word-CHAPTER 2 RELATED WORK 14
2.3 Convolution neural network
A convolution neural network (CNN) is a sub-network belonging to deep neural networks, ally applied to computer vision and NLP for a better result in sentence classification problems[]
usu-Figure 2.5: A normal pipline in OCR process
The core value of CNN is to extract a unique signature of the image by the process input theimage through many hidden layers to extract the feature map on each layer
Convolution
There are multiple layers in a CNN, and the first layer that the input image comes into contactwith is the Convolutional layer The goal of the first layer is to extract the special and uniquefeatures from the image and remove other noises data that are not useful for the learning process.This operation uses a two-dimensional image matrix and an image filter matrix as input In anormal neural network, it will be going through many hidden layers from the input image andthen go to the output But with CNN, a convolutional layer is a hidden layer that contains afeature map So each feature map it’s like the scan of the initial input image with extractedfeatures, concrete features on the matrix input image However, the way it is scanning is to rely
on the convolutional filter or filter matrix, which it will be scanning through input data fromleft to right and then top to bottom and multiplication each scanned value on the matrix withthe kernel matrix defined at the beginning, to sum up, and then pass the result to the activationfunction The kernel matrix exists in this process because it relates to the features we want toextract from the input data, and the kernel matrix will have the dimension F*F*D where D isthe depth of the kernel, F is the edge of the kernel and F always an odd number For example,
we applied three kernels to receive three distinction features from the image matrix input In
this figure by using the convolution over the image matrix with filter matrix to enhancce the X
feature, because of the 1’s on the filter matrix has a X shape
Figure 2.6: Convolution example
Trang 28And the result of the convolution is a feature map, which demonstrated in the picture below:
Figure 2.7: the filter 3*3 matrix with the image matrix 5*5 create a feature map 3*3
Stride and Padding
Stride is the distance between two kernels when making scanning on an image matrix Forexample, with stride is 1, the kernel will scan the adjacent cell, but with stride assign to 2, thekernel will scan the next box with step by 2 The following images will illustrate more clearlythe stride meaning
Figure 2.8: The matrix after padding by 1
But the stride and size of the kernel are inversely proportional to the size of the feature map So
if stride and size are higher, the size of the feature map output will be smaller To maintain thefeature map’s size at beginner phrase padding solved that by adding a layer around the matrixcell For instance, if setting padding to one, this layer will add one surrounding zero’s cell to
expand the input like the figures below Pooling layer
Figure 2.9: The matrix with padding 1 with stride 1 and padding 1 with stride 2
The aim of the pooling layer is to use the convolutional layer to diminish the size of data butremain the main feature attribute to reduce the data computation in the model and avoid over-fitting There are three types of pooling, including max, average, and sum pooling The most
Trang 29CHAPTER 2 RELATED WORK 16
popular and in this thesis is max pooling, meaning taking the maximum value in the window.The operating of pooling is nearly similar to convolutional, it has a sliding matrix scanningthrough the value of the feature map of the convolutional layer or can be understood as an inputmatrix, which is known as sliding window then choose out the max value among the value inthe matrix since this is max pooling The figure below illustrate choosing the pooling windowwith size 2 * 2 and stride equal 2 to make sure not to overlap and applying max pooling
Figure 2.10: Max pooling with stride 2 and kernel 2 * 2
Fully connected
Since this fully connected is also a fully connected Artificial Neural Network, usually after manylayers of Convolutional and pooling are two layers of fully connected, a layer used to combinedthe feature layer has been extracted, also conversion three-dimension vector into two or one-dimension Finally, by using an activation function like softmax or sigmoid, we can classify theoutputs into categories And one more output layer which, have a number of neurons depend
to the number of feature output we are expecting for example in the dataset handwritten digitdatabase MNIST, the range of dataset is from 0 to 9 numerical data, so the output neuron in thiscase will be 10
Figure 2.11: CNN architecture in detect handwritting number
2.4 Vietnamese Optical Character Reader
VietOCR, or Vietnamese Optical Character Reader, is a tool provided by Pham Ba Cuong Quoc,which provided two models of OCR for Vietnamese text recognition step: Attention-OCR andTransformer-OCR While attention-OCR using attention seq2seq is sequence to sequence, thistechnique is well-used in comprehensive problems in Natural Language Processing Even in ourpart Optical Character Reader, Transformer-OCR using Transformer Architecture has developed
an immeasurable well for Natural Language Processing community This framework also allowsusers to quickly retrain and finetune the model to suit variant use cases
Trang 302.4.1 Attention Optical Character Reader
AttentionOCR is a combination of the CNN model and Attention Seq2seq, the operations ofthis model Machine Translation (MT) Similarly, the attention mechanism is the average sum ofthe weight number of the needed feature related to the problem that we need For example, take
a sample in MT, using the attention mechanism to generate the litte We needed to calculate a
vector context C is an average sum weight number of the vectors that match the value mặt, trời,
bé, nhỏ with h1, h2, h3, h4 respectively A weight number is a scalar number learned by themodel The picture below depicts the process of generating a machine translation sample [][]
So with the attention in the machine translation, input is a current language to expect language
Figure 2.12: AttentionOCR architecture
with the same meaning by encoding the sequence of text into a vector attention-OCR willobtain the input is an image When an image comes through the CNN layer, the output will be
feature maps with shape including channel x height x width, and feature maps will become the input of LSTM(Long-short term memory) Therefore this LSTM layer takes input as a hidden
x time_step To adapt to the requirement of LSTM, the last two shapes of the feature maps will
be flattened out
2.4.2 Transformer Optical Character Reader
In this model, the author has to take full advantage of the transformer architecture by replacingthe LSTM to predict the incoming text Here is the system architecture
In this model, the author has to take full advantage of the transformer architecture by replacingthe LSTM to predict the incoming text Here is the system architecture The system has proposed
Trang 31CHAPTER 2 RELATED WORK 18
Figure 2.13: CNN flatten out the feature maps
Figure 2.14: TransformerOCR architecture
by the author measurement the precision is around 88%, which has been trained with 10 millionimages by the author
Backbone Config Precision full sequence Time
VGG19-bn-Transformer vgg_transformer 0.8800 86ms on GPU 1080TiVGG19-bn-Seq2Seq vgg_seq2seq 0.8701 12ms on GPU 1080Ti
Table 2.1: Table accuracy tested by the author of VietOCR
Trang 322.5 Tesseract OCR
The most popular program or framework in Optical Character Reader is Tesseract, a significantlong-term support open source project that has been launched over the decade by Google withmore than 100 languages officially pre-trained by google Tesseract is a deep learning-based onLSTM neural network The existing model data has been claimed training on about 400000 textlines spanning about 4500 fonts for the Latin-based languages, which are suitable for the Viet-namese language model However, Tesseract performance accurately with some documentedtext, and it highly depends on the document’s condition and the preprocessing step
The framework proposed many arguments to configure Tesseract before using Page tion modes (PSM) The method will include Legacy and LSTM since the version we are using
Segmenta-is officially released 4.0.0 has supported LSTM architecture Here Segmenta-is the lSegmenta-ist configuration wehad figured out during the beginning of the stage Text Recognition
Trang 343.1 Mobile Development
3.1.1 React Native
In these days, the process development of mobile applications is quite easier compared to thepast few years since the invention of many frameworks and platforms has released to developer.Almost among of them is based on the well-liked and multipurpose programming language asknown as Javascript as well as React Native
React Native started as a production of a hackathon project held by Facebook’s hacker culture
in 2013 Comparable to ReactJS, React Native at that moment was like a remarkable ventional idea because the ambiguity of this framework could actually work between the com-munication of Javascript and native component on mobile, the performance of native language.Therefore, on 26 March 2015, the first version of React Native had released on Github repo
uncon-To not be confused with the ReactJS, a JavaScript library dedicated to building UI, which torender the UI component on website development and using the ReactDOM library to render
to the screen React native is a collection of "special" React components, and it knows how tocompiles these components to corresponding native widgets Android or IOS in this case Besidethat, it also provide Native Platform APIs for developer using to access some feature of specificenvironment operating system such as Camera, Calender, Setting, etc [5]
3.1.1.1 React Native Architecture
React Native has one main difference compared to Cordova-based apps: it runs using nativeviews instead of a webview renderer It has direct access to all the Android, iOS APIs andoffered a common preset tool using said API Thus, it is very simple to make various apps andhave the performance of a native application simultaneously
Trang 35CHAPTER 3 THEORETICAL FRAMEWORK 22
Figure 3.1: React Native Architecture
The React Native project can be split into two central part is the component parts and the logicpart The component part is preset of component that will be compiled and rendered directly tothe native view instead of just a webview like Cordova [6], and the JavaScript Code is the logicpart can be tap into the native platform features such as Camera using like in my product, thislanguage can do this because the javascript is run on the JSC (JavaScript Core), which is a Vir-tual Machine and hosted by the React Native inside the real device In this part, the JavaScriptvirtual machine knew how to communicate to the native features for the application through aparticular component called Bridge This remarkable component built-in C++ and C can alsounderstand C++ API for the JavaSript, for further running on a real device is the C++ code, notthe JavaScript Code This stage is the hidden important scenario in React Native because theshared language in both native platform in particularly is Android and IOS, is C++ Similar toits name, this Bridge element is key to allowing cross-platform development to become morecorporeality,it is in charge of handling non-blocking asynchronous commands from both sides
Chaining all the elements in the stage, we have two different programming languages talk toeach other a the endpoint of the application, even though it is not the same language but stilldelivers efficient work
from this section below i will discuss about benifits and drawbacks of this framework
3.1.1.2 Advantages
There are vast of advantages of React Native, most remarkable is listed below:
• Faster to Build:According to the procedure install the same feature between an
Trang 36applica-tion made by Swift language and another the one by React Native, the time installaapplica-tion ofReact Native is significantly saving time higher than Swift 33
• Cross-Platform Usage: "Learn once, write everywhere", it can compile one code intonative platforms version without needing to change anything
• Class Performance:The View component is compiled into native code, no external derer required, so the performance in UI is very high, allowing smooth experience whenusing
ren-• JavaScript:It also uses Javascript, which is a common and popular programming
lan-guage
• Community:It has a growing community that provide a lot of preset, tools to help
devel-oping your apps
• Hot Reloading:When you change the code during development, the changes can be
ren-der and display right away on your device, lessen the time for debugging
3.1.1.3 Disadvantages
However there is still remain negative sides:
• Compatibility and debugging problems:If developer is fresh to React Native they mayexperienced confused and spent a lot of time in configure the package compatibility anddebugging tools provided by react native
• Limited in the custom modules:Developer some times need to build them own solutionfrom the sratch and when the component finish it will finished with threes codebases areReact native ,IOS and Android part
• Native developers still needed:This framework can not take the whole place of the tive language for a large project because of a native feature
na-However, this framework’s community is rapidly growing day by day and support eachother to resolve and release new features every day
3.2 Back-end Programming Language
3.2.1 Python
Python is a high level programming language was created and released first in 1991 by Guido
van Rossum, and it had been a very early success in many software applications in the cial aspect
Trang 37commer-CHAPTER 3 THEORETICAL FRAMEWORK 24
Figure 3.2: Machine Learning libraries in Python
Python is a multi-paradigm programming language It supports different programming proaches One of the popular approaches to solve a programming problem is by creating objects.Its also including many features:
ap-• Versatile, Easy to Use and Fast to Develop
• Open Source with a Vibrant Community
• Suitable for Machine Learning problem because number of algorithms running with fewlines of code This help programmer can have more time in improve algorithm instead ofmemory allocating
• Have an vast of supportation and variety libraries in Machine Learning such as Numpy,Pandas, Sklearn, TensorFlow, Keras
3.3 Network Architecture
3.3.1 Flask
Flask is a lightweight Web Server Gateway Interface (WSGI) micro web application framework
A micro web framework with minimal dependencies on external libraries, written in Python,which was formed for a faster and easier use, and can scale up to complex applications This
Flask could install by package-management written in Python as known as pip followed by : pip install Flask
Moreover Flask also have a lot of impressive features:
• built-in development server and fast debugger
Trang 38• integrated support for unit testing
• RESTful request dispatching
• Jinja2 templating
• On client side support secure cookies
• WSGI 1.0 compliant
• Unicode based
• Google app engine compatibility
the following part containing the pros and cons of Flask:
3.3.1.1 Advantages
• Simpler development as Python is well known
• Flexibility as you build from scratch with no hindrance
• Because there are fewer levels of abstraction, performance is inherently better from thestart
• Modularity allow for multiple Flask applications and server for many purposes and tribute them across the network server
dis-• Good for machine learning API purpose
3.3.1.2 Disadvantages
• Because of its simplicity and minimality, Flask is not regulated
• For big project, using Flask could be quite time consuming
• Flask is not asynchronous architecture
• It does not have a full toolset instead users must search out extension and libraries fromthe community
Trang 39CHAPTER 3 THEORETICAL FRAMEWORK 26
Data transformation between two or more systems has always been the top priority requirement
of developing software That where Representational State Transfer(REST), come into ground, it is now probably the most used web service technology REST is a method for twosystem to talk over HTTP similar to that of a web browser and server
play-This architecture decribed by five design constraints that, when followed along with, the duction can archive many specific properties including performance, simplicity and reliability.Those are illustrates below:
pro-• Client-Server:The client and server work self-sufficient
• Stateless:The server does not record the state of the client side
• Cacheable:The server marks whether data is cacheable
• Uniform interface:The client and server interact in a uniform and predictable way An
essential aspect of this is that the server exposes resources
• Layered system:The application behaves the same regardless of any intermediaries
be-tween the client and server
Overall, REST is one of the most popular design styles for web APIs
3.3.3 Client-Server Model
Figure 3.3: Client-Server Model
Trang 40The Client-Server model is a term used to demonstrate two or more computer systems connectedvia the internet, one of which is a server providing services to each connected client, eachcomputer or process on the network is either considered as a client or a server For example,storage servers, web servers, g-mail server, file sharing server, they provide resources to theclients usually following schema one server provides service to multiple clients simultaneously.
In particular, when the server’s purpose is for commercial, it also can be considered as a hostingservice
Deployment of this model provided opportunities to maintain services and quality Moreover,the server computing is an organization that will effectively increase its productivity, cost-effectiveness, user interface, enhanced data storage, vast connectivity, and reliable applicationservices Here is a list of the pros and cons of this infrastructure
3.3.3.1 Advantages
• Centralized:Since all data is stored in server, it can be served as a back-up for the clients
• Security:All connected clients are all monitored on the same computer, so shared
re-sources are less difficult to check and managed to avoid virus effect and spread to anotherserver or harmful files
• Performance:In specific server, the process of send and get data is absolutely quicker
• Scalability:To make system can provide more service just needed to enlarge the number
• Cost:Even for a dedicated server, to serve a large number of clients, it usually needs robust
hardware and an internet connection to ensure the quality of service
• Overload:When there are frequent simultaneous client requests, server severely get
over-loaded, forming traffic congestion
Since it is centralized, if a critical server broke, client requests are not accomplished Therefore,the client-server lacks the strongness of a good network