Luận văn tốt nghiệp Khoa học máy tính: Building A Diagram Recognition Problem with Machine Vision Approach

 The thesis uses Mask R-CNN model and its variant, Keypoint R-CNN with someimprovements and augmentation to solve offline diagram recognition task with rather highaccuracy ~90% and acce

Trang 1

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

HO CHI MINH UNIVERSITY OF TECHNOLOGY

COMPUTER SCIENCE AND ENGINEERING FACULTY

——————– * ———————

GRADUATION THESIS

Building A Diagram Recognition Problem

with Machine Vision Approach

Council: Computer Science Advisor: Dr Nguyen Duc Dung Reviewer: Dr Nguyen An Khuong

—o0o—

Student:

Tran Hoang Thinh 1752516

Trang 2

ĐẠI HỌC QUỐC GIA TP.HCM CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

- Độc lập - Tự do - Hạnh phúc

TRƯỜNG ĐẠI HỌC BÁCH KHOA

KHOA: KH & KT Máy tính _ NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP

BỘ MÔN: KHMT _ Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình

HỌ VÀ TÊN: Trần Hoàng Thịnh MSSV: 1752516 _

HỌ VÀ TÊN: _ MSSV:

HỌ VÀ TÊN: _ MSSV: NGÀNH: _ LỚP:

1 Đầu đề luận án:

Building A Diagram Recognition Problem with Machine Vision Approach

2 Nhiệm vụ đề tài (yêu cầu về nội dung và số liệu ban đầu):

- Investigate approaches in diagram recognition problem

- Research on machine learning approaches for the problem

- Prepare data for the problem

- Propose and implement the diagram recognition system

- Evaluate the proposed model

3 Ngày giao nhiệm vụ luận án: 1/3/2021

4 Ngày hoàn thành nhiệm vụ: 30/6/2021

5 Họ tên giảng viên hướng dẫn: Phần hướng dẫn:

1) Nguyễn Đức Dũng 2) _ 3) _ Nội dung và yêu cầu LVTN đã được thông qua Bộ môn

Ngày tháng năm

CHỦ NHIỆM BỘ MÔN GIẢNG VIÊN HƯỚNG DẪN CHÍNH

PHẦN DÀNH CHO KHOA, BỘ MÔN:

Người duyệt (chấm sơ bộ):

Trang 3

TRƯỜNG ĐẠI HỌC BÁCH KHOA CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

MSSV: 1752516 Ngành (chuyên ngành): Computer Science

2 Đề tài: Building A Diagram Recognition Problem with Machine Vision Approach

3 Họ tên người hướng dẫn/phản biện: Nguyễn Đức Dũng

4 Tổng quát về bản thuyết minh:

- Số bản vẽ vẽ tay Số bản vẽ trên máy tính:

6 Những ưu điểm chính của LVTN:

The team has successfully proposed the diagram recognition system They built the initial dataset and perform labeling the data for this task The team has utilized their knowledge in computer vision and machine learning to propose a suitable approach for this problem The evaluation results are promising

7 Những thiếu sót chính của LVTN:

The dataset they built is still small and the number of components that this model can recognize is also limited Even obtained high accuracy, the team has not performed experiments under real conditions, i.e image captured with shadows, low contrast, thin sketches, etc

8 Đề nghị: Được bảo vệ o Bổ sung thêm để bảo vệ o Không được bảo vệ o

9 3 câu hỏi SV phải trả lời trước Hội đồng:

a

b

c

10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Giỏi Điểm: 9 /10

Ký tên (ghi rõ họ tên)

Nguyễn Đức Dũng

Trang 4

TRƯỜNG ĐẠI HỌC BÁCH KHOA CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

MSSV: 1752516 Ngành (chuyên ngành): Computer Science

2 Đề tài: “Building A Diagram Recognition Problem with Machine Vision Approach”

3 Họ tên người phản biện: Nguyễn An Khương

4 Tổng quát về bản thuyết minh:

- Số bản vẽ vẽ tay Số bản vẽ trên máy tính:

6 Những ưu điểm chính của LVTN:

 Thesis topic is interesting and well-choosen The author clearly understands the problem to

be solved and masters the techniques and background knowledge to solve the problem

 The author proposes three algorithms: Algorithms 2 for improving Non Max Suppression,and Algorithms 3,4 for diagram building

 The thesis uses Mask R-CNN model and its variant, Keypoint R-CNN with someimprovements and augmentation to solve offline diagram recognition task with rather highaccuracy (~90%) and acceptable performance (< 2s for each diagram)

7 Những thiếu sót chính của LVTN:

 The thesis is not well-written and too short

 The contributions of the author are not presented in a clear maner

8 Đề nghị: Được bảo vệ  Bổ sung thêm để bảo vệ  Không được bảo vệ 

9 Câu hỏi SV phải trả lời trước Hội đồng:

a Is there any commercial or prototype app/software that solve this problem or similar ones? If YES, can you give some comments and remarks on benchmarking your work and those?

b Arrow keypoints seem often coincide with one border of the bounding box, so how should

we do to reduce overlap task between arrow keypoint detection and bounding box detection?

10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Excellent Điểm: 9/10

Ký tên (ghi rõ họ tên)

Nguyễn An Khương

Trang 5

We hereby undertake that this is our own research project under the guidance of Dr.Nguyen Research content and results are truthful and have never been published before.The data used for the analysis and comments are collected by us from many differentsources and will be clearly stated in the references

In addition, we also use several reviews and figures of other authors and organizations.All have citations and origins

If we detect any fraud, we take full responsibility for the content of our graduate ternship Ho Chi Minh City University of Technology is not related to the copyright andcopyright infringement caused by us in the implementation process

in-Best regards,Tran Hoang Thinh

Trang 6

First and foremost, we would like to express our sincere gratitude to our advisor Dr.Nguyen Duc Dung for the support of our thesis for his patience, enthusiasm, experience,and knowledge He shared his experience and knowledge which helps us in our research andhow to provide a good thesis We also want to thank Dr Nguyen An Khuong and Dr LeThanh Sach for their support in reviewing our thesis proposal and thesis Finally, we wouldlike to show our appreciation to Computer Science Faculty and Ho Chi Minh University ofTechnology for providing an academic environment for us to become what we are today

Best regards,Tran Hoang Thinh

Trang 7

Diagram has been one of the most effective illustrating tools for demonstrating and sharing ideasand suggestions among others Besides text and images, drawing flow charts is the best way togive others a clearer path of the plan with the least amount of work Nowadays, many meetingsrequire a blackboard so everyone can express their thoughts This raises a problem with savingthese drawings as a reference for future use since taking a picture can not solve the problem

of re-editing these ideas and they need to be redrawn to be suitable in professional documents

On the other hand, digitizing the chart requires redrawing the entire diagram using a computer

or a special device like drawing boards and digital pens, which cost a lot and are not the mostconvenient tools to use

Therefore, it is necessary to find a way to convert the current, traditional hand-drawingdiagrams into a digital version, simplifying the sharing process between users Moreover, thedigitizing diagram also helps the user to modify and convert to other forms that satisfy theirrequirements This thesis will focus on stating a problem with digitizing diagrams and proposingthe solution

Trang 8

1.1 Overview 1

1.2 Outline 2

1.3 Objectives 2

2 Related works 3 2.1 Object detection methods 3

2.1.1 Introduction 3

2.1.2 Traditional detector 3

2.1.3 CNN-based Detector 4

2.1.3.1 CNN-based Two Stages Detection (Region Proposal based) 4 2.1.3.2 CNN-based One Stage Detection (Regression/Classification based) 5

2.2 Diagram recognition 6

3 Background 7 3.1 Faster R-CNN 7

3.1.1 Backbone network 7

3.1.2 Region Proposal Network 8

3.1.3 Non-Maximum Suppression 9

3.2 Mask R-CNN 11

3.2.1 Binary Masks 11

3.2.2 Feature Pyramid Network 12

3.2.3 Region of Interest Align 13

3.3 Keypoint R-CNN 15

4 Proposed method 17 4.1 Scope of the thesis 17

4.2 Problem statement 18

4.2.1 Problem definition 18

4.2.2 Approaches 18

4.2.3 Preparing input 19

4.3 Proposed model 24

4.3.1 Model selection 24

4.3.2 Proposed model 26

4.3.2.1 Feature map generator 26

4.3.2.2 Proposal generator 27

4.3.2.3 Instance generator 29

4.3.3 Loss function and summary 30

Trang 9

4.4 Diagram building 31

4.4.1 Output 31

4.4.2 Arrow refinement 31

4.4.3 Symbol-Arrow relationship 33

4.4.4 Text-Others relationship 37

5 Experiments and Results 38 5.1 Data augmentation 38

5.2 Experiments 39

5.2.1 Perform training and inference without keypoints 40

5.2.2 Perform training and inference with keypoints 40

5.2.3 Building diagram structure from predictions 42

6 Conclusion 46 6.1 Summary 46

6.2 Challenges 46

6.3 Future works 46

Trang 10

List of Tables

4.1 Output structures 22

4.2 Used fields 31

4.3 Output structures 32

5.1 Model results 42

5.2 Graph building technique experiment 44

Trang 11

List of Figures

3.1 ResNet50 model, from [1] 8

3.2 Non-Maximum Suppression, from [2] 10

3.3 MaskRCNN model, from [3] 12

3.4 Mask Sample, the pink colored pixels are for the object 12

3.5 Feature Pyramid Network, from [4] 13

3.6 RoIPooling in Faster R-CNN 14

3.7 RoIAlign layer used in Mask R-CNN 15

4.1 Sample of an entry in DiDi dataset 21

4.2 Python code to save a drawing as PNG image 21

4.3 A sample with its labels and the JSON label information 23

4.4 Sample drawing with bounding boxes 25

4.5 Pipeline of the model 26

4.6 Feature Pyramid Network with ResNet, from [5] 27

4.7 A drawing with its predictions 30

4.8 Structure sample 32

4.9 Model fails to detect intersected arrows 33

4.10 Example when Euclidean distance does work 34

4.11 Sample for Weighted Euclidean 34

5.1 Sample prediction 40

5.2 Sample prediction with rotated input 41

5.3 Loss over iteration of proposed model without keypoints 41

5.4 Sample diagram without text 42

5.5 Sample diagram with text 43

5.6 Loss over iteration of proposed model with keypoints 43

5.7 Drawing without predictions at 60% score 44

5.8 Example of impossibility in prediction 45

5.9 Sample output result 45

Trang 12

List of Algorithms

1 Non-Maximum Suppression 10

2 DiDi image generation 20

3 COCO Format Generation 24

4 Improved Non-Maximum Suppression 29

5 Arrow Refinement 33

6 Weighted Euclidean for Symbol-Arrow relationship 36

Trang 13

any-Within the area of computer vision, a subset of artificial intelligence that deals with thescience of enabling computers or engines to visualize images, a smaller section deals with theability to detect objects, for example, humans, animals, furniture, etc Recently, there have beenmany applications that can help deal with this task Google Lens[11] is an image recognitiontechnology developed by Google, which can detect objects, texts, bar codes, QR codes, mathequations, and “google” for its related results or information using a two-step detector com-bined with Long Short-Term Memory (LSTM) Microsoft Math[12] can use natural languageprocessing (NLP) to recognize and solve math problems step-by-step Other contributions, in-cluding Vivino[13], which can scan and detect wines, Screenshop[14] tells your shopping cata-log from the image and give recommendations,

Regarding diagram recognition, since the introduction of the Online Handwritten FlowchartDataset[15] in 2011, there have been numerous attempts in digitizing diagrams Many ideas inthe mainstream are divided into two main approaches: online diagram recognition and offlinediagram recognition In online diagram recognition, the user continuously draws a diagram on

a device with a touchscreen, such as a tablet or a smartphone, and a pen or finger Meanwhile,the program captures the input as a sequence of strokes These are later used to detect objectsand relations between them Alternatively, the input of offline diagram recognition is a raw

Trang 14

This project develops a model that can perform offline diagram recognition and digitize thediagram in a suitable format It will not stop with the model and algorithm but will be developed

to an application that can serve actual clients to solve real-life problems

1.2 Outline

The report is organized as follows:

• Chapter 2 briefly surveys the application of object detection in real life, related work ofobject detection in general, and flowchart detection in particular

• Chapter 3 provides sufficient knowledge to understand the project

• Chapter 4 shows our proposed system, including how the application works

• Chapter 5 lists our experiments and results

• Chapter 6 summarizes what we have done along with challenges and future works

1.3 Objectives

The target of this thesis is to build a model that can convert preprocessed diagram images

to a reasonable and understandable structure This structure can be later served as a part of anapplication that can help to solve many problems in real life

Trang 15

cer-to the computer: “What is this object?” and if possible, return the coordinate where this ject locates The object detection task can be divided into two topics: “general detection” and

ob-“specific detection” The first topic indicates the ability to detect instances of an object under auniform condition and thus simulate human eyes with automotive car as an example The othertopic covers the detection of object under specific scenarios such as face detection, voice de-tection, text recognition, diagram recognition, etc Object detection is widely used in multipleapplications, such as autonomous driving, robot and crime prevention

2.1.2 Traditional detector

Most of the early object detection algorithms were built based on manual-made featureswith multiple complex models Due to the lack of resources and image size, numerous speed-upmethods are required

In 2001, P Viola and M Jones achieved real-time face detection using sliding windows [16,17] The algorithm goes through all possible locations and scales in an image to find the humanface It speeds up the computational process by involving three important techniques The firstone, integral image speeds up box filtering or convolution process using Haar wavelet as thefeature representation of an image The second one, feature selection using Adaboost algorithm

to select a small set of features from a huge set of feature pools The third one, A multi-stagedetection paradigm reduces its computation by spending less time on the background than onface possibility location Although the algorithm is simple, it exceeds the current computationalpower of the computer at that time

Histogram of Oriented Gradients (HOG) was created in 2005 by N.Dalal and B.Triggs [18]

It is designed to be computed on a grid of equal cells and use overlapping local contrast ization to improve accuracy To detect objects of different sizes, HOG resizes the input imagemultiple times to match with detection window size It has been an important foundation ofmany object detectors and a large variety of computer vision applications.[19, 20]

Trang 16

normal-RELATED WORKS

2.1.3 CNN-based Detector

As the traditional methods show their disadvantages by becoming progressively complex,the progress has been slow down, researchers have tried finding an alternative to increase theaccuracy and performance In 2012, Krizhevsky et al brought back the age of ConvolutionalNeural Network with a paper on object classification with ImageNet [21] As DCNN can clas-sify the image based on the feature set, subsequent papers show their interest in the newly foundmethod in object detection Over the past decades, multiple attempts at making object detectionmodel have been proposed and studied to improve the accuracy in detection such as LeNet-5[22], AlexNet [23], VGG-16 [24], Inception [25, 26], ResNet [27],etc Studies also discovertechniques that improve the training process and prevent overfitting, for example, dropout [23],Auto Encoder [28], Restricted Boltzmann Machine (RBM) [29], Data Augmentation [30] andBatch Normalization [31]

There are two main groups in CNN-based detection: “two-step detection” and “one-stepdetection” In the first group, the image will be examined to generate proposals, and these pro-posals are delivered to another network for classification and regression In the second group,the object will be recognized and classified directly within one network model

2.1.3.1 CNN-based Two Stages Detection (Region Proposal based)

Released in 2014 by Girshick et al.[32, 33], R-CNN is the first attempt to build a lutional Neural Network for object detection The idea of R-CNN is divided into three mainstages:

Convo-• Proposals are generated by using selective search

• Proposals are resized to fixed resolution These proposals are then used in the CNN model

to extract the feature map

• Feature map is classified using SVMs for multiple classes to deduct the final bounding box.Despite having certain advantages compared with traditional methods and bringing CNN back

to practical use, R-CNN has some fatal disadvantages The training has multiple stages andfeature maps are stored separately, thus increasing time and space complexity Moreover, Thenumber of overlapping proposals is large (over 2000 proposals for an image) The CNN modelalso requires a fixed size image, so any input must be resized and on certain occasions, theobject will get cropped, creating abominable distortions

Later in the same year, He et al introduced a novel CNN architecture named SPP-Net [34]using Spatial Pyramid Matching (SPM) [35, 36] The Convolutional Neural Network is com-bined with a Spatial Pyramid Pooling (SPP) layer, empowers the ability to generate a fixed-length feature representation without scaling the input image The model removes the proposaloverlapping and the need to resize the image, however, it still requires multi-step training in-cluding feature extraction, network fine-tuning, SVM training, and bounding box regressor Ad-ditionally, the convolution layers before the SPP cannot be modified with the algorithm shown

in [34]

In 2015, Girshick proposed Fast R-CNN [37], a model with the ability to do multi-task onclassification and bounding box regression within the same network Similar to SPP-Net, thewhole image is processed with convolution layers to produce feature maps Then, a fixed-lengthfeature vector is extracted from each region proposal with a region of interest (RoI) poolinglayer Each feature vector is then fed into a sequence of fully connected layers before branchinginto two outputs, one is then used for classifier and the other encodes the bounding box location.Regardless of region proposal generation, the training of all network layers can be processed in

a single stage, saving the extra cost of storage

Trang 17

RELATED WORKS

In the same year, Ren et al introduced Faster R-CNN, a method to optimize Fast R-CNNfurther by altering the proposal generation using selective search by a similar network calledthe Region Proposal Network (RPN) [38] It is a fully-convolution network that can predict ob-ject bounding boxes and scores at each position at the same time With the proposal of FasterR-CNN, region proposal-based CNN architectures for object detection can be trained in an end-to-end way However, the alternate training algorithm is time-consuming and RPN does notperform well when dealing with objects with extreme scales or shapes As a result, multipleadjustments have been made Some noticeable improvements are Region-based fully convolu-tional network (R-FCN) [39], Feature Pyramid Network (FPN) [4], Mask R-CNN [40] and itsvariant, Keypoint R-CNN We will look at the details of these methods in Chapter 3

2.1.3.2 CNN-based One Stage Detection (Regression/Classification based)

Region Proposal based frameworks are composed of several correlated stages, includingregion proposal generation, feature extraction, classification, and bounding box regression Even

in Faster R-CNN or its variant, training the parameters is still required between the RegionProposal Network and detection network As a result, achieving real-time detection with TwoStages Detection is a big challenge One Stage Detection, on the other hand, deal with the imagedirectly by mapping image pixel to bounding box coordinates and class probabilities

• You Only Look Once (YOLO): YOLO [41] was proposed by J.Redmon et al in 2015 as thefirst entry to the One Stage Detection era This network divides the image into regions andpredicts bounding boxes and probabilities for each region simultaneously The YOLO con-sists of 24 convolution layers and 2 FC layers, of which some convolution layers constructensembles of inception modules with 1 × 1 reduction layers followed by 3 × 3 conv layers.Furthermore, YOLO produces fewer false positives in the background, which makes coop-erating with Fast R-CNN become possible The improved versions, YOLO v2,v3, and v4were later proposed, which adopts several impressive strategies, such as BN, anchor boxes,dimension cluster, and multi-scale training.[42, 43, 44]

• Single Shot MultiBox Detector (SSD): SSD [45] was proposed by W Liu et al in 2016

as the second entry of the One Stage Detector SSD introduces multi-reference and tiresolution detection techniques that significantly improve detection accuracy The maindifference between SSD and any other detectors is that SSD detects objects of differentscales on different layers of the network rather than detecting them at the final layer

mul-• RetinaNet: RetinaNet [46] uses a Feature Pyramid Network(FPN) with a CNN-based bone FPN involves adding top-level feature maps with feature maps below them beforemaking predictions It involves upscaling the top-level map, dimensionality matching ofthe map below using a 1x1 Convolution and performing element-wise addition of both.RetinaNet achieves comparable results to Two Stages Detection while maintaining higherspeed

Back-• Refinement Neural Network for Object Detection (RefineDet): RefineDet [47] is based

on a feed-forward convolutional network that is similar to SSD, produces a fixed number

of bounding boxes and the scores indicating different classes of objects in those boxes, lowed by the non-maximum suppression to produce the final result RefineDet is composed

fol-of two inter-connected modules:

– Anchor Refinement Module (ARM): Remove negative anchors and adjust the

loca-tions/sizes of anchors to initialize the regressor

– Object Detection Module (ORM): Perform regression on object locations and predict

multi-class labels based on the refined anchors

Trang 18

RELATED WORKS

There are three core components in RefineDet: Transfer Connection Block (TCB) convertsthe features from ARM to ODM; Two-step Cascaded Regression conduct regression on thelocations and sizes of objects; Negative Anchor Filtering will reject well-classified negativeanchors and reduce the imbalance problem

2.2 Diagram recognition

In general, diagram recognition can be grouped into two smaller areas: Online DiagramRecognition and Offline Diagram Recognition In online recognition, the model is a RNN torecognize each stroke and generate candidate matches

Valois et al [48] proposed a method for recognizing electrical diagrams Each set of inkstrokes is detected as a match with the corresponding confidence factor using probabilistic nor-malization functions The disadvantage of the model is the simplicity of the system and its lowaccuracy, preventing it from being used in real situations Feng et al [49] used a more mod-ern technique in detecting electrical circuits Symbol hypotheses generation and classificationare generated using a Hidden Markov Model (HMM) and traced on 2D-DP However, it has adrawback of complexity when the diagram and number of hypotheses are immense, makes itimpractical for real-life cases ChemInk [50], a system for detecting chemical formula sketches,categorizing strokes into elements and bonds between them The final joint is performed usingconditional random fields (CRF), which combines features from a three-layer hierarchy: inkpoints, segments, and candidate symbols Qi et al [51] used a similar approach to recognize di-agram structure with Bayesian CRF - ARD These methods outperform traditional techniques,however, by using pairwise at the final layer, it is harder to combine features for future adapta-tions Coming to Flowchart recognition, after the release of the Online Handwritten FlowchartDataset (OHFCD), multiple studies occurred in resolving this dataset Lemaitre et al [52] pro-posed DMOS (Description and MOdification of the Segmentation) for online flowchart recog-nition Wang et al [53] used a max-margin Markov Random Field to perform segmentationand recognition In [54] they extend their work by adding a grammatical description that com-bines the labeled isolated strokes while ensuring global consistency of the recognition Bresler

et al proposed a pipeline model where they separate strokes and text by using a text/non-textclassifier then detect symbol candidates using a max-sum model by a group of temporally andspatially close strokes The author also proposes an offline extension that uses a preprocessingmodel to reconstruct the strokes from flowchart [55, 56]

While online flowchart recognition detects candidates based on ink strokes, offline flowchartrecognition performs object detection on an image from the user It is possible to reconstruct on-line stroke from offline data [57], however, that preprocessing step is not necessary because wecan recognize the whole diagram structure independently with strokes As online recognition at-tracts more researchers, there have not been many studies on offline detection A Bhattacharya

et al [58] uses morphological and binary mapping to detect electrical circuits Although it canwork on a smaller scale, using binary mapping cannot detect curve or zig-zag lines Julca-Aguilar and Hirata proposed a method using Faster R-CNN to detect candidates and evaluateits accuracy on OHFCD The model can detect components in the diagram, including arrows,however, it cannot detect the arrowhead

Trang 19

Chapter 3

Background

In this section, we provide the basic knowledge of the techniques used for the system in ourproject This knowledge is based on our surveys in Chapter 2 and will be used in Chapter 4 Wewill summarize about three models in order: Section 3.1 is about Faster R-CNN used in objectdetection, section 3.2 introduces Mask R-CNN, a descendant of Faster R-CNN by adding objectsegmentation task Finally, section 3.3 shows Keypoint R-CNN, a variation of Mask R-CNN,which is important in this project

3.1 Faster R-CNN

Introduced in chapter 2, Faster CNN is an object detection model that extends Fast CNN Faster R-CNN replaces the old looping method with a new sub-model called RegionProposal Network (RPN) As a result, a Faster R-CNN model consists of three main compo-nents: Backbone network, RPN, and Regression-Classification layers Because we do not useFaster R-CNN model in this project, we will briefly discuss the first two components, the lastone will be mentioned in section 3.2

R-3.1.1 Backbone network

The Backbone Convolutional Neural Network is an important part of the algorithm It playsthe role of a feature extractor, which encodes the image input and returns a feature map Thebetter the backbone is, the higher result the model will achieve

Figure 3.1 shows the pipeline of ResNet50, one example of Residual Network[27] ResidualNetwork inherits the idea of stacking many convolution layers to create a model It consists oftwo main blocks: convolution block and identity block These blocks have similar structures:they both have two paths, one goes through many convolutional layers and one goes through ashortcut path The main path follows three steps:

1 Conv layer, kernel size 1*1 with Batch Normalization and ReLU

2 Conv layer, kernel size k*k with Batch Normalization and ReLU In figure 3.1, k = 3

3 Conv layer, kernel size 1*1 with Batch Normalization

For the shortcut path, while convolution block contains one conv layer with Batch tion, the identity block simply uses the original input The results of both paths are then addedtogether with ReLU activation function The number of layers used in both blocks is three sincethe conv layer in the shortcut path of convolution block is redundant

Trang 20

Figure 3.1: ResNet50 model, from [1]

Most ResNet model has the same structures, the only difference between them is the number

of convolution and identity blocks In general, the input is an image resized to 224 by 224with three channels (RGB) It then goes through a conv layer with kernel size 7*7 with BatchNormalization, ReLU activation then a MaxPool 3x3 After that, it goes through a series ofconvolution and identity blocks before doing global average pool with softmax function to returnthe feature map In figure 3.1, we can see that the model contains four convolution blocks andtwelve identity blocks Hence, the total number of layers is:

1 + 4 ∗ 3 + 12 ∗ 3 + 1 = 50implying the name ResNet50 Other model for example ResNet101 has 4 convolution blocksand 29 identity blocks, so the total number of layers is: 1 + 4 ∗ 3 + 29 ∗ 3 + 1 = 101

There are other candidates worth mentioning, including Inception, Inception-ResNet, Inour project, we decide to use ResNet50 and ResNet101 The full ideas will be discussed insection 4.3

3.1.2 Region Proposal Network

Region Proposal Network (RPN) is an addition to Faster R-CNN It is a sub-model that playsthe role to generate proposals from input feature maps and use these proposals for regressionand classification The way this model solves the task can be divided into three main steps:

At the first step, the model receives the feature maps If the model is in training, it alsoreceives ground truth boxes It then creates a sliding window running on the feature maps Foreach position of the sliding window, the model creates a set of m*n anchors with m differentaspect ratios and n different scales In practice, m and n are equal to five and three, respectively.Each anchor A has four attributes < xa, ya, wa, ha> regarding x coordinate, y coordinate of thecenter, width, and height from the center (half of total width and height)

The second step involves labeling anchors For each anchor and ground truth box, a valuecalled Intersection over Union (IoU) is calculated, indicating the overlapping ratio with theground truth bounding box The anchor will be labeled positive or negative depending on thisvalue

k= AnchorBox∩ GroundTruthBoxAnchorBox∪ GroundTruthBox

Trang 21

Normally, FOREGROU ND_T HRESHOLD = 0.7 and BACKGROU ND_T HRESHOLD =0.3 All anchors labeled as -1 are ignored Because we have a large number of ground-truthboxes, each anchor produces a vector of a label, each label is the relationship between the anchorand the corresponding ground-truth box

The last step of the sub-model is to generate proposals Since anchors containing its labeland bounding box, it will try to predict five parameters: the first four parameters are from theregression task giving the box coordinate and the last parameter is from the classification taskgiven the label vector In training, these parameters will be used to calculate the loss function

As a result, the loss function of RPN will have two terms: box regression loss and label fication loss For classification loss, we can use the binary cross-entropy function between thepredicted label and anchor label:

classi-Lr pn_cls = bce(labelpredicted, labelgt)For regression loss, the four parameters that the model will try to predict is:

rx= (x − xa)/wa

ry= (y − ya)/ya

rw= log(w − wa)

rh= log(h − ha)where x, y, w, h belongs to the coordinate and x, xais the coordinate belong to the prediction andanchor, respectively Intuitively, from the anchor and gt box we also have:

r∗x = (x∗− xa)/wa

r∗y = (y∗− ya)/ya

r∗w= log(w∗− wa)

r∗h= log(h∗− ha)where x∗is the coordinate belong to the ground truth From these parameters, we can calculatethe box regression loss for each anchor using the Smooth L1 loss:

Lr pn_reg= SL1((rx, ry, rw, rh), (rx∗, r∗y, r∗w, rh∗))Therefore, the total loss of the RPN model is the sum of regression and classification lossterms for every anchor divided by the normalization term In the original paper, the author uses

a hyperparameter to balance the loss between regression and classification Our implementation

in chapter 4 will discuss about balancing these variables

3.1.3 Non-Maximum Suppression

Non-Maximum Suppression (NMS) is an effective algorithm to filter out similar predictionsfrom detectors This algorithm is used in both training and inference phase of Faster R-CNN,although inference is preferred The main idea of the algorithm is to remove any redundant oroverlapped boxes to reduce the computational cost in later steps, especially in inference wheretime constraint is a big issue Figure 3.2 shows an example of its usage in detection model As

we can see in the image on the left, both the cat and dog have been labeled by many bounding

Trang 22

Figure 3.2: Non-Maximum Suppression, from [2]

boxes Our aim is the picture on the right, where there exists only one box covering each image

By doing so, we reduce the number of boxes from 9 to 2, thus a 4.5x improvement in computingspeed

The following algorithm briefly demonstrates the idea of this improvement It receives twoarrays, B and S storing bounding boxes and scores of each prediction, respectively, along Inter-section Over Union constant t The result is two arrays storing new bounding boxes and scoresafter applying the technique To conduct the idea, the algorithm chooses the box with the highestscore For each other box, the Intersection Over Union between those two will be calculated;

if this variable is higher than t then the box will be removed After the iteration, NMS simplychooses the next box with the highest corresponding score and repeats until the array B is empty

Algorithm 1: Non-Maximum Suppression

Trang 23

at a given variable t which we can decide beforehand Naturally, the higher t we choose, theharsher the complexity will become, eventually come to O(N2) The memory complexity of thealgorithm is O(N) as the new array is created

Mask R-CNN also solved two disadvantages that its predecessor struggled with by addingand replacing certain techniques The first disadvantage is the inability to deal with small ob-jects In Faster R-CNN, small objects are often negligible in the big picture However, in diresituations where small objects must be detected, using Faster R-CNN is a bad idea Section 3.2.2will show the solution to this problem by introducing Feature Pyramid Network The other dis-advantage is the lack of pixel-to-pixel alignment between inputs and outputs RoIPooling doesnot deal effectively with floating-point locations As a result, it often rounds down the locationbefore performing spatial quantization Mask R-CNN replaces the RoIPooling technique withRoIAlign to do pooling on floating-point position

The following sections will discuss new techniques using in Mask R-CNN model, includingbinary mask in section 3.2.1, Feature Pyramid Network in section 3.2.2 and RoIAlign in section3.2.3

3.2.1 Binary Masks

The methodology of generating masks of Mask R-CNN is similar to how Faster R-CNNcreates and predicts the bounding box For each anchor A =< xA, yA, wA, hA> created by thesliding window, a binary mask of size 28 by 28 is created based on their bounding box coordi-nate, each pixel of this mask is 1 if this pixel belongs to the object In training, this mask will becompared to the ground truth mask using binary cross-entropy loss function, similarly to RPNclassification loss

Lmask= bce(maskpredicted, maskgt)For inference, the 28 by 28 mask is resized to fit the predicted bounding box: The width isstretched by wA/56 and height by hA/56 times while the center coordinate naturally becomesthe center of the binary mask To reduce the computational cost, it is recommended to generatemask after the bounding box array is filtered using NMS so that we do not need to add a mask

on every anchor

Trang 24

Figure 3.3: MaskRCNN model, from [3]

Figure 3.4: Mask Sample, the pink colored pixels are for the object

Figure 3.4 shows an example of a sample mask Ignoring the label on the top left and thebounding box, each pixel colored pink is said to belong to the object (the parallelogram) We cansee spikes of mask on the edge of the object, indicating the misalignment of pixels during maskstretching We can solve this, however, it is not ideal to do this problem, as we do not entirelyuse masks in this project; they are mostly for creating bounding boxes and visual effects

3.2.2 Feature Pyramid Network

Feature Pyramid Network, or FPN, is a technique used in Mask R-CNN that replaces theold backbone network with a new model This model provides multiple feature maps whichcan be fed as inputs in RPN to create proposals When bringing up multiple inputs, one easysolution is to downscale the image multiple times to create a pyramid Each layer image in thispyramid is then computed to make a feature map separately However, the disadvantage of thismethod is the computational cost, as we need to produce the feature map multiple times TheFPN model ensures that we only need to calculate once to generate multiple feature maps Sincemany feature maps are created and each of them has a different scale, it is easier to detect smallobjects

Figure 3.5 shows the basic idea of Feature Pyramid Network It consists of two main nents: the bottom-up path and the top-down path, each path has a structure similar to a pyramid

compo-In the bottom-up pathway, the input image goes through multiple steps of downscaling, eachoutput of a lower layer is the input of the next layer in the pyramid In the original paper, thedownscaling network is a convolution network, which computes a hierarchy of feature maps

Trang 25

Figure 3.5: Feature Pyramid Network, from [4]

with a scaling factor (or strides) of 2 For example, assuming a 5 layers pyramid with the tom layer having stride 1, the other layers in the bottom-up pathway have strides of 2, 4, 8,and 16, respectively Each layer in the pyramid is used in the top-down pathway to make thefinal maps To simplify the calculation in the top-down pathway, assuming the pyramid in thebottom-up pathway has four layers, each of them will be called C1to C4respectively

bot-In the top-down pathway, each layer of the pyramid is calculated from the top to the bottom

to partially create higher resolution maps Assuming that we discard C1in the calculation due tolarge memory consumption and calling the layers in the top-down pathway from top to bottom

as P2to P4, we can simplify the method to generate these layers as follows: Firstly, P4 layer iscreated from C4going through a 1*1 convolution to reduce its dimension The coarser resolutionlayer Pi+1is upsampled by a scaling factor equal to the variable used in the bottom-up resolution.The upsampled feature map is then merged by element-wise addition with the correspondinglayer Ci in the bottom-up pathway which goes through a 1*1 convolution layer similar to C4.This process continues until P2, due to C1is not being used in the algorithm These three layers

in the top-down pathway are forwarded to RPN as the input maps

Another interesting characteristic of Feature Pyramid Network is that it can be used alongwith other backbone networks, including ResNet, Inception, VGG, We will talk about thiscombination in section 4.3

3.2.3 Region of Interest Align

In Faster R-CNN, proposals from Region Proposal Network are normalized through a nique called Region of Interest Pooling (or RoIPooling) This layer originates from the idea toaccelerate both the training and the inference phase of the object detection model Normally,there will be a lot of proposals on a frame so that we can still use the same feature map inputfor all of them, thus improving the processing speed by a large factor From an array of pro-posals that contain bounding boxes with different sizes, the method can quickly get a list ofcorresponding feature maps with a fixed size It should be noticed that the dimension of thislayer output depends on neither the size of the input feature map nor the size of proposals It

Trang 26

Figure 3.6: RoIPooling in Faster R-CNN

is chosen specifically as a predefined parameter to control the number of sections we performpooling on

The algorithm takes two inputs, feature map(s) transported from the backbone network and

a (N, 5) matrix representing an array of RoIs (in this situation, we can specifically call thesebounding boxes), each RoI R =< idx, x, y, w, h > contains an index and the coordinates of eachproposal In certain models, it requires another parameter for the output size of the layer so eachfeature map calculates on a different scale For each RoI, it will transmute the feature map to afixed matrix (for example, 2*2) The scaling method consists of two steps:

• Divide the region into equal size sections, the number of sections is equal to the outputdimension Each section coordinate is an integer

• Perform MaxPooling in each section

Figure 3.6 shows an example of RoI Pooling in Faster R-CNN The input feature map is a

4 by 4 matrix and a RoI with coordinate R =< 1.5, 2.5, 1.5, 1.5 > The output of RoIPooling

is a 2*2 matrix Following the steps: Firstly, we will convert the RoI with center coordinate totopleft-bottomright coordinate The new coordinate of the RoI is:

x1 = x − w = 0y1 = y − h = 1x2 = x + w = 3y2 = y + h = 4Next, we will divide the region (0, 1), (3, 4) into a 2 by 2 matrix By doing so, we get thefour regions as in the second image Finally, we get the max value in each section to get the finalmatrix, and that is the result of RoIPooling layer

From the implementation of RoIPooling, we can see the disadvantage of the algorithm

It shows in the first step of the scaling method By dividing the proposal region into integercoordinate, we accidentally create round-off errors on each section While it does not result in

a big issue in Faster R-CNN, it creates an insurmountable amount of error when using FPN inparticular and Mask R-CNN in general For further explanation, since Feature Pyramid Networkcreates multiple feature maps on different scales to solve small object detection, there would

be occasions where an object with small bounding box will be fed to the pooling layer Bydoing so, many sections after the division will eventually become zero, breaking the core of thealgorithm To negate this problem, Mask R-CNN uses RoIAlign, an improvement of RoIPoolingthat removes this rounding issue The scaling method in RoIAlign consists of two steps:

• Divide the region into equal size sections, the number of sections is equal to the output

dimension Each section coordinate is a floating-point number.

Trang 27

Figure 3.7: RoIAlign layer used in Mask R-CNN

• Perform MaxPooling in each section The value of each segmented feature is the cation between the value of the feature itself and the area proportion that the segmentedfeature cover

multipli-Taking a similar example to the previous technique, figure 3.7 shows a 4 by 4 feature mapwith 2*2 output The coordinate of RoI is R =< 1.5, 2.5, 1.5, 1.5 > By doing the first step sim-ilar to RoIPooling, we receive the RoI region of (0, 1), (3, 4) Now, dividing this region into two

by two sections, we get the result similar to the second image, with the red lines indicating thesegmentation By taking into account the bottom right section as a sample, we can understandthe method to recalculate the feature value of each segment before performing MaxPooling andachieve the output

Considering the complexity of both RoI layers, we can see that both algorithms have thetime complexity of O(n2), with n is the input size To prove this, we firstly need to calculate thenumber of loops used The algorithm use two main loops, the first loop iterates over sections

to perform MaxPooling Denoting the size of the output is m ∗ m, the first loop performs m ∗

m iterations, thus gives O(m2) in time complexity The other loop will run on all segmentedfeatures on a section to find the max value Taking into account the size of RoI (2w, 2h), sincethe area is divided into m ∗ m sections, the number of segmented features S on a section will be:

For RoIPooling : 2w

m

∗ 2hm

Since both w and h is proportional to the size of the feature map, or n, we can roughly say that

in both situations, the complexity will be O((mn)2) Adding the two results, we have the finalcomplexity of O(n2) The cost of calculating the segmented feature value in RoIAlign resides

in the second loop and is negligible compared to the true complexity

3.3 Keypoint R-CNN

Introduced as a variation of Mask R-CNN, Keypoint R-CNN is a tuned model used to solvethe keypoint detection task in Human Pose Estimation Challenge Because Keypoint R-CNN

Trang 28

is a variation of Mask R-CNN, most of Mask R-CNN features stays in this model The maindifference between these two is the properties of the third branch of the model While MaskR-CNN uses a 28 by 28 binary mask to store object mask for segmentation, Keypoint R-CNNuses a floating-point heatmap with size 56*56 Each variable in the heatmap ranges from -1

to 1 In training, a heatmap is produced and correlated with ground-truth keypoints From thecorrelation, the loss function is calculated using cross-entropy:

is the application to multiple class detection, as Keypoint R-CNN can only perform well on oneclass

Trang 29

Chapter 4

Proposed method

This chapter will show our contribution and solution to the diagram recognition problem Itwill not only contain knowledge from chapter 3 about the known model, but also our improve-ment to these models and the addition of solving the relationship issues This chapter containsfour main sections: Section 4.1 explains the constraints and scope of the thesis Section 4.2clarifies our objective and how the task can be completed Section 4.3 introduces our proposed

model to solve the object detection problem Finally, section ?? shows our proposed method to

generate a full diagram from the output of the object detection model

4.1 Scope of the thesis

This thesis is a contribution to solve the task of diagram digitization For more details,

it serves as the background in a server connecting with user devices, mainly Android Theapplication allows the user to convert an image of a diagram to an online diagram that the usercan modify, duplicate and convert to other formats Due to the limitation of the user device, thefollowing constraints are considered:

• Quantity: Most Android phones have a smaller than 5.5 inches screen Using a phone screenand camera, the user may not be able to capture a complex diagram without shrinking theresolution To make it easier to modify the diagram after conversion, the original diagramshould not contain more than 15 symbols or the modification would be terrible

• Quality: The diagram is drawn with remarkable tools (chalk, ball pen, ) Since this thesisdoes not use original but preprocessed images, it is vital to avoid capturing issues such asblurred images Moreover, the image should be captured in the correct orientation with amaximum of 10-degree rotation to avoid shape and text misunderstanding

• Time constraint: Since data are traversing between user device and server, the total ence time of the model should not be greater than a certain threshold The threshold of 8seconds is selected to be the maximum inference time

infer-Other constraints inferring to the objectives of this thesis Python, specifically, PyTorch is used

in this model It is deliberating whether the diagram server will use Python or C++, as bothhave libraries that fully optimize the computational cost The final system should recognize dia-gram objects such as “symbol”, “arrow”, and “text” For each symbol and connector, the systemreturns a structure containing object information For text objects, while it can be returned as

a string containing raw text, this thesis does not cover OCR problem Instead, it returns thebounding box of the text In the future, it is recommended to add OCR to the final model fortext recognition The system is also able to construct a full relationship between every predicted

Trang 30

PROPOSED METHOD

object, including the relationship between symbols and connectors, texts and symbols, texts andconnectors The output of this construction is a reasonable JSON file storing object informationand relationship The JSON file should not contain duplicate or unused information The lastconstraint in the thesis is the limitation of connectors The connectors can intersect but must notoverlap each other because of the impossibility to distinguish similar objects when they havethe same part (Ideally, it is similar to the two flowers - one stem situation)

So, how to digitize diagrams? While it is viable to draw a diagram using tools such asdraw.io, Visio, the speed of converting from hand-drawn to digital diagram is extremelyfaster As mentioned in section 2.2, there are two ways to digitize hand-drawn diagrams: on-line and offline diagram recognition Online diagram recognition has been being famous inrecent years due to its convenience in practice, the user only needs to prepare a pen and tablet

or phone to draw a diagram and the background process constructs the diagram automatically

in real-time However, in an offline meeting where most information is displayed on paper orwhiteboard/blackboard, the limitation of online drawings quickly appears: in order to record theevents, the user most likely has to redraw a lot of information, thus consuming a measurableamount of time For offline drawing, the user’s role is to capture the image and the backendwould convert the image into a modifiable diagram, saving time and resources

This thesis proposes the methods to solve the problem: Given an image containing diagramand text, create a model to convert this image to an output structure that can be used to makedigital diagrams in an Android application

4.2.2 Approaches

To solve the problem defined, we need to take into consideration the approaches and tion There are two types of diagram: online and offline While online diagram can convert intostatic image, offline detection is preferred The next step is to choose the type of object detec-tion Since there are two approaches: one stage detector and two stages detector, it is important

solu-to select which branch the proposed model belongs solu-to The main difference between one stageand two stages is inference time and precision While one stage model performs inference ex-tremely fast, some models can perform in real-time, two stages model has the edge on precision.Because the system will be used as a background process, two stages detector is preferred as

Trang 31

• FC database[59] with 672 flowchart diagrams drawn by 24 users from Czech TechnicalUniversity storing in InkML format.

• KONDATE dataset[60] from Nakagawa lab at Tokyo University of Agriculture and nology containing 670 free from handwritten Japanese documents by 67 writers

Tech-• University of Nantes flowchart dataset[15] consists of 419 drawings

• Digital Ink Diagram Data or DiDi[61] from Google containing 58655 flowchart diagrams.Due to the popularity of online diagram recognition in the past, all of these datasets arestored in the online diagram format Nevertheless, we choose DiDi Dataset due to the over-whelming number of entries, while other datasets can be used as a reference point in the future.DiDi dataset has two components: images and drawings Images used in the dataset are createdusing GraphViz then stored in PNG format The following algorithm shows the author’s idea in

Tiêu đề	Building A Diagram Recognition Problem with Machine Vision Approach
Tác giả	Tran Hoang Thinh
Người hướng dẫn	Dr. Nguyen Duc Dung
Trường học	Ho Chi Minh University of Technology
Chuyên ngành	Computer Science
Thể loại	Graduation Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	62
Dung lượng	1,4 MB