1. Trang chủ
  2. » Luận Văn - Báo Cáo

Luận văn tốt nghiệp Khoa học máy tính: Building A Diagram Recognition Problem with Machine Vision Approach

62 1 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

Trang 1

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITYHO CHI MINH UNIVERSITY OF TECHNOLOGY

COMPUTER SCIENCE AND ENGINEERING FACULTY

Tran Hoang Thinh 1752516

Trang 2

ĐẠI HỌC QUỐC GIA TP.HCM CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM - Độc lập - Tự do - Hạnh phúc

TRƯỜNG ĐẠI HỌC BÁCH KHOA

KHOA: KH & KT Máy tính _ NHIỆM VỤ LUẬN ÁN TỐT NGHIỆP

BỘ MÔN: KHMT _ Chú ý: Sinh viên phải dán tờ này vào trang nhất của bản thuyết trình

HỌ VÀ TÊN: Trần Hoàng Thịnh MSSV: 1752516 _ HỌ VÀ TÊN: _ MSSV: HỌ VÀ TÊN: _ MSSV: NGÀNH: _ LỚP:

1 Đầu đề luận án:

Building A Diagram Recognition Problem with Machine Vision Approach

2 Nhiệm vụ đề tài (yêu cầu về nội dung và số liệu ban đầu):

- Investigate approaches in diagram recognition problem - Research on machine learning approaches for the problem - Prepare data for the problem

- Propose and implement the diagram recognition system - Evaluate the proposed model

3 Ngày giao nhiệm vụ luận án: 1/3/2021 4 Ngày hoàn thành nhiệm vụ: 30/6/2021

5 Họ tên giảng viên hướng dẫn: Phần hướng dẫn:

1) Nguyễn Đức Dũng 2) _ 3) _ Nội dung và yêu cầu LVTN đã được thông qua Bộ môn

Ngày tháng năm

CHỦ NHIỆM BỘ MÔN GIẢNG VIÊN HƯỚNG DẪN CHÍNH

PHẦN DÀNH CHO KHOA, BỘ MÔN:

Người duyệt (chấm sơ bộ): Đơn vị: _ Ngày bảo vệ: _ Điểm tổng kết: Nơi lưu trữ luận án:

Trang 3

TRƯỜNG ĐẠI HỌC BÁCH KHOA CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

6 Những ưu điểm chính của LVTN:

The team has successfully proposed the diagram recognition system They built the initial dataset and perform labeling the data for this task The team has utilized their knowledge in computer vision and machine learning to propose a suitable approach for this problem The evaluation results are promising

7 Những thiếu sót chính của LVTN:

The dataset they built is still small and the number of components that this model can recognize is also limited Even obtained high accuracy, the team has not performed experiments under real conditions, i.e image captured with shadows, low contrast, thin sketches, etc

8 Đề nghị: Được bảo vệ o Bổ sung thêm để bảo vệ o Không được bảo vệ o 9 3 câu hỏi SV phải trả lời trước Hội đồng:

a b c

10 Đánh giá chung (bằng chữ: giỏi, khá, TB): Giỏi Điểm: 9 /10 Ký tên (ghi rõ họ tên)

Nguyễn Đức Dũng

Trang 4

TRƯỜNG ĐẠI HỌC BÁCH KHOA CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

MSSV: 1752516Ngành (chuyên ngành): Computer Science

2 Đề tài: “Building A Diagram Recognition Problem with Machine Vision Approach”

3 Họ tên người phản biện: Nguyễn An Khương

4 Tổng quát về bản thuyết minh:

Số bảng số liệu: 4Số hình vẽ: 18Số tài liệu tham khảo: 53 Phần mềm tính toán:Hiện vật (sản phẩm)

and Algorithms 3,4 for diagram building.

 The thesis uses Mask R-CNN model and its variant, Keypoint R-CNN with someimprovements and augmentation to solve offline diagram recognition task with rather highaccuracy (~90%) and acceptable performance (< 2s for each diagram).

7 Những thiếu sót chính của LVTN:

 The thesis is not well-written and too short

 The contributions of the author are not presented in a clear maner.

8 Đề nghị: Được bảo vệ  Bổ sung thêm để bảo vệ  Không được bảo vệ 9 Câu hỏi SV phải trả lời trước Hội đồng:

a Is there any commercial or prototype app/software that solve this problem or similar ones? If YES, can you give some comments and remarks on benchmarking your work and those? b Arrow keypoints seem often coincide with one border of the bounding box, so how should

we do to reduce overlap task between arrow keypoint detection and bounding box detection?

10 Đánh giá chung (bằng chữ: giỏi, khá, TB): ExcellentĐiểm: 9/10

Ký tên (ghi rõ họ tên)

Nguyễn An Khương

Trang 5

We hereby undertake that this is our own research project under the guidance of Dr.Nguyen Research content and results are truthful and have never been published before.The data used for the analysis and comments are collected by us from many differentsources and will be clearly stated in the references.

In addition, we also use several reviews and figures of other authors and organizations.All have citations and origins.

If we detect any fraud, we take full responsibility for the content of our graduate ternship Ho Chi Minh City University of Technology is not related to the copyright andcopyright infringement caused by us in the implementation process.

in-Best regards,Tran Hoang Thinh

Trang 6

First and foremost, we would like to express our sincere gratitude to our advisor Dr.Nguyen Duc Dung for the support of our thesis for his patience, enthusiasm, experience,and knowledge He shared his experience and knowledge which helps us in our research andhow to provide a good thesis We also want to thank Dr Nguyen An Khuong and Dr LeThanh Sach for their support in reviewing our thesis proposal and thesis Finally, we wouldlike to show our appreciation to Computer Science Faculty and Ho Chi Minh University ofTechnology for providing an academic environment for us to become what we are today.

Best regards,Tran Hoang Thinh

Trang 7

Diagram has been one of the most effective illustrating tools for demonstrating and sharing ideasand suggestions among others Besides text and images, drawing flow charts is the best way togive others a clearer path of the plan with the least amount of work Nowadays, many meetingsrequire a blackboard so everyone can express their thoughts This raises a problem with savingthese drawings as a reference for future use since taking a picture can not solve the problemof re-editing these ideas and they need to be redrawn to be suitable in professional documents.On the other hand, digitizing the chart requires redrawing the entire diagram using a computeror a special device like drawing boards and digital pens, which cost a lot and are not the mostconvenient tools to use.

Therefore, it is necessary to find a way to convert the current, traditional hand-drawingdiagrams into a digital version, simplifying the sharing process between users Moreover, thedigitizing diagram also helps the user to modify and convert to other forms that satisfy theirrequirements This thesis will focus on stating a problem with digitizing diagrams and proposingthe solution.

Trang 8

2.2 Diagram recognition 6

3Background73.1 Faster R-CNN 7

3.2.2 Feature Pyramid Network 12

3.2.3 Region of Interest Align 13

3.3 Keypoint R-CNN 15

4Proposed method174.1 Scope of the thesis 17

Trang 9

5.2.1 Perform training and inference without keypoints 40

5.2.2 Perform training and inference with keypoints 40

5.2.3 Building diagram structure from predictions 42

6Conclusion466.1 Summary 46

6.2 Challenges 46

6.3 Future works 46

Trang 11

List of Figures

3.1 ResNet50 model, from [1] 8

3.2 Non-Maximum Suppression, from [2] 10

3.3 MaskRCNN model, from [3] 12

3.4 Mask Sample, the pink colored pixels are for the object 12

3.5 Feature Pyramid Network, from [4] 13

3.6 RoIPooling in Faster R-CNN 14

3.7 RoIAlign layer used in Mask R-CNN 15

4.1 Sample of an entry in DiDi dataset 21

4.2 Python code to save a drawing as PNG image 21

4.3 A sample with its labels and the JSON label information 23

4.4 Sample drawing with bounding boxes 25

4.5 Pipeline of the model 26

4.6 Feature Pyramid Network with ResNet, from [5] 27

4.7 A drawing with its predictions 30

4.8 Structure sample 32

4.9 Model fails to detect intersected arrows 33

4.10 Example when Euclidean distance does work 34

4.11 Sample for Weighted Euclidean 34

5.1 Sample prediction 40

5.2 Sample prediction with rotated input 41

5.3 Loss over iteration of proposed model without keypoints 41

5.4 Sample diagram without text 42

5.5 Sample diagram with text 43

5.6 Loss over iteration of proposed model with keypoints 43

5.7 Drawing without predictions at 60% score 44

5.8 Example of impossibility in prediction 45

5.9 Sample output result 45

Trang 12

List of Algorithms

1 Non-Maximum Suppression 10

2 DiDi image generation 20

3 COCO Format Generation 24

4 Improved Non-Maximum Suppression 29

5 Arrow Refinement 33

6 Weighted Euclidean for Symbol-Arrow relationship 36

Trang 13

Chapter 1Introduction

Comparing to many decades ago, artificial intelligence (AI) has developed faster than one can imagine Tracing back to the 90s, right after the second “AI Winter” ended, there hadbeen numerous advances where computers successfully achieved milestones that used to be be-lieved as impossible In 1994, Chinook[6], a checker (English draughts) engine, won the UnitedStates tournament by an enormous margin It beat the second-best player Don Lafferty whilemaking Marion Tinsley, the best at the time, withdraw in the middle of the game 1997 on theother hand is the year that would change the history of chess forever when the Deep Blue[7]chess machine from IBM defeated Grandmaster Gary Kasparov with the score of 3½ to 2½ Inthe same year, Logistello[8] beat the world champion, Takeshi Murakami, with an overwhelm-ing score of six to zero Nowadays, AI can be seen everywhere in modern life, from work-relatedexamples like email spam filters, virtual assistants to the entertainment industry like recommen-dation systems, chatting, gaming bot, voice and text recognition, AlphaZero [9], developed byGoogle DeepMind, defeated the reigning champion, Stockfish, in a one-side match with the re-sult of 28 wins, 72 draws, and zero losses Another project, AlphaGo [10], beat the champion,Lee Sedol at 4 - 1, making the history of artificial intelligence the first time a computer hadbeaten a human in Go.

any-Within the area of computer vision, a subset of artificial intelligence that deals with thescience of enabling computers or engines to visualize images, a smaller section deals with theability to detect objects, for example, humans, animals, furniture, etc Recently, there have beenmany applications that can help deal with this task Google Lens[11] is an image recognitiontechnology developed by Google, which can detect objects, texts, bar codes, QR codes, mathequations, and “google” for its related results or information using a two-step detector com-bined with Long Short-Term Memory (LSTM) Microsoft Math[12] can use natural languageprocessing (NLP) to recognize and solve math problems step-by-step Other contributions, in-cluding Vivino[13], which can scan and detect wines, Screenshop[14] tells your shopping cata-log from the image and give recommendations,

Regarding diagram recognition, since the introduction of the Online Handwritten FlowchartDataset[15] in 2011, there have been numerous attempts in digitizing diagrams Many ideas inthe mainstream are divided into two main approaches: online diagram recognition and offlinediagram recognition In online diagram recognition, the user continuously draws a diagram ona device with a touchscreen, such as a tablet or a smartphone, and a pen or finger Meanwhile,the program captures the input as a sequence of strokes These are later used to detect objectsand relations between them Alternatively, the input of offline diagram recognition is a raw

Trang 14

INTRODUCTIONimage from a source like phone cameras The input is broken down into a set of features andthese features are then used to visualize the objects Recently, there has been more attentionto online recognition as it is more flexible than its counterpart in both precision and real-timeconstraints However, in many real-life situations such as meetings or conferences, when thediscussion between people is displayed on a blackboard or a paper, although possible, onlinemethods are not preferred One would capture the image of the blackboard and use offlinemethods to digitize the diagram.

This project develops a model that can perform offline diagram recognition and digitize thediagram in a suitable format It will not stop with the model and algorithm but will be developedto an application that can serve actual clients to solve real-life problems.

The report is organized as follows:

• Chapter 2 briefly surveys the application of object detection in real life, related work ofobject detection in general, and flowchart detection in particular.

• Chapter 3 provides sufficient knowledge to understand the project.

• Chapter 4 shows our proposed system, including how the application works.• Chapter 5 lists our experiments and results.

• Chapter 6 summarizes what we have done along with challenges and future works

The target of this thesis is to build a model that can convert preprocessed diagram imagesto a reasonable and understandable structure This structure can be later served as a part of anapplication that can help to solve many problems in real life.

Trang 15

cer-2.1.2Traditional detector

Most of the early object detection algorithms were built based on manual-made featureswith multiple complex models Due to the lack of resources and image size, numerous speed-upmethods are required.

In 2001, P Viola and M Jones achieved real-time face detection using sliding windows [16,17] The algorithm goes through all possible locations and scales in an image to find the humanface It speeds up the computational process by involving three important techniques The firstone, integral image speeds up box filtering or convolution process using Haar wavelet as thefeature representation of an image The second one, feature selection using Adaboost algorithmto select a small set of features from a huge set of feature pools The third one, A multi-stagedetection paradigm reduces its computation by spending less time on the background than onface possibility location Although the algorithm is simple, it exceeds the current computationalpower of the computer at that time.

Histogram of Oriented Gradients (HOG) was created in 2005 by N.Dalal and B.Triggs [18].It is designed to be computed on a grid of equal cells and use overlapping local contrast normal-ization to improve accuracy To detect objects of different sizes, HOG resizes the input imagemultiple times to match with detection window size It has been an important foundation ofmany object detectors and a large variety of computer vision applications.[19, 20]

Trang 16

RELATED WORKS

2.1.3CNN-based Detector

As the traditional methods show their disadvantages by becoming progressively complex,the progress has been slow down, researchers have tried finding an alternative to increase theaccuracy and performance In 2012, Krizhevsky et al brought back the age of ConvolutionalNeural Network with a paper on object classification with ImageNet [21] As DCNN can clas-sify the image based on the feature set, subsequent papers show their interest in the newly foundmethod in object detection Over the past decades, multiple attempts at making object detectionmodel have been proposed and studied to improve the accuracy in detection such as LeNet-5[22], AlexNet [23], VGG-16 [24], Inception [25, 26], ResNet [27],etc Studies also discovertechniques that improve the training process and prevent overfitting, for example, dropout [23],Auto Encoder [28], Restricted Boltzmann Machine (RBM) [29], Data Augmentation [30] andBatch Normalization [31].

There are two main groups in CNN-based detection: “two-step detection” and “one-stepdetection” In the first group, the image will be examined to generate proposals, and these pro-posals are delivered to another network for classification and regression In the second group,the object will be recognized and classified directly within one network model.

2.1.3.1CNN-based Two Stages Detection (Region Proposal based)

Released in 2014 by Girshick et al.[32, 33], R-CNN is the first attempt to build a lutional Neural Network for object detection The idea of R-CNN is divided into three mainstages:

Convo-• Proposals are generated by using selective search.

• Proposals are resized to fixed resolution These proposals are then used in the CNN modelto extract the feature map.

• Feature map is classified using SVMs for multiple classes to deduct the final bounding box.Despite having certain advantages compared with traditional methods and bringing CNN backto practical use, R-CNN has some fatal disadvantages The training has multiple stages andfeature maps are stored separately, thus increasing time and space complexity Moreover, Thenumber of overlapping proposals is large (over 2000 proposals for an image) The CNN modelalso requires a fixed size image, so any input must be resized and on certain occasions, theobject will get cropped, creating abominable distortions.

Later in the same year, He et al introduced a novel CNN architecture named SPP-Net [34]using Spatial Pyramid Matching (SPM) [35, 36] The Convolutional Neural Network is com-bined with a Spatial Pyramid Pooling (SPP) layer, empowers the ability to generate a fixed-length feature representation without scaling the input image The model removes the proposaloverlapping and the need to resize the image, however, it still requires multi-step training in-cluding feature extraction, network fine-tuning, SVM training, and bounding box regressor Ad-ditionally, the convolution layers before the SPP cannot be modified with the algorithm shownin [34].

In 2015, Girshick proposed Fast R-CNN [37], a model with the ability to do multi-task onclassification and bounding box regression within the same network Similar to SPP-Net, thewhole image is processed with convolution layers to produce feature maps Then, a fixed-lengthfeature vector is extracted from each region proposal with a region of interest (RoI) poolinglayer Each feature vector is then fed into a sequence of fully connected layers before branchinginto two outputs, one is then used for classifier and the other encodes the bounding box location.Regardless of region proposal generation, the training of all network layers can be processed ina single stage, saving the extra cost of storage.

Trang 17

RELATED WORKS

In the same year, Ren et al introduced Faster R-CNN, a method to optimize Fast R-CNNfurther by altering the proposal generation using selective search by a similar network calledthe Region Proposal Network (RPN) [38] It is a fully-convolution network that can predict ob-ject bounding boxes and scores at each position at the same time With the proposal of FasterR-CNN, region proposal-based CNN architectures for object detection can be trained in an end-to-end way However, the alternate training algorithm is time-consuming and RPN does notperform well when dealing with objects with extreme scales or shapes As a result, multipleadjustments have been made Some noticeable improvements are Region-based fully convolu-tional network (R-FCN) [39], Feature Pyramid Network (FPN) [4], Mask R-CNN [40] and itsvariant, Keypoint R-CNN We will look at the details of these methods in Chapter 3.

2.1.3.2CNN-based One Stage Detection (Regression/Classification based)

Region Proposal based frameworks are composed of several correlated stages, includingregion proposal generation, feature extraction, classification, and bounding box regression Evenin Faster R-CNN or its variant, training the parameters is still required between the RegionProposal Network and detection network As a result, achieving real-time detection with TwoStages Detection is a big challenge One Stage Detection, on the other hand, deal with the imagedirectly by mapping image pixel to bounding box coordinates and class probabilities.

• You Only Look Once (YOLO): YOLO [41] was proposed by J.Redmon et al in 2015 as thefirst entry to the One Stage Detection era This network divides the image into regions andpredicts bounding boxes and probabilities for each region simultaneously The YOLO con-sists of 24 convolution layers and 2 FC layers, of which some convolution layers constructensembles of inception modules with 1 × 1 reduction layers followed by 3 × 3 conv layers.Furthermore, YOLO produces fewer false positives in the background, which makes coop-erating with Fast R-CNN become possible The improved versions, YOLO v2,v3, and v4were later proposed, which adopts several impressive strategies, such as BN, anchor boxes,dimension cluster, and multi-scale training.[42, 43, 44].

• Single Shot MultiBox Detector (SSD): SSD [45] was proposed by W Liu et al in 2016as the second entry of the One Stage Detector SSD introduces multi-reference and mul-tiresolution detection techniques that significantly improve detection accuracy The maindifference between SSD and any other detectors is that SSD detects objects of differentscales on different layers of the network rather than detecting them at the final layer.• RetinaNet: RetinaNet [46] uses a Feature Pyramid Network(FPN) with a CNN-based Back-

bone FPN involves adding top-level feature maps with feature maps below them beforemaking predictions It involves upscaling the top-level map, dimensionality matching ofthe map below using a 1x1 Convolution and performing element-wise addition of both.RetinaNet achieves comparable results to Two Stages Detection while maintaining higherspeed.

• Refinement Neural Network for Object Detection (RefineDet): RefineDet [47] is basedon a feed-forward convolutional network that is similar to SSD, produces a fixed numberof bounding boxes and the scores indicating different classes of objects in those boxes, fol-lowed by the non-maximum suppression to produce the final result RefineDet is composedof two inter-connected modules:

– Anchor Refinement Module (ARM): Remove negative anchors and adjust the

loca-tions/sizes of anchors to initialize the regressor.

– Object Detection Module (ORM): Perform regression on object locations and predict

multi-class labels based on the refined anchors.

Trang 18

RELATED WORKS

There are three core components in RefineDet: Transfer Connection Block (TCB) convertsthe features from ARM to ODM; Two-step Cascaded Regression conduct regression on thelocations and sizes of objects; Negative Anchor Filtering will reject well-classified negativeanchors and reduce the imbalance problem.

2.2Diagram recognition

In general, diagram recognition can be grouped into two smaller areas: Online DiagramRecognition and Offline Diagram Recognition In online recognition, the model is a RNN torecognize each stroke and generate candidate matches.

Valois et al [48] proposed a method for recognizing electrical diagrams Each set of inkstrokes is detected as a match with the corresponding confidence factor using probabilistic nor-malization functions The disadvantage of the model is the simplicity of the system and its lowaccuracy, preventing it from being used in real situations Feng et al [49] used a more mod-ern technique in detecting electrical circuits Symbol hypotheses generation and classificationare generated using a Hidden Markov Model (HMM) and traced on 2D-DP However, it has adrawback of complexity when the diagram and number of hypotheses are immense, makes itimpractical for real-life cases ChemInk [50], a system for detecting chemical formula sketches,categorizing strokes into elements and bonds between them The final joint is performed usingconditional random fields (CRF), which combines features from a three-layer hierarchy: inkpoints, segments, and candidate symbols Qi et al [51] used a similar approach to recognize di-agram structure with Bayesian CRF - ARD These methods outperform traditional techniques,however, by using pairwise at the final layer, it is harder to combine features for future adapta-tions Coming to Flowchart recognition, after the release of the Online Handwritten FlowchartDataset (OHFCD), multiple studies occurred in resolving this dataset Lemaitre et al [52] pro-posed DMOS (Description and MOdification of the Segmentation) for online flowchart recog-nition Wang et al [53] used a max-margin Markov Random Field to perform segmentationand recognition In [54] they extend their work by adding a grammatical description that com-bines the labeled isolated strokes while ensuring global consistency of the recognition Bresleret al proposed a pipeline model where they separate strokes and text by using a text/non-textclassifier then detect symbol candidates using a max-sum model by a group of temporally andspatially close strokes The author also proposes an offline extension that uses a preprocessingmodel to reconstruct the strokes from flowchart [55, 56].

While online flowchart recognition detects candidates based on ink strokes, offline flowchartrecognition performs object detection on an image from the user It is possible to reconstruct on-line stroke from offline data [57], however, that preprocessing step is not necessary because wecan recognize the whole diagram structure independently with strokes As online recognition at-tracts more researchers, there have not been many studies on offline detection A Bhattacharyaet al [58] uses morphological and binary mapping to detect electrical circuits Although it canwork on a smaller scale, using binary mapping cannot detect curve or zig-zag lines Julca-Aguilar and Hirata proposed a method using Faster R-CNN to detect candidates and evaluateits accuracy on OHFCD The model can detect components in the diagram, including arrows,however, it cannot detect the arrowhead.

Trang 19

Chapter 3Background

In this section, we provide the basic knowledge of the techniques used for the system in ourproject This knowledge is based on our surveys in Chapter 2 and will be used in Chapter 4 Wewill summarize about three models in order: Section 3.1 is about Faster R-CNN used in objectdetection, section 3.2 introduces Mask R-CNN, a descendant of Faster R-CNN by adding objectsegmentation task Finally, section 3.3 shows Keypoint R-CNN, a variation of Mask R-CNN,which is important in this project.

3.1Faster R-CNN

Introduced in chapter 2, Faster CNN is an object detection model that extends Fast CNN Faster R-CNN replaces the old looping method with a new sub-model called RegionProposal Network (RPN) As a result, a Faster R-CNN model consists of three main compo-nents: Backbone network, RPN, and Regression-Classification layers Because we do not useFaster R-CNN model in this project, we will briefly discuss the first two components, the lastone will be mentioned in section 3.2.

R-3.1.1Backbone network

The Backbone Convolutional Neural Network is an important part of the algorithm It playsthe role of a feature extractor, which encodes the image input and returns a feature map Thebetter the backbone is, the higher result the model will achieve.

Figure 3.1 shows the pipeline of ResNet50, one example of Residual Network[27] ResidualNetwork inherits the idea of stacking many convolution layers to create a model It consists oftwo main blocks: convolution block and identity block These blocks have similar structures:they both have two paths, one goes through many convolutional layers and one goes through ashortcut path The main path follows three steps:

1 Conv layer, kernel size 1*1 with Batch Normalization and ReLU.

2 Conv layer, kernel size k*k with Batch Normalization and ReLU In figure 3.1, k = 3.3 Conv layer, kernel size 1*1 with Batch Normalization.

For the shortcut path, while convolution block contains one conv layer with Batch tion, the identity block simply uses the original input The results of both paths are then addedtogether with ReLU activation function The number of layers used in both blocks is three sincethe conv layer in the shortcut path of convolution block is redundant.

Trang 20

Figure 3.1: ResNet50 model, from [1]

Most ResNet model has the same structures, the only difference between them is the numberof convolution and identity blocks In general, the input is an image resized to 224 by 224with three channels (RGB) It then goes through a conv layer with kernel size 7*7 with BatchNormalization, ReLU activation then a MaxPool 3x3 After that, it goes through a series ofconvolution and identity blocks before doing global average pool with softmax function to returnthe feature map In figure 3.1, we can see that the model contains four convolution blocks andtwelve identity blocks Hence, the total number of layers is:

3.1.2Region Proposal Network

Region Proposal Network (RPN) is an addition to Faster R-CNN It is a sub-model that playsthe role to generate proposals from input feature maps and use these proposals for regressionand classification The way this model solves the task can be divided into three main steps:

At the first step, the model receives the feature maps If the model is in training, it alsoreceives ground truth boxes It then creates a sliding window running on the feature maps Foreach position of the sliding window, the model creates a set of m*n anchors with m differentaspect ratios and n different scales In practice, m and n are equal to five and three, respectively.Each anchor A has four attributes < xa, ya, wa, ha> regarding x coordinate, y coordinate of thecenter, width, and height from the center (half of total width and height).

The second step involves labeling anchors For each anchor and ground truth box, a valuecalled Intersection over Union (IoU) is calculated, indicating the overlapping ratio with theground truth bounding box The anchor will be labeled positive or negative depending on thisvalue.

label_box =

1 if k ≥ FOREGROU ND_T HRESHOLD0, if k ≤ BACKGROU ND_T HRESHOLD−1 otherwise

where k is the Intersection over Union of the anchor and ground-truth boxk= AnchorBox∩ GroundTruthBox

AnchorBox∪ GroundTruthBox

Trang 21

Normally, FOREGROU ND_T HRESHOLD = 0.7 and BACKGROU ND_T HRESHOLD =0.3 All anchors labeled as -1 are ignored Because we have a large number of ground-truthboxes, each anchor produces a vector of a label, each label is the relationship between the anchorand the corresponding ground-truth box.

The last step of the sub-model is to generate proposals Since anchors containing its labeland bounding box, it will try to predict five parameters: the first four parameters are from theregression task giving the box coordinate and the last parameter is from the classification taskgiven the label vector In training, these parameters will be used to calculate the loss function.As a result, the loss function of RPN will have two terms: box regression loss and label classi-fication loss For classification loss, we can use the binary cross-entropy function between thepredicted label and anchor label:

Lr pn_cls = bce(labelpredicted, labelgt)

For regression loss, the four parameters that the model will try to predict is:rx= (x − xa)/wa

ry= (y − ya)/yarw= log(w − wa)rh= log(h − ha)

where x, y, w, h belongs to the coordinate and x, xais the coordinate belong to the prediction andanchor, respectively Intuitively, from the anchor and gt box we also have:

r∗x = (x∗− xa)/war∗y = (y∗− ya)/yar∗w= log(w∗− wa)r∗h= log(h∗− ha)

where x∗is the coordinate belong to the ground truth From these parameters, we can calculatethe box regression loss for each anchor using the Smooth L1 loss:

Lr pn_reg= SL1((rx, ry, rw, rh), (rx∗, r∗y, r∗w, rh∗))

Therefore, the total loss of the RPN model is the sum of regression and classification lossterms for every anchor divided by the normalization term In the original paper, the author usesa hyperparameter to balance the loss between regression and classification Our implementationin chapter 4 will discuss about balancing these variables.

3.1.3Non-Maximum Suppression

Non-Maximum Suppression (NMS) is an effective algorithm to filter out similar predictionsfrom detectors This algorithm is used in both training and inference phase of Faster R-CNN,although inference is preferred The main idea of the algorithm is to remove any redundant oroverlapped boxes to reduce the computational cost in later steps, especially in inference wheretime constraint is a big issue Figure 3.2 shows an example of its usage in detection model Aswe can see in the image on the left, both the cat and dog have been labeled by many bounding

Trang 22

Figure 3.2: Non-Maximum Suppression, from [2]

boxes Our aim is the picture on the right, where there exists only one box covering each image.By doing so, we reduce the number of boxes from 9 to 2, thus a 4.5x improvement in computingspeed.

The following algorithm briefly demonstrates the idea of this improvement It receives twoarrays, B and S storing bounding boxes and scores of each prediction, respectively, along Inter-section Over Union constant t The result is two arrays storing new bounding boxes and scoresafter applying the technique To conduct the idea, the algorithm chooses the box with the highestscore For each other box, the Intersection Over Union between those two will be calculated;if this variable is higher than t then the box will be removed After the iteration, NMS simplychooses the next box with the highest corresponding score and repeats until the array B is empty.

Algorithm 1: Non-Maximum SuppressionInput: B = [b1, b2, , bN], S = [s1, s2, , sN],tB:list of boxes,

S:list of scores,t :IoU threshold

Output: D, S1 D= [];

2while B 6= [] do3 m= argmax(S) ;

The time complexity of the algorithm is O(N) in the best case with one big box coverall others and O(N2) in the worst case when no box is removed, or IoU < t for every pair ofboxes The average case does not matter in this situation because the algorithm is based on IoU

Trang 23

at a given variable t which we can decide beforehand Naturally, the higher t we choose, theharsher the complexity will become, eventually come to O(N2) The memory complexity of thealgorithm is O(N) as the new array is created.

3.2Mask R-CNN

This section will cover Mask R-CNN[40], an extension of Faster R-CNN in two-step objectdetection Introduced in 2017, Mask R-CNN quickly becomes the head of two stages detectorby introducing many new techniques while also solving a new task, object segmentation.

Figure 3.3 shows the basic structure of Mask R-CNN Noticeable differences are comparingto Faster R-CNN, as a new branch, called object mask detector is added along with the normalregression and classification branch to predict object masks Object mask is a layer that storinginformation about object location in the picture The difference between the object mask andthe bounding box is that instead of storing rectangle coordinates of the object, the object maskstores pixel-to-pixel location As a result, it would help the computer understand the objectproperties easier, especially in the situation where multiple objects have similar bounding boxesoverlapping each other Each mask is a 28x28 binary matrix with each value as a pixel; eachpixel is predicted to either belong to the object (equal to true or 1) or belong to the background(equal to false or 0) After that, the binary mask is applied to the object by resizing its matrix tofit the image using either bilinear or nearest neighbor algorithm.

Mask R-CNN also solved two disadvantages that its predecessor struggled with by addingand replacing certain techniques The first disadvantage is the inability to deal with small ob-jects In Faster R-CNN, small objects are often negligible in the big picture However, in diresituations where small objects must be detected, using Faster R-CNN is a bad idea Section 3.2.2will show the solution to this problem by introducing Feature Pyramid Network The other dis-advantage is the lack of pixel-to-pixel alignment between inputs and outputs RoIPooling doesnot deal effectively with floating-point locations As a result, it often rounds down the locationbefore performing spatial quantization Mask R-CNN replaces the RoIPooling technique withRoIAlign to do pooling on floating-point position.

The following sections will discuss new techniques using in Mask R-CNN model, includingbinary mask in section 3.2.1, Feature Pyramid Network in section 3.2.2 and RoIAlign in section3.2.3.

3.2.1Binary Masks

The methodology of generating masks of Mask R-CNN is similar to how Faster R-CNNcreates and predicts the bounding box For each anchor A =< xA, yA, wA, hA> created by thesliding window, a binary mask of size 28 by 28 is created based on their bounding box coordi-nate, each pixel of this mask is 1 if this pixel belongs to the object In training, this mask will becompared to the ground truth mask using binary cross-entropy loss function, similarly to RPNclassification loss.

Lmask= bce(maskpredicted, maskgt)

For inference, the 28 by 28 mask is resized to fit the predicted bounding box: The width isstretched by wA/56 and height by hA/56 times while the center coordinate naturally becomesthe center of the binary mask To reduce the computational cost, it is recommended to generatemask after the bounding box array is filtered using NMS so that we do not need to add a maskon every anchor.

Trang 24

Figure 3.3: MaskRCNN model, from [3]

Figure 3.4: Mask Sample, the pink colored pixels are for the object

Figure 3.4 shows an example of a sample mask Ignoring the label on the top left and thebounding box, each pixel colored pink is said to belong to the object (the parallelogram) We cansee spikes of mask on the edge of the object, indicating the misalignment of pixels during maskstretching We can solve this, however, it is not ideal to do this problem, as we do not entirelyuse masks in this project; they are mostly for creating bounding boxes and visual effects.

3.2.2Feature Pyramid Network

Feature Pyramid Network, or FPN, is a technique used in Mask R-CNN that replaces theold backbone network with a new model This model provides multiple feature maps whichcan be fed as inputs in RPN to create proposals When bringing up multiple inputs, one easysolution is to downscale the image multiple times to create a pyramid Each layer image in thispyramid is then computed to make a feature map separately However, the disadvantage of thismethod is the computational cost, as we need to produce the feature map multiple times TheFPN model ensures that we only need to calculate once to generate multiple feature maps Sincemany feature maps are created and each of them has a different scale, it is easier to detect smallobjects.

Figure 3.5 shows the basic idea of Feature Pyramid Network It consists of two main nents: the bottom-up path and the top-down path, each path has a structure similar to a pyramid.In the bottom-up pathway, the input image goes through multiple steps of downscaling, eachoutput of a lower layer is the input of the next layer in the pyramid In the original paper, thedownscaling network is a convolution network, which computes a hierarchy of feature maps

Trang 25

Figure 3.5: Feature Pyramid Network, from [4]

with a scaling factor (or strides) of 2 For example, assuming a 5 layers pyramid with the tom layer having stride 1, the other layers in the bottom-up pathway have strides of 2, 4, 8,and 16, respectively Each layer in the pyramid is used in the top-down pathway to make thefinal maps To simplify the calculation in the top-down pathway, assuming the pyramid in thebottom-up pathway has four layers, each of them will be called C1to C4respectively.

bot-In the top-down pathway, each layer of the pyramid is calculated from the top to the bottomto partially create higher resolution maps Assuming that we discard C1in the calculation due tolarge memory consumption and calling the layers in the top-down pathway from top to bottomas P2to P4, we can simplify the method to generate these layers as follows: Firstly, P4 layer iscreated from C4going through a 1*1 convolution to reduce its dimension The coarser resolutionlayer Pi+1is upsampled by a scaling factor equal to the variable used in the bottom-up resolution.The upsampled feature map is then merged by element-wise addition with the correspondinglayer Ci in the bottom-up pathway which goes through a 1*1 convolution layer similar to C4.This process continues until P2, due to C1is not being used in the algorithm These three layersin the top-down pathway are forwarded to RPN as the input maps.

Another interesting characteristic of Feature Pyramid Network is that it can be used alongwith other backbone networks, including ResNet, Inception, VGG, We will talk about thiscombination in section 4.3.

3.2.3Region of Interest Align

In Faster R-CNN, proposals from Region Proposal Network are normalized through a nique called Region of Interest Pooling (or RoIPooling) This layer originates from the idea toaccelerate both the training and the inference phase of the object detection model Normally,there will be a lot of proposals on a frame so that we can still use the same feature map inputfor all of them, thus improving the processing speed by a large factor From an array of pro-posals that contain bounding boxes with different sizes, the method can quickly get a list ofcorresponding feature maps with a fixed size It should be noticed that the dimension of thislayer output depends on neither the size of the input feature map nor the size of proposals It

Trang 26

Figure 3.6: RoIPooling in Faster R-CNN

is chosen specifically as a predefined parameter to control the number of sections we performpooling on.

The algorithm takes two inputs, feature map(s) transported from the backbone network anda (N, 5) matrix representing an array of RoIs (in this situation, we can specifically call thesebounding boxes), each RoI R =< idx, x, y, w, h > contains an index and the coordinates of eachproposal In certain models, it requires another parameter for the output size of the layer so eachfeature map calculates on a different scale For each RoI, it will transmute the feature map to afixed matrix (for example, 2*2) The scaling method consists of two steps:

• Divide the region into equal size sections, the number of sections is equal to the outputdimension Each section coordinate is an integer.

• Perform MaxPooling in each section.

Figure 3.6 shows an example of RoI Pooling in Faster R-CNN The input feature map is a4 by 4 matrix and a RoI with coordinate R =< 1.5, 2.5, 1.5, 1.5 > The output of RoIPoolingis a 2*2 matrix Following the steps: Firstly, we will convert the RoI with center coordinate totopleft-bottomright coordinate The new coordinate of the RoI is:

x1 = x − w = 0y1 = y − h = 1x2 = x + w = 3y2 = y + h = 4

Next, we will divide the region (0, 1), (3, 4) into a 2 by 2 matrix By doing so, we get thefour regions as in the second image Finally, we get the max value in each section to get the finalmatrix, and that is the result of RoIPooling layer.

From the implementation of RoIPooling, we can see the disadvantage of the algorithm.It shows in the first step of the scaling method By dividing the proposal region into integercoordinate, we accidentally create round-off errors on each section While it does not result ina big issue in Faster R-CNN, it creates an insurmountable amount of error when using FPN inparticular and Mask R-CNN in general For further explanation, since Feature Pyramid Networkcreates multiple feature maps on different scales to solve small object detection, there wouldbe occasions where an object with small bounding box will be fed to the pooling layer Bydoing so, many sections after the division will eventually become zero, breaking the core of thealgorithm To negate this problem, Mask R-CNN uses RoIAlign, an improvement of RoIPoolingthat removes this rounding issue The scaling method in RoIAlign consists of two steps:

• Divide the region into equal size sections, the number of sections is equal to the output

dimension Each section coordinate is a floating-point number.

Trang 27

Figure 3.7: RoIAlign layer used in Mask R-CNN

• Perform MaxPooling in each section The value of each segmented feature is the cation between the value of the feature itself and the area proportion that the segmentedfeature cover.

multipli-Taking a similar example to the previous technique, figure 3.7 shows a 4 by 4 feature mapwith 2*2 output The coordinate of RoI is R =< 1.5, 2.5, 1.5, 1.5 > By doing the first step sim-ilar to RoIPooling, we receive the RoI region of (0, 1), (3, 4) Now, dividing this region into twoby two sections, we get the result similar to the second image, with the red lines indicating thesegmentation By taking into account the bottom right section as a sample, we can understandthe method to recalculate the feature value of each segment before performing MaxPooling andachieve the output.

Considering the complexity of both RoI layers, we can see that both algorithms have thetime complexity of O(n2), with n is the input size To prove this, we firstly need to calculate thenumber of loops used The algorithm use two main loops, the first loop iterates over sectionsto perform MaxPooling Denoting the size of the output is m ∗ m, the first loop performs m ∗m iterations, thus gives O(m2) in time complexity The other loop will run on all segmentedfeatures on a section to find the max value Taking into account the size of RoI (2w, 2h), sincethe area is divided into m ∗ m sections, the number of segmented features S on a section will be:

For RoIPooling : 2wm

∗ 2h

≤ N ≤ 2wm

∗ 2h

For RoIAlign : 2wm

∗ 2h

≤ N ≤ 2wm + 1

∗ 2h

m + 1

Since both w and h is proportional to the size of the feature map, or n, we can roughly say thatin both situations, the complexity will be O((mn)2) Adding the two results, we have the finalcomplexity of O(n2) The cost of calculating the segmented feature value in RoIAlign residesin the second loop and is negligible compared to the true complexity.

3.3Keypoint R-CNN

Introduced as a variation of Mask R-CNN, Keypoint R-CNN is a tuned model used to solvethe keypoint detection task in Human Pose Estimation Challenge Because Keypoint R-CNN

Trang 28

is a variation of Mask R-CNN, most of Mask R-CNN features stays in this model The maindifference between these two is the properties of the third branch of the model While MaskR-CNN uses a 28 by 28 binary mask to store object mask for segmentation, Keypoint R-CNNuses a floating-point heatmap with size 56*56 Each variable in the heatmap ranges from -1to 1 In training, a heatmap is produced and correlated with ground-truth keypoints From thecorrelation, the loss function is calculated using cross-entropy:

Lk p= ce(k ppredicted, k pgt)

To create the correct order of keypoints, inputs fed to the model have to contain specificorder In inference, this order will be used to generate correlation map and find top N keypoints.It is necessary to clarify the keypoint selection technique so that they are not close to each other.The final keypoint map is stretched to fit the bounding box size Keypoint R-CNN is an im-portant model since the location of keypoints can be used for other tasks, especially, connectingobjects, one of the main objectives of this project However, the huge disadvantage of this modelis the application to multiple class detection, as Keypoint R-CNN can only perform well on oneclass.

Trang 29

Chapter 4

Proposed method

This chapter will show our contribution and solution to the diagram recognition problem Itwill not only contain knowledge from chapter 3 about the known model, but also our improve-ment to these models and the addition of solving the relationship issues This chapter containsfour main sections: Section 4.1 explains the constraints and scope of the thesis Section 4.2clarifies our objective and how the task can be completed Section 4.3 introduces our proposed

model to solve the object detection problem Finally, section ?? shows our proposed method to

generate a full diagram from the output of the object detection model.

4.1Scope of the thesis

This thesis is a contribution to solve the task of diagram digitization For more details,it serves as the background in a server connecting with user devices, mainly Android Theapplication allows the user to convert an image of a diagram to an online diagram that the usercan modify, duplicate and convert to other formats Due to the limitation of the user device, thefollowing constraints are considered:

• Quantity: Most Android phones have a smaller than 5.5 inches screen Using a phone screenand camera, the user may not be able to capture a complex diagram without shrinking theresolution To make it easier to modify the diagram after conversion, the original diagramshould not contain more than 15 symbols or the modification would be terrible.

• Quality: The diagram is drawn with remarkable tools (chalk, ball pen, ) Since this thesisdoes not use original but preprocessed images, it is vital to avoid capturing issues such asblurred images Moreover, the image should be captured in the correct orientation with amaximum of 10-degree rotation to avoid shape and text misunderstanding.

• Time constraint: Since data are traversing between user device and server, the total ence time of the model should not be greater than a certain threshold The threshold of 8seconds is selected to be the maximum inference time.

infer-Other constraints inferring to the objectives of this thesis Python, specifically, PyTorch is usedin this model It is deliberating whether the diagram server will use Python or C++, as bothhave libraries that fully optimize the computational cost The final system should recognize dia-gram objects such as “symbol”, “arrow”, and “text” For each symbol and connector, the systemreturns a structure containing object information For text objects, while it can be returned asa string containing raw text, this thesis does not cover OCR problem Instead, it returns thebounding box of the text In the future, it is recommended to add OCR to the final model fortext recognition The system is also able to construct a full relationship between every predicted

Trang 30

PROPOSED METHOD

object, including the relationship between symbols and connectors, texts and symbols, texts andconnectors The output of this construction is a reasonable JSON file storing object informationand relationship The JSON file should not contain duplicate or unused information The lastconstraint in the thesis is the limitation of connectors The connectors can intersect but must notoverlap each other because of the impossibility to distinguish similar objects when they havethe same part (Ideally, it is similar to the two flowers - one stem situation).

So, how to digitize diagrams? While it is viable to draw a diagram using tools such asdraw.io, Visio, the speed of converting from hand-drawn to digital diagram is extremelyfaster As mentioned in section 2.2, there are two ways to digitize hand-drawn diagrams: on-line and offline diagram recognition Online diagram recognition has been being famous inrecent years due to its convenience in practice, the user only needs to prepare a pen and tabletor phone to draw a diagram and the background process constructs the diagram automaticallyin real-time However, in an offline meeting where most information is displayed on paper orwhiteboard/blackboard, the limitation of online drawings quickly appears: in order to record theevents, the user most likely has to redraw a lot of information, thus consuming a measurableamount of time For offline drawing, the user’s role is to capture the image and the backendwould convert the image into a modifiable diagram, saving time and resources.

This thesis proposes the methods to solve the problem: Given an image containing diagramand text, create a model to convert this image to an output structure that can be used to makedigital diagrams in an Android application.

To solve the problem defined, we need to take into consideration the approaches and tion There are two types of diagram: online and offline While online diagram can convert intostatic image, offline detection is preferred The next step is to choose the type of object detec-tion Since there are two approaches: one stage detector and two stages detector, it is importantto select which branch the proposed model belongs to The main difference between one stageand two stages is inference time and precision While one stage model performs inference ex-tremely fast, some models can perform in real-time, two stages model has the edge on precision.Because the system will be used as a background process, two stages detector is preferred as

Trang 31

solu-PROPOSED METHOD

there are other time-consuming issues such as sending images and data between user device andserver or building the diagram on the applications Therefore, two stages object detection modelis selected as the branch for this project.

4.2.3Preparing input

In this section, we will walk through a number of steps in order to construct the final inputfor this project The first and foremost problem to consider is to find a suitable dataset Fromsection 4.1, one constraint of this thesis is that the image should be preprocessed due to thefact that there do not exist many training samples available on the internet While making ownsamples is acquirable, biased errors will occur frequently Therefore, it is critical to rely onusable datasets To the best of our knowledge, there exist the following diagram datasets:

• FC database[59] with 672 flowchart diagrams drawn by 24 users from Czech TechnicalUniversity storing in InkML format.

• KONDATE dataset[60] from Nakagawa lab at Tokyo University of Agriculture and nology containing 670 free from handwritten Japanese documents by 67 writers.

Tech-• University of Nantes flowchart dataset[15] consists of 419 drawings.

• Digital Ink Diagram Data or DiDi[61] from Google containing 58655 flowchart diagrams.Due to the popularity of online diagram recognition in the past, all of these datasets arestored in the online diagram format Nevertheless, we choose DiDi Dataset due to the over-whelming number of entries, while other datasets can be used as a reference point in the future.DiDi dataset has two components: images and drawings Images used in the dataset are createdusing GraphViz then stored in PNG format The following algorithm shows the author’s idea in

Ngày đăng: 31/07/2024, 10:19

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w