1. Trang chủ
  2. » Luận Văn - Báo Cáo

Luận văn tốt nghiệp Khoa học máy tính: Developing a pipeline for table extraction in document images

56 1 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Nội dung

Trang 1

GRADUATION THESIS

DEVELOPING A PIPELINE FORTABLE EXTRACTION

IN DOCUMENT IMAGES

MAJOR: COMPUTER SCIENCE

Mr Nguyen Nam QuanDr Nguyen Tien Thinh

HO CHI MINH CITY, 02/2023

Trang 2

TRU,.,NG DAI HQC BACH KHOAKHOA: KH & KT M6y tinh-

l DAu dd luAn rin:

XAy dung trQihOng nhdn d4ng bdng trong dnh tdi liQu

Developing a pipeline for table extraction in the document image2 Nhi$m vg (y6u ciu v6 nQi dung vir s5 liQu ban dAu;:

Table extraction is one ofthe most critical components ofdocument imagesi we can see it almosteverywhere in every report and documenl l-his is also the topic ofmost concem today in digitaltransformation Table recognition and comprehension require many techniques and research.

Nowadays it is still a massive challenge tbr scientists and industrial applications.

This thesis focuses on studying the problem oftable information extraction in a whole documentimage processing system

There are three main tasks in this research:

- Build a comprehensive pipeline lor tablc cxtraction in thc document image.- Develop a model lbr detecting tablc regions in the document inlagc.

- Develop a classification model to classity the detected table into a borderless table and abordered table for extracting the table's cells.

- Benchmark the models with public/private datasets.

3 Ngiry giao nhiQm vg lu$n 6n: 2011212021

4 Nglry hoirn thirnh nhiQm vq:20/1212022

5 Hg tOn giing vi6n hufng dAn: Phin hu6ng din:

I )Trin Tuin Anh IIuong dan dinh hutrng phdt tri€n tich hop kiim tra drinh gi[

2) Nguy6n Nam Quin - I lucrng dAn ph6t tri6n model c6ng ngh€, thu thap dt liQu

3)Nguy6n Tiiln Thinh - Huong dAn nQi dung trinh biy b6 cgc phuong ph6p nghidn cuu

Trang 3

Nsdy 26 ttuing 12 ndm2022

PHIEU CHAM BAO VE LVTN

lDinh cho nguii hmmg din phun biQnt

l H9 vd t6n SV: Lu Anh Khoa

Z Od tui, Developing a pipeline for table extraction in the document image (X6y dUng hQ th6ngnhan dang bang trong anh tai liQu).

3 Hq t6n nguoi huong din/phan biQn: Trin tuAn enl

4 T6ng qurit v6 ban thuyOt minh:

56 tai liQu tham kheo: PhAn m6m tinh to6n:HiQn v4t (san ph6m)

5 Tiing qtuit vd c6c bdn v€:

- Sti-ban vC: Ban Al: Ban 42: Kh6 kh6a:- Sii ban vC vC tay 56 ba'n vC tren m6y tinh:

6 Nhtng uu di6m chinh cua LVTN:

- This Gsis presents a pipeline for table extraction in the document image Student also develops amodel for deiecting tabie regions in the document image and a classification model to classifr thedetected table into a borderless table and a bordered table for extracting the table's cells.

- The presented pipeline is also used in many applications in an industrial environment.

- This thesis hai researched many different methods and approaches and made appropriate

assessments and analYzes.

- The thesis has also completed some practical data and put it into the evaluation, the constructionof diverse data is very necessary.

- Students have had good experiments, evaluations, and demos

7 Nhirng thiliu s6t chinh cua LVTN:

- This thisis is inclined towards application, so the direction of research development has not beenmade clear.

- The overview of the works has not been fully detailed, nor is the assessment comprehensive.8 DE ngh!: Dugc bio vQ E 86 sung th6m d6 bio v€ tr Kh6ng dugc b6o vQ tr

9 3 cAu h6i SV phni tra loi trudc HQi dti,ng:

- Make clear the future works including research/pipeline development.

10 Drinh gi6 chung GAng chft gi6i, klui TB): Gi6i Di6m : 9.3110

Ki ten (ghirSho t6n

Trantu6n enn

Trang 4

TRTIONG DAI HQC BACH KHOAKHOA KH & KT MAY TINH

CONG HOA xA HQI CHU NGHIA VIET NAMDQc l{p - Tu do - Hanh phtic

Ngity 28 rhdng 12 ndm 2022

PHIEU CHAM BAO VE LVTN

(Ditnh cho ngrdi phdn biAn)

l Ho vd t6n SV: Lu Anh Khoa

MSSV: 1852112 Ngdnh (chuydn ngdnh): Computer Science

Z Od al: Developing a pipeline for table extraction in documenl images3 H9 t6n phan.biQn: TS Ld Thinh Sfch

4 T6ng qudt v6 ban thuyCt minh:

56 tdi liQu tham kh6o: PhAn mAm tinh to6Ln:

HiQn vft (san phAm)5 T6ng qu6t vO c6c bdn v6:

- 56 ban vC: Ban Al: Ban A.2: Kh6 kh6c:- 56 ban ve ve tay SO ban ve tr6n miiy tinh:

6 Nhtng uu tli6m chinh cia LVTN:

o The author has a strong background in deep leaming and its applications for computervision.

o The author has proposed a pipeline for table extraction in document images and focused(implemented) on two main tasks inside: table detection and table classification.

o Table detection: consists of two steps: detecting tables using YoloVT model andpost-processing detected tables with morphology and connected-component analysis.

o Table classification: classifying extracted tables into two classes: bordered vsborderless using MobilenetV3 (Large) model

o The proposed method can produce accurate results and be able to compared with othermethods in table extraction.

7 Nhring thii5u s6t chinh cira LVTN:

o The thesis needs to be re-written:

o to add a survey for related works in the research fields (table detection andclassification)

o to add detail explaination for the proposed method and its results (quantitative andqualitative)

o to add an evaluation for the processing time

8 Dd nghi: Du-o c bdo vQ: x 16 sung th6m di: bao vC tr Kh6ng cluoc bAo v€ tr

9 3 cdu hdi SV phAi trd ldi truoc H6i d6ng:

10 Drinh giri chung (b5ng chir: gi6i, khri, TB): Gi6i Di6m : 9 ll0

Kf t6n (ghi 16 ho ttn)

LE Thdnh SAch

Trang 5

DECLARATION OF AUTHORSHIP

I hereby declare that this thesis was carried out by myself under the guidance and supervision of Dr.TranTuan Anh, Dr Nguyen Tien Thinh, and Mr Nguyen Nam Quan; and that the work contained and theresults in it are true by its author and have not violated research ethics The data and figures presentedin this thesis are for analysis, comments, and evaluations from various resources by my work and havebeen fully acknowledged in the reference part.

In addition, other comments, reviews, and data used by other authors, and organizations have beenacknowledged, and explicitly cited I will take full responsibility for any fraud detected in my thesis.

HO CHI MINH CITY, Dec 2022Author

Trang 6

I would like to acknowledge and give my warmest thanks to my advisors, Dr Tran Tuan Anh, Dr.Nguyen Tien Thinh, and Mr Nguyen Nam Quan who made this work possible They are the ones whobuild the first bricks of my scientific career.

Besides, I would like to also acknowledge all of the instructors of Ho Chi Minh University of Technology,who have given me motivation, encouragement, and precious knowledge during the long road of myuniversity life.

Last but not least, I would like to thank my family who always be there, support me throughout my life.

Trang 7

The ”Developing pipeline for table extraction in document images” research topic aims to develop asystem to extract tabular regions from scanned/captured document images (invoices, reports, researchpapers ) with high accuracy and reasonable response time This thesis will propose a pipeline consistsof several steps to detect, classify and to extract data from tabular regions.

Trang 8

2.2 Convolution neural network (CNN) 8

2.15 YOLOv7 - training techniques 32

2.15.1 Label assignment: Simple Optimal Transport Assignment (SimOTA) 32

2.19.1 Depthwise Separable Convolutions 36

2.19.2 Inverted residual block 37

2.19.3 Squeeze and excite (SE) 37

3 Related works 383.1 Traditional methods 38

3.2 Convolutional Neural Networks (CNN) 38

4 Proposed method 394.1 Pipeline 39

Trang 9

5.2 Training process 45

5.3 Quantitative results 45

5.3.1 Table detection - mAP 45

5.3.2 Table detection - Weighted Average F1 45

5.3.3 Table classification - Model comparisons 46

5.3.4 Speed 46

5.4 Qualitative results 47

6 Conclusion 486.1 Achievements 48

6.2 Limitation 48

6.3 Future works 49

List of Figures1 A typical CNN 9

2 Maxpool 2x2: an example of pooling layer 9

3 Some activation function 10

4 Example of Dilated Convolution 11

5 Dropout visualization 12

6 Example of data augmentation 12

7 An example of image segmentation output 13

8 Different IOU with Red bounding boxes are ground truths while the Green ones are dictions 15

23 Hard label vs Smooth label 25

24 Cosine Annealing Learning rate 25

25 Groundtruth: green Predict: Black Using L1 yields 9.07 for all 3 cases but their IOUare different by a large margin IOU is also used to evaluate an object detection model.So using IOU as a loss is a logical improvement 26

Trang 10

36 An example of mosaic augmentation, 4 images are ”merged” together to create a new

sample Red bounding boxes annotate labels 34

37 Mixup augmentation example 34

38 Depthwise Convolution, visualized 36

39 Left: Normal convolution Right: Depthwise separable convolution 37

40 Squeeze and excitation block 37

41 CascadeTabNet architecture from [23] 39

42 Overview 39

43 Before (left) and After (right) Notice that on the left the table does not have enoughouter bordered lines 40

44 Examples of bordered tables 41

45 Examples of borderless tables 41

46 Original bordered table 43

47 After cell indexing, each cell has the form start row, start col end row, end col 44

48 Excel result of figure 46 Note that tesseract cannot read the text in some cells 44

49 Correct detection on publaynet dataset 47

50 Correct detection on fintabnet dataset 48

51 Wrong cases: Partial detection (top), missed tables(bottoms) Green boxes are groundtruth while red boxes are predictions 49

List of Tables1 Data statistic for table detection 45

2 Data statistic for table classification 45

3 Result on test set of each dataset, table detection 45

4 Comparison with ICDAR19 Competition on Table Detection and Recognition, track A2with previous participants Scores of other teams are taken directly from [11] 46

5 Comparisons between different classification models 46

6 Speed measurements 47

Trang 11

1.1The need for table extraction

With the trend of digital conversion, the amount of document images has increased exponentially Toautomate the process of extracting information in those images, many methods have been proposed fordifferent types of information arrangements Besides text, tables are one of the most used methods toarrange information in documents Its purpose is to group information related to a topic together tohelp the reader compare and retrieve information faster However, because of their complex and diversestyles, it is hard to parse tabular data from document images into a well-structured machine-readableformat.

Document types that contain tables as one of the main elements include invoices, financial reports, andforms To understand these types of documents effectively, a table extraction tool for images is necessary.That is the motivation for this thesis.

1.2The goal

The goal for thesis is to create a pipeline consists of deep learning models and image processing techniquesto extract tables from an input document image.

There are 4 sub goals as follows:

• Build a comprehensive pipeline for table extraction in the document image (table extractionpipeline)

• Develop a model for detecting table regions in the document image (table detection)

• Develop a classification model to classify the detected table into a borderless table and a borderedtable for extracting the table’s cells (table classification)

• Benchmark the models with public/private datasets.Some constraints on tables:

• Captions and table names do not count as parts of a table (Usually these elements are text and arevery close to tables).

• The tables types consists of bordered tables and borderless tables (Will be defined below)• Table size ranges from small (2 to 5 rows and/or columns, often seen in research papers) to large

(many rows, occupies a large area of the image ( 80%) ).

• The cells, texts and lines may contain colors rather than black and white.For the input image:

• Input image must be from scanned documents or exported pdfs (different with document capturedwith mobile devices)

• Image quality must be acceptable (Bad image quality includes: blurred, distorted, noisy, )

2.1Convolution and cross-correlation in image processing

In image processing, convolution is the process of transforming an image by applying a kernel over eachpixel and its local neighbors across the entire image The kernel is a matrix of values whose size andvalues determine the transformation effect of the convolution process

Mathematically, the convolution between an image and a kernel can be written as:

g(x, y) = ω ∗ f (x, y) =

ω(dx, dy)f (x + dx, y + dy)

Trang 12

where g(x, y) is the filtered image, f (x, y) is the original image, ω is the filter kernel Every element ofthe filter kernel is considered by −a ≤ dx ≤ a and −b ≤ dy ≤ b.

The difference of convolution and cross-correlation can be seen below:

a b cd e fg h i∗

1 2 34 5 67 8 9

[2, 2] = (i · 1) + (h · 2) + (g · 3) + (f · 4) + (e · 5) + (d · 6) + (c · 7) + (b · 8) + (a · 9).

a b cd e fg h i⋆

1 2 34 5 67 8 9

[2, 2] = (a · 1) + (b · 2) + (c · 3) + (d · 4) + (e · 5) + (f · 6) + (g · 7) + (h · 8) + (i · 9).

A convolution can be seen as a cross-correlation with a kernel being rotated by 180◦

With carefully hand-crafted kernels, blur, sharpen, detect edges in images are a few possible imageprocessing techniques using convolutions.

2.2Convolution neural network (CNN)

Instead of relying on hand-crafted kernels to produce the desired output, the goal of CNN is to learn theparameters of the kernels itself.

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network(ANN), most commonly applied to analyze visual imagery CNNs are also known as Shift Invariantor Space Invariant Artificial Neural Networks (SIANN), based on the shared-weight architecture of theconvolution kernels or filters that slide along input features and provide translation-equivariant responsesknown as feature maps Counter-intuitively, most convolutional neural networks are not invariant totranslation, due to the downsampling operation they apply to the input They have applications inimage and video recognition, recommender systems, image classification, image segmentation, medicalimage analysis, natural language processing, brain–computer interfaces, and financial time series.CNNs are regularized versions of multilayer perceptrons Multilayer perceptrons usually mean fully con-nected networks, that is, each neuron in one layer is connected to all neurons in the next layer The ”fullconnectivity” of these networks make them prone to overfitting data Typical ways of regularization, orpreventing overfitting, include: penalizing parameters during training (such as weight decay) or trimmingconnectivity (skipped connections, dropout, etc.) CNNs take a different approach towards regularization:they take advantage of the hierarchical pattern in data and assemble patterns of increasing complexityusing smaller and simpler patterns embossed in their filters Therefore, on a scale of connectivity andcomplexity, CNNs are on the lower extreme.

Convolutional networks were inspired by biological processes in that the connectivity pattern betweenneurons resembles the organization of the animal visual cortex Individual cortical neurons respond tostimuli only in a restricted region of the visual field known as the receptive field The receptive fields ofdifferent neurons partially overlap such that they cover the entire visual field.

CNNs use relatively little pre-processing compared to other image classification algorithms This meansthat the network learns to optimize the filters (or kernels) through automated learning, whereas intraditional algorithms these filters are hand-engineered This independence from prior knowledge andhuman intervention in feature extraction is a major advantage.

A typical CNN can be seen in figure 1

2.2.1 Building blocksConvolutional layer

The convolutional layer is the core building block of a CNN The layer’s parameters consist of a set oflearnable filters (or kernels), which have a small receptive field, but extend through the full depth of theinput volume During the forward pass, each filter is convolved across the width and height of the inputvolume, computing the dot product between the filter entries and the input, producing a 2-dimensionalactivation map of that filter As a result, the network learns filters that activate when it detects somespecific type of feature at some spatial position in the input.

Trang 13

Figure 1: A typical CNN

Stacking the activation maps for all filters along the depth dimension forms the full output volume of theconvolution layer Every entry in the output volume can thus also be interpreted as an output of a neuronthat looks at a small region in the input and shares parameters with neurons in the same activation map.

Pooling layer

Another important concept of CNNs is pooling, which is a form of non-linear down-sampling There areseveral non-linear functions to implement pooling, where max pooling is the most common It partitionsthe input image into a set of rectangles and, for each such sub-region, outputs the maximum.

Intuitively, the exact location of a feature is less important than its rough location relative to otherfeatures This is the idea behind the use of pooling in convolutional neural networks The pooling layerserves to progressively reduce the spatial size of the representation, to reduce the number of parameters,memory footprint and amount of computation in the network, and hence to also control overfitting.This is known as down-sampling It is common to periodically insert a pooling layer between successiveconvolutional layers (each one typically followed by an activation function, such as a ReLU layer) in a CNNarchitecture While pooling layers contribute to local translation invariance, they do not provide globaltranslation invariance in a CNN, unless a form of global pooling is used The pooling layer commonlyoperates independently on every depth, or slice, of the input and resizes it spatially A very commonform of max pooling is a layer with filters of size 2×2, applied with a stride of 2, which subsamples everydepth slice in the input by 2 along both width and height, discarding 75% of the activations:

fX,Y(S) = max11,b=0S2X+a,2Y +b

A visualization can be seen in figure 2

Figure 2: Maxpool 2x2: an example of pooling layer

Activation layer

Activation layer usually lies after a convolutional layer The most common activation layer is ReLU.ReLU is the abbreviation of REctified Linear Unit, which applies the non-saturating activation functionf (x) = max(0, x) It effectively removes negative values from an activation map by setting them tozero It introduces nonlinearities to the decision function and in the overall network without affectingthe receptive fields of the convolution layers.

Trang 14

Other functions can also be used to increase nonlinearity, for example the saturating hyperbolic tangentf (x) = tanh(x), and the sigmoid function σ(x) = (1 + e−x)−1 ReLU is often preferred to other functionsbecause it trains the neural network several times faster without a significant penalty to generalizationaccuracy.

Some commonly used activation function is shown in figure 3

Figure 3: Some activation function

Fully connected layer

After several convolutional and max pooling layers, the final classification is done via fully connectedlayers Neurons in a fully connected layer have connections to all activations in the previous layer, asseen in regular (non-convolutional) artificial neural networks Their activations can thus be computed asan affine transformation, with matrix multiplication followed by a bias offset (vector addition of a learnedor fixed bias term).

Loss layer

The ”loss layer”, or ”loss function”, specifies how training penalizes the deviation between the predictedoutput of the network, and the true data labels (during supervised learning) Various loss functions canbe used, depending on the specific task.

2.2.2 HyperparametersKernel size

The kernel is the number of pixels processed together It is typically expressed as the kernel’s dimensions,e.g., 2x2, or 3x3.

Padding is the addition of (typically) 0-valued pixels on the borders of an image This is done so thatthe border pixels are not undervalued (lost) from the output because they would ordinarily participatein only a single receptive field instance The padding applied is typically one less than the correspondingkernel dimension For example, a convolutional layer using 3x3 kernels would receive a 2-pixel pad, thatis 1 pixel on each side of the image.

The stride is the number of pixels that the analysis window moves on each iteration A stride of 2 meansthat each kernel is offset by 2 pixels from its predecessor.

Number of filters

Trang 15

Since feature map size decreases with depth, layers near the input layer tend to have fewer filters whilehigher layers can have more To equalize computation at each layer, the product of feature values vawith pixel position is kept roughly constant across layers Preserving more information about the inputwould require keeping the total number of activations (number of feature maps times number of pixelpositions) non-decreasing from one layer to the next.

The number of feature maps directly controls the capacity and depends on the number of availableexamples and task complexity.

Filter size

Common filter sizes found in the literature vary greatly, and are usually chosen based on the data set.The challenge is to find the right level of granularity so as to create abstractions at the proper scale,given a particular data set, and without overfitting.

Pooling type and size

Max pooling is typically used, often with a 2x2 dimension This implies that the input is drasticallydownsampled, reducing processing cost.

Large input volumes may warrant 4×4 pooling in the lower layers Greater pooling reduces the dimensionof the signal, and may result in unacceptable information loss Often, non-overlapping pooling windowsperform best.

Dilation involves ignoring pixels within a kernel This reduces processing/memory potentially withoutsignificant signal loss A dilation of 2 on a 3x3 kernel expands the kernel to 5x5, while still processing 9(evenly spaced) pixels Accordingly, dilation of 4 expands the kernel to 9x9.

Figure 4: Example of Dilated Convolution

2.2.3 Regularization methodsDropout

Because a fully connected layer occupies most of the parameters, it is prone to overfitting One methodto reduce overfitting is dropout At each training stage, individual nodes are either ”dropped out” of thenet (ignored) with probability 1−p or kept with probability p , so that a reduced network is left; incomingand outgoing edges to a dropped-out node are also removed Only the reduced network is trained on thedata in that stage The removed nodes are then reinserted into the network with their original weights.In the training stages, p is usually 0.5; for input nodes, it is typically much higher because informationis directly lost when input nodes are ignored.

At testing time after training has finished, we would ideally like to find a sample average of all possible2n dropped-out networks; unfortunately this is unfeasible for large values of n However, we can findan approximation by using the full network with each node’s output weighted by a factor of p p, sothe expected value of the output of any node is the same as in the training stages This is the biggestcontribution of the dropout method: although it effectively generates 2n neural nets, and as such allowsfor model combination, at test time only a single network needs to be tested.

Trang 16

By avoiding training all nodes on all training data, dropout decreases overfitting The method alsosignificantly improves training speed This makes the model combination practical, even for deep neuralnetworks The technique seems to reduce node interactions, leading them to learn more robust featuresthat better generalize to new data.

A visualiztion is shown in figure 5

Figure 5: Dropout visualization

Artificial data / Data augmentation

Because the degree of model overfitting is determined by both its power and the amount of training itreceives, providing a convolutional network with more training examples can reduce overfitting Becausethese networks are usually trained with all available data, one approach is to either generate new datafrom scratch (if possible) or perturb existing data to create new ones Data augmentation can rangedfrom simple image processing like changing the hue, scaled, rotated angle of an existed image, to moremodern ones like mixup [37] where a new data point is created using multiple exisiting data points.An example of data augmentation can be seen in figure 6

Figure 6: Example of data augmentation

Early stopping

Trang 17

One of the simplest methods to prevent overfitting of a network is to simply stop the training beforeoverfitting has had a chance to occur It comes with the disadvantage that the learning process is halted.

Number of parameters

Another simple way to prevent overfitting is to limit the number of parameters, typically by limiting thenumber of hidden units in each layer or limiting network depth For convolutional networks, the filtersize also affects the number of parameters Limiting the number of parameters restricts the predictivepower of the network directly, reducing the complexity of the function that it can perform on the data,and thus limits the amount of overfitting This is equivalent to a ”zero norm”.

Weight decay

A simple form of added regularizer is weight decay, which simply adds an additional error, proportionalto the sum of weights (L1 norm) or squared magnitude (L2 norm) of the weight vector, to the error ateach node The level of acceptable model complexity can be reduced by increasing the proportionalityconstant(’alpha’ hyperparameter), thus increasing the penalty for large weight vectors.

L2 regularization is the most common form of regularization It can be implemented by penalizing thesquared magnitude of all parameters directly in the objective The L2 regularization has the intuitiveinterpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors Due tomultiplicative interactions between weights and inputs this has the useful property of encouraging thenetwork to use all of its inputs a little rather than some of its inputs a lot.

2.3Image segmentation

In digital image processing and computer vision, image segmentation is the process of partitioning adigital image into multiple image segments, also known as image regions or image objects (sets of pixels).The goal of segmentation is to simplify and/or change the representation of an image into something thatis more meaningful and easier to analyze Image segmentation is typically used to locate objects andboundaries (lines, curves, etc.) in images More precisely, image segmentation is the process of assigninga label to every pixel in an image such that pixels with the same label share certain characteristics.An example of image segmentation is shown in figure 7

Figure 7: An example of image segmentation output

There are 2 classes of image segmentation techniques:• Classical computer vision approach

• AI based techniques

Trang 18

2.3.3 Trainable segmentation

Most of the aforementioned segmentation methods are based only on color information of pixels in theimage Humans use much more knowledge when performing image segmentation, but implementing thisknowledge would cost considerable human engineering and computational time, and would require a hugedomain knowledge database which does not currently exist Trainable segmentation methods, such asneural network segmentation, overcome these issues by modeling the domain knowledge from a datasetof labeled pixels.

2.4Object detection

Object detection is a task which deals with detecting objects in any image or a video frame Withthe rising and superior results of deep learning, all state-of-the-art object detection methods today arebuilt with deep learning approaches They can be categorized into two main types: one-stage methodsand two stage-methods One-stage methods prioritize inference speed while two-stage methods prioritizedetection accuracy.

2.5Object detection - Metric

The main metric of object detection is mAP (mean average precision) To understand mAP, first we haveto know about IOU, Recall, Precision and Average precision:

Intersection over union (IOU) measures how much the predicted region is overlapping with the actualground truth regions, as shown in Figure 8

IoU = Area of overlap region

Area of union region (1)

Precision is defined as the ratio of numbers of predicted regions that are tables over the number ofground truth regions.

P = #tables in predicted regions#number of grown truths =

Trang 19

Figure 8: Different IOU with Red bounding boxes are ground truths while the Green ones are predictions

Average precision for a class:AP = 1

(Recall[k] − Recall[k − 1]) ∗ P recision[k] (4)

where n is the number of IOU thresholds, Recall[k], P recision[k] is the recall and precision for IOUiouT hreshold[k] (iouThreshold[] = [0.5, 0.55, 0.60, , 0.90, 0.95])

To define if a predicted regions is a table given a ground truth, we need to compute their intersectionover union (IOU) and if their IOU is greater than a threshold, then the predicted region is count as atrue positive.

Mean average precision is calculated by taking average AP of all classes:

mAP = 1n

2.6.1 RPN (Region Proposal Network)

Faster RCNN use a sub-network called RPN to extract the regions containing objects (RoI - Region ofInterest), differs from its predecessors, RCNN and Fast-RCNN

RCNN use Selective Search as its region proposal extractor The number of regions extracted are around2000, Then the regions are resized into the same size and go through a pretrained CNN model, thenlocaize the offset and the object class But 2000 regions is a large number, making the model runs superslow (figure 9)

Fast-RCNN improves this by using a pretrained CNN to extract feature maps, then using SelectiveSearch on those feature maps instead of the original image So the speed increased by a large margin.But because of Selective Search, the model inference time still takes too long (around 2s/image) (figure10)

With Faster-RCNN, instead of using Selective Search, a sub-network is used to extract regions makingit even faster and is designed to be an end-to-end trainable network.

RPN using 1 conv later with 512 channels, kernel size = (3, 3) on feature map Then it split into 2branches: one for object classification, one for bounding box regression Both of them using 1 conv layerwith kernel sizw = (1, 1) but with different output channels With binary object classification, it has2k channels output, with k is the number of anchors to determine if that anchors has object or is thebackground With bounding box regression, it has 4k channels output, with 4 represents 4 offset (x, y,w, h)

Because the input image size is not fixed, the output size of RPN has the same property For examplewith an input image size WxHx3 and down sampling = 16, RPN classify and RPN bounding box has 18* (W / 16) * (H / 16) v`a 36 * (W / 16) * (H / 16), respectively.

Trang 20

Figure 9: RCNN architecture

Figure 10: Faster RCNN achitecture

Figure 11: RPN in Faster-RCNN

2.6.2 AnchorsWhat are anchors ?

Anchors are pre-defined boxes, are known before training model Anchors in Faster-RCNN are definedas 9 anchors with every pixel in the feature map The total numbers of anchors are based on the sizeof feature map For example feature map after goes through backbone has size WxHxC (with C is thechannels of feature map), then the total number of anchors will be WxHx9 (9 is the number of anchorsof one pixel).

Anchors are in different size and ratio ( figure 12 )

Anchors will be assigned as positive/negative (object/background) based on overlap area or IOU overlapwith ground truth bounding box following the rule:

Trang 21

Figure 12: Anchors

• The anchor with the highest IOU with ground truth box will be positive• Anchors with IOU ≥ 0.7 will be positive.

• Anchors with IOU < 0.3 will be negative (background).

• Anchors with IOU 0.3 ≥ x < 0.7 will be neutral and is not considered in model training.

The RoIs after RPN step will contain overlap regions, so a method is proposed to filter-out those regions,called non-maximum suppression (NMS) The idead is simple:

• Let the set R is the set contained the RoI after RPN step and their confidence score set S, tively, an overlap theshold N and an empty set D

respec-• Mark the RoI with the highest confidence score and remove from R, insert to D

• Compare the new RoI with every RoI in R with IOU If IOU is greater then overlap threshold N,remove the RoI in R.

• Repeat step 2, 3 until set R is empty.

But NMS has its own weakness, too For example N = 0.5 There is some RoIs with IOU = 0.51, theirconfidence score is very high but they can still be remove from R Vice versa, there are RoIs with IOU ¡0.5 and low confidence scores are not removed from R, making the model appear worse.

Soft-NMS is proposed to solve this problem Instead of removing RoIs that has high overlap thresold andhigh confidence score, we decrease the confidence score based on IOU:

si IOU (M, bi) < Nsi∗ (1 − IOU (M, bi)) IOU (M, bi) ≥ N2.6.3 RoI pooling

RoI pooling makes the output size of feature map fixed RoI pooling is a must as the final layers of themodel are 2 fully connected branches which required fixed input size.

2.6.4 Detection model

After RoI pooling, we have the output feature maps with fixed size, they will be flatten and go through2 fully connected layers (figure 13):

• Object classification N+1 class (N is the number of classes, +1: background)

• Bounding box regression to locate the RoI with 4N output, represents 4 coordinated (x, y, w, h)

Trang 22

Figure 13: Detection model in Faster-RCNN

NMS is done like RPN step above

2.6.5 Loss function

Faster-RCNN loss consists of 4 parts:

• RPN classification (object or background)• RPN regression (anchor - region proposal)• Fast-RCNN classification (N+1 classes)

• Fast-RCNN bounding box regression (region proposal - ground truth)L({pi}), {ti}) = 1

Lcls(pi, p∗i) + λ 1Nreg

Lcls(ti, t∗i)where

• i is the index of an anchor in a mini-batch, pi is the probability for an anchor to be an object.• Lcls is the binary cross entropy for the question: does the anchor contain object ? for RPN and

multi-class cross entropy for Faster-RCNN

• Lreg is the loss for bounding box regression using Smooth L1 Loss Smooth L1 Loss can be seen asan combination of L1 and L2 loss:

|x| |x| > α1

YOLO family of models has continued to evolve since the first initial release.

• YOLOv2 [25] made a number of iterative improvements on top of YOLO including BatchNorm,higher resolution, and anchor boxes.

• YOLOv3 [26] built upon previous models by adding an objectness score to bounding box prediction,added connections to the backbone network layers, and made predictions at three separate levelsof granularity to improve performance on smaller objects.

• YOLOv4 [7] introduced improvements like improved feature aggregation, a ”bag of freebies” (withaugmentations), mish activation, and more.

• YOLOv5 [1] is the first model in the ”YOLO family” to not be released with an accompanyingpaper and with ongoing development The Focus layer [?] introduced in this version, is evolved

Trang 23

from YOLOv3 structure It helps to reduce required CUDA memory reduce parameters, increaseforward and backward propagation speed.

• YOLOv7 [32] is the successor of YOLOv4, incorporates the techiniques from yolov4, yolov5 and”trainable bag of freebies”, pushing the limit of object detection even more.

2.8YOLO approach for object detection

The main idea of YOLO is to divide the image into S x S grid For each grid cell, there are a bunch ofanchors, each of them will predict one object with the representation (x,y,width,height,class).

First, the image goes through a CNN to create a S × S feature map, called a grid YOLO will detectobject in each of S × S cell Each cell prediction contains B bounding boxes and probability for Cclasses Each bounding box consists of 5 variables: center coordinates (x, y), width and height (w, h) andconfidence The confidence of bounding box represents if that bounding box has any objects or not Soin one cell, YOLO predicts a tensor with B × 5 + C elements, where B is the number of bounding boxes,5 is the number of variables of a bounding box, and C is the number of classes In a S × S feature map,the shape of the output tensor from YOLO would be S × S × (B × 5 + C).

Its architecture is shown in figure 14

Figure 14: YOLOv1 architecture

Its loss function consists of several parts:

1objij [(√

wi−pwˆi)2+ (phi−q

1objij (Ci− ˆCi)2

+λno objS2

i=0B

Trang 24

1objij [(xi− ˆx)2+ (yi− ˆy)2]

1objij [(√

wi−pwˆi)2+ (phi−q

For every cell in feature map and for every bounding box in cell, if that cell has object the the loss willbe calculated else loss would be 0 The square root was used for width and height of bounding boxes.The idea is that if the bounding box is small, the impact of wrong regression is greater than larger boxes.

1objij (Ci− ˆCi)2

+λno objS2

1no objij (Ci− ˆCi)2

For every bounding box in B predicted bounding box of cell i, if bounding box j and the ground truth

bounding box has the largest IOU, then 1objij = 1, else 0 1no objij has the opposite value of 1objij , or

1no objij = 1 − 1objij

Ci is IOU of predicted bounding box and ground truth bounding box.

The number of no object bounding boxes are large, so a hyper-parameter λno objis added to balance theloss of 2 parts.

Anchor box In YOLOv2, anchor boxes were used similar to Faster-RCNN The input image was changedfrom 448 x 448 to 416 x 416 because the author wanted the final feature map size to be an odd number(with 448 x 448 the final feature map size would be 14 x 14) The idea is that images in COCO datasetusually have an object at the center of the image So having a center cell will improve the chance of itsanchor box can detect the object Using anchor box, mAP of the model decreased but its recall increased,meaning that the model can detect more objects, but the quality of detection is worse.

In two-stage models (R-CNN family), anchor box works well because the first stage also consists ofoptimizing anchor box positions while YOLO does not have that stage So having some initial anchorboxes are very important for the model YOLOv2 generates anchors through k-means algorighm.Also, YOLOv2 predicts the displacement of anchor boxes tx, ty, tw, thand objectness score to with tx, tyare limited to interval [0, 1] This will limit the center coordinates x, y of bounding box when applying

Trang 25

Figure 15: YOLOv2 architecture

transformations on tx, ty, which means tx, ty in a grid cell will not make the center of bounding boxes inthat cell goes outside that cell.

Backbone YOLOv3 uses a new backbone, called Darknet-53 YOLOv1’s backbone used 1x1 volution (bottleneck) from Inception Network, YOLOv2 added BatchNorm, with YOLOv3, it appliesskip-connection from ResNet, calles a Residual Block (figure 16)

Con-Neck In previous versions, detecting small objects are always a weak spot Although YOLOv2 used skipconnection from early layers to move the information from bigger feature map to later smaller featuremap, but it was not enough YOLOv3 is an upgrade for this problem YOLOv3 uses Feature PyramidNetwork (FPN), detects objects from 3 different scales (figure 17).

Other changes

Classification prediction Previous YOLO models used softmax in output of classification But fromYOLOv3, output of classification is changed to sigmoid Sigmoid function is used because of some objectsin some datasets are classified into 2 class (person and women, for example).

Bounding box prediction Keeping the idea of anchor box with k-means from YOLOv2, YOLOv3makes clear its way of chossing bounding boxes In a grid cell of a feature map, YOLOv3 generates 9anchor boxes (YOLOv2 used 5), each 3 anchor boxes belong to a scale.

PAN (Path Aggregation Network) is a variation of FPN (Feature Pyramid Network) IN FPN, a branchis created for information to flow from deep layers to shallow layers PAN adds another branch to bringthe information from shallow layers back to deep layers (Figure 21)

SPP

Trang 26

Figure 16: YOLOv3 backbone

SPP (Spatial pyramid pooling) is a special block at the end of the backbone It outputs 4 feature mapswith the same H × W shape (the same shape with backbone output) Then they are concatenatedtogether (Figure 22)

Remove Grid sensitivity

YOLOv4 uses a new formula to calculate the bounding box position from prediction (tx, ty, tw, th)bx = σ(tx) ∗ 1.1 − 0.05 + cx

by = σ(ty) ∗ 1.1 − 0.05 + cybw = pwetw

bh = pheth

Using multiple anchors for one ground truth of bounding box

In YOLOv3, only the anchor has the highest IOU with the ground truth is chosen to be positive anchor.Anchors has IOU with ground truth smaller than a threshold (0.5, for example) will be considered asnegative anchors Others are not calculated in the loss of the model, called neutral anchors.

Trang 27

Figure 17: YOLOv3 architecture

Figure 18: Darknet53 vs CSPDarknet53

But in YOLOv4, these neutral anchors will be considered as positive, participate in loss calculation.

Label smoothing

Trang 28

Figure 19: CSPResBlock

Figure 20: Left: Sample image, Center: DropOut, Right: DropBlock.

Figure 21: PAN structure

Label Smoothing is a regularization technique that introduces noise for the labels This accounts for thefact that datasets may have mistakes in them, so maximizing the likelihood of log p(y|x) directly can beharmful Assume for a small constant ϵ, the training set label y is correct with probability 1 − ϵ andincorrect otherwise Label Smoothing regularizes a model based on a softmax with output values byreplacing the hard 0 and 1 classification targets with targets of ϵ

k − 1 and 1 − ϵ respectively.

Ngày đăng: 30/07/2024, 17:38

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN

w