The need for table extraction
With the trend of digital conversion, the amount of document images has increased exponentially To automate the process of extracting information in those images, many methods have been proposed for different types of information arrangements Besides text, tables are one of the most used methods to arrange information in documents Its purpose is to group information related to a topic together to help the reader compare and retrieve information faster However, because of their complex and diverse styles, it is hard to parse tabular data from document images into a well-structured machine-readable format.
Document types that contain tables as one of the main elements include invoices, financial reports, and forms To understand these types of documents effectively, a table extraction tool for images is necessary.That is the motivation for this thesis.
The goal
The goal for thesis is to create a pipeline consists of deep learning models and image processing techniques to extract tables from an input document image.
There are 4 sub goals as follows:
• Build a comprehensive pipeline for table extraction in the document image (table extraction pipeline)
• Develop a model for detecting table regions in the document image (table detection)
• Develop a classification model to classify the detected table into a borderless table and a bordered table for extracting the table’s cells (table classification)
• Benchmark the models with public/private datasets.
• Captions and table names do not count as parts of a table (Usually these elements are text and are very close to tables).
• The tables types consists of bordered tables and borderless tables (Will be defined below)
• Table size ranges from small (2 to 5 rows and/or columns, often seen in research papers) to large (many rows, occupies a large area of the image ( 80%) ).
• The cells, texts and lines may contain colors rather than black and white.
• Input image must be from scanned documents or exported pdfs (different with document captured with mobile devices)
• Image quality must be acceptable (Bad image quality includes: blurred, distorted, noisy, )
Convolution and cross-correlation in image processing
In image processing, convolution is the process of transforming an image by applying a kernel over each pixel and its local neighbors across the entire image The kernel is a matrix of values whose size and values determine the transformation effect of the convolution process
Mathematically, the convolution between an image and a kernel can be written as: g(x, y) =ω∗f(x, y) a
X dy=−b ω(dx, dy)f(x−dx, y−dy) and the cross-correlation between an image and a kernel is defined as: g(x, y) =ω ⋆ f(x, y) a
X dy=−b ω(dx, dy)f(x+dx, y+dy) whereg(x, y) is the filtered image, f(x, y) is the original image, ω is the filter kernel Every element of the filter kernel is considered by−a≤dx≤aand−b≤dy≤b.
The difference of convolution and cross-correlation can be seen below:
[2,2] = (iã1) + (hã2) + (gã3) + (fã4) + (eã5) + (dã6) + (cã7) + (bã8) + (aã9).
[2,2] = (aã1) + (bã2) + (cã3) + (dã4) + (eã5) + (fã6) + (gã7) + (hã8) + (iã9).
A convolution can be seen as a cross-correlation with a kernel being rotated by 180 ◦
With carefully hand-crafted kernels, blur, sharpen, detect edges in images are a few possible image processing techniques using convolutions.
Convolution neural network (CNN)
Building blocks
The convolutional layer is the core building block of a CNN The layer’s parameters consist of a set of learnable filters (or kernels), which have a small receptive field, but extend through the full depth of the input volume During the forward pass, each filter is convolved across the width and height of the input volume, computing the dot product between the filter entries and the input, producing a 2-dimensional activation map of that filter As a result, the network learns filters that activate when it detects some specific type of feature at some spatial position in the input.
Stacking the activation maps for all filters along the depth dimension forms the full output volume of the convolution layer Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.
Another important concept of CNNs is pooling, which is a form of non-linear down-sampling There are several non-linear functions to implement pooling, where max pooling is the most common It partitions the input image into a set of rectangles and, for each such sub-region, outputs the maximum.
Intuitively, the exact location of a feature is less important than its rough location relative to other features This is the idea behind the use of pooling in convolutional neural networks The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters, memory footprint and amount of computation in the network, and hence to also control overfitting. This is known as down-sampling It is common to periodically insert a pooling layer between successive convolutional layers (each one typically followed by an activation function, such as a ReLU layer) in a CNN architecture While pooling layers contribute to local translation invariance, they do not provide global translation invariance in a CNN, unless a form of global pooling is used The pooling layer commonly operates independently on every depth, or slice, of the input and resizes it spatially A very common form of max pooling is a layer with filters of size 2×2, applied with a stride of 2, which subsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations: fX,Y(S) =max 1 1,b=0 S2X+a,2Y +b
A visualization can be seen in figure 2
Figure 2: Maxpool 2x2: an example of pooling layer
Activation layer usually lies after a convolutional layer The most common activation layer is ReLU.ReLU is the abbreviation of REctified Linear Unit, which applies the non-saturating activation function f(x) = max(0, x) It effectively removes negative values from an activation map by setting them to zero It introduces nonlinearities to the decision function and in the overall network without affecting the receptive fields of the convolution layers.
Other functions can also be used to increase nonlinearity, for example the saturating hyperbolic tangent f(x) = tanh(x), and the sigmoid functionσ(x) = (1 +e −x ) −1 ReLU is often preferred to other functions because it trains the neural network several times faster without a significant penalty to generalization accuracy.
Some commonly used activation function is shown in figure 3
After several convolutional and max pooling layers, the final classification is done via fully connected layers Neurons in a fully connected layer have connections to all activations in the previous layer, as seen in regular (non-convolutional) artificial neural networks Their activations can thus be computed as an affine transformation, with matrix multiplication followed by a bias offset (vector addition of a learned or fixed bias term).
The ”loss layer”, or ”loss function”, specifies how training penalizes the deviation between the predicted output of the network, and the true data labels (during supervised learning) Various loss functions can be used, depending on the specific task.
Hyperparameters
The kernel is the number of pixels processed together It is typically expressed as the kernel’s dimensions, e.g., 2x2, or 3x3.
Padding is the addition of (typically) 0-valued pixels on the borders of an image This is done so that the border pixels are not undervalued (lost) from the output because they would ordinarily participate in only a single receptive field instance The padding applied is typically one less than the corresponding kernel dimension For example, a convolutional layer using 3x3 kernels would receive a 2-pixel pad, that is 1 pixel on each side of the image.
The stride is the number of pixels that the analysis window moves on each iteration A stride of 2 means that each kernel is offset by 2 pixels from its predecessor.
Since feature map size decreases with depth, layers near the input layer tend to have fewer filters while higher layers can have more To equalize computation at each layer, the product of feature values va with pixel position is kept roughly constant across layers Preserving more information about the input would require keeping the total number of activations (number of feature maps times number of pixel positions) non-decreasing from one layer to the next.
The number of feature maps directly controls the capacity and depends on the number of available examples and task complexity.
Common filter sizes found in the literature vary greatly, and are usually chosen based on the data set.
The challenge is to find the right level of granularity so as to create abstractions at the proper scale, given a particular data set, and without overfitting.
Max pooling is typically used, often with a 2x2 dimension This implies that the input is drastically downsampled, reducing processing cost.
Large input volumes may warrant 4×4 pooling in the lower layers Greater pooling reduces the dimension of the signal, and may result in unacceptable information loss Often, non-overlapping pooling windows perform best.
Dilation involves ignoring pixels within a kernel This reduces processing/memory potentially without significant signal loss A dilation of 2 on a 3x3 kernel expands the kernel to 5x5, while still processing 9 (evenly spaced) pixels Accordingly, dilation of 4 expands the kernel to 9x9.
Figure 4: Example of Dilated Convolution
Regularization methods
Because a fully connected layer occupies most of the parameters, it is prone to overfitting One method to reduce overfitting is dropout At each training stage, individual nodes are either ”dropped out” of the net (ignored) with probability 1−por kept with probabilityp, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed Only the reduced network is trained on the data in that stage The removed nodes are then reinserted into the network with their original weights.
In the training stages, pis usually 0.5; for input nodes, it is typically much higher because information is directly lost when input nodes are ignored.
At testing time after training has finished, we would ideally like to find a sample average of all possible
2 n dropped-out networks; unfortunately this is unfeasible for large values of n However, we can find an approximation by using the full network with each node’s output weighted by a factor of p p, so the expected value of the output of any node is the same as in the training stages This is the biggest contribution of the dropout method: although it effectively generates 2 n neural nets, and as such allows for model combination, at test time only a single network needs to be tested.
By avoiding training all nodes on all training data, dropout decreases overfitting The method also significantly improves training speed This makes the model combination practical, even for deep neural networks The technique seems to reduce node interactions, leading them to learn more robust features that better generalize to new data.
A visualiztion is shown in figure 5
Because the degree of model overfitting is determined by both its power and the amount of training it receives, providing a convolutional network with more training examples can reduce overfitting Because these networks are usually trained with all available data, one approach is to either generate new data from scratch (if possible) or perturb existing data to create new ones Data augmentation can ranged from simple image processing like changing the hue, scaled, rotated angle of an existed image, to more modern ones like mixup [37] where a new data point is created using multiple exisiting data points.
An example of data augmentation can be seen in figure 6
Figure 6: Example of data augmentationEarly stopping
One of the simplest methods to prevent overfitting of a network is to simply stop the training before overfitting has had a chance to occur It comes with the disadvantage that the learning process is halted.
Another simple way to prevent overfitting is to limit the number of parameters, typically by limiting the number of hidden units in each layer or limiting network depth For convolutional networks, the filter size also affects the number of parameters Limiting the number of parameters restricts the predictive power of the network directly, reducing the complexity of the function that it can perform on the data, and thus limits the amount of overfitting This is equivalent to a ”zero norm”.
A simple form of added regularizer is weight decay, which simply adds an additional error, proportional to the sum of weights (L1 norm) or squared magnitude (L2 norm) of the weight vector, to the error at each node The level of acceptable model complexity can be reduced by increasing the proportionality constant(’alpha’ hyperparameter), thus increasing the penalty for large weight vectors.
L2 regularization is the most common form of regularization It can be implemented by penalizing the squared magnitude of all parameters directly in the objective The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors Due to multiplicative interactions between weights and inputs this has the useful property of encouraging the network to use all of its inputs a little rather than some of its inputs a lot.
Image segmentation
Thresholding
The simplest method of image segmentation is called the thresholding method This method is based on a clip-level (or a threshold value) to turn a gray-scale image into a binary image.
The key of this method is to select the threshold value (or values when multiple-levels are selected) Several popular methods are used in industry including the maximum entropy method, balanced histogram thresholding, Otsu’s method [22] (maximum variance), and k-means clustering.
K-means clustering
The K-means algorithm is an iterative technique that is used to partition an image into K clusters The basic algorithm is
• Pick K cluster centers, either randomly or based on some heuristic method, for example K-means++
• Assign each pixel in the image to the cluster that minimizes the distance between the pixel and the cluster center
• Re-compute the cluster centers by averaging all of the pixels in the cluster
• Repeat steps 2 and 3 until convergence is attained (i.e no pixels change clusters)
In this case, distance is the squared or absolute difference between a pixel and a cluster center The difference is typically based on pixel color, intensity, texture, and location, or a weighted combination of these factors K can be selected manually, randomly, or by a heuristic This algorithm is guaranteed to converge, but it may not return the optimal solution The quality of the solution depends on the initial set of clusters and the value of K.
Trainable segmentation
Most of the aforementioned segmentation methods are based only on color information of pixels in the image Humans use much more knowledge when performing image segmentation, but implementing this knowledge would cost considerable human engineering and computational time, and would require a huge domain knowledge database which does not currently exist Trainable segmentation methods, such as neural network segmentation, overcome these issues by modeling the domain knowledge from a dataset of labeled pixels.
Object detection
Object detection is a task which deals with detecting objects in any image or a video frame With the rising and superior results of deep learning, all state-of-the-art object detection methods today are built with deep learning approaches They can be categorized into two main types: one-stage methods and two stage-methods One-stage methods prioritize inference speed while two-stage methods prioritize detection accuracy.
Object detection - Metric
The main metric of object detection is mAP (mean average precision) To understand mAP, first we have to know about IOU, Recall, Precision and Average precision:
Intersection over union (IOU)measures how much the predicted region is overlapping with the actual ground truth regions, as shown in Figure 8
IoU = Area of overlap region
Precision is defined as the ratio of numbers of predicted regions that are tables over the number of ground truth regions.
Recall is defined as the ratio of the numbers of predicted regions that are tables over the number of predicted regions.
Figure 8: Different IOU with Red bounding boxes are ground truths while the Green ones are predictions
(Recall[k]−Recall[k−1])∗P recision[k] (4) where n is the number of IOU thresholds, Recall[k], P recision[k] is the recall and precision for IOU iouT hreshold[k] (iouThreshold[] = [0.5, 0.55, 0.60, , 0.90, 0.95])
To define if a predicted regions is a table given a ground truth, we need to compute their intersection over union (IOU) and if their IOU is greater than a threshold, then the predicted region is count as a true positive.
Mean average precisionis calculated by taking average AP of all classes: mAP = 1 n X k
AP[k] (5) where k is the number of classes
Faster R-CNN
RPN (Region Proposal Network)
Faster RCNN use a sub-network called RPN to extract the regions containing objects (RoI - Region of Interest), differs from its predecessors, RCNN and Fast-RCNN
RCNN use Selective Search as its region proposal extractor The number of regions extracted are around
2000, Then the regions are resized into the same size and go through a pretrained CNN model, then locaize the offset and the object class But 2000 regions is a large number, making the model runs super slow (figure 9)
Fast-RCNN improves this by using a pretrained CNN to extract feature maps, then using Selective Search on those feature maps instead of the original image So the speed increased by a large margin. But because of Selective Search, the model inference time still takes too long (around 2s/image) (figure 10)
With Faster-RCNN, instead of using Selective Search, a sub-network is used to extract regions making it even faster and is designed to be an end-to-end trainable network.
RPN using 1 conv later with 512 channels, kernel size = (3, 3) on feature map Then it split into 2 branches: one for object classification, one for bounding box regression Both of them using 1 conv layer with kernel sizw = (1, 1) but with different output channels With binary object classification, it has 2k channels output, with k is the number of anchors to determine if that anchors has object or is the background With bounding box regression, it has 4k channels output, with 4 represents 4 offset (x, y, w, h)
Because the input image size is not fixed, the output size of RPN has the same property For example with an input image size WxHx3 and down sampling = 16, RPN classify and RPN bounding box has 18
Figure 11: RPN in Faster-RCNN
Anchors
Anchors are pre-defined boxes, are known before training model Anchors in Faster-RCNN are defined as 9 anchors with every pixel in the feature map The total numbers of anchors are based on the size of feature map For example feature map after goes through backbone has size WxHxC (with C is the channels of feature map), then the total number of anchors will be WxHx9 (9 is the number of anchors of one pixel).
Anchors are in different size and ratio ( figure 12 )
Anchors will be assigned as positive/negative (object/background) based on overlap area or IOU overlap with ground truth bounding box following the rule:
• The anchor with the highest IOU with ground truth box will be positive
• Anchors with IOU≥0.7 will be positive.
• Anchors with IOU