Đồ Án Tốt Nghiệp Công Nghệ thông tin trường Bách khoa đà nẵng 2016 về mảng trí tuệ nhân tạo (Machine Learning) Mathematics is one of the most important fields of the people, is studied, developed, and applied a lot in real life. Mathematics helps solve many problems in life. All of us have been looking at math for a long time. Many people love this subject, but many people have difficultly to solve math problems. Nowadays, with the vigorous development of science and technology, especially Artificial Intelligence (AI), AI has many outstanding achievements, which can solve many humans works. This thesis introduces the application that helps to solve math problems by applying some machine learning algorithms.
THIS THESIS IS APPROVED BY: Instructor: Le Thi My Hanh, Ph.D Date Suggestions/Comments: ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… ………………………………………………………………………………………… SUMMARY Topic title: MathOCR: Solving math problems using machine learning Student name: Bui Dang Quang Dung Student ID: 103160193 Class: 16TCLC3 Mathematics is one of the most important fields of the people, is studied, developed, and applied a lot in real life Mathematics helps solve many problems in life All of us have been looking at math for a long time Many people love this subject, but many people have difficultly to solve math problems Nowadays, with the vigorous development of science and technology, especially Artificial Intelligence (AI), AI has many outstanding achievements, which can solve many human's works This thesis introduces the application that helps to solve math problems by applying some machine learning algorithms DA NANG UNIVERSITY THE SOCIALIST REPUBLIC OF VIETNAM UNIVERSITY OF SCIENCE AND TECHNOLOGY FACULTY OF INFORMATION TECHNOLOGY Independence - Freedom - Happiness GRADUATION PROJECT REQUIREMENTS Student Name: BUI DANG QUANG DUNG Student ID: 103160193 Class: 16TCLC3 Faculty: Information Technology Major: Information Technology Topic title: MathOCR: Solving math problems using machine learning Project topic: ☐ has signed intellectual property agreement for the final result Initial figure and data: Data is collected from many resources Content of the explanations and calculations: The content contains five parts: Machine Learning, Computer Vision, Natural Language Processing and their applications; Models, details in MathOCR Introduce MathOCR Application MathOCR Experiments Data pre-processing process: Generating data process of models in MathOCR Experiment results of models Conclusion Drawings, charts Instructor name: Le Thi My Hanh PhD, Information Technology Faculty, University of Danang - University of Science and Technology Date of assignment : Date of completion : / /2020 / /2020 Da Nang, December Head of Division………………… Instructor 2020 MathOCR: Soling math problems using machine learning PREFACE During the project, I would like to express my sincere thanks to Le Thi My Hanh, PhD Thank you for giving me a lot of ideas, solutions, and knowledge to complete this project And I would also like to really appreciate the teachers, and students of the Faculty of Information Technology – University of Danang - University of Science and Technology for helping me in the past four years of study, passing me on the necessary knowledge and valuable experience for me to be able to this project And finally, I would also like to express my special thanks to my family who supported, gave me motivation and help, both financially and spiritually, for this project Although I tried my best to this project, it is impossible to avoid mistakes or incompletes I hope that I can receive valuable comments and recommendations from the teachers to complete my thesis Da Nang, December 11th 2020 Students Bui Dang Quang Dung iv MathOCR: Soling math problems using machine learning ASSURANCE I understand the University’s policy about anti-plagiarism and guarantee that: The contents of this thesis project are performed by myself following the guidance of Le Thi My Hanh, PhD All the references, which I used in this thesis, are quoted with the author’s name, project’s name, time, and location to publish clearly and faithfully This project's contents are my work and have not been copied from other sources or been previously submitted for award or assessment Student Performed Bui Dang Quang Dung v MathOCR: Soling math problems using machine learning TABLE OF CONTENT SUMMARY ii GRADUATION PROJECT REQUIREMENTS iii PREFACE iv ASSURANCE v LIST OF PICTURE ix LIST OF TABLE xi LIST OF ACRONYM xii INTRODUCTION Reason for doing thesis Scope and Objective Overview CHAPTER 1: MACHINE LEARNING, COMPUTER VISION, NATURAL LANGUAGE AND THEIR APPLICATION 1.1 Introduction 1.2 Machine Learning 1.2.1 What is Machine Learning 1.2.2 Supervised Learning 1.2.3 Unsupervised Learning 1.2.4 Reinforcement Learning 1.3 Computer Vision 1.3.1 What is Computer Vision 1.3.2 Computer Vision tasks 1.3.3 Applications of Computer Vision 1.4 Natural Language Processing 1.4.1 What is Natual Language Processing 1.4.2 Natural Language Processing tasks and their application CHAPTER 2: MODELS, DETAILS IN MATHOCR 10 2.1 Introduction 10 2.2 The Vietnamese recognition model with the Transformer 10 2.2.1 Introduction 10 2.2.2 Backbone 11 vi MathOCR: Soling math problems using machine learning 2.2.3 Encoder 12 2.2.4 Decoder 12 2.2.5 Multi-Head Attention 12 2.2.6 Position-wise Feed-Forward Networks 14 2.2.7 Positional Encoding 14 2.3 The Image to Latex model 15 2.3.1 Introduction 15 2.3.2 Model Architecture 15 2.3.3 Encoder 15 2.3.3.1 Convolution 15 2.3.3.2 Positional Encoding 16 2.3.4 Decoder 16 2.3.4.1 Token Embedding 16 2.3.4.2 LSTM network 17 2.4 The YOLOv4 Model 18 2.4.1 Introduction 18 2.4.2 YOLOv4 Architecture 18 2.4.2.1 Backbone 19 2.4.2.2 Neck 20 2.4.2.3 Head 21 2.4.2.4 Bag Of Freebies 22 2.4.2.5 Bag Of Specials 22 2.5 Metric Evaluation 22 2.5.1 BLEU 22 2.5.2 mAP 23 2.4.2.1 Precision and Recall 24 2.5.2.2 IoU 24 2.5.2.3 AP 25 CHAPTER INTRODUCE MATHOCR APPLICATION 29 3.1 Introduction 29 3.2 Front-end 29 3.3 Server 30 3.4 Features Specification 30 3.3.1 Document Scanner 31 vii MathOCR: Soling math problems using machine learning 3.3.2 Math Formula Recognition 32 3.3.3 Vietnamese Text Recognition 33 3.3.4 Solving Math equations 34 CHAPTER MATHOCR EXPERIMENTS 36 4.1 Introduction 36 4.2 Data Preprocessing Process 36 4.3 VietnameseOCR Model Experiments 38 4.3.1 Data Source 38 4.3.2 Training parameters 41 4.3.3 Experimental Results 41 4.4 Im2LaTex Model 42 4.4.1 Data Source 42 4.4.2 Training parameters 43 4.4.3 Experimental Results 43 4.5 YOLOv4 Model 44 4.5.1 Data Source 44 4.5.2 Training parameters 44 4.5.3 Experiment result 45 CHAPTER CONCLUSION 46 5.1 Archived results: 47 5.2 Limitations: 47 5.3 Development: 47 REFERENCES 47 viii MathOCR: Soling math problems using machine learning LIST OF PICTURE Figure 1.1 Supervised Learning Figure 1.2 Unsupervised Learning Figure 1.3 Reinforcement Learning Figure 1.4 Subfiles of Computer Vision Figure 2.1 The Transformer Architecture 12 Figure 2.2 (left) Scaled Dot-Product Attention, (right) Multi-Head Attention consists of several attention layers running in parallel 13 Figure 2.3 Object detector 19 Figure 2.4 DenseNet CSP 19 Figure 2.5 Modified PAN 21 Figure 2.6 Modified SAM 21 Figure 2.7 IoU 24 Figure 2.8 The result calculated Precision and Recall 26 Figure 2.9 The results after being smoothed 26 Figure 2.10 Calculate max of precision at each level 27 Figure 2.11 Normalize by VOC format 27 Figure 3.1 The React Native Framework 29 Figure 3.2 Python, Flask and Pytorch 30 Figure 3.3 Document Scanner Screen 31 Figure 3.4 Math Formula Recognition Screen 32 Figure 3.5 Vietnamese Text Recognition Screen 33 Figure 3.6 Solving Math Equations Screen 34 Figure 4.1 Data preprocessing process 36 Figure 4.2 Data normalization 38 Figure 4.3 Request and Beautiful soup libraries 39 Figure 4.4 Process of generating data from existing text 39 Figure 4.5 Process of generating data from PDF files 40 Figure 4.6 Results of Vietnamese text recognition 41 Figure 4.7 Good prediction results 43 Figure 4.7 Unexpected prediction results 43 ix MathOCR: Soling math problems using machine learning Figure 4.8 The statistics bounding box of objects 44 Figure 4.9 Experiment result 45 Figure 4.10 Good prediction results 45 Figure 4.11 The predicted results are average 45 Figure 4.12 Unexpected prediction results 46 x MathOCR: Soling math problems using machine learning 3.3.4 Solving Math equations Figure 3.6 Solving Math Equations Screen The application performs the function of solving mathematical equations This function is combined with the math formula recognition function The user will perform this function by taking a picture containing the math equation to be solved The math formula recognition function will then recognize it, and the result will be the corresponding LaTeX string Then the system will validate and solve that equation Some types of support systems: Integral analysis Derivative solution Function graphs Solve the equation Features of the function: Pros: Solving common problems efficiently Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 34 MathOCR: Soling math problems using machine learning Cons Depending on the function of recognizing mathematical formulas and for complicated mathematical equations, it may be impossible to solve or long calculation results Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 35 MathOCR: Soling math problems using machine learning CHAPTER MATHOCR EXPERIMENTS 4.1 Introduction In this section, I will introduce building machine learning models, the process of creating and processing data for training, testing, and evaluation Prove the results of the model Also introduced the React Native framework for building smart mobile apps And the efficiency of the functions of the application 4.2 Data Preprocessing Process Data preprocessing is always one of the essential processes for building a useful machine learning model Most datasets used in Machine Learning related problems need to be processed, cleaned and transformed before a Machine Learning algorithm can be trained on these datasets This eliminates noise, enhances the richness of the data, and enhances the model's predictive quality As such, the construction of a data processing process is significant For this thesis, I propose a data preprocessing process as follows: Figure 4.1 Data preprocessing process The data preprocessing process consists of main parts: reading data, data augmentation, data normalization, data shuffle and data preparation for training: Reading data: The data need to read are mainly images When training with hundreds of thousands of images such as VietnameseOCR and Im2LaTeX models, it is completely possible to encounter faulty images We need to process the error image At the same time, during training, data will be read from the internal memory (ROM) to the main memory (RAM) and then converted to a Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 36 MathOCR: Soling math problems using machine learning format that CUDA (NVIDIA-GPU) can read to compute We need to build a multi-threaded data reading mechanism to minimize data reading time and help the training model faster Data augmentation in data analysis is used to increase the amount and abundance of data and add slightly modified copies of already existing data or newly created synthetic data from existing data It acts as a regularizer and helps reduce overfitting when training a machine learning model Some data augmentation techniques: Radius Distortion Perspective Transform Simple Blur Motion Blur Drop Shadows Normalization: This is the standardization step of data If the image is an RGB image or a normal Gray image is both an 8-bit image, the values are 0-255 If a pixel has a larger value, it will be more dominant when training the model That makes the model's prediction wrong or reduces the accuracy of the model In addition, it also helps in training faster and minimizes the ability to optimize at the local minima is not good There are two ways to standardize data: Option 1: Standardize the value of the image from 0-255 to 0-1 by dividing the image by 255 This method is simple and has a relatively good effect For grayscale data, this should be chosen Option 2: (This method is suitable for RGB images) Using the mean and standard deviation (Standard Deviation) of the ImageNet model [16] to normalize With the mean and the standard deviation have the following values: 𝑚𝑒𝑎𝑛 = [0.485,0.456,0.406] 𝑠𝑡𝑑 = [0.229, 0.224, 0.225] + How to normalize: Divide an image with the value 255, then normalize the image according to the following formula: 𝑖𝑚𝑎𝑔𝑒 = (𝑖𝑚𝑎𝑔𝑒 − 𝑚𝑒𝑎𝑛) / 𝑠𝑡𝑑 Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 37 MathOCR: Soling math problems using machine learning Figure 4.2 Data normalization Shuffle data: When training the model with the iterative optimization algorithm Gradient Descent, in the training process, if similar data are arranged together, the optimization process will follow that characteristic That results in the reduction of the slope over a period that will be almost the same and not too much difference in training This leads to loss of the generality of the data and affects the predictability of the model Prepare data for training: This is also quite an important step, with the proliferation of hardware technologies The GPUs are getting more powerful and can compute in parallel thousands of times faster than with the CPU Simultaneously, these devices' capacity has also increased, making it possible to train more images in one go Defining batch-size images, maximize hardware usage And also must be checking for batches with unstable batch-size, often the last batch that will crash the program 4.3 VietnameseOCR Model Experiments 4.3.1 Data Source Data sources for training the model are gathered from a variety of sources, including: VietOCR data, with more than million self-generated images Handwriting OCR for Vietnamese Addresses with over 4000 images Because the model mainly focuses on recognizing Vietnamese words on documents So the training data must have this data For this data, a data collection and creation process with two objects include existing text and text on PDF documents Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 38 MathOCR: Soling math problems using machine learning For an existing text: This text can be scratched from any website on the internet These include wiki.org, vnexpress.net The very famous libraries in python for scratching data from web pages are request and beautiful soup Figure 4.3 Request and Beautiful soup libraries After scratching the data is the raw text, separating each sentence in that text, the sentences themselves will be the data labels We will proceed to convert each sentence into pictures containing the content of that sentence Usually, documents will be represented by different fonts, different font sizes, italicized or bold, and various colours Therefore, to increase the richness of the data, several properties are applied as follows: o Colour: The primary colours are black, red, blue, and green o Italic and bold text o Text size: to 22 o Font: Random font word in fonts supporting Vietnamese, in which the probability of encountering Time New Roman is 30%, and Arial is 20% Figure 4.4 Process of generating data from existing text Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 39 MathOCR: Soling math problems using machine learning For PDF files: PDF is a very famous and widely used text file format Searching for PDF files is very easy; it is possible to scratch PDF files from internet learning sites, just like the scratch text part above PDF is really a great type of text format, but it is not easy to get the information in it to be text and crop it into an image The simple reason is that PDF is essentially arranging symbols like letters, punctuation, and diacritics Therefore, it is not easy to find relationships between sentences and their position To handle this, convert the data format as PDF into two new formats, HTML and image With the file converted to HTML, the sentences or lines will be wrapped in tags, such as, tags, each tag will have CSS attribute as the font and coordinates its degree When comparing this coordinate with the image created from a pdf file, it is easy to crop that text, the text in the tag itself will also be the label of the data Figure 4.5 Process of generating data from PDF files Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 40 MathOCR: Soling math problems using machine learning Detailed amount of data for the model: For traning: 204089 images For testing: 22778 images For validating: 21987 images 4.3.2 Training parameters Iteration: 250000 iterations Batch size: 32 Learning rate: The value learning rate ∈ [10−5 , 10−3 ] Loss function: Cross-Entropy Optimizer: Adam 4.3.3 Experimental Results The model was trained in 250,000 iterations, giving a BLEU-4 result of 96.9% Figure 4.6 Results of Vietnamese text recognition Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 41 MathOCR: Soling math problems using machine learning 4.4 Im2LaTex Model 4.4.1 Data Source For the mathematical formula recognition model, the main data used is IM2LATEX-100K, which consists of a large collection of rendered real-world mathematical expressions collected from published articles This dataset provides a challenging test-bed for the image-to-markup task based on reconstructing mathematical markup from rendered images, originally written by scientists A model is trained to generate LaTeX markup with the goal of rendering to the exact source image This original dataset consists of approximately one hundred thousand math formula sequences written in LaTeX format and has already been divided into datasets for training, testing, and validating To complete the data set, it is necessary to create images containing each of those LaTeX series contents Like the Vietnamese text, the formulas, when written in LaTeX, also have many fonts As in scientific papers, it is written in LaTex font; in MS Word, fonts from the STIX family are popular Other fonts are also used, such as Asana Math; Neo Euler; Gyre Pagella; Gyre Termes; Latin Modern To create these images, we inject the LaTeX codes into the HTML script and use the browser to display the corresponding math formula Then convert the HTML files to the image The image is then cropped and retains only the area containing the math formula The conversion process is shown clearly in Figure 4.7 One of the libraries that helps render latex strings into math formulas is MathJaX However, rendering from LaTex to HTML and HTML to the image will take a long time, requiring a long wait Figure 4.7 Process of generating mathematical image data Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 42 MathOCR: Soling math problems using machine learning Detailed amount of data for the model: For training: 83885 images For testing: 9321 images For validating: 8371 images 4.4.2 Training parameters Iteration: 100000 iterations Batch size: 32 Learning rate: The value learning rate ∈ [10−5 , 10−3 ] Loss function: Cross-Entropy Optimizer: Adam 4.4.3 Experimental Results The model was trained in 100,000 iterations, giving a BLEU-4 result of 88.9% Evaluation results give good results: Figure 4.7 Good prediction results Evaluation results are unexpected Figure 4.7 Unexpected prediction results Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 43 MathOCR: Soling math problems using machine learning 4.5 YOLOv4 Model 4.5.1 Data Source The data to train the model are the image that contains the math problem The three main objects that need to be detected are Vietnamese text, mathematical formula, and functional graph This data does not have an available source, using the YOLO mark tool to labelled for raw data The total number of data includes 1278 images; the data rates for training, testing, and validating are 0.7, 0.15, and 0.15 Figure 4.8 The statistics bounding box of objects Figure 4.8 shows the bounding boxes for each object, for Vietnamese text, for math formula, and for graph function 4.5.2 Training parameters Epoch: 150 epochs Batch size: 16 Learning rate: The value learning rate ∈ [10−5 , 10−3 ] Loss function: The Complete-IoU loss Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 44 MathOCR: Soling math problems using machine learning 4.5.3 Experiment result Figure 4.9 Experiment result Figure shows the results of testing with validation datasets The result mAP@0.5 is about 96%, and mAP@0.5: 0.95 is about 77% Evaluation results give good results: Figure 4.10 Good prediction results Evaluation results are average Figure 4.11 The predicted results are average Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 45 MathOCR: Soling math problems using machine learning Evaluation results are unexpected Figure 4.12 Unexpected prediction results CHAPTER CONCLUSION Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 46 MathOCR: Soling math problems using machine learning 5.1 Archived results: Learn and apply knowledge in the field of Machine Learning Explore the libraries and frameworks for building the application Experimental results: Build machine learning models to help recognize Vietnamese writing, math formulas, and detect objects with high accuracy and speed Combine models to create functions that solve math problems Propose data pre-processing Processes generate data from sources on the Internet for the respective models Build application that solves math problems 5.2 Limitations: The application only solves a certain number of problems, has not solved many other complex problems Machine learning models give unexpected results in some cases having to predict images with a lot of noise The application is only deployed locally and has not had the opportunity to expand to demonstrate the system's quality 5.3 Development direction: Learn and improve the YOLOv4 model, which helps to develop detect curvy text Improve data preprocessing and data generation processes, increase data quality to improve machine learning models Learn new techniques, help solve more math problems REFERENCES [1] Wikipedia Machine Learning https://en.wikipedia.org/wiki/Machine_learning Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 47 MathOCR: Soling math problems using machine learning [2] [3] [4] [5] IBM Supervised Learning https://www.ibm.com/cloud/learn/supervised-learning IBM Unsupervised Learning https://www.ibm.com/cloud/learn/unsupervised-learning DeepAI Computer Vison https://deepai.org/machine-learning-glossary-and-terms/computer-vision Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, Illia Polosukhin Attention Is All You Need arXiv preprint arXiv: 1706.03762, 2017 [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] Karen Simonyan, Andrew Zisserman Very Deep Convolutional Networks for Large-Scale Image Recognition arXiv preprint arXiv: 1409.1556, 2014 Wikipedia Long short-term memory https://en.wikipedia.org/wiki/Long_short-term_memory Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao YOLOv4: Optimal Speed and Accuracy of Object Detection arXiv preprint arXiv: 2004.10934, 2020 Chien-Yao Wang, Hong-Yuan Mark Liao, I-Hau Yeh, Yueh-Hua Wu, PingYang Chen, Jun-Wei Hsieh CSPNet: A New Backbone that can Enhance Learning Capability of CNN arXiv preprint arXiv: 1911.11929, 2019 https://leimao.github.io/blog/BLEU-Score/ Medium mAP (mean Average Precision) for Object Detection https://jonathan-hui.medium.com/map-mean-average-precision-for-objectdetection-45c121a31173 Wikipedia React Native https://en.wikipedia.org/wiki/React_Native Wikipedia Python (programming language) https://en.wikipedia.org/wiki/Python_(programming_language) Wikipedia Flask (web framework) https://en.wikipedia.org/wiki/Flask_(web_framework) Wikipeida PyTorch https://en.wikipedia.org/wiki/PyTorch Wikipedia ImageNet https://en.wikipedia.org/wiki/ImageNet Student: Bui Dang Quang Dung Instructor: Le Thi My Hanh Ph.D 48 ... 4.3 Request and Beautiful soup libraries 39 Figure 4.4 Process of generating data from existing text 39 Figure 4.5 Process of generating data from PDF files 40 Figure 4.6 Results... 2.5.2.3 AP 25 CHAPTER INTRODUCE MATHOCR APPLICATION 29 3.1 Introduction 29 3.2 Front-end 29 3.3 Server 30 3.4 Features Specification ... (VGG16, VGG 19 ) [6] In the Vietnamese word recognition model, the backbone will be the VGG 19 model The model also needs to be customized with parameters to suit the task The architecture VGG 19 finetune