Nghiên cứu nâng cao hiệu quả phát hiện công thức toán học trong ảnh văn bản

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY BUI HAI PHONG ENHANCING PERFORMANCE OF MATHEMATICAL EXPRESSION DETECTION IN SCIENTIFIC DOCUMENT IMAGES DOCTORAL DISSERTATION IN COMPUTER SCIENCE Hanoi−2021 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY BUI HAI PHONG ENHANCING PERFORMANCE OF MATHEMATICAL EXPRESSION DETECTION IN SCIENTIFIC DOCUMENT IMAGES Major: Computer Science Code: 9480101 DOCTORAL DISSERTATION IN COMPUTER SCIENCE SUPERVISORS: 1.Assoc Prof Hoang Manh Thang 2.Assoc Prof Le Thi Lan Hanoi−2021 DECLARATION OF AUTHORSHIP I, Bui Hai Phong, declare that the thesis titled "Enhancing performance of mathematical expression detection in scientific document images" has been entirely composed by myself I assure some points as follows: This work was done wholly or mainly while in candidature for a Ph.D research degree at Hanoi University of Science and Technology The work has not be submitted for any other degree or qualifications at Hanoi University of Science and Technology or any other institutions Appropriate acknowledge has been given within this thesis where reference has been made to the published work of others The thesis submitted is my own, except where work in the collaboration has been included The collaborative contributions have been clearly indicated Hanoi, September, 2021 PhD Student SUPERVISORS 1.Assoc Prof Hoang Manh Thang 2.Assoc Prof Le Thi Lan i ACKNOWLEDGEMENT I decided to pursue a PhD in Computer Science at MICA International Research Institute, Hanoi University of Science and Technology (HUST) in 2017 It has been one of the best decisions I could have made HUST is a really special place where I have accumulated immense knowledge I would like to thank Executive Board and all members of MICA Research Institute, HUST for the kind support in the PhD course I wish to express my deepest gratitude to my supervisors Assoc.Prof Hoang Manh Thang and Assoc.Prof Le Thi Lan for their continuous instruction, advice and support in the PhD course The thesis cannot be fulfilled without the specific direction of my supervisors I wish to thank all members of Computer Vision Department, MICA Research Institute, HUST for the frequent support in the PhD course I wish to thank Executive Board and all members of School of Graduate Education; School of School of Electronics and Telecommunications and School of Information and Communication Technology, HUST for the specific comments and suggestion for the thesis I wish to thank all members of Faculty of Information Technology, Hanoi Architectural University for the support in the professional work in the completion of the PhD I wish to thank Professor Akiko Aizawa and members of Aizawa Laboratory, National Institute of Informatics, Tokyo, Japan where I have obtained many scientific experiences during the internship of the PhD I wish to thank anonymous reviewers for valuable comments during the completion of the PhD I gratefully acknowledge the funding from SAHEP HUST project number T2020SAHEP-008 and Domestic Master/ PhD Scholarship Programme of Vingroup Innovation Foundation 2019-2021 I wish to express my sincere gratitude to my family and friends for the continuous support and encouragement in the completion of the PhD Hanoi, 2021 Ph.D Student ii ABSTRACT Mathematical expressions (MEs) play an important role in scientific documents and a huge number of scientific documents have been produced over years Therefore, the demand of document digitization for researching and studying purposes has continuously increased Detection and recognition of MEs in documents are considered as essential steps for document digitization The detection of expressions aims to locate the position of expressions within documents Meanwhile, the recognition of MEs aims at converting expressions from image format to string In the documents, mathematical expressions are classified in two categories: isolated (displayed) and inline (embedded) expressions An isolated expression displays in a separate line, an inline expression is mixed with other components (texts) Mathematical expressions may consist of mathematical operators (e.g +, -, ì, ữ), functions (log sin, cos) and variables (i, j, r) Large expressions may consist of multiple text lines Meanwhile, small expressions may consist of one character The accuracy of the detection of isolated expressions has been gradually improved However, the detection of inline expressions is considered as a challenging task In practice, the detection and recognition of MEs in document images are closely related The accuracy of the detection allows to obtain accuracy of the recognition In contrast, the incorrect detection may cause errors in the recognition of MEs This thesis presents three main contributions in the detection and recognition of MEs in scientific document images: (1) First, a hybrid method of two stages has been proposed for the effective detection of MEs At first stage, the layout analysis of entire document images is introduced to improve the accuracy of text line and word segmentation At second stage, both isolated and inline MEs in document images are detected Both hand-crafted and deep learning features are extensively investigated and combined to improve the detection accuracy In the handcrafted feature extraction approach, the Fast Fourier Transform (FFT) is applied for text line images for the detection of isolated MEs The Gaussian parameters of projection profile are applied as the feature extraction for the detection of inline MEs After the feature extraction, various machine learning classifiers have been fine tuned for the detection In the deep learning approach, the CNNs (Alexnet and ResNet) have been optimized for the detection of MEs The fusion of handcrafted and deep learning features based on the prediction scores has been applied The merit of the method is that it can operate directly on the ME images without the employment of character recognition (2) Second, an end-to-end framework for mathematical expression detection in sciiii entific document images is proposed without using any Optical Character Recognition (OCR) or Document Analysis techniques as in conventional methods The distance transform is firstly applied for input document images in order to take advantages of the distinguished features of spatial layout of MEs Then, the transformed images are fed into the Faster Region with Convolutional Neural Network (Faster R-CNN) that has been optimized to improve the accuracy of the detection Specifically, the optimization and generation strategies of anchor boxes of the Region Proposal Network have been proposed to improve the accuracy of expression detection of various sizes The proposed methods for the detection of MEs have been tested on two public datasets (Marmot and GTDB) The obtained accuracies of isolated and inline expressions in the Marmot dataset are 92.09% and 85.90% while those in the GTDB dataset are 91.04% and 85.15%, respectively The performance comparison with conventional methods shows the effectiveness of the proposed method (3) Finally, the detection and recognition of MEs have been integrated in a system The MEs in document images have been detected and recognized The recognition results are represented in Latex The application aims to support end users to use the detection and recognition of MEs in document images conveniently Hanoi, 2021 Ph.D Student iv CONTENTS DECLARATION OF AUTHORSHIP i ACKNOWLEDGEMENT ii ABSTRACT iii CONTENTS viii ABBREVIATIONS viii LIST OF TABLES LIST OF FIGURES xi xviii INTRODUCTION 0.1 Motivation 0.2 Hypotheses 0.3 Objectives of the thesis 0.4 Introduction of the ME detection and recognition 0.4.1 Introduction of MEs 0.4.2 Introduction of ME detection 0.4.3 Introduction of ME recognition 0.5 Contributions of this thesis 0.6 Structure of this thesis CHAPTER LITERATURE REVIEW 10 1.1 Document analysis 11 1.2 ME detection methods in document images 13 1.2.1 Rule-based detection 1.2.2 Handcrafted feature extraction methods for the ME detection 13 14 1.2.3 Deep neural network for ME detection 15 1.2.3.1 Deep neural networks 15 1.2.3.2 Deep neural network models for ME detection 21 1.3 ME recognition 22 1.3.1 Traditional approaches for ME recognition 22 1.3.2 Neural network approaches for ME recognition 25 1.4 Datasets and evaluation metrics 27 1.4.1 Datasets 27 1.4.2 Evaluation metrics 30 v 1.5 Existing systems for ME recognition 33 1.6 Summary of the chapter 35 CHAPTER THE DETECTION OF MEs USING THE LATE FUSION OF HANDCRAFTED AND DEEP LEARNING FEATURES 37 2.1 Overview of the proposed method 37 2.2 Page segmentation 38 2.3 Handcrafted feature extraction for ME detection 2.3.1 Handcrafted feature extraction for isolated ME detection 42 43 2.3.2 Handcrafted feature extraction for inline ME detection 47 2.4 Deep learning method for ME detection 53 2.5 Late fusion of handcrafted and deep learning features for ME detection 55 2.6 Post-processing for ME detection 58 2.7 Experimental results 60 2.7.1 Performance evaluation of the detection of MEs using different machine learning algorithms 61 2.7.2 Performance evaluation of the detection of MEs using the fusion of handcrafted and deep learning features with different operations 63 2.7.3 Performance evaluation of the detection of isolated and inline MEs on different public datasets 63 2.7.4 Evaluation of the impact of image resolution on the ME detection 66 2.7.5 Evaluation of the impact of the post-processing 67 2.7.6 Visualization of extracted features of images using the handcrafted and deep learning feature approaches 68 2.7.7 Error analysis and discussion 73 2.7.8 Measurement of execution time 75 2.8 Summary of the chapter 76 CHAPTER THE DETECTION OF MEs USING THE COMBINATION OF THE DISTANCE TRANSFORM AND FASTER R-CNN 78 3.1 Overview of the proposed method for ME detection using the DT and the Faster R-CNN 78 3.2 The detection of MEs using the DT and the Faster R-CNN 3.2.1 Distance transform of document image 79 79 3.2.2 ME detection using a Faster R-CNN 82 3.2.2.1 Region proposal network 83 3.2.2.2 Fully connected detection network 86 vi 3.2.2.3 Loss function of the training Faster R-CNN 87 3.3 Experimental results 89 3.3.1 Loss function of the training process of Faster R-CNN 89 3.3.2 Evaluation of the impact of the DT and anchor box generation to the performance of the ME detection 91 3.3.3 Comparison of Faster R-CNN models in ME detection 94 3.3.4 Comparison of the proposed and state-of-the-art methods used in ME detection 95 3.3.5 Performance comparison of the proposed method on cross datasets 97 3.3.6 Illustration of feature extraction of the Resnet-50 98 3.3.7 Error analysis and discussion 102 3.3.8 Measurement of execution time 103 3.4 Summary of the chapter 104 CHAPTER THE DETECTION AND RECOGNITION OF MEs IN DOCUMENT IMAGES 105 4.1 Overview of the proposed system for the detection and recognition of MEs 105 4.2 ME recognition using the WAP network 4.2.1 Watcher module of the WAP network 106 107 4.2.2 Parser module of the WAP network 108 4.2.3 Training the WAP network 112 4.3 Experimental results 4.3.1 Performance evaluation of the detection and recognition of MEs 113 114 4.3.2 Error analysis and discussion 119 4.3.3 Measurement of execution time 120 4.4 Summary of the chapter 120 CONCLUSIONS 121 PUBLICATIONS 123 Bibliography 125 vii ABBREVIATIONS No Abbreviation Meaning CNN Convolutional Neural Network DT Distance Transform ExpRate Expression Error Rate FFT Fast Fourier Transform Faster R-CNN Faster Regions Convolutional Neural Network GRU Gated Recurrent Unit HOG Histogram of Oriented Gradients HPP Horizontal Projection Profile IoU Intersection over Union 10 kNN k-Nearest Neighbour 11 LSTM Long-Short Term Memory 12 Mask R-CNN Mask Region with Convolutional Neural Network 13 ME Mathematical Expression 14 OCR Optical Charater Recognition 15 ResNet Residual Neural Network 16 RF Random Forest 17 RNN Recurrent Neural Network 18 ROIs Region of Interests 19 RPN Region Proposal Network 20 SSD Single Shot Detector 21 SVM Support Vector Machine 22 t-SNE t- Distributed Stochastic Neighbor Embedding 23 VPP Vertical Projection Profile 24 WAP Watcher Attend Parser Neural Network 25 WER Word Error Rate 26 YOLO You Only Look One viii recognition are: (1) Much information of large MEs is lost when resizing images to match WAP network requirements Therefore, some symbols are missed in the recognition of large MEs Figure 4.13 and 4.14 show errors in the recognition of large MEs In addition, to reduce the ambiguity of the recognition of MEs, additional context information is powerful to identify symbols correctly For instance, to reduce the wrong recognition of some variables (e.g "i" and "j"), the context information of the variables can be extracted In the context of equations, symbol "=" can be extracted to reduce the ambiguity (2) The Marmot dataset is a small dataset for training the WAP network Thus, some specific symbols were not trained to recognize The symbols that were not trained normally cause the errors in the recognition As shown in Figure 4.12, some symbols (in blue) are missed when recognizing the large MEs The recognition accuracy can be improved when training the network with larger datasets (3) Errors in the ME detection obviously cause errors in the recognition Figure 4.16 illustrates the impact of detection of MEs to the recognition result Some characters are missed in the detection of MEs Therefore, the recognition results are not correct For instance, the ME yn = −xn is recognized as y = −x 4.3.3 Measurement of execution time The recognition system using the WAP network was implemented in Python and Tensorflow in a PC with 32GB RAM; Core i7-3.2 GHz processor and GetForce GTX 108, 11 GB GPU The average execution time of the ME recognition is 1.2 second per page One page consists of average 23 expressions The ME recognition using the WAP network obtains higher accuracy, however, the method performs slower than Tesseract and Infty Reader Tesseract and Infty Reader applied traditional image processing and handcrafted features for the recognition Therefore, the performance of Tesseract and Infty Reader is better than that of the WAP network 4.4 Summary of the chapter This chapter has presented a system for detection and recognition of expressions in document images The detection of expressions is performed by the DT and the Faster R-CNN The recognition expression is performed by using the WAP network The overall performance of the detection and recognition of MEs was evaluated on the Marmot datasets The accuracies of the recognition for isolated and detected MEs are 51.77% and 45.50%, respectively Comparing with traditional ME recognition systems, the proposed system obtained better performance The results have shown the promising application of the system The main results in this chapter have been published in the publication [C4] 120 CONCLUSION AND FUTURE WORKS Conclusion The detection and recognition of MEs in scientific document images have provided attractive and challenging tasks for researchers Moreover, results of the detection and recognition of MEs allow to develop many real applications for users such as: document digitization, mathematical information retrieval, text to speech This thesis has presented the contributions in the field of ME detection and recognition in scientific document images Concretely, three main approaches have been proposed in the thesis to improve the accuracy of the ME detection and recognition in scientific document images: (1) The fusion of handcrafted and deep learning features has been proposed for the detection of MEs In particular, the accuracy of the detection of inline MEs has been significantly improved compared with traditional handcrafted feature extraction methods The performance of overall system is evaluated on two public datasets those are the Marmot and GTDB The generic performance metrics based on IoU are applied to evaluate the system clearly The performance evaluation has shown that the proposed fusion strategy is an efficient way to improve the accuracy of the detection of MEs in scientific document images without using the OCR techniques (2) To further improve the detection accuracy of MEs, the end-to-end framework for ME detection has been proposed In the framework, the DT and Faster R-CNN are proposed to detect MEs in document images The distance transform with various distance metrics including Euclidean, City Block and Chessboard is applied for document images in order to take advantages of the mathematical expression layouts Moreover, the optimization and generation strategies of anchor boxes of RPN of Faster R-CNN are proposed to improve the accuracy of the detection The proposed system has been tested on two public datasets those are Marmot and GTDB Comparing with conventional methods, the uses of the DT with Euclidean metric and Faster R-CNN have shown higher accuracy of the detection with the reduction of human-resources Obtained results have shown that the DT significantly enhances the discriminative features of MEs and the optimisation of anchor boxes plays an important role to improve the accuracy of ME detection in document images (3) Finally, the detection and recognition of MEs have been integrated in an applicable system The MEs in document images have been detected and recognized simultaneously In the system, MEs have been detected using the DT and Faster R-CNN Then, the advanced network that is WAP has been applied to recognize detected MEs 121 The recognition results are represented in Latex that is a popular format for scientific documents The application supports end users to use the detection and recognition of MEs conveniently Compared with conventional methods for ME recognition, the WAP network has recognized MEs more accurately Moreover, the network has shown the efficient performance in the structure analysis of complex layout MEs Future works The thesis has gained competitive results in the detection and recognition of MEs in scientific document images However, there are rooms for improvement In the future, the following directions can be extended in order to improve the accuracy of the expression detection and recognition: (1) The performance of the ME detection can be improved by the combination of different deep neural networks Actually, the strategies in the detection of tiny symbols can be apply to further improve the detection accuracy of inline MEs Moreover, the contextual information of MEs can be investigated to improve the accuracy of the detection (2) The ME detection and recognition can be further improved to obtain the applicable and robust system Particularly, the ME recognition can be investigated A large number of mathematical notations can be inserted to train the recognition models Moreover, recent advances in the encoder-decoder network architecture can be applied to improve the recognition accuracy The improvement of attention mechanisms can be applied to improve the accuracy of the ME recognition (3) The document images considered in the thesis are assumed to be non-skew Therefore, the deskew algorithms will be investigated for the proposed approaches in the thesis Deskew algorithms are considered as preprocessing steps for document images Then, the detection and recognition approaches can be applied for the document images 122 PUBLICATIONS Conferences [C1] Bui Hai Phong, Thang Manh Hoang, Thi-Lan Le (2017), A new method for displayed mathematical expression detection based on FFT and SVM, 4th NAFOSTED Conference on Information and Computer Science (NICS), IEEE, Hanoi, Vietnam, ISBN: 978-1-5386-3210-9, DOI: 10.1109/NAFOSTED.2017.8108044, pp.90-96, 2017 [C2] Bui Hai Phong, Thang Manh Hoang and Thi-Lan Le (2019), Mathematical variable detection based on CNN and SVM, 2nd International Conference on Multimedia Analysis and Pattern Recognition (MAPR), ISBN: 978-1-7281-1829-1, DOI: 10.1109/MAPR.2019.8743543, pp 1-5, 2019 [C3] Bui Hai Phong, Thang Manh Hoang and Thi-Lan Le (2019), A unified system for for mathematical expression detection in scientific document images , KoreaVietnam International Joint Workshop on Communications and Information Sciences (KICS), ISBN: 978-89-950043-7-1[93560], Hanoi, Viet Nam, pp.14-16, 2019 [C4] Bui Hai Phong, Luong Tan Dat, Nguyen Thi Yen, Thang Manh Hoang and Thi-Lan Le (2020), A deep learning based system for mathematical expression detection and recognition in scientific document images, The 12th IEEE International Conference on Knowledge and Systems Engineering (KSE), ISBN:978-1-7281-3003-3, pp.85-90, 2020, DOI:10.1109/KSE.2019.8919461 Journals [J1] Bui Hai Phong, Thang Manh Hoang and Thi-Lan Le (2020), A hybrid method for mathematical expression detection in scientific document images, IEEE Access, vol 8, pp.83663 - 83684, 2020, ISSN: 2169-3536 (Print) 2169-3536 (Online), DOI: 10.1109/ACCESS.2020.2992067 (ISI, Q1, IF=4.098) [J2 ] Bui Hai Phong, Thang Manh Hoang and Thi-Lan Le (2021), Mathematical variable detection in scientific document images, International Journal of Computational Vision and Robotics, Vol 11, No 1, pp.66-89, 2021, ISSN online: 1752-914X, ISSN print: 1752-9131, DOI:10.1504/IJCVR.2021.111876 (SCOPUS) [J3] Bui Hai Phong, Thang Manh Hoang and Thi-Lan Le (2021), An end-to-end framework for the detection of mathematical expressions in scientific document images, Expert Systems, Online ISSN:1468-0394, DOI: 10.1111/exsy.12800 (ISI, Q2, IF=2.587) Related publication of the thesis 123 [C5] Bui Hai Phong, Thang Manh Hoang, Thi-Lan Le and Akkiko Aizawa (2019), Mathematical variable detection in PDF scientific documents, 11th Asian conference on intelligent information and database systems (ACIIDS), Springer, Cham, Indonesia, DOI: https://doi.org/10.1007/978-3-030-14802-7-60, ISBN: 978-3-030-14802-7, 2019 124 Bibliography [1] K.Iwatsuki and A.Aizawa (2017) Detecting in-line mathematical expressions in scientific documents ACM Symposium on Document Engineering, pp 141–144 doi:10.1145/3103010.3121041 [2] N.Nikolaou et al (2010) Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths Image and Vision Computing, 28.4, pp 590–604 doi:10.1016/j.imavis.2009.09.013 [3] Muno F.A (2015) Mathematical expression recognition based on probabilistic grammar PhD thesis, Technical University of Valencia, Spain [4] Alkalai M et al (2013) Improving formula analysis with line and mathematics identification 12th International Conference on Document Analysis and Recognition doi:10.1109/icdar.2013.74 [5] Zanibbi R and Blostein D (2012) Recognition and retrieval of mathematical expressions International Journal on Document Analysis and Recognition, 15.4, pp 331–357 [6] M.Suzuki et al (2003) INFTY: an integrated ocr system for mathematical documents Conference: Proceedings of the 2003 ACM Symposium on Document Engineering doi:10.1145/958220.958239 [7] Arnold D (January 2013) Prealgebra textbook College of the Redwoods [8] Redden J (2011) Elementary algebra textbook Saylor Foundation [9] D.F.Chan and D.Yeung (2000) Mathematical expression recognition: a survey International Journal on Document Analysis and Recognition, 3.1, pp 3–15 [10] Mali P (2019) Scanning single shot detector for math in document images Master thesis, Rochester Institute of Technology, USA [11] Lin X., Gao L., Tang Z., Lin X., and Hu X (March 2012) Performance evaluation of mathematical formula identification International Workshop on Document Analysis Systems doi:10.1109/das.2012.68 [12] Lin X., Gao L., Tang Z., Baker J., and Sorge V (2014) Mathematical formula identification and performance evaluation in pdf documents International Journal on Document Analysis and Recognition, 17.3, pp 239–255 125 [13] Wang Z et al (November 2020) Translating math formula images to latex sequences using deep neural networks with sequence-level training ternational Journal on Document Analysis and Recognition s10032-020-00360-2 In- doi:10.1007/ [14] A.Awal et al (November 2010) The problem of handwritten mathematical expression recognition evaluation Conference on Frontiers in Handwriting Recognition doi:10.1109/icfhr.2010.106 [15] L.Lamport (1994) Latex: a document preparation system Addison-Wesley Professional, 2nd Edition [16] Chu W and Liu F (December 2013) Mathematical formula detection in heterogeneous document images Proceeding of the International Conference on Technologies and Applications of Artificial Intelligence doi:10.1109/taai.2013.38 [17] Oliveira S et al (August 2018) dhSegment: a generic deep-learning approach for document segmentation International Conference on Frontiers in Handwriting Recognition, pp 7–12 doi:10.1109/icfhr-2018.2018.00011 [18] Clausner C et al (2019) Icdar2019 competition on recognition of documents with complex layouts - rdcl2019 2019 International Conference on Document Analysis and Recognition (ICDAR) doi:10.1109/icdar.2019.00245 [19] Tran T.A., Na I.S., and Kim S.H (2016) Page segmentation using minimum homogeneity algorithm and adaptive mathematical morphology International Journal on Document Analysis and Recognition, 19.3, pp 191–209 [20] Wahl F.M., Wong K.Y., and Casey R.G (1982) Block segmentation and text extraction in mixed text/image documents Computer graphics and image processing, 20.4, pp 375–390 [21] Wang D and Srihari S (1989) Classification of newspaper image blocks using texture analysis Computer graphics and image processing, 47.3, pp 327–352 [22] Caponetti L et al (2008) Document page segmentation using neuro-fuzzy approach Applied Soft Computing, 8.1, pp 118–126 [23] Agrawal M and Doermann D (2009) Voronoi++: a dynamic page segmentation approach based on voronoi and docstrum features International Conference on Document Analysis and Recognition doi:10.1109/icdar.2009.270 [24] Cheng H and Bouman C (2001) Multiscale bayesian segmentation using a trainable context model IEEE Transactions on Image Processing, 10.4, pp 511– 525 126 [25] Shi Z and Govindaraju V (August 2005) Multi-scale techniques for document page segmentation International Conference on Document Analysis and Recognition doi:10.1109/ICDAR.2005.165 [26] Dai-Ton H., Duc-Dung N., and Duc-Hieu L (2016) An adaptive over-split and merge algorithm for page segmentation Pattern Recognition Letters, 80, pp 137–143 [27] Breuel T (January 2008) The OCRopus open source OCR system Proceeding of the Conference on Document Recognition and Retrieval XV doi:10.1117/12 783598 [28] Smith R (September 2007) An overview of the tesseract OCR engine International Conference on Document Analysis and Recognition doi:10.1109/icdar 2007.4376991 [29] Anoop M and Anil K (2007) sis Document structure and layout analy- Digital Document Processing, pp 29–48 doi:https://doi.org/10.1007/ 978-1-84628-726-8_2 [30] Lin X et al (November 2013) A text line detection method for mathematical formula recognition International Conference on Document Analysis and Recognition doi:10.1109/icdar.2017.161 [31] Chen K et al (November 2017) Convolutional neural networks for page segmentation of historical document images International Conference on Document Analysis and Recognition doi:10.1109/icdar.2017.161 [32] He K., Zhang X., Ren S., and Sun J (June 2016) Deep residual learning for image recognition IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778 doi:10.1109/cvpr.2016.90 [33] T.Lu and A.Dooms (2020) Probabilistic homogeneity for document image segmentation Pattern Recognition, 109, pp 107591–107605 doi:10.1016/j.patcog 2020.107591 [34] S.Bhowmik et al (2019) Gib: a game theory inspired binarization technique for degraded document images IEEE Transactions on Image Processing, 28.3, pp 1443–1455 [35] A.Antonacopoulos et al (2009) A realistic dataset for performance evaluation of document layout analysis 10th International Conference on Document Analysis and Recognition doi:10.1109/icdar.2009.271 127 [36] Fateman R (December 1997) How to find mathematics on a scanned page Proceedings of SPIE - The International Society for Optical Engineering doi: 10.1117/12.373482 [37] Lee H and Wang J (1997) Design of a mathematical expression understanding system Pattern Recognition Letters, 18.3, pp 289–298 [38] J.Toumit et al (1999) A hierarchical and recursive model of mathematical expressions for automatic reading of mathematical documents Proceedings of the Fifth International Conference on Document Analysis and Recognition, September 1999 doi:10.1109/icdar.1999.791739 [39] Kacem A et al (2001) Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context International Journal on Document Analysis and Recognition, 4.2, pp 97–108 [40] Garain U (2009) Identification of mathematical expressions in document images International Conference on Document Document Analysis and Recognition doi: 10.1109/icdar.2009.203 [41] Garain U (1998) Automatic detection of italic, bold and all-capital words in document images International Conference on Pattern Recognition 1998 doi: 10.1109/icpr.1998.711217 [42] Yamazaki S et al (September 2011) Embedding a mathematical OCR module into ocropus International Conference on Document Analysis and Recognition doi:10.1109/icdar.2011.180 [43] D.Drake and Baird H (2005) Distinguishing mathematics notation from english text using computational geometry International Conference on Document Analysis and Recognition doi:10.1109/ICDAR.2005.89 [44] R.Duda et al (2000) Pattern classification 2nd edition John Wiley Sons [45] Katherine L et al (2011) A low complexity sign detection and text localization method for mobile applications IEEE Transactions on Multimedia, 13.5, pp 922–934 [46] Zhao Z (2019) Object detection with deep learning: a review IEEE Transactions on Neural Networks and Learning Systems, 30.11, pp 3212–3232 [47] Georgevici A.I and Terblanche M (2019) Neural networks and deep learning: a brief introduction Intensive Care Medicine, 45.5, pp 712–714 [48] Wang H et al (2017) On https://arxiv.org/pdf/1702.07800.pdf 128 the origin of deep learning [49] Fukushima K (1980) Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position Biological Cybernetics, 36.4, pp 193–202 [50] Rumelhart D (1986) Learning internal representations by error propagation Parallel distributed processing, 1, pp 318–362 [51] Y.LeCun et al (1980) Handwritten digit recognition with a back-propagation network Biol Cybernetics, pp 193–202 [52] Hochreiter S et al (1997) Long short-term memory Neural Computation, 9.8, pp 1735–1780 [53] Krizhevskyand A., Sutskever I., and Hinton G (2012) Imagenet classification with deep convolutional neural networks International Conference on Neural Information Processing Systems, pp 1097–1105 [54] Manessi F and R.Alessandro (2019) Learning combinations of activation functions https://arxiv.org/pdf/1801.09403.pdf [55] Karen Simonyan A.Z (2014) Very deep convolutional networks for large-scale image recognition arXiv:1409.1556 [56] Szegedy C et al (2014) Going deeper with convolutions arXiv:1409.4842 [57] Girshick R and Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation IEEE Conference on Computer Vision and Pattern Recognition [58] Uijlings J and Smeulders A (2013) Selective search for object recognition International Journal of Computer Vision, 104.2, pp 154–171 [59] Girshick R et al (2015) Fast r-cnn IEEE International Conference on Computer Vision doi:10.1109/iccv.2015.169 [60] Ren S., He K., Girshick R., and Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks IEEE International Conference on Computer Vision, 39.6, pp 1137–1149 [61] Redmon J et al (June 2016) You only look once: Unified, real-time object detection International Conference on Computer Vision and Pattern Recognition doi:10.1109/cvpr.2016.91 [62] Liu W et al (2016) SSD: single shot multibox detector Lecture Notes in Computer Science, 9905, pp 21–37 129 [63] Ohyama W et al (2019) Detecting mathematical expressions in scientific document images using a u-net trained on a diverse datase IEEE Access, 7, pp 144030 – 144042 [64] He W and Liu C (2016) Context-aware mathematical expression recognition: an end-to-end framework and a benchmark International Conference on Pattern Recognition (ICPR), pp 3246–3251 [65] Ronneberger O., Fischer P., and Brox T (2015) U-net: Convolutional networks for biomedical image segmentation International Conference on Medical Image Computing and Computer-Assisted Intervention MICCAI, pp 234–241 [66] Mahdavi M et al (2019) Icdar 2019 crohme + tfd: competition on recognition of handwritten mathematical expressions and typeset formula detection International Conference on Document Analysis and Recognition doi:10.1109/icdar 2019.00247 [67] R.H.Anderson (1967) Syntax-directed recognition of hand-printed two- dimensional mathematics Symposium on Interactive Systems for Experimental Applied Mathematics: Proceedings of the Association for Computing Machinery Inc, pp 436–459 doi:10.1145/2402536.2402585 [68] Garain U and Chaudhuri B (2004) Recognition of online handwritten mathematical expressions IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), pp 2366–2376 [69] He W et al (2016) Context-aware mathematical expression recognition: an end-to-end framework and a benchmark International Conference on Pattern Recognition (ICPR), pp 3246–3251 [70] B.HaKan et al (January 2007) Online handwritten mathematical expression recognition Document Recognition and Retrieval XIV doi:10.1117/12.704043 [71] J.Ha et al (1995) Understanding mathematical expressions from document images International Conference on Document Analysis and Recognition doi: 10.1109/icdar.1995.602060 [72] Zanibbi R et al (2002) Recognizing mathematical expressions using tree transformation EEE Transactions on Pattern Analysis and Machine Intelligence, pp 1455–1467 [73] Alvaro F et al (September 2011) Recognition of printed mathematical expressions using two-dimensional stochastic context-free grammars International Conference on Document Analysis and Recognition doi:10.1109/icdar.2011.247 130 [74] Lavirotte S and Pottier L (April 1998) Mathematical formula recognition using graph grammar Document Recognition V doi:10.1117/12.304644 [75] Celik M and Yanikoglu B (2011) Probabilistic mathematical formula recognition using a 2d context-free graph grammar 2011 International Conference on Document Analysis and Recognition doi:10.1117/12.304644 [76] F.Alvaro et al (2014) Recognition of on-line handwritten mathematical expressions using 2d stochastic context-free grammars and hidden markov models Pattern Recognition Letters, 35, pp 58–67 [77] C.Wang et al (2015) Understanding mathematical expressions from camera images The 33rd Workshop on Combinatorial Mathematics and Computation Theory [78] J.Zhang et al (2017) Watch, attend and parse: an end-to-end neural network based approach to handwritten mathematical expression recognition Pattern Recognition, 71, pp 196–206 [79] ZHANG T et al (October 2016) Online handwritten mathematical expressions recognition by merging multiple 1d interpretations International Conference on Frontiers in Handwriting Recognition doi:10.1109/icfhr.2016.0045 [80] J.Wu et al (2019) Image-to-markup generation via paired adversarial learning Machine Learning and Knowledge Discovery in Databases, pp 18–34 doi:10 1007/978-3-030-10925-7_2 [81] Zhang J and J.Da (August 2018) Multi-scale attention with dense encoder for handwritten mathematical expression recognition International Conference on Pattern Recognition (ICPR) doi:10.1109/icpr.2018.8546031 [82] J Wang Y.S and S.Wang (2019) Image to latex with densenet encoder and joint attention Procedia computer science, pp 374–380 [83] L.A.Duc (2020) Recognizing handwritten mathematical expressions via paired dual loss attention network and printed mathematical expressions IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 566–567 [84] Everingham M et al (2010) The pascal visual object classes (voc) challenge International Journal of Computer Vision, 88.2, pp 303–338 [85] Mathpix (March 2021) Mathpix https://mathpix.com [86] Minh N.Q (2013) Semantic enrichment of mathematical expressions for mathematical search PhD thesis, National Institute of Informatics, Tokyo, Japan 131 [87] Mathur A (June 2019) AI based reading system for blind using OCR 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA) doi:10.1109/iceca.2019.8822226 [88] P.Wells (2015) Math in the dark: tools for expressing mathematical content by visually impaired student PhD thesis, Nova Southern University [89] Edward S (July 2018) Text-to-speech device for visually impaired people International Journal of Pure and Applied Mathematics 119.15 [90] Selvaraj C and Natarajan B (January 2018) Enhanced portable text to speech converter for visually impaired International Journal of Intelligent Systems Technologies and Applications, 17.1 [91] DoIT (March 2021) Doit http://doit.uet.vnu.edu.vn/, [92] A.Papandreou and B.Gatos (September 2011) A novel skew detection technique based on vertical projections International Conference on Document Analysis and Recognition doi:10.1109/icdar.2011.85 [93] K.Taeho et al (November 2017) Robust document image dewarping using textline and line segments International Conference on Document Analysis and Recognition doi:10.1109/icdar.2017.146 [94] Papandreou A and Gatos B (September 2011) A novel skew detection technique on vertical projections International Conference on Document Analysis and Recognition doi:10.1109/icdar.2011.85 [95] Frigo M and Johnson S.G (1998) FFTW: an adaptive software architecture for the FFT Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp 1381–1384 [96] J.Friedman et al (2000) Additive logistic regression: a statistical view of boosting Annals of Statistics, p 337–407 [97] Young I.T et al (1995) Fundamentals of image processing Delft University of Technology [98] M.Narwaria et al (2012) Fourier transform-based scalable image quality measure IEEE Transactions on Image Processing, 21.8, pp 3364–3377 [99] Joachims T (2002) Optimizing search engines using click through data ACM Conference on Knowledge Discovery and Data Mining [100] C.Blatter et al (1977) Analysis Heidelberger Taschenb Cher (Book 151) [101] Ghahramani S (2000) Fundamentals of probability Prentice Hall: New Jersey 132 [102] Roger L (2008) Linguistics 251 lecture notes Linguistics 251 lecture notes, Fall 2008 [103] P.Napoletano et al (January 2018) Anomaly detection in nanofibrous materials by cnn-based self-similarity Sensors, 18.1 doi:10.3390/s18010209 [104] Diederik K and Ba J (2014) Adam: a method for stochastic optimization arXiv:1412.6980 [105] Murphy K (2012) Machine learning: a probabilistic perspective The MIT Press, Cambridge, Massachusetts, First edition [106] Zhu Q., Zhang P., Wang Z., and Ye X (2019) A new loss function for CNN classifier based on predefined evenly-distributed class centroids IEEE Access, 8, pp 10888–10895 doi:10.1109/access.2019.2960065 [107] A.Herrera and H.Müller (2014) Fusion techniques in biomedical information retrieval Fusion in Computer Vision, pp 209–228 [108] S.Lee M and H.Muller (2019) Late fusion of deep learning and handcrafted visual features for biomedical image modality classification IET Image Processing, pp 382–391 [109] Saba T et al (2019) Brain tumor detection using fusion of hand crafted and deep learning features Cognitive Systems Research, pp 221–230 [110] Liu Z and Smith R (August 2013) A simple equation region detector for printed document images in tesseract International Conference on Document Analysis and Recognition doi:10.1109/icdar.2013.56 [111] Degtyarenko I., Radyvonenko O., Bokhan K., and Khomenko V (2016) Text/shape classifier for mobile applications with handwriting input International Journal on Document Analysis and Recognition, 19.4, pp 369–379 [112] Maaten V.L and Hinton G (2008) Visualizing data using t-sne Machine Learning Research 9, pp 2579–2605 [113] Torgerson W.S (1952) Multidimensional scaling: I theory and method Psychometrika, pp 401–419 [114] Williams C et al (2002) On a connection between kernel pca and metric multidimensional scaling Machine Learning, 46, pp 11–19 [115] Xu D and Li H (2006) Euclidean distance transform of digital images in arbitrary dimensions Advances in Multimedia Information Processing - PCM 2006, pp 72–79 133 [116] Srivastava S et al (2021) Comparative analysis of deep learning image detection algorithms Journal of Big Data, 8.1 doi:10.1186/s40537-021-00434-w [117] D.Bahdanau et al (2014) Neural machine translation by jointly learning to align and translate arXiv:1409.0473 [118] J.K.Chorowski et al (2015) Attention-based models for speech recognition Advances in Neural Information Processing Systems, 1, pp 577–585 [119] K.Cho et al (2015) Describing multimedia content using attention-based encoderdecoder networks IEEE Transactions on Multimedia, 17.11, pp 1875–1886 134 ... 2D handwritten MEs are recognized from the multiple 1D sequences The BLSTM has been used as a strong sequence classifier of the recognition task The work in [80] has proposed the neural network

Tiêu đề	Enhancing Performance Of Mathematical Expression Detection In Scientific Document Images
Tác giả	Bui Hai Phong
Người hướng dẫn	Assoc. Prof. Hoang Manh Thang, Assoc. Prof. Le Thi Lan
Trường học	Hanoi University of Science and Technology
Chuyên ngành	Computer Science
Thể loại	Doctoral Dissertation
Năm xuất bản	2021
Thành phố	Hanoi

Định dạng
Số trang	154
Dung lượng	26,61 MB