Các ví dˆ v∑ tr˜Ìng hÒp nh™n d§ng sai

4.4 K∏t qu£ cıa hª thËng tìm ki∏m Ëi t˜Òng Á hÂa

Bô d˙ liªu dùng trong các thí nghiªm gÁm 2 ph¶n nh˜ sau:

– BÎd˙liªu dùng ∫ tìm ki∏m: bao gÁm toàn bÎcác £nh là Ëi t˜Òng ÁhÂa ˜Òc c≠t ra t¯ các hÎp giÓi h§n trong bÎ d˙ liªu IIT-AR-13k(validation) và IIT-AR- 13k(test). Các v‡ trí cıa hÎp giÓi h§n là các nhãn chu©n cıa d˙ liªu. Các £nh nh‰này s≥ ˜Òc gán tên lÓp chính là tên cıa v´n b£n ch˘a £nh ó.

– BÎ d˙ liªu dùng ∫ cho vào cÏ s d˙ liªu ˜Òc chu©n b‡ nh˜ sau.

– ˜a toàn bÎ các £nh trong hai t™p d˙ liªu IIT-AR-13k(validation) và IIT- AR-13k(test) vào trong mô hình YOLOv3 ã ˜Òc hußn luyªn ∫ nh™n ra các Ëi t˜Òng ÁhÂa. – ây chính là mô hình ˜Òc báo cáo trong ph¶n k∏t qu£ nh™n d§ng Ëi t˜Òng Ïn nhãn.

– ¶u ra cıa mô hình YOLOv3 này chính là v‡ trí cıa các hÎp giÓi h§n. T¯

các v‡ trí ó chúng tôi c≠t ra các £nh và ˜a qua các lo§i mô hình trích xußt thuÎc tính khác nhau bao gÁm: DINO, ViT, Resnet18, Resnet50, và Efficientnet b2.

Chi ti∏t v∑ k∏t qu£ Î o recall top k cıa hª thËng tìm ki∏m Ëi t˜Òng Á hÂa

˜Òc mô t£ trong B£ng 4.11. Trong ó ta có th∫ thßy khi s˚ dˆng DINO làm mô hình trích xußt thuÎc tính hình £nh thì k∏t qu£ tìm ki∏m §t ˜Òc cao nhßt recall top 3 §t 93,07%

B£ng 4.11: K∏t qu£ Î o recall top k cıa hª thËng tìm ki∏m vÓi các mô hình tríchxußt thuÎc tính xußt thuÎc tính Recall top k B£ng Énh minh ho§ Énh th™t Bi∫u t˜Òng Ch˙ k˛ Trung bình t¯ng lÓp Trung bình t¯ng £nh DINO 1 0.8348 0.7966 0.9608 0.7612 0.9300 0.8567 0.8462

3 0.9305 0.8867 0.9866 0.8458 0.9800 0.9259 0.9307 5 0.9471 0.9078 0.9866 0.8507 0.9850 0.9354 0.9452 10 0.9651 0.9248 0.9888 0.8657 0.9850 0.9459 0.9604 50 0.9836 0.9470 0.9922 0.9005 0.9850 0.9617 0.9775 ViT 1 0.5992 0.6674 0.9474 0.7065 0.7950 0.7431 0.6610 3 0.7414 0.8061 0.9821 0.8259 0.9450 0.8601 0.7887 5 0.7864 0.8432 0.9821 0.8557 0.9600 0.8855 0.8257 10 0.8371 0.8941 0.9843 0.8706 0.9850 0.9142 0.8685 50 0.9292 0.9417 0.9888 0.9005 0.9900 0.9500 0.9393 Resnet18 1 0.8225 0.7871 0.9586 0.796 0.885 0.84984 0.8360 3 0.9193 0.8941 0.9832 0.8607 0.975 0.92646 0.9239 5 0.9421 0.9174 0.9854 0.8706 0.975 0.9381 0.9432 10 0.9585 0.9322 0.9866 0.8756 0.985 0.94758 0.9569 50 0.9819 0.9523 0.991 0.9055 0.99 0.96414 0.9771 Resnet50 1 0.8533 0.7638 0.9597 0.7562 0.9050 0.8476 0.8535 3 0.9330 0.8591 0.9854 0.8507 0.9750 0.9206 0.9286 5 0.9500 0.8867 0.9877 0.8607 0.9900 0.9350 0.9449 10 0.9624 0.9163 0.9888 0.8706 0.9900 0.9456 0.9577 50 0.9807 0.9460 0.9899 0.8856 0.9900 0.9584 0.9748 Efficientnet b2 1 0.8369 0.8061 0.9630 0.7512 0.8600 0.8434 0.8469 3 0.9205 0.8972 0.9810 0.8259 0.9650 0.9179 0.9236 5 0.9373 0.9206 0.9821 0.8358 0.9900 0.9332 0.9393 10 0.9552 0.9322 0.9832 0.8507 0.9900 0.9423 0.9537 50 0.9807 0.9544 0.9877 0.8905 0.9900 0.9607 0.9757

Ch˜Ïng 5

K∏t lu™n

Chuy∫n Íi sË, sË hóa vń b£n ã em l§i nh˙ng lÒi ích vô cùng to lÓn cho các doanh nghiªp ó là c≠t gi£m chi phí v™n hành, và tńg hiªu qu£ làm viªc. Các quy∏t ‡nh bây giÌ ˜Òc ã ra nhanh chóng và chính xác hÏn nhÌ các hª thËng báo cáo thông suËt k‡p thÌi, và tËi ˜u hóa ˜Òc nńg sußt làm viªc cıa nhân viên. Trong các doanh nghiªp sË l˜Òng các vń b£n báo cáo hàng ngày c¶n ph£i gi£i quy∏t và l˜u tr˙ ngày mÎt lÓn d®n ∏n bài toán tìm ki∏m ngày mÎt ph˘c t§p. Nhu c¶u cıa ng˜Ìi dùng ngày giÌ ây không còn chø d¯ng l§i  viªc tìm ki∏m t¯ nÎi dung vń b£n mà còn c¶n tìm ki∏m các Ëi t˜Òng Á hÂa nh˜ b£ng bi∫u, Á th‡ và bi∫u Á.

Các hª thËng tìm ki∏m ÁhÂa muËn ho§t Îng tËt thì b˜Óc ¶u tiên ó là qu£n l˛ và t§o ra chø mˆc các Ëi t˜Òng Á hÂa trong v´n b£n. ∫ có th∫ ánh chø mˆc các Ëi t˜Òng ÁhÂa thì tr˜Óc h∏t chúng ta c¶n ph£i nh™n d§ng và phát hiªn các Ëi t˜Òng này. a ph¶n các ph˜Ïng pháp và mô hình nh™n diªn Ëi t˜Òng Á hÂa trong

£nh v´n b£n hiªn nay muËn §t ˜Òc các k∏t qu£ tËt thì c¶n ph£i ˜Òc hußn luyªn trên mÎt bÎ d˙ liªu có ı hai ph¶n là v‡ trí và tên nhãn cıa Ëi t˜Òng ÁhÂa. Công s˘c ∫ gán nhãn cho các bÎ d˙liªu này là không h∑ nh‰hÏn n˙a thÌi gian b‰ra cÙng rßt lÓn. Do v™y trong lu™n v´n này chúng tôi ∑ xußt mÎt ph˜Ïng pháp có th∫ t™n dˆng bÎ d˙ liªu không ¶y ı v∑ tên nhãn cıa các lÓp ∫ nh™n d§ng các Ëi t˜Òng

Á hÂa. Nh˙ng óng góp chính cıa lu™n v´n bao gÁm.

v‡ Ëi t˜Òng Á hÂa vÓi bÎ d˙ liªu chø bao gÁm các v‡ trí cıa hÎp giÓi h§n.

– ∑ xußt ph˜Ïng pháp s˚ dˆng m§ng hÂc sâu t¸ giám sát DINO cho b˜Óc trích xußt thuÎc tính hình £nh dùng trong bài toán phân lo§i các lÓp Ëi t˜Òng Á

hÂa vÓi bÎ d˙liªu có sË l˜Òng £nh mÈi lÓp h§n ch∏.

– Xây d¸ng và phát tri∫n hª thËng tìm ki∏m và truy xußt các Ëi t˜Òng Á hÂa trong £nh v´n b£n.

Ph˜Ïng pháp nh™n d§ng các Ëi t˜Òng ÁhÂa trong £nh vń b£n ˜Òc trình bày trong lu™n vń ã ˜Òc th˚ nghiªm và §t k∏t qu£ xßp xø 74% trên các bÎ d˙ liªu ki∫m th˚ và ánh giá. Áp dˆng các ph˜Ïng pháp trên vào ∫ xây d¸ng nên hª thËng tìm ki∏m và truy xußt các vń b£n t¯ các Ëi t˜Òng Á hÂa cÙng §t ˜Òc các k∏t qu£ Î hÁi t˜ng §t 93,3%.

M∞c dù các k∏t qu£trên bÎd˙liªu có Îhiªu qu£tËt nhñg khi áp dˆng ph˜Ïng pháp nh™n d§ng các Ëi t˜Òng ÁhÂa trên các vń b£n có Înhiπu lÓn và có n∏p gßp khó nh˜ £nh chˆp ho∞c £nh scan vń b£n thì ph˜Ïng pháp v®n chã ho§t Îng hiªu qu£. Do v™y h˜Óng nghiên c˘u trong t˜Ïng lai là c¶n c£i thiªn Î chính xác vÓi các vń b£n có Î nhiπu lÓn nh˜ các £nh chˆp và £nh scan.

Tài liªu tham kh£o

[1] Ajoy Mondal, Peter Lipps, and CV Jawahar. Iiit-ar-13k: a new dataset for graphical object detection in documents. In International Workshop on Document Analysis Systems, pages 216–230. Springer, 2020.

[2] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[4] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and har- nessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.

[5] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28:91–99, 2015.

[6] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.

[7] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.

[8] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.

[9] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.

[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

[11] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021.

[12] Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A benchmark dataset for document layout analysis. arXiv preprint arXiv:2006.01038, 2020.

[13] Max G¨obel, Tamir Hassan, Ermelinda Oro, and Giorgio Orsi. Icdar 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1449–1453. IEEE, 2013.

[14] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: largest dataset ever for document layout analysis. In2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022. IEEE, 2019.

[15] Liangcai Gao, Xiaohan Yi, Zhuoren Jiang, Leipeng Hao, and Zhi Tang. Icdar2017 competition on page object detection. In 2017 14th IAPR International Confer- ence on Document Analysis and Recognition (ICDAR), volume 1, pages 1417– 1422. IEEE, 2017.

[16] Liangcai Gao, Yilun Huang, Hervé Déjean, Jean-Luc Meunier, Qinqin Yan, Yu Fang, Florian Kleber, and Eva Lang. Icdar 2019 competition on table de-

tection and recognition (ctdar). In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1510–1515. IEEE, 2019.

[17] Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. Tablebank: Table benchmark for image-based table detection and recognition. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 1918–1925, 2020.

[18] Asif Shahab, Faisal Shafait, Thomas Kieninger, and Andreas Dengel. An open approach towards the benchmarking of table structure recognition systems. In Proceedings of the 9th IAPR International Workshop on Document Analysis Sys- tems, pages 113–120, 2010.

[19] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278– 2324, 1998.

[20] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[21] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818– 2826, 2016.

[22] Jwalin Bhatt, Khurram Azeem Hashmi, Muhammad Zeshan Afzal, and Didier Stricker. A survey of graphical page object detection with deep neural networks. Applied Sciences, 11(12):5344, 2021.

[23] Xiaohan Yi, Liangcai Gao, Yuan Liao, Xiaode Zhang, Runtao Liu, and Zhuoren Jiang. Cnn based page object detection in document images. In2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 230–235. IEEE, 2017.

crfs with gaussian edge potentials. Advances in neural information processing systems, 24:109–117, 2011.

[25] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.

[26] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), volume 1, pages 1162–1167. IEEE, 2017. [27] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional

networks. In European conference on computer vision, pages 818–833. Springer, 2014.

[28] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2):303–338, 2010.

[29] Nguyen D Vo, Khanh Nguyen, Tam V Nguyen, and Khang Nguyen. Ensemble of deep object detectors for page object detection. In Proceedings of the 12th International Conference on Ubiquitous Information Management and Commu- nication, pages 1–6, 2018.

[30] Ranajit Saha, Ajoy Mondal, and CV Jawahar. Graphical object detection in document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 51–58. IEEE, 2019.

[31] Madhav Agarwal, Ajoy Mondal, and CV Jawahar. Cdec-net: Composite de- formable cascade network for table detection in document images. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 9491–9498. IEEE, 2021.

[32] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.

[33] Alexander Neubeck and Luc Van Gool. Efficient non-maximum suppression. In 18th International Conference on Pattern Recognition (ICPR’06), volume 3, pages 850–855. IEEE, 2006.

[34] N Jhanwar, Subhasis Chaudhuri, Guna Seetharaman, and Bertrand Zavidovique. Content based image retrieval using motif cooccurrence matrix.Image and Vision Computing, 22(14):1211–1220, 2004.

[35] Kinh Tieu and Paul Viola. Boosting image retrieval. International Journal of Computer Vision, 56(1):17–36, 2004.

[36] Thomas Deselaers, Daniel Keysers, and Hermann Ney. Features for image retrieval: an experimental comparison. Information retrieval, 11(2):77–107, 2008. [37] Chuen-Horng Lin, Rong-Tai Chen, and Yung-Kuan Chan. A smart content-based

image retrieval system based on color and texture feature. Image and Vision Computing, 27(6):658–665, 2009.

[38] Zhaofan Qiu, Yingwei Pan, Ting Yao, and Tao Mei. Deep semantic hashing with generative adversarial networks. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 225–234, 2017.

[39] R Rani Saritha, Varghese Paul, and P Ganesh Kumar. Content based image retrieval using deep learning process. Cluster Computing, 22(2):4187–4200, 2019. [40] Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, and Hervé Jégou. Training vision transformers for image retrieval. arXiv preprint arXiv:2102.05644, 2021. [41] Po-Whei Huang and SK Dai. Image retrieval by texture similarity. Pattern

recognition, 36(3):665–679, 2003.

[42] Alex Krizhevsky and Geoffrey E Hinton. Using very deep autoencoders for content-based image retrieval. In ESANN, volume 1, page 2. Citeseer, 2011. [43] Yoonseop Kang, Saehoon Kim, and Seungjin Choi. Deep learning to hash with

multiple representations. In 2012 IEEE 12th International Conference on Data Mining, pages 930–935. IEEE, 2012.

[44] Pengcheng Wu, Steven CH Hoi, Hao Xia, Peilin Zhao, Dayong Wang, and Chun- yan Miao. Online multimodal deep similarity learning with application to image retrieval. InProceedings of the 21st ACM international conference on Multimedia, pages 153–162, 2013.

[45] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1386–1393, 2014.

[46] Socratis Gkelios, Yiannis Boutalis, and Savvas A Chatzichristofis. Investigat- ing the vision transformer model for image retrieval tasks. arXiv preprint arXiv:2101.03771, 2021.

[47] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Hamming embedding and weak geometric consistency for large scale image search. In European conference on computer vision, pages 304–317. Springer, 2008.

[48] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In 2008 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2008.

[49] Isaak Kavasidis, Sergio Palazzo, Concetto Spampinato, Carmelo Pino, Daniela Giordano, Danilo Giuffrida, and Paolo Messina. A saliency-based convolutional neural network for table and chart detection in digitized documents. arXiv preprint arXiv:1804.06236, 2018.

[50] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7263– 7271, 2017.

[51] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Op- timal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.

[52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017. [53] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-

training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[54] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Ope- nAI blog, 1(8):9, 2019.

[55] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & dis- tillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.

[56] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.