Trích rút thông tin văn bản sử dụng deep transfer learning

TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI LUẬN VĂN THẠC SĨ Trích rút thơng tin văn sử dụng Deep Transfer Learning NGUYỄN THANH BÌNH thanhbinh030296@gmail.com Ngành: Tốn Tin Giảng viên hướng dẫn: TS Trần Ngọc Thăng Viện: Chữ kí GVHD Toán ứng dụng Tin học HÀ NỘI, 4/2021 LUẬN VĂN THẠC SĨ NGUYỄN THANH BÌNH Lời cảm ơn Lời đầu tiên, tác giả xin gửi lời cảm ơn chân thành đến Tiến sĩ Trần Ngọc Thăng đạo trợ giúp nhiệt tình suốt trình nghiên cứu bắt đầu học viên đến Và tác giả xin trân trọng cảm ơn Viện Tốn Ứng dụng Tin học, Phịng đào tạo 2- Bộ phận quản lý đào tạo sau đại học, Trường Đại học Bách Khoa Hà Nội viện nghiên cứu CMC tạo điều kiện thuận lợi để để tác giả hoàn thành luận văn Tóm tắt nội dung luận văn Bài tốn rút trích thông tin áp dụng nhiều thực tiễn từ hệ thống rút trích thơng tin miền chuyên biệt Sinh học, Y học, phòng chống tội phạm hệ thống phục vụ việc học tập, giảng dạy (ELearning) Tuy nhiên,để mơ hình học sâu (Deep learning models) lớn hiệu cần nhiều liệu Chúng cần đào tạo với hàng nghìn chí hàng triệu điểm liệu trước đưa dự đốn xác Chúng tốn tài ngun tính tốn, thời gian liệu áp dụng để giải nhiệm vụ Các nhiệm tương lai khác cần lượng liệu tài nguyên tính tốn tương đương để xây dựng mơ hình học sâu Tuy nhiên, não người khơng hoạt động theo cách Nó khơng loại bỏ kiến thức thu trước giải nhiệm vụ Thay vào đó, đưa định dựa điều học từ khứ Những thu dạng kiến thức nhiệm vụ, sử dụng theo cách tương tự để giải nhiệm vụ liên quan Transfer learning nói chung Deep transfer learning nói riêng nhằm mục đích bắt chước hành động deep learning nhằm mục đích tiết kiệm tài ngun tính tốn, thời gian cải thiện hiệu suất mơ hình Từ khóa: Transfer learning, Deep transfer learning, Deep learning, Named entity recognition Hà Nội, ngày 05 tháng 04 năm 2021 Học viên thực Nguyễn Thanh Bình Mục lục GIỚI THIỆU 1.1 Giới thiệu Deep learning 1.2 Giới thiệu tổng quan toán trích rút thơng tin từ văn 1.3 Mục tiêu nghiên cứu 1.4 Cấu trúc luận văn TRANSFER LEARNING VÀ DEEP TRANSFER LEARNING 2.1 Giới thiệu Transfer learning 2.2 Định nghĩa Transfer learning 2.3 Các phương pháp cách tiếp cận 2.4 Deep transfer learning 2.4.1 Các mơ hình đào tạo trước có sẵn dạng trích xuất tính 2.4.2 Tinh chỉnh mơ hình đào tạo trước 2.4.3 Các mơ hình huấn luyện trước 2.5 Các loại Deep transfer learning 2.5.1 Thích ứng miền 2.5.2 Nhầm lẫn tên miền 2.5.3 Học đa tác vụ 2.5.4 Học lần 2.6 Lợi thách thức Deep Transfer Learning 2.7 Ứng dụng Transfer learning TRÍCH RÚT THƠNG TIN VĂN BẢN BẰNG DEEP TRANSFER LEARNING 3.1 Bài tốn trích rút thông tin từ tờ khai hải quan 3.2 Kiến trúc hệ thống 3.3 Cơ chế Attention 3.3.1 Giới thiệu attention 3.3.2 Mô hình Transformer 8 10 11 12 13 14 17 18 19 21 22 23 23 24 24 25 25 26 28 28 29 30 30 33 LUẬN VĂN THẠC SĨ 3.4 3.5 3.6 3.7 3.8 NGUYỄN THANH BÌNH BERT 3.4.1 Kiến trúc BERT 3.4.2 Biểu diễn liệu đầu vào 3.4.3 Tiền huấn luyện mơ hình Mơ tả liệu Mơ hình Baseline Mô hình BERT cho NER Tổng kết 35 35 35 36 38 40 42 47 KẾT LUẬN 48 Tài liệu tham khảo 48 A Công bố khoa học liên quan 52 Danh sách hình vẽ 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Mạng nơ-ron đơn giản - hình (a) mạng nơ-ron học sâu (Deep learning neural) - hình (b) Ví dụ gán nhãn BIO Thử thách ImageNet dựa tập liệu ImageNet Mơ hình máy học truyền thống (bên trái) mơ hình tn theo ngun tắc transfer learning Học chuyển giao Deep learning Ý tưởng học chuyển giao Deep transfer learrning với mô hình học sâu đào tạo trước dạng trình trích xuất đặc trưng Hiệu mơ hình đào tạo sẵn (off-the-shelf pretrained model) Ví dụ mạng nhận dạng khn mặt lớp trước để trích đặc trưng Giữ trọng số hay tinh chỉnh Học đa nhiệm nhận thông tin từ tất nhiệm vụ đồng thời 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 Mô tả hàng hóa Kiến trúc tổng quan Mơ hình sequence to sequence truyền thống Mơ hình sequence to sequence có attention Attention Multi-Head Attention Kiến trúc Transformer Biểu diễn liệu đầu vào BERT Thống kê số lượng từ đoạn/ câu Các mơ tả hàng hóa Mơ hình đề xuất tác giả cho toán phân loại phân cấp hải quan từ mơ tả hàng hóa 3.11 Ví dụ gán theo B-I-O 3.12 Kiến trúc mơ hình Baseline mã 10 13 14 18 19 20 20 21 22 24 28 29 30 32 32 34 36 38 39 40 40 41 LUẬN VĂN THẠC SĨ NGUYỄN THANH BÌNH 3.13 Kiến trúc mơ hình BERT-NER 3.14 Mơ hình BERT-NER với phần pre-trained BERT giữ nguyên trọng số 3.15 Mơ hình BERT-NER giữ ngun trọng số mơ hình BERT-NER có tinh chỉnh 45 A.1 Minh chứng báo xuất A.2 Minh chứng thuyết trình hội nghị 52 52 43 45 Danh sách bảng 3.1 3.2 3.3 3.4 Bảng mô tả nhãn liệu Bảng kết mơ hình Baseline Bảng kết mơ hình BERT-NER có cập nhật trọng số trình huấn luyện Bảng kết mô hình BERT-NER khơng cập nhật trọng số phần mơ hình BERT trình huấn luyện 39 41 44 44 Bảng kí hiệu từ ngữ viết tắt Từ viết tắt Ý nghĩa AGI Artificial general intelligence NLP natural language processing ML Machine laerning DL Deep learning BERT Bidirectional Encoder Representation from Transformers MLM Masked Language Model IID Independent and Identically Distributed IE Information Extraction CNN convolutional neural network NaN Not a Number LSTM Long short term memory bi-LSTM bidirectional Long short term memory CHƯƠNG GIỚI THIỆU 1.1 Giới thiệu Deep learning Trước nói deep learning nói máy học (machine learning) trước Một hệ thống máy học hệ thống đào tạo thay lập trình rõ ràng Nó trình bày với nhiều ví dụ liên quan đến nhiệm vụ tìm thấy cấu trúc thống kê ví dụ mà cuối cho phép hệ thống đưa quy tắc để tự động hóa nhiệm vụ [1] Học sâu (Deep learning) lĩnh vực cụ thể máy học: phương pháp việc học biểu diễn từ liệu đặt trọng tâm vào việc học lớp liên tiếp biểu diễn ngày có ý nghĩa Chữ "deep" deep learning tham chiếu đến loại hiểu biết sâu sắc mà phương pháp tiếp cận đạt được; hơn, viết tắt ý tưởng lớp biểu diễn liên tiếp.[1] Trong học sâu, biểu diễn lớp (hầu luôn) học thông qua mơ hình gọi mạng nơ-ron (neural network), cấu trúc theo lớp nghĩa đen xếp chồng lên Thuật ngữ mạng lưới thần kinh tham chiếu đến sinh học thần kinh, số khái niệm trung tâm deep learning phát triển phần cách lấy cảm hứng từ hiểu biết não, nhiên thực tế mơ hình học sâu khơng phải mơ hình não Hình 1.1: Mạng nơ-ron đơn giản - hình (a) mạng nơ-ron học sâu (Deep learning neural) - hình (b) LUẬN VĂN THẠC SĨ 1.2 NGUYỄN THANH BÌNH Giới thiệu tổng quan tốn trích rút thơng tin từ văn Rút trích thơng tin (Information extraction – IE), nhánh nghiên cứu thiên rút trích thơng tin ngữ nghĩa văn Từ đây, ta có nhiều ứng dụng cho nhiều domain Web mining (rút trích tên người tiếng, sản phẩm hot, so sánh giá sản phẩm, nghiên cứu đối thủ cạnh tranh, phân tích tâm lý khách hàng), Biomedical, Business intelligent, Financial professional (đánh giá thị trường từ nguồn khác nhau: giá xăng dầu tăng giảm, thơng tin chiến tranh, trị nước, điều luật thị trường kinh doanh), Terrism event (sử dụng vũ khí gì, đối tượng cơng ai).[2] Sau bước tiền xử lý thiên từ vựng cú pháp tách câu, tách từ, phân tích cú pháp, gán nhãn từ loại Từ IE ta đơn giản hóa thành tốn gồm: Rút trích tên thực thể (Named entity recognition – NER: people, organization, location), phân giải đồng tham chiếu (Coreference resolution) Rút trích quan hệ hai thực thể (Relation extraction – RE) Các mơ hình thực nghiệm đánh giá thông qua số Precision, Recall, F1-score [3] Trong phạm vi luận văn tậm trung đến rút trích thực thể Trích rút thực thể cơng bố lần đầu vào năm 1995 MUC-6 từ NER nhận nhiều quan tâm từ cộng đồng Danh sách NER liệt kê MUC-6 person, organization, location, date, time, moetary, percentages Trích rút thực thể có hướng tiếp cận: Rule-based statistical learning • Hướng tiếp cận Rule-based: Một tập rule (luật) gồm luật định nghĩa sẵn hay tự động phát sinh Văn đầu vào đem so sánh với tập luật (rule), thỏa mãn thực rút trích Một luật gồm pattern (khn mẫu) action (hành động) Pattern thường regular expression (biểu thức quy) định nghĩa tập feature token Khi pattern thỏa mãn action kích hoạt Action gán nhãn entity (thực thể) cho chuỗi token (kí tự), thêm nhãn start/end cho entity, hay xác định nhiều entity lúc Ví dụ để match danh từ riêng “Mr B” B từ viết hoa ám tên người ta LUẬN VĂN THẠC SĨ NGUYỄN THANH BÌNH Hình 3.14: Mơ hình BERT-NER với phần pre-trained BERT giữ nguyên trọng số Với phương pháp đánh giá tham khảo theo [22] [3], qua hai bảng kết quả, thấy rõ ràng kết F1-score trường NAME hai mơ hình BERT-NER thấp chút so với trường khác Điều lý giải chênh lệch thực thể liệu thực thể nhiều liệu Tuy nhiên micro-f1 score mơ hình cao Tuy thời gian huấn luyện giảm xuống, khoảng 16-17 phút vòng lặp (epoch), kết mơ hình tương đồng so micro f1-score macro f1-score Mặc dù vậy, mơ hình BER-NER có tinh chỉnh trọng số (BERT-NER fine-tuning) có loss nhỏ so với BERT-NER không tinh chỉnh trọng số (BERT-NER freeze) hình 3.15 Hình 3.15: Mơ hình BERT-NER giữ ngun trọng số mơ hình BERT-NER có tinh chỉnh 45 LUẬN VĂN THẠC SĨ NGUYỄN THANH BÌNH Điều lý giải theo báo BERT [18] họ sử dụng WordPiece [20] cho việc embedding Mặc dù đa ngôn ngữ từ điển 119,547 tiếng Việt chiếm 13.5% [23], việc học cách biểu diễn câu mơ hình BERT với liệu đem đến kết tốt so với việc không tinh chỉnh (freeze) Đó lý mơ hình BERT-NER có cập nhật trọng số trình huấn luyện đạt kết f1-score cao so với không cập nhật trọng số Qua ba mơ hình: mơ hình baseline, mơ hình BERT-NER có cập nhật trọng số mơ hình BERT-NER khơng cập nhật trọng số, thấy nhiệm vụ đặc thù liệu đặc thù như, cần có chiến lược xây dựng mơ hình dùng Deep transfer learning phù hợp Trong tốn trích rút thơng tin từ tờ khai hải quan, mặt hàng tơ mơ hình BERT-NER có cập nhật lại trọng số đạt kết cao Đây đặc trưng học chuyển giao, thay giữ ngun mơ hình trọng số mơ hình khác có nét tương đồng để giải cho tốn cần cần phải tinh chỉnh để phù hợp cho tốn Tất nhiên có nhiều trường hợp mơ hình huấn luyện từ trước chạy tốt với toán cụ thể Tuy nhiên, thơng thường trường hợp mơ hình huấn luyện trước có miền liệu bao gồm miền liệu tốn cần giải Ví dụ: cần phân loại cảm xúc bình luận từ người dùng trang thương mại điện tử tiki đồ dùng học tập, có mơ hình huấn luyện trước mơ hình phân loại cảm xúc bình luận từ người dùng trang thương mại điện tử toàn mặt hàng bán Rõ ràng toán toán nhỏ tốn mà mơ hình huấn luyện trước giải Chúng ta thấy rõ ràng rằng, lúc nên cố gắng xây dựng cải thiện mơ hình từ đầu Thay vào đó, nên tìm hiểu xem loại tốn có làm chưa, tốn tương tự, kết sao, có khả tận dụng không? Bằng việc tận Transfer learning thu mơ hình xử lý tốn trích rút thơng tin từ mơ tả hàng hóa tờ khai Hải quan Việt Nam Tuy việc học chuyển giao mơ hình đại BERT đòi hỏi thời gian huấn luyện lâu với kết mang lại thay thời gian cải thiện mơ hình tốn nhiều sức thời gian đáng giá Đến đây, trả lời câu hỏi “Khi tinh chỉnh mơ hình, giữ ngun trọng số mơ hình?” mục 2.4.2 thấy điều tác giả đề cập mục 2.1 - Không phải lúc mơ hình đại (state-of-the-art) có khả tận dụng tốt toán đặc thù chuyên 46 LUẬN VĂN THẠC SĨ NGUYỄN THANH BÌNH biệt, tùy tốn liệu có mà có cách tiệp cận phù hợp Cụ thể tốn trích rút thơng tin cần tinh chỉnh trọng số mô hình 3.8 Tổng kết Việc sử dụng Transfer learning (học chuyển giao) nói chung Deep transfer learning (học chuyển giao sâu) nói riêng cách tiếp cận hiệu cho tốn có tập liệu huấn luyện không nhiều Bằng cách tận dụng tri thức từ toán lớn tổng quát để áp dụng sang toán nhỏ cụ thể, từ giải vấn đề tập liệu huấn luyện khơng lớn Mơ hình học tập chuyển giao đem lại hiệu cao thực nghiệm tác giả mơ hình đề xuất tác giả thay mơ hình BERT-NER Tuy nhiên ta thấy khơng phải kiến trúc mơ hình đại (state-of-the-art) ứng dụng vào toán cụ thể đem lại kết cao Tùy toán tập liệu phục vụ cho tốn đó, ta tinh chỉnh mơ hình giữ nguyên trọng số (freeze model) để giữ lại tính huấn luyện mơ hình gốc nhằm phục vụ mơ hình mục tiêu Các kết thử nghiệm cho thấy, với tập liệu nhỏ mô tả liệu tơ, hệ thống khả quan có tính ứng dụng cao, đặc biệt chiến lược giữ ngun trọng số mơ hình huấn luyện từ trước để áp dụng vào mơ hình mục tiêu Với liệu tờ khai hải quan, mô tả ô tô, xe cộ quán cung cấp thơng tin để truy suất thơng tin, mơ hình áp dụng deep transfer learning đề xuất giúp đỡ Hải quan Việt Nam q trình truy suất, kiểm tra tiêu chí để đánh giá mặt hàng xe ô tô, xe cộ, xe hai bánh gắn máy nguyên trình “kiểm tra sau thông quan” [24] Thành công tiền đề để phát triển xây dựng mơ hình trích rút thơng tin từ sản phẩm hàng hóa hải quan thương mại điện tử (E-Commerce Commodity Customs Product) 47 CHƯƠNG KẾT LUẬN Trong phạm vi nội dung luận văn, số kết đạt luận văn bao gồm: • Thành cơng việc giới thiệu tổng hợp khái niệm machine learning, deep learning, trích rút thơng tin, transfer learning, deep transfer learning • Ứng dụng Deep transfer learning vào toán trích rút thơng tin với liệu tiếng Việt, tốn cụ thể trích rút thơng tin từ tờ khai hải quan mơ tả hàng hóa ô tô, phương tiện giao thông đường hàng khơng, đường thủy đường sắt • Thêm vào đó, luận văn có phân tích với cách khác có kết khác mơ hình thử nghiệm sử dụng chiến lược deep transfer learning nêu luận văn Với kết đạt được, luận văn có nhiều tiềm ứng dụng Transfer learning Deep transfer learning tốn trích rút thông tin với liệu huấn luyện không lớn Đây phương pháp hay dễ áp dụng tất nghiên cứu machine learning deep learning, doanh nghiệp, cơng ty, cá nhân, tổ chức có toán cần phải giải mà kho liệu eo hẹp, thiếu tài ngun tính tốn, thiết bị tính toán tiên tiến Một số hướng phát triển tiếp luận văn: • Cải thiện độ xác mơ hình ứng dụng vào nhiều lĩnh vực khác nhau, ví dụ: trích rút thơng tin từ văn y tế mô tả thuốc, công dụng thuốc, thành phần • Xây dựng mơ hình lớn nhằm mục đích trích rút thơng tin từ mặt hàng hải quan mặt hàng hải quan thương mại điện tử (E-Commerce Commodity Customs Product) 48 Tài liệu tham khảo [1] F Chollet, Deep Learning with Python Manning, Nov 2017 [2] C Aggarwal and C Zhai, Mining Text Data Boston, MA: Springer US, 06 2012 [3] D S Batista, “Named-entity evaluation metrics based on entity-level,” 2018 [4] M T Bahadori, Y Liu, and D Zhang, “A general framework for scalable transductive transfer learning,” Knowl Inf Syst., vol 38, no 1, pp 61–83, 2014 [5] N Farajidavar, T deCampos, and J Kittler, “Adaptive transductive transfer machine,” in Proceedings of the British Machine Vision Conference, BMVA Press, 2014 [6] D D Sarkar, “A comprehensive hands-on guide to transfer learning with real-world applications in deep learning.” https://towardsdatascience.com/a-comprehensive-hands-on-guideto-transfer-learning-with-real-world-applications-in-deep-learning212bf3b2f27a [7] L Torrey and J Shavlik, Transfer learning In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques IGI Global, 2010 [8] I Goodfellow, Y Bengio, and A Courville, Deep Learning Adaptive computation and machine learning, MIT Press, 2016 [9] S V Lab, “Imagenet dataset.” http://image-net.org/index [10] S J Pan and Q Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol 22, no 10, pp 1345–1359, 2010 [11] C Szegedy, V Vanhoucke, S Ioffe, J Shlens, and Z Wojna, “Rethinking the inception architecture for computer vision,” CoRR, vol abs/1512.00567, 2015 49 LUẬN VĂN THẠC SĨ NGUYỄN THANH BÌNH [12] K Simonyan and A Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014 [13] T Mikolov, I Sutskever, K Chen, G S Corrado, and J Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems (C J C Burges, L Bottou, M Welling, Z Ghahramani, and K Q Weinberger, eds.), vol 26, Curran Associates, Inc., 2013 [14] A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N Gomez, L Kaiser, and I Polosukhin, “Attention is all you need,” CoRR, vol abs/1706.03762, 2017 [15] E Tzeng, J Hoffman, N Zhang, K Saenko, and T Darrell, “Deep domain confusion: Maximizing for domain invariance,” CoRR, vol abs/1412.3474, 2014 [16] Li Fei-Fei, R Fergus, and P Perona, “One-shot learning of object categories,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 28, no 4, pp 594–611, 2006 [17] J Yosinski, J Clune, Y Bengio, and H Lipson, “How transferable are features in deep neural networks?,” CoRR, vol abs/1411.1792, 2014 [18] J Devlin, M Chang, K Lee, and K Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol abs/1810.04805, 2018 [19] D Bahdanau, K Cho, and Y Bengio, “Neural machine translation by jointly learning to align and translate,” ArXiv, vol 1409, 09 2014 [20] Y Wu, M Schuster, Z Chen, Q V Le, M Norouzi, W Macherey, M Krikun, Y Cao, Q Gao, K Macherey, J Klingner, A Shah, M Johnson, X Liu, L Kaiser, S Gouws, Y Kato, T Kudo, H Kazawa, K Stevens, G Kurian, N Patil, W Wang, C Young, J Smith, J Riesa, A Rudnick, O Vinyals, G Corrado, M Hughes, and J Dean, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” CoRR, vol abs/1609.08144, 2016 [21] M Schuster and K K Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol 45, no 11, pp 2673–2681, 1997 50 LUẬN VĂN THẠC SĨ NGUYỄN THANH BÌNH [22] H Nakayama, “seqeval: A python framework for sequence labeling evaluation,” 2018 Software available from https://github.com/chakkiworks/seqeval [23] J Ács’s blog, “Exploring bert’s vocabulary.” http://juditacs.github.io/ 2019/02/19/bert-tokenization-stats.html [24] C H Q tỉnh Đồng Nai, “KhÁi quÁt vỀ kiỂm tra sau thÔng quan.” 51 PHỤ LỤC A Công bố khoa học liên quan Nguyen Thanh Binh, Huy Anh Nguyen, Phan Ngoc Linh, Nguyen Linh Giang, Tran Ngoc Thang “Attentive RNN for HS Code Hierarchy Classification on Vietnamese Goods Declaration”, Intelligent Systems and Networks (ICICSN 2021) Hình A.1: Minh chứng báo xuất Hình A.2: Minh chứng thuyết trình hội nghị 52 Attentive RNN for HS Code Hierarchy Classification on Vietnamese Goods Declaration Nguyen Thanh Binh1 , Huy Anh Nguyen2 , Pham Ngoc Linh1 , Nguyen Linh Giang3 , and Tran Ngoc Thang1 School of Applied Mathematics and Informatics, Hanoi University of Science and Technology, Hanoi, Vietnam, thanhbinh030296@gmail.com,phamngoclinh96th@gmail.com, thang.tranngoc@hust.edu.vn Computer Science Department, Stony Brook University, Stony Brook, NY 11794, USA, anh.h.nguyen@stonybrook.edu School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi, Vietnam giang.nguyenlinh@hust.edu.vn Abstract The harmonized commodity description and corresponding coding system (HS Code System) created by the World Customs Organization (WCO) are internationally used to classify standard transaction goods from their descriptions The system uses the four-level hierarchical structure to arrange thousands of different codes However, in practice, the traditional and manual methods for classifying a large number of items is a labor-consuming work and also prone to error In order to assist the customs officers as well as many companies, we proposed a deep learning model with self-attention mechanism along side hierarchical classifying layers to improve the accuracy of classification of Vietnamese short text from goods declarations Experimental results indicated the potential of these approaches with high accuracy Keywords: Hierarchical classification, Self-attention mechanism, HS Code, Vietnamese text, Recurrent neural network Introduction When the international trade between countries drastically expanded in terms of both volume and value, reducing the clearance time, especially those related to goods declaration process, was the most important goal that every nation desired to achieve However, the former system was regional and inconsistent that took a long time, maybe up to a week to complete the declaration procedure Thus, in 1998, World Customs Organization (WCO) introduced the Harmonized Commodity Description and Coding System (or simply, Harmonized System HS), a complex hierarchical system based on the economic activity or component material, in order to standardize the name and number of traded products The Tran Ngoc Thang et al system has been adopted widely as over 200 countries and territories, and 98% of merchandises in international trade are used according to [1] However, choosing the right HS code is not an easy task In particular, a noticeable amount of merchandises has been miss-classified even though companies have dedicated experts or experienced agencies for this kind of work To facilitate these difficulties, Vietnam customs provided a website that users can fill their goods description and then receive a list of potential HS codes However, this website as well as many third party services mostly use searching keywords which has many hindrances to find the codes that matches the semantic content, not only the words itself For instance, "83099099, Dây buộc PVC lõi kẽm"and "56090000, Dây buộc giày Polyester"and "39269099, Dây buộc cáp nhựa" while they certainly belongs to different codes, the tools or websites return the same result For this reason, building an automated classification HS code system using knowledge based techniques to capture the meaning of descriptions is a highly promised approach to solve the aforementioned problems Generally, HS Code classification task deals with short-texts (one to two sentences) and hierarchical labels There are some approaches to tackle the similar task such as [4] proposed a method using Background Net to classify the chapter 22 and chapter 90 goods declaration However, the description by Vietnamese has some unique characteristics Hence, the main purpose of this paper is introducing an efficient system to classify the declaration using Vietnamese This paper contains four sections: the introduction represents the importance of HS Code classification problem and some related works Section introduces the data and problem formulation Section states the model, evaluation metrics and the experiment results The last section is the conclusion and discussion Problem Formulation Harmonized System, as mentioned above, is a hierarchical system in which divided into 99 chapters, 1224 headings and 5224 subheadings The first 6-digits code is used internationally across countries in the organization WCO also allows members to add numbers after the first 6-digits depending on the purpose of the countries (normally for statistics reason) Vietnam uses the 8-digits system to classify commodities, which means there are totally 11371 different HS codes (in 2017) These large number of different codes, or in other words, distinct labels in a classification task, is a high complexity challenge to HS Hierarchical Classification solve These labels also have hierarchical attribute For example, an 40.16.10.10 can be interpreted as four levels In our model, we propose using four fully-connected (FC) layer in which each layer outputs one level of HS code The first layer determines which chapter the product belongs to, the second and third layer generate the heading and subheading, respectively The last layer decides the HS code by combining the outputs or three previous layers and its own output The second problem relies on differences between the description and HS nomenclature in terms of semantic content Vietnamese has many vagueness when combining words To deal with this problem, we improve the performance of embedding layer by using a bidireactional LSTM [3] A layer of attention is added after the embedding layer to accurately represent the importance of particular words in each sentence More detailed explanation will expressed in the following section The last question is how to cope with HS amendments WCO makes HS revision every 5-6 years, where they add, remove or change a small number of codes to reflect the international trade’s variation The last amendments was in 2017, where they accepted 233 changes 3.1 Methodology Preprocessing data Input data of our proposed model is the description of the goods in text form For raw data, which is records in Vietnamese, having trouble writing with difficult codes, with strange characters or without accents In addition, the data also has a lot of descriptions of the same HS code and a type of item, but only the difference in quantity or volume We proceeded to clean raw data by removing redundant characters, false characters that lose the meaning of words in sentences, and accent remark for Vietnamese missing accent descriptions Then with the duplicate data that were mentioned above, we used the LCS (the longest common substring) method [2] with a similarity of them is 0.78 or more The description of the goods could be short text without many features, so we kept a longer and more detailed description of the goods with the same HS Code for the tokenizing process word, and remove sketchy descriptions 3.2 Building Model As mentioned above, we have built deep learning models: a model has hierarchical classification and self-attention mechanism (Figure 1), and the second without We propose the first model in this section It consists of three parts: Word representation layer, Bidirectional LSTM Layer and Self-attention, Fully Connected Layers and Output Layer Tran Ngoc Thang et al Figure The architecture of the model we build, the first is word representation, after that sentence embedding is made up of two parts The first part is a bidirectional LSTM, the second part is the self-attention mechanism The rest of the model is the fully connected layer and the output layer 3.2.1 Word represenation layer With the commodity description preprocessed above, the input of the model is tokenized We denote n be the maximum number of words of a description, with each word represented by a vector xt whose word dimension is d The output of this layer is X = (x1 , x2 , , xn ) with xt ∈ Rd , t = 1, , n 3.2.2 Bidirectional LSTM Layer and Self-attention We decided to build a sentence embedding model based on [5] with two components: a bidireactional LSTM layer and the following part is the self-attention mechanism Hence, we will summarize this process We feed each word xt ∈ X = (x1 , x2 , , xn ) into a Bi-directional LSTM to learn some mutual information between neighboring words in one sentence → − −−→ −−−−→ ht = LST M (xt , ht−1 ) (1) ← − ←−−−− ←−− ht = LST M (xt , ht−1 ) (2) → − ← − Each ht vectors, are concatenated from ht and ht , is included in section “Selfattention mechanism” to form a structure of matrix sentence embedding The purpose is to represent many aspects of the sentence semantics Hence, the output of this section is a structure of matrix sentence embedding It is discussed in more detail in [5] 3.2.3 Four Fully Connected Layers and Output Layer Our model uses Fully connected layer connected with each output respectively ychapter , yheading , HS Hierarchical Classification ysub heading , ycountry extension respectively is the prediction probability of yi with corresponding label i The base model has no attention mechanism and hierarchical classes The hierarchy has been replaced with a softmax layer, which is connected immediately behind the last fully connected layer 4.1 Experiment and Results Evaluation Metrics We present the measures that are used to evaluate the various models in our experiments These include Precision, Recall and F1-score Assume that B(T P, T N, F P, F N ) is a binary evaluation measure, which is computed based on the true positives (T P ), true negatives(T N ), false positives (F P ) and false negatives (F N ) We have that P recision = TP TP , Recall = TP + FP TP + FN In order to combine these two metrics, a harmonic mean called F1-Score is proposed as: F − score = ∗ P recision ∗ Recall P recision + Recall The average result over all classes is determined using micro-averaging In Micro-average method, it sums up the individual true positives (TP), false positives (TP), and false negatives (FN) of the different classes and apply to get statistics The average precision and recall are calculated: n T Pi T Pi +F Pi Micro-average of precision = i=1 n T Pi T Pi +F Ni Micro-average of recall = i=1 Micro F1-Score is simply the harmonic mean of precision and recall, respectly 4.2 Data Description and Preparation We gathered the dataset from real transactions which recorded by Vietnam Customs Department In addition, we purposely chose the good descriptions from chapter 03, chapter 62, chapter 85, chapter 38 and chapter 40 since they are heavily misclassified and also have large impact on detecting tax evasion Normally, good descriptions have no more than a sentence and fall into only one HS code Hence, we feed a pairs of sentences and HS code into our model Tran Ngoc Thang et al However, before using those data, we performed data preprocessing tasks which consist of taking away numbers and punctuations, lowering all the characters, and removing any duplicated records for each category We used the data of the customs declarations with descriptions of Vietnamese in chapters 3, 62, 85, 30, 38 and 40 available from 2017 to the present with 1,167 different labels along with 751,328 samples Table 1: The description data Chapter 62 85 30 38 40 4.3 Number of labels 201 168 459 88 104 147 Number of samples 84306 187266 159890 197308 56240 66318 Results We tested and compared the performance of two models: a hierarchical model and the baseline model For medium dataset of chapters 3, 62, and 85 and large dataset of chapters 3, 62, 85, 30, 38 and 40, we used 200 hidden units in LSTM and four fully connected layer (with 100 units each layer) and 1,167 units each layer in baseline with large dataset but still keep 100 unit each layer in the hierarchical model Our hierarchical model and baseline model were trained by Adam optimizer with the learning rate of 0.001 and the batch size was 1024 We ran 20 and 50 epochs for mediumscale and large-scale datasets, respectively We splited 80% for training and 20% to testing We used an early stopping strategy to avoid overfitting and accelerate training Table gives the micro F1-score for each part in an HS Code Table gives the Micro F1-score for each model We used the same train and test datasets for both models, but the baseline model was not predictable with larger data Consclusion and Discussion We demonstrated a novel architecture for classifying HS codes automatically Although this result has some limitations but also introduced an idea of an HS code hierarchy classification model, which could help solve the HS code classification when the number of codes becomes larger We realized that as the amount of data and the number of labels got larger, the baseline model seemed to be ineffective in the classification process For the baseline model, the error rate of all HS Hierarchical Classification Table 2: Micro F1-Score of hierarchical model for datasets Dataset Medium dataset Large dataset Precision Recall Micro F1 Precision Recall Micro F1 Chapter 0.9981 0.9981 0.9981 0.9587 0.9587 0.9587 Heading 0.9015 0.9015 0.9015 0.8507 0.8507 0.8507 Sub heading 0.8004 0.8004 0.8004 0.7556 0.7556 0.7556 Country extension 0.8207 0.8207 0.8207 0.7687 0.7687 0.7687 First digits 0.7778 0.7778 0.7778 0.7279 0.7279 0.7279 Full HS Code 0.6982 0.6982 0.6982 0.6443 0.6443 0.6443 Table 3: Performance micro F1-score comparison of models Dataset Medium dataset Large dataset Baseline model 0.7500 0.0436 Hierarchical model 0.6982 0.6443 digits is higher compared to the hierarchical model, each code is separate, not hierarchical Hence, the code is wrong and the code is correct can not be related to each other For the hierarchical model, the code that HS predicts incorrectly can be the same chapter, heading, subheading, and only the different country extension code We will improve it in the future by incorporating Vietnamese word processing techniques or pretrained Vietnamese embedding In addition, we will combine the use of deep learning models combined to achieve better results with a larger dataset Acknowledgement This paper is supported by CMC Institute of Science and Technology (CIST), CMC Corporation; and Hanoi University of Science and Technology, Vietnam This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) grant number [102.05-2019.316] References World Customs Organization http://www.wcoomd.org/ Gusfield, Dan: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology Cambridge University Press, USA (1997) Schuster, M Paliwal, K K.: Bidirectional recurrent neural networks IEEE Tran Sign Proc 45 (11), 2673-2681 (1997) doi:10.1109/78.650093 Liya D., Zhen Z F., Dong L C.: Auto-Categorization of HS Code Using Background Net Approach Procedia Computer Science, Proc 60, pp 1462-1471 (2015) doi: 10.1016/j.procs.2015.08.224 Lin, Zhouhan & Feng, Minwei & Dos Santos, Cicero & Yu, Mo & Xiang, Bing & Zhou, Bowen & Bengio, Y.: A Structured Self-attentive Sentence Embedding ICLR Conference (2017) ArXiv preprint arXiv:1703.03130 ... machine learning, deep learning, trích rút thơng tin, transfer learning, deep transfer learning • Ứng dụng Deep transfer learning vào tốn trích rút thơng tin với liệu tiếng Việt, toán cụ thể trích rút. .. Lợi thách thức Deep Transfer Learning 2.7 Ứng dụng Transfer learning TRÍCH RÚT THƠNG TIN VĂN BẢN BẰNG DEEP TRANSFER LEARNING 3.1 Bài tốn trích rút thơng tin từ tờ khai... cơng Do việc trích rút thơng tin mơ tả hàng hóa việc quan trọng Trong nội dung luận văn này, tác giả trình bày sử dụng deep transfer learning cho tốn trích rút thơng tin từ văn bản, cụ thể từ

Tiêu đề	Trích Rút Thông Tin Văn Bản Sử Dụng Deep Transfer Learning
Tác giả	Nguyễn Thanh Bình
Người hướng dẫn	TS. Trần Ngọc Thăng
Trường học	Trường Đại Học Bách Khoa Hà Nội
Chuyên ngành	Toán Ứng Dụng Và Tin Học
Thể loại	Luận Văn Thạc Sĩ
Năm xuất bản	2021
Thành phố	Hà Nội

Định dạng
Số trang	60
Dung lượng	3,77 MB

Trích rút thông tin văn bản sử dụng deep transfer learning

Biểu diễn dữ liệu đầu vào của BERT

Mô hình BERT cho NER