Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 61 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
61
Dung lượng
1 MB
Nội dung
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY MASTER THESIS Improving the quality of Inverse text normalization based on neural network and numerical entities recognition PHAN TUAN ANH Anh.PT211263M@sis.hust.edu.vn School of Information and Communication Technology Supervisor: Associate Professor Le Thanh Huong Supervisor’s signature School: Information and Communication Technology May 15, 2023 CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập – Tự – Hạnh phúc BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ Họ tên tác giả luận văn: Phan Tuấn Anh Đề tài luận văn: Cải thiện chất lượng cho tốn chuẩn hóa ngược văn dựa mạng nơ ron nhận diện thực thể số Chuyên ngành: Khoa học liệu trí tuệ nhân tạo (Elitech) Mã số SV: 20211263M Tác giả, Người hướng dẫn khoa học Hội đồng chấm luận văn xác nhận tác giả sửa chữa, bổ sung luận văn theo biên họp Hội đồng ngày 22/4/2023 với nội dung sau: Sửa lỗi tả, sốt loại câu chữ, bố cục luận văn Lược bỏ phần miêu tả lời thích hình ảnh, bảng Thay hình 3.1, 3.2 cũ hình 3.1, 3.2 Thay hình 1.1 (cũ) hình 1.1 (mới) nhằm miêu tả vai trò ITN module hệ thống xử lý tiếng nói Bổ sung thêm phần đánh giá chi tiết độ module (trong phần 4.4.1.1) Bổ sung phần phân tích lỗi gặp phải liệu tiếng Việt (trong phần 4.4.1.2) Thêm ví dụ số dạng lỗi sai mơ hình gặp phải liệu tiếng Việt (trong phần 4.5) Bổ sung việc xây dựng liệu cho tiếng Việt, làm rõ tập nhãn (trong phần 3.3) Bỏ bảng 4.2 4.3, thay đồ thị 4.1, 4.2, 4.3, 4.4, 4.5, 4.6 nhằm trực quan hóa kết 10 Đưa nhận xét giải thích tập liệu tiếng Việt cho kết tập liệu tiếng Anh (trong phần 4.4.1.2) 11 Sửa lại tài liệu tham khảo (không dùng trích dẫn báo arxiv) Ngày Giáo viên hướng dẫn tháng năm Tác giả luận văn CHỦ TỊCH HỘI ĐỒNG SĐH.QT9.BM11 Ban hành lần ngày 11/11/2014 Graduation Thesis Assignment Name: Phan Tuan Anh Phone: +84355538467 Email: Anh.PT211263M@sis.hust.edu.vn; phantuananhkt2204k60@gmail.com Class: 21A-IT-KHDL-E Affiliation: Hanoi University of Science and Technology Phan Tuan Anh - hereby declare that this thesis on the topic ”Improving the quality of Inverse text normalization based on neural network and numerical entities recognition” is my personal work, which is performed under the supervision of Associate Professor Le Thanh Huong All data used for analysis in the thesis are my own research, analysis objectively and honestly, with a clear origin, and have not been published in any form I take full responsibility for any dishonesty in the information used in this study.” Student Signature and Name Acknowledgement I wish that a few lines of short text could help me convey my most sincere gratitude to my supervisor: Associate Professor Le Thanh Huong, who has driven and encouraged me throughout the years of my master’s course She has listened to my idea and given me numerous valuable advice for my proposal Besides, she also indicated the downsides of my thesis, which are very helpful for me to perfect my thesis I would like to thank Ph.D Bui Khac Hoai Nam and other members of the NLP team in Viettel Cyberspace Center, always support and provide me with the foundation knowledge Especially, my leader Mr Nguyen Ngoc Dung always creates favorable conditions for me to conduct extensive experiments in this study Last but not least, I would like to thank my family, who play the most important role in my life They are constantly my motivation to accept and pass the challenge I have at this time Abstract Neural inverse text normalization (ITN) has recently become an emerging approach for automatic speech recognition in terms of post-processing for readability In particular, leveraging ITN by using neural network models has achieved remarkable results instead of relying on the accuracy of manual rules However, ITN is a highly language-dependent task especially tricky in ambiguous languages In this study, we focus on improving the performance of ITN tasks by adopting the combination of neural network models and rule-based systems Specifically, we first use a seq2seq model to detect numerical segments (e.g., cardinals, ordinals, and date) of input sentences Then, detected segments are converted into the written form using rule-based systems Technically, a major difference in our method is that we only use neural network models to detect numerical segments, which is able to deal with the low resource and ambiguous scenarios of target languages In addition, to further improve the quality of the proposed model, we also integrate a pre-trained language model: BERT and one variant of BERT ( RecogNum-BERT) as initialize points for the parameters of the encoder Regarding the experiment, we evaluate different languages: English and Vietnamese to indicate the advantages of the proposed method Accordingly, empirical evaluations provide promising results for our method compared with stateof-the-art models in this research field, especially in the case of low-resource and complex data scenarios Student Signature and Name TABLE OF CONTENTS CHAPTER Introduction 1.1 Research background 1.2 Research motivation 1.3 Research objective 1.4 Related publication 1.5 Thesis organization CHAPTER Literature Review 2.1 Related works 2.1.1 Rule-based methods 2.1.2 Neural network model 2.1.3 Hybrid model 11 2.2 Background 11 2.2.1 Encoder-decoder model 11 2.2.2 Transformer 15 2.2.3 BERT 18 CHAPTER Methodology 20 3.1 Baseline model 20 3.2 Proposed framework 20 3.3 Data creation process 23 3.4 Number recognizer 24 3.4.1 RNN-based and vanilla transformer-based 24 3.4.2 BERT-based 25 3.4.3 RecogNum-BERT-based 25 3.5 Number converter 28 CHAPTER Experiment 30 4.1 Datasets 30 4.2 Hyper-parameter configurations 31 4.2.1 RNN-based and vanilla transformers-based configurations 31 4.2.2 BERT-based and RecogNum-BERT-based configurations 32 4.3 Evaluation metrics 33 4.3.1 Bi lingual evaluation understudy (BLEU) 33 4.3.2 Word error rate (WER) 33 4.3.3 Number precision (NP) 34 4.4 Result and Analysis 34 4.4.1 Experiments without pre-trained LM 35 4.4.2 Experiments with pre-trained LM 41 4.5 Visualization 43 CHAPTER Conclusion 44 5.1 Summary 44 5.2 Future work 45 LIST OF FIGURES 1.1 The role of Inverse text normalization module in spoken dialogue systems 2.1 The pipeline of NeMo toolkit for inverse text normalization 2.2 The overview of the encoder-decoder architecture for the example is a machine translation (English → Vietnamese) 12 2.3 The overview of the using LSTM-based as encoder block (left) and the architecture of LSTM (right) 14 2.4 The illustration of the decoding process 15 2.5 The general architecture of vanilla Transformer, which is introduced in [20] 16 2.6 The description of scale dot product attention (left) and the multihead attention (right) 17 2.7 The overview of pertaining produces of BERT that is trained in large corpus with the next sentence prediction and mask token prediction 18 3.1 The overview of my baseline model (the seq2seq model for ITN problem) [7] 21 3.2 The general framework of the proposed method (hybrid model) for the Neural ITN approach 21 3.3 The overview of data creation pipeline for training Number recognizer 23 3.4 The training process and inference process of applying Bert as initializing encoder for the proposed model 26 3.5 The overview of the training and inference process of my proposed model when integrating the RecogNum-BERT 27 3.6 Our architecture for creating the RecogNum-Bert 28 3.7 The pipeline for data preparation for fine-tuning RecogNum-Bert 4.1 The comparison of models on English test set with BLEU score (higher is better) 36 4.2 The comparison of models on English test set with WER score (lower is better) 37 4.3 The comparison of models on English test set with NP score (higher is better) 38 4.4 The comparison of models on the Vietnamese test set with BLEU score (higher is better) 39 4.5 The comparison of models on the Vietnamese test set with WER score (lower is better) 40 4.6 The comparison of models on the Vietnamese test set with NP score (higher is better) 41 29 4.4.1 Experiments without pre-trained LM 4.4.1.1 Results of separate modules a) Results of Number Recognizer: For evaluating the performance of the Number Recognizer, we report the results of the seq2seq model in the validation set with BLEU score (figure 4.2) Validation/Train set 20k / 80k 40k / 160k 100k / 400k 200k / 800 English RNN Transformer 0.9028 0.8396 0.9128 0.8888 0.915 0.9025 0.917 0.9323 Vietnamese RNN Transformer 0.855 0.8327 0.8635 0.8928 0.8591 0.8986 0.8559 0.8988 Table 4.2: The results of Number Recognizer in the validation set with BLEU score Accordingly, there are several crucial things that I can conclude as follows: • Overall, the good quality is seen in the Number Recognizer for both approaches using RNN or Transformer The BLEU score of the modules increases gradually when I supplement additional data for the training process When the volume of data reaches the peak (1m samples), the Number Recognizer obtains 0.9323 BLEU score with the Transformer model in English and 0.8988 BLEU score in Vietnamese • Regarding the English dataset, applying the RNN for Number Recognizer gives a better performance than Transformer in the range of data from 100k to 200k That proves the advantage of the RNN-based model over than Transformer-based model in the low resource By contrast, when the volume of data is increased (from 500k to 1m), the performance of the transformerbased model is better than that of the RNN-based Besides, when we gradually supplement the data for Number Recognizer, the performance of RNN is limited by approximately 0.91 while that result for the Transformer-based model still gains improvement (roundly 0.93) • In terms of the Vietnamese dataset, I also see a similar trend Especially, the performance of Vietnamese is lower than that of English in all of the cases The crucial reason behind this phenomenon is the average sequence length of the Vietnamese sample is longer than that of the English (table 4.1), which may lead to a bad influence on the performance of the seq2seq model 35 b) Results of Number Converter: The validation dataset spending on measuring the performance of the Number Converter has to meet the following requirements: • The input sentence must be supplied crucial information: the content of entities (e.g twenty-two, November, ten dola) and the type of number that content belong to (e.g DATE, TIME, MEASURE, ) • The gold output must be supplied in the written form of corresponding numerical entities (e.g 22, 10$, ) Unfortunately, there is no available dataset that can satisfy the aforementioned criterion Therefore, I not conduct the experiment to figure out the performance of only the Number Converter The quality of this module is proven along with the performance of the hybrid model 4.4.1.2 Results of hybrid model Figures 4.1, 4.2, 4.3 show the comparison results of my experiments on the English test set in BLEU, WER, and NP score, respectively Figure 4.1: The comparison of models on English test set with BLEU score (higher is better) Note that for the BLEU and NP scores, the higher is the better, contrary to the case of WER score Accordingly, there are several issues I can summarize based on the results as follows: 36 Figure 4.2: The comparison of models on English test set with WER score (lower is better) • For the smallest amount of data (100k), my method using RNN as the backbone achieves the highest score in all of the criteria: 0.8334 for BLEU, 0.1017 for WER, and 0.8566 for NP In both the baseline and my proposal, the performance of RNN surpasses that of the Transformer • For 200k samples, my proposed model applies RNN and obtains the highest BLEU score: 0.8353 Whereas, the apply the Transformers model get the highest WER: 0.1003 and highest NP: 0.8611 • In the range of data 100k and 200k, my proposed model exposes the advantage when compared to the performance of the baseline Additionally, the RNN-based model also helps benefit the model significantly more than the Transformers-based model in the smallest data set (100k) When the amount of data progressively increases, the performance of the Transformer-base model is better than that of the RNN-based • In data set 500k, the transformer witnesses dominance in both the baseline model and my proposed model Especially, the highest performances are 0.8558 BLEU score and 0.0708 WER score for the baseline model while my proposed model achieves the best result on NP score: 0.8687 In this volume of data, the baseline with only the seq2seq model starts to take over the performance of my proposed model This phenomenon proved that when facing wealthy resources (more than 500k in English), the strength of the neural network is shown off my proposed model with stage is Number 37 Figure 4.3: The comparison of models on English test set with NP score (higher is better) Converter is built up by a set of rules that has less generalization and is more disadvantageous than the end2end model • In the largest volume of data (1m samples), the highest value of all criteria BLEU score, WER score, and NP score is seen by the baseline model with Transformer-based These values of applying the Transformers-based are not only remarkably higher than that of utilizing the RNN-based model but also slightly higher than the best figures come from my proposed model: 0.9138 in BLEU to 0.8933, 0.0405 in WER score to 0.0517, 0.9318 in NP score to 0.9105 • Based on the aforementioned results, three of type metrics evaluation show a high correlation when evaluating the performance of the ITN problem The most crucial thing is that the value: 500k can be considered as the threshold for the boundary to distinguish between low and antique resources for the English data set The results of my experiments in Vietnamese are presented in the figures: 4.4, 4.5, 4.6 Particularly, there are several remarkable points can be highlighted as follows: • The data size 100k witnesses the best results in my proposed model Specifically, the best BLEU score is 0.7019 and the best WER score is 0.1897 when using the RNN model, whereas the best NP is 0.6359 using the Transformers model Notably, the gap between the results values of the two methods 38 Figure 4.4: The comparison of models on the Vietnamese test set with BLEU score (higher is better) is seriously huge Especially, regarding the BLEU score, my model with RNN-based is higher significantly than that of the baseline model by 0.025 The number for WER and NP are 0.06 and 0.3, respectively This huge value is clear evidence to prove the efficiency of my approaches when the number of samples is extremely small • For the data size: 200k, my method with the Transformers model lead to others in all of the criteria: 0.7286 for BLEU, 0.1774 for WER, and 0.6775 for NP In terms of the baseline, although reaching the approximate value in BLEU, the figures of this method are notably less than that of my method by roughly 0.04 WER score and 0.2 NP score • There are similar trends in the figures of the data 500k and 1m Firstly, when applying the RNN model and providing more data for both the baseline model and my approach, I not see an improvement This indicates the limited ability of the RNN model when applying it to Vietnamese Leveraging the Transformers in my method still surpasses the remaining my method achieves 0.7885 for the BLEU score, 0.1199 for the WER score, and 0.699 for the NP score When compare to the numbers of the baseline method, my results are still considerably higher than by 0.03 BLEU score, 0.06 WER score, and 0.1 NP score When comparing the performance of the baseline model, proposed model in Vietnamese and English, it is clear that the outcome of models in Vietnamese 39 Figure 4.5: The comparison of models on the Vietnamese test set with WER score (lower is better) tends to be lower than that of English For instance, in the case 1m sample, our proposed model + Transformers obtains only 0.7885 BLEU score compared to 0.8933 BLEU score of English The lower result on the Vietnamese test set may come from several issues as follows: • The average sequence length of the Vietnamese sample is pretty long (as in table 4.1) This may have a strong influence on the performance of the seq2seq task on the test set • On the Vietnamese test set, the seq2seq model may produce repeat words, and harm severely the performance of the hybrid model • On the Vietnamese test set, there are some cases that one number can exist in different types For instance: ’500 ngh`ın’ and ’500.000’ (five hundred thousand) Although they bring the same meaning but the above difference can create a lower BLEU or WER score • Some examples that fail in Vietnamese can be seen in the table 4.6 In both English and Vietnamese, the recurrent-based seq2seq models with attention achieve better performance compared with Transformer in the case of lowresource scenarios Meanwhile, Transformer-based models are able to achieve the best results by increasing the number of training samples Therefore, combining two methods (hybrid models) is able to improve performance I take this issue as my future work regarding this study 40 Figure 4.6: The comparison of models on the Vietnamese test set with NP score (higher is better) 4.4.2 Experiments with pre-trained LM Table 4.3 reports my results about the BLEU score on four approaches to apply different types of models to build my number recognizer: vanilla transformer, BERT, RecogNum-BERT + numerical loss + mask token prediction loss, RecogNumBERT + numerical loss only Look at the table, I can highlight some remarkable point as follow: • It is clear that when the range of data is from 100k to 500k, my method with BERT and other variants of BERT witnesses a significant improvement compared to only using the vanilla Transformer model The aforementioned difference is largest when data only have 100k samples and gradually decreases when I supplement data This evidence shows the efficiency of my proposal when taking advantage of the pre-train language model for improving the quality of the Number Recognizer module in the low resource • Nevertheless, similar to the result in section 4.1, when the size of the data is big enough, the knowledge from the pre-train language model may not be helpful for this module This conclusion is clear when in I look at the result in the final column (1m), in which the BLEU score for vanilla Transformers reaches the peak at 0.8933 while the figures for the three remaining methods only are: 0.8794, 0.8781, 0.8801, respectively • Loot at the performance of three type model pre-trained, over all of the range 41 Training Data 100k Our-Vanilla Transformer 0.741 Our-BERT 0.8697 Our-RecogNum-BERT(num loss + mask 0.8554 loss) Our-RecogNum-BERT(num loss) 0.8662 200k 0.8188 0.8772 0.8738 500k 0.8394 0.8779 0.8759 1m 0.8933 0.8794 0.8781 0.8782 0.879 0.8801 Table 4.3: Comparison between variants of the proposed method with different encoders for the module number recognizer: Transformers Base, BERT, RecogNum-BERT in Bleu score of data, RecogNum-BERT with loss brings the worst benefit Meanwhile, the performance of RecogNum-BERT with numerical loss only is comparable to BERT in the 100k dataset and surpasses all in other datasets Despite only being a minimal improvement, using additional numerical loss prediction also prove the contribution for building the better Number Recognizer Training Data 100k Our-Vanilla Transformer 0.152 Our-BERT 0.06195 Our-RecogNum-BERT(num loss + mask 0.0783 loss) Our-RecogNum-BERT(num loss) 0.0666 200k 0.1003 0.0565 0.059 500k 0.0874 00542 0.0569 1m 0.0517 0.0535 0.05398 0.0550 0.538 0.05295 Table 4.4: Comparison between variants of the proposed method with different encoders for the number recognizer: vanilla transformer, BERT, RecogNumBERT in WER score Table 4.4 express my results in WER score for all of my approaches in all range of data According to this table, I can easily see some points as follows: • There is a strong correlation between BLEU score and WER score when I can see a similar trend in both criteria With low resources like 100k, 200k, and even 500k, applying BERT and variants of BERT still bring many benefits with considerable amelioration • Although the improvement is not too big, modifying the additional numerical loss to the original BERT shows the efficiency in advancing the performance of the whole model 42 4.5 Visualization In this section, I visualize some of the examples in English of both spoken form and written form (table 4.5) Besides, some samples that are predicted wrongly in Vietnamese also can be found in the table 4.6 Spoken-form up to five hundred thousand dollars one hundred eleven point nine people nine trillion seven hundred eighty one billion nine hundred eight million four hundred forty three thousand seven four million eight hundred eleven thousand one Baseline method up to $5000000 11.9 people 97819984437 Our method up to $500000 111.9 people 9781908443007 48111 4811001 Table 4.5: Examples for prediction error of the baseline model in English Spoken-form (Vietnamese - English) Our method Gold answer dãy số may mắn không khơng bảy khơng tám khơng chín (the lucky number sequence is zero one zero seven zero eight zero nine) dãy số may mắn 1789 (the lucky number sequence is 1789) dãy số may mắn 01 07 08 09 (the lucky number sequence is 01 07 08 09) từ sáu trăm bảy trăm năm mươi nghìn từ 600 750000 từ 600 750 (lost from six hundred seven hundred and (lost from 600 nghìn (lost from fifty thousand) 750000) 600 750 thousand) số điện thoại không tám hai hai năm số điện thoại hai ba sáu chín (the phone number zero 2215 2369 (the eight two two one five two three six nine) phone number 2215 2369) số điện thoại (08)22152369 (the phone number (08)22152369 ) sáu đến mười phần tinh bột (six to 6-11phầntinhbột eleven parts starch ) (6-11phầntinhbột) 6-11 phần tinh bột (6-11 parts starch) Table 4.6: Examples of error prediction of the proposed model in Vietnamese 43 Chapter Conclusion 5.1 Summary In this study, I introduce a new method for the neural ITN approach Specifically, the difference from previous works, I divide the neural ITN problem into two stages Particularly, in the first stage, neural models are used to detect numerical segments: number recognizer Sequentially, the written form is extracted based on a set of rules in the second stage number converter In this regard, my method is able to deal with low-resource scenarios, where there is not much available data for training Furthermore, I showed that my method can be easily extended to other languages without linguistic knowledge requirements The evaluation of two different language datasets (i.e., English and Vietnamese) with different sizes of training samples (i.e., 100k, 200k, 500k, and 1000k) indicates that my method is able to achieve comparable results in the English language, and the highest results in Vietnamese languages Moreover, by leveraging the strength of the pre-trained language model (BERT), I improve the performance by utilizing the parameter of it as the initial parameter of the number recognizer In addition, to further explore the effect of additional knowledge about the appearance of numerical entities, I propose the new variant of BERT, call: RecogNum-BERT, and apply it successfully to the number recognizer and also witness the minimal improvements Finally, to my knowledge, my work is the first study that considers the ITN problem under the scenario: low resource data I hope this work can promote the interest of research to enhance the performance of ITN tasks in Vietnamese and other low-resource languages 44 5.2 Future work Regarding the future work of this study, I have some ideas to focus on to further advance the quality of my proposed model as follows: • As the result in section 4.4.2, transferring the knowledge from the pre-trained language model into the number recognizer brings huge advantages However, in this work, I only take a well-known model like BERT as the backbone of my experiment BERT is trained in corpus with enormous domains, and have the ability to generalize to many NLP task This trait also makes BERT less competitive than one specific pre-trained model that only belongs to a certain domain Therefore, I would like to rebuild one pre-trained language model with the data coming totally from the domain of spoken language only The data will be collected from the data in spoken language from the ASR system Nevertheless, this process can cost severe computational resources and time complexity 45 REFERENCES [1] S Pramanik and A Hussain, “Text normalization using memory augmented neural networks”, Speech Commun., vol 109, pp 15–23, 2019 DOI: 10 1016/j.specom.2019.02.003 [2] P Ebden and R Sproat, “The kestrel TTS text normalization system”, Nat Lang Eng., vol 21, no 3, pp 333–353, 2015 DOI: 10.1017/S1351324914000175 [3] Y Zhang, E Bakhturina, and B Ginsburg, “Nemo (inverse) text normalization: From development to production”, in Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - September 2021, H Hermansky, H Cernock´y, L Burget, L Lamel, O Scharenborg, and P Motl´ıcek, Eds., ISCA, 2021, pp 4857–4859 [Online] Available: http : / / www isca - speech org/archive/interspeech\_2021/zhang21ja\_interspeech html [4] H Zhang, R Sproat, A H Ng, et al., “Neural models of text normalization for speech applications”, Comput Linguistics, vol 45, no 2, pp 293–337, 2019 DOI: 10.1162/coli\_a\_00349 [5] E Pusateri, B R Ambati, E Brooks, O Pl´atek, D McAllaster, and V Nagesha, “A mostly data-driven approach to inverse text normalization”, in Proceeding of the 18th Annual Conference of the International Speech Communication Association (Interspeech), ISCA, 2017, pp 2784–2788 [6] M Ihori, A Takashima, and R Masumura, “Large-context pointer-generator networks for spoken-to-written style conversion”, in Proceeding of the 45th International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp 8189–8193 DOI: 10.1109/ICASSP40776.2020 9053930 [7] M Sunkara, C Shivade, S Bodapati, and K Kirchhoff, “Neural inverse text normalization”, in Proceeding of the 46th International Conference on 46 Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp 7573– 7577 DOI: 10.1109/ICASSP39728.2021.9414912 [8] I Sutskever, O Vinyals, and Q V Le, “Sequence to sequence learning with neural networks”, in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Z Ghahramani, M Welling, C Cortes, N D Lawrence, and K Q Weinberger, Eds., 2014, pp 3104–3112 [Online] Available: https://proceedings.neurips cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2Abstract.html [9] R Sproat and N Jaitly, “An RNN model of text normalization”, in Proceeding of the 18th Annual Conference of the International Speech Communication Association (Interspeech), ISCA, 2017, pp 754–758 [10] W Chan, N Jaitly, Q V Le, and O Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition”, in Proceeding of the 41st International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp 4960–4964 DOI: 10 1109/ICASSP.2016.7472621 [11] S Yolchuyeva, G N´emeth, and B Gyires-T´oth, “Text normalization with convolutional neural networks”, Int J Speech Technol., vol 21, no 3, pp 589–600, 2018 DOI: 10.1007/s10772-018-9521-x [12] C Mansfield, M Sun, Y Liu, A Gandhe, and B Hoffmeister, “Neural text normalization with subword units”, in Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Association for Computational Linguistics, 2019, pp 190–196 DOI: 10.18653/ v1/N19-2024 [13] T Kudo and J Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing”, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium: Association for Computational Linguistics, Nov 2018, pp 66–71 DOI: 10.18653/ v1 / D18 - 2012 [Online] Available: https : / / aclanthology org/D18-2012 47 [14] M Lewis, Y Liu, N Goyal, et al., “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension”, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online: Association for Computational Linguistics, Jul 2020, pp 7871–7880 DOI: 10 18653 / v1 / 2020 acl main 703 [Online] Available: https : / / aclanthology org / 2020.acl-main.703 [15] J Devlin, M Chang, K Lee, and K Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding” [16] S Clinchant, K W Jung, and V Nikoulina, “On the use of BERT for neural machine translation”, in Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong: Association for Computational Linguistics, Nov 2019, pp 108–117 DOI: 10.18653/v1/D19-5611 [Online] Available: https://aclanthology.org/D19-5611 [17] A Graves and A Graves, “Long short-term memory”, Supervised sequence labelling with recurrent neural networks, pp 37–45, 2012 [18] T Luong, H Pham, and C D Manning, “Effective approaches to attentionbased neural machine translation”, in Proceeding of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), The Association for Computational Linguistics, 2015, pp 1412–1421 DOI: 10 18653/v1/d15-1166 [19] D Bahdanau, K Cho, and Y Bengio, “Neural machine translation by jointly learning to align and translate”, in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y Bengio and Y LeCun, Eds., 2015 [Online] Available: http://arxiv.org/abs/1409.0473 [20] A Vaswani, N Shazeer, N Parmar, et al., “Attention is all you need”, in Proceeding of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing System (NeurIPS), 2017, pp 5998–6008 [21] J Zhu, Y Xia, L Wu, et al., “Incorporating BERT into neural machine translation”, in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020 [Online] Available: https://openreview.net/forum?id= Hyl7ygStwB 48 [22] M E Peters, M Neumann, M Iyyer, et al., “Deep contextualized word representations”, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume (Long Papers), New Orleans, Louisiana: Association for Computational Linguistics, Jun 2018, pp 2227–2237 DOI: 10.18653/v1/N18-1202 [Online] Available: https://aclanthology org/N18-1202 [23] Y Liu, M Ott, N Goyal, et al., “Roberta: A robustly optimized BERT pretraining approach”, CoRR, vol abs/1907.11692, 2019 arXiv: 1907 11692 [Online] Available: http : / / arxiv org / abs / 1907 11692 [24] J Devlin, M.-W Chang, K Lee, and K Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding”, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume (Long and Short Papers), Minneapolis, Minnesota: Association for Computational Linguistics, Jun 2019, pp 4171–4186 DOI: 10.18653/ v1 / N19 - 1423 [Online] Available: https : / / aclanthology org/N19-1423 49