1. Trang chủ
  2. » Luận Văn - Báo Cáo

Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói

90 3 1

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Nghiên Cứu Phương Pháp Cải Thiện Chất Lượng Hệ Thống Ghi Nhật Ký Người Nói
Tác giả Nguyễn Tùng Lâm
Người hướng dẫn Dr. T. Anh Xuan Tran
Trường học Hanoi University of Science and Technology
Chuyên ngành Kỹ thuật Điều khiển và Tự động hóa
Thể loại master thesis
Năm xuất bản 2022
Thành phố Hanoi
Định dạng
Số trang 90
Dung lượng 2,94 MB

Nội dung

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY MASTER THESIS A Study on Improving Speaker Diarization System TUNG LAM NGUYEN lamfm95@gmail.com Dept of Control Engineering and Automation Supervisor Dr T Anh Xuan Tran School School of Electrical Engineering Hanoi, March 1, 2022 CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập – Tự – Hạnh phúc BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ Họ tên tác giả luận văn: Nguyễn Tùng Lâm Đề tài luận văn: Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói Chuyên ngành: Kỹ thuật Điều khiển Tự động hóa Mã số SV: CBC19018 Tác giả, Người hướng dẫn khoa học Hội đồng chấm luận văn xác nhận tác giả sửa chữa, bổ sung luận văn theo biên họp Hội đồng ngày 31/12/2021 với nội dung sau: - Trang 4, 5: Vẽ lại sơ đồ hệ thống bản: Tách bạch phần ghi nhận ký người nói nhận dạng người nói Có bổ sung mơ tả chung hệ thống - Trang 26: Bổ sung sơ đồ khối thuật toán Agglomerative Hierarchial Clustering (AHC) - Trang 27: Sửa lại hình 2.15 – ví dụ thuật tốn AHC - Trang 32 - 34: Viết thêm phần Frameworks Trong có trình bày thư viện tự viết Kal-Star Trong phần có nhắc đến việc Kal-Star sử dụng cho hệ thông mô tả trang 4, - Trang 41 - 49: Thay sơ đồ hệ thống cũ sơ đồ Sắp xếp lại phần nội dung - Trang 53, 56: Cập nhật vị trí bảng kết để tiện theo dõi Ngày 28 tháng 02 năm 2022 Giáo viên hướng dẫn Tác giả luận văn CHỦ TỊCH HỘI ĐỒNG SĐH.QT9.BM11 Ban hành lần ngày 11/11/2014 i Declaration of Authorship I, Tung Lam NGUYEN, declare that this thesis titled, “A Study on Improving Speaker Diarization System” and the work presented in it are my own I confirm that: • This work was done wholly or mainly while in candidature for a research degree at this University • Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated • Where I have consulted the published work of others, this is always clearly attributed • Where I have quoted from the work of others, the source is always given With the exception of such quotations, this thesis is entirely my own work • I have acknowledged all main sources of help Signed: Date: iii “I’m not much but I’m all I have.” - Philip K Dick, Martian Time-Slip v Abstracts Speaker diarization is the method of dividing a conversation into segments spoken by the same speaker, usually referred to as “who spoke when” At Viettel, this task is especially important to the IP contact center (IPCC) automatic quality assurance system, by which hundreds of thousands of calls are processed everyday Integrated within a speaker recognition system, speaker diarization helps distinguishing between agents and customers within each support call and giving further useful insights (e.g: agent attitude and customer satisfaction, ) The key to accurately such task is to learn discriminative speaker representations X-Vectors, bottle-neck features of a time-delayed neural network (TDNN), have emerged as the speaker representations of choice for many speaker diarization system On the other hand, ECAPA-TDNN, a recent development over X-Vectors’ neural network with residual connections and attention on both time and feature channels, has shown state-of-the-art results on popular English corpora Therefore, the aim of this work is to explore capability of ECAPA-TDNN versus X-Vectors in the current Vietnamese speaker diarization system Both baseline and proposed systems are evaluated in two tasks: speaker verification, to evaluate the discriminative characteristics of speaker representations; and speaker diarization, to evaluate how these speaker representations affect the whole complex system Used data include private data sets (IPCC_110000, VTR_1350) and a public data set (ZALO_400) In general, conducted experiments show the proposed system out-perform the baseline system on all tasks and on all data sets 52 Chapter Results Furthermore, both systems show a consistent degradation in terms of equal error rate (EER) when the dimension reduction ratio increases with only two exceptions The first exception occurs in the tests with IPCC_110000 test split: the EER of the proposed system falls from 1.44% down to 1.39% and then raise up to 1.54%, when the target dimension falls from 192 to 180, and then 172 The second exception happens occurs in the tests with ZALO_400 data set, when the EER of the proposed system falls from 8.08% down to 7.91% and then raise up to 8.16%, when the target dimension experience the same changes mentioned in the first exception However, in both of these exceptions, the swings are insignificant and did not affect the trend of EER As for MinDCF, this metric does not show a clear trend against the reduction of embedding dimensions In summary, the proposed system shows significant improvements over the baseline system, and the embedding dimension shouldn’t be further reduced in PLDA scoring stage 4.1 Speaker Verification Task 53 IPCC_110000 (test split) (8000Hz, K=3, # Trials=3888) PLDA ratio dim 1.00 128 0.95 120 0.90 112 0.85 108 0.80 100 0.70 88 0.60 76 0.50 64 X-Vector MinDCF EER p=0.01 p=0.001 (%) 0.6240 0.8112 3.91 0.6183 0.8117 3.96 0.6183 0.8066 4.01 0.6163 0.8102 4.17 0.6317 0.8050 4.27 0.6497 0.8020 4.32 0.6445 0.8138 4.53 0.6533 0.7917 4.78 ECAPA-TDNN PLDA MinDCF EER ratio dim p=0.01 p=0.001 (%) 1.00 192 0.1024 0.1085 1.44 0.95 180 0.1065 0.1080 1.44 0.90 172 0.1096 0.1096 1.39 0.85 160 0.0998 0.0998 1.54 0.80 152 0.1070 0.1070 1.54 0.70 132 0.0983 0.0983 1.65 0.60 112 0.1101 0.1101 1.85 0.50 96 0.1240 0.1240 2.01 VTR_1350 (8000Hz (resampled), K=3, # Trials=7902) PLDA ratio dim 1.00 128 0.95 120 0.90 112 0.85 108 0.80 100 0.70 88 0.60 76 0.50 64 X-Vector MinDCF p=0.01 p=0.001 0.7588 0.8651 0.7603 0.8641 0.7472 0.8659 0.7325 0.8669 0.7327 0.8502 0.7423 0.8322 0.7261 0.8461 0.7459 0.8494 EER (%) 9.69 9.82 9.87 9.92 10.02 10.10 10.30 10.48 ECAPA-TDNN PLDA MinDCF EER ratio dim p=0.01 p=0.001 (%) 1.00 192 0.2680 0.3192 3.29 0.95 180 0.2721 0.3680 3.49 0.90 172 0.2797 0.3936 3.54 0.85 160 0.3055 0.4052 3.59 0.80 152 0.2971 0.3941 3.62 0.70 132 0.3214 0.4432 3.70 0.60 112 0.3774 0.4450 3.77 0.50 96 0.3991 0.4938 3.77 ZALO_400 (8000Hz (resampled), K=3, # Trials=2376) PLDA ratio dim 1.00 128 0.95 120 0.90 112 0.85 108 0.80 100 0.70 88 0.60 76 0.50 64 X-Vector MinDCF p=0.01 p=0.001 0.9470 0.9470 0.9562 0.9562 0.9444 0.9444 0.9444 0.9444 0.9402 0.9402 0.9478 0.9478 0.9621 0.9621 0.9739 0.9739 EER (%) 14.39 14.56 14.65 14.73 14.90 14.90 15.24 14.65 ECAPA-TDNN PLDA MinDCF EER ratio dim p=0.01 p=0.001 (%) 1.00 192 0.7340 0.7340 7.83 0.95 180 0.7424 0.7424 8.08 0.90 172 0.7306 0.7306 7.91 0.85 160 0.7214 0.7214 8.16 0.80 152 0.6987 0.6987 8.16 0.70 132 0.7079 0.7079 8.67 0.60 112 0.7374 0.7374 9.34 0.50 96 0.7332 0.7332 9.93 TABLE 4.1: EER and MinDCF performance 54 4.2 Chapter Results Speaker Diarization Task In this task, the whole baseline and proposed systems are tested with mock conversations, consisting of different number of engaging speakers, with different uniform subsegmenting configurations In this test, oracle VAD (i.e: ground-truth VAD) is used to diminish the effect of any voice activity detection modules, PLDA scoring is carried out without dimension reduction, and the exact number of engaging speakers in each conversation is known before clustering process Results are reported in table 4.2, where {x : y} represents an uniform segmentation configuration of windows of length x seconds with y seconds overlaps FIGURE 4.1: A speaker diarization output of a 3-way conversation in VTR_1350 test set Both system performs relatively well with IPCC_110000 test split’s mock conversations, where DERs are all below 4.5 percent This results match the fact that these conversations are generated from the data set that is in-domain with the embedding extractor’s training data set The results with ZALO_400 mock conversations is much worse when the DER goes up to 17.15% in case of the baseline system, and 11.20% in case of the proposed system With VTR_1350 mock conversations, the results are even worse than that It goes up to 24.25% in case of the baseline system, and 22.33% in case of the proposed system In most cases, the proposed system with ECAPA_TDNN out-perform the baseline system and with most conversation types and all sub-segmentation configurations In each set of conversations participated by the same number of speaker, the best DER among different sub-segmenting configurations of the proposed system usually outperforms the baseline system by from 30% to 70% 4.2 Speaker Diarization Task 55 Furthermore, while both systems performs better with wider provided context (i.e: large sub-segmentation window size), the proposed system doesn’t show a significant improvement in the relative DER reduction against context windows size In other words, the proposed system’s DER is expected to decrease faster than the baseline system’s DER since ECAPA-TDNN theoretically make use of the wide context thanks to its attention mechanism, while it isn’t This experiment proved that in speaker diarization task, where speech segments are sub-segmented into sub-segments smaller than seconds, the attention at time channel isn’t much effective In conclusion, the proposed system gives significant improvement in DER performance over the baseline system 56 Chapter Results IPCC_110000.test - Mock Conversations (8000Hz; 2-way, 3-way, 4-way and 5-way conversations (200 each)) # spk subseg DER (%) # spk subseg X-Vector ECAPA DER (%) X-Vector ECAPA 2 2 {1.5 : {2.0 : {3.0 : {4.0 : 0.75} 1.00} 1.50} 2.00} 2.93 2.72 2.54 2.17 2.66 2.06 1.70 1.50 4 4 {1.5 : {2.0 : {3.0 : {4.0 : 0.75} 1.00} 1.50} 2.00} 4.35 4.43 3.53 2.88 4.33 2.82 2.36 2.93 3 3 {1.5 : {2.0 : {3.0 : {4.0 : 0.75} 1.00} 1.50} 2.00} 2.65 2.72 2.34 2.10 2.65 1.36 0.93 0.89 5 5 {1.5 : {2.0 : {3.0 : {4.0 : 0.75} 1.00} 1.50} 2.00} 5.27 3.78 3.26 3.03 3.88 2.63 2.91 3.14 VTR_1350 - Mock Conversations (8000Hz (resampled); 2-way, 3-way, 4-way and 5-way conversations (200 each)) # spk subseg DER (%) # spk subseg X-Vector ECAPA DER (%) X-Vector ECAPA 2 2 {1.5 : {2.0 : {3.0 : {4.0 : 0.75} 1.00} 1.50} 2.00} 11.31 9.32 5.31 2.45 18.59 8.41 2.66 1.31 4 4 {1.5 : {2.0 : {3.0 : {4.0 : 0.75} 1.00} 1.50} 2.00} 21.18 16.54 10.34 8.20 21.63 12.71 5.34 4.44 3 3 {1.5 : {2.0 : {3.0 : {4.0 : 0.75} 1.00} 1.50} 2.00} 17.77 12.52 8.39 7.08 20.13 12.80 5.63 3.42 5 5 {1.5 : {2.0 : {3.0 : {4.0 : 0.75} 1.00} 1.50} 2.00} 24.25 18.24 12.16 8.63 22.33 11.70 4.90 2.95 ZALO_400 - Mock Conversations (8000Hz (resampled); 2-way, 3-way, 4-way and 5-way conversations (200 each)) # spk subseg DER (%) # spk subseg X-Vector ECAPA DER (%) X-Vector ECAPA 2 2 {1.5 : {2.0 : {3.0 : {4.0 : 0.75} 1.00} 1.50} 2.00} 5.62 4.78 5.40 6.61 4.86 3.60 2.72 6.06 4 4 {1.5 : {2.0 : {3.0 : {4.0 : 0.75} 1.00} 1.50} 2.00} 15.35 12.83 11.06 12.17 8.85 6.59 6.14 6.58 3 3 {1.5 : {2.0 : {3.0 : {4.0 : 0.75} 1.00} 1.50} 2.00} 10.85 10.11 9.72 10.92 7.11 6.15 5.77 7.59 5 5 {1.5 : {2.0 : {3.0 : {4.0 : 0.75} 1.00} 1.50} 2.00} 17.15 13.79 12.31 12.44 11.20 8.72 6.76 7.09 TABLE 4.2: DER performance 57 Chapter Conclusions and Future Works In this thesis, a new architecture of deep neural network, ECAPA-TDNN, was experimented in comparison the the baseline system based on X-Vectors and showed significant overall improvements The proposed system outperformed the baseline on all Vietnamese data sets and on on tasks: speaker verification and speaker diarization Thanks to the attention mechanism that operates on both time and features channels, the proposed network can learn which data in the context and which features are more important This context-aware feature of ECAPA-TDNN is remarkably important since different languages have different ways of constructing sentences, and the word positioning in Vietnamese is totally different from that of English or French In this sense, ECAPATDNN can be adapted to a wide variety of languages of different writing styles, and indeed it worked with Vietnamese conversations Followings are some highlighted pros and cons in using ECAPA-TDNN in the speaker diarization system: Pros: • ECAPA-TDNN provides context-aware embeddings with attention on both time and features channels that work exceptionally well with Vietnamese data • Based entirely on Pytorch framework, ECAPA-TDNN is much easier to train, to test and to customize than in Kaldi Cons: • Both the training and inference processes are slower due to complexity of the network With an NVIDIA-A100 GPU it still takes 80 hours to complete 20 training epochs with IPCC_110000 data set 58 Chapter Conclusions and Future Works • The network is not yet production-ready, while X-Vectors network, trained with Kaldi has long been used in production, for both speaker verification and diarization system Further research directions can be explored to improve the understanding about the capability of ECAPA-TDNN in speaker diarization includes: • Trials and errors with more configurations (in this thesis, only some minor changes were made to the original network configurations) • Exploring other types of clustering methods • Learning how effective the proposed system is in case of conversations with multiple overlaps • Apply post-processing methods to the diarization result • Build a Vietnamese conversation data set based on real conversations 59 Bibliography [1] Omid Sadjadi et al NIST 2021 Speaker Recognition Evaluation Plan en 2021 URL : https : / / tsapps nist gov / publication / get \ %5Fpdf cfm ? pub \ %5Fid=932697 [2] David Arthur and Sergei Vassilvitskii “k-means++: the advantages of careful seeding” In: SODA ’07 2007 [3] Dan Pelleg and Andrew W Moore “X-means: Extending K-means with Efficient Estimation of the Number of Clusters” In: ICML 2000 [4] Aonan Zhang et al “Fully Supervised Speaker Diarization” In: ICASSP 2019 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), pp 6301–6305 [5] Shota Horiguchi et al “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors” In: ArXiv abs/2005.09921 (2020) [6] Yuki Takashima et al “End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection” In: 2021 IEEE Spoken Language Technology Workshop (SLT) (2021), pp 849–856 [7] Tsun-Yat Leung and Lahiru Samarakoon “Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty” In: Interspeech 2021 (2021) [8] Niruhan Viswarupan K-Means Data Clustering 2017 URL: https://towardsdatascience com/k-means-data-clustering-bce3335d2203 (visited on 12/09/2021) [9] Sabur Ajibola Alim and Nahrul Khair Alang Rashid “From Natural to Artificial Intelligence - Algorithms and Applications” In: IntechOpen, 2018 Chap [10] Urmila Shrawankar and Vilas M Thakare “Techniques for Feature Extraction In Speech Recognition System : A Comparative Study” In: CoRR abs/1305.1145 (2013) arXiv: 1305.1145 URL: http://arxiv.org/abs/1305.1145 60 Bibliography [11] Smita Magre, Pooja Janse, and Ratnadeep Deshmukh “A Review on Feature Extraction and Noise Reduction Technique” In: (Feb 2014) [12] Bob Meddins “5 - The design of FIR filters” In: Introduction to Digital Signal Processing Ed by Bob Meddins Oxford: Newnes, 2000, pp 102–136 ISBN: 9780-7506-5048-9 DOI: https : / / doi org / 10 1016 / B978 - 075065048 - / 50007-6 URL: https://www.sciencedirect.com/science/article/pii/ B9780750650489500076 [13] Torben Poulsen “Loudness of tone pulses in a free field” In: Acoustical Society of America Journal 69.6 (June 1981), pp 1786–1790 DOI: 10.1121/1.385915 [14] Stanislas Dehaene “The neural basis of the Weber–Fechner law: a logarithmic mental number line” In: Trends in cognitive sciences 7.4 (2003), pp 145–147 [15] S S Stevens “A Scale for the Measurement of the Psychological Magnitude Pitch” In: Acoustical Society of America Journal 8.3 (Jan 1937), p 185 DOI: 10.1121/1.1915893 [16] Robert B Randall “A history of cepstrum analysis and its application to mechanical problems” In: Mechanical Systems and Signal Processing 97 (2017) Special Issue on Surveillance, pp 3–19 ISSN: 0888-3270 DOI: https://doi.org/10 1016/j.ymssp.2016.12.026 URL: https://www.sciencedirect.com/ science/article/pii/S0888327016305556 [17] Philipos C Loizou Speech Enhancement: Theory and Practice 2nd USA: CRC Press, Inc., 2013 ISBN: 1466504218 [18] Xugang Lu et al “Speech enhancement based on deep denoising autoencoder” In: INTERSPEECH 2013 [19] Yong Xu et al “A Regression Approach to Speech Enhancement Based on Deep Neural Networks” In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 23.1 (2015), pp 7–19 DOI: 10.1109/TASLP.2014.2364452 [20] Hakan Erdogan et al “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks” In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp 708–712 [21] Tian Gao et al “Densely Connected Progressive Learning for LSTM-Based Speech Enhancement” In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018, pp 5054–5058 DOI: 10 1109 / ICASSP 2018.8461861 Bibliography 61 [22] Desh Raj et al Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis 2020 arXiv: 2011.02014 [eess.AS] [23] Gregory Sell et al “Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge” In: INTERSPEECH 2018 [24] Neville Ryant et al The Second DIHARD Diarization Challenge: Dataset, task, and baselines 2019 arXiv: 1906.07839 [eess.AS] [25] Mireia Díez et al “BUT System for DIHARD Speech Diarization Challenge 2018” In: INTERSPEECH 2018 [26] Shinji Watanabe et al “CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings” In: Proc 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020) 2020, pp 1–7 DOI: 10.21437/CHiME.2020-1 [27] Ashish Arora et al The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge 2020 arXiv: 2006.07898 [eess.AS] [28] Wikipedia contributors Maximum likelihood estimation — Wikipedia, The Free Encyclopedia https://en.wikipedia.org/w/index.php?title=Maximum_ likelihood_estimation&oldid=1051139067 [Online; accessed 17-November2021] 2021 [29] John R Hershey et al Deep clustering: Discriminative embeddings for segmentation and separation 2015 arXiv: 1508.04306 [cs.NE] [30] Morten Kolbæk et al Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks 2017 arXiv: 1703.06284 [cs.SD] [31] Yi Luo and Nima Mesgarani “Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation” In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 27.8 (2019), 1256–1266 ISSN: 2329-9304 DOI : 10.1109/taslp.2019.2915167 URL : http://dx.doi.org/10.1109/ TASLP.2019.2915167 [32] Xiong Xiao et al Microsoft Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2020 2020 arXiv: 2010.11458 [eess.AS] 62 Bibliography [33] Arsha Nagrani et al VoxSRC 2020: The Second VoxCeleb Speaker Recognition Challenge 2020 arXiv: 2012.06867 [cs.SD] [34] Takuya Yoshioka et al “Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks” In: Interspeech 2018 (2018) DOI: 10 21437 / interspeech 2018 - 2284 URL: http : / / dx doi org/10.21437/Interspeech.2018-2284 [35] Christoph Boeddecker et al “Front-end processing for the CHiME-5 dinner party scenario” In: Proc 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018) 2018, pp 35–40 DOI: 10.21437/CHiME.2018-8 [36] Wikipedia contributors Audacity (audio editor) — Wikipedia, The Free Encyclopedia https : / / en wikipedia org / w / index php ? title = Audacity _ (audio_editor)&oldid=1054771106 [Online; accessed 17-November-2021] 2021 [37] A Benyassine et al “ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications” In: IEEE Communications Magazine 35.9 (1997), pp 64–73 DOI : 10.1109/35.620527 [38] Jongseo Sohn and Wonyong Sung “A voice activity detector employing soft decision based noise spectrum adaptation” In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 (Cat No.98CH36181) Vol 1998, 365–368 vol.1 DOI: 10 1109 / ICASSP 1998 674443 [39] Monica Franzese and Antonella Iuliano “Hidden Markov Models” In: Encyclopedia of Bioinformatics and Computational Biology Ed by Shoba Ranganathan et al Oxford: Academic Press, 2019, pp 753–762 ISBN: 978-0-12-811432-2 DOI: https://doi.org/10.1016/B978-0-12-809633-8.20488-3 URL: https: //www.sciencedirect.com/science/article/pii/B9780128096338204883 [40] Jongseo Sohn, Nam Soo Kim, and Wonyong Sung “A statistical model-based voice activity detection” In: IEEE Signal Processing Letters 6.1 (1999), pp 1–3 DOI : 10.1109/97.736233 [41] Jacob Benesty, M Mohan Sondhi, Yiteng Huang, et al Springer handbook of speech processing Vol Springer, 2008 Bibliography 63 [42] Wikipedia contributors WebRTC — Wikipedia, The Free Encyclopedia [Online; accessed 17-November-2021] 2021 URL: https : / / en wikipedia org / w / index.php?title=WebRTC\&oldid=1053350113 [43] Webrtc/common_audio/VAD - external/webrtc - git at google URL: https : / / chromium.googlesource.com/external/webrtc/+/branch- heads/43/ webrtc/common\%5Faudio/vad/ [44] Thad Hughes and Keir Mierle “Recurrent neural networks for voice activity detection” In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing 2013, pp 7378–7382 DOI: 10.1109/ICASSP.2013.6639096 [45] Jesus Lopez et al “Advances in Speaker Recognition for Telephone and AudioVisual Data: the JHU-MIT Submission for NIST SRE19” In: Nov 2020, pp 273– 280 DOI: 10.21437/Odyssey.2020-39 [46] Matthew A Siegler “Automatic Segmentation, Classification and Clustering of Broadcast News Audio” In: 1997 [47] “Step-by-step and integrated approaches in broadcast news speaker diarization” In: Computer Speech & Language 20.2 (2006) Odyssey 2004: The speaker and Language Recognition Workshop, pp 303–330 ISSN: 0885-2308 DOI: https:// doi.org/10.1016/j.csl.2005.08.002 URL: https://www.sciencedirect com/science/article/pii/S0885230805000471 [48] Scott Chen “Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion” In: 1998 [49] Perrine Delacourt and Christian Wellekens “DISTBIC: A speaker-based segmentation for audio data indexing” In: Speech Communication 32 (Sept 2000), pp 111– 126 DOI: 10.1016/S0167-6393(00)00027-3 [50] Simon Prince and James H Elder “Probabilistic Linear Discriminant Analysis for Inferences About Identity” In: 2007 IEEE 11th International Conference on Computer Vision (2007), pp 1–8 [51] Daniel Garcia-Romero and Carol Y Espy-Wilson “Analysis of i-vector Length Normalization in Speaker Recognition Systems” In: INTERSPEECH 2011 [52] In: () [53] Gregory Sell and Daniel Garcia-Romero “Speaker diarization with plda i-vector scoring and unsupervised calibration” In: 2014 IEEE Spoken Language Technology Workshop (SLT) (2014), pp 413417 64 Bibliography [54] Niko Brăummer et al “Unsupervised Domain Adaptation for I-Vector Speaker Recognition” In: Odyssey 2014 [55] Georg Heigold et al End-to-End Text-Dependent Speaker Verification 2015 arXiv: 1509.08062 [cs.LG] [56] Wikipedia contributors Long short-term memory — Wikipedia, The Free Encyclopedia https : / / en wikipedia org / w / index php ? title = Long _ short - term _ memory & oldid = 1055779877 [Online; accessed 18-November2021] 2021 [57] Wikipedia contributors Triplet loss — Wikipedia, The Free Encyclopedia https: / / en wikipedia org / w / index php ? title = Triplet _ loss & oldid = 1047189529 [Online; accessed 18-November-2021] 2021 [58] David Snyder et al “Deep Neural Network Embeddings for Text-Independent Speaker Verification” In: INTERSPEECH 2017 [59] David Snyder et al “X-Vectors: Robust DNN Embeddings for Speaker Recognition” In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018), pp 5329–5333 [60] Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur “A time delay neural network architecture for efficient modeling of long temporal contexts” In: INTERSPEECH 2015 [61] Wikipedia contributors Rectifier (neural networks) — Wikipedia, The Free Encyclopedia https://en.wikipedia.org/w/index.php?title=Rectifier_ (neural _ networks ) &oldid = 1054372725 [Online; accessed 18-November2021] 2021 [62] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification” In: Interspeech 2020 (2020) DOI: 10.21437/interspeech 2020-2650 URL: http://dx.doi.org/10.21437/Interspeech.2020-2650 [63] Wikipedia contributors Convolutional neural network — Wikipedia, The Free Encyclopedia https://en.wikipedia.org/w/index.php?title=Convolutional_ neural_network&oldid=1054217311 [Online; accessed 23-November-2021] 2021 [64] Kaiming He et al Deep Residual Learning for Image Recognition 2015 arXiv: 1512.03385 [cs.CV] Bibliography 65 [65] Shang-Hua Gao et al “Res2Net: A New Multi-Scale Backbone Architecture” In: IEEE Transactions on Pattern Analysis and Machine Intelligence 43.2 (2021), 652–662 ISSN: 1939-3539 DOI: 10.1109/tpami.2019.2938758 URL: http: //dx.doi.org/10.1109/TPAMI.2019.2938758 [66] Jie Hu et al Squeeze-and-Excitation Networks 2019 arXiv: 1709.01507 [cs.CV] [67] Ashish Vaswani et al Attention Is All You Need 2017 arXiv: 1706.03762 [cs.CL] [68] Chao Dong et al “Image Super-Resolution Using Deep Convolutional Networks” In: IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2016), pp 295–307 [69] Liang-Chieh Chen et al “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs” In: IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (2018), pp 834– 848 [70] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi “V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation” In: 2016 Fourth International Conference on 3D Vision (3DV) (2016), pp 565–571 [71] Jiankang Deng et al ArcFace: Additive Angular Margin Loss for Deep Face Recognition 2019 arXiv: 1801.07698 [cs.CV] [72] Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda “Attentive Statistics Pooling for Deep Speaker Embedding” In: Interspeech 2018 (2018) DOI: 10.21437/ interspeech.2018-993 URL: http://dx.doi.org/10.21437/Interspeech 2018-993 [73] Wikipedia contributors Euclidean distance — Wikipedia, The Free Encyclopedia https://en.wikipedia.org/w/index.php?title=Euclidean_distance& oldid=1038865400 [Online; accessed 12-December-2021] 2021 [74] Suryakant and Tripti Mahara “A New Similarity Measure Based on Mean Measure of Divergence for Collaborative Filtering in Sparse Environment” In: Procedia Computer Science 89 (2016), pp 450–456 [75] Wikipedia contributors Cosine similarity — Wikipedia, The Free Encyclopedia https://en.wikipedia.org/w/index.php?title=Cosine_similarity& oldid=1053914647 [Online; accessed 12-December-2021] 2021 [76] Christopher M Bishop Pattern Recognition and Machine Learning (Information Science and Statistics) Berlin, Heidelberg: Springer-Verlag, 2006 ISBN: 0387310738 66 Bibliography [77] Sergey Ioffe “Probabilistic Linear Discriminant Analysis” In: ECCV 2006 [78] Simon Prince and James H Elder “Probabilistic Linear Discriminant Analysis for Inferences About Identity” In: 2007 IEEE 11th International Conference on Computer Vision (2007), pp 1–8 [79] Wikipedia contributors Dendrogram — Wikipedia, The Free Encyclopedia https: //en.wikipedia.org/w/index.php?title=Dendrogram&oldid=1040840097 [Online; accessed 12-December-2021] 2021 [80] Tim Bock What is Hierarchical Clustering? URL: https : / / www displayr com/what-is-hierarchical-clustering (visited on 12/12/2021) [81] Mirco Ravanelli et al SpeechBrain: A General-Purpose Speech Toolkit arXiv:2106.04624 2021 arXiv: 2106.04624 [eess.AS] [82] Viettel Cyberspace Center Lam Nguyen Kal-star: A Kaldi-based Library 2021 [83] ZALO Corp ZALO AI Challenge 2020 (visited on 11/28/2021) URL : https://challenge.zalo.ai [84] Daniel Povey et al “The Kaldi Speech Recognition Toolkit” In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding IEEE Catalog No.: CFP11SRW-USB Hilton Waikoloa Village, Big Island, Hawaii, US: IEEE Signal Processing Society, Dec 2011 [85] Wikipedia contributors Signal-to-noise ratio — Wikipedia, The Free Encyclopedia https : / / en wikipedia org / w / index php ? title = Signal - to noise_ratio&oldid=1052575963 [Online; accessed 6-December-2021] 2021 [86] Tom Ko et al “A study on data augmentation of reverberant speech for robust speech recognition” In: Mar 2017, pp 5220–5224 DOI: 10 1109 / ICASSP 2017.7953152 [87] Wikipedia contributors Binary classification — Wikipedia, The Free Encyclopedia https : / / en wikipedia org / w / index php ? title = Binary _ classification & oldid = 1022218348 [Online; accessed 7-December-2021] 2021 [88] Nauman Dawalatabad et al “ECAPA-TDNN Embeddings for Speaker Diarization” In: Interspeech 2021 (2021) DOI: 10.21437/interspeech.2021- 941 URL : http://dx.doi.org/10.21437/Interspeech.2021-941 ... Lâm Đề tài luận văn: Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói Chuyên ngành: Kỹ thuật Điều khiển Tự động hóa Mã số SV: CBC19018 Tác giả, Người hướng dẫn khoa... 31/12/2021 với nội dung sau: - Trang 4, 5: Vẽ lại sơ đồ hệ thống bản: Tách bạch phần ghi nhận ký người nói nhận dạng người nói Có bổ sung mơ tả chung hệ thống - Trang 26: Bổ sung sơ đồ khối thuật toán... có nhắc đến việc Kal-Star sử dụng cho hệ thông mô tả trang 4, - Trang 41 - 49: Thay sơ đồ hệ thống cũ sơ đồ Sắp xếp lại phần nội dung - Trang 53, 56: Cập nhật vị trí bảng kết để tiện theo dõi

Ngày đăng: 11/10/2022, 22:01

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN