Tách nguồn âm thanh sử dụng mô hình phổ nguồn tổng quát trên cơ sở thừa số hóa ma trận không âm =

MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY DUONG THI HIEN THANH AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL DOCTORAL DISSERTATION OF COMPUTER SCIENCE Hanoi - 2019 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY DUONG THI HIEN THANH AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL Major: Computer Science Code: 9480101 DOCTORAL DISSERTATION OF COMPUTER SCIENCE SUPERVISORS: ASSOC PROF DR NGUYEN QUOC CUONG DR NGUYEN CONG PHUONG Hanoi - 2019 DECLARATION OF AUTHORSHIP I, Duong Thi Hien Thanh, hereby declare that this thesis is my original work and it has been written by me in its entirety I confirm that: • This work was done wholly during candidature for a Ph.D research degree at Hanoi University of Science and Technology • Where any part of this thesis has previously been submitted for a degree or any other qualification at Hanoi University of Science and Technology or any other institution, this has been clearly stated • Where I have consulted the published work of others, this is always clearly attributed • Where I have quoted from the work of others, the source is always given With the exception of such quotations, this thesis is entirely my own work • I have acknowledged all main sources of help • Where the thesis is based on work done by myself jointly with others, I have made exactly what was done by others and what I have contributed myself Hanoi, May 2019 Ph.D Student Duong Thi Hien Thanh SUPERVISORS Assoc.Prof Dr Nguyen Quoc Cuong i Dr Nguyen Cong Phuong ACKNOWLEDGEMENT This thesis has been written during my doctoral study at International Research Institute Multimedia, Information, Communication, and Applications (MICA), Hanoi University of Science and Technology (HUST) It is my great pleasure to thank numerous people who have contributed towards shaping this thesis First and foremost I would like to express my most sincere gratitude to my supervisors, Assoc Prof Nguyen Quoc Cuong and Dr Nguyen Cong Phuong, for their great guidance and support throughout my Ph.D study I am grateful to them for devoting their precious time to discussing research ideas, proofreading, and explaining how to write good research papers I would like to thank them for encouraging my research and empowering me to grow as a research scientist I would like to express my appreciation to my supervisor in Master cource, Prof Nguyen Thanh Thuy, School of Information and Communication Technology - HUST, and Dr Nguyen Vu Quoc Hung, my supervisor in Bachelors course at Hanoi National University of Education They had shaped my knowledge for excelling in studies In the process of implementation and completion of my research, I have received many supports from the board of MICA directors and my colleagues at Speech Communication department Particularly, I am very much thankful to Prof Pham Thi Ngoc Yen, Prof Eric Castelli, Dr Nguyen Viet Son and Dr Dao Trung Kien, who provided me with an opportunity to join researching works in MICA institute and have access to the laboratory and research facilities Without their precious support would it have been being impossible to conduct this research My warmly thanks go to my colleagues at Speech Communication department of MICA institute for their useful comments on my study and unconditional support over four years both at work and outside of work I am very grateful to my internship supervisor Prof Nobutaka Ono and the members of Ono’s Lab at the National Institute of Informatics, Japan for warmly welcoming me into their lab and the helpful research collaboration they offered I much appreciate his help in funding my conference trip and introducing me to the wider audio research community I would also like to thank Dr Toshiya Ohshima, MSc Yasutaka Nakajima, MSc Chiho Haruta and other researchers at Rion Co., Ltd., Japan for welcoming me to their company and providing me data for experimental ii I would also like to sincerely thank Dr Nguyen Quang Khanh, dean of Information Technology Faculty, and Assoc Prof Le Thanh Hue, dean of Economic Informatics Department, at Hanoi University of Mining and Geology (HUMG) where I am working I have received the financial and time support from my office and leaders for completing my doctoral thesis Grateful thanks also go to my wonderful colleagues and friends Nguyen Thu Hang, Pham Thi Nguyet, Vu Thi Kim Lien, Vo Thi Thu Trang, Pham Quang Hien, Nguyen The Binh, Nguyen Thuy Duong, Nong Thi Oanh and Nguyen Thi Hai Yen, who have the unconditional support and help during a long time A special thank goes to Dr Le Hong Anh for the encouragement and his precious advice Last but not the least, I would like to express my deepest gratitude to my family I am very grateful to my mother-in-law and father-in-law for their support in the time of need, and always allow me to focus on my work I dedicate this thesis to my mother and father with special love, they have been being a great mentor in my life and had constantly encouraged me to be a better person The struggle and sacrifice of my parents always motivate me to work hard in my studies I would also like to express my love to my younger sisters and younger brother for their encouraging and helping A special love goes to my beloved husband Tran Thanh Huan for his patience and understanding, for always being there for me to share the good and bad times I also appreciate my sons Tran Tuan Quang and Tran Tuan Linh for always cheering me up with their smiles This work has become more wonderful because of the love and affection that they have provided Thank you all! Hanoi, May 2019 Ph.D Student Duong Thi Hien Thanh iii CONTENTS DECLARATION OF AUTHORSHIP DECLARATION OF AUTHORSHIP i i ACKNOWLEDGEMENT ii CONTENTS iv NOTATIONS AND GLOSSARY viii LIST OF TABLES xii LIST OF FIGURES xiii INTRODUCTION Chapter AUDIO SOURCE SEPARATION: FORMULATION AND STATE OF THE ART 10 1.1 Audio source separation: a solution for cock-tail party problem 10 1.1.1 General framework for source separation 10 1.1.2 Problem formulation 11 Literature review on international research works 13 1.2.1 13 1.2.1.1 Gaussian Mixture Model 14 1.2.1.2 Nonnegative Matrix Factorization 15 1.2.1.3 Deep Neural Networks 16 Spatial models 18 1.2.2.1 Interchannel Intensity/Time Difference (IID/ITD) 18 1.2.2.2 Rank-1 covariance matrix 19 1.2.2.3 Full-rank spatial covariance model 20 1.3 Literature review on research works in Vietnam 21 1.4 Source separation performance evaluation 22 1.4.1 Energy-based criteria 23 1.4.2 Perceptually-based criteria 24 Summary 24 1.2 1.2.2 1.5 Spectral models Chapter NONNEGATIVE MATRIX FACTORIZATION APPLYING TO AUDIO iv SPECTRAL DECOMPOSITION 26 2.1 NMF introduction 26 2.1.1 NMF in a nutshell 26 2.1.2 Cost function for parameter estimation 28 2.1.3 Multiplicative update rules 29 Application of NMF to audio source separation 31 2.2.1 Audio spectra decomposition 31 2.2.2 NMF-based audio source separation 32 Proposed application of NMF to unusual sound detection 34 2.3.1 Problem formulation 35 2.3.2 Proposed methods for non-stationary frame detection 36 2.3.2.1 Signal energy based method 36 2.3.2.2 Global NMF-based method 37 2.3.2.3 Local NMF-based method 37 Experiment 39 2.3.3.1 Dataset 39 2.3.3.2 Algorithm settings and evaluation metrics 39 2.3.3.3 Results and discussion 40 Summary 45 2.2 2.3 2.3.3 2.4 Chapter SINGLE-CHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL WITH MIXED GROUP SPARSITY CONSTRAINT 46 3.1 General workflow of the proposed approach 46 3.2 GSSM formulation 48 3.3 Model fitting with sparsity-inducing penalties 48 3.3.1 Block sparsity-inducing penalty 49 3.3.2 Component sparsity-inducing penalty 50 3.3.3 Proposed mixed sparsity-inducing penalty 51 3.4 Derived algorithm in unsupervised case 51 3.5 Derived algorithm in semi-supervised case 54 3.5.1 Semi-GSSM formulation 54 3.5.2 Model fitting with mixed sparsity and algorithm 56 Experiment 56 3.6 v 3.6.1 3.6.2 3.6.3 Experiment data 56 3.6.1.1 Synthetic dataset 57 3.6.1.2 SiSEC-MUS dataset 57 3.6.1.3 SiSEC-BNG dataset 58 Single-channel source separation performance with unsupervised setting 59 3.6.2.1 Experiment settings 59 3.6.2.2 Evaluation method 59 3.6.2.3 Results and discussion 63 Single-channel source separation performance with semi-supervised setting 67 3.6.3.1 Experiment settings 67 3.6.3.2 Evaluation method 67 3.6.3.3 Results and discussion 67 3.7 Computational complexity 68 3.8 Summary 69 Chapter MULTICHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GSSM IN GAUSSIAN MODELING FRAMEWORK 70 4.1 Formulation and modeling 70 4.1.1 Local Gaussian model 70 4.1.2 NMF-based source variance model 72 4.1.3 Estimation of the model parameters 73 Proposed GSSM-based multichannel approach 74 4.2.1 GSSM construction 74 4.2.2 Proposed source variance fitting criteria 75 4.2.2.1 Source variance denoising 75 4.2.2.2 Source variance separation 76 4.2.3 Derivation of MU rule for updating the activation matrix 77 4.2.4 Derived algorithm 80 Experiment 80 4.3.1 Dataset and parameter settings 80 4.3.2 Algorithm analysis 82 4.2 4.3 vi 4.3.2.1 4.3.2.2 4.3.3 Algorithm convergence: separation results as functions of EM and MU iterations 82 Separation results with different choices of λ and γ 83 Comparison with the state of the art 84 4.4 Computational complexity 93 4.5 Summary 93 CONCLUSIONS AND PERSPECTIVES 95 BIBLIOGRAPHY 98 LIST OF PUBLICATIONS 116 vii NOTATIONS AND GLOSSARY Standard mathematical symbols C Set of complex numbers R Set of real numbers Z Set of integers E Expectation of a random variable Nc Complex Gaussian distribution Vectors and matrices a Scalar a Vector A Matrix A T Matrix transpose A H Matrix conjugate transposition (Hermitian conjugation) diag(a) Diagonal matrix with a as its diagonal det(A) Determinant of matrix A tr(A) Matrix trace A The element-wise Hadamard product of two matrices (of the same dimension) B with elements [A A (n) a A 1 B]ij = Aij Bij (n) The matrix with entries [A]ij -norm of vector -norm of matrix Indices f Frequency index i Channel index j Source index n Time frame index t Time sample index viii ... a Vector A Matrix A T Matrix transpose A H Matrix conjugate transposition (Hermitian conjugation) diag(a) Diagonal matrix with a as its diagonal det(A) Determinant of matrix A tr(A) Matrix trace... ), p(sj (n)) = (1.5) k=1 where detotes a vector of zeroes, δjk which satisfies K k=1 δjk = 1, ∀j, and Σjk = diag([vjk (f )]f ) are the weight and the diagonal spectral covariance matrix of the... Non-negative Matrix Factorization MAP Maximum A Posteriori MFCC MelFrequency Cepstral Coefficients ML Maximum Likelihood MMSE Minimum Mean Square Error MU Multiplicative Update NMF Non-negative Matrix

Định dạng
Số trang	133
Dung lượng	1,86 MB