Tách nguồn âm thanh sử dụng mô hình phổ nguồn tổng quát trên cơ sở thừa số hóa ma trận không âm = Tách nguồn âm thanh sử dụng mô hình phổ nguồn tổng quát trên cơ sở thừa số hóa ma trận không âm = Tách nguồn âm thanh sử dụng mô hình phổ nguồn tổng quát trên cơ sở thừa số hóa ma trận không âm = luận văn tốt nghiệp,luận văn thạc sĩ, luận văn cao học, luận văn đại học, luận án tiến sĩ, đồ án tốt nghiệp luận văn tốt nghiệp,luận văn thạc sĩ, luận văn cao học, luận văn đại học, luận án tiến sĩ, đồ án tốt nghiệp
MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY DUONG THI HIEN THANH AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL DOCTORAL DISSERTATION OF COMPUTER SCIENCE Hanoi - 2019 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY DUONG THI HIEN THANH AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL Major: Computer Science Code: 9480101 DOCTORAL DISSERTATION OF COMPUTER SCIENCE SUPERVISORS: ASSOC PROF DR NGUYEN QUOC CUONG DR NGUYEN CONG PHUONG Hanoi - 2019 DECLARATION OF AUTHORSHIP I, Duong Thi Hien Thanh, hereby declare that this thesis is my original work and it has been written by me in its entirety I confirm that: • This work was done wholly during candidature for a Ph.D research degree at Hanoi University of Science and Technology • Where any part of this thesis has previously been submitted for a degree or any other qualification at Hanoi University of Science and Technology or any other institution, this has been clearly stated • Where I have consulted the published work of others, this is always clearly attributed • Where I have quoted from the work of others, the source is always given With the exception of such quotations, this thesis is entirely my own work • I have acknowledged all main sources of help • Where the thesis is based on work done by myself jointly with others, I have made exactly what was done by others and what I have contributed myself Hanoi, May 2019 Ph.D Student Duong Thi Hien Thanh SUPERVISORS Assoc.Prof Dr Nguyen Quoc Cuong i Dr Nguyen Cong Phuong ACKNOWLEDGEMENT This thesis has been written during my doctoral study at International Research Institute Multimedia, Information, Communication, and Applications (MICA), Hanoi University of Science and Technology (HUST) It is my great pleasure to thank numerous people who have contributed towards shaping this thesis First and foremost I would like to express my most sincere gratitude to my supervisors, Assoc Prof Nguyen Quoc Cuong and Dr Nguyen Cong Phuong, for their great guidance and support throughout my Ph.D study I am grateful to them for devoting their precious time to discussing research ideas, proofreading, and explaining how to write good research papers I would like to thank them for encouraging my research and empowering me to grow as a research scientist I would like to express my appreciation to my supervisor in Master cource, Prof Nguyen Thanh Thuy, School of Information and Communication Technology - HUST, and Dr Nguyen Vu Quoc Hung, my supervisor in Bachelors course at Hanoi National University of Education They had shaped my knowledge for excelling in studies In the process of implementation and completion of my research, I have received many supports from the board of MICA directors and my colleagues at Speech Communication department Particularly, I am very much thankful to Prof Pham Thi Ngoc Yen, Prof Eric Castelli, Dr Nguyen Viet Son and Dr Dao Trung Kien, who provided me with an opportunity to join researching works in MICA institute and have access to the laboratory and research facilities Without their precious support would it have been being impossible to conduct this research My warmly thanks go to my colleagues at Speech Communication department of MICA institute for their useful comments on my study and unconditional support over four years both at work and outside of work I am very grateful to my internship supervisor Prof Nobutaka Ono and the members of Ono’s Lab at the National Institute of Informatics, Japan for warmly welcoming me into their lab and the helpful research collaboration they offered I much appreciate his help in funding my conference trip and introducing me to the wider audio research community I would also like to thank Dr Toshiya Ohshima, MSc Yasutaka Nakajima, MSc Chiho Haruta and other researchers at Rion Co., Ltd., Japan for welcoming me to their company and providing me data for experimental ii I would also like to sincerely thank Dr Nguyen Quang Khanh, dean of Information Technology Faculty, and Assoc Prof Le Thanh Hue, dean of Economic Informatics Department, at Hanoi University of Mining and Geology (HUMG) where I am working I have received the financial and time support from my office and leaders for completing my doctoral thesis Grateful thanks also go to my wonderful colleagues and friends Nguyen Thu Hang, Pham Thi Nguyet, Vu Thi Kim Lien, Vo Thi Thu Trang, Pham Quang Hien, Nguyen The Binh, Nguyen Thuy Duong, Nong Thi Oanh and Nguyen Thi Hai Yen, who have the unconditional support and help during a long time A special thank goes to Dr Le Hong Anh for the encouragement and his precious advice Last but not the least, I would like to express my deepest gratitude to my family I am very grateful to my mother-in-law and father-in-law for their support in the time of need, and always allow me to focus on my work I dedicate this thesis to my mother and father with special love, they have been being a great mentor in my life and had constantly encouraged me to be a better person The struggle and sacrifice of my parents always motivate me to work hard in my studies I would also like to express my love to my younger sisters and younger brother for their encouraging and helping A special love goes to my beloved husband Tran Thanh Huan for his patience and understanding, for always being there for me to share the good and bad times I also appreciate my sons Tran Tuan Quang and Tran Tuan Linh for always cheering me up with their smiles This work has become more wonderful because of the love and affection that they have provided Thank you all! Hanoi, May 2019 Ph.D Student Duong Thi Hien Thanh iii CONTENTS DECLARATION OF AUTHORSHIP DECLARATION OF AUTHORSHIP i i ACKNOWLEDGEMENT ii CONTENTS iv NOTATIONS AND GLOSSARY viii LIST OF TABLES xii LIST OF FIGURES xiii INTRODUCTION Chapter AUDIO SOURCE SEPARATION: FORMULATION AND STATE OF THE ART 10 1.1 Audio source separation: a solution for cock-tail party problem 10 1.1.1 General framework for source separation 10 1.1.2 Problem formulation 11 Literature review on international research works 13 1.2.1 13 1.2.1.1 Gaussian Mixture Model 14 1.2.1.2 Nonnegative Matrix Factorization 15 1.2.1.3 Deep Neural Networks 16 Spatial models 18 1.2.2.1 Interchannel Intensity/Time Difference (IID/ITD) 18 1.2.2.2 Rank-1 covariance matrix 19 1.2.2.3 Full-rank spatial covariance model 20 1.3 Literature review on research works in Vietnam 21 1.4 Source separation performance evaluation 22 1.4.1 Energy-based criteria 23 1.4.2 Perceptually-based criteria 24 Summary 24 1.2 1.2.2 1.5 Spectral models Chapter NONNEGATIVE MATRIX FACTORIZATION APPLYING TO AUDIO iv SPECTRAL DECOMPOSITION 26 2.1 NMF introduction 26 2.1.1 NMF in a nutshell 26 2.1.2 Cost function for parameter estimation 28 2.1.3 Multiplicative update rules 29 Application of NMF to audio source separation 31 2.2.1 Audio spectra decomposition 31 2.2.2 NMF-based audio source separation 32 Proposed application of NMF to unusual sound detection 34 2.3.1 Problem formulation 35 2.3.2 Proposed methods for non-stationary frame detection 36 2.3.2.1 Signal energy based method 36 2.3.2.2 Global NMF-based method 37 2.3.2.3 Local NMF-based method 37 Experiment 39 2.3.3.1 Dataset 39 2.3.3.2 Algorithm settings and evaluation metrics 39 2.3.3.3 Results and discussion 40 Summary 45 2.2 2.3 2.3.3 2.4 Chapter SINGLE-CHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL WITH MIXED GROUP SPARSITY CONSTRAINT 46 3.1 General workflow of the proposed approach 46 3.2 GSSM formulation 48 3.3 Model fitting with sparsity-inducing penalties 48 3.3.1 Block sparsity-inducing penalty 49 3.3.2 Component sparsity-inducing penalty 50 3.3.3 Proposed mixed sparsity-inducing penalty 51 3.4 Derived algorithm in unsupervised case 51 3.5 Derived algorithm in semi-supervised case 54 3.5.1 Semi-GSSM formulation 54 3.5.2 Model fitting with mixed sparsity and algorithm 56 Experiment 56 3.6 v 3.6.1 3.6.2 3.6.3 Experiment data 56 3.6.1.1 Synthetic dataset 57 3.6.1.2 SiSEC-MUS dataset 57 3.6.1.3 SiSEC-BNG dataset 58 Single-channel source separation performance with unsupervised setting 59 3.6.2.1 Experiment settings 59 3.6.2.2 Evaluation method 59 3.6.2.3 Results and discussion 63 Single-channel source separation performance with semi-supervised setting 67 3.6.3.1 Experiment settings 67 3.6.3.2 Evaluation method 67 3.6.3.3 Results and discussion 67 3.7 Computational complexity 68 3.8 Summary 69 Chapter MULTICHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GSSM IN GAUSSIAN MODELING FRAMEWORK 70 4.1 Formulation and modeling 70 4.1.1 Local Gaussian model 70 4.1.2 NMF-based source variance model 72 4.1.3 Estimation of the model parameters 73 Proposed GSSM-based multichannel approach 74 4.2.1 GSSM construction 74 4.2.2 Proposed source variance fitting criteria 75 4.2.2.1 Source variance denoising 75 4.2.2.2 Source variance separation 76 4.2.3 Derivation of MU rule for updating the activation matrix 77 4.2.4 Derived algorithm 80 Experiment 80 4.3.1 Dataset and parameter settings 80 4.3.2 Algorithm analysis 82 4.2 4.3 vi 4.3.2.1 4.3.2.2 4.3.3 Algorithm convergence: separation results as functions of EM and MU iterations 82 Separation results with different choices of λ and γ 83 Comparison with the state of the art 84 4.4 Computational complexity 93 4.5 Summary 93 CONCLUSIONS AND PERSPECTIVES 95 BIBLIOGRAPHY 98 LIST OF PUBLICATIONS 116 vii NOTATIONS AND GLOSSARY Standard mathematical symbols C Set of complex numbers R Set of real numbers Z Set of integers E Expectation of a random variable Nc Complex Gaussian distribution Vectors and matrices a Scalar a Vector A Matrix A T Matrix transpose A H Matrix conjugate transposition (Hermitian conjugation) diag(a) Diagonal matrix with a as its diagonal det(A) Determinant of matrix A tr(A) Matrix trace A The element-wise Hadamard product of two matrices (of the same dimension) B with elements [A A (n) a A 1 B]ij = Aij Bij (n) The matrix with entries [A]ij -norm of vector -norm of matrix Indices f Frequency index i Channel index j Source index n Time frame index t Time sample index viii [48] Gerber, T., Dutasta, M., Girin, L., and F´evotte, C (2012) Professionally- produced Music Separation Guided by Covers In International Society for Music Information Retrieval Conference (ISMIR 2012), pages 85–90, Porto, Portugal [49] Gribonval, R., Vincent, E., F´evotte, C., and Benaroya, L (2003) Proposals for Performance Measurement in Source Separation In 4th Int Symp on Independent Component Analysis and Blind Signal Separation (ICA2003), pages 763–768 [50] Gustafsson, T., Rao, B., and Trivedi, M (2003) Source Localization in Reverberant Environments: Modeling and Statistical Analysis IEEE Transactions on Speech and Audio Processing, 11(6):791–803 [51] Hennequin, R., David, B., and Badeau, R (2011) Score Informed Audio Source Separation using a Parametric Model of Non-negative Spectrogram In Proc IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 45– 48 [52] Hershey, J R., Chen, Z., Roux, J L., and Watanabe, S (2016) Deep Clustering: Discriminative Embeddings for Segmentation and Separation In Proc IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 31–35 [53] Heymann, J., Drude, L., and Haeb-Umbach, R (2017) A Generic Neural Acoustic Beamforming Architecture for Robust Multi-channel Speech Processing Computer Speech & Language, 46:374 – 385 [54] Houda, A and Otman, C (2015) Article: Blind Audio Source Separation: Stateof-Art International Journal of Computer Applications, 130(4):1–6 Published by Foundation of Computer Science (FCS), NY, USA [55] Huang, A (2013) NMF Face Recognition Method Based on Alpha Divergence In Zhong, Z., editor, Proceedings of the International Conference on Information Engineering and Applications (IEA) 2012, volume 217, pages 477–483 Springer London [56] Huang, P., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P (2015) Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation IEEE/ACM Trans Audio, Speech & Language Processing, 23(12):2136–2147 103 [57] Huber, R and Kollmeier, B (2006) PEMO-Q - A New Method for Objective Audio Quality Assessment Using a Model of Auditory Perception IEEE Transactions on Audio, Speech, and Language Processing, 14(6):1902–1911 [58] Hurmalainen, A., Saeidi, R., and Virtanen, T (2012) Group Sparsity for Speaker Identity Discrimination in Factorisation-based Speech Recognition In Proc Interspeech, pages 17–20 [59] Ito, N., Araki, S., and Nakatani, T (2013) Permutation-free Convolutive Blind Source Separation via Full-band Clustering Based on Frequency-independent Source Presence Priors In Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 3238–3242 [60] Izumi, Y., Ono, N., and Sagayama, S (2007) Sparseness-Based 2ch BSS using the EM Algorithm in Reverberant Environment In Proc IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 147–150 [61] Jeter, M and Pye, W (1981) A Note on Nonnegative Rank Factorizations Linear Algebra and its Applications, 38:171–173 [62] Jiang, Y., Wang, D., Liu, R., and Feng, Z (2014) Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2112–2121 [63] Jourjine, A., Rickard, S., and Yılmaz, O (2000) Blind Separation of Disjoint Orthogonal Signals: Demixing N Sources from Mixtures In Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 2985–2988 [64] Kim, G and Loizou, P (2010) Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms IEEE Trans Audio, Speech, Language Processing, 18(8):2080–2090 [65] Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W., and Maas, R (2013) The Reverb Challenge: A Common Evaluation Framework for Dereverberation and Recognition of Reverberant Speech In Proc IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 1–4, NY, USA [66] Kitamura, D., Ono, N., Sawada, H., Kameoka, H., and Saruwatari, H (2016a) Determined Blind Source Separation Unifying Independent Vector Analysis and 104 Nonnegative Matrix Factorization IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(9):1626–1641 [67] Kitamura, D., Ono, N., Sawada, H., Kameoka, H., and Saruwatari, H (2016b) Determined Blind Source Separation Unifying Independent Vector Analysis and Nonnegative Matrix Factorization IEEE/ACM Trans on Audio, Speech and Language Processing, 24(9):1622–1637 [68] Kompass, R (2007) A Generalized Divergence Measure for Nonnegative Matrix Factorization Neural Computation, 19(3):780–791 [69] Kuttruff, H (2000) Room Acoustics Spon Press, New York, 4rd edition edition [70] Lafay, G., Benetos, E., and Lagrange, M (2017) Sound Event Detection in Synthetic Audio: Analysis of the Dcase 2016 Task Results In 2016 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 11– 15 [71] Le Magoarou, L., Ozerov, A., and Duong, N Q K (2013) Text-informed Audio Source Separation using Nonnegative Matrix Partial Co-factorization In IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6 [72] Lee, D D and Seung, H S (1999) Learning the Parts of Objects by Nonnegative Matrix Factorization Nature, 401 6755:788–91 [73] Lee, D D and Seung, H S (2001) Algorithms for Non-negative Matrix Factorization In Advances in Neural and Information Processing Systems 13, pages 556–562 [74] Lef`evre, A., Bach, F., and F´evotte, C (2011) Itakura-Saito Non-negative Matrix Factorization with Group Sparsity In IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 21–24 [75] Lef`evre, A., Bach, F., and F´evotte, C (2011) Online algorithms for Nonnegative Matrix Factorization with the Itakura-Saito divergence In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 1–4 [76] Leglaive, S., S¸ims¸ekli, U., Liutkus, A., Badeau, R., and Richard, G (2017) Alpha-stable Multichannel Audio Source Separation In IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 576–580 105 [77] Li, Y and Ngom, A (2013) The Non-negative Matrix Factorization Toolbox for Biological Data Mining Source Code for Biology and Medicine, 8(1):1–10 [78] Linh-Trung, N., Aissa-El-Bey, A., Abel-Meraim, K., and Belounchrani, A (2005) Underdetermined Blind Source Seperation of Non-disjoint Nonstationary Sources in the Time-Frequency Domain In Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, 2005., volume 1, pages 46– 49 [79] Liutkus, A., Badeau, R., and Richard, G (2011) Gaussian Processes for Underdetermined Source Separation IEEE Transactions on Signal Processing, 59(7):3155–3167 [80] Liutkus, A., Durrieu, J L., Daudet, L., and Richard, G (2013) An Overview of Informed Audio Source Separation In Proc IEEE Int Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pages 1–4 [81] Liutkus, A., Fitzgerald, D., and Rafii, Z (2015) Scalable Audio Separation with Light Kernel Additive Modelling In Proc IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 76–80 [82] Liutkus, A., Stăoter, F R., Rafii, Z., Kitamura, D., Rivet, B., Ito, N., Ono, N., and Fontecave, J (2017a) The 2016 Signal Separation Evaluation Campaign In Proc Int Conf on Latent Variable Analysis and Signal Separation, pages 323–332 [83] Liutkus, A., Stăoter, F.-R., Rafii, Z., Kitamura, D., Rivet, B., Ito, N., Ono, N., and Fontecave, J (2017b) The 2016 Signal Separation Evaluation Campaign In Latent Variable Analysis and Signal Separation, volume 10169, pages 323–332 Springer International Publishing, Cham [84] L´opez, A R., Ono, N., Remes, U., Palomăaki, K., and Kurimo, M (2015) Designing Multichannel Source Separation based on Single-channel Source Separation In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 469–473 [85] Magoarou, L L., Ozerov, A., and Duong, N Q K (2014) Text-informed Audio Source Separation Example-based Approach Using Non-negative Matrix Partial Co-factorization Journal of Signal Processing Systems, pages 1–5 106 [86] Magron, P., Badeau, R., and Liutkus, A (2017) L´evy NMF for Robust Nonnegative Source Separation In Proc IEEE Int Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 259–263 [87] Makino, S., Lee, T.-W., and Sawada, H (2007) Blind Speech Separation Springer [88] Mandel, M I., Weiss, R J., and Ellis, D P W (2010) Model-Based ExpectationMaximization Source Separation and Localization IEEE Transactions on Audio, Speech, and Language Processing, 18(2):382–394 [89] McCowan, I and Bourlard, H (2003) Microphone Array Post-filter based on Noise Field Coherence IEEE Transactions on Speech and Audio Processing, 11(6):709–716 [90] Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T., and Plumbley, M D (2018) Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(2):379–393 [91] Mohammadiha, N., Smaragdis, P., and Leijon, A (2013) Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization IEEE Transactions on Audio, Speech, and Language Processing, 21(10):2140–2151 [92] Naik, G R and Wang, W., editors (2014) Blind Source Separation: Advances in Theory, Algorithms and Applications Signals and Communication Technology Springer, Berlin [93] Nakajima, Y., Sunohara, M., Naito, T., Sunago, N., Ohshima, T., and Ono, N (2016) DNN-based Environmental Sound Recognition with Real-recorded and Artificially-mixed Training Data [94] Naylor, P A and Gaubitch, N D., editors (2010) Speech Dereverberation Signals and Communication Technology Springer London [95] Nesta, F and Omologo, M (2012) Generalized State Coherence Transform for Multidimensional TDOA Estimation of Multiple Sources IEEE Transactions on Audio, Speech, and Language Processing, 20(1):246–260 107 [96] Nikunen, J and Virtanen, T (2014) Direction of Arrival Based Spatial Covariance Model for Blind Sound Source Separation IEEE/ACM Trans on Audio, Speech, and Language Processing, 22(3):727–739 [97] Nix, J and Hohmann, V (2007) Combined Estimation of Spectral Envelopes and Sound Source Direction of Concurrent Voices by Multidimensional Statistical Filtering IEEE Transactions on Audio, Speech and Language Processing, 15(3):995– 1008 [98] Nugraha, A., Liutkus, A., and Vincent, E (2016) Multichannel Audio Source Separation With Deep Neural Networks IEEE/ACM Transactions on Audio, Speech, and Language Processing, 14(9):1652–1664 [99] O’Grady, P D., Pearlmutter, B A., and Rickard, S T (2005) Survey of Sparse and Non-sparse Methods in Source Separation International Journal of Imaging Systems and Technology, 15(1):18–33 [100] Ono, N., Koldovsk´y, Z., Miyabe, S., and Ito, N (2013a) The 2013 Signal Separation Evaluation Campaign In 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6 [101] Ono, N., Koldovsk´y, Z., Miyabe, S., and Ito, N (2013b) The 2013 Signal Separation Evaluation Campaign In Proc IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6 [102] Ono, N., Rafii, Z., Kitamura, D., Ito, N., and Liutkus, A (2015a) The 2015 Signal Separation Evaluation Campaign In Latent Variable Analysis and Signal Separation, volume 9237, pages 387–395 Springer International Publishing, Cham [103] Ono, N., Rafii, Z., Kitamura, D., Ito, N., and Liutkus, A (2015b) The 2015 Signal Separation Evaluation Campaign In Latent Variable Analysis and Signal Separation (LVAICA), volume 9237, pages 387–395 Springer [104] Ozerov, A and Fevotte, C (2010) Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation IEEE Transactions on Audio, Speech, and Language Processing, 18(3):550–563 [105] Ozerov, A and F´evotte, C (2010) Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation IEEE Trans on Audio, Speech and Language Processing, 18(3):550–563 108 [106] Ozerov, A., Fevotte, C., Blouet, R., and Durrieu, J.-L (2011) Multichannel Nonnegative Tensor Factorization with Structured Constraints for User-guided Audio Source Separation In Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 257–260 [107] Ozerov, A., F´evotte, C., and Vincent, E (2017) An Introduction to Multichannel NMF for Audio Source Separation In Audio Source Separation, Signals and Communication Technology Springer [108] Ozerov, A., Philippe, P., Bimbot, F., and Gribonval, R (2007) Adaptation of Bayesian Models for Single-Channel Source Separation and its Application to Voice/Music Separation in Popular Songs IEEE Transactions on Audio, Speech and Language Processing, 15(5):1564–1578 [109] Ozerov, A., Vincent, E., and Bimbot, F (2012) A general Flexible Framework for the Handling of Prior Information in Audio Source Separation IEEE Transactions on Audio, Speech, and Language Processing, 20(4):1118–1133 [110] Paatero, P (1997) Least Squares Formulation of Robust Non-negative Factor Analysis Chemometrics and Intelligent Laboratory Systems, 37(1):23–35 [111] Paatero, P and Tapper, U (1994) Positive Matrix Factorization: A Nonnegative Factor Model with Optimal Utilization of Error Estimates of Data Values Environmetrics, 5(2):111–126 [112] Parekh, S., Essid, S., Ozerov, A., Duong, N Q K., Perez, P., and Richard, G (2017) Motion Informed Audio Source Separation In Proc IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 6–10 [113] Parvaix, M and Girin, L (2011) Informed Source Separation of Linear Instantaneous Under-Determined Audio Mixtures by Source Index Embedding IEEE Transactions on Audio, Speech, and Language Processing, 19(6):1721–1733 [114] Pedersen, M S., Larsen, J., Kjems, U., and Parra, L C (2007) A Survey of Convolutive Blind Source Separation Methods In Springer Handbook of Speech Processing, pages 1–34 Springer [115] Quang, T T., Huy, T Q., and Phuong, N H (2011) Blind Source Separation applied to Sound in Various Conditions Journal of Science and Technology Development, 14:34–42 109 [116] Quir´os, A and Wilson, S P (2012) ependent Gaussian Mixture Models for Source Separation EURASIP Journal on Advances in Signal Processing, 2012(1) [117] Rafii, Z and Pardo, B (2013) Online REPET-SIM for Real-time Speech Enhancement In Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 848–852 [118] Rennie, S J., Hershey, J R., and Olsen, P A (2008) Efficient Model-based Speech Separation and Denoising using Nonnegative Subspace Analysis In In: Proc of ICASSP Las Vegas, pages 1833–1836 [119] Revit, L J and Schulein, R B (2013) Sound Reproduction Method and Apparatus for Assessing Real-world Performance of Hearing and Hearing aids The Journal of the Acoustical Society of America, 133(2):1196–1199 [120] Reynolds, D A., Quatieri, T F., and Dunn, R B Speaker Verification Using Adapted Gaussian Mixture Models Digital Signal Processing, 10(1):19–41 [121] Roy, R and Kailath, T (1989) ESPRIT-estimation of Signal Parameters via Rotational Invariance Techniques IEEE/ACM Transactions on Audio, Speech, and Language Processing, 37(7):984–995 [122] Sainath, T N., Weiss, R J., Wilson, K W., Li, B., Narayanan, A., Variani, E., Bacchiani, M., Shafran, I., Senior, A., Chin, K., Misra, A., and Kim, C (2017) Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(5):965–979 [123] Sandler, R and Lindenbaum, M (2011) Nonnegative Matrix Factorization with Earth Mover’s Distance Metric for Image Analysis IEEE Trans Pattern Anal Mach Intell., 33(8):1590–1602 [124] Sawada, H., Araki, S., and Makino, S (2011) Underdetermined Convolutive Blind Source Separation via Frequency Bin-wise Clustering and Permutation Alignment IEEE Transactions on Audio, Speech, and Language Processing, 19(3):516– 527 [125] Sawada, H., Kameoka, H., Araki, S., and Ueda, N (2013) Multichannel Extensions of Non-Negative Matrix Factorization With Complex-Valued Data IEEE Transactions on Audio, Speech, and Language Processing, 21(5):971–982 110 [126] Smaragdis, P and Mysore, G J (2009) Separation by Humming: User-guided Sound Extraction from Monophonic Mixtures In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 69–72 [127] Smaragdis, P., Raj, B., and Shashanka, M (2007) Supervised and Semisupervised Separation of Sounds from Single-channel Mixtures In Int Conf on Independent Component Analysis and Signal Separation (ICA), pages 414–421 [128] Smith, J O (2011) Spectral Audio Signal Processing W3K Publishing [129] Souvira`a-Labastie, N., Olivero, A., Vincent, E., and Bimbot, F (2015) Multichannel Audio Source Separation using Multiple Deformed References IEEE/ACM Transactions on Audio, Speech and Language Processing, 23:1775–1787 [130] Sprechmann, P., Bronstein, A M., and Sapiro, G (2015) Supervised Nonnegative Matrix Factorization for Audio Source Separation In Excursions in Harmonic Analysis, Volume 4, pages 407–420 Springer International Publishing, Cham [131] Sun, D L and Mysore, G J (2013) Universal Speech Models for Speaker Independent Single-channel Source Separation In Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 141–145 [132] Sunohara, M., Haruta, C., and Ono, N (2017) Low-latency Real-time Blind Source Separation for Hearing Aids based on Time-domain Implementation of Online Independent Vector Analysis with Truncation of Non-causal Components In Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 216–220 [133] Tan, V Y F and Fevotte, C (2013) Automatic Relevance Determination in Nonnegative Matrix Factorization with the /spl beta/-Divergence IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7):1592–1605 [134] Thuy-Duong, N T., Linh-Trung, N., Tran-Duc, T., and Boashash, B (2013) Separation of Nonstationary EEG Epileptic Seizures Using Time-Frequency-Based Blind Signal Processing Techniques In Proceedings of the 4th International Conference on Biomedical Engineering in Vietnam, pages 317–323 [135] Traa, J., Smaragdis, P., Stein, N D., and Wingate, D (2015) Directional NMF for Joint Source Localization and Separation In 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 1–5 111 [136] Uhlich, S., Giron, F., and Mitsufuji, Y (2015) Deep Neural Network based Instrument Extraction from Music In Proc IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 2135–2139 [137] Vincent, E., Araki, S., and Bofill, P (2009) The 2008 Signal Separation Evaluation Campaign: A Community- Based Approach to Large- Scale Evaluation In Proc Int Conf on Independent Component Analysis and Signal Separation (ICA), pages 734–741 [138] Vincent, E., Araki, S., Theis, F., Nolte, G., Bofill, P., Sawada, H., Ozerov, A., Gowreesunker, V., Lutter, D., and Duong, N Q (2012) The Signal Separation Evaluation Campaign (2007 2010): Achievements and Remaining Challenges Signal Processing, 92(8):1928–1936 [139] Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., and Matassoni, M (2013) The Second ’CHiME’ Speech Separation and Recognition Challenge” Datasets, Tasks and Baselines In IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 126–130 [140] Vincent, E., Bertin, N., Gribonval, R., and Bimbot, F (2014) From Blind to Guided Audio Source Separation: How Models and Side Information can Improve the Separation of Sound IEEE Signal Processing Magazine, 31(3):107–115 [141] Vincent, E., Gribonval, R., and Fevotte, C (2006a) Performance Measurement in Blind Audio Source Separation IEEE Transactions on Audio, Speech and Language Processing, 14(4):1462–1469 [142] Vincent, E., Jafari, M G., Abdallah, S A., Plumbley, M D., and Davies, M E (2010) Probabilistic Modeling Paradigms for Audio Source Separation In In Machine Audition: Principles, Algorithms and Systems, pages 162–185 IGI Global [143] Vincent, E., Jafari, M G., and Plumbley, M D (2006b) Preliminary Guidelines for Subjective Evalutation of Audio Source Separation Algorithms In UK ICA Research Network Workshop, Southampton, United Kingdom [144] Vincent, E., Sawada, H., Bofill, P., Makino, S., and Rosca, J P (2007) First Stereo Audio Source Separation Evaluation Campaign: Data, Algorithms and Results In Independent Component Analysis and Signal Separation, pages 552–559 Springer Berlin Heidelberg 112 [145] Vincent, E., Virtanen, T., and Gannot, S., editors (2017) Audio Source Separation and Speech Enhancement Wiley [146] Virtanen, T (2007) Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria IEEE Transactions on Audio, Speech and Language Processing, 15(3):1066–1074 [147] Virtanen, T., Singh, R., and Raj, B., editors (2012) Techniques for Noise Robustness in Automatic Speech Recognition Wiley, Chichester, West Sussex, U.K ; Hoboken, N.J [148] Vuong-Hoang, N., Nguyen-Quoc, T., and Tran-Hoai, L (2010) Blind Speech Separation in Convolutive Mixtures using Non-Gaussianity Maximization and Inverse Filters In International Conference on Communications and Electronics 2010, pages 190–194 [149] Wang, D (2017) Deep Learning Reinvents the Hearing Aid IEEE Spectrum, 54(3):32–37 [150] Wang, D and Brown, G J., editors (2006) Computational Auditory Scene Analysis: Principles, Algorithms, and Applications IEEE Press ; Wiley Interscience [151] Wang, L., Ding, H., and Yin, F (2011) A Region -Growing Permutation Alignment Approach in Frequency- Domain Blind Source Separation of Speech Mixtures Trans Audio, Speech and Language Processing, 19(3):549–557 [152] Wang, Y.-X and Zhang, Y.-J (2013) Nonnegative Matrix Factorization: A Comprehensive Review IEEE Transactions on Knowledge and Data Engineering, 25(6):1336–1353 [153] Wang, Z.-Q., Roux, J L., and Hershey, J R (2018) Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker- Independent Speech Separation In Proc IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5 [154] Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J R., and Schuller, B (2015) Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR In Latent Variable Analysis 113 and Signal Separation, volume 9237, pages 91–99 Springer International Publishing [155] Weninger, F., Hershey, J R., Le Roux, J., and Schuller, B (2014) Discriminatively Trained Recurrent Neural Networks for Single-channel Speech Separation In IEEE Global Conference on Signal and Information Processing, pages 577–581 [156] Winter, S., Kellermann, W., Sawada, H., and Makino, S (2006) MAP-Based Underdetermined Blind Source Separation of Convolutive Mixtures by Hierarchical Clustering and -norm Minimization EURASIP Journal on Advances in Signal Processing, 2007(1):024717 [157] Wăolfel, M and McDonough, J (2009) Distant Speech Recognition Wiley, Chichester, U.K [158] Wood, S and Rouat, J (2016) Blind Speech Separation with GCC-NMF In Proc Interspeech, pages 3329–3333 [159] Wood, S U N., Rouat, J., Dupont, S., and Pironkov, G (2017) Blind Speech Separation and Enhancement With GCC-NMF IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(4):745–755 [160] Xiao, X., Watanabe, S., Erdogan, H., Lu, L., Hershey, J., Seltzer, M L., Chen, G., Zhang, Y., Mandel, M., and Yu, D (2016) Deep Beamforming Networks for Multi-channel Speech Recognition In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5745–5749 IEEE [161] Yilmaz, Y K., Cemgil, A T., and Simsekli, U (2011) Generalised Coupled Tensor Factorisation In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 2151–2159, USA Curran Associates Inc [162] Yu, D and Deng, L (2011) Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP IEEE Signal Processing Magazine, 28(1):145–154 [163] Zdunek, R (2011) Convolutive Nonnegative Matrix Factorization with Markov Random Field Smoothing for Blind Unmixing of Multichannel Speech Recordings In Proc The 5th International Conference on Advances in Nonlinear Speech Processing, NOLISP’11, pages 25–32 Springer-Verlag 114 [164] Zdunek, R (2013) Improved Convolutive and Under-Determined Blind Audio Source Separation with MRF Smoothing Cognitive Computation, 5(4):493–503 [165] Zhang, Z.-Y (2012) Nonnegative Matrix Factorization: Models, Algorithms and Applications In Data Mining: Foundations and Intelligent Paradigms, volume 24, pages 99–134 Springer Berlin Heidelberg 115 LIST OF PUBLICATIONS Hien-Thanh Thi Duong, Quoc-Cuong Nguyen, Cong-Phuong Nguyen, Thanh Huan Tran, and Ngoc Q K Duong (2015) Speech enhancement based on nonnegative matrix factorization with mixed group sparsity constraint Proc ACM International Symposium on Information and Communication Technology (SoICT 2015), pp 247-251, Hue, Vietnam ISBN 978-1-4503-3843-1, DOI:10.1145/2833258.2833276 Hien-Thanh Thi Duong, Quoc-Cuong Nguyen, Cong-Phuong Nguyen, and Ngoc Q K Duong (2016) Single-channel speaker-dependent speech enhancement exploiting generic noise model learned by nonnegative matrix factorization Proc IEEE International Conference on Electronics, Information and Communication, pp 268-271, Danang, Vietnam, Electronic ISBN 978-1-4673-8016-4, PoD ISBN 978-1-46738017-1, DOI 10.1109/ELINFOCOM.2016.7562952 Thanh Thi Hien Duong, Nobutaka Ono, Yasutaka Nakajima and Toshiya Ohshima (2016) Non-stationary Segment Detection Methods based on Single-basis Non-negative Matrix Factorization for Effective Annotation Proc IEEE Asia-Pacific Signal and Information Processing Association Annual Summit Conference (IEEE APSIPA ASC), pp 1-6, Jeju, Korea, Electronic ISBN 978-9-8814-7682-1, PoD ISBN 978-1-5090-2401-8, DOI 10.1109/APSIPA.2016.7820760 Thanh Thi Hien Duong, Phuong Cong Nguyen, and Cuong Quoc Nguyen (2018) Exploiting Nonnegative Matrix Factorization with Mixed Group Sparsity Constraint to Separate Speech Signal from Singlechannel Mixture with Unknown Ambient Noise EAI Endorsed Transactions on Context-Aware Systems and Applications vol 18(13), pp 1-8 ISSN 2409-0026 Duong Thi Hien Thanh, Nguyen Cong Phuong, and Nguyen Quoc Cuong (2018) Combination of Nonnegative Matrix Factorization and mixed group sparsity constraint to exploit generic source spectral model in single-channel audio source separation Journal of Military Science and Technology Vol 45(4), pp: 83-94 ISSN 1859 - 1043 (In Viet- 116 namese) Thanh Thi Hien Duong, Ngoc Q K Duong, Phuong Cong Nguyen, and Cuong Quoc Nguyen (2018) Multichannel source separation exploiting NMF-based generic source spectral model in Gaussian modeling framework In: Deville Y., Gannot S., Mason R., Plumbley M., Ward D (eds) Latent Variable Analysis and Signal Separation LVA/ICA 2018 Lecture Notes in Computer Science, vol 10891, pp 547-557 Springer, Cham DOI 10.1007/978-3-319-93764-9 50 Thanh Thi Hien Duong, Ngoc Q K Duong, Phuong Cong Nguyen, and Cuong Quoc Nguyen (2019) Gaussian modeling-based multichannel audio source separation exploiting generic source spectral model IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol 27(1), pp 32-43 ISSN 2329-9304, DOI 10.1109/TASLP.2018.28 69692 (ISI - Q1) 117 ... a Vector A Matrix A T Matrix transpose A H Matrix conjugate transposition (Hermitian conjugation) diag(a) Diagonal matrix with a as its diagonal det(A) Determinant of matrix A tr(A) Matrix trace... directly estimating the time-frequency mask [149] or for estimating the source spectra whose ratio yields a time-frequency mask [5, 56, 136] Time-frequency masking, as its name suggests, estimates the... Non-negative Matrix Factorization MAP Maximum A Posteriori MFCC MelFrequency Cepstral Coefficients ML Maximum Likelihood MMSE Minimum Mean Square Error MU Multiplicative Update NMF Non-negative Matrix