In this part, we will introduce the motivation and the problem that we focus on throughout this thesis. Then, we emphasize on the objectives as well as scopes of our work. In addition, our contributions in this thesis will be summarized in order to give a clear view of the achievement. Finally, the structure of the thesis is presented chapter by chapter. 1. Background and Motivation 1.1. Cocktail party problem Real-world sound scenarios are usually very complicated as they are mixtures of many different sound sources. Fig. 1 depicts the scenario of a typical cocktail party, where there are many people attending, many conversations going on simultaneously and various disturbances like loud music, people screaming sounds, and a lot of hustlebustle. Some other similar situations also happen in daily life, for example, in outdoor recordings, where there is interference from a variety of environmental sounds, or in a music concert scenario, where a number of musical instruments are played and the audience gets to listen to the collective sound, etc. In such settings, what is actually heard by the ears is a mixture of various sounds that are generated by various audio sources. The mixing process can contain many sound reflections from walls and ceiling, which is known as the reverberation. Humans with normal hearing ability are generally able to locate, identify, and differentiate sound sources which are heard simultaneously so as to understand the conveyed information. However, this task has remained extremely challenging for machines, especially in highly noisy and reverberated environments. The cocktail party effect described above prevents both human and machine perceiving the target sound sources [2, 12, 145], the creation of machine listening algorithms that can automatically separate sound sources in difficult mixing conditions remains an open problem. Audio source separation aims at providing machine listeners with a similar function to the human ears by separating and extracting the signals of individual sources from a given mixture. This technique is formally termed as blind source separation(BSS) when no prior information about either the sources or the mixing condition is available, and is described in Fig. 2. Audio source separation is also known as an effective solution for cocktail party problem in audio signal processing community [85, 90, 138, 143, 152]. Depending on specific application, some source separation approaches focus on speech separation, in which the speech signal is extracted from the mixture containing multiple background noise and other unwanted sounds. Other methods deal with music separation, in which the singing voice and certain instruments are recovered from the mixture or song containing multiple musical instruments. The separated source signals may be either listened to or further processed, giving rise to many potential applications. Speech separation is mainly used for speech enhancement in hearing aids, hands-free phones, or automatic speech recognition (ASR) in adverse conditions [11, 47, 64, 116, 129]. While music separation has many interesting applications, including editing/remixing music post-production, up-mixing, music information retrieval, rendering of stereo recordings, and karaoke [37, 51, 106, 110].
MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY DUONG THI HIEN THANH AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL DOCTORAL DISSERTATION OF COMPUTER SCIENCE Hanoi - 2019 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY DUONG THI HIEN THANH AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL Major: Computer Science Code: 9480101 DOCTORAL DISSERTATION OF COMPUTER SCIENCE SUPERVISORS: ASSOC PROF DR NGUYEN QUOC CUONG DR NGUYEN CONG PHUONG Hanoi - 2019 CONTENTS DECLARATION OF AUTHORSHIP DECLARATION OF AUTHORSHIP i i ACKNOWLEDGEMENT ii CONTENTS iv NOTATIONS AND GLOSSARY viii LIST OF TABLES xi LIST OF FIGURES xii INTRODUCTION Chapter AUDIO SOURCE SEPARATION: FORMULATION AND STATE OF THE ART 10 1.1 Audio source separation: a solution for cock-tail party problem 10 1.1.1 General framework for source separation 10 1.1.2 Problem formulation 11 State of the art 13 1.2.1 13 1.2.1.1 Gaussian Mixture Model 14 1.2.1.2 Nonnegative Matrix Factorization 15 1.2.1.3 Deep Neural Networks 16 Spatial models 18 1.2.2.1 Interchannel Intensity/Time Difference (IID/ITD) 18 1.2.2.2 Rank-1 covariance matrix 19 1.2.2.3 Full-rank spatial covariance model 20 Source separation performance evaluation 21 1.3.1 Energy-based criteria 22 1.3.2 Perceptually-based criteria 23 Summary 23 1.2 1.2.2 1.3 1.4 Spectral models Chapter NONNEGATIVE MATRIX FACTORIZATION 2.1 NMF introduction iv 24 24 2.2 2.3 2.1.1 NMF in a nutshell 24 2.1.2 Cost function for parameter estimation 26 2.1.3 Multiplicative update rules 27 Application of NMF to audio source separation 29 2.2.1 Audio spectra decomposition 29 2.2.2 NMF-based audio source separation 30 Proposed application of NMF to unusual sound detection 32 2.3.1 Problem formulation 33 2.3.2 Proposed methods for non-stationary frame detection 34 2.3.2.1 Signal energy based method 34 2.3.2.2 Global NMF-based method 35 2.3.2.3 Local NMF-based method 35 Experiment 37 2.3.3.1 Dataset 37 2.3.3.2 Algorithm settings and evaluation metrics 37 2.3.3.3 Results and discussion 38 Summary 43 2.3.3 2.4 Chapter SINGLE-CHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL WITH MIXED GROUP SPARSITY CONSTRAINT 44 3.1 General workflow of the proposed approach 44 3.2 GSSM formulation 46 3.3 Model fitting with sparsity-inducing penalties 46 3.3.1 Block sparsity-inducing penalty 47 3.3.2 Component sparsity-inducing penalty 48 3.3.3 Proposed mixed sparsity-inducing penalty 49 3.4 Derived algorithm in unsupervised case 49 3.5 Derived algorithm in semi-supervised case 52 3.5.1 Semi-GSSM formulation 52 3.5.2 Model fitting with mixed sparsity and algorithm 54 Experiment 54 3.6.1 Experiment data 54 3.6.1.1 55 3.6 Synthetic dataset v 3.6.2 3.6.3 3.7 3.6.1.2 SiSEC-MUS dataset 55 3.6.1.3 SiSEC-BNG dataset 56 Single-channel source separation performance with unsupervised setting 57 3.6.2.1 Experiment settings 57 3.6.2.2 Evaluation method 57 3.6.2.3 Results and discussion 61 Single-channel source separation performance with semi-supervised setting 65 3.6.3.1 Experiment settings 65 3.6.3.2 Evaluation method 65 3.6.3.3 Results and discussion 65 Summary 66 Chapter MULTICHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GSSM IN GAUSSIAN MODELING FRAMEWORK 68 4.1 Formulation and modeling 68 4.1.1 Local Gaussian model 68 4.1.2 NMF-based source variance model 70 4.1.3 Estimation of the model parameters 71 Proposed GSSM-based multichannel approach 72 4.2.1 GSSM construction 72 4.2.2 Proposed source variance fitting criteria 73 4.2.2.1 Source variance denoising 73 4.2.2.2 Source variance separation 74 4.2.3 Derivation of MU rule for updating the activation matrix 75 4.2.4 Derived algorithm 77 Experiment 79 4.3.1 Dataset and parameter settings 79 4.3.2 Algorithm analysis 80 4.2 4.3 4.3.2.1 4.3.2.2 4.3.3 Algorithm convergence: separation results as functions of EM and MU iterations 80 Separation results with different choices of λ and γ 81 Comparison with the state of the art vi 82 4.4 Summary 91 CONCLUSIONS AND PERSPECTIVES 93 BIBLIOGRAPHY 96 LIST OF PUBLICATIONS 113 vii NOTATIONS AND GLOSSARY Standard mathematical symbols C Set of complex numbers R Set of real numbers Z Set of integers E Expectation of a random variable Nc Complex Gaussian distribution Vectors and matrices a Scalar a Vector A Matrix A T Matrix transpose A H Matrix conjugate transposition (Hermitian conjugation) diag(a) Diagonal matrix with a as its diagonal det(A) Determinant of matrix A tr(A) Matrix trace A The element-wise Hadamard product of two matrices (of the same dimension) B with elements [A A (n) a A 1 B]ij = Aij Bij (n) The matrix with entries [A]ij -norm of vector -norm of matrix Indices f Frequency index i Channel index j Source index n Time frame index t Time sample index viii Sizes I Number of channels J Number of sources L STFT filter length F Number of frequency bin N Number of time frames K Number of spectral basis Mixing filters A ∈ RI×J×L Matrix of filters aj (τ ) ∈ RI Mixing filter of j th source to all microphones, τ is the time delay aij (t) ∈ R Filter coefficient at tth time index aij ∈ RL Time domain filter vector aij ∈ CL Frequency domain filter vector aij (f ) ∈ C Filter coefficient at f th frequency index General parameters x(t) ∈ RI Time-domain mixture signal s(t) ∈ RJ Time-domain source signals cj (t) ∈ R I Time-domain j th source image Time-domain j th original source signal sj (t) ∈ R x(n, f ) ∈ CI Time-frequency domain mixture signal J Time-frequency domain source signals s(n, f ) ∈ C cj (n, f ) ∈ C I Time-frequency domain j th source image vj (n, f ) ∈ R Time-dependent variances of the j th source Rj (f ) ∈ C Time-independent covariance matrix of the j th source Σj (n, f ) ∈ CI×I Covariance matrix of the j th source image Σx (n, f ) ∈ CI×I Empirical mixture covariance Σx (n, f ) ∈ CI×I Empirical mixture covariance V ∈ RF+×N W ∈ RF+×K H ∈ RK×N + F ×K U ∈ R+ Power spectrogram matrix Spectral basis matrix Time activation matrix Generic source spectral model ix Abbreviations APS Artifacts-related Perceptual Score BSS Blind Source Separation DoA Direction of Arrival DNN Deep Neural Network EM Expectation Maximization ICA Independent Component Analysis IPS Interference-related Perceptual Score IS Itakura-Saito ISR source Image to Spatial distortion Ratio ISTFT Inverse Short-Time Fourier Transform IID (i.i.d) Interchannel Intensity Difference ITD (i.t.d) Interchannel Time Difference GCC-PHAT Generalized Cross Correlation Phase Transform GMM Gaussian Mixture Model GSSM Generic Source Spectral Model KL Kullback-Leibler LGM Local Gaussian Model MAP Maximum A Posteriori ML Maximum Likelihood MU Multiplicative Update NMF Non-negative Matrix Factorization OPS Overall Perceptual Score PLCA Probabilistic Latent Component Analysis SAR Signal to Artifacts Ratio SDR Signal to Distortion Ratio SIR Signal to Interference Ratio SiSEC Signal Separation Evaluation Campaign SNMF Spectral Non-negative Matrix Factorization SNR Signal to Noise Ratio STFT Short-Time Fourier Transform TDOA Time Difference of Arrival T-F Time-Frequency TPS Target-related Perceptual Score x LIST OF TABLES 2.1 Total number of different events detected from three recordings in spring 40 2.2 Total number of different events detected from three recordings in summer 41 2.3 Total number of different events detected from three recordings in winter 42 3.1 List of snip songs in the SiSEC-MUS dataset 3.2 Source separation performance obtained on the Synthetic and SiSECMUS dataset with unsupervised setting 3.3 Speech separation performance obtained on the SiSEC-BGN ∗ 56 59 indi- cates submissions by the authors and “-” indicates missing information [81, 98, 100] 3.4 Speech separation performance obtained on the Synthetic dataset with semi-supervised setting 4.1 85 Speech separation performance obtained on the SiSEC-BGN-devset Comparison with s-o-t-a methods in SiSEC ∗ indicates submissions by the authors and “-” indicates missing information 4.3 66 Speech separation performance obtained on the SiSEC-BGN-devset Comparison with closed baseline methods 4.2 60 86 Speech separation performance obtained on the test set of the SiSECBGN ∗ indicates submissions by the authors [81] xi 91 IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(7):1462– 1476 [39] Fan, H.-T., Hung, J.-w., Lu, X., Wang, S.-S., and Tsao, Y (2014) Speech enhancement using segmental nonnegative matrix factorization In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4483– 4487 [40] F´evotte, C., Bertin, N., and Durrieu, J L (2009) Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis Neural Computation, 21(3):793–830 [41] F´evotte, C., Gribonval, R., and Vincent, E (2005) BSS EVAL Toolbox User Guide – Revision 2.0 Technical report Developed with the support of the French GdR-ISIS/CNRS Workgroup “Resources for Audio Source Separation” [42] F´evotte, C and Idier, J (2011) Algorithms for nonnegative matrix factorization with the beta-divergence Neural Computation, 23(9):2421–2456 [43] F´evotte, C., Vincent, E., and Ozerov, A (2017) Single-channel audio source separation with NMF: divergences, constraints and algorithms In Audio Source Separation Springer [44] Fitzgerald, D (2012) User assisted separation using tensor factorisations 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pages 2412–2416 [45] Fox, B., Sabin, A., Pardo, B., and Zopf, A (2007) Modeling perceptual similarity of audio signals for blind source separation evaluation In Independent Component Analysis and Signal Separation - 7th International Conference, ICA 2007, Proceedings, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pages 454–461 [46] Fritsch, J and Plumbley, M (2013) Score informed audio source separation using constrained nonnegative matrix factorization and score synthesis In IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 888–891 [47] Gannot, S., Vincent, E., Markovich-Golan, S., and Ozerov, A (2017) A consolidated perspective on multimicrophone speech enhancement and source separation 100 IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(4):692– 730 [48] Gerber, T., Dutasta, M., Girin, L., and F´evotte, C (2012) Professionally- produced music separation guided by covers In International Society for Music Information Retrieval Conference (ISMIR 2012), pages 85–90, Porto, Portugal [49] Gribonval, R., Vincent, E., F´evotte, C., and Benaroya, L (2003) Proposals for performance measurement in source separation In 4th Int Symp on Independent Component Analysis and Blind Signal Separation (ICA2003), pages 763–768 [50] Gustafsson, T., Rao, B., and Trivedi, M (2003) Source localization in reverberant environments: modeling and statistical analysis IEEE Transactions on Speech and Audio Processing, 11(6):791–803 [51] Hennequin, R., David, B., and Badeau, R (2011) Score informed audio source separation using a parametric model of non-negative spectrogram In Proc IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 45–48 [52] Hershey, J R., Chen, Z., Roux, J L., and Watanabe, S (2016) Deep clustering: Discriminative embeddings for segmentation and separation In Proc IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 31–35 [53] Heymann, J., Drude, L., and Haeb-Umbach, R (2017) A generic neural acoustic beamforming architecture for robust multi-channel speech processing Computer Speech & Language, 46:374 – 385 [54] Houda, A and Otman, C (2015) Article: Blind audio source separation: Stateof-art International Journal of Computer Applications, 130(4):1–6 Published by Foundation of Computer Science (FCS), NY, USA [55] Huang, A (2013) NMF Face Recognition Method Based on Alpha Divergence In Zhong, Z., editor, Proceedings of the International Conference on Information Engineering and Applications (IEA) 2012, volume 217, pages 477–483 Springer London [56] Huang, P., Kim, M., Hasegawa-Johnson, M., and Smaragdis, P (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation IEEE/ACM Trans Audio, Speech & Language Processing, 23(12):2136–2147 101 [57] Huber, R and Kollmeier, B (2006) PEMO-Q - A new method for objective audio quality assessment using a model of auditory perception IEEE Transactions on Audio, Speech, and Language Processing, 14(6):1902–1911 [58] Hurmalainen, A., Saeidi, R., and Virtanen, T (2012) Group sparsity for speaker identity discrimination in factorisation-based speech recognition In Proc Interspeech, pages 17–20 [59] Ito, N., Araki, S., and Nakatani, T (2013) Permutation-free convolutive blind source separation via full-band clustering based on frequency-independent source presence priors In Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 3238–3242 [60] Izumi, Y., Ono, N., and Sagayama, S (2007) Sparseness-Based 2ch BSS using the EM Algorithm in Reverberant Environment In Proc IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 147–150 [61] Jeter, M and Pye, W (1981) A note on nonnegative rank factorizations Linear Algebra and its Applications, 38:171–173 [62] Jiang, Y., Wang, D., Liu, R., and Feng, Z (2014) Binaural classification for reverberant speech segregation using deep neural networks IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2112–2121 [63] Jourjine, A., Rickard, S., and Yılmaz, O (2000) Blind separation of disjoint orthogonal signals: Demixing N sources from mixtures In Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 2985–2988 [64] Kim, G and Loizou, P (2010) Improving speech intelligibility in noise using environment-optimized algorithms IEEE Trans Audio, Speech, Language Processing, 18(8):2080–2090 [65] Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Sehr, A., Kellermann, W., and Maas, R (2013) The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech In Proc IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 1–4, NY, USA [66] Kitamura, D., Ono, N., Sawada, H., Kameoka, H., and Saruwatari, H (2016a) Determined Blind Source Separation Unifying Independent Vector Analysis and 102 Nonnegative Matrix Factorization IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(9):1626–1641 [67] Kitamura, D., Ono, N., Sawada, H., Kameoka, H., and Saruwatari, H (2016b) Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization IEEE/ACM Trans on Audio, Speech and Language Processing, 24(9):1622–1637 [68] Kompass, R (2007) A generalized divergence measure for nonnegative matrix factorization Neural Computation, 19(3):780–791 [69] Kuttruff, H (2000) Room Acoustics Spon Press, New York, 4rd edition edition [70] Lafay, G., Benetos, E., and Lagrange, M (2017) Sound event detection in synthetic audio: Analysis of the dcase 2016 task results In 2016 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 11–15 [71] Le Magoarou, L., Ozerov, A., and Duong, N Q K (2013) Text-informed audio source separation using nonnegative matrix partial co-factorization In IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6 [72] Lee, D D and Seung, H S (1999) Learning the parts of objects by non-negative matrix factorization Nature, 401 6755:788–91 [73] Lee, D D and Seung, H S (2001) Algorithms for non-negative matrix factorization In Advances in Neural and Information Processing Systems 13, pages 556–562 [74] Lef`evre, A., Bach, F., and F´evotte, C (2011) Itakura-Saito non-negative matrix factorization with group sparsity In IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 21–24 [75] Leglaive, S., S¸ims¸ekli, U., Liutkus, A., Badeau, R., and Richard, G (2017) Alpha-stable multichannel audio source separation In IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 576–580 [76] Li, Y and Ngom, A (2013) The non-negative matrix factorization toolbox for biological data mining Source Code for Biology and Medicine, 8(1):1–10 103 [77] Liutkus, A., Badeau, R., and Richard, G (2011) Gaussian Processes for Underdetermined Source Separation IEEE Transactions on Signal Processing, 59(7):3155–3167 [78] Liutkus, A., Durrieu, J L., Daudet, L., and Richard, G (2013) An overview of informed audio source separation In Proc IEEE Int Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pages 1–4 [79] Liutkus, A., Fitzgerald, D., and Rafii, Z (2015) Scalable audio separation with light kernel additive modelling In Proc IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 7680 [80] Liutkus, A., Stăoter, F R., Rafii, Z., Kitamura, D., Rivet, B., Ito, N., Ono, N., and Fontecave, J (2017a) The 2016 signal separation evaluation campaign In Proc Int Conf on Latent Variable Analysis and Signal Separation, pages 323–332 [81] Liutkus, A., Stter, F.-R., Rafii, Z., Kitamura, D., Rivet, B., Ito, N., Ono, N., and Fontecave, J (2017b) The 2016 Signal Separation Evaluation Campaign In Latent Variable Analysis and Signal Separation, volume 10169, pages 323–332 Springer International Publishing, Cham [82] L´opez, A R., Ono, N., Remes, U., Palomăaki, K., and Kurimo, M (2015) Designing multichannel source separation based on single-channel source separation In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 469–473 [83] Magoarou, L L., Ozerov, A., and Duong, N Q K (2014) Text-informed audio source separation example-based approach using non-negative matrix partial cofactorization Journal of Signal Processing Systems, pages 1–5 [84] Magron, P., Badeau, R., and Liutkus, A (2017) L´evy NMF for robust nonnegative source separation In Proc IEEE Int Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 259–263 [85] Makino, S., Lee, T.-W., and Sawada, H (2007) Blind Speech Separation Springer [86] Mandel, M I., Weiss, R J., and Ellis, D P W (2010) Model-based expectationmaximization source separation and localization IEEE Transactions on Audio, Speech, and Language Processing, 18(2):382–394 104 [87] McCowan, I and Bourlard, H (2003) Microphone array post-filter based on noise field coherence IEEE Transactions on Speech and Audio Processing, 11(6):709–716 [88] Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T., and Plumbley, M D (2018) Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(2):379–393 [89] Mohammadiha, N., Smaragdis, P., and Leijon, A (2013) Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization IEEE Transactions on Audio, Speech, and Language Processing, 21(10):2140–2151 [90] Naik, G R and Wang, W., editors (2014) Blind source separation: advances in theory, algorithms and applications Signals and communication technology Springer, Berlin [91] Nakajima, Y., Sunohara, M., Naito, T., Sunago, N., Ohshima, T., and Ono, N (2016) DNN-based environmental sound recognition with real-recorded and artificially-mixed training data [92] Naylor, P A and Gaubitch, N D., editors (2010) Speech Dereverberation Signals and Communication Technology Springer London [93] Nesta, F and Omologo, M (2012) Generalized State Coherence Transform for Multidimensional TDOA Estimation of Multiple Sources IEEE Transactions on Audio, Speech, and Language Processing, 20(1):246–260 [94] Nikunen, J and Virtanen, T (2014) Direction of arrival based spatial covariance model for blind sound source separation IEEE/ACM Trans on Audio, Speech, and Language Processing, 22(3):727–739 [95] Nix, J and Hohmann, V (2007) Combined Estimation of Spectral Envelopes and Sound Source Direction of Concurrent Voices by Multidimensional Statistical Filtering IEEE Transactions on Audio, Speech and Language Processing, 15(3):995– 1008 [96] Nugraha, A., Liutkus, A., and Vincent, E (2016) Multichannel audio source separation with deep neural networks IEEE/ACM Transactions on Audio, Speech, and Language Processing, 14(9):1652–1664 105 [97] O’Grady, P D., Pearlmutter, B A., and Rickard, S T (2005) Survey of sparse and non-sparse methods in source separation International Journal of Imaging Systems and Technology, 15(1):18–33 [98] Ono, N., Koldovsk, Z., Miyabe, S., and Ito, N (2013a) The 2013 Signal Separation Evaluation Campaign In 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6 [99] Ono, N., Koldovsk, Z., Miyabe, S., and Ito, N (2013b) The 2013 Signal Separation Evaluation Campaign In Proc IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6 [100] Ono, N., Rafii, Z., Kitamura, D., Ito, N., and Liutkus, A (2015a) The 2015 Signal Separation Evaluation Campaign In Latent Variable Analysis and Signal Separation, volume 9237, pages 387–395 Springer International Publishing, Cham [101] Ono, N., Rafii, Z., Kitamura, D., Ito, N., and Liutkus, A (2015b) The 2015 Signal Separation Evaluation Campaign In Latent Variable Analysis and Signal Separation (LVAICA), volume 9237, pages 387–395 Springer [102] Ozerov, A and Fevotte, C (2010) Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation IEEE Transactions on Audio, Speech, and Language Processing, 18(3):550–563 [103] Ozerov, A and F´evotte, C (2010) Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation IEEE Trans on Audio, Speech and Language Processing, 18(3):550–563 [104] Ozerov, A., Fevotte, C., Blouet, R., and Durrieu, J.-L (2011) Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation In Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 257–260 [105] Ozerov, A., F´evotte, C., and Vincent, E (2017) An introduction to multichannel NMF for audio source separation In Audio Source Separation, Signals and Communication Technology Springer [106] Ozerov, A., Philippe, P., Bimbot, F., and Gribonval, R (2007) Adaptation of Bayesian Models for Single-Channel Source Separation and its Application to 106 Voice/Music Separation in Popular Songs IEEE Transactions on Audio, Speech and Language Processing, 15(5):1564–1578 [107] Ozerov, A., Vincent, E., and Bimbot, F (2012) A general flexible framework for the handling of prior information in audio source separation IEEE Transactions on Audio, Speech, and Language Processing, 20(4):1118–1133 [108] Paatero, P (1997) Least squares formulation of robust non-negative factor analysis Chemometrics and Intelligent Laboratory Systems, 37(1):23–35 [109] Paatero, P and Tapper, U (1994) Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values Environmetrics, 5(2):111–126 [110] Parekh, S., Essid, S., Ozerov, A., Duong, N Q K., Perez, P., and Richard, G (2017) Motion informed audio source separation In Proc IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 6–10 [111] Parvaix, M and Girin, L (2011) Informed Source Separation of Linear Instantaneous Under-Determined Audio Mixtures by Source Index Embedding IEEE Transactions on Audio, Speech, and Language Processing, 19(6):1721–1733 [112] Pedersen, M S., Larsen, J., Kjems, U., and Parra, L C (2007) A survey of convolutive blind source separation methods In Springer Handbook of Speech Processing, pages 1–34 Springer [113] Quirs, A and Wilson, S P (2012) Dependent Gaussian mixture models for source separation EURASIP Journal on Advances in Signal Processing, 2012(1) [114] Rafii, Z and Pardo, B (2013) Online REPET-SIM for real-time speech enhancement In Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 848–852 [115] Rennie, S J., Hershey, J R., and Olsen, P A (2008) Efficient model-based speech separation and denoising using nonnegative subspace analysis In In: Proc of ICASSP Las Vegas, pages 1833–1836 [116] Revit, L J and Schulein, R B (2013) Sound reproduction method and apparatus for assessing real-world performance of hearing and hearing aids The Journal of the Acoustical Society of America, 133(2):1196–1199 107 [117] Reynolds, D A., Quatieri, T F., and Dunn, R B Speaker verification using adapted gaussian mixture models Digital Signal Processing, 10(1):19–41 [118] Roy, R and Kailath, T (1989) Esprit-estimation of signal parameters via rotational invariance techniques IEEE/ACM Transactions on Audio, Speech, and Language Processing, 37(7):984–995 [119] Sainath, T N., Weiss, R J., Wilson, K W., Li, B., Narayanan, A., Variani, E., Bacchiani, M., Shafran, I., Senior, A., Chin, K., Misra, A., and Kim, C (2017) Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(5):965–979 [120] Sandler, R and Lindenbaum, M (2011) Nonnegative matrix factorization with earth mover’s distance metric for image analysis IEEE Trans Pattern Anal Mach Intell., 33(8):1590–1602 [121] Sawada, H., Araki, S., and Makino, S (2011) Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment IEEE Transactions on Audio, Speech, and Language Processing, 19(3):516– 527 [122] Sawada, H., Kameoka, H., Araki, S., and Ueda, N (2013) Multichannel Extensions of Non-Negative Matrix Factorization With Complex-Valued Data IEEE Transactions on Audio, Speech, and Language Processing, 21(5):971–982 [123] Smaragdis, P and Mysore, G J (2009) Separation by humming: User-guided sound extraction from monophonic mixtures In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 69–72 [124] Smaragdis, P., Raj, B., and Shashanka, M (2007) Supervised and semi- supervised separation of sounds from single-channel mixtures In Int Conf on Independent Component Analysis and Signal Separation (ICA), pages 414–421 [125] Smith, J O (2011) Spectral audio signal processing W3K Publishing [126] Souvira`a-Labastie, N., Olivero, A., Vincent, E., and Bimbot, F (2015) Multichannel audio source separation using multiple deformed references IEEE/ACM Transactions on Audio, Speech and Language Processing, 23:1775–1787 108 [127] Sprechmann, P., Bronstein, A M., and Sapiro, G (2015) Supervised nonnegative matrix factorization for audio source separation In Excursions in Harmonic Analysis, Volume 4, pages 407–420 Springer International Publishing, Cham [128] Sun, D L and Mysore, G J (2013) Universal speech models for speaker independent single channel source separation In Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 141–145 [129] Sunohara, M., Haruta, C., and Ono, N (2017) Low-latency real-time blind source separation for hearing aids based on time-domain implementation of online independent vector analysis with truncation of non-causal components In Proc IEEE Int Conf on Acoustics, Speech, and Signal Processing (ICASSP), pages 216– 220 [130] Tan, V Y F and Fevotte, C (2013) Automatic Relevance Determination in Nonnegative Matrix Factorization with the /spl beta/-Divergence IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7):1592–1605 [131] Traa, J., Smaragdis, P., Stein, N D., and Wingate, D (2015) Directional nmf for joint source localization and separation In 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 1–5 [132] Uhlich, S., Giron, F., and Mitsufuji, Y (2015) Deep neural network based instrument extraction from music In Proc IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 2135–2139 [133] Vincent, E., Araki, S., and Bofill, P (2009) The 2008 signal separation evaluation campaign: A community-based approach to large-scale evaluation In Proc Int Conf on Independent Component Analysis and Signal Separation (ICA), pages 734–741 [134] Vincent, E., Araki, S., Theis, F., Nolte, G., Bofill, P., Sawada, H., Ozerov, A., Gowreesunker, V., Lutter, D., and Duong, N Q (2012) The signal separation evaluation campaign (2007 2010): Achievements and remaining challenges Signal Processing, 92(8):1928–1936 [135] Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F., and Matassoni, M (2013) The second ’chime’ speech separation and recognition challenge: Datasets, 109 tasks and baselines In IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 126–130 [136] Vincent, E., Bertin, N., Gribonval, R., and Bimbot, F (2014) From Blind to Guided Audio Source Separation: How models and side information can improve the separation of sound IEEE Signal Processing Magazine, 31(3):107–115 [137] Vincent, E., Gribonval, R., and Fevotte, C (2006a) Performance measurement in blind audio source separation IEEE Transactions on Audio, Speech and Language Processing, 14(4):1462–1469 [138] Vincent, E., Jafari, M G., Abdallah, S A., Plumbley, M D., and Davies, M E (2010) Probabilistic modeling paradigms for audio source separation In In Machine Audition: Principles, Algorithms and Systems, pages 162–185 IGI Global [139] Vincent, E., Jafari, M G., and Plumbley, M D (2006b) Preliminary guidelines for subjective evalutation of audio source separation algorithms In UK ICA Research Network Workshop, Southampton, United Kingdom [140] Vincent, E., Sawada, H., Bofill, P., Makino, S., and Rosca, J P (2007) First Stereo Audio Source Separation Evaluation Campaign: Data, Algorithms and Results In Independent Component Analysis and Signal Separation, pages 552–559 Springer Berlin Heidelberg [141] Vincent, E., Virtanen, T., and Gannot, S., editors (2017) Audio Source Separation and Speech Enhancement Wiley [142] Virtanen, T (2007) Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria IEEE Transactions on Audio, Speech and Language Processing, 15(3):1066–1074 [143] Virtanen, T., Singh, R., and Raj, B., editors (2012) Techniques for noise robustness in automatic speech recognition Wiley, Chichester, West Sussex, U.K ; Hoboken, N.J [144] Wang, D (2017) Deep learning reinvents the hearing aid IEEE Spectrum, 54(3):32–37 [145] Wang, D and Brown, G J., editors (2006) Computational auditory scene analysis: principles, algorithms, and applications IEEE Press ; Wiley Interscience 110 [146] Wang, L., Ding, H., and Yin, F (2011) A region-growing permutation alignment approach in frequency-domain blind source separation of speech mixtures Trans Audio, Speech and Language Processing, 19(3):549–557 [147] Wang, Y.-X and Zhang, Y.-J (2013) Nonnegative Matrix Factorization: A Comprehensive Review IEEE Transactions on Knowledge and Data Engineering, 25(6):1336–1353 [148] Wang, Z.-Q., Roux, J L., and Hershey, J R (2018) Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation In Proc IEEE Int Conf on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5 [149] Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J R., and Schuller, B (2015) Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR In Latent Variable Analysis and Signal Separation, volume 9237, pages 91–99 Springer International Publishing [150] Weninger, F., Hershey, J R., Le Roux, J., and Schuller, B (2014) Discriminatively trained recurrent neural networks for single-channel speech separation In IEEE Global Conference on Signal and Information Processing, pages 577–581 [151] Winter, S., Kellermann, W., Sawada, H., and Makino, S (2006) MAP-based underdetermined blind source separation of convolutive mixtures by hierarchical clustering and -norm minimization EURASIP Journal on Advances in Signal Processing, 2007(1):024717 [152] Wlfel, M and McDonough, J (2009) Distant speech recognition Wiley, Chichester, U.K [153] Wood, S and Rouat, J (2016) Blind speech separation with GCC-NMF In Proc Interspeech, pages 3329–3333 [154] Wood, S U N., Rouat, J., Dupont, S., and Pironkov, G (2017) Blind Speech Separation and Enhancement With GCC-NMF IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(4):745–755 [155] Xiao, X., Watanabe, S., Erdogan, H., Lu, L., Hershey, J., Seltzer, M L., Chen, G., Zhang, Y., Mandel, M., and Yu, D (2016) Deep beamforming networks for 111 multi-channel speech recognition In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5745–5749 IEEE [156] Yilmaz, Y K., Cemgil, A T., and Simsekli, U (2011) Generalised coupled tensor factorisation In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 2151–2159, USA Curran Associates Inc [157] Yu, D and Deng, L (2011) Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP IEEE Signal Processing Magazine, 28(1):145–154 [158] Zdunek, R (2011) Convolutive nonnegative matrix factorization with markov random field smoothing for blind unmixing of multichannel speech recordings In Proc The 5th International Conference on Advances in Nonlinear Speech Processing, NOLISP’11, pages 25–32 Springer-Verlag [159] Zdunek, R (2013) Improved Convolutive and Under-Determined Blind Audio Source Separation with MRF Smoothing Cognitive Computation, 5(4):493–503 [160] Zhang, Z.-Y (2012) Nonnegative Matrix Factorization: Models, Algorithms and Applications In Data Mining: Foundations and Intelligent Paradigms, volume 24, pages 99–134 Springer Berlin Heidelberg 112 LIST OF PUBLICATIONS Hien-Thanh Thi Duong, Quoc-Cuong Nguyen, Cong-Phuong Nguyen, Thanh Huan Tran, and Ngoc Q K Duong (2015) Speech enhancement based on nonnegative matrix factorization with mixed group sparsity constraint Proc ACM International Symposium on Information and Communication Technology (SoICT 2015), pp 247-251, Hue, Vietnam ISBN 978-1-4503-3843-1, DOI:10.1145/2833258.2833276 Hien-Thanh Thi Duong, Quoc-Cuong Nguyen, Cong-Phuong Nguyen, and Ngoc Q K Duong (2016) Single-channel speaker-dependent speech enhancement exploiting generic noise model learned by nonnegative matrix factorization Proc IEEE International Conference on Electronics, Information and Communication, pp 268-271, Danang, Vietnam, Electronic ISBN 978-1-4673-8016-4, PoD ISBN 978-1-46738017-1, DOI 10.1109/ELINFOCOM.2016.7562952 Thanh Thi Hien Duong, Nobutaka Ono, Yasutaka Nakajima and Toshiya Ohshima (2016) Non-stationary Segment Detection Methods based on Single-basis Non-negative Matrix Factorization for Effective Annotation Proc IEEE Asia-Pacific Signal and Information Processing Association Annual Summit Conference (IEEE APSIPA ASC), pp 1-6, Jeju, Korea, Electronic ISBN 978-9-8814-7682-1, PoD ISBN 978-1-5090-2401-8, DOI 10.1109/APSIPA.2016.7820760 Thanh Thi Hien Duong, Phuong Cong Nguyen, and Cuong Quoc Nguyen (2018) Exploiting Nonnegative Matrix Factorization with Mixed Group Sparsity Constraint to Separate Speech Signal from Singlechannel Mixture with Unknown Ambient Noise EAI Endorsed Transactions on Context-Aware Systems and Applications vol 18(13), pp 1-8 ISSN 2409-0026 Duong Thi Hien Thanh, Nguyen Cong Phuong, and Nguyen Quoc Cuong (2018) Combination of Nonnegative Matrix Factorization and mixed group sparsity constraint to exploit generic source spectral model in single-channel audio source separation Journal of Military Science and Technology Vol 45(4), pp: 83-94 ISSN 1859 - 1043 (In Viet- 113 namese) Thanh Thi Hien Duong, Ngoc Q K Duong, Phuong Cong Nguyen, and Cuong Quoc Nguyen (2018) Multichannel source separation exploiting NMF-based generic source spectral model in Gaussian modeling framework In Latent Variable Analysis and Signal Separation, vol 10891, pp 547-557 Springer International Publishing DOI 10.1007/9 78-3-319-93764-9 50 (SCOPUS) Thanh Thi Hien Duong, Ngoc Q K Duong, Phuong Cong Nguyen, and Cuong Quoc Nguyen (2019) Gaussian modeling-based multichannel audio source separation exploiting generic source spectral model IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol 27(1), pp 32-43 ISSN 2329-9304, DOI 10.1109/TASLP.2018.28 69692 (ISI - Q1) 114 ... a Vector A Matrix A T Matrix transpose A H Matrix conjugate transposition (Hermitian conjugation) diag(a) Diagonal matrix with a as its diagonal det(A) Determinant of matrix A tr(A) Matrix trace... directly estimating the time-frequency mask [144] or for estimating the source spectra whose ratio yields a time-frequency mask [4, 56, 132] Time-frequency masking, as its name suggests, estimates the... domain filter vector aij (f ) ∈ C Filter coefficient at f th frequency index General parameters x(t) ∈ RI Time-domain mixture signal s(t) ∈ RJ Time-domain source signals cj (t) ∈ R I Time-domain