MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY DUONG THI HIEN THANH AUDIO SOURCE SEPARATION EXPLOITING NMF BASED GENERIC SOURCE SPECTRAL MODEL DOCTORAL DISSERTATION OF C[.]
MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY DUONG THI HIEN THANH AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL DOCTORAL DISSERTATION OF COMPUTER SCIENCE Hanoi - 2019 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY DUONG THI HIEN THANH AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL Major: Computer Science Code: 9480101 DOCTORAL DISSERTATION OF COMPUTER SCIENCE SUPERVISORS: ASSOC PROF DR NGUYEN QUOC CUONG DR NGUYEN CONG PHUONG Hanoi - 2019 DECLARATION OF AUTHORSHIP I, Duong Thi Hien Thanh, hereby declare that this thesis is my original work and it has been written by me in its entirety I confirm that: • This work was done wholly during candidature for a Ph.D research degree at Hanoi University of Science and Technology • Where any part of this thesis has previously been submitted for a degree or any other qualification at Hanoi University of Science and Technology or any other institution, this has been clearly stated • Where I have consulted the published work of others, this is always clearly attributed • Where I have quoted from the work of others, the source is always given With the exception of such quotations, this thesis is entirely my own work • I have acknowledged all main sources of help • Where the thesis is based on work done by myself jointly with others, I have made exactly what was done by others and what I have contributed myself Hanoi, February 2019 Ph.D Student Duong Thi Hien Thanh SUPERVISORS Assoc.Prof Dr Nguyen Quoc Cuong i Dr Nguyen Cong Phuong ACKNOWLEDGEMENT This thesis has been written during my doctoral study at International Research Institute Multimedia, Information, Communication, and Applications (MICA), Hanoi University of Science and Technology (HUST) It is my great pleasure to thank numerous people who have contributed towards shaping this thesis First and foremost I would like to express my most sincere gratitude to my supervisors, Assoc Prof Nguyen Quoc Cuong and Dr Nguyen Cong Phuong, for their great guidance and support throughout my Ph.D study I am grateful to them for devoting their precious time to discussing research ideas, proofreading, and explaining how to write good research papers I would like to thank them for encouraging my research and empowering me to grow as a research scientist I could not have imagined having a better advisor and mentor for my Ph.D study I would like to express my appreciation to my supervisor in Master cource, Prof Nguyen Thanh Thuy, School of Information and Communication Technology - HUST, and Dr Nguyen Vu Quoc Hung, my supervisor in Bachelors course at Hanoi National University of Education They had shaped my knowledge for excelling in studies In the process of implementation and completion of my research, I have received many supports from the board of MICA directors and my colleagues at Speech Communication department Particularly, I am very much thankful to Prof Pham Thi Ngoc Yen, Prof Eric Castelli, Dr Nguyen Viet Son and Dr Dao Trung Kien, who provided me with an opportunity to join researching works in MICA institute and have access to the laboratory and research facilities Without their precious support would it have been being impossible to conduct this research My warmly thanks go to my colleagues at Speech Communication department of MICA institute for their useful comments on my study and unconditional support over four years both at work and outside of work I am very grateful to my internship supervisor Prof Nobutaka Ono and the members of Ono’s Lab at the National Institute of Informatics, Japan for warmly welcoming me into their lab and the helpful research collaboration they offered I much appreciate his help in funding my conference trip and introducing me to the signal processing research communities I would also like to thank Dr Toshiya Ohshima, MSc Yasutaka Nakajima, MSc Chiho Haruta and other researchers at Rion Co., Ltd., Japan for ii welcoming me to their company and providing me data for experimental I would also like to sincerely thank Dr Nguyen Quang Khanh, dean of Information Technology Faculty, and Assoc Prof Le Thanh Hue, dean of Economic Informatics Department, at Hanoi University of Mining and Geology (HUMG) where I am working I have received the financial and time support from my office and leaders for completing my doctoral thesis Grateful thanks also go to my wonderful colleagues and friends Nguyen Thu Hang, Pham Thi Nguyet, Vu Thi Kim Lien, Vo Thi Thu Trang, Pham Quang Hien, Nguyen The Binh, Nguyen Thuy Duong, Nong Thi Oanh and Nguyen Thi Hai Yen, who have the unconditional support and help during a long time A special thank goes to Dr Le Hong Anh for the encouragement and his precious advice Last but not the least, I would like to express my deepest gratitude to my family I am very grateful to my mother-in-law and father-in-law for their support in the time of need, and always allow me to focus on my work I dedicate this thesis to my mother and father with special love, they have been being a great mentor in my life and had constantly encouraged me to be a better person The struggle and sacrifice of my parents always motivate me to work hard in my studies I would also like to express my love to my younger sisters and younger brother for their encouraging and helping This work has become more wonderful because of the love and affection that they have provided A special love goes to my beloved husband Tran Thanh Huan for his patience and understanding, for always being there for me to share the good and bad times I also appreciate my sons Tran Tuan Quang and Tran Tuan Linh for always cheering me up with their smiles Without love from them, this thesis would not have been completed Thank you all! Hanoi, February 2019 Ph.D Student Duong Thi Hien Thanh iii CONTENTS DECLARATION OF AUTHORSHIP DECLARATION OF AUTHORSHIP i i ACKNOWLEDGEMENT ii CONTENTS iv NOTATIONS AND GLOSSARY viii LIST OF TABLES xi LIST OF FIGURES xii INTRODUCTION Chapter AUDIO SOURCE SEPARATION: FORMULATION AND STATE OF THE ART 10 1.1 Audio source separation: a solution for cock-tail party problem 10 1.1.1 General framework for source separation 10 1.1.2 Problem formulation 11 State of the art 13 1.2.1 13 1.2.1.1 Gaussian Mixture Model 14 1.2.1.2 Nonnegative Matrix Factorization 15 1.2.1.3 Deep Neural Networks 16 Spatial models 18 1.2.2.1 Interchannel Intensity/Time Difference (IID/ITD) 18 1.2.2.2 Rank-1 covariance matrix 19 1.2.2.3 Full-rank spatial covariance model 20 Source separation performance evaluation 21 1.3.1 Energy-based criteria 22 1.3.2 Perceptually-based criteria 23 Summary 23 1.2 1.2.2 1.3 1.4 Spectral models Chapter NONNEGATIVE MATRIX FACTORIZATION 2.1 NMF introduction iv 24 24 2.2 2.3 2.1.1 NMF in a nutshell 24 2.1.2 Cost function for parameter estimation 26 2.1.3 Multiplicative update rules 27 Application of NMF to audio source separation 29 2.2.1 Audio spectra decomposition 29 2.2.2 NMF-based audio source separation 30 Proposed application of NMF to unusual sound detection 32 2.3.1 Problem formulation 33 2.3.2 Proposed methods for non-stationary frame detection 34 2.3.2.1 Signal energy based method 34 2.3.2.2 Global NMF-based method 35 2.3.2.3 Local NMF-based method 35 Experiment 37 2.3.3.1 Dataset 37 2.3.3.2 Algorithm settings and evaluation metrics 37 2.3.3.3 Results and discussion 38 Summary 43 2.3.3 2.4 Chapter SINGLE-CHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL WITH MIXED GROUP SPARSITY CONSTRAINT 44 3.1 General workflow of the proposed approach 44 3.2 GSSM formulation 46 3.3 Model fitting with sparsity-inducing penalties 46 3.3.1 Block sparsity-inducing penalty 47 3.3.2 Component sparsity-inducing penalty 48 3.3.3 Proposed mixed sparsity-inducing penalty 49 3.4 Derived algorithm in unsupervised case 49 3.5 Derived algorithm in semi-supervised case 52 3.5.1 Semi-GSSM formulation 52 3.5.2 Model fitting with mixed sparsity and algorithm 54 Experiment 54 3.6.1 Experiment data 54 3.6.1.1 55 3.6 Synthetic dataset v 3.6.2 3.6.3 3.7 3.6.1.2 SiSEC-MUS dataset 55 3.6.1.3 SiSEC-BNG dataset 56 Single-channel source separation performance with unsupervised setting 57 3.6.2.1 Experiment settings 57 3.6.2.2 Evaluation method 57 3.6.2.3 Results and discussion 61 Single-channel source separation performance with semi-supervised setting 65 3.6.3.1 Experiment settings 65 3.6.3.2 Evaluation method 65 3.6.3.3 Results and discussion 65 Summary 66 Chapter MULTICHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GSSM IN GAUSSIAN MODELING FRAMEWORK 68 4.1 Formulation and modeling 68 4.1.1 Local Gaussian model 68 4.1.2 NMF-based source variance model 70 4.1.3 Estimation of the model parameters 71 Proposed GSSM-based multichannel approach 72 4.2.1 GSSM construction 72 4.2.2 Proposed source variance fitting criteria 73 4.2.2.1 Source variance denoising 73 4.2.2.2 Source variance separation 74 4.2.3 Derivation of MU rule for updating the activation matrix 75 4.2.4 Derived algorithm 77 Experiment 79 4.3.1 Dataset and parameter settings 79 4.3.2 Algorithm analysis 80 4.2 4.3 4.3.2.1 4.3.2.2 4.3.3 Algorithm convergence: separation results as functions of EM and MU iterations 80 Separation results with different choices of λ and γ 81 Comparison with the state of the art vi 82 4.4 Summary 91 CONCLUSIONS AND PERSPECTIVES 93 BIBLIOGRAPHY 96 LIST OF PUBLICATIONS 113 vii NOTATIONS AND GLOSSARY Standard mathematical symbols C Set of complex numbers R Set of real numbers Z Set of integers E Expectation of a random variable Nc Complex Gaussian distribution Vectors and matrices a Scalar a Vector A Matrix A T Matrix transpose A H Matrix conjugate transposition (Hermitian conjugation) diag(a) Diagonal matrix with a as its diagonal det(A) Determinant of matrix A tr(A) Matrix trace A Recent works in audio source separation have considered two penalty functions, namely block sparsity-inducing penalty [128] and component sparsity-inducing penalty [8] The former one enforces the activation of relevant examples only while omitting e will likely converge to irrelevant ones since their corresponding activation block in H zero The latter one, on the other hand, enforces the activation of relevant components in U only It is motivated by the fact that only a part of the spectral model learned from an example may fit well with the targeted source in the mixture, while the remaining components in the model not Thus instead of activating the whole block, the component sparsity-inducing penalty allows selecting only the more likely relevant spectral components from U Inspired by the advantage of these penalty functions, in our recent work we proposed to combine them in a more general form like (3.7): e =γ Ω(H) P X log( + kHp k1 ) + (1 − γ) p=1 K X log( + khk k1 ), (4.20) k=1 where the first term on the right hand side of the equation presents the block sparsityinducing penalty, the second term presents the component sparsity-inducing penalty, and γ ∈ [0, 1] weights the contribution of each term In (4.20), hk ∈ R1×N is a row + e e (or component) of H, Hp is a subset of H representing the activation coefficients for p-th block, P is the total number of blocks, is a non-zero constant, and k.k1 denotes `1 -norm operator In the considered setting, a block represents one training example P for a source and P is the total number of used examples (i.e., P = Jj=1 Lj ) By putting (4.20) into (4.19), we now have a complete criterion for estimating the e given V e and the pre-trained spectral model U The derivation of activation matrix H e is presented in the Section 4.2.3 MU rule for updating H 4.2.3 Derivation of MU rule for updating the activation matrix e denote the minimization criterion (4.19) with the mixed sparsity conLet L(H) e defined as in (4.20) and D(·k·) being IS divergence The partial derivastrained Ω(H) e with respect to an entry hkn is tive of L(H) e = ∇hkn L(H) F X f =1 uf k v(n, f ) − e n,f e [UH] [UH] n,f λ.γ λ.(1 − γ) + + kHp k1 + khk k1 75 ! + (4.21) e can be written as a sum of two nonnegative parts, denoted by ∇+ L(H) e ≥ This ∇hkn L(H) hkn e ≥ 0, respectively, as and ∇− L(H) hkn e = ∇+ L(H) e − ∇− L(H) e ∇hkn L(H) hkn hkn (4.22) with e ∇+ hkn L(H) , F X uf k λ.(1 − γ) λ.γ + , + e n,f + kHp k1 + khk k1 [UH] uf k v(n, f ) e [UH] f =1 e ∇− hkn L(H) , F X f =1 (4.23) n,f Following a standard approach for MU rule derivation [40, 73]), hkn is updated as ! e η ∇− L( H) hkn hkn ← hkn , (4.24) e ∇+ L(H) hkn where η = 0.5 following the derivation in [42, 74], which was shown to produce an accelerated descent algorithm Putting (4.23) into (4.24) and rewriting it in a matrix e as form, we obtain the updates of H e ←H e H b V b −2 ) U> (V ! 12 b −1 ) + λ(γY + (1 − γ)Z) U> (V , (4.25) b = UH, e Y = [Y> , , Y> ]> with Yp , p = 1, P an uniform matrix of where V P the same size as Hp whose entries are , +kHp k1 > > and Z = [z> , , zK ] with zk , k = 1, K a uniform vector of the same size as hk whose entries are 76 +khk k1 Algorithm Proposed GSSM + SV separation algorithm - Part Require: Mixture signal x(t) List of examples of each source in the mixture {slj (t)}j=1:J,l=1:Lj Hyper-parameters λ, γ, MU-iteration Ensure: Source images cˆj (t) separated from x(t) b x (n, f ) ∈ - Compute the mixture STFT coefficients x(n, f ) ∈ CF ×N and then Ψ CI×I by (4.5) - Construct the GSSM model Uj by (3.2), then U ∈ RF+×K by (3.3) - Initialize the spatial covariance matrices Rj (f ), ∀j, f e j randomly, - Initialize the non-negative time activation matrix for each source H e = [H e >, , H e > ]v ∈ RK×N then H + J e j ]n,f - Initialize the source variance vj (n, f ) = [Uj H 4.2.4 Derived algorithm Within the LGM, a generalized EM algorithm has been used to estimate the parameters {vj (n, f ), Rj (f )}j,n,f by considering the set of hidden STFT coeffients of all the source images {cj (n, f )}n,f as the complete data The hints for the GEM derivation are presented in Section 4.1.3, and more details can be found in [28, 107] For the proposed approach as far as the GSSM concerned, the E step of the algorithm remains the same In the M step, we additionally perform the optimization defined either by (4.17) (for source variance denoising) or by (4.19) (for source variance separation) This is done by the MU rules so that the estimated intermediate source variances vj (n, f ) are further updated with the supervision of the GSSM The detail of overall proposed algorithm with source variance separation is summarized in Algorithm Note that this generalized EM algorithm requires the same order of computation compared to the existing method [6, 107] as sparsity constraint and bigger USSM size does not significantly affect the overall computational time As an example, for separating a 10-second long mixture presented in our experiment, both [6] and our proposed method (when non-optimally implemented in Matlab) take about 400 seconds when running in a laptop with Intel Core i5 Processor, 2.2 GHz, and 8Gb RAM 77 Algorithm Proposed GSSM + SV separation algorithm - Part ————————————————————————————————— // Generalized EM algorithm for the parameter estimation: repeat // E step (perform calculation for all j, n, f ): Σj (n, f ) = vj (n, f )Rj (f ) // eq (4.2) P Σx (n, f ) = Jj=1 vj (n, f ) Rj (f ) // eq (4.3) Gj (n, f ) = Σj (n, f )Σ−1 x (n, f ) // Wiener gain b j (n, f ) = Gj (n, f )Ψ b x (n, f )GH (n, f ) + (I − Gj (n, f ))Σj (n, f ) // eq (4.9) Σ j // M step: updating spatial covariance matrix and unconstrained source spectra P b Rj (f ) = N1 N n=1 vj (n,f ) Σj (n, f ) // eq (4.11) b j (n, f )) // eq (4.12) vj (n, f ) = tr(R−1 (f )Σ I j Vj = {vj (n, f )}n,f e = PJ Vj V j=1 // MU rules for NMF inside M step to further constrain source spectra by the GSSM for iter = 1, , MU-iteration for p = 1, , P Yp ← +kHp k1 end for Y = [Y1> , , YP> ]> for k = 1, , K zk ← +khk k1 end for > > Z = [z> , , zK ] // Updating activation matrix b = UH e V 21 e V b −2 ) U> (V e ←H e H // eq (4.25) b −1 )+λ(γY+(1−γ)Z) U> ( V end for e j ]n,f // updating constrained spectra vj (n, f ) = [Uj H until convergence - Source separation by multichannel Wiener filtering (4.7) 78 - Time domain source images cˆj (t) are obtained by the inverse STFT of cˆj (n, f ) 4.3 4.3.1 Experiment Dataset and parameter settings We validated the performance of the proposed approach in an important speech enhancement use case where we know already two types of sources in the mixture: speech and noise For a better comparison with the state of the art, we used the benchmark development dataset (devset) of the “Two-channel mixtures of speech and realworld background noise” (BGN) task1 within the SiSEC 2016 [81] This devset was described in section 3.6.1.3 and called “SiSEC-BGN-devset” in folowing sections Our works in single-channel case in Section 3.6 and preliminary tests on multichannel case show that only a few examples for each source could be enough to train an efficient GSSM Thus, for training the generic speech spectral model, we took only one male voice and two female voices from the SiSEC 20152 These three speech examples are also 10-second length For training the generic noise spectral model, we extracted five noise examples from the Diverse Environments Multichannel Acoustic Noise Database (DEMAND)3 Again they were 10-second length and contained three types of environmental noise: cafeteria, square, metro We performed the listening check to confirm that these examples used for the speech and noise model training are different from those in the devset, which were used for testing The STFT window length was 1024 for all train and test files The number of NMF components in Wjl for each speech example was set to 32, while that for noise example was 16 These values were found to be reasonable in [9] and our work on single-channel case Each Wjl were obtained by optimizing (4.14) with 20 MU iterations Initialization of the spatial covariance matrices: As suggested in [28], we firstly tried to initialize the spatial covariance matrix Rj (f ) by performing hierarchical clustering on the mixture STFT coefficients x(n, f ) But this strategy did not give us a good separation performance as the noise source in the considered mixtures is diffuse (i.e., it does not come from a single direction) Thus we initialized the noise spatial covariance matrix based on the diffuse model where noise is assumed to come uniformly from all spatial directions With this assumption, the diagonal entries of the noise spatial covariance matrix are one and the off-diagonal entries are real-valued computed as https://sisec.inria.fr/sisec-2016/bgn-2016/ https://sisec.inria.fr/sisec-2015/2015-underdetermined-speech-and-music-mixtures/ http://parole.loria.fr/DEMAND/ 79 in [69] r1,2 (f ) = r2,1 (f ) = sin(2πf d/v) , 2πf d/v (4.26) where d is the distance between two microphones and v = 334 m/s the sound velocity The spatial covariance matrix for the speech source was initialized by the full-rank direct+diffuse model detailed in [28] where the speech’s direction of arrival (DoA) was set to 90 degrees This DoA initialization was chosen for balancing the fact that the speech direction can vary between degree and 180 degrees in each mixture and we did not have access to the ground truth information while performing the test The source separation performance for all approaches was evaluated by two sets of criteria The four power-based criteria: the signal to distortion ratio (SDR), the signal to interference ratio (SIR), the signal to artifacts ratio (SAR), and the source image to spatial distortion ratio (ISR), measured in dB where the higher the better [137] The four perceptually-motivated criteria: the overall perceptual score (OPS), the target-related perceptual score (TPS), the artifact-related perceptual score (APS), and the interference-related perceptual score (IPS) [32], where a higher score is better As power-based criteria are more widely used in source separation community, the hyper-parameters for each algorithm were chosen in order to maximize the SDR - the most important metric as it reflects the overall signal distortion 4.3.2 Algorithm analysis 4.3.2.1 Algorithm convergence: separation results as functions of EM and MU iterations We first investigate the convergence in term of separation performance of the derived Algorithm by varying the number of EM and MU iterations and computing the separation results obtained on the benchmark BGN dataset In this experiment, we set λ = 10 and γ = 0.2 as we will show in next section that these values offer both the stability and the good separation performance The speech and noise separation results, measured by the SDR, SIR, SAR, and ISR, averaged over all mixtures in the dataset, illustrated as functions of the EM and MU iterations, are shown in Fig 4.2 As it can be seen, generally the SDR increases when the number of EM and MU iterations increases With 10 or 25 MU iterations, the algorithm converges nicely and saturates after about 10 EM iterations The best separation performance was obtained with 10 MU iterations and 15 EM iterations It is also interesting to see that with a small number of MU iterations like 1, 2, or 3, the separation results are quite poor and 80 Figure 4.2: Average separation performance obtained by the proposed method over stereo mixtures of speech and noise as functions of EM and MU iterations (a): speech SDR, (b): speech SIR, (c): speech SAR, (d): speech ISR, (e): noise SDR, (f): noise SIR, (g): noise SAR, (h): noise ISR the algorithm is less stable as it varies significantly even with a large number of EM iterations This reveals the effectiveness of the proposed NMF constraint (4.19) 4.3.2.2 Separation results with different choices of λ and γ We further investigate the sensitivity of the proposed algorithm to two parameters λ and γ, which determine the contribution of sparsity penalty to the NMF constraint in (4.19) For this purpose, we varied the values of these parameters, λ = {1, 10, 25, 50, 100, 200, 500}, γ = {0, 0.2, 0.4, 0.6, 0.8, 1}, and applied the corresponding source separation algorithm presented in the Algorithm on the benchmark BGN dataset The number of EM and MU iterations are set to 15 and 10, respectively, as these values guarantee the algorithm’s convergence shown in Fig 4.2 The speech and noise separation results, measured by the SDR, SIR, SAR, and ISR, averaged over all mixtures in the dataset, represented as functions of λ and γ, are shown in Fig 4.3 It can be seen that the proposed algorithm is less sensitive to the choice of γ, while more sensitive to the choice of λ, and the separation performance greatly decreases with λ > 10 The best choice for these parameters in term of the SDR are λ = 10, γ = 81 Figure 4.3: Average separation performance obtained by the proposed method over stereo mixtures of speech and noise as functions of λ and γ (a): speech SDR, (b): speech SIR, (c): speech SAR, (d): speech ISR, (e): noise SDR, (f): noise SIR, (g): noise SAR, (h): noise ISR 0.2 With the small value of λ (e.g., λ = 1), varying γ does not really affect the separation performance as the evaluation criteria are quite stable We noted that with γ = 0.2, the algorithm offers 0.2 dB and 1.0 dB SDR, which are higher than when γ = and γ = 1, respectively This confirms the effectiveness of the mixed sparsity penalty (4.20) in the multichannel setting 4.3.3 Comparison with the state of the art We compare the speech separation performance obtained on the SiSEC-BGNdevset of the proposed approach with the different optimizing criteria and settings to Arberet’s algorithm [6], which is the closest method to ours • Arberet’s method [6]: using the similar local Gaussian model, the algorithm further constrains the intermediate source variances by unsupervised NMF with criterion (4.13) Such algorithm is implemented by Ozerov et al in [107] This method is actually the most relevant prior art to compare with as it falls in the same LGM framework 82 • GSSM + SV denoising: The proposed GSSM + full-rank spatial covariance approach where the estimated variances of each sources Vj are further constrained by criterion (4.17) We submitted results obtained by this method to the SiSEC 2016 BGN task and obtained the best performance over the actual test set in term of SDR, SIR and ISR [80] • GSSM + SV separation (No sparsity constraint): The proposed approach with source variance separation by optimizing criterion (4.19) In order to investigate the benefit of the sparsity constraint, we further report the results obtained by this method when λ = • GSSM + SV separation (GSSM’+component sparsity): The proposed approach with source variance separation by optimizing criterion (4.19) To confirm the effectiveness of the GSSM construction by (3.2), we report the results obtained when the GSSM of the same size is learned jointly by concatenating all example’s spectrograms Slj as (4.15) In this case, only the component sparsity is applied (i.e., γ = 0) as block does not exist • GSSM + SV separation: The proposed approach with source variance separation by optimizing criterion (4.19) Furthermore, we compare the speech separation performance obtained by our approach to the state-of-the-art methods presented at the SiSEC campaign over different years since 2013 Note that the devset of the “Two-channel mixtures of speech and real-world background noise” task within the SiSEC is the same over the years The results of these methods were submitted by the authors and evaluated by the SiSEC organizers [81, 99, 101] All comparing methods are summarized as follows: • Wang’s method [146] (in SiSEC 2013): this algorithm performs well-known frequency domain independent component analysis (ICA) The associated permutation problem is solved by a novel region-growing permutation alignment technique • Le Magoarou’s method [83] (in SiSEC 2013): this approach uses text transcript of the speech source in the mixture as prior information to guide the source separation process The algorithm is based on the nonnegative matrix partial co-factorization 83 • Rafii’s method [114] (in SiSEC 2013): this technique uses a similarity matrix to separate the repeating background from the non-repeating foreground in a mixture The underlying assumption is that the background is dense and lowranked, while the foreground is sparse and varied • Ito’s method [59] (in SiSEC 2015): this is a permutation-free frequency-domain blind source separation algorithm via full-band clustering of the T-F components The clustering is performed via MAP estimation of the parameters with EM algorithm • Liu’s method [81] (in SiSEC 2016): the algorithm performs Time Difference of Arrival (TDOA) clustering based on GCC-PHAT • Wood’s method [153] (in SiSEC 2016): this recently proposed algorithm first applies NMF to the magnitude spectrograms of the mixtures with channels concatenated in time Each dictionary atom is clustered to either the speech or the noise according to its spatial origin The separation results obtained by different methods for each noisy environment (Ca, Sq, Su), and the average overall mixtures are summarized in Table 4.1 and Table 4.2 The charts to illustrate the average speech separation performance obtained on the devset of the proposed methods comparing to the closest existing algorithms in terms of the energy-based criteria and the perceptually-based criteria are shown in Fig 4.4 and Fig 4.5, respectively Fig 4.6 and Fig 4.7 are the charts illustrating the average speech separation performance of our methods comparing to the other state-of-theart methods The boxplot to illustrate the variance of the results obtained by the two proposed methods is shown in Fig 4.8 A number of interesting findings upon the experiment results will be discussed in the following 84 85 7.3 20.7 10.6 11.4 (GSSM’ + component sparsity) GSSM + SV separation (λ = 10, γ = 0.2) 19.3 25.0 (No sparsity constraint) GSSM + SV separation 10.2 7.9 GSSM + SV separation 13.0 13.5 18.2 10.0 12.7 8.4 10.9 13.3 (λ = 10, γ = 0.2) 10.0 9.1 11.8 IPS OPS 10.5 SIR SDR GSSM + SV denoising Arberet [6, 107] Methods 81.6 25.6 66.2 19.4 64.1 20.2 83.0 27.7 70.4 16.1 APS SAR 61.0 19.6 48.1 9.7 55.8 11.2 49.9 16.2 50.5 19.5 TPS ISR 31.6 7.8 30.3 4.4 32.4 -1.1 8.5 7.0 8.3 3.3 OPS SDR 31.4 11.1 28.3 6.1 29.4 -2.6 14.7 8.5 10.5 3.3 IPS SIR 62.0 19.3 57.9 16.0 55.0 17.6 77.6 22.0 82.3 10.4 APS SAR 57.4 12.3 52.8 6.9 60.2 8.0 39.0 9.8 47.5 15.3 TPS ISR 23.7 5.0 21.6 2.4 18.7 -1.6 11.3 5.1 10.2 -0.2 OPS SDR 27.8 7.1 16.5 1.8 11.4 -3.2 7.8 5.6 3.7 -1.2 IPS SIR 47.3 18.7 56.0 18.3 56.0 20.4 61.8 20.7 56.6 9.5 APS SAR 37.6 9.5 43.0 8.8 35.7 7.6 27.6 8.1 23.4 11.7 TPS ISR 23.1 8.1 21.3 4.9 18.8 1.8 18.1 7.7 10.4 4.4 OPS SDR 24.5 11.0 22.3 6.5 22.0 1.5 12.5 9.0 9.1 4.6 IPS SIR 65.2 21.3 60.3 17.7 58.2 19.1 75.9 23.6 72.6 12.1 APS SAR Table 4.1: Speech separation performance obtained on the SiSEC-BGN-devset - Comparison with closed baseline methods Ca1 Sq1 Su1 Average 54.2 14.1 49.1 8.3 53.3 8.9 40.1 11.6 43.2 15.9 TPS ISR ... a Vector A Matrix A T Matrix transpose A H Matrix conjugate transposition (Hermitian conjugation) diag(a) Diagonal matrix with a as its diagonal det(A) Determinant of matrix A tr(A) Matrix trace... directly estimating the time-frequency mask [144] or for estimating the source spectra whose ratio yields a time-frequency mask [4, 56, 132] Time-frequency masking, as its name suggests, estimates the... in W matrix and called a distribution weight matrix or activation matrix Usually, relatively few basis vectors can be used to represent many data vectors, so we can achieve a good approximation