Doctoral dissertation of computer science: Audio source separation exploiting nmf based generic source spectral model

Aims to tackle the real-world recordings with challenging settings as mentioned earlier, we have proposed novel separation algorithms for both single-channel and multi-channel cases. The achieved results have been described in seven publications. The results of our algorithms were also submitted to the international source separation campaign SiSEC 20164 [81] and obtained the best performance in terms of energybased criteria.

Trang 1

MINISTRY OF EDUCATION AND TRAININGHANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

DUONG THI HIEN THANH

AUDIO SOURCE SEPARATION EXPLOITING

NMF-BASED GENERIC SOURCE SPECTRAL MODEL

DOCTORAL DISSERTATION OF COMPUTER SCIENCE

Hanoi - 2019

Trang 2

MINISTRY OF EDUCATION AND TRAININGHANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

DUONG THI HIEN THANH

AUDIO SOURCE SEPARATION EXPLOITING

NMF-BASED GENERIC SOURCE SPECTRAL MODEL

Major: Computer Science

Code: 9480101

DOCTORAL DISSERTATION OF COMPUTER SCIENCE

SUPERVISORS:

1 ASSOC PROF DR NGUYEN QUOC CUONG

2 DR NGUYEN CONG PHUONG

Hanoi - 2019

Trang 3

• Where I have consulted the published work of others, this is always clearly tributed.

at-• Where I have quoted from the work of others, the source is always given Withthe exception of such quotations, this thesis is entirely my own work

• I have acknowledged all main sources of help

• Where the thesis is based on work done by myself jointly with others, I havemade exactly what was done by others and what I have contributed myself

Hanoi, February 2019Ph.D Student

Duong Thi Hien Thanh

SUPERVISORS

Assoc.Prof Dr Nguyen Quoc Cuong Dr Nguyen Cong Phuong

Trang 4

This thesis has been written during my doctoral study at International ResearchInstitute Multimedia, Information, Communication, and Applications (MICA), HanoiUniversity of Science and Technology (HUST) It is my great pleasure to thank numer-ous people who have contributed towards shaping this thesis

First and foremost I would like to express my most sincere gratitude to my sors, Assoc Prof Nguyen Quoc Cuong and Dr Nguyen Cong Phuong, for their greatguidance and support throughout my Ph.D study I am grateful to them for devotingtheir precious time to discussing research ideas, proofreading, and explaining how towrite good research papers I would like to thank them for encouraging my researchand empowering me to grow as a research scientist I could not have imagined having

supervi-a better supervi-advisor supervi-and mentor for my Ph.D study

I would like to express my appreciation to my supervisor in Master cource, Prof.Nguyen Thanh Thuy, School of Information and Communication Technology - HUST,and Dr Nguyen Vu Quoc Hung, my supervisor in Bachelors course at Hanoi NationalUniversity of Education They had shaped my knowledge for excelling in studies

In the process of implementation and completion of my research, I have receivedmany supports from the board of MICA directors and my colleagues at Speech Com-munication department Particularly, I am very much thankful to Prof Pham Thi NgocYen, Prof Eric Castelli, Dr Nguyen Viet Son and Dr Dao Trung Kien, who pro-vided me with an opportunity to join researching works in MICA institute and haveaccess to the laboratory and research facilities Without their precious support would

it have been being impossible to conduct this research My warmly thanks go to mycolleagues at Speech Communication department of MICA institute for their usefulcomments on my study and unconditional support over four years both at work andoutside of work

I am very grateful to my internship supervisor Prof Nobutaka Ono and the bers of Ono’s Lab at the National Institute of Informatics, Japan for warmly welcoming

mem-me into their lab and the helpful research collaboration they offered I much appreciatehis help in funding my conference trip and introducing me to the signal processingresearch communities I would also like to thank Dr Toshiya Ohshima, MSc Yasu-taka Nakajima, MSc Chiho Haruta and other researchers at Rion Co., Ltd., Japan for

Trang 5

welcoming me to their company and providing me data for experimental.

I would also like to sincerely thank Dr Nguyen Quang Khanh, dean of InformationTechnology Faculty, and Assoc Prof Le Thanh Hue, dean of Economic InformaticsDepartment, at Hanoi University of Mining and Geology (HUMG) where I am work-ing I have received the financial and time support from my office and leaders forcompleting my doctoral thesis Grateful thanks also go to my wonderful colleaguesand friends Nguyen Thu Hang, Pham Thi Nguyet, Vu Thi Kim Lien, Vo Thi ThuTrang, Pham Quang Hien, Nguyen The Binh, Nguyen Thuy Duong, Nong Thi Oanhand Nguyen Thi Hai Yen, who have the unconditional support and help during a longtime A special thank goes to Dr Le Hong Anh for the encouragement and his preciousadvice

Last but not the least, I would like to express my deepest gratitude to my family I

am very grateful to my mother-in-law and father-in-law for their support in the time ofneed, and always allow me to focus on my work I dedicate this thesis to my motherand father with special love, they have been being a great mentor in my life and hadconstantly encouraged me to be a better person The struggle and sacrifice of myparents always motivate me to work hard in my studies I would also like to express

my love to my younger sisters and younger brother for their encouraging and helping.This work has become more wonderful because of the love and affection that they haveprovided

A special love goes to my beloved husband Tran Thanh Huan for his patience andunderstanding, for always being there for me to share the good and bad times I alsoappreciate my sons Tran Tuan Quang and Tran Tuan Linh for always cheering me upwith their smiles Without love from them, this thesis would not have been completed.Thank you all!

Hanoi, February 2019Ph.D StudentDuong Thi Hien Thanh

Trang 6

DECLARATION OF AUTHORSHIP i

DECLARATION OF AUTHORSHIP i ACKNOWLEDGEMENT ii

CONTENTS iv

NOTATIONS AND GLOSSARY viii

LIST OF TABLES xi

LIST OF FIGURES xii

INTRODUCTION 1

Chapter 1 AUDIO SOURCE SEPARATION: FORMULATION AND STATE OF THE ART 10 1.1 Audio source separation: a solution for cock-tail party problem 10

1.1.1 General framework for source separation 10

1.1.2 Problem formulation 11

1.2 State of the art 13

1.2.1 Spectral models 13

1.2.1.1 Gaussian Mixture Model 14

1.2.1.2 Nonnegative Matrix Factorization 15

1.2.1.3 Deep Neural Networks 16

1.2.2 Spatial models 18

1.2.2.1 Interchannel Intensity/Time Difference (IID/ITD) 18 1.2.2.2 Rank-1 covariance matrix 19

1.2.2.3 Full-rank spatial covariance model 20

1.3 Source separation performance evaluation 21

1.3.1 Energy-based criteria 22

1.3.2 Perceptually-based criteria 23

1.4 Summary 23

Chapter 2 NONNEGATIVE MATRIX FACTORIZATION 24 2.1 NMF introduction 24

Trang 7

2.1.1 NMF in a nutshell 24

2.1.2 Cost function for parameter estimation 26

2.1.3 Multiplicative update rules 27

2.2 Application of NMF to audio source separation 29

2.2.1 Audio spectra decomposition 29

2.2.2 NMF-based audio source separation 30

2.3 Proposed application of NMF to unusual sound detection 32

2.3.1 Problem formulation 33

2.3.2 Proposed methods for non-stationary frame detection 34

2.3.2.1 Signal energy based method 34

2.3.2.2 Global NMF-based method 35

2.3.2.3 Local NMF-based method 35

2.3.3 Experiment 37

2.3.3.1 Dataset 37

2.3.3.2 Algorithm settings and evaluation metrics 37

2.3.3.3 Results and discussion 38

2.4 Summary 43

Chapter 3 SINGLE-CHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GENERIC SOURCE SPECTRAL MODEL WITH MIXED GROUP SPARSITY CONSTRAINT 44 3.1 General workflow of the proposed approach 44

3.2 GSSM formulation 46

3.3 Model fitting with sparsity-inducing penalties 46

3.3.1 Block sparsity-inducing penalty 47

3.3.2 Component sparsity-inducing penalty 48

3.3.3 Proposed mixed sparsity-inducing penalty 49

3.4 Derived algorithm in unsupervised case 49

3.5 Derived algorithm in semi-supervised case 52

3.5.1 Semi-GSSM formulation 52

3.5.2 Model fitting with mixed sparsity and algorithm 54

3.6 Experiment 54

3.6.1 Experiment data 54

3.6.1.1 Synthetic dataset 55

Trang 8

3.6.1.2 SiSEC-MUS dataset 55

3.6.1.3 SiSEC-BNG dataset 56

3.6.2 Single-channel source separation performance with unsuper-vised setting 57

3.6.2.1 Experiment settings 57

3.6.2.2 Evaluation method 57

3.6.3 Single-channel source separation performance with semi-supervised setting 65

3.6.3.1 Experiment settings 65

3.6.3.2 Evaluation method 65

3.7 Summary 66

Chapter 4 MULTICHANNEL AUDIO SOURCE SEPARATION EXPLOITING NMF-BASED GSSM IN GAUSSIAN MODELING FRAMEWORK 68 4.1 Formulation and modeling 68

4.1.1 Local Gaussian model 68

4.1.2 NMF-based source variance model 70

4.1.3 Estimation of the model parameters 71

4.2 Proposed GSSM-based multichannel approach 72

4.2.1 GSSM construction 72

4.2.2 Proposed source variance fitting criteria 73

4.2.2.1 Source variance denoising 73

4.2.2.2 Source variance separation 74

4.2.3 Derivation of MU rule for updating the activation matrix 75

4.2.4 Derived algorithm 77

4.3 Experiment 79

4.3.1 Dataset and parameter settings 79

4.3.2 Algorithm analysis 80

4.3.2.1 Algorithm convergence: separation results as func-tions of EM and MU iterafunc-tions 80

4.3.2.2 Separation results with different choices of λ and γ 81 4.3.3 Comparison with the state of the art 82

Trang 9

4.4 Summary 91

CONCLUSIONS AND PERSPECTIVES 93

BIBLIOGRAPHY 96

LIST OF PUBLICATIONS 113

Trang 10

NOTATIONS AND GLOSSARY

Standard mathematical symbols

C Set of complex numbers

R Set of real numbers

Z Set of integers

E Expectation of a random variable

Nc Complex Gaussian distribution

Vectors and matrices

AT Matrix transpose

AH Matrix conjugate transposition (Hermitian conjugation)

diag(a) Diagonal matrix with a as its diagonal

det(A) Determinant of matrix A

tr(A) Matrix trace

A B The element-wise Hadamard product of two matrices (of the same dimension)

with elements [A B]ij = AijBij

A.(n) The matrix with entries [A].(n)ij

kak1 `1-norm of vector

kAk1 `1-norm of matrix

Indices

f Frequency index

i Channel index

j Source index

n Time frame index

t Time sample index

Trang 11

I Number of channels

J Number of sources

L STFT filter length

F Number of frequency bin

N Number of time frames

K Number of spectral basis

Mixing filters

A ∈ RI×J ×L Matrix of filters

aj(τ ) ∈ RI Mixing filter of jthsource to all microphones, τ is the time delay

aij(t) ∈ R Filter coefficient at tthtime index

aij ∈ RL Time domain filter vector

b

aij ∈ CL Frequency domain filter vector

baij(f ) ∈ C Filter coefficient at fth frequency index

General parameters

x(t) ∈ RI Time-domain mixture signal

s(t) ∈ RJ Time-domain source signals

cj(t) ∈ RI Time-domain jthsource image

sj(t) ∈ R Time-domain jthoriginal source signal

x(n, f ) ∈ CI Time-frequency domain mixture signal

s(n, f ) ∈ CJ Time-frequency domain source signals

cj(n, f ) ∈ CI Time-frequency domain jthsource image

vj(n, f ) ∈ R Time-dependent variances of the jthsource

Rj(f ) ∈ C Time-independent covariance matrix of the jthsource

Σj(n, f ) ∈ CI×I Covariance matrix of the jth source image

b

Σx(n, f ) ∈ CI×I Empirical mixture covariance

b

Σx(n, f ) ∈ CI×I Empirical mixture covariance

V ∈ RF ×N+ Power spectrogram matrix

W ∈ RF ×K+ Spectral basis matrix

H ∈ RK×N+ Time activation matrix

U ∈ RF ×K+ Generic source spectral model

Trang 12

APS Artifacts-related Perceptual Score

BSS Blind Source Separation

DoA Direction of Arrival

EM Expectation Maximization

ICA Independent Component Analysis

IPS Interference-related Perceptual Score

ISR source Image to Spatial distortion Ratio

ISTFT Inverse Short-Time Fourier Transform

IID (i.i.d) Interchannel Intensity Difference

ITD (i.t.d) Interchannel Time Difference

GCC-PHAT Generalized Cross Correlation Phase TransformGMM Gaussian Mixture Model

GSSM Generic Source Spectral Model

LGM Local Gaussian Model

MU Multiplicative Update

NMF Non-negative Matrix Factorization

OPS Overall Perceptual Score

PLCA Probabilistic Latent Component AnalysisSAR Signal to Artifacts Ratio

SDR Signal to Distortion Ratio

SIR Signal to Interference Ratio

SiSEC Signal Separation Evaluation Campaign

SNMF Spectral Non-negative Matrix FactorizationSNR Signal to Noise Ratio

STFT Short-Time Fourier Transform

TDOA Time Difference of Arrival

TPS Target-related Perceptual Score

Trang 13

LIST OF TABLES

2.1 Total number of different events detected from three recordings in spring 402.2 Total number of different events detected from three recordings in sum-mer 412.3 Total number of different events detected from three recordings in winter 423.1 List of snip songs in the SiSEC-MUS dataset 563.2 Source separation performance obtained on the Synthetic and SiSEC-MUS dataset with unsupervised setting 593.3 Speech separation performance obtained on the SiSEC-BGN ∗ indi-cates submissions by the authors and “-” indicates missing information[81, 98, 100] 603.4 Speech separation performance obtained on the Synthetic dataset withsemi-supervised setting 664.1 Speech separation performance obtained on the SiSEC-BGN-devset -Comparison with closed baseline methods 854.2 Speech separation performance obtained on the SiSEC-BGN-devset -Comparison with s-o-t-a methods in SiSEC ∗ indicates submissions

by the authors and “-” indicates missing information 864.3 Speech separation performance obtained on the test set of the SiSEC-BGN.∗ indicates submissions by the authors [81] 91

Trang 14

LIST OF FIGURES

1 A cocktail party effect 2

2 Audio source separation 3

3 Live recording environments 4

1.1 Source separation general framework 11

1.2 Audio source separation: a solution for cock-tail party problem 13

1.3 IID coresponding to two sources in an anechoic environment 19

2.1 Decomposition model of NMF [36] 25

2.2 Spectral decomposition model based on NMF (K = 2) [66] 29

2.3 General workflow of supervised NMF-based audio source separation 30 2.4 Image of overlapping blocks 34

2.5 General workflow of the NMF-based nonstationary segment extraction 35 2.6 Number of different events were detected by the methods from (a) the recordings in Spring, (b) the recordings in Summer, and (c) the record-ings in Winter 39

3.1 Proposed weakly-informed single-channel source separation approach 45 3.2 Generic source spectral model (GSSM) construction 47

3.3 Estimated activation matrix H: (a) without a sparsity constraint, (b) with a block sparsity-inducing penalty (3.5), (c) with a component inducing penalty (3.6), and (d) with the proposed mixed sparsity-inducing penalty (3.7) 48

3.4 Average separation performance obtained by the proposed method with unsupervised setting over the Synthetic dataset as a function of MU it-erations 61

3.5 Average separation performance obtained by the proposed method with unsupervised setting over the Synthetic dataset as a function of λ and γ 62 3.6 Average speech separation performance obtained by the proposed meth-ods and the state-of-the-art methmeth-ods over the dev set in SiSEC-BGN 63

3.7 Average speech separation performance obtained by the proposed meth-ods and the state-of-the-art methmeth-ods over the test set in SiSEC-BGN 63

Trang 15

4.1 General workflow of the proposed source separation approach The topgreen dashed box describes the training phase for the GSSM construc-tion Bottom blue boxes indicate processing steps for source separa-tion Green dashed boxes indicate the novelty compared to the existingworks [6, 38, 107] 734.2 Average separation performance obtained by the proposed method overstereo mixtures of speech and noise as functions of EM and MU itera-tions (a): speech SDR, (b): speech SIR, (c): speech SAR, (d): speechISR, (e): noise SDR, (f): noise SIR, (g): noise SAR, (h): noise ISR 814.3 Average separation performance obtained by the proposed method overstereo mixtures of speech and noise as functions of λ and γ (a): speechSDR, (b): speech SIR, (c): speech SAR, (d): speech ISR, (e): noiseSDR, (f): noise SIR, (g): noise SAR, (h): noise ISR 824.4 Average speech separation performance obtained by the proposed meth-ods and the closest existing algorithms in terms of the energy-basedcriteria 884.5 Average speech separation performance obtained by the proposed meth-ods and the closest existing algorithms in terms of the perceptually-based criteria 884.6 Average speech separation performance obtained by the proposed meth-ods and the state-of-the-art methods in terms of the energy-based criteria 894.7 Average speech separation performance obtained by the proposed meth-ods and the state-of-the-art methods in terms of the perceptually-basedcriteria 894.8 Boxplot for the speech separation performance obtained by the pro-posed “GSSM + SV denoising” (P1) and “GSSM + SV separation”(P2) methods 90

Trang 16

In this part, we will introduce the motivation and the problem that we focus onthroughout this thesis Then, we emphasize on the objectives as well as scopes of ourwork In addition, our contributions in this thesis will be summarized in order to give aclear view of the achievement Finally, the structure of the thesis is presented chapter

by chapter

1 Background and Motivation

1.1 Cocktail party problem

Real-world sound scenarios are usually very complicated as they are mixtures ofmany different sound sources Fig 1 depicts the scenario of a typical cocktail party,where there are many people attending, many conversations going on simultaneouslyand various disturbances like loud music, people screaming sounds, and a lot of hustle-bustle Some other similar situations also happen in daily life, for example, in outdoorrecordings, where there is interference from a variety of environmental sounds, or in amusic concert scenario, where a number of musical instruments are played and the au-dience gets to listen to the collective sound, etc In such settings, what is actually heard

by the ears is a mixture of various sounds that are generated by various audio sources.The mixing process can contain many sound reflections from walls and ceiling, which

is known as the reverberation Humans with normal hearing ability are generally able

to locate, identify, and differentiate sound sources which are heard simultaneously so

as to understand the conveyed information However, this task has remained extremelychallenging for machines, especially in highly noisy and reverberated environments.The cocktail party effect described above prevents both human and machine perceiv-ing the target sound sources [2, 12, 145], the creation of machine listening algorithmsthat can automatically separate sound sources in difficult mixing conditions remains

an open problem

Audio source separationaims at providing machine listeners with a similar tion to the human ears by separating and extracting the signals of individual sourcesfrom a given mixture This technique is formally termed as blind source separation

Trang 17

func-(BSS) when no prior information about either the sources or the mixing condition isavailable, and is described in Fig 2 Audio source separation is also known as aneffective solution for cocktail party problem in audio signal processing community[85, 90, 138, 143, 152] Depending on specific application, some source separationapproaches focus on speech separation, in which the speech signal is extracted fromthe mixture containing multiple background noise and other unwanted sounds Othermethods deal with music separation, in which the singing voice and certain instrumentsare recovered from the mixture or song containing multiple musical instruments Theseparated source signals may be either listened to or further processed, giving rise tomany potential applications Speech separation is mainly used for speech enhance-ment in hearing aids, hands-free phones, or automatic speech recognition (ASR) inadverse conditions [11, 47, 64, 116, 129] While music separation has many interest-ing applications, including editing/remixing music post-production, up-mixing, musicinformation retrieval, rendering of stereo recordings, and karaoke [37, 51, 106, 110].

Figure 1: A cocktail party effect1.Over the last couple of decades, efforts have been undertaken by the scientific com-munity, from various backgrounds such as Signal Processing, Mathematics, Statistics,Neural Networks, Machine Learning, etc., to build audio source separation systems

as described in [14, 15, 22, 43, 85, 105, 125] The audio source separation problem

1 Some icons of Fig 1 are from: http://clipartix.com/.

Trang 18

Figure 2: Audio source separation.

has been studied at various levels of complexity, and different approaches and systemshave come up Despite numerous effort, the problem is not completely solved yet

as the obtained separation results are still far from perfect, especially in challengingconditions such as moving sound sources and high reverberation

1.2 Basic notations and target challenges

• Overdetermined, determined, and underdetermined mixture

There are three different settings in audio source separation under the ship between the number of sources J and the number of microphones I: Incase the number of the microphones is larger than that of the sources, J < I, thenumber of observable variables are more than the unknown variables and hence

relation-it is referred to as overdetermined case If J = I, we have as many observablevariables as unknowns, and this is a determined case The more dificult soureseparation case is that the number of unknowns are more than the number ofobservable variables, J > I, which is called the underdetermined case

Furthermore, if I = 1 then it is a single-channel case If I > 1 then it is amulti-channelcase

• Instantaneous, anechoic, and reverberant mixing environment

Apart from the mixture settings based on the relationship between the number

of sources and the number of microphones, audio source separation algorithmscan also be distinguished based on the target mixing condition they deal with

Trang 19

The simplest case deals with instantaneous mixtures, such as certain music tures generated by amplitude panning In this case, there is no time delay, amixture at a given time is essentially a weighted sum of the source signals atthe same time instant There are two other typical types of the live recordingenvironments, anechoic and reverberant, as shown in Fig 3 In the anechoicenvironments such as studio or outdoor, the microphones capture only the directsound propagation from a source With reverberant environments such as realmeeting rooms or chambers, the microphones capture not only the direct soundbut also many sound reflections from walls, ceilings, and floors The modeling

mix-of the reverberant environment is much more difficult than the instantaneous andanechoic cases

Figure 3: Live recording environments2

State-of-the-art audio source separation algorithms perform quite well in taneous or noiseless anechoic conditions, but still far from perfect by the amount ofreverberation These numerical performance results are clearly shown in the recentcommunity-based Signal Separation Evaluation Campaigns (SiSEC) [5, 99, 101, 133,134] and others [65, 135] That shows that addressing the separation of reverberantmixtures, a common case in the real-world recording applications, remains one of thekey scientific challenges in the source separation community Moreover, when the de-sired sound is corrupted by high-level background noise, i.e., the Signal-to-Noise Ratio(SNR) is up to 0 dB or lesser, the separation performance is even lower

instan-2 Some icons of Fig 3 are from: http://clipartix.com/.

Trang 20

To improve the separation performance, informed approaches have been proposedand emerged over the last decade in the literature [78, 136] Such approaches exploitside information about one or all of the sources themselves, or the mixing condition inorder to guide the separation process Examples of the investigated side informationinclude deformed or hummed references of one (or more) source(s) in a given mixture[123, 126], text associated with spoken speeches [83], score associated with musicalsources [37, 51], and motion associated with audio-visual objects in a video [110].Following this trend, our research focuses on using weakly-informed strategy

to target the determined/underdetermined and high reverberation audio sourceseparation challenge We use a very abstract semantic information just about thetypes of audio sources existing in the mixture to guide the separation process

2 Objective and scope

2.1 Objective

The main objective of the thesis is to investigate and develop efficient audiosource separation algorithm, which can deal with the determined/underdeterminedand high reverberation in the real-world recording conditions

In order to do that, we start by studying state-of-the-art approaches for selectingone of the most well-known frameworks that can deal with the targeted challenges

We then develop novel algorithms grounded on such considered modeling framework,i.e.,the Local Gaussian Model (LGM), with Nonnegative Matrix Factorization (NMF)

as the spectral model, for both single-channel and multi-channel cases In our proposedapproach, we exploit information just about the types of audio sources in the mixture

to guide the separation process For instance, in speech enhancement application, weknow that one source in a noisy recording should be speech, and another is backgroundnoise We further want to investigate the algorithms’ convergence as well as theirsensitivity to the parameter settings in order to guide for parameter settings when it isapplicable

For evaluation, both speech and music separations are considered We consider aspeech separation for speech enhancement task, and consider both singing voice andmusical instrument separation for music task In order to compare fairly the obtainedseparation results with other existing methods, we use the benchmark dataset in addi-

Trang 21

tion to our own synthetic dataset This well-designed benchmark dataset is from theSignal Separation Evaluation Campaign (SiSEC3) for the speech and real-world back-ground noise separation task and music separation task Using these datasets allows

us to join in our research community activities Especially, we target to participate theSiSEC challenge so as to bring our developed algorithm to the international researchcommunity

2.2 Scope

In our study, we order to recover the original sources (in single-channel setting) orthe spatial images of each source (in multi-channel setting) from the observed audiomixture The source spatial images are the contribution of those sources to the mixturesignal For example, for speech recordings in real-world environments, the spatialimages are the speech signals recorded at the microphones after propagating from thespeaker to the microphones

Furthermore, as focusing on the weakly-informed source separation, we assumethe number of sources and the types of sources are known prior For instance, themixture is composed of speech and noise in speech separation context, or vocals andmusical instrumentsin music separation context

3 Contributions

Aims to tackle the real-world recordings with challenging settings as mentionedearlier, we have proposed novel separation algorithms for both single-channel andmulti-channel cases The achieved results have been described in seven publications.The results of our algorithms were also submitted to the international source separationcampaign SiSEC 20164 [81] and obtained the best performance in terms of energy-based criteria More specifically, the main contributions are described as follows:

• We have proposed a novel single-channel audio source separation algorithmweakly guided by some source examples This algorithm exploits the genericsource spectral model (GSSM), which represents the spectral characteristics ofaudio sources, to guide the separation process With that, a new sparsity-inducingpenalty for the cost function has also been proposed We have validated the

3 http://sisec.inria.fr/

4 http://sisec.inria.fr/sisec-2016/

Trang 22

speech performance of the proposed algorithm in both supervised and supervised setting We have also analyzed algorithm’s convergence as well asits stability with respect to the parameter settings.

semi-These contributions were published in four scientific papers (papers 1, 2, 4, 5 in

“List of publications”)

• A novel multi-channel audio source separation algorithm weakly guided by somesource examples has been proposed This algorithm exploits the use of genericsource spectral model learned by NMF within the well-established local Gaus-sian model We have proposed two new optimization criteria, the first one con-strains the variances of each source by NMF, the second criterion constrains thetotal variances of all sources altogether The corresponding EM algorithms forparameter estimation have also been derived We have investigated the sensitiv-ity of the proposed algorithm to parameters as well as its convergence in order

to guide for parameter settings in the practical implementation

As another important contribution, we participated in the SiSEC challenges so

as our proposed approach is visible to the international research community.Evaluated fairly by the SiSEC organizes, our proposed algorithm obtained thebest source separation results in terms of the energy-based criteria in the SiSEC2016

These achievements were described in two papers (papers 6 and 7 in “List ofpublications”)

• In addition to two main contributions mentioned above, by studying NMF modeland it’s application in acoustic processing field, we have proposed novel un-supervised detection methods for detecting automatically non-stationary seg-ments from single-channel real-world recordings Those methods aim to ef-fective acoustic-event annotation They were proposed during my research in-ternship at Ono’s Lab, Japan National Institute of Informatics, and transferred toRION company in Japan for the potential use

This work has published in paper 3 in “List of publications”

4 Structure of thesis

The work presented in this thesis is structured in four chapters as follows:

Trang 23

• Chapter 1: Audio source separation: Formulation and State of the art

We introduce the general framework and the mathematical formulation of theconsidered audio source separation problem as well as the notations used in thisthesis It is followed by an overview of the state-of-the-art audio source sep-aration methods, which exploits different spectral models and spatial models.Also, two families of criteria, that are used for source separation performanceevaluation, are presented in this chapter

• Chapter 2: Nonnegative matrix factorization

This chapter firstly introduces NMF, which has received a lot of attention in theaudio processing community It is followed by a baseline supervised algorithmbased on NMF model aiming to separate audio sources from the observed mix-ture By the end of this chapter, we propose novel methods for automaticallydetecting non-stationary segments using NMF for effective sound annotation

• Chapter 3: Proposed single-channel audio source separation approach

We present the proposed weakly-informed audio source separation method forsingle-channel audio source separation targeting both unsupervised and semi-supervised setting The algorithm is based on NMF with mixed sparsity con-straints In this method, the generic spectral characteristics of sources are firstlylearned from several training signals by NMF They are then used to guide thesimilar factorization of the observed power spectrogram into each source Wealso propose to combine two existing group sparsity-inducing penalties in theoptimization process and adapt the corresponding algorithm for parameter esti-mation based on multiplicative update (MU) rule The last section of this chapter

is devoted to the experimental evaluation We show the effectiveness of the posed approach in both unsupervised and semi-supervised settings

pro-• Chapter 4: Proposed multichannel audio source separation approachThis chapter is a significant extension of the work mentioned in chapter 3 tothe multi-channel case We describe a novel multichannel audio source separa-tion algorithm weakly guided by some source examples, where the NMF-basedGSSM is combined with the full-rank spatial covariance model in a Gaussianmodeling paradigm We then present the generalized expectation-maximization(EM) algorithm for the parameter estimation Especially, for guiding the esti-mation of the intermediate source variances in each EM iteration, we investigatethe use of two criteria: (1) the estimated variances of each source are constrained

Trang 24

by NMF, and (2) the total variances of all sources are constrained by NMF together By the experiment, the separation performances obtained by proposedalgorithms are analyzed and compared with state-of-the-art and baseline algo-rithms Moreover, the analysis results about the sensitivity of the proposed al-gorithms to parameter settings as well as their convergence are also addressed inthis chapter.

al-In the last part of the thesis, we present the conclusion and perspectives for thefuture research directions

Trang 25

CHAPTER 1 AUDIO SOURCE SEPARATION: FORMULATION AND

STATE OF THE ART

In this chapter, we introduce audio source separation technique as a solution for thecocktail party problem After briefly describing the general audio source separationframework, we present some basic setting for convolution conditional and recordingenvironment Then the state-of-the-art models exploiting spectral cues as well as spa-tial cues for source separating process will be summarized Finally, we introduce twofamilies of criterias that are used for source separation performance evaluation

party problem

Audio source separation is the signal processing task which consists in recoveringthe constitutive sounds, called sources, of an observed mixture, which can be single-channel or multichannel [43, 78, 85, 90, 105] This separation needs a system that

is able to perform many processes, such as estimating the number of sources, mating the required number of frequency basis and convolutive parameters to be as-signed to each source, applying separation algorithms, and reconstructing the sources[6, 25, 28, 102, 111, 121, 158, 159] There are two types of cues can be exploited forthe separation process, called spectral cues and spatial cues Spectral cues describethe spectral structures of sources, while spatial cues are information about the sourcespatial positions [22, 85, 97] They will be discussed more detail in Section 1.2.1 and1.2.2, respectively It can be seen that spectral signals alone are not able to distin-guish sources with similar pitch range and timbre, while the individual spatial signalsmay not be sufficient to distinguish sources from near directions So most of existingsystems require the exploitation of both types of cues

esti-In general, the source separation algorithm is processed in the time-frequency

Trang 26

do-main after the short-time Fourier transform (STFT) and consists of two modeling cues

as in Fig 1.1: (1) spectral model exploits spectral characteristics of sources, (2) spatialmodel performs modeling and exploiting spatial information Finally, the estimatedtime domain source signals are obtained via the inverse short-time Fourier transform(ISTFT)

Figure 1.1: Source separation general framework

Multichannel audio mixtures are the types of recordings that we obtain when weemploy microphone arrays [14, 22, 85, 90, 92] Let us formulate the multichannel mix-ture signal, where J sources are observed by an array of I microphones, with indexes

j ∈ {1, 2, , J } and i ∈ {1, 2, , I} to indicate specific source j and channel i.This mixture signal is denoted by x(t) = [x1(t), , xI(t)]T ∈ RI×1 and is sum ofcontributions from all sources as [85]:

Under physical view, sound sources are typically divided into two types: pointsourcesand diffuse sources The point source is the case in which sound emits from asingle point in a space, e.g., unmoving human speaker, a water drop, a singer is singing

Trang 27

alone, etc The diffuse source is the case in which sound comes from a region of space,e.g.,water drops in the rain, singers are singing in a choir, etc Diffuse sources can beconsidered as a collection of point sources [85, 141] In the case where the j-th source

is a point source, source spatial image cj(t) is written as [85]

is the single-channel source signal

Audio source separation systems often operate in the time-frequency (T-F) main, in which the temporal characteristics and the spectral characteristics of audiocan be jointly represented A most commonly used time-frequency representation isthe short-time Fourier transform (STFT) [3, 125] STFT analysis refers to computingthe time-frequency representation from the time-domain waveform by creating over-lapping frames along the waveform and applying the disjointed Fourier transform oneach frame

do-Switched to the T-F domain, equation (1.1) can be written as

1, 2, , F presents the frequency bin index

A common assumption in array signal processing is the narrowband assumption

on the source signal [118] Under the narrowband assumption, the convolutive mixingmodel (1.2) may be approximated by complex-valued multiplication in each frequencybin (n, f ) given by

cj(n, f ) ≈ aj(f )sj(n, f ) (1.4)where cj(n, f ) and sj(n, f ) are the STFT coefficients of cj(t) and sj(t), respectively,

aj(f ) is the Fourier transform of aj(τ )

Source separation consists in recovering either the J original source signals sj(t) ortheir spatial images cj(t) given the I-channel mixture signal x(t) The objective of our

Trang 28

Figure 1.2: Audio source separation: a solution for cock-tail party problem.

research, as mentioned previously, is to recover the spatial image cj(t) of the sourcefrom the observed mixture as shown in Fig 1.2 Note that in our study, backgroundnoise is also considered as a source This definition applies to both point sources anddiffuse sourcesin both live recordings and artificially-mixed recordings

As discussed in Section 1.1.1, a standard architecture for source separation systemincludes two models: the spectral model formulates the spectral characteristics of thesources, and spatial model exploits the spatial information of the sources An advan-tage of this architecture is that it offers modularity and we can mix and match any mix-ing filter estimation technique with any spectral source estimation technique Besides,some of the approaches to source separation also can recover the sources by directlyexploiting either the spectral sources or the mixing filters The whole BSS picture built

in more than two decades of research is very large, consisting of many different niques and requiring an intensive survey, e.g see in [22, 54, 85, 112, 138, 141] In thissection, we limit our discussion on some popular spectral and spatial models They arecombined or used individually in the state-of-the-art algorithms in different ways

This section reviews three typical source spectral models that have been studied tensively in the literature They are spectral Gaussian Mixture Model (Spectral GMM),spectral Nonnegative Matrix Factorization (Spectral NMF) and Deep Neural Network

Trang 29

1.2.1.1 Gaussian Mixture Model

We start by the principles of the Gaussian model-based approaches, known as tral GMM [7, 77, 106, 113], where the redundancy and structure of each audio sourcecan be exploited for audio source separation

Spec-As it can be seen, the short time Fourier spectrum of the j-th source is a columnvector composed of all elements sj(n, f ), with f = 1, , F as sj(n) = [sj(n, f )]f.The Spectral GMM approach models sj(n) as a multidimensional zero-mean complex-valued K-state Gaussian mixture with probability density function (pdf) given by [7,106]

The Spectral GMM defines K × F free variances vjk(f ) and exploits the globalstructure of the sources to estimate them However, GMM does not explicitly modelamplitude variation of sound sources, so the signals having similar spectral shape butdifferent amplitude level may result in different estimated spectral variance templates[vjk(f )]f To overcome this issue, another version of GMM was proposed in 2006[13], called Spectral Gaussian Scaled Mixture Model (Spectral GSMM) In SpectralGSMM, a time-varying scaling parameter gjk(n) is incorporated in each Spectral-GMM The pdf of the GSMM is then written as [13]

Trang 30

Spectral GMM and Spectral GSMM were applied to single-channel audio sourceseparation [13, 16], and stereo separation of moving sources [95] The GMM wasalso considered in multichannel instantaneous music mixtures [7] where the Spectral-GMMs are learnt from the mixture signals.

1.2.1.2 Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is a dimension reduction technique thatworks with nonnegative data NMF has been applied to many fields of machine learn-ing and audio signal processing [43, 72, 73, 102, 105, 108, 109, 127] More detaileddescriptions of NMF will be presented in Chapter 2 as a baseline method for our study

In the following, we will review NMF as a structured spectral source model applied toaudio source separation, known as Spectral NMF

In the Spectral NMF model, each source sj is the sum of Kj spectral basis(also iscalled frequency basis, basis spectra, or latent components) and is written by [102]

time-ck(n, f ) ∼ Nc(0, hnkwkf) (1.9)where wkf ∈ R+denotes spectral basis representing spectral structures of the signal,

hnk ∈ R+is the distribution of the spectral basis representing time-varying activations.The source STFT coefficients sj(n, f ) are also modeled as independent zero-meanGaussian random variables with free variancesPK j

Trang 31

− log p(Sj|Hj, Wj) $

n,f

d(|sj(n, f )|2kHjWj) (1.11)where ◦ denotes equality up to a constant, divergence function d may be Kullback-Leibler (KL) divergence [73]: dKL(xky) = x log(xy) − x − y, or Itakura-Saito (IS)divergence [40]: dIS(xky) = x

y − log(x

y) − 1, etc., it will be presented more details

in Chapter 2 Here NMF requires the estimation of only N Kj values of Hj and KjFvalues of Wj instead of estimating N F values of the power spectrogram Sj where

N Kj + KjF N F Thus NMF is considered as a form of dimension reduction inthis context

Spectral NMF has been applied to single-channel audio source separation [115,142] and multichannel audio source separation [102, 104] with different settings Inrecent years, several studies have investigated user-guided NMF methods [26, 30, 37,

104, 126, 156] that incorporate specific information about the sources in order to prove the efficiency of the separation algorithm

im-1.2.1.3 Deep Neural Networks

Recent studies have shown that deep neural networks (DNNs) are able to modelcomplex functions and perform well on various tasks, including audio signal process-ing [4, 35, 53, 62, 119, 144, 155, 157] The two former methods, GMM and NMF, firstlearn the characteristics of speech and noise signals, then those learned models wereused to guide the signal separation process The deep learning based approaches canlearn the separation mask or the separation model by end-to-end training and gain asignificant impact

In DNN-based approaches, the mixture time-frequency representation is pre-processed

to extract relevant features Given these features as inputs, a DNN is utilized either fordirectly estimating the time-frequency mask [144] or for estimating the source spectrawhose ratio yields a time-frequency mask [4, 56, 132] Time-frequency masking, asits name suggests, estimates the spatial images by filtering the time-frequency repre-sentation of the mixture using a mask This can be expressed as

ˆ

cj(n, f ) = ˆmj(n, f )x(n, f ), (1.12)where ˆmj(n, f ) is the mask for time frame n and frequency bin f of source j-th Inthe audio enhancement scenario, the best possible binary or soft mask are called theideal binary maskor the ideal ratio mask, respectively They are derived from a typical

Trang 32

real-valued scalar mask in [33], and can be computed as

com The mask estimation error:

DM A=X

f,n

(mrattarg(f, n) − ˆmrattarg(f, n))2, (1.15)

- The error of spectra computed using the estimated mask:

DSA=X

f,n

( ˆmrattarg(f, n)|x(f, n)| − |starg(f, n)|)2, (1.16)

where starg(f, n) is the target source spectra

- The error of signal in the complex-valued T-F domain computed using the mated mask:

esti-DP SA =X

f,n

| ˆmrattarg(f, n)x(f, n) − starg(f, n)|2, (1.17)These studies also show that DP SA outperforms the other two cost functions Thisindicates that taking phase information into account in the DNN training is beneficialalthough the estimated mask is real-valued and thus, it does not affect the phase.Most studies have addressed the problem of single-channel source separation [18,

52, 56, 132, 150] Recently, there exist a few studies exploiting DNN for multichannelsound source separation based on diference approaches In Nugraha’s study [96], theDNNs are used to estimate the spectral parameters for each source in the EM iteration.Such estimated parameters, together with the spatial parameters, are used to derive atime-varying multichannel filter The study of Wang et al., [148] combines spectraland spatial features in a deep clustering algorithm for blind source separation In theirapproach, phase difference features are included in the input to a deep clustering net-work, they encode both spatial and spectral information in the embeddings it creates,

Trang 33

leading to better-estimated time-frequency masks Such DNN-based approaches wereshown to offer very promising results However, they require a large amount of la-beled data for training, which may not always be available and the training is usuallycomputationally expensive.

or the full-rank spatial covariance matrix in local Gaussian model (LGM) where thenarrowband assumption is relaxed [28, 38, 94]

In this part, we present three typical existing models that exploit deterministic orprobabilistic parameterization for the spatial cues They are IID/ITD, rank-1 covari-ance matrix, and full-rank spatial covariance model

1.2.2.1 Interchannel Intensity/Time Difference (IID/ITD)

Spatial models encode any information related to the spatial position of sources.Many existing BSS algorithms exploit spatial cues such as the phase and amplitude

of the mixture channels They are called interchannel time difference (ITD) and terchannel intensity difference(IID) This ITD is produced because it takes longer forthe sound to arrive at the microphone that is farther from the source The IID is pro-duced because some of the incoming sound energy are degreaded when reaching themicrophone that is farther away from the direction of the source [145]

in-Assuming that there are two microphones and two sources in anechoic mixingcondition, the IID is illustrated in Fig 1.3 The source position s1nearer to microphone

1, so the recorded signal level x1 is higher than x2and the corresponding IID when thesource amplitude varies is modeled by the solid line s1 On the contrary, the sourceposition s2 results in smaller x1 than x2, and the corresponding IID is represented bythe dotted line in Fig 1.3 The observed IID is therefore constant over time and directlyrelated to the source direction of arrival (DoA) The IID/ITD has been widely exploited

Trang 34

in the history of both anechoic and convolutive source separation [1, 31, 86, 97, 112,

138, 145] The state of the arts also pointed out that IID/ITD relevant for instantaneousand anechoic mixtures but far from the actual characteristics of reverberation mixtures

Figure 1.3: IID coresponding to two sources in an anechoic environment

1.2.2.2 Rank-1 covariance matrix

Given the mixing model is written in 1.4 with the narrowband assumption, thecovariance matrix of cj(n, f ), denoting by Σj, is then given by [28]

Σj = vj(n, f )Rj(n, f ), (1.18)where vj(n, f ) is the variance of sj(n, f ) and Rj(n, f ) is equal to the rank-1 matrix

Rj(n, f ) = aj(f )aHj (f ), (1.19)with aj(f ) is the Fourier transform of the mixing filters aj(τ ) and (.)H indicates theconjugate transposition This rank-1 convolutive parameterization of the spatial co-variance matrices has been exploited in together with an NMF model of the sourcevariances in [67, 103, 104, 121, 151]

In the case of anechoic recording environment without reverberation and usingomnidirectional microphones, each mixing filter will combine with a delay τij and again κij specified by the distance rij from the j-th source to the i-th microphone as[50]

τij = rij

Trang 35

aanj (f ) = (κ1je−2iπf τ1j, , κIje−2iπf τIj)T (1.23)

1.2.2.3 Full-rank spatial covariance model

In an anechoic or low reverberation recording environment, one possible tation of the narrowband approximation is that the sound of each source as recorded

interpre-at the microphones comes from a single spinterpre-atial position interpre-at each frequency f , as ified by aj(f ) or aan

spec-j (f ) [28] But this approximation is not valid in a reverberantenvironment because of some spatial spread of each source, echoes at many differentpositions on the walls, ceilings, and floors of the recording room The full-rank spatialcovariance matrices will model better this spread

Assuming that the spatial image of each source is composed of two uncorrelatedparts: a direct part aanj (f ) as in (1.23) and a reverberant part [69] Then the spatialcovariance Rj(f ) of each source is a full-rank matrix defined as the sum of the covari-ance of its direct part and the covariance of its reverberant part [28]

Rj(f ) = aanj (f )(aanj )H(f ) + σ2Ω(f ), (1.24)where σ2 is the variance of the reverberant part and Ωil(f ) is a function of the mi-crophone directivity pattern and the distance between the i-th and the l-th microphone(such that Ωii(f ) = 1) This full-rank direct+diffuse model assumes that the reverber-ation recorded at all microphones has the same power but is correlated as characterized

by Ωil(f )

This model was employed for single source localization in [50] and consideredfor multiple source localization in [93] The covariance matrix Ω(f ) was usually em-ployed for the modeling of diffuse background noise [60, 87] For instance, the sourceseparation algorithm in [60] assumed that the sources follow an anechoic model andrepresented the non-direct part of all sources by a shared diffuse noise component with

Trang 36

covariance Ω(f ) and constant variance This algorithm did not account for the lation between the variances of the direct part and the non-direct part.

corre-A full-rank unconstrained covariance model had been proposed in 2010 [28] whichencodes the spatial position of the sources as well as their spatial spread This modelparameterizes the spatial information of each source via a full-rank unconstrainedHermitian positive semi-definite spatial covariance matrix Rj(f ) whose coefficientsare not deterministically related a priori This unconstrained parameterization is themost general possible parameterization for a covariance matrix It generalizes theabove three parameterizations in the sense that any matrix taking the form of (1.19),(1.22) or (1.24) can also be considered as a particular form of an unconstrained ma-trix Since then, the full-rank unconstrained spatial model has been applying more andmore widely, by combining with different spectral models such as NMF [6, 105, 107],DNNs [96]

The topic of the source separation performance evaluation has long been studied inthe literature Several studies have been published both in terms of objective quality[49, 137] and subjective quality [32, 45, 139] In our study, we focus on two popularfamilies of objective evaluation criteria, which can be applied to any audio mixtureand any algorithm and do not require the knowledge of the unmixing parameters orfilters These criteria, namely energy ratio criteria and perceptually-motivated criteria,have been widely used in the community as well as in the recent evaluation campaigns[5, 65, 99, 101, 133–135, 140]

Both families of criteria that we mentioned above are derived from the perceptualdecomposition of each estimated source image ˆcij(t) into four constituents as [140]

Trang 37

k ≤ I, 0 ≤ τ ≤ L − 1, PL

all is the least-squares projector onto the subspace spanned

by ckl(t − τ ), 1 ≤ k ≤ I, 1 ≤ l ≤ J , 0 ≤ τ ≤ L − 1, and L is the filter length which

is set to 32 ms [140]

Then the relative amounts of interference distortion, artifacts distortion, and spatialdistortion are measured using three energy ratio criteria expressed in decibels (dB):the Source to Interference Ratio (SIR), the Sources to Artifacts Ratio (SAR), and thesource Image to Spatial distortion Ratio (ISR), defined by [140]

• Signal to Interference Ratio:

SIR = 10 log10

PI i=1

P

t(cij(t) + espatij (t))2

PI i=1

P

t(cij(t) + espatij (t) + einter

ij (t))2

PI i=1

P

teartifij (t)2 (1.30)This measure estimates the artifacts introduced by the source separation process

• Source Image to Spatial distortion Ratio:

ISR = 10 log10

PI i=1

P

tcij(t)2

PI i=1

P

tespatij (t)2 (1.31)This measure represents the suppression of the spatial distortions

The total error represents the overall performance of the source separation algorithm,also measured by the Signal to Distortion Ratio (SDR) and calculated as follows

Trang 38

• Signal to Distortion Ratio:

SDR = 10 log10

PI i=1

P

tcij(t)2

PI i=1

P

t(espatij (t) + einter

ij (t) + eartifij (t))2 (1.32)These criteria were implemented in Matlab and distributed for public use [41]1 Theyare most commonly used metrics in the source separation community so far

1.3.2 Perceptually-based criteria

In addition to the energy ratio criteria, we consider the perceptually-motivated jective criteria in [32] to assess the quality of the estimated source image signals Theperceptually-motivated objective criteria are derived based on the decomposition of theestimated source image signals into three distortion components similarly to (1.26),(1.27), and (1.28), and add the use of the PMO-Q perceptual salience measure [57].These components are called target distortion, interference distortion component, andartifact distortion component They are then used to compute four performance cri-teria akin to SDR, SAR, SIR, and ISR, were termed Overall Perceptual Score (OPS),Artifacts-related Perceptual Score(APS), Interference-related Perceptual Score (IPS),and Target-related Perceptual Score (TPS), respectively

ob-These criteria score from 0 to 100 where higher values indicate better performance

It was shown in [32] that the perceptually-motivated criteria could improve the lation with subjective scores compared to the energy ratio criteria and were often used

corre-in addition with the energy ratio criteria from 2010 corre-in the audio source separation munity The source code of these perceptually-motivated criteria is also available2

This chapter has introduced the audio source separation as a big picture, and lated the general source separation problem that we will focus on in this thesis Fromthat, we have surveyed the major technicals in order to exploit spectral information orspatial information of sources in the separation process In addition, two popular fam-ilies of objective evaluation criteria, that we will use to evaluate the source separationperformance of the proposed methods in Chapter 3 and 4, have also been presented

formu-1 http://bass-db.gforge.inria.fr/bss eval/

2 http://bass-db.gforge.inria.fr/peass/

Trang 39

CHAPTER 2 NONNEGATIVE MATRIX FACTORIZATION

Spectral decomposition by NMF has become the popular approach in many audiosignal processing tasks, such as source separation, enhancement and audio detection.This chapter first presents the NMF formulation and its extensions.We then introducethe NMF-based audio spectral decomposition By the end of this chapter, we presentthe proposed methods for automatically detecting unusual sounds using NMF withaiming for effective sound annotation

Nonnegative matrix factorization is a dimension reduction technique that applies

to the nonnegative data NMF has been widely known and used after the publication

of Lee and Seung in 1999 [72, 73], but it actually appeared nearly 20 years before thatwith other names such as nonnegative rank factorization [61] or positive matrix factor-ization[109] Thanks to [72, 73], NMF has been used extensively for a variety of manyapplications, such as bioinformatics [76], image processing [120], facial recognition[55], speech enhancement [39, 89], direction of arrival (DoA) estimation [131], blindsource separation [40, 102, 107, 122, 130, 159], and the informed source separation[25, 44, 46, 48] Comprehensive reviews about the NMF can be found in [147, 160]

In the following, we will present some details about NMF so as to understand what theNMF is and how it works

Given a data matrix V ∈ RF ×N+ of dimensions F × N with nonnegative entries,NMF aims at finding two nonnegative matrices W and H such that WH is approxi-mately equal to V as [73]

where W ∈ RF ×K+ and H ∈ RK×N+ are nonnegative matrices of dimensions F × Kand K × N , respectively NMF can be applied to the statistical analysis of multivariate

Trang 40

data in the following manner Given a set of multivariate n-dimensional data vectors,the vectors are placed in the columns of a F × N matrix V where F is the charac-teristic of the data, N is the number of observations or examples of the dataset NMFapproximately factorizes V into F × K matrix W and K × N matrix H as shown inFig 2.1, where K is the number of the basis vector (latent components) Usually, K

is chosen to be smaller than F and N , in order to achieve the decompositions, where

F × K + K × N F × N [42, 73] So W and H are smaller than the original matrix

V, they are lower-rank representation of the original data matrix That is why NMF isconsidered as a dimensionality reduction technique

Equation (2.1) can be rewritten column by column as v ≈ Wh, where v and h arethe columns of V and H, respectively In other words, each data vector v is approxi-mated by a linear combination of the columns of W, weighted by the components of

h Therefore W is called a dictionary matrix, containing the basis that is optimized forthe linear approximation of the data in V H contains the distribution of the basis in Wmatrix and called a distribution weight matrix or activation matrix Usually, relativelyfew basis vectors can be used to represent many data vectors, so we can achieve a goodapproximation when the basis vectors successfully discover the latent structure in thedata

To sum up, NMF aims to find the nonnegative basic representative factors whichcan be used for feature extraction, dimensional reduction, eliminating redundant infor-mation and discovering the hidden patterns behind a series of non-negative vectors

Figure 2.1: Decomposition model of NMF [36]

In the Spectral NMF model, each source sj is the sum of Kj spectral. .. audio source separation algorithmweakly guided by some source examples This algorithm exploits the genericsource spectral model (GSSM), which represents the spectral characteristics ofaudio sources,... architecture for source separation systemincludes two models: the spectral model formulates the spectral characteristics of thesources, and spatial model exploits the spatial information of the sources

Định dạng
Số trang	129
Dung lượng	1,84 MB