Khóa luận tốt nghiệp Khoa học máy tính: Hệ thống tổng hợp tiếng nói với deep learning

HO CHI MINH CITY NATIONAL UNIVERSITYUNIVERSITY OF INFORMATION TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE Tran Kim Hung GRADUATION THESIS SPEAKER ADAPTATION IMPROVEMENT METHODS IN SPEECH S

Trang 1

HO CHI MINH CITY NATIONAL UNIVERSITY

UNIVERSITY OF INFORMATION TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE

Tran Kim Hung

GRADUATION THESIS SPEAKER ADAPTATION IMPROVEMENT METHODS IN

SPEECH SYNTHESIS

BACHELOR OF COMPUTER SCIENCE

HO CHI MINH CITY, 2021

Trang 2

HO CHI MINH CITY NATIONAL UNIVERSITY

UNIVERSITY OF INFORMATION TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE

Tran Kim Hung - 18520811

GRADUATION THESIS SPEAKER ADAPTATION IMPROVEMENT METHODS IN

SPEECH SYNTHESIS

BACHELOR OF COMPUTER SCIENCE

THESIS ADVISORS:

Trinh Quoc Son M.Sc

Ngo Duc Thanh Ph.D

HO CHI MINH CITY, 2021

Trang 3

THÔNG TIN HỘI ĐÒNG CHÁM KHÓA LUẬN TÓT NGHIỆP

Hội đồng chấm khóa luận tốt nghiệp, thành lập theo Quyết định số 36 ngày 17/1/2022

của Hiệu trưởng Trường Đại học Công nghệ Thông tin.

Trang 4

I would like to express my sincere gratitude to my advisor and instructor Trinh

Quoc Son With the enthusiasm and support, he gave me, I embark on exploring the

realm of speech synthesis and voice cloning domain in computer science.

My sincere appreciation also goes to the co-supervisor of the thesis, Dr Ngo Duc

Thanh from the department of computer science, University of Information

Technology He had taken his valuable time to review my work and give me direction

on my thesis If not for him, I couldn’t complete my research.

My last gratitude goes to the council of the department of computer science had

gave me a chance to work on this thesis.

Trang 5

TABLE OF CONTENTS

ABSTRACT .ccscsscssssssssessesessesssessssesessesesnesesessenesnssesessesesessesessesesecsesecuesesecsesesuesesecaeseaeeseseenenes al CHAPTER 1 INTRODUCTION .sssssessssesssssresssessssenecsenessssenesseseseesesesesseneensseseeneseaeenenenees 2

2.3 Advantages of Neural Network Over Hand-crafted Methods

2.3.1 Data Training Efficiency.

2.3.2 Automation in Neural Speech Synthesis

2.3.3 Quality of Sound

2.4 Speaker Adaptation.

2.4.1 Why Does Speaker Adaptation Important

2.5 Existing Methods of Speaker Adaptation

2.5.1 Tuning Speaker Adaptation

2.5.2 Zero-Shot Speaker Adaptation

CHAPTER 3 BACKGROUND \W sessessessessessessessssessessessesseesesseesessessessssesseeseeaeeseesesneensseeseaneee 1

Trang 6

3.6 How Speaker Embedding Improves Speech Synthesis

4.2.2 Modification to The Flow of Speaker Embedding

4.2.3, Add More Information To The Speaker Encoder

CHAPTER 5 EXPERIMENT essessessessessssssssessssessecsessessessessessessessessaesneeneeaeesesseaneensseeseeansee 47

Trang 7

LIST OF FIGURES

FIGURE 1.1: WORKFLOW OF A SPEECH SYNTHESIS SYSTEM

FIGURE 2.1: FABER WONDERFUL TALKING MACHINE [1]

FIGURE 2.2: MODULES OF A UNIT SELECTION TTS SYSTEM [2]

FIGURE 2.3: STATISTICAL PARAMETRIC SPEECH SYNTHESIS [2]

FIGURE 2.4: THE PROCESS OF RETRAINING ANOTHER SPEAKER

FIGURE 2.5: THE PROCESS OF SPEAKER ADAPTATION TO A TUNING-BASED NEURAL SPEECH SYNTHESIS „13

FIGURE 2.6: ZERO-SHOT SPEAKER ADAPTATION PIPELINE [8] 15

FIGURE 2.7: THE PROCESS OF SPEAKER ADAPTATION IN THE ZERO-SHOT SPEAKER ADAPTATION METHOD 15

FIGURE 3.1: EXAMPLE SYNTHESIS OF A SENTENCE IN DIFFERENT VOICES USING THE SYSTEM MEL SPECTROGRAMS

ARE VISUALIZED FOR REFERENCE UTTERANCE USED TO GENERATE SPEAKER EMBEDDING (LEFT), AND THE

CORRESPONDING SYNTHESIZER OUTPUTS (RIGHT) THE TEXT-TO-SPECTROGRAM ALIGNMENT [8] 17

FIGURE 3.2: RECURRENT NEURAL ÑETWORK „18

FIGURE 3.3: LONG SHORT TERM MEMORY 20

FIGURE 3.4: BIDIRECTIONAL LSTM [5] 21

FIGURE 3.5: THE DIFFERENCES BETWEEN CONVENTIONAL RNN, LSTM, AND GRU 22 FIGURE 3.6: ENCODER-DECODER ARCHITECTURE [2] 23 FIGURE 3.7: ALIGNMENT MECHANISM PROPOSED BY BAHDANAU ET AL [2] 224 FIGURE 3.8: CHARACTER EMBEDDINGS [2] 5

FIGURE 3.9: SIMILARITY MATRIX CONSTRUCTION AT TRAINING [7] 26

FIGURE 3.10: VISUALIZATION OF SPEAKER EMBEDDINGS EXTRACTED FROM LIBRISPEECH UTTERANCES EACH

COLOR CORRESPONDS TO A DIFFERENT SPEAKER REAL AND SYNTHETIC UTTERANCES APPEAR NEARBY

WHEN THEY ARE FROM THE SAME SPEAKER, HOWEVER REAL AND SYNTHETIC UTTERANCES CONSISTENTLY

FORM DISTINCT CLUSTERS 27

FIGURE 3.11: UMAP PROJECTIONS OF UTTERANCE EMBEDDINGS FROM RANDOMLY SELECTED BATCHES FROM THE

TRAIN SET AT DIFFERENT ITERATIONS OF OUR MODEL UTTERANCES FROM THE SAME SPEAKER ARE

REPRESENTED BY A DOT OF THE SAME COLOR WE SPECIFICALLY OMIT TO PASS LABELS TO UMAP, SO THE

27

29

CLUSTERING IS ENTIRELY DONE BY THE MODEL

FIGURE 3.12: INPUT & OUTPUT OF SPEAKER ENCODER

FIGURE 3.13: INPUTS & OUTPUT OF SYNTHESIZER SYSTEM

FIGURE 3.14: TACOTRON 2 WITH SPEAKER EMBEDDING ARCHITECTURE [3'

FIGURE 3.15: INPUT & OUTPUT OF VOCODER

FIGURE 3.16: FATCHORD WAVERNN ARCHITECTURE [3]

FIGURE 3.17: THREE STAGES TRAINING OF THE MODEL [3]

FIGURE 4.1: DIFFERENT MODES OF THE MULTIMODAL SPEAKER ADAPTIVE ACOUSTIC ARCHITECTURE DASHES

BORDER INDICATES MODULES WITH TRAINABLE PARAMETERS WHILE BOLD SOLID BORDER INDICATES

MODULES WITH IMMUTABLE PARAMETERS [9]

FIGURE 4.2: ARCHITECTURE OF DEEP VOICE 3 [11].

Trang 8

FIGURE 4.3: MULTI-SPEAKER LDE TTS SYSTEM ENCODER BLOCKS ARE IN ORANGE, DECODER BLOCKS IN BLUE,

POST-NET BLOCK IN GREEN, SPEAKER ENCODER BLOCK IN RED, AND VOCODER BLOCK IN YELLOW [10] 41

FIGURE 4.4: OUR PROPOSED ARCHITECTURE TO PASS THE SPEAKER EMBEDDING TO THE POST-NET AND PRE-NET

LAYER (WE HOWEVER DID NOT APPLY THE SPEAKER EMBEDDING TO THE OUTPUT OF THE POST-NET)

FIGURE 4.5: LST-TTS SYSTEM [12]

42

44

FIGURE 4.6: WAV2VEC FEATURE EXTRACTOR OVERVIEW [13] Ad

FIGURE 4.7: OUR PROPOSED MODEL BY CONCATENATING WAV2VEC2 SPEAKER REPRESENTATION AND GE2E

SPEAKER REPRESENTATION

Trang 9

LIST OF TABLES

TABLE l.1: METHODS OF SPEAKER ADAPTATION

TABLE 2 ‘OMPARISON BETWEEN THE THREE SPEECH SYNTHESIS METHODS [21]

TABLE 2.2:THE RESULT WHEN EXPERIMENTS WITH TACOTRON 2 [6]

TABLE 2.3:RESULT OF TUNING MODEL [22]

TABLE 2.4: SPEAKER SIMILARITY MOS WITH 95% CONFIDENCE INTERVALS [8]

TABLE 4.1: OUR EXPERIMENT ON THE SV2TTS SYSTEM

TABLE 4.2: CROSS-DATASET EVALUATION FOR UNSEEN SPEAKERS [8]

TABLE 4.3: PERFORMANCE BETWEEN DIFFERENT TRAINING LOSS FUNCTIONS [7]

TABLE 4.4: PERFORMANCE USING SPEAKER ENCODERS (SES) ON DIFFERENT DATASETS [8]

TABLE S.1: CRITERIA FOR DIFFERENT EVALUATION MEASUREMENTS

TABLE EXPERIMENT RESULT OF THE PROPOSED MODEL TRAINED ON DI

TABLE 5.3: COMPARISON BETWEEN ARCHITECTURES

TABLE 5.4: COST OF TRAINING AND TUNING .

Trang 10

LSTM Long Short Term Memory

TTS Text-to-Speech

RNN Recurrent Neural Network

MOS Mean Opinion Score

NMOS Naturalness Mean Opinion Score

SMOS Similarity Mean Opinion Score

BLSTM Bidirectional LSTM

CBHG 1-D convolution bank + highway network

+ bidirectional GRU

GRU Gated Recurrent Unit

STFT Short Time Fourier Transform

DNN Deep Neural Network

SV2TTS Speaker Verification To Text To Speech

GPU Graphic Processing Unit

CPU Core Processing Unit

ITU International Telecommunication Union

EER EER: Equal Error Rate

Trang 11

Text-to-Speech (TTS) synthesis is an automatic conversion of written contents

to spoken language TTS synthesis plays a critical role in natural human and computerinteraction in an organic manner Although, communication between man and machinecan be satisfied by commands and text appearing on the screen Some applications such

as Siri or Cortana are the prime example of what communication between man and

machine can take shape with help of TTS

Classic approaches in speech synthesis are limited to a definitive set of data andheavy loads of hand-crafted procedures (rule-based algorithms used to perform

linguistic analysis) In this epoch of neural networks, improvements to the quality of

sound and flexibility of training a TTS model using neural networks have beenincreased rapidly throughout the years of development However, there are questionsstill open for improvement One of many problems is the issue of creating an appropriatemodel for speech production with minimal resources and time Our objective is to tacklethis topic and investigate what we can contribute to it Many pieces of research have

been conducted to address this concern; it’s known as the problem of speaker

Trang 12

CHAPTER 1 INTRODUCTION.

The definition of speech synthesis is the process of artificially generating humanspeech from text It aims to synthesize intelligible and natural audio indistinguishablefrom human recorded audio A speech synthesizer is a computer system designed forthis particular purpose Speech synthesis has applications that can be useful for peoplewith disabilities and dyslexia One of the most iconic application cases is Dr StephenHawking

The TTS system is the automatic converter of written to spoken language The

input is text, and its mission is to generate a speech waveform that corresponds to theoriginal text (Figure 1.1)

Speech Synthesis Input text seen

Figure 1.1: Workflow of a speech synthesis system

1.1 Problem Statement

In speech synthesis, it is essential to produce a system that can generate soundidentical to the speaker It aims to replicate both in terms of naturalness, styles, andsimilarity in tempo However, it is also crucial to train and produce the system modelfast and efficiently to compete with other rivals in this growing market of artificial voice

cloning

There are many challenges concerning this topic They have been gatekeeping

the breakthrough in this field Those problems are such as:

- Fear of violation to the freedom of speech: With the improvement ofspeech synthesis naturalness throughout the years of development, it is naturalfor the mass to fictionalize a dystopia where their voices are taken advance offor malicious intents, and the more the field reaches its breakthrough, the moretheir fear becomes a reality In response to the fear, regulations are set to prevent

Trang 13

the collection of voice data and make it harder to train the model This fear isproven to be a hindrance to the development of the field of speech synthesis Togenerate a system model that can achieve the naturalness of a desired humanspeaker, the researcher needs to have an extravagant amount of high-quality data.Collecting speeches data with high quality in large quantities can be problematic

to compromise With the fear of losing the freedom of speech as the base,

retaliation from the people is inevitable There also are laws that address this

issue authorized to protect their voices from malicious intentions To solve thiscase, we not only need to strategize a proposal to model a system with very littledata we have, but it also provides output with decent quality

- Similar to the problem of data-hungry, to generate a system that reachesthe minimum requirement to a human-like product, massive computing power isneeded However, this work was composed in a global chip shortage crisis,

which made it impossible to own a system capable of generating an appropriate

system model for speech production in a short amount of time Therefore, weneed to find a solution to run the algorithm in low computing configuration

1.2 Scope and Goals

1.2.1 ScopeThe scope of our work is to survey existing work related to the problem

of saving time on producing a model for speech synthesis

During the endeavor to consult this subject, we discovered what is known

as speaker adaptation It's abridged as a set of techniques that allow researchers

to generate speech for a specific speaker with very minimal data and computing

power.

There are many related works on the study of speaker adaptation (Table1.1.) The details will be elaborated on further in section 2.5 of this thesis

Trang 14

Adaptation paradigm Methods

The first goal of this work is a legacy continuation of [8], which is to build

a TTS system that can generate natural speech for various speakers in a efficient manner We are based on an existing zero-shot learning setting, whereseveral few-seconds of un-transcribed reference audios from a target speaker are

data-used to synthesize new speech in that speaker’s formant, without updating any

model parameters Such systems have accessibility applications, such as

restoring the ability to communicate naturally to users who have lost their voice

and cannot provide many new training examples They could also enable newapplications, such as transferring a voice across languages for more naturalspeech-to-speech translation or generating realistic speech from texts in low-resource settings However, it is also important to note the potential for misuse

of this technology, for example, replicating someone’s voice without their

consent.

The second goal of our work is to optimize the adaptation process and

output speech quality so that it is undifferentiated from the original human voice.

We also want to accelerate the production operation for an appropriate model forthe speech generating system, which means we want the model training to befaster to save time

Trang 15

1.3 Contributions

In this thesis, we survey all research related to the field of speaker adaptation

We aim to resolve which method would bring the most noticeable improvement

We also combine some of the proposed methods The justification for fusingthem is to observe how they complement each other By orchestrating them together, itallowed us to inspect what kind of impact they pose on one another

The results of our work show that the changes we proposed to address theproblem can have a notable impact on the big picture

Trang 16

CHAPTER 2 RELATED WORKS

2.1 History of TTS

It has been more than a century since an American scientist Joseph Henry came

across the “Wonderful Talking Machine” that can talk at an exhibition in Philadelphia

on December 20th, 1845 (Figure 2.1) The machine was a work of a mechanic from

Freiburg named Joseph Faber

Figure 2.1: Faber Wonderful Talking Machine [1]

The invention of Mr Faber of Freiburg can be considered the oldest predecessor

of speech synthesis ever recorded Speech synthesis has come a long way in digitalizing

the art of TTS to computers

In the early part of the modern history of the field, concatenative and statisticalparametric methods of speech synthesis were the pillar ranked high in the domain andpresented as methods to solve the speech synthesis problem However, the proceduresbehind their operation were considerably complicated Processes required extensive

research on the characteristics of the speech and bound it in a set of rules (can be called

hand-crafted feature) Last but not least, configurations like this were necessary to

archive a suitable system for TTS that only fixated on a single speaker [1].

When the neural network becomes more available and offers more automationthan the predecessors, more models based on the neural network get released throughoutthe years Noticeably is the release of the Tacotron [22] and Tacotron 2 [6]

Trang 17

2.2 Evaluation Methods

To evaluate the efficiency of speech synthesis, we will use the Mean Opinion

Score (MOS) to evaluate both the similarity and naturalness of the end product of

speech production The MOS will be a score from 1 to 5 which scales from bad, not

good, average, good, and very good.

Depending on what one wants to test, different questions should be asked For

instance: “Please listen to the sample and judge using a five-point scale their

quality/naturalness/similarity to a certain speaker.

ITU 1994 [32] recommends MOS with seven different questions ranging from

“Listening Effort” to “Overall Quality”, See also [31] for a description of an expanded

MOS test in which questions are asked about the following: Listening Effort,

Comprehension Problems, Speech Sound Articulation, Precision, Voice Pleasantness,

Voice Naturalness, Human-like Voice, Voice Quality, Emphasis, Rhythm, Intonation,

Trust, Confidence, Enthusiasm, Persuasiveness Criteria that affect the result of this

work will be defined more deeply in section 5.3.

2.3 Advantages of Neural Network Over Hand-crafted Methods The motivation for the mass to turn to neural network approaches is because of

its ability to model the relation between the raw text and the final output without the

need for hand-crafted feature engineering.

Trang 18

In exemplar-based speech synthesis (Figure 2.2), the system simply stores the

speech corpus itself The stored speech data are labeled so that appropriate parts of them

can be found, extracted, and then concatenated during the synthesis phase The most

prominent exemplar-based technique and one of the dominant approaches to speech

synthesis are Unit-Selection (Figure 2.2) In this technique, units from a large speech

database are selected according to how well they match a specification and how well

they join together The specification and the units are completely described by a

structure, which can be any mixture of linguistic and acoustic features The quality of

output derives directly from the quality of the recordings, and it appears that the larger

the database, the better the coverage Commercial systems have exploited these

techniques to bring about a new level of synthetic speech However, these techniques

limit the output speech to the same style as the original recording In addition, recording

a massive database with variations is very difficult and costly [2].

DATABASE.

Speech Speech Statistical

Analysis Parameters Modeling

Ỷ

¬ TRAD @ggđ SPS

SYNTHESIS Synthesizer

Me a Speech Speech | Statistical |

Processing Parameters | Generation

« Hello !»

Figure 2.3: Statistical parametric speech synthesis [2]

In model-based speech synthesis (Figure 2.3), the fits a model to the speech

corpus (during the training phase) and stores this model Due to the presence of noise

and unpredictable factors in speech training data, the models are usually statistical.

Statistical Parametric Speech Synthesis has also grown in popularity, in contrast to the

selection of actual instances of speech These models don’t use stored exemplars They

describe the parameters of models using statistics (e.g means and variances of

Trang 19

probability density functions) which capture the distribution of parameter values found

in the training data (Figure 2.3) The quality of speech produced by the initial statistical

parametric systems was significantly lower than this of unit-selection systems [2]

2.3.1 Data Training Efficiency

In the era of Deep Learning, end-to-end neural networks, which are of-the-art for speech synthesis tasks, have become highly competitive with theconventional TTS systems An end-to-end TTS system can be trained on <text,audio> pairs without hand-crafted feature engineering All the modules aretrained together to optimize a global performance criterion, without manualintegration of separately trained modules [2]

state-2.3.2 Automation in Neural Speech Synthesis

The two stages of a common TTS system are Text Analysis (text intointermediate representation) and Waveform synthesis (from the intermediaterepresentation into waveform) These stages are known as the frontend and thebackend respectively

The ’front-end’ of the TTS (From input text to linguistic specification):

Text Processing sees the text as the input to the synthesizer and tries to rewrite

any "non-standard" text as proper "linguistic" text It takes the arbitrary text andperforms the task of classifying the written signal concerning its semiotic type(natural language or other), decoding the written signal into an unambiguous,

structured, representation, and in the case of non-natural language, verbalizing

this representation to generate words The tasks that can be included in the

’front-end’ [1] are the following:

1 Pre-processing: possible identification of text genre, character encoding

issues, possible multilingual issues

2 Sentence splitting: segmentation of the document into a list of sentences

3 Tokenization: segmentation of each sentence into several tokens

4 Text-Analysis:

Trang 20

a) Semiotic classification: classification of each token as one of the

semiotic classes of natural language, abbreviation quantity, date,time, etc

b) Decoding/parsing: finding the underlying identities of tokens

using a decoder or parser that is specific to the semiotic class

c) Verbalization: Conversion of non-natural language semiotic

classes into words that can be spoken

5 Homograph resolution: Determination of the correct underlying word for

any ambiguous natural language token

6 Parsing: Assigning a syntactic structure to the sentence

7 Prosody prediction: Attempting to predict a prosodic form for each

utterance from the text

When performing TTS, grapheme and phoneme analysis are crucial [19].First, a grapheme form of the text input is found and converted to a phoneme

form for synthesis This approach is advancingly practical in languages wherethe grapheme-phoneme correspondence is relatively direct; finding the

graphemes often means the phonemes tagging can accomplish with high

confidence As a result, pronunciation can be accurately determined

With an end-to-end system [6], [7], [18], we only need a small part of the

above tasks, simple Text Decoding (Word Tokenization and Normalization),because the characteristics of speech expect to be learned automatically throughthe neural network Word tokenization and normalization are generally done bycascades of simple regular expressions substitutions or finite automata Althoughrecent advancements in learned text normalization may render this unnecessary

in the future, thus, a straightforward and more automated approach to text

analysis

2.3.3 Quality of SoundThe work of [4] also suggested that letting the linguistic encoder processtext into vector representatives to learn analysis through data automaticallyseems to improve naturalness It is a crucial ingredient in allowing the attention

10

Trang 21

mechanism to function accurately Conditioning the overall system onpreviously generated acoustic representations and self-evaluating weights tomatch the ground truth during training leads to significant naturalness gains.

Improvements can be verified by looking at past works related to the to-end system designed by many researchers in the field In the research of [21],

end-in the attempt to end-inspect the naturalness of speeches generated by Tacotron, itwas put to compete with other conventional approaches at the time Along withthe parametric, concatenative method of speech synthesis, the experiment result

is presented in Table 2.1 While Tacotron experiment samples naturalness MOSwas lacking behind Concatenative, it surpassed the Parametric method,

promising a future for speech synthesis with neural networks

Mean Opinion Score

Tacotron 3.82 + 0.085

Parametric 3.69 + 0.109

Concatenative 4.09 + 0.119

Table 2.1: Comparison between the three speech synthesis methods [21]

Succeeding Tacotron [21] was the work of Tacotron-2 [6] Byimplementing an extra module known as WaveNet, the result of the Tacotron-2(Table 2.2) surpassed all its predecessors and became state-of-the-art in the

domain of neural speech synthesis

Table 2.2:The result when experiments with Tacotron 2 [6]

11

Trang 22

2.4 Speaker Adaptation

While in the domain of speech recognition, speaker adaptation refers to the range

of techniques whereby a speech recognition system is adapted to the acoustic features

of a specific user using a small sample of utterances from that user [16] In speechproduction, when systems are focused on a single speaker, adaptation is understood asswitching the current speaker the system is producing to a new speaker

2.4.1 Why Does Speaker Adaptation Important?

In the more “classic” end-to-end speech synthesizers, such as [6] and [21]are models that train on a definitive set of data The result will still only producethe speaker that was resided in the original (single speaker) dataset (Figure 2.4)

Figure 2.4: The process of retraining another speaker

The problem with this approach is that it can consume a lot of computingpower to train It needs time for the model to match the designer’s desire It alsoneeds to train on a single speaker dataset

12

Trang 23

In research of [1], [6], and [21], a model that can engineer such a result

needs to be trained over 50,000 to produce good quality sound It is consuming to generate such a model, so it is impossible to create a decent model

time-in a short amount of time In this era where GPU is not available, speedtime-ing upthe training is not an option when we rely on the CPU to train the model It isparamount that we save time and resources (both in hardware and data) onreproducing an appropriate voice-generating model

2.5 Existing Methods of Speaker Adaptation

2.5.1 Tuning Speaker Adaptation

It is common to adapt a well-trained, general acoustic model to new users

or environmental conditions (Figure 2.5) Instead of retraining the whole system

to produce a new model, we can tune a pre-trained model to new data

Output speaker 1 Output speaker 2

Figure 2.5: The process of speaker adaptation to a tuning-based neural speech

synthesis

Research such as [22] has shown the efficiency of tuning the weight ofthe model can result in a much better naturalness (Table 2.3)

13

Trang 24

Dataset Sample tests MOS Average score

BigCorpus 44 3.47 SmallCorpus 80 4.13

Table 2.3:Result of tuning model [22]

However, the tuning method needs an improvement that can reduce thenumber of additional data and training steps

As [22] suggested, the number of training steps needed for such a model

to adapt well to the new speaker is 35,000 If the system relied on a CPU, 35,000

isn't a small number to work only with a CPU We then proceed to other research

on tuning models like [2] However, the result may be faulty because the quality

of the data performed adaptation with fewer steps proved unstable The practicalaspect of the tuning method was suppressed in the original Tacotron-2 by theheavy demand on the amount of data and machine power

Speaker adaptation on singular seq2seq via tuning needs a solution toreduce the amount of data and work fed into it Zero-shot speaker adaptation thencomes up with an answer for the problem of the massive chunk of works on the

model

2.5.2 Zero-Shot Speaker Adaptation

When referring to the technique of Zero-Shot Speaker Adaptation, we are

referring to a procedure that requires only a limited voice example (a singlespeech sample with a 5-second minimum length) Subsequently, the system cangenerate a waveform with similar vocal features to the example voice

In the Zero-Shot Speaker Adaptation approach, a speaker encoder isimplemented into the model, its task is to extract a d-vector from a speechsample The d-vector is supposed to be a representation of the essence of the

speaker’s voice After the d-vector (sometimes it may be called a speaker

embedding) is produced, it will be fed to a synthesizer (Figure 2.6)

14

Trang 25

speaker Speaker speaker

Figure 2.6: Zero-Shot Speaker Adaptation pipeline [8]

Another variant of the architecture is the Deep Voice 3 [11] The modeladds a speaker embedding to multiple modules in the network to train a multi-speaker TTS model for thousands of speakers Deep Voice 2 [17] used a jointlytrained speaker encoder network to extract a speaker embedding of unseenspeakers, while Jia et al [8] used a separate speaker verification network TheVoiceloop model jointly trains a speaker embedding with the acoustic model andcan adapt to unseen speakers by using both the speech and transcriptions of thetarget speakers [9] Nachmani et al replaced the jointly trained speakerembedding of Voiceloop with a speaker embedding obtained solely fromacoustic features so that the model could adapt using un-transcribed speech [9]

Côn No ng Synthesizer Model Vocoder Model |

Input Text ——————————>, Sees Ầ———— 1 5 second wave sample

Trang 26

The architecture offers the end-user a much more versatile adaptingpower (Figure 2.7) The model only needs only one sample of voice for themodel to perform adaptation While the tuning method required the user toretrain the model, the Zero-shot speaker adaptation did not force the user toretrain the model This method helps the model save some time and resources onspeaker adaptation.

There are many reasons for performing Zero-shot Speaker Adaptation

instead of conventional training, for instance, reducing the speaker footprint andquickly adapting to new speakers [9] But the most important reason is itspotential to handle unrefined adaptation data, whether in an insufficient quantity

or unreliable quality like noisy speech, incorrect transcript, or no transcript at all

[9].

System Speaker Set VCTK LibriSpeech

Ground truth Same speaker 4.67 + 0.04 4.33 + 0.08

Ground truth Same gender 2.25 + 0.07 1.83 + 0.07

Ground truth Different gender 1.15 + 0.04 1.04 + 0.03

Embedding table Seen 4.17 + 0.06 3.70 + 0.08

Proposed model Seen 4.22 + 0.06 3.28 + 0.08

Proposed model Unseen 3.28 + 0.07 3.03 + 0.09

Table 2.4: Speaker similarity MOS with 95% confidence intervals [8]

However, the architecture of zero-shot adaptation posed another problem.There is a noticeable gap between speakers present in the original dataset and

speakers that didn’t exist in terms of similarity to their respected ground truth.

To improve the Zero-Shot Speaker Adaptation, we must minimize the differencebetween seen and unseen speakers of the data used to train the model Looking

in Table 2.4, we can see the difference when comparing the similarity between

the “Seen” and “Unseen” speakers in both datasets used for the experiment.

16

Trang 27

CHAPTER 3 BACKGROUND

In this chapter, we will present an architecture that motivated our research It acts

as the foundation for our team to dive into the speaker adaptation topic and gives us thegoal to seek methods to improve it Many fundamental concepts will be explained tounderstand the overall architecture Basic concepts, such as: RNNs, LSTMs, Sequence

to Sequence Models, Attention, and how the speaker embedding has any impact on theproblem will be presented throughout this chapter

The core architecture of this work is a Zero-Shot Speaker Adaptation, as weintroduce in section 2.5.2 (Figure 2.6) It is a neural speech synthesis network thatreceives two inputs The first input is a waveform example (5-seconds duration at

minimum), and the second input is a text The end product of the process is a waveform

that corresponds with the text and possesses a vocal range that resembles the inputsound example (Figure 3.1) The particular system we have chosen for this thesis is theSV2TTS system, implemented by GitHub user CorestinJ [3]

Speaker reference utterance Synthesized mel spectrogram

“and all his brothers and sisters stood round and listened

with their mouths open"

-2 20

- 10

-4 0

00 05 10 15 20 25 310 35 40 45 0.0 0.5 10 15 2.0

“but it will appear in the sequel that this exception is

much more obvious than substantial"

Time (sec) Time (sec)

“this is a big red apple"

Speaker 7021 Mel channel Mel channel

“this is a big red apple"

Figure 3.1: Example synthesis of a sentence in different voices using the system Melspectrograms are visualized for reference utterance used to generate speakerembedding (left), and the corresponding synthesizer outputs (right) The text-to-spectrogram alignment [8]

17

Trang 28

We have chosen this system because it was presented as a compact version ofthe SV2TTS and is very easy to use Throughout the experiments, we also noticed thatthe system activities are very merciful on the longevity of the hardware.

3.1 RNN

A recurrent neural network is a class of artificial neural networks whereconnections between nodes form a directed graph along a temporal sequence (Figure3.2) This allows it to exhibit temporal dynamic behavior for a time series Unlikefeedforward neural networks, RNNs can use their internal state (memory) to processsequences of inputs RNNs are relatively old, like many other deep learning algorithms.They were initially created in the 1980s but can only show their real potential for inrecent years, because of the increase in available computational power, the massiveamounts of data that we have nowadays, and the invention of LSTM in the 1990s.Because of their internal memory, RNNs can remember important things about the input

they received, which enables them to be very precise in predicting what’s coming next.

This is the reason why they are the preferred algorithm for data like time series, speech,text, financial data, audio, video, weather, etc In an RNN, the information cyclesthrough a loop When it makes a decision, it takes into consideration the current inputand also-

Figure 3.2: Recurrent Neural Network

what it has learned from the inputs it received previously All RNNs have infinitememory However, the information decays exponentially with time The LSTM cell

18

Trang 29

was presented as an improvement to keep the information for longer periods than thebasic RNN cell [2].

3.1.1 RNN fundamentals RNN compute and produce sequences of hidden vector h = (hy, , hy) and output vector sequence y = (1, , Yr), for a given input vector

sequence X = (X4, , Xr) Iterate the following equation from t = | to T [4]:

hy = H(Wynxe + Warhe-i + bạ)

v= Whyht + by

where W is the weight matrices, e.g W,;, is the weight matrix betweeninput and hidden vectors; is the bias vectors, e.g b is the bias vector for hiddenstate vectors; and # is the nonlinear activation function for hidden nodes

in RNNs, the network becomes unstable, and it is unable to learn from trainingdata At an extreme, the values of weights can become so large as to overflowand result in NaN values [2]

Related to the exploding gradient problem is the vanishing gradientproblem, where the gradient will be vanishingly small, effectively preventing theweight from changing its value In the worst case, this may completely stop theneural network from further training This was a major problem in the 1990s andmuch harder to solve than the exploding gradients Fortunately, it was solved

through the concept of LSTM cell by Sepp Hochreiter and Juergen Schmidhuber

Later, the GRU cell also proved efficient in training RNNs without vanishing

gradients [2]

19

Trang 30

Gradient clipping, weight regularization and weight normalization

(Salimans and Kingma) and truncated BPTT may reduce this problem However,

the best practice to reduce the exploding gradient problem is to use gated RNNCells, like LSTM and GRU [2]

3.1.3 LSTM & GRULSTM network is a type of RNN, and since the RNN is a simpler system,

the intuition gained by analyzing the RNN applies to the LSTM network as well

Importantly, the canonical RNN equations, which we derive from differentialequations, serve as the starting model that stipulates a perspicuous logical pathtoward ultimately arriving at the LSTM system architecture [14]

The Long Short-Term Memory (LSTM) network was invented with the

goal of addressing the vanishing gradients problem The key insight in the LSTMdesign was to incorporate nonlinear, data-dependent controls into the RNN cell,which can be trained to ensure that the gradient of the objective function withrespect to the state signal (the quantity directly proportional to the parameterupdates computed during training by Gradient Descent) does not vanish The

LSTM cell can be rationalized from the canonical RNN cell by reasoning about

Equation 30 and introducing changes that make the system robust and versatile[14]

Figure 3.3: Long Short Term Memory

In the RNN system, the observable readout signal of the cell is the warped

version of the cell’s state signal itself A weighted copy of this warped state

20

Trang 31

signal is fed back from one step to the next as part of the update signal to the

cell’s state This tight coupling between the readout signal at one step and the

state signal at the next step directly impacts the gradient of the objective functionwith respect to the state signal This impact is compounded during the trainingphase, culminating in the vanishing/exploding gradients [14]

LSTM networks are a series of LSTM cells (Figure 3.3) It can overcomeproblems where conventional RNN cannot By model signals that are too long

for traditional RNN to process For LSTM, is implemented with the following

functions [5]:

ig = O(Wy ix, + Wri ha + Wa¡c¿—{ + Bj)

i= ơ(W,rz, +Whnrh,_+ + Wop cea + br)

Ce = feCe_1 + iytanh(Wxex¿ + Wpeh¿— + bạ)

On = O(WyoXt + Wroht-1 + W¿ac; + bo)

Trang 32

the hidden layer into two parts, forward state sequence, and backward statesequence [5] The iterative process is:

= H(W,5x¢ + Wrehe-1 + bz)

= H(W, 5x + Wryhess + by.)

at a + +

Variants of LSTM is the GRU which:

e Combine the forget and input gates into a single update gate

e Merge the memory cell and the hidden state

Figure 3.5: The Differences between conventional RNN, LSTM, and GRU

3.2 Sequence-fo-Sequence ArchitectureThe Sequence-to-sequence models have enjoyed great success in a variety oftasks such as machine translation, speech recognition, text summarization, and Text-To-Speech Synthesis In machine translation, a single neural network takes as input asource sentence X and generates its translation Y (let X = x1, x2, , Xr, and Y = yj,2a, Vry be two variable-length sequences, where x, and y; are the source and targetsymbols) [2] During training, the system learns the conditional probability:

P(y\, Var v9 Yry | Xa, X25 - Xx)

During generation, given a source sequence X the system samples Y according

to the above probability The neural machine translation models have three components:

an encoder, a decoder, and an attention mechanism A basic architecture has twocomponents: an encoder and a decoder (The encoder is an RNN and the decoder is

22

Trang 33

another RNN which 1s trained to generate the output sequence by predicting the next

symbol y; given the hidden state h,) [2].

3.2.1 Encoder-Decoder

In the Encoder-Decoder framework, an encoder 1s used to encode a

variable-length sequence into a fixed-length vector representation and a decoder

is used to decode a given fixed-length vector representation back into a length sequence [8] (Figure 3.6) From a probabilistic perspective, this model is

variable-a genervariable-al method to levariable-arn the conditionvariable-al distribution over variable-a vvariable-arivariable-able-lengthsequence conditioned on yet another variable-length sequence X =(X, , Xrx).The input and output sequences may have different lengths The most common

approach is to use an RNN that reads each symbol of an input sequence Xsequentially As it reads each symbol, the hidden state of the RNN changes

according to equation h, After reading the end of the sequence (marked by an

end-of-sequence symbol), the hidden state of the RNN is a summary c of thewhole input sequence The decoder of the proposed model is another RNN which

is trained to generate the output sequence by predicting the next symbol y; given

the hidden state h;_¡ However, both y, and h;_¡ are also conditioned on yt—l

and on the summary c of the input sequence [2] Hence, the hidden state of thedecoder at time t is computed by,

23

Tiêu đề	Speaker Adaptation Improvement Methods in Speech Synthesis
Tác giả	Tran Kim Hung
Người hướng dẫn	Trinh Quoc Son, M.Sc, Ngo Duc Thanh, Ph.D
Trường học	University of Information Technology
Chuyên ngành	Computer Science
Thể loại	Graduation Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	67
Dung lượng	35,03 MB