HO CHI MINH CITY NATIONAL UNIVERSITYUNIVERSITY OF INFORMATION TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE Tran Kim Hung GRADUATION THESIS SPEAKER ADAPTATION IMPROVEMENT METHODS IN SPEECH S
Trang 1HO CHI MINH CITY NATIONAL UNIVERSITY
UNIVERSITY OF INFORMATION TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE
Tran Kim Hung
GRADUATION THESIS SPEAKER ADAPTATION IMPROVEMENT METHODS IN
SPEECH SYNTHESIS
BACHELOR OF COMPUTER SCIENCE
HO CHI MINH CITY, 2021
Trang 2HO CHI MINH CITY NATIONAL UNIVERSITY
UNIVERSITY OF INFORMATION TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE
Tran Kim Hung - 18520811
GRADUATION THESIS SPEAKER ADAPTATION IMPROVEMENT METHODS IN
SPEECH SYNTHESIS
BACHELOR OF COMPUTER SCIENCE
THESIS ADVISORS:
Trinh Quoc Son M.Sc
Ngo Duc Thanh Ph.D
HO CHI MINH CITY, 2021
Trang 3THÔNG TIN HỘI ĐÒNG CHÁM KHÓA LUẬN TÓT NGHIỆP
Hội đồng chấm khóa luận tốt nghiệp, thành lập theo Quyết định số 36 ngày 17/1/2022
của Hiệu trưởng Trường Đại học Công nghệ Thông tin.
Trang 4I would like to express my sincere gratitude to my advisor and instructor Trinh
Quoc Son With the enthusiasm and support, he gave me, I embark on exploring the
realm of speech synthesis and voice cloning domain in computer science.
My sincere appreciation also goes to the co-supervisor of the thesis, Dr Ngo Duc
Thanh from the department of computer science, University of Information
Technology He had taken his valuable time to review my work and give me direction
on my thesis If not for him, I couldn’t complete my research.
My last gratitude goes to the council of the department of computer science had
gave me a chance to work on this thesis.
Trang 5TABLE OF CONTENTS
ABSTRACT .ccscsscssssssssessesessesssessssesessesesnesesessenesnssesessesesessesessesesecsesecuesesecsesesuesesecaeseaeeseseenenes al CHAPTER 1 INTRODUCTION .sssssessssesssssresssessssenecsenessssenesseseseesesesesseneensseseeneseaeenenenees 2
2.3 Advantages of Neural Network Over Hand-crafted Methods
2.3.1 Data Training Efficiency.
2.3.2 Automation in Neural Speech Synthesis
2.3.3 Quality of Sound
2.4 Speaker Adaptation.
2.4.1 Why Does Speaker Adaptation Important
2.5 Existing Methods of Speaker Adaptation
2.5.1 Tuning Speaker Adaptation
2.5.2 Zero-Shot Speaker Adaptation
CHAPTER 3 BACKGROUND \W sessessessessessessessssessessessesseesesseesessessessssesseeseeaeeseesesneensseeseaneee 1
Trang 63.6 How Speaker Embedding Improves Speech Synthesis
4.2.2 Modification to The Flow of Speaker Embedding
4.2.3, Add More Information To The Speaker Encoder
CHAPTER 5 EXPERIMENT essessessessessssssssessssessecsessessessessessessessessaesneeneeaeesesseaneensseeseeansee 47
Trang 7LIST OF FIGURES
FIGURE 1.1: WORKFLOW OF A SPEECH SYNTHESIS SYSTEM
FIGURE 2.1: FABER WONDERFUL TALKING MACHINE [1]
FIGURE 2.2: MODULES OF A UNIT SELECTION TTS SYSTEM [2]
FIGURE 2.3: STATISTICAL PARAMETRIC SPEECH SYNTHESIS [2]
FIGURE 2.4: THE PROCESS OF RETRAINING ANOTHER SPEAKER
FIGURE 2.5: THE PROCESS OF SPEAKER ADAPTATION TO A TUNING-BASED NEURAL SPEECH SYNTHESIS „13
FIGURE 2.6: ZERO-SHOT SPEAKER ADAPTATION PIPELINE [8] 15
FIGURE 2.7: THE PROCESS OF SPEAKER ADAPTATION IN THE ZERO-SHOT SPEAKER ADAPTATION METHOD 15
FIGURE 3.1: EXAMPLE SYNTHESIS OF A SENTENCE IN DIFFERENT VOICES USING THE SYSTEM MEL SPECTROGRAMS
ARE VISUALIZED FOR REFERENCE UTTERANCE USED TO GENERATE SPEAKER EMBEDDING (LEFT), AND THE
CORRESPONDING SYNTHESIZER OUTPUTS (RIGHT) THE TEXT-TO-SPECTROGRAM ALIGNMENT [8] 17
FIGURE 3.2: RECURRENT NEURAL ÑETWORK „18
FIGURE 3.3: LONG SHORT TERM MEMORY 20
FIGURE 3.4: BIDIRECTIONAL LSTM [5] 21
FIGURE 3.5: THE DIFFERENCES BETWEEN CONVENTIONAL RNN, LSTM, AND GRU 22 FIGURE 3.6: ENCODER-DECODER ARCHITECTURE [2] 23 FIGURE 3.7: ALIGNMENT MECHANISM PROPOSED BY BAHDANAU ET AL [2] 224 FIGURE 3.8: CHARACTER EMBEDDINGS [2] 5
FIGURE 3.9: SIMILARITY MATRIX CONSTRUCTION AT TRAINING [7] 26
FIGURE 3.10: VISUALIZATION OF SPEAKER EMBEDDINGS EXTRACTED FROM LIBRISPEECH UTTERANCES EACH
COLOR CORRESPONDS TO A DIFFERENT SPEAKER REAL AND SYNTHETIC UTTERANCES APPEAR NEARBY
WHEN THEY ARE FROM THE SAME SPEAKER, HOWEVER REAL AND SYNTHETIC UTTERANCES CONSISTENTLY
FORM DISTINCT CLUSTERS 27
FIGURE 3.11: UMAP PROJECTIONS OF UTTERANCE EMBEDDINGS FROM RANDOMLY SELECTED BATCHES FROM THE
TRAIN SET AT DIFFERENT ITERATIONS OF OUR MODEL UTTERANCES FROM THE SAME SPEAKER ARE
REPRESENTED BY A DOT OF THE SAME COLOR WE SPECIFICALLY OMIT TO PASS LABELS TO UMAP, SO THE
27
29
CLUSTERING IS ENTIRELY DONE BY THE MODEL
FIGURE 3.12: INPUT & OUTPUT OF SPEAKER ENCODER
FIGURE 3.13: INPUTS & OUTPUT OF SYNTHESIZER SYSTEM
FIGURE 3.14: TACOTRON 2 WITH SPEAKER EMBEDDING ARCHITECTURE [3'
FIGURE 3.15: INPUT & OUTPUT OF VOCODER
FIGURE 3.16: FATCHORD WAVERNN ARCHITECTURE [3]
FIGURE 3.17: THREE STAGES TRAINING OF THE MODEL [3]
FIGURE 4.1: DIFFERENT MODES OF THE MULTIMODAL SPEAKER ADAPTIVE ACOUSTIC ARCHITECTURE DASHES
BORDER INDICATES MODULES WITH TRAINABLE PARAMETERS WHILE BOLD SOLID BORDER INDICATES
MODULES WITH IMMUTABLE PARAMETERS [9]
FIGURE 4.2: ARCHITECTURE OF DEEP VOICE 3 [11].
Trang 8FIGURE 4.3: MULTI-SPEAKER LDE TTS SYSTEM ENCODER BLOCKS ARE IN ORANGE, DECODER BLOCKS IN BLUE,
POST-NET BLOCK IN GREEN, SPEAKER ENCODER BLOCK IN RED, AND VOCODER BLOCK IN YELLOW [10] 41
FIGURE 4.4: OUR PROPOSED ARCHITECTURE TO PASS THE SPEAKER EMBEDDING TO THE POST-NET AND PRE-NET
LAYER (WE HOWEVER DID NOT APPLY THE SPEAKER EMBEDDING TO THE OUTPUT OF THE POST-NET)
FIGURE 4.5: LST-TTS SYSTEM [12]
42
44
FIGURE 4.6: WAV2VEC FEATURE EXTRACTOR OVERVIEW [13] Ad
FIGURE 4.7: OUR PROPOSED MODEL BY CONCATENATING WAV2VEC2 SPEAKER REPRESENTATION AND GE2E
SPEAKER REPRESENTATION
Trang 9LIST OF TABLES
TABLE l.1: METHODS OF SPEAKER ADAPTATION
TABLE 2 ‘OMPARISON BETWEEN THE THREE SPEECH SYNTHESIS METHODS [21]
TABLE 2.2:THE RESULT WHEN EXPERIMENTS WITH TACOTRON 2 [6]
TABLE 2.3:RESULT OF TUNING MODEL [22]
TABLE 2.4: SPEAKER SIMILARITY MOS WITH 95% CONFIDENCE INTERVALS [8]
TABLE 4.1: OUR EXPERIMENT ON THE SV2TTS SYSTEM
TABLE 4.2: CROSS-DATASET EVALUATION FOR UNSEEN SPEAKERS [8]
TABLE 4.3: PERFORMANCE BETWEEN DIFFERENT TRAINING LOSS FUNCTIONS [7]
TABLE 4.4: PERFORMANCE USING SPEAKER ENCODERS (SES) ON DIFFERENT DATASETS [8]
TABLE S.1: CRITERIA FOR DIFFERENT EVALUATION MEASUREMENTS
TABLE EXPERIMENT RESULT OF THE PROPOSED MODEL TRAINED ON DI
TABLE 5.3: COMPARISON BETWEEN ARCHITECTURES
TABLE 5.4: COST OF TRAINING AND TUNING .
Trang 10LSTM Long Short Term Memory
TTS Text-to-Speech
RNN Recurrent Neural Network
MOS Mean Opinion Score
NMOS Naturalness Mean Opinion Score
SMOS Similarity Mean Opinion Score
BLSTM Bidirectional LSTM
CBHG 1-D convolution bank + highway network
+ bidirectional GRU
GRU Gated Recurrent Unit
STFT Short Time Fourier Transform
DNN Deep Neural Network
SV2TTS Speaker Verification To Text To Speech
GPU Graphic Processing Unit
CPU Core Processing Unit
ITU International Telecommunication Union
EER EER: Equal Error Rate
Trang 11Text-to-Speech (TTS) synthesis is an automatic conversion of written contents
to spoken language TTS synthesis plays a critical role in natural human and computerinteraction in an organic manner Although, communication between man and machinecan be satisfied by commands and text appearing on the screen Some applications such
as Siri or Cortana are the prime example of what communication between man and
machine can take shape with help of TTS
Classic approaches in speech synthesis are limited to a definitive set of data andheavy loads of hand-crafted procedures (rule-based algorithms used to perform
linguistic analysis) In this epoch of neural networks, improvements to the quality of
sound and flexibility of training a TTS model using neural networks have beenincreased rapidly throughout the years of development However, there are questionsstill open for improvement One of many problems is the issue of creating an appropriatemodel for speech production with minimal resources and time Our objective is to tacklethis topic and investigate what we can contribute to it Many pieces of research have
been conducted to address this concern; it’s known as the problem of speaker
Trang 12CHAPTER 1 INTRODUCTION.
The definition of speech synthesis is the process of artificially generating humanspeech from text It aims to synthesize intelligible and natural audio indistinguishablefrom human recorded audio A speech synthesizer is a computer system designed forthis particular purpose Speech synthesis has applications that can be useful for peoplewith disabilities and dyslexia One of the most iconic application cases is Dr StephenHawking
The TTS system is the automatic converter of written to spoken language The
input is text, and its mission is to generate a speech waveform that corresponds to theoriginal text (Figure 1.1)
Speech Synthesis Input text seen
Figure 1.1: Workflow of a speech synthesis system
1.1 Problem Statement
In speech synthesis, it is essential to produce a system that can generate soundidentical to the speaker It aims to replicate both in terms of naturalness, styles, andsimilarity in tempo However, it is also crucial to train and produce the system modelfast and efficiently to compete with other rivals in this growing market of artificial voice
cloning
There are many challenges concerning this topic They have been gatekeeping
the breakthrough in this field Those problems are such as:
- Fear of violation to the freedom of speech: With the improvement ofspeech synthesis naturalness throughout the years of development, it is naturalfor the mass to fictionalize a dystopia where their voices are taken advance offor malicious intents, and the more the field reaches its breakthrough, the moretheir fear becomes a reality In response to the fear, regulations are set to prevent
Trang 13the collection of voice data and make it harder to train the model This fear isproven to be a hindrance to the development of the field of speech synthesis Togenerate a system model that can achieve the naturalness of a desired humanspeaker, the researcher needs to have an extravagant amount of high-quality data.Collecting speeches data with high quality in large quantities can be problematic
to compromise With the fear of losing the freedom of speech as the base,
retaliation from the people is inevitable There also are laws that address this
issue authorized to protect their voices from malicious intentions To solve thiscase, we not only need to strategize a proposal to model a system with very littledata we have, but it also provides output with decent quality
- Similar to the problem of data-hungry, to generate a system that reachesthe minimum requirement to a human-like product, massive computing power isneeded However, this work was composed in a global chip shortage crisis,
which made it impossible to own a system capable of generating an appropriate
system model for speech production in a short amount of time Therefore, weneed to find a solution to run the algorithm in low computing configuration
1.2 Scope and Goals
1.2.1 ScopeThe scope of our work is to survey existing work related to the problem
of saving time on producing a model for speech synthesis
During the endeavor to consult this subject, we discovered what is known
as speaker adaptation It's abridged as a set of techniques that allow researchers
to generate speech for a specific speaker with very minimal data and computing
power.
There are many related works on the study of speaker adaptation (Table1.1.) The details will be elaborated on further in section 2.5 of this thesis
Trang 14Adaptation paradigm Methods
The first goal of this work is a legacy continuation of [8], which is to build
a TTS system that can generate natural speech for various speakers in a efficient manner We are based on an existing zero-shot learning setting, whereseveral few-seconds of un-transcribed reference audios from a target speaker are
data-used to synthesize new speech in that speaker’s formant, without updating any
model parameters Such systems have accessibility applications, such as
restoring the ability to communicate naturally to users who have lost their voice
and cannot provide many new training examples They could also enable newapplications, such as transferring a voice across languages for more naturalspeech-to-speech translation or generating realistic speech from texts in low-resource settings However, it is also important to note the potential for misuse
of this technology, for example, replicating someone’s voice without their
consent.
The second goal of our work is to optimize the adaptation process and
output speech quality so that it is undifferentiated from the original human voice.
We also want to accelerate the production operation for an appropriate model forthe speech generating system, which means we want the model training to befaster to save time
Trang 151.3 Contributions
In this thesis, we survey all research related to the field of speaker adaptation
We aim to resolve which method would bring the most noticeable improvement
We also combine some of the proposed methods The justification for fusingthem is to observe how they complement each other By orchestrating them together, itallowed us to inspect what kind of impact they pose on one another
The results of our work show that the changes we proposed to address theproblem can have a notable impact on the big picture
Trang 16CHAPTER 2 RELATED WORKS
2.1 History of TTS
It has been more than a century since an American scientist Joseph Henry came
across the “Wonderful Talking Machine” that can talk at an exhibition in Philadelphia
on December 20th, 1845 (Figure 2.1) The machine was a work of a mechanic from
Freiburg named Joseph Faber
Figure 2.1: Faber Wonderful Talking Machine [1]
The invention of Mr Faber of Freiburg can be considered the oldest predecessor
of speech synthesis ever recorded Speech synthesis has come a long way in digitalizing
the art of TTS to computers
In the early part of the modern history of the field, concatenative and statisticalparametric methods of speech synthesis were the pillar ranked high in the domain andpresented as methods to solve the speech synthesis problem However, the proceduresbehind their operation were considerably complicated Processes required extensive
research on the characteristics of the speech and bound it in a set of rules (can be called
hand-crafted feature) Last but not least, configurations like this were necessary to
archive a suitable system for TTS that only fixated on a single speaker [1].
When the neural network becomes more available and offers more automationthan the predecessors, more models based on the neural network get released throughoutthe years Noticeably is the release of the Tacotron [22] and Tacotron 2 [6]
Trang 172.2 Evaluation Methods
To evaluate the efficiency of speech synthesis, we will use the Mean Opinion
Score (MOS) to evaluate both the similarity and naturalness of the end product of
speech production The MOS will be a score from 1 to 5 which scales from bad, not
good, average, good, and very good.
Depending on what one wants to test, different questions should be asked For
instance: “Please listen to the sample and judge using a five-point scale their
quality/naturalness/similarity to a certain speaker.
ITU 1994 [32] recommends MOS with seven different questions ranging from
“Listening Effort” to “Overall Quality”, See also [31] for a description of an expanded
MOS test in which questions are asked about the following: Listening Effort,
Comprehension Problems, Speech Sound Articulation, Precision, Voice Pleasantness,
Voice Naturalness, Human-like Voice, Voice Quality, Emphasis, Rhythm, Intonation,
Trust, Confidence, Enthusiasm, Persuasiveness Criteria that affect the result of this
work will be defined more deeply in section 5.3.
2.3 Advantages of Neural Network Over Hand-crafted Methods The motivation for the mass to turn to neural network approaches is because of
its ability to model the relation between the raw text and the final output without the
need for hand-crafted feature engineering.
Trang 18In exemplar-based speech synthesis (Figure 2.2), the system simply stores the
speech corpus itself The stored speech data are labeled so that appropriate parts of them
can be found, extracted, and then concatenated during the synthesis phase The most
prominent exemplar-based technique and one of the dominant approaches to speech
synthesis are Unit-Selection (Figure 2.2) In this technique, units from a large speech
database are selected according to how well they match a specification and how well
they join together The specification and the units are completely described by a
structure, which can be any mixture of linguistic and acoustic features The quality of
output derives directly from the quality of the recordings, and it appears that the larger
the database, the better the coverage Commercial systems have exploited these
techniques to bring about a new level of synthetic speech However, these techniques
limit the output speech to the same style as the original recording In addition, recording
a massive database with variations is very difficult and costly [2].
DATABASE.
Speech Speech Statistical
Analysis Parameters Modeling
Ỷ
¬ TRAD @ggđ SPS
SYNTHESIS Synthesizer
Me a Speech Speech | Statistical |
Processing Parameters | Generation
« Hello !»
Figure 2.3: Statistical parametric speech synthesis [2]
In model-based speech synthesis (Figure 2.3), the fits a model to the speech
corpus (during the training phase) and stores this model Due to the presence of noise
and unpredictable factors in speech training data, the models are usually statistical.
Statistical Parametric Speech Synthesis has also grown in popularity, in contrast to the
selection of actual instances of speech These models don’t use stored exemplars They
describe the parameters of models using statistics (e.g means and variances of
Trang 19probability density functions) which capture the distribution of parameter values found
in the training data (Figure 2.3) The quality of speech produced by the initial statistical
parametric systems was significantly lower than this of unit-selection systems [2]
2.3.1 Data Training Efficiency
In the era of Deep Learning, end-to-end neural networks, which are of-the-art for speech synthesis tasks, have become highly competitive with theconventional TTS systems An end-to-end TTS system can be trained on <text,audio> pairs without hand-crafted feature engineering All the modules aretrained together to optimize a global performance criterion, without manualintegration of separately trained modules [2]
state-2.3.2 Automation in Neural Speech Synthesis
The two stages of a common TTS system are Text Analysis (text intointermediate representation) and Waveform synthesis (from the intermediaterepresentation into waveform) These stages are known as the frontend and thebackend respectively
The ’front-end’ of the TTS (From input text to linguistic specification):
Text Processing sees the text as the input to the synthesizer and tries to rewrite
any "non-standard" text as proper "linguistic" text It takes the arbitrary text andperforms the task of classifying the written signal concerning its semiotic type(natural language or other), decoding the written signal into an unambiguous,
structured, representation, and in the case of non-natural language, verbalizing
this representation to generate words The tasks that can be included in the
’front-end’ [1] are the following:
1 Pre-processing: possible identification of text genre, character encoding
issues, possible multilingual issues
2 Sentence splitting: segmentation of the document into a list of sentences
3 Tokenization: segmentation of each sentence into several tokens
4 Text-Analysis:
Trang 20a) Semiotic classification: classification of each token as one of the
semiotic classes of natural language, abbreviation quantity, date,time, etc
b) Decoding/parsing: finding the underlying identities of tokens
using a decoder or parser that is specific to the semiotic class
c) Verbalization: Conversion of non-natural language semiotic
classes into words that can be spoken
5 Homograph resolution: Determination of the correct underlying word for
any ambiguous natural language token
6 Parsing: Assigning a syntactic structure to the sentence
7 Prosody prediction: Attempting to predict a prosodic form for each
utterance from the text
When performing TTS, grapheme and phoneme analysis are crucial [19].First, a grapheme form of the text input is found and converted to a phoneme
form for synthesis This approach is advancingly practical in languages wherethe grapheme-phoneme correspondence is relatively direct; finding the
graphemes often means the phonemes tagging can accomplish with high
confidence As a result, pronunciation can be accurately determined
With an end-to-end system [6], [7], [18], we only need a small part of the
above tasks, simple Text Decoding (Word Tokenization and Normalization),because the characteristics of speech expect to be learned automatically throughthe neural network Word tokenization and normalization are generally done bycascades of simple regular expressions substitutions or finite automata Althoughrecent advancements in learned text normalization may render this unnecessary
in the future, thus, a straightforward and more automated approach to text
analysis
2.3.3 Quality of SoundThe work of [4] also suggested that letting the linguistic encoder processtext into vector representatives to learn analysis through data automaticallyseems to improve naturalness It is a crucial ingredient in allowing the attention
10
Trang 21mechanism to function accurately Conditioning the overall system onpreviously generated acoustic representations and self-evaluating weights tomatch the ground truth during training leads to significant naturalness gains.
Improvements can be verified by looking at past works related to the to-end system designed by many researchers in the field In the research of [21],
end-in the attempt to end-inspect the naturalness of speeches generated by Tacotron, itwas put to compete with other conventional approaches at the time Along withthe parametric, concatenative method of speech synthesis, the experiment result
is presented in Table 2.1 While Tacotron experiment samples naturalness MOSwas lacking behind Concatenative, it surpassed the Parametric method,
promising a future for speech synthesis with neural networks
Mean Opinion Score
Tacotron 3.82 + 0.085
Parametric 3.69 + 0.109
Concatenative 4.09 + 0.119
Table 2.1: Comparison between the three speech synthesis methods [21]
Succeeding Tacotron [21] was the work of Tacotron-2 [6] Byimplementing an extra module known as WaveNet, the result of the Tacotron-2(Table 2.2) surpassed all its predecessors and became state-of-the-art in the
domain of neural speech synthesis
Table 2.2:The result when experiments with Tacotron 2 [6]
11
Trang 222.4 Speaker Adaptation
While in the domain of speech recognition, speaker adaptation refers to the range
of techniques whereby a speech recognition system is adapted to the acoustic features
of a specific user using a small sample of utterances from that user [16] In speechproduction, when systems are focused on a single speaker, adaptation is understood asswitching the current speaker the system is producing to a new speaker
2.4.1 Why Does Speaker Adaptation Important?
In the more “classic” end-to-end speech synthesizers, such as [6] and [21]are models that train on a definitive set of data The result will still only producethe speaker that was resided in the original (single speaker) dataset (Figure 2.4)
Figure 2.4: The process of retraining another speaker
The problem with this approach is that it can consume a lot of computingpower to train It needs time for the model to match the designer’s desire It alsoneeds to train on a single speaker dataset
12
Trang 23In research of [1], [6], and [21], a model that can engineer such a result
needs to be trained over 50,000 to produce good quality sound It is consuming to generate such a model, so it is impossible to create a decent model
time-in a short amount of time In this era where GPU is not available, speedtime-ing upthe training is not an option when we rely on the CPU to train the model It isparamount that we save time and resources (both in hardware and data) onreproducing an appropriate voice-generating model
2.5 Existing Methods of Speaker Adaptation
2.5.1 Tuning Speaker Adaptation
It is common to adapt a well-trained, general acoustic model to new users
or environmental conditions (Figure 2.5) Instead of retraining the whole system
to produce a new model, we can tune a pre-trained model to new data
Output speaker 1 Output speaker 2
Figure 2.5: The process of speaker adaptation to a tuning-based neural speech
synthesis
Research such as [22] has shown the efficiency of tuning the weight ofthe model can result in a much better naturalness (Table 2.3)
13
Trang 24Dataset Sample tests MOS Average score
BigCorpus 44 3.47 SmallCorpus 80 4.13
Table 2.3:Result of tuning model [22]
However, the tuning method needs an improvement that can reduce thenumber of additional data and training steps
As [22] suggested, the number of training steps needed for such a model
to adapt well to the new speaker is 35,000 If the system relied on a CPU, 35,000
isn't a small number to work only with a CPU We then proceed to other research
on tuning models like [2] However, the result may be faulty because the quality
of the data performed adaptation with fewer steps proved unstable The practicalaspect of the tuning method was suppressed in the original Tacotron-2 by theheavy demand on the amount of data and machine power
Speaker adaptation on singular seq2seq via tuning needs a solution toreduce the amount of data and work fed into it Zero-shot speaker adaptation thencomes up with an answer for the problem of the massive chunk of works on the
model
2.5.2 Zero-Shot Speaker Adaptation
When referring to the technique of Zero-Shot Speaker Adaptation, we are
referring to a procedure that requires only a limited voice example (a singlespeech sample with a 5-second minimum length) Subsequently, the system cangenerate a waveform with similar vocal features to the example voice
In the Zero-Shot Speaker Adaptation approach, a speaker encoder isimplemented into the model, its task is to extract a d-vector from a speechsample The d-vector is supposed to be a representation of the essence of the
speaker’s voice After the d-vector (sometimes it may be called a speaker
embedding) is produced, it will be fed to a synthesizer (Figure 2.6)
14
Trang 25speaker Speaker speaker
Figure 2.6: Zero-Shot Speaker Adaptation pipeline [8]
Another variant of the architecture is the Deep Voice 3 [11] The modeladds a speaker embedding to multiple modules in the network to train a multi-speaker TTS model for thousands of speakers Deep Voice 2 [17] used a jointlytrained speaker encoder network to extract a speaker embedding of unseenspeakers, while Jia et al [8] used a separate speaker verification network TheVoiceloop model jointly trains a speaker embedding with the acoustic model andcan adapt to unseen speakers by using both the speech and transcriptions of thetarget speakers [9] Nachmani et al replaced the jointly trained speakerembedding of Voiceloop with a speaker embedding obtained solely fromacoustic features so that the model could adapt using un-transcribed speech [9]
Côn No ng Synthesizer Model Vocoder Model |
Input Text ——————————>, Sees Ầ———— 1 5 second wave sample
Trang 26The architecture offers the end-user a much more versatile adaptingpower (Figure 2.7) The model only needs only one sample of voice for themodel to perform adaptation While the tuning method required the user toretrain the model, the Zero-shot speaker adaptation did not force the user toretrain the model This method helps the model save some time and resources onspeaker adaptation.
There are many reasons for performing Zero-shot Speaker Adaptation
instead of conventional training, for instance, reducing the speaker footprint andquickly adapting to new speakers [9] But the most important reason is itspotential to handle unrefined adaptation data, whether in an insufficient quantity
or unreliable quality like noisy speech, incorrect transcript, or no transcript at all
[9].
System Speaker Set VCTK LibriSpeech
Ground truth Same speaker 4.67 + 0.04 4.33 + 0.08
Ground truth Same gender 2.25 + 0.07 1.83 + 0.07
Ground truth Different gender 1.15 + 0.04 1.04 + 0.03
Embedding table Seen 4.17 + 0.06 3.70 + 0.08
Proposed model Seen 4.22 + 0.06 3.28 + 0.08
Proposed model Unseen 3.28 + 0.07 3.03 + 0.09
Table 2.4: Speaker similarity MOS with 95% confidence intervals [8]
However, the architecture of zero-shot adaptation posed another problem.There is a noticeable gap between speakers present in the original dataset and
speakers that didn’t exist in terms of similarity to their respected ground truth.
To improve the Zero-Shot Speaker Adaptation, we must minimize the differencebetween seen and unseen speakers of the data used to train the model Looking
in Table 2.4, we can see the difference when comparing the similarity between
the “Seen” and “Unseen” speakers in both datasets used for the experiment.
16
Trang 27CHAPTER 3 BACKGROUND
In this chapter, we will present an architecture that motivated our research It acts
as the foundation for our team to dive into the speaker adaptation topic and gives us thegoal to seek methods to improve it Many fundamental concepts will be explained tounderstand the overall architecture Basic concepts, such as: RNNs, LSTMs, Sequence
to Sequence Models, Attention, and how the speaker embedding has any impact on theproblem will be presented throughout this chapter
The core architecture of this work is a Zero-Shot Speaker Adaptation, as weintroduce in section 2.5.2 (Figure 2.6) It is a neural speech synthesis network thatreceives two inputs The first input is a waveform example (5-seconds duration at
minimum), and the second input is a text The end product of the process is a waveform
that corresponds with the text and possesses a vocal range that resembles the inputsound example (Figure 3.1) The particular system we have chosen for this thesis is theSV2TTS system, implemented by GitHub user CorestinJ [3]
Speaker reference utterance Synthesized mel spectrogram
“and all his brothers and sisters stood round and listened
with their mouths open"
-2 20
- 10
-4 0
00 05 10 15 20 25 310 35 40 45 0.0 0.5 10 15 2.0
“but it will appear in the sequel that this exception is
much more obvious than substantial"
Time (sec) Time (sec)
“this is a big red apple"
Speaker 7021 Mel channel Mel channel
“this is a big red apple"
“this is a big red apple"
Figure 3.1: Example synthesis of a sentence in different voices using the system Melspectrograms are visualized for reference utterance used to generate speakerembedding (left), and the corresponding synthesizer outputs (right) The text-to-spectrogram alignment [8]
17
Trang 28We have chosen this system because it was presented as a compact version ofthe SV2TTS and is very easy to use Throughout the experiments, we also noticed thatthe system activities are very merciful on the longevity of the hardware.
3.1 RNN
A recurrent neural network is a class of artificial neural networks whereconnections between nodes form a directed graph along a temporal sequence (Figure3.2) This allows it to exhibit temporal dynamic behavior for a time series Unlikefeedforward neural networks, RNNs can use their internal state (memory) to processsequences of inputs RNNs are relatively old, like many other deep learning algorithms.They were initially created in the 1980s but can only show their real potential for inrecent years, because of the increase in available computational power, the massiveamounts of data that we have nowadays, and the invention of LSTM in the 1990s.Because of their internal memory, RNNs can remember important things about the input
they received, which enables them to be very precise in predicting what’s coming next.
This is the reason why they are the preferred algorithm for data like time series, speech,text, financial data, audio, video, weather, etc In an RNN, the information cyclesthrough a loop When it makes a decision, it takes into consideration the current inputand also-
Figure 3.2: Recurrent Neural Network
what it has learned from the inputs it received previously All RNNs have infinitememory However, the information decays exponentially with time The LSTM cell
18
Trang 29was presented as an improvement to keep the information for longer periods than thebasic RNN cell [2].
3.1.1 RNN fundamentals RNN compute and produce sequences of hidden vector h = (hy, , hy) and output vector sequence y = (1, , Yr), for a given input vector
sequence X = (X4, , Xr) Iterate the following equation from t = | to T [4]:
hy = H(Wynxe + Warhe-i + bạ)
v= Whyht + by
where W is the weight matrices, e.g W,;, is the weight matrix betweeninput and hidden vectors; is the bias vectors, e.g b is the bias vector for hiddenstate vectors; and # is the nonlinear activation function for hidden nodes
in RNNs, the network becomes unstable, and it is unable to learn from trainingdata At an extreme, the values of weights can become so large as to overflowand result in NaN values [2]
Related to the exploding gradient problem is the vanishing gradientproblem, where the gradient will be vanishingly small, effectively preventing theweight from changing its value In the worst case, this may completely stop theneural network from further training This was a major problem in the 1990s andmuch harder to solve than the exploding gradients Fortunately, it was solved
through the concept of LSTM cell by Sepp Hochreiter and Juergen Schmidhuber
Later, the GRU cell also proved efficient in training RNNs without vanishing
gradients [2]
19
Trang 30Gradient clipping, weight regularization and weight normalization
(Salimans and Kingma) and truncated BPTT may reduce this problem However,
the best practice to reduce the exploding gradient problem is to use gated RNNCells, like LSTM and GRU [2]
3.1.3 LSTM & GRULSTM network is a type of RNN, and since the RNN is a simpler system,
the intuition gained by analyzing the RNN applies to the LSTM network as well
Importantly, the canonical RNN equations, which we derive from differentialequations, serve as the starting model that stipulates a perspicuous logical pathtoward ultimately arriving at the LSTM system architecture [14]
The Long Short-Term Memory (LSTM) network was invented with the
goal of addressing the vanishing gradients problem The key insight in the LSTMdesign was to incorporate nonlinear, data-dependent controls into the RNN cell,which can be trained to ensure that the gradient of the objective function withrespect to the state signal (the quantity directly proportional to the parameterupdates computed during training by Gradient Descent) does not vanish The
LSTM cell can be rationalized from the canonical RNN cell by reasoning about
Equation 30 and introducing changes that make the system robust and versatile[14]
Figure 3.3: Long Short Term Memory
In the RNN system, the observable readout signal of the cell is the warped
version of the cell’s state signal itself A weighted copy of this warped state
20
Trang 31signal is fed back from one step to the next as part of the update signal to the
cell’s state This tight coupling between the readout signal at one step and the
state signal at the next step directly impacts the gradient of the objective functionwith respect to the state signal This impact is compounded during the trainingphase, culminating in the vanishing/exploding gradients [14]
LSTM networks are a series of LSTM cells (Figure 3.3) It can overcomeproblems where conventional RNN cannot By model signals that are too long
for traditional RNN to process For LSTM, is implemented with the following
functions [5]:
ig = O(Wy ix, + Wri ha + Wa¡c¿—{ + Bj)
i= ơ(W,rz, +Whnrh,_+ + Wop cea + br)
Ce = feCe_1 + iytanh(Wxex¿ + Wpeh¿— + bạ)
On = O(WyoXt + Wroht-1 + W¿ac; + bo)
Trang 32the hidden layer into two parts, forward state sequence, and backward statesequence [5] The iterative process is:
= H(W,5x¢ + Wrehe-1 + bz)
= H(W, 5x + Wryhess + by.)
at a + +
Variants of LSTM is the GRU which:
e Combine the forget and input gates into a single update gate
e Merge the memory cell and the hidden state
Figure 3.5: The Differences between conventional RNN, LSTM, and GRU
3.2 Sequence-fo-Sequence ArchitectureThe Sequence-to-sequence models have enjoyed great success in a variety oftasks such as machine translation, speech recognition, text summarization, and Text-To-Speech Synthesis In machine translation, a single neural network takes as input asource sentence X and generates its translation Y (let X = x1, x2, , Xr, and Y = yj,2a, Vry be two variable-length sequences, where x, and y; are the source and targetsymbols) [2] During training, the system learns the conditional probability:
P(y\, Var v9 Yry | Xa, X25 - Xx)
During generation, given a source sequence X the system samples Y according
to the above probability The neural machine translation models have three components:
an encoder, a decoder, and an attention mechanism A basic architecture has twocomponents: an encoder and a decoder (The encoder is an RNN and the decoder is
22
Trang 33another RNN which 1s trained to generate the output sequence by predicting the next
symbol y; given the hidden state h,) [2].
3.2.1 Encoder-Decoder
In the Encoder-Decoder framework, an encoder 1s used to encode a
variable-length sequence into a fixed-length vector representation and a decoder
is used to decode a given fixed-length vector representation back into a length sequence [8] (Figure 3.6) From a probabilistic perspective, this model is
variable-a genervariable-al method to levariable-arn the conditionvariable-al distribution over variable-a vvariable-arivariable-able-lengthsequence conditioned on yet another variable-length sequence X =(X, , Xrx).The input and output sequences may have different lengths The most common
approach is to use an RNN that reads each symbol of an input sequence Xsequentially As it reads each symbol, the hidden state of the RNN changes
according to equation h, After reading the end of the sequence (marked by an
end-of-sequence symbol), the hidden state of the RNN is a summary c of thewhole input sequence The decoder of the proposed model is another RNN which
is trained to generate the output sequence by predicting the next symbol y; given
the hidden state h;_¡ However, both y, and h;_¡ are also conditioned on yt—l
and on the summary c of the input sequence [2] Hence, the hidden state of thedecoder at time t is computed by,
23