Vietnamese Speech Recognition and Synthesis in Embedded System Using T-Engine

In this paper, we are concerned with the combination of speech recognition and synthesis engines and the implementation of them in T-Engine Embedded system

Trang 1

Trinh Van Loan, La The Vinh

Department of Computer Engineering, Faculty of Information Technology, Hanoi University of Technology 1A DaiCoViet, Hanoi, Vietnam.

Trang 2

Abstract – In Vietnam, researches in speech recognition and

synthesis have been started in recent years Together with thedeveloping trend of human-computer interaction systems usingspeech, the optimization of speech recognition and synthesismodules in both speed and quality is an important problem, inorder to combine the two modules in one interactive product.Based on well-known methods for recognition and synthesisproblems, we do some experiment and enhancement to improveboth speed and quality of speech engines Finally, we demonstratea human-computer interaction software in T-Engine embeddedsystem.

In this paper, we are concerned with the combination of speech recognition and synthesis engines and the implementation of them in T-Engine Embedded system

Based on previous research in Vietnamese speech recognition and synthesis, we propose some enhancements to improve the synthetic speech quality Besides, the use of system resource (memory, CPU, storage ) in implementation is a considerable problem, especially for an embedded system as T-Engine in our case.

The paper is organized as follows In Section 2, a short introduction of T-Engine is given, while a method proposed for speech recognition in T-Engine is provided in Section 3, following is Vietnamese speech synthesis method in Section 4 In the two last sections, we provide concluding remarks.

II T-ENGINE INTRODUCTION

The T-Engine is a project to develop a standardized, open, real time computing system and development environment The T-Engine has standardized hardware (T-T-Engine board) as well as real-time operating system (T-Kernel), the hardware includes:

 CPU with built-in MMU

Fig.1 T-Engine layout

III PROPOSED METHOD FOR SPEECH RECOGNITION IN T-ENGINE

Fig.2 Speech recognition in T-Engine

The UDA1342 audio codec in T-Engine provides a minimal sampling frequency (SF) of 44100Hz This SF is really not necessary for our recognition, while it increases the computation considerably So, in order to improve the recognition speed we need a downsampling module with factor of four for pre-processing of speech signal before the feature extraction phase.

In the figure 2, feature extraction module and recognition model are the most important In our system we used MFCC feature of speech signal, since this one was proved to be a good feature of speech To form a feature vector, we firstly divide the signal into frames, then for each frame a feature vector is calculated including 13 MFCC values together with first and second derivative Assume that x[0 L-1] is the speech signal, the kth frame of the speech is constructed as:

s[n] = x[k*N+n] for n = 0 K-1

Where K is frame length, N is shift length of frames From 13 MFCC values m[0 12] the first and second derivative are Where m1 and m2 are the first and the second derivative respectively In training phase, feature vectors are used to adjust HMM’s parameters such as number of Gaussian mixtures, transition probability matrix, observation probability matrix for the best observation of models with the input data Table 1 illustrates the experimenting results of our recognition system with MFCC features and twenty-Gaussian-mixtures, six-left-to-right-states hidden Markov models

Trang 3

Table I

Speech recognition result

Note that, in our recognition engine, we separate the training phase from the recognizing phase in order to reduce the system resource used in T-Engine The training phase is implement in a PC with speech data from T-Engine, only the recognition module is implemented in T-Engine.

IV VIETNAMESE SPEECH SYNTHESIS IN T-ENGINE Previous researches in Vietnamese speech synthesis have indicated that PSOLA is an effective method to synthesize speech, based on the concatenation of diphones with amplitude and tone balancing PSOLA is not only a good-quality method but also a speed-optimal one Hence, PSOLA is very suitable to implement in embedded systems, this is the reason why we use it in our implementation of speech human-machine interaction product using T-Engine However, in Vietnamese there are some specific characteristics that make the speech synthesis a little difference than in other languages Following are some particular traits needed to consider in a Vietnamese TTS system: Vietnamese is a monosyllable and tonal language, there are six tones corresponding to difference varying rules of the fundamental frequency (f0) of speech Because of these features, there are two most common way of synthesizing Vietnamese tones The first method is to change the f0 to get the correspond tone, this way will reduce the size of speech diphones data, and the complexity of f0 balancing considerably But the quality of speech is not very good, especially with “~” and “?” tones such as "bão" and "bảo" because of the very complicated changing of amplitude together with f0

The second one is to concatenate diphones with already recorded tones In this manner, the size of data will increase noticeably, but the tones are created exactly like the natural speech, so the speech quality is quite good However f0 balancing when concatenated recorded-tone diphones is a little more difficult To solve this problem we will cut diphones into frames, each frame is one speech period.

Fig.3 Speech signal frames

Then, frames are multiplied with Hamming window.

Fig.4 Speech signal frames multiplied by Hamming window To keep the f0 contour smoothly, frames are overlapped with desired period

The two frames of contact are used for power balancing between the two diphones Assume that x(n) and y(n) are frames of contact with length N, we compute a power factor by:

Then the overlapping is done with second diphones’ frames multiply by p Assume that x[0 L-1] is the current speech signal, y[0 N-1] is the next frame, and K is the period The synthesis signal s[] is calculated by:

s[n] = x[n] for n = 0,1 L-N/2-1

s[n] = x[n] + p.y[n-L+N/2] for n = L-N/2 L-N/2+K-1 s[n] = p*y[L-N/2+k] for n = L-N/2+K L+K;

To reduce memory needed in implementation of the above algorithm for T-Engine embedded system, we store each diphone in a separate data file together with index file By this way, only two diphones are loaded in memory at the same time so the memory is reduced considerably Table 2 is the database index file structure

Table II

Diphone index file structure Speakers Trained Test

Trang 4

2 BYTEThe end-point of thefirst frame, is also the start-pointof a period of the vowel This fieldis available for the first diphone

All the data in the table above is calculated manually to ensure the quality of synthesis speech In order to speed the creation of database, we have built a database tool supporting auto frame detecting This is a very useful tool for creating the database with less effort This tool can produce a pitch contour automatically from a wave data file with high accuracy, then save the contour to a database index file as described in table II.

Fig.7 Waiting for speech commands from users screen shot

V APPLICATION OF VIETNAMESE RECOGNITION AND SYNTHESIS

Vietnamese speech recognition and synthesis has a wide range of application, especially in human-computer interaction (HCI)

Fig.6 Screen shot of HCM Museum introduction

To demonstrate the use of speech in HCI we have combined speech recognition together with speech synthesis into our software running in T-Engine This software allow users to use speech-commands to query information about places in Ha Noi Figure 7 illustrate the main screen of the software, in this screen user can see a map of Ha Noi with some places in bold title When an user read the title, for example "Bao tang Ho Chi Minh", the software will tell the user some information about the place Figure 6 is a screen shot of Ho Chi Minh museum introduction.

VI CONCLUSIONS

This paper is concerned with an advanced method in Vietnamese speech synthesis from tones-already diphones database We have done some experiments with the implementation of human-computer interaction system in T-Engine embedded system, besides our enhancement and optimization allow system to be implemented in low resource embedded systems The complete experiment consists of two part: recognizing and synthesizing, table I illustrate some recognition results with difference voice

Trang 5

REFERENCES

[1] Dang Ngoc Duc, John-Paul Hosom2 and Luong Chi Mai, 'HMM/ANNSystem for Vietnamese Continuous Digit Recognition'

[2] Dang Ngoc Duc, Luong Chi Mai, 'Improve the Vietnamese SpeechRecognition System Using Neural Network'

[3] Nguyen Van Giap, Tran Hong Viet, 'Kỹ thuật nhận dạng tiếng nói và ứngdụng trong điều khiển'

[4] Giang Tang, Jessica Barlơ, 'Characteristics of the sound systems ofmonolingual Vietnamese-speaking children with phonological impairment'[5] Le Hong Minh, Quach Tuan Ngoc, 'Some Results in Phonetic Analysis toVietnamese Text-to-Speech Synthesis Based on Rules'

[6] Le Hong Minh, 'Some Results in Phonetic Analysis to Vietnamese Text-to-Speech Synthesis Based on Rules' ICT.rda 2003

[7] Wesley Mattheyses, Werner Verhelst and Piet Verhoev, 'Robust pitch marking for prosodic modification of speech using td-psola', 2006

Định dạng
Số trang	5
Dung lượng	271 KB