Kiến nghị và định hướng nghiên cứu

6. Nội dung nghiên cứu

5.2. Kiến nghị và định hướng nghiên cứu

- Kết quả nghiên cứu có thể sử dụng để làm tài liệu tham khảo cũng như nghiên cứu cho sinh viên và học viên cao học

- Kết quả nghiên cứu là bước đầu cho các nghiên cứu về các thiết bị hỗ trợ người khuyết tật.

- Nghiên cứu chỉ dừng lại ở việc chuyển đổi từ giọng nói sang ngơn ngữ cử chỉ. Các nghiên cứu tiếp theo có thể thiết kế chuyển đổi 2 chiều, từ giọng nói sang ngơn ngữ cử chỉ và từ ngơn ngữ cử chỉ sang giọng nói. Hơn nữa, hệ thống hiện tại sử dụng nhận dạng giọng nói thơng qua các API và địi hỏi hệ thống phải được kết nối internet. Các nghiên cứu tiếp theo cần triển khai các mơ hình nhận dạng giọng nói có khả năng hoạt động khơng cần kết nối internet.

TÀI LIỆU THAM KHẢO

[1]. Vi N.T. Truong, Chuan-Kai Yang, and Quoc-Viet tran, “A translator of American Sign language to text and speech”, 2016 IEEE 5th Global Conference on Consumer Electronics, 2016

[2]. S. Rajaganapathy, B. Aravind, B. Keerthana, and M. Sivagami, “Conversation of Sign Language to Speech with Human Gestures,” Procedia Computer Science pp. 10-15, 2015. [3]. Tan Tian Swee, Sh-Hussain Salleh, A.K. Ariff, Chee-Ming Ting, Siew Kean Seng, “Malay Sign Language Gesture Recognition System,” International Conference on Intelligent and Advanced Systems, 2007

[4]. M. K. Viblis và K. J. Kyriakopoulos, “Gesture recognition: the gesture segmentation problem”, Journal of Intelligent and Robotic Systems 28: 151–158, 2000.

[5]. B. Gallo, R. San-Segundo, J.M. Lucas, R. Barra, L.F. D’Haro, F. Fernández “Speech into Sign Language Statistical Translation System for Deaf People,” IEEE Latin America Transactions, vol. 7, no. 3, july 2009.

[6]. K. Rekha and B. Latha “Mobile translation system from speech language to hand motion language,” 2014 International Conference on Intelligent Computing Applications, pp. 411- 415, 2014

[7] Nhữ Bảo Vũ, “Xây dựng mơ hình đối thoại cho tiếng việt trên miền mở dựa vào phương pháp học chuỗi liên tiếp”, LVThS - Đại Học Quốc Gia Hà Nội, Trường Đại Học Công Nghệ, 2016

[8]. Trần Quyết Cường, “Nghiên cứu áp dụng kỹ thuật học sâu cho bài toán nhận dạng ký tự Latinh”, LVThS – Đại học Hàng hải Việt Nam, 2016.

[9]. M.A.Anusuya, S.K.Katti, “Speech Recognition by Machine: A Review”, International Journal of Computer Science and Information Security, Vol. 6, No. 3, 2009 [10]. Indurkhya, Handbook of Natural Language Processing, Chapman and Hall/CRC, 2010

[11]. Baum, L. An inequality and associated maximization technique occurring in statistical estimation

[12]. Baker, J. Stochastic modeling for automatic speech recognition, in D. R. Reddy, (ed.), Speech Recognition, Academic Press, New York, (1975).

[13]. Jelinek, F. Continuous speech recognition by statistical methods, Proceedings of the IEEE, 64(4), 532–557, (1976).

[14]. Deng, L. A stochastic model of speech incorporating hierarchical nonstationarity, IEEE Transactions on Speech and Audio Processing, 1(4), 471–475. (1993).

[15]. Deng, L., M. Aksmanovic, D. Sun, and J. Wu, Speech recognition using hidden Markov models with polynomial regression functions as nonstationary states, IEEE Transactions on Speech and Audio Processing, 2, 507–520, (1994).

[16]. Deng, L., D. Yu, and A. Acero. Structured speech modeling, IEEE Transactions on Audio, Speech and Language Processing (Special Issue on Rich Transcription), 14(5), 1492–1504, (2006).

[17]. Lippman, R. An introduction to computing with neural nets, IEEE ASSP Magazine, 4(2), 4–22. (1987).

[18]. Gao Y. and J. Kuo. Maximum entropy direct models for speech recognition, IEEE Transactions on Speech and Audio Processing, 14(3), 873–881. (2006)

[19]. Gunawardana, A. and W. Byrne. Discriminative speaker adaptation with conditional maximum likelihood linear regression, Proceedings of the EUROSPEECH, Aalborg, Denmark, (2001)

[20]. Bahl, L., P. Brown, P. de Souza, and R. Mercer. Maximum mutual information estimation of hidden Markov model parameters for speech recognition, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 49– 52, Tokyo, Japan. (1986)

[21]. Povey, B., Kingsbury, L. Mangu, G. Saon, H. Soltau, and G. Zweig. FMPE: Discriminatively trained features for speech recognition, in Proceedings of the International Conference on Acoustics, Speech,

and Signal Processing, Philadelphia, PA. (2005)

[22]. He, X., L. Deng, C. Wu. Discriminative learning in sequential pattern recognition, IEEE Signal Processing Magazine, 25(5), 14–36. (2008)

[23]. K. Lee, H. -. Hon and R. Reddy, "An overview of the SPHINX speech recognition system," in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 38, no. 1, pp. 35-45, Jan. 1990, doi: 10.1109/29.45616.

[24]. J. D. Markel, and A. H. Gray, Linear Prediction of Speech. Springer-Verlag, 1976. [25]. [. E. Baum, “An inequality and associated maximization technique in ststistical estimation of probabilistic functions of Markov processes.” Inequalities, vol. 3 , pp, 1- 8,1972

[26]. J. K. Baker, “The DRAGON system-An overview,” IEEE Trans. Acoust., Speech, Signal Processing. vol. ASSP-23, pp. 24-29, Feb. 1975],

[27]. F. Jelinek, “Continuous speech recognition by statistical methods,” Proc. IEEE, vol. 64, pp, 532-556. Apr. 1976.].

[28]. W. M. Fisher, V. Zue, J . Bernstein, and D. Pallett, “An acousticphonetic data base,” presented at the 113th Meet. Acoust. Soc. Amer.. May 1987

[29]. D. Pallett, “Test procedures for the March 1987 DARPA benchmark tests,” in Proc. DARPA Speech Recog. Workshop. Mar. 1987. pp, 75-78

[30]. D. Huggins-Daines, M. Kumar, A. Chan, A. W. Black, M. Ravishankar and A. I. Rudnicky, "Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices," 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Toulouse, France, 2006, pp. I-I,

[31]. Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng, “Deep Speech: Scaling up end-to-end speech recognition” arXiv:1412.5567

[32] . Ronan Collobert, Christian Puhrsch, “Wav2Letter: an End-to-End ConvNet-based Speech Recognition System”, arXiv:1609.03193

[33]. S. Konstantinidis, "Computing the Levenshtein distance of a regular language," IEEE Information Theory Workshop, 2005., Rotorua, New Zealand, 2005, pp. 4 pp.-, doi: 10.1109/ITW.2005.1531868.

PHỤ LỤC

Bài báo thuộc danh mục sản phẩm đề tài

Minh Le, Thanh Minh Le, Vu Duc Bui, Son Ngoc Truong, “A Low-Cost Speech to Sign Language Converter” International Journal of Computer Science and Network Security, VOL.21 No.3, March 2021 (ESCI)

IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.3, March 2021 37

Manuscript received March 5, 2021 Manuscript revised March 20, 2021

https://doi.org/10.22937/IJCSNS.2021.21.3.5

A Low-Cost Speech to Sign Language Converter

Minh Le, Thanh Minh Le, Vu Duc Bui, Son Ngoc Truong

HCMC University of Technology and Education, Ho Chi Minh City, Vietnam

Summary

This paper presents a design of a speech to sign language converter for deaf and hard of hearing people. The device is low- cost, low-power consumption, and it can be able to work entirely offline. The speech recognition is implemented using an open- source API, Pocketsphinx library. In this work, we proposed a context-oriented language model, which measures the similarity between the recognized speech and the predefined speech to decide the output. The output speech is selected from the recommended speech stored in the database, which is the best match to the recognized speech. The proposed context-oriented language model can improve the speech recognition rate by 21% for working entirely offline. A decision module based on determining the similarity between the two texts using Levenshtein distance decides the output sign language. The output sign language corresponding to the recognized speech is generated as a set of sequential images. The speech to sign language converter is deployed on a Raspberry Pi Zero board for low-cost deaf assistive devices.

Key words:

Speech recognition, speech-to-text, sign language.

1. Introduction

Speech is a common way that humans use to communicate with each other. However, it is not used for people who suffer from speech and hearing disabilities. There are many people around the globe having disabilities in hearing and speaking. There is an existing language used for such people, which is called sign language. Sign language is a fully visual language with its grammar used for deaf and mute people [1]. In sign language, hand gestures, head, body movements, and facial expressions are used by humans to convey the information [2]. It is difficult for deaf people to understand the information from speech coming from normal people or even from media devices. Therefore, a translator is necessary for minimizing the communication gap between hearing impaired and normal people. The translator owns a speech recognition that translates speech to text and a sign language generator that converts text to appropriate sign language [3]-[5]. Speech recognition is a complicated task. The leading-edge speech recognition model is implemented using machine learning and deep neural network [6]-[9]. However, deep neural networks are based on a massive number of computational tasks which consume a huge amount of power and processing time. For this reason, a simple, portable, and low-cost translator is necessary for helping deaf people to

receive information from the real world. In this paper, we experimentally present a design of a low-cost one-way portable translator, which can translate speech to sign language for deaf people. The speech recognition is performed using the open-source library, Pocketsphinx, which can work entirely offline [10]-[13]. Pocketsphinx is a small reconfigurable model which can be deployed on low-cost embedded systems for a mobile device. In this experimental demonstration, we deploy the speech recognition model and the speech to sign language on a low-cost Raspberry Pi Zero [14]-[16]. The optimized Pocketsphinx model has low accuracy because it employs the limited acoustic and language model. To improve the accuracy, a context-oriented language model is proposed. The proposed context-oriented language model is based on the Levenshtein Distance to measure the similarity between the recognized speech and the recommended speech.

2. Design a speech to sign language converter

A speech to sign language converter for deaf people must satisfy several requirements such as mobility, low power consumption, low cost, and high reliability. There are three necessary modules inside such the device including speech-to-text module, language understanding module, and text to sign language converter module. For low-cost devices, we utilize the Raspberry Pi Zero for the control unit. The speech to text, language understanding, and text to sign language are deployed on the Raspberry Pi Zero board.

Fig. 1 The block diagram of the speech to sign language

IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.3, March 2021 38

Fig. 1 shows a block diagram of a speech to sign language converter. The Raspberry Pi Zero is a low-cost embedded system being suitable for portable devices. Raspberry Pi Zero has only one channel for output audio. In this design, the input speech signal recorded from the microphone is passed to the Raspberry Pi Zero using a USB Audio Adapter, as shown in Fig. 1. Sign language is the animation composed of a set of sequential images being displayed on an LCD. The system is powered by a 4000 mAh battery. The device can work continuously 4 hours.

Speech to text is an important task that decides the reliability of the device. Recently, deep learning-based speech-to-text modules have been demonstrated as the leading edge model for automatic speech recognition systems. However, the deep neural network is a resource- hungry platform since it is based on huge computational tasks. For being suitable to be deployed on a low-cost computer such as Raspberry Pi Zero, we used an offline open-source speech-to-text module, Pocketsphinx instead. We propose a context-oriented language model to improve speech recognition accuracy. The Pocketsphinx is followed by the proposed context-oriented language model is shown in Fig. 2. Natural language processing is commonly used for language understanding tasks. However, the natural language proposed is a complicated task that requires a huge resource. In this design, we use a simple method that compares the recognized speech and the one stored in the database to decide the output. Levenshtein Distance method is suitable for such a task. The last task is the text to sign language converter. Having decided the output, a sign language is generated as a set of consecutive images playing on the output device. Fig. 2 shows the low-cost speech to sign language converter with proposed context-oriented language model.

Fig. 2. The proposed speech to sign language converter

with context-oriented language model.

In Fig. 2, the speech is recorded from the microphone and then enters the Raspberry Pi Zero where speech is converted to text by Pocketsphinx module. Pocketsphinx is an optimized Sphinx for low-cost computers. The proposed context-oriented language model is based on the Levenshtein Distance to measure the similarity of recognized speech and desired speech stored in the database. The output of the context-oriented language model is the speech obtained from the recommended speech which is the best match to the recognized speech. By using the proposed context-oriented language model, the recognized speech is corrected according to the expected speech stored in the database. The corrected text is then entered into the language understanding module where the output sign language is decided. Here the text is compared with the predefined text to determine which sign language output will be.

3. Experimental Results

The proposed architecture with three modules of speech to text, language understanding, and text to sign language is deployed on a Raspberry Pi Zeros board for a low-cost speech to sign language converter. The speech recognition accuracy is improved by using the proposed context- oriented language model which corrects the recognized speech. In table 1, we demonstrate the operation of the proposed context-oriented language model

Table I. The proposed context-oriented language model

Pocketsphinx

output Database Similarity score Context-oriented language model output What do you

doing What are you doing 0.86 What are you doing What are you

do What are you doing 0.9 What are you doing why do you

go What are you doing 0.5 Where do you go why do you go Where do you go 0.8 Where do you go I go working I am working 0.83 I am working

I have to

walk now work now I have to 0.89 I have to work now

In table I, we evaluate the performance of the proposed context-oriented language model in enhancing the accuracy of offline speech recognition. The recognized speech is compared with the recommended speech stored in the database to decide the output speech if the similarity score is higher than 0.8. The Levenshtein Distance is utilized to measure the similarity of two strings. By doing this, the text converted from speech resulted from Pocketsphinx is corrected. In table I, the recognized from Pocketsphinx is “What do you doing”, it is similarly the

IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.3, March 2021 39

recommended speech of “What are you doing”, then the corrected output is “What are you doing”. Similarly, the recognized speech of “why do you go” better matches with “where do you go” rather than “What are you doing”, the output is “where do you go”. Using the proposed context- oriented model, the speech recognition rate is improved significantly. The database is composed of possible sentences. To evaluate the speech recognition module, we measure the accuracy for 500 sentences from 5 speakers. The speech is recorded from the microphone in realtime. The Pocketsphinx without the proposed context-oriented model can recognize the speech and convert it to text with an accuracy as high as 71%. The Pocketsphins followed by the proposed context-oriented language model has accuracy as high as 92%. The proposed context-oriented language model can improve the accuracy by 21%. Having received the speech, the sign language is generated at the output. Sign language is based on a set of sequential images as shown in Fig. 3

Fig. 3. The sign language of “what are you doing” is

composed of a set of consecutive images.

Sign language corresponding to the text output from the speech to text module is a set of sequential images which specific meaning as shown in Fig. 3. The sign language is displayed on an LCD device.

4. Conclusion

This paper presented a design of a speech to sign language converted for deaf people. The device is mobility, low power consumption, and can work without an internet connection. The speech recognition is implemented by using an open-source library, Pocketsphinx module. To enhance the accuracy, we proposed a context-oriented language model, which measures the similarity between the recognized speech and the predefined speech to decide the output. The proposed model can improve speech recognition accuracy by 21%. A decision module is based on a similarity between the two texts using Levenshtein distance decides the output sign language.

Acknowledgments

This work belongs to the project grant No: T2020-39TĐ, funded by Ho Chi Minh City University of Technology and Education, Vietnam.

References

[1] U. Bellugi and S. Fischer, “A comparison of sign language and spoken language” Cognition, vol. 1, no. 2–3, pp. 173- 200, 1972.

[2] O. Aran and L. Akarun, “Sign Language Processing and Interactive Tools for Sign Language Education,” 2007 IEEE 15th Signal Processing and Communications Applications, Eskisehir, 2007, pp. 1-4.

[3] L. Boppana, R. Ahamed, H. Rane and R. K. Kodali, “Assistive Sign Language Converter for Deaf and Dumb,” 2019 International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), Atlanta, GA, USA, 2019, pp. 302-307.

[4] N. C. Camgoz, S. Hadfield, O. Koller, H. Ney and R. Bowden, “Neural Sign Language Translation,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 7784-7793. [5] L. Kau, W. Su, P. Yu and S. Wei, “A real-time portable sign

language translation system,” 2015 IEEE 58th International Midwest Symposium on Circuits and Systems (MWSCAS), Fort Collins, CO, 2015, pp. 1-4.

[6] P. Lakkhanawannakun and C. Noyunsan, "Speech Recognition using Deep Learning," 2019 34th International

Kiến nghị và định hướng nghiên cứu

Mơ hình mạng nơ-ron học sâu DeepSpeech

Mạng nơ-ron học sâu ConvNet