Developing vietnamese sign language to text translation system

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY DEPT OF SOFTWARE ENGINEERING Do Thi Thanh Nha - Le Thanh Luan FINAL THESIS Developing Vietnamese-Sign-Language To Text Translation System SOFTWARE ENGINEERING MAJOR HO CHI MINH CITY, JULY 2023 VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY DEPT OF SOFTWARE ENGINEERING Do Thi Thanh Nha - Le Thanh Luan FINAL THESIS Developing Vietnamese-Sign-Language To Text Translation System SOFTWARE ENGINEERING MAJOR Instructor: Dr Nguyen Trinh Dong HO CHI MINH CITY, JULY 2023 GRADUATION & THESIS EVALUATION COMMITTEE INFORMATION The Graduation & Thesis Evaluation Committee was established according to Decision No dated by the Rector of the University of Information Technology - Chairman Secretary Member Member ACKNOWLEDGEMENT We would like to express our deepest gratitude and appreciation to all those who have supported us throughout the journey of completing this thesis First and foremost, we are profoundly grateful to the esteemed members of the thesis committee, particularly those from the Faculty of Software Engineering at the University of Information Technology - VNUHCM Their valuable insights, constructive criticism, and scholarly contributions have significantly enhanced the academic rigor of this thesis We are indebted to their expertise and rigorous evaluation We would also like to express our heartfelt appreciation to our family and friends for their unconditional love, encouragement, and belief in our abilities Their unwavering support, understanding, and patience have been the driving force behind our academic journey Furthermore, we would like to express our deep appreciation to Professor Nguyen Trinh Dong for his invaluable guidance, expertise, and mentorship throughout the process this thesis’ journey His unwavering support, constructive feedback, and scholarly insights have played a crucial role in shaping the academic rigor and overall quality of this work To everyone who has contributed to this thesis in various ways, whether directly or indirectly, we extend our heartfelt appreciation Your support has been invaluable in the successful completion of this academic endeavor Thank you very much, we wish you all the best Ho Chi Minh City, July 2023 Students Le Thanh Luan Do Thi Thanh Nha THE OBSERVATIONS Members & Participation percentage No Full name Le Thanh Luan Do Thi Thanh Nha Student ID Contribution percentage 19520702 18529116 50% 50% / 110 Contents Introduction 1.1 Problem statement 1.2 Approach 1.3 Results 8 10 11 Foundational knowledge 2.1 Vietnamese Sign Language (VSL) 2.2 Survey of Existing Sign Language Translation Technologies 2.2.1 Sign language recognition using sensor gloves 2.2.2 A Cost Effective Design and Implementation of Arduino Based Sign Language Interpreter 2.2.3 Neural Sign Language Translation 2.2.4 Deep Learning for Vietnamese sign language recognition in video sequence 2.3 Machine Learning Techniques and Algorithms 2.3.1 Machine learning 2.3.2 Deep Learning 2.3.3 Model Architectures for Sign Language Recognition 2.3.4 Deep Neural Network (DNN) 2.3.5 Convolutional Neural Networks (CNN) 2.3.6 Recurrent Neural Network (RNN) 2.3.7 Long Short-Term Memory (LSTM) 2.4 Software Background 2.4.1 Python 2.4.2 Javascript 2.4.3 Tensorflow 2.4.4 Keras 2.4.5 scikit-learn 2.4.6 Jupyter Notebook 2.4.7 Mediapipe 2.4.8 Expo-React Native 13 13 16 16 18 18 19 21 21 22 24 26 29 36 38 40 43 43 44 45 46 47 48 51 / 110 2.4.9 2.4.10 2.4.11 2.4.12 SQLServer Strapi Websocket knowledge Visual Studio Code 52 54 55 58 Data Collection and Preprocessing 3.1 Reason for Dataset Creation: Scarcity of Existing Datasets and Unresponsive Researchers 3.2 Gathering and Preparing the Dataset for VSL Recognition Model 3.3 Data Preprocessing Steps 3.4 Matrix Formation - Input for Machine Learning Model 3.5 File Organization 60 Model Training 4.1 Model Architecture 4.2 Training Process 4.3 Model Evaluation 69 69 72 74 Mobile Application Development 5.1 Designing the Mobile Application 5.1.1 Use cases 5.1.2 Database Diagram 5.1.3 The streaming system for processing videos 5.2 Implementation and Results 5.2.1 Integrating the Model into the Mobile App with Python 5.2.2 Results 5.3 User Testing and Feedback: Improving the User Experience Feedback and Discussion 6.1 User Testing Results and Feedback on the Mobile Application 6.2 Comparison of our Mobile Application with Existing Sign Language E-learning Apps 6.2.1 ASL Bloom 6.2.2 Lingvano 6.3 Other Assistive Technologies 60 63 64 66 67 78 78 78 86 86 95 95 97 100 102 102 103 103 104 105 Conclusion and Future Work 106 7.1 Summary of Contributions and Accomplishments 106 7.2 Future Directions for the Project: Expanding the dictionary 107 / 110 List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 2.22 2.23 2.24 2.25 2.26 VSL alphabet (Source: Circular 17/2020/TT-BGDĐT) VSL shares similarities with global sign language dictionaries An example of sensor glove used for detecting movement sequences Source: Cornell University ECE 4760 Sign language glove prototype LSTM architecture Structure of a simple neural network with an input layer, an output layer, and two hidden layers Starting the convolutional operation Step two in the convolutional operation Finish the convolution operation when the kernel goes through the 5*5 matrix Example of the convolutional matrix X matrix when adding outer padding of zeros Convolutional operation when stride=1, padding=1 Convolutional operation stride=2, padding=1 Illustration of convolutional operation on a color image with k=3 Tensor X, W dimensions are written as matrices Max pooling layer with size=(3,3), stride=1, padding=0 Example of pooling layer The structure of a recurrent neural network The flowchart of RNN-T algorithm Used in Reliable Multi-Object Tracking Model Using Deep Learning and Energy Efficient Wireless Multimedia Sensor Networks[1] Python Language Syntax Javascript Language Syntax Sklearn Metrics - Confusion matrix Jupyter Notebook IDE Mediapipe Hand Landmarks Mediapipe Holistic Landmarks Mediapipe Face Mesh Landmarks Expo can run cross-platform 14 15 17 26 27 30 31 31 32 32 33 33 34 35 36 36 37 38 43 44 46 47 49 50 51 52 / 110 2.27 Strapi.io 2.28 Visual Studio Code Editor 3.1 54 59 3.3 3.4 3.5 3.6 3.7 1st dataset of Deep Learning for Vietnamese Sign Language Recognition in Video Sequence Source: [11] Some actual footage of the dataset used in Deep Learning for Vietnamese Sign Language Recognition in Video Sequence Source: [11] Letter D Letter L Letter V Letter Y Folder tree of the sign ’a’ 62 66 66 66 66 68 4.1 Mediapipe Face Mesh Landmarks 75 3.2 61 5.1 5.2 The database diagram designed for SignItOut 86 The diagram illustrates the relationships among the elements within the API layer 88 5.3 The diagram illustrates the relationships among the elements within the Engine layer 91 5.4 Run Strapi with command prompt 97 5.5 Strapi UI run on localhost 98 5.6 Login Screen 99 5.7 Home screen 99 5.8 Course details Screen 99 5.9 Plain Text Lesson 99 5.10 Quiz Lesson 100 5.11 Video Lesson 100 6.1 6.2 The logo of ASL Bloom, one of the most popular sign language E-Learning applications with more than 100.000 users 103 Lingvano, the ASL learning application which uses artificial intelligence for giving feedback about users’ signing accuracy 105 / 110 CHAPTER SECTION 5.2 these components, enabling them to process the data received and perform their respective functionalities without the need for explicit input handling methods For the implementation of the FrameReader, we opted to utilize the Python programming language along with its built-in socket library This involved the setup and configuration of a socket server to receive input frames from the external environment, specifically the client’s request The received data is in a buffered format, represented as bytes, necessitating the conversion of these buffer inputs into images using Python’s image processing capabilities Once the conversion is completed, if the processing is successful, the component further transforms the image into a FrameEvent object This FrameEvent is then pushed into the incoming queue of the subsequent component, which in this case is the MediapipeDetector To implement the MediapipeDetector, we focused on setting up and configuring the necessary components for the Mediapipe model The data received from the FrameEvent serves as input to the Mediapipe model, enabling it to detect various features and landmarks within the image Once the detection process is complete, the results are packaged into a MediapipeEvent and subsequently transmitted to the LandmarkExtractor component for further processing and analysis For the implementation of the LandmarkExtractor, our main objective was to define the necessary functions to extract landmarks and poses, which are then combined into a vector comprising 1662 elements The algorithms used for extracting these features were described in detail in section 3.2 of the report Once the vector is created, it is encapsulated within a NumpyArrayEvent and subsequently transmitted to the TextClassifier component for further processing and analysis To implement the TextClassifier, our primary focus was on sending the input vector to the pre-trained model for classification and obtaining the corresponding text output based on the output label For the packaging and unpackaging of the model, we utilized Keras, a widely used and straightforward framework Once the classification is performed, the resulting output is encapsulated within an OutputEvent and subsequently transmitted to the OutputEmittor component In the case of the OutputEmittor component, our focus was on establishing and configuring the socket server to emit output to clients, utilizing Python’s built-in socket library Unlike the FrameReader, in this scenario, we needed to convert the output text into a buffered format that can be effectively handled by the socket 96 / 110 CHAPTER SECTION 5.2 5.2.2 Results After implentation, we successfully ran Strapi on our machines, with Strapi UI can be viewed on localhost:1337/admin Strapi Fig 5.4: Run Strapi with command prompt 97 / 110 CHAPTER SECTION 5.2 Fig 5.5: Strapi UI run on localhost Mobile App Screens After implentation and integration with Strapi-backend and MOdel via websocket we successfully ran SignItOut expo project on our machines, with application configured to be run on IOS devices for demo purpose Here are some screens of our mobile app: 98 / 110 CHAPTER SECTION 5.2 Fig 5.6: Login Screen Fig 5.7: Home screen Fig 5.8: Course details Screen Fig 5.9: Plain Text Lesson 99 / 110 CHAPTER SECTION 5.3 Fig 5.10: Quiz Lesson 5.3 Fig 5.11: Video Lesson User Testing and Feedback: Improving the User Experience To ensure the optimal user experience of our e-learning application, we conducted user testing and gathered valuable feedback from our trial-users This process allowed us to identify areas of improvement and make necessary adjustments to enhance the overall usability and functionality of the application Use Case UC001: User Login played a crucial role in our user testing phase By observing how users interacted with the login process, we were able to identify pain points and make refinements to streamline the authentication flow As a result, we implemented Expo Google Signin to provide a seamless and secure login experience for our users The Browse Courses feature, represented by Use Case UC002, was another area that underwent thorough user testing We collected feedback on the course discovery process, navigation, and filtering options Based on user feedback, we implemented an intuitive and responsive course catalog with modern-looking sliders, allowing users to easily explore and find the courses they are interested in 100 / 110 CHAPTER SECTION 5.3 Use Case UC004: View Course Details also received significant attention during user testing We collected feedback on the presentation of course information, including the course description, instructor details, duration, and learning objectives Based on user input, we revamped the course details page, providing a comprehensive and user-friendly overview of each course User feedback on Use Case UC005: Learn course was invaluable in enhancing the learning experience We gathered insights on the usability of video playback, readability of articles, and overall engagement with the course materials Taking this feedback into account, we optimized the multimedia delivery, ensuring smooth video playback and improving the readability of the provided articles Tracking lesson progress, as defined in Use Case UC007, was another aspect we scrutinized during user testing We gathered feedback on the visibility of completed sections, quiz scores, and overall progress tracking within a lesson Based on user feedback, we improved the lesson progress tracking feature, offering users a clear and comprehensive overview of their progress within each lesson Lastly, Use Case UC008: User Logout received attention during the user testing phase to ensure a seamless logout experience By observing users’ logout behavior and collecting feedback, we refined the logout process to provide a smooth transition back to the login/signup page In conclusion, user testing and feedback played a crucial role in our application’s development process By incorporating user feedback from the mentioned use cases, we were able to iteratively improve our e-learning application, ensuring a user-centric and enhanced user experience We are grateful to our users for their valuable input, which has been instrumental in shaping the application into its current user-friendly state 101 / 110 Chapter Feedback and Discussion 6.1 User Testing Results and Feedback on the Mobile Application During the development of our mobile learning application integrated with the Vietnamese Sign Language recognition model, we conducted user testing sessions to gather valuable feedback and evaluate the effectiveness of the application in aiding sign language learning The user testing involved a diverse group of individuals, including both learners and experienced sign language users The results of the user testing sessions were highly encouraging, demonstrating the positive impact and usability of the mobile application Participants reported that the application provided an intuitive and user-friendly interface, allowing them to easily navigate through various features and interact with the learning content The visual feedback provided by the application, including real-time recognition of sign language gestures, was found to be highly beneficial for learners to practice and improve their signing skills The user testing sessions also highlighted several areas for improvement and valuable insights for further development Participants expressed the need for additional interactive exercises and challenges to enhance their learning experience They suggested incorporating gamification elements, such as quizzes and rewards, to make the learning process more engaging and motivating Additionally, participants emphasized the importance of clear and concise instructions, as well as the availability of comprehensive sign language dictionaries and resources within the application Furthermore, participants appreciated the integration of text-to-speech functionality, which provided spoken translations of written content This feature facilitated the comprehension of sign language instruction and allowed for better understanding of written materials related to sign language learning 102 / 110 CHAPTER SECTION 6.2 Overall, the user testing sessions provided valuable feedback and insights for refining and enhancing the mobile application The positive feedback and suggestions from participants reinforced the effectiveness of our approach and highlighted the potential of the application to serve as a valuable learning tool for Vietnamese Sign Language Based on the user testing results, we will continue to iteratively improve the application, incorporating the valuable feedback received to ensure a more inclusive, user-centered, and effective learning experience for individuals interested in learning Vietnamese Sign Language 6.2 6.2.1 Comparison of our Mobile Application with Existing Sign Language E-learning Apps ASL Bloom ASL Bloom is an ideal choice for beginners as it focuses on laying a strong foundation and employs spaced repetition techniques to enhance learning retention This app is specifically designed to assist families and friends of deaf children in learning sign language It offers comprehensive lessons starting from the basics, supported by instructional videos demonstrating the precise hand movements for each sign The app goes beyond mere vocabulary and includes interactive features like practice quizzes, flashcards, and a searchable sign bank for easy reference Moreover, ASL Bloom delves into important aspects of deaf culture, providing a holistic understanding of the language Fig 6.1: The logo of ASL Bloom, one of the most popular sign language ELearning applications with more than 100.000 users 103 / 110 CHAPTER SECTION 6.2 Although ASL Bloom continues to achieve success in the market, it is important to acknowledge that the development team has made a deliberate decision not to incorporate artificial intelligence (AI) into the application As of now, there are no plans to integrate AI technology Furthermore, the app currently does not include a feature for checking users’ signs and providing corrections Hence, the inclusion of artificial intelligence in our system is a significant advantage for us 6.2.2 Lingvano Lingvano is a mobile application that closely resembles our offering in the market It provides an interactive and engaging platform for learning American Sign Language (ASL) With its enjoyable activities and well-structured lessons, Lingvano facilitates the development of sign language skills and the ability to hold meaningful conversations What sets Lingvano apart is its gamified approach to learning sign language The app turns the learning process into an engaging game, making it enjoyable and motivating for users Similar to the popular language learning app Duolingo, Lingvano guides learners through lessons where they acquire new signs and learn how to combine them to form sentences and engage in conversations Unlike many other ASL apps, Lingvano offers a variety of practice activities to reinforce learning In addition to the traditional method of watching and repeating signs, the app presents alternative exercises such as selecting the meaning of a sign or choosing the correct sign for a given word Moreover, Lingvano goes the extra mile by incorporating a feature that utilizes the device’s camera to provide users with visual feedback on their signing accuracy With its comprehensive range of activities and practice options, Lingvano offers ample opportunities for learners to hone their singing skills The app prioritizes interactive engagement and provides a diverse learning experience that goes beyond basic repetition 104 / 110 CHAPTER SECTION 6.3 Fig 6.2: Lingvano, the ASL learning application which uses artificial intelligence for giving feedback about users’ signing accuracy While this application shares similarities with our idea, it focuses on American Sign Language (ASL), whereas our application supports Vietnamese Sign Language (VSL) However, Lingvano boasts a more diverse range of quizzes compared to our offering The time constraints of our project limited our ability to expand our application and data set accordingly 6.3 Other Assistive Technologies Text-to-Speech Systems: Text-to-speech systems facilitate communication between deaf and hearing individuals by converting spoken language into written text and vice versa These systems employ text-to-speech synthesis (TTS) technologies to enable real-time transcription and speech synthesis By providing textual representations of spoken language, these systems assist deaf-mute individuals in participating in verbal conversations In our project, we employed Text-to-Speech (TTS) systems as a crucial component of our assistive technology solution By harnessing the power of TTS technology, we aimed to bridge the gap between written and spoken language for individuals with hearing impairments Through the integration of TTS systems into our educational platform, we provided spoken translations of written content, enabling users to access and comprehend textual materials by hearing them in spoken form This approach allowed individuals to receive spoken representations of sign language instructional materials, dialogues, or descriptions, thereby enhancing their understanding of the language and facilitating their communication skills development By leveraging TTS technology, we aimed to empower individuals with hearing impairments to participate more fully in verbal conversations and engage with spoken language in a more inclusive and accessible manner 105 / 110 Chapter Conclusion and Future Work In conclusion, this thesis has presented a comprehensive study on training a model for Vietnamese Sign Language (VSL) recognition and developing a mobile learning app integrated with the model to aid in sign language learning The contributions and accomplishments achieved throughout this research can be summarized as follows: 7.1 Summary of Contributions and Accomplishments In conclusion, this thesis has presented a comprehensive study on training a model for Vietnamese Sign Language (VSL) recognition and developing a mobile learning app integrated with the model to aid in sign language learning The contributions and accomplishments achieved throughout this research have laid the foundation for further advancements in VSL recognition and sign language learning technology Firstly, an in-depth analysis of computer vision and machine learning techniques for sign language recognition was conducted, with a specific focus on Vietnamese Sign Language Various methods, including pose estimation using Mediapipe and deep learning models like LSTM-DNN, were explored and implemented These techniques demonstrated promising results in capturing the temporal dynamics and patterns of sign language gestures 106 / 110 CHAPTER SECTION 7.2 Additionally, a diverse and representative dataset of VSL video sequences was collected and prepared The dataset was carefully divided into training, testing, and validation sets, ensuring effective model training and evaluation This dataset serves as a valuable resource for future research and development in VSL recognition The development of the LSTM-DNN model for VSL recognition showcased its ability to accurately recognize sign language gestures The model’s performance was evaluated using various metrics, demonstrating its effectiveness in capturing temporal dependencies and improving recognition accuracy This model serves as a solid foundation for further research and enhancements in VSL recognition technology Furthermore, the integration of the trained model into a mobile learning app provides an interactive and accessible platform for individuals to learn VSL at their own pace The app’s features, including real-time recognition, instructional videos, and interactive exercises, enhance the learning experience and make sign language learning more engaging and effective 7.2 Future Directions for the Project: Expanding the dictionary In terms of future work, there are several avenues for further improvement and expansion of this research Firstly, incorporating more advanced techniques and models, such as attention mechanisms or transformer-based architectures, could potentially enhance the recognition accuracy and performance of the VSL model Secondly, the dataset could be expanded by including more diverse sign gestures, involving a larger number of signers, and considering variations in lighting conditions and camera angles This would help make the model more robust and generalizable to real-world scenarios Furthermore, incorporating additional modalities, such as audio or textual information, into the VSL learning experience would provide learners with more comprehensive feedback and a richer learning environment This integration could facilitate a deeper understanding of sign language and enhance the overall learning experience 107 / 110 CHAPTER SECTION 7.2 Moreover, conducting user studies and gathering feedback from the target user group would provide valuable insights into the usability and effectiveness of the developed mobile learning app User feedback can guide further refinements and improvements to meet the specific needs and preferences of sign language learners, ensuring the app’s effectiveness and user satisfaction In conclusion, the research presented in this thesis has made significant strides in VSL recognition and sign language learning technology The developed model, dataset, and mobile learning app lay the groundwork for future advancements in this field By continuously exploring advanced techniques, expanding datasets, integrating additional modalities, and considering user feedback, we can foster inclusive and effective sign language learning environments for individuals to communicate and connect more effectively 108 / 110 Bibliography [1] Bassam Alqaralleh, Sachi Mohanty, Deepak Gupta, Ashish Khanna, K Shankar, and Thavavel Vaiyapuri Reliable multi-object tracking model using deep learning and energy efficient wireless multimedia sensor networks IEEE Access, 8:213426–213436, 01 2020 [2] Necati Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden Sign language transformers: Joint end-to-end sign language recognition and translation 03 2020 [3] Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden Neural sign language translation In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7784–7793, 2018 [4] Josh Fischer and Ning Wang Grokking Streaming Systems Manning Publications, 2022 [5] S.A Mehdi and Y.N Khan Sign language recognition using sensor gloves In Proceedings of the 9th International Conference on Neural Information Processing, 2002 ICONIP ’02., volume 5, pages 2204–2206 vol.5, 2002 [6] A Nandhini, D Roopan, S Shiyaam, and S Yogesh Sign language recognition using convolutional neural network Journal of Physics: Conference Series, 1916:012091, 05 2021 [7] Anirbit Sengupta, Tausif Mallick, and Abhijit Das A cost effective design and implementation of arduino based sign language interpreter In 2019 Devices for Integrated Circuit (DevIC), pages 12–15, 2019 [8] Junzhong Shen, You Huang, Mei Wen, and Chun-yuan Zhang Towards an efficient deep pipelined template-based architecture for accelerating the entire 2d and 3d cnns on fpga IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 04 2019 109 / 110 CHAPTER SECTION 7.2 [9] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri Learning spatiotemporal features with 3d convolutional networks pages 4489–4497, 12 2015 [10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, L ukasz Kaiser, and Illia Polosukhin Attention is all you need In I Guyon, U Von Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett, editors, Advances in Neural Information Processing Systems, volume 30 Curran Associates, Inc., 2017 [11] Anh Vo, Van-Huy Pham, and Bao Thien Deep learning for vietnamese sign language recognition in video sequence International Journal of Machine Learning and Computing, 9, 07 2019 [12] Aoxiong Yin, Tianyun Zhong, Li Tang, Weike Jin, Tao Jin, and Zhou Zhao Gloss attention for gloss-free sign language translation In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2551–2562, June 2023 110 / 110

Định dạng
Số trang	114
Dung lượng	15,48 MB