INTRODUCTION APPLYING MACHINE LEARNING METHODS FOR AUTOMATIC CLASSIFICATION OF MENTAL ILLNESS TYPES AMONG YOUNG PEOPLE IN VIETNAM BASED ON SOCIAL MEDIA TEXT DATA 1.. Through natural lan
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
Social Media Text Data
Team Leader: Vu Thi Que Anh
ID: 20070900 Class: BDA2020B
Hanoi, April 2024
Trang 2TEAM LEADER INFORMATION
- Program: Business Data Analytics
- Address: Alley 39 Ho Tung Mau, Mai Dich, Cau Giay District, Hanoi City
- Phone no: 0367531963
- Email: 20070900@vnu.edu.vn
II Academic Results (from the first year to now)
Academic year Overall score Academic rating
III Other achievements:
- Incentive Scholarship for semester 2 of the 2020-2021 school year
- Incentive Scholarship for semester 1 and semester 2 of the 2021-2022 school year
Trang 31 Introduction 7
2 Literature Review 8
3 Data & Methodology 11
3.1 Dataset 11
3.2 Methodology 11
3.2.1 Data Collection 12
3.2.2 Preprocessing 12
3.2.3 Proposed Methods 17
3.2.4 Experimental setups: 18
4 Results & Discussions 19
4.1 Comparison between models 19
4.2 Insights 21
4.3 Discussions: 23
4.3.1 Importance of Pre-trained Language Models: 23
4.3.2 Future Directions and Implications for Mental Health: 23
5 Conclusion & Recommendations 24
6 References 24
Trang 4LIST OF TABLES
Table 1: Signs of psychological disorders 13 Table 2: Number of labeling of manual labeling and labeling using ChatGPT 16 Table 3: Experimental results of the proposed methods 19
Trang 5LIST OF FIGURES
Figure 1: BERT Architecture (Nguyen, 2020) 18
Figure 2: Word cloud gives noun type words for Label 1 21
Figure 3: Elbow methods 21
Figure 4: Visualize for 3 clusters 22
Figure 5: Word cloud gives adjective type words for Label 1 22
Trang 6INTRODUCTION
APPLYING MACHINE LEARNING METHODS FOR AUTOMATIC CLASSIFICATION OF MENTAL ILLNESS TYPES AMONG YOUNG PEOPLE IN VIETNAM BASED ON SOCIAL MEDIA TEXT DATA
1 Project Code: CN.NC.SV.23_22
2 Member List:
3 Advisor(s):
Fullname: Tran Thi Oanh
Academic degree: Associate Professor
Academic field: Faculty of Applied Science
4 Abstract (300 words or less):
These days, mental illness is quite common and a major source of distress in people's lives, affecting the health and well-being of society as a whole A behavioral or mental pattern that significantly impairs one's ability to function or causes considerable discomfort is referred to as mental illness, sometimes known as a psychiatric disorder Another symptom of a mental illness is a clinically significant disruption in a person's behavior, emotion control, or thought processes, frequently in social situations These disruptions might manifest as isolated incidents, ongoing, or relapsing-remitting The prevalence of depression in particular and mental health issues in general among Vietnamese youth have become important concerns in recent years, yet prompt detection and classification remain a major challenge and the precise nature of these illnesses Using the enormous volumes of textual data that are readily available on social media sites, this study suggests an automatic categorization framework designed to distinguish
Trang 7and identify between different mental health conditions, thoughts of self-harm, and suicide ideation is a possible strategy for successful social intervention Through natural language processing (NLP) techniques and releases of two pretrained masked language models: PhoBERT and VisoBERT; and machine learning CNN, SVM, we analyze and classify text data on social networks to distinguish between individuals suffering from psychological disorders or not With an estimated accuracy of up to about 84%, our results show that the created method is a great accomplishment This is a significant advancement that shows how machine learning can help physicians diagnose mental illness more precisely and successfully This research not only offers the knowledge required to comprehend the mental health of the online community, contributes to the advancement of automated mental health assessment tools but also holds potential for informing targeted intervention and support strategies tailored to the specific mental health needs of young individuals in the Vietnamese context
5 Keywords
Mental illness, Machine learning, Transformer, Natural Language Processing, trained model
Trang 8Pre-SUMMARY REPORT IN STUDENT RESEARCH,
2023-2024 ACADEMIC YEAR
1 Introduction
Mental illnesses are health conditions involving changes in emotion, thinking or behavior (or a combination of these) Mental illnesses can be associated with distress and/or problems functioning in social, work or family activities Mental illness is common In a given year: Nearly one in five (19%) U.S adults experience some form of mental illness; One in 24 (4.1%) has a serious mental illness; One in 12 (8.5%) has a diagnosable substance use disorder (American Psychiatric Association, 2022) According to statistics from the World Health Organization, every year nearly 1 million people have problems with depressive disorders, nearly 800,000 people die by suicide Depression is gradually becoming one of the most common diseases globally, but only 25% of them are treated and have timely treatment According to research at the Ha Noi National Children's Hospital 2020, the rate of adolescents suffering from depression is 26.3%, 6.3% of children thinking about death, 4.6% of children planning suicide, and 4.6% of children attempting suicide suicide attempt is 5.8%
The transition to adulthood coupled with societal pressures, academic stress, and rapid social changes has rendered young individuals particularly vulnerable to various forms
of health problems and psychological disorders Research on a change from mental health to suicidal thoughts using linguistic and interactional metrics has revealed that suicide attempters have mental problems (De Choudhury et al., 2016) Psychological evaluations have historically relied on self-report surveys, diagnostic instruments, and clinical interviews conducted by qualified experts Unfortunately, these approaches frequently have drawbacks including subjectivity, high expense, and restricted accessibility, particularly in environments with low resources like Vietnam Harnessing the power of social media data, particularly text-based content, presents a promising opportunity to gain insights into the mental health status of young individuals
Leveraging advanced natural language processing (NLP) techniques and pretrained language models, such as PhoBERT and VisoBERT, offers a novel approach to automatically classify mental health among young people in Vietnam based on their social media text data PhoBERT, a Vietnamese pretrained language model developed
Trang 9by VinAI Research, and VisoBERT, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks, have demonstrated remarkable capabilities in understanding and generating Vietnamese text
Through this research project we will help to overcome the shortcomings of conventional mental health screening techniques and offer a practical, scalable approach to early identification and intervention We aim to inform targeted interventions, support strategies, and policy initiatives tailored to the particular needs of this demographic group by automatically classifying mental health issues among young people in Vietnam This will ultimately help to promote resilience and mental well-being in Vietnamese society
2 Literature Review
Mental illness has become a pressing issue globally, affecting individuals of all ages, races, and socioeconomic backgrounds Among young people, the prevalence of mental health disorders is particularly concerning, with factors such as academic pressure, social media influence, and economic instability contributing to the rise in cases In Vietnam, like many other countries, mental health issues among young people are on the rise, yet access to mental health services remains limited, leading to a significant treatment gap
To address this gap and provide timely interventions, researchers have turned to social media text data and machine learning methods for automatic classification of mental illness types This literature review aims to provide a comprehensive overview of existing studies on applying machine learning methods for the automatic classification
of mental illness types among young people in Vietnam based on social media text data Recent studies have demonstrated the effectiveness of state-of-the-art natural language processing models such as RoBERTa in classifying mental illness types based on social media text data(2022, Nguyen et al) applied RoBERTa to classify posts from Depression_Reddit, achieving an impressive F1 score of 95.11% for distinguishing between depression and non-depression posts Similarly(2020, Tran et al) utilized RoBERTa to classify posts from SuicideWatch, achieving a high F1 score of 95.47% for identifying suicidal and non-suicidal posts These results highlight the robustness of RoBERTa in capturing subtle linguistic cues indicative of mental health conditions, making it a promising tool for automatic classification tasks
Trang 10In addition to transformer-based models like RoBERTa, researchers have explored hybrid architectures and attention mechanisms to enhance classification performance (2021, Le et al) employed a BiLSTM model with attention mechanisms to classify suicide notes, last statements, and neutral posts Their model achieved a commendable F1 score of 93.3%, demonstrating the effectiveness of attention mechanisms in capturing important textual features Similarly Hybrid Deep Learning Model (Fasttext + LSTM), a hybrid deep learning model combining Fasttext and LSTM was proposed by Pham et al for classifying posts from Reddit and Twitter With an accuracy of 87%, the hybrid model showcased the potential of integrating traditional text processing techniques with deep learning approaches for mental illness classification tasks
While deep learning models have shown promising results, traditional machine learning techniques such as KNN and SVM have also been employed in mental illness classification tasks However, these methods often face challenges in handling complex linguistic patterns and context dependencies present in social media text data For instance, a study by (2021, Hoang et al) utilized KNN and SVM to classify data from CLPsych (2021, Hoang et al) achieving a relatively low F1 score of 74.10% This underscores the limitations of traditional methods in capturing the nuanced semantics of mental health-related text
Recent research has also explored the integration of multimodal information, such as text and emojis, to improve classification accuracy (2023, Truong et al) leveraged the SentiEmoDD dataset of Tweets containing both text and emojis and applied SVM to classify mental health-related posts Their model achieved impressive accuracies of 88.29% for text-only classification and 87.69% for text and emoji classification, demonstrating the potential of leveraging multimodal features for enhancing classification performance
In their seminal work (2022, Ji et al) introduced MentalBERT and MentalRoBERTa, two pretrained masked language models specifically tailored for mental healthcare applications Trained on a corpus collected from social forums dedicated to mental health discussions, these models aim to enhance machine learning for mental health detection tasks (2022,Ji et al) conducted a comprehensive evaluation on several mental health detection benchmarks, demonstrating that language representations pre-trained in the target domain improve the performance of mental health detection tasks.Their model
Trang 11achieved impressive accuracies of F1 is 95.11% This work addresses the critical need for domain-specific pretrained language models in the field of mental healthcare and provides valuable resources for the development of automated classification systems for mental illness types among young people in Vietnam based on social media text data Despite the progress made in automatic classification of mental illness types among young people in Vietnam, several challenges remain One limitation is the lack of annotated datasets specific to the Vietnamese context, which may hinder the development of robust classification models Additionally, issues such as data privacy and ethical considerations need to be carefully addressed, particularly when dealing with sensitive user-generated content on social media platforms Future research directions may involve the exploration of transfer learning techniques to adapt pre-trained models
to the Vietnamese language, as well as the development of interpretable models to enhance clinical applicability and transparency In their seminal work (2022, Ji et al) introduced MentalBERT and MentalRoBERTa, two pretrained masked language models specifically tailored for mental healthcare applications Trained on a corpus collected from social forums dedicated to mental health discussions, these models aim to enhance machine learning for mental health detection tasks (2022, Ji et al) conducted a comprehensive evaluation on several mental health detection benchmarks, demonstrating that language representations pre-trained in the target domain improve the performance of mental health detection tasks This work addresses the critical need for domain-specific pretrained language models in the field of mental healthcare and provides valuable resources for the development of automated classification systems for mental illness types among young people in Vietnam based on social media text data The summary table provided below for an overview of the research findings
❖ Motivation
The motivation for this research stems from a growing concern about mental health, especially among young individuals in Vietnam, who are increasingly vulnerable to psychological disorders due to a range of pressures including academic stress and social changes The limited accessibility to mental health services in Vietnam highlights a significant treatment gap, necessitating innovative approaches to facilitate early detection and intervention The vast availability of social media data offers a unique opportunity to leverage machine learning techniques to analyze text data for signs of
Trang 12mental health issues, thereby enabling timely and targeted interventions This study aims
to utilize advanced natural language processing tools and machine learning methods to automatically classify different types of mental illnesses among the youth based on their social media interactions
❖ Objectives
The primary objective of this study is to apply machine learning methods for the automatic classification of mental illness types among young people in Vietnam using social media text data The specific goals include:
1 Developing a robust classification framework that can accurately distinguish between different mental health conditions based on textual data
2 Improving the understanding of mental health patterns among young individuals by analyzing the language used in social media posts
3 Contributing to the reduction of the mental health treatment gap in Vietnam by providing a tool that aids in the early detection of mental health issues
By achieving these objectives, the research seeks to enhance the capacity for early mental health intervention and support, tailored to the unique needs of young individuals in Vietnam
3 Data & Methodology
3.2 Methodology
To conduct this study, we will follow a comprehensive methodology encompassing data collection, preprocessing, feature extraction, model selection and training, evaluation and comparison with existing approaches
Trang 133.2.1 Data Collection
We will gather 1063 textual posts from social media platforms popular among Vietnamese youth, including Facebook Specifically, we will focus on posts related to mental health, encompassing depression, anxiety, stress, self-harm, and suicide ideation Throughout this process, we will uphold ethical standards regarding data privacy and consent
3.2.2 Preprocessing
3.2.2.1 Remove duplicate value
We first identify duplicate rows within the DataFrame and then eliminate duplicate entries There were 63 duplicate values We decided to remove these duplicate values After deletion, our data had 1000 records left
3.2.2.2 Label data:
In this research, we will perform data labeling through two times of labeling using both the manual labeling method and labeling using ChatPGT to compare and check to improve the quality of the labeling process Compare the results of manual labeling and ChatGPT labeling to evaluate the accuracy of both methods
3.2.2.2.1 Definition of labels
Label 0: Individuals who show no signs or symptoms of past or present psychological
disorder or have no medical records related to psychological illness
Label 1: Subjects who have ever had and are currently showing signs and symptoms of
psychological diseases, have been or are undergoing treatment, and have medical records related to psychological health of various levels of severity
Label 2: If there is a difference between the manual label and the GPT label, it will be
labeled 2
3.2.2.2.2 Labeling instructions
❖ Assign labels manually
We had a meeting to talk and discuss with an expert - Master of Psychology Nguyen Thi Nhu Phuong of International School of Vietnam National University about
psychological disorders in particular and mental health in general After receiving her answers and imparting knowledge, we had some quite reputable evidence to recognize
Trang 14the signs of the disease so that we could label the data in the most fair way (Tran Thanh Nam - Nguyen Phuong Hong Ngoc, 2020).
Some signs and manifestations of stress, anxiety disorders, depressive disorders in particular and psychological diseases in general
Table 1: Signs of psychological disorders
Mental
health
Stress Stress is a nervous state
of tension, including many factors such as physical, chemical and reactions of an individual trying to adapt to a change or pressure from the outside or inside
uncomfortable;
anxiety and stress; sad, depressed, indifferent, feeling like you've lost your self-worth
- Angry, frustrated or hot-tempered; using stimulants; disruption
of daily activities;
eating too much or too little; becoming unreasonable…
- Headache, muscle pain; stomach-ache;
dizzy; digestive disorders, difficulty breathing, chest pain
- External environment:
weather, noise, traffic, dust,
- Physical problems: body changes, illness, pain, insufficient nutrients, etc
- Stress from society and family: work, finances, family conflicts, friends,
- Thinking: thinking a lot, being negative,
Anxiety
disorders
Anxiety disorder is excessive fear of a situation that is unreasonable, repetitive, and prolonged, affecting
- Excessive fear of situations that occur, are unreasonable in nature, and repeat over
- Exposure to stressful and negative life or environmental events during