_ FineTune PhoBERT for Vietnamese Fake News Detection.... This pre-labeling will aid in the training andvalidation of our machine learning models, enabling them to effectively distinguis
Trang 1VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
UNIVERSITY OF INFORMATION TECHNOLOGY
ADVANCED PROGRAM IN INFORMATION SYSTEMS
NGUYEN DUC MANH - 19521827
Trang 2Additionally, we express our deep gratitude to the lecturers at the University
of Information Technology, especially those from the Information System Faculty.Their extensive knowledge and passionate teaching have significantly contributed toour academic journey The understanding and insights they shared have laid a solidfoundation, enabling us to better complete this thesis
Throughout the execution of this project, we endeavored to apply ourfoundational knowledge and explore new technologies to construct this graduatethesis However, due to limited experience and expertise, imperfections wereinevitable Therefore, we eagerly look forward to receiving feedback andsuggestions from our professors, which will assist us in refining our necessaryknowledge and skills
We also wish to convey our sincere thanks to our families and friends whohave consistently supported and accompanied us throughout the research andcompletion of this thesis
Thank you very much!
Authors
Nguyen Duc Manh Nguyen Thanh
Trang 3TABLE OF CONTENTS
ACKNOWLEDGMENTTS - Án HH HH TH HT HT TH HT HT HT HT Hit 2
TABLE OF CONTENTS 222 3
LIST OF TABLES 101011008 5
CHAPTER 1: INTRODUCTION 22.54 9
I9) - 9
II " ¬ 9
1.3 ObjJectives and SCODe - - - 5s ST TH HT HH TH TH TH ghe 10 1.3.1 900) 0i ốốốốốốố ốố Ầ.Ầằồ.Ầ 10
1.3.2 b0 10
1.4 MOLVAUOH SH HH TH TH HT TH TH TH TH TH TH Hà HT TH Hà 11 1.5 _ Report Ôutline c1 ST TH HT HT TT HH TT HH TH TH HT TH Hà 11 CHAPTER 2: LITERATURE REVIEW AND THEORETICAL BACKGROUND 12
2.1 Fake News and Its Impact " we 12 2.1.1 Overview of Fake News 2.1.2 The Significance of Detecting Fake News
2.2 BERT and Its Variants in NLP - 5:2 2121 12125121211 11121 112101 110101 11011 HH ưưn 13 2.2.1 Introduction to_BIEÌRÏT c2 s x3 919911 1v nh TH Thu TH HT HT HH TH kh 13 2.2.2 PhoBERT: A BERT Variant for Vietnamese Language ProcesSIng ‹ ««-s«++ 14 2.3 Model Evaluation Metrics ccccccecccceecscescsceeescseecscseeecsesesecscseeecsesesecsesesecsenenseseeesseeeneenee 14 2.3.1 Accuracy, Precision, and R€CaÌÌL - ‹ 6 + 313191 91193 1 91 910 1 ng nh 14 2.3.2 II 16
2.3.3 \00).4990ó 17
2.3.4 00/0) 51000061 111 e 19
2.4 Related MOdelL cá kg HH HH TH HH TH TT TH HH TH cưng 21 2.4.1 Support Vector Machine (SVM|) -.- LH TH HH TT HH TT Hàn HH tư 21 2.4.2 Logistic REQreSSiON 0 ee eeessesecseeseeseeseeeeseeseeeseeseeseeseecsesseeseseesesseescasetenesaeeaesesaesaeeasensets 23 2.5 Related Research e 23
CHAPTER 3: METHODOLOGY 0000o0 occ cece 25
cm—-°aằ®GỆỰỘNy.Ả 4333 25 3.2 Data Collection and Preprocessing - 6 + E11 11191 11 1k Hà TH TH Hà HT HH 27 3.2.1 — Data (90 27
3.2.2 M8010 00077 31
Trang 43.2.3 Data Cleaning and PreprOC€SSITE - 5 (6 510 919v nh TH nh HH nh nh 32 3.2.4 Data Training nh 38 3.3 PhoBERT for Vietnamese Fake News Def(€cCtIOII 5 5 tk HT HH ng ri, 40
3.3.1 PhoBERT - pretrained BERT for Vietnamese language -¿- 6+ + * Sky 40 3.3.2 _ FineTune PhoBERT for Vietnamese Fake News Detection -++-s+c+c+scse+ 41
ch 0 - 43
3.4.1 Architecture of the Fake News Detection SySf€Im - SntS vn rrnrrirerree 43 3.4.2 Integration of the PhoBERT Model with the ApplicatiOn -ó- s5 5+5 ssvsssseseese 44 3.5 System deployment and 'Te€S(ITB - . 5 1 11v 9v nh nh TH ghi Hưng 45
3.5.1 Backend ph 20 3 45 3.5.2 Frontend DeveÏOPIMIII 5 + E111 1911911 11911 1 v11 HT ng nh Hnghvrt 45 CHAPTER 4: EXPERIMENTAL RESULTS AND APPLICATION DEMO 47
4.1 Model Performance and R€SUÏLS - ó6 t1 128E 51 1 1 v21 ng TH TH nh TH ng gu nưy 47
4.1.2 Models Evaluation " 32 4.2 _ Application Demo " eT 4.3 Discussion of Findings
4.4 Limitations of the Stud y ccceeeeccsscscceeeseeseeseersensenecsecseescecsesaeesesaesesseeaeeceseeeesaeeacaeaeseeaeeeeeees 63 CHAPTER 5: CONCLUSIONS AND FUTURE WORKS HH rưy 65
S.1 Achieved T€SuÏÌ(S c1 9E 91 TH nh HH HT TH TT HH HT TH nh cưng 65 5.2 _ Implications of the R€S€arCH ó- c6 t1 9 vn TH Tu TH Tu HH nh Hư nh thư 65 5.3 | Recommendations for Future Research ceceseseseeececeeesesesesecseesecsesecseseeeeseseneenee 66 REFERENCES 0000 esceecscsceeeseneeeescsesecseseeesscsesessaesscscessesesessesesecsseesacseeesseseeesseeesasseeetseeseetass 67
Trang 5» The result of data cOlÏeCfiNE ccc - cò cà cà cà cà sài sec s20
` The result of Afd DFOC€SSÏN Ă Tnhh nh re 31
> Data LAD Cling cecccccscccescecesscessccssscessceseseeeeseceeeeseeeessecesseessaeseusesesaeensaes 32Term frequency (TF) ccccccccssccessssccesesscesesneceeesseesseseeeensseeesssseeensnneeensaes 35
° Document frequency ((ÏŸ) 2-5 1kg, 35
° Inverse document frequency ([DÏÌ') «se ssvksskssiksskesiee 36
xi 2n 36
2 DAte SPlittin nne.Ầ.ẦẮ 38
» Data Splitting 0 cee cee ccc ccc cee cee cee cee ee cee cee cee cee cee cee cee see nesses seeseessee AT
° Training Result of the PhoBERT Model .- - ««<s«<ssx++ 48: Validation Result of the PhoBET Model .- «<<<<«<++ 49: Testing Result of the PhoBERT MlodlelL « +ss++++sssex++ssss+ J]
° Result of train for 3 model Evaluation «<< << + + ++see++ 52
° Result of test for 3 Model CVAIUAtION ccccccccccccsssecesseeessetstsessssssssessees 34
Trang 6LIST OF FIGURES
os LL eo
Figure 2 - 1: Perfect CÏASSIÍT€F ĂĂẶ Ăn HH TH HH ng gện T8Figure 2 - 2: Typical ROC CHTW€ Ă- Sc SS E3 EESEEESSEEEEsekssersrrerereereree 18Figure 2 - 3: Random CÍASSi[T€F Ă SG TS kg ng vn gry 19Figure 2 - 4: Confusion HIQÍFLV SG SH HH Hệ 20Figure 2 - 5: Samples on the margin are called the support V€CfOTS 22Figure 3 - 1: Illustrate the process in the SÿSÍGI cee cee enn cà cee cà cee si c2)Figure 3 - 2: Illustrate the process for data COlleCHOH .cccccsccccsscessseetseteteeetseeens 27Figure 3 - 3: Illustrate a review in VH€XDYSS SG KH hit 28Figure 3 - 4: Illustrate a review in VI€ÍẨđH s3 ki nikt 29Figure 3 - 5: Text Processing and Preparation for PhoBERT Model Training 37Figure 3 - 6: Fine-Tuning Process for PhoBERT Model -‹ -s+++s«++ 39
Figure 3 - 7: Illustrate the process for architecture of the Fake News Detection System 43
Figure 3 - 8: FaStA PI LORO Ăn HH TH HH riện 44Figure 3 - 9: [lustrate user ÏHÍ€TƒfQC ằàĂ ST BS SvhkEEekkkrrreerrererrke 46Figure 4 - 1: lllustrate the running time in second of the trained models 53Figure 4 - 2: Illustrate the metrics result of each trained classification models 54Figure 4 - 3: Illustrate the running time in second of the test models 55Figure 4- 4: Illustrate the metrics result of each test classification models 55Figure 4 - S:Tllustrate (ÌQÍQS€F ĂĂ SG SH kg vn ket 58Figure 4 - 6: Illustrate result Real H€WS cẶ S5 se *SSiEEEseEEeseseeersesersrses 58Figure 4 - 7: Illustrate result FAKC I€WS Ă 5c SE ESSEEEE+veeeteeeeeereres 59Figure 4 - 8: Illustrate result Real news .cccccsccccscccessesscseeesscesseesseseeseeessesesseseneesags 59Figure 4 - 9: Illustrate result Fake Mews, ccccccccccsccecssccessceessceenseesneeesseesseessseeseeeeaas 60Figure 4 - 10: Illustrate Vinrexpress Web, cccccsccccsscccssccesseeensceseeesneeeseeceueeseneeseseeaas 61Figure 4 - 11: Illustrate result of the news does NOt €XỈSÍ e3 ó1Figure 4 - 12: lllustrate VN€XDT€SS W€P SG SH ng vn rry 62Figure 4- 13: Illustrate result of the news dOe€S NOt €XỈSÍ eĂ S2 62Figure 4 - 14: Illustrate result of the news dOe€S NOt CXISt .eeseeseeeseeneeeneeneeeneeees 62
Trang 7LIST OF ACRONYMS AND ABBREVIATIONS
No Acronyms Meaning
1 NPL Natural Language Processing
2 BERT Bidirectional Encoder Representations from
Transformers
3 Al Artificial intelligence
4 SLA service level agreement
5 NER Named entity recognition
6 NLTK Natural Language Toolkit
7 NLI Network Layer Interface
8 ROC Receiver Operator Characteristic
9 AUC Area Under the Curve
10 SVM Support vector machine
Trang 811 LSTM Long short-term memory
12 KNN K-Nearest Neighbors Algorithm
13 TF - IDF Term frequency-inverse document frequency
14 API Application Programming Interface
15 ReLU Rectified Linear Unit
16 TP True Positive
17 TN True Negative
18 FP False Positive - Type 1 Error
19 FN False Negative - Type 2 Error
Trang 9CHAPTER 1: INTRODUCTION
1.1 Context
In the digital age, the Internet has become an essential part of our daily lives
It not only expands our access to information but also serves as the foundation forthe growth of social networks, where people can share, exchange, and receiveinformation quickly However, this strong development also comes with significantchallenges, especially the issue of fake news Fake news can causemisunderstandings and panic among the public and can negatively affect individualsand organizations when false information is spread
In Vietnam, detecting and preventing fake news on online platforms isbecoming an urgent issue With the popularity of social media and online news sites,Vietnamese Internet users are increasingly at risk of encountering inaccurate orfabricated information This creates an urgent need for the development of a supportsystem capable of accurately and efficiently predicting and classifying real and fake
news.
This thesis focuses on building such a system, using advanced deep learningmodels to analyze and assess the authenticity of information, especially in thecontext of Vietnamese language and culture The goal is to create a useful tool forInternet users, helping them to verify information themselves and protect against thenegative impacts of fake news
1.2 Purpose
The main aim of this report is to delve into the essential concepts andalgorithms that underpin fake news detection systems We will focus on the practicalapplication of these algorithms, followed by an analysis and assessment of the results
we obtain To begin, we will compile datasets from leading and credible online
9
Trang 10newspapers such as VnExpress, Thanh Niên, Tuổi Trẻ, Dân Trí, VietNamNet, and
others This text will be processed through several steps, including data review andpreprocessing tasks like converting text to lowercase, stripping punctuation, andremoving stop words After data processing, it will involve applying machinelearning models The news articles we have collected will have pre-assigned labelsindicating whether they are fake or real This pre-labeling will aid in the training andvalidation of our machine learning models, enabling them to effectively distinguishbetween fake and real news items
1.3 Objectives and Scope
1.3.1 Objectives
- To generate all datasets by collecting news articles from various online
platforms
- Toemploy text mining techniques that encompass fake news detection
algorithms and natural language processing tools
1.3.2 Scope
- Datasets comprising online news articles and user-generated content
- Fake news detection system
- Natural Language Processing (NLP) techniques
- Sentiment analysis to gauge the veracity of news content
- Implementation of BERT and other NLP models for enhanced
detection accuracy
- System analysis and design to create a robust detection framework
- Development of a user-friendly web interface for real-time fake news
assessment.
10
Trang 111.4 Motivation
Our motivation is to develop a system that empowers users to discern thetruthfulness of news content related to their interests or current events In an erawhere misinformation can spread rapidly, our project addresses the critical challenge
of identifying and flagging fake news, thereby saving users from the pitfalls ofmisinformation The overwhelming presence of unverified news and the complexity
of verifying each piece of information make it imperative to have a system thatsimplifies this process, making it less time-consuming and more accessible foreveryday users
Chapter 3: Methodology — Research methods and approaches used indeveloping the fake news prediction support system
Chapter 4: Experimental results and application demo — Present theresults of the implemented system, demo and discussion of results
Chapter 5: Conclusions and future works — Summarize the research,present the conclusions drawn from findings, and suggest directions for futuresearch in related areas
11
Trang 12CHAPTER 2: LITERATURE REVIEW AND
THEORETICAL BACKGROUND
2.1 Fake News and Its Impact
2.1.1 Overview of Fake NewsFake news or hoax news is false or misleading information (hoaxes,propaganda, and disinformation) presented as news Fake news often has the aim ofdamaging the reputation of a person or entity or making money through advertisingrevenue Although false news has always been spread throughout history, the term
"fake news" was first used in the 1890s when sensational reports in newspapers werecommon Nevertheless, the term does not have a fixed definition and has beenapplied broadly to any type of false information It's also been used by high-profilepeople to apply to any news unfavorable to them Further, disinformation involvesspreading false information with harmful intent and is sometimes generated andpropagated by hostile foreign actors, particularly during elections In somedefinitions, fake news includes satirical articles misinterpreted as genuine, andarticles that employ sensationalist or clickbait headlines that are not supported in thetext Because of this diversity of types of false news, researchers are beginning tofavor information disorder as a more neutral and informative term [1]
2.1.2 The Significance of Detecting Fake News
- Maintaining Public Trust: The spread of fake news can undermine trust in
the media, government institutions, and reputable sources of information.Detecting and correcting fake news is necessary to protect or restore this trust
- Countering Manipulation and Disinformation: Fake news is often used as
a tool for manipulation for political or personal purposes It can also be used
12
Trang 13by outside forces to cause social disorder Detecting and addressing fake newshelps prevent these manipulation efforts.
- Reduce Social and Political Divisions: Fake news often exacerbates social
and political rifts By exposing and countering misinformation, there is thepotential to reduce division and promote a fair and fact-based debate
- Media Awareness: Responding to fake news also includes educating the
public on the ability to recognize and evaluate news critically, emphasizingthe importance of distinguishing between trustworthy information andmisinformation
- Facing Legal and Ethical Challenges: The growth of fake news brings new
legal and ethical challenges Detecting fake news is an important part ofdeveloping policies and regulations that balance the protection of freedom ofexpression with the need to prevent the spread of harmful information
2.2 BERT and Its Variants in NLP
2.2.1 Introduction to BERTBERT, which stands for Bidirectional Encoder Representations fromTransformers, is a model in natural language processing It succinctly describesBERT's bidirectional nature and the Transformer's attention mechanism to processwords in relation to all other words in a sentence, unlike traditional models whichprocess words sequentially BERT is pre-trained on a vast corpus of text anddesigned to understand the nuances of language by predicting the context of missingwords and the relationship between consecutive sentences This pre-training enablesBERT to be fine-tuned for a variety of tasks, such as sentiment analysis, questionanswering, and language inference, with minimal additional task-specific training
13
Trang 142.2.2 PhoBERT: A BERT Variant for Vietnamese Language ProcessingPre-trained PhoBERT models are the state-of-the-art language models forVietnamese (Pho, 1.e.” Phở”, is the most popular Vietnamese food) PhoBERT hastwo versions, PhoBERTbase and PhoBERTlarge, as the first large-scalemonolingual language models pre-trained for Vietnamese These modelssignificantly outperform the multilingual model XML-R (1) in various Vietnamese-specific NLP (2) tasks [7]
PhoBERT employs the same architecture as BERTbase and BERTlarge,optimized based on RoBERTa It was trained using a 20GB word-level Vietnamesecorpus, addressing challenges in Vietnamese language modeling such as thedifferentiation between syllables and word tokens
PhoBERT sets new state-of-the-art results in four Vietnamese NLP tasks:Part-of-speech tagging, Dependency parsing, Named-entity recognition (NER), andNatural language inference (NLI) It achieved these results by employing large-scalepre-training data and addressing unique characteristics of the Vietnamese language
PhoBERT presents a significant advancement in Vietnamese NLP Its ability
to outperform existing models in language-specific tasks demonstrates theeffectiveness of dedicated, large-scale, monolingual language models The release
of PhoBERT is expected to foster future research and applications in VietnameseNLP, including potential use in systems like fake news detection
2.3 Model Evaluation Metrics
2.3.1 Accuracy, Precision, and Recall
Accuracy
14
Trang 15Accuracy is a metric that measures how often a machine learning modelcorrectly predicts the outcome You can calculate accuracy by dividing thenumber of correct predictions by the total number of predictions.
This metric is simple to calculate and understand Almost everyone has anintuitive perception of accuracy: a reflection of the model's ability to correctlyclassify data points [4]
Precision
Precision is a metric that measures how often a machine learning modelcorrectly predicts the positive class You can calculate precision by dividingthe number of correct positive predictions (true positives) by the total number
of instances the model predicted as positive (both true and false positives)
¬ true positives
precision = — A$ ma
true positives + false positives
You can measure the precision on a scale of 0 to 1 or as a percentage Thehigher the precision, the better You can achieve a perfect precision of 1.0when the model is always right when predicting the target class: it never flagsanything in error [4]
Recall
Recall is a metric that measures how often a machine learning model correctlyidentifies positive instances (true positives) from all the actual positive
15
Trang 16samples in the dataset You can calculate recall by dividing the number of truepositives by the number of positive instances The latter includes truepositives (successfully identified cases) and false negative results (missedcases).
true positives
recall == —————————
true positives + false negatives
You can measure the recall on a scale of 0 to 1 or as a percentage The higherthe recall, the better You can achieve a perfect recall of 1.0 when the modelcan find all instances of the target class in the dataset
Recall can also be called sensitivity or true positive rate The term
"sensitivity" is more commonly used in medical and biological research ratherthan machine learning For example, you can refer to the sensitivity of adiagnostic medical test to explain its ability to expose the majority of truepositive cases correctly The concept is the same, but “recall” is a morecommon term in machine learning [4]
2.3.2 F1-ScoreF1 score is a machine learning evaluation metric that measures a model’saccuracy It combines the precision and recall scores of a model
Precision measures how many of the “positive” predictions made by themodel were correct
Recall measures how many of the positive class samples present in the datasetwere correctly identified by the model
The F1 score combines precision and recall using their harmonic means, andmaximizing the Fl score implies simultaneously maximizing both precision andrecall
16
Trang 17The F1 score is defined based on the precision and recall scores, which aremathematically defined as follows:
The Receiver Operator Characteristic (ROC) curve is an evaluation metric forbinary classification problems It is a probability curve that plots the TPR againstFPR at various threshold values and essentially separates the ‘signal’ from the
‘noise.’ In other words, it shows the performance of a classification model at all
classification thresholds The Area Under the Curve (AUC) is the measure of the
17
Trang 18ability of a binary classifier to distinguish between classes and is used as a summary
of the ROC curve
TPR (Sensitivity)
FPR (1-Specificity)
Figure 2 - 1; Perfect Classifier |
When AUC = 1, the classifier can correctly distinguish between all thePositive and the Negative class points If, however, the AUC had been 0, then theclassifier would predict all Negatives as Positives and all Positives as Negatives
4
TPR (Sensitivity)
FPR (1-Specificity)
Figure 2 - 2: Typical ROC Curve.
When 0.5<AUC<l, there is a high chance that the classifier will be able todistinguish the positive class values from the negative ones This is so because theclassifier can detect more numbers of True positives and True negatives than Falsenegatives and False positives
18
Trang 19TPR (Sensitivity)
FPR (1-Specificity)
Figure 2 - 3: Random Classifier.
When AUC=0.5, then the classifier is not able to distinguish between Positiveand Negative class points Meaning that the classifier either predicts a random class
or a constant class for all the data points
So, the higher the AUC value for a classifier, the better its ability to distinguishbetween positive and negative classes [6]
2.3.4 Confusion Matrix
It is a method for evaluating the results of classification problems byconsidering both accuracy and recall metrics for predictions in each class Aconfusion matrix consists of the following four indices for each classification class:
'Source:https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/
19
Trang 20+ve -ve
Figure 2 - 4: Confusion matron
To simplify, let's use the example of a cancer diagnosis problem to explainthese four indices In the cancer diagnosis problem, there are two classes: the classdiagnosed as Positive for cancer and the class diagnosed as Negative for no cancer
e TP (True Positive): The number of correct predictions This occurs
when the model correctly predicts that a person has cancer
e TN (True Negative): The number of correct predictions indirectly This
happens when the model correctly predicts that a person does not havecancer, meaning not diagnosing a cancer case is correct
?Source:https://viblo.asia/p/tim-hieu-ve-confusion-matrix-trong-machine-learnng-Az45bRpo5xY
20
Trang 21e FP (False Positive - Type 1 Error): The number of incorrect positive
predictions This occurs when the model predicts that a person hascancer, but the person is actually healthy
e FN (False Negative - Type 2 Error): The number of incorrect negative
predictions indirectly This happens when the model predicts that aperson does not have cancer, but the person actually has cancer,meaning not diagnosing a cancer case is incorrect [8]
21
Trang 22&
Figure 2 - 5: Samples on the margin are called the support vectors.
In more detail, with a set of training examples, each example labeled as one
of two classes, the SVM training algorithm builds a model assigning new examples
to one of these categories, transforming it into a binary probabilistic linear classifier.SVM maps the training examples to points in space to maximize the margin, which
is the distance between the two classes New examples are then mapped into thesame space and predicted to belong to a category based on which side of the gapthey fall on [12]
3Source: https://en.wikipedia.org/wiki/Support_vector_machine
22
Trang 232.4.2 Logistic Regression
Logistic regression is one of the most popular Machine Learning algorithms,which comes under the Supervised Learning technique It is used for predicting thecategorical dependent variable using a given set of independent variables Logisticregression predicts the output of a categorical dependent variable
e Logistic Regression is a significant machine learning algorithm because it has
the ability to provide probabilities and classify new data using continuous anddiscrete datasets
e Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used forthe classification
e Logistic regression uses the concept of predictive modeling as regression;
therefore, it is called logistic regression, but is used to classify samples;Therefore, it falls under the classification algorithm [13]
2.5 Related Research
In the field of news analysis and information credibility, various studies havecontributed to advancing our understanding of effective methodologies for assessingthe reliability of articles This introduction provides a brief overview of noteworthystudies that have delved into aspects such as detecting fake news, authenticationmethods, and the utilization of machine learning techniques By examining theseprior endeavors, we aim to identify gaps, build upon successful methodologies, andcontribute novel insights to the ongoing discourse surrounding the crucial challenge
of distinguishing trustworthy information in today's media landscape
23
Trang 24Table 2 - 1: Related research.
using two machine
algorithms [14]
(M Sudhakar, K.P Kaliyamurthie)
learning
News Detection Fake Through
MLand Deep Learning Approaches
for Better AccuracyAnil [15]
(Kumar Dubey, Mala Saraswat)
Fake News Prediction: A Survey [16]
(Pinky Saikia Dutta, Meghasmita Das,
Sumedha Biswas, Mriganka Bora,
Sankar Swami Saikia)
Fake News Detection Using Machine
AND CHALLENGES [18] (V6 Trung
Hùng, Ninh Khánh Chi, Tran Anh
Kiệt)
Naive Bayes Logistic Regression
Naive Bayes Logistic Regression
Decision Tree.
Support Vector Machine (SVM)
Applied XGBoost and LSTM algorithms for better accuracy and received ideal conditions of
acceptance to accuracy.
Based on probability, the system will predict the authenticity or falsity of an article.
Detecting the fake news by reviewing it in two stages: characterization and disclosure.
The application utilizes machine
learning techniques based on
both traditional methods and deep learning to analyze content.
24
Trang 25CHAPTER 3: METHODOLOGY 3.1 System Overview
The overall aim of our fake news detection project is to build a system capable
of accurately identifying and flagging fake news articles By leveraging BERT andmachine learning, the system will enable users to distinguish the credibility of news
content.
; Data procesing (PhoBERT + Fine tuning and optimize hyper
Data collection * (Clean, remove ) ` parameter fine tuning layers)
Deploy on web Create API Evaluate Model
Figure 3 - 1: Illustrate the process in the system.
Our system can be divided into 6 parts Part I is Data collection In this part,
we have collected articles from many different sources, including both mainstreamnewspapers and some reactionary and tabloid newspapers After collection, ourdataset had more than 1600 articles, including both real and fake news Each article
is tagged as fake news or real news
The second part focuses on preparing and transforming the collected data tooptimize the training of the PhoBERT model Multiple sequential data processingtechniques are applied: Text cleaning function, Text normalization, Build processingpipeline, TF-IDF Vectorization, Truncation The above processing steps aim tooptimize data, remove noise and prepare the most suitable input data for training andworking effectively with BERT
25
Trang 26The third step in the process is refining the PhoBERT model that has beenapplied to the fake news detection context, while also optimizing hyperparametersand fine-tuning layers Initially we used the pre-trained PhoBERT model as a basicencoding layer Then came the Fine-tuning process, tuning the PhoBERT modelwith specific data in order to be able to understand and classify fake news moreaccurately Hyperparameter optimization was applied to the model such as learningrate, number of epochs, batch size to improve model performance Then it wentthrough the process of adjusting weights in neural networks to enhance the ability todetect fake news.
After running the model, we assess our system's ability to detect fake news
We use accuracy, precision, recall, and Fl score to measure performance, ensuringour model distinguishes real news from fake effectively Through cross-validation,
we test the model's reliability across various data subsets This step is vital to confirmthat our system performs well and is ready to detect fake news
After the evaluation phase, we create an Application Programming Interface(API) to provide access to our fake news detection model The API acts as anintermediary, allowing users to submit news content and receive a credibilityassessment In the final part, we focused on integrating our fake news detectionmodel with a web interface, leveraging the previously developed API This step iscrucial for providing a practical and accessible platform for users to interact with oursystem Our web interface, which currently runs locally, serves as a directapplication of the API It features a simple and intuitive user interface whereindividuals can input news articles and then check whether the news is fake or real
In conclusion, our fake news detection system showcases the effective use ofNLP and machine learning in a student project context We've developed afunctional tool that combines data processing with a PhoBERT-based model and an
26
Trang 27API, all integrated into a web interface While it's currently running locally, thisproject has been a valuable learning experience and a practical approach to tacklingthe issue of fake news at a smaller scale.
3.2 Data Collection and Preprocessing
3.2.1 Data CollectionExpectations for new information, a thirst for knowledge, interest in socialevents, and a desire to maintain an understanding of the surrounding world arereasons why we read the news Reading the news helps us stay informed, supportdecision-making, enhance awareness, and strengthen social connections Research
on the reliability of news to readers on online media is still scarce Therefore, weaim to extract news from media sources as our data In this study, our research willestablish a process for extracting information from authentic and fake news on mediaplatforms such as Vnexpress, Viettan, tingia covering topics like politics, society,news and law The figure below illustrates the components of this process:
Figure 3 - 2: Illustrate the process for data collection.
The first step of this process is to search for news on media platforms Thedata collection process begins by selecting news websites as sources for articles andkey information These websites offer a wealth of diverse, up-to-date, and relevant
content.
Next, for reliable news, we choose Vnexpress.net, one of the leading digitalnewspapers in Vietnam known for its strict criteria regarding the quality and
27
Trang 28accuracy of information Articles from Vnexpress.net reflect events and topicstransparently and objectively, covering both domestic and international affairs.
$3) Vnexpress.net/hanh-khach-ke-khoanh-khac-phi-co-nhat-ban-boc-chay-tren-duong-bang-46!
VNIBEXPRESS Thứ ba, 2/1/2024 TPHCM v 28° Mới nhất Tin theo khu vực fj Intemational Đăng nhập
ff Thờisợ Gocnhin Thégidi Video Podcasts Kinhdoanh Bấtđôngsản Khoahoc Giditri Théthao Phápluật Giáodục Sứckhỏe Đờisống Dulich Sốhóa Xe Ykién Tâmsự Th
L
146.000 # 391.000 ở
ob a
391.000 4 391.000 4
f ï Một hành khách nói phi cơ của Japan Airlines "đã va chạm vào thứ gi đó" lúc ha
cánh, trước khi ngọn lửa bùng lên và bao trùm máy bay.
ị
| "Tôi nghe tiếng nổ, giống như máy bay va vào thứ gi đó rồi giật lên ngay khi chúng
| tôi hạ cánh", một hành khách trên chuyến bay JAL 516 của Japan Airlines nói với
Figure 3 - 3: Illustrate a review in Vnexpress.
For disinformation and sensational news, we carefully curated content fromviettan.org, a well-known source notorious for disseminating provocativeinformation that negatively impacts communities and society This selection aids us
in distinguishing more clearly between credible information and content that elicits
a negative reaction during the categorization process
28
Trang 29$5 Ìviettan.org/k
PM Kinh té TP.HCM lao déc, |
CHO PHU ĐỊNH, QUAN 6, SAI GON xense cao: ikinhdoanhéam,chg |
® oe 1
.——=sansnssaassasmasssnssnssme ¬
1 Phóng viên dạo một vòng chợ Phú Định, Quận 6,
| Sài Gòn, một chợ thường ngày đông đảo, nhộn
Ị nhịp kẻ mua người bán đủ các loại thực phẩm
| thiết yếu hằng ngày.
{ Thế mà nay, giữa tháng 11, hình ảnh cho thấy từ
' đầu chợ đến cuối chợ vắng tanh khách Người
| bán thì nhiều chỉ thấy lua thưa, lẻ tẻ người mua!
Một Sài Gòn dm đạm trong làn sóng suy thoái
| kinh tế! Từ đầu năm đến nay xí nghiệp gặp khó.
{ khăn phải thải người, thậm chí đóng cửa, trong
| khi giá điện lại tăng khiến đời sống người dân
Ỉ càng khó khan hon! Có người nói: “chưa năm
| nào mà khó khăn như năm nay ”
| NHI SG AGENNSSIANGAEHOANG HH Ổ UNOME —
Youtube Viét Tan
Figure 3 - 4: Illustrate a review in Viettan.
Our dataset is meticulously organized with fields such as Title, Tags, Link,Content, and Label, where labels are predefined as either real or fake news The datacollection and labeling process is carried out carefully to ensure accuracy andreliability for training classification models
The result of data collection and preprocessing
After gathering data from the website, the study collected a total of 1,732 newsarticles, including 1,440 real articles and 292 fake articles Additional data analysis
is presented in the Table 3-1 below:
29
Trang 30Table 3 - 1: The result of data collecting.
Type of Name of tag | Number of | Number of | Grand Total
news tags news
is also carried out to eliminate irrelevant or noisy information such as advertisements
or unnecessary personal details These steps ensure that the final dataset is a clear,
high-quality collection of data that accurately reflects the research goal of detecting
fake news The study collected a total of 1,687 news articles, comprising 1,440 real
articles and 247 fake articles Additional data analysis is presented in the Table 3-2
below:
30
Trang 31Real news Pháp luật 1,440
Thế giới
Thời sự
The table indicates that the majority of the restaurants collected from newswebsites consist of 85% real news articles out of the total collected, with theremaining 15% being fake news
3.2.2 Data LabelingThe first step is labeling the data, with label 0 understood as authentic newsand label 1 as fake news Through this, we can provide a reasonable analysis of the
31
Trang 32characteristics of the data samples Data samples labeled 0 exhibit authenticity andreliability It can be assumed that they contain verified information and are highlyrated for reliability Articles or events in this category may be considered the mostreliable and trustworthy information On the contrary, data samples labeled 1 arelikely to contain inaccurate, biased, or even rumor-based information Observingthese characteristics in the data can help us better understand the trends and patterns
of the labeled information The structured data is presented in Table 3-3 below:
Table 3 - 3: Data labeling
Kiếm hang trăm triệu đồng nhờ xà phòng nghệ thuật
Thủ tướng: Đến năm 2030, Tây Nguyên cần hoàn thành 5 cao tốc
Bãi tắm Cửa Tùng thay đổi sau 20 năm
Với 90% cô phần ở Ngân hàng SCB, bà Trương Mỹ Lan đã sử dụng hơn
1.000 công ty trong hệ sinh thái của minh dé bỏ túi riêng 304 ngàn ty
đồng
tử hình với Lê Văn Mạnh, người đã trải qua 7.000 ngày biệt giam
Nêu nói đên bộ môn cờ vây truyên thông của văn hóa Trung Hoa, hăn
nhiên Hoa Ky là một “tay mo.”
3.2.3 Data Cleaning and PreprocessingBefore training the data set, the text should be clean and preprocessed Duringthe natural language processing process, we used data cleaning and normalizationsteps to ensure cleanliness and uniformity, creating the best conditions for machine
32
Trang 33learning and analysis First we removed URLs and HTML tags from the text Thishelps remove irrelevant and potentially noisy elements Next, we apply anotherregular expression rule to remove all characters other than numbers and letters,including Vietnamese special characters, to simplify the text.
Then, we normalize whitespace, removing excess whitespace that the spacingbetween words is consistent This helps increase the accuracy of later text analysis
We also use the “text_normalize” function from the “underthesea” library tonormalize the text, helping to homogenize the expression in the text For example:
Original sentence: "Hnay tôi di lam muộn vì trời mua to qua!"
After use the “text_normalize”: “"Hém nay tôi di làm muộn vi trời mua to
qua!"
where the acronym "Hnay" is converted to the full form "Hôm nay"
Next, we perform the tokenization process using the “word_tokenize”function also from the “underthesea” library, this allows us to effectively separatewords in Vietnamese text This word separation is an important step to prepare datafor future deep learning models[9] For instance:
Original sentence: "Bác sĩ bây giờ có thé than nhiên báo tin bệnh nhân bị ung
thự”
After use the “word _tokenize(sentence)”: ['Bac sĩ, 'bây gio’, 'có thể, thản
nhiên, bảo tin, bệnh nhân, ‘bi’, tung thư}
After use the “word_tokenize(sentence, format="text")”: 'Bác sĩ bây giờ
có thể thản nhiên báo tin bệnh nhân bị ung thư!
After completing word tokenization, we applied the Term Frequency - InverseDocument Frequency (TF - IDF) TF - IDF is frequently used in machine learning
33
Trang 34algorithms in various capacities It includes two components, term frequency andinverse document frequency.
The term frequency of a word in a document There are a lot of ways ofcalculating this frequency, with the easiest being a raw count of instances when aword appears in a document Then there are ways to adjust the frequency, by length
of a document, or by the raw frequency of the most frequent word in the document
The inverse document frequency measures a word’s rarity across a set ofdocuments The lower IDF value indicates a more common word It is calculated bydividing the total number of documents by the number of documents containing theword, then taking the logarithm of this quotient
TF-IDF score for the word t in the document d from the document set
IC|: the number of documents in the corpus
ICt|=I{d € C: t € d}I: the number of documents containing term t
Trang 35In order to calculate TF-IDF weight, we need to compute the term frequency(TF) first.
Table 3 - 4: Term frequency (TF).
Pocument2 | 0) a)
To know how many of the value of term frequency, we need to count thenumber of terms t in the document For instance: “Mực tiéu” appear one time indocument 1, hence, tf(Muc tiéu,Doc1) = 1 In contrast, “Mục tiêu” do not exist in
document 2, so tf(Muc tiêu, Doc2) = 0.
Next, we need to count the number of documents containing the term t Thevalue of document frequency (df) is known as |Ct | in the formula
Table 3 - 5: Document frequency (df).
since the term “Mực fiêu ” is present only in the first document This is thecase for the other terms as well
Then, we calculate the value of IDF
35