Khóa luận tốt nghiệp: Building a supporting system for prediction of fake news

_ FineTune PhoBERT for Vietnamese Fake News Detection.... This pre-labeling will aid in the training andvalidation of our machine learning models, enabling them to effectively distinguis

Trang 1

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

UNIVERSITY OF INFORMATION TECHNOLOGY

ADVANCED PROGRAM IN INFORMATION SYSTEMS

NGUYEN DUC MANH - 19521827

Trang 2

Additionally, we express our deep gratitude to the lecturers at the University

of Information Technology, especially those from the Information System Faculty.Their extensive knowledge and passionate teaching have significantly contributed toour academic journey The understanding and insights they shared have laid a solidfoundation, enabling us to better complete this thesis

Throughout the execution of this project, we endeavored to apply ourfoundational knowledge and explore new technologies to construct this graduatethesis However, due to limited experience and expertise, imperfections wereinevitable Therefore, we eagerly look forward to receiving feedback andsuggestions from our professors, which will assist us in refining our necessaryknowledge and skills

We also wish to convey our sincere thanks to our families and friends whohave consistently supported and accompanied us throughout the research andcompletion of this thesis

Thank you very much!

Authors

Nguyen Duc Manh Nguyen Thanh

Trang 3

TABLE OF CONTENTS

ACKNOWLEDGMENTTS - Án HH HH TH HT HT TH HT HT HT HT Hit 2

TABLE OF CONTENTS 222 3

LIST OF TABLES 101011008 5

CHAPTER 1: INTRODUCTION 22.54 9

I9) - 9

II " ¬ 9

1.3 ObjJectives and SCODe - - - 5s ST TH HT HH TH TH TH ghe 10 1.3.1 900) 0i ốốốốốốố ốố Ầ.Ầằồ.Ầ 10

1.3.2 b0 10

1.4 MOLVAUOH SH HH TH TH HT TH TH TH TH TH TH Hà HT TH Hà 11 1.5 _ Report Ôutline c1 ST TH HT HT TT HH TT HH TH TH HT TH Hà 11 CHAPTER 2: LITERATURE REVIEW AND THEORETICAL BACKGROUND 12

2.1 Fake News and Its Impact " we 12 2.1.1 Overview of Fake News 2.1.2 The Significance of Detecting Fake News

2.2 BERT and Its Variants in NLP - 5:2 2121 12125121211 11121 112101 110101 11011 HH ưưn 13 2.2.1 Introduction to_BIEÌRÏT c2 s x3 919911 1v nh TH Thu TH HT HT HH TH kh 13 2.2.2 PhoBERT: A BERT Variant for Vietnamese Language ProcesSIng ‹ ««-s«++ 14 2.3 Model Evaluation Metrics ccccccecccceecscescsceeescseecscseeecsesesecscseeecsesesecsesesecsenenseseeesseeeneenee 14 2.3.1 Accuracy, Precision, and R€CaÌÌL - ‹ 6 + 313191 91193 1 91 910 1 ng nh 14 2.3.2 II 16

2.3.3 \00).4990ó 17

2.3.4 00/0) 51000061 111 e 19

2.4 Related MOdelL cá kg HH HH TH HH TH TT TH HH TH cưng 21 2.4.1 Support Vector Machine (SVM|) -.- LH TH HH TT HH TT Hàn HH tư 21 2.4.2 Logistic REQreSSiON 0 ee eeessesecseeseeseeseeeeseeseeeseeseeseeseecsesseeseseesesseescasetenesaeeaesesaesaeeasensets 23 2.5 Related Research e 23

CHAPTER 3: METHODOLOGY 0000o0 occ cece 25

cm—-°aằ®GỆỰỘNy.Ả 4333 25 3.2 Data Collection and Preprocessing - 6 + E11 11191 11 1k Hà TH TH Hà HT HH 27 3.2.1 — Data (90 27

3.2.2 M8010 00077 31

Trang 4

3.2.3 Data Cleaning and PreprOC€SSITE - 5 (6 510 919v nh TH nh HH nh nh 32 3.2.4 Data Training nh 38 3.3 PhoBERT for Vietnamese Fake News Def(€cCtIOII 5 5 tk HT HH ng ri, 40

3.3.1 PhoBERT - pretrained BERT for Vietnamese language -¿- 6+ + * Sky 40 3.3.2 _ FineTune PhoBERT for Vietnamese Fake News Detection -++-s+c+c+scse+ 41

ch 0 - 43

3.4.1 Architecture of the Fake News Detection SySf€Im - SntS vn rrnrrirerree 43 3.4.2 Integration of the PhoBERT Model with the ApplicatiOn -ó- s5 5+5 ssvsssseseese 44 3.5 System deployment and 'Te€S(ITB - . 5 1 11v 9v nh nh TH ghi Hưng 45

3.5.1 Backend ph 20 3 45 3.5.2 Frontend DeveÏOPIMIII 5 + E111 1911911 11911 1 v11 HT ng nh Hnghvrt 45 CHAPTER 4: EXPERIMENTAL RESULTS AND APPLICATION DEMO 47

4.1 Model Performance and R€SUÏLS - ó6 t1 128E 51 1 1 v21 ng TH TH nh TH ng gu nưy 47

4.1.2 Models Evaluation " 32 4.2 _ Application Demo " eT 4.3 Discussion of Findings

4.4 Limitations of the Stud y ccceeeeccsscscceeeseeseeseersensenecsecseescecsesaeesesaesesseeaeeceseeeesaeeacaeaeseeaeeeeeees 63 CHAPTER 5: CONCLUSIONS AND FUTURE WORKS HH rưy 65

S.1 Achieved T€SuÏÌ(S c1 9E 91 TH nh HH HT TH TT HH HT TH nh cưng 65 5.2 _ Implications of the R€S€arCH ó- c6 t1 9 vn TH Tu TH Tu HH nh Hư nh thư 65 5.3 | Recommendations for Future Research ceceseseseeececeeesesesesecseesecsesecseseeeeseseneenee 66 REFERENCES 0000 esceecscsceeeseneeeescsesecseseeesscsesessaesscscessesesessesesecsseesacseeesseseeesseeesasseeetseeseetass 67

Trang 5

» The result of data cOlÏeCfiNE ccc - cò cà cà cà cà sài sec s20

` The result of Afd DFOC€SSÏN Ă Tnhh nh re 31

> Data LAD Cling cecccccscccescecesscessccssscessceseseeeeseceeeeseeeessecesseessaeseusesesaeensaes 32Term frequency (TF) ccccccccssccessssccesesscesesneceeesseesseseeeensseeesssseeensnneeensaes 35

° Document frequency ((ÏŸ) 2-5 1kg, 35

° Inverse document frequency ([DÏÌ') «se ssvksskssiksskesiee 36

xi 2n 36

2 DAte SPlittin nne.Ầ.ẦẮ 38

» Data Splitting 0 cee cee ccc ccc cee cee cee cee ee cee cee cee cee cee cee cee see nesses seeseessee AT

° Training Result of the PhoBERT Model .- - ««<s«<ssx++ 48: Validation Result of the PhoBET Model .- «<<<<«<++ 49: Testing Result of the PhoBERT MlodlelL « +ss++++sssex++ssss+ J]

° Result of train for 3 model Evaluation «<< << + + ++see++ 52

° Result of test for 3 Model CVAIUAtION ccccccccccccsssecesseeessetstsessssssssessees 34

Trang 6

LIST OF FIGURES

os LL eo

Figure 2 - 1: Perfect CÏASSIÍT€F ĂĂẶ Ăn HH TH HH ng gện T8Figure 2 - 2: Typical ROC CHTW€ Ă- Sc SS E3 EESEEESSEEEEsekssersrrerereereree 18Figure 2 - 3: Random CÍASSi[T€F Ă SG TS kg ng vn gry 19Figure 2 - 4: Confusion HIQÍFLV SG SH HH Hệ 20Figure 2 - 5: Samples on the margin are called the support V€CfOTS 22Figure 3 - 1: Illustrate the process in the SÿSÍGI cee cee enn cà cee cà cee si c2)Figure 3 - 2: Illustrate the process for data COlleCHOH .cccccsccccsscessseetseteteeetseeens 27Figure 3 - 3: Illustrate a review in VH€XDYSS SG KH hit 28Figure 3 - 4: Illustrate a review in VI€ÍẨđH s3 ki nikt 29Figure 3 - 5: Text Processing and Preparation for PhoBERT Model Training 37Figure 3 - 6: Fine-Tuning Process for PhoBERT Model -‹ -s+++s«++ 39

Figure 3 - 7: Illustrate the process for architecture of the Fake News Detection System 43

Figure 3 - 8: FaStA PI LORO Ăn HH TH HH riện 44Figure 3 - 9: [lustrate user ÏHÍ€TƒfQC ằàĂ ST BS SvhkEEekkkrrreerrererrke 46Figure 4 - 1: lllustrate the running time in second of the trained models 53Figure 4 - 2: Illustrate the metrics result of each trained classification models 54Figure 4 - 3: Illustrate the running time in second of the test models 55Figure 4- 4: Illustrate the metrics result of each test classification models 55Figure 4 - S:Tllustrate (ÌQÍQS€F ĂĂ SG SH kg vn ket 58Figure 4 - 6: Illustrate result Real H€WS cẶ S5 se *SSiEEEseEEeseseeersesersrses 58Figure 4 - 7: Illustrate result FAKC I€WS Ă 5c SE ESSEEEE+veeeteeeeeereres 59Figure 4 - 8: Illustrate result Real news .cccccsccccscccessesscseeesscesseesseseeseeessesesseseneesags 59Figure 4 - 9: Illustrate result Fake Mews, ccccccccccsccecssccessceessceenseesneeesseesseessseeseeeeaas 60Figure 4 - 10: Illustrate Vinrexpress Web, cccccsccccsscccssccesseeensceseeesneeeseeceueeseneeseseeaas 61Figure 4 - 11: Illustrate result of the news does NOt €XỈSÍ e3 ó1Figure 4 - 12: lllustrate VN€XDT€SS W€P SG SH ng vn rry 62Figure 4- 13: Illustrate result of the news dOe€S NOt €XỈSÍ eĂ S2 62Figure 4 - 14: Illustrate result of the news dOe€S NOt CXISt .eeseeseeeseeneeeneeneeeneeees 62

Trang 7

LIST OF ACRONYMS AND ABBREVIATIONS

No Acronyms Meaning

1 NPL Natural Language Processing

2 BERT Bidirectional Encoder Representations from

Transformers

3 Al Artificial intelligence

4 SLA service level agreement

5 NER Named entity recognition

6 NLTK Natural Language Toolkit

7 NLI Network Layer Interface

8 ROC Receiver Operator Characteristic

9 AUC Area Under the Curve

10 SVM Support vector machine

Trang 8

11 LSTM Long short-term memory

12 KNN K-Nearest Neighbors Algorithm

13 TF - IDF Term frequency-inverse document frequency

14 API Application Programming Interface

15 ReLU Rectified Linear Unit

16 TP True Positive

17 TN True Negative

18 FP False Positive - Type 1 Error

19 FN False Negative - Type 2 Error

Trang 9

CHAPTER 1: INTRODUCTION

1.1 Context

In the digital age, the Internet has become an essential part of our daily lives

It not only expands our access to information but also serves as the foundation forthe growth of social networks, where people can share, exchange, and receiveinformation quickly However, this strong development also comes with significantchallenges, especially the issue of fake news Fake news can causemisunderstandings and panic among the public and can negatively affect individualsand organizations when false information is spread

In Vietnam, detecting and preventing fake news on online platforms isbecoming an urgent issue With the popularity of social media and online news sites,Vietnamese Internet users are increasingly at risk of encountering inaccurate orfabricated information This creates an urgent need for the development of a supportsystem capable of accurately and efficiently predicting and classifying real and fake

news.

This thesis focuses on building such a system, using advanced deep learningmodels to analyze and assess the authenticity of information, especially in thecontext of Vietnamese language and culture The goal is to create a useful tool forInternet users, helping them to verify information themselves and protect against thenegative impacts of fake news

1.2 Purpose

The main aim of this report is to delve into the essential concepts andalgorithms that underpin fake news detection systems We will focus on the practicalapplication of these algorithms, followed by an analysis and assessment of the results

we obtain To begin, we will compile datasets from leading and credible online

9

Trang 10

newspapers such as VnExpress, Thanh Niên, Tuổi Trẻ, Dân Trí, VietNamNet, and

others This text will be processed through several steps, including data review andpreprocessing tasks like converting text to lowercase, stripping punctuation, andremoving stop words After data processing, it will involve applying machinelearning models The news articles we have collected will have pre-assigned labelsindicating whether they are fake or real This pre-labeling will aid in the training andvalidation of our machine learning models, enabling them to effectively distinguishbetween fake and real news items

1.3 Objectives and Scope

1.3.1 Objectives

- To generate all datasets by collecting news articles from various online

platforms

- Toemploy text mining techniques that encompass fake news detection

algorithms and natural language processing tools

1.3.2 Scope

- Datasets comprising online news articles and user-generated content

- Fake news detection system

- Natural Language Processing (NLP) techniques

- Sentiment analysis to gauge the veracity of news content

- Implementation of BERT and other NLP models for enhanced

detection accuracy

- System analysis and design to create a robust detection framework

- Development of a user-friendly web interface for real-time fake news

assessment.

10

Trang 11

1.4 Motivation

Our motivation is to develop a system that empowers users to discern thetruthfulness of news content related to their interests or current events In an erawhere misinformation can spread rapidly, our project addresses the critical challenge

of identifying and flagging fake news, thereby saving users from the pitfalls ofmisinformation The overwhelming presence of unverified news and the complexity

of verifying each piece of information make it imperative to have a system thatsimplifies this process, making it less time-consuming and more accessible foreveryday users

Chapter 3: Methodology — Research methods and approaches used indeveloping the fake news prediction support system

Chapter 4: Experimental results and application demo — Present theresults of the implemented system, demo and discussion of results

Chapter 5: Conclusions and future works — Summarize the research,present the conclusions drawn from findings, and suggest directions for futuresearch in related areas

11

Trang 12

CHAPTER 2: LITERATURE REVIEW AND

THEORETICAL BACKGROUND

2.1 Fake News and Its Impact

2.1.1 Overview of Fake NewsFake news or hoax news is false or misleading information (hoaxes,propaganda, and disinformation) presented as news Fake news often has the aim ofdamaging the reputation of a person or entity or making money through advertisingrevenue Although false news has always been spread throughout history, the term

"fake news" was first used in the 1890s when sensational reports in newspapers werecommon Nevertheless, the term does not have a fixed definition and has beenapplied broadly to any type of false information It's also been used by high-profilepeople to apply to any news unfavorable to them Further, disinformation involvesspreading false information with harmful intent and is sometimes generated andpropagated by hostile foreign actors, particularly during elections In somedefinitions, fake news includes satirical articles misinterpreted as genuine, andarticles that employ sensationalist or clickbait headlines that are not supported in thetext Because of this diversity of types of false news, researchers are beginning tofavor information disorder as a more neutral and informative term [1]

2.1.2 The Significance of Detecting Fake News

- Maintaining Public Trust: The spread of fake news can undermine trust in

the media, government institutions, and reputable sources of information.Detecting and correcting fake news is necessary to protect or restore this trust

- Countering Manipulation and Disinformation: Fake news is often used as

a tool for manipulation for political or personal purposes It can also be used

12

Trang 13

by outside forces to cause social disorder Detecting and addressing fake newshelps prevent these manipulation efforts.

- Reduce Social and Political Divisions: Fake news often exacerbates social

and political rifts By exposing and countering misinformation, there is thepotential to reduce division and promote a fair and fact-based debate

- Media Awareness: Responding to fake news also includes educating the

public on the ability to recognize and evaluate news critically, emphasizingthe importance of distinguishing between trustworthy information andmisinformation

- Facing Legal and Ethical Challenges: The growth of fake news brings new

legal and ethical challenges Detecting fake news is an important part ofdeveloping policies and regulations that balance the protection of freedom ofexpression with the need to prevent the spread of harmful information

2.2 BERT and Its Variants in NLP

2.2.1 Introduction to BERTBERT, which stands for Bidirectional Encoder Representations fromTransformers, is a model in natural language processing It succinctly describesBERT's bidirectional nature and the Transformer's attention mechanism to processwords in relation to all other words in a sentence, unlike traditional models whichprocess words sequentially BERT is pre-trained on a vast corpus of text anddesigned to understand the nuances of language by predicting the context of missingwords and the relationship between consecutive sentences This pre-training enablesBERT to be fine-tuned for a variety of tasks, such as sentiment analysis, questionanswering, and language inference, with minimal additional task-specific training

13

Trang 14

2.2.2 PhoBERT: A BERT Variant for Vietnamese Language ProcessingPre-trained PhoBERT models are the state-of-the-art language models forVietnamese (Pho, 1.e.” Phở”, is the most popular Vietnamese food) PhoBERT hastwo versions, PhoBERTbase and PhoBERTlarge, as the first large-scalemonolingual language models pre-trained for Vietnamese These modelssignificantly outperform the multilingual model XML-R (1) in various Vietnamese-specific NLP (2) tasks [7]

PhoBERT employs the same architecture as BERTbase and BERTlarge,optimized based on RoBERTa It was trained using a 20GB word-level Vietnamesecorpus, addressing challenges in Vietnamese language modeling such as thedifferentiation between syllables and word tokens

PhoBERT sets new state-of-the-art results in four Vietnamese NLP tasks:Part-of-speech tagging, Dependency parsing, Named-entity recognition (NER), andNatural language inference (NLI) It achieved these results by employing large-scalepre-training data and addressing unique characteristics of the Vietnamese language

PhoBERT presents a significant advancement in Vietnamese NLP Its ability

to outperform existing models in language-specific tasks demonstrates theeffectiveness of dedicated, large-scale, monolingual language models The release

of PhoBERT is expected to foster future research and applications in VietnameseNLP, including potential use in systems like fake news detection

2.3 Model Evaluation Metrics

2.3.1 Accuracy, Precision, and Recall

Accuracy

14

Trang 15

Accuracy is a metric that measures how often a machine learning modelcorrectly predicts the outcome You can calculate accuracy by dividing thenumber of correct predictions by the total number of predictions.

This metric is simple to calculate and understand Almost everyone has anintuitive perception of accuracy: a reflection of the model's ability to correctlyclassify data points [4]

Precision

Precision is a metric that measures how often a machine learning modelcorrectly predicts the positive class You can calculate precision by dividingthe number of correct positive predictions (true positives) by the total number

of instances the model predicted as positive (both true and false positives)

¬ true positives

precision = — A$ ma

true positives + false positives

You can measure the precision on a scale of 0 to 1 or as a percentage Thehigher the precision, the better You can achieve a perfect precision of 1.0when the model is always right when predicting the target class: it never flagsanything in error [4]

Recall

Recall is a metric that measures how often a machine learning model correctlyidentifies positive instances (true positives) from all the actual positive

15

Trang 16

samples in the dataset You can calculate recall by dividing the number of truepositives by the number of positive instances The latter includes truepositives (successfully identified cases) and false negative results (missedcases).

true positives

recall == —————————

true positives + false negatives

You can measure the recall on a scale of 0 to 1 or as a percentage The higherthe recall, the better You can achieve a perfect recall of 1.0 when the modelcan find all instances of the target class in the dataset

Recall can also be called sensitivity or true positive rate The term

"sensitivity" is more commonly used in medical and biological research ratherthan machine learning For example, you can refer to the sensitivity of adiagnostic medical test to explain its ability to expose the majority of truepositive cases correctly The concept is the same, but “recall” is a morecommon term in machine learning [4]

2.3.2 F1-ScoreF1 score is a machine learning evaluation metric that measures a model’saccuracy It combines the precision and recall scores of a model

Precision measures how many of the “positive” predictions made by themodel were correct

Recall measures how many of the positive class samples present in the datasetwere correctly identified by the model

The F1 score combines precision and recall using their harmonic means, andmaximizing the Fl score implies simultaneously maximizing both precision andrecall

16

Trang 17

The F1 score is defined based on the precision and recall scores, which aremathematically defined as follows:

The Receiver Operator Characteristic (ROC) curve is an evaluation metric forbinary classification problems It is a probability curve that plots the TPR againstFPR at various threshold values and essentially separates the ‘signal’ from the

‘noise.’ In other words, it shows the performance of a classification model at all

classification thresholds The Area Under the Curve (AUC) is the measure of the

17

Trang 18

ability of a binary classifier to distinguish between classes and is used as a summary

of the ROC curve

TPR (Sensitivity)

FPR (1-Specificity)

Figure 2 - 1; Perfect Classifier |

When AUC = 1, the classifier can correctly distinguish between all thePositive and the Negative class points If, however, the AUC had been 0, then theclassifier would predict all Negatives as Positives and all Positives as Negatives

4

TPR (Sensitivity)

FPR (1-Specificity)

Figure 2 - 2: Typical ROC Curve.

When 0.5<AUC<l, there is a high chance that the classifier will be able todistinguish the positive class values from the negative ones This is so because theclassifier can detect more numbers of True positives and True negatives than Falsenegatives and False positives

18

Trang 19

TPR (Sensitivity)

FPR (1-Specificity)

Figure 2 - 3: Random Classifier.

When AUC=0.5, then the classifier is not able to distinguish between Positiveand Negative class points Meaning that the classifier either predicts a random class

or a constant class for all the data points

So, the higher the AUC value for a classifier, the better its ability to distinguishbetween positive and negative classes [6]

2.3.4 Confusion Matrix

It is a method for evaluating the results of classification problems byconsidering both accuracy and recall metrics for predictions in each class Aconfusion matrix consists of the following four indices for each classification class:

'Source:https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/

19

Trang 20

+ve -ve

Figure 2 - 4: Confusion matron

To simplify, let's use the example of a cancer diagnosis problem to explainthese four indices In the cancer diagnosis problem, there are two classes: the classdiagnosed as Positive for cancer and the class diagnosed as Negative for no cancer

e TP (True Positive): The number of correct predictions This occurs

when the model correctly predicts that a person has cancer

e TN (True Negative): The number of correct predictions indirectly This

happens when the model correctly predicts that a person does not havecancer, meaning not diagnosing a cancer case is correct

?Source:https://viblo.asia/p/tim-hieu-ve-confusion-matrix-trong-machine-learnng-Az45bRpo5xY

20

Trang 21

e FP (False Positive - Type 1 Error): The number of incorrect positive

predictions This occurs when the model predicts that a person hascancer, but the person is actually healthy

e FN (False Negative - Type 2 Error): The number of incorrect negative

predictions indirectly This happens when the model predicts that aperson does not have cancer, but the person actually has cancer,meaning not diagnosing a cancer case is incorrect [8]

21

Trang 22

&

Figure 2 - 5: Samples on the margin are called the support vectors.

In more detail, with a set of training examples, each example labeled as one

of two classes, the SVM training algorithm builds a model assigning new examples

to one of these categories, transforming it into a binary probabilistic linear classifier.SVM maps the training examples to points in space to maximize the margin, which

is the distance between the two classes New examples are then mapped into thesame space and predicted to belong to a category based on which side of the gapthey fall on [12]

3Source: https://en.wikipedia.org/wiki/Support_vector_machine

22

Trang 23

2.4.2 Logistic Regression

Logistic regression is one of the most popular Machine Learning algorithms,which comes under the Supervised Learning technique It is used for predicting thecategorical dependent variable using a given set of independent variables Logisticregression predicts the output of a categorical dependent variable

e Logistic Regression is a significant machine learning algorithm because it has

the ability to provide probabilities and classify new data using continuous anddiscrete datasets

e Logistic Regression can be used to classify the observations using different

types of data and can easily determine the most effective variables used forthe classification

e Logistic regression uses the concept of predictive modeling as regression;

therefore, it is called logistic regression, but is used to classify samples;Therefore, it falls under the classification algorithm [13]

2.5 Related Research

In the field of news analysis and information credibility, various studies havecontributed to advancing our understanding of effective methodologies for assessingthe reliability of articles This introduction provides a brief overview of noteworthystudies that have delved into aspects such as detecting fake news, authenticationmethods, and the utilization of machine learning techniques By examining theseprior endeavors, we aim to identify gaps, build upon successful methodologies, andcontribute novel insights to the ongoing discourse surrounding the crucial challenge

of distinguishing trustworthy information in today's media landscape

23

Trang 24

Table 2 - 1: Related research.

using two machine

algorithms [14]

(M Sudhakar, K.P Kaliyamurthie)

learning

News Detection Fake Through

MLand Deep Learning Approaches

for Better AccuracyAnil [15]

(Kumar Dubey, Mala Saraswat)

Fake News Prediction: A Survey [16]

(Pinky Saikia Dutta, Meghasmita Das,

Sumedha Biswas, Mriganka Bora,

Sankar Swami Saikia)

Fake News Detection Using Machine

AND CHALLENGES [18] (V6 Trung

Hùng, Ninh Khánh Chi, Tran Anh

Kiệt)

Naive Bayes Logistic Regression

Decision Tree.

Support Vector Machine (SVM)

Applied XGBoost and LSTM algorithms for better accuracy and received ideal conditions of

acceptance to accuracy.

Based on probability, the system will predict the authenticity or falsity of an article.

Detecting the fake news by reviewing it in two stages: characterization and disclosure.

The application utilizes machine

learning techniques based on

both traditional methods and deep learning to analyze content.

24

Trang 25

CHAPTER 3: METHODOLOGY 3.1 System Overview

The overall aim of our fake news detection project is to build a system capable

of accurately identifying and flagging fake news articles By leveraging BERT andmachine learning, the system will enable users to distinguish the credibility of news

content.

; Data procesing (PhoBERT + Fine tuning and optimize hyper

Data collection * (Clean, remove ) ` parameter fine tuning layers)

Deploy on web Create API Evaluate Model

Figure 3 - 1: Illustrate the process in the system.

Our system can be divided into 6 parts Part I is Data collection In this part,

we have collected articles from many different sources, including both mainstreamnewspapers and some reactionary and tabloid newspapers After collection, ourdataset had more than 1600 articles, including both real and fake news Each article

is tagged as fake news or real news

The second part focuses on preparing and transforming the collected data tooptimize the training of the PhoBERT model Multiple sequential data processingtechniques are applied: Text cleaning function, Text normalization, Build processingpipeline, TF-IDF Vectorization, Truncation The above processing steps aim tooptimize data, remove noise and prepare the most suitable input data for training andworking effectively with BERT

25

Trang 26

The third step in the process is refining the PhoBERT model that has beenapplied to the fake news detection context, while also optimizing hyperparametersand fine-tuning layers Initially we used the pre-trained PhoBERT model as a basicencoding layer Then came the Fine-tuning process, tuning the PhoBERT modelwith specific data in order to be able to understand and classify fake news moreaccurately Hyperparameter optimization was applied to the model such as learningrate, number of epochs, batch size to improve model performance Then it wentthrough the process of adjusting weights in neural networks to enhance the ability todetect fake news.

After running the model, we assess our system's ability to detect fake news

We use accuracy, precision, recall, and Fl score to measure performance, ensuringour model distinguishes real news from fake effectively Through cross-validation,

we test the model's reliability across various data subsets This step is vital to confirmthat our system performs well and is ready to detect fake news

After the evaluation phase, we create an Application Programming Interface(API) to provide access to our fake news detection model The API acts as anintermediary, allowing users to submit news content and receive a credibilityassessment In the final part, we focused on integrating our fake news detectionmodel with a web interface, leveraging the previously developed API This step iscrucial for providing a practical and accessible platform for users to interact with oursystem Our web interface, which currently runs locally, serves as a directapplication of the API It features a simple and intuitive user interface whereindividuals can input news articles and then check whether the news is fake or real

In conclusion, our fake news detection system showcases the effective use ofNLP and machine learning in a student project context We've developed afunctional tool that combines data processing with a PhoBERT-based model and an

26

Trang 27

API, all integrated into a web interface While it's currently running locally, thisproject has been a valuable learning experience and a practical approach to tacklingthe issue of fake news at a smaller scale.

3.2 Data Collection and Preprocessing

3.2.1 Data CollectionExpectations for new information, a thirst for knowledge, interest in socialevents, and a desire to maintain an understanding of the surrounding world arereasons why we read the news Reading the news helps us stay informed, supportdecision-making, enhance awareness, and strengthen social connections Research

on the reliability of news to readers on online media is still scarce Therefore, weaim to extract news from media sources as our data In this study, our research willestablish a process for extracting information from authentic and fake news on mediaplatforms such as Vnexpress, Viettan, tingia covering topics like politics, society,news and law The figure below illustrates the components of this process:

Figure 3 - 2: Illustrate the process for data collection.

The first step of this process is to search for news on media platforms Thedata collection process begins by selecting news websites as sources for articles andkey information These websites offer a wealth of diverse, up-to-date, and relevant

content.

Next, for reliable news, we choose Vnexpress.net, one of the leading digitalnewspapers in Vietnam known for its strict criteria regarding the quality and

27

Trang 28

accuracy of information Articles from Vnexpress.net reflect events and topicstransparently and objectively, covering both domestic and international affairs.

$3) Vnexpress.net/hanh-khach-ke-khoanh-khac-phi-co-nhat-ban-boc-chay-tren-duong-bang-46!

VNIBEXPRESS Thứ ba, 2/1/2024 TPHCM v 28° Mới nhất Tin theo khu vực fj Intemational Đăng nhập

ff Thờisợ Gocnhin Thégidi Video Podcasts Kinhdoanh Bấtđôngsản Khoahoc Giditri Théthao Phápluật Giáodục Sứckhỏe Đờisống Dulich Sốhóa Xe Ykién Tâmsự Th

L

146.000 # 391.000 ở

ob a

391.000 4 391.000 4

f ï Một hành khách nói phi cơ của Japan Airlines "đã va chạm vào thứ gi đó" lúc ha

cánh, trước khi ngọn lửa bùng lên và bao trùm máy bay.

ị

| "Tôi nghe tiếng nổ, giống như máy bay va vào thứ gi đó rồi giật lên ngay khi chúng

| tôi hạ cánh", một hành khách trên chuyến bay JAL 516 của Japan Airlines nói với

Figure 3 - 3: Illustrate a review in Vnexpress.

For disinformation and sensational news, we carefully curated content fromviettan.org, a well-known source notorious for disseminating provocativeinformation that negatively impacts communities and society This selection aids us

in distinguishing more clearly between credible information and content that elicits

a negative reaction during the categorization process

28

Trang 29

$5 Ìviettan.org/k

PM Kinh té TP.HCM lao déc, |

CHO PHU ĐỊNH, QUAN 6, SAI GON xense cao: ikinhdoanhéam,chg |

® oe 1

.——=sansnssaassasmasssnssnssme ¬

1 Phóng viên dạo một vòng chợ Phú Định, Quận 6,

| Sài Gòn, một chợ thường ngày đông đảo, nhộn

Ị nhịp kẻ mua người bán đủ các loại thực phẩm

| thiết yếu hằng ngày.

{ Thế mà nay, giữa tháng 11, hình ảnh cho thấy từ

' đầu chợ đến cuối chợ vắng tanh khách Người

| bán thì nhiều chỉ thấy lua thưa, lẻ tẻ người mua!

Một Sài Gòn dm đạm trong làn sóng suy thoái

| kinh tế! Từ đầu năm đến nay xí nghiệp gặp khó.

{ khăn phải thải người, thậm chí đóng cửa, trong

| khi giá điện lại tăng khiến đời sống người dân

Ỉ càng khó khan hon! Có người nói: “chưa năm

| nào mà khó khăn như năm nay ”

| NHI SG AGENNSSIANGAEHOANG HH Ổ UNOME —

Youtube Viét Tan

Figure 3 - 4: Illustrate a review in Viettan.

Our dataset is meticulously organized with fields such as Title, Tags, Link,Content, and Label, where labels are predefined as either real or fake news The datacollection and labeling process is carried out carefully to ensure accuracy andreliability for training classification models

The result of data collection and preprocessing

After gathering data from the website, the study collected a total of 1,732 newsarticles, including 1,440 real articles and 292 fake articles Additional data analysis

is presented in the Table 3-1 below:

29

Trang 30

Table 3 - 1: The result of data collecting.

Type of Name of tag | Number of | Number of | Grand Total

news tags news

is also carried out to eliminate irrelevant or noisy information such as advertisements

or unnecessary personal details These steps ensure that the final dataset is a clear,

high-quality collection of data that accurately reflects the research goal of detecting

fake news The study collected a total of 1,687 news articles, comprising 1,440 real

articles and 247 fake articles Additional data analysis is presented in the Table 3-2

below:

30

Trang 31

Real news Pháp luật 1,440

Thế giới

Thời sự

The table indicates that the majority of the restaurants collected from newswebsites consist of 85% real news articles out of the total collected, with theremaining 15% being fake news

3.2.2 Data LabelingThe first step is labeling the data, with label 0 understood as authentic newsand label 1 as fake news Through this, we can provide a reasonable analysis of the

31

Trang 32

characteristics of the data samples Data samples labeled 0 exhibit authenticity andreliability It can be assumed that they contain verified information and are highlyrated for reliability Articles or events in this category may be considered the mostreliable and trustworthy information On the contrary, data samples labeled 1 arelikely to contain inaccurate, biased, or even rumor-based information Observingthese characteristics in the data can help us better understand the trends and patterns

of the labeled information The structured data is presented in Table 3-3 below:

Table 3 - 3: Data labeling

Kiếm hang trăm triệu đồng nhờ xà phòng nghệ thuật

Thủ tướng: Đến năm 2030, Tây Nguyên cần hoàn thành 5 cao tốc

Bãi tắm Cửa Tùng thay đổi sau 20 năm

Với 90% cô phần ở Ngân hàng SCB, bà Trương Mỹ Lan đã sử dụng hơn

1.000 công ty trong hệ sinh thái của minh dé bỏ túi riêng 304 ngàn ty

đồng

tử hình với Lê Văn Mạnh, người đã trải qua 7.000 ngày biệt giam

Nêu nói đên bộ môn cờ vây truyên thông của văn hóa Trung Hoa, hăn

nhiên Hoa Ky là một “tay mo.”

3.2.3 Data Cleaning and PreprocessingBefore training the data set, the text should be clean and preprocessed Duringthe natural language processing process, we used data cleaning and normalizationsteps to ensure cleanliness and uniformity, creating the best conditions for machine

32

Trang 33

learning and analysis First we removed URLs and HTML tags from the text Thishelps remove irrelevant and potentially noisy elements Next, we apply anotherregular expression rule to remove all characters other than numbers and letters,including Vietnamese special characters, to simplify the text.

Then, we normalize whitespace, removing excess whitespace that the spacingbetween words is consistent This helps increase the accuracy of later text analysis

We also use the “text_normalize” function from the “underthesea” library tonormalize the text, helping to homogenize the expression in the text For example:

Original sentence: "Hnay tôi di lam muộn vì trời mua to qua!"

After use the “text_normalize”: “"Hém nay tôi di làm muộn vi trời mua to

qua!"

where the acronym "Hnay" is converted to the full form "Hôm nay"

Next, we perform the tokenization process using the “word_tokenize”function also from the “underthesea” library, this allows us to effectively separatewords in Vietnamese text This word separation is an important step to prepare datafor future deep learning models[9] For instance:

Original sentence: "Bác sĩ bây giờ có thé than nhiên báo tin bệnh nhân bị ung

thự”

After use the “word _tokenize(sentence)”: ['Bac sĩ, 'bây gio’, 'có thể, thản

nhiên, bảo tin, bệnh nhân, ‘bi’, tung thư}

After use the “word_tokenize(sentence, format="text")”: 'Bác sĩ bây giờ

có thể thản nhiên báo tin bệnh nhân bị ung thư!

After completing word tokenization, we applied the Term Frequency - InverseDocument Frequency (TF - IDF) TF - IDF is frequently used in machine learning

33

Trang 34

algorithms in various capacities It includes two components, term frequency andinverse document frequency.

The term frequency of a word in a document There are a lot of ways ofcalculating this frequency, with the easiest being a raw count of instances when aword appears in a document Then there are ways to adjust the frequency, by length

of a document, or by the raw frequency of the most frequent word in the document

The inverse document frequency measures a word’s rarity across a set ofdocuments The lower IDF value indicates a more common word It is calculated bydividing the total number of documents by the number of documents containing theword, then taking the logarithm of this quotient

TF-IDF score for the word t in the document d from the document set

IC|: the number of documents in the corpus

ICt|=I{d € C: t € d}I: the number of documents containing term t

Trang 35

In order to calculate TF-IDF weight, we need to compute the term frequency(TF) first.

Table 3 - 4: Term frequency (TF).

Pocument2 | 0) a)

To know how many of the value of term frequency, we need to count thenumber of terms t in the document For instance: “Mực tiéu” appear one time indocument 1, hence, tf(Muc tiéu,Doc1) = 1 In contrast, “Mục tiêu” do not exist in

document 2, so tf(Muc tiêu, Doc2) = 0.

Next, we need to count the number of documents containing the term t Thevalue of document frequency (df) is known as |Ct | in the formula

Table 3 - 5: Document frequency (df).

since the term “Mực fiêu ” is present only in the first document This is thecase for the other terms as well

Then, we calculate the value of IDF

35

Tiêu đề	Building a Supporting System for Prediction of Fake News
Tác giả	Nguyen Duc Manh, Nguyen Thanh
Người hướng dẫn	Dr. Cao Thi Nhan
Trường học	University of Information Technology
Chuyên ngành	Information Systems
Thể loại	Thesis
Thành phố	Ho Chi Minh City

Định dạng
Số trang	70
Dung lượng	38,15 MB