NATIONAL UNIVERSITYFACULTY OF INFORMATION SYSTEMS UNIVERSITY OF INFORMATION TECHNOLOGY — VIETNAM BIEN BAN CHINH SUA KHOA LUAN TOT NGHIEP MINUTES OF AMENDED THESIS DEFENSE Sinh vién Stude
Trang 1NATIONAL UNIVERSITY HOCHIMINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS
DANG TRUC LAM - 19521736
BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS
THESIS ADVISOR
PROF DO PHUC
MSC NGUYEN THI KIM PHUNG
HO CHI MINH CITY, 2023
Trang 2ASSESSMENT COMMITTEE
The Assessment Committee is established under the Decision
¬ by Rector of the University of Information Technology
ã - Chairman
“»M - Secretary
1 - Member
Trang 3NATIONAL UNIVERSITY
FACULTY OF INFORMATION SYSTEMS
UNIVERSITY OF INFORMATION TECHNOLOGY — VIETNAM
BIEN BAN CHINH SUA KHOA LUAN TOT NGHIEP
(MINUTES OF AMENDED THESIS DEFENSE)
Sinh vién (Student): Dang Truc Lam MSSV (Student ID): 19521736
Thuộc chuyên ngành (Major): Advanced program in information systems
Khoa (Academic year): 2019
Thực hiện đề tai (Thesis project): Detecting sources using YouTube original
videos to manipulate contents with other purposes
Hôm nay, ngày 28 tháng 01 năm 2024 , tôi đã hoàn tat việc chỉnh sửa Khóa luận tốt nghiệp (KLTN) theo ý kiến của Hội đồng châm KLTN và phản biện với các nội
dung sau đây:
Today, January 28th, 2024, I have completed the editing of my graduation thesisaccording to the comments from the Thesis Defense Council and Thesis Reviewer withthe following content:
Trang 4: Trang số Trang số
No Nội dung can phải chỉnh sửa (Page Nội dung đã chỉnh sửa (Page
(Details needed to be amended) (List of amended details)
number) number)
A Chinh sửa theo yêu cầu của phản biện (Based on the Reviewer request)
Add the experimental results 49 Add more results of experiment|49, 50, 51,
: section fully and clearly process 52, 53, 54
B Chinh sửa theo yêu cau của Hội đồng (Based on the Thesis Defense Council request )
The result is not clear % is 52 Change the representation scale 52
h not suitable for distance of cosine distance
Topic is too big and challenge 7 Re-define the scope and 7
objectives of the project, make
> more clearer goal of each process
in objectives section
Need to change the title 1,11 Change the title from “Detecting) 1,1
Sources using YouTube original
videos to manipulate contents
` with other purposes” to “Propose
an approach to compare thecontents of YouTube videos”
Dinh kèm Biên bản này là phiếu nhận xét phản biện và biên bản chấm bảo vệ KLTN
(Attached to the minutes is the assessment sheet of the Thesis Reviewer and the Thesis Defense
Council evaluation minutes for the graduation thesis.)
Ho Chi Minh City, 28th January 2024
Xác nhận của Giảng viên phan biện Sinh viên thực hiện
(Confirmation of Thesis Advisor) (Student)
(Ký & ghi rõ họ tên) (Ký & ghỉ rõ họ tên)
il
Trang 5Xác nhận của Trưởng/Phó (Khoa/Bộ môn) phụ trách
(Confirmation of Head of Faculty)
(Ký & ghi rõ họ tên)
11
Trang 6UNIVERSTTY OF INFORMATION TECHNOLOƠY
Advanced
ADVANCED PROGRAM 4EẺ Ediscostiofr
IN INFORMATION SYSTEMS Program
COMMENTS OF THESIS’S REVIEWER
1 Structure and layout
- Sections are quite clear and logical
- However, the experimental content has not been fully presented The video frame analysis
part has not been written yet Students must complete all sections of the report
2 Contents
- The student learned about determining whether a video file has been edited or not This is
an interesting topic and still has many challenges
- The student proposed a method focusing on audio analysis and video frame analysis
extracted from original video, then used cosine metric to calculate the similarity Based
on the similarity to make the conclusion
- This approach is appropriate, but the experimental results are not fully shown in the report
Therefore, the student must add the experimental results section fully and clearly.
3 Experiments and application
iv
Trang 7- The ISOT dataset includes two categories of articles: fake and real news The dataset was
acquired from authentic sources, with the truthful articles being extracted using webcrawling of articles from Reuters.com The fake articles were gathered fromuntrustworthy websites that were identified by Politifact (a reputable fact-checking group
in the USA) and Wikipedia This is a good data set
- Carried out experiment YouTube2Text using Whisper
4 Reference
- Good
5 Behavior
- The student shows good attitudes while working with the reviewer
Overall assessment: (please choose one of the following categories:
Fair/Good/Excellent/Outstanding):
Mark:
Dang Truc Lam: 7.0/10
Ho Chi Minh city, January 22", 2023
Reviewer
Cao Thi Nhan
Trang 8EN BAN
0110911119911 711 0/7
Trang 92)/7
Trang 10(From the instructor)
viii
Trang 11(From the thesis advisor)
ix
Trang 12First of all, allow me to express my sincere appreciation to my advisors, Prof DoPhuc and MSC Nguyen Thi Kim Phung, for their extraordinary instruction, assistance,and knowledge during the course of this research undertaking Their perceptivecritique, support, and steadfast dedication have played a critical role in molding thisthesis and fostering my scholarly development
The guidance and clarity that Prof Do Phuc's extensive knowledge and experience
in Machine Learning have provided to my research have been invaluable His
commitment to my scholarly growth has been phenomenally motivating, and I amappreciative of the inspiration he has provided through his mentoring
The supervision and constructive feedback provided by MSC Nguyen Thi KimPhung have made a substantial contribution to the enhancement of this thesis Hererudite counsel and meticulous attention to detail were indispensable in elevating thecaliber of my research endeavors I express gratitude for her unwavering support andtolerance during this endeavor
Additionally, I would like to express my gratitude to the faculty and staff of
Information Systems for their invaluable assistance, educational materials, andsupportive atmosphere, all of which have significantly enhanced my educationaljourney Their contributions were crucial to the successful completion of this study
In conclusion, I express my gratitude towards my fellows for their support,intellectual stimulation, and positive reinforcement, all of which have contributed tothe overall enrichment of this scholarly expedition
Trang 13UNIVERSITY OF INFORMATION TECHNOLOGY
Advanced
ADVANCED PROGRAM echaciatinr
IN INFORMATION SYSTEMS Program
Email: phucd @uit.edu.vn
Master Nguyen Thi Kim PhungEmail: phungntk @uit.edu.vn
Misinformation and manipulated content have become more widely distributed as a result
of the exponential expansion of digital media[1][2], particularly on platforms such as
YouTube[3][4][5] The objective of this study is to devise a technique for indicating the
originators of misinformation and manipulation videos on YouTube The purpose of this
research is to develop a framework for the identification of manipulated videos through theapplication of deep learning algorithms and the analysis of their characteristics Enhancing
xi
Trang 14the efficacy of approaches to counter misinformation on digital platforms will be facilitated
by the results of this study
The first part of the study looks at how common misinformation (manipulated/ modifiedinformation) is on digital platforms, focusing on YouTube as a major way for changed
content to get around After carefully reading all the previous research on the topic, this
study finds problems with the current methods of detection and suggests a new approach
that uses deep learning algorithms to make detection more accurate.
The method involves collecting a diverse set of YouTube videos that include both real and
fake content Videos are broken down into parts like image frames, and audio patterns sothat deep learning models can be implemented
1 Objectives: Analyzing the distinctions between original YouTube videos and
modified videos This works as a fundamental basis for information security and data
integrity on digital platforms
2 Scope: The subject matter is implemented in order to analyze YouTube videos, video
similarity comparison
3 Methodologies: Applying algorithms and deep learning models specifically
designed for Natural Language Processing (NLP) to address the challenges ofYouTube2Text, etc Assessing the vector parameters to ascertain if the input videohas been modified Simultaneously, evaluating the performance of existing models
to determine the most optimal model for the procedure
4 Expected result: Proposal for an optimized method for detecting videos that spread
misleading/ modiefied information with the intention of manipulating viewers
References:
[1] C Chen, H Wang, M Shapiro, Y Xiao, F Wang, and K Shu, “Combating Health
Misinformation in Social Media: Characterization, Detection, Intervention, and Open
Issues.” arXiv, Nov 09, 2022 Accessed: Jan 06, 2024 [Online] Available:
http://arxiv.org/abs/221 1.05289
xii
Trang 15[2] S Kumar and N Shah, “False Information on Web and Social Media: A Survey.”
arXiv, Apr 23, 2018 Accessed: Jan 06, 2024 [Online] Available:http://arxiv.org/abs/1804.08559
[3] H O.-Y Li, E Pastukhova, O Brandts-Longtin, M G Tan, and M G Kirchhof,
“YouTube as a source of misinformation on COVID-19 vaccination: a systematic analysis.,” BMJ Glob Health, vol 7, no 3, Mar 2022, doi: 10.1136/bmjgh-2021-008334.
[4] L Tang et al., ““Down the Rabbit Hole’ of Vaccine Misinformation on YouTube:
Network Exposure Study.,” J Med Internet Res, vol 23, no 1, p e23262, Jan 2021, doi:
10.2196/23262
[5] I Srba et al., “Auditing YouTube’s Recommendation Algorithm for MisinformationFilter Bubbles,” ACM Trans Recomm Syst., vol 1, no 1, pp 1-33, Mar 2023, doi:10.1145/3568392
Research plan:
- Identify and collect a diverse dataset of YouTube videos
- Preprocess the data, including cleaning and organizing the videos for analysis
- Extract features from the collected videos, including images and audios
- Develop deep learning models for Youtube2Text, Doc2Vec, and document comparisons.
- Evaluate the performance of the trained models.
- Compare model results with other related works
xiii
Trang 16Approved by the advisor(s)
Signature(s) of advisor(s)
Prof Do Phuc
MSC Nguyen Thi Kim Phung
Ho Chi Minh city, 01/09/2023
Signature(s) of student(s)
Dang Truc Lam
XIV
Trang 17TABLE OF CONTENTS
THESIS PROPOSAL, 0 SG 0c H4 0.0 0000080080/040.006096 xi
TABLE OF CONTIEN TỀ c5 s9 000001000600090804800006 XV
LIST OF EIGU RES œ- 5 (<< << 0 00000 60068008900 xvii
LIST OF TA B.LEES <5 (<< 5< I0 0000089089081 080 XX
LIST OF ABBREVIA TIONN o G55 sọ SH ng 000008966 xxi
ABSTRRACTT, cĩc sọ cọ HH 000910000 00000.0 4000560080900 1
Chapter 1 — Introduction 5-5 5 <5 5< 5 5 9 9 99 00000 000088808960.8 2
1.1 General introduction 17 a 2
1.2 Problem statement and chaÏlenge€s - «+ «+2 £++£++++kEseeeeeeeersesske 3 1.2.1 Problem sfaf€Imeii( - «c1 s11 9 119 111v HH HH ệt 3 I0) 3
1.3 Surveying methods for analyzing video COnIf€T «+ s<c<s£+s+sx2 3 1.4 Study objectives and SCOpC cceecesseeseeseseeeesceeseesseceseeeaesecesseeseeeaeesaeeeaeeees 7 In ODjeCtIVES 7
I Đo 7
1.5 i9 Si 9000 T1 e 7
1.6 Report OUtING 1 8
Chapter 2 — Theoretical Background << 5< < 5< s5 s9 95 895685589556 9
2.1 Current approaches to solve the problem ¿- «+55 +++x++vcseeeseeseeseesee 9
2.1.1 Approaches to video content based on Âudo - -s«+<s<+<s+2 9
XV
Trang 182.1.2 Approaches to video content based on Image ««+-«« 232.1.3 Vector comparison methodology - s6 +skssserseeereevee 39
Chapter 3 — Experiments and Evaluations - < << «<< S 5s 6S£9555699559% 43
3.1 Proposed methodology - + 1113 1E9911 131111911 91119 ng ky 43
3.2 Datasets IntrOdUCtIOT 5 2 3013311331111 1191111111811 1 11 81 ng ngư 44
3.2.1 Lack of video dataset SUTVCY HH HH HH nhh 443.2.2 Alternative datasets for the problem - - -«+++x++se+ssersees 473.3 EMVirONMen 1 Ỏ 49
3.4 Implementation process eSCTID{IOH <6 + 123111331 kkeskeerse 49
3.4.1 Audio arnaÌYS1S so c1 HH HH Hệ 49
3.4.2 Video frame ạiaÏS1S - - c2 13923133 1119111 911 9v ng re 51
Chapter 4 — Conclusion and Future đỈT€CẨÏOIIS s- s55 55s 5555 5s sess+ 56
4.1 Achievements and ÏIm1tatÏOTS - - ¿+ + + **+*E+seeEseeerereerereereree 56
NV 2/00 85 Ả 564.1.2 LÌmIfAfIOTS c2 1121111211110 1 11011110111 011 1 11 1v HH HH Hy 564.2 Future GirectiOns 177577 ồẮồÂ'.ồ®"ê Ả 56
REEFERENCES G5 <5 HH HC 0000000000000 80 58
APPENDICES 0c cụ Họ Họ HH 0 0000000 06.040004 0 65
Xvi
Trang 19LIST OF FIGURES
Figure 1.1 Topic-wide distribution of videos [7] - 5 5 2s 121123391 91 1 91 ng 4 Figure 1.2 Pipeline of Jagtap et al proposed methodology [7] - «++s«++<x+<e++exs+ 4 Figure 1.3 Multi-class classification: Best models and embeddings with highest
weighted Fl-score for each topic [7] - 6 6 St 1 9112321 91 5119111 1v HT HH gà nưệp 5 Figure 1.4 Summary datasets using 1n [ LÚ], - - - 2511911911931 91 11 1 9v ng ng 5 Figure 1.5 Evaluation of fine-tuning base transformers model & - «+ +-«+++s<++s+2 6 Figure 2.1 Baevski et al.’s Wav2vec 2.0 framework{ l Í] - + +++s++ex+eexsexsersessserses 9
Figure 2.2 Conformer encoder model architecture [19] «+ ++s+++s£++k+seseeeeseeese lãi Figure 2.3 Overview of Whisper methodology architecture [2 Í] s«++++<c++ 13
Figure 2.4 Word Error Rate (WER) distribution of Whisper and state-of-the-art
commercial and open-source ASR system in long-from transcription, using 7 distinct
long form English-only afaS€fS - (c1 1191 nh HH Hàn 14
Figure 2.5 A timeline overview of the latest advancements in fundamental network
structures and models for word embeddings, starting from 2013 - - -« s«++s«++s «+2 15 Figure 2.6 Bag of words model [2S] - - 2 + + E1 E939 E91 93 31 11 v1 gi ng 16 Figure 2.7 Example of drawback of BOW [26 ] - c5 + kg re, 16 Figure 2.8 TF formula [277 ] - - - + + + < + E111 911911 91 91 91 1 1111 nh HH nh 17 Figure 2.9 IDF formula [27] - - <6 1111111 111911 911 11 11 910gr 17
Figure 2.10 The difference between CBOW and Skip-gram[29] - ¿s55 «<< cs++ 18 II0410290808xÀ/219)/0ii13.20n maỪO ÔỎ 18 Figure 2.12 PV-DBOW model - 6 5 6 21 2119 91191 911911 vn nh nu ng gh nnrnry 19 Figure 2.13 Performance comparison on SST đafaS€( 5c 3c 3+ + seseeseeeree 21
Figure 2.14 Performance comparison on IMDB dataset ccesceeseesseeseeeeeeeneeeseeeeeeeneeeee 22
Figure 2.15 Example of 3 paragraphs - - <1 131911919 111910 1911 vn HH ngư 23
Figure 2.16 Results of example in Figure 2.15 << + 111193 11 x1 9v ren 23 Figure 2.17 Video to frames using OpenCV example [39] s5 «++s x++v+sesseeesee 24
Figure 2.18 Anaz and Faris, 2015 made a performance comparison between OpenCV
and MATLAB in real-time application [40] , - <6 311191 E913 1 EEkksekkeeeere 25 Figure 2.19 The architecture of the proposed VON¢Et .- - nSnSs rưn 27
XVii
Trang 20Figure 2.20 Average Running Time of a RGB Image of Size 768x512 From the Kodak
Dataset by the Test Methos - ó5 1 01190111 11191019 TH HH ngư 27
Figure 2.21 Example of feature extracting architecture + 5+5 x*+x£+v+ssseseeses 27 Figure 2.22 An example of CNN architecture for image classification [50] 28 Figure 2.23 Primary calculations executed at each step of convolutional layer [50] 29
Figure 2.24 The convolution value is calculated by taking the dot product of the
corresponding values in the Kernel and the channel matrices [5 Ï] - -«- «+ 29
Figure 2.25 ReLu function novel formula [50] - ¿- 6 + *£+x£+£+eEseEseeeeeseerserses 30 Figure 2.26 Three types of most frequently utilizing pooling methods [50] 30 Figure 2.27 Fully connected layer [50] eccecceccesseseceeceseeecececeesesseeseeeceeaeeseseneeseeaeeaeenes 31 Figure 2.28 Softmax mathematically formula [50] - - s6 +2 £+s£*sEseEeeeeserserees 32 Figure 2.29 RNN architecture [52] ¿c1 31119111111 111911911911 9v ng ng ng rưy 32
I2)01.026000.40)0./2i‹i 2n 33 Figure 2.31 Visualized diagram of vanishing ø7adieni( «5+ x++£+ee+seseeeeeeses 34 Figure 2.32 Workflow comparison between RNN and LSTM -«++<++<<+ 35 Figure 2.33 Transformer architeCture - (c6 110118111930 E93 E930 11 19v vn ng rưy 36
Figure 2.34 Self-attention helps the model understand more about the relationship
between S€T(€TIC€S G2 11111 9 1 TH TT HT HT TH TT TH HT TH HH TH nghiệt 37
Figure 2.35 Positional encoding deSCTIDfIOII - 5 6 E31 E93 E+#EE+EEEeeEseekseeeeeeeereree 38 Figure 2.36 Comparison of BERT and recent improvements over it [ŠŠ] - 39 Figure 2.37 Document similarity using cosine example [77 ] - -«- «=2 << £+s£zss+ 40
Figure 2.38 Cosine similarity formula 0 ccceceeseeseeeseeceeseeseeececeesecsecececeeaesaeceneeseeaeeaneees 40 Figure 2.39 Euclidean distance ÍOTINUÏi - ¿+ + + + x****E*xE#kE*kEsEEeEEkEkkrekrerkerkrrke 4I Figure 2.40 Euclidean distance example[60] - - <6 + +2 E + **E+k+kEsEkeesrkrekrskerree 4I Figure 2.41 Manhattan distance ÍOrrmuia - 5s k1 E 23191 E91E E91 vn nh giết 42 Figure 2.42 Manhattan distance example[60] - - - + + 2s E2 219 1 ri, 42 Figure 3.1 Proposed methodology arChIf€CfUT€ -ó- - E2 S11 * 911193119119 119 11811 rry 44
Figure 3.2 Multimodal in term of fake information description [6 Ï] - «- ««+s+ 45
Figure 3.3 Summary of the most relevant works on factuality, covering different
Modalities and tasks [61] - - - - - ¿+ 2 E633 221116132231 181 1293311111531 11183111 E99 1n ngư 46
XVII
Trang 21Figure 3.4 Categories and number of articles per CAf€ØOYY 5 555 + *++svseeeeeeeres 47 Figure 3.5 Comparison with other existed dataset - G11 HH ng rệt 48
Figure 3.6 Average AUC Performance of SOTA Detection Methods on Each Dataset
— ỐỐ 48 Figure 3.7 Example result of You Tube2Text using WhlSper - 5c sseseesses 49 Figure 3.8 Documents after S{€TnTTITØ - (G11 911811811 891189111911 811 91111 vn rry 50
Figure 3.9 Average similarity percent, USING COSINE - - c1 123 9E ve rưy 51
Figure 3.10 Video2Vec exaImnpÌÏe 5 s11 TH TH HH TH HH nh nh 51
Figure 3.11 4-D array example oo nh 52 Figure 3.12 Cosine distance ÍOrImuÌa - - «+ + +3 + E1 vn nh nh TH nh rệc 52 Figure 3.13 Average distance over whole afS€K c5 1n HH HH riệt 54
Figure 4.1 Stemming function in DoC2V€C - ĩc c2 S313 E*ESEESkEkkeskerkrrkeree 65 Figure 4.2 Calculate similarity percent 1n DOC2VEC - - -s + x + +ekESseseeseeeseeeree 65 Figure 4.3 Function to calculate the DerC€TI{À€ - ĩ6 6 2213 *91 21123111 1E Ekrrkrree 65 Figure 4.4 Cosine distance percent In VideO2VEC ĩc k3 9v 1 1 911 rry 67
XIX
Trang 23LIST OF ABBREVIATIONS
NLP: Natural Language Processing
AUC-ROC: Area Under the Curve” of the “Receiver Operating
Characteristic” curve
BERT: Bidirectional Encoder Representations from Transformers
CNN: Convolutional Neural Network
RNN: Recurrent Neural Network
CTC loss function: Connectionist Temporal Classification loss
WER: Word Error Rate
HMM baseline: Hidden Markov Models
WSPSR: Web-scale supervised pretraining for Speech Recognition
ASR: Automatic Speech Recognition system
BoW: Bag of Words
TF-IDF: Term Frequency - Inverse Document Frequency
CBOW: Continuous Bag of Words
DBOW: Distributed Bag of Words
PV-DM: Distributed memory version of paragraph vector
PV-DBOW: Distributed Bag of Words version of Paragraph Vector
RGB: Red, Green, Blue channel
GAP: Global Average Pooling
MSE: Mean Squared Error
LSTM: Long Short-Term Memory
SOTA: state-of-the-art
XXI
Trang 24The advent of digital platforms has fundamentally transformed the dissemination
of information and news, providing unparalleled possibilities for worldwidecommunication and the exchange of knowledge The emergence of digital platforms,including social media, news websites, and content-sharing platforms, has facilitatedwidespread access to information, enabling people and organizations to rapidly reachlarge audiences
YouTube, as one of the major worldwide video-sharing platforms, providesexceptional potential for content creators to effectively reach varied audiences with awide range of information and news content Nevertheless, the widespread availability
of YouTube content distribution has also resulted in difficulties with the accuracy and
reliability of material provided on the platform.[1], [2], [3]
The research methodology entails gathering a heterogeneous dataset of authenticYouTube videos that encompass a wide range of genres and subjects Deep learningmodels are used to extract and analyze features such as image analysis and audioanalysis, as well as natural language processing (NLP)
Overall, the study's findings are anticipated to enhance the creation of more efficienttactics for identifying and countering the exploitation of authentic YouTube videos formanipulative intentions The research seeks to improve the openness and integrity ofdigital content on platforms like YouTube by identifying sources that participate insuch behaviors
Trang 25Chapter 1 — Introduction
1.1 General introduction
The advent of digital platforms has fundamentally transformed the dissemination
of information and news, providing unparalleled possibilities for worldwidecommunication and the exchange of knowledge The emergence of digital platforms,including social media, news websites, and content-sharing platforms, has facilitated
widespread access to information, enabling people and organizations to rapidly reach
large audiences This has enabled individuals to engage in public discussions, exchangevarious viewpoints, and obtain a broad array of news outlets Nevertheless, this realityhas also resulted in the widespread dissemination of false information, deliberate
misinformation, and manipulative propaganda, which present substantial threats to
public confidence and the stability of society.[4], [5]
YouTube, as one of the major worldwide video-sharing platforms, providesexceptional potential for content creators to effectively reach varied audiences with awide range of information and news content The extensive number of users and thealgorithmic recommendation algorithms of this platform have significantly changed
the way people access and interact with information, influencing public discussions
worldwide Nevertheless, the widespread availability of content distribution has also
resulted in difficulties with the accuracy and reliability of material provided on the
platform.[1], [2], [3]
With the aforementioned concerns, it is required to create techniques for detecting
manipulated videos and detecting the sources of misinformation video in order toensure safety, security, and the accuracy of information However, it is exceedinglydifficult to distinguish subtle discrepancies in videos with long durations Furthermore,
it is almost unfeasible to manually authenticate a substantial volume of data Hence,
this study is being done to surmount these challenges
Trang 261.2 Problem statement and challenges
as it increases the likelihood of misleading information and copyright violations
The problem is to develop a technique to analyze both original and modifiedYouTube videos, assess their characteristics, and ascertain whether the given video hasundergone any alterations
The input consists of video segments, and the final result is the distance percentage
of those input videos
1.2.2 Challenges
Lack of pre-labeled dataset
e Building a large dataset containing videos labeled as "original" and
"modified copies" is a challenging task Furthermore, collecting such data
requires strict adherence to principles of data security, intellectual property
rights, and copyright
1.3 Surveying methods for analyzing video content
Jagtap et al., 2021 [7] utilized a technique that involves analyzing the characteristicsand features of each video using subtitles, commonly referred to as Video Caption, inorder to detect videos spreading misleading data The researchers utilized a dataset ofYouTube videos that consisted of metadata recorded in csv format This metadataincluded information such as video titles, descriptions, view counts, likes, and dislikes.The dataset covered subjects such as the Vaccines Controversy, 9/11 Conspiracy,
Trang 27Chem-trail Conspiracy, Moon Landing Conspiracy, and Flat Earth Theory The authorsdeveloped a multi-class prediction model utilizing natural language processing (NLP)
to classify videos The algorithm categorizes videos into two classes: "misinformation"and "debunk misinformation or neutral."
Topic Original Count | Available videos
with captions
Vaccines Controversy 775 621 (28.5%)
9/11 Conspiracy 654 436 (20.0%)Chem-trail Conspiracy 675 484 (22.2%)Moon Landing Conspiracy 466 317 (14.6%)
Flat Earth 373 317 (14.6%)
Figure 1.1 Topic-wide distribution of videos [7]
Data Collection and Preprocessing Model Building Performance Analysis
‘i | Training 4 F1 Score
Captions Classifiers 5, AUC ROC (Misinfo vs All)
Embeddings
Figure 1.2 Pipeline of Jagtap et al proposed methodology [7]
The authors obtained video captions (subtitles) by utilizing Video Scraper to extractdata from the provided links in the dataset Subsequently, they conducted preprocessingprocedures to exclude any erroneous or noisy data Afterwards, they performed wordembedding using 4 state-of-the-art techniques[8][9], including Stanford GloVeWikipedia vectors - 100D, Stanford GloVe Wikipedia vectors - 300D, Word2VecGoogle News - 300D, Word2Vec Twitter - 200D Ultimately, the authors calculatedF1 scores, AUC-ROC, Precision Score, and Recall Score to identify the best classifierswith the most impactful embeddings for all topics
Trang 28Three-class Classifier (sorted by F1-score)
Topics Fl-score | Precision | Recall | Accuracy
Vaccines Controversy NuSVC 0.89 0.89 0.89 0.89 Google 300D
Moon Landing NuSVC 0.85 0.84 0.85 0.85
GloVe 300D
Figure 1.3 Multi-class classification: Best models and embeddings with highest
weighted F1-score for each topic [7]
Christodoulou et al., 2023 [10] proposed a novel methodology to detect
misinformation in YouTube videos They achieved this by employing video
classification techniques and cross-referencing information from video transcripts The
authors employed a combination of two transfer learning techniques: fine-tuning base
transformer models (such as BERT, ROBERTa, and ELECTRA) and utilizing sentence
transformers (MPNet and RoBERTa-large) for few-shot learning They conducted
experiments using three distinct datasets to provide the most thorough outcomes
Dataset Type Number of Samples
: Misinformation 652
YouTube Audit (Vaccines) Non-misinformation 636
‘YouTube Pseudoscience Pseudoscience 182
Science 226
Fake 1000
ISOT Fake News Real 1000
Figure 1.4 Summary datasets using in [10]
The evaluation results indicate that ROBERTa outperformed BERT and ELECTRA
in the fine-tuning models of the Youtube-Audit (Vaccines) dataset Additionally,
MPNet Few-shot demonstrated superior performance compared to ROBERTa-large.Within the YouTube Pseudoscience dataset, the performance of few-shot learningmodels surpassed that of fine-tuning models Specifically, MPNet Few-shot
Trang 29outperformed RoBERTa-large in terms of getting higher scores ELECTRA achievedthe highest performance in the ISOT Fake News dataset.
Model MCC Accuracy FI score
Youtube Audit (Vaccines)
Figure 1.5 Evaluation of fine-tuning base transformers model &
Few-shot learning on the three datasets [10]
Unlike Jagtap et al., 2021 [7], who focused on topic-specific strategies using variousclassifiers and embeddings, Christodoulou et al., 2023 approach emphasizes the
versatility of transformer models These findings highlight the importance of
context-specific approaches and the role of deep learning models, embeddings, and transferlearning in combating misinformation
Additionally, several of other studies that commonly focused on classification tasksexamined based on video subtitles However, similar methodologies might fail to takeinto account a wide range of factors when it comes to detecting deceptive or alteredmaterial in videos, including but not limited to visuals, audio, and diverse othervariables
Trang 301.4 Study objectives and scope
1.4.1 Objectives
To tackle the urgent problem described earlier, the study proposes an approach toexamine and compute the similarity and distance between videos, utilizing severalnatural language processing algorithms and deep learning models This 1saccomplished concurrently with the processing of multimedia data, encompassingaudio and video frame processing
In audio analysis process, the purpose is to get the average similarity percentbetween documents, which is after being extracted from the audio
In video frame analysis process, instead of compute the similarity, calculate thedistance between video frames is more suitable, since its representation positive values(easier to analyze) And the goal is to get the average distance number based on the
taken dataset
1.4.2 Scope
The scope of this thesis is based on alternative datasets, one for audio analysis and
one for video frame analysis, since collecting YouTube videos that are satisfied theproblem conditions is challenging
Some technologies used include:
- Language: Python
- Methodology: Whisper, Paragraph Vector,
- Library: PyTube, Numby, Pandas, Matplotlib, Gensim,
1.5 Implementation details
Step 1: Research, collect, and preprocess data Survey models and relatedproblems Select a model suitable for the given problem
Step 2: Experiment with available models and methods, compare, and adapt
them to meet the initial requirements Additionally, propose directions for modifyingand advancing the problem if necessary
Trang 31Step 3: Evaluate the model's output results, continuously adapt and improve to
achieve the best performance.
1.6 Report outline
This thesis is structured with the following chapters and sections:
Chapter 1: Introduction
e Overview of the topic
e Introduction and summary of the objectives, problem statement, field, and
scope of the study
Chapter 2: Theoritical Background
e Approaches to solving multimedia data analysis problems, including
YouTube2Text, Doc2Vec, Video2Frame, Img2Vec
Chapter 3: Experiments and Evaluations
e Implementation of the methods and models studied on the dataset
e Evaluation of the results
e Comparison with other methods in the same field
Chapter 4: Conclusion and Future directions
e Summary of the topic
e Conclusions
e Future directions for further development of the problem
Trang 32Chapter 2 — Theoretical Background
2.1 Current approaches to solve the problem
This section introduces the basic concept of a variety of analysis methodologies,
including audio and image
2.1.1 Approaches to video content based on Audio
2.1.1.1 YouTube2Text
The rapid development of digital devices and technologies has made the analysisand processing of multimedia data essential in the field of information technology,particularly in the domains of deep learning and artificial intelligence Especially, audiodata processing research is receiving growing interest, with scholars exploring
numerous experimental methodologies
Wav2Vec
Baevski et al., 2020 introduced a novel approach for unsupervised pre-training.They utilized a Convolutional Neural Network (CNN) to encode raw video segmentsinto waveforms The latent voice representations are subsequently inputted into a
Transformer network to generate contextual representations as the ultimate output
Trang 33The architecture of Wav2vec 2.0 model comprises a multi-layer convolutional
feature encoder, which accepts raw audio as input and generates latent speech
representations The Transformer algorithm processes these representations to generatecontextualized representations that encompass information from the entire sequence.The feature encoder's output is discretized through a quantization module, whichrepresents the objectives in the self-supervised objective The technique differs fromearlier models by emphasizing the construction of context representations and the
utilization of self-attention to capture dependencies throughout the whole sequence of
latent representations [12]
In the pre-training phase, the objective is to acquire speech audio representationsthrough the resolution of a contrastive task The objective of this challenge is todetermine the accurate quantized latent speech representation for a masked time stepfrom a group of distractors
Within the realm of speech recognition, pre-trained models are modified by
incorporating a linear projection layer onto the existing network.[13] This layer serves
to represent the vocabulary specific to the given job In the instance of Librispeech, themodels are optimized using a CTC loss function [14] and a modified technique known
as SpecAugment SpecAugment involves masking time-steps and channels duringtraining [15] This methodology aids in mitigating overfitting and results in enhanced
error rates.
An advantage of this technology is in its ability to process raw audio data withoutrelying on human labels, enabling it to handle a substantial volume of data.Nevertheless, because to its unsupervised learning, it may not get the utmost mapping
performance A level of fine-tuning is necessary in order to do out tasks such as speech
recognition
Chan et al., 2021 [16] developed a speech recognition model named SpeechStew.This model combines different existing datasets, including AMI, Broadcast News,
Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal.
Importantly, no domain-specific adjustments were made to the datasets in terms of
re-10
Trang 34balancing or re-weighting The architecture utilized in this model is the Conformer
RNN-T [17] (Figure 2.2), as proposed by Gulati et al., 2020 [16] Additionally, the
model incorporates the wav2vec 2.0 model [11] for pre-training with 1 billionparameters [18]
Figure 2.2 Conformer encoder model architecture [19]
The transfer learning capabilities of SpeechStew allow for the fine-tuning and
adaptation of a general-purpose model trained on diverse datasets to a new application.The provided example is CHiME-6 [20], a dataset comprising of conversational audiorecorded in noisy surroundings This dataset poses a significant challenge when itcomes to training speech recognition models directly The authors showcase theefficacy of transfer learning in enhancing performance on the specified task by refining
SpeechStew using CHiME-6.
The transfer learning capabilities of SpeechStew are exceedingly useful, enablingthe authors to train a versatile model once and subsequently refine it for specific low-resource tasks This approach is economically efficient as fine-tuning necessitates only
lãi
Trang 35a limited number of thousand steps, in contrast to the roughly 100k steps required to
train a model from the beginning.
And as expected, SpeechStew consistently delivers nearly state-of-the-artperformance across a range of tasks, without relying on an external language model.The results of our study demonstrate a Word Error Rate (WER) of 9.0% on AMI-IHM,4.7% on Switchboard, 8.3% on CallHome, and 1.3% on WSJ, and with CHiME-6, theyachieve 38.9% WER without a language model, which compares to 38.6% WER to astrong HMM baseline with a language model These results indicate a significantimprovement over previous research, particularly when considering the utilization ofrobust external language models
Whisper
Radford et al., 2022 [21] conducted a study on speech processing systems that weretrained to predict transcripts of audio from the internet The researchers found thatwhen these systems were trained on a large amount of labeled audio data (680,000hours), which is an order of magnitude larger than previous datasets, with multilingualand multitask supervision, they performed well on standard benchmarks and werecompetitive with fully supervised models without the need for fine-tuning They
introduce a method called Whisper (basic name for WSPSR — Web-scale supervised
pretraining for Speech Recognition) in order to support further work on robust speechprocessing, based on the previous approaces
The dataset being used is constructed from the audio that is paired with transcripts
on the Internet, which is generated by both human and the output of ASR systems,since there existed numerous researches showing that training on datasets of mixedhuman and machine-generated data can significantly improve the modelperformance [22]
The researchers employed a well-established encoder-decoder Transformer [23]
design in their study on large-scale supervised pre-training for speech recognition since
it has demonstrated its reliability in terms of scalability The audio was enhanced byresampling it to a frequency of 16,000 Hz and generating an 80-channel log-magnitude
12
Trang 36Mel spectrogram representation The encoder utilized convolution layers with a filter
width of 3 and the GeLu activation function It was then followed by Transformer
blocks On the other hand, the decoder used learnt position embeddings and tied output token representations
input-@s (background music playing)
5 re Sequence-to-sequence learnin ee eae
Multitask training data (680k hours) od eq g [ew [em] 00 | The ck row
|
English transcription kz.
$ *Ask not what your country can do for -” a a]
#2 Ask not what your country can do for ——>
mm mm
Any-to-English speech translation i § :
: Hi :
$ “El rápido zorro marrón salta sobre -” Transformer Š “———>
Encoder Blocks 3 om) Transformer
2 The quick brown fox jumps over + Š >», =a Decoder Blocks
5
Non-English transcription SS!
[ ce | | oes]
$: "cit Pol Bet etc LISLE Wa We
Sinusoidal L——> (cross attention)
J AS Hol Sep ujeice 4S Wa Pe Positional (sa ateriion —)
PREV > text tokens _ r,, EOT
† ME || TP DỰNG eee nes | tes Exene J
Custom vocabulary / fa}
prompting Yi an Lew \ x= Le Text-only transcription
special text timestamp (VAD) Translation (allows dataset-specific fine-tuning)
tokens tokens tokens ts
Figure 2.3 Overview of Whisper methodology architecture [21]
- Multitask training format: the authors employed a single model to performentire speech processing pipeline with different tasks on the same audio input,including transcription, translation, voice activity detection, alignment, language
identifications, etc Furthermore, the authors used a straightforward format where all
tasks and conditioning information are specified as a sequence of input tokens to thedecoder, encompassing special tokens, text tokens and timestamp tokens Additionally,the decoder is trained to consider the previous text history of the transcript, aiming toutilize longer-range text context to resolve any ambiguities in the audio In order toindicate the beginning of prediction, the authors use <start of transcript> token,
13
Trang 37following by <no speech> token in case of the input audio signal is no speech, and
<language tag> if there are any speech recognized from the audio, before executing
next tasks is transcription and translation Afterwards, <no timestamps> token isutilized to specify whether the case need predict timestamps or not, and lastly, <end oftranscript> token is employed in the end of the format
Due to factors like unclear speech and labeling errors, each dataset used forAutomatic Speech Recognition (ASR)[24] has varying levels of irreducible error
Evaluating ASR performance solely based on Word Error Rate (WER) metrics makes
it challenging to determine the potential for improvement in each dataset To assesshow well the Whisper ASR system performs compared to human performance, a studywas conducted using 25 recordings from the Kincaid46 dataset, with results indicatingthat Whisper's English ASR performance is very close to human-level accuracy, withonly a slight difference in WER (1.15% better)
9 TED-LIUM3 Meanwhile Kincaid46 Rev16 Earnings-21 Earnings-22 CORAAL
#WNWhisper ME CompanyA RE Company B BAN Company C #8 Company D BE NVIDIA STT (CTC large)
Figure 2.4 Word Error Rate (WER) distribution of Whisper and state-of-the-art
commercial and open-source ASR system in long-from transcription, using 7 distinct
long form English-only datasets.
2.1.1.2 Doc2Vec
Word Embedding is an essential technique in Natural Language Processing (NLP)that has a fundamental function in expressing words as compact vectors in a space withmany dimensions These vectors include semantic and syntactic associations among
14
Trang 38words, enabling deep learning algorithms to efficiently analyze and comprehendtextual material Word Embedding allows algorithms to analyze language byassociating words with continuous vector representations, preserving the contextual
and hierarchical information found in spoken language The utilization of this
technology has significantly transformed numerous NLP tasks, including language
modeling, sentiment analysis, machine translation, and document categorization It
achieves this by offering a way to represent and analyze textual data in a more
significant and computationally efficient manner.
structures and models for word embeddings, starting from 2013.
Bag of words (BoW): is a widely used method in Natural Language Processing
(NLP) that converts text documents of different lengths into fixed-length vectors based
on word frequencies These vectors disregard the grammatical syntax of sentences andthe sequential arrangement of words [25] This technique involves tokenization,counting word frequencies and encoding the text data into numerical values
15
Trang 39Bag of Words Model explanation Example
Corpus _ hi
A collection of text documents TT Te dog ote child hag
— Tokenization:
Okemze) D1: [The] [dog] fis} happy]
Divide the text into smaller units called D2: [The] [child] [makes] [the] [dog] [happy]
tokens, usually words or phrases D3: [The] [dog] [makes] [the] [child] [happy]
{ © AIML.com Research {
Co i Documents Counting word frequencies
unt word frequencies D1 the: 1, dog: 1, is: 1, happy: 1
Create a vocabulary of unique words and —
count the number of times each word D2 the: 2, dog: 1, makes: 1, child: 1, happy: 1
Bppeers in cern document: D3 the: 2, child: 1, makes: 1, dog: 1, happy: 1
Encode the data
Encoding the text data as numerical values
by creating a vector for each document, with.
each element of the vector representing the
frequency count of a particular word in the
document
[0,1,1,1,0,1]
[1,1,1,0,1,2]
1 2 [1,1,1,0,1,2]
Figure 2.6 Bag of words model [25]
However, this technique has several drawbacks, such as insensitivity to word order,
ignore punctuations and grammatical structures, limited semantic information, etc
Since BoW focuses on every single word, the relationship between words are totally
ignored, this led to the lack of meaning within NLP process, especially in some cases
of complicated sentences [26]
Document D1 The child makes the dog happy
the: 2, dog: 1, makes: 1, child: 1, happy: 1
Document D2 The dog makes the child happy
the: 2, child: 1, makes: 1, dog: 1, happy: 1
16
Trang 40Term Frequency — Inverse Document Frequency (TF-IDF): TF-IDF is apopular statistical method in natural language processing and information retrieval Itcompares the importance of a term in a document to a corpus Its work is to vectorize
a word by multiplying TF with IDF[27] The drawbacks of TF-IDF are as the same as
BoW’s
e Term Frequency (TF): TF is the number of times a term or word appears
in a document compared to its entire word count
number of times the term appears in the document TF=
total number of terms in the document
Figure 2.8 TF formula [27]
e Inverse Document Frequency (TF-IDF): IDF shows the percentage of
corpus documents that contain a phrase Technical terms, which are found
in a limited number of publications, are valued higher than ‘a, the, and’
number of the documents in the corpus
IDF = log( )
number of documents in the corpus contain the term
Figure 2.9 IDF formula [27]
Word2Vec
Word2vec is a neural network with two layers that converts words into vectors in
order to predict the meaning of The input of the system is a collection of written texts,
known as a text corpus The system then generates a set of vectors, specifically feature
vectors, which are used to represent the words found in the corpus Although Word2vec
is not classified as a deep neural network, it converts textual data into a numericalrepresentation that can be comprehended by deep neural networks[28] The purposeand usefulness of Word2vec is to group the vectors of similar words together in
vectorspace and detects similarities mathematically using cosine.
The vector which is used to represent a word is called Neural Word Embedding.There are 2 ways to implement the distribution of word in vector space:
17