Khóa luận tốt nghiệp: Detecting sources using YouTube original videos to manipulate contents with other purposes

NATIONAL UNIVERSITYFACULTY OF INFORMATION SYSTEMS UNIVERSITY OF INFORMATION TECHNOLOGY — VIETNAM BIEN BAN CHINH SUA KHOA LUAN TOT NGHIEP MINUTES OF AMENDED THESIS DEFENSE Sinh vién Stude

Trang 1

NATIONAL UNIVERSITY HOCHIMINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS

DANG TRUC LAM - 19521736

BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS

THESIS ADVISOR

PROF DO PHUC

MSC NGUYEN THI KIM PHUNG

HO CHI MINH CITY, 2023

Trang 2

ASSESSMENT COMMITTEE

The Assessment Committee is established under the Decision

¬ by Rector of the University of Information Technology

ã - Chairman

“»M - Secretary

1 - Member

Trang 3

NATIONAL UNIVERSITY

FACULTY OF INFORMATION SYSTEMS

UNIVERSITY OF INFORMATION TECHNOLOGY — VIETNAM

BIEN BAN CHINH SUA KHOA LUAN TOT NGHIEP

(MINUTES OF AMENDED THESIS DEFENSE)

Sinh vién (Student): Dang Truc Lam MSSV (Student ID): 19521736

Thuộc chuyên ngành (Major): Advanced program in information systems

Khoa (Academic year): 2019

Thực hiện đề tai (Thesis project): Detecting sources using YouTube original

videos to manipulate contents with other purposes

Hôm nay, ngày 28 tháng 01 năm 2024 , tôi đã hoàn tat việc chỉnh sửa Khóa luận tốt nghiệp (KLTN) theo ý kiến của Hội đồng châm KLTN và phản biện với các nội

dung sau đây:

Today, January 28th, 2024, I have completed the editing of my graduation thesisaccording to the comments from the Thesis Defense Council and Thesis Reviewer withthe following content:

Trang 4

: Trang số Trang số

No Nội dung can phải chỉnh sửa (Page Nội dung đã chỉnh sửa (Page

(Details needed to be amended) (List of amended details)

number) number)

A Chinh sửa theo yêu cầu của phản biện (Based on the Reviewer request)

Add the experimental results 49 Add more results of experiment|49, 50, 51,

: section fully and clearly process 52, 53, 54

B Chinh sửa theo yêu cau của Hội đồng (Based on the Thesis Defense Council request )

The result is not clear % is 52 Change the representation scale 52

h not suitable for distance of cosine distance

Topic is too big and challenge 7 Re-define the scope and 7

objectives of the project, make

> more clearer goal of each process

in objectives section

Need to change the title 1,11 Change the title from “Detecting) 1,1

Sources using YouTube original

videos to manipulate contents

` with other purposes” to “Propose

an approach to compare thecontents of YouTube videos”

Dinh kèm Biên bản này là phiếu nhận xét phản biện và biên bản chấm bảo vệ KLTN

(Attached to the minutes is the assessment sheet of the Thesis Reviewer and the Thesis Defense

Council evaluation minutes for the graduation thesis.)

Ho Chi Minh City, 28th January 2024

Xác nhận của Giảng viên phan biện Sinh viên thực hiện

(Confirmation of Thesis Advisor) (Student)

(Ký & ghi rõ họ tên) (Ký & ghỉ rõ họ tên)

il

Trang 5

Xác nhận của Trưởng/Phó (Khoa/Bộ môn) phụ trách

(Confirmation of Head of Faculty)

(Ký & ghi rõ họ tên)

11

Trang 6

UNIVERSTTY OF INFORMATION TECHNOLOƠY

Advanced

ADVANCED PROGRAM 4EẺ Ediscostiofr

IN INFORMATION SYSTEMS Program

COMMENTS OF THESIS’S REVIEWER

1 Structure and layout

- Sections are quite clear and logical

- However, the experimental content has not been fully presented The video frame analysis

part has not been written yet Students must complete all sections of the report

2 Contents

- The student learned about determining whether a video file has been edited or not This is

an interesting topic and still has many challenges

- The student proposed a method focusing on audio analysis and video frame analysis

extracted from original video, then used cosine metric to calculate the similarity Based

on the similarity to make the conclusion

- This approach is appropriate, but the experimental results are not fully shown in the report

Therefore, the student must add the experimental results section fully and clearly.

3 Experiments and application

iv

Trang 7

- The ISOT dataset includes two categories of articles: fake and real news The dataset was

acquired from authentic sources, with the truthful articles being extracted using webcrawling of articles from Reuters.com The fake articles were gathered fromuntrustworthy websites that were identified by Politifact (a reputable fact-checking group

in the USA) and Wikipedia This is a good data set

- Carried out experiment YouTube2Text using Whisper

4 Reference

- Good

5 Behavior

- The student shows good attitudes while working with the reviewer

Overall assessment: (please choose one of the following categories:

Fair/Good/Excellent/Outstanding):

Mark:

Dang Truc Lam: 7.0/10

Ho Chi Minh city, January 22", 2023

Reviewer

Cao Thi Nhan

Trang 8

EN BAN

0110911119911 711 0/7

Trang 9

2)/7

Trang 10

(From the instructor)

viii

Trang 11

(From the thesis advisor)

ix

Trang 12

First of all, allow me to express my sincere appreciation to my advisors, Prof DoPhuc and MSC Nguyen Thi Kim Phung, for their extraordinary instruction, assistance,and knowledge during the course of this research undertaking Their perceptivecritique, support, and steadfast dedication have played a critical role in molding thisthesis and fostering my scholarly development

The guidance and clarity that Prof Do Phuc's extensive knowledge and experience

in Machine Learning have provided to my research have been invaluable His

commitment to my scholarly growth has been phenomenally motivating, and I amappreciative of the inspiration he has provided through his mentoring

The supervision and constructive feedback provided by MSC Nguyen Thi KimPhung have made a substantial contribution to the enhancement of this thesis Hererudite counsel and meticulous attention to detail were indispensable in elevating thecaliber of my research endeavors I express gratitude for her unwavering support andtolerance during this endeavor

Additionally, I would like to express my gratitude to the faculty and staff of

Information Systems for their invaluable assistance, educational materials, andsupportive atmosphere, all of which have significantly enhanced my educationaljourney Their contributions were crucial to the successful completion of this study

In conclusion, I express my gratitude towards my fellows for their support,intellectual stimulation, and positive reinforcement, all of which have contributed tothe overall enrichment of this scholarly expedition

Trang 13

UNIVERSITY OF INFORMATION TECHNOLOGY

Advanced

ADVANCED PROGRAM echaciatinr

IN INFORMATION SYSTEMS Program

Email: phucd @uit.edu.vn

Master Nguyen Thi Kim PhungEmail: phungntk @uit.edu.vn

Misinformation and manipulated content have become more widely distributed as a result

of the exponential expansion of digital media[1][2], particularly on platforms such as

YouTube[3][4][5] The objective of this study is to devise a technique for indicating the

originators of misinformation and manipulation videos on YouTube The purpose of this

research is to develop a framework for the identification of manipulated videos through theapplication of deep learning algorithms and the analysis of their characteristics Enhancing

xi

Trang 14

the efficacy of approaches to counter misinformation on digital platforms will be facilitated

by the results of this study

The first part of the study looks at how common misinformation (manipulated/ modifiedinformation) is on digital platforms, focusing on YouTube as a major way for changed

content to get around After carefully reading all the previous research on the topic, this

study finds problems with the current methods of detection and suggests a new approach

that uses deep learning algorithms to make detection more accurate.

The method involves collecting a diverse set of YouTube videos that include both real and

fake content Videos are broken down into parts like image frames, and audio patterns sothat deep learning models can be implemented

1 Objectives: Analyzing the distinctions between original YouTube videos and

modified videos This works as a fundamental basis for information security and data

integrity on digital platforms

2 Scope: The subject matter is implemented in order to analyze YouTube videos, video

similarity comparison

3 Methodologies: Applying algorithms and deep learning models specifically

designed for Natural Language Processing (NLP) to address the challenges ofYouTube2Text, etc Assessing the vector parameters to ascertain if the input videohas been modified Simultaneously, evaluating the performance of existing models

to determine the most optimal model for the procedure

4 Expected result: Proposal for an optimized method for detecting videos that spread

misleading/ modiefied information with the intention of manipulating viewers

References:

[1] C Chen, H Wang, M Shapiro, Y Xiao, F Wang, and K Shu, “Combating Health

Misinformation in Social Media: Characterization, Detection, Intervention, and Open

Issues.” arXiv, Nov 09, 2022 Accessed: Jan 06, 2024 [Online] Available:

http://arxiv.org/abs/221 1.05289

xii

Trang 15

[2] S Kumar and N Shah, “False Information on Web and Social Media: A Survey.”

arXiv, Apr 23, 2018 Accessed: Jan 06, 2024 [Online] Available:http://arxiv.org/abs/1804.08559

[3] H O.-Y Li, E Pastukhova, O Brandts-Longtin, M G Tan, and M G Kirchhof,

“YouTube as a source of misinformation on COVID-19 vaccination: a systematic analysis.,” BMJ Glob Health, vol 7, no 3, Mar 2022, doi: 10.1136/bmjgh-2021-008334.

[4] L Tang et al., ““Down the Rabbit Hole’ of Vaccine Misinformation on YouTube:

Network Exposure Study.,” J Med Internet Res, vol 23, no 1, p e23262, Jan 2021, doi:

10.2196/23262

[5] I Srba et al., “Auditing YouTube’s Recommendation Algorithm for MisinformationFilter Bubbles,” ACM Trans Recomm Syst., vol 1, no 1, pp 1-33, Mar 2023, doi:10.1145/3568392

Research plan:

- Identify and collect a diverse dataset of YouTube videos

- Preprocess the data, including cleaning and organizing the videos for analysis

- Extract features from the collected videos, including images and audios

- Develop deep learning models for Youtube2Text, Doc2Vec, and document comparisons.

- Evaluate the performance of the trained models.

- Compare model results with other related works

xiii

Trang 16

Approved by the advisor(s)

Signature(s) of advisor(s)

Prof Do Phuc

MSC Nguyen Thi Kim Phung

Ho Chi Minh city, 01/09/2023

Signature(s) of student(s)

Dang Truc Lam

XIV

Trang 17

TABLE OF CONTENTS

THESIS PROPOSAL, 0 SG 0c H4 0.0 0000080080/040.006096 xi

TABLE OF CONTIEN TỀ c5 s9 000001000600090804800006 XV

LIST OF EIGU RES œ- 5 (<< << 0 00000 60068008900 xvii

LIST OF TA B.LEES <5 (<< 5< I0 0000089089081 080 XX

LIST OF ABBREVIA TIONN o G55 sọ SH ng 000008966 xxi

ABSTRRACTT, cĩc sọ cọ HH 000910000 00000.0 4000560080900 1

Chapter 1 — Introduction 5-5 5 <5 5< 5 5 9 9 99 00000 000088808960.8 2

1.1 General introduction 17 a 2

1.2 Problem statement and chaÏlenge€s - «+ «+2 £++£++++kEseeeeeeeersesske 3 1.2.1 Problem sfaf€Imeii( - «c1 s11 9 119 111v HH HH ệt 3 I0) 3

1.3 Surveying methods for analyzing video COnIf€T «+ s<c<s£+s+sx2 3 1.4 Study objectives and SCOpC cceecesseeseeseseeeesceeseesseceseeeaesecesseeseeeaeesaeeeaeeees 7 In ODjeCtIVES 7

I Đo 7

1.5 i9 Si 9000 T1 e 7

1.6 Report OUtING 1 8

Chapter 2 — Theoretical Background << 5< < 5< s5 s9 95 895685589556 9

2.1 Current approaches to solve the problem ¿- «+55 +++x++vcseeeseeseeseesee 9

2.1.1 Approaches to video content based on Âudo - -s«+<s<+<s+2 9

XV

Trang 18

2.1.2 Approaches to video content based on Image ««+-«« 232.1.3 Vector comparison methodology - s6 +skssserseeereevee 39

Chapter 3 — Experiments and Evaluations - < << «<< S 5s 6S£9555699559% 43

3.1 Proposed methodology - + 1113 1E9911 131111911 91119 ng ky 43

3.2 Datasets IntrOdUCtIOT 5 2 3013311331111 1191111111811 1 11 81 ng ngư 44

3.2.1 Lack of video dataset SUTVCY HH HH HH nhh 443.2.2 Alternative datasets for the problem - - -«+++x++se+ssersees 473.3 EMVirONMen 1 Ỏ 49

3.4 Implementation process eSCTID{IOH <6 + 123111331 kkeskeerse 49

3.4.1 Audio arnaÌYS1S so c1 HH HH Hệ 49

3.4.2 Video frame ạiaÏS1S - - c2 13923133 1119111 911 9v ng re 51

Chapter 4 — Conclusion and Future đỈT€CẨÏOIIS s- s55 55s 5555 5s sess+ 56

4.1 Achievements and ÏIm1tatÏOTS - - ¿+ + + **+*E+seeEseeerereerereereree 56

NV 2/00 85 Ả 564.1.2 LÌmIfAfIOTS c2 1121111211110 1 11011110111 011 1 11 1v HH HH Hy 564.2 Future GirectiOns 177577 ồẮồÂ'.ồ®"ê Ả 56

REEFERENCES G5 <5 HH HC 0000000000000 80 58

APPENDICES 0c cụ Họ Họ HH 0 0000000 06.040004 0 65

Xvi

Trang 19

LIST OF FIGURES

Figure 1.1 Topic-wide distribution of videos [7] - 5 5 2s 121123391 91 1 91 ng 4 Figure 1.2 Pipeline of Jagtap et al proposed methodology [7] - «++s«++<x+<e++exs+ 4 Figure 1.3 Multi-class classification: Best models and embeddings with highest

weighted Fl-score for each topic [7] - 6 6 St 1 9112321 91 5119111 1v HT HH gà nưệp 5 Figure 1.4 Summary datasets using 1n [ LÚ], - - - 2511911911931 91 11 1 9v ng ng 5 Figure 1.5 Evaluation of fine-tuning base transformers model & - «+ +-«+++s<++s+2 6 Figure 2.1 Baevski et al.’s Wav2vec 2.0 framework{ l Í] - + +++s++ex+eexsexsersessserses 9

Figure 2.2 Conformer encoder model architecture [19] «+ ++s+++s£++k+seseeeeseeese lãi Figure 2.3 Overview of Whisper methodology architecture [2 Í] s«++++<c++ 13

Figure 2.4 Word Error Rate (WER) distribution of Whisper and state-of-the-art

commercial and open-source ASR system in long-from transcription, using 7 distinct

long form English-only afaS€fS - (c1 1191 nh HH Hàn 14

Figure 2.5 A timeline overview of the latest advancements in fundamental network

structures and models for word embeddings, starting from 2013 - - -« s«++s«++s «+2 15 Figure 2.6 Bag of words model [2S] - - 2 + + E1 E939 E91 93 31 11 v1 gi ng 16 Figure 2.7 Example of drawback of BOW [26 ] - c5 + kg re, 16 Figure 2.8 TF formula [277 ] - - - + + + < + E111 911911 91 91 91 1 1111 nh HH nh 17 Figure 2.9 IDF formula [27] - - <6 1111111 111911 911 11 11 910gr 17

Figure 2.10 The difference between CBOW and Skip-gram[29] - ¿s55 «<< cs++ 18 II0410290808xÀ/219)/0ii13.20n maỪO ÔỎ 18 Figure 2.12 PV-DBOW model - 6 5 6 21 2119 91191 911911 vn nh nu ng gh nnrnry 19 Figure 2.13 Performance comparison on SST đafaS€( 5c 3c 3+ + seseeseeeree 21

Figure 2.14 Performance comparison on IMDB dataset ccesceeseesseeseeeeeeeneeeseeeeeeeneeeee 22

Figure 2.15 Example of 3 paragraphs - - <1 131911919 111910 1911 vn HH ngư 23

Figure 2.16 Results of example in Figure 2.15 << + 111193 11 x1 9v ren 23 Figure 2.17 Video to frames using OpenCV example [39] s5 «++s x++v+sesseeesee 24

Figure 2.18 Anaz and Faris, 2015 made a performance comparison between OpenCV

and MATLAB in real-time application [40] , - <6 311191 E913 1 EEkksekkeeeere 25 Figure 2.19 The architecture of the proposed VON¢Et .- - nSnSs rưn 27

XVii

Trang 20

Figure 2.20 Average Running Time of a RGB Image of Size 768x512 From the Kodak

Dataset by the Test Methos - ó5 1 01190111 11191019 TH HH ngư 27

Figure 2.21 Example of feature extracting architecture + 5+5 x*+x£+v+ssseseeses 27 Figure 2.22 An example of CNN architecture for image classification [50] 28 Figure 2.23 Primary calculations executed at each step of convolutional layer [50] 29

Figure 2.24 The convolution value is calculated by taking the dot product of the

corresponding values in the Kernel and the channel matrices [5 Ï] - -«- «+ 29

Figure 2.25 ReLu function novel formula [50] - ¿- 6 + *£+x£+£+eEseEseeeeeseerserses 30 Figure 2.26 Three types of most frequently utilizing pooling methods [50] 30 Figure 2.27 Fully connected layer [50] eccecceccesseseceeceseeecececeesesseeseeeceeaeeseseneeseeaeeaeenes 31 Figure 2.28 Softmax mathematically formula [50] - - s6 +2 £+s£*sEseEeeeeserserees 32 Figure 2.29 RNN architecture [52] ¿c1 31119111111 111911911911 9v ng ng ng rưy 32

I2)01.026000.40)0./2i‹i 2n 33 Figure 2.31 Visualized diagram of vanishing ø7adieni( «5+ x++£+ee+seseeeeeeses 34 Figure 2.32 Workflow comparison between RNN and LSTM -«++<++<<+ 35 Figure 2.33 Transformer architeCture - (c6 110118111930 E93 E930 11 19v vn ng rưy 36

Figure 2.34 Self-attention helps the model understand more about the relationship

between S€T(€TIC€S G2 11111 9 1 TH TT HT HT TH TT TH HT TH HH TH nghiệt 37

Figure 2.35 Positional encoding deSCTIDfIOII - 5 6 E31 E93 E+#EE+EEEeeEseekseeeeeeeereree 38 Figure 2.36 Comparison of BERT and recent improvements over it [ŠŠ] - 39 Figure 2.37 Document similarity using cosine example [77 ] - -«- «=2 << £+s£zss+ 40

Figure 2.38 Cosine similarity formula 0 ccceceeseeseeeseeceeseeseeececeesecsecececeeaesaeceneeseeaeeaneees 40 Figure 2.39 Euclidean distance ÍOTINUÏi - ¿+ + + + x****E*xE#kE*kEsEEeEEkEkkrekrerkerkrrke 4I Figure 2.40 Euclidean distance example[60] - - <6 + +2 E + **E+k+kEsEkeesrkrekrskerree 4I Figure 2.41 Manhattan distance ÍOrrmuia - 5s k1 E 23191 E91E E91 vn nh giết 42 Figure 2.42 Manhattan distance example[60] - - - + + 2s E2 219 1 ri, 42 Figure 3.1 Proposed methodology arChIf€CfUT€ -ó- - E2 S11 * 911193119119 119 11811 rry 44

Figure 3.2 Multimodal in term of fake information description [6 Ï] - «- ««+s+ 45

Figure 3.3 Summary of the most relevant works on factuality, covering different

Modalities and tasks [61] - - - - - ¿+ 2 E633 221116132231 181 1293311111531 11183111 E99 1n ngư 46

XVII

Trang 21

Figure 3.4 Categories and number of articles per CAf€ØOYY 5 555 + *++svseeeeeeeres 47 Figure 3.5 Comparison with other existed dataset - G11 HH ng rệt 48

Figure 3.6 Average AUC Performance of SOTA Detection Methods on Each Dataset

— ỐỐ 48 Figure 3.7 Example result of You Tube2Text using WhlSper - 5c sseseesses 49 Figure 3.8 Documents after S{€TnTTITØ - (G11 911811811 891189111911 811 91111 vn rry 50

Figure 3.9 Average similarity percent, USING COSINE - - c1 123 9E ve rưy 51

Figure 3.10 Video2Vec exaImnpÌÏe 5 s11 TH TH HH TH HH nh nh 51

Figure 3.11 4-D array example oo nh 52 Figure 3.12 Cosine distance ÍOrImuÌa - - «+ + +3 + E1 vn nh nh TH nh rệc 52 Figure 3.13 Average distance over whole afS€K c5 1n HH HH riệt 54

Figure 4.1 Stemming function in DoC2V€C - ĩc c2 S313 E*ESEESkEkkeskerkrrkeree 65 Figure 4.2 Calculate similarity percent 1n DOC2VEC - - -s + x + +ekESseseeseeeseeeree 65 Figure 4.3 Function to calculate the DerC€TI{À€ - ĩ6 6 2213 *91 21123111 1E Ekrrkrree 65 Figure 4.4 Cosine distance percent In VideO2VEC ĩc k3 9v 1 1 911 rry 67

XIX

Trang 23

LIST OF ABBREVIATIONS

NLP: Natural Language Processing

AUC-ROC: Area Under the Curve” of the “Receiver Operating

Characteristic” curve

BERT: Bidirectional Encoder Representations from Transformers

CNN: Convolutional Neural Network

RNN: Recurrent Neural Network

CTC loss function: Connectionist Temporal Classification loss

WER: Word Error Rate

HMM baseline: Hidden Markov Models

WSPSR: Web-scale supervised pretraining for Speech Recognition

ASR: Automatic Speech Recognition system

BoW: Bag of Words

TF-IDF: Term Frequency - Inverse Document Frequency

CBOW: Continuous Bag of Words

DBOW: Distributed Bag of Words

PV-DM: Distributed memory version of paragraph vector

PV-DBOW: Distributed Bag of Words version of Paragraph Vector

RGB: Red, Green, Blue channel

GAP: Global Average Pooling

MSE: Mean Squared Error

LSTM: Long Short-Term Memory

SOTA: state-of-the-art

XXI

Trang 24

The advent of digital platforms has fundamentally transformed the dissemination

of information and news, providing unparalleled possibilities for worldwidecommunication and the exchange of knowledge The emergence of digital platforms,including social media, news websites, and content-sharing platforms, has facilitatedwidespread access to information, enabling people and organizations to rapidly reachlarge audiences

YouTube, as one of the major worldwide video-sharing platforms, providesexceptional potential for content creators to effectively reach varied audiences with awide range of information and news content Nevertheless, the widespread availability

of YouTube content distribution has also resulted in difficulties with the accuracy and

reliability of material provided on the platform.[1], [2], [3]

The research methodology entails gathering a heterogeneous dataset of authenticYouTube videos that encompass a wide range of genres and subjects Deep learningmodels are used to extract and analyze features such as image analysis and audioanalysis, as well as natural language processing (NLP)

Overall, the study's findings are anticipated to enhance the creation of more efficienttactics for identifying and countering the exploitation of authentic YouTube videos formanipulative intentions The research seeks to improve the openness and integrity ofdigital content on platforms like YouTube by identifying sources that participate insuch behaviors

Trang 25

Chapter 1 — Introduction

1.1 General introduction

The advent of digital platforms has fundamentally transformed the dissemination

of information and news, providing unparalleled possibilities for worldwidecommunication and the exchange of knowledge The emergence of digital platforms,including social media, news websites, and content-sharing platforms, has facilitated

widespread access to information, enabling people and organizations to rapidly reach

large audiences This has enabled individuals to engage in public discussions, exchangevarious viewpoints, and obtain a broad array of news outlets Nevertheless, this realityhas also resulted in the widespread dissemination of false information, deliberate

misinformation, and manipulative propaganda, which present substantial threats to

public confidence and the stability of society.[4], [5]

YouTube, as one of the major worldwide video-sharing platforms, providesexceptional potential for content creators to effectively reach varied audiences with awide range of information and news content The extensive number of users and thealgorithmic recommendation algorithms of this platform have significantly changed

the way people access and interact with information, influencing public discussions

worldwide Nevertheless, the widespread availability of content distribution has also

resulted in difficulties with the accuracy and reliability of material provided on the

platform.[1], [2], [3]

With the aforementioned concerns, it is required to create techniques for detecting

manipulated videos and detecting the sources of misinformation video in order toensure safety, security, and the accuracy of information However, it is exceedinglydifficult to distinguish subtle discrepancies in videos with long durations Furthermore,

it is almost unfeasible to manually authenticate a substantial volume of data Hence,

this study is being done to surmount these challenges

Trang 26

1.2 Problem statement and challenges

as it increases the likelihood of misleading information and copyright violations

The problem is to develop a technique to analyze both original and modifiedYouTube videos, assess their characteristics, and ascertain whether the given video hasundergone any alterations

The input consists of video segments, and the final result is the distance percentage

of those input videos

1.2.2 Challenges

Lack of pre-labeled dataset

e Building a large dataset containing videos labeled as "original" and

"modified copies" is a challenging task Furthermore, collecting such data

requires strict adherence to principles of data security, intellectual property

rights, and copyright

1.3 Surveying methods for analyzing video content

Jagtap et al., 2021 [7] utilized a technique that involves analyzing the characteristicsand features of each video using subtitles, commonly referred to as Video Caption, inorder to detect videos spreading misleading data The researchers utilized a dataset ofYouTube videos that consisted of metadata recorded in csv format This metadataincluded information such as video titles, descriptions, view counts, likes, and dislikes.The dataset covered subjects such as the Vaccines Controversy, 9/11 Conspiracy,

Trang 27

Chem-trail Conspiracy, Moon Landing Conspiracy, and Flat Earth Theory The authorsdeveloped a multi-class prediction model utilizing natural language processing (NLP)

to classify videos The algorithm categorizes videos into two classes: "misinformation"and "debunk misinformation or neutral."

Topic Original Count | Available videos

with captions

Vaccines Controversy 775 621 (28.5%)

9/11 Conspiracy 654 436 (20.0%)Chem-trail Conspiracy 675 484 (22.2%)Moon Landing Conspiracy 466 317 (14.6%)

Flat Earth 373 317 (14.6%)

Figure 1.1 Topic-wide distribution of videos [7]

Data Collection and Preprocessing Model Building Performance Analysis

‘i | Training 4 F1 Score

Captions Classifiers 5, AUC ROC (Misinfo vs All)

Embeddings

Figure 1.2 Pipeline of Jagtap et al proposed methodology [7]

The authors obtained video captions (subtitles) by utilizing Video Scraper to extractdata from the provided links in the dataset Subsequently, they conducted preprocessingprocedures to exclude any erroneous or noisy data Afterwards, they performed wordembedding using 4 state-of-the-art techniques[8][9], including Stanford GloVeWikipedia vectors - 100D, Stanford GloVe Wikipedia vectors - 300D, Word2VecGoogle News - 300D, Word2Vec Twitter - 200D Ultimately, the authors calculatedF1 scores, AUC-ROC, Precision Score, and Recall Score to identify the best classifierswith the most impactful embeddings for all topics

Trang 28

Three-class Classifier (sorted by F1-score)

Topics Fl-score | Precision | Recall | Accuracy

Vaccines Controversy NuSVC 0.89 0.89 0.89 0.89 Google 300D

Moon Landing NuSVC 0.85 0.84 0.85 0.85

GloVe 300D

Figure 1.3 Multi-class classification: Best models and embeddings with highest

weighted F1-score for each topic [7]

Christodoulou et al., 2023 [10] proposed a novel methodology to detect

misinformation in YouTube videos They achieved this by employing video

classification techniques and cross-referencing information from video transcripts The

authors employed a combination of two transfer learning techniques: fine-tuning base

transformer models (such as BERT, ROBERTa, and ELECTRA) and utilizing sentence

transformers (MPNet and RoBERTa-large) for few-shot learning They conducted

experiments using three distinct datasets to provide the most thorough outcomes

Dataset Type Number of Samples

: Misinformation 652

YouTube Audit (Vaccines) Non-misinformation 636

‘YouTube Pseudoscience Pseudoscience 182

Science 226

Fake 1000

ISOT Fake News Real 1000

Figure 1.4 Summary datasets using in [10]

The evaluation results indicate that ROBERTa outperformed BERT and ELECTRA

in the fine-tuning models of the Youtube-Audit (Vaccines) dataset Additionally,

MPNet Few-shot demonstrated superior performance compared to ROBERTa-large.Within the YouTube Pseudoscience dataset, the performance of few-shot learningmodels surpassed that of fine-tuning models Specifically, MPNet Few-shot

Trang 29

outperformed RoBERTa-large in terms of getting higher scores ELECTRA achievedthe highest performance in the ISOT Fake News dataset.

Model MCC Accuracy FI score

Youtube Audit (Vaccines)

Figure 1.5 Evaluation of fine-tuning base transformers model &

Few-shot learning on the three datasets [10]

Unlike Jagtap et al., 2021 [7], who focused on topic-specific strategies using variousclassifiers and embeddings, Christodoulou et al., 2023 approach emphasizes the

versatility of transformer models These findings highlight the importance of

context-specific approaches and the role of deep learning models, embeddings, and transferlearning in combating misinformation

Additionally, several of other studies that commonly focused on classification tasksexamined based on video subtitles However, similar methodologies might fail to takeinto account a wide range of factors when it comes to detecting deceptive or alteredmaterial in videos, including but not limited to visuals, audio, and diverse othervariables

Trang 30

1.4 Study objectives and scope

1.4.1 Objectives

To tackle the urgent problem described earlier, the study proposes an approach toexamine and compute the similarity and distance between videos, utilizing severalnatural language processing algorithms and deep learning models This 1saccomplished concurrently with the processing of multimedia data, encompassingaudio and video frame processing

In audio analysis process, the purpose is to get the average similarity percentbetween documents, which is after being extracted from the audio

In video frame analysis process, instead of compute the similarity, calculate thedistance between video frames is more suitable, since its representation positive values(easier to analyze) And the goal is to get the average distance number based on the

taken dataset

1.4.2 Scope

The scope of this thesis is based on alternative datasets, one for audio analysis and

one for video frame analysis, since collecting YouTube videos that are satisfied theproblem conditions is challenging

Some technologies used include:

- Language: Python

- Methodology: Whisper, Paragraph Vector,

- Library: PyTube, Numby, Pandas, Matplotlib, Gensim,

1.5 Implementation details

Step 1: Research, collect, and preprocess data Survey models and relatedproblems Select a model suitable for the given problem

Step 2: Experiment with available models and methods, compare, and adapt

them to meet the initial requirements Additionally, propose directions for modifyingand advancing the problem if necessary

Trang 31

Step 3: Evaluate the model's output results, continuously adapt and improve to

achieve the best performance.

1.6 Report outline

This thesis is structured with the following chapters and sections:

Chapter 1: Introduction

e Overview of the topic

e Introduction and summary of the objectives, problem statement, field, and

scope of the study

Chapter 2: Theoritical Background

e Approaches to solving multimedia data analysis problems, including

YouTube2Text, Doc2Vec, Video2Frame, Img2Vec

Chapter 3: Experiments and Evaluations

e Implementation of the methods and models studied on the dataset

e Evaluation of the results

e Comparison with other methods in the same field

Chapter 4: Conclusion and Future directions

e Summary of the topic

e Conclusions

e Future directions for further development of the problem

Trang 32

Chapter 2 — Theoretical Background

2.1 Current approaches to solve the problem

This section introduces the basic concept of a variety of analysis methodologies,

including audio and image

2.1.1 Approaches to video content based on Audio

2.1.1.1 YouTube2Text

The rapid development of digital devices and technologies has made the analysisand processing of multimedia data essential in the field of information technology,particularly in the domains of deep learning and artificial intelligence Especially, audiodata processing research is receiving growing interest, with scholars exploring

numerous experimental methodologies

Wav2Vec

Baevski et al., 2020 introduced a novel approach for unsupervised pre-training.They utilized a Convolutional Neural Network (CNN) to encode raw video segmentsinto waveforms The latent voice representations are subsequently inputted into a

Transformer network to generate contextual representations as the ultimate output

Trang 33

The architecture of Wav2vec 2.0 model comprises a multi-layer convolutional

feature encoder, which accepts raw audio as input and generates latent speech

representations The Transformer algorithm processes these representations to generatecontextualized representations that encompass information from the entire sequence.The feature encoder's output is discretized through a quantization module, whichrepresents the objectives in the self-supervised objective The technique differs fromearlier models by emphasizing the construction of context representations and the

utilization of self-attention to capture dependencies throughout the whole sequence of

latent representations [12]

In the pre-training phase, the objective is to acquire speech audio representationsthrough the resolution of a contrastive task The objective of this challenge is todetermine the accurate quantized latent speech representation for a masked time stepfrom a group of distractors

Within the realm of speech recognition, pre-trained models are modified by

incorporating a linear projection layer onto the existing network.[13] This layer serves

to represent the vocabulary specific to the given job In the instance of Librispeech, themodels are optimized using a CTC loss function [14] and a modified technique known

as SpecAugment SpecAugment involves masking time-steps and channels duringtraining [15] This methodology aids in mitigating overfitting and results in enhanced

error rates.

An advantage of this technology is in its ability to process raw audio data withoutrelying on human labels, enabling it to handle a substantial volume of data.Nevertheless, because to its unsupervised learning, it may not get the utmost mapping

performance A level of fine-tuning is necessary in order to do out tasks such as speech

recognition

Chan et al., 2021 [16] developed a speech recognition model named SpeechStew.This model combines different existing datasets, including AMI, Broadcast News,

Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal.

Importantly, no domain-specific adjustments were made to the datasets in terms of

re-10

Trang 34

balancing or re-weighting The architecture utilized in this model is the Conformer

RNN-T [17] (Figure 2.2), as proposed by Gulati et al., 2020 [16] Additionally, the

model incorporates the wav2vec 2.0 model [11] for pre-training with 1 billionparameters [18]

Figure 2.2 Conformer encoder model architecture [19]

The transfer learning capabilities of SpeechStew allow for the fine-tuning and

adaptation of a general-purpose model trained on diverse datasets to a new application.The provided example is CHiME-6 [20], a dataset comprising of conversational audiorecorded in noisy surroundings This dataset poses a significant challenge when itcomes to training speech recognition models directly The authors showcase theefficacy of transfer learning in enhancing performance on the specified task by refining

SpeechStew using CHiME-6.

The transfer learning capabilities of SpeechStew are exceedingly useful, enablingthe authors to train a versatile model once and subsequently refine it for specific low-resource tasks This approach is economically efficient as fine-tuning necessitates only

lãi

Trang 35

a limited number of thousand steps, in contrast to the roughly 100k steps required to

train a model from the beginning.

And as expected, SpeechStew consistently delivers nearly state-of-the-artperformance across a range of tasks, without relying on an external language model.The results of our study demonstrate a Word Error Rate (WER) of 9.0% on AMI-IHM,4.7% on Switchboard, 8.3% on CallHome, and 1.3% on WSJ, and with CHiME-6, theyachieve 38.9% WER without a language model, which compares to 38.6% WER to astrong HMM baseline with a language model These results indicate a significantimprovement over previous research, particularly when considering the utilization ofrobust external language models

Whisper

Radford et al., 2022 [21] conducted a study on speech processing systems that weretrained to predict transcripts of audio from the internet The researchers found thatwhen these systems were trained on a large amount of labeled audio data (680,000hours), which is an order of magnitude larger than previous datasets, with multilingualand multitask supervision, they performed well on standard benchmarks and werecompetitive with fully supervised models without the need for fine-tuning They

introduce a method called Whisper (basic name for WSPSR — Web-scale supervised

pretraining for Speech Recognition) in order to support further work on robust speechprocessing, based on the previous approaces

The dataset being used is constructed from the audio that is paired with transcripts

on the Internet, which is generated by both human and the output of ASR systems,since there existed numerous researches showing that training on datasets of mixedhuman and machine-generated data can significantly improve the modelperformance [22]

The researchers employed a well-established encoder-decoder Transformer [23]

design in their study on large-scale supervised pre-training for speech recognition since

it has demonstrated its reliability in terms of scalability The audio was enhanced byresampling it to a frequency of 16,000 Hz and generating an 80-channel log-magnitude

12

Trang 36

Mel spectrogram representation The encoder utilized convolution layers with a filter

width of 3 and the GeLu activation function It was then followed by Transformer

blocks On the other hand, the decoder used learnt position embeddings and tied output token representations

input-@s (background music playing)

5 re Sequence-to-sequence learnin ee eae

Multitask training data (680k hours) od eq g [ew [em] 00 | The ck row

|

English transcription kz.

$ *Ask not what your country can do for -” a a]

#2 Ask not what your country can do for ——>

mm mm

Any-to-English speech translation i § :

: Hi :

$ “El rápido zorro marrón salta sobre -” Transformer Š “———>

Encoder Blocks 3 om) Transformer

2 The quick brown fox jumps over + Š >», =a Decoder Blocks

5

Non-English transcription SS!

[ ce | | oes]

$: "cit Pol Bet etc LISLE Wa We

Sinusoidal L——> (cross attention)

J AS Hol Sep ujeice 4S Wa Pe Positional (sa ateriion —)

PREV > text tokens _ r,, EOT

† ME || TP DỰNG eee nes | tes Exene J

Custom vocabulary / fa}

prompting Yi an Lew \ x= Le Text-only transcription

special text timestamp (VAD) Translation (allows dataset-specific fine-tuning)

tokens tokens tokens ts

Figure 2.3 Overview of Whisper methodology architecture [21]

- Multitask training format: the authors employed a single model to performentire speech processing pipeline with different tasks on the same audio input,including transcription, translation, voice activity detection, alignment, language

identifications, etc Furthermore, the authors used a straightforward format where all

tasks and conditioning information are specified as a sequence of input tokens to thedecoder, encompassing special tokens, text tokens and timestamp tokens Additionally,the decoder is trained to consider the previous text history of the transcript, aiming toutilize longer-range text context to resolve any ambiguities in the audio In order toindicate the beginning of prediction, the authors use <start of transcript> token,

13

Trang 37

following by <no speech> token in case of the input audio signal is no speech, and

<language tag> if there are any speech recognized from the audio, before executing

next tasks is transcription and translation Afterwards, <no timestamps> token isutilized to specify whether the case need predict timestamps or not, and lastly, <end oftranscript> token is employed in the end of the format

Due to factors like unclear speech and labeling errors, each dataset used forAutomatic Speech Recognition (ASR)[24] has varying levels of irreducible error

Evaluating ASR performance solely based on Word Error Rate (WER) metrics makes

it challenging to determine the potential for improvement in each dataset To assesshow well the Whisper ASR system performs compared to human performance, a studywas conducted using 25 recordings from the Kincaid46 dataset, with results indicatingthat Whisper's English ASR performance is very close to human-level accuracy, withonly a slight difference in WER (1.15% better)

9 TED-LIUM3 Meanwhile Kincaid46 Rev16 Earnings-21 Earnings-22 CORAAL

#WNWhisper ME CompanyA RE Company B BAN Company C #8 Company D BE NVIDIA STT (CTC large)

Figure 2.4 Word Error Rate (WER) distribution of Whisper and state-of-the-art

commercial and open-source ASR system in long-from transcription, using 7 distinct

long form English-only datasets.

2.1.1.2 Doc2Vec

Word Embedding is an essential technique in Natural Language Processing (NLP)that has a fundamental function in expressing words as compact vectors in a space withmany dimensions These vectors include semantic and syntactic associations among

14

Trang 38

words, enabling deep learning algorithms to efficiently analyze and comprehendtextual material Word Embedding allows algorithms to analyze language byassociating words with continuous vector representations, preserving the contextual

and hierarchical information found in spoken language The utilization of this

technology has significantly transformed numerous NLP tasks, including language

modeling, sentiment analysis, machine translation, and document categorization It

achieves this by offering a way to represent and analyze textual data in a more

significant and computationally efficient manner.

structures and models for word embeddings, starting from 2013.

Bag of words (BoW): is a widely used method in Natural Language Processing

(NLP) that converts text documents of different lengths into fixed-length vectors based

on word frequencies These vectors disregard the grammatical syntax of sentences andthe sequential arrangement of words [25] This technique involves tokenization,counting word frequencies and encoding the text data into numerical values

15

Trang 39

Bag of Words Model explanation Example

Corpus _ hi

A collection of text documents TT Te dog ote child hag

— Tokenization:

Okemze) D1: [The] [dog] fis} happy]

Divide the text into smaller units called D2: [The] [child] [makes] [the] [dog] [happy]

tokens, usually words or phrases D3: [The] [dog] [makes] [the] [child] [happy]

Co i Documents Counting word frequencies

unt word frequencies D1 the: 1, dog: 1, is: 1, happy: 1

Create a vocabulary of unique words and —

count the number of times each word D2 the: 2, dog: 1, makes: 1, child: 1, happy: 1

Bppeers in cern document: D3 the: 2, child: 1, makes: 1, dog: 1, happy: 1

Encode the data

Encoding the text data as numerical values

by creating a vector for each document, with.

each element of the vector representing the

frequency count of a particular word in the

document

[0,1,1,1,0,1]

[1,1,1,0,1,2]

1 2 [1,1,1,0,1,2]

Figure 2.6 Bag of words model [25]

However, this technique has several drawbacks, such as insensitivity to word order,

ignore punctuations and grammatical structures, limited semantic information, etc

Since BoW focuses on every single word, the relationship between words are totally

ignored, this led to the lack of meaning within NLP process, especially in some cases

of complicated sentences [26]

Document D1 The child makes the dog happy

the: 2, dog: 1, makes: 1, child: 1, happy: 1

Document D2 The dog makes the child happy

the: 2, child: 1, makes: 1, dog: 1, happy: 1

16

Trang 40

Term Frequency — Inverse Document Frequency (TF-IDF): TF-IDF is apopular statistical method in natural language processing and information retrieval Itcompares the importance of a term in a document to a corpus Its work is to vectorize

a word by multiplying TF with IDF[27] The drawbacks of TF-IDF are as the same as

BoW’s

e Term Frequency (TF): TF is the number of times a term or word appears

in a document compared to its entire word count

number of times the term appears in the document TF=

total number of terms in the document

Figure 2.8 TF formula [27]

e Inverse Document Frequency (TF-IDF): IDF shows the percentage of

corpus documents that contain a phrase Technical terms, which are found

in a limited number of publications, are valued higher than ‘a, the, and’

number of the documents in the corpus

IDF = log( )

number of documents in the corpus contain the term

Figure 2.9 IDF formula [27]

Word2Vec

Word2vec is a neural network with two layers that converts words into vectors in

order to predict the meaning of The input of the system is a collection of written texts,

known as a text corpus The system then generates a set of vectors, specifically feature

vectors, which are used to represent the words found in the corpus Although Word2vec

is not classified as a deep neural network, it converts textual data into a numericalrepresentation that can be comprehended by deep neural networks[28] The purposeand usefulness of Word2vec is to group the vectors of similar words together in

vectorspace and detects similarities mathematically using cosine.

The vector which is used to represent a word is called Neural Word Embedding.There are 2 ways to implement the distribution of word in vector space:

17

Tiêu đề	Detecting Sources Using YouTube Original Videos To Manipulate Contents With Other Purposes
Tác giả	Dang Truc Lam
Người hướng dẫn	Prof. Do Phuc, Msc. Nguyen Thi Kim Phung
Trường học	University of Information Technology
Chuyên ngành	Information Systems
Thể loại	Thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	90
Dung lượng	59,53 MB