Khóa luận tốt nghiệp: Constructing a knowledge graph with fact checking about Vietnamese cuisine

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMSLE SY THANH — 19522230 CHE NGUYEN MINH TUNG - 19522490 CONSTRUCTI

Trang 1

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS

LE SY THANH — 19522230 CHE NGUYEN MINH TUNG - 19522490

CONSTRUCTING A KNOWLEDGE GRAPH WITH FACT CHECKING ABOUT VIETNAMESE

CUISINE

BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS

THESIS ADVISOR ASSOCIATE PROFESSOR DR DO PHUC

HO CHI MINH CITY, 2023

Trang 2

ASSESSMENT COMMITTEE

The Assessment Committee is established under the Decision ,

by Rector of the University of Information Technology

1 Assoc Prof Dr Nguyén Dinh Thuan - Chairman.

2 Dr Cao Thi Nhan - Secretary

3 Dr Lê Kim Hùng - Member

Trang 3

I extend my deepest gratitude to my mentor, Assoc Prof Dr Đỗ Phúc, for his invaluable

guidance and steadfast support throughout the fruition of my project His consistent

backing has been a beacon of clarity, and I am truly appreciative Special acknowledgment

goes to M.S Nguyén Thi Kim Phung, whose guidance and camaraderie have enriched our

collaboration on this project

In the second phase of this academic journey, my interactions with Dr Cao Thi Nhan as

my consultant have been particularly enriching Dr Nhan's benevolence extended beyondthe scope of academia, fostering a sense of warmth and inclusion among our classmatesthroughout our university years Her guidance has not only contributed to the academicaspects of my project but has also left an indelible mark on our collective experience

Moving on to the third expression of gratitude, I find it imperative to recognize themeticulous efforts of the entire Information Systems Department faculty Their

responsiveness to my inquiries demonstrated a commitment to academic excellence that

went above and beyond, significantly enhancing my understanding of the subject matter

The fourth acknowledgment extends beyond the academic sphere, encompassing thesupport network that has been pivotal to my journey My gratitude extends to my family,

whose unwavering support has been a pillar of strength Friends and classmates have

formed a tapestry of encouragement and camaraderie, providing a backdrop of positivitythat has propelled me forward Their collective support and love serve as a constantreminder of the interconnectedness that makes academic pursuits all the more meaningful

Trang 4

TABLE OF CONTENTS

cs Le

ABSTRACT 1

Chapter 1 INTRODUCTION 2

1.1 Background and Context 2

1.2 Statement of the Problem 3

1.3 Objectives of the Study 4

1.4 Significance of the Study 5

15 Motivation 6

1.6 Contribution 8

1.7 Structure of the thesis 8

Chapter2 BACKGROUND AND RELATED WORK 10

2.1 Resource Descriptive Framework 10

2.3.4 Pre-training and Fine-Tuning 21

2.4 Softmax, Argmax and Loss functions 22

2.8.1 The transformer architecture 34

2.8.2 Semantic textual similarity 35

Chapter3 SYSTEM DESIGN 37

3.1 Overview 37

3.2 Software and database design 38

3.2.1 System architecture 38

Trang 5

3.2.2 Knowledge graph construction

3.3 Algorithm processing flow

3.3.1 Integrate BERT-NER

3.3.2 System response processing

3.3.3 Fact-checking through semantic similarity

Chapter4 SYSTEM IMPLEMENTATION

4.1 Overview

4.2 User query processing

4.3 Neo4j query processing

4.4 Answer results processing

4.5 Experiment and Discussion

Chapter5 CONCLUSION & FUTURE WORK

Trang 6

LIST OF EIGURES

œ2EFlx»

Figure 2-1 Triple examples TẢ 12Figure 2-2 Example of a graph (Source: video “Introduction to Neo4j and GraphDatabases”, 2019) eceecccscesscsscceseceseceeeesecesecececseceseceaeceeeaeceaeeececseeeseceaeceeecaeeeaeeeaeeeneeaeens 13Figure 2-3 BERT base and BERT large models (Source: web“BERT base vs BERTLarge”, 2019) ee scesesseesecseesecsecsecsceeecsecesesseesecsessecsessecsaseaeeaeeseesesseesessessessessesesseaseaeeaes 15Figure 2-4 Architecture of Transformer (Source: web “The Transformer Model”) 16Figure 2-5 BERT input representation (Source: “BERT: Pre-training of DeepBidirectional Transformers for Language Understanding ”, 2019) -<cc<x 17Figure 2-6 Encoder funCfIOTI - 5 + + 111 1v vn TT TH HH nghệ 19Figure 2-7 Tensor điTN€TSIOTIS - - G2 3 1921011811891 911 91 19010191 ng 20

Figure 2-8 Overall pre-training and fine-tuning procedures for BERT (Source: “BERT:

Pre-training of Deep Bidirectional Transformers for Language Understanding ”, 2019)

Figure 3-3 Pipeline of the SÿS(€T - 5 1E 211 911 1 301 1 1v ng ng ngư 43

Figure 3-4 Fact checking pIpeÌIne - - + 1+1 kS SH ng ng ướt 49Figure 4-1 Overview of the system processing flOW - + cssesereserrerrske 53Figure 4-2 Output Of Predict - - -Ă 2 3019210118 1113910119 11 911111 11H HH ng 55Figure 4-3 Output of final T€SUÍS - 5 1119101121 911 11911191 ng ng ng tr 55

Figure 4-4 JSON object of Final result for †TIDÏ€ - <5 + + +*££++EE+eeeeeeeseeereeee 56

Figure 4-5 Execution plan for a Cypher Query - 55 + 33+ EEseesrseeereerreee 58Figure 4-6 Result from the DÏ4Tn + E33 139111311 91 1 930 1901119 ng ngư 60Figure 4-7 Result from Neo4j in JSON - cà HH HH HH Hiệp 61

Figure 4-8 Extraction results of related Sentences - - -G ng rưn 63

Figure 4-9 Word segmentation T€SUÏÍL - + + + Sx 93 91191 1 1 vn ng rưệt 64

Figure 4-10 Cosine similarity T€SUÏ( ó6 << + 13 E3 91 3E ngư 65

Figure 4-11 Sort by highest similarity from high to ÏOW - c5 SĂcSsSssseseseresee 66Figure 4-12 Training Loss over EDOCs 6 13 93 91193 2 HH HH ng rưệt 67Figure 4-13 Result of precision, recall and Ï~SCOT€ - 5 + + £++eexseexeeereees 68

Figure A-1 NER result after executed query (admin ]) - «<5 s<++s£+seeeeses 74

Figure A-2 Triple result after processing the label from NER (admin Ủ]) 74

Figure A-3 Knowledge graph result from n€O(4 - + + + + k kg tr 75Figure A-4 Add triple interface eee (5 + 31 HT TH HH HH 75

Trang 7

Figure A-5 Chatbot user ITI(CTÍAC€ G1 HH HH HH HH 76Figure A-6 Chatbot response with image and relevant URL reference - 77Figure A-7 Some examples of difference QU€TV 5 55 + 13+ +*vEEseeseeeeseeereeee 77Figure A-8 Example with Vietnamese QU€FV - - 5 6 tk HH nh tr 78Figure C-1 Structure of the Knowledge Graph - - - - - -c + t**v vn ngư, 81

Trang 8

LIST OF TABLES

œ4EFllx»

Table 2-1 RDF triples dIA44-š-:: 3 11

Table 2-2 Example softmmax - G0111 HH HH ng 23

Table 2-3 Example Argmax - - 5 - 5E 1101193019319 111901 nọ ng 23

Table 3-1 Data organization ạfAÌÏWS1S - «xxx vn HH HH ng nh rưệt 42Table 3-2 List of dependency tags and NER exampÌe +++-s+*++x*s+seexseereees 45Table 3-3 Our Sample Cat€ØOTIZAfIOTI - G1 321119111910 19 10 1991119 1H ng ng ngư 46Table 206015 1 ẻ.ẻ.aHa 54

Trang 9

LIST OF ABBREVIATIONS

NLP Nature Language Processing

BERT Bidirectional Encoder Representations from TransformersRDF Resource Descriptive Framework

HTTP HyperText Transfer Protocol

NER Named Entity Recognition

Trang 10

With today's advanced technology, the use of chatbots is growing in popularity and

strength We want to learn more about this area, conduct further research, and create our

own application Rich data gathered from the Internet was used to create a knowledge

graph, which we then used to build a chatbot - a question-answering application - based

on the knowledge graph Wikipedia content and online source pages are the main places

to look for authenticity and dependability The study presented in this thesis advances

applications of knowledge representation and natural language processing The research

also addresses the challenges posed by the intrinsic depth and complexity of natural

language, emphasizing the adaptability and flexibility of natural language, which makes

it inherently difficult for computational analysis The thesis aims to bridge this gap by

leveraging advanced NLP approaches and innovative technologies to create a chatbot that

can effectively engage users and serve as a reliable source of knowledge about

Vietnamese culinary traditions The subject of our knowledge graph is Vietnamese cuisine

in 63 provinces In this thesis, we present the process of natural language processing of

queries from users, as well as the conditions for creating answers from chatbots and using

information reference sources to support fact checking Information reference sources

become integral in this endeavor, acting as pillars of support in ensuring the accuracy and

credibility of the knowledge imparted by our chatbot

Trang 11

Chapter 1

INTRODUCTION

1.1 Background and Context

The swift advancement of digital technology and artificial intelligence has brought about

revolutionary shifts in the ways people interact with online platforms and obtain

information Chatbots have become immensely useful tools in this age of rapid

technological development, changing the way users engage with one another and

providing prompt answers to questions Artificial intelligence-powered chatbots are

highly effective at mimicking human speech and are widely used to streamline

communication and offer prompt support on a variety of online platforms

Furthermore, a new era of intelligent conversation agents has been ushered in by the

creation of Knowledge Graphs and advanced Natural Language Processing (NLP)

techniques, which have coincided with the emergence of chatbots A knowledge graph is

an advanced data structure that is used to precisely record the complex relationships

between concepts and things in a given domain Cutting-edge models and algorithms are

used in advanced NLP approaches to process and understand human language with ease

This feature greatly enhances the quality of user interactions by enabling chatbots to

comprehend and react to user inquiries with increased precision and thoroughness

This thesis explores how these innovative technologies come together to create a

cutting-edge web chatbot with a unique emphasis on Vietnamese food Through the integration

of chatbot capabilities with the sophisticated NLP and Knowledge Graph insights, this

research aims to open up new avenues for providing a rich and dynamic user experience

The goal is to develop a chatbot that can effectively engage people and act as a reliable

source of knowledge about the complex web of Vietnamese culinary traditions by

investigating the intersection of various technological frontiers

Trang 12

1.2 Statement of the Problem

Because of the language's intrinsic depth and complexity, texts written in natural language

are by nature difficult The existence of ambiguity, which allows a statement to transmit

wholly different meanings depending on its context, is one of the main causes of this

complexity Natural language is remarkably adaptable, as seen by its capacity to fit in

with a variety of contexts with ease But this very flexibility makes it extremely difficult

for computers to understand

Since natural language is inherently flexible, it is not feasible to cover every possible use

case with a strict set of rules Rather, the method uses algorithms created to take the

meaning out of every sentence and extract the most important information This

methodology, which navigates the complex and dynamic nature of natural language and

makes it accessible for computational analysis, is essential for enabling computer

"comprehension" of languages

Once the core data has been extracted, and all of the extracted data is based on the topic

we choose and research, all of the important and related data to the topic has been saved

We spent time gathering information from the internet, particularly Wikipedia, because it

is the most trustworthy site available Traditional databases are intended to store

structured data rather than unstructured data such as text Because natural language tends

to link things together, we decided to store data in a graph database

Our assignment is to provide a brief overview of the problem we hope to solve First, we

gather a significant amount of information related to our research topic, including

keywords and other important data We then efficiently organize and store this data,

setting the stage for further data processing Specifically, we tailor our storage strategy to

support both English and Vietnamese, offering flexibility for a range of linguistic

scenarios Right now, we are primarily concerned with data storage optimization,

specifically for the Vietnamese language

Trang 13

We address this challenge because of its significant practical implications, with one

notable application already implemented by Google Google search provides website

links that show the most relevance to the search problem This section contains

information obtained from processed articles containing the specified keywords This

shows that Google has given the most relevant websites to be listed first Our app is

specifically designed to target and implement similar functionality In addition, to support

information verification we do research and reference fact checking based on the

Knowledge Network Fact checking is an important component to ensure the accuracy

and reliability of the chatbot's information, we will provide source pages from the internet

to support this fact checking, in fact it is just providing references from the website to

increase the credibility of the chatbot's answers

1.3 Objectives of the Study

The main goal of this research is to design and implement a web-based chatbot that uses

Knowledge Graph technology to provide users with detailed information about the

different types of food that are available in different Vietnamese cities and provinces The

particular goals consist of:

a) Building a Comprehensive Knowledge Graph:

e Establishing a robust Knowledge Graph that intricately captures relationships

between Vietnamese dishes, ingredients, and regional nuances

b) Developing an Intuitive User Interface:

e Crafting a user-friendly website featuring an integrated chatbot interface,

ensuring seamless and accessible interaction

c) Implementing Bidirectional Encoder Representations Named Entity Recognition

(Bert NER):

Trang 14

e Extracting and facilitating support for website operations through the

utilization of the Named Entity Recognition model as the designated and

optimized model for data extraction

d) Take advantage of the phoBert, vncorenlp Pre-Training Model:

e Investigate and evaluate phoBert and vncorenlp pre-trained models to process

and present relevant information and enrich the Knowledge Graph with

additional information

e) Evaluating Chatbot Accuracy and User Experience:

e Rigorously assessing the accuracy of the chatbot's responses and gauging the

overall user experience to refine and optimize its functionalities

f) Assessing Practicality and Effectiveness:

e Conducting an extensive evaluation of the system's practicality and

effectiveness, especially concerning users seeking information about

Vietnamese cuisine

By delineating these specific objectives, this research aspires to contribute meaningfully

to the integration of Knowledge Graph technology in a chatbot framework, offering users

a valuable and enriching experience while exploring the diverse culinary landscape of

Vietnam

1.4 Significance of the Study

This research holds several significant implications:

a) Developing Technological Frontiers:

e This work makes a substantial contribution to the advancement of knowledge

graphs, chatbot creation, and natural language processing (NLP) It pushes the

limits of innovation and integration in these fields by showcasing a useful

application of these technologies within the culinary domain

b) Improving Access to Culinary Knowledge:

e The research serves as a clearinghouse for gastronomic insights and is a useful

tool for a wide range of people who are interested in Vietnamese food Serving

Trang 15

travelers, foodies, and scholars alike, it provides a thorough and immersive

entry point to discover the nuances of Vietnamese culinary customs

c) Unveiling Interaction Dynamics:

e This study illuminates the dynamic field of user-centric artificial intelligence

by exploring the opportunities and difficulties of building a chatbot that uses

Knowledge Graphs for personalized interactions The knowledge acquired

advances both technology and our sophisticated understanding of

human-machine interaction

d) Opening the Path for Future Projects:

e This research has the potential to act as a model for similar projects in various

culinary traditions and cultural domains, even beyond its immediate

application The aforementioned framework establishes a standard for

subsequent endeavors aiming to utilize analogous technologies for the

conservation and propagation of cultural and gastronomic legacy

1.5 Motivation

This research is motivated by a number of factors, primarily the need to close important

gaps in the current knowledge extraction systems Although many systems perform well

in English-only settings, our goal is to create a reliable system for data extraction and

storing that is compatible with both Vietnamese and English But at present, we only

extract English sentences because of limited time, with Vietnamese we will perform in

future In a nation of over 90 million people where internet usage is expanding quickly,

the lack of a system that is sensitive to language and culture becomes a critical issue

Given Vietnam's status as a developing country, our research is motivated by the system's

potential to revolutionize a number of facets of daily life Furthermore, the absence of a

system specifically designed for Vietnamese presents a noteworthy obstacle, and our

inspiration stems from the conviction that a knowledge graph system can be crucial in

bridging these knowledge gaps Moreover, the use of graphs in NLP is becoming more

and more common, which motivates us and gives us a chance to advance this developing

Trang 16

field Our main objective is to provide the Vietnamese people with a knowledge graph

system that not only fulfills their information needs but also highlights the diversity of

Vietnamese culture, especially in terms of food customs, and acts as a useful guide for

foreign visitors

The difficulties arise when attempting to glean information from Vietnamese paragraphs

Vietnamese-written text cannot be properly processed by the current set of tools To be

more precise, no system exists that can extract triples from text in Vietnamese It became

nearly impossible for us to complete the task in the limited time we had In the end, we

decided that it would be better to use the already-existing tools, which process English

text well and can also be trained to comprehend Vietnamese text Additionally, there are

far more English texts available than Vietnamese, despite the fact that Vietnamese texts

are much richer and more elusive In order to train our model to comprehend and extract

text in both Vietnamese and English, we also chose to construct a pre-existing knowledge

graph that included text in both languages Vietnamese cuisine is the exclusive focus of

our system domain It is anticipated that the system will quickly and easily produce a

trained model that is stable and accurate

Furthermore, we would like to demonstrate the knowledge graph's proficiency in NLP

tasks Our goal is to create an application that uses the knowledge graph we created to

answer questions The goal of this application is to have natural language communication

with humans In a strict sense, this application ought to be able to communicate with users

and respond to their inquiries Our knowledge graph and the model we have trained serve

as this application's brain

In spite of the fact that the questions may be challenging to comprehend or may be correct

or incorrect depending on the individual, we hope to be able to construct and train our

model robustly enough for the system to provide the most compelling responses when

posed a query The outcomes of this thesis should hopefully be a step in that direction

Trang 17

1.6 Contribution

The thesis "Constructing a Knowledge Graph with Fact-Checking about Vietnamese

Cuisine" makes several contributions to the field of natural language processing (NLP)

and knowledge graph construction

e Firstly, the thesis proposes a novel approach to constructing a knowledge graph

related to Vietnamese cuisine, utilizing advanced NLP techniques such as Named

Entity Recognition (NER) and Bidirectional Encoder Representations from

Transformers (BERT) The integration of these techniques enables the creation of

a comprehensive and accurate knowledge graph that captures the relationships

between Vietnamese dishes, ingredients, and regional nuances

e Secondly, the thesis presents a sophisticated web chatbot that utilizes the

knowledge graph to provide users with detailed information about Vietnamese

cuisine The chatbot's intuitive user interface and fact-checking capabilities

enhance the user experience and increase the credibility of the chatbot's answers

e Thirdly, the thesis evaluates the effectiveness and practicality of the system,

providing insights into the accuracy of the chatbot's responses and the overall user

experience The evaluation demonstrates the system's ability to effectively engage

users and serve as a reliable source of knowledge about Vietnamese culinary

traditions

The thesis contributes to the advancement of NLP and knowledge graph construction by

demonstrating the practical application of state-of-the-art algorithms and models in the

context of Vietnamese cuisine The integration of advanced NLP techniques, knowledge

graph insights, and the development of a sophisticated web chatbot paves the way for a

rich and dynamic user experience, offering new avenues for exploring the intersection of

technology and culinary traditions

1.7 Structure of the thesis

Trang 18

The structure of our thesis is as follows The introduction summarizes the background,

goals, problems stated, importance of the research, and inspiration behind this graduate

thesis

e Chapter 2 Theoretical Framework: This section outlines the theoretical framework

explored to complete the thesis

e Chapter 3 describes our proposed system design, including the approach, sources

of data to be collected, and models chosen for the system's construction It

encompasses details about system components and design options

e Chapter 4 Operational System Implementation: In addition to providing a

description of the operational system, Chapter 4 covers inputs, outputs, service

details, user interface, and operational system implementation details

e Chapter 5 Conclusions and Development Plan: The final chapter presents the

thesis's conclusions and outlines the development plan intended to be carried out

soon.

Trang 19

Chapter 2

BACKGROUND AND RELATED WORK

2.1 Resource Descriptive Framework

The Resource Descriptive Framework (RDF) serves as a fundamental abstract model

within the realm of the Semantic Web, a concept advanced by the World Wide Web

Consortium (W3C) W3C designed RDF to standardize the encoding of metadata, making

it a cornerstone in structuring knowledge within the Semantic Web [1] RDF facilitates

the representation of knowledge, employing a data model that can be stored in diverse

formats such as JSON (JavaScript Object Notation) or XML (Extensible Markup

Language)

RDF operates on the principle of decomposing knowledge into discrete, manageable

units Any form of knowledge can be expressed as a triple, denoted by Subject, Predicate,

and Object For example, the statement "Pho is a specialty in Nam Dinh" can be captured

in RDF format as follows:

:Pho :is_a :Food ;

:specialty_in :Nam Dinh

:Nam Dinh :is_a :Location

In this RDF representation, "Pho" is the Subject, "is_a" is the Predicate, and "Food" is the

Object, indicating that "Pho" is a type of food Similarly, "Nam Dinh" is the Subject,

"is_a" is the Predicate, and "Location" is the Object, conveying that "Nam Dinh" is a type

of place

RDF's flexibility allows it to express a broad spectrum of facts, encompassing both

concrete entities, such as special foods or places, and abstract concepts like the

relationship between a place and its associated foods Additionally, RDF accommodates

textual values known as literal values, expanding its capacity to represent diverse forms

10

Trang 20

of data Despite its simplicity, RDF offers a structured approach that computers can

interpret and utilize effectively

At the core of RDF lies the concept of triples, denoting instances of (Subject, Predicate,

Object) Each triple embodies a factual relationship in the real world, acting as the

fundamental building block of RDF Below is a tabular representation of RDF triples

conveying information about "Pho" and "Nam Dinh."

Table 2-1 RDF triples

Subject Predicate Object

Pho is_a Food

Nam Dinh is_a Location

Pho is_a_specialty_in Nam Dinh

In this table, three distinct facts are articulated, each encapsulated in a single line A

distinctive feature of RDF is its ability to maintain entity consistency even when referred

to in various contexts For instance, "Nam Dinh" in fact #2 and fact #3 refers to the same

entity

Triples within RDF are highly flexible, allowing for nesting and composition of triples

For example, a nested triple could be represented as (Pho, is_a_specialty_in, (Nam Dinh,

type, Location)) This inherent flexibility empowers the linkage of real-world facts,

thereby creating what is known as Triples of Knowledge

11

Trang 21

Banh tam , `

cà rỉ i %

Ễ 4 Bun a ng

Figure 2-1 Triple examples

Triples of Knowledge essentially manifest as labeled, directed graphs, where each node

signifies a subject or an object The edges connecting nodes represent predicates,

elucidating the relationship between the subject and the object Specifically designed to

store and retrieve triples, a triple store or RDF store serves as a specialized database for

RDF knowledge representation

2.2 Knowledge Graphs

Knowledge Graphs are intricate and structured representations of knowledge that excel at

capturing complex relationships between entities within a defined domain These entities

can range from objects, events, concepts, to individuals, and their interactions are often

represented by edges connecting them in a graph-like structure This allows for a richer

understanding of the semantic context and interconnections within a specific knowledge

domain

The foundation of a Knowledge Graph lies in organizing and integrating vast amounts of

data and information into a coherent structure Each entity in the graph is represented as

12

Trang 22

a node, and the relationships or attributes associated with these entities are depicted as

edges These edges convey the nature and semantics of the relationships, enabling a

nuanced depiction of the knowledge landscape

One of the key strengths of Knowledge Graphs is their ability to facilitate semantic search

and question answering The structured representation allows for efficient querying and

retrieval of information, significantly improving search accuracy and relevance

Moreover, the structured nature of Knowledge Graphs enables reasoning and inference,

which is pivotal in extracting implicit knowledge and inferring new relationships based

on the existing ones

name: “Dan”

born: May 29, 1970 twitter: “@dan”

Knowledge Graphs have garnered widespread adoption, particularly by leading search

engines and recommendation systems These technologies employ Knowledge Graphs to

enhance information retrieval by providing context-aware results By leveraging the

relationships and attributes within the graph, search engines can offer more precise and

personalized responses to queries, leading to an improved user experience

13

Trang 23

Typically, knowledge graphs consist of datasets from multiple sources, many of which

have different structural characteristics Context, identities, and schemas combine to give

diverse data structure The knowledge graph is framed by schemas, the setting in which

the knowledge exists is determined by identities, and the context classifies the underlying

nodes appropriately These elements aid in separating words with various meanings

Semantic enrichment is the process by which knowledge graphs powered by machine

learning use natural language processing (NLP) to create an all-encompassing view of

nodes, edges, and labels This procedure enables knowledge graphs to recognize distinct

objects and comprehend the connections among various objects when data is ingested

After that, other datasets that are pertinent and comparable to this working knowledge are

compared and integrated with it When a knowledge graph is finished, it makes it possible

for search and question-answering systems to find and utilize thorough responses to

specific queries

From the previous section that the atomic data entity in the Resource Description

Framework (RDF) data model is called a semantic triple, or simply triple Triples are

expressed as follows: Subject, Predicate, Object Ontology concepts are subjects and

objects, and the relationship between a subject and an object is called the predicate A

predicate can be represented by a word, phrase, or sentence After that, this data model is

kept in a knowledge base for processing and access

2.3 Bidirectional Encoder Representations from Transformers

BERT is one of Transformer's most important achievements in the field of natural

language processing Rooted in two-dimensional contextual language understanding,

BERT has achieved excellent results in a variety of NLP tasks, from text classification to

semantic search BERT is intended to jointly train on both left and right context in all

layers in order to pretrain deep bidirectional representations from unlabeled text

14

Trang 24

Therefore, without requiring significant task-specific architecture modifications, the

pre-trained BERT model can be refined with just one extra output layer to produce

state-of-the-art models for a variety of tasks, including question answering and language inference

BERT is both powerful empirically and conceptually [2]

2.3.1 BERT Architecture

The first description of the BERT model by Vaswani et al [2] and its release in the

tensor2tensor library served as the foundation for the multi-layered Bidirectional

Transformer encoder architecture

ENCODER

eee ENCODER

ENCODER ENCODER

BERTbasr BERT arce

Figure 2-3 BERT base and BERT large models (Source: web“BERT base vs BERT large”,

2019)

In figure 2-3 As the number of layers in BERT large is increased so does the number of

parameters (weights) and number of attention heads increases BERT base has a total of

12 attention heads (lets each token in input to focus on other tokens) and 110 million

parameters Whereas BERT large has 16 attention heads with 340 million parameters

BERT base has 768 hidden layers whereas BERT large has 1024 hidden layers

2.3.2 Multi-head Attention

BERT often uses several encoder layers, called "Transformer layers" or "Transformer

blocks." This number of layers is usually a hyperparameter of the model and can vary

depending on the specific version of BERT [2] Each encoder layer uses multi-head

15

Trang 25

attention to simultaneously process information from many different "attention heads".

The number of attention heads is usually a hyperparameter Each attention head produces

a different representation of the input sequence, and then the results from all attention

heads are combined

The multi-head attention mechanism can process the whole paragraph or multiple

sentences in total to obtain their interrelationship In Transformer, encoder positioner,

multi-head attention mechanism and feedforward comprise a complete unit The encoder

position is for marking the word position in each sentence, which can effectively ensure

the position of each word can be fully considered in sentence analysis [7]

Positional Encoding

Figure 2-4 Architecture of Transformer (Source: web “The Transformer Model”)

16

Trang 26

Each head of multi-head attention focuses on a specific aspect of the input information.

Thanks to this diversity, BERT is capable of learning complex relationships and context

characteristics

2.3.3 Input Representation

About the input representation for BERT Input can be a representation of a single text

sentence or a pair of text sentences (e.g [Question, answer]) placed into a string made up

of words When given a particular input string, our input representation is built by

summing those tokens with the segment vector and the corresponding positions of the

words in the string

For ease of visualization, the input representation is visualized in the figure 2-5:

Figure 2-5 BERT input representation (Source: “BERT: Pre-training of Deep

Bidirectional Transformers for Language Understanding ”, 2019)

e WordPiece embeddings with a dictionary of 30,000 words and used ## as the

separator For example, the word playing is split into play##ing

e Positional embeddings with a maximum sentence length of 512 tokens.

e The first token for each chain is by default a special token whose value is [CLS]

The output of the Transformer (final hidden state) corresponding to this token will

be used to represent the whole sentence in classification tasks Otherwise in

classification tasks, this vector is ignored

® In cases where pairs of sentences are grouped together into a single sequence,

distinguish the sentences in two ways First, separate them by a special token

17

Trang 27

[SEP] Second, add one segment embedding for sentence A and another segment

embedding for sentence B

e When there is only one single sentence, the segment embedding is only for

sentence A

BERT does not use regular step tokens, instead it uses WordPiece tokens This method

divides words into sub-word units Sub-word units can be whole words or word fragments

These tokens are then used as input to the BERT model

When separating words, the algorithm will try to find the longest word in the vocabulary

If not found, it breaks the word into smaller parts This process continues until all parts

are in the vocabulary or only single characters remain

Example with the word 'banh':

e We have input word set when WordPiece is {‘banh’, ’ban’, ’ba’, ‘nh’, ‘b’, ‘a’, ‘n’,

'h*}

e Starting with the word "banh", we check to see if it is in the vocabulary In this

case, "banh” is not in the vocabulary

e So we start splitting the last word into ‘ban’ and ‘h’ Now we see in the dictionary

both 'ban' and Tỉ At this point, the word separation is complete, but we will add

the character ## at the beginning of the word 'h' (the meaning of ## means that this

phrase does not have an independent meaning but is a part of the word ‘banh' )

=> banh after tokenizer : ban, ##h

Segment embeddings are a component used in model BERT to distinguish between

different segments or sentences in a sequence of text They're essential when the model

needs to process multiple sentences or segments of text simultaneously, such as in tasks

like natural language inference, question answering, or text classification involving

multiple sentences

18

Trang 28

In BERT, segment embeddings are employed to handle sentence-level information within

a single input sequence Consider the scenario where BERT is fed two sentences for a

classification task; it needs a mechanism to differentiate between the sentences

e Multiple segments or sentences are combined into a single input sequence A

special token, [SEP], is used to separate these segments within the input sequence

Example: My dog is cute he likes playing

e After the first sentence (A) ends, Segment Embeddings will add [SEP] to the end

of the sentence and continue connecting that segment to the next sentence New

[SEP] is known as part of sentence A The remaining parts belong to sentence B

In Positional Encoding of the paper [3], the author explains why they need to encode the

position of each token The need to encode the position of each token arises from the

absence of recurrence and convolution within the model Without these mechanisms, it

becomes crucial to incorporate information regarding the relative or absolute position of

tokens in the sequence for the model to effectively utilize the sequence's order

Feed

Co

Add & Norm

Multi-Head Attention

Figure 2-6 Encoder function

In Positional Encoding, author use sine and cosine functions of different frequencies:

19

Trang 29

s 1: the position of the token in the input

* J: the position along the dimension of the token embeddings

Tensors are the data structure used by machine learning systems, and getting to know

them is an essential skill you should build early on Tensors are mathematical objects that

describe linear relationships between sets of multidimensional data They are a

generalization of scalars, vectors, and matrices, which are all types of tensors

Index [0,2,1]

Index [0,0]

Š b

rank 0 tensor rank 1 tensor rank 2 tensor rank 3 tensor

dimensions [ ] dimensions [5] dimensions [5, 3] dimensions (4, 4, 2]

scalar vector matrix

Tensors

Figure 2-7 Tensor dimensions

In BERT, tensor is output of encode have shape (batch_size, sequence_length, hidden_size)

e batch_size: the number of input sequences in the batch

e sequence_length = max_length

e hidden_size: the size of the hidden layers in the BERT model

In BERT, Token classification (also known as sequence labeling) involves predicting a label for

20

Trang 30

each token in the input sequence The shape of the logits produced by token classification in

BERT is (batch_size, sequence_length, num_labels)

e batch_size: the number of input sequences in the batch

e sequence_length = max_length

e num_labels: the number of possible labels for each token present in the model

2.3.4 Pre-training and Fine-Tuning

They used two datasets for the training process: BooksCorpus (800M words) and English

Wikipedia (2,500M words) For Wikipedia, they only extracted text fragments and

ignored lists, tables, and headings It is important to use a paragraph-level corpus rather

than a jumbled collection of sentences

To create an input sequence for the training process, they take samples of two consecutive

spans in the corpus, which they call sentences, although they are often much longer than

regular simple sentences They sample such that after combining, the length of the

combined sample contains at most 512 tokens Masks for MLM are still used after

applying WordPiece tokenization at a uniform rate of 15%

For sentence classification tasks, BERT's fine-tuning is very simple To get a

representation of an input sequence with a fixed number of dimensions, they just need to

get the hidden state in the last layer the output of the Transformer layer for the first token

(a special token [CLS] is built for the first Chain)

All parameters of BERT and W are fine-tuned to optimize the error function

Padding in BERT refers to the process of adding special tokens to input sequences to

make them of equal length BERT requires fixed-length input sequences, so if the original

sequences are shorter than the maximum sequence length, padding tokens are added at

the end to match the maximum length These padding tokens do not carry any meaningful

information and are typically represented by a special token like [PAD] Padding ensures

21

Trang 31

that all input sequences have the same length, allowing efficient batch processing during

training and inference

‘Question Paragraph

Unlabeled ae) B Pair ‘Question pole Pair

Pre-training Fine-Tuning

Figure 2-8 Overall pre-training and fine-tuning procedures for BERT (Source: “BERT:

Pre-training of Deep Bidirectional Transformers for Language Understanding ”, 2019)

In figure 2-8, the same architectures are used in both pre-training and fine-tuning The

same pre-trained model parameters are used to initialize models for different down-stream

tasks During fine-tuning, all parameters are fine-tuned [CLS] is a special symbol added

in front of every input example, and [SEP] is a special separator token (e.g separating

questions/answers)

2.4 Softmax, Argmax and Loss functions

2.4.1 Softmax

Softmax, a common normalizing function for producing a probability distribution from

neural network logits, is defined as follows:

e: Euler = 3.14

x ev e*i

4=! ˆ: Total value in first vector value

22

Trang 32

A vector of real values can be converted into a vector of probabilities using the softmax

function The result's probabilities are all within the range of 0 to 1, and the total of them

is l

2.4.2 Argmax

Argmax, a popular normalization function that creates probability distributions from

neural network logits, to ascertain the location of the largest value inside that collection

The following is the definition of the Argmax function:

Argmax P(k)

k: the number of classification class labels

Table 2-3 Example Argmax

0 1 2 3

indexing starts from 0 in python so the 3nd index refers to the fourth element in the array

2.4.3 Cross-entropy loss

23

Trang 33

The concept of cross entropy, which measures the difference between two probability

distributions for a given random variable or set of occurrences, is an extension of

information theory entropy Throughout training, model weight adjustments are made

using cross-entropy loss The goal is to reduce the loss; a better model has a smaller loss

Cross-entropy loss in a perfect model is 0 Multi-class and multi-label classifications are

usually supported

CE =—)_ tilog(f(s);)

ti: groundtruthsi: CNN score for each class i in C

As usually an activation function (Softmax) is applied to the scores before the CE Losscomputation, f(si) refers to the activations

2.5 Named Entity Recognition

Token classification involves assigning labels to tokens within a text for tasks such as

Named Entity Recognition (NER) NER models aim to identify distinct entities like

datetime, person, and locations, in a text, on categorizing words as verbs, nouns,

punctuation, and so on

Named Entity Recognition (NER) is a foundational task in natural language processing

(NLP) that involves the automatic identification and categorization of named entities

within a given text [8] Named entities are specific pieces of information, such as names

of persons, organizations, locations, dates, numerical expressions, and more, which carry

significant meaning and context in the text

Named Entity Recognition (NER): The NER model is applied to identify and classify

named entities such as dish names, ingredient names, locations, and other relevant entities

within the collected textual data This process enriches the dataset by tagging entities with

their respective categories

24

Trang 34

In the discipline of Natural Language Processing (NLP), named entity recognition is one

of the most crucial data processing tasks It seeks to find and classify important data, or

entities, in text data These ‘entities’ can be any word or any group of words (often proper

nouns) that consistently refer to the same object For instance, an entity detection system

might identify the word "News Catcher" in text and categorize it as a "Organization"

Figure 2-9 NER example

All entity recognition algorithms basically follow these two steps:

e Identifying text entities

e The entities are categorized into specified classes.

The NER locates the token or group of tokens that make up an entity in the first stage

Finding the beginning and ending indices of things is frequently done using

inside-outside-beginning chunking The construction of entity categories is done in the second

stage Here are some of the most typical entity classes, though these categories can vary

based on the use case:

e String patterns like email addresses, phone numbers, or IP addresses

Although some techniques to name entity recognition use rules, the majority of

contemporary systems use a machine learning/deep learning paradigm Text data has a

great deal of inherent ambiguity because it was created by humans For instance, the word

"Sydney" can be used to describe both a place and a person

25

Trang 35

| didn't know there was Madame Tussauds in Gres.

and are planning to go to the festival next week, are you in?

Figure 2-10 NER tagging example

There is no sure-fire way of dealing with such ambiguities but as a general rule of thumb,

the more relevant training data is to the task, the better the model performs In this process

we will train a custom NER model that extracts specialties and location in Viet Nam from

clinical text

Any application where a deep comprehension of a lot of text is required can benefit from

NER A strong NER helps the computer to swiftly classify documents according to their

relevance and comprehend the subject or theme of text at a glance

We will apply NER for information extraction Using NER to extract and highlight

important entities from user text is an important step in identifying fundamental elements

in text So that we can process input from the user to get the text to extract information

then stored the output of NER ready for the next extraction

Furthermore, NER can be enhanced through machine learning and deep learning

approaches, enabling models to continuously improve their accuracy and precision in

identifying named entities Advanced techniques such as named entity linking (NEL) can

be employed to link the recognized entities to existing knowledge bases, enhancing the

depth and accuracy of the knowledge graph

26

Trang 36

Figure 2-11 IIustration of BERT for NER (Source: “BERT: Pre-training of Deep

Bidirectional Transformers for Language Understanding ”, 2019)

To tackle the aforementioned issue, we utilize the integrated BERT-NER model

(https://github.com/kamalkraj/BERT-NER)

New dataset is generated in CoNLL 2003 format, starting with sentences manually chosen

from Wikipedia Utilizing GPT-3.5, these sentences are transformed into CoNLL 2003

format, and additional labels are manually assigned to enhance the dataset

2.6 Neo4j

Neo4j is a cutting-edge graph database management system that makes it easy to store,

manage, and query intricate and highly connected data It belongs to the class of NoSQL

databases, which are designed to address the particular difficulties presented by data that

has complex relationships

Neo4j, which Emil Eifrem founded in 2000, has been essential to the development of

graph database technology Neo4j, Inc., the company that created Neo4j, is committed to

expanding the potential of graph databases and encouraging their broad use in a variety

of sectors

27

Trang 37

Drivers and APIs,

Neo4j Browser, Neo4j Data Importer,

Neo4j Desktop, Neo4j Ops Manager Analytics

Graph Data Science

Figure 2-12 Neo4j ecosystem (Source: web” Welcome to Neo4j”)

Neo4j is fundamentally based on the mathematical framework of graph theory, which

describes the relationships between entities Neo4j models data as nodes and

relationships, in contrast to traditional relational databases that rely on tables and

predefined schemas This enables a more natural and intuitive representation of real-world

connections It is a good option to store semantic triples because of its robust and

adaptable data model When dealing with situations where relationships are just as

important as, or even more crucial than, individual data points, this structure is especially

helpful

One of Neo4j's most notable features is its query language, Cypher, which was created

especially for use with graph databases Developers and data analysts can easily retrieve

and manipulate graph data thanks to Cyphers expressive and powerful features Its

human-readable syntax closely resembles the natural intuitiveness of graph databases

Cypher Query Language, included with Neo4j, offers a logical and visually appealing

method of matching patterns in node and relationship graphs For instance, Neo4j stores

the triple ("John," "lives in," and "Paris") as shown in Figure 2-13:

28

Trang 38

Figure 2-13 Example triple store in neo4j

The example in Figure 2-13 can be express using the following Cypher pattern:

(:Person {name: “John”})-[:LIVES_IN]—>(:City {name: “Paris”})

(:Person {name: "John"}) is a node in this pattern The node's type or label, "Person," is

indicated by the word that follows the colon in that node The node's properties are listed

next to that, and its name 1s "John." Comparably, another node with the type "City" and

the name "Paris" is (:City {name: "Paris"}) The relationship, with or without direction,

is located in between the nodes There is a directed relationship of the type "LIVES_IN"

in this specific instance It's clear that Cypher's syntax provides a logical and visual

interface for interacting with the graph's nodes and relationships

And here's how to create such a triple by following these codes below:

CREATE (:Person {name: 'John', age: 30}) CREATE (:City {name: 'Paris', country: 'France'})

This Cypher query creates two nodes, one for the person (John) and another for the city

(Paris) The labels Person and City are used to categorize the nodes, and properties like

name, age, and country store additional information

29

Trang 39

MATCH (person:Person {name: 'John'}), (city:City {name: 'Paris'})

CREATE (person)-[:LIVES_IN]->(city)

This Cypher query establishes a relationship between the person and the city The

relationship type is LIVES_IN, and the arrow direction indicates that John lives in Paris

Cypher offers additional commands to communicate with the Neo4j database in addition

to CREATE Some of the other commands are illustrated and explained in the text that

follows:

MATCH (person:Person {name: 'John'})-[: LIVES_IN]->(city)

RETURN person, city

MATCH Clause:

e MATCH: This is a keyword in Cypher used to specify patterns to match in the

graph

® (person:Person {name: 'John'}): This part defines a pattern where we are

looking for a node with the label Person and a property name with the value

‘John' The person is an alias assigned to this node for reference in the rest of

the query

e -[:LIVES_IN]->: This part of the pattern represents a directed relationship of

type LIVES_IN going from the person node to the city node The arrow ->

indicates the direction of the relationship

e (city): This specifies that we are looking for a node, and we don't specify a label

or properties for the city node in this pattern The query will find any node that

is connected to the person node by a LIVES_IN relationship

RETURN Clause:

e RETURN person, city: This part of the query specifies what information to return.

Here, we want to return the person and city nodes that match the pattern described

in the MATCH clause

30

Trang 40

MERGE (person:Person {name: 'John'})

MERGE (city:City {name: 'Paris'})

MERGE (person)-[:LIVES_TN]->(city)

RETURN person, city

MERGE Clause:

e@ MERGE (person:Person {name: John'}), MERGE (person)-[:LIVES_IN]->(city):

This line tries to find a node labeled as Person with the property name equal to

‘John’ If such a node exists, it will be matched; otherwise, a new node with the

specified properties will be created

e MERGE (person)-[:LIVES_IN]->(city): This line establishes a LIVES_IN

relationship between the person and city nodes If the relationship already exists,

it will be matched; otherwise, a new relationship will be created

2.7 Fact checking

The overwhelming amount of data generated online has made fact-checking a critical

process for confirming the veracity and accuracy of information more difficult The

amount of information available is so great that the conventional method of having human

experts verify the facts finds it difficult to keep up Nevertheless, a viable answer to this

problem is offered by recently developed computational fact-checking techniques

Computational fact checking employs advanced techniques in the field of natural

language processing and information validation to evaluate the accuracy of statements or

claims There are various ways to accomplish this, including determining the semantic

closeness metrics-derived shortest path between concept nodes and evaluating the

relevance between the sentence and the pertinent retrieved documents, the proper degree

of similarity will be ascertained Thus, it is feasible to extract pertinent documents for

low-resource languages, and this is what we aim to do To do this, we use reference

materials like Wikipedia and other online sources as links to our documents

31

Tiêu đề	Constructing a Knowledge Graph with Fact Checking About Vietnamese Cuisine
Tác giả	Le Sy Thanh, Che Nguyen Minh Tung
Người hướng dẫn	Assoc. Prof. Dr. Do Phuc
Trường học	University of Information Technology
Chuyên ngành	Information Systems
Thể loại	Thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	93
Dung lượng	50,18 MB