VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMSLE SY THANH — 19522230 CHE NGUYEN MINH TUNG - 19522490 CONSTRUCTI
Trang 1VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMS
LE SY THANH — 19522230 CHE NGUYEN MINH TUNG - 19522490
CONSTRUCTING A KNOWLEDGE GRAPH WITH FACT CHECKING ABOUT VIETNAMESE
CUISINE
BACHELOR OF ENGINEERING IN INFORMATION SYSTEMS
THESIS ADVISOR ASSOCIATE PROFESSOR DR DO PHUC
HO CHI MINH CITY, 2023
Trang 2ASSESSMENT COMMITTEE
The Assessment Committee is established under the Decision ,
by Rector of the University of Information Technology
1 Assoc Prof Dr Nguyén Dinh Thuan - Chairman.
2 Dr Cao Thi Nhan - Secretary
3 Dr Lê Kim Hùng - Member
Trang 3I extend my deepest gratitude to my mentor, Assoc Prof Dr Đỗ Phúc, for his invaluable
guidance and steadfast support throughout the fruition of my project His consistent
backing has been a beacon of clarity, and I am truly appreciative Special acknowledgment
goes to M.S Nguyén Thi Kim Phung, whose guidance and camaraderie have enriched our
collaboration on this project
In the second phase of this academic journey, my interactions with Dr Cao Thi Nhan as
my consultant have been particularly enriching Dr Nhan's benevolence extended beyondthe scope of academia, fostering a sense of warmth and inclusion among our classmatesthroughout our university years Her guidance has not only contributed to the academicaspects of my project but has also left an indelible mark on our collective experience
Moving on to the third expression of gratitude, I find it imperative to recognize themeticulous efforts of the entire Information Systems Department faculty Their
responsiveness to my inquiries demonstrated a commitment to academic excellence that
went above and beyond, significantly enhancing my understanding of the subject matter
The fourth acknowledgment extends beyond the academic sphere, encompassing thesupport network that has been pivotal to my journey My gratitude extends to my family,
whose unwavering support has been a pillar of strength Friends and classmates have
formed a tapestry of encouragement and camaraderie, providing a backdrop of positivitythat has propelled me forward Their collective support and love serve as a constantreminder of the interconnectedness that makes academic pursuits all the more meaningful
Trang 4TABLE OF CONTENTS
cs Le
ABSTRACT 1
Chapter 1 INTRODUCTION 2
1.1 Background and Context 2
1.2 Statement of the Problem 3
1.3 Objectives of the Study 4
1.4 Significance of the Study 5
15 Motivation 6
1.6 Contribution 8
1.7 Structure of the thesis 8
Chapter2 BACKGROUND AND RELATED WORK 10
2.1 Resource Descriptive Framework 10
2.3.4 Pre-training and Fine-Tuning 21
2.4 Softmax, Argmax and Loss functions 22
2.8.1 The transformer architecture 34
2.8.2 Semantic textual similarity 35
Chapter3 SYSTEM DESIGN 37
3.1 Overview 37
3.2 Software and database design 38
3.2.1 System architecture 38
Trang 53.2.2 Knowledge graph construction
3.3 Algorithm processing flow
3.3.1 Integrate BERT-NER
3.3.2 System response processing
3.3.3 Fact-checking through semantic similarity
Chapter4 SYSTEM IMPLEMENTATION
4.1 Overview
4.2 User query processing
4.3 Neo4j query processing
4.4 Answer results processing
4.5 Experiment and Discussion
Chapter5 CONCLUSION & FUTURE WORK
Trang 6LIST OF EIGURES
œ2EFlx»
Figure 2-1 Triple examples TẢ 12Figure 2-2 Example of a graph (Source: video “Introduction to Neo4j and GraphDatabases”, 2019) eceecccscesscsscceseceseceeeesecesecececseceseceaeceeeaeceaeeececseeeseceaeceeecaeeeaeeeaeeeneeaeens 13Figure 2-3 BERT base and BERT large models (Source: web“BERT base vs BERTLarge”, 2019) ee scesesseesecseesecsecsecsceeecsecesesseesecsessecsessecsaseaeeaeeseesesseesessessessessesesseaseaeeaes 15Figure 2-4 Architecture of Transformer (Source: web “The Transformer Model”) 16Figure 2-5 BERT input representation (Source: “BERT: Pre-training of DeepBidirectional Transformers for Language Understanding ”, 2019) -<cc<x 17Figure 2-6 Encoder funCfIOTI - 5 + + 111 1v vn TT TH HH nghệ 19Figure 2-7 Tensor điTN€TSIOTIS - - G2 3 1921011811891 911 91 19010191 ng 20
Figure 2-8 Overall pre-training and fine-tuning procedures for BERT (Source: “BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding ”, 2019)
Figure 3-3 Pipeline of the SÿS(€T - 5 1E 211 911 1 301 1 1v ng ng ngư 43
Figure 3-4 Fact checking pIpeÌIne - - + 1+1 kS SH ng ng ướt 49Figure 4-1 Overview of the system processing flOW - + cssesereserrerrske 53Figure 4-2 Output Of Predict - - -Ă 2 3019210118 1113910119 11 911111 11H HH ng 55Figure 4-3 Output of final T€SUÍS - 5 1119101121 911 11911191 ng ng ng tr 55
Figure 4-4 JSON object of Final result for †TIDÏ€ - <5 + + +*££++EE+eeeeeeeseeereeee 56
Figure 4-5 Execution plan for a Cypher Query - 55 + 33+ EEseesrseeereerreee 58Figure 4-6 Result from the DÏ4Tn + E33 139111311 91 1 930 1901119 ng ngư 60Figure 4-7 Result from Neo4j in JSON - cà HH HH HH Hiệp 61
Figure 4-8 Extraction results of related Sentences - - -G ng rưn 63
Figure 4-9 Word segmentation T€SUÏÍL - + + + Sx 93 91191 1 1 vn ng rưệt 64
Figure 4-10 Cosine similarity T€SUÏ( ó6 << + 13 E3 91 3E ngư 65
Figure 4-11 Sort by highest similarity from high to ÏOW - c5 SĂcSsSssseseseresee 66Figure 4-12 Training Loss over EDOCs 6 13 93 91193 2 HH HH ng rưệt 67Figure 4-13 Result of precision, recall and Ï~SCOT€ - 5 + + £++eexseexeeereees 68
Figure A-1 NER result after executed query (admin ]) - «<5 s<++s£+seeeeses 74
Figure A-2 Triple result after processing the label from NER (admin Ủ]) 74
Figure A-3 Knowledge graph result from n€O(4 - + + + + k kg tr 75Figure A-4 Add triple interface eee (5 + 31 HT TH HH HH 75
Trang 7Figure A-5 Chatbot user ITI(CTÍAC€ G1 HH HH HH HH 76Figure A-6 Chatbot response with image and relevant URL reference - 77Figure A-7 Some examples of difference QU€TV 5 55 + 13+ +*vEEseeseeeeseeereeee 77Figure A-8 Example with Vietnamese QU€FV - - 5 6 tk HH nh tr 78Figure C-1 Structure of the Knowledge Graph - - - - - -c + t**v vn ngư, 81
Trang 8LIST OF TABLES
œ4EFllx»
Table 2-1 RDF triples dIA44-š-:: 3 11
Table 2-2 Example softmmax - G0111 HH HH ng 23
Table 2-3 Example Argmax - - 5 - 5E 1101193019319 111901 nọ ng 23
Table 3-1 Data organization ạfAÌÏWS1S - «xxx vn HH HH ng nh rưệt 42Table 3-2 List of dependency tags and NER exampÌe +++-s+*++x*s+seexseereees 45Table 3-3 Our Sample Cat€ØOTIZAfIOTI - G1 321119111910 19 10 1991119 1H ng ng ngư 46Table 206015 1 ẻ.ẻ.aHa 54
Trang 9LIST OF ABBREVIATIONS
NLP Nature Language Processing
BERT Bidirectional Encoder Representations from TransformersRDF Resource Descriptive Framework
HTTP HyperText Transfer Protocol
NER Named Entity Recognition
Trang 10With today's advanced technology, the use of chatbots is growing in popularity and
strength We want to learn more about this area, conduct further research, and create our
own application Rich data gathered from the Internet was used to create a knowledge
graph, which we then used to build a chatbot - a question-answering application - based
on the knowledge graph Wikipedia content and online source pages are the main places
to look for authenticity and dependability The study presented in this thesis advances
applications of knowledge representation and natural language processing The research
also addresses the challenges posed by the intrinsic depth and complexity of natural
language, emphasizing the adaptability and flexibility of natural language, which makes
it inherently difficult for computational analysis The thesis aims to bridge this gap by
leveraging advanced NLP approaches and innovative technologies to create a chatbot that
can effectively engage users and serve as a reliable source of knowledge about
Vietnamese culinary traditions The subject of our knowledge graph is Vietnamese cuisine
in 63 provinces In this thesis, we present the process of natural language processing of
queries from users, as well as the conditions for creating answers from chatbots and using
information reference sources to support fact checking Information reference sources
become integral in this endeavor, acting as pillars of support in ensuring the accuracy and
credibility of the knowledge imparted by our chatbot
Trang 11Chapter 1
INTRODUCTION
1.1 Background and Context
The swift advancement of digital technology and artificial intelligence has brought about
revolutionary shifts in the ways people interact with online platforms and obtain
information Chatbots have become immensely useful tools in this age of rapid
technological development, changing the way users engage with one another and
providing prompt answers to questions Artificial intelligence-powered chatbots are
highly effective at mimicking human speech and are widely used to streamline
communication and offer prompt support on a variety of online platforms
Furthermore, a new era of intelligent conversation agents has been ushered in by the
creation of Knowledge Graphs and advanced Natural Language Processing (NLP)
techniques, which have coincided with the emergence of chatbots A knowledge graph is
an advanced data structure that is used to precisely record the complex relationships
between concepts and things in a given domain Cutting-edge models and algorithms are
used in advanced NLP approaches to process and understand human language with ease
This feature greatly enhances the quality of user interactions by enabling chatbots to
comprehend and react to user inquiries with increased precision and thoroughness
This thesis explores how these innovative technologies come together to create a
cutting-edge web chatbot with a unique emphasis on Vietnamese food Through the integration
of chatbot capabilities with the sophisticated NLP and Knowledge Graph insights, this
research aims to open up new avenues for providing a rich and dynamic user experience
The goal is to develop a chatbot that can effectively engage people and act as a reliable
source of knowledge about the complex web of Vietnamese culinary traditions by
investigating the intersection of various technological frontiers
Trang 121.2 Statement of the Problem
Because of the language's intrinsic depth and complexity, texts written in natural language
are by nature difficult The existence of ambiguity, which allows a statement to transmit
wholly different meanings depending on its context, is one of the main causes of this
complexity Natural language is remarkably adaptable, as seen by its capacity to fit in
with a variety of contexts with ease But this very flexibility makes it extremely difficult
for computers to understand
Since natural language is inherently flexible, it is not feasible to cover every possible use
case with a strict set of rules Rather, the method uses algorithms created to take the
meaning out of every sentence and extract the most important information This
methodology, which navigates the complex and dynamic nature of natural language and
makes it accessible for computational analysis, is essential for enabling computer
"comprehension" of languages
Once the core data has been extracted, and all of the extracted data is based on the topic
we choose and research, all of the important and related data to the topic has been saved
We spent time gathering information from the internet, particularly Wikipedia, because it
is the most trustworthy site available Traditional databases are intended to store
structured data rather than unstructured data such as text Because natural language tends
to link things together, we decided to store data in a graph database
Our assignment is to provide a brief overview of the problem we hope to solve First, we
gather a significant amount of information related to our research topic, including
keywords and other important data We then efficiently organize and store this data,
setting the stage for further data processing Specifically, we tailor our storage strategy to
support both English and Vietnamese, offering flexibility for a range of linguistic
scenarios Right now, we are primarily concerned with data storage optimization,
specifically for the Vietnamese language
Trang 13We address this challenge because of its significant practical implications, with one
notable application already implemented by Google Google search provides website
links that show the most relevance to the search problem This section contains
information obtained from processed articles containing the specified keywords This
shows that Google has given the most relevant websites to be listed first Our app is
specifically designed to target and implement similar functionality In addition, to support
information verification we do research and reference fact checking based on the
Knowledge Network Fact checking is an important component to ensure the accuracy
and reliability of the chatbot's information, we will provide source pages from the internet
to support this fact checking, in fact it is just providing references from the website to
increase the credibility of the chatbot's answers
1.3 Objectives of the Study
The main goal of this research is to design and implement a web-based chatbot that uses
Knowledge Graph technology to provide users with detailed information about the
different types of food that are available in different Vietnamese cities and provinces The
particular goals consist of:
a) Building a Comprehensive Knowledge Graph:
e Establishing a robust Knowledge Graph that intricately captures relationships
between Vietnamese dishes, ingredients, and regional nuances
b) Developing an Intuitive User Interface:
e Crafting a user-friendly website featuring an integrated chatbot interface,
ensuring seamless and accessible interaction
c) Implementing Bidirectional Encoder Representations Named Entity Recognition
(Bert NER):
Trang 14e Extracting and facilitating support for website operations through the
utilization of the Named Entity Recognition model as the designated and
optimized model for data extraction
d) Take advantage of the phoBert, vncorenlp Pre-Training Model:
e Investigate and evaluate phoBert and vncorenlp pre-trained models to process
and present relevant information and enrich the Knowledge Graph with
additional information
e) Evaluating Chatbot Accuracy and User Experience:
e Rigorously assessing the accuracy of the chatbot's responses and gauging the
overall user experience to refine and optimize its functionalities
f) Assessing Practicality and Effectiveness:
e Conducting an extensive evaluation of the system's practicality and
effectiveness, especially concerning users seeking information about
Vietnamese cuisine
By delineating these specific objectives, this research aspires to contribute meaningfully
to the integration of Knowledge Graph technology in a chatbot framework, offering users
a valuable and enriching experience while exploring the diverse culinary landscape of
Vietnam
1.4 Significance of the Study
This research holds several significant implications:
a) Developing Technological Frontiers:
e This work makes a substantial contribution to the advancement of knowledge
graphs, chatbot creation, and natural language processing (NLP) It pushes the
limits of innovation and integration in these fields by showcasing a useful
application of these technologies within the culinary domain
b) Improving Access to Culinary Knowledge:
e The research serves as a clearinghouse for gastronomic insights and is a useful
tool for a wide range of people who are interested in Vietnamese food Serving
Trang 15travelers, foodies, and scholars alike, it provides a thorough and immersive
entry point to discover the nuances of Vietnamese culinary customs
c) Unveiling Interaction Dynamics:
e This study illuminates the dynamic field of user-centric artificial intelligence
by exploring the opportunities and difficulties of building a chatbot that uses
Knowledge Graphs for personalized interactions The knowledge acquired
advances both technology and our sophisticated understanding of
human-machine interaction
d) Opening the Path for Future Projects:
e This research has the potential to act as a model for similar projects in various
culinary traditions and cultural domains, even beyond its immediate
application The aforementioned framework establishes a standard for
subsequent endeavors aiming to utilize analogous technologies for the
conservation and propagation of cultural and gastronomic legacy
1.5 Motivation
This research is motivated by a number of factors, primarily the need to close important
gaps in the current knowledge extraction systems Although many systems perform well
in English-only settings, our goal is to create a reliable system for data extraction and
storing that is compatible with both Vietnamese and English But at present, we only
extract English sentences because of limited time, with Vietnamese we will perform in
future In a nation of over 90 million people where internet usage is expanding quickly,
the lack of a system that is sensitive to language and culture becomes a critical issue
Given Vietnam's status as a developing country, our research is motivated by the system's
potential to revolutionize a number of facets of daily life Furthermore, the absence of a
system specifically designed for Vietnamese presents a noteworthy obstacle, and our
inspiration stems from the conviction that a knowledge graph system can be crucial in
bridging these knowledge gaps Moreover, the use of graphs in NLP is becoming more
and more common, which motivates us and gives us a chance to advance this developing
Trang 16field Our main objective is to provide the Vietnamese people with a knowledge graph
system that not only fulfills their information needs but also highlights the diversity of
Vietnamese culture, especially in terms of food customs, and acts as a useful guide for
foreign visitors
The difficulties arise when attempting to glean information from Vietnamese paragraphs
Vietnamese-written text cannot be properly processed by the current set of tools To be
more precise, no system exists that can extract triples from text in Vietnamese It became
nearly impossible for us to complete the task in the limited time we had In the end, we
decided that it would be better to use the already-existing tools, which process English
text well and can also be trained to comprehend Vietnamese text Additionally, there are
far more English texts available than Vietnamese, despite the fact that Vietnamese texts
are much richer and more elusive In order to train our model to comprehend and extract
text in both Vietnamese and English, we also chose to construct a pre-existing knowledge
graph that included text in both languages Vietnamese cuisine is the exclusive focus of
our system domain It is anticipated that the system will quickly and easily produce a
trained model that is stable and accurate
Furthermore, we would like to demonstrate the knowledge graph's proficiency in NLP
tasks Our goal is to create an application that uses the knowledge graph we created to
answer questions The goal of this application is to have natural language communication
with humans In a strict sense, this application ought to be able to communicate with users
and respond to their inquiries Our knowledge graph and the model we have trained serve
as this application's brain
In spite of the fact that the questions may be challenging to comprehend or may be correct
or incorrect depending on the individual, we hope to be able to construct and train our
model robustly enough for the system to provide the most compelling responses when
posed a query The outcomes of this thesis should hopefully be a step in that direction
Trang 171.6 Contribution
The thesis "Constructing a Knowledge Graph with Fact-Checking about Vietnamese
Cuisine" makes several contributions to the field of natural language processing (NLP)
and knowledge graph construction
e Firstly, the thesis proposes a novel approach to constructing a knowledge graph
related to Vietnamese cuisine, utilizing advanced NLP techniques such as Named
Entity Recognition (NER) and Bidirectional Encoder Representations from
Transformers (BERT) The integration of these techniques enables the creation of
a comprehensive and accurate knowledge graph that captures the relationships
between Vietnamese dishes, ingredients, and regional nuances
e Secondly, the thesis presents a sophisticated web chatbot that utilizes the
knowledge graph to provide users with detailed information about Vietnamese
cuisine The chatbot's intuitive user interface and fact-checking capabilities
enhance the user experience and increase the credibility of the chatbot's answers
e Thirdly, the thesis evaluates the effectiveness and practicality of the system,
providing insights into the accuracy of the chatbot's responses and the overall user
experience The evaluation demonstrates the system's ability to effectively engage
users and serve as a reliable source of knowledge about Vietnamese culinary
traditions
The thesis contributes to the advancement of NLP and knowledge graph construction by
demonstrating the practical application of state-of-the-art algorithms and models in the
context of Vietnamese cuisine The integration of advanced NLP techniques, knowledge
graph insights, and the development of a sophisticated web chatbot paves the way for a
rich and dynamic user experience, offering new avenues for exploring the intersection of
technology and culinary traditions
1.7 Structure of the thesis
Trang 18The structure of our thesis is as follows The introduction summarizes the background,
goals, problems stated, importance of the research, and inspiration behind this graduate
thesis
e Chapter 2 Theoretical Framework: This section outlines the theoretical framework
explored to complete the thesis
e Chapter 3 describes our proposed system design, including the approach, sources
of data to be collected, and models chosen for the system's construction It
encompasses details about system components and design options
e Chapter 4 Operational System Implementation: In addition to providing a
description of the operational system, Chapter 4 covers inputs, outputs, service
details, user interface, and operational system implementation details
e Chapter 5 Conclusions and Development Plan: The final chapter presents the
thesis's conclusions and outlines the development plan intended to be carried out
soon.
Trang 19Chapter 2
BACKGROUND AND RELATED WORK
2.1 Resource Descriptive Framework
The Resource Descriptive Framework (RDF) serves as a fundamental abstract model
within the realm of the Semantic Web, a concept advanced by the World Wide Web
Consortium (W3C) W3C designed RDF to standardize the encoding of metadata, making
it a cornerstone in structuring knowledge within the Semantic Web [1] RDF facilitates
the representation of knowledge, employing a data model that can be stored in diverse
formats such as JSON (JavaScript Object Notation) or XML (Extensible Markup
Language)
RDF operates on the principle of decomposing knowledge into discrete, manageable
units Any form of knowledge can be expressed as a triple, denoted by Subject, Predicate,
and Object For example, the statement "Pho is a specialty in Nam Dinh" can be captured
in RDF format as follows:
:Pho :is_a :Food ;
:specialty_in :Nam Dinh
:Nam Dinh :is_a :Location
In this RDF representation, "Pho" is the Subject, "is_a" is the Predicate, and "Food" is the
Object, indicating that "Pho" is a type of food Similarly, "Nam Dinh" is the Subject,
"is_a" is the Predicate, and "Location" is the Object, conveying that "Nam Dinh" is a type
of place
RDF's flexibility allows it to express a broad spectrum of facts, encompassing both
concrete entities, such as special foods or places, and abstract concepts like the
relationship between a place and its associated foods Additionally, RDF accommodates
textual values known as literal values, expanding its capacity to represent diverse forms
10
Trang 20of data Despite its simplicity, RDF offers a structured approach that computers can
interpret and utilize effectively
At the core of RDF lies the concept of triples, denoting instances of (Subject, Predicate,
Object) Each triple embodies a factual relationship in the real world, acting as the
fundamental building block of RDF Below is a tabular representation of RDF triples
conveying information about "Pho" and "Nam Dinh."
Table 2-1 RDF triples
Subject Predicate Object
Pho is_a Food
Nam Dinh is_a Location
Pho is_a_specialty_in Nam Dinh
In this table, three distinct facts are articulated, each encapsulated in a single line A
distinctive feature of RDF is its ability to maintain entity consistency even when referred
to in various contexts For instance, "Nam Dinh" in fact #2 and fact #3 refers to the same
entity
Triples within RDF are highly flexible, allowing for nesting and composition of triples
For example, a nested triple could be represented as (Pho, is_a_specialty_in, (Nam Dinh,
type, Location)) This inherent flexibility empowers the linkage of real-world facts,
thereby creating what is known as Triples of Knowledge
11
Trang 21Banh tam , `
cà rỉ i %
Ễ 4 Bun a ng
Figure 2-1 Triple examples
Triples of Knowledge essentially manifest as labeled, directed graphs, where each node
signifies a subject or an object The edges connecting nodes represent predicates,
elucidating the relationship between the subject and the object Specifically designed to
store and retrieve triples, a triple store or RDF store serves as a specialized database for
RDF knowledge representation
2.2 Knowledge Graphs
Knowledge Graphs are intricate and structured representations of knowledge that excel at
capturing complex relationships between entities within a defined domain These entities
can range from objects, events, concepts, to individuals, and their interactions are often
represented by edges connecting them in a graph-like structure This allows for a richer
understanding of the semantic context and interconnections within a specific knowledge
domain
The foundation of a Knowledge Graph lies in organizing and integrating vast amounts of
data and information into a coherent structure Each entity in the graph is represented as
12
Trang 22a node, and the relationships or attributes associated with these entities are depicted as
edges These edges convey the nature and semantics of the relationships, enabling a
nuanced depiction of the knowledge landscape
One of the key strengths of Knowledge Graphs is their ability to facilitate semantic search
and question answering The structured representation allows for efficient querying and
retrieval of information, significantly improving search accuracy and relevance
Moreover, the structured nature of Knowledge Graphs enables reasoning and inference,
which is pivotal in extracting implicit knowledge and inferring new relationships based
on the existing ones
name: “Dan”
born: May 29, 1970 twitter: “@dan”
Knowledge Graphs have garnered widespread adoption, particularly by leading search
engines and recommendation systems These technologies employ Knowledge Graphs to
enhance information retrieval by providing context-aware results By leveraging the
relationships and attributes within the graph, search engines can offer more precise and
personalized responses to queries, leading to an improved user experience
13
Trang 23Typically, knowledge graphs consist of datasets from multiple sources, many of which
have different structural characteristics Context, identities, and schemas combine to give
diverse data structure The knowledge graph is framed by schemas, the setting in which
the knowledge exists is determined by identities, and the context classifies the underlying
nodes appropriately These elements aid in separating words with various meanings
Semantic enrichment is the process by which knowledge graphs powered by machine
learning use natural language processing (NLP) to create an all-encompassing view of
nodes, edges, and labels This procedure enables knowledge graphs to recognize distinct
objects and comprehend the connections among various objects when data is ingested
After that, other datasets that are pertinent and comparable to this working knowledge are
compared and integrated with it When a knowledge graph is finished, it makes it possible
for search and question-answering systems to find and utilize thorough responses to
specific queries
From the previous section that the atomic data entity in the Resource Description
Framework (RDF) data model is called a semantic triple, or simply triple Triples are
expressed as follows: Subject, Predicate, Object Ontology concepts are subjects and
objects, and the relationship between a subject and an object is called the predicate A
predicate can be represented by a word, phrase, or sentence After that, this data model is
kept in a knowledge base for processing and access
2.3 Bidirectional Encoder Representations from Transformers
BERT is one of Transformer's most important achievements in the field of natural
language processing Rooted in two-dimensional contextual language understanding,
BERT has achieved excellent results in a variety of NLP tasks, from text classification to
semantic search BERT is intended to jointly train on both left and right context in all
layers in order to pretrain deep bidirectional representations from unlabeled text
14
Trang 24Therefore, without requiring significant task-specific architecture modifications, the
pre-trained BERT model can be refined with just one extra output layer to produce
state-of-the-art models for a variety of tasks, including question answering and language inference
BERT is both powerful empirically and conceptually [2]
2.3.1 BERT Architecture
The first description of the BERT model by Vaswani et al [2] and its release in the
tensor2tensor library served as the foundation for the multi-layered Bidirectional
Transformer encoder architecture
ENCODER
eee ENCODER
ENCODER ENCODER
BERTbasr BERT arce
Figure 2-3 BERT base and BERT large models (Source: web“BERT base vs BERT large”,
2019)
In figure 2-3 As the number of layers in BERT large is increased so does the number of
parameters (weights) and number of attention heads increases BERT base has a total of
12 attention heads (lets each token in input to focus on other tokens) and 110 million
parameters Whereas BERT large has 16 attention heads with 340 million parameters
BERT base has 768 hidden layers whereas BERT large has 1024 hidden layers
2.3.2 Multi-head Attention
BERT often uses several encoder layers, called "Transformer layers" or "Transformer
blocks." This number of layers is usually a hyperparameter of the model and can vary
depending on the specific version of BERT [2] Each encoder layer uses multi-head
15
Trang 25attention to simultaneously process information from many different "attention heads".
The number of attention heads is usually a hyperparameter Each attention head produces
a different representation of the input sequence, and then the results from all attention
heads are combined
The multi-head attention mechanism can process the whole paragraph or multiple
sentences in total to obtain their interrelationship In Transformer, encoder positioner,
multi-head attention mechanism and feedforward comprise a complete unit The encoder
position is for marking the word position in each sentence, which can effectively ensure
the position of each word can be fully considered in sentence analysis [7]
Positional Encoding
Figure 2-4 Architecture of Transformer (Source: web “The Transformer Model”)
16
Trang 26Each head of multi-head attention focuses on a specific aspect of the input information.
Thanks to this diversity, BERT is capable of learning complex relationships and context
characteristics
2.3.3 Input Representation
About the input representation for BERT Input can be a representation of a single text
sentence or a pair of text sentences (e.g [Question, answer]) placed into a string made up
of words When given a particular input string, our input representation is built by
summing those tokens with the segment vector and the corresponding positions of the
words in the string
For ease of visualization, the input representation is visualized in the figure 2-5:
Figure 2-5 BERT input representation (Source: “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding ”, 2019)
e WordPiece embeddings with a dictionary of 30,000 words and used ## as the
separator For example, the word playing is split into play##ing
e Positional embeddings with a maximum sentence length of 512 tokens.
e The first token for each chain is by default a special token whose value is [CLS]
The output of the Transformer (final hidden state) corresponding to this token will
be used to represent the whole sentence in classification tasks Otherwise in
classification tasks, this vector is ignored
® In cases where pairs of sentences are grouped together into a single sequence,
distinguish the sentences in two ways First, separate them by a special token
17
Trang 27[SEP] Second, add one segment embedding for sentence A and another segment
embedding for sentence B
e When there is only one single sentence, the segment embedding is only for
sentence A
BERT does not use regular step tokens, instead it uses WordPiece tokens This method
divides words into sub-word units Sub-word units can be whole words or word fragments
These tokens are then used as input to the BERT model
When separating words, the algorithm will try to find the longest word in the vocabulary
If not found, it breaks the word into smaller parts This process continues until all parts
are in the vocabulary or only single characters remain
Example with the word 'banh':
e We have input word set when WordPiece is {‘banh’, ’ban’, ’ba’, ‘nh’, ‘b’, ‘a’, ‘n’,
'h*}
e Starting with the word "banh", we check to see if it is in the vocabulary In this
case, "banh” is not in the vocabulary
e So we start splitting the last word into ‘ban’ and ‘h’ Now we see in the dictionary
both 'ban' and Tỉ At this point, the word separation is complete, but we will add
the character ## at the beginning of the word 'h' (the meaning of ## means that this
phrase does not have an independent meaning but is a part of the word ‘banh' )
=> banh after tokenizer : ban, ##h
Segment embeddings are a component used in model BERT to distinguish between
different segments or sentences in a sequence of text They're essential when the model
needs to process multiple sentences or segments of text simultaneously, such as in tasks
like natural language inference, question answering, or text classification involving
multiple sentences
18
Trang 28In BERT, segment embeddings are employed to handle sentence-level information within
a single input sequence Consider the scenario where BERT is fed two sentences for a
classification task; it needs a mechanism to differentiate between the sentences
e Multiple segments or sentences are combined into a single input sequence A
special token, [SEP], is used to separate these segments within the input sequence
Example: My dog is cute he likes playing
e After the first sentence (A) ends, Segment Embeddings will add [SEP] to the end
of the sentence and continue connecting that segment to the next sentence New
[SEP] is known as part of sentence A The remaining parts belong to sentence B
In Positional Encoding of the paper [3], the author explains why they need to encode the
position of each token The need to encode the position of each token arises from the
absence of recurrence and convolution within the model Without these mechanisms, it
becomes crucial to incorporate information regarding the relative or absolute position of
tokens in the sequence for the model to effectively utilize the sequence's order
Feed
Co
Add & Norm
Multi-Head Attention
Figure 2-6 Encoder function
In Positional Encoding, author use sine and cosine functions of different frequencies:
19
Trang 29s 1: the position of the token in the input
* J: the position along the dimension of the token embeddings
Tensors are the data structure used by machine learning systems, and getting to know
them is an essential skill you should build early on Tensors are mathematical objects that
describe linear relationships between sets of multidimensional data They are a
generalization of scalars, vectors, and matrices, which are all types of tensors
Index [0,2,1]
Index [0,0]
Š b
rank 0 tensor rank 1 tensor rank 2 tensor rank 3 tensor
dimensions [ ] dimensions [5] dimensions [5, 3] dimensions (4, 4, 2]
scalar vector matrix
Tensors
Figure 2-7 Tensor dimensions
In BERT, tensor is output of encode have shape (batch_size, sequence_length, hidden_size)
e batch_size: the number of input sequences in the batch
e sequence_length = max_length
e hidden_size: the size of the hidden layers in the BERT model
In BERT, Token classification (also known as sequence labeling) involves predicting a label for
20
Trang 30each token in the input sequence The shape of the logits produced by token classification in
BERT is (batch_size, sequence_length, num_labels)
e batch_size: the number of input sequences in the batch
e sequence_length = max_length
e num_labels: the number of possible labels for each token present in the model
2.3.4 Pre-training and Fine-Tuning
They used two datasets for the training process: BooksCorpus (800M words) and English
Wikipedia (2,500M words) For Wikipedia, they only extracted text fragments and
ignored lists, tables, and headings It is important to use a paragraph-level corpus rather
than a jumbled collection of sentences
To create an input sequence for the training process, they take samples of two consecutive
spans in the corpus, which they call sentences, although they are often much longer than
regular simple sentences They sample such that after combining, the length of the
combined sample contains at most 512 tokens Masks for MLM are still used after
applying WordPiece tokenization at a uniform rate of 15%
For sentence classification tasks, BERT's fine-tuning is very simple To get a
representation of an input sequence with a fixed number of dimensions, they just need to
get the hidden state in the last layer the output of the Transformer layer for the first token
(a special token [CLS] is built for the first Chain)
All parameters of BERT and W are fine-tuned to optimize the error function
Padding in BERT refers to the process of adding special tokens to input sequences to
make them of equal length BERT requires fixed-length input sequences, so if the original
sequences are shorter than the maximum sequence length, padding tokens are added at
the end to match the maximum length These padding tokens do not carry any meaningful
information and are typically represented by a special token like [PAD] Padding ensures
21
Trang 31that all input sequences have the same length, allowing efficient batch processing during
training and inference
‘Question Paragraph
Unlabeled ae) B Pair ‘Question pole Pair
Pre-training Fine-Tuning
Figure 2-8 Overall pre-training and fine-tuning procedures for BERT (Source: “BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding ”, 2019)
In figure 2-8, the same architectures are used in both pre-training and fine-tuning The
same pre-trained model parameters are used to initialize models for different down-stream
tasks During fine-tuning, all parameters are fine-tuned [CLS] is a special symbol added
in front of every input example, and [SEP] is a special separator token (e.g separating
questions/answers)
2.4 Softmax, Argmax and Loss functions
2.4.1 Softmax
Softmax, a common normalizing function for producing a probability distribution from
neural network logits, is defined as follows:
e: Euler = 3.14
x ev e*i
4=! ˆ: Total value in first vector value
22
Trang 32A vector of real values can be converted into a vector of probabilities using the softmax
function The result's probabilities are all within the range of 0 to 1, and the total of them
is l
2.4.2 Argmax
Argmax, a popular normalization function that creates probability distributions from
neural network logits, to ascertain the location of the largest value inside that collection
The following is the definition of the Argmax function:
Argmax P(k)
k: the number of classification class labels
Table 2-3 Example Argmax
0 1 2 3
indexing starts from 0 in python so the 3nd index refers to the fourth element in the array
2.4.3 Cross-entropy loss
23
Trang 33The concept of cross entropy, which measures the difference between two probability
distributions for a given random variable or set of occurrences, is an extension of
information theory entropy Throughout training, model weight adjustments are made
using cross-entropy loss The goal is to reduce the loss; a better model has a smaller loss
Cross-entropy loss in a perfect model is 0 Multi-class and multi-label classifications are
usually supported
CE =—)_ tilog(f(s);)
ti: groundtruthsi: CNN score for each class i in C
As usually an activation function (Softmax) is applied to the scores before the CE Losscomputation, f(si) refers to the activations
2.5 Named Entity Recognition
Token classification involves assigning labels to tokens within a text for tasks such as
Named Entity Recognition (NER) NER models aim to identify distinct entities like
datetime, person, and locations, in a text, on categorizing words as verbs, nouns,
punctuation, and so on
Named Entity Recognition (NER) is a foundational task in natural language processing
(NLP) that involves the automatic identification and categorization of named entities
within a given text [8] Named entities are specific pieces of information, such as names
of persons, organizations, locations, dates, numerical expressions, and more, which carry
significant meaning and context in the text
Named Entity Recognition (NER): The NER model is applied to identify and classify
named entities such as dish names, ingredient names, locations, and other relevant entities
within the collected textual data This process enriches the dataset by tagging entities with
their respective categories
24
Trang 34In the discipline of Natural Language Processing (NLP), named entity recognition is one
of the most crucial data processing tasks It seeks to find and classify important data, or
entities, in text data These ‘entities’ can be any word or any group of words (often proper
nouns) that consistently refer to the same object For instance, an entity detection system
might identify the word "News Catcher" in text and categorize it as a "Organization"
Figure 2-9 NER example
All entity recognition algorithms basically follow these two steps:
e Identifying text entities
e The entities are categorized into specified classes.
The NER locates the token or group of tokens that make up an entity in the first stage
Finding the beginning and ending indices of things is frequently done using
inside-outside-beginning chunking The construction of entity categories is done in the second
stage Here are some of the most typical entity classes, though these categories can vary
based on the use case:
e String patterns like email addresses, phone numbers, or IP addresses
Although some techniques to name entity recognition use rules, the majority of
contemporary systems use a machine learning/deep learning paradigm Text data has a
great deal of inherent ambiguity because it was created by humans For instance, the word
"Sydney" can be used to describe both a place and a person
25
Trang 35| didn't know there was Madame Tussauds in Gres.
and are planning to go to the festival next week, are you in?
Figure 2-10 NER tagging example
There is no sure-fire way of dealing with such ambiguities but as a general rule of thumb,
the more relevant training data is to the task, the better the model performs In this process
we will train a custom NER model that extracts specialties and location in Viet Nam from
clinical text
Any application where a deep comprehension of a lot of text is required can benefit from
NER A strong NER helps the computer to swiftly classify documents according to their
relevance and comprehend the subject or theme of text at a glance
We will apply NER for information extraction Using NER to extract and highlight
important entities from user text is an important step in identifying fundamental elements
in text So that we can process input from the user to get the text to extract information
then stored the output of NER ready for the next extraction
Furthermore, NER can be enhanced through machine learning and deep learning
approaches, enabling models to continuously improve their accuracy and precision in
identifying named entities Advanced techniques such as named entity linking (NEL) can
be employed to link the recognized entities to existing knowledge bases, enhancing the
depth and accuracy of the knowledge graph
26
Trang 36Figure 2-11 IIustration of BERT for NER (Source: “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding ”, 2019)
To tackle the aforementioned issue, we utilize the integrated BERT-NER model
(https://github.com/kamalkraj/BERT-NER)
New dataset is generated in CoNLL 2003 format, starting with sentences manually chosen
from Wikipedia Utilizing GPT-3.5, these sentences are transformed into CoNLL 2003
format, and additional labels are manually assigned to enhance the dataset
2.6 Neo4j
Neo4j is a cutting-edge graph database management system that makes it easy to store,
manage, and query intricate and highly connected data It belongs to the class of NoSQL
databases, which are designed to address the particular difficulties presented by data that
has complex relationships
Neo4j, which Emil Eifrem founded in 2000, has been essential to the development of
graph database technology Neo4j, Inc., the company that created Neo4j, is committed to
expanding the potential of graph databases and encouraging their broad use in a variety
of sectors
27
Trang 37Drivers and APIs,
Neo4j Browser, Neo4j Data Importer,
Neo4j Desktop, Neo4j Ops Manager Analytics
Graph Data Science
Figure 2-12 Neo4j ecosystem (Source: web” Welcome to Neo4j”)
Neo4j is fundamentally based on the mathematical framework of graph theory, which
describes the relationships between entities Neo4j models data as nodes and
relationships, in contrast to traditional relational databases that rely on tables and
predefined schemas This enables a more natural and intuitive representation of real-world
connections It is a good option to store semantic triples because of its robust and
adaptable data model When dealing with situations where relationships are just as
important as, or even more crucial than, individual data points, this structure is especially
helpful
One of Neo4j's most notable features is its query language, Cypher, which was created
especially for use with graph databases Developers and data analysts can easily retrieve
and manipulate graph data thanks to Cyphers expressive and powerful features Its
human-readable syntax closely resembles the natural intuitiveness of graph databases
Cypher Query Language, included with Neo4j, offers a logical and visually appealing
method of matching patterns in node and relationship graphs For instance, Neo4j stores
the triple ("John," "lives in," and "Paris") as shown in Figure 2-13:
28
Trang 38Figure 2-13 Example triple store in neo4j
The example in Figure 2-13 can be express using the following Cypher pattern:
(:Person {name: “John”})-[:LIVES_IN]—>(:City {name: “Paris”})
(:Person {name: "John"}) is a node in this pattern The node's type or label, "Person," is
indicated by the word that follows the colon in that node The node's properties are listed
next to that, and its name 1s "John." Comparably, another node with the type "City" and
the name "Paris" is (:City {name: "Paris"}) The relationship, with or without direction,
is located in between the nodes There is a directed relationship of the type "LIVES_IN"
in this specific instance It's clear that Cypher's syntax provides a logical and visual
interface for interacting with the graph's nodes and relationships
And here's how to create such a triple by following these codes below:
CREATE (:Person {name: 'John', age: 30}) CREATE (:City {name: 'Paris', country: 'France'})
This Cypher query creates two nodes, one for the person (John) and another for the city
(Paris) The labels Person and City are used to categorize the nodes, and properties like
name, age, and country store additional information
29
Trang 39MATCH (person:Person {name: 'John'}), (city:City {name: 'Paris'})
CREATE (person)-[:LIVES_IN]->(city)
This Cypher query establishes a relationship between the person and the city The
relationship type is LIVES_IN, and the arrow direction indicates that John lives in Paris
Cypher offers additional commands to communicate with the Neo4j database in addition
to CREATE Some of the other commands are illustrated and explained in the text that
follows:
MATCH (person:Person {name: 'John'})-[: LIVES_IN]->(city)
RETURN person, city
MATCH Clause:
e MATCH: This is a keyword in Cypher used to specify patterns to match in the
graph
® (person:Person {name: 'John'}): This part defines a pattern where we are
looking for a node with the label Person and a property name with the value
‘John' The person is an alias assigned to this node for reference in the rest of
the query
e -[:LIVES_IN]->: This part of the pattern represents a directed relationship of
type LIVES_IN going from the person node to the city node The arrow ->
indicates the direction of the relationship
e (city): This specifies that we are looking for a node, and we don't specify a label
or properties for the city node in this pattern The query will find any node that
is connected to the person node by a LIVES_IN relationship
RETURN Clause:
e RETURN person, city: This part of the query specifies what information to return.
Here, we want to return the person and city nodes that match the pattern described
in the MATCH clause
30
Trang 40MERGE (person:Person {name: 'John'})
MERGE (city:City {name: 'Paris'})
MERGE (person)-[:LIVES_TN]->(city)
RETURN person, city
MERGE Clause:
e@ MERGE (person:Person {name: John'}), MERGE (person)-[:LIVES_IN]->(city):
This line tries to find a node labeled as Person with the property name equal to
‘John’ If such a node exists, it will be matched; otherwise, a new node with the
specified properties will be created
e MERGE (person)-[:LIVES_IN]->(city): This line establishes a LIVES_IN
relationship between the person and city nodes If the relationship already exists,
it will be matched; otherwise, a new relationship will be created
2.7 Fact checking
The overwhelming amount of data generated online has made fact-checking a critical
process for confirming the veracity and accuracy of information more difficult The
amount of information available is so great that the conventional method of having human
experts verify the facts finds it difficult to keep up Nevertheless, a viable answer to this
problem is offered by recently developed computational fact-checking techniques
Computational fact checking employs advanced techniques in the field of natural
language processing and information validation to evaluate the accuracy of statements or
claims There are various ways to accomplish this, including determining the semantic
closeness metrics-derived shortest path between concept nodes and evaluating the
relevance between the sentence and the pertinent retrieved documents, the proper degree
of similarity will be ascertained Thus, it is feasible to extract pertinent documents for
low-resource languages, and this is what we aim to do To do this, we use reference
materials like Wikipedia and other online sources as links to our documents
31