Fact-checking through semantic similarity- 123docz.net

Nothing is sufficient given the outcomes that the chatbot has returned. Users today have very high expectations; they don't just want a chatbot to respond or say something; they also want it to be true or accurate. Given the volume and complexity of online information today, this impacts chatbots of all kinds. In order to provide evidence that the chatbot's claim is true, we have also gathered and archived reliable online resources, including Wikipedia and respectable Vietnamese newspaper articles about Vietnamese food, as a source of information. We have selected the approach of computing the Vietnamese semantic similarity for responses from our system with related sentences from source

pages, assessed through similarity scores, in order to be able to validate the chatbot's

response.

Since the majority of the source pages pertaining to Vietnamese cuisine are in Vietnamese, that is the reason we switched to Vietnamese here. As a result, we have included more libraries and Vietnamese natural language processing models. We will only use the pre-trained model due to time constraints, as it is now capable of fully comprehending word context through simple natural language sentences. This will be accomplished via the pipeline shown in figure 3-4 on the final part of our system.

Extract relevant f.

Sự——— sentence i —=/

Knowledge _Wyrd ; 1a

graph tt ) Segmentation Embedding

Answer template |

processing

Answer Similarity

Figure 3-4 Fact checking pipeline

After receiving the results from the knowledge graph, our system will execute two tasks

in parallel. Generate bot responses through the template and extract related sentences from relevant source pages from that entity, here mainly about dishes. To perform the extraction

we use python's Beautiful Soup library, used to parse and extract information from HTML and XML web pages. This library simplifies the web scraping process by providing tools for navigating, searching, and extracting data from a web page's syntax. To be able to extract sentences related to entities, we have set conditions for the system to only extract sentences containing related entities, this helps data processing and higher scores. Once the information is available, the system will move to the next step which is clean data. We use feature segmentation from VnCoreNLP [4] to perform this segmentation. To be able

to represent natural language sentences into vectors to compare similarity, we must preprocess the information to get better results.

In the field of natural language processing (NLP), vector representations of sentences play

an important role in understanding and processing linguistic information. Models like PhoBERT have opened new doors to efficient language representation, bringing many benefits to a variety of applications. The vector representation for sentences helps the model understand the context in which each word appears. Instead of just considering individual words, the model is capable of deriving comprehensive meaning from the interactions between sentence components. This is important to address the challenges of understanding and generating artificial natural language. In the task of answering questions, understanding the context of the sentence is the decision between answering correctly or incorrectly. The vector representation for sentences helps the model capture important information and generate accurate answers. To implement this method we have applied the phoBERT model. PhoBERT is a Vietnamese natural language model trained based on the BERT (Bidirectional Encoder Representations from Transformers) architecture. BERT is one of the famous models for natural language representation, capable of understanding the meaning of words and surrounding context. PhoBERT continues these advantages and is trained on big data, helping it achieve high performance

in many NLP tasks.

Before being included in the model, natural language sentences often need to go through

a word segmentation process to be represented as smaller units such as words and sentences. That's why we use VnCoreNLP [4] to segment words for Vietnamese sentences. After the sentence has been word segmented, the PhoBERT model is capable

of creating a representation vector for each word and sentence based on the context and relationships between linguistic components.

PhoBERT is a natural language representation model built on the architecture of BERT that mentions in chapter 2 section 2-3, a famous transformer architecture in the field of natural language processing (NLP). PhoBERT's architecture includes:

e Input Layer:

For each input sentence, each word is mapped into the feature vector space through the word embedding process. At the same time, each word is also combined with

a position vector to help the model recognize the order of words in the sentence.

e Encoder class:

PhoBERT uses multiple layers of encoders, each layer contains a self-attention mechanism and a feed-forward network. Encoder classes help the model understand expressions and contextual relationships between words in a sentence.

e Sentence Representation:

After each encoder layer, the representation of each word in the sentence is updated based on the context around it. The representation of a sentence is a combination

of word-by-word representations, usually taken from the final encoder layer.

e Final Representation Vector:

The end result is a vector representation of the entire sentence, containing information about context, meaning, and relationships between words.

In natural language processing, measuring the similarity between representation vectors

is an important part of evaluating and comparing text. In the context of PhoBERT, one of the popular methods to measure similarity is using cosine similarity. Cosine similarity is

a method that has proven effective in measuring the similarity between two vectors in high-dimensional space. Based on the cosine principle of the angle between two vectors, cosine similarity scores similarity from -1 (completely dissimilar) to 1 (completely similar).

Although cosine similarity is not a new algorithm, it still holds its place in natural language processing. There are several important benefits to using cosine similarity in the context of PhoBERT:

e Standard Similarity Measure: Cosine similarity is a standard and effective

similarity measure, especially when applied to language representation vectors from models like PhoBERT.

e Computational Efficiency: Compared to some other methods, cosine similarity has

low computational complexity, making it suitable for similarity assessment on large amounts of text data.

e High Interpretation: The results of cosine similarity are easy to interpret, because

the closer the value is to 1, the higher the similarity and vice versa. This makes it

an effective tool in understanding and comparing texts.

Chapter 4