Khóa luận tốt nghiệp: Natural language question-answering system about tourism location

LIST OF ABBREVIATIONSAbbreviation Full form 1 NLP Natural Language Processing 2 AI Artificial Intelligence 3 QA Question and Answering 4 RNN Recurrent Neural Network 5 BERT Bidirectional

The problems and its significance - G5 5 2313 1E ESSEEsrserereeeeeee 3 1.3 MO(IVALIOH G Quà 5 1.4 COntrIDufIOTS S1 TH TH TH HH 6 1.5 Structure of the theS1S - s11 Tnhh ng nh 6

In the current context, the outstanding development of Artificial Intelligence (AI) and Natural Language (NLP) has created a revolution in the application of machine learning and natural language processing Modern machine learning models, such as question-and-answer models and large language models, are becoming more powerful thanks to rapid developments in computer hardware.

However, many important issues in the field of natural language processing such as Question Answering (QA), Machine Translation, Text Summarization (Summarization), Sentiment Analysis, and Linguistic Inference Natural Language (NLD), each of which faces major challenges due to a lack of quality data However, the situation has changed with the emergence of many large and quality datasets, especially on resource-rich languages such as English and Chinese.

Although this development is driving much groundbreaking research in resource- rich languages, we recognize that Vietnamese, a language with low research resources, is facing a lack of environment testing and research dedicated to NLP models This is especially true of question-answering (QA), an important problem in understanding human language, because it involves the ability to understand semantic relationships in language, which is the foundation for many applications such as QA systems, information extraction and summary text.

In 2022, there are several important datasets related to the NLI problem for Vietnamese that have been published such as ViNewsQA, VLSP Shared Task 2021, ViNLI However, we realize that each of these datasets poses unique research goals for each author group and is a challenge for NLP models pre-trained in Vietnamese.

Another important issue is how challenging the data is for current machine learning models To maintain the attractiveness of a research topic in NLP, data difficulty plays an important role For QA problems, the data needs to be challenging to stimulate the research community to create effective and optimal solutions.

For these reasons, we decided to research a more robust approach to overcome the challenges of QA for Vietnamese and improve the accuracy of current models We also look forward to applying QA to solve related problems, opening up many new opportunities for using NLP in real-world applications.

The reason we chose this topic is because it combines two important fields that are currently developing passively in Vietnam: Artificial Intelligence (AI) and landscape Here are some specific reasons:

The development of AI and NLP: In recent decades, AI and NLP have made significant progress, opening many opportunities to create intelligent applications We are interested in exploring how these advances can be applied in understanding and interacting with natural language data.

Unique Vietnamese culture and tourism: Wonders are an indispensable part of Vietnamese tourism and are attracting the attention of many people at home and abroad We want to explore how to use technology to help users learn and discover the beauty of Vietnam.

New and unique dataset: We have collected and tested the dataset for this topic ourselves This allows us to generate exclusive and accurate documentation, which improves the accuracy and performance of our QA systems.

Serving the community: We hope this project will bring real value to the community of Vietnamese travel lovers, helping them easily learn and explore the beautiful world of the country.

In short, the choice of this topic combines the ability to apply cutting-edge technology and an interest in local tourism, creating an exciting and meaningful project for both the NLP and scenic fields scene in Vietnam.

Despite numerous systems built to learn and respond to daily life issues, most of the data available is predominantly in English There aren't many question-answering models designed for natural language processing with data in Vietnamese, limiting the diversity in various fields of Vietnamese natural language research Vietnam, a developing country with over 97 million people and a rapidly increasing number of internet users, might face limitations due to the lack of diverse data, hindering access to AI and Machine Learning support in some areas.

Moreover, the recent emergence of ChatGPT has showcased the remarkable development and innovation in artificial intelligence at a new level The new results and significant achievements globally underscore the importance of supportive tools in modern-day life We aim to be part of this new and captivating trend, which led us to decide to construct a natural language question-answering system to serve the Vietnamese people.

Challenges arose as we attempted to query information about tourism and locations in Vietnam using Vietnamese text There wasn't enough data to effectively process Vietnamese text for this purpose Specifically, no system could provide comprehensive information about regional locations from Vietnamese text. Completing this task within a short period seemed nearly impossible if starting from scratch Fortunately, we found materials and models from predecessors that greatly aided us in completing this thesis Additionally, numerous articles and research in Vietnamese have been an essential support for us, motivating our continued development and efforts to produce outcomes for our cause.

Related Words 2.0.0 ceeceescccsseesseeesecescecsseeesecesaeseeecseeesaeeeeeeceneecseeeeaeeesaeeeaes 8 2.2 Softmax 4g cố CV ca) mm n

There are many systems in the world today that are trying to solve this problem. The general methods are mostly the same although each system may have a different implementation of the solution We will describe three works that are widely known today: IBM Watson for Travel & Tourism, Amazon Alexa and Google Assistant, Expedia Smart Travel Tools

2.1.2 BERT (Bidirectional Encoder Representation from Transformers)

BERT, short for Bidirectional Encoder Representations from Transformers, was introduced and developed by Google in 2018 It stands as one of the most significant advancements in the field of Natural Language Processing (NLP) BERT's uniqueness lies in its ability to comprehend language bidirectionally, both from left to right and right to left, enabling it to understand the context of each word within a sentence Trained on a massive dataset from the internet, BERT learns how words are used in various contexts This knowledge empowers BERT to grasp grammar, semantics, and sentence structure naturally and accurately, resulting in reliable outcomes across multiple NLP applications such as translation, information retrieval, and language summarization This makes BERT a pivotal tool in contemporary Natural Language Processing [1]

RoBERTa is an advanced NLP (Natural Language Processing) language model developed by Facebook AI It builds on the Transformer architecture, along with special improvements and tunings that improve performance over the original model, like BERT The name "RoBERTa" is capitalized to indicate that it is a robustly optimized version of BERT The main goal of ROBERTa is to improve performance by tweaking the architecture and training process Highlights of roBERTa include Optimization Techniques, Pre-training Tasks, Transfer Learning, Performance. RoBERTa has become one of the most popular and influential models in the field of natural language processing (NLP), and it has been widely used for applications ranging from translation to sentiment analysis and many other tasks.[5]

The PhoBERT model is a pre-trained language model for Vietnamese, developed by Dat Quoc Nguyen and Anh Tuan Nguyen This model is built upon the Transformer architecture The Transformer model is the first fully self-attention- based architecture for computing input and output representations without using Recurrent Neural Networks (RNN) or sequence-aligned convolutions Additionally, the PhoBERT model is trained on a large volume of Vietnamese text, around 20GB, consisting of articles, social media posts, and more PhoBERT comprises PhoBERT- based and PhoBERT-large versions.[2]

The viBERT model is a monolingual model trained for Vietnamese, developed by FPT AI, based on the BERT (Bidirectional Encoder Representations from Transformers) architecture Like other BERT-based models, viBERT is pre-trained on a large volume of text data using the Masked Language Model technique During pre-training, the model is presented with text input where some words are randomly masked, and then it's trained to predict the missing words based on the context of the surrounding words The model has a 12-layer architecture and 110 million parameters, making it a relatively large and powerful language model Additionally, it was trained on a 10GB dataset comprising articles, website posts, etc As a result, it possesses the capability to comprehend and generate text across various styles and genres.[3]

2.1.6 vELECTRA vElectra is a model researched by The Viet Bui, Thi Oanh Tran, and Phuong Len Hong of FPT Technology Research Institute, FPT University, Hanoi, Vietnam and Vietnam National University, Hanoi, Vietnam vVELECTRA is a variant of the ELECTRA language model fine-tuned for Vietnamese language processing. ELECTRA is a high-capacity model built on the Transformer architecture, known for its ability to efficiently compress data and understand context within text. vELECTRA's strength is its ability to train on a large amount of Vietnamese data, helping it deeply understand text and create accurate predictions in many different language processing tasks This model is often used in tasks such as masked word prediction, text classification, or semantic evaluation.[3]

VNCoreNLP is a natural language processing (NLP) library first introduced in

2018, developed by a research team at the Research Center for Information Technology (UIT-RCI) of the University of Information Technology - Vietnam National University Ho Chi Minh City Built upon Stanford University's CoreNLP technology, VNCoreNLP has been fine-tuned and tailored to suit the Vietnamese language This library offers various features for processing Vietnamese text, including parsing, part-of-speech tagging, entity extraction, sentiment analysis, and grammar pattern recognition VNCoreNLP supports NLP applications such as text summarization, information extraction, and natural language processing in artificial intelligence applications With its compatibility and specialization for the Vietnamese language, VNCoreNLP has provided the development and research community in the field of NLP with a valuable tool for studying and building intelligent applications tailored to Vietnam's language.[4]

The softmax function is a generalized form of the logistic function, characterized by transforming a real-valued C-dimensional vector into a C-dimensional vector with values lying within the range [0, 1], and these values sum up to 1 This method is widely used as a classification technique.

(2.1) e Cis the number of input dimensions. e aj is the i-th element after the softmax transformation, representing the probability of data points falling into class I. e zis the i-th element in the input vector, which can carry a negative or positive value, the greater the Zi, the greater the AI, the higher the probability of the data falling into class I.

A few examples of softmax functions: œ® œ

38 3 Be Bos = 3 mo 4m eo Ss = & s8 oo oo fo 3 ow N

LÍ = A — a, dg a3 đ dạ a3 a, ag dạ a) a2 dạ

Figure 2-1: The transformation of vectors using softmax.

Cross Entropy cece cece ceeesceseceeecseecsecsesseeeceeessessesseseeeseeeseeeeseeaaeegs 12 2.4 (60605 7

The loss function is used to compute the loss or error, minimizing the difference between the predicted output and the actual output.

Cross Entropy between two discrete distributions p and q is defined as follows:

This is an example in the case where C = 2 and pl takes values of 0.5, 0.1, and 0.8, respectively.

Figure 2-2: Comparison between cross-entropy function and squared distance.

There are two crucial observations here:

- The minimum value of both functions is attained when q = p at the abscissa of the points colored in green.

- More importantly, the cross-entropy function takes on a very high value (1.e., high loss) when q is far from p In contrast, the difference between losses near or far from the solution in the squared distance function (q - p)Ÿ is negligible.

In terms of optimization, the cross-entropy function favors solutions closer to p because distant solutions are penalized heavily.

Word Embedding is a vector space used to represent data that can describe relationships, semantic similarities, and contexts within the data This space comprises multiple dimensions, and words with similar contexts or meanings within that space will be positioned close to each other The operation process of BERT embedding in our thesis unfolds as follows:

1 Tokenization: The input text is divided into tokens (words or subwords) for processing BERT utilizes a specialized token encoding process called WordPiece to generate these tokens.

Input Preparation: Each encoded token is fed into BERT as input This involves adding special tokens to mark the beginning and end positions of the sentence, as well as adding the [CLS] token at the start of the sentence to represent the entire sentence.

Multilingual Contextual Representation: BERT employs a transformer architecture to compute vector representations for each token Particularly, BERT uses encoder layers within the transformer to consider both left and right contexts of each token, creating contextualized representations for each word in the sentence.

Embedding Creation: After passing through the encoder transformer, contextualized representations of tokens are generated For each word in the sentence, BERT creates a fixed-dimensional vector embedding containing information about the word's meaning and its context within the sentence.

Output Embedding: During training or application of BERT for a specific task

(e.g., text classification), the embedding of the [CLS] token — representing the entire sentence — is commonly used to generate the model's output.

Token embedding has the task of converting words or tokens into numerical vectors of fixed lengths so that computers can comprehend and process them This process aids in representing words in a numerical space, allowing models to understand and perform computations or predictions based on the characteristics, context, and relationships between words.

Input: “Việt và Tin lam khóa luận” (6 words) e Step 1: Split the input into tokens, adding two tokens, [CLS] and [SEP], respectively, at the beginning and end of the sentence.

[CLS] “Việt ‘va’ ‘Tin’ ‘lam’ “khóa' ‘luan’ [SEP] (8 words) e Step 2: Project through a matrix of shape (V, H) where: V is the number of words in the vocabulary, and H is the feature size (768 for BERT base).

Result: The token embedding matrix is of shape (8, 768) or a tensor of shape (1,

8, 768) if it includes the batch axis.

Segment embedding in models like BERT serves the purpose of identifying and distinguishing different segments of text with distinct meanings, such as questions and answers or different sections within a document It aids the model in understanding the relationship between various parts of the text and supports learning the context and meaning Segment embedding is commonly used to assign separate feature vectors to each section of the text during model training, enabling the model to differentiate between the differences in the input data segments.

Example: Now we have 2 sentences

Input: “Việt và Tin làm khóa luận Rat hăng say”

Step 1: Repeating the tokenization process, adding [CLS] and [SEP] tokens, and concatenating two sentences together.

[CLS]‘Viét?‘va’‘Tin’‘lam’‘khoa’‘luan’ [SEP] ‘rat’‘hang’‘say’ [SEP]

Step 2: Labeling (0, 1) to differentiate between the two sentences Where: 0 represents the first sentence, and 1 represents the second sentence (In the case of input with only one sentence, assign 0 to all tokens in that sentence.)

[CLS]‘Viét’‘va’‘Tin’‘lam’‘khoa’‘luan’ [SEP] ‘rat’‘hang’‘say’ [SEP]

0 0 0 0 0 0 0 0 1 1 1 1 Step 3: Projection through a matrix of shape (2, 768) Where: 2 is the number of labels (0 and 1), and 768 is the feature size of BERT.

Result: The segment embedding matrix has a shape of (12, 768) or a tensor with a shape of (1, 12, 768) including the batch axis.

Position embeddings in the model serve to represent the relative positions of words within a sentence In contrast to vocabulary represented by token embeddings,

15 position embeddings assist the model in recognizing the positions of words within a sentence without relying solely on sequential order This enables the model to grasp the grammatical structure of a sentence and establish a contextual space to understand the positional relationships between words in a broader context.[8]

Using the Sinusoidal Position Encoding method, which has the formula: sin ———] if 7 is even J

10000 4emb-dim cos ———) if 7 is odd

(2.3) e i: the position of the token in the input e j: the position along the dimension of the token embeddings

These representations are element-wise summed to create a single representation of shape (1, n, 768) This is a three-dimensional input with dimensions (batch size, sequence length, feature size 768) fed into the Encoder of the BERT model.

Figure 2-4: Data preprocessing in BERT.

2.4.1.4 The difference between RoBERTa and BERT in the embedding process.

In terms of embedding, we have referenced and utilized several models related to RoBERTa, and overall, the steps share similarities However, ROBERTa takes a different approach in its Position Embedding compared to BERT when constructing position embeddings In RoBERTa, it doesn't use the sinusoidal positional embeddings method based on sin-cos functions like BERT.

Instead, RoBERTa employs Learned Positional Embeddings Specifically, positions in the sentence are not encoded based on a cyclical model like sin-cos; rather, they are directly learned from the data during training This means each position (word) in the sentence will have a unique positional embedding vector learned separately, not following a fixed cycle like BERT's sin-cos positional embeddings.[5]

This approach allows the model to learn positional representations more effectively, unconstrained by a fixed cycle model, and enables it to learn better positional representations across diverse contexts.

Both models are pre-trained for language representation and based on the Transformer architecture, but there are several key differences between the two models:

1 Training Corpus: RoBERTa is trained on a larger corpus of text compared to

BERT Which makes it to learn a more robust and nuanced representation of the language [8]

2 Dynamic Masking: RoBERTa uses a dynamic masking strategy, where different tokens are masked in each training example This allows the model to learn a more diverse set of representations, as it must predict different masks in different contexts.[8]

3 No Next Sentence Prediction Loss: Unlike BERT, RoBERTa does not use a next sentence prediction (NSP) loss during pre-training This allows

RoBERTa to focus solely on the masked language modeling objective, leading to a more expressive language representation.[8]

4 Large Byte-Pair Encoding Vocabulary: RoBERTa uses a larger byte-pair encoding (BPE) vocabulary size of 50k as compared to 30k size of BERT, allowing the model to learn a more fine-grained representation of the language.[5]

Position wise Feed-Forward NGfWOTK .- - 5 ng giết 35 2.7 Evaluation Methods for Word Embedding Techniques

Cosine Similarity 00.0 a8

In data analysis, cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths It follows that the cosine similarity does not depend on the magnitudes of the vectors, but only on their angle The cosine similarity always belongs to the interval [— 1, 1].

|I4I|-IEII Consine similarity = Se(4, B) := cos(@) = (2.8)

MSE (Mean Squared Error) loss functions as the average of squared differences between the actual and the predicted value It’s the most commonly used regression loss function With equation:

- nis the number of data points.

- Y, represents the predicted value for the i" data point.

- Y, represents the actual or observed value for the i" data point.

The term (Y, — Y;)2 calculates the squared difference between the predicted and actual values for each data point ~ is the reciprocal of the number of data points, used to calculate the average.

We give appreciation to doanhieung that translate STS Benchmark to Vietnamese. STS Benchmark (Semantic Textual Similarity Benchmark) is a standardized dataset used to evaluate the semantic similarity between pairs of sentences, as judged by humans and manually labeled accordingly The dataset consists of approximately 8.630k rows and we use 20% for splitting dev, test and the other we apply for training with our dataset. split genre dataset year sid score sentences sentence2

Our thesis proposes applying the SBERT model to develop a natural language question-answering system about tourism location, delving deeper into the exploration of algorithms applied to this technique in the subsequent chapters.

In Chapter 2, we outlined the general theories used in our model In Chapter 3, we will elaborate on the data processing steps and provide detailed information about the algorithms or models utilized in our research.

Sentence Sentence Pre-trained models

Figure 3-1: Training student model mimics teacher model.

In the image above, we trained with more than 10k pairs of synonyms related to landscapes We designed the training model based on the student model learning from the teacher model and using based on making monolingual sentence embeddings multilingual [12] Moreover, the student model leans a sentence embedding space with important properties: Vector spaces are aligned across tokens, i.e., identical sentences in difference tokens are closes With the teacher model, we will apply the Vietnamese-SBERT model, which provides similarities between Vietnamese sentences And the teacher model is designed based on two main modules: Pre-trained Models and Mean Pooling (in part 2.4.3).

In the system design part, we will focus on developing 3 pre-trained models embedding such as phoBERT, viBERT and vELECTRA with different tokenized mechanisms and finally we will choose the best model for application in our products. ofr

(over 2k questions about Vietnamese tourism location)

In Question process, it is the input of user And it moves to cleaned data what they remove stopwords of questions Significantly, both cleaned questions and datasets are embedded in pre-trained models and whole tokens are taken into Pooling Finally,

2 sentence vectors were compared by Cosine Similarity.

Data preprocessing is a crucial part of preparing data for model training The main steps in data processing may include:

In our thesis, there are 2 important data sections: a dataset of questions and answers related to tourist destinations and a dataset for training to recognize synonyms Both datasets are manually collected by us.

Regarding questionnaire data, we collected information about tourist destinations and attractions at reputable websites (gov, Wikipedia, ) After collecting the data, we compare the information searched from various sources to check the accuracy of the data Next is to clean the data, remove junk characters and spelling errors Finally,

Figure 3-3: Question-Answer training data

In terms of thesaurus recognition training data, we rely on frequently asked questions in spoken and written Vietnamese to create synonymous sentence pairs We tried to create a dataset that could be closest to the user's question so that the system could give the best results. tỉnh ly An Giang trung tam hành chính An Giang khu trung tam của An Giang

An Giang danh lam thắng cảnh An Giang điểm đến du lịch nổi tiếng An Giang thu hút du khách

Figure 3-4: Training data for similar meaning sentences

While manual data collection takes us a lot of time and effort, responding to accuracy and reliability for a new problem is how we can control the best quality dataset.

Before feeding the data into the model for training, it needs to go through several steps to clean the text data:

- To remove punctuation marks and non-ASCII characters.

- To convert characters in proper nouns to their correct format.

- Stop words are commonly occurring words in text that, when analyzing NLP data, may not carry significant meaning Filtering them out optimizes the subsequent data processing.

- With phoBERT, we have a new process that needs to be segmented words and the others do not need to use this.

Table 3-1; Example removing non-ASCII characters.

Raw data Cleaned data | Cleaned data

Tinh ly của Bac% Giang | Tỉnh ly Bac_Giang ở đâu | Tinh ly Bắc Giang ở đâu năm ở đâu

Tinh ly của bắc giang năm ở | Tinh_ly Bắc_ Giang ở đâu | Tỉnh ly Bắc Giang ở đâu đâu

Tinh ly của Bắc Giang năm | Tinh_ly Bắc_ Giang ở đâu | Tinh ly Bắc Giang ở đâu ở đâu ???

And, we will make a slight change to the user's question input, removing some words that do not affect the search or make comparing sentences easier and get better scores.

Hãy cho tôi biết danh lam thăng cảnh của Quảng Nam

Danh lam thăng cảnh Quảng Nam

VỊ trí của Thanh Hóa năm ở đâu VỊ trí Thanh Hóa ở dau

Tinh ly của Can Thơ ở đâu Tỉnh ly Cần Thơ ở đâu

Nét riêng đặc trưng của Phú Yên có gì | Nét riêng đặc trưng Phú Yên

Moreover, we can handle with capitalizing names of scenic spots and provinces:

Table 3-3: Example capitalize words. hỗ gươm Hồ Gươm an giang An Giang thanh hóa Thanh Hóa

In the Datasets module, whole question datasets were encoded by pre-trained models before has been compared to encoded questions to see their similarities score.

We created a dataset with over 1.5k questions with topics such as provincial capital, scenic spots, and historical features.

The question attribute “tinh ly Tay Ninh” will be taken to encode by pre-trained models Significantly, that question of datasets has the template It is also having some template such as what has example like

“Thừa Thiên Huế danh lam thắng cảnh”.

In the training processing stage, for each model, I will list all synoym sentences of our domain about tourism location to make sure that when users give a sentence that have close meaning with our datasets.

Sentence Synonym Sentence | Synonym Sentence | Synonym Sentence

Tàu có gì đặc trưng

Bà Rịa Vũng Tàu Rịa Vũng Tàu

Tinh ly An | Trung tâm Trung tâm hành | Trung tâm hành Giang hanh_chinh chinh An Giang chinh An Giang

VỊ trí Điện | Điện Biên ở dau Điện Biên ở dau Điện Biên ở đâu

Bà Ria Vũng |Nét riêng của |Nét riêng của Bà | Nét riêng của Ba Ria

Tokenization is the process of breaking down text into smaller tokens for easier handling in subsequent text processing.

Cleaned data Tokenization Tokenization Tokenization

Tỉnh ly ‘tinh ly, 'của, | trnh, ly, cu, ##a, | tính, Ty, 'cua,

Bắc Giang ở đâu | Bắc Giang, 'ở', | 'ba’, '##c', 'gian, '##g', | bac, ‘giang’, 'o’,

In pre-trained models, we use it on huggingface for installing pre-trained models with:

- phoBERT: vinai/phobert-base (https://huggingface.co/vinai/phobert-base) with having some config training parameters: o Max position embedding: 258. o Vocabulary size: 64001.

44 o Layer normalization: le-05. o Hidden size: 768. o Model type: Roberta.

- viBERT: FPTAI/vibert-base-cased (https://huggingface.co/FPTAI/vibert- base-cased) with config: o Max position embedding: 512. o Vocabulary size: 38168. o Layer normalization: le-12. o Hidden size: 768. o Model type: Bert.

- VvELECTRA: FPTA/velectra-base-discriminator-cased.

(https://huggingface.co/FPTAI/velectra-base-discriminator-cased) with config: o Max position embedding: 512. o Vocabulary size: 32054. o Layer normalization: le-12. o Hidden size: 768. o Model type: Electra.

Because we want to fixe-sized representation technique of the entire input sequence, the type of Pooling what we want to use is average pooling or mean pooling to calculate whole tokens in sentence for giving cosine similarities Specifically, a sequence of vectors will be computed by adding elements of all vectors and then dividing by total number of vectors So, the output from Average Pooling is single vector (the average of all input vectors).

Based on that we can take the single vector to compare by consine similarity.

Parallel Setence Dataset - LH Hiện 43 3.3.5 TOKk€n1ZatIOH SG SH HHHHHTHnHnHngnHnệt 44 3.4 Pre-traIned mO€ÌS - óc 11 991193119111 vn ng nến 44 “hs

In the training processing stage, for each model, I will list all synoym sentences of our domain about tourism location to make sure that when users give a sentence that have close meaning with our datasets.

Sentence Synonym Sentence | Synonym Sentence | Synonym Sentence

Tàu có gì đặc trưng

Bà Rịa Vũng Tàu Rịa Vũng Tàu

Tinh ly An | Trung tâm Trung tâm hành | Trung tâm hành Giang hanh_chinh chinh An Giang chinh An Giang

VỊ trí Điện | Điện Biên ở dau Điện Biên ở dau Điện Biên ở đâu

Bà Ria Vũng |Nét riêng của |Nét riêng của Bà | Nét riêng của Ba Ria

Tokenization is the process of breaking down text into smaller tokens for easier handling in subsequent text processing.

Cleaned data Tokenization Tokenization Tokenization

Tỉnh ly ‘tinh ly, 'của, | trnh, ly, cu, ##a, | tính, Ty, 'cua,

Bắc Giang ở đâu | Bắc Giang, 'ở', | 'ba’, '##c', 'gian, '##g', | bac, ‘giang’, 'o’,

In pre-trained models, we use it on huggingface for installing pre-trained models with:

- phoBERT: vinai/phobert-base (https://huggingface.co/vinai/phobert-base) with having some config training parameters: o Max position embedding: 258. o Vocabulary size: 64001.

44 o Layer normalization: le-05. o Hidden size: 768. o Model type: Roberta.

- viBERT: FPTAI/vibert-base-cased (https://huggingface.co/FPTAI/vibert- base-cased) with config: o Max position embedding: 512. o Vocabulary size: 38168. o Layer normalization: le-12. o Hidden size: 768. o Model type: Bert.

- VvELECTRA: FPTA/velectra-base-discriminator-cased.

(https://huggingface.co/FPTAI/velectra-base-discriminator-cased) with config: o Max position embedding: 512. o Vocabulary size: 32054. o Layer normalization: le-12. o Hidden size: 768. o Model type: Electra.

Because we want to fixe-sized representation technique of the entire input sequence, the type of Pooling what we want to use is average pooling or mean pooling to calculate whole tokens in sentence for giving cosine similarities Specifically, a sequence of vectors will be computed by adding elements of all vectors and then dividing by total number of vectors So, the output from Average Pooling is single vector (the average of all input vectors).

Based on that we can take the single vector to compare by consine similarity.

In figure 3-1, call u and v are 2 single vectors from pooling process We use it for consine similarity but some cases such as the synonym meaning (3.2.3) have prerequisites for giving score When it matches and has a high score, we take the id question of it and give the answer what we created before which has 1.5k answers from tourism location domain.

We will train these models based on some of the teacher models with average pooling layer that have been trained, but the datasets are designed by us based on the tourism domain Here are some teacher models for my 3 models to follow: keepitreal/vietnamese-sbert for phoBERT. alfaneo/bert-base-multilingual-sts for VELECTRA/viBERT.

Moreover, we fine-tuned some parameters for suitable our expectation:

The maximum number of tokens: Max_seq_length = 256.

The number of training examples utilized in one iteration: Batch_size = 64.

The learning rate determines the step size at each iteration: Learning rate 2e-6.

Epsilon: le-6 (Epsilon is a small constant added to the denominator to prevent division by zero).

Loss function: MSE (Mean Squared Error).

All applying parameters in this code: from sentence_transformers.datasets import ParallelSentencesDataset model = SentenceTransformer(modules=[embedding_model, pooling_model]) train_batch_size = 16

# Define the dataset train_reader = ParallelSentencesDataset(student_model=model, teacher_model=teacher_model) train_reader.load_ data(/content/train_ ver2.txt)

# Create a DataLoader for training train_dataloader = DataLoader(train_reader, shuffle=True, batch_size=train_batch_size)

# Define the loss function train_loss = losses MSELoss(model=model) model fit( train_objectives=[(train_dataloader, train_loss)], epochs 0, evaluation_steps@00, warmup_steps@000, scheduler='warmupconstant', save_best_model=True, optimizer_params={'lr': 2e-6, eps': le-6}

In the ‘/content/train_ver2.txt’ file for phoBERT that visualize like this: tỉnh ly An Giang trung_ tâm hành_ chính An_ Giang

An Giang danh lam thắng cảnh An Giang điểm đến du lịch nổi tiếng

Sóc Trang có gì đặc trưng Nét riêng của Sóc_ Trăng

Bình Thuận vị trí Bình Thuận thuộc vùng nào

But that file is different for viBERT and VELECTRA: tinh ly An Giang trung tâm hành chính An Giang

An Giang danh lam thắng cảnh An Giang điểm đến du lịch nỗi tiếng

Sóc Trăng có gì đặc trưng Nét riêng của Sóc Trăng

Bình Thuận vi trí Bình Thuận thuộc vùng nao

This is Parrallel Sentences Datasets help us with learning same meaning parallel sentences and handle some cases different tokens but same meaning We train over 5k rows with same meaning in our traveling domain and collected data on huggingface.

For the training time stage, we record the period in 3 models This is the bar chart given training time calculated by minutes:

The result of this, part of it may also be due to the addition of phoBERT's concatenation phase, which helps reduce the number of tokens, making phoBERT training time a bit faster, and other factors may be due to the different design architecture of the two models viBERT and VELECTRA which make it slower than phoBERT.

In result section, I use Fl-score and Pearson correlation coefficient to evaluate each model implementation:

Evaluation Models mphoBERT MviBERT #vELECTRA

In phoBERT, we have witnessed that it achieves a precision of 0.825, indicating that a significant proportion of the instances predicted as positive are indeed true positives Also, it demonstrates a recall score with 0.868 which can capture a substantial proportion of the actual positive instances So, Fl-score has a good score with 0.846 at a threshold = 0.8 in cosine similarity In viBERT, it has a low score with 0.607 to accurately identify positive instances compared to phoBERT But it shows a high recall of 0.961, and it indicates that a large proportion of actual positive instances are successfully captured Moreover, it has 0.738 in Fl-score which is not good score for our expectation In VELECTRA, it displays a precision of 0.701, indicating a reasonable ability to correctly classify positive instances and achieves a high recall of 0.962, showcasing its effectiveness in capturing most actual positive instances Also, the Fl-score of 0.806 reflects a balanced performance in terms of precision and recall.

Figure 4-3: Pearson correlation coefficient chart.

The Pearson correlation coefficient between phoBERT has 0.873 which is dominant over the others which can be easily understood that they have a specific architecture such as word segmentation helping to improve their own model which is different with viBERT and VELECTRA.

This is some cases in which models give the results about similar meaning sentence:

Given Levels | Pre-trained Given Results Score True

Questions models (Matching Question) | (Consine | /False

Hãy cho phoBERT Tinh_ly An_Giang 0.956 True tôi — biết viBERT Tinh ly An Giang 0.998 True tinh ly cua | Easy vELECTRA | Tỉnh ly An Giang 0.953 True

An Giang là gì? phoBERT | Tỉnh ly Kon Tum 0.922 True ray viBERT Tinh ly Kon Tum 0.994 True

Tỉnh ly VELECTRA | Tỉnh ly Kon Tum 0.964 True của Kon

Hãy đưa ra phoBERT Thừa Thiên Huế vi trí | 0.807 True thông tin viBERT Thiên viện Trúc Lâm | 0.973 False

Thừa Bạch Mã ở Thừa Thiên

Thiên Huế ; Hué thuộc Medium VELECTRA | Cac dao Quan Lan va | 0.821 False vung nao? Cô Tô thuộc địa phan

Hãy cho phoBERT |Hà Nội có gì | 0.768 True tôi một số đặc trưng nét riêng | Hard viBERT Sin Hồ ở Lai Chau 0.895 False của Hà vELECTRA | Chợ nồi ở Cà Mau 0.734 False

Hãy cho phoBERT |Suối MơởĐồngNai | 0.774 True tôi biết vị viBẵERT Khu du lịch sinh thái | 0.921 False trí Suối Măng Den ở Kon

Mo nam ở Tum? đâu và Hare vELECTRA | Đá ba chồng ở Đồng | 0.875 False thuộc tỉnh Nai nào nước ta?

Vinh? vELECTRA | Điện Biện vi trí 0.875 False

Thông tin Easy phoBERT Chùa Phước Kiến ở | 0.999 True của Chùa Đông Tháp

Phước viBERT Chùa Phước Kién ở | 1.000 True

Kiến ở Đồng Tháp Đồng vELECTRA | Chùa Phước Kiến ở | 1.000 True

VỊ trí của phoBERT Deo Mã Pi Leng ở | 0.956 True

Déo Ma Pi Ha_Giang

Leng viBERT Đèo Ma Pi Leng ở Ha | 0.978 True

Medium Giang vELECTRA | Núi lửa 0.897 False

Chư Đăng Ya ở Gia Lai

Thông tin phoBERT | Tỉnh ly Hải Duong 0.99 True của trung viBERT Tinh ly Hai Duong 0.998 True tâm hành | Medium | VELECTRA | Tinh ly Hai Dương 0.997 True chinh Hai

Những phoBERT Bình Phước 0.939 True điểm đến danh lam thắng cảnh du lịch nỗi viBERT Hai Duong danh lam | 0.994 False tiếng của Hard thắng cảnh

Bình vELECTRA | Du lịch Huế đặc điềm | 0.893 False

Phước? văn hóa lịch sử

Based on the table above, we see that phoBERT can handle some cases such as

“Hãy cho tôi biệt vi trí của suôi Mo và thuộc tỉnh nao nước ta?” This is the hard cases for our datasets, and it is good that phoBERT can give a expected matching questions.

Significantly, in case “Hãy cho tôi một số nét riêng của Hà Nội” we matched and trained the synonym meaning “nét riêng” and “đặc trưng”, 33 66 du lịch nổi tiếng” and

“danh lam thắng cảnh” to 3 models but viBERT and vELECTRA did not give the

53 right connection to questions But phoBERT does good for this case Luckily, the synonym “trung tâm hành chính” and “tỉnh ly” are given correct matching questions by 3 models what we trained.

For discussion, because of its dominance with the others, we decided to choose phoBERT as the main model to serve our answering questions system based on tourism location domain in our application.

Remarkably, for the next part, it is our design and implementation application.

JSON A x SON can cử, Database

CONCLUSIONS AND FUTURE WORKS .<<< 57 S.1 Challenges

Tiêu đề	Natural Language Question-Answering System About Tourism Location
Tác giả	Le Duc Tin, Nguyen Van Quoc Viet
Người hướng dẫn	Prof. Dr. Do Phuc
Trường học	University of Information Technology
Chuyên ngành	Information Systems
Thể loại	Bachelor of Engineering
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	86
Dung lượng	45,46 MB