Research content: Students can explore, research and present content related to the thesis topic which isapplying deep learning to process natural language of arithmetic word problems in
Virtual aSSISAT Ăn HH HH HH HH 4
On 4" October 2011, the world’s first virtual assistant was released as a feature of the iPhone S4 [5] Since then, there are a huge number of virtual assistants have appeared and become an important part of human life such as Google Assistant of Google, Cortana of Microsoft, or Bixby of Samsung.
By adopting DL applications, the possibilities of these virtual assistants seem limitless They can handle many human tasks such as making calls, playing music, finding information, and of course, solving arithmetic word problems The strength of these virtual assistants is that they can interact very well with humans, understand most problems in natural language However, there are still certain types of math that they cannot handle, as well as they have not been programmed to handle problems in Vietnamese, which is a huge omission for users in Vietnam.
While there are many websites and applications designed to solve arithmetic word problems, most of them are only programmed to solve problems in English According to our knowledge, there are no tools or apps available for solving arithmetic word problems in Vietnamese Currently, there are nearly 9 million primary school students in our country, and probably many of them are not yet able to read comprehension in English, not to mention arithmetic word problems This situation has led to certain difficulties in supporting teaching and learning purposes in Vietnam For that reason, we hope we can create an algorithm that can handle and solve arithmetic word problems in Vietnamese for elementary school students.
Besides, today AI in general and DL, in particular, have appeared and developed rapidly in all aspects of life That development has brought many positive results in the work of serving the people Desiring to be a part of that trend, we decided to use those advanced technologies in our project, in the hope of creating an application that can interact and serve Vietnamese people.
While working on this graduate project, we have achieved 4 following things:
We applied the necessary algorithms and tools to extract and analyze arithmetic word problem sentences in Vietnamese natural language. Determine the correlation between the objects appearing in the problem.
We used the Recurrent Neural Network (RNN) to the recognition of existing problems as well as recognition of new mathematical types, minimize the processing time and increase the accuracy of the solution with the self-learning ability of computers.
We built an interactive Chatbot with students to increase efficiency and interest in solving arithmetic word problems.
Our system is the combination of a variety of areas in Computer Science (CS) ranging from Natural Language Processing (NLP) to RNN The project that we choose is not new in the world, but it is the first application capable of solving arithmetic word problems in Vietnam natural language We hope this essay can be used as a reference for anyone with a passion and a common goal.
This thesis will contain 5 main chapters with the following structure:
Chapter 1: This chapter will introduce and summarize the context, the problem, and its significant, related works to this project, the motivation, and the contribution of this graduate thesis.
Chapter 2: This chapter is all about the background and theory that we need to research in other to understand and finish this thesis.
Chapter 3: This chapter will describe the design of our system, the components inside it, including the descriptions, inputs, output of the services. e Chapter 4: This chapter will explain the evaluation detail, system installation, and the graphical user interface (GUI). e Chapter 5: This chapter will present the conclusions of this graduate thesis, future developments that we will do, and our recommendations.
In summary, in this chapter, we introduced to everyone the context, issues, and the importance of researching and building an application capable of solving arithmetic word problems in Vietnamese natural language We have outlined some outstanding and limited aspects of existing methods, applications, and applicable features in this graduate thesis Also, we have highlighted the need for an application that can serve the learning purposes of primary school students in Vietnam, as well as our contributions to this essay In the next chapter, we will elaborate on background concepts and theories that are directly or indirectly related to our research topic.
After considering and studying for the solution to the problem set out, we realized that some issues need to be dealt with as follows:
1 How can we make computers can understand and analyze the given
Vietnamese arithmetic word problems in the natural language?
To make sure that the computers can understand and analyze the given word problems, we will use some NLP techniques such as word segmentation, part-of-speech (POS), word embedding, and so on Besides that, some available packages such as VnCoreNLP will be used to help the computers be able to process Vietnamese text.
After understanding and analyzing the components as well as numbers inside the given Vietnamese arithmetic word problems, how can we help computers to detect, predict, or classify the type of math?
One of the most popular and most powerful techniques using to solve language problems is using Neural networks This network can help us to detect or classify the given problems into several types of them, which will be easier for computers to calculate and return the most accurate result In our graduate thesis, we will apply Long Short-term Memory (LSTM) and Gated Recurrent Unit (GRU), which are the special types of Neural Networks, to fulfill the requirements that we set out.
3 If the computers can identify the types of math, how can they calculate and give out the most accurate answers for that Vietnamese arithmetic word problems and return the results to users?
- We wIll create several patterns manually, each of them will contain many calculation formulas so that they can be used to solve a specific type of problem Therefore, after detecting the type of word problem, we will put these problems in the respective patterns and the computer will calculate the final results.
According to Wikipedia [6], NLP is a branch of AI, which mainly focuses on developing applications based on human natural language This process can be considered as one of the most complicated in AI since it is not easy to understand the meaning of linguistic Without this, computers will never be able to think and communicate with humans, as well as support us in language-related jobs such as translation, text data analysis, speech recognition, etc Being able to analyze natural language and help computers to understand the context correctly is a very difficult task Unlike programming languages, natural language does not have too many concrete rules or logic, it tends to rely more on the feelings of the speaker or writer. These things will cause a certain difficulty for the computer when they cannot understand human emotions Besides that, the diversity of semantics and contexts is also considered as the cause of the difficulty for the computer when processing natural language And if we do not clarify this, it may lead to the computer misunderstanding what the author wants to convey.
In NLP, there are two aspects that we should concern about which are Natural Language Understanding and Natural Language Generation.
With Natural Language Understanding (NLU), there will be 4 steps are described below: e Morphological Analysis: This is a step to identify, analyze, and describe the structure of given words Word segmentation and part-of-speech tagging(POST) are typical problems of this field. e Syntactics Analysis or parsing: This is a process of analyzing a series of symbols, which may be presented in the form of natural language or computer language, following formal grammar Some formal grammar often uses in NLP are context-free grammar, combinatory categorial grammar, and dependency grammar Typical algorithms are Cocke-Younger—Kasami algorithm, Earley, Chart, etc. e Semantic Analysis: To understand the meaning of linguistic input, NLP will use this step to link semantic constructs, from the phrase, clause, sentence, and paragraph-level to the whole article level, with their independent meanings. e Discourse Analysis: This step is applied for studying the relationship between language and context-of-use (include human or object identity) This step also studying how discourse is constituted, and how the listener understands who is talking to him/her.
Background and theory 5-5 5< < << << 55 9 1968568588505 7 "`9 › nh
Word Embedding - - - + 5 k9 21 v1 11v ng ng ngư 14
Word embedding is the name of a set of models in NLP Word embedding converts words to numeric for the computer to understand Its job is to find the synonyms, and relationships between words in a sentence, and help the computer to understand the input text There are 2 basic models of word embedding: e Frequency-based embedding. e Prediction-based embedding.
Frequency-based embedding is based on the frequency a word appears in documents to create a vector There are 3 popular techniques of this model: e Count vector e Tf-idf Vector e Co-occurrence Matrix
This technique will count the appearance of a word in a document and create a matrix DxN With D is the number of the document in a set and N is the number of the unique word (not include stop words) in the set The value in the matrix is the time a word appears in a document.
For example, we have 2 documents:
D1: “Sara goes to the park.”
D2: “Sara invites Jane to the park.”
The word “Sara” appears one time in both documents, so the value in column
“Sara” and row DĨ with row D2 is 1 We count the same for the other word, Table 2.3 is the count vector of D/ and D2.
Table 2.3: Count vector of D1 and D2 ee J= II Jm.
Term Frequency — Inverse document Frequency (TF — IDF) vector can classify better than count vector because it does not just count the appearance of a word in a document but also calculate the weight of it Term frequency calculates the frequency a term appears in a document. f(t,d)
- f(t,d): The number of times a term appears in a document
- Dereaft',d : The number of a term in the document
Idf (inverse document frequency) calculates the weight of a term If a term appears many times in a document, the weight of it reduces because that term will not help distinguish a document from the others.
=loe —?! idf(t,D) = log lfacp:rca))
- |D|: The number of documents in set D
- |{d € D:t € d}|: The number of documents contains term t in set D
Co-occurrence matrix counts how many times a word w2 appears around (before or after) a word wl in a certain length window.
To understand this better, let us have a look at an example We have 2 documents in a corpus: e D3: “A monkey jumping on a bed” e D4: “A monkey fall out of a bed”
From D3 and D4 we can create a co-occurrence matrix with all the unique words. Assume the window length = 1 It is the number of words we will consider around the focus word.
- Context word: o “monkey”, “bed” => words after the focus word o “on”, “of? => words before the focus word
In Figure 2.1, we can see the word “monkey” and “bed” appears after the word
“a” twice, one time in D3 and one time in D4 Therefore, the value in the Co-
16 occurrence matrix with column “monkey” (column “beđ”) and row “a” is 2 Same for the word “on” and “of’, the columns corresponding to “on” and “of” across the focus word “monkey” in Table 2.4 is increased by 1.
Figure 2.1: Visualize the count of the first iteration in D3 and D4
- Context word: o “Jumping”, “fall” => words after the focus word o “ứ” => word before the focus word
After doing this process for all the words in D3 and D4 we get the following matrix as Table 2.4.
Table 2.4: Co-occurrence matrix of D3 and D4
Context | A | monkey | jumping | on | bed | fall | out | of Focus
17 monkey 2 10 1 0 |0 1 0 0 jumping 0 |1 0 1 10 0 0 0 on 1 10 1 0 |0 0 0 0 bed 2 |0 0 0 10 0 0 0 fall 0 |1 0 0 10 0 1 0 out 0 |0 0 0 10 1 0 1 of 1 10 0 0 10 0 1 0
The co-occurrence matrix needs a huge memory to store and process To reduce the size of the matrix, we can remove stop words, use Singular value decomposition (SVD) or Principal component analysis (PCA) method.
In 2013, Tomas Mikolov, leader of a research team at Google, presented the Word2Vec technique [8] This technique is a combination of 2 methods: Skip-gram and CBOW (Continuous bag of words).
Actual context vector (one hot
Word2Vec Skip-Gram U2 = softmax(Uh) } wee
Tnput center Input Output u-; = softmax(Uh) h Wet word (one-hot Weights Weights ' m encoded) Matrix Matrix h
M | U nx1 h nx1 tại =softmax(Uh) : Wt nxd đx1 nxd ' F] nxi '
Vector ' representation h of input ' LI center word nx1 h nx1 h vector using ‘
= soft ' wt input weights Utee max(Uh) 1 lu h h nxi ' nxi
Figure 2.2 The Skip-gram model architecture (Source: “Word2Vec: Skip-Gram
Given a single word, skip-gram will predict the surrounding word The architecture of the skip-gram model is shown in Figure 2.2, it is a simple neural network contains 3 layers (input layer, hidden layer, and output layer) We have a vocabulary V that contains all unique words in our dataset Each unique word is represented by a one-hot encoding The input vector is a vector with [V, 1] dimensional (V is the number of unique words in the vocabulary) There are 2 weight matrices in this model: Input-Hidden Weights Matrix (V*N) is the matrix between
19 the input layer and the hidden layer, while Hidden-Output Weights Matrix (N*V) is between the hidden and output layer (N is the number of neurons in the hidden layer). These matrices will be generated automatically The input vector multiple with Input- Hidden Weights matrix to get a Hidden Activation matrix, this result continues to multiple with the Hidden-Output Weights matrix and applies softmax function to get a probabilities vector This probabilities vector will be compared with the expected output vector and calculated the error between them. h = Wiinput ~xE IRN
Below is the formula of softmax function defined by E Kim [9]: exp (Woutput(context).h)
- D(Weonext | Weenter) denotes the conditional probability of a word to be a context word of a given current word.
- Woutpuri) is the i-th column of the output embedding matrix.
- his the hidden activation matrix size (N*1)
After getting the error vector, we will calculate the gradients of the input and output weight matrix with the below formula:
Where symbol dot denotes the outer product, Woupu is the Hidden-Output Weights matrix, h is the hidden activation matrix and ec is the probability distribution vector of the c-th context word Note: }'€_, ec equal the error vector, x is the one-hot encoded input vector The outer product of vectors u and v equals the multiplication of uv! (v'is the conjugate transpose of v) The gradient of the input weight matrix is
20 computed as below For the model to get a more correct result we need to update the input and output weight matric (these 2 matrices were generated randomly) The new weight matrix is updated by the below formula:
War = WO inpue= De CC input input ~ 1) Qwinput 9J
1 is the learning rate, which is a hyperparameter and often takes a positive value in the range between [0, 1] We will keep updating the weight matrix until the error vector converges to 0.
CBOW (Continuous Bag of Words) is the opposite model of Skip-gram Its model is shown in Figure 2.3 It predicts the center word from given context words (word surround the center word) For the CBOW model with many inputs, the steps are the same as Skip-gram The Hidden Activation matrix equals the average of all the input vector multiple with the Input-Hidden weight matrix.
Figure 2.3: The Continuous Bag of Words model architecture (Source:
“Introduction to Word Embedding and Word2Vec — Dhruvil Karani, 2018’)
The advantage of CBOW is it takes less time to train than Skip-gram and gives better accuracy for the frequent words The downside of CBOW is that it represents words that are the same but have different meanings with the same vector.
Fasttext is a word embedding model, which was introduced in 2016 by Facebook.
It can be considered as an upgrade version of Word2Vector Fasttext separates a word into n-grams (sub-words) and present each gram as a vector By using this technique fastText can represent a word that not in its vocabulary For the word
“gastroenteritis” with n-grams = 3 First, we need to add angular brackets to denote the beginning and end of a word Secondly, we will slide the word into windows of 3
22 characters from the beginning angular bracket to the end angular bracket, each window is shifted one character Figure 2.4 shows all 3-grams of the word
3-grams |
Figure 2.4 The word “gastroenteritis” slide into many n-grams in fastText.
System design and implemennfafẽOTA s- << <5 5< s5< s=< se sssssese 35
Constructing and pre-processing (Ìafa ó5 225 2x k#svEskekrskekrreree 37
During the formation and development of mankind, the text always played an extremely important role This is the main tool for people to store the knowledge, inventions, and innovations of humanity and use them to pass on to the next generations For that reason, helping the computer to understand the knowledge and content contained in the current documents is essential for future development. However, the writing system in the text is an unstructured data type, which makes it impossible for computers to understand the text as a human Therefore, before we want the computer to be able to understand the content of the given arithmetic word problems, we must convert those problems into "language" that the computer can understand and process In this research, we will build a pipeline dedicated to NLP, to transform the problems into multidimensional vectors that computers can use to process Figure 3.2 illustrates the construction of this pipeline:
As we can see in Figure 3.2, we will initially create a data set to train and test the model After having enough data, we started to process it by standardizing Vietnamese punctuation and formatting Next, standardized data will be transferred to word segmentation phrase to separate words according to Vietnamese standards. Finally, the entire data set above will be embedded for conversion into a vector that the computer can understand and process Below, we will explain in detail each step in this section.
3.2.1 Identifying domain and generating related data
As mentioned, our current goal in this thesis is to solve the literacy problem for students at the elementary level, so the construction of the dataset included many other types of math at this level is essential Through surveys and practical research, we have synthesized 5 types of math that are currently being taught quite commonly.: e Change in: Given Owner 1 owns x object(s), Owner 2 owns y object(s) A number of object(s) were transferred from Owner 2 to Owner 1 Find the new quantity of object(s) of Owner 1 and Owner 2.
Example: “An cú 12 cỏi kẹo, Bỡnh cú 9 cỏi kẹo Bỡnh cho An ử cỏi Hỏi mỗi người có bao nhiêu cái kẹo?” — “An has 12 candies, Binh has 9 candies.
Binh gives An Š candies How many candies do seach have?” e Change out: Given Owner 1 owns x object(s), Owner 2 owns y object(s) A number of object(s) were transferred from Owner | to Owner 2 Find the new quantity of object(s) of Owner 1.
Example: “Linh có 17 cái bánh, Hang có 5 cái bánh Linh cho Hang 5 cái bánh Hỏi mỗi người có bao nhiêu cái bánh?” — “Linh has 17 cakes Hang has
5 cakes Linh gives Hang 5 cakes How many cakes do seach have?” e Combine: Given Owner 1 owns x object(s), Owner 2 owns y object(s) Find the total number of objects of Owner 1 and Owner 2.
Example: “Thay Phúc có 10 cuốn sách, anh Hung có 3 cuốn sách Hỏi thay Phúc và anh Hung có tổng cộng bao nhiêu cuỗn sách?" — “Mr Phuc has
10 books, Hung has 3 books How many books does Mr Phuc and Hung have together?” e Increase: Given Owner | owns x object(s) and has y object(s) more Find the new quantity of object(s) of Owner 1.
Example: “Hà có 6 viên bi, sau đó nhận thêm 6 viên bi Hỏi Hà có bao nhiêu viên bi?” — “Ha has 6 marbles, then receive 8 more marbles How many marbles does Ha has?” e Decrease: Given Owner | owns x object(s) and lost y object(s) more Find the new quantity of object(s) of Owner 1.
Example: “Lớp 12A1 có 30 em học sinh, sau đó 5 em chuyển di Hỏi lớp
12A1 còn bao nhiêu em học sinh?” — “l2AI has 30 students, then 5 students were transferred How many students does 12A1 have left?”
As you can observe from the given examples, each of them follows a format that contains special keywords for each form of math.
For example, in the example of the "Change in" form, the number of the subject of this form is always 2 or more subjects At the same time in that example, there is also the word "cho" which means “give” or “pass” in English, from Owner 2 to Owner 1 However, with the “Change out’ form, it also has the same format and same keyword, but from Owner 1 to Owner 2 in this case Of course, with human ability, we can completely understand the content and requirements of the problem, as well as distinguish the two types of math "Change in" and "Change out" through keyword and position subject However, computers are not capable of understanding such knowledge as we mentioned above That is why we need to go to the next processing steps to support the computer.
The second processing step in this pipeline is to standardize words and punctuation according to the Vietnamese standard In this processing step, there are two normalization issues that we need to keep in mind when working with Vietnamese The first problem is Unicode standardization for Vietnamese and the other is the standardization of the accented typing method.
With the standardization of Vietnamese Unicode, we are now using two fairly popular Unicode sets, namely combinational Unicode and built-in Unicode Normally to the human eye, the characters typed from these two combinations have similar meanings However, if we analyze by computer, it will lead to the case as shown in Figure 3.3.
Figure 3.3: The comparison of combinational Unicode and built-in Unicode
In Figure 3.3, the first letter “hiéu’” is typed according to the combinatorial
Unicode percussion, and when compared to itself, the computer returns True.
However, if we compare it to the second letter “hiếu”, it returns False With the current treatment of the computer, these two typing methods return 2 results with 2 different lengths, which is why it returns False You can read more about this problem and the construction of Vietnamese Unicode in Sourceforge.net [11].
There are many methods and libraries available to solve this problem easily, in this study we will use the regex library instead re library to increase accuracy and convenience for the Vietnamese natural language With the re library, we need to have a string of characters written in built-in Unicode, then import any Vietnamese text into the regex string, the words in the sentence will be adjusted However, with the regex library (which is an extension of re), we do not need to define the regex string in advance, this library will automatically adjust words according to the correct combination no matter which language it is.
The second problem in Vietnamese standardization is related to the standardization of accents Currently, there are at least two perspectives on how to place bars, each with certain strengths and weaknesses, and they are classified as
"new" and "old" The following table lists scenarios in which two ways of placing accent marks are different, according to Wikipedia [12]:
Table 3.1: Old and new way for accent marks the position
Old way New way 6a, 6a, 0a, Oa, Oa 0a, 04, 04, O4, Oa òe, de, Oe, Ge, Oe 08, 06, OẺ, 06, oe uy, úy, uy, dy, uy uy, uy, uy, uy, uy
Table 3.1 shows us the difference between old style and new style accent marks position In the old style, accent marks were placed at the first vowel position While in the new style, the accent marks will be placed in the second vowel position.
Similar to the standardization of Unicode, the difference in how to place accent marks in the old and new style has resulted in the computer being unable to recognize two words that are semantically similar but different in the accent marks To overcome this shortcoming, we decided to bring all the words according to the new typing method back to the old typing method For the time being, we will be using hand-programmed functions to handle this problem, which will be discussed in more detail in Chapter 5.
After standardizing the arithmetic word problems according to Vietnamese standardization, we will proceed to process Vietnamese tokenization.
For English and many other languages, a text is usually composed of many single words together and separate by space Hence, word tokenizer is often described as the process of separating a large sample of text into many single words In NLP, this is a compulsory process, especially for classifying or predicting However, this treatment is not suitable at all for the structured word organization in Vietnam In Vietnamese, word units can consist of single words (containing only one word) or compound words (consisting of 2 words or more) With computers, it is not possible to determine what is a single word and which is a compound word, which can lead to a lot of processing errors as it defaults to all single words.
Constructing model
After collecting and processing the required data set, the next step will be to build and train the model Since the results of this section will affect all subsequent steps as well as the accuracy of the system, this is can be considered as the most important part In the training system, we will try 2 types of models, the LSTM model, and the GRU model Both are explained quite carefully and fully in Section 2.6 In this section, we will describe the structure and components of these two models using the Keras library.
Top Python Libraries for Deep Learning, NLP, Computer Vision
20000 > * fastai ằ 15000 spaCy MXNet a , " NLTK #gensim
3000 i Transformers (Huggingface) Simple-CV Ignite igging
Figure 3.8: Top Python Libraries for Deep Learning, Natural Language
Processing & Computer Vision (Source: KDnuggets)
Talking a little bit about Keras library, this is a Python open-source library, designed by Google for the artificial neural network field In recent years, the Keras library has always been at the top of the most widely used libraries for DL, besides Tensorflow and PyTorch [14] Here are the reasons to choose Keras as a library for
DL models: e Keras prioritizes the programmer's experience. e Keras has been widely used in the enterprise and research community. e Keras makes it easy to turn designs into products. e Keras supports training on multiple distributed GPUs.
47 e Keras supports multiple backend engines and does not limit you to one ecosystem.
With the support of Keras library, we can easily define the structure of the LSTM and GRU model as follows: input: | [(None, 100)]
| input: (None, 100) input_1: InputLayer reshape: Reshape input: | (None, 10, 10) output: | (None, 128) lstm: LSTM
| output: | (None, 512) dense: Dense input: | (None, 512) dense_1: Dense
(None, 512) input: | (None, 512) output: | (None, 128) a tAm input: | (None, 128) output: (None, 5) dense_2: Dense dense_3: Dense
Figure 3.9: Structure of LSTM model
Figure 3.9 illustrates the structure when we use the LSTM model Here we will have a total of 7 layers, including 1 input layer, 1 layer responsible for reshaping the input data, one LSTM layer, and 4 Dense layers acting as the normal neural network layer (include the final output layer) Details about the layers are as follows: e Input layer: This is the layer that will make the model's input, in this case, 100 dimensional vectorized problems.
Reshape: This is the class that will be responsible for changing the shape of the LSTM layer input The purpose of this is to help the input match the shape of the network structure.
LSTM: This is the layer that is mainly responsible for the series data modeling process Details about LSTM have been introduced in section 2.6.3.
Dense: This is the most basic layer in the neural network The Dense layer is responsible for connecting the output of the previous layer to its unit neurons and then providing the output in the next layer.
Output layer: This is also a dense layer, but the output unit of this layer will be
5 units corresponding to 5 labels of the math types.
Besides, in each dense layer, there will be different activation functions Here we use two types of activation functions which are "ReLU" and "softmax":
ReLU: This is a simple activation function that has been used quite a lot recently to train neural networks The purpose of this function is to eliminate the vanishing gradient problem, make the model learn faster and better The formula for the ReLU function is as follows: f (x) = max (0, x)
ReLU(X) Bm G h * WwW F&F VAs DO HO
Figure 3.10: Graph of the ReLU function (Source: AJ Curious)
According to the above formula and Figure 3.10, the way the ReLU function works is quite simple, it will remove values