Khóa luận tốt nghiệp: Applying knowledge graph and bert for Vietnamese triple classification

VIETNAM NATIONAL UNIVERSITY HOCHIMINH CITY UNIVERSITY OF INFORMATION TECHNOLOGY ADVANCED PROGRAM IN INFORMATION SYSTEMSAPPLYING KNOWLEDGE GRAPH AND BERT FOR VIETNAMESE TRIPLE CLASSIFICAT

Motivation oo

We are living in the world of Artificial Intelligence (AD, machines can support people to communicate with those who are handicapped or are incapable of ordinary language and others To assist them, machines must understand human languages in many types of format such as text, voice, gesture and the rest In our project, textual handling is mentioned in the field of NLP In many prevalent languages, it can be supported by a large public dataset but this process is harder in Vietnamese We are building a system that helps a machine understand Vietnamese language at a relatively acceptable level of accuracy.

RELATED WORKS - - - S S1 ST HH HH HH non ngư, 5

KG-BERT: BERT for Knowledge Graph Completion . - s5 ++ss+s+s>ss+s+s 5

Determining whether an unseen triple fact (h, r, t) is existent or not is a challenge in the fields of Language Understanding To deeply understand a paragraph, the paper [1] applied the pre-trained model named BERT to give a scoring function of the triple in the KG-BERT language model to achieve newfangled performance in the model.Their tasks included triple classification, link prediction, relation prediction They showed that with limited number of their original training dataset, they can obtain high accuracy, specifically achieved test accuracy of 88.1% with 5% of training triples and87% with only 10% of training triples.

Neo4j ỂNG TC .Ó àằàĂ.eiieriierrirrrree 6

Our task is to train a machine to understand Vietnamese human language form To do that, we combine the ability of BERT and Knowledge Graph that will be introduced in KG-BERT solution [1] and the related others that apply Knowledge Graph with Neo4j [2] as our backbone An open-source tool named Doccano [3] is also used in triple labeling process.

2.2 KG-BERT: BERT for Knowledge Graph Completion

Determining whether an unseen triple fact (h, r, t) is existent or not is a challenge in the fields of Language Understanding To deeply understand a paragraph, the paper [1] applied the pre-trained model named BERT to give a scoring function of the triple in the KG-BERT language model to achieve newfangled performance in the model. Their tasks included triple classification, link prediction, relation prediction They showed that with limited number of their original training dataset, they can obtain high accuracy, specifically achieved test accuracy of 88.1% with 5% of training triples and 87% with only 10% of training triples.

Google is often referred to as the largest knowledge base of mankind GoogleKnowledge Graph which was first made the introduction in 2012 can be considered one of their most valuable assets.! Based on all of their available sources, Google gathers data from people, animals, events, history and other topics to support their well- known search engine The Knowledge Graph contains a countless amount of information and data, Google uses all of them to connect each other and convey more single-minded search results for users In response to users’ search queries, providing accurate information is Google’s ultimate goal.

Knowledge Graph Base or KGBase is a collaborative, robust database with versioning, analytics & visualizations.” This is a tool that requires no coding, but can easily visualize knowledge graphs Users just need to import CSV files or spreadsheets and make their own knowledge graph Having to pay a plan fee to use and starting with a 7-day free trial.

Being a native graph database, Neo4j built from stem to stern to strength not only data but also data relationships Neo4j connects data as it’s stored, empowering queries never before envisioned, at speeds never pondered possible.* As opposed to traditional databases, which arrange data in rows, columns and tables, Neo4j has a lithe set-up stated by stored relationships between data records Neo4j is commonly applied in https://blog.google/products/search/introducing-knowledge-graph-things-not/

? https://www.kgbase.com/pages/about

3 https://neo4j.com/ network & IT management, crime investigation, the ICIJ Panama papers, movies recommendation, etc.

METHODS - -< HH TH HO HH HH ngư 8

Cc 009i .4 dd

Triple classification is one the most underlying tasks in NLP, whose purpose is judged whether the triple is existent in the context or not To give an idea, input triple (h, r, t) would be checked in paragraph in which there is not any explicitly sentence or phrase that is completely identical compared to input one. predicted as false input as triple

Figure 3.1: Illustration of triple classification task

As demonstrated in Figure 3.1, if the triple is classified from ambiguous context as correct, label would be named as true (numerically - 1) or vice versa, label would be named as false (numerically - 0).

Graph, formally, is a set of nodes (vertices) and the relationships (edges) that link them up For very simple instance in Figure 3.2, the two nodes ‘football’ and ‘sports’ would possess the relationship ‘is a type of pointing from ‘football’ to ‘sports’.

Distinct from other traditional database management systems, relationships take leadoff dominance in graph databases Things like foreign keys or out-of-band processing, just like MapReduce do not affect highly data connections for your applicational.

Many full-grow corporations in the world are now using graph databases such as Walmart, eBay and Pitney Bowes Among the graph database technologies, Neo4j is at the top of the industry as the most native when it comes to both characteristics graph processing and storage [5].

Figure 3.3: An overview of the graph database space from Graph Database 2nd edition by O’Reilly and Neo4j

In figure 3.3, Neo4j is leading in both graph processing and graph storage.

There are two bread-and-butter properties to grasp: e The underlying storage: using non-native storage such as relational or object-oriented databases will make storing lower, with native graph storage that is designed specifically for graph database, this would be better choice instead. e The processing engine: the substantial performance mileages of index-free adjacency or native graph processing can be described as connected nodes bodily “point” to each other in the database Unnaturalized processing may use other means to process CRUD operations but for less efficiency.*

It’s worthwhile to inscribe that native graph processing and native graph storage are neither good nor bad - they’re ordinarily classic engineering trade-offs [6].

The graph model gives the green light for you to put in new nodes and relationships while holding up the original nodes and relationships intact.

A test between a relational database and a Neo4j graph database was made to compare the effect of the two sides In the below table, we can remark as the required depth of connectedness incremented, the graph database performance has become more and more superior compared to relational one.

+ CRUD: Create, Read, Update or Delete

Depth RDBMS execution time(s) Neo4j execution time(s) Records returned |

Figure 3.4: An experiment run between RDBMS and Neo4j from Graph Databases for Beginners by Neo4j

In figure 3.4, Neo4j has better performance compare to the traditional database management system.

Figure 3.5: Comparison between SQL and NoSQL by Trilochan Parida*®

5 https://twitter.com/TechParida/status/1332348459043549 184

SQL and NoSQL comparisons in Figure 3.5, with dynamics applied to key-value and unstructured data, NoSQL will be a better and more optimal choice than SQL in handling data in present-day mobile devices.

Cypher is the well-known graph database language Cypher is intended to be trouble-free-to-learn for SQL veterans while also being effortless for greenhorns. Concurrently, Cypher is varying enough to impress that we’re working with graphs, not relational sets.

Cypher intuitively goes along the way we draw graphs as below:

(An)(Cuong)

The most straightforward queries consist of a MATCH clause followed by a RETURN clause as below example:

MATCH (a: Person {:’An’})-[:KNOWS]->(b)-[:KNOWS]->(c),

Other clauses such as WHERE, CREATE, CREATE UNIQUE, MERGE, DELETE, REMOVE, SET, ORDER BY, SKIP LIMIT, FOREACH, UNION, WITH.

A single query in SQL can be many lines lengthier than the same query in a graph database query language like Cypher, which is one grand example of productive mapping from a usual language to Cypher This well-liked one is not the exclusively graph database query language; unalike graph databases have their own means of querying data as well.

With the advent of graph databases, Data modeling has been made much simpler. Nonetheless, you need to double-check your data model is designed productively for your exact use case.

Graph database can be applied for predictive modeling with some basic search algorithms namely Depth-First and Breadth-First Search, Dijkstra’s algorithm or A* algorithm, Graph databases are extremely of use in comprehending big datasets in scenarios as various as logistics route optimization, retail suggestion engines, fraud detection and social network monitoring.

Financial institutions and insurance organizations mislay billions of dollars to fraudsters yearly But many of them without delay rely on Neo4j to successfully uncover fraud rings by sending previously hidden relationships to light.

Whether you are under an obligation to a solution that provides real-time recommendations, graph-based seeking or supply-chain management, be certain to read through all the distinguishable ways in which graph technology can work for yourOrganization [7].

Knowledge Graphh . - ôsành TH TH nh greg

The world changes every day, every hour, the approach to information, the data is not as difficult as it was before This has led to the challenge of a huge increase in data, not only where the storage is a problem but also how to represent the data in the best possible way When you search for some things on Google, the result will surprise you because Google returns information (not just what you think when you are typing) related to your search Example, when I searched for ‘Big Ben Tower’, Google gave back the description, the year of commencement, the height, an architect and the like,

13 then I clicked on one of the results, Google gave me some more relevant result links, it is the power of knowledge graphs Today with the Knowledge Graph, people have been able to access and understand the data better. e A node may have zero or more labels.

E.g Person, Teacher, Worker, etc. e Arelationship must have type and direction.

E.g knows, has, etc. e Anode or relationship may have zero or more properties.

/ name: Minh Le \ / name: Anna ` address: Ho Chi | Branson |

Figure 3.6: Knowledge Graph with three nodes and their properties

In Figure 3.6, a node with the properties of name: ‘Cuong Nguyen’ and address:

“Ha Noi’ have a relationship type ‘IS A COWORKER WITH’ connecting with another node name: ‘Anna’ and address: ‘Ha Noi’ Especially, relationship type of ‘IS A COWORKER WITH?’ has a property since: May 2, 2020 that defines the relevance between two nodes ‘Cuong Nguyen’- [IS A COWORKER WITH] -> ‘Anna’ is called triple with the structure (Head)-[Relation ]-> (Tail).

What is Knowledge Graph? According to Google Blog, Knowledge Graph help you to search for anything that Google knows about — landscapes, celebrities, sport teams, buildings, geographical attributes, movies and others - and instantly get information

14 that’s related to your query This is a demanding first step towards constructing the next generation of search, which taps into the concerted intellectuality of the web and interprets the world a bit more like people do In a simple way of understanding, the knowledge graph is a graph about collecting and connecting information about objects in the real world, objects can be persons, places, groups, movies and more Let see an example below:

In figure 3.7, the node ‘Quang Nam’ connects to 2 other nodes with relations “có danh lam thăng cảnh" and “có món ăn đặc san’.

Data is collected as a volume and correctness data increase day by day, a knowledge graph is built with the purpose of handling these troubles, we come with 3 advantages that make knowledge graph is more powerful:° e Performance: ‘Minutes to milliseconds’ performance, using index-free adjacency, a graph database turns convoluted joins into rapid graph traversals, thereby maintaining millisecond enforcement regardless of the overall size of the dataset. © https://neo4j.com/top-ten-reasons

15 e Flexibility: When applying knowledge graph, IT and data architect teams flex as applications and industries change. e Agility: Developer team can see the agility of the graph database in today’s agile environment, permitting your graph database to emerge in step with the changing industry.

Transfer learning oo “vn —

In machine learning and deep learning application development, three key properties that contribute to the efficiency of a model are qualified data, times and resources Data is anywhere but the necessary data is not available, we are dealing with Vietnamese triple classification, to collect qualified data enough to train a big model as BERT really consumes time and resources With a deeper network, BERT needs the amount of CPU and GPU to training from scratch That is why transfer learning becomes useful, transfer learning is a deep learning technique that the model trained on a task is then applied to a second related task, these methods only work well if the knowledge learnt from the first task is informatively. transfer knowledge

Know how to ride abike ————————> Ride a motorbike transfer knowledge

Know how to play guitar ————————>x Play piano

Figure 3.8: An example of transfer learning

In figure 3.8, we show an example when transfer learning is applied in real life. People can play piano if they have known the other one in the same domain.

In the easiest way, if two kids would like to learn to play piano, one person has played guitar in the past and others does not know anything so the first kids can learn

16 piano faster than the second because the knowledge from the two instruments can complement each other After learning knowledge of one task, it transfers knowledge from a pre-trained model to training a new one and even tackle issues like having less data for training a newer task When using transfer learning, you must self-ask three questions: ‘What to transfer, when to transfer and How to do that’ In the pre-training phase, a model could learn a huge amount of information from one task, a developer must decide what to transfer for our downstream task or not and when we should transfer, and we should define the strategies to do that because it can affect the current algorithm.

| Labeled data i Labeled data are available {

“| target tasks are ‡ learnt ultaneously

‡ available only ina ‡ i source domain i

No labeled data in } both source and ‡ targetdomain ‡

Figure 3.9: Types of transfer learning

In figure 3.9, transfer learning proposes more than one strategy for developers depending on their specific situation In our case, BERT pre-train in unlabeled large plain text corpus and apply in Vietnamese dataset that is labeled so we decide to approach in the Inductive Transfer Learning way.

Input standardization + Parameters transfer normalization

Figure 3.10: Transfer knowledge from one model to another

In figure 3.10, we show the sub-types of transfer learning This can be applied in the specific use cases.

One task is trained, we could not transfer all of things to a new task, the components will be depended on specific circumstances Transferring knowledge from the source domain task can be a choice, but we just reusing the certain knowledge that can be incorporated with the target task to improve performance of an application In machine learning and deep learning, features engineering is a ‘must have’ step before building a model, pre-training often implemented in a popular public dataset such as ImageNet, Iemocap, BookCorpus, and others, it can cover approximately cases in their domain. Standardization and Normalization are the two most usage techniques in features engineering, we can inherit the mean and standard deviation of these dataset to fit our input To apply on a downstream task, parameters is the one we must transfer, this is the result of a learning process and ingest information from the pre-trained phase.

Figure 3.11: Two methods of transfer learning

In figure 3.11, to extract knowledge from a pre-trained model to another one, we could transfer which can be compound with our data to get a good result.

Designing an architecture in deep learning requires a huge mathematical knowledge to ensure that it can fit with the dataset to produce a highly result Go through experiment stages, leading people to propose their model that goes over the old benchmark in the industry Besides the above properties that can be reused, architecture is a backbone that makes a model more powerful We can receive an end to end architecture to finetune with our dataset - this is called finetuning or simply remove the last layer and get the rest as an input to the new model - called feature extractor Each one depends on the downstream task and depend in the size of dataset to.

+ ——> + ry — + ry ry ry n feature extractor fine-tuning

Figure 3.12: The details of fine-tuning and feature extractor

In Figure 3.12, we apply two methods of transfer learning: feature extractor and fine-tuning, an example architecture contains 4 layers: convld -> ReLu -> Max Pooling -> Classifier, we can decide which part of them will be transferred On the left, we cut layers except the classifier and use them as an input of the new model Because this layer is trained to capture the core-information of input, the purpose is to combine the strength of the new model and knowledge from the previous one On the other hand, we design an appropriate classifier layer and train it with the target dataset The two methods also adapt the knowledge and parameters from the pre-trained model.

Besides the need for valuable data, to clearly understand Vietnamese language form we need a model that is smart and learnable enough to implement it Google has recently released a brand-new pre-trained model in NLP named BERT (Bidirectional Encoder Representations from Transformers), BERT demonstrates its ability by bidirectional learning, context vector and pre-training in a large plain text corpus over

100 languages For example, in the sentence “Ha Lan nhận bang tốt nghiệp của cô ấy”,

20 unidirectional model only comprehends “Ha Lan nhận bằng tốt nghiệp”, this one could not make sense of ‘Ha Lan’ that has the same meaning with “cô ấy” but BERT can handle this language issue by the three techniques mentioned early.

In Natural Language Processing fields, sequence data is used widely in many applications such as Question Answering, Video Captioning, Speech Emotion Recognition, along with others The fundamental architecture that solve these problems is called Recurrent Neural Network (RNN) can be introduced in [6], but this kind of neural network could not remember an information from a long sequence, example is the next word prediction task: the network must remember and learn an information to previous words to give the decision which is the exact words or sentence.

Long Short-Term Memory (LSTM) can be seen as development of a traditional recurrent neural network LSTM proposes a forget gate, input gate and a cell state which decide information to keep or not via layers [8] However, LSTM could not indicate the attention of a word in long sentences that are more important than others.

To give the representation of input sentence, Sequence to Sequence can achieve better performance in Machine Translation task [9] by the represented vector as the output of encoder stage Although this vector contains information of input but it is via a long time learning and data flow could not be fluent as a short sentence.

To make a big jump in Language Understanding, Attention mechanism has been introduced to maximize the translation performance This mechanism tells us how one word can pay attention to others words by bidirectional learning, attention weight and context vector Attention is an irreplaceable piece of Transformers which is used to construct BERT.

BERT first introduced by Google [9], its name: “Bidirectional Encoder Representations from Transformers” exposed that BERT uses stack of Transformer encoder as the architecture From the original paper of Transformer [10], authors proposed two significant components: encoder and decoder, they experimented in the machine translation task and achieve BLEU score 41.0% which outperforming all of the previously publishing separate models, consuming less than 1/4 the training cost of the previous cutting-edge model Transformer architecture can be shown in the following:

Figure 3.13: Transformers architecture from Attention Is All You Need by A.

Encoder: each Transformer contains stack of 6 encoders, from figure 3.13, the encoder (the left one including Input Embedding) is comprised of 2 sub-layers The first layer is a multi-head self-attention mechanism, the second is a position-wise fully

22 associated feed-forward network They design a residual link-up around each of the two sub-layers and an output of these layers is computed via layer normalization.

Tiêu đề	Applying Knowledge Graph and Bert for Vietnamese Triple Classification
Tác giả	Pham Binh An, Nguyen Huy Cuong
Người hướng dẫn	Associate Professor. Do Phuc
Trường học	University of Information Technology
Chuyên ngành	Information Systems
Thể loại	Bachelor of Engineering Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	85
Dung lượng	42,41 MB