Khóa luận tốt nghiệp Hệ thống thông tin: Question answering over knowledge graphs for Covid-19

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITYUNIVERSITY OF INFORMATION TECHNOLOGY FACULTY OF INFORMATION SYSTEMS KEL QUESTION ANSWERING OVER KNOWLEDGE GRAPHS FOR COVID-19 BACHELOR OF ENGI

Semantic web technologies

RDF 2 ee 9

Resource Description Framework (RDF) is a representation language for describing information on the World Wide Web The World Wide Web Consortium (W3C) published the initial RDF specification in 1997 [22], and it later became a W3C recommendation in 1999 [23] Currently, the most recent version is RDF 1.1, which was released in 2014 [18] by W3C Figure 2.2 shows example of RDF.

RDF extends the Web’s linking structure by utilizing URIs to identify both the relationship between objects and the two ends of the link (this is commonly referred to as a triple") This straightforward model enables the mixing, exposure, and sharing of structured and semi-structured data across multiple applications This linking structure results in a directed, labeled graph, with edges representing the named connection between two resources represented by graph nodes This graph view is the simplest mental model for RDF and is frequently used in visually simple explanations.

Figure 2.1: Resource Description Framework (RDF) triple example.

SPARQL 2 0.00 ee 9

SPARQL Protocol and RDF Query Language is a structured language [20], simi- lar to SQL, that is used for accessing and manipulating RDF graphs The current version of SPARQL deployed by W3C is SPARQL 1.1 [24].

When it comes to expressing information on the Web, RDF is a directed, labeled graph data format that is used RDF query language SPARQL is defined in

CHAPTER 2 BACKGROUND AND RELATED WORK this specification, including its syntax and semantics for RDF If the data is stored natively as RDF or is perceived as RDF via middleware, SPARQL may be used to express searches across many data sources When it comes to querying necessary and optional graph patterns, as well as their conjunctions and disjunctions, SPARQL is a powerful language to use SPARQL additionally includes features such as ex- tensible value testing and query restricting based on the RDF graph that was used to generate the query results It is possible to get results sets or RDF graphs as a consequence of SPARQL searches The following example uses SPARQL to answer the NLQ "What is the city in Vietnam where the first case of COVID-19 occurred?" SPARQL query syntax is as follows:

{ dbr:COVID-19_pandemic_in_Vietnam dbp: firstCase ?uri

Web Ontology Language (OWL)

Developed by the World Wide Web Consortium (W3C), the Web Ontology Lan- guage (OWL) [25] is a Semantic Web language that is intended to convey rich and complicated information about objects, groups of objects, and relationships between objects A computational logic-based language, OWL allows information stated in it to be used by computer programs for a variety of purposes including, but not limited to, verifying the coherence of that knowledge or making implicit knowledge explicit. Known as ontologies, OWL papers have the capability of being distributed across the Internet Ontologies may reference or be referenced to by other OWL ontologies. OWL is a component of the World Wide Web Consortium’s Semantic Web technology stack, which also includes RDF, RDFS, SPARQL, and other technologies.

W3C’s Open Web Language, sometimes referred to as "OWL 2," [26] was created by its former working group [W3C OWL Working Group] (which has since been terminated) and released in 2009, with a Second Edition published in 2012.

Created by the [W3C Web Ontology Working Group] (now defunct) and released in 2004, OWL 2 is an expansion and modification to the 2004 version of OWL developed by the W3C Web Ontology Working Group and published in 2004 In- cluded among the deliverables that make up the OWL 2 specification is a Document Overview, which serves as an introduction to OWL 2, describes the relationship between the previous version of the specification (OWL 1), and serves as an entry point to the remaining deliverables via a Documentation Roadmap A Second Edition of OWL was published in 2012 [27].

fe

Figure 2.2: Web Ontology Language (OWL) example !

Linked Data 20 00.00 eee eee 11

There are dates and titles and part numbers and chemical characteristics and any other data that can be thought of on the Semantic Web; it is a Web of Data, in other

!http://www.cse.lehigh.edu/ heflin/IntroTOOWL.pdf

CHAPTER 2 BACKGROUND AND RELATED WORK words The set of Semantic Web technologies (RDF, OWL, SKOS, SPARQL, and so on) creates an environment in which applications may query data, draw conclusions from vocabularies, and so on.

To make the Web of Data a reality, however, it is necessary to make the massive amount of data accessible via the Web available in a standard format that can be accessed and managed by Semantic Web tools Not only does the Semantic Web require data access, but it also requires data connections in order to establish a Web of Data, which is not yet possible (as opposed to a sheer collection of datasets). Linked Data [28] is a term that refers to a collection of interconnected datasets that may be found on the web.

In order to accomplish and produce Linked Data , technologies for a common format (RDF) should be accessible, allowing for either conversion or on-the-fly access to existing databases to be performed (relational, XML, HTML, etc) It is also critical to be able to configure query endpoints in order to make it easier to retrieve that information To get access to the data, the W3C gives a variety of technologies (RDF, GRDDL, POWDER, RDFa, the forthcoming R2RML, RIF, and SPARQL) to choose from.

A good example of a huge Linked Dataset is DBPedia [3], which, in essence, converts the content of Wikipedia into RDF and makes it accessible to the public.

In addition to including Wikipedia data, DBPedia also contains linkages to other datasets on the Web, such as Geonames This is why DBPedia is so important.Due to the presence of those additional links (in the form of RDF triples), applications can leverage additional (and potentially more precise) knowledge from other datasets when developing an application; additionally, by integrating facts from multiple datasets, the application can provide a significantly improved user experience.

KnowledgeGraphs Ặ 12

The knowledge graph [29] is a set of interconnected descriptions of items — such as objects, events, or ideas — that are linked together The use of knowledge graphs, which place data in context via the use of linking and semantic information, provides

2.2 SEMANTIC WEB TECHNOLOGIES a framework for data integration, unification, analytics, and sharing, among other things.

COVID-19 2006-06-08 ul panty SEH Is founded

: Is the city of pandemic Minh city Mã in Vietnam ul p91E20| SỊ

With regard to knowledge graphs, we are talking about a directed labeled graph in which the labels have well defined meanings Nodes, edges, and labels are all components of a directed labeled graph Anything, including people, businesses, computers, and other technological devices, may function as a node One of the functions of edges is to connect a pair of nodes and to capture the relationship of interest that exists between them, such as a friendship relationship between two people, the customer relationship that exists between a company and an individual, or a network connection between two computers It is the meaning of the connection that is conveyed by the labels, such as the friendship relationship between two individuals.

If we have a collection of nodes N and a set of labels L, a knowledge graph is

CHAPTER 2 BACKGROUND AND RELATED WORK defined as the subset of the cross product N x L x N of these two sets of nodes This set is composed of three members, each of which may be represented graphically as illustrated in the diagram below.

It is possible to utilize the directed graph representation in a number of ways, depending on the requirements of a particular application Directed graphs, such as the one shown above, in which the nodes are individuals and the edges represent their friendship relationships, are also referred to as data graphs in certain circles This kind of directed network, in which the nodes represent classes of items (for example, Book, Textbook), and the edges represent the connection between subclasses, is known as a taxonomy The terms A and B are referred to as subject and predicate, respectively, in certain data models, while the terms A and C are referred to as object.

Graph navigation may be used to perform a variety of interesting calculations across graphs We may browse the friendship knowledge network from A to all nodes B that are related to it by a relation labeled as friend, and then recursively to all nodes Œ that are connected by the friend connection to each of the nodes B.

If the graph G has a route, the path is made up of a set of nodes (v; through Un), Where each node has an edge from v; to vj41 for any i € ứ with 1 and should be supplied in the format YYYY-MM-DD unless otherwise specified In order to comprehend more about generation process, we introduce some clauses that can be used into SPARQL query.

The FILTER keyword is used to limit the number of results that are returned that fit the graph pattern supplied shortly before the FILTER keyword itself is executed. The FILTER keyword is used to prefix an expression that may be composed of a broad range of operators that are classified according to the number of parameters that are being manipulated It is possible to use unary operators such as to express the logical NOT and "BOUND" to check whether the input is limited to a certain value "Does Vietnam overcome COVID-19 pandemic?" may be answered with the use of these operators, which can be used alone or in combination In this

CHAPTER 3 TOWARDS QUESTION ANSWERING OVER KNOWLEDGE GRAPH BENCHMARK CORPORA example, the solution can be obtained by simply checking to see whether the object of the relation res: COVID-19 dbo:overcomeDate?date does not exist in the KG (i.e. FILTER(!BOUND(?date)) exists in the database The binary operators AND "&&" and OR "II" are also included, as are all of the other test operators between two values (such as ==,!=, > and so on) Binary operators are also used in conjunction with logical connectives AND "&&" and OR "II." These are used to answer inquiries such as "Which nations have more than ten thousand COVID-19 cases?" by building a SPARQL query based on the answers The REGEX operator, which is a single ternary operator that examines if a given text (first argument) fits a given regular expression, is also included (second parameter) It is possible to provide a case insensitive pattern match by using the third argument, which is optional If the name "The Socialist Republic of Vietnam" appears within the label of the resource dbo: Vietnam, this ternary operator can be used to determine whether or not the name "The Socialist Republic of Vietnam" appears within the label of the resource dbo: Vietnam Questions such as "Is Vietnam’s official name The Socialist Republic of Vietnam?" can be answered by using this ternary operator.

The ORDER BY modifier, which is the second most often used modifier, is used to change the order in which the solutions are presented Despite the fact that the dataset contains just one question, it is vital to point out that the ORDER BY modifier is always used in combination with other modifiers in the dataset An example of where ORDER BY occurs by itself is the question "What are the top ten most affected countries regarding COVID-19?" in which the answer is represented by all of the results of an ordered list of the top countries.

It is common to utilize the LIMIT modifier in conjunction with the ORDER BY and OFFSET modifiers to answer questions that include superlatives, such as "Who is the Minister of Health in Vietnam during COVID-19?" Another combination is made up of the previous three functions (LIMIT, ORDER BY, and OFFSET) with

3.2 DATA CONSTRUCTION the addition of the COUNT aggregation function for questions such as where, as illustrated in the following listing, the ORDER BY function must be applied to the COUNT function’s object The LIMIT modifier is also the topic of debate when it comes to its use inside the QALD dataset For example, the question "Which nation has the highest number of COVID-19 cases?" may be translated into a SPARQL query with the LIMIT modifier set to 5, which can then be executed This is correct because without the modifier, only the first five results represent the correct answer (i.e the resources corresponding to the countries with which Iran shares borders), while the remaining results are simply literals unrelated to the question (i.e the resources corresponding to the countries with which Iran shares borders) Although the use of the LIMIT modifier is not strictly necessary in this scenario, it is closely associated with the structure of DBpedia.

Apart from SELECT, the only other viable SPARQL query form included inside the dataset is ASK, which is logical considering that QA systems on KGs are often factual QA systems ASK inquiry form is required in order to build inquiries that need a Yes/No response, thereby yielding a Boolean result, such as "Did Vietnam defeat COVID-19?"

Other than in conjunction with the ORDER BY and LIMIT operators, the COUNT modifier may be used on its own, for example, in inquiries that begin with the word

"How many," such as "How many cases of COVID-19 were detected in Vietnam?".

Dataanalysis 2.2 ee 29

In order to verify that all queries created in the previous part operate correctly when inquiring into KGs, each query generated in the previous part is verified once more; if any queries are found to be incorrect, they are transferred to the next section to be fixed and validated once more.

Each question and inquiry that did not pass the previous portion will be fixed again in this section before being sent to the next section First and foremost, we revised and, if required, rephrased all responses in order to make them seem more natural and fluid Finally, in order to confirm that the dataset was grammatically accurate, we had all of the produced findings peer-reviewed.

Finally, each question is examined by independent reviewer This second round provides a greater quality of data since the reviewer is also permitted to change the questions if any inaccuracies are discovered during the first iteration.

We construct a sizable collection of benchmarking datasets that cover different domains and different languages.

LC-QUAD [55] is Large-Scale Complex Question Answering Dataset It consists of 5000 questions and answers pair with the intended SPARQL queries over knowledge base is DBPedia.

Question Answering over Linked Data (QALD) [49] is a set of evaluation cam- paigns on question answering over linked data that are being conducted as part of the Question Answering over Linked Data project This series began in 2011 and will continue until 2020 QALD-9 [58] is the most recent edition, and it has 408 questions that were collected and selected from prior challenges It is accessible in

CHAPTER 3 TOWARDS QUESTION ANSWERING OVER KNOWLEDGE GRAPH BENCHMARK CORPORA eleven different languages (e.g., English, Spanish, German, Italian, French, Dutch, Romanian or Farsi) The XML format is used for QALD-1 to QALD-5 and JSON format is for version from 6 to 9 COVID-KGQA is also multilingual dataset with appearance of 2 languages as English and Vietnamese.

In addition to the dataset provided above, which was covered in the preceding section, we also produced a dataset about COVID-19 to further enhance the diversity of biomedical field datasets Using the lastest version of DBPedia as a starting point, we developed corpus about COVID-19 Q&A over knowledge graph DBPedia (COVID-KGQA) with more than 1,000 bilingual question-answer pairs The format of data we use is approximately same as QALD dataset [58] with simplified version from CBench [59], which will help evaluation progress through QA systems much more straightforward The example of our data is illustrated in Figure 3.2.

The JSON format of our dataset is described below:

"String":"Where is the first case in Vietnam?

"keywords":"first case, COVID-19, Vietnam "

"string":"Truong hop ca nhiem COVID-19 dau tien cua Viet Nam la o dau?",

"keywords":"Ca nhiem dau tien, COVID-19, Viet

"spargl":"SELECT DISTINCT ?uri WHERE { ?uri }"

"value": "http://dbpedia.org/ resource/Ho_Chi_Minh_City"

CHAPTER 3 TOWARDS QUESTION ANSWERING OVER KNOWLEDGE GRAPH BENCHMARK CORPORA

The benchmark datasets are shown in Table 3.1, along with the number of questions contained in each and the KGs they target QALD is an annual assessment effort for question answering that began in 2011 and continues to this day As a result, it contains nine benchmarks (QALD-1 to QALD-9) The challenge organisers provided both a training dataset and a test dataset for each task When these datasets were first created, they comprised at a minimum the NL question, a SPARQL query, and the appropriate responses Afterwards, more information such as keywords, response type, and information about necessary aggregation functions was added to the datasets, as well as a new knowledge base other than DBpedia and hybrid question answering based on RDF and free text.

COVID-KGQA 1,000 DBPedia Biomedical, Cross-language LC-QUAD 5,000 DBPedia Generic

QALD-9 408 DBPedia Generic, Cross-language

QALD-8 315 DBPedia, Wikidata Generic, Cross-language

QALD-7 530 DBPedia, Wikidata Generic, Cross-language

Figure 3.3, Figure 3.4, Figure 3.5 depict a distribution of the trigrams that contain the first letters of the questions’ inquiries Additionally, in order to have a better understanding of the lexical diversity of questions in the datasets, we identify the trigram patterns that appear the most frequently in the questions We can see that the dataset contains a large number of different language constructs, with 4 types of questions such as How, What, Who, Which appearing the most frequently in QALD-9 (Figure 3.3), How, Who, Which, What appearing the most frequently in LC-QUAD (Figure 3.4), and When, Where, What, Who appearing the most frequently in COVID-KGQA (Figure 3.5) Meanwhile, the questions What is, Who is, How many, and Which countries account for the majority of all types of questions in QALD-9, How many, Who is, Which TV, and What is are in LC-QUAD, and When is, Where is, What is, and How many are with COVID-KGQA What is,

Who is, How many, and Which countries are the most common types of questions in QALD-9.

Figure 3.3: Distribution of the most popular question prefixes in QALD-9

Figure 3.4: Distribution of the most popular question prefixes in LC-QUAD

CHAPTER 3 TOWARDS QUESTION ANSWERING OVER KNOWLEDGE GRAPH BENCHMARK CORPORA

Figure 3.5: Distribution of the most popular question prefixes in COVID-KGQA

VIETNAMESE DBPEDIA KNOWLEDGE GRAPH 35

Introduction 2.2 0 eee 35

In order to standardize this data in accordance with Semantic Web and Linked Data principles, DBpedia is a community initiative that extracts information from Wikipedia and organizes it As a result, it is a large-scale knowledge repository that is regarded to be one of the most hotly debated topics in the field of Semantic Web right now Furthermore, as a result of the growth of the DBpedia community, many additional possibilities in the Linked Open Data (LOD) field are being developed every day In turn, this results in the creation of particular DBpedia chapters in various languages throughout the world However, there are numerous languages, like Vietnamese, that do not have their own chapters in the book The reason for this is because the DBpedia extraction technique necessitates the creation of human mappings between Wikipedia templates and the types and characteristics of the DBpedia ontology So as the first step toward creating the Vietnamese DBpedia chapter, we explain in this article how to determine automatically types for Vietnamese things as described in the previous section.

Recently, DBpedia has expanded to include 140 languages and has published 21 billion triples per month! In other words, there are a billion sections of information in the form of RDF triples that are freely available under an open license and cover a wide variety of domains in multilingualism, including location, person, and work.

In recent years, the number of contributors to DBpedia has increased significantly.

DBpedia Live extraction is s [60], which is real time extraction, allows Wikipedia editors to maintain the DBpedia ontology The DBpedia ontology, in particular, allows them to connect the structures of Wikipedia infoboxes to classes and characteristics in the DBpedia ontology Up to this point, this has been accomplished via a human engineering effort inside the DBpedia project, and more than 2000 of these

!https:/www.DBpedia.org/resources/knowledge-graphs/

CHAPTER 4 VIETNAMESE DBPEDIA KNOWLEDGE GRAPH mappings have been been completed to date As of today, DBpedia has opened our interface to allow individuals who develop the data to also have influence over how it is represented in DBpedia Furthermore, the ontology itself may now be changed directly inside Wikipedia, paving the possibility for a collaborative process of ontology engineering to take place.

DIEF is an acronym for the DBpedia Information Extraction Framework [61], which was used to create the original English-language edition of DBpedia, was the next version of DIF Then, by adding new chapters to DBpedia, this structure makes it possible to broaden the breadth of the database in many Wikipedia languages throughout the world According to the DBpedia website, there are 20 accessible releases of the worldwide chapters of DBpedia that have been recorded.

The DBpedia for Greece [62], Germany [63], Japan [64], and Korea [65] explain how to construct a local chapter of the DBpedia in each country DBpedia versions for communities in other languages are being developed using this work as a guideline document, with the ultimate objective of international and multilingual extraction being the ultimate goal of the project’s final product.

This research is being conducted at the time of the publication of the Arabic chapter, which is shown in [66] as the most recent edition Practices and efforts are being put in place to publish linked data in the Arabic language, as well as the inclusion of a new chapter to the international DBpedia chapters that are already accessible.

In a nutshell, the aforementioned studies are creating their own DBpedia chapters on the basis of the Extraction Framework and manually producing mappings Many research, however, switched to an automated strategy after discovering the drawbacks of the manual way of investigation Using the most specific ancestor baseline, Pamero and colleagues [67] developed a three-step technique to automatically map Wikipedia templates to the DBpedia ontology in an automated manner.

A technique termed DBTax was developed by Marco Fossati in [68] to assign types to DBpedia entities by learning the taxonomy from the Wikipedia category system without the need for human supervision using natural language processing. Furthermore, this technique resolves the problem of inaccurate categorization of

Vietnamese DBpedia generation

entities as well as the problem of overlapping among the DBpedia kinds that occur the most often, such as Place, Person, and Work, Organization.

Using inter-language linkages in Dutch, Bouma et al [69] intend to automatically populate Wikipedia infoboxes based on the information in the inter-language links The authors of this paper begin by examining the relationship between the En- glish and Dutch Wikipedias Afterwards, they use bidirectional matching to generate a collection of correct mapping pairings Referencing [70], the author also provided a case study in which solely automatically generated Chinese characters were used. According to the findings of this study, the most noteworthy aspect is a model in- ference that made use of interpolation, an external connection from DBpedia, and trained sample data Despite the fact that this model is capable of producing some favorable outcomes, it has some drawbacks, including a limited amount of training data and a lengthy processing time It is evident that creating one’s own DBpedia chapter using automated chores rather than human labor would need consideration from a variety of angles, and there will continue to be problems and possibilities.

The information is derived from the Wikipedia repositories and is available in many languages When the first chapter of Vietnam’s DBpedia [71] was released in

2018, there were 292 active Wikipedia versions that included Vietnamese However, the difficulty is that information on COVID-19 is not available during this time period since COVID-19 is released in December of 2019 As a result, this particular version of DBpedia is not functioning properly for us Moreover, the number of Wikipedia pages from November 2018 to November 2021 increases more than 5

000 000 pages Because of these reasons, it inspires us to create a new version of the Vietnamese DBpedia that includes COVID-19 information Figure 4.1 illustrates the statistic of Wikipedia pages from November 2018 to November 2021

This section provides an overview of the generation of DBpedia knowledge graphs using the Vietnamese language Additionally, we detail the generation process.

*https://stats wikimedia.org/#/vi.wikipedia.org/contributing/new-pages/normallbarl2018-10-

CHAPTER 4 VIETNAMESE DBPEDIA KNOWLEDGE GRAPH

Figure 4.1: Statistic of Wikipedia pages from November 2018 to November 2021

Input Wikipedia pages are accessed through a third-party source Pages may be read directly from a Wikipedia dump or using the MediaWiki API from a MediaWiki installation The data is sourced from many Wikipedia repositories in multiple languages As aresult, we need to develop a web crawler to extract data from Wikipedia prior to doing this task This task’s output file format is xml.

Creating Mapping There are two stages to this job Initially, the infobox is matched to the associated class (type), and then its attributes are mapped to the properties possessed by that class in the DBpedia ontology The detail of this step is given in subsection 4.2.2

Parsing The wiki parser parses each Wikipedia page The wiki parser converts a

Wikipedia page’s source code to an Abstract Syntax Tree (AST) Figure 4.3 shows the example of AST of Image link Sunflower of Wikipedia? ao IFile.Examplejpg| |„Ƒ") target = File:Example jpg props(.,! Content = Sunflower

Figure 4.3: The AST representation of the image link [72]

Extraction Extractors are provided with each Wikipedia page’s Abstract Syntax Tree DBpedia provides extractors for a variety of purposes, including label, abstract, and geographical coordinate extraction Each extractor accepts an Abstract Syntax Tree as input and outputs an array of RDF statements Detail of this step is given on subsection 4.2.3

$https://en wikipedia.org/wiki/File: A_sunflower.jpg

Output The RDF statements that have been gathered are written to a sink N- Triples, for example, is one of the formats that are supported.

The mapping objective is to connect Wikipedia templates with matching classes in the DBpedia ontology in order to improve the quality of the ontology As a result, every Wikipedia object now has a type associated with it After the mapping process, the entity of Viet Nam, for example, will be assigned the type Country.

In recent years, these mappings have been created manually by volunteers from all over the globe for a variety of languages The Vietnamese mapping community, on the other hand, is tiny, and the outcomes of hand-generated mapping are insignifi- cant In this part, we describe the characteristics of Vietnamese Wikipedia templates as well as the many sorts of inferences that may be made about things There are two types of Wikipedia templates: infobox templates and macro-templates As dis- cussed before, infoboxes are one of the most useful sources of structured information in Wikipedia articles that DBpedia makes use of As a result, we are especially inter- ested in them in this study According to statistics from the mapping site, there are

1295 Vietnamese Wikipedia templates with a total of 1.19 million entries It is the biggest Wikipedia in a non-European language and the largest for a language that is only officially recognized in a single nation And, based on their early study, there are 826 infoboxes and 433 macros, despite the fact that more than half of articles lack infoboxes As a result, handwriting mapping is inefficient for the Vietnamese Wikipedia Because of this, finding an automated approach to produce mapping is a wise decision to make.

Ontology mapping is a difficult operation since it requires the comprehension of both previous knowledge in particular areas and structures in Wikipedia templates, which might be difficult to do Furthermore, this activity must remain flexible in order to keep up with the rapid changes in Wikipedia articles So mapping communities only have 32 languages, which is a modest quantity in compared to the 142 languages that are supported by DBpedia, which is an impressive figure In prior study [73], authors provided a Tprediction technique to provide an automated-type

4.2 VIETNAMESE DBPEDIA GENERATION prediction for DBpedia entities, which was implemented in DBpedia These are all Wikipedia-created entities, as well As a result, they employ this strategy in order to infer the types of Vietnamese Wikipedia entries The authors of the Tprediction algorithm combine two baselines, namely the particular level and majority voting, in order to utilize the types already assigned by various languages in their predictions. The conformity Con(t)described by the authors is determined based on the sum of the frequency of t and a recursion of its parents and is calculated by the authors. This value is critical in determining the sort of entity to return when a request is made The formula of Con(t) is as follow: Con(t) = frequency(t) + Con(parent(t)). Although the authors modified the input and the number of pivot languages when applying this approach to Vietnamese entities, they did not modify the input or the number of pivot languages for other languages Detailing approach, authors mini- mize the number of pivot languages from 32 to 15 by include Vietnamese Wikipedia items in the mix The rationale for this is based on their examination of Vietnamese entries as well as ones in another language According to their findings, the proportion of articles of a particular type in the total number of articles shared between Vietnamese Wikipedia and the language in the first column is a critical factor to consider when choosing a pivot language And the authors chose a language in which the proportion of articles assigned a type is greater than 50% of the total number of articles shared between Vietnamese Wikipedia and the language in the first column.

As aresult, test results contain less noise.

A variety of extractors are used by the DBpedia extraction framework in order to translate different sections of Wikipedia articles into RDF assertions Separated into four categories, DBpedia extractors are as follows:

Infoboxes are the sort of Wikipedia material that is most beneficial for the DBpedia extraction since they include the most information Infoboxes are typically used to

CHAPTER 4 VIETNAMESE DBPEDIA KNOWLEDGE GRAPH present the most important details about an article in the form of a table of attribute- value pairs at the top right-hand side of the Wikipedia page (see image below) (for right-to-left languages on the top left-hand side) A template, which defines a set of properties that may be used to construct an infobox, is used to create the infoboxes that appear in a Wikipedia article Wikipedia makes use of a diverse collection of infobox templates Templates for infoboxes that represent individuals, organizations, or cars are common examples As a result of the evolution of Wikipedia’s infobox template system over time, various communities of Wikipedia editors use a variety of templates to describe the same types of objects (e.g Infobox_city_Vietnam, Infobox_HoChiMinh_City and Infobox_city) In addition, various templates use different names for the same characteristic, which is confusing (e.g birthplace and placeofbirth) Due to the fact that many Wikipedia editors do not precisely adhere to the suggestions provided on the page that specifies a template, attribute values are represented in a variety of alternative formats and units of measurement In the following, you will get an extract from an infobox that is based on a template for explaining COVID-19 in Vietnam information.

{{Infobox COVID-19 pandemic in Vietnam name = COVID-19 pandemic in Vietnam arrival_dat = 2020-01-23 confirmed_case = 473530 recovery_case = 248722 first_case = Ho Chi Minh city orgin = Wuhan, Hubei, China ( )

The first line of this infobox indicates the infobox type, and the following lines provide the different qualities of the object being described in the first line The following is an excerpt from the data that was extracted: dbr:COVID-19 pandemic in Vietnam

4.2 VIETNAMESE DBPEDIA GENERATION dbp:name COVID-19 pandemic in Vietnam; dbp:arrivalDate 2020-01-23 dbp:confirmedCase 473530; dbp: recoveryCase 248722; dbp:firstCase Ho Chi Minh city; dbp:origin Wuhan, Hubei, China;

First outbreak = Wuhan, Hubei, China Index case Ho Chi Minh City

Figure 4.4: Wikipedia information about COVID-19 pandemic in Vietnam

Vietnamese DBpedlaanalysls

Our solution outperforms the default in Vietnamese in terms of mapped properties, however it is smaller than DBpedia in terms of mapped properties in English.

We are unable to provide much more due to a lack of resources and time, and it is also our future work responsibility.

Table 4.1: Comparison between DBpedia knowledge graph

KG ViDefault ViDBpedia DBpedia(En)

Mapped templates per total templates 0.77% 29% 6.81%

Mapped properties per total properties 0.28% 0.3% 3.95% viDBpedia

Figure 4.5: Comparison between ViDBpedia and existing default Vietnamese in DBpedia

Table 4.2 show statistical result of knowledge graphs (KGs) comparisons Based on [76], we conducted a statistical comparison of ViDBpedia and DBpedia, and we offer our key conclusions here.

There are two KGs on this list ViDBPedia is smaller than DBpedia, whereas DBpe- dia is the biggest in terms of the amount of triples It seems that there is a relationship

Figure 4.6: Comparison between ViDBpedia, existing default Vietnamese and Total Vietnamese in DBpedia

Table 4.2: Overview information about ViDBpedia and DBpedia

Number of triples 2 645 822 411 885 960 Number of instances 309 683 20 764 283 Number of entities 192 436 4 298 433 Number of classes 184 736

No of unique predicates 412 60 231 between the method used to build up a KG and its size For example, automatically generated KGs tend to be bigger since the constraints of integrating new information become less onerous as the KG becomes larger A significant effect is made by datasets that have been imported into KGs The number of triples and the number of facts in the KG are greatly influenced by this A significant influence on the number of triples is also the method of modelling data Using the N-Triples format, for example, more intermediary nodes are required to be represented, resulting in many more triples than if the relations were stated in simple statements Finally, but certainly not least, the number of languages that are supported has an impact on the number of triples.

Summary 0 Q Q Q Q Q HQ HH ee ee 51

Classes vary greatly across the KGs, with a range of 184 (according to ViDBpedia) to 736 (DBpedia) Contrary to this, while ViDBpedia is comprised of numerous classes, only a tiny percentage of them are actually utilised on an instance-by-case basis Keep in mind, however, that this is not always a negative.

When searching through the KGs, it is typical to discover relations that are only seldom used: A quarter of the DBpedia ontology’s relations are used more than 500 times in DBpedia, whereas half of the ontology’s relations are used just once.

DBpedia has by far one of the most entities of any database on the internet today. Due to the fact that each statement is instantiated, DBpedia exposes a disproportion- ately large number of instances in comparison to entities (in the sense of instances that represent real world objects)

In this chapter, we outline the processes required to create a Vietnamese DBpe- dia chapter and discuss in detail how to detect types automatically for Vietnamese entities As a result, our study seems to be the first practical document for the development of the Vietnamese DBpedia including COVID-19 information, according to our findings This effort is just the first stage in the development of an Vietnamese DBpedia chapter, and we hope that it will be connected to other chapters in the DB- pedia network as soon as feasible This is a topic on which we continue to deploy some efforts in the future In particular, we analyze the quality of the Vietnamese DBpedia dataset, and we improve our algorithm in order to increase the number of extracted triples We hope that this prove to be a helpful source data for Semantic Web apps that might be of use to the Vietnamese people in the future.

QUESTION ANSWERING OVER KNOWLEDGE GRAPH

Introduction 2.2 20.0.0 0.00 ee ee 52

In the present study, three standard approach for KGQA are combined and extended: information retrieval-based (IR), template-based, and subgraph matching Low- dimensional vectors are used to represent both the queries and the potential answers in IR-based techniques IR-based approaches retrieve the answers by ranking the candidate responses according to their matching score or by categorizing the candidates into positive or negative categories Template-based approaches perform semantic parsing and guide the mapping of inquiries into structured queries by relying on predetermined templates and artificial rules that have been created artificially. They are essential in the KGQA task because they make the semantic parsing of questions and the decomposition of query utterances much easier to do Templates might be created by hand or learnt via machine learning On the opposite end of the spectrum, Subgraph-based research efforts make use of subgraph for semantic parsing in order to improve performance They employ models with trainable parameters to generate a semantic parser that is tailored to the training corpus provided to them In this chapter, we will discuss various well-known techniques to KGQA and choose the methods that will serve as exemplary examples for our thesis.

Subgraph matching techniques

The technique we use on this section is subgraph matching Subgraph matching constructs the query subgraph using a semantic tree, while others deviate significantly from this approach by building the subgraph from the entity A natural language phrase may have many interpretations, each of which corresponds to a different set of semantic elements in the knowledge graph After locating the semantic tree, they must extract the semantic relationships from it before constructing the semantic

5.2 SUBGRAPH MATCHING TECHNIQUES query graph Several techniques [77, 78] use this approach.

| understanding — query graph | Subgraph evaluation | BH

Figure 5.1: Overview of gAnswer model

5.2.1 gAnswer gAnswer [79] system is QA system with the purpose is to convert natural language questions into query graphs that include semantic information When the system has completed this process, it may convert query graphs into normal SPARQL queries, which will be performed in graph databases in order to provide responses to users. For semantic disambiguation, system uses a novel data-driven approach System retain various plans for entity and predicate mappings when creating query graphs, and perform semantic disambiguation in the query execution phrase based on entity and predicate matches while generating the query graphs ( incorrect mappings). The answers to ganswer are obtained using a graph data-driven solution that is divided into two phases: offline and online During the offline phase, a graph min- ing technique is used to determine the semantic equivalence of relation terms and predicates After that, a paraphrase dictionary is constructed in order to document the semantic equivalence that has been discovered The online phase is divided into two stages: the question understanding and question evaluation During the question understanding step, a semantic query graph is constructed to reflect the user’s intent This is accomplished by extracting semantic relations from the natural language question’s dependency tree using the previously constructed paralanguage dictionary Following that, a subgraph of the knowledge graph is chosen that, via subgraph isomorphism, corresponds to the semantic query network In the question evaluation step, the final response is delivered depending on the subgraph that was

CHAPTER 5 QUESTION ANSWERING OVER KNOWLEDGE GRAPH

What is the.BUđgEEỉf the film đif#eted By |

| H Ỷ 4 ' ÿ —

(c) phrase mapping “Step 2 ————_Filmex > “YP?

Figure 5.2: Overview of semantic relation generation framework of gAnswer in question understading step [79]

Template-based technque

The usage of templates is critical in question answering (QA) over knowledge graphs (KGs), where user utterances are converted into structured questions via the use of semantic parsing Using templates has the advantage of being traceable, and this may be used to create explanations for the users, so that they can understand why they get certain responses These systems [80, 81] use this approach.

Based on isomorphic graph patterns, TeBaQA [82] may be used to create templates that can be reused across many SPARQL queries TeBaQA can be illustrated as 5 following steps:

Detection Template Classifier Query Building over lsomerphic

Figure 5.3: Overview of TeBaQA model [82]

Preprocessing is the first step in which all queries are run through to eliminate semantically unnecessary terms and generate a collection of meaningful n-grams. The Graph-Isomorphism Detection and Template Classification phase utilizes the training sets to train a classifier on a natural language question and a SPARQL query by analyzing the basic graph pattern for graph isomorphisms and then applying the classifier to the natural language question and SPARQL query Fun- damentally, structurally identical SPARQL queries convey syntactically equivalent questions, according to this theory When a query is asked, it is categorized into a ranked list of SPARQL templates, which is then shown on the screen In the case of Information Extraction, TeBaQA collects all essential information from the question, such as entity names, relations between entities, and classes, and then de- termines the response type using an index set that is not dependent on the KG of the data being analyzed This step involves inserting the extracted information into top-level template elements, determining what kind of SPARQL query to use, and adding query modifiers It is next necessary to run the SPARQL queries that were generated and compare their responses to the anticipated response type In the following ranking, all of the information, as well as the natural language question and the returned responses are taken into consideration A two-step process is used to compile this rating In the first step, TeBaQA filters the results according to 1) the anticipated response type of the question in contrast to the actual answer type of the query and 2) the cardinality of the result set Second, the quality of the remaining

SPARQL queries is evaluated using TeBaQA In the next part, we will go into great depth regarding this model in order to have a better understanding of it.

Question Preprocessing Words that provide no information to the response to natural language questions are often seen in natural language inquiries As a result, authors discern between n-grams that are semantically meaningful and those that are not Irrelevant n-grams may cause mistakes to spread across the design, which can be disastrous The entity dbr:The The is a good illustration of what I’m talking about Every time the word The appears in a query, the system’s performance would suffer significantly as a result of the incorrect association of the term The with this entity However, since irrelevant terms are occasionally included in entities, such as dbr:The Two Towers, authors are unable to filter out all of these words As a result, authors combine up to six surrounding words from the query into n-grams and eliminate any ngrams that include just stop words Authors create stop word lists that include the most frequent terms in a specific language that are very unlikely to contribute semantic value to the phrase This allows us to detect irrelevant words quickly and easily Part-of-speech tags (POS tags) are also used by TeBaQA to discriminate between relevant and irrelevant n-grams in a sentence Only n-grams that begin with the JJ, NN, or VB POS-tags are considered to be significant in this context Following this preprocessing stage, TeBaQA does an information extraction step in which it maps the remaining n-grams to entities from DBpedia.

Graph-Isomorphism Detection and Template Classification TeBaQA classi- fies a question in order to discover the isomorphic fundamental graph pattern for each element (BGP) Considering that SPARQL is a graph-based query language, an isomorphism may be used to verify whether two SPARQL queries are structurally equivalent TeBaQA can categorize incoming queries during runtime in order to determine the most appropriate query templates, into which subsequent semantic information may be introduced at the time of execution. ¢ SPARQL BGP Isomorphism to Create Template Classes TeBaQA develops a basic graph pattern for each question and its associated SPARQL query using the training datasets, which it then uses to construct the remaining graph

5.3 TEMPLATE-BASED TECHNIQUE patterns As a result, all SPARQL queries that are isomorphic are combined into a single class Each class now contains natural language questions that are semantically distinct but structurally comparable in SPARQL. ¢ Question Features and Classification Following that, TeBaQA trains a classifier using all questions from an isomorphism class as input to compute features for the class in which it was trained A feature vector contains all of the information necessary to make a trustworthy determination as to which question belongs to which class in a question set The characteristics may be separated into two categories: semantic characteristics and syntactic characteristics Semantic elements such as the word "question," the entity "person," and the resource type "query" are included in this category and indicate specific content components of the inquiry, such as specific people or subjects that are referenced in the question The question’s structure is described by all of the other characteristics It should be noted that additional characteristics were studied, but none of them were found to increase the model’s identifica- tion rate authors describe the following characteristics in order to facilitate future study in this area: Music and movies were the most often mentioned cultural categories, for example, who is the vocalist on the album The Dark Side of the Moon? Queries about which nations or cities are asked as well as questions about where things are situated, for example, "In which country is Mecca located?" With the aforementioned attributes collected based on the input question and the isomorphic basic graph patterns used to label the classes, TeBaQA trains a statistical classifier that can be used to answer the question. The target class of a feature vector may be identified by producing the the basic graph pattern that corresponds to the relevant SPARQL query, and assigning the feature vector the class that corresponds to this pattern.

Information Extraction A specific SPARQL template’s placeholders are filled by TeBaQA, which specifies the entities, classes, and relations that should be used to fill them The absence of semantic context information in questions causes semantic entity linking technologies such as DBpedia Spotlight [83] and MAG [84]

METHODOLOGY to under perform, and these tools are not KG-agnostic since questions are shorter than ordinary texts When attempting to answer the question Who was the president of Vietnam?, the term Vietnam must be connected to dbr:Vietnam and not to any other nation with the same name as the country in question As a result, authors use a search index-based technique to find entity, relation, and class candidates that is independent of the knowledge base TeBaQA makes use of three indexes, all of which are prepared before to the start of the program. ¢ Entity linking This index comprises all of the entities that are present in the target knowledge graph TeBaQA checks the label field of the index to determine if an n-gram from the preprocessing stage corresponds to an entity The index provides information on entities, relations, and classes that are associated with the entity under consideration. ¢ Relation Index and Class Index KG classes and relations are stored in these two indices, which include the whole OWL hierarchy It is necessary to employ indexes in order to map n-grams to relations and classes in the ontologies of the Knowledge Base.TeBaQA additionally indexes hypernyms and synonyms for all relations and classes Consider the query Who was the Proffesor at the UIT The connection dbo:Professor is the only one that exists in DBpedia; the relation dbp:associateProfessor is not present The relationship dbo:Professor may be discovered by using the word associate professor With this example, we can see how natural language differs from knowledge graphs in terms of lexical and semantic differences. ¢ Disambiguation In response to an n-gram query, we receive candidates for entities, relations, and classes whose labels contain all of the n-tokens, gram’s as well as candidates for relations and classes A Levenshtein distance filter of 0.8 is applied to the candidates since the label of each candidate may include more tokens than the ngram of the candidate All of the remaining candidates are utilized to fill up a template that has been provided.

5.3 TEMPLATE-BASED TECHNIQUE ¢ Template Filling TeBaQA makes it easier to fill up the templates by provid- ing information about related entities and associated relations for the entities that have been located in the entity-index There are two possible outcomes for triples inside a template: 1) The triple consists of one placeholder for an entity and one placeholder for a link between entities The related relation information from the entity index is all that is used in this situation Com- bining an entity candidate e with a relation candidate p yields either a triple

< e,p,?u > or

Tiêu đề	Question Answering over Knowledge Graphs for COVID-19
Tác giả	Kel
Người hướng dẫn	Dr. Nguyễn Luu Thuy Ngan, Dr. Nguyễn Thanh Tam
Trường học	University of Information Technology
Chuyên ngành	Information Systems
Thể loại	Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	113
Dung lượng	62,3 MB