Qa system for real estate law in vietnam

Trang 1

PHẠM THANH HỮU

QA SYSTEM FOR REAL ESTATE LAWIN VIETNAM

Major: Computer Science

Major code : 8480101

MASTER’S THESIS

Trang 2

Supervisor 1: ASSOC PROF QUAN THANH THO, PhD

Supervisor 2: DR NGUYEN TIEN THINH, PhD

Examiner 1 : DR TRAN TUAN ANH, PhD

Examiner 2 : DR BUI THANH HUNG, PhD

Master’s thesis is defended at HCM City University of Technology, VNU- HCM City on 13/07/2023

Master’s Thesis Committee:

1 Chairman: ASSOC PROF VO THI NGOC CHAU 2 Secretary: DR PHAN TRONG NHAN

3 Reviewer 1: DR TRAN TUAN ANH 4 Reviewer 2: DR BUI THANH HUNG 5 Commissioner: DR BUI CONG GIAO

Approval of the Chairman of Master’s Thesis Committee and Dean of Faculty of Computer Science and Engineering after the thesis is corrected (If any)

Trang 3

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

THE TASK SHEET OF MASTER’S THESIS

Full name: PHAM THANH HUU Student code: 2171066

Date of birth: 03.03.1978 Place of birth: QuangNgai

Major: Computer Science Major code : 8480101

I THESIS TITLE (In Vietnamese): HỆ THỐNG HỎI ĐÁP TỰ ĐỘNG LUẬT BẤT

ĐỘNG SẢN VIỆT NAM

II THESIS TITLE (In English) : QA SYSTEM FOR REAL ESTATE LAW IN

VIETNAM

III TASKS AND CONTENTS: Developing a chatbot capable of responding to legal

real estate queries

IV THESIS START DATE : 22.12.2022

V THESIS COMPLETION DATE: 09.06.2023

VI INSTRUCTOR: ASSOC PROF QUAN THANH THO, PhD and DR NGUYEN

TIEN THINH, PhD

INSTRUCTOR INSTRUCTOR

HCM City, 09/06/2023

CHAIRMAN OF DEAN OF

Trang 4

I would like to express my deepest gratitude to my advisors - Assoc Prof Quan Thanh Tho, for hisvaluable and constructive suggestions during the planning and development of this research work Hiswillingness to give his time so generously has been very much appreciated Moreover, his advice onalgorithms and his recommendations on solutions when I had to deal with problems during doing thisresearch.

Trang 5

Intelligent legal services have emerged in recent years due to the application of AI technology to thelaw industry; however, these have yet to be developed in Vietnam since there is a lack of research intoautomatic processing in the Vietnamese language In this thesis, the author proposes to build a chatbotthat can effectively and automatically answer legal questions, especially those related to real estate.The most important module of the chatbot is the Legal Statutes Identification (LSI), which identifiesthe legal statutes relevant to a given description of facts or evidence of a legal document (such as a legalquestion or a description of a legal fact) To deploy the LSI model, the author has built an LSI datasetincluding more than 300,000 legal questions and millions of judgments of the Supreme People’s Courtof Vietnam Three models are presented in this thesis The first is an ML-based model in which theLSI is performed by the Support Vector Machine after the input questions have been word-embeddedwith TF-IDF Embedding The second model, based on deep learning, will implement LSI downstreamtasks after using a new model called LegarBERT to construct word embedding for the input question.Finally, the author attempts to build LSI using graph machine learning by encoding legal reasoningas nodes and edges, representing by queries, a legal articles, and legal key word (legal terminology).

TÓM TẮT LUẬN VĂN THẠC SĨ

Các dịch vụ pháp lý thông minh đã xuất hiện trong những năm gần đây nhờ sự áp dụng của cơngnghệ Trí tuệ Nhân tạo vào ngành luật; tuy nhiên, tại Việt Nam, chúng vẫn chưa được phát triển dothiếu nghiên cứu về xử lý tự động trong tiếng Việt Trong luận văn này, tác giả đề xuất xây dựng mộtchatbot có khả năng trả lời tự động và hiệu quả các câu hỏi pháp lý, đặc biệt là các câu hỏi liên quanđến bất động sản Mô-đun quan trọng nhất của chatbot là Hệ thống Xác định Căn cứ Pháp lý (LegalStatutes Identification - LSI), được sử dụng để xác định các căn cứ pháp lý liên quan đến một mô tảcụ thể về sự kiện hoặc bằng chứng từ một văn bản pháp lý (như một câu hỏi pháp lý hoặc một môtả về sự kiện pháp lý) Để triển khai mơ hình LSI, tác giả đã xây dựng một tập dữ liệu LSI gồm hơn300.000 câu hỏi pháp lý và hàng triệu bản án của Tòa án Nhân dân Tối cao Việt Nam Luận văn nàytrình bày ba mơ hình Mơ hình đầu tiên dựa trên máy học (ML), trong đó LSI được thực hiện bằngMáy Vector Hỗ trợ sau khi câu hỏi đầu vào được biểu diễn bằng phương pháp Nhúng TF-IDF Mơhình thứ hai, dựa trên học sâu, sẽ thực hiện các tác vụ LSI sau khi sử dụng một mơ hình mới được gọilà LegarBERT để xây dựng việc nhúng từ cho câu hỏi đầu vào Cuối cùng, tác giả cố gắng xây dựngLSI bằng cách sử dụng học máy đồ thị bằng cách mã hóa lý luận pháp lý thành các nút và cạnh, biểuthị bằng các truy vấn, các điều khoản pháp lý và thuật ngữ pháp lý.

Trang 6

I guarantee this research is my own, conducted under the supervision of Assoc Prof Quan ThanhTho The contents and results of this research are legitimate and have not been published in any formsprior to this The data and materials used for the analysis and feedback are derived from variousresources and which are appropriately listed in the References section.

The data and results of several other authors and organizations have been used and have been aptlycited.

If there is any plagiarism, I stand by our actions and are to be held responsible for it Ho Chi MinhCity University of Technology is not responsible for any copyright infringement relating to this dis-sertation .

Trang 7

1Introduction1

1.1 Motivation 1

1.2 Ojectives and Scope 4

1.2.1 Business Objectives 4

1.2.2 Research Objectives 4

1.2.3 Thesis’s Scope 4

1.3 Contributions of the Thesis 5

1.4 Organization of the Thesis 5

2Legal Document Structure and Data72.1 VN-LandLaw-2013 Corpus 7

2.2 Formal Structure of a Legal Document 9

2.3 Legal Data Sourcing 10

2.4 Data Preparation 11

2.4.1 LSI Labeling 11

2.4.2 Legal Entity Extration 15

2.4.3 Legal Relation Extraction 16

2.4.4 The TF-IDF Matrix of Vietnam Land Law 18

2.5 Legal Data Summary Statistics 21

2.5.1 Legal Data Classification 21

2.5.2 Basic Legal Data Statistics 21

2.5.3 Unbalanced Legal Data 26

2.5.4 Vietnam Land Law Article Semantic Relations Matrix 27

2.5.5 Vietnam Land Law Article Co-occurrence Matrix in LSI Dataset 28

3Literature Review313.1 QAS Research in NLP 33

3.2 Law-related Global QAS Research 35

Trang 8

4.1 Technical Background 40

4.1.1 Term Fequency-Inverse Document Frequency (TF-IDF) 40

4.1.2 Support Vector Machine (SVM) 40

4.1.3 Attention Model 41

4.1.4 Autoencoder Model 42

4.1.5 BiLSTM 43

4.1.6 PhoBERT 44

4.1.7 Masked Language Modeling (MLM) 45

4.1.8 Fine-tune a Pretrained Model 45

4.1.9 Graph Convolution Neural Network (GCN) and Graph Attention Netowrk(GAT) 464.2 Business Background 47

4.3 Legal Domain Background 47

5The Proposed System515.1 Business Model 51

5.2 Overall System Architecture 52

5.3 The Main User Cases 53

5.4 UI Design 54

5.5 The Evaluation/Acceptance Criteria 56

5.5.1 Chatbot System Acceptance Criteria 56

5.5.2 LSI Model Metrics 56

5.5.3 Comparative Methods 59

5.6 Primary Challenges 62

6LSI by Linear Support Vector Classification with TF-IDF Embedding636.1 Introduction 636.2 Related Works 636.3 Network Architecture 646.4 Experiments 646.4.1 Dataset Creation 646.4.2 Training Setup 656.4.3 Metrics 65

6.5 Results and Conclusions 65

7LSI by Multi Label Classification with LegarBert677.1 Introduction 67

7.2 Related Works 67

Trang 9

7.4.1 LegarBert Training from PhoBert 707.5 Experiments 717.5.1 Dataset Creation 717.6 Legal-Masked Strategy 717.6.1 Training Setup 737.6.2 Metrics 73

8LegarHKB: A LSI Retrieval Model using Heterogeneous Knowledge Graph for the Viet-namese Law Domain768.1 Introduction 768.2 Related Works 768.3 Network Architecture 798.4 Experiments 808.4.1 Dataset Creation 808.4.2 Training Setup 818.4.3 Metrics 82

Trang 10

1.1 LSI ChatBot’s response 5

2.1 Data collection procedure 12

2.2 Data labeling application screenshot - Login Page 13

2.3 Data labeling application screenshot - Home Page 14

2.4 Data labeling application screenshot - Labeling Page 14

2.5 The tree of ”Chủ thể.” Subjects of legal relations 15

2.6 The tree of ”Hành vi” or ”Quan hệ pháp lý.” Acts/ Legal relations 17

2.7 The TF-IDF Matrix of Vietnam Land Law 18

2.8 Legal data categories 21

2.9 Supervised LSI training data statistics per legal category 22

2.10 Supervised LSI training data statistics per legal category (Distribution) 22

2.11 Semi-supervised LegarBert (MLM) training data statistics per legal category(Frombooks) 23

2.12 Semi-supervised LegarBert (MLM) training data statistics per legal category(Frombooks) (Distribution) 24

2.13 Semi-supervised LegarBert (MLM) training data statistics per legal category(FromSupreme People’s Court) 25

2.14 Semi-supervised LegarBert (MLM) training data statistics per legal category(FromSupreme People’s Court) (Distribution) 26

2.15 Unbalanced legal data phenomenon 27

2.16 Heatmap of 212 legal documents’ TF-IDF vectors’ cosine similarity 27

2.17 Semantic relations of articles 35-51 of chapter IV(Land use master plans and plans) 28

2.18 Heatmap of Vietnam Land Law articles co-occurrence in LSI dataset 28

2.19 Article Samples 29

2.20 High concurrency, high semantic similarity 29

2.21 High concurrency, low semantic similarity 30

3.1 Timeline of automated law research 31

4.1 Architecture of attention Model 42

Trang 11

4.4 Architecture of PhoBERT 45

4.5 Graph Attention Neural Network 46

4.6 IVS JSC overview 47

4.7 Vietnamese legal structure 48

4.8 Vietnam real estate law structure 48

4.9 IRAC method 49

4.10 IRAC example 50

5.1 Business Model 51

5.2 Overall system architecture 53

5.3 The main user cases 54

5.4 Main screens of the chatbot 54

5.5 Legal quick lookup popup 55

5.6 Example of KU calculation 58

5.7 Vietnam Land Law 2013 Long-tail dataset 61

6.1 LSI by Support Vector Machine with TF-IDF Embedding model 64

7.1 The Answering Engine of the Legar System 69

7.2 LegarBert embedding model training by MLM tasks 70

7.3 LSI by Multi Label Classification with LegarBert 71

8.1 LSI by Heterogeneous Knowledge Graph 79

8.2 Data transformation process 80

8.3 Nodes and Edges Definition 80

8.4 Graph Design 81

8.5 Graph Demo with some nodes and edges 81

Trang 12

1.1 NLP techniques used in legal domain 3

2.1 S, O, R, TO, T legal question analysis 10

2.2 Entity extraction example 16

2.3 Top 100 single-word TF-IDF values from 212 Vietnamese Land Law 19

2.4 Top 200 single-word TF-IDF values from 212 Vietnamese Land Law 20

2.5 Legal Document Sentences and Words Statistics 26

2.6 Most-paired articles 28

3.1 QAS research in NLP 35

3.2 Law-related global QAS research 37

3.3 Vietnamese Law-related QAS research 39

5.1 Comparative Methods 60

5.2 Long-tail dataset 61

6.1 Train/Val/Test Dataset 64

6.2 Model 1 Hyperparameters 65

6.3 LSI by Support Vector Machine with TF-IDF Embedding results 66

7.1 Hyperparameter of LegarBert training 74

7.2 Hyperparameter of LSI by LegarBert 74

7.3 Perplexity comparing with MLM task 75

7.4 LSI by Multi Label Classification with LegarBert results 75

7.5 LSI by Multi Label Classification with LegarBert K-Utility 75

8.1 Hyperparameter of LSI by Heterogeneous Knowledge Graph 82

8.2 LSI by Heterogeneous Knowledge Graph results 82

8.3 LSI by Heterogeneous Knowledge Graph K-Utility 83

9.1 Comparing 3 models by Precision/Recall/F1 84

Trang 14

Introduction

1.1Motivation

In most nations, the legal system is overburdened by a backlog of cases, particularly in low-leveljudiciaries Though speedy justice acts exist, the process in the legal domain is extremely laborious.Artificial Intelligence (AI) tools can provide a way of automating these tasks, accelerating justicedelivery [1] The legislation to which businesses and citizens have to abide is growing at a constantrate both in complexity and volume The data present in legislation is mostly in an unstructured formatin legal documents [2] This makes the task of retrieving information highly inefficient and time-consuming, particularly when there are huge quantities of data involved Further, the utility of suchdata differs broadly and relies on its representation and structure In this scenario, legal professionalsand users might find it highly problematic to explore the legal data while investigating a specific caseor dealing with particular circumstances, even when the data is accessible [3] These problems haveresulted in the necessity of devising better methods for structuring and searching across huge amountsof legal data [4] For this reason, the process of Legal Statute Identification (LSI) is significant inthe domain of the law and it includes identification of the probable set of statutory laws, which areappropriate, or which may be violated based on the factual description of a scenario described innatural language This process has to be carried out at various phases of litigation by experts, suchas judges, lawyers, and police personnel Therefore, automation of LSI can significantly increase lawaccess for professionals and the wider public [5].

Due to the rapid advances in deep learning (DL) and natural language processing (NLP), numerousQuestion-answering systems (QAS) have been developed for numerous applications such as naviga-tion, virtual assistants, chatbots, and search engines [6][7], and thus can be applied in other fields,including law, to improve efficiency The primary purpose of a QAS is to comprehend user intentionsand provide appropriate responses The QAS extracts its data autonomously in response to a userquery [8] There are many user-friendly documents on the Internet, but there are also many which areless so Consequently, a successful QAS must use relevant documents efficiently.

As QAS, requires input information, it is necessary to use Legal Statute Identification (LSI) [5] Asmentioned, LSI identifies the legal statutes that are pertinent to a given description of facts or evidenceof a legal document (such as a legal question or a description of a legal fact).

Trang 15

data or finding inconsistencies in information Further, they permit downstream applications, such asclassification, prediction, dialogue, and QAS [9].

Legal Artificial Intelligence (LegalAI), which focuses predominantly on AI in legal applications, hasgarnered enormous interest in recent years [10] LegalAI approaches rely on Natural Language Pro-cessing (NLP) [11] because the majority of resources in the judicial system are textual, such as con-tracts, judgment documents, and legal provisions Knowledge graphs are comparatively state-of-the-art technology in NLP processes, they have significant strength in constructing legal data They usegraph models for describing the knowledge and building relations among various entities Knowledgegraphs can be classified into two groups: General knowledge graphs and Domain knowledge graphs.The former type is the most commonly utilized graph due to their broad information coverage acrossvarious domains [12] On the other hand, domain knowledge graphs are mostly intended for particulardomains, highlighting the depth of knowledge As the information available in the legal domain hasrigorous and complicated knowledge features, domain knowledge graphs are preferred Simultane-ously, the advancement in graph databases and machine learning has empowered a potential way toconstruct legal document knowledge graphs [13].

In the legal field, knowledge graphs can be constructed by depicting the cases filed in courts as nodesand citations as edges, thereby enabling numerous graph processes This way of representing thelegal documents can enhance the performance of several downstream applications, like finding simi-lar cases, judgment prediction, text summarization, question-answering, and legal cognitive assistance[1] Legal intelligence intends to utilize AI technologies, like speech recognition and NLP for empow-ering the domain of intelligent justice [14] Most of the LSI techniques are based on simple machinelearning as well as statistical methods [5] Recently, NLP approaches have been most commonly usedin several processes of legal text mining with the availability of high-quality legal texts Legal textmining is slowly turning into a most commonly studied topic [15] Multiple models have been devisedin the past for producing knowledge graph from unstructured data.

Convolutional Neural Networks (CNN) and Sequence-to-Sequence (Seq2Seq) frameworks like Re-current Neural Network (RNN) have produced better performances in several NP tasks, such as docu-ment classification, information retrieval, and sentidocu-ment analysis [6] [2].Furthermore, Deep Learningis a modern technique in AI and has been used extensively owing to their promising outcomes in clas-sification and prediction problems across various fields Particularly, knowledge matching of deeplearning components, input features, hidden unit and layers, and output predictions with ontology andknowledge graphs has the ability in making the internal mechanism of processes more understand-able and transparent Additionally, the query and reasoning strategies of knowledge graphs allowinnovative explanations, such as interactive and cross-disciplinary explanations [16].

Following Table 1.1 portrays a review of the prevailing schemes devised for applying NLP in legaldomain to identify the knowledge contained with an emphasis on deep learning approaches for variousresearches.

Trang 16

Table 1.1: NLP techniques used in legal domain.

AuthorsMethodsAdvantagesDisadvantages

Paul, S., et al.[5]Legal StatuteIdentification usingCitation Network(LeSICiN)

LeSICiN offered highgeneralizability and waseffective in producing afeature-rich and robustrepresentation of thedocument.

It failed to capture thesemantic relationshipsin the heterogeneouscitation graph.Sovrano, F.,et al [17]Question-Answering(QA) Algorithm

This method waseffective in determiningthe probable answers forevery question, even withlimitations to knowledgeexplicitly.

This algorithm wastrained for solvingprocesses associatedwith common sense,and so produced poorresults with reasoning.

Li, G., et al [18]

Graph Long Short-Term Memory(Graph LSTM)

This technique producedhigh classification accuracyand computational efficiency.

The Graph LSTMmethod was inefficientin producing betterresults with reduceddata size.Zhao, Q.,et al [14]Graph neuralnetwork-basedLegal JudgmentPrediction (LJP)

This model was capableof extracting featureinformation fully andalleviating the dataimbalance problem.

The network sufferedfrom overfitting issues.

Ji, D., et al.[15]

Deep NeuralNetwork (DNN)

This technique wascapable of leveragingcontextual informationwith high efficiency andmodeling the implicitrelationship amongentities.

This scheme failed toconsider incorporatingstructural informationand domain knowledgefor enriching thesemantic meanings ofabbreviationsto enhance performance.Sulis, E.,et al [19]Co-occurrencenetwork (CN)

This technique waseffective in automaticallyidentifying the linkclasses and implicit linksbetween norms of legaltexts with high accuracy.

The method was futilein improving theclassification scheme toattain better results.

Zhu, G., et al.[20]Bidirectional Long-Short-Term Memorywith ConditionalRandom Field(Bi-LSTM-CRF)The Bi-LSTM-CRFwas successful in fullymining information tacitknowledge, and

enhancing the retrieval efficiency.

The method sufferedfrom highcomputationalcomplexity.Vuong,Y.T.H., et al.[21]Supporting Modelwith BERT for Case lawRetrieval

(SM-BERT-CR)

This scheme was suitablefor retrieving case detailsfrom all legal casedocuments, irrespectiveof the document length.

Though this model wasefficient in identifyingthe support relationdirectly, it wasunsuccessful in

Trang 17

To overcome those drawbacks, we evaluated the following three approaches:

• Machine Learning based approach: Implementation of LSI task by Linear Support Vector Clas-sification with TF-IDF Embedding

• Deep Learning based approach: Implementation of LSI task by Multi-Label Classification withLegarBert (Our proposed BERT-based pre-trained model for Vietnam Law Domain)

• Graph machine learning based approach: Implementation of LSI task with LegarHKB (Ourproposed heterogeneous knowledge graph for the Vietnam Law Domain)

1.2Ojectives and Scope

1.2.1Business Objectives

The business objective of this thesis is to build an intelligent legal system service based on chatbottechnology as the foundation for implementing legal on-demand service for IVS JSC.

1.2.2Research Objectives

• Construct a chatbot application that can respond to legal real estate-related questions.

• Develop a knowledge graph based on expert legal knowledge using the most advanced AI tech-nologies for Legal Statute Identification (LSI) to reduce the number of errors made by normalpeople and attorneys.

• Determine the significance of Vietnamese words, which helps determine whether they shouldbe added to the Legal Statute Identification (LSI) or not.

• To comprehend how to collect and process data, particularly Vietnamese legal data, in order toconstruct an efficient natural language processing system.

• To comprehend the BERT model and other fundamental AI models, including a comprehensionof standard pre-trained language models and the training of a domain-specific language model.• To comprehend the entire procedure for constructing a natural language processing system using

DL/ML models.

1.2.3Thesis’s Scope

• This thesis tests, selects, and deploys an NLP model to predict pertinent [law, phrase, term] inresponse to a real estate legal query (LSI for Vietnamese Land law 2013) Figure 1.1 depicts auser inputting a query and the LSI ChatBot responding.

Trang 18

Figure 1.1: LSI ChatBot’s response

1.3Contributions of the Thesis

As mentioned earlier, we introduce the following:

• Legal knowledge representation and Legal reasoning representation using deep embedding andlegal domain knowledge.

• The LegarBert language model repository is created for Vietnam Land Law Digitized datafrom Vietnam Land Law 2013 was processed using advanced methods and embedded utilizingBERT’s masked language model.

• The Heterogeneous Knowledge Graph-based Vietnam Land Law model LegarHKB is new.Legal terminology, subject, object, relation, event, and time nodes from a large Vietnam Lawdatabase are included in the model This graph data warehouse digitizes and labels approxi-mately 1 million Supreme Court of Vietnam cases.

• We introduce new metrics suitable for commercializing LSI products where traditional metricsin ML and DL, such as Precision, Recall, and F1, are incomprehensible to consumers This newmetric is referred to as KU and is introduced in the following sections.

1.4Organization of the Thesis

Trang 20

Legal Document Structure and Data

2.1VN-LandLaw-2013 Corpus

The VN-LandLaw-2013 corpus is a collection of questions, answers, and corresponding labels relatedto the Vietnamese Land Law 2013 The process of preparing this dataset involved two main steps:data collection and data labeling In the data collection step, we acquired a digitalized version of theVietnamese Land Law 2013 Then, we gathered conversations from landlaw-related e-forums, whichreflect real situations of legal consultation by experts concerning the applications of the VietnameseLand Law 2013 The collected dataset of conversations was labeled by our team of legal experts toextract relevant information such as doctype, legislation, article, clause, and point Ultimately, a totalof 5910 data samples were collected for this corpus.

The conversation given in Example 1 is, in fact, from a real sample extracted for the VN-LandLaw-2013 corpus Listing 1 provides the full information of this data item when stored in the corpus As canbe observed, a data item is annotated with a substantial amount of legal information In the contextof this paper, we focus on the annotated Articles, specifically Articles 168, 188, and 186 TheseArticles are used as labels for this item Thus, our consultation problem is realized as a multi-labelclassification task, where the Legar system will classify a user’s concern into corresponding Articles.

Example 2.1 In a real conversation concerning legal matter of land business, a user expresses a

Trang 21

{”id”:”92”, ”question”:”Tôi đang thực hiện giao dịch mua đất của Ông A, hiện tại đã ký xong hợp

đồng chuyển nhượng có cơng chứng Nay tôi được công chứng viên báo là ông A đang bị khởi tố vìtội lừa đảo chiếm đoạt tài sản và có thể giao dịch mua đất của tôi sẽ bị tạm dừng do tài sản của ông Asẽ bị phong toả? Xin hỏi cơng chứng viên nói như vậy có đúng khơng? Xin cảm ơn! ”,

”answer”:” Điều 188 Luật Đất đai 2013 quy định điều kiện thực hiện quyền chuyển nhượng quyền

sử dụng đất như sau: - Có Giấy chứng nhận, trừ trường hợp quy định tại khoản 3 Điều 186 và trườnghợp nhận thừa kế quy định tại khoản 1 Điều 168 của Luật đất đai 2013; Đất khơng có tranh chấp; -Quyền sử dụng đất không bị kê biên để bảo đảm thi hành án; - Trong thời hạn sử dụng đất Căn cứquy định nêu trên, nếu mảnh đất mà bạn đang thực hiện giao dịch mua bán, chuyển nhượng đáp ứngcác điều kiện trên thì người bán có quyền chuyển nhượng mảnh đất này Hợp đồng chuyển nhượngđược công chứng theo đúng quy định pháp luật Căn cứ quy định tại Điều 82 Luật Công chứng 2014,tổ chức hành nghề công chứng có thể cập nhật thơng tin về biện pháp ngăn chặn được áp dụng đối vớitài sản có liên quan đến hợp đồng, giao dịch đã được công chứng, cụ thể như sau: “Điều 62 Cơ sở dữliệu công chứng 1 Cơ sở dữ liệu công chứng bao gồm các thơng tin về nguồn gốc tài sản, tình trạnggiao dịch của tài sản và các thông tin về biện pháp ngăn chặn được áp dụng đối với tài sản có liên quanđến hợp đồng, giao dịch đã được công chứng.” Căn cứ quy định trên, văn phịng cơng chứng nơi bạnđã công chứng hợp đồng chuyển nhượng sẽ cập nhật được những thông tin ngăn chặn đối với tài sảnmà bạn đang giao dịch Do đó, nếu những thơng tin ngăn chặn đối với tài sản bạn đang giao dịch đãđược cập nhật trên cơ sở dữ liệu công chứng thì nội dung văn phịng cơng chứng đưa ra như bạn nêutrên là hồn tồn có cơ sở Tuy hợp đồng chuyển nhượng quyền sử dụng đất của bạn đã được côngchứng nhưng theo quy định tại Điều 11a Thông tư 24/2014/TT-BTNMT (được bổ sung bởi Thông tư33/2017/TT-BTNMT), cơ quan có thẩm quyền sẽ dừng giải quyết thủ tục cấp Giấy chứng nhận quyềnsử dụng đất đối với hồ sơ xin cấp Giấy chứng nhận quyền sử dụng đất của bạn Xem thêm: Trườnghợp nào bị phong tỏa tài khoản ngân hàng? Trên đây là nội dung tư vấn dựa trên những thông tin màluật sư đã nhận được Nếu còn bất kỳ thắc mắc nào liên quan, vui lòng liên hệ 19006199 để được hỗtrợ kịp thời Xin cảm ơn!”,

”summary”:”Nếu những thông tin ngăn chặn đối với tài sản bạn đang giao dịch đã được cập nhật trên

cơ sở dữ liệu cơng chứng thì nội dung văn phịng cơng chứng đưa ra như bạn nêu trên là hoàn toàn cócơ sở Tuy hợp đồng chuyển nhượng quyền sử dụng đất của bạn đã được công chứng nhưng theo quyđịnh tại Điều 11a Thông tư 24/2014/TT-BTNMT (được bổ sung bởi Thơng tư 33/2017/TT-BTNMT),cơ quan có thẩm quyền sẽ dừng giải quyết thủ tục cấp Giấy chứng nhận quyền sử dụng đất đối với hồsơ xin cấp Giấy chứng nhận quyền sử dụng đất của bạn.”,

”legals”:[ {”doctype”: ”Luật”, ”legislation”: ”Luật Đất đai 2013”, ”article”: ”188”, ”clause”: ”Không

xác định”, ”point”: ””}, {”doctype”: ”Luật”, ”legislation”: ”Luật Đất đai 2013”, ”article”: ”168”,”clause”: ”1”, ”point”: ””}, {”doctype”: ”Luật”, ”legislation”: ”Luật Đất đai 2013”, ”article”: ”186”,

”clause”: ”3”, ”point”: ””},{”doctype”: ”Luật”, ”legislation”: ”Luật Công chứng 2014”, ”article”:”82”, ”clause”: ”Không xác định”, ”point”: ””},{”doctype”: ”Luật”, ”legislation”: ”Luật Công chứng2014”, ”article”: ”62”, ”clause”: ”Không xác định”, ”point”: ””}, {”doctype”: ”Thông tư”, ”legisla-tion”: ”Thông tư 24/2014/TT-BTNMT (được bổ sung bởi Thông tư 33/2017/TT-BTNMT)”, ”article”:”11”, ”clause”: ”Không xác định”, ”point”: ””}] }

Trang 22

2.2Formal Structure of a Legal Document

People without specialized law knowledge and even inexperienced lawyers are unfamiliar with ex-amining and interpreting complex law documents [10] The legal text has grown exponentially onthe Internet and in specialized systems, along with other natural languages text data, such as scien-tific publications, news stories, or social media [22] In contrast to other literature, legal texts containstrong logical links between sentences or other articles using words, phrases, concerns, concepts, andvariables related to the law [23] The logical links between sentences related to the law create ambi-guity Unfortunately, this makes finding information and providing answers in the legal field moredifficult than in other fields [24].

Global researchers [25][26] identified some basic ontology design patterns regularly used to modellegal norms i) Agent-role-time 3; ii) Event-time-place-jurisdiction 4; iii) Agent-action-time [27]; iv)Object-document [27]; v) Legal deontic ontology [28][26] These patterns, combined with linguistictaxonomies, could provide a good solution for creating a bridge between the variants of the legaldefinitions and the conceptualization level[29].

In legal science in Vietnam, in order to identify specific laws and provisions to solve a legal question,the following factors are often noticed and analyzed: SUBJECT (S), OBJECT (O), LEGAL RELA-TIOSHIP(R), TRANSACTION OBJECT(TO) and TIME of legal events(T) Determining these fac-tors is very important because each legal code has a separate set of governing S, O, R, TO, T throughwhich we can identify which terms and laws to use to solve the problem.

• Subject (”Chủ thể.” Subjects of legal relations): an individual or organization with legal andlegal act capacity participating in legal relations.

• Objects (”Khách thể.” Objects of legal relations): material benefits, spiritual benefits, or bothbenefits that the subject parties want to achieve when entering a particular legal relationship.• Relationship ( ”Hành vi” or ”Quan hệ pháp lý.” Acts/ Legal relations): something that people

do or cause to happen.

• Transaction Object (Đối tượng bị tác động): Objects of legal transactions such as houses, land,money, and gold.

• Time: the time at which the legal action or event occurs.Example and analysis:

Trang 23

Table 2.1: S, O, R, TO, T legal question analysis.

Subject(S)Objects(O)Relationship(R)Transaction object(TO)Time(T)

Vợ chồng bà A

cơ quan có thẩm quyền

quyền sử dụng đấtsử dụngchuyển nhượnghợp đồng mua bánchứng nhậncấp giấy chứng nhậngiao dịchđất 1970

Assume that the labeling and identification of sets S, O, R, TO, and T are correct and appropriate Inthat case, we can anticipate developing an AI model that recognizes Article 101 and Article 188 ofthe Vietnam Land Law 2013, as described in the preceding illustration.

Due to time constraints, this thesis only examines the extraction of S, R and the development of an AImodel that effectively leverages S, R to perform LSI This is one of the thesis’s killer techniques.

2.3Legal Data Sourcing

The majority of data is collected from online sources and law books (which are scanned and convertedto digital format using OCR technology) Following is main data sourses.

a) Academic Q&A group:

• https://hocluat.vn/

b) Question and answer group about specific cases:

• Ministry of Finance: https://mof.gov.vn/hoidapcstc/

• Ministry of Labour, Invalids and Social Affairs: http://bovoinddn.molisa.gov.vn/trang-chu/ho-tro-covid-19

• Ministry of Education and Training: https://moet.gov.vn/bovoinguoidan/Pages/hoi-dap.aspx• Ministry of Justice: https://hdpl.moj.gov.vn/Pages/home.aspx

• Ministry of Industry and Trade: https://moit.gov.vn/hoi-dap

Trang 24

• Ministry of Transport: https://mt.gov.vn/vn/pages/Hoidap.aspx?cID=76• Ministry of Construction: https://moc.gov.vn/vn/pages/hoidap.aspx

• Ministry of Agriculture and Rural Development: https://www.mard.gov.vn/Pages/hoi-dap.aspx• Ministry of Information and Communications: https://mic.gov.vn/Pages/hoidap.aspx

• Ministry of Public Security: http://bocongan.gov.vn/hoi-dap.html

• Ministry of Science and Technology: http://www.most.gov.vn/vn/Pages/Hoidap.aspx• Ministry of Home Affairs: https://www.moha.gov.vn/danh-muc.html?cateid=278• Thuvienphapluat: https://danluat.thuvienphapluat.vn/

c) Publication of the judgment of the Supreme People’s Court:

There are more than 1 million judgments published by the people’s courts at• https://congbobanan.toaan.gov.vn

2.4Data Preparation

2.4.1LSI Labeling

Trang 25

HoChiMinhCityUniversityofTechnologyFacultyofComputerScienceandEngineering

Figure 2.1: Data collection procedure

Trang 26

Below are some screen shot of labeling application for this project.

Trang 27

Figure 2.3: Data labeling application screenshot - Home Page

Trang 28

2.4.2Legal Entity Extration

Legal entities extraction and legal relation extraction are extracted from three sources by the corpus:• Over a million Code/Law decisions.

• Clause 300K queries and answers for all Codes/Laws (data source).• All other legal documentation.

The VnCoreNLP[30] library assisted us in extracting all nouns and verbs in the corpus Verbs arelikely to be legal relationships, while nouns are likely to represent legal subjects The first stage inLegal entity extraction is to ask, ”Is the following noun a legal subject?” The user responds with”yes” or ”no” The subsequent phases are similarly categorized (labelled) by Figure 2.5: The tree of”Subject.” The subjects of legal relations.

Figure 2.5: The tree of ”Chủ thể.” Subjects of legal relations

Trang 29

Table 2.2: Entity extraction example.

Thể nhân(Person)Pháp nhân(Legal Person)Chủ thể quản lý nhà nước(State management entities)

nhân_dân tổng_cơng_ty cơ_quan_chức_năng

thư_kí trường xã

nhà_cung_cấp chi_nhánh liên_đồn

thí_sinh viện_kiểm_sát ubnd

giáo_viên tổ_chức phường

nhà_đầu_tư doanh_nghiệp huyện_uỷ

luật_sư đại_sứ_qn nhà_nước

gái bank thơn

mẹ_kế chi_cụcuỷ_viên uỷ_ban_nhân_dânbị_đơn uỷ_banem_gái tổ_dân_phốbạn_bè thanh_trathơng_dịch_viên xã_phườngthầy chính_phủtình_nguyện_viên đảng_uỷthư_ký hội_đồng_nhân_dânngười_dân cơ_quan_dân_cửcơng_tác_viên quốc_hộinhà_thầu cơ_quan_hành_chính

2.4.3Legal Relation Extraction

Trang 31

2.4.4The TF-IDF Matrix of Vietnam Land Law

Take the following actions with 212 articles of Vietnam Land Law 2013:

• Pre-processing Vietnamese stopword, obtaned 911 single words and distinct compound words(Tokens) Here we again see that to create a large body of laws, such as the law of the land,legislators use only 911 words This discovery is highly intriguing because it can be utilized toimprove the accuracy of the LSI problem using a computer science-based approach.

• Determine the TF-IDF matrix (referred to as the M matrix) by calculating the TF-IDF scores ofterms (Compound word)-documents(Articles) The TF-IDF score ranges from 0 to 1; the higherthe value, the rarer the term, and the lower the value, the more frequent the term.

• Following the calculation of the matrix M, we can observe the Table 2.3: Top 100 single-wordTF-IDF values from 212 Vietnamese Land Law and , Table 2.4: Top 200 single-word TF-IDFvalues from 212 Vietnam Land Law shown below.

Trang 32

Table 2.3: Top 100 single-word TF-IDF values from 212 Vietnamese Land Law.

đất thu nông nghĩa cá

dụng cấp sở đối liền

sử điều kế mục quốc

định luật hồi thì thời

quyền chức giá việc gắn

hoạch nhân dân để trình

có nhận dựng hộ ban

nước tiền xây kinh địa

của người này thường giấy

quy chính tại quản hữu

nhà các cơ đích khi

được chuyển lý ngồi khác

th giao tài trong án

và tổ vụ đầu ủy

cho về hành cư khoản

đai hợp trường pháp xuất

nghiệp không bồi quan tế

với hiện tư hạn xã

cơng thực quyết đình rừng

Trang 33

Table 2.4: Top 200 single-word TF-IDF values from 212 Vietnamese Land Law.

đất trường_hợp vốn kinh_doanh lần

sử_dụng đầu_tư mà bằng chuyển_đổi

quyền gia_đình cơ_sở một thủy_sản

có cá_nhân trách_nhiệm hàng sự_nghiệp

của hộ làm ngày bị

được tài_sản sau thủ_tục tranh_chấp

và nhân_dân thời_hạn giá bản_đồ

thuê tiền lập bộ cưỡng_chế

cho nhà chính_phủ chi_tiết liên_quan

đất_đai nghĩa_vụ chuyển_nhượng mơi_trường bao_gồm

nhà_nước khu nam hệ_thống nuôi_trồng

quy_định liền việt đến vi_phạm_pháp_luật

cấp gắn điều_kiện loại sang

quy_hoạch khi do phi công_nghiệp

người sở_hữu thuộc trồng thơng_qua

các khác là cả bảo_đảm

về quyết_định quốc_phịng đang tư_vấn

tổ_chức khoản huyện tài_chính kỳ

thu_hồi ủy_ban phê_duyệt cộng_đồng chủ

theo cơ_quan xã chính_sách đăng_ký

khơng phải diện_tích dữ_liệu căn_cứ

kế_hoạch nhận lại góp quỹ

xây_dựng cơng_trình đây cơng_cộng đơ_thị

giao kinh_tế an_ninh xác_định vì

thực_hiện rừng hỗ_trợ thiệt_hại tài_nguyên

này chuyển doanh_nghiệp tôn_giáo từng

tại sản_xuất thông_tin đúng thừa_kế

nông_nghiệp dự_án định_cư phù_hợp hình_thức

điều giấy địa_phương thi_hành đủ

luật hoặc bảo_vệ giải_quyết hiệu_lực

với chứng_nhận hạn_mức muối mình

bồi_thường pháp_luật trả địa_chính điều_tra

thì đã quốc_gia phép thửa

để thu tái_định_cư trên hành_vi

quản_lý vào xã_hội đấu_giá đó

mục_đích tỉnh chưa cịn thị_trấn

việc thẩm_quyền phát_triển điểm cơng_nhận

đối_với tiền_sử_dụng từ thời_gian gồm

trong năm dân_cư tặng phường

Trang 34

2.5Legal Data Summary Statistics

2.5.1Legal Data Classification

The collected legal data can be divided into the following 27 categories (Figure 2.8).

Figure 2.8: Legal data categories.

2.5.2Basic Legal Data Statistics

The data was collected from three sources.

Trang 35

Figure 2.9: Supervised LSI training data statistics per legal category.

Trang 36

2 Legal documents are collected on the internet, or from specialized books (Using OCR tech-nology to scan and convert into digital data) This data is unlabeled and used to retrain BERTmodels.

Trang 37

Figure 2.12: Semi-supervised LegarBert (MLM) training data statistics per legal category(From books) (Dis-tribution).

Trang 39

Figure 2.14: Semi-supervised LegarBert (MLM) training data statistics per legal category(From Supreme Peo-ple’s Court) (Distribution).

Table 2.5: Legal Document Sentences and Words Statistics

Phân loạiLĩnh vực luậtSố câuDung lượng

(Bytes)

Dung lượng

(MB)Số lượng từ

Bản ánDân sự47,69626,532,58027858,528,000Bản ánHình sự164,066101,579,5451021,640,660,000Bản ánHơn nhân và gia đình111,05642,596,201431,999,008,000Bản ánKinh doanh thương mại4,9881,624,607274,820,000Bản ánLao động1,082927,185120,558,000Quyết địnhDân sự95,345650,323,08850123,948,500Quyết địnhHơn nhân và gia đình388,191147,256,1881471,319,849,400Quyết địnhKinh doanh thương mại8,5452,779,520329,053,000Quyết địnhLao động1,7211,397,41812,925,700Quyết địnhQuyết định áp dụng biện pháp xử phạt hành chính29,27810,351,50810204,946,000Quyết địnhQuyết định áp dụng biện pháp hành chính42,96517,056,95117300,755,000Quyết địnhQuyết định tuyên bố phá sản6215,1080186,000

Tổng số894,995402,448,8994026,575,237,600

2.5.3Unbalanced Legal Data

Trang 40

• Hire experts, 3rd parties to create more Q&A data about the terms of the Vietnam Land Law tosupplement.

• Use meta-learning methods (such as few shots learning) to adapt the learning model in a newcontext with less data.

Figure 2.15: Unbalanced legal data phenomenon.

2.5.4Vietnam Land Law Article Semantic Relations Matrix

We vectorize the article documents using TF-IDF and then apply the cosine similarity index to rep-resent their semantic relationship The outcomes (Figure 2.16: Heatmap of 212 legal documents’TF-IDF vectors’ cosine similarity) demonstrate that articles from the same chapter of the VietnamLand Law often have the high semantic relationship (Figure 2.17: Semantic relations of articles 35-51of chapter IV(Land use master plans and plans)) In the following section, we employ this semanticrelation matrix for an analogous analysis with co-occurrence matrices.

Định dạng
Số trang	122
Dung lượng	3,99 MB