Project summer internship word pre processing and segmentation vietnamese

Therefore, in this summer internship, I and my team members chose to do the topic word preprocessing and segmentation Vietnamese, a process in Vietnamese preprocessing for model training

Trang 1

VietNam-Korea Information Technology and Comunication

University, VietNam

Faculty Computer Science

PROJECT SUMMER INTERNSHIP

Word Pre-processing and Segmentation

Vietnamese

Participants: Nguyễn Kết Đoàn – 20AD

Nguyễn Trần Tiến – 20SE5

Nguyễn Đức Bảo – 20AD Tôn Thất Rôn – 21GIT

Võ Văn Nam – 21SE3 Phùng Ánh Sáng - 21GIT

Instructor: TS Nguyễn Hữu Nhật Minh

DaNang, tháng 07 năm 2023

Trang 2

VietNam-Korea Information Technology and Communication

University, Vietnam

Faculty of Computer Science

PROJECT SUMMER INTERNSHIP

Word Pre-processing and Segmentation

Vietnamese

Participants: Nguyễn Kết Đoàn – 20AD

Nguyễn Trần Tiến – 20SE5

Nguyễn Đức Bảo – 20AD Tôn Thất Rôn – 21GIT

Võ Văn Nam – 21SE3 Phùng Ánh Sáng - 21GIT

Instructor: TS Nguyễn Hữu Nhật Minh

Đà Nẵng, tháng 05 năm 2023

Trang 3

INSTRUCTOR’S COMMENT

………

Trang 4

We would like to thank you very much PhD Nguyen Huu Nhat Minh – lecturer

of the Faculty of Computer Science has dedicated and enthusiastically helped during the time of studying and working on projects, he has spent a lot of valuable time to wholeheartedly guide, guide and orient us to implement projects, help us complete projects and learn valuable experiences We would like to sincerely thank the teachers

in the Department of Computer Science as well as eSTi’s teachers who enthusiastically taught, facilitated and supported me during the process of implementing the project at Vietnam-Korea University of Information and Communication Technology

We would like to thank our friends, especially the interns in the summer internship at eSTi, for helping us create favorable learning and working conditions for

us during the research and implementation of the thesis Due to limited knowledge and relatively short time to implement the topic, errors cannot be avoided We look forward to your guidance Finally, we would like to wish the teachers and interns at eSTi more success in their teaching and learning activities

Leader,

Nguyễn Kết Đoàn

Trang 5

In today's fast-paced world, technology plays an important role in almost every aspect of life Especially with the explosion of AI (Arfitical Intelligent), more and more AI models are born in all areas of life such as robotisc, computer vision, language, helping a lot or even replacing people in some tasks AI has achieved many great achievements such as MidJourney in the field of images, chatGPT in the field of languages, etc Especially chatGPT, a very powerful language model, is considered as a wonder of the 21st century Realizing the development of English language models, with the desire to build a separate language model for Vietnamese that can understand and perform Vietnamese in depth in terms of semantics, accent marks, learn about previous researches and

AI projects on Vietnamese Realizing that there have been many successful studies, achieving certain results such as Viettel, VinAI, VietAI, However, I and my teammates discovered that previous projects have not focused on semantics as well as multi-meaning in Vietnamese Therefore, in this summer internship, I and my team members chose to do the topic word preprocessing and segmentation Vietnamese, a process in Vietnamese preprocessing for model training We try to tokenize words as best we can based on the context and meaning they represent In addition, we also focus on processing Vietnamese data such as correcting common errors in Vietnamese of the dataset

Trang 6

TABLE OF CONTENTS

Trang 7

Chapter 1 THEORETICAL BASIS

1 Natural language processing in machine learning overview

 Machine learing overview

Machine Learning is an area of artificial intelligence (Artificial Intelligence) that studies methods and techniques so that computers are able to learn from data and improve performance over time without needing to be specifically programmed Basically, the Machine Learning process includes the following steps:

 Data collection: First, data is collected from sources such as databases, data collectors, or directly from data sources such as sensors

 Data preprocessing: After data collection, preprocessing is performed to clean and prepare the data for machine learning This may include missing data processing, noise removal, data normalization, and feature extraction

 Model selection and training: In this step, the machine learning model is selected and trained on the preprocessed data There are many types of machine learning models such as supervised learning, unsupervised learning, and reinforcement learning

 Model evaluation and refinement: After the model is trained, it is evaluated using performance evaluation methods such as dividing the data into training and test sets If the performance is not satisfactory, the model is refined by changing its hyperparameters or architecture

 Model deployment: Once the model has achieved good enough performance, it can be deployed and applied in a real environment Implementation may include integrating the model into an application or automated system

Machine Learning is widely applied in many fields, including image recognition, natural language processing, financial forecasting, systems consulting, and many others It has significantly contributed to the creation of intelligent technologies and applications, which enhance the automation and information processing capabilities of computers

 Natural language processing overview

Natural Language Processing (NLP), Machine Learning is extensively used for various processing tasks Here are some key areas where Machine Learning techniques are applied for language processing in NLP:

 Tokenization: Machine Learning models are trained to split text into individual tokens or words This is an essential step in NLP, where sentences are divided into smaller units for further analysis and processing

 Part-of-Speech Tagging: Machine Learning algorithms can be employed to assign grammatical tags to each word in a sentence, such as noun, verb, adjective, etc This helps in understanding the syntactic structure and grammatical relationships within the text

 Named Entity Recognition (NER): Machine Learning models are used to identify and extract named entities from text, such as person names, organization names, locations, and more NER helps in understanding the entities mentioned in the text and their relevance to the overall context

Trang 8

 Dependency Parsing: Machine Learning techniques can be utilized to analyze the grammatical structure of sentences by identifying the syntactic dependencies between words Dependency parsing helps in understanding how words relate to each other within a sentence

 Semantic Role Labeling: Machine Learning models can be trained to identify the semantic roles of words or phrases within a sentence, such as the subject, object, or predicate This aids in understanding the meaning and roles of different elements in a sentence

 Sentiment Analysis: Machine Learning algorithms are commonly used to analyze and classify the sentiment expressed in text This involves training models on labeled data to identify and classify text as positive, negative, or neutral in terms of sentiment

 Language Modeling: Machine Learning models are trained on large amounts of text data to learn the probabilities of word sequences This allows them to generate new text or predict the likelihood of a given sequence of words

 Machine Translation: Machine Learning models, especially sequence-to-sequence models, are employed for machine translation tasks These models learn to translate text from one language to another by training on parallel corpora of translated sentences

These are some examples of how Machine Learning is used for processing natural language in NLP Machine Learning techniques enable computers to understand and process human language by learning patterns, rules, and relationships from data

2 Theoretical basis

 Sentences segmentation

Sentence segmentation, also known as sentence boundary detection or sentence tokenization, is the task of splitting a text into individual sentences In Natural Language Processing (NLP), sentence segmentation is an important preprocessing step for various language processing tasks

Machine Learning techniques can be applied to perform sentence segmentation Here's an overview of how it can be approached:

 Dataset Preparation: Annotated data is required for training a Machine Learning model for sentence segmentation This dataset consists of text documents where sentences are manually segmented and labeled

 Feature Extraction: Various features can be extracted from the text to assist in sentence segmentation These features may include punctuation marks, capitalization patterns, abbreviations, or specific language patterns

 Model Selection: Different Machine Learning algorithms can be utilized for sentence segmentation, such as Support Vector Machines (SVM), Conditional Random Fields (CRF), or Recurrent Neural Networks (RNN) The choice of model depends on the specific requirements and characteristics of the data

 Training: The extracted features and corresponding labels from the annotated dataset are used to train the Machine Learning model The model learns to identify patterns and cues that indicate the boundaries between sentences

 Evaluation and Fine-tuning: The trained model is evaluated on a separate validation or test set to assess its performance If necessary, the model can be fine-tuned by adjusting hyperparameters or modifying the feature set

Trang 9

 Inference: Once the model is trained and validated, it can be used for sentence segmentation on new, unseen text The model takes a text document as input and predicts the sentence boundaries based on the learned patterns

It's worth noting that there are also rule-based approaches for sentence segmentation that rely on predefined linguistic rules and heuristics These approaches can be effective in certain scenarios, especially when dealing with specific languages or domains

Sentence segmentation is a fundamental task in NLP, and accurate segmentation is crucial for subsequent language processing tasks such as part-of-speech tagging, named entity recognition, or sentiment analysis Machine Learning techniques provide

an automated and data-driven approach to tackle this task effectively

 Word pre_processing

Word preprocessing is an important step in Natural Language Processing (NLP) that involves transforming raw text into a format that is suitable for further analysis and modeling The goal of word preprocessing is to clean and normalize the text data, reducing noise and inconsistencies, and improving the quality of the input for downstream tasks

Here are some common techniques used in word preprocessing:

 Tokenization: Tokenization is the process of splitting text into individual words

or tokens It breaks down the text into smaller units, making it easier to analyze and process Tokens can be generated based on whitespace, punctuation marks,

or more sophisticated techniques like word embeddings

 Lowercasing: Converting all text to lowercase is a common preprocessing step

It helps in standardizing the text and avoids treating the same word differently based on capitalization For example, "Hello" and "hello" will be considered the same after lowercasing

 Removing Punctuation: Punctuation marks like commas, periods, and question marks are often removed as they usually do not contribute much to the meaning

of the text in many NLP tasks However, in some cases, punctuation can provide important context, so it depends on the specific task requirements

 Stop Word Removal: Stop words are commonly occurring words like "the,"

"is," and "and" that do not carry much semantic meaning They can be removed

as they can introduce noise and increase computational overhead However, in certain applications like sentiment analysis or document classification, stop words may carry valuable information and might not be removed

 Lemmatization and Stemming: Lemmatization and stemming are techniques used to reduce words to their base or root form Lemmatization aims to convert words to their dictionary form (lemma), while stemming uses heuristic rules to remove prefixes and suffixes These techniques help to handle variations of words and reduce vocabulary size

 Handling Numerical and Special Characters: Depending on the task, numerical digits and special characters can be treated differently For some applications, they might be relevant (e.g., sentiment analysis of product reviews), while in other cases, they can be removed or replaced with special tokens

 Handling Rare or Infrequent Words: Rare or infrequent words that occur in the text can be replaced with a special token or entirely removed to simplify the

Trang 10

vocabulary and reduce noise This is particularly useful when dealing with large datasets where rare words may not provide significant value

It's important to note that word preprocessing techniques depend on the specific NLP task, domain, and dataset characteristics The choice of preprocessing steps should be carefully considered to preserve meaningful information and avoid losing valuable context for the given task

 Word segmentation

Word segmentation is the process of dividing a sequence of characters or a continuous string of text into individual words or word-like units Word segmentation

is particularly important in languages like Chinese, Thai, and Vietnamese, where there are no explicit word delimiters like spaces

Here are some approaches and techniques used for word segmentation:

 Rule-Based Segmentation: Rule-based approaches utilize predefined linguistic rules and heuristics to identify word boundaries These rules are based on patterns, character frequencies, or morphological structures specific to the language For example, in Chinese, word boundaries can be determined based

on the combination of characters and their context

 Statistical and Machine Learning Methods: Machine Learning techniques, such

as Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), or Recurrent Neural Networks (RNNs), can be trained to learn word boundaries based on annotated data These models capture patterns and statistical dependencies between characters to make predictions about word segmentation

 Lexicon-Based Methods: Lexicon-based approaches leverage pre-existing dictionaries or lexicons to identify valid words in the text The text is compared against the entries in the lexicon, and word boundaries are determined based on matches However, these methods may struggle with out-of-vocabulary words

or words not present in the lexicon

 Hybrid Approaches: Hybrid methods combine multiple techniques to improve word segmentation accuracy For example, a rule-based approach can be combined with statistical or machine learning methods to handle cases that are not covered by the rules This hybridization helps to achieve better segmentation performance

 Domain-Specific Techniques: Word segmentation techniques can be tailored to specific domains or applications For example, in biomedical text, domain-specific knowledge or dictionaries can be used to assist in word segmentation This is because domain-specific texts often have unique vocabulary and linguistic characteristics

 Unsupervised or Semi-supervised Methods: Unsupervised or semi-supervised learning approaches can be used when labeled data for word segmentation is limited These methods leverage unsupervised clustering or other techniques to identify word boundaries based on statistical patterns or distributional properties of the characters

Word segmentation is critical for various NLP tasks such as machine translation, named entity recognition, and part-of-speech tagging Accurate word segmentation lays the foundation for subsequent analysis and understanding of text The choice of

Định dạng
Số trang	15
Dung lượng	0,99 MB