Therefore, in this summer internship, I and my team members chose to do the topic word preprocessing and segmentation Vietnamese, a process in Vietnamese preprocessing for model training
Trang 1VietNam-Korea Information Technology and Comunication
University, VietNam
Faculty Computer Science
PROJECT SUMMER INTERNSHIP
Word Pre-processing and Segmentation
Vietnamese
Participants: Nguyễn Kết Đoàn – 20AD
Nguyễn Trần Tiến – 20SE5
Nguyễn Đức Bảo – 20AD Tôn Thất Rôn – 21GIT
Võ Văn Nam – 21SE3 Phùng Ánh Sáng - 21GIT
Instructor: TS Nguyễn Hữu Nhật Minh
DaNang, tháng 07 năm 2023
Trang 2VietNam-Korea Information Technology and Communication
University, Vietnam
Faculty of Computer Science
PROJECT SUMMER INTERNSHIP
Word Pre-processing and Segmentation
Vietnamese
Participants: Nguyễn Kết Đoàn – 20AD
Nguyễn Trần Tiến – 20SE5
Nguyễn Đức Bảo – 20AD Tôn Thất Rôn – 21GIT
Võ Văn Nam – 21SE3 Phùng Ánh Sáng - 21GIT
Instructor: TS Nguyễn Hữu Nhật Minh
Đà Nẵng, tháng 05 năm 2023
Trang 3INSTRUCTOR’S COMMENT
………
………
………
………
………
………
………
………
………
………
………
………
………
………
………
………
………
………
………
………
………
………
………
………
………
Trang 4We would like to thank you very much PhD Nguyen Huu Nhat Minh – lecturer
of the Faculty of Computer Science has dedicated and enthusiastically helped during the time of studying and working on projects, he has spent a lot of valuable time to wholeheartedly guide, guide and orient us to implement projects, help us complete projects and learn valuable experiences We would like to sincerely thank the teachers
in the Department of Computer Science as well as eSTi’s teachers who enthusiastically taught, facilitated and supported me during the process of implementing the project at Vietnam-Korea University of Information and Communication Technology
We would like to thank our friends, especially the interns in the summer internship at eSTi, for helping us create favorable learning and working conditions for
us during the research and implementation of the thesis Due to limited knowledge and relatively short time to implement the topic, errors cannot be avoided We look forward to your guidance Finally, we would like to wish the teachers and interns at eSTi more success in their teaching and learning activities
Leader,
Nguyễn Kết Đoàn
Trang 5In today's fast-paced world, technology plays an important role in almost every aspect of life Especially with the explosion of AI (Arfitical Intelligent), more and more AI models are born in all areas of life such as robotisc, computer vision, language, helping a lot or even replacing people in some tasks AI has achieved many great achievements such as MidJourney in the field of images, chatGPT in the field of languages, etc Especially chatGPT, a very powerful language model, is considered as a wonder of the 21st century Realizing the development of English language models, with the desire to build a separate language model for Vietnamese that can understand and perform Vietnamese in depth in terms of semantics, accent marks, learn about previous researches and
AI projects on Vietnamese Realizing that there have been many successful studies, achieving certain results such as Viettel, VinAI, VietAI, However, I and my teammates discovered that previous projects have not focused on semantics as well as multi-meaning in Vietnamese Therefore, in this summer internship, I and my team members chose to do the topic word preprocessing and segmentation Vietnamese, a process in Vietnamese preprocessing for model training We try to tokenize words as best we can based on the context and meaning they represent In addition, we also focus on processing Vietnamese data such as correcting common errors in Vietnamese of the dataset
Trang 6TABLE OF CONTENTS
Trang 7Chapter 1 THEORETICAL BASIS
1 Natural language processing in machine learning overview
Machine learing overview
Machine Learning is an area of artificial intelligence (Artificial Intelligence) that studies methods and techniques so that computers are able to learn from data and improve performance over time without needing to be specifically programmed Basically, the Machine Learning process includes the following steps:
Data collection: First, data is collected from sources such as databases, data collectors, or directly from data sources such as sensors
Data preprocessing: After data collection, preprocessing is performed to clean and prepare the data for machine learning This may include missing data processing, noise removal, data normalization, and feature extraction
Model selection and training: In this step, the machine learning model is selected and trained on the preprocessed data There are many types of machine learning models such as supervised learning, unsupervised learning, and reinforcement learning
Model evaluation and refinement: After the model is trained, it is evaluated using performance evaluation methods such as dividing the data into training and test sets If the performance is not satisfactory, the model is refined by changing its hyperparameters or architecture
Model deployment: Once the model has achieved good enough performance, it can be deployed and applied in a real environment Implementation may include integrating the model into an application or automated system
Machine Learning is widely applied in many fields, including image recognition, natural language processing, financial forecasting, systems consulting, and many others It has significantly contributed to the creation of intelligent technologies and applications, which enhance the automation and information processing capabilities of computers
Natural language processing overview
Natural Language Processing (NLP), Machine Learning is extensively used for various processing tasks Here are some key areas where Machine Learning techniques are applied for language processing in NLP:
Tokenization: Machine Learning models are trained to split text into individual tokens or words This is an essential step in NLP, where sentences are divided into smaller units for further analysis and processing
Part-of-Speech Tagging: Machine Learning algorithms can be employed to assign grammatical tags to each word in a sentence, such as noun, verb, adjective, etc This helps in understanding the syntactic structure and grammatical relationships within the text
Named Entity Recognition (NER): Machine Learning models are used to identify and extract named entities from text, such as person names, organization names, locations, and more NER helps in understanding the entities mentioned in the text and their relevance to the overall context
Trang 8 Dependency Parsing: Machine Learning techniques can be utilized to analyze the grammatical structure of sentences by identifying the syntactic dependencies between words Dependency parsing helps in understanding how words relate to each other within a sentence
Semantic Role Labeling: Machine Learning models can be trained to identify the semantic roles of words or phrases within a sentence, such as the subject, object, or predicate This aids in understanding the meaning and roles of different elements in a sentence
Sentiment Analysis: Machine Learning algorithms are commonly used to analyze and classify the sentiment expressed in text This involves training models on labeled data to identify and classify text as positive, negative, or neutral in terms of sentiment
Language Modeling: Machine Learning models are trained on large amounts of text data to learn the probabilities of word sequences This allows them to generate new text or predict the likelihood of a given sequence of words
Machine Translation: Machine Learning models, especially sequence-to-sequence models, are employed for machine translation tasks These models learn to translate text from one language to another by training on parallel corpora of translated sentences
These are some examples of how Machine Learning is used for processing natural language in NLP Machine Learning techniques enable computers to understand and process human language by learning patterns, rules, and relationships from data
2 Theoretical basis
Sentences segmentation
Sentence segmentation, also known as sentence boundary detection or sentence tokenization, is the task of splitting a text into individual sentences In Natural Language Processing (NLP), sentence segmentation is an important preprocessing step for various language processing tasks
Machine Learning techniques can be applied to perform sentence segmentation Here's an overview of how it can be approached:
Dataset Preparation: Annotated data is required for training a Machine Learning model for sentence segmentation This dataset consists of text documents where sentences are manually segmented and labeled
Feature Extraction: Various features can be extracted from the text to assist in sentence segmentation These features may include punctuation marks, capitalization patterns, abbreviations, or specific language patterns
Model Selection: Different Machine Learning algorithms can be utilized for sentence segmentation, such as Support Vector Machines (SVM), Conditional Random Fields (CRF), or Recurrent Neural Networks (RNN) The choice of model depends on the specific requirements and characteristics of the data
Training: The extracted features and corresponding labels from the annotated dataset are used to train the Machine Learning model The model learns to identify patterns and cues that indicate the boundaries between sentences
Evaluation and Fine-tuning: The trained model is evaluated on a separate validation or test set to assess its performance If necessary, the model can be fine-tuned by adjusting hyperparameters or modifying the feature set
Trang 9 Inference: Once the model is trained and validated, it can be used for sentence segmentation on new, unseen text The model takes a text document as input and predicts the sentence boundaries based on the learned patterns
It's worth noting that there are also rule-based approaches for sentence segmentation that rely on predefined linguistic rules and heuristics These approaches can be effective in certain scenarios, especially when dealing with specific languages or domains
Sentence segmentation is a fundamental task in NLP, and accurate segmentation is crucial for subsequent language processing tasks such as part-of-speech tagging, named entity recognition, or sentiment analysis Machine Learning techniques provide
an automated and data-driven approach to tackle this task effectively
Word pre_processing
Word preprocessing is an important step in Natural Language Processing (NLP) that involves transforming raw text into a format that is suitable for further analysis and modeling The goal of word preprocessing is to clean and normalize the text data, reducing noise and inconsistencies, and improving the quality of the input for downstream tasks
Here are some common techniques used in word preprocessing:
Tokenization: Tokenization is the process of splitting text into individual words
or tokens It breaks down the text into smaller units, making it easier to analyze and process Tokens can be generated based on whitespace, punctuation marks,
or more sophisticated techniques like word embeddings
Lowercasing: Converting all text to lowercase is a common preprocessing step
It helps in standardizing the text and avoids treating the same word differently based on capitalization For example, "Hello" and "hello" will be considered the same after lowercasing
Removing Punctuation: Punctuation marks like commas, periods, and question marks are often removed as they usually do not contribute much to the meaning
of the text in many NLP tasks However, in some cases, punctuation can provide important context, so it depends on the specific task requirements
Stop Word Removal: Stop words are commonly occurring words like "the,"
"is," and "and" that do not carry much semantic meaning They can be removed
as they can introduce noise and increase computational overhead However, in certain applications like sentiment analysis or document classification, stop words may carry valuable information and might not be removed
Lemmatization and Stemming: Lemmatization and stemming are techniques used to reduce words to their base or root form Lemmatization aims to convert words to their dictionary form (lemma), while stemming uses heuristic rules to remove prefixes and suffixes These techniques help to handle variations of words and reduce vocabulary size
Handling Numerical and Special Characters: Depending on the task, numerical digits and special characters can be treated differently For some applications, they might be relevant (e.g., sentiment analysis of product reviews), while in other cases, they can be removed or replaced with special tokens
Handling Rare or Infrequent Words: Rare or infrequent words that occur in the text can be replaced with a special token or entirely removed to simplify the
Trang 10vocabulary and reduce noise This is particularly useful when dealing with large datasets where rare words may not provide significant value
It's important to note that word preprocessing techniques depend on the specific NLP task, domain, and dataset characteristics The choice of preprocessing steps should be carefully considered to preserve meaningful information and avoid losing valuable context for the given task
Word segmentation
Word segmentation is the process of dividing a sequence of characters or a continuous string of text into individual words or word-like units Word segmentation
is particularly important in languages like Chinese, Thai, and Vietnamese, where there are no explicit word delimiters like spaces
Here are some approaches and techniques used for word segmentation:
Rule-Based Segmentation: Rule-based approaches utilize predefined linguistic rules and heuristics to identify word boundaries These rules are based on patterns, character frequencies, or morphological structures specific to the language For example, in Chinese, word boundaries can be determined based
on the combination of characters and their context
Statistical and Machine Learning Methods: Machine Learning techniques, such
as Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), or Recurrent Neural Networks (RNNs), can be trained to learn word boundaries based on annotated data These models capture patterns and statistical dependencies between characters to make predictions about word segmentation
Lexicon-Based Methods: Lexicon-based approaches leverage pre-existing dictionaries or lexicons to identify valid words in the text The text is compared against the entries in the lexicon, and word boundaries are determined based on matches However, these methods may struggle with out-of-vocabulary words
or words not present in the lexicon
Hybrid Approaches: Hybrid methods combine multiple techniques to improve word segmentation accuracy For example, a rule-based approach can be combined with statistical or machine learning methods to handle cases that are not covered by the rules This hybridization helps to achieve better segmentation performance
Domain-Specific Techniques: Word segmentation techniques can be tailored to specific domains or applications For example, in biomedical text, domain-specific knowledge or dictionaries can be used to assist in word segmentation This is because domain-specific texts often have unique vocabulary and linguistic characteristics
Unsupervised or Semi-supervised Methods: Unsupervised or semi-supervised learning approaches can be used when labeled data for word segmentation is limited These methods leverage unsupervised clustering or other techniques to identify word boundaries based on statistical patterns or distributional properties of the characters
Word segmentation is critical for various NLP tasks such as machine translation, named entity recognition, and part-of-speech tagging Accurate word segmentation lays the foundation for subsequent analysis and understanding of text The choice of