1. Trang chủ
  2. » Giáo Dục - Đào Tạo

The purpose of this essay is to find out how i use machine learning to categorize vietnamese news and how i do it

53 5 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 53
Dung lượng 4,74 MB

Nội dung

NAME : TRAN NGUYEN ANH THOAI MODULE : COURSE CODE : COURSEWORD LEADER : DUE DATE : CENTRE : GREENWICH, HCMC WORD : Commitment This dissertation is done and has references to documents, articles, websites as described in the references and I will quote for each reference I hereby certify that in addition to the reference citations, all contents and data in the essay have been compiled by myself based on the research results under the supervision of Mr Le Minh Nhat Trieu I accept full responsibility for violations of the regulations Honors First thing, I would like to thank my loved ones, especially my parents Because they are feeder, and giving me the best things in this life, from matter to morality Second thing, I want to thanks to the FPT University and the Greenwich University Thanks to the school and teachers who creates classes and subjects for students Thank you for the knowledge that the teacher has communicated Thank you to Mr Le Minh Nhat Trieu, who accompanied me during the past school year at TopUp semester He devoted his devotion, time, and energy to support me as much as possible Finally, we would like to thank the teachers and friends who have accompanied me during the past school years Thank you for your, your valuable knowledge and enthusiasm to help Abstraction Nowadays, the development of IT has changed our life so much Special Data mining and machine learning It has been applied in all field in our life from face, voice recognition, nature language processing, Especially natural language processing many areas in today's life use examples in some places using robots capable of communication to replace humans in communication A typical example is the explosion of anti-epidemic robots in the covid-19 age In the field of mass media, newspapers and news production are getting more and more attention from the masses Accompanying that is a great deal of work, it is a waste of time if we sit and read each title of an article for us to classify it Due to grasping the deadly weakness of the media industry This essay was written to fix the problem of sorting articles topic, but because time is limited, topics only revolve around the world, sports, life, law, health Usually, the articles will be stored in natural language, unstructured data The easiest way to classify these articles is the vector space itself However, in order to vector the information, we need to process the data first Specific tasks need to be done with cutting words, removing accents in sentences and eliminating stop word In this topic to be able to separate words I use the segmentation tool to separate words, then construct vectors based on Bow, Word2Vec methods Then use jupyter notebook to show the results obtained in news classification using ML method Introduction Digital life all information or knowledge that we know or not know is all on the internet This problem also solves a lot of human problems such as document storage without paper or pen, long storage time, convenient for searching Although the face of the huge amount of information that is properly categorized, it is important to be concerned But in fact, this job needs to be done manually and takes a lot of time and effort So the automatic classification is very necessary Seeing this need, I decided to explore the steps to conduct information classification using ML The method of classifying news with the data set is news taken from online news sites in Vietnamese From there we proceed to build and apply the classification methods This is a research project and also the subject of my graduation thesis The purpose of this essay is to find out how I use machine learning to categorize Vietnamese news and how I it 1.1 Thesis layout CHAPTER MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING IN GERNERAL 2.1 Overview background of machine learning In today's industrialized life The amount of data is increasing in quality and quantity But only a small part of that huge chunk of data has value The desire to find and exploit information and value from that data block has opened a new wing for the information technology industry It is information extraction from the database (Knowledge Discovery from Data) Steps for data mining include: ● Identify the request and the associated data space (problem understand and data understand) ● Data preparation requirements, including data cleaning, data integration, data selection and data transformation ● Data mining including identify the target of the data to be exploited and exploitation technique The result will be an images or text source ● Evaluation Based on the criteria and filter the source of the obtained data ● Deployment The data mining process is repeated many times Data extraction is the process of extracting data from a data set This needs to use knowledge of many fields such as IT, AI, database, math Mining methods include: ● classification: This is a technique that allows the classification of an object into one or more certain classes ● Regression: Defines a data sample into a predictive variable has real value ● Clustering: "Cluster" means a group of data objects Similar objects are located in a cluster The result is similar objects in the same group 2.2 What is machine learning With the explosion of big data and the classical algorithms that haven't performed well on paper yet The emergence of Machine Learning is inevitable, it also leaves a new piece for the IT industry Machine learning is a field of artificial intelligence involved in the research and construction of techniques that allow systems to "learn" automatically from data to solve specific problems Machine learning is strongly related to statistics, as both fields are studying data analysis, but unlike statistics, machine learning focuses on the complexity of algorithms in performing computation Many reasoning problems are classified as NP-difficult problems, so part of machine learning is to study the development of approximate inference algorithms that can be handled Currently, thanks to the development of hardware, there has been the production and improvement of many new algorithms such as Deep Learning, Reinforcement learning But all must have the appearance of ML It is the core of the advanced algorithms now The outstanding feature of this document classification is the variety of topics Number of topics and texts is unlimited For example, take some popular topics in Vietnamese news such as law, health, life, education, economics Machine learning is widely used today including data tracing machine, medical diagnostics, detecting fake credit cards, analyzing the stock market, classifying DNA sequences, speech and writing recognition , automatic translation, game play and robot locomotion [1] 2.3 General Machine Learning Structure ... there we proceed to build and apply the classification methods This is a research project and also the subject of my graduation thesis The purpose of this essay is to find out how I use machine. .. deal of work, it is a waste of time if we sit and read each title of an article for us to classify it Due to grasping the deadly weakness of the media industry This essay was written to fix the. .. of Machine Learning is inevitable, it also leaves a new piece for the IT industry Machine learning is a field of artificial intelligence involved in the research and construction of techniques

Ngày đăng: 06/01/2022, 23:14

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w