Spam email filtering based on machine learning

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	5
Dung lượng	261,59 KB

Nội dung

In the paper, we are going to present a spam email filtering method based on machine learning, namely Naïve Bayes classification method because this approach is highly effective. With the learning ability (self improving performance), a system applied this method can automatically learn and ameliorate the effect of spam email classification. Simultaneously, the ability of system’s classification is also updated by new incoming emails, therefore, it is very difficult for spammers to overcome the classifier, compared to traditional solutions.

Trịnh Minh Đức Tạp chí KHOA HỌC & CƠNG NGHỆ 118(04): 133 - 137 SPAM EMAIL FILTERING BASED ON MACHINE LEARNING Trinh Minh Duc* College of Information and Comunication Technology – TNU SUMMARY In the paper, we are going to present a spam email filtering method based on machine learning, namely Naïve Bayes classification method because this approach is highly effective With the learning ability (self improving performance), a system applied this method can automatically learn and ameliorate the effect of spam email classification Simultaneously, the ability of system’s classification is also updated by new incoming emails, therefore, it is very difficult for spammers to overcome the classifier, compared to traditional solutions Key words: Machine learning, email spam filtering, Naïve Bayes INTRODUCTION* The Email classification is actually the twoclass text classification problem, that is: the early dataset consists of spam and non-spam emails, the texts to be classified as the emails are sent to inbox The output of the classification process is to determine the class label for an email – belonging to either one of the two classes: spam or non-spam The general model of the spam email classification problem can be discribed as follows: The categorization process can be divided two phases: The training phase: The input of this phase is the set of spam and non-spam emails The output is the trained data applied a suitable classification method to serve for the classification period The classification phase: The input of this phase is an email, together with the trained data The output is the classification result of the email: spam or non-spam The rest of this paper is organized as follows In Sect 2, we formulate Naïve Bayes classification method and our solution In Sect 3, we show experimental results to evaluate the efficiency of this method Finally, in Sect 4, we conclude by showing possible future directions NAïVE BAYES METHOD [4] Figure The spam email classification model * Tel: 0984215060; Email: duchoak15@gmail.com CLASSIFICATION Naïve Bayes method is a supervised learning classification method, and based on probability, based on a probability model (function) The classification process is based on the probability values of the likelihood of the hypotheses Classification technique of Naïve Bayes is based on Bayes theorem and is particularly suitable for the cases whose input size is large Although Naïve Bayes is 133 Trịnh Minh Đức Tạp chí KHOA HỌC & CƠNG NGHỆ quite simple, its classification capability is much better than other complex methods Due to the statistically dependent relaxation hypotheses, Naïve Bayes method considers the attributes conditionally independent with each other 118(04): 133 - 137 Naïve Bayes classifying algorithm can be described succinctly as follows: The learning phase (given a training set) For each classification (i.e., class label) ci ∈ C Classification problem estimate the priori probability P( ci) For each A training set D, where each training instance x is represented as a n-dimensional attribute attribute value xj, estimate the probability of vector x = (x1, x2, , xn) A pre-defined set of classes C={c1, c2 , cm}.Given a new instance z, which class should z be classified into? Probability P(ck | z) is called the probability that a new instance z likely belonging to the class ck is calculated as follows: P(xj|ci) The classification phase (given a new instance) For each classification ci ∈ C, compute the fomula: n P(ci ).∏ P(x j | ci ) c = arg max P(ci | z) ci∈C j=1 c = arg max P(ci | z1 , z , , z n ) Select the most probable classification c* ci∈C P(z1 , z , , z n | ci ).P(ci ) c = arg max P(z1 , z , , z n ) ci∈C ci∈C P(z1 , z , , z n | ci ).P(ci ) ( P(z1 , z , , z n ) is the same for all classes) Assumption in Naïve Bayes method: The attributes are conditionally independent given classification: n P(z1 , z , , z n | ci ) = ∏ P(z j | ci ) j=1 Naïve Bayes classifier finds the most probable class for z: ci∈C j=1 j | ci ) ∏ P(z cNB = arg max P(ci ) j=1 What happens if no training instances associated with class ci have attribute value xj? P( xj | ci ) = 0, and hence: n P(ci ).∏ P(x j | ci ) = j=1 Solution: to use a Bayesian approach to estimate P( xj | ci ) P(x j | ci ) = n(ci , x j ) + mp n(ci ) + m n(ci ) : number of training instances n ci∈C n ∏ P(x c* = arg max P(ci ) There are two issues we need to solve: c= arg max 134 that attribute value given classification ci: j | ci ) associated with class ci Trịnh Minh Đức Tạp chí KHOA HỌC & CƠNG NGHỆ n(ci , x j ) : number of training instances 118(04): 133 - 137 efficiency of the classification An email associated with class ci that have attribute consists of value xj title, content, attachment or non… A simple p: a prior estimate for P(x j | ci ) → Assume uniform priors: p = , if attribute X k has k possible values m: a weight given to prior → To augment the n(ci) actual observations by an additional m virtual samples distributed according to p a lot of charateristics such as: example: if we know that 95% HTML emails is spam, and we receive a HTML email, thus being able to base oneself on this prior probability in order to compute the probability of email that we receive is spam, if this probability is greater than the The limit of precision in computers’ probability given non-spam, it can be computing capability concluded that the email is spam, however, P(xj | ci) < 1, for every attribute value xj and this conclusion is not very accurate However, class ci So, when the number of attribute the more if we know much information, the values is very large greater the probability of correct classification is To obtain prior probabilities, using Naïve Bayes method to train the set of early Solution: to use a logarithmic function of template emails, then using these probabilities to classify a new email The probability probability c* = arg max ci∈C calculation will be based on Naïve Bayes formula With the obtained probability values, we compare them with each other If spam In the spam email classification problem, probability is greater than non-spam, then we each sample that we consider is an email The can conclude that the email is spam, the set of classes that each email can belong to opposite is non-spam [5] C={ spam, non-spam} EXPERIMENTAL RESULTS When we receive an email, if we not know We have implemented a test which applied any information about it, it’s so hard to decide Naïve Bayes method in email classification exactly this email is spam or non-spam If we The total number of emails in the sample have more certain characteristics or attributes dataset is 4601, including 1813 spam emails of (accounting for 39.4%) This dataset which can an email, then we can improve the 135 Trịnh Minh Đức Tạp chí KHOA HỌC & CÔNG NGHỆ be downloaded at http://archive.ics.uci.edu/ml/ datasets/Spambase is called Spambase This dataset is divided into two disjoint subsets: the training set D_train (accounting for 66.7%) – for training the system and the test set D_test (33.3%) – for evaluating the trained system In order to evaluate a machine learning system’s performance, we often use some measures such as: Precision (P), Recall (R), Accuracy rate (Acc), Error rate (Err), F1measure 118(04): 133 - 137 n N −> N is the number of non-spam emails which the filter recognizes as non-spam N N is the total number of non-spam emails NS is the total number of spam emails Experimental results on the Spambase dataset We present test results for two options of the division of the Spambase dataset: Experiment 1: divide the original Spambase dataset with a proportion k1 = 2 , in that, 3 the dataset for training and the remaining for testing Experiment 2: divide the original Spambase dataset with a proportion k2 = Formulas to compute these measures as follows: n S −>S n S −>S + n N −>S P= n S−>S n S−>S + n S−> N R= Acc = Err = F1 = n N −> N + n S−>S N N + NS n N −>S + n S−> N N N + NS 2.P.R = P+R + P R Where: n S−>S is the number of spam emails which , in that, 10 the dataset for training and the remaining 10 for testing Table Testing results n S−>S n S−> N n N −> N n N −>S Recall Precison Acc Err F1-measure Experiment 486 Experiment 180 119 726 276 204 80.33% 70.43% 78.96% 21.04% 75.05% 98.90% 98.36% 98.92% 1.18% 98.63% The testing result in this experiment have very high accuracy (approximately 99%) Conclusion the filter recognizes as spam n S−> N is the number of spam emails In this paper, we have examined the effect of which the filter recognizes as non-spam n N −>S is the number of non-spam emails a classifier which has a self-learning ability to which the filter recognizes as spam 136 the Naïve Bayes classification method This is improve classification performance Naïve Trịnh Minh Đức Tạp chí KHOA HỌC & CƠNG NGHỆ 118(04): 133 - 137 Bayes classifier proved suitable for email REFERENCES classification problem Currently, we are [1] Jonathan A.Zdziarski, Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification - Press 2006 [2] Mehran Sahami Susan Dumais David Heckerman Eric Horvitz., A Bayesian Approach to Filtering Junk E-Mail [3] Sun Microsystem, JavaMail API Design Specification Version 1.4 [4] T M Mitchell Machine Learning McGrawHill, 1997 [5] Lê Nguyễn Bá Duy , Tìm hiểu hướng tiếp cận phân loại email xây dựng phần mềm mail client hỗ trợ tiếng việt, Đại học khoa học tự nhiên, Tp.Hồ Chí Minh, 2005 continuing to build standard as well as training samples and can adjust the Naïve Bayes algorithm to improve the accuracy of classification In the near future, we will build a standard training and testing data system for both English and Vienamese This is a big problem and need to focus more effort TÓM TẮT LỌC THƯ RÁC DỰA TRÊN HỌC MÁY Trần Minh Đức* Trường ĐH Công nghệ thông tin Truyền thông – ĐH Thái Nguyên Trong báo giới thiệu phương pháp phân loại thư rác dựa học máy, cách tiếp cận có hiệu cao Với khả học (tự cải thiện hiệu năng), hệ thống tự động học cải thiện hiệu phân loại thư rác Đồng thời, khả phân loại hệ thống liên tục cập nhật theo mẫu thư rác vậ y, khó để spammers vượt qua được, so với cách tiếp cận truyền thống khác Từ khóa: Học máy, lọc thư rác, Naïve Bayes Ngày nhận bài: 13/3/2014; Ngày phản biện: 15/3/2014; Ngày duyệt đăng: 25/3/2014 Phản biện khoa học: TS Trương Hà Hải – Trường ĐH CNTT&TT – ĐH Thái Nguyên * Tel: 0984215060; Email: duchoak15@gmail.com 137 ... the number of non -spam emails which the filter recognizes as non -spam N N is the total number of non -spam emails NS is the total number of spam emails Experimental results on the Spambase dataset... each other If spam In the spam email classification problem, probability is greater than non -spam, then we each sample that we consider is an email The can conclude that the email is spam, the set... each email can belong to opposite is non -spam [5] C={ spam, non -spam} EXPERIMENTAL RESULTS When we receive an email, if we not know We have implemented a test which applied any information about

Ngày đăng: 30/01/2020, 03:54