using artificial neural networks to identify image spam

USING ARTIFICIAL NEURAL NETWORKS TO IDENTIFY IMAGE SPAM A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Priscilla Hope August, 2008 ii USING ARTIFICIAL NEURAL NETWORKS TO IDENTIFY IMAGE SPAM Priscilla Hope Thesis Approved: Accepted: _______________________________ _______________________________ Advisor Dean of the College Dr. Kathy J. Liszka Dr. Ronald F. Levant _______________________________ _______________________________ Faculty Reader Dean of the Graduate School Dr. Timothy W. O'Neil Dr. George R. Newkome _______________________________ _______________________________ Faculty Reader Date Dr. Tim Marguish _______________________________ Department Chair Dr. Wolfgang Pelz iii ABSTRACT Internet technology has made international communication easy and convenient. This convenience has compelled a number of people to rely on electronic mail for almost all spheres of life – personal, business etc. Scrupulous organizations/individuals have taken undue advantage of this convenience and populate users’ inboxes with unwanted messages making email spam a menace. Even as anti-spam software producers think they have almost solved the problem, spammers come out with new techniques. One such tactic in the spammers’ toolbox comes in the form of image spam – messages that contain little more than a link to an image rendered in an HTML mail reader. The image typically contains the spam message one hopes to avoid, yet it is able to bypass most filters due to the composition and format of these pictures. This research focuses on identifying these images as spam by using an artificial neural network (ANN), software programs used for recognizing patterns, based on the biological neural networks in our brains. As information propagates through a neural network, it “learns” about the data. A large collection of both spam and non-spam images have being used to train an ANN, and then test the effectiveness of the trained network against an unidentified or already identified set of pictures. This process involves formatting images and adding the desired training values expected by the ANN. Several different ANNS have being trained using different configurations of hidden layers and iv nodes per layer. A detailed process for preprocessing spam image files is given, followed by a description on how to train an artificial neural network to distinguish between ham and spam. Finally, the trained network is tested against both known and unknown images. v ACKNOWLEDGEMENTS This research would not have being possible without Jason Bowling making his ideas available for further studies. I’m grateful to him for his generosity. I also appreciate Garth Bruen of Knujon for contributing spam images without which my corpus would have being small. I appreciate my committee members, Dr. Tim O’Neil and Dr. Margush, for their insightful corrections. My sincerest gratitude goes to Dr. Kathy J. Liszka, my supervisor, with whose help this research became a joy to work on. Thanks Dr. Liszka, you are the best supervisor! vi TABLE OF CONTENTS Page LIST OF TABLES …………………………………………………………………viii LIST OF FIGURES ……………………………………………………………… ix CHAPTER I. INTRODUCTION ……………………………………………………………… 1 II. THE NATURE OF SPAM ……………………………………………………….4 2.1 Basic Definitions ….……….…………………………………………………5 2.2 History and Statistics ……………………………………… ……………5 2.3 The Long Arm of Spam ……………………………….………… …………7 2.4 Spam Filters ……………………………………………………………… …9 2.5 Who Are Spammers and Why Can’t We Stop Them …………………… 11 2.6 The Cost of Spam …… ………………………………………… 12 2.7 Why Are We Reading Spam …………………………………… ………….13 2.8 Getting Past the Spam Filter …… …………………………………. …… 15 2.9 Related Research …………………………………………………… …… 16 III. IMAGES AND THE CORPUS …………………………………………………18 vii 3.1 Image Spam Creating Techniques ……………………… ………… 18 3.2 Image Formats ………………………………………………… ………… 22 3.3 Image Preparation ……… ………………………………………… 22 3.4 Corpus ……………………………………………………………… …… 23 IV. THE ARTIFICIAL NEURAL NETWORK .……………………………………27 4.1 Fast Artificial Neural Network (FANN) …………… …………………… 29 4.2 Creating the Artificial Neural Network ……………………………… ……30 4.3 Training the Artificial Neural Network ……………………… 32 4.4 Testing the Artificial Neural Network ………………………………… … 36 V. TRAINING RESULTS….……………………………………………………….38 5.1 Training files ………………… …………………………………………….38 5.2 Test Results ……………………………………………………… 41 5.3 Sample Runs ………………………………………………………… …….41 VI. CONCLUSION AND FUTURE WORK ……………………………………….51 REFERENCES ………………………………………… ………………………….53 APPENDIX ………………….……………………… ………………………….….56 viii LIST OF TABLES Table Page 3.1 Corpus Statistics ………………… ……………………………………………… 24 5.1 Training Image Times for 50 Hidden Neurons …………… …………………….…40 5.2 Training Image Times for 75 Hidden Neurons …………………………… ………40 ix LIST OF FIGURES Figure Page 1.1 Image Spam Examples ……………………………………………………………2 2.1 The First Generally Acknowledged Email Spam ………………………… 6 2.2 Sample Text-based Spam Message ……………………………………………….8 3.1 Text-only image… …………………………………………………………… 20 3.2 Assembled Images …………………………………………………………… 21 3.3 Original six individual images………………………………………………… 21 3.4 Script for Checksum ……………………………………………………….……25 3.5 Unix Script for Reformatting File Names ………………………………………26 4.1 Perceptron or feed-forward ANN ……………………………………………….28 4.2 Script automating executing image2fann utility ………………………… 30 4.3 Sample content of a file containing a set of image files to be run through image2fann utility ………………………………………………………31 4.4 Sample preprocessed images to be trained ……………………………… 31 4.5 ANN for Spam Image Identification ……………………………………………33 4.6 Sample partial output from train.c ………………………………………………35 4.7 Process flow of ANN training and testing …………………………………….36 4.8 Process of testing a network…………………………………………………… 37 5.1 Sample preprocessed images to be tested ……………………………………….38 5.2 Sample output file from train.c (partial) ….………………………………… 39 5.3 Sample output file from test.c (partial)………………………………………… 41 5.4 ANN of 572 trained images using 50 hidden neurons and tested with 53 untrained images ………………………………………………………. 42 5.5 ANN of 572 trained images using 75 hidden neurons and tested with 53 untrained images ………………………………………………………. 43 5.6 ANN of 227 trained images using 50 hidden neurons and tested with 53 untrained images ………………………………………………………. 44 5.7 ANN of 227 trained images using 75 hidden neurons and tested with 53 untrained images ……………………………………………………… 44 5.8 ANN of 2000 trained images using 75 hidden neurons and tested with 2000 trained images ………………………………………………………. 45 5.9 ANN of 2000 trained images using 75 hidden neurons and tested with 100 untrained images ………………………………………………………46 5.10 ANN of 2000 trained images using 50 hidden neurons and tested with 2000 trained images ……………………………………………………… 46 5.11 ANN of 2000 trained images using 50 hidden neurons and tested with 100 untrained images …………………………………………………… 47 5.12 ANN of 2000 trained “images with mostly words” using 75 hidden neurons and tested with 100 trained images …………………………………….48 5.13 ANN of 2000 trained “images with mostly words” using 75 hidden neurons and tested with 100 untrained images …………………………… …48 5.14 ANN of 2000 trained “images with mostly words” using 50 hidden neurons and tested with 100 trained images …………………………………….49 5.15 ANN of 2000 trained “images with mostly words” using 50 hidden neurons and tested with 100 untrained images .…………………………….… 49 5.16 Jason Bowling on a hiking trip ………………………………………………….50 5.17 Ham Image Wrongly Classified as Spam …………………………………… 50 x [...]... text-based spam message Spam with an attached image is a relatively new phenomenon, which only started to appear in numbers in the second half of 2005 Image spam exploded in mid 2006 and by year’s end, over 50% of total spam received was image spam It has since declined and now account for around 20% [13] According to the paper, Image Spam – the New Face of Email Threat, image spam forms around 12.87% of total... ANNs have been used in [25] to identify spam by looking at the text-based header portion of spam email PUREmail is a second generation email filter uses artificial Intelligence to process images by visualization [35] Although the results given are encouraging, this thesis uses an artificial neural network to classify spam and ham Artificial Neural Networks have the capability to mimic human intelligence... productivity to major business losses To an organization, spam is not only a nuisance, it is expensive 2.7 Why Are We Reading Spam? We all claim that we delete our spam, but if that were true, spammers would have no reason to continue pushing it through the pipe to us Obviously, enough people are reading enough spam to make it lucrative for spammers to continue The strategy adopted by the spammer consists... • Allow users to create their own “explicit deny list” • Message reputation and fingerprinting checking to see if the email content has elements of spam that have been seen before • Image fingerprinting checks images to see if they contain similarities to cataloged spam images • Image property space, a technique that uses rules to extract properties of images in an email that might be spam • Analyze... It is the first research of this kind to be conducted for image spam [26] In the next chapter, the use of images in spam and the manipulation of the images for this thesis research are addressed 17 CHAPTER III IMAGES AND THE CORPUS From mere observation, one can conjecture that image spam is yet another clever way to avoid anti -spam devices Spammers have resorted to this format, with some of the following... in the image that is difficult to distinguish it from the original image Due to the randomization of the pixels each iteration of the image will appear completely different to many image spam filters • Color Modification Due to the unlimited flexibility in the number of colors and fonts, image spammers change the properties of their images resulting in new pixel locations and identifiers • Stock spots... many forms by way of file types, multipart images with images split into multiple images, and rotated by a slight degree This research examines a method for identifying image spam by training an artificial neural network Chapter two presents an overall view of the spam problem and a brief summary of current research A detailed process for preprocessing spam image files is given in chapter three, along... filters Spam filters prevent spam from reaching an inbox Manual weeding may still need to be done for those spams that successfully bypass the filter Manual weeding is also necessary to identify and retrieve back those legitimate messages that have found themselves marked as spam Unfortunately, it is left to the user to identify the spam and manually delete it or report to the spam filter that a message... filtering spam images In their research, they use a “content-based image similarity searching” technique to classify spam images A false positive rate of 0.001% is maintained through their detection system [20] Another approach [21] is mostly based on extracting text regions inside the images of interest and then using something called a Support Vector Machine (SVM) to distinguish between ham and spam images... backgrounds Image spammers are now using highly colored and patterned backgrounds, uneven letters, and randomly inserted pixels around the border Each image is unique and hard to read by any software attempting to use Optical Character Recognition, a technology that aims to scan an image and extract text, which requires known fonts to be effective Figure 3.1 Text-only image • Multi-frame animated images Spammers . USING ARTIFICIAL NEURAL NETWORKS TO IDENTIFY IMAGE SPAM A Thesis Presented to The Graduate Faculty of The University of Akron In. types, multipart images with images split into multiple images, and rotated by a slight degree. This research examines a method for identifying image spam by training an artificial neural network over 50% of total spam received was image spam. It has since declined and now account for around 20% [13]. According to the paper, Image Spam – the New Face of Email Threat, image spam forms

Định dạng
Số trang	67
Dung lượng	1,14 MB