Using Null Data Processing to recognize variant computer viruses for Rule-based Anti-virus systems TRUONG Minh Nhat Quang1, HOANG Van Kiem2, NGUYEN Thanh Thuy3 Abstract—Null data processing is commonly used for Knowledge Discovering in Database systems and Machine Learning systems In fact, null data must be eliminated from major databases because it often causes bad effects on the quality of data mining processing In this paper, we introduce a new point of view: null data processing can also be used in a recognition system to identify strange objects Our work involves three operations - ‘toNull’, ‘deNull’ and ‘fixNull’ along with Data Fusion technique and a rule-based production inference algorithm This process has been applied to our designed software called Machine Learning Anti-virus System (MAV) in security scope The results from the experiment showed that the MAV-operated algorithm has the same performance as other anti-virus software whose algorithms require bigger viruses signature database II NULL DATA PROCESSING A Data cleaning in the knowledge discovery process The knowledge discovery process consists of the following: data selection, cleaning, enrichment, coding, data mining and reporting Cleaning is an important phase because the quality of data directly affects the quality of data mining processing The aim of this phase is to treat all data pollution There are two types of data pollution: duplication of records and lack of domain consistency [1] I INTRODUCTION B Null data processing in the cleaning phase Besides several high quality data sources, some data are collected from many sites at various qualities Because of many reasons, lack of domain consistency is a common problem that target systems have to face This type of pollution is particularly damaging because it is hard to trace, but it will greatly influence the type of the data patterns [1] ETRIEVING valued information from database of Machine Learning (ML) systems and Knowledge Discovering in Database (KDD) systems (also called target systems) depends on many factors, especially on the quality of the sample data set A large number of data sources sometimes have no data in some fields and on several records Before applying data mining process, target systems have to clean the input data One of the techniques to clean data is using a null data process to recall possible values In this paper, we introduce a new point of view: null data processing can be used in a recognition system to identify strange objects Our work consists of three steps At first, we set up a process called ‘toNull’ which empties all real null values and abnormal values of the objects’ attributes Using a null data processing tool such as data fusion and a rule-based production inference, the process ‘deNull’ tries to estimate all possible values for these elements Finally, a process named ‘fixNull’ will recall the original values for all modified elements and add them into the knowledge base In the next section - Experimentations, we will illustrate the exciting results of this technique on a rule-based recognition target system: Machine Learning Anti-virus System (MAV) C Null data processing with Data Fusion There are many existing null data processes In our research, we use data fusion as a robust partner Data fusion originates from market studies [2],[3] especially in media and consumption surveys, where it is often impossible to ask the same sample all the items when there are too many questions In data fusion, the goal is to obtain a single database where all the variables have been completed for the union of units Basically the problem may be formalized in terms of two data files: the first file contains observations for a whole set of p+q variables measured on n0 units; the second file contains observations of only a subset of p variables for n1 units In some cases, n0 is small compared to n1 If X stands for the common variables, we have the scheme in Fig The problem here is to fill in the blank part of the table, where a lot of variables are missing because they have not been collected [4] In Fig 2, the file (X0 ,Y0) is used to predict the unknown Y part of the second file The first file will be called donor file and the second one the recipient In this approach, imputation with implicit models are based on the principle of copying and pasting; we give the whole vector of variables of the donor X to the Y variables of the receiver Keywords— Data fusion, Machine learning, Knowledge based systems, Security R X0 X1 Cantho Inservice University, Vietnam (e-mail: tmnquang@ctu.edu.vn) Vietnam National University HCM City (e-mail: hkiem@citd.edu.vn) Hanoi University of Technology, Vietnam (e-mail: thuynt@it-hut.edu.vn) 1-4244-0134-8/06/$20.00 1-4244-0133-X/06/$20.00© ©2006 2006IEEE IEEE Y0 ? Fig 1: Missing value in database 600 DONOR FILE TABLE I EXAMPLE OF A VIRUS SIGNATURE DATABASE No Virus Name I X0 Y0 Nearest neighbour Imputation X1 ? v0 v1 Family.a.vir 15 28 03 101 32 27 65 37 81 61 v2 v3 v4 v5 v6 v7 v8 v9 Family.b.vir 15 28 03 101 35 27 65 37 85 61 Family.c.vir 15 28 03 101 30 27 65 37 90 61 Family.d.vir 15 28 03 101 34 27 65 37 84 61 Family.e.vir 15 28 03 101 33 27 65 37 83 61 Family.f.vir 15 28 03 101 38 27 65 37 90 61 Family.g.vir 15 28 03 101 30 27 65 37 88 61 Family.h.vir 15 28 03 101 29 27 65 37 87 61 Family.i.vir 15 28 03 101 31 27 65 37 92 61 J RECIPIENT FILE Fig 2: Imputation scheme Let i be a receiver The basic idea is to look for a donor j having a close profile with the X variables: a double if all the variables are identical or the nearest neighbor such as an appropriate distance d(i,j) in the Rp space of common variables is minimal [4] III RULE-BASED ANTI-VIRUS SYSTEMS A Rule-based Anti-virus systems With the rapid development of the Internet, computer viruses have been hot news They are more frequently and more seriously infecting, destroying and stealing data, which directly influences the network security and data safety of many computer systems in the world There are many types of computer viruses, and each type has its own way of infecting [5] Basically, scanning for computer viruses is a recognition process for all characteristics of viral codes in an ID-virus library To diagnose computer viruses, a rule-based anti-virus program needs to be built with the set VK of K vectors-virus signatures: VK= {v1, v2,…, vk} and to determine the existence of vi in the diagnosed set S The conventional rule set has the form: R: p1^ p2 …^ pn q (1) where pi represents the virus signature/behavior and q is a result/conclusion of the process When a new virus appears, anti-virus experts debug it and update correctly the viral signatures vi for ID-virus library Using the information of the diagnosed object in observation space and of the viral characteristic in ID-virus library, the data classification algorithm of anti-virus program will assign the diagnosed object into Class or Class - possibly or impossibly infected by a virus [6] Suppose that anti-virus AV has an ID-virus library (Table I) and works on the observation space where there are two special objects that we pay attention to: ObjectV and ObjectX (Table II) The scanning result shows that ObjectV is infected by virus Family.f.vir, and ObjectX is safe 1-4244-0133-X/06/$20.00 © 2006 IEEE 601 B Machine Learning approach to anti-virus system The advantage of the formula above is clear: the anti-virus program can identify most known viruses from data test However, the searching algorithm will fail when lacking the data sources (e.g virus signatures in ID-virus library)[7] For example, virus V may change to X, XN = {x1, x2,…, xn} When we apply the rule (1) to diagnose X, the result is evident: ¬QV; for at least one xu exists (xu ≠vu, u = 1÷n) Therefore, most anti-viruses cannot recognize any variant viruses [8] Our solution is to create a conventional anti-virus program that utilizes a ML rule-based approach that involves these basic tasks: modeling a knowledge base, forming rule sets to recognize known viruses, diagnosing and discovering interesting rules using Data Mining algorithms, finding hidden attributes and predicting variant/unknown viruses [6] Like a typical KDD, MAV has these stages: - Examining: Data selection, Cleaning, Enrichment Diagnosing: Coding, Data mining Treatment: Data processing Conclusion: Reporting The technique referred to in this paper is part of the first stage of our ML system, the data cleaning process Analyzing virus characteristics, we defined virus classes corresponding to widespread viruses such as File-viruses, Boot-viruses, Worm-viruses, Macro-viruses and Text-viruses Depending on the diagnostic purpose, a class is defined with its particular characteristics In general, a standard virus class has an object-oriented form: Object: Virus family identification Property: Attributes/behavior Method: Treatment/direction outline TABLE II TWO OF THE DIAGNOSED OBJECTS IN OBSERVATION SPACE No Object Name a0 a1 a2 a3 a4 … a5 a6 a7 a8 a9 … … … … … … … … … … … v ObjectV 15 28 03 101 38 27 65 37 90 61 … … … … … … … … … … … x ObjectX 15 28 03 101 39 … 27 65 37 91 61 … … … … … … … … … … … … Value of Executable Code 256 (a) ObjectX before nulling: 15 28 03 101 39 27 65 37 91 61 200 (b) ObjectX after nulling: 150 Fig 4: toNulling a diagnosing object 100 1) toNull: to create a data backup and then empty all abnormal values of the objects’ attributes to isolate all new viral behaviors (Fig.4) After this operation, there will be a ‘virtual null’ data set (null but not null) 2) deNull: using data fusion to replace all null data by possible values The goal of deNull operation is to predict all strange objects that may be a nearest neighbour of known viruses (Fig 5) 3) fixNull: to recall all original values from data backup (Fig 6) and add them into the database to recognize the ‘variant of variant’ computer viruses in the future (Table III) 50 241 256 256 226 241 211 196 181 166 151 136 121 91 106 76 61 46 31 16 (a) Executable code sequence of virus Klez.a.worm.W32 Value of Executable Code 15 28 03 101 ? 27 65 37 ? 61 256 200 150 100 50 226 211 196 181 166 151 136 121 106 91 76 61 46 31 16 IV EXPERIMENTATIONS (b) Executable code sequence of virus Klez.h.worm.W32 Fig 3: The graphical executable code of virus family Klez.worm.W32 Using this model, we suggest three knowledge bases: - KB1: examine determined diseases - KB2: diagnose new determined diseases - KB3: diagnose unusual diseases When KB1 is used to treat most known viruses, KB2 is used to detect some variances of known viruses, and KB3 is used to predict unknown viruses In this assignment, MAV also requires the decisions in different usage levels: end-users, technicians and system administrators [6] C Cleaning data to recognize variant computer viruses Most anti-viruses are ineffective when they are facing variant viruses In fact, virus variants are programmed from their parent’s source codes by crackers Therefore, many target codes in the same virus family are similar In Fig 3, the X-axis denotes the position of each executable code; the Yaxis denotes its correspondent value (0-255 values) Chart 3a shows a graphical executable code of virus Klez.a.worm.W32, and chart 3b shows the code of virus Klez.h.worm.W32, which is the seventh descendant of Klez.a.worm.W32 If the amount of variant set xi of X is smaller than vi of V, the recognition rule to identify the variant computer viruses can be defined as follow: RX: a1^ a2 ^… ^ (au ← NULL) ^…^ an → QX A Practical processing To estimate the effect of the method above, we have experimented with Norton Anti-virus 2003 Professional Edition, Virus Scan Professional Edition, Bit Defender v.8 and MAV, which have a smart scanning ability (Table IV) The system activities are described as follow: • Merging the datatset of virus samples randomly into the observation space • Installing anti-virus programs into the testing system • For each testing anti-virus program: - To disable the auto-protect agents - To enable the smart scanning option at highest level No Virus Name v0 v1 Family.a.vir 15 28 03 101 32 27 65 37 81 61 v2 v3 v4 v5 v6 v8 v9 Family.b.vir 15 28 03 101 35 27 65 37 85 61 Family.c.vir 15 28 03 101 30 27 65 37 90 61 Family.d.vir 15 28 03 101 34 27 65 37 84 61 Family.e.vir 15 28 03 101 33 27 65 37 83 61 Family.f.vir 15 28 03 101 38 27 65 37 90 61 Family.g.vir 15 28 03 101 30 27 65 37 88 61 Family.h.vir 15 28 03 101 29 27 65 37 87 61 Family.i.vir 15 28 03 101 31 27 65 37 92 61 Nearest neighbour (NN): 38 NN: 90 (a) Using the nearest neighbour to estimate possible values where: denotes values of virus V au denotes values abnormal compared to V QX denotes a conclusion of the inference process to assert whether X is a descendant of V The process to recognize variant computer viruses involves three operations: ObjectX*: 15 28 03 101 38 27 65 37 90 61 (b) A variant of virus Family.f.vir detected Fig 5: deNulling all abnormal values New variant virus: 15 28 03 101 39 27 65 37 91 61 Fig 6: Recalling original object values 1-4244-0133-X/06/$20.00 © 2006 IEEE v7 602 TABLE III VIRUS SIGNATURE DATABASE AFTER FIXNULL v0 v1 v2 v3 v4 v5 v6 v7 v8 Family.a.vir 15 28 03 101 32 27 65 37 81 61 Family.b.vir 15 28 03 101 35 27 65 37 85 61 Family.c.vir 15 28 03 101 30 27 65 37 90 61 Family.d.vir 15 28 03 101 34 27 65 37 84 61 Family.e.vir 15 28 03 101 33 27 65 37 83 61 Family.f.vir 15 28 03 101 38 27 65 37 90 61 Family.g.vir 15 28 03 101 30 27 65 37 88 61 Family.h.vir 15 28 03 101 29 27 65 37 87 61 1200 v9 Family.i.vir 15 28 03 101 31 27 65 37 92 61 10 Family.j.vir 15 28 03 101 39 27 65 37 91 61 1000 Virus samples No Virus Name NAV 1/25/2006 72,020 9.05.15 McAfee 1/25/2006 N/A 4.0.4682 SoftWin 1/25/2006 253,993 7.05450 Bit Defender v.8 MAV MAV Software 1/25/2006 890 REFERENCES [1] [2] [6] N/A [7] Norton Anti-virus Detection Precision Prediction 907 889 18 Omission Virus Scan 906 877 29 94 959 925 34 41 MAV 957 483 474 43 1-4244-0133-X/06/$20.00 © 2006 IEEE [8] 93 Bit Defender v.8 MAV Omissions Using a new point of view which considers variant computer viruses as objects lacking domain consistency, MAV has successfully recognized variant viruses by using the three-step null data process powered by the Data Fusion technique Although this research is in a rule-based anti-virus system, this study is also applied to rule-based recognition systems for strange objects to be learned and identified Based on the nearest neighbour method, however, this technique has some limitations MAV can only recognize the nearest variant computer viruses When facing an unknown virus, MAV needs an additional decision from a built-in expert system [5] TABLE V EXPERIMENTAL RESULTS Anti-virus Predictions V CONCLUSION [4] Virus Virus Engine Definition Signature Version Symantec Precisions Fig 7: Comparison of smart Anti-virus scannings [3] Virus Scan BitDef Scan Detections TABLE IV TESTING ANTI-VIRUSES Norton Anti-virus 400 B Experimentation Results Table V is the diagnostic result of 80,000 KB dataset of 1,000 virus samples supplied by Kaspersky Anti-virus (Version 5.0.5/Release build #13, compiled at Nov 29 2004) Fig is the chart of the anti-virus test results With respect to detection, MAV and Bit Defender (BitDef) have the same results at 957 and 959 viruses detected Both anti-viruses are better than Norton Anti-virus (NAV) and Virus Scan at 907 and 906 Although the number of updated viruses is very small (at 890 virus signatures), MAV can recognize as many computer viruses as other anti-virus programs, which have a much bigger virus signature database (of 72,020 and 253,993 virus signatures) Manufacture 600 200 - To setup the program activities only for virus detection - To update the last virus signature database from program’s website - To scan the observation space for viruses - To take the results of each testing program Anti-virus 800 603 Pieter Adriaans, Dolf Zantinge “Data Mining” Addision Wesley Longman, 1996, 40-41 Baker K., Harris P., O’Brien J., “Data fusion: An appraisal and experimental evaluation”, Journal of the Market Research Society, 31 (1989), 153-212 Lejeune M “De l'usage des fusions de données dans les études de marché”, Proceedings 50th Session of ISI-Beijing, 1995, Tome LVI, 923-935 Gilbert Saporta “Data fusion and data grafting” CNAM, F75141 Paris Cedex 03, France Elsevier Science B.V 2002 Hoang Kiem, Nguyen Thanh Thuy, Truong Minh Nhat Quang “Machine Leaning Approach to Anti-virus Expert System with Nearest Neighbor Rule-based Structural Risk Minimization” RIVF’05, the 3rd International Conference in Computer Science: Research, Innovation and Vision for the Future February 2005, Cantho-Vietnam 295-298 Hoang Kiem, Nguyen Thanh Thuy, Truong Minh Nhat Quang “A Machine Learning Approach to Anti-virus System” Joint Workshop of Vietnamese Society of AI, SIGKBS-JSAI, ICS-IPSJ and IEICE-SIGAI on Active Mining 4-7 December 2004, Hanoi-Vietnam, 61-65 Nguyen Thanh Thuy, Truong Minh Nhat Quang “A Global Solution to Anti-virus Systems” The Proceedings of the 1st International Conference on Advanced Communication Technology 10-12 February 1999, Muju-Korea, 374-377 Nguyen Thanh Thuy, Truong Minh Nhat Quang “Expert System Approach to Diagnosing and Destroying Unknown Computer Viruses.” The Scientific Conference Proceedings of the 5th ASEAN Science and Technology Week 10-1998, Hanoi-Vietnam