In this paper, we propose an improvement of r-chunk type detector-based NSA by combining negative selection and positive selection to reduce runtime complexity and memory complexity.
Nguyễn Văn Trường Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 106(06): 41 - 47 COMBINING NEGATIVE SELECTION AND POSITIVE SELECTION IN ARTIFICIAL IMMUNE SYSTEMS Nguyen Van Truong1*, Vu Thi Nguyet Thu1, Trinh Van Ha2 College of Education – TNU College of Information and Communication Technology - TNU SUMMARY Artificial Immune System (AIS) is a diverse research area that combines the disciplines of immunology and computation Negative Selection Algorithm (NSA) and Positive selection algorithm (PSA) are two famous models of AIS designed for anomaly detection They all contain two stages: generating a set D of detectors from a given set S of self; detecting if a given cell is self or non-self using generated detectors In this paper, we propose an improvement of r-chunk type detector-based NSA by combining negative selection and positive selection to reduce runtime complexity and memory complexity Key words: Artificial immune system, negative selection algorithm, positive selection algorithm, computer security, r-chunk detector INTRODUCTION* The biological immune system is able to recognize which cells are its own (self) and which are foreign (non-self, such as bacteria or viruses) The representative immune cell is the T cell, which has a self-recognition component and an antigen receptor for locating and eliminating infected cells By modeling the characteristics of the biological immune system, the system that protects from damage by external attacks and eliminates intruders in the case of computer perspectives is called the artificial immune system [3] Biological immune system is a complex, self organizing and highly distributed system It has no centralized control and uses learning and memorizing when solving particular tasks [11] The learning process does not require negative examples and acquired knowledge is represented in an explicit form: T cells are generated randomly and in a large number, in the hope that every pathogen that infects the host is detected by at least some of these cells However, the host must ensure that no cell generated would turn against itself - many severe diseases are caused by such autoimmune reactions Hence, newborn T * Tel: 0915016063; Email: nvtruongtn@gmail.com cells undergo the process of negative selection In a special organ, the thymus, they are shown self proteins, which belong to the host If a T cell detects any self protein, it is destroyed.In contrast with negative selection, in positive selection, the T cells are tested for recognition of Major Histocompatibility Complex molecules expressed on the cortical epithelial cells If a T cell fails to recognize any of the Major Histocompatibility Complex molecules, it is discarded; otherwise, it is kept Forrest et al [2, 3] analyzed the biological immune system and they found that the problem faced by the immune system is similar to one that today's computer systems face: It is difficult to defend a system against a various unknown danger, such as an exploit of a new security hole The only reliable knowledge we have is the normal behavior of the system - the equivalent of self The idea of the negative selection classification scheme is to mimic the T cells in the biologicalimmune system: Generate a set of detectors that not match anything in self, thenuse these detectors to monitor the system for unusual behavior An algorithmic abstraction of this biological process called NSA is found interesting implementations: computer virus detection, monitoring UNIX processes, anomaly detection in time series, fault analysis, process 41 Số hóa Trung tâm Học liệu – Đại học Thái Nguyên http://www.lrc-tnu.edu.vn Nguyễn Văn Trường Đtg Tạp chí KHOA HỌC & CƠNG NGHỆ 106(06): 41 - 47 diagnosis, numerical optimization,recognizing promoters in DNA sequences or Scheduling Problem [4, 8, 9, 11, 12] The outline of a typical NSA contains two stages: generation and detection [2] In the generation stage (Fig 1.a), the detectors are generated by some random processes and censored by trying to match given self samples taken from set S Those candidates that match are eliminated and the rest are kept as detectors in set D In the detection stage (Fig 2.a), the collection of detectors (or detector set) is used to verify whether an incoming data instance is self or non-self If it matches any detector, it is claimed as non-self or anomaly Each negative detector will match a subset of the non-self set By generating a sufficient number of independent detectors, good coverage of the non-self set could be obtained Begin Begin Generate random candidates Generate random candidates Match self samples? Match self samples? Yes No Yes Accept as new detector No Accept as new detector No No Enough detectors? Enough detectors? Yes Yes End End a Negative detector generation b Positive detector generation Figure Models of detector generation Begin Begin Input new samples Input new samples Match any detector? Match any detector? Yes No No “Nonself” “Self” Yes “Nonself” “Self” End End a Negative detection b Positive detection Figure Detections of new instances 42 Số hóa Trung tâm Học liệu – Đại học Thái Nguyên http://www.lrc-tnu.edu.vn Nguyễn Văn Trường Đtg Tạp chí KHOA HỌC & CƠNG NGHỆ In positive selection, positive detectors are those that match some samples; and an instance is clamed as self if it matches any detector The generation and detection stages are illustrated in Fig 1.b and Fig 2.b, respectively Each positive detector will cover a subset of the self set Several studies have used the concept of positive selection to model their systems [1, 4, 8, 13] The considered negative r-chunk and rcontiguous detectors are among the most common ones in the AIS literature Many authors originally research the negative rcontiguous detectors, and negative r-chunk detectors were later introduced to achieve better results on data where adjacent regions of the input strings are not necessarily semantically correlated, such as network data packets [8, 13, 14] Zhou Ji et al (2007) [14] showed that there are atleast 16 representations of NSAs All existing NSAs suffer from a worst-case exponential size of D in the total size of the input, and therefore, limit their practical applicability [7] Our contribution is to develop an r-chunk type detector-based selection algorithm by combining negative selection and positive selection, that reduces both runtime and memory complexities effectively The remaining of the paper is organized as follows: In the next section, we present rchunk detector types Some modifications of positive detection to have a false negative rate adequate to that of negative selection are discussed The subsequent section, shows our efficient approach in detail In the last section, we summarize our approach and discuss the future work NEGATIVE AND POSITIVE STRINGBASED DETECTORS In this paper, we consider NSA and PSA as a classifier operating on a binary string space Σℓ, where Σ= {0, 1} The limited alphabet Σ here is just for easy understanding the approach; our algorithm can be feasibly 106(06): 41 - 47 adjusted to real world datasets on arbitrary alphabets We also use the following notation: Let s ∈Σℓ be a binary string Then ℓ = |s| is the length of s and s[i,…,j] is the substring of s that starts at position i with length j – i + Definition (Chunk detectors) An r-chunk detector (d,i) is a tuple of astring d ∈Σr and an integer i ∈ {1,…, ℓ - r +1}.It matches another string s∈Σℓ if s[i,…, i+r - 1] = d and we also call s match detector (d, i) at the position i Definition (Positive chunk detectors) Given a self set S, an r-chunk detector (d, i) is a positive chunk detector if it matches a substring s[i,…, i + r - 1] of s, s ∈ S Definition (Negative chunk detectors) Given a self set S, an r-chunk detector (d, i) is a negative chunk detector if it does not matches any substring s[i,…, i + r - 1] of s, s ∈ S Example 1.Given a self set S having binary strings, with ℓ = and r = 3: S = {s1 = 00000; s2 = 00010; s3 = 10110; s4 = 10111; s5 = 11000; s6 = 11010} The set Dn of all negative 3-chunk detectors that includes (ℓ - r + 1) subsetsis Dn = Dn1∪ Dn2∪ Dn3 where Dn1 = {(001,1); (010,1); (011,1); (100,1); (111,1)}, Dn2 = {(010,2); (110,2); (111,2)} and Dn3 = {(001,3); (011,3); (100,3); (101,3)}; The non-self space covered by Dn, calledNn, contains 26 elements in {0, 1}5are{00001; 00011; 00100; 00101; 00110; 00111; 01000; 01001; 01010; 01011; 01100; 01101; 01110; 01111; 10000; 10001; 10010; 10011; 10100; 10101; 11001; 11011; 11100; 11101; 11110; 11111} The set Dp of all positive 3-chunkdetectors is Dp =Dp1∪ Dp2∪ Dp3whereDp1 = {(000,1); (101,1); (110,1)}, Dp2 = {(000,2); (001,2); (011,2); (100,2); (101,2)} and Dp3 = {(000,3); (010,3); (110,3); (111,3)} It can be seen that Dni and Dpi is the complement of each other in space {0,1}3, i = 1, 2, The self space detected by Dp, called Sp, contains 26 elements in {0, 1}5 which are {00000; 00001; 00010; 00011; 00110; 00111; 01000; 01001; 01010; 01011; 01110; 01111; 10000; 10001; 10010; 10011; 10100; 10101; 10110; 10111; 43 Số hóa Trung tâm Học liệu – Đại học Thái Nguyên http://www.lrc-tnu.edu.vn Nguyễn Văn Trường Đtg Tạp chí KHOA HỌC & CƠNG NGHỆ 11000; 11001; 11010; 11011; 11110; 11111} This example shows that the conjunction of two sets Dn and Dp is not empty In other words, the spaces detected by two types of 3chunk detectors are not the complement of each other The solution of this ambiguity can be found in NSA literature: if a given cell s matches any r-chunk negative detector, it is non-self Our approach may originate to dual method of NSA: if a given cell s does not match a r-chunk positive detector at all position i, i = 1, 2,…, ℓ - r +1, it is non-self Using this dual selection method, the set Sp now contains only 6elements of S that match ℓ - r +1 3-chunk positive detector at all position i, i = 1, 2,…, ℓ - r +1 We have now two disjoint set Sp and Nn that Sp ∪ Nn = {0, 1}5 It means that NSA and PSA have the same false negative rate This simple modifications of positive detection leads to our interesting approach as described in the following section COMBINING NEGATIVE SELECTION AND POSITIVE SELECTION Our approach is derived from Truong et al.’s work[8], in which onlynegative selection algorithms only are used In our approach, 106(06): 41 - 47 binary tree is used as data structure for combining positive selection and negative selection to reduce memory complexity, and therefore to reduce time complexity of detection phase Our algorithm is first construct ℓ - r + binary self trees corresponding to ℓ - r + Dpisets Then all complete sub-trees of these trees are deleted to achieve a compact representation of the positive r-chunk detector set Finally, for every self tree, if the ithtree is optimal in memory, it is selected, otherwise it will be replaced by constructing non-self tree corresponding to Dniand is denoted as Ti, i = 1,…, ℓ - r + 1.Therefore,there are two type of binary tree: self tree and non-self tree The detection phase can be processed by traveling the optimal trees iteratively one by one In Example 1, the binary trees T1, T2, T3 built from Dp1, Dp2, and Dp3, respectively, are illustrated in Figure 3a, 3b, 3c In these figures, the dash arrows representing subtrees will be deleted Moreover, the left child is labeled with and the right one labeled with implicitly a Self tree of Dp1 c Self tree of Dp2 e Self tree of Dp b Non-self tree of Dn1 d Non-self tree of Dn2 f Non-self tree of Dn3 Figure Binary treesconstructed from Dni and Dpi, i = 1, 2, 44 Số hóa Trung tâm Học liệu – Đại học Thái Nguyên http://www.lrc-tnu.edu.vn Nguyễn Văn Trường Đtg Tạp chí KHOA HỌC & CƠNG NGHỆ The number of nodes of the tree in figure 3.a3.f (after deletingredundant nodes) is 9, 10, 7, 6, and 8, respectively The three selected optimal trees are in the figure 3.a (9 nodes), 3.d (6 nodes) and 3.e or 3.f (8 nodes) Our efficient algorithm, called PNSA (Positive-Negative Selection Algorithm),on rchunk detectors is presented as follows ProcedurePNSA; Input: a self set S, an integer r ∈ {1,…,ℓ}and a cell string s* to be detected Output: detection of s* as self or non-self begin for i = to ℓ- r + begin initialize an empty binary self tree Ti; for each s ∈ S insert s[i,…,r+i-1] into Ti; for every non-leaf node n∈Tido if n is root of complete binary sub-tree then delete this sub-tree; if this self tree is not optimal then create nonself tree end; flag = true; while (i