DSpace at VNU: Data classification by parallel generation of set partitions

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	13
Dung lượng	327,46 KB

Nội dung

Int J Intelligent Information and Database Systems, Vol 7, No 2, 2013 135 Data classification by parallel generation of set partitions Hoang Chi Thanh Deceased; formerly of: Department of Computer Science, VNU University of Science, 334 – Nguyen Trai Rd., Thanh Xuan, Hanoi, Vietnam E-mail: thanhhc@vnu.vn Abstract: In this paper, we propose two data classification problems As each crisp classification of a data object set is a partition of the set, so we apply our partition generation algorithm to the data classification problems First, we recall our new simple algorithm generating partitions and parallelise it Then we separate a data object set into classes by partitioning the set Keywords: algorithm; data classification; parallel computing; set partition Reference to this paper should be made as follows: Thanh, H.C (2013) ‘Data classification by parallel generation of set partitions’, Int J Intelligent Information and Database Systems, Vol 7, Nos 2, pp.135–147 Biographical notes: Hoang Chi Thanh was an Associate Professor at Hanoi University of Science, Vietnam He received his PhD in Computer Science from Warsaw Technical University, Poland and BSc in Computational Mathematics from The University of Hanoi, Vietnam From 1974 to 2012, he was working for The University of Hanoi (currently: Hanoi University of Science) From 2000 to 2008, he was the Head of the Department of Computer Science From 2004 to 2012, he was the Director of HUS Science Company He has published more than 50 referred papers and eight books He was the supervisor of four PhD students His research interests included concurrency theory, combinatorics, data mining and knowledge-based systems Hoang Chi Thanh passed away on December 2012 This paper is a revised and expanded version of a paper entitled ‘An efficient parallel algorithm for the set partition problem’ presented at the 3rt Asian Conference on Intelligent Information and Database Systems (ACIIDS) held in Deagu, Korea in April 2011 Introduction In practice, many applications require a deterministic structure of subsets of a data object set That means we have to separate data objects into classes by predefined criteria The work creates data classification issues The issues are divided into two following problems: Copyright © 2013 Inderscience Enterprises Ltd 136 H.C Thanh Separate a given data objects set into non-empty classes by predefined criteria Given a data object set and a data object By predefined criteria, the set was separated into non-empty classes Add the data object into most appropriate class after the above criteria There are existing a lot of methods for data classification, e.g., classification using support vector machine (SVM), model-based classification and classification by decision tree (Lao, 2005; Rao and Das, 2011; Yang et al., 2005) These methods are still complicated They restrict types of data As we know, each crisp classification of a data object set is a partition of the set So we can apply a fast algorithm for set partition generation to data classification problems The set partition problem becomes a base for data classification problems There are some algorithms for the set partition problem, e.g., a recursive algorithm, a pointer-based algorithm (Cornen et al., 2001; Papadimitriou and Steiglitz, 2000; Cameron et al., 2004) Recently, we proposed a new algorithm based on index arrays to generate all partitions of a set (Thanh and Thanh, 2011) The algorithm is short and simple Its complexity is the least as it generates one partition with the linear complexity An important advantage of the algorithm is that it is easy to parallelise So we parallelise it and apply the parallel algorithm to data classification problems With many problems, we know quite well the number of all desirable solutions and their arrangement in some sequence Then, we can split the sequence of all solutions into subsequences and use a common programme (algorithm) executed in a parallel computing system to find the subsequences concurrently This parallel computing organisation is an association of the bottom-up design and the divide and conquers one (Thanh, 2012) Its model is illustrated as in the following figure Figure The model of a parallel computing organisation to find all solutions of a problem (see online version for colours) Data classification by parallel generation of set partitions 137 In order to make the parallel computing organisation realistic and optimal, the subsequences of solutions should be satisfied two following criteria: It is easy to determine the input and the termination condition for the computing process of each subsequence The difference in lengths of the subsequences is the less, the better Of course, Input is indeed the input of the problem and the last termination condition is the termination condition of the algorithm The first criterion ensures that splitting of the solutions sequence is realistic In many cases, we can use (a part of) the input of the computing process for the next subsequence as the termination condition of the computing process for the previous subsequence The second criterion implements the balance of computing to the processors Consequently, the parallel computing processes become optimal Hence, the amount of time required for finding all the solutions will be drastically decreased by the number of subsequences Thus, we apply the above parallelising technique to our set partition algorithm and then apply the parallel algorithm in constructing two parallel algorithms for data classification problems These algorithms are fast, simple and easy to implement The rest of this paper is organised as follows In Section 2, we present two data classification problems Section is devoted to the application of the parallel algorithm generating partitions in the data classification problems We recall our new simple algorithm generating partitions and parallelise it Then we separate a data object set into classes by partitioning the set Last section contains conclusions and future research Data classification Data classification is the process of separating data into separate piles (e.g., classes, groups…), to which different policies apply The data classification problem is modelled as follows Figure Geometrical model of data classification (see online version for colours) 138 H.C Thanh Data classification organises data so that information technology can manage it better 2.1 Data classification problems We consider two following data classification problems DC problem 1: Given a set S = {s1, s2, …, sn} where the elements si (i = 1, 2, …, n) are data objects Separate the set S into non-empty classes C1, C2, …, Ck (k ≤ n) by predefined criteria This problem is illustrated by some following typical examples: Separating companies by changes in the prices of their stocks Identifying the behaviours of sea animals using the data acquired from their bodies Recognising spoken words from audio signals Identifying anomalous events from astronomical data After separating the set S, if any data object belongs to some classes, the classification is fuzzy Otherwise, if a data object belongs to one class only, the classification is crisp Naturally, each crisp classification of the set S becomes a partition of this set DC problem 2: Given a data object set S = {s1, s2, …, sn} and a data object sn+1 By predefined criteria, the set S was separated into non-empty classes C1, C2, …, Ck Add the data object sn+1 into most appropriate class after the above criteria Example 2.1: Basing on the credit rating database, the bank has to decide that its new partner should be lent or not Basing on patient database, doctors can make a diagnosis and give treatment to a new patient Any algorithm for the DC problem may be applied to the DC problem when the set S = {s1, s2, …, sn} in the DC problem is considered as S = S′ ∪ {sn}, where the set S′ = {s1, s2, …, sn–1} Data classification criteria depend on the policies of classification and the type of data Policies of classification must be represented by a mathematical expression Generally, the notion of distance between two data objects, between a data object and a data class or between two data classes take part in the policies 2.2 Some existing methods for data classification Data classification is an area of data mining Its aim is to build models expressing data classes The classification process consists of two phases: training phase and classifying phase Training phase: constructing a classifier by analysing/learning/supervising on a test set Classifying phase: classifying new data objects if the precision of the classifier is admitted Data classification by parallel generation of set partitions 139 We recall some often used data classification methods (Lao, 2005; Rao and Das, 2011) Data classification using SVM: It maps a data object to a point in a high dimensional space, where a maximal separating hyperplane is constructed Two parallel hyperplanes are constructed on each side of the hyperplane that separate the data The separating hyperplane is the one that maximises the distance between the two parallel hyperplanes The points in this hyperplane are called support vectors Model-based classification: The method involves three steps: model finding, model refinemnet and classification Used models are linear functions It excludes all models with large errors From the remaining models it selects those which discriminate between classes Classification by decision tree: From a training dataset it constructs a decision tree, where each inner vertex represents an attribute checking, the edges represent the results of the checking and each leaf describes a class Besides, there are some other classification methods, e.g., classification by Bayesian network, classification based on neural network, supervised functional classification, classification by rough sets… But each method is well used to some types of data only Data classification by set partitions As presented above, each crisp classification of a data object set is a partition of the set Hence, we can apply our algorithm generating set partitions presented in Thanh and Thanh (2011) to the data classification problems We recall it now 3.1 Set partition problem Let X be a finite set Definition 3.1: A partition of the set X is a family {A1, A2, …, Ak} of subsets of X, satisfying the three following properties: Ai ≠ ∅, ≤ i ≤ k Ai ∩ Aj = ∅, ≤ i < j ≤ k A1 ∪ A2 ∪ … ∪ Ak = X Partition problem: Given a set X Find all partitions of the set X The set partitions are broadly used in many areas of sciences and technologies The number of all partitions of an n-element set is denoted by Bell number Bn and calculated by the following recursive formula (Papadimitriou and Steiglitz, 2000): n −1 Bn = ⎛ n − 1⎞ ⎟Bi , where B0 = i ⎠ i =0 ∑ ⎜⎝ 140 H.C Thanh Given an n-element set X Let us identify the set X = {1, 2, 3, …, n} Let π = {A1, A2, …, Ak} be a partition of the set X Each subset Ai is called a block of the partition π To ensure the uniqueness of representation, blocks in a partition are sorted in the ascending order of the least element in each block In the partition, block Ai (i = 1, 2, 3, …, k) has the index i and element always belongs to block A1 Each element j ∈ X, belonging to some block Ai has also the index i It means every element of X can be represented by the index of a block that includes the element Of course, the index of element j is not greater than j Each partition can be represented by a sequence of n indices The sequence can be considered as a word with the length of n on the alphabet X So we sort these words increasingly by the lexicographical order Then, • The first word (the smallest) is 1 It corresponds to the partition {{1, 2, 3, …, n}} This partition consists of one block only • The last word (the greatest) is n It corresponds to the partition {{1}, {2}, {3}, …, {n}} This partition consists of n blocks and each block has only one element This is a unique partition that has a block with the index n It is easy to show that the number of all partitions of an n element set is not greater than the number of all permutations on the same set It means, Bn ≤ n! We use an integer array AI[1 n] to represent a partition, where AI[i] stores the index of the block that includes element i Element always belongs to the first block, element may belong to the first or the second block If element belongs to the first block element may belong to the first or the second block only And if element belongs to the second block element may belong to the first, the second or the third block Hence, element i may only belong to the following blocks: 1, 2, 3, , max ( AI [1], AI [2], , AI [i − 1]) + It means, for every partition: ≤ AI [i ] ≤ max ( AI [1], AI [2], , AI [i − 1]) + ≤ i, where i = 1, 2, , n This is an invariant for all partitions of a set We use it to find partitions Example 3.2: The partitions of a three-element set and their index representations are sorted in the following table Table The partitions and their index representations (see online version for colours) No Partitions AI[1 3] {{1, 2, 3}} 111 {{1, 2}, {3}} 112 {{1, 3}, {2}} 121 {{1}, {2, 3}} 122 {{1}, {2}, {3}} 123 Data classification by parallel generation of set partitions 141 3.2 A new algorithm for partition generation It is easy to determine a partition from its index array representation So, instead of generating partitions of the set X, we find index arrays AI[1 n], each of them can represent a partition These index arrays are sorted in the ascending order The first index array is 1 1 and the last index one is n – n So the termination condition of the algorithm is: AI [n] = n Starting with the first index array our algorithm repeats a loop to generate an index array and to print the corresponding partition until the last index array was generated Assume that AI[1 n] is a just found index array representing some partition of X The algorithm has to generate the index array AI′ [1 n] next to AI in the ascending order To so we use an integer array Max[1 n], where Max[i] stores max(AI[1], AI[2], …, AI[i – 1]) The array Max gives us possibilities to increase indices of the array AI Of course, Max[1] = and Max[i ] = max ( Max[i − 1], AI [i − 1]) , i = 3, 3, , n Then, AI ′[i ] = AI [i ], i = 1, 2, , p − 1, where p = max { j | AI [ j ] ≤ Max[ j ]} ; AI ′[ p] = AI [ p ] + and AI ′[ j ] = 1, j = p + 1, p + 2, , n Basing on the above properties of the index arrays, we construct the following algorithm generating all partitions of a set Algorithm 3.1 (Generation of a set’s partitions) Input: A positive integer n Output: A sequence of an n-element set’s all partitions, whose index representations are sorted in the ascending order Begin input n; for i ← to n – AI[i] ← 1; AI[n] ← Max[1] ← ; repeat for i ← to n if Max[i – 1] < AI[i – 1] then Max[i] ← AI[i – 1] else Max[i] ← Max[i – 1]; p ← n; while AI[p] = Max[p] + 10 {AI[p] ← 1; p ← p – 1}; 11 AI[p] ← AI[p] + 1; 12 print the corresponding partition; 13 14 until AI[n] = n; End 142 H.C Thanh The algorithm’s complexity The algorithm generates an index array and prints the corresponding partition with the complexity of O(n) Therefore, the total complexity of the algorithm is O(Bn.n) So the algorithm becomes one of the best among algorithms generating partitions of a set 3.3 Parallel generating partitions We apply the technique presented above to our set partition algorithm For the simplicity of presentation, we split the sequence of all desirable partitions into two subsequences If we want to split the sequence into more, this technique could apply to each subsequence The pivot is chosen as a partition represented by the following index array: [n / 2] − 1[n / 2]11 11 So, the last partition of the first subsequence corresponds to the index array: [n / 2] − 1[n / 2]11 111 The chosen pivot and the last index array of the first subsequence are illustrated in the following figure Figure The pivot and the last index array of the first subsequence We have to determine a termination condition for the first computing process and an input for the second one The termination condition for the first computing process in the instruction 13 is replaced by: AI [i ] = i, i = 2, 3, , [n / 2] − 1, [n / 2] The input for the second computing process in the instruction and will be: AI [i ] ← i, i = 2, 3, , [ n / 2] − 1, [ n / 2]; AI [ j ] ← 1, j = [n / 2] + 1, [n / 2] + 2, , n − 1; AI [n] ← Data classification by parallel generation of set partitions 143 It is easy to show that the pivot is nearly in the middle of the sequence of all index arrays So the above splitting is appropriate 3.4 Application of partition generation in data classification problems Now we are able to apply the fast parallel algorithm generating set partitions presented above to two data classification problems The data object set S = {s1, s2, …, sn} is stored in an array S[1 n] where: S[i] = si, i = 1, 2, …, n Each data object si is referred by the index i in the array S[1 n] Accordingly, a crisp classification of the data object set S corresponds to a partition of the index set X = {1, 2, …, n} Basing on the fast parallel algorithm generating set partitions, we propose the following data classification algorithm: Generating a partition of the index set X = {1, 2, …, n} Checking the partition, if corresponding data classes satisfy the criteria print these data classes and stop Detailing the above algorithm, we have the following data classification algorithm for the DC problem Algorithm 3.2 (Classification for the DC problem 1) Input: A data object array S[1 n] Output: Data classes of the data object set in the array S[1 n] Begin input n, S[1 n]; A[1 n – 1] ← 1; A[n] ← Max[1] ← 0; repeat for i ← to n {Max[i] ← Max[i – 1]; if Max[i] < A[i – 1] then Max[i] ← A[i – 1];} k ← n; 10 while A[k] = Max[k] + {A[k] ← 1; k ← k – 1}; 11 A[k] ← A[k] + 1; 12 if data classes from the array S[1 n] by the partition A[1 n] satisfy the criteria then {print the corresponding data classes; exit}; 13 14 until A[n] = n; End The algorithm’s complexity still depends on the classifying criteria In general, the algorithm’s complexity is O(Bn.n2) The algorithm is simple and fast It does not restrict data types We can parallelise Algorithm 3.2 as we for Algorithm 3.1 to obtain a parallel classification algorithm for the DC problem 144 H.C Thanh Example 3.3: Given a database S on the average monthly stock prices of n companies in m months as the following array: S = [ sij ] , i = 1, 2, , n and j = 1, 2, , m We have to separate the companies by changes in the prices of their stocks For the company i, we construct its growth rates as follows: tik = sik +1 − sij , with k = 1, 2, , m − sik The distance between two companies i and j is defined as the Euclidean distance: d ( compi , comp j ) = m −1 ( tik − t jk )2 m − k =1 ∑ Applying Algorithm 3.2 with the following criteria: the first class consists of companies with the distance between any two being under the threshold a1; the second class is for those with the distance of any two companies being above the threshold a1 and under the threshold a2, the third class would contain the remaining companies, we get the result Consider the following database on the average monthly stock prices of eight companies in months Table The average monthly stock prices Average stock price January Average stock price February Average stock price March Average stock price April Average stock price May Average stock price June ABC Diary Products Jsc 85 90 92 92 93 91 Petro Drilling 54 52 56 54 52 53 DE Construction Corporation 13 14 14 18 12 11 Vecom Corporation 95 98 99 102 110 115 Nasan Consumer 140 142 143 156 158 162 AS Commercial Bank 22.1 22.1 22.2 22.1 22.3 22.1 FT Corporation 52 54 55 65 66 69 Haco Minerals 8.1 8.8 9.5 10.2 10.7 11.8 No Company The monthly growth rates of these companies are determined in the following table Data classification by parallel generation of set partitions Table 145 The monthly growth rates Growth Rate Growth Rate Growth Rate Growth Rate Growth Rate ABC Diary Products Jsc 0.06 0.02 0.00 0.01 –0.02 Petro Drilling –0.04 0.08 –0.04 –0.04 0.02 DE Construction Corporation 0.08 0.00 0.29 –0.33 –0.08 Vecom Corporation 0.03 0.01 0.03 0.08 0.05 Nasan Consumer 0.01 0.01 0.09 0.01 0.03 AS Commercial Bank 0.00 0.00 0.00 0.01 –0.01 FT Corporation 0.04 0.02 0.18 0.02 0.05 Haco Minerals 0.09 0.08 0.07 0.05 0.10 No Company Calculating by Algorithm 3.2 with the threshold a1 = 0.05 and the threshold a2 = 0.10 we get the following results: • the first class is { 1, 4, 5, 6}, moderately performing companies • the second class is {7, 8}, outperforming companies and • the third class is {2, 3}, underperforming companies To solve the DC problem 2, we recall the following interesting fact of the partition problem Given a set X = {x1, x2, …, xn} and an element xn+1 Assume that, π = {A1, A2, …, Ak} is a partition of the set X Then, the following families of subsets: { A1 ∪ { xn +1} , A2 , , Ak } { A1 , A2 ∪ { xn +1} , , Ak } (3.1) { A1 , A2 , , Ak ∪ { xn +1}} { A1 , A2 , , Ak , { xn +1}} become partitions of the set X″ = X ∪ {xn+1}, whose projections on the set X are indeed the partition π In the sequence of partitions (3.1), the element xn+1 seems to be moved in the partition π from the first block to the last one We use this fact to solve the DC problem The data classes C1, C2, …, Ck satisfy the criteria Therefore, we have only to check when adding the data object sn+1 into some class the class still satisfies the criteria or not Then we construct the following simple algorithm for the DC problem 146 H.C Thanh Algorithm 3.3 (Classification for the DC problem 2) Input: Classes C1, C2, …, Ck and a new data object sn+1 Output: New data classes Begin input k, C1, C2, …, Ck and sn+1; i ← 1; if Ci ∪ {sn+1} satisfies the criteria then { print the data classes C1, …, Ci–1, Ci ∪ {sn+1}, Ci+1, …, Ck; halt}; i ← i + 1; if i ≤ k then goto else print the data classes C1, C2, …, Ck, {sn+1}; End Obviously, the complexity of this algorithm depends on the classifying criteria In general, the complexity of this algorithm is O(k.n) The checking ‘Ci ∪ {sn+1} satisfies the criteria’ with i = 1, 2, …, k in the instruction may be executed in parallel Note that Algorithm 3.3 can be applied for data classes C1, C2, …, Ck in both cases when the classification is crisp or fuzzy Conclusions Data classification is an important work in data mining Data classification organises data so that information technology can analyse and manage them better In this paper, we propose two data classification problems As each crisp classification of a data object set is a partition of the set, so we apply our partition generation algorithm to the data classification problems First, we recall our new simple algorithm generating partitions and then we parallelise the algorithm The parallel computing organisation presented is an illustration of output decomposition technique We apply the parallel generation of partitions to the classification by constructing two algorithms for two DC problems These algorithms are fast, simple and easy to implement In future, we will apply the technique to other problems, namely time-series processing, time-series matching, scheduling problem and system control Acknowledgements The author would like to thank anonymous reviewers for valuable recommendations This work is supported by a research grant of Vietnam National University, Hanoi for promoting science and technology Data classification by parallel generation of set partitions 147 References Cameron, K., Eschen, E.M., Hoang, C.T and Sritharan, R (2004) ‘The list partition problem for graphs’, Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, pp.391–399 Cornen, T.H., Leiserson, C.E., Rivest, R.L and Stein, C (2001) Introduction to Algorithms, The MIT Press, USA Lao, T.W (2005) ‘Clustering of time-series data – a survey’, Pattern Recognition, Vol 38, pp.1857–1874, Elsevier Papadimitriou, C.H and Steiglitz, K (2000) Combinatorial Optimization: Algorithms and Complexity, Dover Publ Inc., USA Rao, N.S and Das, S.K (2011) ‘Classification of herbal gardens in India using data mining’, Journal of Theoretical and Applied Information Technology, Vol 25, No 2, pp.71–78, Islamabad Thanh, H.C (2012) ‘Parallel generation of permutations by inversion vectors’, Proceedings of IEEE-RIVF International Conference on Computing and Communication Technologies, pp.129–132, IEEE Thanh, H.C and Thanh, N.Q (2011) ‘An efficient parallel algorithm for the set partition problem, new challenges for intelligent information and database systems’, in Nguyen, N.T., Trawinski, B and Jung, J.J (Eds.): Studies in Computational Intelligence, Vol 351, pp.25–32, Springer Yang, Y et al (2005) ‘Preprocessing time-series data for classification with application to CRM’, Lecture Notes in Artificial Intelligence, Vol 3809, pp.133–142, Springer ... functional classification, classification by rough sets… But each method is well used to some types of data only Data classification by set partitions As presented above, each crisp classification of. .. propose two data classification problems As each crisp classification of a data object set is a partition of the set, so we apply our partition generation algorithm to the data classification problems... where the set S′ = {s1, s2, …, sn–1} Data classification criteria depend on the policies of classification and the type of data Policies of classification must be represented by a mathematical expression

Ngày đăng: 17/12/2017, 16:34