DSpace at VNU: Picture fuzzy clustering for complex data tài liệu, giáo án, bài giảng , luận văn, luận án, đồ án, bài tậ...
Engineering Applications of Artificial Intelligence 56 (2016) 121–130 Contents lists available at ScienceDirect Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai Picture fuzzy clustering for complex data Pham Huy Thong, Le Hoang Son n VNU University of Science, Vietnam National University, 334 Nguyen Trai, Thanh Xuan, Hanoi, Viet Nam art ic l e i nf o a b s t r a c t Article history: Received 24 April 2016 Received in revised form August 2016 Accepted August 2016 Fuzzy clustering is a useful segmentation tool which has been widely used in many applications in real life problems such as in pattern recognition, recommender systems, forecasting, etc Fuzzy clustering algorithm on picture fuzzy set (FC-PFS) is an advanced fuzzy clustering algorithm constructed on the basis of picture fuzzy set with the appearance of three membership degrees namely the positive, the neutral and the refusal degrees combined within an entropy component in the objective function to handle the problem of incomplete modeling in fuzzy clustering A disadvantage of FC-PFS is its capability to handle complex data which include mix data type (categorical and numerical data) and distinct structured data In this paper, we propose a novel picture fuzzy clustering algorithm for complex data called PFCA-CD that deals with both mix data type and distinct data structures The idea of this method is the modification of FC-PFS, using a new measurement for categorical attributes, multiple centers of one cluster and an evolutionary strategy – particle swarm optimization Experiments indicate that the proposed algorithm results in better clustering quality than others through clustering validity indices & 2016 Elsevier Ltd All rights reserved Keywords: Complex data Distinct structured data Fuzzy clustering Mix data type Picture fuzzy clustering Introduction Fuzzy clustering is used for partitioning dataset into clusters where each element in the dataset can belong to all clusters with different membership values (Bezdek et al., 1984) Fuzzy clustering was firstly introduced by Bezdek et al (1984) under the name “Fuzzy C-Means (FCM)” This algorithm is based on the idea of K-Means clustering with membership values being attached to the objective function for partitioning all data elements in the dataset into appropriate groups (Chen et al., 2016) FCM is more flexible than K-Means algorithm, especially in overlapping and uncertainty dataset (Bezdek et al., 1984) Moreover, FCM has many applications in real life problems such as in pattern recognition, recommender systems, forecasting, etc (Son et al., 2012a, 2012b, Son et al., 2013, 2014; Thong and Son, 2014; Son, 2014a, 2014b; Son and Thong, 2015; Thong and Son, 2015; Son, 2015b, 2015c, 2016; Son and Tuan, 2016; Son and Hai, 2016; Wijayanto et al., 2016; Tuan et al., 2016; Thong et al., 2016; Tuan et al., 2016) However, FCM still has some limitations regarding clustering quality, hesitation, noises and outliers (Ferreira and de Carvalho, 2012; De Carvalho et al., 2013; Thong and Son, 2016b) There have been many researches proposed to overcome these limitations; one of them is innovating FCM on advanced fuzzy sets such as the n Corresponding author E-mail addresses: thongph@vnu.edu.vn (P.H Thong), sonlh@vnu.edu.vn, chinhson2002@gmail.com (L.H Son) http://dx.doi.org/10.1016/j.engappai.2016.08.009 0952-1976/& 2016 Elsevier Ltd All rights reserved type-2 fuzzy sets (Mendel and John, 2002), intuitionistics fuzzy sets (Atanassov, 1986) and picture fuzzy sets (Cuong, 2014) Fuzzy clustering algorithm on PFS (FC-PFS) (Son, 2015a; Thong and Son, 2016b) is an extension of FCM with the appearance of three membership degrees of picture fuzzy sets namely the positive, the neutral and the refusal degrees combined within an entropy component in the objective function to handle the problem of incomplete modeling in FCM (Yang et al., 2004) FC-PFS was shown to have better accuracy than other fuzzy clustering schemes in the equivalent articles (Son, 2015a; Thong and Son, 2016b) Nonetheless, a remark regarding the working flow of the FCPFS algorithm extracted from our experiments through various types of datasets is the inefficiency of processing complex data which include mix data types and distinct structure data Mix data are known as categorical and numerical data, which can be effectively processed with equipped kernel functions only (Ferreira and de Carvalho, 2012) Distinct structure data contains nonsphere structured data such as data scatter in a linear line or a ring types, etc that prevent clustering algorithms to partition data elements into exact clusters Almost fuzzy clustering methods, including FC-PFS, find them hard to deal with complex data There have been many researches on developing new fuzzy clustering algorithms that employed dissimilarity distances and kernel functions to cope with complex data in (Cominetti et al., 2010; Hwang, 1998; Ji et al., 2013, 2012) However, they solved either mix data types or distinct structure data but not all of them so this 122 P.H Thong, L.H Son / Engineering Applications of Artificial Intelligence 56 (2016) 121–130 leaves the motivation for this paper to work on In this paper, we propose a novel picture fuzzy clustering algorithm for complex data called PFCA-CD that deals with both mix data type and distinct data structures The idea of this method is the modification of FC-PFS, using a measurement for categorical attributes, multiple centers of one cluster and an evolutionary strategy - particle swarm optimization Experiments indicate that the proposed algorithm results in better clustering quality than others through clustering validity indices The rest of the paper is organized as follows Section describes the background with literature review and some particular fuzzy clustering methods for complex data Section presents our proposed method Section validates the method on the benchmark UCI datasets Finally, conclusions and further works are covered in last section Background In this section, we firstly give an overview of the relevant methods for clustering complex data in Section 2.1 Sections 2.2–2.3 review two typical methods of this approach 2.1 Literature review The related works for clustering complex data is divided into two groups: mixed type of data including categorical and numerical data and distinct structure of data (Fig 1) In the first group, there have been many researches about clustering for both categorical and numerical data Hwang (1998) extended the k-means algorithm for clustering large datasets including categorical values Yang et al (2004) used fuzzy clustering algorithms to partition mixed feature variables by giving a modified dissimilarity measure for symbolic and fuzzy data Ji et al (2012, 2013) proposed fuzzy k-prototype clustering algorithms combining the mean and fuzzy centroid to represent the prototype of a cluster and employing a new measure based on co-occurrence of values to evaluate the dissimilarity between data objects and prototypes of clusters Chen et al (2016) presented a soft subspace clustering of categorical data by using a novel soft feature-selection scheme to make each categorical attribute be automatically assigned a weight that correlates with the smoothed dispersion of the categories in a cluster A series of methods based on multiple dissimilarity matrices to handle with mix data was introduced by De Carvalho et al (2013) The main ideas of these methods were to obtain a collaborative role of the different dissimilarity matrices to get a final consensus partition Although these methods can partition mixed data efficiently, they find it difficult to solve with complex distinct structure of data In the second group, many researchers tried to partition complex structure of data which had intrinsic geometry of non-sphere and non-convex clusters Cominetti et al (2010) proposed a method called DifFuzzy combining ideas from FCM and diffusion on graph to handle the problem of clusters with a complex nonlinear geometric structure This method is applicable to a larger class of clustering problems which not require any prior information on the number of clusters Ferreira and de Carvalho (2012) presented kernel fuzzy clustering methods based on local adaptive distances to partition complex data The main idea of these methods were based on a local adaptive distance where dissimilarity measures were obtained as sums of the Euclidean distance between patterns and centroids computed individually for each variable by means of kernel functions Dissimilarity measure is utilized to learn the weights of variables during the clustering process that improves performance of the algorithms However, this method could deal with numerical data only It has been shown that the DifFuzzy algorithm (Cominetti et al., 2010) and the fuzzy clustering algorithm based on multiple dissimilarity matrices (Dissimilarity) (De Carvalho et al., 2013) are two typical clustering methods in each group Therefore, we will analyze these methods more detailed in the next sections 2.2 DifFuzzy DifFuzzy clustering algorithm (Cominetti et al., 2010) is based on FCM and the diffusion on graph to partition the dataset into clusters with a complex nonlinear geometric structure Firstly, the auxiliary function is defined: F ( σ ): ( 0, ∞) → N where σ ∈ ( 0, ∞) be a positive number The i − th and j − th nodes are connected by an edge if: ‖Xi − Xj ‖ < σ F ( σ ) is equal to the number of components of the σ− neighborhood graph which contain at least M vertices, where M is the mandatory parameter of DifFuzzy F ( σ ) begins from zero, and then increases to its maximum value, before settling back down to a value of C = max F ( σ ), σ∈ ( 0, ∞) (2) ⎧ if i and j are hard po int s in the same core ⎪ ⎪ ⎪ clusters , ∧ ωi, j( β ) = ⎨ ⎛ ‖X − X ‖2 ⎞ ⎪ i j ⎟otherwise , ⎪ exp⎜⎜ − ⎟ ⎪ β ⎝ ⎠ ⎩ (3) where β is a positive L ( β ): ( 0, ∞) → ( 0, ∞) is: N L( β ) = N real number The function: ∧ ∑ ∑ ωi, j( β ) (4) i=1 j=1 It has two well defined limits: C lim L( β ) = N + β→0 ∑ n i ( n i − 1) i=1 and lim L( β ) = N2, β →∞ (5) where ni corresponds to the number of points in the i − th core cluster DifFuzzy does this by finding β* which satisfies the relation: Complex data ⎛ L β* = − γi ⎜⎜ N + ⎝ ( ) ( Mix data types (categorical and numerical data) (1) Distinct structure of data (different distribution of data) C ⎞ i=1 ⎠ ∑ ni( ni − 1)⎟⎟ + γiN2, (6) where γ1 ∈ ( 0, 1) is an internal parameter of the method Its default value is 0.3 Then the auxiliary matrices are defined as follows ∧ Fig Classification of methods dealing with complex data ) ( ) W = W β* (7) P.H Thong, L.H Son / Engineering Applications of Artificial Intelligence 56 (2016) 121–130 The matrix D is defined as a diagonal matrix with diagonal elements N Di, j = ∑ ωi, j, i = 1, 2, … …, N , (8) j=1 ⎡ ⎛ ( 0) ⎢ D⎛ ⎞ e, G ⎢ C ⎜ ⎜ γk( 0), s⎟ i k ⎝ ⎠ ⎜ uik( ) = ⎢ ∑ ⎜ ( 0) ⎢ h = D⎛ ⎜⎜ ⎜ γ ( 0), s⎞⎟ ei , Gh ⎢ ⎝ ⎝ h ⎠ ⎢⎣ ( ( P = I + ⎡⎣ W − D⎤⎦ γ2 max Di, j C n m ( u ( )) D ( ) ( e , G ( )) = ∑ ∑ ( u ( )) ∑ ( γ ( )) ∑ d ( e ( ) J ( 0) = ik ∑∑ k=1 i=1 C n ik m k=1 i=1 ⎢ γ ⎥ α = ⎢ ⎥, ⎢⎣ log γ2 ⎥⎦ (10) where γ2 corresponds to the second (largest) eigenvalue of P and ⌊ ⌋ denotes the integer part In order to compute the diffusion distance between soft point Xs and c − th cluster, the following formula is used α dist ( Xs , c ) = ‖P αe − P e‖, (11) where e( j ) = if j = s , and e( j ) = otherwise Finally the membership value of the soft point Xs in the c − th cluster, uc ( Xs), is determined as −1 C −1 ∑l = dist ( Xs , l) (12) This procedure is applied to every soft data point Xs and every cluster c ∈ { 1, 2, , C} The output of DifFuzzy is a number of clusters ( C ) and for each data point a set of C numbers that represent the degree of membership in each cluster The membership value of Xi , i = 1, 2, , N , in the c − th cluster, c = 1, 2, , C , is denoted by uc ( Xi) The degree of membership is a number between and 1, where the values close to correspond to points that are very likely to belong to that cluster The sum of the membership values of a data point in all clusters is always one i r s kj ( = , r , r ), ( k1 kr ) ( ) k ( k1 kr ) k = 1, , C Randomly select C distinct prototypes Gk( ) ∈ E ( q)( k = 1, , C ) For each object ei( i = 1, , n) compute its membership degree u ( ) k = , C on fuzzy cluster C : ik ( ) k j e ∈ Gk i, e ) (14) ( ( ) K ) t t−1 t−1 U ( t − 1) = u1( ) , , un( ) are fixed The prototype Gk( ) = G* ∈E ( q) of fuzzy cluster Ck( k = 1, , C ) is calculated according to the procedure described in Proposition: The prototype Gk = G* ∈ E ( q) of fuzzy cluster Ck( k = 1, , C ) is chosen to minimizes the cluss ( ) ( n r tering criterion J: ∑i = ( uik )m ∑ j = γkj Dj ei , G* → Min ) 2.3.3 Compute the best relevance weight vector t t When the vector of prototypes G( t ) = G1( ), , Gk( ) and the ( ( ) t−1 t−1 fuzzy partition represented by U ( t − 1) = u1( ) , , un( ) ) are t fixed, the components γkj( )( j = 1, , r ) of relevance weight vector t γ ( ) k = , C are computed as in Eqs (15) or (17) if the matching k ( ) function given by Eqs (16) or (18), respectively γkj {∏ = r ⎡ ∑n h=1 ⎣ i=1 r ( uik) Dh( ei , Gk)⎤⎦} m n ∑i = ( uik ) Dj ( ei , Gk ) r ⎡ ∑n h=1 ⎣ i=1 {∏ = m ∑e ∈ G dh( ei , e)⎤⎦ k ⎡ ∑n u m ∑ d e , e)⎤⎦ ⎣ i = ( ik ) e ∈ Gk j( i s r }, m ( uik) (15) r ∑ ( γkj) Dj( ei , Gk) = ∑ γkj ∑ dj( ei , e), e ∈ Gk (16) −1 ⎤ ⎡ m ⎢ r ⎛ ∑in= ( uik ) ∑e ∈ G dj( ei , e) ⎞ s − ⎥ ⎟ ⎥ , k = ⎢∑ ⎜ n m ⎢ h = ⎜⎝ ∑i = ( uik ) ∑e ∈ G dh( ei , e) ⎟⎠ ⎥ k ⎥⎦ ⎢⎣ (17) j=1 2.3.1 Initialization Fix C (the number of clusters), ≤ C < < n; fix m, < m < + ∞; fix s , ≤ s < + ∞; fix T (an iteration limit); fix ε > and ε < < Fix the cardinality ≤ q < < n of the prototypes Gk( k = 1, , C ) Set 0 0 0 t = Set λ ( ) = γ ( ), , γ ( ) = 1, , or set λ ( ) = γ ( ), , γ ( ) k 2.3.2 Compute the best prototypes Set The vector of relevance weights t = t + t−1 t−1 Λ( t − 1) = γ ( ), , γ ( ) and the fuzzy partition represented by D( γk)( ei , Gk ) = The dissimilarity algorithm (De Carvalho et al., 2013) is Fuzzy K-Medoids with relevance weight for each dissimilarity matrix consisting of steps below ⎛⎜ ⎞ γ , s⎟ ⎝ k ⎠ j=1 r 2.3 Dissimilarity k (13) (9) where I ∈ R N × N is the identity matrix and γ2 is an internal parameter of DifFuzzy Its default value is 0.1 DifFuzzy also computes an auxiliary integer parameter α by, dist ( Xs , c ) s ( ) ( ) , i = 1, N uc ( X s ) = −1 ⎤ ⎞ m−1⎥ ⎟ ⎥ ⎟ ⎥ ⎟ ⎥ ⎟⎟ ⎥ ⎠ ⎥⎦ −1 ⎤ ⎞ m−1⎥ ∑e ∈ G ( 0) dj( ei , e) ⎟ ⎥ k ⎟ s ⎥ ⎟ ∑e ∈ G ( 0) dj( ei , e) ⎥ ⎠ h ⎥⎦ ⎡ ⎢ C ⎛ ∑r ( 0) ⎜ j = γkj ⎢ = ⎢∑ ⎜ r ⎢ h = ⎜⎝ ∑ j = γhj( ) ⎢⎣ where ωi, j are the entries of matrix W Finally, the matrix P is defined as, ) ) 123 j=1 −1 ⎤ ⎡ m ⎢ r ⎛ ∑in= ( uik ) Dj ( ei , Gk ) ⎞ s − ⎥ ⎟ ⎥ γkj = ⎢ ∑ ⎜⎜ n m ⎟ ⎢ h = ⎝ ∑i = ( uik ) Dh( ei , Gk ) ⎠ ⎥ ⎣ ⎦ r D( γk, s)( ei , Gk ) = s r s ∑ ( γkj) Dj( ei , Gk) = ∑ ( γkj) ∑ j=1 j=1 e ∈ Gk dj( ei , e), (18) 124 P.H Thong, L.H Son / Engineering Applications of Artificial Intelligence 56 (2016) 121–130 2.3.4 Define the best fuzzy partition t t The vector of prototypes G( t ) = G1( ), , Gk( ) and the vector of ( ( ) ) t t relevance weights Λ( t ) = γi( ), , γk( ) are fixed The membership uik( ) t degree of ei( i = 1, , n) object in fuzzy cluster Ck( k = 1, , C ) is calculated as in Eq (19) ⎡ t ⎢ D⎛ ( t ) ⎞ ei , Gk( ) ⎜ γk , s⎟ ⎢ C ⎝ ⎠ t uik( ) = ⎢ ∑ ⎢ h = D⎛ t ⎞ e , G ( t ) i () h ⎢ ⎜ γk , s⎟ ⎝ ⎠ ⎢⎣ ( ( ) ) −1 ⎤ m−1 ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ −1 ⎤ ⎡ s ⎞ m−1⎥ t) ⎢ C ⎛ ∑r ( ⎜ j = γkj ∑e ∈ G ( t ) dj( ei , e) ⎟ ⎥ ⎢ k ⎟ = ⎢∑ ⎜ s ⎥ r t) ( ⎜ ⎟ ⎥ ⎢ h = ⎝ ∑ j = γhj ∑e ∈ G ( t ) dj( ei , e) ⎠ h ⎥⎦ ⎢⎣ ( ) ( ) (19) 2.3.5 Stopping criterion Compute: J ( t) = C n ∑ ∑ k =1 n =1 C = n ∑∑ k =1 i =1 ⎛ ⎛ ( t )⎞ m t ⎞ ⎜ u ik ⎟ D⎛ t ⎞⎜ ei , Gk( )⎟ ⎠ ⎠ ⎜ γk ( ), s⎟⎝ ⎝ ⎛ ( t )⎞ m ⎜u ⎟ ⎝ ik ⎠ ⎝ r ∑ j =1 ⎠ ⎛ ( t )⎞ s ⎜γ ⎟ ⎝ kj ⎠ ∑ dj( ei, e) ( t) (20) e ∈G k If J ( t ) − J ( t − 1) ≤ ε or t < T : STOP; otherwise go to Step A 2.4 Particle swarm optimization Particle Swarm Optimization (PSO) firstly introduced by Eberhart and Kennedy (1995) is an evolutionary strategy that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality It simulates the movement of organisms in a bird flock or fish school to find food Suppose that there are popsize particles in the swarm, each of them is presented to be one solution of the problems and is encoded with their location ( loc ) and velocity ( vec ) The PSO procedure consists of these steps: initializing the swarm, calculating the fitness values and updating particles Firstly, the location and velocity of each particle are initiated randomly Secondly, each particles is measured the quality by fitness values Depending on the demand the problem, the fitness value is design to assess the quality of solution Finally, the update process is demonstrated in Eqs (21) and (22) veci = veci + C1rand( locPbest − loci ) + C2rand( locGbest − loci ), (21) loci = veci + loci (22) where C1, C2 ≥ are PSO’s parameters Generally, C1, C2 are often set as locPbest is the location that particle i has best current solution and locGbest is the location that the swarm has best current solution The whole process is repeated until the number of iteration has reached or the best solution in the two continuous steps has not been changed Details of this method are described in Fig Fig Schema of PSO algorithm P.H Thong, L.H Son / Engineering Applications of Artificial Intelligence 56 (2016) 121–130 The proposed method 125 By using the Lagranian method, the authors determined the optimal solutions of model in Eqs (32)–(35) Based on the FC-PFS algorithm (Thong and Son, 2016a), a new picture fuzzy clustering method for complex data (PFCA-CD) is presented The idea of this method is to overcome the complex structure and mix data by enhancing FC-PFS with multiple centers, measurement of categorical attributes and evolutionary strategy with Particle Swarm Optimization (PSO) (Eberhart and Kennedy, 1995) Therein, the multiple centers are used to deal with complex structure of data because data with complex structures have many different shapes that cannot be represented by one center The centers are alternatively selected from data elements because of categorical data Moreover, categorical attributes are not be measured by using the same way with numerical attributes; therefore new measurements are used to cope with mix data The subsections are as follow: Section 3.1 describes details of FC-PFS, Section 3.2 proposes a new measurement of categorical attributes and Section 3.3 presents the PFCA-CD algorithm accompanied with remarks in Section 3.4 ( ξkj = − μkj + ηkj ( ( − − μkj + ηkj Definition A Picture Fuzzy Set (PFS) (Cuong, 2014) in a nonempty set X is, α α )) , (k = 1, N , j = 1, C ), (32) where α ∈ ( 0, 1⎦⎤ is an exponent coefficient used to control the refusal degree in PFS sets μkj = ( C ∑i = ( − ξki ) ηkj = e−ξkj ‖xk − V j‖ ‖xk − Vi‖ ⎛ ⎜1 − C ⎜ C ∑i = e−ξki ⎝ ) C ⎞ i=1 ⎠ ∑k = μkj − ξkj N ∑k = m−1 ∑ ξki⎟⎟, , (k = 1, N , j = 1, C ), (33) (k = 1, N , j = 1, C ), (34) m )) X ( ( , ( μ ( − ξ )) N Vj = 3.1 Fuzzy clustering on picture fuzzy set ) k m (j = 1, C ) (35) kj kj Details for FC-PFS are described in Table 3.2 A new measurement for categorical attributes ( ) Supposes that dh x i , xj is a distance of the element x i and xj on (23) attribute h ( i = 1, N , j = 1, N , h = 1, R ) If the hth attribute is numerical data, dh x i , xj is calculated based on Euclid distance where μ Ȧ ( x ) is the positive degree of each element x ∈ X , η Ȧ( x ) is the neutral membership and γ Ȧ( x ) is the negative degree satisfying the constraints, Otherwise, if the hth attribute is categorical data, d x ih, vjh is calculated by Eq (36) Ȧ = { } x, μ Ȧ ( x), η A(̇ x), γ A(̇ x) |x ∈ X , μ Ȧ ( x), η A(̇ x), γ A(̇ x) ∈ ⎡⎣ 0, 1⎤⎦, ∀ x ∈ X, ≤ μ Ȧ ( x) + η A(̇ x) + γ A(̇ x) ≤ 1, ∀ x ∈ X (24) (25) The refusal degree of an element is: ( ) ξ Ȧ( x) = − μ Ȧ ( x) + η A(̇ x) + γ A(̇ x) , ∀ x ∈ X (26) Based on theory of picture fuzzy set, Thong and Son (2016a) proposed a picture fuzzy model for clustering problem called FCPFS, which was proven to get better clustering quality than other relevant methods Suppose there is a dataset X consisting of N data points in r dimensions The objective function for dividing the dataset into C groups is: N J= C m ∑ ∑ ( μkj ( − ξkj)) + ) ( ) ⎧ if x = v ih jh d xih , vjh = ⎨ , ⎩ otherwise ( ) (36) This means that data input has to be normalized in range [0,1] with respects to minimum of distance between two objects and respects to the maximum one In Eq (36), if the two categorical objects are not equal, the distance between them is the maximum one 3.3 The PFCA-CD algorithm In order to partition dataset with mix data type and distinct data structure, we combine FC-PFS with PSO as follows Suppose Table FC-PFS ‖xk − Vj‖2 k=1 j=1 N ( Fuzzy clustering method on picture fuzzy sets C ∑ ∑ ηkj( log ηkj + ξkj) → k=1 j=1 I: (27) Some constraints are defined as follows μkj , ηkj , ξkj ∈ ⎡⎣ 0, 1⎤⎦, μkj + ηkj + ξkj ≤ 1, (28) (29) Data X whose number of elements ( N ) in r dimensions; Number of clusters ( C ); the fuzzifier m ; Threshold ε ; the maximum iteration maxSteps40 O: Matrices u , η , ξ and centers V ; FC-PFS: 1: t ¼ 2: u (t ) ← random ; η (t ) ← random ; ξ (t ) ← random ( k = 1, N , j = 1, C ) satisfy kj 3: 4: 5: C ∑ ( μkj ( − ξkj)) = 1, j=1 ⎛ ξ ⎞ ∑ ⎜⎝ ηkj + kj ⎟⎠ = 1, k = 1, N , j = 1, C C j=1 (30) C (31) kj kj Eqs (28) and (29) Repeat t ẳ t ỵ1 Calculate Vj (t ) ( j = 1, C ) by Eq (35) 6: Calculate ukj (t ) ( k = 1, N ; j = 1, C ) by Eq (33) 7: Calculate ηkj (t ) ( k = 1, N ; j = 1, C ) by Eq (34) 8: Calculate ξkj (t ) ( k = 1, N ; j = 1, C ) by Eq (32) 9: Until ‖μ(t ) − μ(t − 1) ‖ + ‖η(t ) − η(t − 1)‖ + ‖ξ (t ) − ξ (t − 1)‖ ≤ ε or maxSteps has reached 126 P.H Thong, L.H Son / Engineering Applications of Artificial Intelligence 56 (2016) 121–130 that dataset X contains mix numerical and categorical data with complex structure and the number of cluster C is given Instead of using the iteration of FC-PFS, PSO iteration is employed The initial population of PSO is encoded as P = { p , p , , p popsize } where each particle consists of the following components: – ( μkj , ηkj , ξkj ): the positive, neutral and refusal degrees of elements in X respectively – ( μPbest , ηPbest , ξPbestkj ): the positive, neutral and refusal degrees of kj kj elements in X having best clustering quality respectively – Vj and VPbest j : the set of cluster centers corresponding to ( μkj , ηkj , ξkj ) and ( μPbest , ηPbest , ξPbestkj ) respectively kj A particle starts from a given values of ( μkj , ηkj , ξkj ) and tries to change them to achieve the best fitness value The fitness value is chosen by the same way to calculate the optimization function (27) as in Eq (37) Fitness = C m ∑ ∑ ( μkj ( − ξkj)) ‖xk − Vj‖2 + C ∑ ∑ ηkj( log ηkj + ξkj), (37) k=1 j=1 This process can be regarded as best matching of the fitness value with current status of the particle If the achieved solutions are better than the previous ones, the local optimal solutions Pbest – ( μPbest , ηPbest , ξPbestkj , VPbestkj ) of the particle are recorded Then, kj 1: Vj = ∅ 2: 3: Repeat ( ( Find xi ∉ Vj , xi ∈ X , such that i = arg ∑kN= μkj − ξkj h = 1, N 4: Vj = Vj ∪ x i 5: Until ∑kN= μkj − ξkj ( ( h = 1, N m )) m )) ‖xk − xh‖2 ‖xk − xh‖2 > EPS and xh ∉ Vj kj evolution of the particle is made by changing the value of ( μkj , ηkj , ξkj and Vj ) In the evolution, ( μkj , ηkj ) are calculated by Eqs (33)–(34) and (38)–(39) as below ( ) ( ) μkj = μkj + C1rand μPbest − μkj + C2rand μGbest − μkj , (k = 1, N , j = 1, C ), ( (38) ) ( sure to be the best ones – The computational time for PSO strategy is quite high The complexity of the proposed algorithm is O NC2 + N2 for one particle and one loop, where N is the number of elements in the dataset and C is the number of clusters Then, the complexity of the algorithm is O popsize × NC2 + N2 × numSteps where numSteps and popsize are the number of iterations and number of particles respectively Because popsize and C are always small, the complexity of the proposed algorithm is about O N2 × numSteps If numSteps = 1, the complexity of the algorithm is only O N2 In worst cases, numSteps=maxSteps, the computational time of the proposed algorithm may be the highest ( ( ( ) ) ) ( k=1 j=1 N Determining center for cluster j ( Vj ) kj – Pbesti : the best quality value that a particle achieves N Table Choosing centers for clusters ) ηkj = ηkj + C1rand ηPbest − ηkj + C2rand ηGbest − ηkj , (k = 1, N , j = 1, C ), (39) where ( μGbest , ηGbest , ξGbest and VGbest ) are the best values of the swarm ( Gbest ) The centers for cluster j are chosen by the procedure in Table The evolution of all particles is continued until a number of iterations are reached The final solutions comprising the most suitable values of its clustering centers and membership matrices are determined with the minimum fitness value Details of the proposed algorithm are presented in Fig and Table 3.4 Remarks The proposed method has some advantages: – The proposed method uses multiple centers for each cluster so that a cluster with data elements scattering in un-sphered and distinct structure can be easily presented by these centers – The proposed method employing FC-PFS with the PSO strategy can enhance the convergence process – The proposed method employs a new measurement for categorical attribute values that is appropriate in calculating distance between two objects However, this method still has some limitations: – The use of PSO algorithm may result in good solutions, but not ( ) ) Experiments 4.1 Materials and system configuration The following benchmark datasets of UCI Machine Learning Repository (University of California, 2007) are used for the validation of performance of algorithms (Table 4) They are very wellknown and standard data for clustering and classification consisting of seven datasets with different sizes, number of attributes and number of classes The largest dataset is ABALONE including 4177 elements and numerical attributes The dataset contains largest attributes is AUTOMOBILE with 15 numerical and 10 categorical attributes In the experiments, we not normalize the dataset The aim is to verify the quality of clustering algorithms from small to large sizes and mix datatype (numerical and categorical attributes) In order to assess the quality, the number of classes in each dataset is used as the ‘correct’ number of clusters The proposed algorithm – PFCA-CD has been implemented in addition to the DifFuzzy algorithm (Cominetti et al., 2010) and the Dissimilarity algorithm (De Carvalho et al., 2013) in C programming language and executed them on a Linux Cluster 1350 with eight computing nodes of 51.2GFlops Each node contains two Intel Xeon dual core 3.2 GHz, 2GB Ram The experimental results are taken as the average values after 50 runs Cluster validity measurement: Mean Accuracy (MA), the DaviesBouldin (DB) index (Davies and Bouldin, 1979), the Rand index (RI) and Alternative Silhouette (ASWC) (Vendramin et al., 2010) are used to evaluate the qualities of solutions for clustering algorithms The DB index is shown as below DB = Si = ⎧ S + S ⎫⎞ i j ⎬⎟⎟, j : j ≠ i ⎩ Mij ⎭⎠ ⎝ i=1 C ⎛ C ∑ ⎜⎜ max⎨ Ti ∑ Ti j=1 ⎪ ⎪ ⎪ ⎪ (40) Xj − Vi , (i = 1, …, C ), (41) P.H Thong, L.H Son / Engineering Applications of Artificial Intelligence 56 (2016) 121–130 127 Fig Schema of PFCA-CD Mij = ‖Vi − Vj‖, (i, j = 1, …, C , i ≠ j ), (42) where Ti is the size of cluster ith Si is a measure of scatter within the cluster, and Mij is a measure of separation between cluster ith and jth The minimum value indicates the better performance for DB index The Rand index is defined as, RI = a+d , a+b+c+d (43) where a ( b) is the number of pairs of data points belonging to the same class in R and to the same (different) cluster in Q with R and Q being two ubiquitous clusters c (d ) is the number of pairs of data points belonging to the different class in R and to the same (different) cluster The Rand index the larger, the better is Alternative Silhouette (ASWC), is invoked to measure the clustering quality ASWC = N N ∑ sxi, i=1 (44) 128 P.H Thong, L.H Son / Engineering Applications of Artificial Intelligence 56 (2016) 121–130 Table Picture fuzzy clustering algorithm for complex data Table The average validity index values of algorithms (Bold values mean the best one in each dataset and validity index) Picture fuzzy clustering algorithm for complex data MA ASWC DB RI 66.667 92.639 92.667 66.667 86.888 88.785 – 96.404 96.464 33.333 82.146 93.157 20 77.35 94.538 – 100 89.3 1.569 1.946 1.971 0.621 1.082 1.142 – 1.715 1.147 0.745 1.411 0.937 0.699 1.21 1.035 – 1.076 1.02 2.707 9.915 11.217 4.654 4.239 11.808 – 3.812 4.936 6.437 8.721 5.319 5.356 8.279 4.667 – 2.868 2.7 81.960 79.092 76.599 67.557 57.303 66.994 – 62.56 61.346 63.912 64.371 69.458 65.356 63.862 66.607 – 54.006 50.512 I: Data X whose number of elements ( N ) in r dimensions; Number of clusters (C ); threshold ε ; fuzzifier m and the maximal number of iteration max Steps > O: Matrices μ , η , ξ and centers V ; PFCA-CD 1: t ¼ 2: u (t ) ← random ; η (t ) ← random ; ξ (t ) ← random (k = 1, N , j = 1, C ) satisfy kj 3: 4: 5: 6: kj kj (2829) Repeat t ẳ t ỵ1 For each particle i GLASS ABALONE AUTOMOBILE Choosing centers Vj (t ) ( j = 1, C ) as in Table 7: Calculate μkj (t ) ( k = 1, N ; j = 1, C ) by Eq (38) 8: Calculate ηkj (t ) ( k = 1, N ; j = 1, C ) by Eq (39) 9: Calculate ξkj (t ) ( k = 1, N ; j = 1, C ) by Eq (32) 10: 11: 12: 13: 14: 15: Calculate fitness value by Eq (37) Update Pbest value Update Gbest value End Until Gbest unchanges or maxSteps has reached Output ( μ , η , ξ , V ) ¼ ( μGbest , ηGbest , ξGbest , VGbest ) SERVO STATLOG Table Descriptions of experimental datasets Dataset No elements No numerical attributes No categorical attributes No classes IRIS GLASS ABALONE AUTOMOBILE SERVO STATLOG 150 214 4177 159 167 1000 0 10 13 6 sxi = IRIS bp, i a p, i + ε 15 , (i = 1, N ), (45) where ap, i is the average distance of element i to all other elements in cluster p, bp, i is the average distance of element i to all other elements in cluster p ε is a small constant (e.g 10 À for normalized data) used to avoid division by zero when ap, i = The maximum value indicates the better performance for the ASWC index Parameters setting: Some values of parameters such as fuzzifier m = 2, ε = 10−3, max Steps = 1000 are set up for all algorithms Particularly for PFCA-CD, we set C1 = C2 = 1, α ∈ 0.6 and ε = 10−3 (Thong & Son, 2016) Objectives: We aim to evaluate the clustering qualities of algorithms through validity indices Some experiments by various cases of parameters are also considered DifFuzzy Dissimilarity PFCA-CD DifFuzzy Dissimilarity PFCA-CD DifFuzzy Dissimilarity PFCA-CD DifFuzzy Dissimilarity PFCA-CD DifFuzzy Dissimilarity PFCA-CD DifFuzzy Dissimilarity PFCA-CD and 69.458 (RI) compared to 33.333 (MA), 6.437 (DB) and 63.912 (RI) of DifFuzzy and 82.146 (MA), 8.721 (DB) and 64.371 (RI) of Dissimilarity for AUTOMOBILE dataset Fig indicates more about the MA and RI values of algorithms over different dataset In most case of dataset, the proposed method has better values than those of DifFuzzy and Dissimilarity Fig shows the values of ASWC and DB of all algorithms with different datasets It can be seen that the proposed algorithm results in smaller values of DB than those of others in STATLOG, SERVO, AUTOMOBILE datasets In ASWC, the proposed algorithm has better in IRIS and GLASS datasets, which are only numeric datasets This means that ASWC maybe not good for complex data Table show the times each algorithm has reached the best values in Table The PFCA-CD ranks first within 12 best values, Dissimilarity ranks second within values and the remained algorithm has times of best values The fluctuation of the values for validity index changes is presented in Table In Table 7, the std values for validity indices DifFuzzy algorithm are not change over time to time because this algorithm is not employed heuristic strategy The std values of PFCA-CD changes less than those of Dissimilarity in general This means that the proposed method results in more stable solutions than those of Dissimilarity method Table shows the computational time of all algorithms 4.2 Results and discussions Table indicates the average validity index values of algorithm It can be seen that the proposed PFCA-CD algorithm has better clustering quality based on validity indices than others In most case, the proposed algorithm has at least one best value of validity indices For instance, in AUTOMOBILE and SERVO datasets, the values for PFCA-CD are better in MA, DB and RI indices than those of DifFuzzy and Dissimilarity There are 93.157 (MA), 5.319 (DB) Fig The chart of MA and RI values of all algorithms with different datasets P.H Thong, L.H Son / Engineering Applications of Artificial Intelligence 56 (2016) 121–130 129 Dissimilarity The discrepancy is less than one second and particularly the std value of the proposed algorithm is much less than that of others, these mean that the proposed algorithm is as good as others in runtime for IRIS dataset Only for AUTOMOBILE dataset, the proposed algorithm take more time to run (7643.86 compared to 5214.669 of Dissimilarity) This indicates that the proposed algorithm is not effectively in large and only numerical dataset Conclusions Fig The chart of ASWC and DB values of all algorithms with different datasets Table Times to achieve best values of algorithms (Bold values mean the best one) Algorithms Times to achieve best value DifFuzzy Dissimilarity PFCA-CD 12 Table The STD values for validity indices of algorithms IRIS GLASS ABALONE AUTOMOBILE SERVO STATLOG DifFuzzy Dissimilarity PFCA-CD DifFuzzy Dissimilarity PFCA-CD DifFuzzy Dissimilarity PFCA-CD DifFuzzy Dissimilarity PFCA-CD DifFuzzy Dissimilarity PFCA-CD DifFuzzy Dissimilarity PFCA-CD MA ASWC DB RI 7.144 8.55 38.586 3.184 – 2.63 0.7 9.694 1.938 5.23 3.435 – 4.612 0.447 0.267 1.014 0.094 – 0.039 0.002 1.433 0.132 0.088 0.053 – 2.96E À 0.011 6.512 3.663 5.352 3.354 – 4.113 0.086 12.587 5.938 13.358 5.089 – 0.433 9.891 11.375 6.886 49.76 0.936 – 0.247 0.072 4.23 0.806 1.121 0.864 – 0.075 0.688 Table The computational time (with STD values) for algorithms in seconds IRIS GLASS ABALONE AUTOMOBILE SERVO STATLOG DifFuzzy Dissimilarity PFCA-CD 31.048 (1.369) 522.184 (35.528) – 149.553 (0.058) 16.975 (2.9E À 3) – 4.165 (3.626) 122.44 (121.87) 5214.669 (4457.619) 318.622 (71.871) 19.124 (3.439) 3082.443 (253.439) 4.743 (0.919) 17.577 (1.39) 7643.86 (844.934) 22.9 (9.017) 19.064 (6.279) 108.688 (5.991) accompanied with STD values The computational time of the proposed method is less than those of other algorithms in GLASS, AUTOMOBILE, SERVO and STATLOG datasets Only in AUTOMOBILE and IRIS datasets, the proposed algorithm is higher In IRIS dataset, the proposed algorithm need 4.743 s compared to 4.165 s of In this paper, we presented a novel picture fuzzy clustering algorithm for complex data (PFCA-CD) that enables to cluster mix numerical and categorical data with distinct structures PFCA-CD made uses of hybridization between Particle Swarm Optimization strategy to Picture Fuzzy Clustering where combined solutions consisting of equivalent clustering centers and membership matrices are packed in PSO The idea of each cluster can be shortly captured as more than one center can deal with complex structure of data where the shape of data is not sphere The use of a novel measurement for categorical attributes can cope with mix data also Thus, this process created both the most suitable solutions for the problem The experimental results on the benchmark datasets of UCI Machine Learning Repository indicated that in most cases the PFCA-CD algorithm not only produced solution with better clustering quality but also was faster than other algorithms Further research directions of this paper could be lean to the following ways: i) investigate a distributed version of PFCA-CD; ii) consider the semi-supervised situations for PFCA-CD; iii) apply the algorithm to recommended systems and other problems Appendix Source codes and the experimental datasets of this paper can be retrieved at this link: https://sourceforge.net/p/complexdata/ code/ci/master/tree/ References Atanassov, K.T., 1986 Intuitionistic fuzzy sets Fuzzy Sets Syst 20 (1), 87–96 Bezdek, J.C., Ehrlich, R., Full, W., 1984 FCM: the fuzzy c-means clustering algorithm Comput Geosci 10 (2), 191–203 Chen, L., Wang, S., Wang, K., Zhu, J., 2016 Soft subspace clustering of categorical data with probabilistic distance Pattern Recognit 51, 322–332 Cominetti, O., Matzavinos, A., Samarasinghe, S., Kulasiri, D., Liu, S., Maini, P., Erban, R., 2010 DifFUZZY: a fuzzy clustering algorithm for complex datasets Int J Comput Intell Bioinform Syst Biol (4), 402–417 Cuong, B.C., 2014 Picture fuzzy sets J Comput Sci Cybern 30 (4), 409–416 Davies, D.L., Bouldin, D.W., 1979 A cluster separation measure IEEE Trans Pattern Anal Mach Intell 2, 224–227 De Carvalho, F.D.A., Lechevallier, Y., De Melo, F.M., 2013 Relational partitioning fuzzy clustering algorithms based on multiple dissimilarity matrices Fuzzy Sets Syst 215, 1–28 Eberhart, R.C., Kennedy, J., 1995 A new optimizer using particle swarm theory, In: Proceedings of the Sixth International Symposium on Micro Machine and Human Science, 1, pp 39–43 Ferreira, M.R., de Carvalho, F.D., 2012 Kernel fuzzy clustering methods based on local adaptive distances, In: Proceedings of 2012 IEEE International Conference on In Fuzzy Systems (FUZZ-IEEE), pp 1–8 Hwang, Z., 1998 Extensions to the k-means algorithm for clustering large data sets with categorical values Data Min Knowl Discov (3), 283–304 Ji, J., Pang, W., Zhou, C., Han, X., Wang, Z., 2012 A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data Knowl – Based Syst 30, 129–135 Ji, J., Bai, T., Zhou, C., Ma, C., Wang, Z., 2013 An improved k-prototypes clustering algorithm for mixed numeric and categorical data Neurocomputing 120, 590–596 Mendel, J.M., John, R.I.B., 2002 Type-2 fuzzy sets made simple IEEE Trans Fuzzy Syst 10 (2), 117–127 Son, L.H., 2014a Enhancing clustering quality of geo-demographic analysis using 130 P.H Thong, L.H Son / Engineering Applications of Artificial Intelligence 56 (2016) 121–130 context fuzzy clustering type-2 and particle swarm optimization Appl Soft Comput 22, 566–584 Son, L.H., 2014b HU-FCF: a hybrid user-based fuzzy collaborative filtering method in recommender systems Expert Syst Appl 41 (15), 6861–6870 Son, L.H., 2015a DPFCM: a novel distributed picture fuzzy clustering method on picture fuzzy sets Expert Syst Appl 42 (1), 51–66 Son, L.H., 2015b A novel kernel fuzzy clustering algorithm for geo-demographic analysis Inf Sci 317, 202223 Son, L.H., 2015c HU-FCF ỵ ỵ : a novel hybrid method for the new user cold-start problem in recommender systems Eng Appl Artif Intell 41, 207–222 Son, L.H., 2016 Dealing with the new user cold-start problem in recommender systems: a comparative review Inf Syst 58, 87–104 Son, L.H., Thong, N.T., 2015 Intuitionistic fuzzy recommender systems: an effective tool for medical diagnosis Knowl – Based Syst 74, 133–150 Son, L.H., Tuan, T.M., 2016 A cooperative semi-supervised fuzzy clustering framework for dental X-ray image segmentation Expert Syst Appl 46, 380–393 Son, L.H., Hai, P.V., 2016 A novel multiple fuzzy clustering method based on internal clustering validation measures with gradient descent Int J Fuzzy Syst http://dx.doi.org/10.1007/s40815-015-0117-1 Son, L.H., Cuong, B.C., Long, H.V., 2013 Spatial interaction – modification model and applications to geo-demographic analysis Knowl – Based Syst 49, 152–170 Son, L.H., Linh, N.D., Long, H.V., 2014 A lossless DEM compression for fast retrieval method using fuzzy clustering and MANFIS neural network Eng Appl Artif Intell 29, 33–42 Son, L.H., Cuong, B.C., Lanzi, P.L., Thong, N.T., 2012a A novel intuitionistic fuzzy clustering method for geo-demographic analysis Expert Syst Appl 39 (10), 9848–9859 Son, L.H., Lanzi, P.L., Cuong, B.C., Hung, H.A., 2012b Data mining in GIS: a novel context-based fuzzy geographically weighted clustering algorithm Int J Mach Learn Comput (3), 235–238 Thong, P.H., Son, L.H., 2014 A new approach to multi-variables fuzzy forecasting using picture fuzzy clustering and picture fuzzy rules interpolation method, In: Proceeding of 6th International Conference on Knowledge and Systems Engineering, pp 679–690 Thong, N.T., Son, L.H., 2015 HIFCF: an effective hybrid model between picture fuzzy clustering and intuitionistic fuzzy recommender systems for medical diagnosis Expert Syst Appl 42 (7), 3682–3701 Thong, P.H., Son, L.H., Fujita, H., 2016 Interpolative Picture Fuzzy Rules: A Novel Forecast Method for Weather Nowcasting, In: Proceeding of the 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2016), pp 86–93 Thong, P.H., Son, L.H., 2016a Picture fuzzy clustering: a new computational intelligence method Soft Comput 20 (9), 3549–3562 Thong, P.H., Son, L.H., 2016b An overview of semi-supervised fuzzy clustering algorithms Int J Eng Technol (4), 301–306 Tuan, T.M., Ngan, T.T., Son, L.H., 2016 A novel semi-supervised fuzzy clustering method based on interactive fuzzy satisficing for dental X-ray image segmentation Appl Intell 45 (2), 402–428 Tuan, T.M., Duc, N.T., Hai, P.V., Son, L.H., 2016 Dental diagnosis from X-Ray images using fuzzy rule-based systems Int J Fuzzy Syst Appl (in press) University of California, 2007 UCI Repository of Machine Learning Databases 〈http://archive.ics.uci.edu/ml/〉 Vendramin, L., Campello, R.J., Hruschka, E.R., 2010 Relative clustering validity criteria: a comparative overview Stat Anal Data Min (4), 209–235 Wijayanto, A.W., Purwarianti, A., Son, L.H., 2016 Fuzzy geographically weighted clustering using artificial bee colony: an efficient geo-demographic analysis algorithm and applications to the analysis of crime behavior in population Appl Intell 44 (2), 377–398 Yang, M.S., Hwang, P.Y., Chen, D.H., 2004 Fuzzy clustering algorithms for mixed feature variables Fuzzy Sets Syst 141 (2), 301–317 ... cluster DifFuzzy does this by finding β* which satisfies the relation: Complex data ⎛ L β* = − γi ⎜⎜ N + ⎝ ( ) ( Mix data types (categorical and numerical data) (1) Distinct structure of data (different... of data including categorical and numerical data and distinct structure of data (Fig 1) In the first group, there have been many researches about clustering for both categorical and numerical data. .. relevant methods for clustering complex data in Section 2.1 Sections 2.2–2.3 review two typical methods of this approach 2.1 Literature review The related works for clustering complex data is divided