Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 16 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
16
Dung lượng
1,95 MB
Nội dung
Expert Systems with Applications 42 (2015) 51–66 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa DPFCM: A novel distributed picture fuzzy clustering method on picture fuzzy sets Le Hoang Son ⇑ VNU University of Science, Vietnam National University, Viet Nam a r t i c l e i n f o Article history: Available online 26 July 2014 Keywords: Clustering quality Distributed clustering Facilitator model Fuzzy clustering Picture fuzzy sets a b s t r a c t Fuzzy clustering is considered as an important tool in pattern recognition and knowledge discovery from a database; thus has been being applied broadly to various practical problems Recent advances in data organization and processing such as the cloud computing technology which are suitable for the management, privacy and storing big datasets have made a significant breakthrough to information sciences and to the enhancement of the efficiency of fuzzy clustering Distributed fuzzy clustering is an efficient mining technique that adapts the traditional fuzzy clustering with a new storage behavior where parts of the dataset are stored in different sites instead of the centralized main site Some distributed fuzzy clustering algorithms were presented including the most effective one – the CDFCM of Zhou et al (2013) Based upon the observation that the communication cost and the quality of results in CDFCM could be ameliorated through the integration of a distributed picture fuzzy clustering with the facilitator model, in this paper we will present a novel distributed picture fuzzy clustering method on picture fuzzy sets so-called DPFCM Experimental results on various datasets show that the clustering quality of DPFCM is better than those of CDFCM and relevant algorithms Ó 2014 Elsevier Ltd All rights reserved Introduction Fuzzy clustering is considered as an important tool in pattern recognition and knowledge discovery from a database; thus has been being applied broadly to various practical problems The first fuzzy clustering algorithm is Fuzzy C-Means (FCM) proposed by Bezdek (1984) It is an iterative algorithm that modifies the centers and the partition matrix in each step in order to satisfy a given objective function Bezdek proved that FCM converges to the saddle points of the objective function Even though FCM was proposed long time ago, this algorithm is still a popular fuzzy clustering that has been being applied to many practical problems for the rules extraction and implicit patterns discovery wherein the fuzziness exist such as, Image segmentation (Ahmed, Yamany, Mohamed, Farag, & Moriarty, 2002; Cao, Deng, & Wang, 2012; Chen, Chen, & Lu, 2011; Chuang, Tzeng, Chen, Wu, & Chen, 2006; Krinidis & Chatzis, 2010; Li, Chui, Chang, & Ong, 2011; Ma & Staunton, 2007; Pham, Xu, & Prince, 2000; Siang Tan & Mat Isa, 2011; Zhang & Chen, 2004); ⇑ Official address: 334 Nguyen Trai, Thanh Xuan, Hanoi, Viet Nam Tel.: +84 904171284; fax: +84 0438623938 E-mail addresses: sonlh@vnu.edu.vn, chinhson2002@gmail.com http://dx.doi.org/10.1016/j.eswa.2014.07.026 0957-4174/Ó 2014 Elsevier Ltd All rights reserved Face recognition (Agarwal, Agrawal, Jain, & Kumar, 2010; Chen & Huang, 2003; Haddadnia, Faez, & Ahmadi, 2003; Lu, Yuan, & Yahagi, 2006, 2007); Gesture recognition (Li, 2003; Wachs, Stern, & Edan, 2003); Intrusion detection (Chimphlee, Abdullah, Noor Md Sap, Chimphlee, & Srinoy, 2005; Chimphlee, Abdullah, Noor Md Sap, Srinoy, & Chimphlee, 2006; Shah, Undercoffer, & Joshi, 2003; Wang, Hao, Ma, & Huang, 2010); Hot-spot spatial analysis (Di Martino, Loia, & Sessa, 2008); Risk analysis (Li, Li, & Kang, 2011); Bankrupt prediction (Martin, Gayathri, Saranya, Gayathri, & Venkatesan, 2011); Geo-demographic analysis (Cuong, Son, & Chau, 2010; Son, 2014a, 2014b; Son, Cuong, Lanzi, & Thong, 2012, 2013, 2014; Son, Lanzi, Cuong, & Hung, 2012); Fuzzy time series forecasting and commercial systems (Bai, Dhavale, & Sarkis, 2014; Chu, Liau, Lin, & Su, 2012; Egrioglu, Aladag, & Yolcu, 2013; Egrioglu, 2011; Hadavandi, Shavandi, & Ghanbari, 2011; Izakian & Abraham, 2011; Roh, Pedrycz, & Ahn, 2014; Wang, Ma, Lao, & Wang, 2014; Zhang, Huang, Ji, & Xie, 2011) Recent advances in data organization and processing such as the cloud computing technology which are suitable for the 52 L.H Son / Expert Systems with Applications 42 (2015) 51–66 management, privacy and storing big datasets have made a significant breakthrough to information sciences in general and to the enhancement of the efficiency of FCM in particular For example, cloud computing is an Internet-based storage solution where ubiquitous computers resources are set up with the same configuration in order to develop and run applications as if they were constructed in a single centralized system Users not need to know where and how the computers resources operate so that the maintenance and running costs could be reduced; thus guaranteeing the stable expansion of applications In the cloud computing paradigm, data mining techniques especially fuzzy clustering are very much needed in order to retrieve meaningful information from virtually integrated data warehouse Petre (2012) and Geng and Yang (2013) stated that using data mining through cloud computing reduces the barriers that keep users from benefiting of the data mining instruments so that they could only pay for the data mining tools without handling complex hardware and data infrastructures Examples of deploying data mining and clustering algorithms in some typical cloud computing service providers such as Amazone cloud, Google Apps, Microsoft, Salesforce and IBM could be found in energy aware consolidation (Srikantaiah, Kansal, & Zhao, 2008), education (Ercan, 2010), scheduling workflow (Pandey, Wu, Guru, & Buyya, 2010) and others (Surcel & Alecu, 2008) Such these algorithms are called the distributed mining techniques Distributed fuzzy clustering is a distributed mining technique that adapts the traditional fuzzy clustering with a new storage behavior where parts of the dataset are stored in different sites instead of the centralized main site Distributed fuzzy clustering is extended from the distributed hard clustering algorithms Several efforts on distributed hard/fuzzy clustering could be named but a few Lu, Gu, and Grossman (2010) presented a micro-cluster distributed clustering algorithm called dSimpleGraph based on the relation between two micro-clusters to classify data on the local machines and generate a determined global view from local views Xie, Bai, and Lang (2010) aimed to accelerate the clustering method of Support Vector Machine for large-scale datasets and presented a distributed clustering method inspired by the Multi-Agent framework, in which data are divided to different agents, and the global clustering result can be generalized from the agents Gehweiler and Meyerhenke (2010) presented a distributed heuristic using only limited local knowledge for clustering static and dynamic graphs Kwon et al (2010) proposed a scalable, parallel algorithm for data clustering based on the Map Reduce framework Karjee and Jamadagni (2011) constructed a distributed clustering algorithm based upon spatial data correlation among sensor nodes and performed data accuracy for each distributed cluster at their respective cluster head node Le Khac and Kechadi (2011) proposed a distributed density-based clustering that both reduces the communication overheads and improves the quality of the global models by considering the shapes of local clusters Ghanem, Kechadi, and Tari (2011) introduced a distributed clustering algorithm based on the aggregation of models produced locally that means datasets were processed locally on each node and the results were integrated to construct global clusters hierarchically The aim of this approach is to minimize the communications, maximize the parallelism and load balance the work among different nodes of the system, and reduce the overhead due to extra processing while executing the hierarchical clustering Gilhotra and Trikha (2012) presented a cohesive framework for cluster identification and outlier detection for distributed data based on the idea that is to generate independent local models and combine the local models at a central server to obtain global clusters with the supports of feedback loops Bui, Kudireti, and Sohier (2012) presented a distributed random walk based clustering algorithm that builds a bounded-size core through a random walks based procedure Branting (2013) presented a Distributed Pivot Clustering algorithm that takes only the distance function, which satisfies the triangle inequality and is of sufficiently high-granularity to permit the data to be partitioned into canopies of optimal size based on distance to reference elements or pivots Balcan, Ehrlich, and Liang (2013) provided two distributed clustering algorithms based on k-means and k-median The basic ideas of these algorithms are to reduce the problem of finding a clustering with low cost to the problem of finding a core-set of small size then construct a global core-set Hai, Zhang, Zhu, and Wang (2012), Jain and Maheswari (2013) and Singh and Gosain (2013) took surveys of distributed clustering group methods including partitioning, hierarchical, density-based, soft-computing, neural network and fuzzy clustering methods They argued that datasets in the real world applications often consist of inconsistencies or outliers, where it is difficult to obtain homogeneous and meaningful global clusters so that the distributed hard clustering should incorporate with the fuzzy set theory in order to handle the hesitancy originating from imperfect and imprecise information A parallel version of the FCM algorithm so-called PFCM aiming for the distributed fuzzy clustering was proposed by Rahimi, Zargham, Thakre, and Chhillar (2004) Vendramin, Campello, Coletta, and Hruschka (2011) modified the PFCM algorithm with a pre-processing procedure to estimate the number of clusters and also presented a consensusbased algorithm to distributed fuzzy clustering Coletta, Vendramin, Hruschka, Campello, and Pedrycz (2012) gave the distributed version of PFCM, known as PFCM–c⁄ algorithm which automatically calculates the number of clusters Visalakshi, Thangavel, and Parvathi (2010) introduced an intuitionistic fuzzy based distributed clustering algorithm including two different levels: the local level and the global level In the local level, numerical datasets are converted into intuitionistic fuzzy data and they are clustered independently from each other using modified FCM algorithm In the global level, global center is computed by clustering all local cluster centers The global center is again transmitted to local sites to update the local cluster model The communication model used in Visalakshi et al (2010) is the facilitator or the Master–Slave model A distributed fuzzy clustering namely CDFCM working on the peer-to-peer (P2P) model was proposed by Zhou, Chen, Chen, and Li (2013) In this algorithm, the cluster centers and attribute-weights are calculated at each peer and then updated by neighboring results through local communications The process is repeated until a pre-defined stopping criterion hold, and the status quo of clusters in all peers reflects accurately the results as in the centralized clustering CDFCM was experimental validated and had better clustering quality than other relevant algorithms such as FCM (Bezdek, 1984), PFCM (Rahimi et al., 2004), Soft-DKM (Forero, Cano, & Giannakis, 2011) and WEFCM (Zhou & Philip Chen, 2011) It was considered as one of the most effective distributed fuzzy clustering available in the literature The motivation of this paper is described as follows In the activities of CDFCM, this algorithm solely updates the cluster centers and attribute-weights of each peer by those of neighboring peers This requires large communication costs, approximately P  NB communications per iteration with P being the number of peers and NB being the average number of neighbors of a given peer Additionally, the quality of results in each peer could not be high since only local updates with neighboring results are conducted Based upon the idea that the communication cost and the quality of results in CDFCM could be ameliorated through the integration of a distributed picture fuzzy clustering with the facilitator model, in this paper we will present a novel distributed picture fuzzy clustering method on picture fuzzy sets so-called DPFCM The proposed algorithm utilizes the facilitator model that means all peers transferred their results into a special, unique peer called the Master peer so that it takes only P communication costs to complete the update process Employing the Master peer in the L.H Son / Expert Systems with Applications 42 (2015) 51–66 facilitator model also helps increasing the capability to update more numbers of neighboring results, thus advancing the quality of results In order to enhance the clustering quality as high as possible, we also deploy the distributed fuzzy clustering algorithm in picture fuzzy sets (PFS) (Cuong & Kreinovich, 2013), which in essence are the generalization of the traditional fuzzy sets (FS) (Zadeh, 1965) and intuitionistic fuzzy set (IFS) (Atanassov, 1986) used for the development of the existing CDFCM algorithm PFS based models can be applied to situations requiring human opinions involving more answers of types: yes, abstain, no and refusal, which can not be accurately expressed in the traditional FS Therefore, deploying the distributed clustering algorithm in PFS could give higher clustering quality than in FS and in IFS Our contribution in this paper is a novel distributed picture fuzzy clustering method (DPFCM) that utilizes the ideas of both the facilitator model and deploying clustering algorithms in PFS in order to ameliorate the clustering quality The proposed algorithm will be implemented and validated in comparison with CDFCM and other relevant algorithms in terms of clustering quality The significance of this contributed research is not only the enhancement of the clustering quality of distributed fuzzy clustering algorithm but also the enrichment of the know-how knowledge of integrating picture fuzzy sets to clustering algorithms and deploying them to practical applications Indeed, the contribution of this paper is meaningful to both theoretical and practical sides The rest of the paper is organized as follows Section gives the preliminary about the PFS set The formulation of clustering algorithms in PFS in association with the facilitator model is described in Section Section validates the proposed approach through a set of experiments involving benchmark data Finally, Section draws the conclusions and delineates the future research directions Preliminary In this section, we take a brief overview of some basic terms and notations in PFS, which can be used throughout the paper Definition A picture fuzzy set (PFS) (Cuong & Kreinovich, 2013) in a non-empty set X is, ẩ ẫ A_ ẳ x; lA_ xị; gA_ ðxÞ; cA_ ðxÞ jx X ; ð1Þ where lA_ ðxÞ is the positive degree of each element x X; gA_ ðxÞ is the neutral degree and cA_ ðxÞ is the negative degree satisfying the constraints, lA_ ðxÞ; gA_ xị; cA_ xị ẵ0; 1; 8x X; lA_ xị ỵ gA_ xị ỵ cA_ xị 1; 8x X: 53 Example A patient was given the first emergency aid and diagnosed by four states after examining possible symptoms that are ‘‘heart attack’’, ‘‘uncertain’’, ‘‘not heart attack’’, ‘‘appendicitis’’ In this case, we also have a PFS set Now, we briefly present some basic picture fuzzy operations, picture distance metrics and picture fuzzy relations Let PFS(X) denote the set of all PFS sets on universe X Definition For A,B PFS(X), the union, intersection and complement operations are defined as follows A [ B ¼ fhx; maxflA ðxÞ; lB ðxÞg; minfgA ðxÞ; gB ðxÞg; minfcA xị; cB xịgijx Xg; A \ B ẳ fhx; minflA ðxÞ; lB ðxÞg; minfgA ðxÞ; gB ðxÞg; maxfcA xị; cB xịgijx Xg; A ẳ fhx; cA xị; gA ðxÞ; lA ðxÞijx Xg: ð4Þ ð5Þ ð6Þ Definition For A,B PFS(X), the Cartesian product of these PFS sets is, A1 B ẳ fhx;yị; lA xị:lB yị; gA ðxÞ:gB ðyÞ; cA ðxÞ:cB ðyÞijx A; y Bg; 7ị A2 B ẳ fhx;yị; lA xị ^ lB ðyÞ; gA ðxÞ ^ gB ðyÞ; cA ðxÞ _ cB ðyÞijx A; y Bg: ð8Þ Definition The distances between A,B PFS(X) are the normalized Hamming distance and the normalized Euclidean in Eqs (9) and (10), respectively dp A; Bị ẳ N 1X jlA xi ị lB xi ịj ỵ jgA xi ị gB xi ịj ỵ jcA xi ị cB xi ịjị; N iẳ1 9ị v u N u1 X lA xi ị lB xi ịị2 ỵ gA xi ị gB xi ịị2 ỵ cA xi ị cB xi ịị2 Þ: ep ðA; BÞ ¼ t N i¼1 ð10Þ Definition The picture fuzzy relation R is a picture fuzzy subset of A  B, given by R ¼ fhðx; yÞ; lR ðx; yÞ; gR ðx; yÞ; cR ðx; yÞijx A; y Bg; ð11Þ lR ; gR ; cR : A B ! ẵ0; 1; 12ị lR x; yị ỵ gR x; yị ỵ cR x; yị 1; 8ðx; yÞ A  B: ð13Þ ð2Þ ð3Þ The refusal degree of an element is calculated as nA_ xị ẳ lA_ xị ỵ gA_ xị þ cA_ ðxÞÞ; 8x X In cases nA_ ðxÞ ¼ PFS returns to intuitionistic fuzzy sets (IFS) (Atanassov, 1986), and when both gA_ xị ẳ nA_ xị ẳ 0, PFS returns to fuzzy sets (FS) (Zadeh, 1965) In order to illustrate the applications of PFS, let us consider some examples below PFR(A  B) is the set of all picture fuzzy subset on A  B Some properties of PFS operations, the convex combination of PFS, etc accompanied with proofs can be referenced in Cuong and Kreinovich (2013) The proposed method 3.1 The proposed distributed picture fuzzy clustering model Example In a democratic election station, the council issues 500 voting papers for a candidate The voting results are divided into four groups accompanied with the number of papers that are ‘‘vote for’’ (300), ‘‘abstain’’ (64), ‘‘vote against’’ (115) and ‘‘refusal of voting’’ (21) Group ‘‘abstain’’ means that the voting paper is a white paper rejecting both ‘‘agree’’ and ‘‘disagree’’ for the candidate but still takes the vote Group ‘‘refusal of voting’’ is either invalid voting papers or did not take the vote This example was happened in reality and IFS could not handle it since the refusal degree (group ‘‘refusal of voting’’) does not exist In this section, we propose a distributed picture fuzzy clustering model The communication model is the facilitator or the Master– Slave model having a Master peer and P Slave peers, and each Slave peer is allowed to communicate with the Master only Each Slave peer has a subset of the original dataset X consisting of N data points in r dimensions We call the subset Y j j ẳ 1; Pị and P [Pj¼1 Y j ¼ X; Pj¼1 jY j j ¼ N The number of dimensions in a subset is exactly the same as that in the original dataset Let us divide the dataset X into C groups satisfying the objective function below 54 J¼ L.H Son / Expert Systems with Applications 42 (2015) 51–66 Yl X P X C X ulkj À glkj À nlkj l¼1 k¼1 j¼1 !m r X wljh kX lkh À V ljh k2 hẳ1 P X C X r X ỵc wljh log wljh ! min; 14ị lẳ1 jẳ1 hẳ1 Where ulkj, glkj and nlkj are the positive, the neutral and the refusal degrees of data point kth to cluster jth in the Slave peer lth This reflects the clustering in the PFS set expressed through Definition wljh is the attribute-weight of attribute hth to cluster jth in the Slave peer lth Vljh is the center of cluster jth in the Slave peer lth according to attribute hth Xlkh is the kth data point of the Slave peer lth according to attribute hth m and c are the fuzzifier and a positive scalar, respectively The constraints for (14) are shown below ulkj ; glkj ; nlkj ẵ0; 1; 15ị ulkj þ glkj þ nlkj 1; ! C X ulkj ¼ 1; À glkj À nlkj j¼1 ð16Þ C X jẳ1 n glkj ỵ lkj C ẳ 1; r X wljh ẳ 1; 17ị 18ị wljh ẳ wijh :8i l; i; l ẳ 1; Pị: 3.2 The solutions In this section, we use the Lagranian method and the Picard iteration to determine the optimal solutions of the model (14)– (21) as follows Theorem The optimal solutions of the systems (14)–(21) are: À glkj À nlkj ulkj ¼ ; mÀ1 Pr PC wljh kX lkh ÀV ljh k h¼1 Pr i¼1 h¼1 ð8l ¼ 1; P; k ¼ 1; Y l ; j ẳ 1; Cị; wlih kX lkh V lih k 22ị 19ị 8i l; i; l ẳ 1; P; j ẳ 1; C; h ẳ 1; rị; hlijh ẳ hlijh ỵ a1 V ljh V ijh ị; h¼1 V ljh ¼ V ijh ; ð8i – l; i; l ẳ 1; Pị; Constraint (19) forces that the sum of attribute-weights for a given cluster in a peer is equal to one Thus, all attributes could be normalized for the clustering Outputs of the distributed picture fuzzy clustering model (14)– (21) are the optimal cluster centers fV ljh jl ¼ 1; P; j ¼ 1; C; h ¼ 1; rg, the picture degrees fðulkj ; glkj ; nlkj ịjl ẳ 1; P; k ẳ 1; Y l ; j ¼ 1; Cg in all peers showing which cluster that a data point belongs to and the attribute-weights fwljh jl ¼ 1; P; j ¼ 1; C; h ¼ 1; rg Based upon these results, the state of clusters in a given peer is determined, and the global results could be retrieved from the local ones according to a specific cluster ð23Þ ð20Þ ð21Þ The proposed model in Eqs (14)–(21) relies on the principles of the PFS set and the facilitator model The differences of this model with the CDFCM model of Zhou et al (2013) are expressed below The proposed model is the generalization of the CDFCM model since when glkj = nlkj = that means the PFS set degrades to the FS set, it returns to the CDFCM model resulting in both the objective function and the constraints In the other words, a new membership-like function ulkj/(1 À glkj À nlkj) was appended to the objective function instead of ulkj in CDFCM Moreover, the constraints (15)–(18) that describes the relations of some degrees in the PFS set were integrated to the optimization problem By doing so, the new distributed picture fuzzy clustering model is totally set up according to the PFS set The proposed model utilizes the facilitator model to increase the number of neighboring results used to update that of a given peer, thus giving high accuracy of the final results This reflects in the constraints (20) and (21) where the cluster centers and the attribute-weights of two ubiquitous peers are coincided so that these local centers and attribute-weights would converge to the global ones Additional remarks of the distributed picture fuzzy clustering model in (14)–(21) are: The objective function in (14) both minimizes the dispersion within clusters and maximizes the entropy of attribute-weights allowing important attributes could contribute greatly to the identification of clusters The constraints (15) and (16) are originated from the definition of PFS Constraint (17) describes that the sum of memberships of a data point to all clusters in a Slave peer is equal to one Analogously, constraint (18) states that the sum of hesitant memberships of a data point to all clusters in a Slave peer expressed through the neutral and refusal degrees is also equal to one PY l m ulkj k¼1 1Àglkj Ànlkj V ljh ¼ wljh X lkh À PY l PP i ¼ hlijh i–l ; m ulkj wljh k¼1 1Àglkj Ànlkj ð8l ¼ 1; P; j ẳ 1; C; h ẳ 1; rị; Dlijh ẳ Dljih ỵ a2 wljh wijh ị; 24ị 8i l; i; l ¼ 1; P; j ¼ 1; C; h ẳ 1; rị; 25ị wljh hP m i ulkj Yl X lkh À V ljh 2 ỵ c ỵ 2PPiẳ1 Dlijh exp 1c k¼1 1Àglkj Ànlkj i–l ¼P hP m i ; ulkj Yl r X lkh0 À V ljh0 2 ỵ c ỵ 2PPiẳ1 Dlijh0 h0 ẳ1 exp À c  k¼1 1Àg Àn lkj lkj i–l ð8l ¼ 1; P; j ¼ 1; C; h ¼ 1; rị; glkj ẳ nlkj ỵ C1 C ulkj Pr PC iẳ1 ulki 26ị PC iẳ1 nlki wlih kX lkh ÀV lih k2 h¼1 w kX ÀV k2 ljh lkh ljh ; mỵ1 8l ẳ 1; P; k ¼ 1; Y l ; j ¼ 1; Cị; 27ị 1=a nlkj ẳ ulkj ỵ glkj ị ulkj ỵ glkj ịa ị ; ð8l ¼ 1; P; k ¼ 1; Y l ; j ẳ 1; Cị: 28ị Proof (A) Fix W,V,g,n the Lagranian function with respect to U is: LUị ẳ Yl X P X C X l¼1 k¼1 j¼1 ulkj À glkj À nlkj !m r X 2 wljh X lkh À V ljh h¼1 P X C X r X ỵc wljh log wljh lẳ1 jẳ1 hẳ1 À Yl P X X C X l¼1 k¼1 j¼1 klk ! ulkj À1 ; À glkj À nlkj ð29Þ 55 L.H Son / Expert Systems with Applications 42 (2015) 5166 ulkj @LUị m ẳ @ulkj glkj À nlkj À glkj À nlkj À klk À glkj À nlkj !mÀ1 Y l X ulkj @L ¼ @wljh k¼1 À glkj À nlkj r X 2 wljh X lkh À V ljh h¼1 ¼ 0; ð8l ¼ 1; P; k ¼ 1; Y l ; j ẳ 1; C; h ẳ 1; rị; þ klk Pr m h¼1 wljh  kX lkh À V ljh k2 31ị c 1m1 hẳ1 wljh kX lkh ÀV ljh k2 C C C m1 A 32ị : 8l ẳ 1; P; k ẳ 1; Y l ; j ẳ 1; Cị: ulkj À glkj À nlkj !m l¼1 j¼1 h¼1 i¼1 i–l lẳ1 Yl X ulkj @LVị ẳ @V ljh À glkj À nlkj k¼1 j¼1 h¼1 !m P P X X wljh X lkh À V ljh ỵ hlijh hiljh ẳ 0; iẳ1 il iẳ1 i–l PY l m P ulkj wljh X lkh À Pi¼1 hlijh k¼1 1Àglkj Ànlkj i–l ; PY l ulkj m w ljh k¼1 1Àg Àn lkj ð8l ¼ 1;P; j ¼ 1;C; h ¼ 1;rÞ: hlijh is calculated by a Picard iteration below with a1 being a positive scalar hlijh ẳ hlijh ỵ a1 V ljh V ijh ị; 8i l; i; l ẳ 1; P; j ẳ 1; C; h ẳ 1; rị ð37Þ (C) By the similar calculation with (B), we take the Lagranian function with respect to W ulkj À glkj À nlkj !m r X l¼1 j¼1 h¼1 h¼1 l¼1 j¼1 P X P X C X r X Dlijh wljh wijh ị: lẳ1 iẳ1 il r hẳ1 hP m i ; ulkj Yl X lkh V ljh 2 ỵ c ỵ 2PPi¼1 Dlijh exp À 1c  k¼1 1Àg Àn lkj lkj i–l ð41Þ ! PY l ulkj m P P exp À 1c  kX lkh À V ljh k2 ỵ c ỵ iẳ1 Dlijh kẳ1 1glkj Ànlkj i–l wljh ¼ P hP m i ; ulkj Yl r X lkh0 À V ljh0 2 ỵ c ỵ 2PP h0 ẳ1 exp À c  i¼1i–l Dlijh k¼1 1Àg Àn lkj lkj 42ị jẳ1 hẳ1 Dlijh is calculated by a Picard iteration below with a2 being a positive scalar Dlijh ¼ Dljih ỵ a2 wljh wijh ị; 8i l; i; l ¼ 1; P; j ¼ 1; C; h ¼ 1; rÞ LðgÞ ¼ Yl X P X C X l¼1 k¼1 j¼1 ulkj À glkj À nlkj !m r X 2 wljh X lkh À V ljh hẳ1 P X C X r X ỵc wljh log wljh l¼1 j¼1 h¼1 ! Yl P X C X X nlkj À1 ; À klk glkj ỵ C jẳ1 lẳ1 kẳ1 r ulkj @Lgị X ¼ @ glkj À glkj À nlkj h¼1 !m ð44Þ 2 m wljh X lkh À V ljh À klk ¼ 0; À glkj À nlkj ð8l ¼ 1; P; k ¼ 1; Y l ; j ẳ 1; C; h ẳ 1; rị; glkj ẳ À nlkj À m um lkj Pr h¼1 wljh ! X lkh À V ljh 2 mỵ1 klk 45ị : 46ị Applying constraint (18)(46) we have glkj ẳ nlkj ỵ C1 C PC ulkj Pr i¼1 ulki PC h¼1 i¼1 nlki wlih kX lkh ÀV lih k2 wljh kX lkh ÀV ljh k ; mỵ1 8l ẳ 1; P; k ẳ 1; Y l ; j ẳ 1; Cị: 47ị (E) Once we have ulkj and glkj, from constraint (16), we can use the Yager generating operator to determine the value of nlkj as follows 2 wljh X lkh À V ljh P X C X r P X C r X X X ỵc wljh log wljh klj wljh 1ị ỵ ẳP lkj 36ị lẳ1 kẳ1 j¼1 c (D) Fix W,V,u,n the Lagranian function with respect to g is: 35ị LVị ẳ klj hẳ1 where hlijh is a Lagranian multiplier matrix Taking the derivative of L(V) with respect to Vljh we have Yl X P X C X exp ð43Þ r X 2 wljh X lkh À V ljh P X C X r P X P X C X r X X ỵc wljh log wljh hlijh V ljh V ijh ị; 34ị V ljh ẳ iẳ1 il wlih kX lkh ÀV lih k (B) We fix all degrees and the attribute-weights to calculate the cluster centers by the Lagranian function below l¼1 k¼1 j¼1 31 P X X lkh V ljh 2 ỵ c klj ỵ Dlijh 5A; 8l ẳ 1; P; j ¼ 1; C; h ¼ 1; rÞ; ð33Þ LðVÞ ¼ !m ð8l ¼ 1; P; k ¼ 1; Y l ; j ẳ 1; C; h ẳ 1; rị: À glkj À nlkj ulkj ¼ ; Pr PC w kX lkh ÀV ljh k2 mÀ1 h¼1 ljh P r i¼1 Yl X P X C X k¼1 ulkj À glkj À nlkj Applying constraint (19)–(40) we have Substitute (32) into (31) we obtain the optimal solutions of ulkj as follows h¼1 Yl X ð8l ¼ 1; P; k ¼ 1; Y l ; j ¼ 1; C; h ¼ 1; rÞ: C Pr j¼1 wljh ¼ exp @À  From constraint (17), we have B B klk ¼ mB @P P X Dlijh Diljh ị ẳ 0; 8l ¼ 1; P; k ¼ 1; Y l ; j ¼ 1; C; h ¼ 1; rÞ; ð39Þ !mÀ1 : kX lkh V ljh k2 ỵ clog wljh ỵ 1ị klj iẳ1 il 30ị ulkj ẳ glkj nlkj ị !m 1=a nlkj ẳ ulkj ỵ glkj ị ulkj ỵ glkj ịa ị hẳ1 8l ẳ 1; P; k ẳ 1; Y l ; j ẳ 1; Cị: 38ị ; ð48Þ Notice that a > is an exponent coefficient used to control the refusal degree in PFS sets The proof is complete h 56 L.H Son / Expert Systems with Applications 42 (2015) 51–66 3.3 The DPFCM algorithm In this section, we present in details the DPFCM algorithm clustering outputs could reach to the global optimum Besides the facilitator model, the utilization of various memberships in (22), (27) and (28) both reflects the principle of the PFS set and improves the clustering quality of the algorithm That is to say, Distributed Picture Fuzzy Clustering Method (DPFCM) I: - O: fV ljh jl ¼ 1; P; j ¼ 1; C; h ¼ 1; rg; DPFCM 1S: Data X whose number of elements (N) in r dimensions Number of clusters: C Number of peers: P + Fuzzifier m Threshold e > Parameters: c,a1,a2,a, max Iter fðulkj ; glkj ; nlkj ịjl ẳ 1; P; k ẳ 1; Y l ; j ¼ 1; Cg fwljh jl ¼ 1; P; j ¼ 1; C; h ¼ 1; rg Initialization: - Set the number of iterations: t = - Set Dlijh tị ẳ hlijh tị ẳ 0; 8i l; i; l ¼ 1; P; j ¼ 1; C; h ¼ 1; rÞ - Randomize fðulkj ðtÞ; glkj ðtÞ; nlkj tịịjl ẳ 1; P; k ẳ 1; Y l ; j ¼ 1; Cg satisfying (16) - Set wljh(t) = 1/r (l ¼ 1; P; j ¼ 1; C; h ¼ 1; r) 2S: Calculate cluster centers V ljh ðtÞ; ðl ¼ 1; P; j ¼ 1; C; h ¼ 1; rÞ from (ulkj(t), glkj(t), nlkj(t)), wljh(t) and hlijh(t) by (24) 3S: Calculate attribute-weights wljh(t + 1), ðl ¼ 1; P; j ẳ 1; C; h ẳ 1; rị from (ulkj(t),glkj(t),nlkj(t)), Vljh(t) and Dlijh(t) by (26) 4S: Send fDlijh ðtÞ; hlijh tị; V ljh tị; wljh t ỵ 1ịji; l ¼ 1; P; i – l; k ¼ 1; Y l ; j ¼ 1; Cg to Master 5M: Calculates fDlijh t ỵ 1ị; hlijh t ỵ 1ịji; l ẳ 1; P; i – l; k ¼ 1; Y l ; j ¼ 1; Cg by (23) and (25) and send them to Slave peers 6S: Calculate cluster centers V ljh t ỵ 1ị; l ẳ 1; P; j ẳ 1; C; h ẳ 1; rị from (ulkj(t), glkj (t), nlkj(t)), wljh(t + 1) and hlijh(t + 1) by (24) 7S: Calculate positive degrees fulkj t ỵ 1ịjl ẳ 1; P; k ¼ 1; Y l ; j ¼ 1; Cg from (glkj (t),nlkj(t)), wljh(t + 1) and Vljh(t + 1) by (22) 8S: Compute neutral degrees fglkj t ỵ 1ịjl ẳ 1; P; k ẳ 1; Y l ; j ¼ 1; Cg from (ulkj (t + 1), nlkj(t)), wljh(t + 1) and Vljh(t + 1) by (27) 9S: Calculate refusal degrees fnlkj t ỵ 1ịjl ẳ 1; P; k ¼ 1; Y l ; j ¼ 1; Cg from (ulkj (t + 1), glkj(t + 1)), wljh(t + 1) and Vljh(t + 1) by (28) If maxl fmaxfkulkj t ỵ 1ị ulkj tịk; kglkj t ỵ 1ị glkj tịk; knlkj t ỵ 1ị nlkj tịkgg < e or t > max Iter then stop the algorithm, Otherwise set t = t + and return Step 3S S: Operations in Slave peers M: Operations in the Master peer 10S: 3.4 The theoretical analyses of DPFCM In this section, we make the analyses of the DPFCM algorithm including the profound meaning of Theorem and the advantages/disadvantages of the proposed work As we can recognize in the proposed model (Section 3.1), the problem (14)–(21) is an optimization problem aiming to derive the cluster centers accompanied with the attribute-weights and the positive, the neutral and the refusal memberships of data points from a given dataset and a facilitator system By using the Lagranian method and the Picard iteration, the optimal solutions of the problem are determined as in Eqs (22)–(28) We clearly see that the cluster centers (24), the attribute-weights (26) and the positive (22), the neutral (27) and the refusal memberships (28) are affected by the facilitator model through the uses of two Lagranian multipliers expressed in Eqs (23) and (25) Specifically, hlijh directly makes the changes of the cluster centers in (24) and then they are updated in the Master peer by Eq (23) The new updated multipliers are continued to be used in the next step of the cluster centers in (24) Similarly, Dlijh contributes greatly to the changes of the values of the attribute-weights in (26) These weights are used for the calculation of all memberships and the cluster centers Similar to hlijh, Dlijh are updated in the Master peers by those of other peers Using the facilitator model in this case expressed by the activities of two Lagranian multipliers hlijh and Dlijh helps the local results in a peer being updated with those of other peers so that the local the final cluster centers in (24) are affected by the membershiplike ulkj/(1 À glkj À nlkj) whose membership components are calculated based upon the dataset, the previous cluster centers and memberships; thus regulating the next results according to the previous ones in a good manner The meaning of Theorem is not just the reflection of the ideas stated in Section but also the expression of the calculation process, which can be easily interpreted into the algorithm in Section 3.3 The advantages of the proposed algorithm are threefold Firstly, the proposed clustering algorithm could be applied to various practical problems requiring fast processing of huge datasets In fact, since the activities of the algorithm are simultaneously performed in all peers, the total operating time is reduced as a result The clustering quality of outputs is also better than those of the relevant distributed clustering algorithms according to our theoretical analyses in Section Secondly, the proposed algorithm is easy to implement and could be adapted with many parallel processing models such as the Message Passing Interface (MPI), Open Multi-Processing (OpenMP), Local Area Multicomputer (LAM/MPI), etc Thirdly, the design of the DPFCM algorithm in this article could be a know-how tutorial for the development of fuzzy clustering algorithms on advanced fuzzy sets like the PFS set Besides the advantages, the proposed work still contains some limitations as follows Firstly, the DPFCM algorithm has large computational time in comparison with some relevant algorithms such as FCM, PFCM, Soft-DKM, WEFCM and CDFCM due to extra computation on the membership degrees 57 L.H Son / Expert Systems with Applications 42 (2015) 51–66 Table The descriptions of experimental datasets Dataset No elements No attributes No classes Elements in each classes IRIS GLASS IONOSPHERE HABERMAN HEART 150 214 351 306 270 34 13 2 (50,50,50) (70,17,76,13,9,29) (126,225) (270,81) (150,120) and the results of all peers Secondly, the number of peers could affect the clustering quality of outputs Large number of peers may enhance the clustering quality but also increase the computational time of algorithm How many numbers of peers is enough to balance between the clustering quality and the computational time? In the experiment section, we will validate these remarks as well as find the answers for these questions Evaluation cancer It contains 306 instances, attributes and classes HEART shows the information of heart diseases including 270 instances, 13 attributes and classes Table gives an overview of those datasets These datasets are normalized by a simple normalization method used in Zhou et al (2013) expressed as follows X Ãi ¼ X i À minfX i g i maxfX i g À minfX i g i 4.1 Experimental environment In this part, we describe the experimental environments such as, Experimental tools: we have implemented the proposed DPFCM algorithm in MPI/C programming language and executed it on a PC Intel Pentium 4, CPU 3.4 GHz, GB RAM, 160 GB HDD The experimental results are taken as the average values after 100 runs and are compared with those of FCM (Bezdek, 1984), PFCM (Rahimi et al., 2004), Soft-DKM (Forero et al., 2011), WEFCM (Zhou & Philip Chen, 2011) and CDFCM (Zhou et al., 2013) Experimental dataset: the benchmark datasets of UCI Machine Learning Repository (Bache & Lichman, 2013) such as IRIS, GLASS, IONOSPHERE, HABERMAN and HEART IRIS is a standard data set consisting of 150 instances with three classes and four attributes in which each class contains of 50 instances GLASS contains 214 instances, classes and attributes which are refractive index, sodium, magnesium, aluminum, silicon, potassium, calcium, barium, and iron IONOSPHERE contains 351 instances of radar data, 34 attributes and classes where ‘‘Good’’ radar shows evidence of some types of structures in the ionosphere and ‘‘Bad’’ returns those that not HABERMAN contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast ; ð49Þ i where X Ãi ðX i Þ is the new (old) data point ith Small subsets Y j j ẳ 1; Pị are generated by randomly selected from the original dataset satisfying the condition [Pj¼1 Y j ¼ X Parameter setting: in order to accurate comparison with the relevant methods, we set the parameters as in Zhou et al (2013) including: m = 2, e = 0.01, c = a1 = a2 = 1, a = 0.5, max Iter = 1000 Cluster validity measurement: we use the Average Iteration Number (AIN), the Average Classification Rate (ACR) (Eq 50) and the Average Normalized Mutual Information (ANMI) (Eq 51) (Huang, Chuang, & Chen, 2012) ACR and ANMI are the-largerthe-better validity indices whilst AIN is the-smaller-the-better PK CR ¼ k¼1 dk N ð50Þ ; where dk is the number of objects correctly identified in kth cluster and N is the total number of objects in the dataset PI PJ NMIR; Qị ẳ iẳ1 Pi;jị jẳ1 Pi; jị log PðiÞPðjÞ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; HðRÞHðQ Þ ð51Þ where R, Q are two partitions of the dataset having I and J clusters, respectively P(i) is the probability that a randomly selected object from the dataset falls into cluster Ri in the partition R P(i,j) is the Fig The initiation of peer 58 L.H Son / Expert Systems with Applications 42 (2015) 51–66 Fig The initiation of peer Fig The initiation of peer Fig The communication in each iteration step L.H Son / Expert Systems with Applications 42 (2015) 51–66 59 Fig The distribution of clusters of Peer in the second iteration Fig The distribution of clusters of Peer in the second iteration probability that an object belongs to cluster Ri in R and cluster Qj in Q H(R) is the entropy associated with probabilities P(i) (1 i I) in partition R AIN, ACR and ANMI are the average results after 100 runs Objective: (a) to illustrate the activities of DPFCM to classify a specific benchmark dataset of UCI Machine Learning Repository; (b) to evaluate the clustering qualities of algorithms through validity indices; (c) to measure the effect of the number of peers to the clustering quality; (d) to investigate the computational time of all algorithms positive, the neutral and the refusal matrices of the first peer are initialized in (52)–(54), respectively 0:082100 0:836100 0:011500 0:722100 0:002400 0:930900 0:365000 0:983200 0:578800 ð52Þ 0:002900 0:199000 0:608400 0:116700 0:462500 0:932100 0:052229 0:123007 0:143827 4.2 An illustration of DPFCM 0:131697 0:878686 0:036471 0:466915 0:002841 0:415851 Firstly, we illustrate the activities of the proposed algorithm – DPFCM to classify the IRIS dataset In this case, N = 150, r = 4, C = and the number of peers is P = The cardinalities of the first, second and third peers are 38, 39 and 73, respectively The initial 0:034799 0:450723 0:213030 0:747537 0:094331 0:017593 ð53Þ 60 L.H Son / Expert Systems with Applications 42 (2015) 51–66 Fig The distribution of clusters of Peer in the second iteration 0:477245 0:020119 0:244364 0:076669 0:050943 0:026051 0:065771 0:000965 0:001082 ð54Þ The values of Lagranian multipliers Dlijh(1), hlijh(1) in all peers after updating from the Master peer are described in (57) and (58), respectively The cluster centers Vljh(1) are then updated accordingly 0:560059 0:075800 0:085142 0:505968 0:482254 0:573009 0:467232 0:113362 0:394597 0:014931 0:496019 0:529548 0:542298 0:452127 0:495326 0:510318 0:562189 0:474282 From this initialization, the distribution of clusters of the first peer in the first iteration is depicted in Fig Similarly, the distributions of clusters of the second and third peer in the first iteration are depicted in Figs and 3, respectively Now, we illustrate the activities of the first peer The cluster centers Vljh(0) calculated by Eq (24) are expressed in Eq (55) Based upon Vljh(1), wljh(1) and the updated Lagranian multipliers, new positive, the neutral and the refusal matrices of the first peer are calculated in (60)–(62), respectively 0:146003 0:306919 0:202825 0:244076 0:025787 0:304908 0:568 0:523 0:603 0:466 0:563 0:564 0:608 0:470 ð55Þ 0:546 0:542 0:619 0:494 0:146620 0:359057 0:189975 ð60Þ 0:129143 0:158800 0:242726 The attribute-weights wljh(1) are computed from (26) and shown in (56) 0:045708 0:161369 0:344092 0:293576 0:638569 0:594259 0:137 0:225 0:291 0:346 0:181 0:268 0:253 0:298 ð56Þ 0:185 0:234 0:273 0:309 Now all Slave peers synchronize their pairs {Dlijh (0), hlijh(0), Vljh(0), wljh(1)} to the Master The communication model is depicted in Fig Peer1 : ð59Þ Peer2 : 0:740159 0:863683 0:634079 0:821898 0:585084 0:796155 ð61Þ 0:232520 0:758352 0:726713 0:640528 0:534759 0:625413 Peer3 : 0:000 0:000 0:000 0:000 À0:171 À 0:180 0:160 0:191 À0:139 À 0:359 0:214 0:284 0:000 0:000 0:000 0:000 À0:185 À 0:061 0:120 0:126 À0:056 À 0:399 0:196 0:260 ð57Þ 0:000 0:000 0:000 0:000 À0:195 À 0:087 0:140 0:141 À0:049 À 0:457 0:227 0:279 Peer1 : Peer2 : Peer3 : 0:000 0:000 0:000 0:000 0:159 0:157 0:181 0:099 0:065 0:081 0:044 À 0:113 0:000 0:000 0:000 0:000 0:169 0:098 0:282 0:183 0:118 0:123 0:111 À 0:054 0:000 0:000 0:000 0:000 0:148 0:091 0:276 0:192 0:092 0:099 0:122 À 0:033 ð58Þ L.H Son / Expert Systems with Applications 42 (2015) 51–66 61 Fig The distribution of clusters of Peer after 1000 iterations Fig The distribution of clusters of Peer after 1000 iterations 0:446857 0:053748 0:191423 0:015702 0:107295 0:060053 0:031231 0:055057 0:013822 0:479442 0:081057 0:030324 ð62Þ 0:284316 0:276429 0:030259 algorithm In this case, the value of the left side of the stopping condition is 0.539 which is larger than e = 0.01 so that we continue to make other iteration steps The final positive, the neutral, the refusal matrices, the cluster centers and the attribute-weights of the first peer after 1000 iterations are shown in Eqs (63)–(67), respectively 0:060858 0:055820 0:027631 0:008723 0:035924 0:013326 The distributions of clusters of the first, second and third peer in the second iteration are depicted in Figs 5–7, respectively By the similar process, we also calculate the new positive, the neutral and the refusal matrices of other peers These values are used to validate the stopping condition as in Step 10S of the DPFCM 0:012103 0:043641 0:029784 0:052452 0:009393 0:040776 0:005385 0:023104 0:036309 ð63Þ 62 L.H Son / Expert Systems with Applications 42 (2015) 51–66 Fig 10 The distribution of clusters of Peer after 1000 iterations Table The comparison of clustering quality of algorithms 0:546285 0:674677 0:210664 0:038853 0:796979 0:262033 0:010000 0:775321 0:328807 ð64Þ 0:715918 0:010000 0:404484 0:080086 0:416115 0:868126 DPFCM FCM WEFCM PFCM Soft-DKM CDFCM ACR (%) IRIS GLASS IONOSPHERE HABERMAN HEART 96.04 53.33 75.26 76.50 71.89 89.33 42.08 70.94 51.96 51.31 96.66 54.39 76.58 77.12 72.88 89.33 42.08 70.94 51.96 51.31 87.38 40.50 67.77 51.42 50.24 95.90 52.96 75.26 74.68 71.95 DPFCM 0.8785 0.4175 0.1961 0.0826 0.0395 FCM 0.7433 0.2974 0.1299 0.0024 0.0052 WEFCM 0.8801 0.4263 0.2026 0.0992 0.0445 PFCM 0.7433 0.2974 0.1299 0.0024 0.0052 Soft-DKM 0.7294 0.2848 0.1028 0.0018 0.0028 CDFCM 0.8705 0.4170 0.1961 0.0610 0.0408 ANMI 0:344102 0:248388 0:499719 0:341086 0:159465 0:498775 0:253134 0:172007 0:480467 Dataset ð65Þ 0:216394 0:239729 0:444036 0:413766 0:447034 0:093166 IRIS GLASS IONOSPHERE HABERMAN HEART Bold values emphasize the results of the proposed method 1:000000 1:000000 1:000000 1:000000 0:253521 0:562804 1:000000 0:118987 ð66Þ 0:586810 0:385400 0:702494 0:680071 0:356 0:423 0:191 0:030 0:336 0:504 0:128 0:032 ð67Þ 0:372 0:462 0:125 0:040 The distributions of clusters of the first, second and third peer after 1000 iterations are depicted in Figs 8–10, respectively Using the iteration scheme in DPFCM, the results of a Slave peer are balanced with those of the others and converge to optimal solutions 4.3 The comparison of clustering quality Secondly, we compare the clustering quality of algorithms through the ACR and ANMI indices The number of peers used in this section is The results in Table show that the clustering quality of DPFCM is mostly better than those of three distributed clustering algorithms namely CDFCM, Soft-DKM and PFCM It is also better than the traditional centralized clustering algorithm FCM, and is little smaller than the centralized weighted clustering WEFCM For example, the ARC value of DPFCM for the IRIS dataset is 96.04% whilst those of CDFCM, Soft-DKM, PFCM and FCM are 95.90%, 87.38%, 89.33% and 89.33%, respectively It is smaller than that of WEFCM (96.66%), but the difference of results between DPFCM and WEFCM is quite small Looking inside the ANMI results of all algorithms, we could recognize that the ANMI value of DPFCM is also larger than those of CDFCM, Soft-DKM, PFCM and FCM, and is smaller than that of WEFCM with the numbers being 0.8785, 0.8705, 0.7294, 0.7433, 0.7433 and 0.8801, respectively Similar observations of both ACR and ANMI indices are found for the GLASS and HABERMAN data Nevertheless, there are some cases that DPFCM results in lower clustering quality than CDFCM For example, the ACR value of DPFCM for the IONOSPHERE dataset is 75.26% which is equal to that of DPFCM, and both of them are smaller than that of WEFCM For the HEART dataset, the ACR value of DPFCM is 71.89%, smaller than those of CDFCM (71.95%) and WEFCM (72.88%) Analogous remarks are found with the ANMI index This means that using the update of all peers in the facilitator model does not always result in better clustering quality than using the update of some neighboring peers in the mechanism of CDFCM since some peers could be in the bad-initialization-states so that the final results would be affected by the balancing mechanism between peers Nonetheless, these cases are not much and most of the time DPFCM often has better clustering quality than CDFCM and other relevant algorithms The clustering qualities of algorithms also vary depending on the dataset For instance, the L.H Son / Expert Systems with Applications 42 (2015) 51–66 Fig 11 The ACR values of algorithms Fig 12 The ANMI values of algorithms Fig 13 The ACR values of DPFCM by number of peers 63 64 L.H Son / Expert Systems with Applications 42 (2015) 51–66 Fig 14 The ANMI values of DPFCM by number of peers Table The results of DPFCM by various numbers of peers Dataset ACR (%) IRIS GLASS IONOSPHERE HABERMAN HEART ANMI IRIS GLASS IONOSPHERE HABERMAN HEART a P=2 93.55 49.23 65.75 70.95 69.70 0.8725 0.4160 0.1893 0.0756 0.0325 P=3 96.04 53.33 75.26a 76.50 71.89a 0.8785 0.4175 0.1961a 0.0826 0.0395 P=4 96.08 57.62 64.30 78.26 66.83 0.8923 0.4525 0.1924 0.0954 0.0683 Table The comparison of AIN values of algorithms P=5 P=6 98.21a 76.20a 63.54 82.12 56.80 a 0.9217 0.4862a 0.1853 0.1084 0.0866a P=7 97.48 56.03 61.32 91.32a 55.40 90.54 30.50 59.72 70.95 52.57 0.9186 0.4480 0.1727 0.1456a 0.0432 0.8492 0.3475 0.1695 0.0756 0.0311 Indicate the maximum value for this dataset ACR values of algorithms for the IRIS dataset is absolute high, ranging from 87.33% to 96.66% However the results for the GLASS dataset are medium with the best clustering quality tracked from the WEFCM algorithm being 54.39% only In the other words, for every two checked data points, one of them is mostly wrong labeled The ranges of clustering qualities of algorithms for the IONOSPHERE, HABERMAN and HEART data are (67.77–76.58%), (51.42–77.12%) and (50.24–72.88%), respectively The ranges for the IRIS and IONOSPHERE are quite narrow which means that (i) all algorithms not differ remarkably in terms of clustering quality and tend to converge to the optimal results achieved by the WEFCM algorithm; (ii) Some datasets such as IONOSPHERE contain outliers so that the range of clustering quality is not high Nevertheless, this range is greater than 65% of accuracy and can be acceptable For other datasets, the ranges of clustering qualities of algorithms are broad, and the efficiency of the proposed algorithm DPFCM is expressed more obvious than that in the cases of narrow ranges For example, the clustering quality of DPFCM for the HABERMAN dataset is nearly approximate to that of WEFCM and is much larger than those of FCM, PFCM and Soft-DKM Thus, the efficiency of DPFCM was proven even in the noisy and narrow-range datasets In Figs 11 and 12, we illustrate the ACR and ANMI values of algorithms through various datasets Obviously, the line of DPFCM Dataset DPFCM FCM WEFCM PFCM Soft-DKM CDFCM IRIS GLASS IONOSPHERE HABERMAN HEART 42.4 113.6 83.4 46.3 81.2 21.2 56.2 13.9 17.1 41.0 26.2 63.8 44.8 18.2 48.4 38.6 107.6 52.6 29.4 67.4 23.8 73.4 36.8 21.6 46.2 30.2 86.2 51.2 26.8 56.8 Bold values emphasize the results of the proposed method is higher than those of other algorithms except WEFCM This affirms our remarks above about the efficiency of DPFCM 4.4 The impact of the number of peers Thirdly, we measure the effect of the number of peers to the clustering quality of DPFCM In Section 3.4, we state a question involving the optimal number of peers to balance between the clustering quality and the computational time In order to answer this question, we have run the DPFCM algorithm by various numbers of peers on the experimental datasets and measured the ACR and ANMI indices values of those cases The results are demonstrated in Table From this table, we depict the ACR and ANMI values of the DPFCM algorithm by number of peers and describe them in Figs 13 and 14 The results in Table and Figs 13 and 14 clearly state that the optimal range of the number of peers should be [3, 5] It is obvious that the ACR values of DPFCM on the IRIS and GLASS datasets are maximal with P = Similarly, the maximal ACR values of DPFCM on the IONOSPHERE and HEART datasets are achieved with P = The last result on the HABERMAN dataset shows the maximal ACR value with P = By the simple count on the maximal ANMI values of DPFCM in Table 3, we also recognize that P = is the most suitable number of peers since it contributes three cases of the maximal ANMI values Thus, our recommendation for choosing P is in the range [3, 5] 4.5 The comparison of computational time Lastly, we investigate the computational time of all algorithms through the AIN index (see Table 4) The results have clearly stated that the proposed DPFCM takes longer number of iterations than L.H Son / Expert Systems with Applications 42 (2015) 51–66 other algorithms However, the differences between those algorithms are not much and this limitation of DPFCM can be acceptable Conclusions In this paper, we concentrated on the fuzzy clustering in distributed environments and presented a novel distributed picture fuzzy clustering method on picture fuzzy sets namely DPFCM This algorithm employed the facilitator model to both ameliorate the clustering quality through large updated numbers of neighboring results and reduce the communication costs In all Slave peers, the clustering was oriented by the principle of picture fuzzy sets, which are the generalization of the traditional fuzzy sets and intuitionistic fuzzy sets By combining the ideas of the facilitator model and picture fuzzy sets, DPFCM has advanced the clustering quality of the relevant algorithms for this problem Theoretical analyses of the proposed algorithm including the meanings of some theorems proposed in this article and the advantages/disadvantages of the algorithm were also discussed The theoretical contribution of this paper could be useful for later development and applications of distributed fuzzy clustering to practical problems Experimental results have been conducted on the benchmark datasets of UCI Machine Learning Repository and divided into several different scenarios for purposes A numerical example on the IRIS dataset has been conducted to show step-by-step the activities of the proposed algorithm The measurements on the impact of the number of peers and the computational time of algorithms were also investigated The findings extracted from the experiments could be summarized as follow: (i) The clustering quality of DPFCM is better than those of other relevant distributed clustering algorithms; (ii) The average ACR value of DPFCM by various datasets is 74.6%; (iii) The number of peers used in the DPFCM algorithm should be chosen within the range [3, 5]; (iv) DPFCM takes longer computational time than other algorithms, yet the differences are not much and can be acceptable The insightful and practical implications of the proposed research work could be interpreted as follows Firstly, since many applications nowadays require fast processing on the large and very large datasets, the DPFCM algorithm could be used to simultaneously process those data without remarkable deducting the quality of outputted results Each local site is kept up-to-date with the others and the main site so that the proposed mechanism could be efficient for the world wide management Secondly, the theoretical contribution of this paper could expand a minor research direction about distributed fuzzy clustering on advanced fuzzy sets such as the picture fuzzy sets used in this article From these insightful implications, further works of this theme could be lay into several directions: (i) Extending DPFCM in the context of semi-supervised clustering; (ii) Adapting DPFCM for other special parallel processing models such as OpenMP and LAMP/MPI; (iii) Integrating DPFCM to recommender systems for the extraction of distributed fuzzy rules; (iv) Considering DPFCM as a part of a fuzzy time series forecasting system such as the ANFIS network, ANN, etc (v) Applying this algorithm for some group decision making problems Acknowledgement The authors are greatly indebted to the editor-in-chief, Prof B Lin and anonymous reviewers for their comments and their valuable suggestions that improved the quality and clarity of paper Another thank is sent to Msc Pham Huy Thong for the calculation works This work is sponsored by a VNU Project under contract No QG.14.60 65 References Agarwal, M., Agrawal, H., Jain, N., & Kumar, M (2010) Face recognition using principle component analysis, eigenface and neural network In Proceeding of IEEE international conference on signal acquisition and processing (ICSAP’10) (pp 310–314) Ahmed, M N., Yamany, S M., Mohamed, N., Farag, A A., & Moriarty, T (2002) A modified fuzzy c-means algorithm for bias field estimation and segmentation of MRI data IEEE Transactions on Medical Imaging, 21(3), 193–199 Atanassov, K T (1986) Intuitionistic fuzzy sets Fuzzy Sets and Systems, 20, 87–96 Bache, K., & Lichman, M (2013) UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science [Online] Available from: Bai, C., Dhavale, D., & Sarkis, J (2014) Integrating fuzzy C-means and TOPSIS for performance evaluation: An application and comparative analysis Expert Systems with Applications, 41(9), 4186–4196 Balcan, M F., Ehrlich, S., Liang, Y (2013) Distributed k-means and k-median clustering on general topologies arXiv preprint arXiv:1306.0604v3 Bezdek, J C et al (1984) FCM: The fuzzy c-means clustering algorithm Computers & Geosciences, 10, 191–203 Branting, L K (2013) Distributed pivot clustering with arbitrary distance In Proceeding of 2013 IEEE international conference on big data (pp 21–27) Bui, A., Kudireti, A., & Sohier, D (2012) An adaptive random walk based distributed clustering algorithm International Journal of Foundations of Computer Science, 23(04), 803–830 Cao, H., Deng, H W., & Wang, Y P (2012) Segmentation of M-FISH images for improved classification of chromosomes with an adaptive fuzzy C-means clustering algorithm IEEE Transactions on Fuzzy Systems, 20(1), 1–8 Chen, L., Chen, C P., & Lu, M (2011) A multiple-kernel fuzzy c-means algorithm for image segmentation IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 41(5), 1263–1274 Chen, X W., & Huang, T (2003) Facial expression recognition: A clustering-based approach Pattern Recognition Letters, 24(9), 1295–1302 Chimphlee, W., Abdullah, A H., Noor Md Sap, M., Chimphlee, S., & Srinoy, S (2005) Integrating genetic algorithms and fuzzy c-means for anomaly detection In Proceeding of IEEE indicon (pp 575–579) Chimphlee, W., Abdullah, A H., Noor Md Sap, M., Srinoy, S., & Chimphlee, S (2006) Anomaly-based intrusion detection using fuzzy rough clustering In Proceeding of IEEE international conference on hybrid information technology (ICHIT’06) (Vol 1, pp 329–334) Chuang, K S., Tzeng, H L., Chen, S., Wu, J., & Chen, T J (2006) Fuzzy c-means clustering with spatial information for image segmentation Computerized Medical Imaging and Graphics, 30(1), 9–15 Chu, H J., Liau, C J., Lin, C H., & Su, B S (2012) Integration of fuzzy cluster analysis and kernel density estimation for tracking typhoon trajectories in the Taiwan region Expert Systems with Applications, 39(10), 9451–9457 Coletta, L F., Vendramin, L., Hruschka, E R., Campello, R J., & Pedrycz, W (2012) Collaborative fuzzy clustering algorithms: Some refinements and design guidelines IEEE Transactions on Fuzzy Systems, 20(3), 444–462 Cuong, B C., & Kreinovich, V (2013) Picture fuzzy sets – a new concept for computational intelligence problems In Proceeding of 2013 third world congress on information and communication technologies (WICT 2013) (pp 1–6) Cuong, B C., Son, L H., & Chau, H T M (2010) Some context fuzzy clustering methods for classification problems In Proceedings of the 2010 ACM symposium on information and communication technology (pp 34–40) Di Martino, F., Loia, V., & Sessa, S (2008) Extended fuzzy C-means clustering algorithm for hotspot events in spatial analysis International Journal of Hybrid Intelligent Systems, 5(1), 31–44 Egrioglu, E et al (2011) Fuzzy time series forecasting method based on Gustafson– Kessel fuzzy clustering Expert Systems with Applications, 38(8), 10355–10357 Egrioglu, E., Aladag, C H., & Yolcu, U (2013) Fuzzy time series forecasting with a novel hybrid approach combining fuzzy c-means and neural networks Expert Systems with Applications, 40(3), 854–857 Ercan, T (2010) Effective use of cloud computing in educational institutions Procedia-Social and Behavioral Sciences, 2(2), 938–942 Forero, P A., Cano, A., & Giannakis, G B (2011) Distributed clustering using wireless sensor networks IEEE Journal of Selected Topics in Signal Processing, 5(4), 707–724 Gehweiler, J., & Meyerhenke, H (2010) A distributed diffusive heuristic for clustering a virtual P2P supercomputer In Proceeding of 2010 IEEE international symposium on parallel & distributed processing, workshops and phd forum (IPDPSW) (pp 1–8) Geng, X., & Yang, Z (2013) Data mining in cloud computing In Proceeding of 2013 international conference on information science and computer applications (ISCA 2013) Ghanem, S., Kechadi, T., & Tari, A (2011) New approach for distributed clustering In Proceeding of 2011 IEEE international conference on spatial data mining and geographical knowledge services (ICSDM) (pp 60–65) Gilhotra, E., & Trikha, P (2012) Modification in ‘‘KNN’’ clustering algorithm for distributed data International Journal of Computer Applications in Engineering Sciences, 22(3) Hadavandi, E., Shavandi, H., & Ghanbari, A (2011) An improved sales forecasting approach by the integration of genetic fuzzy systems and data clustering: Case study of printed circuit board Expert Systems with Applications, 38(8), 9392–9399 66 L.H Son / Expert Systems with Applications 42 (2015) 51–66 Haddadnia, J., Faez, K., & Ahmadi, M (2003) A fuzzy hybrid learning algorithm for radial basis function neural network with application in human face recognition Pattern Recognition, 36(5), 1187–1202 Hai, M., Zhang, S., Zhu, L., & Wang, Y (2012) A survey of distributed clustering algorithms In Proceeding of 2012 IEEE international conference on industrial control and electronics engineering (ICICEE) (pp 1142–1145) Huang, H C., Chuang, Y Y., & Chen, C S (2012) Multiple kernel fuzzy clustering IEEE Transactions on Fuzzy Systems, 20(1), 120–134 Izakian, H., & Abraham, A (2011) Fuzzy C-means and fuzzy swarm for fuzzy clustering problem Expert Systems with Applications, 38(3), 1835–1838 Jain, A K., & Maheswari, S (2013) Survey of recent clustering techniques in data mining Journal of Current Computer Science and Technology, 3(01) Karjee, J., & Jamadagni, H S (2011) Data accuracy model for distributed clustering algorithm based on spatial data correlation in wireless sensor networks arXiv preprint arXiv:1108.2644 Le Khac, N A., & Kechadi, M (2011) On a distributed approach for density-based clustering In Proceeding of 2011 10th international conference on machine learning and applications and workshops (ICMLA) (Vol 1, pp 283–286) Krinidis, S., & Chatzis, V (2010) A robust fuzzy local information C-means clustering algorithm IEEE Transactions on Image Processing, 19(5), 1328–1337 Kwon, Y., Nunley, D., Gardner, J P., Balazinska, M., Howe, B., & Loebman, S (2010) Scalable clustering algorithm for N-body simulations in a shared-nothing cluster Berlin, Heidelberg: Springer, pp 132–150 Li, X (2003) Gesture recognition based on fuzzy C-Means clustering algorithm University Of Tennessee Knoxville: Department Of Computer Science Li, B N., Chui, C K., Chang, S., & Ong, S H (2011) Integrating spatial fuzzy clustering with level set methods for automated medical image segmentation Computers in Biology and Medicine, 41(1), 1–10 Li, H., Li, J., & Kang, F (2011) Risk analysis of dam based on artificial bee colony algorithm with fuzzy c-means clustering Canadian Journal of Civil Engineering, 38(5), 483–492 Lu, L., Gu, Y., & Grossman, R (2010) dSimpleGraph: A novel distributed clustering algorithm for exploring very large scale unknown data sets In Proceeding of 2010 IEEE international conference on data mining workshops (ICDMW) (pp 162– 169) Lu, J., Yuan, X., & Yahagi, T (2006) A method of face recognition based on fuzzy clustering and parallel neural networks Signal Processing, 86(8), 2026–2039 Lu, J., Yuan, X., & Yahagi, T (2007) A method of face recognition based on fuzzy cmeans clustering and associated sub-NNs IEEE Transactions on Neural Networks, 18(1), 150–160 Martin, A., Gayathri, V., Saranya, G., Gayathri, P., & Venkatesan, P (2011) A hybrid model for bankruptcy prediction using genetic algorithm, fuzzy c-means and mars arXiv preprint arXiv:1103.2110 Ma, L., & Staunton, R C (2007) A modified fuzzy C-means image segmentation algorithm for use with uneven illumination patterns Pattern Recognition, 40(11), 3005–3011 Pandey, S., Wu, L., Guru, S M., & Buyya, R (2010) A particle swarm optimizationbased heuristic for scheduling workflow applications in cloud computing environments In Proceeding of 2010 24th IEEE international conference on advanced information networking and applications (AINA) (pp 400–407) Petre, R Sß (2012) Data mining in cloud computing Database Systems Journal, 3(3), 67–71 Pham, D L., Xu, C., & Prince, J L (2000) Current methods in medical image segmentation Annual Review of Biomedical Engineering, 2(1), 315–337 Rahimi, S., Zargham, M., Thakre, A., & Chhillar, D (2004) A parallel fuzzy c-mean algorithm for image segmentation In Proceeding of IEEE annual meeting of the fuzzy information processing (NAFIPS’04) (Vol 1, pp 234–237) Roh, S B., Pedrycz, W., & Ahn, T C (2014) A design of granular fuzzy classifier Expert Systems with Applications, 41(15), 6786–6795 Shah, H., Undercoffer, J., & Joshi, A (2003) Fuzzy clustering for intrusion detection In Proceeding of 12th IEEE international conference on fuzzy systems (FUZZ’03) (Vol 2, pp 1274–1278) Siang Tan, K., & Mat Isa, N A (2011) Color image segmentation using histogram thresholding–Fuzzy C-means hybrid approach Pattern Recognition, 44(1), 1–15 Singh, D., & Gosain, A (2013) A comparative analysis of distributed clustering algorithms: A survey In Proceeding of 2013 IEEE international symposium on computational and business intelligence (ISCBI) (pp 165–169) Son, L H (2014a) Enhancing clustering quality of geo-demographic analysis using context fuzzy clustering type-2 and particle swarm optimization Applied Soft Computing, 22, 566–584 Son, L H (2014b) HU-FCF: A hybrid user-based fuzzy collaborative filtering method in recommender systems Expert Systems With Applications, 41(15), 6861–6870 Son, L H., Cuong, B C., Lanzi, P L., & Thong, N T (2012) A novel intuitionistic fuzzy clustering method for geo-demographic analysis Expert Systems with Applications, 39(10), 9848–9859 Son, L H., Cuong, B C., & Long, H V (2013) Spatial interaction–modification model and applications to geo-demographic analysis Knowledge-Based Systems, 49, 152–170 Son, L H., Lanzi, P L., Cuong, B C., & Hung, H A (2012) Data mining in GIS: A novel context-based fuzzy geographically weighted clustering algorithm International Journal of Machine Learning and Computing, 2(3), 235–238 Son, L H., Linh, N D., & Long, H V (2014) A lossless DEM compression for fast retrieval method using fuzzy clustering and MANFIS neural network Engineering Applications of Artificial Intelligence, 29, 33–42 Srikantaiah, S., Kansal, A., & Zhao, F (2008) Energy aware consolidation for cloud computing In Proceedings of the 2008 conference on power aware computing and systems (Vol 10) Surcel, T., & Alecu, F (2008) Applications of cloud computing In Proceeding of the international conference of science and technology in the context of the sustainable development (pp 177–180) Vendramin, L., Campello, R J G B., Coletta, L F., & Hruschka, E R (2011) Distributed fuzzy clustering with automatic detection of the number of clusters In Proceeding of international symposium on distributed computing and artificial intelligence (pp 133–140) Visalakshi, N K., Thangavel, K., & Parvathi, R (2010) An intuitionistic fuzzy approach to distributed fuzzy clustering International Journal of Computer Theory and Engineering, 2(2), 1793–8201 Wachs, J., Stern, H., & Edan, Y (2003) Parameter search for an image processing fuzzy C-means hand gesture recognition system In Proceedings of 2003 IEEE international conference on image processing (ICIP 2003) (Vol 3, pp III-341) Wang, G., Hao, J., Ma, J., & Huang, L (2010) A new approach to intrusion detection using artificial neural networks and fuzzy clustering Expert Systems with Applications, 37(9), 6225–6232 Wang, Y., Ma, X., Lao, Y., & Wang, Y (2014) A fuzzy-based customer clustering approach with hierarchical structure for logistics network optimization Expert Systems with Applications, 41(2), 521–534 Xie, T., Bai, G., & Lang, H (2010) A novel distributed clustering algorithm based on OCSVM In Proceeding of 2010 IEEE international conference on intelligent computing and intelligent systems (ICIS) (Vol 1, pp 661–665) Zadeh, L A (1965) Fuzzy sets Information and Control, 8, 338–353 Zhang, D Q., & Chen, S C (2004) A novel kernelized fuzzy c-means algorithm with application in medical image segmentation Artificial Intelligence in Medicine, 32(1), 37–50 Zhang, Y., Huang, D., Ji, M., & Xie, F (2011) Image segmentation using PSO and PCM with Mahalanobis distance Expert Systems with Applications, 38(7), 9036–9040 Zhou, J., & Philip Chen, C L (2011) Attribute weighted entropy regularization in fuzzy c-means algorithm for feature selection In Proceeding of IEEE international conference on system science and engineering, (pp 59–64) Zhou, J., Chen, C., Chen, L., & Li, H (2013) A collaborative fuzzy clustering algorithm in distributed network environments IEEE Transactions on Fuzzy Systems http:// dx.doi.org/10.1109/TFUZZ.2013.2294205 ... Karjee and Jamadagni (2011) constructed a distributed clustering algorithm based upon spatial data correlation among sensor nodes and performed data accuracy for each distributed cluster at their... Average Iteration Number (AIN), the Average Classification Rate (ACR) (Eq 50) and the Average Normalized Mutual Information (ANMI) (Eq 51) (Huang, Chuang, & Chen, 2012) ACR and ANMI are the-largerthe-better... face recognition based on fuzzy cmeans clustering and associated sub-NNs IEEE Transactions on Neural Networks, 18(1), 150–160 Martin, A. , Gayathri, V., Saranya, G., Gayathri, P., & Venkatesan,