DSpace at VNU: Spatial interaction – modification model and applications to geo-demographic analysis

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	19
Dung lượng	2,33 MB

Nội dung

DSpace at VNU: Spatial interaction – modification model and applications to geo-demographic analysis tài liệu, giáo án,...

Knowledge-Based Systems 49 (2013) 152–170 Contents lists available at SciVerse ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys Spatial interaction – modification model and applications to geo-demographic analysis Le Hoang Son a,⇑, Bui Cong Cuong b, Hoang Viet Long c a VNU University of Science, Vietnam National University, Viet Nam Institute of Mathematics, Vietnam Academy of Science and Technology, Viet Nam c Faculty of Basic Sciences, University of Transport and Communications, Viet Nam b a r t i c l e i n f o Article history: Received January 2013 Received in revised form May 2013 Accepted May 2013 Available online 23 May 2013 Keywords: Fuzzy clustering Geo-demographic analysis Geographic effects Spatial interaction modification model Data mining a b s t r a c t In this paper, we introduce a novel model so-called Spatial Interaction – Modification Model (SIM2), serving for the classification of spatially-referenced demographic data It is integrated with the main part of the best fuzzy clustering algorithm for geo-demographic analysis problem – IPFGWC to form the new method named as MIPFGWC Theoretical and experimental analyses show that MIPFGWC achieves better clustering quality than IPFGWC and other available algorithms Ó 2013 Elsevier B.V All rights reserved Introduction Geo-Demographic Analysis (GDA) has been being widely used in various applications such as the planning and distribution of products and services, the determination of common population’s characteristics and the study of population variation in terms of gender, ages, sex, ethnicity, etc Results of this kind of analysis are visualized on a map as several distinct groups that represent for different levels of a population’s characteristic, e.g ‘‘High density of chainsmokers’’ and ‘‘Low density of chain-smokers’’ This knowledge assists policy makers in two phases: (i) Understanding the causes for such a distribution of the population’s characteristic and (ii) Giving effective policies to adjust the distribution in order to achieve a certain goal, e.g the limitation of smoking Thus, GDA is of great interest to engineers and managers alike In GDA, the fuzzy clustering methods are often used to generate a specific distribution of population’s characteristics Improving the quality of GDA or the quality of fuzzy clustering used for GDA is considered as an important, necessary objective for the accurate description of the distribution In what follows, we herein shortly summarize some relevant works for that goal Some authors such as Bezdek and Ehrlich [1], Ji et al [5], Khashei et al [6], Son et al [8,9], Yin et al [11] and Zadegan et al [12] introduced ⇑ Corresponding author Address: 334 Nguyen Trai, Thanh Xuan, Hanoi, Viet Nam Tel.: +84 904171284; fax: +84 0438623938 E-mail address: sonlh@vnu.edu.vn (L.H Son) 0950-7051/$ - see front matter Ó 2013 Elsevier B.V All rights reserved http://dx.doi.org/10.1016/j.knosys.2013.05.005 Fuzzy C-Means (FCM) and its variants to determine the distribution of a demographic feature on a map Feng and Flowerdew [4] presented an improvement of FCM so-called Neighbourhood Effects (NE) to remedy the limitation of missing geographic factors in that algorithm Based upon the principle of Spatial Interaction Model (SIM) [2], NE modifies the cluster memberships by geographic parameters so that final results are spatially referenced Mason and Jacobson [7] improved NE through two remarks: (i) SIM model was extended to use the population factor instead of the length of the common boundary The new model was named Spatial Interaction Model with Population Factor (SIM-PF) and (ii) the modification of cluster memberships was executed in each iteration instead of at the end of the algorithm Their algorithm was named FGWC Our previous work [8] improved FGWC by integrating it with some results of Intuitionistic Fuzzy Sets (IFS) to tackle the problems of sensitivity to outliers and crisp memberships that were remained in FGWC SIM-PF model was kept unchanged in the new algorithm named as IPFGWC Some remarks found from those articles are shown below (a) Experimental and theoretical analyses in the article [8] showed that the clustering quality of IPFGWC is better than those of FGWC, NE and FCM (b) SIM-PF model used in IPFGWC contains some limitations that may reduce the quality of the algorithm (c) Theoretical analyses of the impact of SIM-PF (SIM) model to cluster memberships as well as the characteristics of the model were not found in existing articles L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Let us make a deeper analysis for the second consideration The modification of the cluster membership based on SIM-PF is described as follow u0k ẳ a uk ỵ b C X wkj uj ; A jẳ1 a ỵ b ¼ 1; ð1Þ ð2Þ b wkj ¼ ðpopk Â popj Þ a =dkj : ð3Þ u0k ðuk Þ In Eqs (1)–(3), is the new (old) cluster membership of area k Two parameters a and b are scaling variables A is a factor to scale the ‘‘sum’’ term to the range to wkj is the weight between two areas k and j, showing the influence level of one area upon another It is calculated through Eq (3) where popk(popj) and dkj are the population of area k(j) and the distance between these areas, respectively a and b are user-defined parameters Now, we will give three examples to illustrate the weaknesses of SIM-PF model Example (The limitation of the model) In Eq (1), the new updated cluster membership cannot be used in the modification process of the next ones For example, in Fig 1, area ‘‘A’’ is affected by the others following by SIM-PF model and moves to the new location ‘‘A’’’ Fig shows the next modification process of SIM-PF on area ‘‘B’’ However, instead of using ‘‘A’’’, this model still uses ‘‘A’’ for the update of area ‘‘B’’ This may lead to the inaccurate modification that decreases the quality of the algorithm Example (The limitation of the weight – Common Boundary) The weight in Eq (3) does not count for areas having common boundaries Thus, neighboring areas having low population may receive a smaller weight than that of distant areas having high population In Fig 3, we illustrate three areas ‘‘A’’, ‘‘B’’ and ‘‘C’’ Through the comparison of the distances between three centers, we see that ‘‘A’’ is nearer to ‘‘C’’ than ‘‘B’’ The populations of {‘‘A’’, ‘‘C’’} are larger than those of {‘‘A’’, ‘‘B’’} According to Eq (3), the weight wAC is definitely larger than wAB However, areas {‘‘A’’, ‘‘B’’} have the common boundary with the element ‘‘8’’ while {‘‘A’’, ‘‘C’’} not Naturally, it is assumable that {‘‘A’’, ‘‘B’’} is closely related to each other more than {‘‘A’’, ‘‘C’’} Indeed, it should be wAB > wAC 153 Example (The limitation of the weight – Immigration) The weight in Eq (3) cannot reflect the immigration behavior, which is the key demographic factor showing the strong connection of some groups having high densities of immigrating elements In Fig 4, we illustrate three areas in which elements ‘‘6’’ and ‘‘8’’ moved from area ‘‘B’’ to ‘‘A’’, and element ‘‘4’’ went from ‘‘A’’ to ‘‘B’’ The formula in Eq (3) shows that ‘‘A’’ is related to ‘‘C’’ more than ‘‘B’’ However, it should be wAB > wAC since the historic immigration pointed out the interaction between ‘‘A’’ and ‘‘B’’ Those examples show the need of a new model which can ameliorate the limitations of SIM-PF Additionally, theoretical analyses of the new model such as the measurement of the influence of two areas, the difference of cluster memberships between SIM-PF and the new model, and the suitable selection of parameters for the best quality of the algorithm should be considered The relevant articles lacked those systematic analyses as shown in the third consideration above Last but not least, the improvement of IPFGWC including the new model is constructed, and the clustering quality of the new algorithm will be better than that of IPFGWC since the limitations of SIM-PF model are remedied Those considerations are all our motivation and objectives in this article The rest of the paper and our contribution are described as follows Section presents our main contribution including the novel model named as Spatial Interaction – Modification Model (SIM2) and some theoretical analyses of it Specifically, in Section 2.1, we examine a new weighting function that handles the limitations of the weight stated in Examples and Moreover, some interesting properties and theorems of that function, which are useful to determine the values of some parameters, are investigated such as, The upper bound of the average influence of two ubiquitous areas which can help us determine the maximal value of any weight The relation between some parameters of the weighting function that ensures the same influence of two ubiquitous pairs of areas both in the perfect and imperfect assumptions The sum of weights when the number of areas is very large In Section 2.2, a new model that handles the limitation stated in Example is considered We also investigate some properties and theorems such as, The difference of cluster memberships of SIM-PF and SIM2 model Fig The first modification step of SIM-PF model 154 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Fig The second modification step of SIM-PF model Fig The common boundary È É Conditions of parameters for u0k to be a completely monotone increasing sequence The relation between a monotone increasing sequence and a completely one The suitable selection of parameters for the best quality of the algorithm in Section The results of Sections 2.1 and 2.2 are called SIM2 model In Section 3, we integrate SIM2 with the main part of IPFGWC to form the new algorithm so-called Modified IPFGWC (MIPFGWC) Section validates the proposed approach through a set of experiments involving real-world data Finally, Section draws the conclusions and delineates the future research directions Spatial interaction modification model 2.1 A new weighting function Definition A Weighting Function (WF) w is a function w: RR !R k; jị # wkj ẳ 4ị b < ðpopk Âpopj Þ Âpckj ÂIMdkj k–j : else dakj ; where popk(popj) is the population of area k(j); dkj is the distance between those areas; pkj is the maximal distance between elements in 155 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Fig Immigration interaction the common boundary of two areas In case that these areas not have the common boundary or there is an element in the boundary only, pkj is set to one IMkj is the total number of elements immigrating from area k to j and vice versa It measures the historic immigration in the previous calculation step If no immigration between two areas is found, IMkj = Finally, four numbers a, b, c, d are the parameters specified by some theorems in this sub-section Some constraints are attached to Eq (4) as follows C X > > > popk ¼ N0 > < k¼1 > pkj dkj > > > : IMkj popk ỵ popj Theorem The upper bound of the average influence of area k on another one is wAVG ð1=CÞ popk N0 ịb popk ỵ N0 ịd : k Proof From Eqs (4) and (5), we have ! C C X X pckj d b b wkj ¼ popk Â popj Â IMkj Â a ; dkj jẳ1 jẳ1 5ị ; popbk C X popbj Â IMdkj : Using Holder inequality, we obtain C X wkj popbk C X Â popj j¼1 wAC ẳ ẵ3 4ị 1=8 10ị ; 11ị jẳ1 b !d 12ị : jẳ1 From constraint (5) and Minkowski inequality, we get wkj ðpopk Â N0 ịb popk ỵ N0 ịd : 13ị Thus, we receive the result as in Eq (8) h ð6Þ Now, we investigate some interesting properties and theorems of the weighting function Consequence By the similar proof, we obtain the upper bound of the average influence of a ubiquitous area on another one as follow wAVG ẳ 1=Cị2 C X C X wkj 2N02bỵd : 14ị kẳ1 j¼1 Property It is easy to check that WF satises: aị Commutativ e : wk;j ẳ wj;k 8j; k ^ j kị; bị 8j : w/;j ẳ where / is an area hav ing no population; j¼1 !d j¼1 Some special cases are shown below b ¼ d ¼ : w ! wSIM : C X Â IMkj C X ¼ 1:5: ðc ¼ d ¼ 0ị _ pkj ẳ ^ IMkj ẳ 1ị : w ! wSIMÀPF ; !b C X ðpopk Â N0 Þ Â IMkj Example In Fig 4, assume that a = b = c = d = and dAB = 10; dAC = 8, we re-calculate the weights of all areas following by Eq (4) ðcÞ Cardinality : jwj ẳ C : 9ị jẳ1 where C(N0) is the total number of areas (population and elements in common boundaries) When pkj = 0, N0 is equal to the total number of population (N) wAB ẳ ẵ3 2ị 3=10 ẳ 1:8; 8ị 7ị This result gives us an estimation of the upper bound of the influence wkj, "k – j This means that wherever two areas are on the map, the maximal impact between them is shown in Eq (14) In what follows, we will consider another theorem of the weighting function 156 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Theorem Given the following assumptions, ðaÞ All areas are not empty and not intersect the others; ð15Þ ðbÞ b d; cị dkj ẳ dki ỵ e with k i j and popk ỵ popj : dị IMkj ẳ 16ị 17ị e is an infinitesimal; Theorem If we replace a constraint in Theorem with that below then the new condition to ensure that the influences from area k to the others are equal is: (a) IMkj = min{popk, popj}: The condition is: 18ị ỵ eịa > The condition to ensure that the influences from area k to the others are equal is: ð20Þ Then, the following inequality holds Indeed, Eq (21) can be obtained through some transformations below ða À bÞðln x À ln yÞ P 0; ð22Þ () a ln x ỵ b ln y P a ln y ỵ b ln x; 23ị () a ln x ỵ b ln yị P a ỵ bịln x ỵ ln yị; 24ị aỵb () xa yb P x yị : PðCÞa > 8i – j – k ð26Þ ðpopk Â popi Þb Â pcki Â IMdki ; ð27Þ a dki !c a popj b IMkj d dkj pki () ẳ : 28ị popi IMki dki pkj a dkj ¼ From the constraints (15), (17), (18), Eq (28) is re-written as popj b popk ỵ popj d e a ẳ 1ỵ % ỵ eịa : popi popk ỵ popi dki 29ị Without generality, we can assume that popj popi, "i – j – k Apply lemma (21) for the left side of Eq (29), we have bỵd popj b popk ỵ popj d popj popk ỵ popj P : popi popk ỵ popi popi popk ỵ popi ð30Þ Because of constraint (15), the right side of Eq (30) is minimal if its numerator is smallest, and its denominator is largest Thus, popk = popj = and popi = N0 C + bỵd bỵd popj popk ỵ popj 2 > : N0 C ỵ 1ịN0 C ỵ 2ị popi popk ỵ popi From Eqs (29)(31), we obtain the result in Eq (19) ð31Þ ðN0 À C þ 1ÞðN0 À C þ 2Þ j – k; D0 : constant; bỵd 34ị ; ỵ r1 þ r2 þ r3 þ Cþ2 þ Cþ2 ; rCỵ2 ỵ 3r ị r ỵ 3r ị r 2 ỵ 3r ị 35ị ỵ eịac > N0 C ỵ 1ịN0 C ỵ 2ị bỵd : ð36Þ In Theorem 2, the assumptions are quite ideal We will certainly not see any part of the real world containing areas that not overlap the others or are equidistant from a considered one The meaning of this theorem is to examine the relation between some parameters of the weighting function to ensure the same influence from a given area to the others in the perfect condition However, if we not have one of these conditions then how will the new relation become? Theorem helps us answer it by replacing an assumption in Theorem with the new one For example, if we use the minimum operator instead of the average one, the new relation is set up as in Eq (33), which is looser than Eq (19) Similarly, when the distances from a given area to the others form a Padovan sequence, the new relation described in equation (34) is also looser than that in Theorem If all areas intersect the others then we get the result in Eq (36), which is stricter than Eq (19) Thus, those cases in Theorem give us a reference of relations between the parameters in different environments Now, we investigate the sum of weights when the number of areas is very large Assume that we have the following constraints: ðaÞ Assumptions ð15Þ; ð17Þ and ð18Þof Theorem2; ðbÞ 2b þ d < À1=2; k ¼ 1; C is a Triangular sequence: ð37Þ ð38Þ ð39Þ h Consequence By the similar proof, we obtain the condition for the same influence of two ubiquitous pairs of areas, 3bỵdị : j ¼ 4; C; r i ði ¼ 1; 3Þ is the ith root of equation x3 + x2 À = (c) pkj = dkj/d with d being a constant: The condition is: ðcÞ fpopk g; Theorem shows us the relation between some parameters to ensure the same impacts from area k to the others If a > 0, Eq (19) always holds Otherwise, it helps us to choose the suitable values of a, b, d when given the number of areas and population N0 À C ỵ PCị ẳ 25ị Because the inuences from area k to the others are equal, we have ỵ eÞa > The condition is: ð21Þ ðpopk Â popj ịb pckj IMdkj 33ị where aỵb xa yb P x yị : dk1 ẳ dk2 ẳ dk3 ẳ D0 dk;j ẳ dk;j2 ỵ dk;j3 ð0 a bÞ ^ ð0 < x yÞ; () & ð19Þ Proof Consider a lemma as follows: for any a, b, x, y satisfying wkj ¼ wki ; bỵdị : N0 C ỵ (b) Assume that fdkj g; j ¼ 1; C; j – k is a Padovan sequence: bỵd 2 ỵ eị > : N0 C ỵ 1ịN0 C þ 2Þ a ð32Þ Property ðaÞ X X wkj ¼ k¼1 j¼1 ¼ ðP2 À 9Þ; 3D0 1 lim 4ðP2 C À 9C 6C w1ị C ỵ 1ị ỵ 2P2 C D0 C!1 3C ỵ 1ị2 12C 12Cw1ị C ỵ 1ị 6w1ị C ỵ 1ị ỵ P2 Þ ; ð40Þ ð41Þ where w(n)(x) is the nth derivative of the digamma function, and 2b + d = À2 157 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 À Á C SIM2 À ÁSIMÀPF X À uk uj 6 uk D0 is the distance from a ubiquitous area to area k bị X X wkj ẳ kẳ1 jẳ1 80 8P ; D0 where 2b ỵ d ẳ 3: jẳ1 42ị From this property, we recognize that there is a difference with value (92 À 28P2/3)/D0 between two consecutive series " C X j¼1 # kÀ1 bc X ab wkj ỵ wki wij ị : A A iẳ1 c 53ị PC Because of the property of the cluster memberships j¼1 uj ¼ and apply Minkowski inequality for the right side of Eq (53), we obtain Property X X X 1 wki wkj ị ẳ lim 16P4 C ỵ 150P2 C 1575C D0 C!1 45C ỵ 1ị4 kẳ1 iẳ1 jẳ1 aị 4 ð1Þ ð3Þ À 900C w C ỵ 1ị 15C w C ỵ 1ị þ 4P C þ 600P C À 5400C ð43Þ ð1Þ ð3Þ Apply the result in Eq (13) of Theorem 1, we get C X À 5400C w C ỵ 1ị 90C w C ỵ 1ị þ 4P C þ 600P C À 2520C À 3600Cwð1Þ C ỵ 1ị 60Cw3ị C ỵ 1ị 900w1ị C ỵ 1ị 15w3ị C ỵ 1ị ỵP4 ỵ 150P2 ị ; 16 ẳ 1575 ỵ 150P2 ỵ P4 ị; where 2b ỵ d ẳ 2: 45D20 bị X X X ðwki Â wkj Þ ẳ P2 ỵ P4 kẳ1 iẳ1 jẳ1 107D20 ; where 2b ỵ d ẳ 3: ab jẳ1 44ị 45ị 2.2 A new model Definition The Spatial Interaction – Modification Model (SIM2) is defined as, kÀ1 C X X ẳ a uk ỵ b wkj u0j ỵ c wkj uj ; A jẳk jẳ1 a ỵ b ỵ c ẳ 1: 46ị ð47Þ In this model, the new updated areas are appeared within Eq (46) They contribute to the shift of the next areas to the goal state (a, b, c) are the user-defined parameters satisfying constraint (47) In Eq (46), the weighting function wkj is defined in Eq (4) The last parameter – A is a scaling variable that forces the sum of cluster memberships to one It is often calculated through the maximum operator A special case of SIM2 model is: b ¼ : SIM ! SIM À PF: ð48Þ Similar to the previous section, we examine some important theorems from Definition Theorem The maximal difference of cluster memberships using SIM-PF and SIM2 model is: c 2C 1ịCbc 2bỵd b d MD ẳ ab ẵN0 N0 C ỵ 1ị 2N0 C þ 1Þ þ N0 : A A c c Â wkj ab À Â ðpopk Â N0 Þb popk ỵ N ịd : A A 55ị Following by Consequence 1, C X kÀ1 X ðwki Â wij ị 2N02bỵd k 1ịC: Similarly, the difference between two consecutive series in Property is ð2696400 À 256755P2 À 1667P4 Þ=D20 u0k ð54Þ À 3600C w1ị C ỵ 1ị 60C w3ị C þ 1Þ þ 6P4 C þ 900P2 C À 6300C 2 À Á X C C X kÀ1 c bc X SIM2 u0k ịSIMPF ab wkj ỵ wki wij ị: uk A A jẳ1 jẳ1 iẳ1 56ị jẳ1 iẳ1 From Eqs (54)(56), we receive the maximal difference of cluster membership kth of two models SIM2 and SIM-PF, À Á c SIM2 À ÁSIMÀPF À uk Â ðpopk Â N0 ịb popk ỵ N0 ịd uk ab A 2bcN02bỵd k 1ịC ỵ : ð57Þ A Thus, the maximal difference of cluster memberships using SIM-PF and SIM2 model is nÀ Á o À ÁSIMÀPF SIM2 SIM2 u0 ịSIMPF ẳ max u0k u0k u ị MD; 58ị kẳ1;C where MD is stated in Eq (49) h Theorem gives us an estimation of the difference of cluster memberships of two models Now, let us see some other definitions below Definition Vector a = (a1, a2, , an) is said to be s – larger than b = (b1, b2, , bn) if ða > bịs () #fai > bi g ẳ s 8i ẳ 1; n; ð59Þ where s is an integer, belonged to [0, n] If s = n then a is completely larger than b, and this relationship is denoted as a b Definition (Ordered relationship of vectors) Vector a = (a1, a2, , an) is said to be larger than vector b = (b1, b2, , bn), denoted by a > b if there exists two maximal values s1, s2 that satisfy ða > bÞs1 and ðb > aÞs2 and s1 > s2 ð49Þ Property Proof Fix index k, we have À Á kÀ1 c SIM2 À ÁSIMÀPF X À uk wkj Â b Â u0j À Â uj : uk ¼ A j¼1 ðaÞ ða > bÞ0 () 9i n : ðb > aÞi ; ð50Þ SIM2 SIMÀPF For the simplicity, we assume that u0j ¼ u0j ; 8j < k Eq (50) is now re-written as, " # À Á kÀ1 C c bc X SIM2 SIMPF X uj ỵ uk wkj Â ab À Â ðwji Â ui Þ ; 51ị uk ẳ A A jẳ1 iẳ1 " # À Á C kÀ1 X c bc X SIM2 À ÁSIMÀPF Â wkj ỵ uk uj ab wki wij ị : 52ị uk ẳ A A j¼1 i¼1 Apply Holder inequality for Eq (52), we get bị a ẳ b () a > bị0 ^ b > aị0 ; cị s1 ỵ s2 n; 60ị dị a b ẳ a > bịn : Example Given a = (5, 8, 3, 6, 9) and b = (5, 8, 1, 7, 2) We easily recognize that (a > b)2 and (b > a)1 Moreover, a > b Definition A sequence of vectors fX k ; k ¼ 1; Mg is said to be monotone increasing if Xk+1 > Xk, 8k ¼ 1; M À If Xk+1 Xk then fX k ; k ¼ 1; Mg is called completely monotone increasing 158 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Theorem The conditions of parameters to ensure that sequence È É u0k of SIM2 model is completely monotone increasing are: ( ( k¼2;C n o (a) Following by Definition 5, u0k ; k ¼ 1; C is completely monotone increasing Thus, u0kỵ1 u0k ; 8k ẳ 1; C À From À Á À Á Definition 4, $s1 = N ^ s2 = 0: u0kỵ1 > u0k N and u0k > u0kỵ1 Apply Property 4, we get )) wkÀ1;i À wk;i ; PkÀ1 j¼1 ðwk;j À wkÀ1;j Þwj;i ( ( )) c wkÀ1;i À wk;i P max max PkÀ1 ; Aa k¼2;C i¼1;kÀ2 j¼1 ðwk;j À wkÀ1;j Þwj;i ( ) bc À b À wk;kÀ1 P max PkÀ1 ; Aa k¼2;C j¼1 ðwk;j À wkÀ1;j Þwj;kÀ1 ( ) kÀ1 X c a P Â max wkÀ1;k À b Â ðwk;j À wkÀ1;j Þwj;k : A kẳ2;C jẳ1 b P max Proof max 61ị iẳkỵ1;C u0kỵ1 u0k ẳ u0kỵ1 > u0k N ^ u0k > u0kỵ1 ! u0kỵ1 > u0k : ð62Þ Perform the similar proof for other k ¼ 1; C À 1, we receive the conclusion È É À Á (b) If uÀ0k is monotone increasing ẩ then ẫ $s1 > s2: ukỵ1 > uk s1 0 and uk > ukỵ1 s In order for uk to be completely mono2 tone increasing, the following conditions should hold: ð63Þ ð64Þ ( È ẫ # u0kỵ1 > u0k ẳ N ẩ ẫ ; # uk > u0kỵ1 ẳ Proof From Denition 2, we have: u0k À u0kÀ1 ¼ a Â ðuk uk1 ị ỵ b k2 X wk;j wk1;j ịu0j ỵ b wk;k1 c A C X wk;j wk1;j ịuj 8k ẳ 2; C; ð65Þ bc A kÀ1 X C X ðwk;j À wkÀ1;j ị wj;i ui ỵ ẵab jẳ1 iẳ1 k2 C X c X wk;j wk1;j ịuj ỵ wk;j wk1;j ịuj A jẳk jẳ1 8k ¼ 2; C: È É Property Assume that u0k is a completely monotone increasing sequence Then, limits of some special means are shown below ð66Þ From Eq (66), we receive a linear combination in Eqs (67) and (68), (a) Modied Harmonic Mean: u0k u0k1 ẳ h1 u1 ỵ h2 u2 ỵ ỵ hk2 uk2 ỵ hk1 uk1 ỵ hk uk ỵ hkỵ1 ukỵ1 ỵ ỵ hC uC P 0; 8k ẳ 2; C; ð71Þ Theorem provides us the relation between a monotone increasing sequence and a completely one While the first result in 6a is quite nice, the second one in 6b is not so good as the previousÈresult É In fact, if the parameters satisfy conditions (61)–(64) then u0k is completely monotone increasing following by TheoÈ É rem Thus, we not need condition ‘‘ u0k is monotone increasing’’ anymore This remark tells us the truth that it is impossible to transform from a monotone increasing sequence to a completely one without the conditions of the parameters in SIM2 model j¼k ) u0k À u0kÀ1 ¼ a Â ẵuk b wk;k1 ịuk1 ỵ k ¼ 1; C À 1: The condition (71) is obtained if the parameters of SIM2 model satisfy constraints (61)–(64) Therefore, we get the conclusion h jẳ1 u0k1 ỵ 70ị lim @ 67ị k!1 1ỵ C PC kẳ1 u0 A ẳ C; 72ị k Inequality (67) holds when hi P 0; 8i ¼ 1; C Using Eq (68), we receive the conditions of parameters for u0k À u0kÀ1 P where k is fixed Repeat the similar process for other k ¼ 2; C, we get the results in Eqs (61)–(64) hi ¼ kÀ1 X > > bc > wk;j wk1;j ịwj;i ỵ ab wk;i wk1;i ị i ẳ 1; k > A > > > j¼1 > > > > kÀ1 > X > > bc > Â ðwk;j À wkÀ1;j Þwj;kÀ1 À a Â ð1 À b Â wk;k1 ị i ẳ k > A > < j¼1 kÀ1 > X > > bc > Â wk;j wk1;j ịwj;k ỵ a ỵ Ac wk;k wk1;k ị i ẳ k > A > > > j¼1 > > > > kÀ1 > X > > > bc wk;j wk1;j ịwj;i ỵ Ac wk;i wk1;i ị i ẳ k ỵ 1; C > :A jẳ1 68ị h Theorem ẩ ẫ (a) If u0k is completely monotone increasing then it will be monotone increasing È É (b) Conversely, if u0k is monotone increasing, and all parameters of SIM model satisfy the conditions (61)–(64) then it will be completely monotone increasing (b) Root Mean Square: ffi 0sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PC À Á2 u k¼1 k A ¼ 1; lim @ k!1 C (c) Geometric Mean: lim k!1 C Y ! u0k ¼ 1: k¼1 : In what follows, we will look for the conditions of parameters to ensure uk P u0k for given area k This is quite important because they can show us the trend of all areas in the map For example, if uk P u0k for most k ¼ 1; C, all elements may concentrate on some last areas rather than the first ones Indeed, outliers can be happened and affect the final results Thus, a suitable selection of parameters is required to avoid such cases and guarantees the best quality of the algorithm Theorem For given area k, the condition of parameters to ensure uk P u0k is: ( ) maxfab; c=Ag bc Â maxfwki g þ Â max maxfwkj wji g 1Àa Að1 À aÞ jẳ1;k1 iẳ1;C iẳ1;C : k 1ịC 73ị 159 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Proof Step : Calculate the centers of clusters at t + by Eq (83) uk P u0k ; ð74Þ Â kÀ1 X C X wkj wji ui : kX k ÀV j k i¼1 kX k ÀV i k lim ui ẳ 1; 80ị 81ị a2 kX k ÀV j k2 gÀ1 ð82Þ ð83Þ Step : If the difference kV(t+1) À V(t)k e then stop the algorithm Otherwise, assign V(t) = V(t+1) and return to Step ð77Þ Results 4.1 Experimental environment In this part, we describe the experimental environments such as, 78ị Due to the following fact, k;i!1 uk j ẳ 1; C; ; k ¼ 1; N; j ẳ 1; C: 1ỵ cj PN s g m kẳ1 a1 ukj ỵ a2 t kj ỵ a3 hkj Â X k ; j ¼ 1; C: Vj ¼ P s N g m k¼1 a1 ukj ỵ a2 t kj ỵ a3 hkj t kj ¼ ð76Þ Because all components in the sums are positive, each one should satisfy the condition below to obtain inequality (77) È É & ' max ab; Ac ui bc ỵ P max wki 1a k 1ịC u A1 aị iẳ1;C k ( & ') ui : Â max max wkj wji u k j¼1;kÀ1 i¼1;C k ¼ 1; N; j ¼ 1; C; i¼1 kX k ÀV i k In order to obtain inequality (75), the following condition should hold () k ¼ 1; N; ; mÀ1 ; hkj ¼ PC kX k V j ks1 75ị jẳ1 iẳ1 C kÀ1 X C n co X bc X uk aị P max ab; wki ui ỵ wkj wji ui ; Â A A i¼1 j¼1 i¼1 È É C X max ab; Ac ui bc () P ỵ wki 1a u A1 aị k iẳ1 k1 X C X ui Â : wkj wji u k j¼1 i¼1 ukj ¼ PC kÀ1 C X c X bc () uk ð1 À aÞ P ab Â wkj uj ỵ wkj uj ỵ A A jẳ1 jẳk 79ị We can reduce the cluster membership in inequality (78) and get the result in (73) h The findings in Section help us understand the principal characteristics of SIM2 model and are the basis to deploy the main algorithm which will be presented in the next section The modified IPFGWC algorithm In this section, we integrate SIM2 model with the main part of IPFGWC to form the new algorithm named as Modified IPFGWC (MIPFGWC) Details of this algorithm are shown below Input: Geo-demographic data X The number of elements (clusters) – N(C) The dimension of dataset r Threshold e and other parameters m, g, s, (i ¼ 1; 3ị, cj j ẳ 1; Cị Geographic parameters a, b, c, a, b, c, d Output: Final membership values u0k and centers V(t+1) MIPFGWC Algorithm: Step : Set the number of clusters C, threshold e > and other parameters such as m; g; s > 1; > 0i ẳ 1; 3ị; cj j ẳ 1; Cị as in IPFGWC algorithm [8] Step : Initialize centers of clusters Vj, j ¼ 1; C at t = Step : Set geographic parameters a, b, c, a, b, c, d satisfying condition (47) Step : Use the formulas (80)–(82) to calculate the membership values, the hesitation level and the typicality values, respectively Step : Perform geographic modifications through Eqs (4) and (46) È É Step : If u0k is a completely monotone increasing sequence (Theorem 5) or uk P u0k for most k ¼ 1; C (Theorem 7) then conclude that there is no suitable solution for given geographic parameters Otherwise, go to Step Experimental tools: We have implemented the proposed algorithm (MIPFGWC) in addition to IPFGWC [8] and FGWC [7] in C programming language and executed them on a PC Intel(R) Core(TM)2 Duo CPU T6570 @ 2.10 GHz (2 CPUs), 2048 MB RAM, and the operating system is Windows Professional 32-bit Experimental datasets: We use the data as follow: a small university dataset (Table 1) and a real one of socio-economic demographic variables from United Nation Organization – UNO [10] The second dataset was used for experiments in the article [8] Cluster validity measurement: We use the validity function of fuzzy clustering for spatial data namely IFV [3] This index was shown to be robust and stable when clustering spatial data Besides, it was also used to evaluate the clustering quality in the article [8] " #2 C < X N = SD 1X N 1X max IFV ¼ u log2 C À log ukj ; 84ị ; rD C jẳ1 :N kẳ1 kj N k¼1 SDmax ¼ maxkV k À V j k2 ; k–j rD ! C N 1X 1X ¼ kX k À V j k : C jẳ1 N kẳ1 85ị 86ị When IFV ? max, the value of IFV is said to yield the most optimal of the dataset Objective: We evaluate the clustering qualities of those algorithms on two different datasets through IFV index Some comparisons of computational times will also be considered 4.2 Evaluations on the university dataset 4.2.1 Case In this case, we divide these students into three groups: ‘‘High’’, ‘‘Medium’’ and ‘‘Low’’ following by all attributes Using two clustering algorithms MIPFGWC and IPFGWC with the parameters being set up below, e = 10À3, parameters cj ¼ 1j ¼ 1; C, m ¼ 3; s ¼ 2; g ¼ 2; a1 ¼ a2 ¼ a3 ¼ 1; 160 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 IPFGWC: geographic parameters a = b = 1, a = 0.5, b = 0.5, MIPFGWC: geographic parameters a = b = c = d = 1, a = 0.5, b = 0.3, c = 0.2 These parameters were used in the article [8] to evaluate the clustering quality of algorithms The initial centers of clusters V(0) used for MIPFGWC and IPFGWC is, 17 12 8:6 0:3 B C V 0ị ẳ @ 24 26 7:2 0:5 A: 36 47 6:0 0:8 ð87Þ The results of MIPFGWC algorithm with those configurations are shown in Eqs (88) and (89) The computational time and the number of iteration steps of MIPFGWC are 0.021 s and 15, respectively Similarly, the results of IPFGWC algorithm are given in Eqs (90) and (91) The computational time and the number of iteration steps of IPFGWC are 0.019 s and 20, respectively Through Eqs (84)–(86), we compare IFV values of these algorithms and get the result in Eq (92) 19:306553 10:105724 7:374804 0:999004 B C V MIPFGWCị ẳ @ 17:479478 28:197003 7:735475 0:256288 A; 28:539287 33:201849 7:386632 0:638537 ð88Þ U ðMIPFGWCÞ 0:023007 B 0:091148 B B B 0:001554 B B 0:111198 B ¼B B 0:379026 B B 0:050533 B B @ 0:999511 0:927891 0:049102 0:384451 0:524402 C C C 0:994834 0:003612 C C 0:261117 0:627685 C C C; 0:251436 0:369538 C C 0:886195 0:063272 C C C 0:000320 0:000169 A ð89Þ 0:012554 0:041774 0:945672 19:345955 10:096919 7:370893 0:999568 B C V IPFGWCị ẳ @ 17:364905 27:968125 7:694930 0:263094 A; 28:414465 34:814743 7:481142 0:557099 ð90Þ U ðIPFGWCÞ 0:044201 B 0:062008 B B B 0:034854 B B 0:068561 B ¼B B 0:213137 B B 0:054508 B B @ 0:499723 0:469462 0:486337 0:228704 0:709288 C C C 0:497449 0:467697 C C 0:184172 0:747267 C C C; 0:178216 0:608646 C C 0:458405 0:487087 C C C 0:034214 0:466063 A ð91Þ 0:031181 0:132587 0:836232 IFV MIPFGWC ¼ 7:110882 > IFV IPFGWC ¼ 2:249461: ð92Þ Thus, the clustering quality of MIPFGWC is better than that of IPFGWC From Eqs (89) and (91), we get another result kU MIPFGWCị U IPFGWCị k ẳ 1:89 < MD ẳ 5170: 93ị Eq (93) states that the maximal difference of cluster memberships using SIM-PF and SIM2 model calculated by experiments is 1.89 and is smaller than the theoretical value MD stated in Theorem Now, we verify the value of weighting function of MIPFGWC algorithm shown in Eqs (94)–(97) 0:018119 0:009682 B C w ¼ @ 0:018119 0:060905 A; 0:009682 0:060905 ð94Þ ẳ 0:027801 < pop1 N0 ịb pop1 ỵ N0 ịd ẳ 160; wAVG 95ị wAVG ẳ 0:079024 < pop2 N0 ịb pop2 ỵ N0 ịd ¼ 384; ð96Þ wAVG ¼ 0:070587 < ðpop3 Á N0 ịb pop3 ỵ N0 ịd ẳ 160: ð97Þ These results affirm that the average influence of an area on another one does not exceed the upper bound stated in Theorem In Table 2, we calculate the sum of weights both by Properties and and by experiments Results show that when the number of areas is small, the sum of weights is greater than that when the number of areas is large The experimental results in this case were performed with C = and are higher than those of Properties and Additionally, the difference between results of 2b + d = À2 and 2b + d = À3 is small In what follow, we study the distribution of pattern sets and centers before and after running two algorithms Results are depicted from Figs 5–8 From Fig 5, we recognize that there is an outlier in the original pattern set MIPFGWC algorithm can handle outlier(s) better than IPFGWC Fig shows that that outlier is disappeared in MIPFGWC whilst it still remains in IPFGWC (Fig 6) In IPFGWC, the patterns tend to move toward cluster ‘‘C’’ while those of MIPFGWC are evenly distributed in the data space The lacks of immigration effects, common boundaries between regions and updated cluster memberships in IPFGWC result in the abnormity of the distribution of patterns shown in Fig MIPFGWC remedies this limitation through the use of SIM2 model and gives the distribution of patterns in Fig In Fig 8, the changes of centers after running two algorithms are described Intuitively, there is not much difference between the locations of these centers This means that SIM2 model makes a great impact on the distribution of patterns more than the centers All results above shows that MIPFGWC obtains better clustering quality than IPFGWC 4.2.2 Case In order to verify the consistency and robustness of the proposed algorithm, we make some changes about the parameters of algorithms below IPFGWC: geographic parameters a = 0.7, b = 0.3, MIPFGWC: geographic parameters a = 0.7, b = 0.2, c = 0.1 Table The university dataset Student Age Income (K) Average grade Male 15 20 18 35 40 18 19 30 30 45 28 50 15 25 10 31 8.3 7.4 8.1 5.6 6.2 7.4 7.1 0 1 Other parameters are kept intact as in Case Results of MIPFGWC with the initial centers of clusters V(0) in Eq (87) are: 20:986869 10:595425 7:231345 0:999561 B C V MIPFGWCị ẳ @ 17:637678 27:865067 7:649198 0:282628 A; 31:018697 46:161775 8:406729 0:095751 ð98Þ 161 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Table The sum of weights in Case Sum of weights PC PC k¼1 i¼1 wki PC PC PC kẳ1 U MIPFGWCị Results of Properties and i¼1 j wki Â wkj 0:028586 0:948662 0:022752 B 0:068931 B B B 0:001370 B B 0:000859 B ¼B B 0:516925 B B 0:045229 B B @ 0:983921 0:161933 2b + d = À2 2b + d = À3 0.09195 0.08273 0.177412 6.372 Â 10À3 6.305 Â 0À3 0.012 0:273096 0:657973 C C C 0:997771 0:000859 C C 0:036672 0:946739 C C C: 0:296017 0:187058 C C 0:938912 0:015859 C C C 0:013168 0:002911 A ð99Þ 0:493241 0:344827 The computational time and the number of iteration steps of MIPFGWC are 0.008 s and 9, respectively Similarly, the results of IPFGWC algorithm are given in Eqs (100) and (101) The computational time and the number of iteration steps of IPFGWC are 0.005s and 9, respectively IFV values of these algorithms are calculated in Eq (102) V ðIPFGWCÞ Experimental results 20:846367 10:502604 7:243042 0:999342 B C ¼ @ 17:660246 27:829394 7:643995 0:282196 A; 30:688095 46:148977 8:380672 0:094085 U ðIPFGWCÞ 0:069170 0:665294 0:265535 B 0:063570 B B B 0:052247 B B 0:020084 B ¼B B 0:375461 B B 0:079164 B B @ 0:690987 0:139263 0:207881 0:728549 C C C 0:698536 0:249217 C C 0:058535 0:921382 C C C; 0:241883 0:382656 C C 0:660991 0:259845 C C C 0:058643 0:250370 A 101ị 0:364975 0:495762 IFV MIPFGWC ẳ 23:229611 > IFV IPFGWC ẳ 9:313303: 102ị These results show that the clustering quality of MIPFGWC is better than that of IPFGWC Additionally, IFV values of both algorithms are larger than those of previous case given in Eq (92) The computational times of these algorithms are also smaller than those of previous case This means that the large value of a parameter in this case in comparison with the previous one will enhance the clustering quality and reduce the computational time of MIPFGWC ð100Þ Fig The distribution of original pattern set following by (Income, Ages) 162 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Fig The distribution of pattern set after running IPFGWC algorithm in Case Fig The distribution of pattern set after running MIPFGWC algorithm in Case 163 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Fig The changes of centers after running two algorithms in Case Similar to Case 1, Theorem holds since the maximal difference of cluster memberships is 0.7 and is smaller than MD value This difference is smaller than that of previous case The value of weighting function of MIPFGWC is calculated below 0:025793 0:002922 B C 0:015529 A; w ẳ @ 0:025793 0:002922 0:015529 103ị ẳ 0:028715 < pop1 N0 ịb pop1 ỵ N0 ịd ¼ 160; wAVG wAVG wAVG ð104Þ b d 105ị b d 106ị ẳ 0:041322 < pop2 N0 ị pop2 ỵ N0 ị ẳ 384; ẳ 0:018451 < pop3 N0 ị pop3 ỵ N0 Þ ¼ 160: From these results, we recognize that the average influence of an area on another one in this case is mostly smaller than that of previous case given in Eqs (95)–(97) In Table 3, we calculate the sum of weights both by Properties and 3and by experiments We clearly see that the experimental results are smaller than those of Table Besides, they are approximate to the theoretical values of Properties and The distribution of patterns and centers are illustrated in Figs 9–11 These results show that the distributions of patterns in two algorithms are getting better when the value of parameter a increases However, IPFGWC still cannot handle outlier(s) as shown in Fig 10 Fig 11 points out that the distance between the centers of input and those of two algorithms is smaller than the equivalent distance in Fig The reason for this fact may come from the increment of a The parameter a relates to possibility to keep the membership value intact The higher the value of a is, less change the position of a pattern can be Besides, the centers of two algorithms in Fig 11 are nearly coincident Through this case, we still see that MIPFGWC is better than IPFGWC 4.2.3 Case In this case, we try to reduce the value of parameter a in order to check whether MIPFGWC is better than IPFGWC or not IPFGWC: geographic parameters a = 0.3, b = 0.7, MIPFGWC: geographic parameters a = 0.3, b = 0.3, c = 0.4 Other parameters are kept intact as in Case Results of MIPFGWC and IPFGWC are presented in equations from Table The sum of weights in Case Sum of weights Results of Properties and 2b + d = À2 PC k¼1 PC k¼1 PC i¼1 wki PC PC i¼1 j wki Â wkj Experimental results 2b + d = À3 0.09195 0.08273 0.088488 6.372 Â 10À3 6.305 Â 10À3 0.002812 164 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Fig The distribution of pattern set after running MIPFGWC algorithm in Case Fig 10 The distribution of pattern set after running IPFGWC algorithm in Case 165 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Fig 11 The changes of centers after running two algorithms in Case (107)–(111) The computational time and the number of iteration steps of MIPFGWC (IPFGWC) are 0.014s and 12 (0.004 and 8), respectively We recognize that MIPFGWC is still better than IPFGWC in this case However, the value of IFV is the smallest among those of other cases Thus, we should set the value of a be greater than 0.5 in order to keep good clustering quality of MIPFGWC 19:208872 10:063351 7:382896 0:999610 25:256041 29:590166 7:365978 0:568198 26:258217 30:968821 7:364882 0:571935 ð109Þ U ðIPFGWCÞ ð107Þ U ðMIPFGWCÞ 0:020882 0:897375 0:081742 B 0:097268 B B B 0:003527 B B 0:140956 B ¼B B 0:367145 B B 0:061936 B B @ 0:999755 0:017547 0:284645 0:697808 B 0:038392 B B B 0:011578 B B 0:048857 B ¼B B 0:117928 B B 0:025313 B B @ 0:299927 0:210621 0:750988 C C C 0:298887 0:689535 C C 0:192894 0:758250 C C C; 0:145605 0:736467 C C 0:272655 0:702032 C C C 0:011244 0:688829 A IFV MIPFGWC ¼ 2:657823 > IFV IPFGWC ¼ 0:885499: ð108Þ Table The sum of weights in Case Results of Properties and 2b + d = À2 k¼1 PC k¼1 PC i¼1 wki PC PC i¼1 j wki Â wkj ð110Þ ð111Þ For these achieved results, we get the same results with previous cases about Theorems and The value of weighting function of MIPFGWC in this case is shown below 0:037066 0:134114 0:828819 PC 0:016937 0:181681 0:801382 0:454749 0:447983 C C C 0:976002 0:020471 C C 0:352933 0:506111 C C C; 0:243622 0:389233 C C 0:751382 0:186683 C C C 0:000132 0:000113 A Sum of weights B C V IPFGWCị ẳ @ 17:494326 28:243184 7:797305 0:237981 A; B C V MIPFGWCị ẳ @ 17:812722 29:013882 7:836187 0:317771 A; 19:214866 10:061347 7:382195 0:999712 Experimental results 2b + d = À3 0.09195 0.08273 0.466578 6.372 Â 10À3 6.305 Â 10À3 0.100833 166 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Fig 12 The distribution of pattern set after running MIPFGWC algorithm in Case Fig 13 The distribution of pattern set after running IPFGWC algorithm in Case L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 167 Fig 14 The changes of centers after running two algorithms in Case 0 0:011088 0:007178 4.3 Evaluations on the real dataset of UNO B C 0:215023 A; w ¼ @ 0:011088 0:007178 0:215023 wAVG b 112ị d ẳ 0:018266 < pop1 N0 ị pop1 ỵ N0 ị ¼ 64; ð113Þ ¼ 0:226111 < ðpop2 Á N0 Þb pop2 ỵ N0 ịd ẳ 384; wAVG 114ị wAVG ẳ 0:222201 < pop3 N0 ịb pop3 ỵ N0 ịd ẳ 264: 115ị The average inuence of an area on another one in this case is larger than those of previous cases This means that the influence is inversely proportional to the value of a Analogously, the sum of weights is inversely proportional to the value of a as shown in Table We examine the distribution of patterns and centers through Figs 12–14 Obviously, the lower the value of parameter a is, more outliers all algorithms have In this case, outliers appear in both algorithms However, the number of outliers in MIPFGWC is less than that of IPFGWC Moreover, the distribution of patterns in IPFGWC is really bad since all patterns are located at cluster ‘‘C’’ (Fig 13) MIPFGWC remedies this problem and gives us the distribution as in Fig 12 Low value of a also pull the centers of all algorithms far away from those of input This makes the bad distribution of centers in Fig 14 The final conclusion of Section 4.2 is: The clustering quality of MIPFGWC is better than that of IPFGWC We should choose the value of parameter a be greater than 0.5 in order to keep good distribution of patterns and centers Characteristics of MIPFGWC through some theorems and properties are examined The computational time of MIPFGWC is little slower than that of IPFGWC In this section, we evaluate the proposed algorithm on a real dataset of UNO The criteria are IFV value and the computational times 4.3.1 The evaluation by the number of clusters We compare MIPFGWC with other algorithms in term of clustering quality following by various number of clusters and number of geographic parameters The results are shown in Table We can recognize that IFV values of MIPFGWC are lager than those of IPFGWC and FGWC In order to comprehend the experimental results, this sub-section will make a deep evaluation following by the number of clusters In Fig 15, we illustrate the average IFV values of three algorithms following by the number of clusters It is obvious that IFV values of MIPFGWC are larger than those of IPFGWC and FGWC Those values are directly proportional to the number of clusters When C = 2, IFV values of MIPFGWC, IPFGWC and FGWC are 3.76, 1.38 and 0.97, respectively This means that the initial value of MIPFGWC is higher than those of IPFGWC and FGWC The average increments of MIPFGWC, IPFGWC and FGWC per cluster are 9.8, 3.12 and 3.01, respectively Each time a cluster is added, IFV value of MIPFGWC is 3.17 times larger than that of IPFGWC and 3.52 times larger than that of FGWC Thus, more number of clusters is provided, the larger the difference between MIPFGWC and other algorithms is We can use these results to predict IFV values of algorithms for a given number of clusters We also measure the computational times of three algorithms in Table Results show that the computational time of MIPFGWC is larger than those of IPFFWC and FGWC However, the difference of computational times between these algorithms is small In some 168 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Table IFV values by geographic parameters and C (a, b, c) = (0.3, 0.25, 0.45) C (a, b, c) = (0.35, 0.4, 0.25) a MIPFGWC IPFGWC 4.277295 22.179148 30.581131 37.608902 42.934587 52.195249 0.404991 4.145717 6.498622 8.254484 9.667398 11.173161 a FGWC MIPFGWC IPFGWCa FGWCa 0.767451 3.798240 6.031358 7.856602 9.353371 11.013605 2.057448 19.582190 29.624704 37.411510 45.848958 53.488738 1.007727 4.948399 7.261183 9.212238 10.762856 12.553616 0.878311 4.366424 7.217266 8.898565 9.735165 11.447735 2.122069 5.475086 10.710501 14.570259 17.278531 20.015909 0.644008 4.144026 10.385179 13.410133 16.670794 19.049012 1.359492 4.363690 10.130096 12.403330 15.510173 18.197986 1.052985 6.367873 10.129579 12.098781 14.404322 17.403688 (a, b, c) = (0.7, 0.2, 0.1) 6.745117 16.469071 28.365178 39.127854 43.268407 53.799952 (a, b, c) = (0.55, 0.15, 0.3) 2.786570 7.757731 12.280593 19.367472 23.835242 27.567060 2.089112 7.015062 11.326344 18.752688 22.601350 26.099314 (a, b, c) = (0.34, 0.33, 0.33) a 2.033651 20.666177 28.349582 36.559658 46.418473 52.143056 4.030379 15.201345 31.066573 37.586166 47.409964 53.607457 (a, b, c) = (0.5, 0.3, 0.2) 0.586580 4.844952 7.269167 8.786706 9.860887 12.345906 0.404003 4.669764 6.771677 8.711904 9.435227 11.122920 3.435273 20.799380 31.217660 38.386950 42.089580 51.283630 b value in the algorithm is equal to the sum of b and c in MIPFGWC Fig 15 The average IFV of algorithms by the number of clusters cases, e.g C = of (a, b, c) = (0.3, 0.25, 0.45), the computation time of MIPFGWC is even faster than that of IPFGWC Thus, the computational cost of MIPFGWC can be acceptable Fig 16 describes the average computational times of three algorithms following by the number of clusters Through this figure, we may see that the computational time of MIPFGWC is slower than those of IPFGWC and FGWC Nonetheless, the differences between three lines are small, e.g 0.27 s (MIPFGWC vs IPFGWC) and 0.65s (MIPFGWC vs FGWC) on average Therefore, we re-affirm the remark above about the computational time of MIPFGWC 4.3.2 The evaluation by cases From Table 5, we denote some cases below for easy calculation Case Case Case Case 1: 2: 3: 4: (a, (a, (a, (a, b, b, b, b, c) = (0.3, 0.25, 0.45), c) = (0.35, 0.4, 0.25), c) = (0.7, 0.2, 0.1), c) = (0.55, 0.15, 0.3), Case 5: (a, b, c) = (0.34, 0.33, 0.33), Case 6: (a, b, c) = (0.5, 0.3, 0.2) Fig 17 shows IFV values of MIPFGWC following by those cases In each case, we can recognize that the higher the number of clusters is, the larger the value of IFV is However, which case can give us the maximal increment level of IFV values? Through the calculation about the average increment level of IFV in each case, we receive the results from Case to Case are 8.58, 10.28, 9.41, 9.91, 10.02 and 9.57, respectively Indeed, Case can give us the maximal increment level of IFV In Case 2, the value of b is maximal Thus, it should be paid much attention besides a parameter in order to obtain large IFV values of MIPFGWC In Fig 18, we measure the average IFV values of three algorithms following by cases This figure shows that the average IFV values of MIPFGWC are larger than those of IPFGWC and FGWC The average difference between MIPFGWC and IPFGWC is 21.46 IFV Similarly, the average difference between MIPFGWC and L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Table Computational times by geographic parameters and C (s) C (a, b, c) = (0.3, 0.25, 0.45) MIPFGWC IPFGWC 0.1775 0.5597 1.2513 1.3622 1.7202 2.6462 0.1365 0.6268 0.8144 1.5962 1.2309 1.5775 a (a, b, c) = (0.35, 0.4, 0.25) FGWC a 0.1092 0.3247 0.7456 0.7084 0.8612 0.9814 (a, b, c) = (0.7, 0.2, 0.1) 0.2523 0.6563 1.2005 1.2903 1.9696 2.4851 0.2187 0.601 0.8845 0.9912 1.745 2.1351 a 0.1786 0.631 0.9881 1.3555 1.7106 2.5288 0.1739 0.5372 0.7911 1.1504 1.3134 1.4743 IPFGWCa FGWCa 0.218 0.6485 1.0134 1.3873 1.812 2.5304 0.1921 0.6155 0.9593 0.9485 1.2977 1.6427 0.1048 0.3039 0.6639 0.6798 0.9009 1.0003 (a, b, c) = (0.55, 0.15, 0.3) 0.1201 0.3713 0.542 0.6177 0.882 1.0611 (a, b, c) = (0.34, 0.33, 0.33) MIPFGWC 0.2231 0.7059 0.9596 1.3778 1.6378 1.9538 0.2413 0.4682 0.9165 1.4878 1.2943 1.8296 0.0942 0.3178 0.518 0.6467 0.7749 1.2478 (a, b, c) = (0.5, 0.3, 0.2) 0.0870 0.3028 0.6041 0.6989 0.8970 1.0985 0.2124 0.6338 0.9905 1.2886 2.483 2.1251 0.2071 0.5930 0.7753 1.0305 1.1737 1.8682 0.1314 0.3411 0.5347 0.6416 0.7734 1.0946 b value in the algorithm is equal to the sum of b and c in MIPFGWC 169 FGWC is 21.99 IFV Besides, the average IFV values of MIPFGWC seem to be stable through various cases The maximal difference of IFV values between cases is 0.6 IFV These numbers in cases of IPFGWC and FGWC are 8.91 and 8.18 IFV, respectively This means that MIPFGWC is independent of the selection of geographic parameters whilst IPFGWC and FGWC cannot be like that Using SIM2 model in MIPFGWC helps this algorithm run stably through various cases We also evaluate the average computational times of three algorithms following by cases in Fig 19 Obviously, MIPFGWC runs longer than IPFGWC and FGWC However, the differences between those lines are small, e.g 0.27 s (MIPFGWC vs IPFGWC) and 0.65s (MIPFGWC vs FGWC) on average Generally, this number is acceptable to our considered context The final conclusion of Section 4.2.2 is: The clustering quality of MIPFGWC is better than those of IPFGWC and FGWC Even though the computational time of MIPFGWC is slower than those of IPFGWC and FGWC, the differences between those algorithms are small, and the computational cost of MIPFGWC can be acceptable The parameter b should be large in order to obtain high IFV values of MIPFGWC MIPFGWC is stable through various cases Fig 16 The average computational times of algorithms by the number of clusters (s) Fig 17 IFV of MIPFGWC by cases 170 L.H Son et al / Knowledge-Based Systems 49 (2013) 152–170 Fig 18 The average IFV of algorithms by cases Fig 19 The average computational times of algorithms by cases (s) Conclusions References This paper aimed to enhance the clustering quality of IPFGWC algorithm for geo-demographic analysis A novel geographic model so-called SIM2 incorporating a new weighting function with the spatial interaction – modification principle was presented It was integrated to the main IPFGWC to form a new algorithm – MIPFGWC Theoretical and experimental analyses for MIPFGWC were performed to verify the effectiveness of the proposed algorithm Results showed that the clustering quality of MIPFGWC is better than those of other algorithms Consistency and robust analyses were made, and the remarks about parameters selection were given in order to obtain the best clustering quality of the new algorithm Further researches on this theme will investigate the cases when the pattern set is distributive and missing Applications of our method on real practical problems will be considered [1] J.C Bezdek, R Ehrlich, et al., FCM: the fuzzy c-means clustering algorithm, Computers and Geosciences 10 (1984) 191–203 [2] M Birkin, G.P Clarke, Spatial interaction in geography, Geography Review (5) (1991) 16–24 [3] H Chunchun, M Lingkui, S Wenzhong, Fuzzy clustering validity for spatial data, Geo-spatial Information Science 11 (3) (2008) 191–196 [4] Z Feng, R Flowerdew, Fuzzy geodemographics: a contribution from fuzzy clustering methods, in: S Carver (Ed.), Innovations in GIS 5, Taylor & Francis, London, 1998, pp 119–127 [5] J Ji, W Pang, C Zhou, X Han, Z Wang, A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data, Knowledge-Based Systems 30 (2012) 129–135 [6] M Khashei, A.Z Hamadani, M Bijari, A fuzzy intelligent approach to the classification problem in gene expression data analysis, Knowledge-Based Systems 27 (2012) 465–474 [7] G.A Mason, R.D Jacobson, Fuzzy geographically weighted clustering, in: Proceedings of the 9th International Conference on GeoComputation, Maynooth, Eire, Ireland, 2007 (electronic proceedings on CD-ROM) [8] L.H Son, P.L Lanzi, B.C Cuong, H.A Hung, Data mining in GIS: a novel contextbased fuzzy geographically weighted clustering algorithm, International Journal of Machine Learning and Computing (3) (2012) 235–238 [9] L.H Son, B.C Cuong, P.L Lanzi, N.T Thong, A novel intuitionistic fuzzy clustering method for geo-demographic analysis, Expert Systems with Applications 39 (10) (2012) 9848–9859 [10] UNSD Statistical Databases, 2011 Demographic Yearbook (accessed 14.07.12) [11] X Yin, T Shu, Q Huang, Semi-supervised fuzzy clustering with metric learning and entropy regularization, Knowledge-Based Systems 35 (2012) 304–311 [12] S.M.R Zadegan, M Mirzaie, F Sadoughi, Ranked k-medoids: a fast and accurate rank-based partitioning algorithm for clustering large datasets, Knowledge-Based Systems 39 (2013) 133–143 Acknowledgements The authors are greatly indebted to Editor-in-Chiefs, Prof H Fujita and Prof J Lu; anonymous reviewers; Prof Pham Ky Anh, VNU and Prof Nguyen Dinh Hoa, VNU for their comments and their valuable suggestions that improved the quality and clarity of the paper Other thanks will be sent to Ms Bui Thi Cuc for some experimental works This work is sponsored by the NAFOSTED under contract No 102.01-2012.14 and the VNU Project ’’Study about some clustering methods in Geographic Information Systems’’ ... novel model named as Spatial Interaction – Modification Model (SIM2) and some theoretical analyses of it Specifically, in Section 2.1, we examine a new weighting function that handles the limitations... for geo-demographic analysis A novel geographic model so-called SIM2 incorporating a new weighting function with the spatial interaction – modification principle was presented It was integrated to. .. new model Denition The Spatial Interaction – Modification Model (SIM2) is defined as, k1 C X X ẳ a uk ỵ b wkj u0j ỵ c wkj uj ; A jẳk jẳ1 a ỵ b ỵ c ẳ 1: 46ị 47ị In this model, the new updated

Ngày đăng: 16/12/2017, 03:20