Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 98 (2016) 368 – 373 The 6th International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare (ICTH 2016) A Framework for Clustering Cardiac Patient’s Records Using Unsupervised Learning Techniques Rao Muzamal Liaqat* , Bilal Mehboobb, Nazar Abbas Saqibc, Muazzam A Khand {muzamal.liaqat14 * , bilal.mehboob14 b , nazar.abbasc, muazzamakd }@ce.ceme.edu.pk National University of Sciences and Technology (NUST), H-12, Islamabad, Pakistan Abstract Today we are surrounded with large data related to health reports of patients In this paper we will introduce a methodology to extract the useful information (pattern) from raw data by using different unsupervised learning techniques These hidden patterns will help the practitioner to understand the hidden relation (dependency) among the data With the help of useful clustering we can predict the hidden trends in patients We will use the correlation matrix followed by K -mean (fast) to extract the interesting pattern as well as patient state that will help the practitioner to treat the patient wisely According to the nature of data we can categorize the heart patient into normal, moderate, risk and critical patients We use the different clustering algorithm and analyze the performance of each algorithm in cardiac dataset For this research we have used the real dataset provided by AFIC (Armed force institute of cardiology).Data set consist of 1500 records along with 36 attributes © 2016 Published by Elsevier B.V This is an open access article under the CC BY-NC-ND license © 2016 The Authors Published by Elsevier B.V (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-reviewunder under responsibility of Program the Conference Peer-review responsibility of the Chairs Program Chairs Keywords: Clustering; data mining; Unsupervised Learning; K-Mean (fast) Introduction It is the common practice patient co mes to the doctor, after routine procedure and tests, doctor checkup the subject and diagnosis, that’s why a large of data remain unexp lored in hospital which raises a significant problem in healthcare domain Then certain question arises e.g “How we can get the useful informatio n fro m the data, is there any hidden relat ion between the data that reveals some specific pattern to practit ioner so that they can take some wise decision” All these can be answered by using data mining and machine learning algorith ms to indicate the * Corresponding author Tel: +92-51-222-9561; fax: +92-51-927-8257 E-mail address: muzamal.liaqat14@ce.ceme.edu.pk 1877-0509 © 2016 Published by Elsevier B.V This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the Program Chairs doi:10.1016/j.procs.2016.09.056 Rao Muzamal Liaqat et al / Procedia Computer Science 98 (2016) 368 – 373 unseen or hidden pattern1 Nowadays we are surrounding with a large dataset related to patient history However the current database of patients is not so informat ive to extract any useful informat ion or to track the patient disease It is believed by using data ing techniques a lot of hidden informat ion can be extracted by discovering the hidden pattern and correlation among attributes Nowadays statistics is very popular and common ly used technique to analyze the med ical data Researchers are using the different statistical tools, software to analyze the data and extract the useful information In our work we will use the data ing algorith ms wh ich are mo re reliable as co mpare d to statistical model; we will also compute the performance of different alg orith ms Basically there are t wo types of algorith ms that are used in data mining One is known as supervised learning algorith ms (in supervised learning we have trainee dataset e.g SVM , Naïve Bayes) Second is known as unsupervised learning (in wh ich we have no trainee dataset or label attribute e.g K-Mean, DBSCAN) The main focus of this paper is to extract h idden pattern and correlation among different attributes that will assist the practitioner to write a wise and better prescription for heart patient In this paper we use the unsupervis ed techniques such as K-means, K-means (fast), DBSCAN and Kmedoids to find out the hidden cluster and pattern for heart patient The remaining paper is divided into sections Section describes the literature review Section describes the methodology and detailed analysis of cluster, performance of results is carried out in section Conclusion and future work is detailed in section Literature Review In literature a lot of wo rk has been carried out for medical data analysis to discover the hidden pattern and ext ract useful informat ion fro m large data by applying data mining techniques In conventional methods for informat ion extraction fro m data Pro fessional’s manual method was used, which has no worth when dataset increases in volu me as well as in dimension To deal such data we need some co mputing technologies In med ical domain most of the work is carried out on cardiac image segmentation, feature extract ion, pattern recognition as well as correlation 7, Decision tree is a widely used algorith m that is used to mine the hidden information and back t rack the root cause in med ical data In decision tree we have root node and leaf nodes, leaf nodes represent concrete knowledge according to label attribute Co mmonly used decision tree algorith ms are ID3, CHAID, Random Forest and Decision Stump which are mostly used for ing the useful informat ion Many intelligent systems have been developed to assist the practitioner in card iac d isease10 Researchers have used the Naïve Byes, ANN and decision tree to extract the hidden pattern and correlation among attributes11 Our main focus is to process the data to get the useful informat ion and explored the h idden pattern In this paper we use the dataset provided by AFIC (A rmed force Institute of Card iology) Preprocessing steps and performance of different unsupervised learning classifiers are described in methodology section Proposed Methodolog y Our methodology to extract the hidden pattern and correlat ion among the attribute in conte xt of card iac data is shown in Fig Fig 1: Knowledge Discovery Process Model 369 370 Rao Muzamal Liaqat et al / Procedia Computer Science 98 (2016) 368 – 373 The model is div ided into phases; each phase may involve the certain input, output and operations We will explain each phase in detail 3.1 Data Acquisition Mostly we have the medical data in the form of med ical reports, lab reports and doctor reviews fro m all kind of data can be categorized as unstructured form of data 12 We get the data in report form fro m Armed Force Institute of Card iology (AFIC) Raw data consist of 1500 records with 50 attributes Then we get the target data from raw data by applying feature selection on the basis of attributes weight and expert opinion 3.2 Target Data (Attribute Selection) Target data is our interest data which is mined from raw data We can select the target attribute from raw data by assigning weights to attribute using correlation matrix and the consensus of experts Correlation operator applied on cardiac patient data is shown in the Fig Fig 2: Correlation Matrix Fig 3: Weight Assigned by Correlation Matrix Now we can see the different values of weights assigned to attribute by using this correlation matrix Weight against each attribute is shown by Fig By using the weights assigned by correlation matrix and expert opinion we have selected 16 attributes Now we will extract the hidden pattern among these attributes by using the different data mining algorithms 3.3 Preprocessed Data In this step we make our data co mpatible with machine learning algorith ms by apply ing some preprocessing steps Usually we have missing value in our data to remove these values we apply filtering so that more reliable result can be extracted fro m the data In this paper our work is related to clustering (k-mean DBSCA N, k-mean (fast), k-medoids) For this we have to convert the nominal and polynomial data into numeric because k-mean doesn’t work on such types of data In the “Report Category” we have Normal, Moderate, Risk and Critical labels these labels are replaced by numeric values 0, 1, and respectively 3.4 Transformed Data Data transformation is carried out by using certain scripts on data, basically data t ransformat ion is related to data preprocessing steps such as data cleansing (in which we make the data smooth by applying some filtering to mitigate the abrupt changes in data) Data reduction is also an important step in data transformation which is used to remove or exclude the certain column that has redundant behavior or zero effect on overall result s as shown in Fig Rao Muzamal Liaqat et al / Procedia Computer Science 98 (2016) 368 – 373 Fig 4: T ransform Data to Exclude Column 3.5 Patterns/Models This phase describe the hidden pattern extracted fro m data We will briefly exp lain the hidden pattern is result and discussion section before that we have to make some assumptions for better understanding and visualizat ion of results These assumptions are made according to universal standards and expert reco mmendations In our data we have different range of value for BMI co lu mn According to standard we can cat egorize the BMI in four groups.18 to 24(Normal Weights), 25 to 30(Over Weights), 31 to Onward (Obesity) and 50 patient is normal otherwise we categorize as an abnormal or affected patient as shown in Fig 10 Acknowledgement I am g rateful to AFIC, Pakistan for p roviding me dataset for research study I am thankful to my HOD, Dr Shoab A Khan for helping and guid ing me during this wo rk I am also thankfu l to Dr Aqib Malik RM O, EM E College for assisting me in this research References K Aziz, S Aziz, Evaluation and Comparison of Coronary Heart Disease Risk Factor Profiles of Children in a Country with Developing Economy Abu Khousa, E.; Campbell, P., "Predictive data mining to support clinical decisions: An overview of heart disease prediction systems," Innovations in Information Technology (IIT), 2012 International Conference on , vol., no., pp.267,272, 2012 Rao, R B., Krishnan, S., &Niculescu, R S (2006), Data mining for improved cardiac care ACM SIGKDD Explorations Newslett er, 8(1), 310 4.Kajabadi, A., Saraee, M H., &Asgari, S (2009, October) Data mining cardiovascular risk factors In Application of Information and Communication Technologies, 2009.AICT 2009 International Conference on (pp -5) IEEE Giudici, P.: “ Applied Data Mining: Statistical Methods for Business and Industry”, New York: John Wiley, 2003 Wamiq M Ahmed, (2008) Knowledge representation and data mining for biological imaging, Purdue University Cytometry Laborat ories, Bindley Bioscience Center, 1203 W State Street, West Lafayette, IN 47907, USA J.J Sychra, D.G Pave1, E Olea,(1988) , Classification Images Of Cardiac Wall Motion Abnormalities R Bharat Rao, Glenn Fung, BalajiKrishnapuram, (2010), Mining Medical Images J Han and M Kamber, “ Data Mining: Concepts and Techniques,” Morgan Kaufmann Publishers, USA, 2011.http://docs.rapidi.com/files/rapidminer/RapidMiner_OperatorReference_en.pdf 10 Palaniappan, S &, Awang, R., “Intelligent heart disease predication system using data mining technique”.IJCSNS International Journal of Computer Science and Network Security.Vol 8, No 8,2008 11 Ms Ishtake S.H , Prof Sanap S.A., Intelligent Heart Disease Prediction System Using Data Mining Techniques, International J of Healthcare & Biomedical Research, Volume: 1, pp 94-101, 2013 12 Unstructured Data Mining: The Tools You Need to Dig the Deep Web, Posted February 13, 2013 @ 3:41 pm by Scott Raspa, http://www.ikanow.com/blog/02/13/unstructured-data-mining-digthe-deep-web 373 ... values 0, 1, and respectively 3.4 Transformed Data Data transformation is carried out by using certain scripts on data, basically data t ransformat ion is related to data preprocessing steps such... reflects critical situation in cardiac patient is minant in males as compared to females Severity chances of moderate and critical cardiac patients in Fig males are mo re affected as compared to... that we have to make some assumptions for better understanding and visualizat ion of results These assumptions are made according to universal standards and expert reco mmendations In our data