Experimental Setup and Results

The earlier version of our MCFS-ID, as has already been stated capable of discovering undirected networks of interdependent features, was validated on several real data sets (cf. [22,24–26]). Its current version, proposed in this chapter, was validated on a large, fairly complex real data set from [35]. Those authors, inter alia, have collected gene expression levels of 236 genes in CD4+T cells activated in unbiased conditions and measured after 4 and 48 h, or in biased conditions toward T helper 17 (TH17), or with addition of IFN-β. The CD4+T cells were sampled from human blood from 348 healthy patients who were of three different ancestries: European, Asian, and African-American. The authors have reported the impact of ancestry on 94 of 229

Table 1 20 attributes with the highest RI score

Rank Attribute Rank Attribute Rank Attribute

1 UTS2_Th17_48 8 HDGFRP3_Activated_48 15 OLR1_Activated_4

2 UTS2 Activated_48 9 Weight (kg) 16 LYZ_Th17_48

3 UTS2_Activated_4 10 FGL2_Activated_4 17 IFITM3_Th17_48

4 UTS2_Unstim_4 11 NPCDR1_IFNb_4 18 Age (years)

5 UTS2_IFNb_4 12 FGL2_Unstim_4 19 CYBB_Activated_48

6 MXRA7_Th17_48 13 IFITM3_Activated_48 20 CCL2_Activated_4 7 MXRA7_Activated_48 14 IFIT2_IFNb_4

genes. Differentially responsive genes included key indicators of TH phenotype, IL17 family cytokines and IFNG.

The main aim of our validation was to obtain a better understanding of the ancestry influence on human immune system development and current genes expression levels using nonlinear methods in contrast to [35] who examined T cell responses in different populations using a linear model. In our study, we retained observations removed from the study by [35]. This resulted in a decision table with 365 observations (objects), each with 1259 attributes. The attributes of the decision system were all gene expression features and the following donor’s per- sonal data: age,height.cm,weight.kg,bmi,syst oli c,di ast oli c and sex. The decision attribute (class) was chosen to be donor’s ancestry. The parameters of the MCFS-ID algorithm were set to their default values:s=5000,t =5 andm=0.05d (i.e., m=63 in our case).

In sum, while our results are similar to those of [35], there are also differences which are a consequence of the generality with which we take into account interdependencies between the features. For instance, two features excluded by [35] due to high false discovery rate, namely genes OLR1 and CCL2 activated in unbiased conditions and measured after 4 h, were returned in our study within 20 topmost features.

MCFS-ID returned the features ranked according to their RI score and the ID Graph that shows interdependencies between the features (see Table1 and Fig.2).

We verified that the top ranked features are truly informative by using the first 50 of them to build several popular classifiers. In particular, using 10-fold cross-validation we obtained 78.0 % classification accuracy for KNN(5) and 82.4 % accuracy for SVM with polynomial kernel.

Secondly, for the top 20 and 100 features with the highest RI scores returned by MCFS-ID we built respectively two sets of classifiers with the help of ROSETTA, which in turn were used as input to the Ciruvis tool. We used the same decision table as for MCFS-ID, although reduced to 20 and 100 top features, respectively. The feature values were discretized using Equal Frequency Binning with 3 levels. In our decision system there were three decision classes that were slightly unbalanced: Caucasian with 190 objects; African-Americans with 99 objects; and Asian with 76 objects.

Fig. 2 ID Graph created for 50 attributes with highest RI score and ID weights≥6

To reduce the impact of data quality on a classifier we subsampled the datasets. For each of the cases, viz. 20 and 100 features, we thus obtained 100 subsampled datasets where the number of objects in each decision class was equal. (Actually, balancing the classes was not needed and we performed subsampling twice, without and with balancing, to obtain corresponding results; we report only those for the balanced data, as they are more transparent to interpret.) Accuracy was obtained by taking the average of the 10-fold cross-validation performed on 100 replicates returned from the subsampling for each of the datasets. The reducts were computed using the JohnsonReducer algorithm. To avoid over-fitting, the rules with support lower than five were not included.

The mean accuracy of the returned models based on 100 attributes was 0.691 (SD

= 0.091) and based on 20 attributes was 0.671 (SD=0.091). Rather surprisingly, the 80 % reduction of the number of attributes (from 100 to 20) caused only a minor accuracy reduction of just 3 %. The expected classifier accuracy from random guessing for three decision classes would be 33 %, which means that the obtained accuracies were over two times higher than those achievable by chance. We obtained almost 70 % likelihood of predicting correct ancestry for unseen patients, thus confirming both proper choice of the attributes and high performance of our rule-based classifier.

Note that it is not the best possible rule generating classifier which is sought here.

We need one that can be considered reliable and, most importantly, the one that provides possibly simple and clear-cut rule networks, easy to interpret for domain experts. This is why we have decided to use a rule generating classifier which requires discretization of data. By means of an example, while discretization of blood pressure to just three crude levels of low, medium and high cannot result in the best possible classification results, it is not only the most popular discretization used at large, but it has proved sufficient for bringing reasonable and reliable results.

All rules from 100 replicates were taken together and filtered to remove all dupli- cates and rules that were supersets of more significant rules. The rules p-values were calculated using the hypergeometric distribution; cf. [28]. Finally, the rules were ranked based on the p-values, which represented the probability that a random selection of the same number of objects equal to the rule support would contain an equally large or larger fraction of objects assigned to a certain decision class as in the rule accuracy. Finally, we built rule networks using Ciruvis with default settings.

Big Data Analysis and the Scientific Method

Big Data Analysis and Society