Chapter 8 Filter Based Feature Selection Schemes
8.3.3 Experiments with Corpus Based (CB) Scheme
As mentioned previously, for the corpus based scheme, the similarity between the documents is important in computing the strength of the words. In view of this, the similarity values between document pairs were computed and are displayed in the form of a histogram for the 3 different datasets, in Figure 8.5.
Figure 8.5: Distribution of similarity values between the record pairs for the different datasets
It can be seen that a large proportion of the documents’ similarity ranges from 0.05 to 0.2 for all the three datasets investigated. In fact, there were very few document pairs that have similarity values above 0.5. Experiments carried out using the corpus based approach made use of different values of threshold to determine similar documents.
The plots in Figure 8.6 show the variation of the strength measure when the features are arranged in ascending order in terms of strength. As can be seen from the graphs, the strength measure exhibits a smooth variation. Depending on the similarity threshold that has been set there is a slight variation in the strength profiles. For a lower threshold, the strength measures are found to be lower. For the Area and the
Area CDP
Solid
Solid datasets, after the top 400 and 300 features the strength measure drops to almost zero, implying that the words beyond these are not very important. For the CDP dataset this value is at about 550 features.
Figure 8.6: Strength Measures for Area, Solid and CDP datasets
Figure 8.7 shows the accuracy variation profiles for selected datasets and various thresholds. The last plot refers to the situation where the similarity between documents is based on the class labels. It can be seen that the accuracy variation profile is similar to those obtained from the other methods. Given the smaller number of examples, the variation from the Solid dataset is slightly higher than the others. It has a inter-quartile range (IQR) of about 5% whereas the IQR for the other databases is about 2-3%.
0 100 200 300 400 500 600
0 1 2 3 4 5 6 7
Area
Strength-Measure
Feature Number ST=0.500
ST=0.300 ST=0.100
ST=0.005
0 50 100 150 200 250
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Solid
Strength-Measure
Feature Number ST=0.250
ST=0.150
ST=0.050
ST=0.005
0 100 200 300 400 500 600 700 800
0 2 4 6 8 10 12 14 16 18 20
CDP
Strength-Measure
Feature Number ST=0.575
ST=0.375 ST=0.175
ST=0.005
Figure 8.7: Accuracy values for Corpus based feature reduction for selected datasets and thresholds
Figure 8.8 shows the averaged (from 15 trials) accuracy values for the various thresholds selected (eg. ST= Similarity Threshold= 0.005, 0.1,….. ). The accuracy obtained using the class based approach is also included. The profile follows very closely to those obtained by the previous algorithms. There is an initial stable region followed by a drop region thereafter. For most databases the drop is about 20-25%.
However, in the case of the Area dataset, a drop of about 50% is observed.
0 200 400 600 800 1000 1200 1400
35 40 45 50 55 60 65 70
Corpus-Solid-0.5
Accuracy
No. of remaining features
0 500 1000 1500 2000 2500
55 60 65 70 75 80 85
Corpus-CDP-0.375
Accuracy
No. of remaining features
Minimum 25th-percentile Average Median 75th-percentile Maximum
0 200 400 600 800 1000 1200 1400 1600 1800
30 35 40 45 50 55 60 65 70
Corpus-Call-Type-0.005
Accuracy
No. of remaining features
0 200 400 600 800 1000 1200 1400 1600 1800
55 60 65 70 75 80 85
Corpus-Area-Class-Based
Accuracy
No. of remaining features
Figure 8.8: Averaged accuracy values for Corpus based feature reduction for various threshold values and datasets
It is interesting to observe that, for the datasets studied, the accuracy variation is not affected much by the similarity threshold value in the stable region. In the dropping region however, in most cases, the lower threshold seems to provide slightly better
0 200 400 600 800 1000 1200 1400 1600
55 60 65 70 75 80 85
Corpus-Esc
Accuracy
No. of remaining features
ST = 0.005 ST = 0.100 ST = 0.300 ST = 0.500 Class-based
0 500 1000 1500 2000 2500
62 64 66 68 70 72 74 76 78 80 82
Corpus-CDP
Accuracy
No. of remaining features
ST = 0.005 ST = 0.175 ST = 0.375 ST = 0.575 Class-based
0 200 400 600 800 1000 1200 1400 1600
30 35 40 45 50 55 60 65 70
Corpus-Call-Type
Accuracy
No. of remaining features
ST = 0.005 ST = 0.100 ST = 0.300 ST = 0.500 Class-based
0 200 400 600 800 1000 1200 1400 1600
20 30 40 50 60 70 80 90
Corpus-Area
Accuracy
No. of remaining features
ST = 0.005 ST = 0.100 ST = 0.300 ST = 0.500 Class-based
Stable region Drop region
0 200 400 600 800 1000 1200 1400
35 40 45 50 55 60 65
Corpus-Solid
Accuracy
No. of remaining features
ST = 0.005 ST = 0.150 ST = 0.250 ST = 0.500 Class-based
results. This could be roughly explained by the fact that in the stable region, immaterial of the threshold most of the features have a strength measure close to zero.
As such removing them does not really impact the classification accuracy. As for the dropping region, as observed in Figure 8.6, it corresponds to the region in which the strength measures tend to be lower for lower thresholds. As such, removing lower strength words results in a lesser reduction in accuracy and therefore slightly better observed results for the lower thresholds.
The class based approach performs as well as the lower threshold settings. One distinct difference that can be seen for all the datasets with the exception of Solid, is that the drop in the accuracy is not as severe for the class based approach. For example for the case of the Area dataset, the accuracy from the class based approach is as high as 62%
as opposed to a value of about 30% for the other threshold values. Table 8.3 below highlights these differences when the number of features is less than 25. However, it is very unlikely that one would operate at such reduced accuracies.
Table 8.3: Average accuracies for class based and threshold based schemes Dataset Approximate
Drop-points
Approximate Reduction (class based)
Approximate Reduction (threshold based)
Area 400 18 50
Call-Type 400 20 25
Esc 400 10 20
CDP 750 8 20
Solid 300 20 20
Most often we are interested in operating in the stable region. Under these circumstances, the setting of the similarity threshold value is not very critical. In this event, the corpus based scheme becomes reasonably attractive since it does not require labelled examples for feature selection.