Experiments with Corpus Based (CB) Scheme

Một phần của tài liệu Mining of textual databases within the product development process (Trang 174 - 180)

Chapter 8 Filter Based Feature Selection Schemes

8.3.3 Experiments with Corpus Based (CB) Scheme

As mentioned previously, for the corpus based scheme, the similarity between the documents is important in computing the strength of the words. In view of this, the similarity values between document pairs were computed and are displayed in the form of a histogram for the 3 different datasets, in Figure 8.5.

Figure 8.5: Distribution of similarity values between the record pairs for the different datasets

It can be seen that a large proportion of the documents’ similarity ranges from 0.05 to 0.2 for all the three datasets investigated. In fact, there were very few document pairs that have similarity values above 0.5. Experiments carried out using the corpus based approach made use of different values of threshold to determine similar documents.

The plots in Figure 8.6 show the variation of the strength measure when the features are arranged in ascending order in terms of strength. As can be seen from the graphs, the strength measure exhibits a smooth variation. Depending on the similarity threshold that has been set there is a slight variation in the strength profiles. For a lower threshold, the strength measures are found to be lower. For the Area and the

Area CDP

Solid

Solid datasets, after the top 400 and 300 features the strength measure drops to almost zero, implying that the words beyond these are not very important. For the CDP dataset this value is at about 550 features.

Figure 8.6: Strength Measures for Area, Solid and CDP datasets

Figure 8.7 shows the accuracy variation profiles for selected datasets and various thresholds. The last plot refers to the situation where the similarity between documents is based on the class labels. It can be seen that the accuracy variation profile is similar to those obtained from the other methods. Given the smaller number of examples, the variation from the Solid dataset is slightly higher than the others. It has a inter-quartile range (IQR) of about 5% whereas the IQR for the other databases is about 2-3%.

0 100 200 300 400 500 600

0 1 2 3 4 5 6 7

Area

Strength-Measure

Feature Number ST=0.500

ST=0.300 ST=0.100

ST=0.005

0 50 100 150 200 250

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Solid

Strength-Measure

Feature Number ST=0.250

ST=0.150

ST=0.050

ST=0.005

0 100 200 300 400 500 600 700 800

0 2 4 6 8 10 12 14 16 18 20

CDP

Strength-Measure

Feature Number ST=0.575

ST=0.375 ST=0.175

ST=0.005

Figure 8.7: Accuracy values for Corpus based feature reduction for selected datasets and thresholds

Figure 8.8 shows the averaged (from 15 trials) accuracy values for the various thresholds selected (eg. ST= Similarity Threshold= 0.005, 0.1,….. ). The accuracy obtained using the class based approach is also included. The profile follows very closely to those obtained by the previous algorithms. There is an initial stable region followed by a drop region thereafter. For most databases the drop is about 20-25%.

However, in the case of the Area dataset, a drop of about 50% is observed.

0 200 400 600 800 1000 1200 1400

35 40 45 50 55 60 65 70

Corpus-Solid-0.5

Accuracy

No. of remaining features

0 500 1000 1500 2000 2500

55 60 65 70 75 80 85

Corpus-CDP-0.375

Accuracy

No. of remaining features

Minimum 25th-percentile Average Median 75th-percentile Maximum

0 200 400 600 800 1000 1200 1400 1600 1800

30 35 40 45 50 55 60 65 70

Corpus-Call-Type-0.005

Accuracy

No. of remaining features

0 200 400 600 800 1000 1200 1400 1600 1800

55 60 65 70 75 80 85

Corpus-Area-Class-Based

Accuracy

No. of remaining features

Figure 8.8: Averaged accuracy values for Corpus based feature reduction for various threshold values and datasets

It is interesting to observe that, for the datasets studied, the accuracy variation is not affected much by the similarity threshold value in the stable region. In the dropping region however, in most cases, the lower threshold seems to provide slightly better

0 200 400 600 800 1000 1200 1400 1600

55 60 65 70 75 80 85

Corpus-Esc

Accuracy

No. of remaining features

ST = 0.005 ST = 0.100 ST = 0.300 ST = 0.500 Class-based

0 500 1000 1500 2000 2500

62 64 66 68 70 72 74 76 78 80 82

Corpus-CDP

Accuracy

No. of remaining features

ST = 0.005 ST = 0.175 ST = 0.375 ST = 0.575 Class-based

0 200 400 600 800 1000 1200 1400 1600

30 35 40 45 50 55 60 65 70

Corpus-Call-Type

Accuracy

No. of remaining features

ST = 0.005 ST = 0.100 ST = 0.300 ST = 0.500 Class-based

0 200 400 600 800 1000 1200 1400 1600

20 30 40 50 60 70 80 90

Corpus-Area

Accuracy

No. of remaining features

ST = 0.005 ST = 0.100 ST = 0.300 ST = 0.500 Class-based

Stable region Drop region

0 200 400 600 800 1000 1200 1400

35 40 45 50 55 60 65

Corpus-Solid

Accuracy

No. of remaining features

ST = 0.005 ST = 0.150 ST = 0.250 ST = 0.500 Class-based

results. This could be roughly explained by the fact that in the stable region, immaterial of the threshold most of the features have a strength measure close to zero.

As such removing them does not really impact the classification accuracy. As for the dropping region, as observed in Figure 8.6, it corresponds to the region in which the strength measures tend to be lower for lower thresholds. As such, removing lower strength words results in a lesser reduction in accuracy and therefore slightly better observed results for the lower thresholds.

The class based approach performs as well as the lower threshold settings. One distinct difference that can be seen for all the datasets with the exception of Solid, is that the drop in the accuracy is not as severe for the class based approach. For example for the case of the Area dataset, the accuracy from the class based approach is as high as 62%

as opposed to a value of about 30% for the other threshold values. Table 8.3 below highlights these differences when the number of features is less than 25. However, it is very unlikely that one would operate at such reduced accuracies.

Table 8.3: Average accuracies for class based and threshold based schemes Dataset Approximate

Drop-points

Approximate Reduction (class based)

Approximate Reduction (threshold based)

Area 400 18 50

Call-Type 400 20 25

Esc 400 10 20

CDP 750 8 20

Solid 300 20 20

Most often we are interested in operating in the stable region. Under these circumstances, the setting of the similarity threshold value is not very critical. In this event, the corpus based scheme becomes reasonably attractive since it does not require labelled examples for feature selection.

Một phần của tài liệu Mining of textual databases within the product development process (Trang 174 - 180)

Tải bản đầy đủ (PDF)

(244 trang)