SVM was used as the classifier. Each data set was split into 70% and 30% for training and testing respectively. This was done 5 times using a stratified sampling scheme to obtain 5 different trials, for each dataset. The parameters of the SVM were tuned in steps of 2k. For the c parameter, k goes from 1 to 11 whilst for sigma, k goes from –11 to 1. A 3-fold cross validation was carried out to determine the training set accuracy at the various parameter settings. The parameter that gave the best training accuracies was then used to determine the accuracy values for each of the corresponding test sets.
The Duncan’s Multiple Range Test was used to determine if the mean accuracies obtained for the different weighting schemes on the different data sets were different from one another. A p-value of 0.05 was used. This test is used to test for all possible means. The null hypothesis would be H0:ài =àj,foralli≠ j where ài is the mean accuracy due to the i-th weighting scheme. If we test all possible means using t-tests, the probability of type I error for the entire set of comparisons can be greatly
increased. To help avoid this problem the Duncan’s Multiple Range test was used (Montgomery and Runger, 1994).
Table 6.1: Duncan’s Groupings for the ‘KB’ format
Dataset A B C
Area Tfidf-ln Binary
Binary Entropy Tfidf-ls Tfidf
Tf-n
Call-Type
Tfidf-ln Tfidf-ls Binary Entropy
Binary Entropy
Tfidf
Tfidf Tf-n
Esc
Binary Tfidf-ls Tfidf-ln Tfidf Entropy
Tf-n
- -
Table 6.2: Duncan’s Groupings for the ‘Free’ format
Dataset A B C
Area
Binary Tfidf-ls Tfidf-ln Entropy Tfidf
Tf-n
-
Call-Type
Tfidf-ls Binary Tfidf-ln Entropy Tfidf
Tf-n -
Esc
Tfidf-ls Binary Tfidf-ln Tfidf
Tfidf-ln Tfidf
Entropy Tf-n
Table 6.3: Duncan’s Groupings for the ‘Both’ format Dataset A
Area
Binary Entropy Tf-n Tfidf-ls Tfidf-ln Tfidf
Call-Type
Tfidf-ls Tf-n Binary Tfidf-ln Entropy Tfidf
Esc
Tfidf-ls Binary Tf-n Tfidf Entropy Tfidf-ln
Tables 6.1 to 6.3 show the results for the Area, Call-Type and Esc datasets for various data formats and weighting schemes. The weighting schemes found within each cell are not statistically significantly different from one another at a significance level of 0.05. The mean accuracies decrease as we traverse the tables from left to right. In some instances, a particular weighting scheme is found only within a single cell while in other instances it is found in two. In the case of the later, such an instance would imply that the weighting scheme under consideration is not statistically different from the other weighting schemes found within the two cells.
Investigation of the results reveal that the performance of the different weighting scheme is dependent on the dataset and as well as the data format. In general, for the
‘KB’ and ‘free’ format (Table 6.1 and Table 6.2), the tfidf-ls, tfidf-ln and the binary scheme outperform the rest of the schemes. In fact, the tfidf-ls scheme was found in the top two spots of the A-grouping 7 out of 9 times, as seen from Tables 6.1-6.3.
In the cases investigated, complicated weighting schemes such as entropy weighting did not perform as well as a simple weighting scheme such as a binary representation.
A possible explanation for this could be the fact that for most of the cases, a particular record is usually classified into a category based on the existence of a word or a group of words. This is especially true, in the case of free format text, which has an average word length of about 12.5 words. In such instances, it is common for keywords not to be repeated in a given record and the existence of a particular word/or group of words is good enough to determine class information. Hence in the free format and the Both format binary representation scheme does especially well.
From Tables 6.1–6.3, it is interesting to note that for the Area dataset, the binary representation scheme was the best for the ‘Both’ and ‘Free’ formats but lost to the tfidf–ln scheme for the ‘KB’ format. In fact, there was about 4% difference between the binary and the tfidf-ln scheme for the ‘KB’ format. This difference could be attributed to the characteristics of the KB format. When a knowledge base is used to assist a customer, a number of questions are being asked. During this questioning, it is highly likely that some of the keywords associated with some of the classes within the problem area arise. The existence of a particular word or group of words within the questioning procedure might falsely indicate a particular class. Hence that may give rise to the poorer performance observed in the binary weighting scheme, in comparison with the tfidf-ln scheme, for the KB format (Table 6.1). This observation is not present for the Call-Type and the Esc datasets since the ‘keywords’ for these two datasets are generally not prevalent in the questions being asked during use of the knowledge base system.
Table 6.4: Duncan’s Groupings for two other datasets
Datasets A B
CDP
Tfidf-ln Binary Entropy
Tf-n Tfidf-ls
Entropy Tf-n Tfidf-ls
Tfidf
Solid
Tfidf-ls Tf-n Binary
Entropy Tfidf-ln Tfidf
Similar analysis of means was carried out on the CDP and Solid datasets as shown in Table 6.4. For these two databases only the ‘Free’ text format was available and the response variable was similar to the Area dataset. From Table 6.4, it can be seen that the tfidf-ls, tf-n and the binary scheme are present in A-grouping for both datasets.
As before, it is interesting to note that the tfidf representation scheme in its original form produce very poor results. However, with some modification like length – normalization or logistics scaling, classification accuracies see a major improvement.
A sample of the results of the analysis is provided in Appendix C.