Experimental Study On Term Weighting Schemes- 123docz.net

SVM was used as the classifier. Each data set was split into 70% and 30% for training and testing respectively. This was done 5 times using a stratified sampling scheme to obtain 5 different trials, for each dataset. The parameters of the SVM were tuned in steps of 2k. For the c parameter, k goes from 1 to 11 whilst for sigma, k goes from –11 to 1. A 3-fold cross validation was carried out to determine the training set accuracy at the various parameter settings. The parameter that gave the best training accuracies was then used to determine the accuracy values for each of the corresponding test sets.

The Duncan’s Multiple Range Test was used to determine if the mean accuracies obtained for the different weighting schemes on the different data sets were different from one another. A p-value of 0.05 was used. This test is used to test for all possible means. The null hypothesis would be H0:ài =àj,foralli≠ j where ài is the mean accuracy due to the i-th weighting scheme. If we test all possible means using t-tests, the probability of type I error for the entire set of comparisons can be greatly

increased. To help avoid this problem the Duncan’s Multiple Range test was used (Montgomery and Runger, 1994).

Table 6.1: Duncan’s Groupings for the ‘KB’ format

Dataset A B C

Area Tfidf-ln Binary

Binary Entropy Tfidf-ls Tfidf

Tf-n

Call-Type

Tfidf-ln Tfidf-ls Binary Entropy

Binary Entropy

Tfidf

Tfidf Tf-n

Esc

Binary Tfidf-ls Tfidf-ln Tfidf Entropy

Tf-n

- -

Table 6.2: Duncan’s Groupings for the ‘Free’ format

Dataset A B C

Area

Binary Tfidf-ls Tfidf-ln Entropy Tfidf

Tf-n

Call-Type

Tfidf-ls Binary Tfidf-ln Entropy Tfidf

Tf-n -

Esc

Tfidf-ls Binary Tfidf-ln Tfidf

Tfidf-ln Tfidf

Entropy Tf-n

Table 6.3: Duncan’s Groupings for the ‘Both’ format Dataset A

Area

Binary Entropy Tf-n Tfidf-ls Tfidf-ln Tfidf

Call-Type

Tfidf-ls Tf-n Binary Tfidf-ln Entropy Tfidf

Esc

Tfidf-ls Binary Tf-n Tfidf Entropy Tfidf-ln

Tables 6.1 to 6.3 show the results for the Area, Call-Type and Esc datasets for various data formats and weighting schemes. The weighting schemes found within each cell are not statistically significantly different from one another at a significance level of 0.05. The mean accuracies decrease as we traverse the tables from left to right. In some instances, a particular weighting scheme is found only within a single cell while in other instances it is found in two. In the case of the later, such an instance would imply that the weighting scheme under consideration is not statistically different from the other weighting schemes found within the two cells.

Investigation of the results reveal that the performance of the different weighting scheme is dependent on the dataset and as well as the data format. In general, for the

‘KB’ and ‘free’ format (Table 6.1 and Table 6.2), the tfidf-ls, tfidf-ln and the binary scheme outperform the rest of the schemes. In fact, the tfidf-ls scheme was found in the top two spots of the A-grouping 7 out of 9 times, as seen from Tables 6.1-6.3.

In the cases investigated, complicated weighting schemes such as entropy weighting did not perform as well as a simple weighting scheme such as a binary representation.

A possible explanation for this could be the fact that for most of the cases, a particular record is usually classified into a category based on the existence of a word or a group of words. This is especially true, in the case of free format text, which has an average word length of about 12.5 words. In such instances, it is common for keywords not to be repeated in a given record and the existence of a particular word/or group of words is good enough to determine class information. Hence in the free format and the Both format binary representation scheme does especially well.

From Tables 6.1–6.3, it is interesting to note that for the Area dataset, the binary representation scheme was the best for the ‘Both’ and ‘Free’ formats but lost to the tfidf–ln scheme for the ‘KB’ format. In fact, there was about 4% difference between the binary and the tfidf-ln scheme for the ‘KB’ format. This difference could be attributed to the characteristics of the KB format. When a knowledge base is used to assist a customer, a number of questions are being asked. During this questioning, it is highly likely that some of the keywords associated with some of the classes within the problem area arise. The existence of a particular word or group of words within the questioning procedure might falsely indicate a particular class. Hence that may give rise to the poorer performance observed in the binary weighting scheme, in comparison with the tfidf-ln scheme, for the KB format (Table 6.1). This observation is not present for the Call-Type and the Esc datasets since the ‘keywords’ for these two datasets are generally not prevalent in the questions being asked during use of the knowledge base system.

Table 6.4: Duncan’s Groupings for two other datasets

Datasets A B

CDP

Tfidf-ln Binary Entropy

Tf-n Tfidf-ls

Entropy Tf-n Tfidf-ls

Tfidf

Solid

Tfidf-ls Tf-n Binary

Entropy Tfidf-ln Tfidf

Similar analysis of means was carried out on the CDP and Solid datasets as shown in Table 6.4. For these two databases only the ‘Free’ text format was available and the response variable was similar to the Area dataset. From Table 6.4, it can be seen that the tfidf-ls, tf-n and the binary scheme are present in A-grouping for both datasets.

As before, it is interesting to note that the tfidf representation scheme in its original form produce very poor results. However, with some modification like length – normalization or logistics scaling, classification accuracies see a major improvement.

A sample of the results of the analysis is provided in Appendix C.

Experimental Study On Term Weighting Schemes

Problem Response System Database (PRS)

Experiments with Corpus Based (CB) Scheme