sorting function. In such a case, by far the greatest proportion of the time is spent in sorting the keys, so that's where we'll focus our attention next. Sorting Speed Which sort should we use? Surely we can't improve on the standard sort supplied with the C++ compiler we're using (typically Quicksort) or one of its relatives such as heapsort. After all, the conventional wisdom is that even the fastest sorting algorithm takes time proportional to n*log(n). That is, if sorting 1000 items takes 1 second, sorting 1,000,000 items would take at least 2000 seconds, using an optimal sorting algorithm. Since these commonly used algorithms all have average times proportional to this best case, it seems that all we have to do is select the one best suited for our particular problem. This is a common misconception: in fact, only sorts that require comparisons among keys have a lower bound proportional to n*log(n). The distribution counting sort we are going to use takes time proportional to n, not n*log(n), so that if it takes 1 second to sort 1000 items, it would take 1000 seconds to sort 1,000,000 items, not 2000 seconds. 2 Moreover, this sort is quite competitive with the commonly used sorts even for a few hundred items, and it is easy to code as well. For some applications, however, its most valuable attribute is that its timing is data independent. That is, the sort takes the same amount of time to execute whether the data items are all the same, all different, already sorted, in reverse order, or randomly shuffled. This is particularly valuable in real-time applications, where knowing the average time is not sufficient. 3 Actually, this sort takes time proportional to the number of keys multiplied by the length of each key. The reason that the length of the keys is important is that a distribution counting sort actually treats each character position of the sort keys as a separate "key"; these keys are used in order from the least to the most significant. Therefore, this method actually takes time proportional to n*m, where n is the number of keys and m is the length of each key. However, in most real-world applications of sorting, the number of items to be sorted far exceeds the length of each item, and additions to the list take the form of more items, not lengthening of each one. If we had a few very long items to sort, this sort would not be as appropriate. You're probably wondering how fast this distribution sort really is, compared to Quicksort. According to the results of my tests, which sorted several different numbers of records between 23480 and 234801 on 5-digit ZIP codes using both Microsoft's implementation of the Quicksort algorithm (qsort) in version 5.0 of their Visual C++ compiler and my distribution counting sort, which I call "Megasort", there's no contest. 4 The difference in performance ranged from approximately 43 to 1 at the low end up to an astonishing 316 to 1 at the high end! Figures promailx.00 and promail.00, near the end of this chapter, show the times in seconds when processing these varying sets of records. Now let's see how such increases in performance can be achieved with a simple algorithm. The Distribution Counting Sort The basic method used is to make one pass through the keys for each character position in the key, in order to discover how many keys have each possible ASCII character in the character position that we are currently considering, and another pass to actually rearrange the keys. As a simplified example, suppose that we have ten keys to be sorted and we want to sort only on the first letter of each key. The first pass consists of counting the number of keys that begin with each letter. In the example in Figure unsorted, we have three keys that begin with the letter 'A', five that begin with the letter 'B', and two that begin with the letter 'C'. Since 'A' is the lowest character we have seen, the first key we encounter that starts with an 'A' should be the first key in the result array, the second key that begins with an 'A' should be the second key in the result array, and the third 'A' key should be the third key in the result array, since all of the 'A' keys should precede all of the 'B' keys. Unsorted keys (Figure unsorted) 1 bicycle 2 airplane 3 anonymous 4 cashier 5 bottle 6 bongos 7 antacid 8 competent 9 bingo 10 bombardier The next keys in sorted order will be the ones that start with 'B'; therefore, the first key that we encounter that starts with a 'B' should be the fourth key in the result array, the second through fifth 'B' keys should be the fifth through eighth keys in the result array, and the 'C' keys should be numbers nine and ten in the result array, since all of the 'B' keys should precede the 'C' keys. Figure counts.and.pointers illustrates these relationships among the keys. Counts and pointers (Figure counts.and.pointers) Counts Starting indexes A B C A B C 3 5 2 1 4 9 | | | +-+ | + + Key Old | New | | Explanation Index | Index | | Bicycle 1 |+ 4 + | The first B goes to position 4. Airplane 2 ++ 1 + | The first A goes to position 1. Anonymous 3 ++ 2 + | The second A goes after the first. Cashier 4 +++ 9 + The first C goes to position 9. Bottle 5 ||+ 5-+ The second B goes after the first. Bongos 6 ||+ 6-+ The third B goes after the second. Antacid 7 |++ 3 The third A goes after the second. Competent 8 +-+ 10 The second C goes after the first. Bingo 9 + 7-+ The fourth B goes after the third. Bombardier 10 8-+ The fifth B goes after the fourth. If we rearrange the keys in the order indicated by their new indexes, we arrive at the situation shown in Figure afterfirst: After the sort on the first character (Figure afterfirst) Index Unsorted Keys Sorted Keys 1 Bicycle + + Airplane | | 2 Airplane + + + Anonymous | | 3 Anonymous + + + Antacid | + + 4 Cashier + + + Bicycle | | 5 Bottle + + Bottle | | 6 Bongos + + Bongos | | 7 Antacid + + + Bingo | | 8 Competent + + + + Bombardier + + + +-+ 9 Bingo + + + +-Cashier + + + 10 Bombardier + + Competent Multicharacter Sorting Of course, we usually want to sort on more than the first character. As we noted earlier, each character position is actually treated as a separate key; that is, the pointers to the keys are rearranged to be in order by whatever character position we are currently sorting on. With this in mind, let's take the case of a two character string; once we can handle that situation, the algorithm doesn't change significantly when we add more character positions. We know that the final result we want is for all the A's to come before all the B's, which must precede all the C's, etc., but within the A's, the AA's must come before the AB's, which have to precede the AC's, etc. Of course, the same applies to keys starting with B and C: the BA's must come before the BB's, etc. How can we manage this? We already know that we are going to have to sort the data separately on each character position, so let's work backward from what the second sort needs as input. When we are getting ready to do the second (and final) sort, we need to know that all the AA's precede all the AB's, which precede all the AC's, etc., and that all the BA's precede the BB's, which precede the BC's. The same must be true for the keys starting with C, D, and any other letter. The reason that the second sort will preserve this organization of its input data is that it moves keys with the same character at the current character position from input to output in order of their previous position in the input data. Therefore, any two keys that have the same character in the current position (both A's, B's, C's, etc.) will maintain their relative order in the output. For example, if all of the AA's precede all of the AB's in the input, they will also do so in the output, and similarly for the BA's and BB's. This is exactly what we want. Notice that we don't care at this point whether the BA's are behind or ahead of the AB's, as arranging the data according to the first character is the job of the second sort (which we haven't done yet). But how can we ensure that all the AA's precede the AB's, which precede the AC's, etc. in the input? By sorting on the second character position first! For example, suppose we are sorting the following keys: AB, CB, BA, BC, CA, BA, BB, CC. We start the sort by counting the number of occurrences of each character in the second position of each key (the less significant position). There are three A's, three B's, and two C's. Since A is the character closest to the beginning of the alphabet, the first key that has an A in the second position goes in the first slot of the output. The second and third keys that have an A in the second position follow the first one. Those that have a B in the second position are next, in output positions 4, 5, and 6. The C's bring up the rear, producing the situation in Figure lesser.char. After this first sort (on the second character position), all of the keys that have an A in the second position are ahead of all of the keys that have a B in the second position, which precede all those that have a C in the second position. Therefore, all AA keys precede all AB keys, which precede all AC keys, and the same is true for BA, BB, BC and CA, CB, and CC as well. This is the exact arrangement of input needed for the second sort. Less significant character sort (Figure lesser.char) 1 AB ++ BA * || 2 CB + ||+ CA | ||| 3 BA +-++|+ BA * | | || 4 BC + | +-++ AB | + +++ 5 CA + +|+ CB | | 6 BA + ++ BB * + +-+ 7 BB + + BC * . final result we want is for all the A's to come before all the B's, which must precede all the C's, etc., but within the A's, the AA's must come before the AB's,. position. Therefore, all AA keys precede all AB keys, which precede all AC keys, and the same is true for BA, BB, BC and CA, CB, and CC as well. This is the exact arrangement of input needed for the. how such increases in performance can be achieved with a simple algorithm. The Distribution Counting Sort The basic method used is to make one pass through the keys for each character position