o Data can be accessed only by a unique index number, not an arbitrary key The distribution counting sort (Chapter mail.htm and zensort.htm) o Excellent time efficiency, proportional to number of keys o Predictable timing o Extra memory needed for copy of pointer or data, depending on implementation o Appropriate for sorting a few hundred or more keys o Each type of key to be sorted must be handled separately Caching (Chapter superm.htm) o See entry in Figure ioopt Some Thoughts on the Future Although many managers might wish it otherwise, the writing of efficient computer programs will always be an art, not a science. This may appear to be a rash statement, but I think time will prove me correct. Of course, if I am wrong, I am following a rich tradition: according to (Arthur C.) Clarke's Law, if a distinguished elderly scientist [defined as over 30 years old] says that something is possible, he is almost always right; if he says that something is impossible, he is almost always wrong. 3 However, I feel I am on reasonably firm ground. The main reason is that, as demonstrated by Goedel's Proof, no automatic procedure can discover all of the true statements in a consistent axiomatic system such as the behavior of a programmable computer. 4 This implies that the development of new algorithms will always be an adventure. A fine example of this is the optimization in Chapter compress.htm, in which I discovered that I could replace 16-bit occurrence counts with 4-bit indexes and actually speed up the operation of the program at the same time; I'd like to see the optimizing compiler that could figure that out! Obviously, such unpredictability can be quite expensive. While other engineering professionals, such as structural engineers, routinely estimate the time and money required for their projects to within a few percent, software engineers are not nearly so accurate. In fact, overruns of hundreds of percent are not uncommon. Why is this so? Why Johnny Can't Estimate The answer is quite simple: once a structural engineer has built a dozen bridges (for example), the thirteenth is mostly an application of the knowledge he has already acquired. Of course, this is not to say that the expertise of the engineer is unimportant; far from it. However, experience is a great teacher, and experience gained in such a specialty is extremely valuable. We software engineers, however, face a fundamentally different situation: once we have solved a given problem, we should never need to solve it again, as we can reuse the solution already developed. Therefore, in our development projects, we are always venturing into unexplored territory, and we all know how to identify pioneers. 5 Actually, I find it enjoyable to have a new set of challenges with each new project. We should take pride in our work; it is fairly clear to me, at least, that large software systems are the most complex of human creations. They combine the characteristics of art and engineering: first, we conceive of and build something out of sheer thought, the most evanescent material any artist has ever used, and then it actually performs a useful function! 6 Goodbye for Now We have reached the end of this book. I hope you have learned as much from reading this volume as I have while writing it. Suggested Approaches to Problems Chapter superm.htm 1. Modifications to add capabilities to the program: 1. To delete records, we need to invent another flag value (for example, 14 or 0xe for the first digit) indicating that the record has been used in the past, but currently is unused. This allows us to continue looking for a record that might be after the current position in the file; however, we should remember the position of this "previously owned" record, so that if the record we are seeking turns out not to be in the file, we can reuse this record rather than taking up an unused one. Of course, this solution will not prevent the file from gradually being filled up with previously used records, which will result in slower and slower access times, particularly when looking for records which are not in the file; clearly, if every record has been used at least once, we would have to search the entire file to determine that a particular record is not in the file. This can be taken care of by periodic maintenance such as that described in the answer to the next question. 2. In order to handle the case of a full file, or a file which has become nearly filled with deleted records, we can copy the records in the file to a larger one. This requires opening both the old new file and a new one, with the initialization of the new file set to create a larger file. Then we read every valid record in the old file and write it to the new one; since we don't copy any deleted records, the new file starts out with a clean slate. This situation illustrates one reason for using a PriceFile structure to keep track of the data for each file; this is what allows us to have both the new and old file open at once. 3. Keeping track of the inventory of each item requires an inventory field to the ItemRecord structure and updating this information whenever items are sold or added to inventory. 2. Hash coding could be applied to tables in memory in much the same way as we have applied it to disk files: the main difference is that since memory is almost a true random-access device, we could use a different method of selecting which slot to use for overflow records, rather than having to use the next sequential slot. 7 3. If we wish to reduce the time needed to look up an entry in a table in memory by a key, we can maintain a cache of the most recently seen entries, employing a hash code based on the key to be looked up. Each time we find an entry, we store its key and record number in the cache entry corresponding to its hash code. Then, when we want to find a record, the first place we look is in the cache entry corresponding to its key's hash code. If the key in that cache entry matches the key of the record we are looking for, we use the record number in the cache entry to access the record; if the key doesn't match, we have to search the table as we otherwise would have to do anyway. Since the amount of time needed to check the cache entry is small compared to the time needed to search the table, we should come out ahead in most cases. Chapter mail.htm 1. One way to produce output in descending rather than ascending order is to subtract each character from 255 before using it as an index. 2. In order to make only one pass through the data for each character position, rather than two as at present, we have to add the pointer for each key to the end of a linked list for characters having the ASCII code of the character at the current position in the key. For example, if the current character in the key we are processing is 'A', then the pointer to that key would be added to the 'A' list; similarly for 'B', 'C', and any other character that might occur in the key. When the pointers for all the keys have been added to the appropriate lists, the lists are concatenated into one big list, with the 'B' list attached to the end of the 'A' list, the 'C' list attached to the end of the 'B' list, and so on. Then the process starts again for the next character position. 3. In order to sort integer or floating-point data rather than character data, we have to know the internal structure of the data. For example, on the Intel 80x86 machines, the low byte of a two-byte unsigned value has a lower memory address than the high byte. We can view this as a two-byte key, with the less significant part lower in memory. Since we have to sort by the less significant portion first, our outer loop index must count up from 0 to 1, rather than the other way around as it would with a two-byte string. Of course, on a Motorola or other "high end first" machine, such unsigned values would be sorted in the same order as strings. Signed values pose additional difficulty, as the high-order byte of a negative value has a greater character value than the high-order byte of a positive value: one way around this is to negate all the keys both before and after the sort. Applying the distribution sort or any other noncomparison sort to floating-point values is even more complex, as the internal representation of a floating-point value is composed of several parts with differing functions; for this reason, the algorithm must be tailored to the individual circumstances. However, the performance of the resulting algorithm makes this effort worthwhile if the number of values to be sorted is reasonably large. 4. Probably the easiest way to handle variable-length strings as keys is to make one pass through the key array, calling strlen to identify the length of each key, and then saving the result in a temporary array. Then, whenever we are ready to extract a character from the key, we first determine whether we are past the end of the key, in which case we substitute a null byte. 5. Another possible way to sort variable-length strings is to ignore the fact that they differ in length. Since a C string is normally terminated by a null byte, two strings that are identical up to the end of the shorter one will sort correctly if both are treated as having the length of the longer one; the null byte will make the shorter one sort before the longer one. However, this approach will not work if you are using a protected-mode operating system and you try to access bytes following the end of a short string when these bytes have not been allocated to your program; you will get a protection violation. Another aspect of this approach is that the garbage after the end of a string that is shorter than the key being sorted will be used to rearrange identical keys that should remain in the same relative position in the file. This may cause problems in applications that rely on identical keys keeping their relative orders. Chapter compress.htm 1. In order to use arithmetic coding where each of a number of blocks of text must be decompressed without reference to the other blocks, we must generate a table of probabilities by analyzing a file or files which have characteristics similar to those to be compressed. Then, rather than compute the probabilities on the fly when compressing or decompressing one of our text blocks, we use the precomputed probability table. This also increases the speed of compressing and decompressing, as the table does not need to be updated. 2. If the data to be compressed consists entirely of characters with ASCII codes 0 through 127 (or any other subset of the ASCII range), we can eliminate a significant amount of the storage for the probability tables by allocating only those tables needed to encode or decode the characters that can occur in the data. In the example given, we need only 128 tables of 128 elements, rather than 256 tables of 256 elements; the resulting memory requirement is approximately 8 Kbytes rather than approximately 32 Kbytes, saving about 24 Kbytes. Chapter quantum.htm 1. To provide a "mass load mode", you'll have to add a function to change the behavior of PutElement, so that it will start a new block when the current "last added to" block doesn't have enough space. Of course, you'll also have to add a complementary function to reset its behavior when you're done with the mass load. 2. The directory utility will have to be able to add and delete array names randomly. A hash-coded lookup would probably be the best solution for large numbers of arrays, although this is not critical in the current implementation, which is limited to 256 arrays per file. 3. The QFIX program has the following general description. The first phase starts from the main object index, checking the validity of each main object. The big pointer array quantum is checked to see that it belongs to the object in question. Then each of the small pointer arrays is similarly checked, and then each small pointer array entry is checked as well, along with the element to which it points. If all is well so far, the next main object is checked in the same manner. Any object that checks out completely is added to a list of valid objects; all other objects are noted as being incomplete or erroneous. . times, particularly when looking for records which are not in the file; clearly, if every record has been used at least once, we would have to search the entire file to determine that a particular. figure that out! Obviously, such unpredictability can be quite expensive. While other engineering professionals, such as structural engineers, routinely estimate the time and money required for. memory address than the high byte. We can view this as a two-byte key, with the less significant part lower in memory. Since we have to sort by the less significant portion first, our outer loop