mode in that it expects the record to be in the file, and displays the record data when found. After process terminates when the user enters "*" instead of a code number to be looked up or entered, main finishes up by calling terminate_price_file (Figure suplook.11) , which closes the price file and returns. All processing complete, main exits to the operating system. The terminate_price_file function (from superm\suplook.cpp) (Figure suplook.11) codelist/suplook.11 Summary In this chapter, we have covered ways to save storage by using a restricted character set and to gain rapid access to data by an exact key, using hash coding and caching. In the next chapter we will see how to use bitmaps and distribution sorting to aid in rearranging information by criteria that can be specified at run- time. Problems 1. What modifications to the program would be needed to support: 1. Deleting records? 2. Handling a file that becomes full, as an off-line process? 3. Keeping track of the inventory of each item? 2. How could hash coding be applied to tables in memory? 3. How could caching be applied to reduce the time needed to look up an entry in a table in memory? (You can find suggested approaches to problems in Chapter artopt.htm). Footnotes 1. Tape drives are the most commonly used sequential access devices. 2. In general, the "next open place" is not a very good place to put an overflow record if the hash table is kept in memory rather than on the disk; the added records "clog" the table, leading to slower access times. A linked list approach is much better for tables that are actually memory resident. Warning: this does not apply to tables in virtual memory, where linked lists provide very poor performance. For more discussion on overflow handling, see the dynamic hashing algorithm in Chapter dynhash.htm. 3. For a description of "last-come, first-served" hashing, see Patricio V. Poblete and J. Ian Munro, "Last-Come-First-Served Hashing", in Journal of Algorithms 10, 228-248, or my article "Galloping Algorithms", in Windows Tech Journal, 2(February 1993), 40-43. 4. This is a direct-mapped cache. 5. Increasing the maximum number of records in the file by increasing FILE_CAPACITY would also increase the amount of memory required for the cache unless we reduced the cache size as a fraction of the file size by reducing the value .20 in the calculation of APPROXIMATE_CACHE_SIZE. 6. The arithmetic coding data compression algorithm covered in Chapter compress.htm, however, does not restrict the characters that can be represented; rather, it takes advantage of the differing probabilities of encountering each character in a given situation. 7. Note that this legal_chars array must be kept in synchronization with the lookup_chars array, shown in Figure radix40.02. 8. While we could theoretically have more than one of these files active at a time, our example program uses only one such file. 9. This function actually returns one more than the index to the last entry in the line because the standard C loop control goes from the first value up to one less than the ending value. 10. Of course, we might also find a record with the same key as the one we are trying to add, but this is an error condition, since keys must be unique. 11. If we were adding a new record with this key rather than trying to find one, we would use position 0. A Mailing List System Introduction In this chapter we will use a selective mailing list system to illustrate rapid access to and rearrangement of information selected by criteria specified at run time. Our example will allow us to select certain customers of a mail-order business whose total purchases this year have been within a particular range and whose last order was within a certain time period. This would be very useful for a retailer who wants to send coupons to lure back the (possibly lost) customers who have spent more than $100 this year but who haven't been in for 30 days. The labels for the letters should be produced in ZIP code order, to take advantage of the discount for presorted mail. Algorithms Discussed The Distribution Counting Sort, Bitmaps A First Approach To begin, let us assume that the information we have about each customer can be described by the structure definition in Figure custinfo. Customer information (Figure custinfo) typedef struct { char last_name[LAST_NAME_LENGTH+1]; char first_name[FIRST_NAME_LENGTH+1]; char address1[ADDRESS1_LENGTH+1]; char address2[ADDRESS2_LENGTH+1]; char city[CITY_LENGTH+1]; char state[STATE_LENGTH+1]; char zip[ZIP_LENGTH+1]; int date_last_here; int dollars_spent; } DataRecord; A straightforward approach would be to store all of this information in a disk file, with one DataRecord record for each customer. In order to construct a selective mailing list, we read through the file, testing each record to see whether it meets our criteria. We extract the sort key (ZIP code) for each record that passes the tests of amount spent and last transaction date, keeping these keys and their associated record numbers in memory. After reaching the end of the file, we sort the keys (possibly with a heapsort or Quicksort) and rearrange the record numbers according to the sort order, then reread the selected records in the order of their ZIP codes and print the address labels. It may seem simpler just to collect records in memory as they meet our criteria. However, the memory requirements of this approach might be excessive if a large percentage of the records are selected. The customer file of a mail-order business is often fairly large, with 250000 or more 100-byte records not unheard of; in this situation a one-pass approach might require as much as 25 megabytes of memory for storing the selected records. However, if we keep only the keys of the selected records in memory and print the label for each customer as his record is reread on the second pass, we never have to have more than one record in memory at a time. In our example, the length of the key (ZIP code) is nine bytes and the record number is two bytes long, so that 250000 selected records would require only 2.75 megabytes. This is a much more reasonable memory budget, considering that our program might not be the only one running, especially under an operating system like Windows TM . Even so, there is no reason to allocate all the storage we might ever need in advance, and every reason not to. We'd like the program to run in the minimum memory possible, so that it would be useful even on a machine with limited memory or one that is loaded up with network drivers and memory-resident utilities. A linked list is a fairly simple way to allocate memory as needed, but we must be careful to use this method efficiently. Allocating a new block of storage for each record that matches our criteria would be very wasteful, as extra control storage is needed to keep track of each allocation. In MS-DOS, each allocation requires at least 16 bytes of control storage, so 250000 allocations would use about 4 megabytes just for the control storage! This problem is easy to solve; we will allocate storage for a number of records at once, 1000 in our example, which reduces that 4 megabyte overhead to 4 kilobytes. 1 Saving Storage with Bitmaps Let's start our optimization by trying to reduce the memory we need for each selected record during this first pass through the file. One possibility is to convert the ZIP codes to Radix40 representation to save memory. Unfortunately, such data is not suitable for sorting. However, there is another way for us to save memory during the first pass: using a bitmap to keep track of the records that match our criteria. A bitmap is an array of bits that can be used to record one characteristic of a number of items, as long as that characteristic can be expressed as yes/no, true/false, present/absent or some similar pair of opposites. In this case, all we want to remember is whether each record was selected. So if we allocate a bitmap with one bit for each possible record number, we can clear the whole bitmap at the start (indicating we haven't selected any records yet) and then set the bit for each record number we select. When we get done with the selection, we will know how many records were selected, so we can allocate storage for the record numbers we need to sort. Then we can go through the bitmap, and every time we find a bit that is set, we will add the corresponding record number to the end of the list to be passed to the sort. Of course, we could use an array of bytes, one per record, instead of a bitmap, which would not require us to write bit access functions. However, the bitmap requires only one-eighth as much as storage as an array of bytes, which can result in considerable savings with a big array. In our example, with a 250000-record file, a bitmap with one bit per record would occupy only about 31 KB, whereas an equivalent byte array would occupy 250 KB. While this difference may not be significant in the present case, larger files produce proportionally larger savings. As with almost everything, there is a drawback to the use of bitmaps; they usually require more time to access than byte arrays. This is especially true for those machines (such as the 80x86) that do not have an instruction set designed for easy or efficient access to individual bits. Even on machines like the 68020 and its successors, which do have good bit manipulation instructions, compiler support for these functions may be poor. However, since CPU time is unlikely to be the limiting factor while we are reading records from our input file, this extra processing time is probably unimportant in our case. Increasing Processing Speed Now that we have reduced the memory requirements for the first pass, is there any way to speed up the program? In my tests with large numbers of selected records, it turns out that the disk reading and writing time are a small part of the entire elapsed time required to run the program, at least when we are using a standard . certain customers of a mail-order business whose total purchases this year have been within a particular range and whose last order was within a certain time period. This would be very useful. large numbers of selected records, it turns out that the disk reading and writing time are a small part of the entire elapsed time required to run the program, at least when we are using a standard