extra space to prevent the hashing from getting too slow means that the file would end up taking up about 437 Kbytes. For this application, disk storage space would not be a problem; however, the techniques we will use to reduce the file size are useful in many other applications as well. Also, searching a smaller file is likely to be faster, because the heads have to move a shorter distance on the average to get to the record where we are going to start our search. If you look back at Figure initstruct, you will notice that the upc field is ten characters long. Using the ASCII code for each digit, which is the usual representation for character data, takes one byte per digit or 10 bytes in all. I mentioned above that we would be using a limited character set to reduce the size of the records. UPC codes are limited to the digits 0 through 9; if we pack two digits into one byte, by using four bits to represent each digit, we can cut that down to five bytes for each UPC value stored. Luckily, this is quite simple, as you will see when we discuss the BCD (binary-coded decimal) conversion code below. The other data compression method we will employ is to convert the item descriptions from strings of ASCII characters, limited to the sets 0-9, A-Z, and the special characters comma, period, minus, and space, to Radix40 representation, mentioned in Chapter prologue.htm. The main difference between Radix40 conversions and those for BCD is that in the former case we need to represent 40 different characters, rather than just the 10 digits, and therefore the packing of data must be done in a slightly more complicated way than just using four bits per character. The Code Now that we have covered the optimizations that we will use in our price lookup system, it's time to go through the code that implements these algorithms. This specific implementation is set up to handle a maximum of FILE_CAPACITY items, defined in superm.h (Figure superm.00a). 5 Each of these items, as defined in the ItemRecord structure in the same file, has a price, a description, and a key, which is the UPC code. The key would be read in by a bar-code scanner in a real system, although our test program will read it in from the keyboard. Some User-Defined Types Several of the fields in the ItemRecord structure definition require some explanation, so let's take a closer look at that definition, shown in Figure superm.00. ItemRecord struct definition (from superm\superm.h) (Figure superm.00) codelist/superm.00 The upc field is defined as a BCD (binary-coded decimal) value of ASCII_KEY_SIZE digits (contained in BCD_KEY_SIZE bytes). The description field is defined as a Radix40 field DESCRIPTION_WORDS in size; each of these words contains three Radix40 characters. A BCD value is stored as two digits per byte, each digit being represented by a four-bit code between 0000(0) and 1001(9). Function ascii_to_BCD in bcdconv.cpp (Figure bcdconv.00) converts a decimal number, stored as ASCII digits, to a BCD value by extracting each digit from the input argument and subtracting the code for '0' from the digit value; BCD_to_ascii (Figure bcdconv.01) does the reverse. ASCII to BCD conversion function (from superm\bcdconv.cpp) (Figure bcdconv.00) codelist/bcdconv.00 BCD to ASCII conversion function (from superm\bcdconv.cpp) (Figure bcdconv.01) codelist/bcdconv.01 A UPC code is a ten-digit number between 0000000000 and 9999999999, which unfortunately is too large to fit in a long integer of 32 bits. Of course, we could store it in ASCII, but that would require 10 bytes per UPC code. So BCD representation saves five bytes per item compared to ASCII. A Radix40 field, as mentioned above, stores three characters (from a limited set of possibilities) in 16 bits. This algorithm (like some other data compression techniques) takes advantage of the fact that the number of bits required to store a character depends on the number of distinct characters to be represented. 6 The BCD functions described above are an example of this approach. In this case, however, we need more than just the 10 digits. If our character set can be limited to 40 characters (think of a Radix40 value as a "number" in base 40), we can fit three of them in 16 bits, because 40 3 is less than 2 16 . Let's start by looking at the header file for the Radix40 conversion functions, which is shown in Figure radix40.00a. The header file for Radix40 conversion (superm\radix40.h) (Figure radix40.00a) codelist/radix40.00a The legal_chars array, shown in Figure radix40.00 defines the characters that can be expressed in this implementation of Radix40. 7 The variable weights contains the multipliers to be used to construct a two-byte Radix40 value from the three characters that we wish to store in it. The legal_chars array (from superm\radix40.cpp) (Figure radix40.00) codelist/radix40.00 As indicated in the comment at the beginning of the ascii_to_radix40 function (Figure radix40.01), the job of that function is to convert a null-terminated ASCII character string to Radix40 representation. After some initialization and error checking, the main loop begins by incrementing the index to the current word being constructed, after every third character is translated. It then translates the current ASCII character by indexing into the lookup_chars array, which is shown in Figure radix40.02. Any character that translates to a value with its high bit set is an illegal character and is converted to a hyphen; the result flag is changed to S_ILLEGAL if this occurs. The ascii_to_radix40 function (from superm\radix40.cpp) (Figure radix40.01) codelist/radix40.01 The lookup_chars array (from superm\radix40.cpp) (Figure radix40.02) codelist/radix40.02 In the line radix40_data[current_word_index] += weights[cycle] * j;, the character is added into the current output word after being multiplied by the power of 40 that is appropriate to its position. The first character in a word is represented by its position in the legal_chars string. The second character is represented by 40 times that value and the third by 1600 times that value, as you would expect for a base-40 number. The complementary function radix40_to_ascii (Figure radix40.03) decodes each character unambiguously. First, the current character is extracted from the current word by dividing by the weight appropriate to its position; then the current word is updated so the next character can be extracted. Finally, the ASCII value of the character is looked up in the legal_chars array. The radix40_to_ascii function (from superm\radix40.cpp) (Figure radix40.03) codelist/radix40.03 Preparing to Access the Price File Now that we have examined the user-defined types used in the ItemRecord structure, we can go on to the PriceFile structure, which is used to keep track of the data for a particular price file. 8 The best way to learn about this structure is to follow the program as it creates, initializes, and uses it. The function main, which is shown in Figure superm.01, after checking that it was called with the correct number of arguments, calls the initialize_price_file function (Figure suplook.00) to set up the PriceFile structure. The main function (from superm\superm.cpp) (Figure superm.01) codelist/superm.01 The initialize_price_file function (from superm\suplook.cpp) (Figure suplook.00) codelist/suplook.00 The initialize_price_file function allocates storage for and initializes the PriceFile structure, which is used to control access to the price file. This structure contains pointers to the file, to the array of cached records that we have in memory, and to the array of record numbers of those cached records. As we discussed earlier, the use of a cache can reduce the amount of time spent reading records from the disk by maintaining copies of a number of those records in memory, in the hope that they will be needed again. Of course, we have to keep track of which records we have cached, so that we can tell whether we have to read a particular record from the disk or can retrieve a copy of it from the cache instead. When execution starts, we don't have any records cached; therefore, we initialize each entry in these arrays to an "invalid" state (the key is set to INVALID_BCD_VALUE). If file_mode is set to CLEAR_FILE, we write such an "invalid" record to every position in the price file as well, so that any old data left over from a previous run is erased. Now that access to the price file has been set up, we can call the process function (Figure superm.02). This function allows us to enter items and/or look up their prices and descriptions, depending on mode. The process function (from superm\superm.cpp) (Figure superm.02) codelist/superm.02 First, let's look at entering a new item (INPUT_MODE). We must get the UPC code, the description, and the price of the item. The UPC code is converted to BCD, the description to Radix40, and the price to unsigned. Then we call write_record (Figure suplook.01) to add the record to the file. The write_record function (from superm\suplook.cpp) (Figure suplook.01) codelist/suplook.01 In order to write a record to the file, write_record calls lookup_record_number (Figure suplook.02) to determine where the record should be stored so that we can retrieve it quickly later. The lookup_record_number function does almost the same thing as lookup_record (Figure suplook.03), except tha the latter returns a pointer to the record rather than its number. Therefore, they are implemented as calls to a common function: lookup_record_and_number (Figure suplook.04). The lookup_record_number function (from superm\suplook.cpp) (Figure suplook.02) codelist/suplook.02 The lookup_record function (from superm\suplook.cpp) (Figure suplook.03) codelist/suplook.03 The lookup_record_and_number function (from superm\suplook.cpp) (Figure suplook.04) codelist/suplook.04 After a bit of setup code, lookup_record_and_number determines whether the record we want is already in the cache, in which case we don't have to search the file for it. To do this, we call compute_cache_hash (Figure suplook.05), which in turn calls compute_hash (Figure suplook.06) to do most of the work of calculating the hash code. The compute_cache_hash function (from superm\suplook.cpp) (Figure suplook.05) codelist/suplook.05 . between Radix40 conversions and those for BCD is that in the former case we need to represent 40 different characters, rather than just the 10 digits, and therefore the packing of data must be. notice that the upc field is ten characters long. Using the ASCII code for each digit, which is the usual representation for character data, takes one byte per digit or 10 bytes in all. I mentioned. Let's start by looking at the header file for the Radix40 conversion functions, which is shown in Figure radix40.00a. The header file for Radix40 conversion (superm
adix40.h) (Figure