1. Trang chủ
  2. » Công Nghệ Thông Tin

O''''Reilly Network For Information About''''s Book part 157 pot

6 70 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 6
Dung lượng 25,7 KB

Nội dung

all we have to do is to increase the number of active slots by 1 (to 10) and rehash the elements in slot 1. It works just like the example above; none of the slots 2-7 is affected by the change, because the second rule "folds" all of their hash calculations into their current slots. When we have added a total of 96 elements, the number of "active" slots and the total number of slots will both be 16, so we will be back to a situation similar to the one we were in before we added our first expansion slot. What do we do when we have to add the 97th element? We double the total number of slots again and start over as before. 7 This can go on until the number of slots gets to be too big to fit in the size of integer in which we maintain it, or until we run out of memory, whichever comes first. Of course, if we want to handle a lot of data, we will probably run out of memory before we run out of slot numbers. Since the ability to retrieve any record in any size table in constant time, on the average, would certainly be very valuable when dealing with very large databases, I considered that it was worth trying to adapt this algorithm to disk-based hash tables. All that was missing was a way to store the chains of records using a storage method appropriate for disk files. Making a Quantum Leap Upon further consideration, I realized that the quantum file access method could be used to store a variable-length "storage element" pointed to by each hash slot, with the rest of the algorithm implemented pretty much as it was in the article. This would make it possible to store a very large number of strings by key and get back any of them with an average of a little more than one disk access, without having to know how big the file would be when we created it. This algorithm also has the pleasant effect of making deletions fairly simple to implement, with the file storage of deleted elements automatically reclaimed as they are removed. Contrary to my usual practice elsewhere in this book, I have not developed a sample "application" program, but have instead opted to write a test program to validate the algorithm. This was very useful to me during the development of the algorithm; after all, this implementation of dynamic hashing is supposed to be able to handle hundreds of thousands of records while maintaining rapid access, so it is very helpful to be able to demonstrate that it can indeed do that. The test program is hashtest.cpp (Figure hashtest.00). The test program for the dynamic hashing algorithm (quantum\hashtest.cpp) (Figure hashtest.00) codelist/hashtest.00 This program stores and retrieves strings by a nine-digit key, stored as a String value. To reduce the overhead of storing each entry as a separate object in the quantum file, all strings having the same hash code are combined into one "storage element" in the quantum file system; each storage element is addressed by one "hash slot". The current version of the dynamic hashing algorithm used by hashtest.cpp allocates one hash slot for every six strings in the file; since the average individual string length is about 60 characters, and there are 18 bytes of overhead for each string in a storage element (four bytes for the key length, four bytes for the data length, and 10 bytes for the key value including its null further), this means that the average storage element will be a bit under 500 bytes long. A larger element packing factor than six strings per element would produce a smaller hash table and would therefore be more space efficient. However, the choice of this value is not critical with the current implementation of this algorithm, because any storage elements that become too long to fit into quantum will be broken up and stored separately by a mechanism which I will get to later. Of course, in order to run meaningful tests, we have to do more than store records in the hash file; we also have to retrieve what has been stored, which means that we have to store the keys someplace so that we can use them again to retrieve records corresponding to those keys. In the current version of this algorithm, I use a FlexArray (i.e., a persistent array of strings, such as we examined in the previous chapter) to store the values of the keys. However, in the original version of this algorithm, I was storing the key as an unsigned long value, so I decided to use the quantum file storage to implement a persistent array of unsigned long values, and store the keys in such an array. Persistence Pays Off It was surprisingly easy to implement a persistent array of unsigned long values, 8 for which I defined a typedef of Ulong, mostly to save typing. The header file for this data type is persist.h (Figure persist.00). The interface for the PersistentArrayUlong class (quantum\persist.h) (Figure persist.00) codelist/persist.00 As you can see, this class isn't very complex, and most of the implementation code is also fairly straightforward. However, we get a lot out of those relatively few lines of code; these arrays are not only persistent, but they also automatically expand to any size up to and including the maximum size of a quantum file; with the current maximum of 10000 16K blocks, a maximum size PersistentArrayUlong could contain approximately 40,000,000 elements! Of course, we don't store each element directly in a separate addressable entry within a main object, as this would be inappropriate because the space overhead per item would be larger than the Ulongs we want to store! Instead, we employ a two-level system similar to the one used in the dynamic hashing algorithm; the quantum file system stores "segments" of data, each one containing as many Ulongs as will fit in a quantum. To store or retrieve an element, we determine which segment the element belongs to and access the element by its offset in that segment. However, before we can use a PersistentArrayUlong, we have to construct it, which we can do via the default constructor (Figure persist.01). The default constructor for the PersistentArrayUlong class (from quantum\persist.cpp) (Figure persist.01) codelist/persist.01 This constructor doesn't actually create a usable array; it is only there to allow us to declare a PersistentArrayUlong before we want to use it. When we really want to construct a usable array, we use the normal constructor shown in Figure persist.02. The normal constructor for the PersistentArrayUlong class (from quantum\persist.cpp) (Figure persist.02) codelist/persist.02 As you can see, to construct a real usable array, we provide a pointer to the quantum file in which it is to be stored, along with a name for the array. The object directory of that quantum file is searched for a main object with the name specified in the constructor; if it is found, the construction is complete. Otherwise, a new object is created with one element, expandable to fill up the entire file system if necessary. To store an element in the array we have created, we can use StoreElement (Figure persist.03). The PersistentArrayUlong::StoreElement function (from quantum\persist.cpp) (Figure persist.03) codelist/persist.03 This routine first calculates which segment of the array contains the element we need to retrieve, and the element number within that segment. Then, if we are running the "debugging" version (i.e., asserts are enabled), it checks whether the segment number is within the maximum range we set up when we created the array. This test should never fail unless there is something wrong with the calling routine (or its callers), so that the element number passed in is absurdly large. As discussed above, with all such conditional checks, we have to try to make sure that our testing is good enough to find any errors that might cause this to happen; with a "release" version of the program, this would be a fatal error. Next, we check whether the segment number we need is already allocated to the array; if not, we increase the number of segments as needed by calling GrowMainObject, but don't actually initialize any new segments until they're accessed, so that "sparse" arrays won't take up as much room as ones that are filled in completely. Next, we get a copy of the segment containing the element to be updated; if it's of zero length, that means we haven't initialized it yet, so we have to allocate memory for the new segment and fill it with zeros. At this point, we are ready to create an AccessVector called TempUlongVector of type Ulong and use it as a "template" (no pun intended) to allow access to the element we want to modify. Since AccessVector has the semantics of an array, we can simply set the ElementNumberth element of TempUlongVector to the value of the input argument p_Element; the result of this is to place the new element value into the correct place in the TempVector array. Finally, we store TempVector back into the main object, replacing the old copy of the segment. To retrieve an element from an array, we can use GetElement (Figure persist.04). The PersistentArrayUlong::GetElement function (from quantum\persist.cpp) (Figure persist.04) codelist/persist.04 First, we calculate the segment number and element number, and check (via qfassert) whether the segment number is within the range of allocated segments; if it isn't, we have committed the programming error of accessing an uninitialized value. Assuming this test is passed, we retrieve the segment, set up the temporary Vector TempUlongVector to allow access to the segment as an SVector of Ulongs, and return the value from the ElementNumberth element of the array. All this is very well if we want to write things like "Y.Put(100,100000L);" or "X = Y.Get(100);", to store or retrieve the 100th element of the Y "array", respectively. But wouldn't it be much nicer to be able to write "Y[100] = 100000L;" or "X = Y[100];" instead? In Resplendent Array Clearly, that would be a big improvement in the syntax; as it happens, it's not hard to make such references possible, with the addition of only a few lines of code. 9 Unfortunately, this code is not the most straightforward, but the syntactic improvement that it provides is worth the trouble. The key is operator[ ] (Figure persist.05). The PersistentArrayUlong::operator[ ] function (from quantum\persist.cpp) (Figure persist.05) codelist/persist.05 This function returns a temporary value of a type that behaves differently in the context of an "lvalue" reference (i.e., a "write") than it does when referenced as an "rvalue" (i.e, a "read"). In order to follow how this process works, let's use the example in Figure persist1. Persistent array example (Figure persist1) codelist/perexam.00 The first question to be answered is how the compiler decodes the following line: Save[1000000L] = 1234567L; According to the definition of PersistentArrayUlong::operator[ ], this operator returns a PersistentArrayUlongRef that is constructed with the two parameters *this and p_Index, where the former is the PersistentArrayUlong object for which operator[ ] was called (i.e., Save), and the latter is the value inside the [ ], which in this case is 1000000L. What is this return value? To answer this question, we have to look at the normal constructor for the PersistentArrayUlongRef class (Figure persist.06). . slot for every six strings in the file; since the average individual string length is about 60 characters, and there are 18 bytes of overhead for each string in a storage element (four bytes for. unsigned long values, 8 for which I defined a typedef of Ulong, mostly to save typing. The header file for this data type is persist.h (Figure persist.00). The interface for the PersistentArrayUlong. for each string in a storage element (four bytes for the key length, four bytes for the data length, and 10 bytes for the key value including its null further), this means that the average storage

Ngày đăng: 07/07/2014, 08:20