Tài liệu Managing time in relational databases- P19 pdf

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	20
Dung lượng	200,04 KB

Nội dung

AND c.eff_beg_dt <¼ cl.row_crt_dt AND c.eff_end_dt > cl.row_crt_dt AND c.asr_beg_dt <¼ cl.row_crt_dt AND c.asr_end_dt > cl.row_crt_dt WHERE cl.claim_amt > p.copay_amt ORDER BY cl.adjud_dt, c.client_nbr, p.policy_nbr, p.eff_beg_dt; To conclude this section, we show what this query might look like if the SQL language supported PERIOD datatypes, and also our taxonomy of Allen relationships. We suppose that the taxonomy node [fills À1 ] is represented by the reserved word INCLUDES. With a SQL language like this, the Asserted Versioning schema no longer has pairs of dates to represent its two time periods. Instead, it has the single columns asr_per and eff_per. SELECT c.client_nbr, c.client_nm, p.policy_nbr, p.policy_type, p.copay_amt, cl.service_dt, cl.claim_amt, cl.adjud_dt FROM Claim cl INNER JOIN Policy_AV p ON p.policy_oid ¼ cl.policy_oid AND p.eff_per INCLUDES cl.service_dt AND p.asr_per INCLUDES cl.adjud_dt INNER JOIN Client_AV c ON c.client_oid ¼ p.client_oid AND c.eff_per INCLUDES cl.row_crt_dt AND c.asr_per INCLUDES cl.row_crt_dt WHERE cl.claim_amt > p.copay_amt ORDER BY cl.adjud_dt, c.client_nbr, p.policy_nbr, p.eff_beg_dt; In either form, what is striking about the query is its simplicity relative to the complexity of the bi-temporal semantics that under- lies it. Unlike queries in the standard temporal model and, for that matter, uni-temporal queries in the alternative temporal model as well, this query does not assemble a collection of rows and then proceed to check for temporal gaps and temporal overlaps within sub-selected collections of those rows. Asserted Versioning enforces bi-temporal semantics once, as the data is being created and modified, rather than each time the data is queried. In Other Words With appropriate temporal extensions to the SQL language, the expression of all thirteen Allen relationships, and of this and other relationships which are combinations of those 346 Chapter 14 ALLEN RELATIONSHIP AND OTHER QUERIES thirteen relationships, would be greatly simplified. The first thing that is needed to support predicates for these relationships is to provide a PERIOD datatype, as we discussed in Chapter 3. With that datatype available, SQL could express each of the relationships we have discussed with one binary predicate relat- ing two time periods (not two pairs of dates). For example, instead of having to request data associated with two time periods such that the first starts before the second and ends after the second starts but before the second ends, we could simply request data associated with two time periods such that the first [ overlaps] the second. Or, instead of having to request data associated with two time periods such that the first doesn’t start after the second and doesn’t end before the second, we could simply request data associated with two time periods such that the first [ fills] the second. It is clearly easier to think about what info rmation one wants from t he database at the high er level of abstraction provided by this new datatype and these new relationships, rather than at the level of abstraction in which begin and end dates have to be used, as they are in the original formulation of the example. And it is just as clearly easier to write the corresponding SQL. But even with today’s SQL which lacks these temporal extensions, Asserted Versioning manages assertion and effective time date pairs as user-defined PERIOD datatypes, and supports all the Allen relationships as well as the other relationships in our Allen relationship taxonomy. Asserted Versioning thus pro- vides a migration path to the day when these extensions are supported in the SQL standard and in commercial DBMSs. Glossary References Glossary entries whose definitions form strong inter- dependencies are grouped together in the following list. The same glossary entries may be grouped together in different ways at the end of different chapters, each grouping reflecting the semantic perspective of each chapter. There will usually be several other, and often many other, glossary entries that are not included in the list, and we recommend that the Glossa ry be consulted whenever an unfamiliar term is encountered. We note, in particular, that none of the nodes in the Asserted Versioning taxonomy of Allen relationships are included in this list. In general, we leave taxonomy nodes out of these lists since they are long enough without them. Chapter 14 ALLEN RELATIONSHIP AND OTHER QUERIES 347 Allen relationships Asserted Versioning Framework (AVF) episode clock tick closed-open contiguous granularity effective begin date effective end date object PERIOD datatype point in time time period temporal entity integrity (TEI) temporal referential integrity (TRI) the alternative temporal model the standard temporal model version 348 Chapter 14 ALLEN RELATIONSHIP AND OTHER QUERIES 15 OPTIMIZING ASSERTED VERSIONING DATABASES Bi-Temporal, Conventional, and Non-Temporal Databases 350 Data Volumes in Bi-Temporal and in Conventional Databases 350 Response Times in Bi-Temporal and in Conventional Databases 351 The Optimization Drill: Modify, Monitor, Repeat 351 Performance Tuning Bi-Temporal Tables Using Indexes 352 General Considerations 353 Indexes to Optimize Queries 354 Indexes to Optimize Temporal Referential Integrity 366 Other Techniques for Performance Tuning Bi-Temporal Tables 372 Avoiding MAX(dt) Predicates 372 NULL vs. 12/31/9999 372 Partitioning 373 Clustering 375 Materialized Query Tables 376 Standard Tuning Techniques 377 Glossary References 378 One concern about Asserted Versioning is with how well it will perfor m. We believe that with recent improvements in technology, and with the use of the physical design techniques described in this chapter, Asserted Versioning databases can achieve performance very close to that of conventional databases. This is especially true for queries, which are usually the most frequent kind of access to any relational database. The AVF, our own implementation of Asserted Versioning, is designed to operate well with large data volume databases supporting a high volume of mixed-type data retrieval requests. Managing Time in Relational Databases. Doi: 10.1016/B978-0-12-375041-9.00015-7 Copyright # 2010 Elsevier Inc. All rights of reproduction in any form reserved. 349 Bi-Temporal, Conventional, and Non-Temporal Databases In this section, we compare data volumes and response times in bi-temporal and in conventional databases. We find that differences in both data volumes and response times are gener- ally quite small, and are usually not good reasons for hesitating to implement bi-temporal data in even the largest databases of the world’s largest corporations. Data Volumes in Bi-Temporal and in Conventional Databases It might seem that a bi-temporal database will have a lot more data in it than a conventional database, and will conse- quently take a lot longer to process. It is true that the size of a bi-temporal database will be larger than that of an otherwise identical database which contains only current data about persistent objects. But in our consulting engagements, which span several decades and dozens of clients, we have found that in most mission-critical systems, temporal data is jur y-rigged into ostensibly non-temporal databases. There are any number of ways that this may happen. For example, in some systems a version date is added to the primary key of select ed tables. In other systems, more advanced forms of best practice versioning (as described in Chapter 4) are employed. Sometimes, history will be captured by triggering an insert into a history table every time a particular non-temporal table is modified. Another approach is to generate a series of periodic snapshot tables that capture the state of a non-temporal table at regular intervals. Of course, a database with no temporal data at all will certainlybesmallerthanthesamedatabasewithtemporal data. But adding up the overhead associated with embedded best practice versioning, or with triggered histor y, periodic snapshots or some combination of these and other techniques, the amount of data in a so-called non-tempora l da tabase may be as much or even more than the amount of data in a bi-temporal database. Throughout this book, we have been using the terms “non- temporal database” and “conventional database” as equivalent expressions. But now we have a reason to distinguish them. From now on, we will call a database “non-temporal” only if it 350 Chapter 15 OPTIMIZING ASSERTED VERSIONING DATABASES contains no temporal data about persistent objects at all. 1 And from now on, we will use the term “conventional database” to refer to databases that may or may not contain temporal data about persistent objects (and that usually do), but that do not contain explicitly bi-temporal tables and instead incorporate temporal data by using variations on one or more of the ad hoc methods we have described. Response Times in Bi-Temporal and in Conventional Databases At the level of indivi dual tables, a table lacking temporal data will clearly have less data than an otherwise identical table that also contains temporal data. But even if a table has more data than another table, it may perform nearly as well as that other table because response times are usually not linear to the amount of data in the target table. Response times w ill be approximately linear to the amount of data in the table in the case of full table scans, but will almost never be lin ear for dir ect access re ads. A dir ect (random) r ead to a t able with fiv e million rows will perform almost as w ell as a direct read to a table with only on e million r o ws, p ro vided that t he table is indexed p roperly a nd that the number of non-leaf i ndex levels is the same . And, in most cases, they will be the same, or very close to it. In addition, when adding in the overhead of triggers of an expo- nentially growing number of dependents, and of the often ineffi- cient SQL used to access and maintain data in conventional databases, it is likely that using the AVF to manage temporal data in an Asserted Versioning database will prove to be a more efficient method of managing temporal data than directly invoking DBMS methods to manage temporal data in a conventional database. The Optimization Drill: Modify, Monitor, Repeat Performance optimization, also known as “performance tuning”, is usually an iterative app roach to making and then moni- toring modifications to an application and its database. It 1 The point of adding “about persistent objects”, of course, is to distinguish between objects and events, as we did in our taxonomy in Chapter 2. So a “non-temporal database”, in this new sense, may contain event tables, i.e. tables of transactions. And it may also contain fact-dimension data marts. What it may not contain is data about any historical (or future) states of persistent objects. Chapter 15 OPTIMIZING ASSERTED VERSIONING DATABASES 351 could involve adjusting the configuration of the database and server, or making changes to the applications and the SQL that maintain and query the database. As authors of this book, we can’t participate in the specific modify and monitor iterative pro- cesses being carried on by any of our readers and their IT organizations. But we can describe factors that are likely to apply to any Asserted Versioning implementation. These factors include the number of users, the complexity of the application and the SQL, the volatility of the data, and the DBMS and server platform. The m ajor DBMSs may optimize varying configurations differently, and may have extensions that can be used to simplify and improve a “plain vanilla” implementation of Asserted Versioning. In this chapter, we will take a broad brush approach and, in general, discuss optimization techniques that apply to the temporalization of any relational database, regardless of what industry its owning organization is part of, and regardless of what types of applications it supports. Each reader will need to review these recommendations and determine if and how they apply to specific databases and applications that she may be responsible for. To repeat once more as we read the following sections, although we use the term “date” in this book to describe the delimiters of assertion and effective time periods, those delimiters can actually be of any time duration, such as a day, minute, second or microsecond. We use a month as the clock tick granularity in many of our examples. But in most cases, a finer level of granularity will be chosen, such as a timestamp representing the smallest clock tick supported by the DBMS. Performance Tuning Bi-Temporal Tables Using Indexes Many indexes are designed using something similar to a B-tree (balanced tree) structu re, in which e ach node points to its next-level child nodes, and the leaf nodes contain pointers to the desired data. These indexes are used by working down from the top of the hierarchy until the leaf node containing the desired pointer is reached. Each pointer is a specific index value paired with the physical address, page or row id of the row that matches that value. From that point, the DBMS can do a direct read and retrieve the I/O page that contains the desired data. 352 Chapter 15 OPTIMIZING ASSERTED VERSIONING DATABASES B-tree indexes for bi-temporal tables work no differently than B-tree indexes for non-temporal tables. Knowing how these indexes work, our design objective is to construct indexes that will optimize the speed of access to the most frequently accessed data. In bi-temporal tables, we believe, that will almost always be the currently asserted current versions of the objects represented in those tables. As index designers, our task is two-fold. First, we need to determine the best columns to index on. Then we need to arrange those columns in the best sequence. General Considerations The physical seq uence of columns within an index has a sig- nificant impact on the performance of queries that use that index. Our objective is to get to the desired row in a table with the minimum amount of I/O activity against the index, followed by a single direct read to the table itself. So in determining the sequence of columns in an index, a good idea is to put the most frequently used lookup columns in the leftmost (initial) nodes of the index. These columns are often the columns that make up the business key, or perhaps some other identifier such as the primary key, or a foreign key. Against asserted version tables, most queries will be similar to queries against non-temporal tables except that a few temporal predicates will be added to the queries. These temporal predicates eliminate rows whose assertion time periods and/or effective time periods are not what the query is loo king for. An object that is represented by exactly one row in a non- temporal table may be represented by any number of rows in a temporal table. But for normal business use, the one current row in the temporal table, i.e. the row which corresponds to that one row in the non-temporal table, is likely to be accessed much more frequently than any of the other rows. Unless we properly combine temporal columns with non-temporal columns in the index, access to that current row may require us to scan through many past or future rows to get to it. Of course, we are talking about both a scan of index leaf pages, as well as the more expensive scan of the table itself. When specific rows are being searched for, and when they may or may not be clustered close to one another in physical storage, we want to minimize any type of scan. Another important consideration in determining the optimal sequence of columns in an index is that optimizers may decide Chapter 15 OPTIMIZING ASSERTED VERSIONING DATABASES 353 not to use a column in an index unless values have been provided for all the columns to its left, those being the columns that help to more directly trace a path through the higher levels of the index tree, using the columns that match supplied predicates. So if we design an index with its temporal columns too far to the right, and with unqualified columns prior to them, a scan might still be triggered whenever the optimizer looks for the one current row for the object being queried. On the other hand, as we will see, the solution is not to simply make the temporal columns left-most in the index. There will usually be many more non-current rows for an object, in an asserted version table, than the one current row for that object. The table may contain any number of rows representing the history of the object, and any number of rows representing anticipated future states of the object. The table may contain any number of no longer asserted rows for that object, as well as rows that we are not yet prepared to assert. So what we want the optimizer to do is to jump as directly as possible to the one currently asserted current version for an object, without having to scan though a potentially large number of non-current rows. Indexes to Optimize Queries Let’s look at an example. We will assume that it is currently September 2011. So the next time the clock ticks, according to the clock tick granularity used in this boo k, it will be October 2011. In the table shown in Figure 15.1, there are nine rows representing the object whose object identifier is 55. Three of those rows are historical versions. Their effectivity periods are past. They represent past states of the object they refer to. We designate them with “pe” (past effective) in the state column of the table. 2 Another three of those rows are no longer asserted. Their assertion periods are past. They represent claims that we once made, claims that the statements which those rows made about the objects which they represented were true statements. But now we no longer make those claims. They exist in the assertion time past. We designate these rows with “pa” (past asserted) in the state column of the table. 2 The state and row # columns are not columns of the table itself. They are metadata about the rows of the table, just like the row # column in the tables shown in other chapters in this book. 354 Chapter 15 OPTIMIZING ASSERTED VERSIONING DATABASES Two of those rows are not yet asse rted. They are deferred assertions. We are not yet willing to claim that the statements made by those rows are true statements. We designate these rows with “fa” (future asserted) in the state column of the table. There is one current row representing the object whose identifier is 55. This row is currently asserted and, within current assertion time, became effective in August 2009 and will remain in effect until further notice. Note, however, that it will remain asserted only until October 2012. At that time, if nothing in the data changes, the database will cease to say that the data for object 55 is Kiwi from August 2009 until further notice. Instead, it will say that data for object 55 is Kiwi from August 2009 to December 2013, and that from December 2013 until further notice, it will be Grapes. We designate this earlier, but current, row with “cc” (currently asserted current version) in the state metadata column of the table. The SQL to retrieve the one current row for object 55 is: SELECT data FROM mytable WHERE oid ¼ 55 AND eff_beg_dt <¼ Now() AND eff_end_dt > Now() AND asr_beg_dt <¼ Now() AND asr_end_dt > Now() Most optimizers will use the index tree to locate the row id (rid) of the qualifying row or rows using, first of all, the columns that have direct matching predicates, such as EQUALS or IN, columns which are sometimes called match columns. These optimizers will also use the index tree for a column with a range predicate, such as BETWEEN or LESS THAN OR EQUAL TO (<¼), provided that it is the first col umn in the index or the first column following the direct match columns. state pa pe pa pe pa pe cc fa fa 1 2 3 4 5 6 7 8 9 55 Jan09 Jan09 Mar09 Mar09 Jun09 Jun09 Aug09 Aug09 Dec13 Jan09 Feb09 Feb09 Jun09 Jun09 Aug09 Aug09 Oct12 Oct12 Apples Apples Berries Berries Cherries Cherries Kiwi Kiwi Grapes Feb09 9999 Jun09 9999 Aug09 9999 Oct12 9999 9999 9999 Mar09 9999 Jun09 9999 Aug09 9999 Dec13 9999 55 55 55 55 55 55 55 55 row # oid eff-beg eff-end asr-beg asr-end data Figure 15.1 A Bi-Temporal Table. Chapter 15 OPTIMIZING ASSERTED VERSIONING DATABASES 355 [...]... the index entries for current and deferred assertions together, and also separate from the index entries for assertions definitely known to be past assertions, resulting in a better buffer hit ratio In fact, the index could be used as both a clustering and a partitioning index, in which case it would also keep more of the current rows in the target table in memory To the circa flag eliminating definitely... of the index tree where all the leaf node pointers point to rows in the target table which satisfy those match predicates as well that first range predicate The most important thing to note here is that we get to this starting point in the search of the index without doing a scan Our strategy is to get to the desired result using an index with little or no scanning Once we reach that starting point,... effective begin date index column, and make it descending instead of ascending But even with a descending sort order, there are still the same eight rows that qualify and need further filtering In fact, most rows in a temporal database usually have an effective begin date less than Now() So effective begin date does not appear to be a good column to place immediately after the last match column in the index... to find the one current row Since physical I/Os are one of the main causes of performance problems, reducing them is one of our main opportunities for optimization And this particular sequence of index columns doesn’t seem to do a good job in reducing I/O, either in the index or in the table itself Since there are probably more rows for object 55’s past than for its future, we might consider reversing... other columns are in the index, it will probably apply those filters via the index If no other columns are in the index, it will go to the target table itself and apply the criteria that are not included in the index Doing so, it will return a result set containing only row 7 Row 7’s assertion end date has not yet been reached, so it is still currently asserted And the assertion begin dates for rows... query’s result set without doing any scanning at all When there are no more future effective versions found in the index scan, we will have assembled a list of index pointers to all rows which the index scan did not disqualify But in this example, there is one more row with a future effective begin date, that being row 7 So, from its scan starting point, the index will scan rows 8, 7 and 9 and apply... But unlike that circa-first index, this index is also helpful for queries looking for as-was asserted data, that data being the mistakes we have made in our SSN data If we are looking for past assertions, it may also improve performance to code the circa flag using an IN clause Some optimizers will manage short IN clause lists in an index lookaside buffer, effectively utilizing the predicate as though... columns in an index is so important Most important of all is to choose the correct range predicate column to place immediately after the common match predicate columns To put the same point in other words: most important of all is to get positioned into the index for the desired row without resorting to scanning Suppose that the sequence of columns in the index is {oid, eff_beg_dt, asr_beg_dt} In this... Chapter 15 OPTIMIZING ASSERTED VERSIONING DATABASES remaining criteria including assertion begin date And the same eight rows will be qualified by that scan Finally, the DBMS will use the row ids (rids) of the qualifying rows, and read the table itself If the table is physically clustered on exactly this sequence of columns, we might get all eight rows in one I/O On the other hand, in the worst case,... things are currently like are precisely the rows we find in non-temporal tables Rather than being some exotic kind of bi-temporal construct, they are actually the “plain vanilla” data that is the only data found in most of the tables in a conventional database For queries to such data, asserted version tables containing a circa flag, and having the index just described, will nearly match the performance . matching, and then filtering on effective begin date, the index will be scanned for the 356 Chapter 15 OPTIMIZING ASSERTED VERSIONING DATABASES remaining. assertions, resulting in a better buffer hit ratio. In fact, the index could be used as both a clustering and a partitioning index, in which case it would

Ngày đăng: 26/01/2014, 08:20

Xem thêm