1. Trang chủ
  2. » Công Nghệ Thông Tin

Joe Celko s SQL for Smarties - Advanced SQL Programming P68 pps

10 93 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 10
Dung lượng 125,8 KB

Nội dung

642 CHAPTER 29: TEMPORAL QUERIES 1 hens * 1 1 / 2 days * rate = 1 egg; multiply by eggs per hen 1 1 / 2 days * rate = 1 egg per hen; divide by the number of hens rate = egg per hen per day; divide by 1 1 / 2 days If you still do not get it, draw a graph. 29.1 Temporal Math Almost every SQL implementation has a DATE data type, but the functions available for that data type vary quite a bit. The most common ones are a constructor that builds a date from integers or strings; extractors to pull out the month, day, or year; and some display options to format output. You can assume that your SQL implementation has simple date arithmetic functions, although with different syntax from product to product, such as: 1. A date plus or minus a number of days yields a new date 2. A date minus a second date yields an integer number of days Table 29.1 shows the valid combinations of <datetime> and <interval> data types in Standard SQL: Table 29.1 Valid Combinations of Temporal Data Types in Standard SQL <datetime> - <datetime> = <interval> <datetime> + <interval> = <datetime> <interval> (* or/) <numeric> = <interval> <interval> + <datetime> = <datetime> <interval> + <interval> = <interval> <numeric> * <interval> = <interval> 29.2 Personal Calendars 643 Other rules, dealing with time zones and the relative precision of the two operands, are intuitively obvious. There should also be a function that returns the current date from the system clock. This function has a different name with each vendor: TODAY , SYSDATE , NOW() , CURRENT DATE , and getdate() are some examples. There may also be a function to return the day of the week from a date, which is sometimes called DOW() or WEEKDAY() . Standard SQL provides for CURRENT_ DATE , CURRENT_TIME [(<time precision>)] , and CURRENT_TIMESTAMP [(<timestamp precision>)] functions, which are self-explanatory. 29.2 Personal Calendars One of the most common applications of dates is to build calendars that list upcoming events or actions to be taken by their user. People have no trouble with using a paper calendar to trigger their own actions, but the idea of having an internal calendar as a table in their database is somehow strange. Programmers seem to prefer to write a function that calculates the date and matches it to events. It is easier to create a table for cyclic data than people initially think. The months and days of the week within a year repeat themselves in a cycle of 28 years. A table of just more than 10,000 rows can hold a complete cycle. The cycle has to repeat itself every 400 years, so today is on the same day of the week that it was on 400 years ago. As an example, consider the rule that a stockbroker must settle a transaction within three business days after a trade. Business days are defined as excluding Saturdays, Sundays, and certain holidays. The holidays are determined at the start of the year by the New York Stock Exchange, but this can be changed by an act of Congress or presidential decree, or the SEC can order that trading stop in a security. The problem is how to write an SQL query that will return the proper settlement date given a trade date. There are several tricks in this problem. The real trick is to decide what you want and not to be fooled by what you have. You have a list of holidays, but you want a list of settlement days. Let’s start with a table of the given holidays and their names: CREATE TABLE Holidays Insert holiday list into this table (holiday_date DATE NOT NULL PRIMARY KEY, holiday_name CHAR(20) NOT NULL); 644 CHAPTER 29: TEMPORAL QUERIES The next step is to build a table of trade and settlement dates for the whole year. Building the INSERT INTO statements to load the second table is easily done with a spreadsheet; these always have good date functions. Let’s start by building a simple list of the dates over the range we want use and putting them into a table called Settlements: CREATE TABLE Settlements (trade_date DATE NOT NULL PRIMARY KEY, settle_date DATE NOT NULL); INSERT INTO Settlements VALUES ('2005-02-01', '2005-02-01'), ('2005-02-02', '2005-02-02'), ('2005-02-03', '2005-02-03'); etc. We know that we cannot trade on a holiday or weekend. You probably could have excluded weekends in the spreadsheet, but if not, use this statement. DELETE FROM Settlements WHERE trade_date IN (SELECT holi_date FROM Holidays) OR DayOfWeek(trade_date) IN ('Saturday', 'Sunday'); This does not handle the holiday settlements, however. The trouble with a holiday is that it can fall on a weekend, in which case we just handled it, it can last only one day, or it can last any number of days. The table of holidays is built on the assumption that each day of a multiday holiday has a row in the table. We now have to update the table so that the regular settlement days are three business-days forward of the trade date. But we have all the business days in the trade_date column of the Settlements table now. UPDATE Settlements SET settle_date = (SELECT trade_date FROM Settlements AS S1 WHERE Settlements.trade_date < S1.trade_date AND (SELECT COUNT(*) FROM Settlements AS S2 29.3 Time Series 645 WHERE S2.trade_date BETWEEN Settlements.trade_date AND S1.trade_date) = 3); The final settlement table will be about 250 rows per year and only two columns wide. This is quite small; it will fit into main storage easily on any machine. Finding the settlement day is a straight, simple query; if you had built only the Holiday table, you would have had to provide procedural code. 29.3 Time Series One of the major problems in the real world is how to handle a series of events that occur in the same time period or in some particular order. The code is tricky and a bit hard to understand, but the basic idea is that you have a table with start and stop times for events, and you want to get information about them as a group. 29.3.1 Gaps in a Time Series The timeline can be partitioned into intervals, and a set of intervals can be drawn from that partition for reporting. One of the stock questions on an employment form asks the prospective employee to explain any gaps in his record of employment. Most of the time this gap means that you were unemployed. If you are in data processing, you answer that you were consulting, which is a synonym for unemployed. Given this table, how would you write an SQL query to display the time periods and their durations for each of the candidates? You will have to assume that your version of SQL has DATE functions that can do some simple calendar math. CREATE TABLE JobApps (candidate CHAR(25) NOT NULL, jobtitle CHAR(15) NOT NULL, start_date DATE NOT NULL, end_date DATE null means still employed CONSTRAINT started_before_ended CHECK(start_date <= end_date) ); 646 CHAPTER 29: TEMPORAL QUERIES Notice that the end date of the current job_code is set to NULL because SQL does not support an “eternity” or “end of time” value for temporal data types. Using ‘9999-12-31 23:59:59.999999’, the highest possible date value that SQL can represent, is not a correct model and can cause problems when you do temporal arithmetic. The NULL can be handled with a COALESCE() function in the code, as I will demonstrate later. It is obvious that this has to be a self- JOIN query, so you have to do some date arithmetic. The first day of each gap is the last day of an employment period plus one day, and that the last day of each gap is the first day of the next job_code minus one day. This start-point and end- point problem is the reason that SQL defined the OVERLAPS predicate this way. All versions of SQL support temporal data types and arithmetic. But unfortunately, no two implementations look alike, and few look like the ANSI standard. The first attempt at this query is usually something like the following, which will produce the right results, but with a lot of extra rows that are just plain wrong. Assume that if I add a number of days to a date, or subtract a number of days from it, I get a new date. SELECT J1.candidate, (J1.end_date + INTERVAL '1' DAY) AS gap_start, (J2.start_date - INTERVAL '1' DAY) AS gap_end, (J2.start_date - J1.end_date) AS gaplength FROM JobApps AS J1, JobApps AS J2 WHERE J1.candidate = J2.candidate AND (J1.end_date + INTERVAL '1' DAY) < J2.start_date; Here is why this does not work. Imagine that we have a table that includes a candidate named ‘Bill Jones’ with the following work history: Result candidate jobtitle start_date end_date ======================================================= 'John Smith' 'Vice Pres' '1999-01-10' '1999-12-31' 'John Smith' 'President' '2000-01-12' '2001-12-31' 'Bill Jones' 'Scut Worker' '2000-02-24' '2000-04-21' 'Bill Jones' 'Manager' '2001-01-01' '2001-01-05' 'Bill Jones' 'Grand Poobah' '2001-04-04' '2001-05-15' 29.3 Time Series 647 We would get this as a result: Result candidate gap_start gap_end gaplength ================================================== 'John Smith' '2000-01-01' '200001-11' 12 'Bill Jones' '2000-04-22' '200012-31' 255 'Bill Jones' '2001-01-06' '2001-04-03' 89 'Bill Jones' '2000-04-22' '2001-04-03' 348 <= false data The problem is that the ‘John Smith’ row looks just fine and can fool you into thinking that you are doing fine. He had two jobs; therefore, there was one gap in between. However, ‘Bill Jones’ cannot be right— only two gaps separate three jobs, yet the query shows three gaps. The query does its JOIN on all possible combinations of start and end dates in the original table. This gives false data in the results by counting the end of one job_code, ‘Scut Worker’ and the start of another, ‘Grand Poobah’, as a gap. The idea is to use only the most recently ended job_code for the gap. This can be done with a MIN() function and a correlated subquery. The final result is this: SELECT J1.candidate, (J1.end_date + INTERVAL '1' DAY) AS gap_start, (J2.start_date - INTERVAL '1' DAY) AS gap_end FROM JobApps AS J1, JobApps AS J2 WHERE J1.candidate = J2.candidate AND J2.start_date = (SELECT MIN(J3.start_date) FROM JobApps AS J3 WHERE J3.candidate = J1.candidate AND J3.start_date > J1.end_date) AND (J1.end_date + INTERVAL '1' DAY) < (J2.start_date - INTERVAL '1' DAY) UNION ALL SELECT J1.candidate, MAX(J1.end_date) + INTERVAL '1' DAY, CURRENT_TIMESTAMP FROM JobApps AS J1 GROUP BY J1.candidate HAVING COUNT(*) = COUNT(DISTINCT J1.end_date); 648 CHAPTER 29: TEMPORAL QUERIES The length of the gap can be determined with simple temporal arithmetic. The purpose of the UNION ALL is to add the current period of unemployment, if any, to the final answer. 29.3.2 Continuous Time Periods Given a series of jobs that can start and stop at any time, how can you be sure that an employee doing all these jobs was really working without any gaps? Let’s build a table of timesheets for one employee. CREATE TABLE Timesheets (job_code CHAR(5) NOT NULL PRIMARY KEY, start_date DATE NOT NULL, end_date DATE NOT NULL, CHECK (start_date <= end_date)); INSERT INTO Timesheets (job_code, start_date, end_date) VALUES ('j1', '2008-01-01', '2008-01-03'); ('j2', '2008-01-06', '2008-01-10'), ('j3', '2008-01-05', '2008-01-08'), ('j4', '2008-01-20', '2008-01-25'), ('j5', '2008-01-18', '2008-01-23'), ('j6', '2008-02-01', '2008-02-05'), ('j7', '2008-02-03', '2008-02-08'), ('j8', '2008-02-07', '2008-02-11'), ('j9', '2008-02-09', '2008-02-10'), ('j10', '2008-02-01', '2008-02-11'), ('j11', '2008-03-01', '2008-03-05'), ('j12', '2008-03-04', '2008-03-09'), ('j13', '2008-03-08', '2008-03-14'), ('j14', '2008-03-13', '2008-03-20'); The most immediate answer is to build a search condition for all of the characteristics of a continuous time period. This algorithm is due to Mike Arney, a DBA at BORN Consulting. It uses derived tables to get the extreme start and ending dates of a contiguous run of durations. SELECT Early.start_date, MIN(Latest.end_date) FROM (SELECT DISTINCT start_date FROM Timesheets AS T1 29.3 Time Series 649 WHERE NOT EXISTS (SELECT * FROM Timesheets AS T2 WHERE T2.start_date < T1.start_date AND T2.end_date >= T1.start_date) ) AS Early (start_date) INNER JOIN (SELECT DISTINCT end_date FROM Timesheets AS T3 WHERE NOT EXISTS (SELECT * FROM Timesheets AS T4 WHERE T4.end_date > T3.end_date AND T4.start_date <= T3.end_date) ) AS Latest (end_date) ON Early.start_date <= Latest.end_date GROUP BY Early.start_date; Result start_date end_date =========================== '2008-01-01' '2008-01-03' '2008-01-05' '2008-01-10' '2008-01-18' '2008-01-25' '2008-02-01' '2008-02-11' '2008-03-01' '2008-03-20' However, another way of doing this is a query, which will also tell you which jobs bound the continuous periods. SELECT T2.start_date, MAX(T1.end_date) AS finish_date, MAX(T1.job_code || ' to ' || T2.job_code) AS job_code_pair FROM Timesheets AS T1, Timesheets AS T2 WHERE T2.job_code <> T1.job_code AND T1.start_date BETWEEN T2.start_date AND T2.end_date AND T2.end_date BETWEEN T1.start_date AND T1.end_date GROUP BY T2.start_date; 650 CHAPTER 29: TEMPORAL QUERIES Result start_date finish_date job_code_pair ========================================= '2008-01-05' '2008-01-10' 'j2 to J3' '2008-01-18' '2008-01-25' 'j4 to J5' '2008-02-01' '2008-02-08' 'j7 to J6' '2008-02-03' '2008-02-11' 'j8 to J7' DELETE FROM Results WHERE EXISTS (SELECT R1.job_code_list FROM Results AS R1 WHERE POSITION (Results.job_code_list IN R1.job_code_list) > 0); A third solution will handle an isolated job_code like ‘j1’, as well as three or more overlapping jobs, like ‘j6’, ‘j7’, and ‘j8’. SELECT T1.start_date, MIN(T2.end_date) AS finish_date, MIN(T2.end_date + INTERVAL '1' DAY) - - MIN(T1.start_date) AS duration find any (T1.start_date) FROM Timesheets AS T1, Timesheets AS T2 WHERE T2.start_date >= T1.start_date AND T2.end_date >= T1.end_date AND NOT EXISTS (SELECT * FROM Timesheets AS T3 WHERE (T3.start_date <= T2.end_date AND T3.end_date > T2.end_date) OR (T3.end_date >= T1.start_date AND T3.start_date < T1.start_date)) GROUP BY T1.start_date; You will also want to look at how to consolidate overlapping intervals of integers. A fourth solution uses the auxiliary Calendar table (see Section 29.9 for details) to find the dates that are and are not covered by any of the durations. The coverage flag and calendar date can then be used directly 29.3 Time Series 651 by other queries that need to look at the status of single days instead of date ranges. SELECT C1.cal_date, SUM(DISTINCT CASE WHEN C1.cal_date BETWEEN T1.start_date AND T1.end_date THEN 1 ELSE 0 END) AS covered_date_flag FROM Calendar AS C1, Timesheets AS T1 WHERE C1.cal_date BETWEEN (SELECT MIN(start_date FROM Timesheets) AND (SELECT MAX(end_date FROM Timesheets) GROUP BY C1.cal_date; This is reasonably fast because the WHERE clause uses static scalar queries to set the bounds, and the Calendar table uses cal_date as a primary key, so it will have an index. A slightly different version of the problem is to group contiguous measurements into durations that have the value of that measurement. I have the following table: CREATE TABLE Calibrations (start_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL PRIMARY KEY end_time TIMESTAMP NOT NULL, CHECK (end_time = start_time + INTERVAL '1' MINUTE, cal_value INTEGER NOT NULL); The table has this data: Calibrations start_time end_time cal_value ========================================================== '2005-05-11 02:52:00.000' '2005-05-11 02:53:00.000' 8 '2005-05-11 02:53:00.000' '2005-05-11 02:54:00.000' 8 '2005-05-11 02:54:00.000' '2005-05-11 02:55:00.000' 8 '2005-05-11 02:55:00.000' '2005-05-11 02:56:00.000' 8 '2005-05-11 02:56:00.000' '2005-05-11 02:57:00.000' 8 '2005-05-11 02:57:00.000' '2005-05-11 02:58:00.000' 9 '2005-05-11 02:58:00.000' '2005-05-11 02:59:00.000' 9 '2005-05-11 02:59:00.000' '2005-05-11 03:00:00.000' 9 . ('j6', '200 8-0 2-0 1', '200 8-0 2-0 5'), ('j7', '200 8-0 2-0 3', '200 8-0 2-0 8'), ('j8', '200 8-0 2-0 7', '200 8-0 2-1 1'), . J3' '200 8-0 1-1 8' '200 8-0 1-2 5' 'j4 to J5' '200 8-0 2-0 1' '200 8-0 2-0 8' 'j7 to J6' '200 8-0 2-0 3' '200 8-0 2-1 1' 'j8. Smith' 'Vice Pres' '199 9-0 1-1 0' '199 9-1 2-3 1' 'John Smith' 'President' '200 0-0 1-1 2' '200 1-1 2-3 1' 'Bill Jones'

Ngày đăng: 06/07/2014, 09:20

TỪ KHÓA LIÊN QUAN