Joe Celko s SQL for Smarties - Advanced SQL Programming P69 potx

10 86 0
Joe Celko s SQL for Smarties - Advanced SQL Programming P69 potx

Đang tải... (xem toàn văn)

Thông tin tài liệu

652 CHAPTER 29: TEMPORAL QUERIES '2005-05-11 03:00:00.000' '2005-05-11 03:01:00.000' 9 '2005-05-11 03:01:00.000' '2005-05-11 03:02:00.000' 9 '2005-05-11 03:02:00.000' '2005-05-11 03:03:00.000' 8 '2005-05-11 03:03:00.000' '2005-05-11 03:04:00.000' 8 '2005-05-11 03:04:00.000' '2005-05-11 03:05:00.000' 8 I want to be able to group this data so that it looks like this: start_time end_time cal_value ========================================================== '2005-05-11 02:52:00.000' '2005-05-11 02:57:00.000' 8 '2005-05-11 02:57:00.000' '2005-05-11 03:02:00.000' 9 '2005-05-11 03:02:00.000' '2005-05-11 03:05:00.000' 8 Background The table being selected from is updated every minute with a new calibration value. The calibration value can change from minute to minute. I want a select statement that will sum up the start of the cal_value and the end of the calibration value before it changes. SELECT MIN(start_time) AS start_time, MAX(end_time) AS end_time, cal_value FROM (SELECT C1.start_time, C1.end_time, C1.cal_value, MIN(C2.start_time) FROM Calibrations AS C1 LEFT OUTER JOIN Calibrations AS C2 ON C1.start_time < C2.start_time AND C1.cal_value <> C2.cal_value GROUP BY C1.start_time, C1.end_time, C1.cal_value) AS T (start_time, end_time, cal_value, x_time)) GROUP BY cal_value, x_time; 29.3.3 Missing Times in Contiguous Events Consider the following simple table, which we will use to illustrate how to handle missing times in events. CREATE TABLE Events (event_id CHAR(2) NOT NULL PRIMARY KEY, start_date DATE, 29.3 Time Series 653 end_date DATE, CHECK(start_date < end_date)); INSERT INTO Events VALUES ('A', '2003-01-01', '2003-12-31'), ('B', '2004-01-01', '2004-01-31'), ('C', '2004-02-01', '2004-02-29'), ('D', '2004-02-01', '2004-02-29'); Due to circumstances beyond our control, the end_date column may contain a NULL instead of a valid date. Imagine that we had (‘B’, ‘2004- 01-01’, NULL) as a row in the table. One reasonable solution is to populate the missing end_date with the (start_date −1 day) of the next period. This is easy enough. UPDATE Events SET end_date = (SELECT MIN(E1.start_date) - INTERVAL '1' DAY) FROM Events AS E1 WHERE E1.start_date > Events.start_date) WHERE end_date IS NULL; Likewise, due to circumstances beyond our control, the start_date column may contain a NULL instead of a valid date. Imagine that we had (‘B , NULL, ‘2004-01-31’) as a row in the table. Using the same logic, we could take the last known ending date and add one to it to give us a guess at the missing starting value. UPDATE Events SET start_date = (SELECT MIN(E1.end_date) + INTERVAL '1' DAY) FROM Events AS E1 WHERE E1.end_date < Events.end_date) WHERE start_date IS NULL; This has a nice symmetry to it, but it does not cover all possible cases. Consider an event where we know nothing about the times: INSERT INTO Events VALUES 654 CHAPTER 29: TEMPORAL QUERIES ('A', '2003-01-01', '2003-12-31'), ('B', NULL, NULL), ('C', '2004-02-01', '2004-02-29'), ('D', '2004-02-01', '2004-02-29'); You can run each of the previous UPDATE statements and get the NULLs filled in with values. However, you can combine them into one update: UPDATE Events SET end_date = CASE WHEN end_date IS NULL THEN (SELECT MIN(E1.start_date) - INTERVAL '1' DAY FROM Events AS E1 WHERE E1.start_date > Events.start_date) ELSE end_date END, start_date = CASE WHEN start_date IS NULL THEN (SELECT MIN(E1.end_date) + INTERVAL '1' DAY FROM Events AS E1 WHERE E1.end_date < Events.end_date) ELSE start_date END WHERE start_date IS NULL OR end_date IS NULL; The real problem is having no boundary dates on contiguous events, like this: INSERT INTO Events VALUES ('A', '2003-01-01', '2003-12-31'), ('B', '2004-01-01', NULL), ('C', NULL, '2004-02-29'), ('D', '2004-02-01', '2004-02-29'); The result of applying the previous update is that we get an error because it will try to set the start_date equal to end_date in both rows. Given the restriction that each event lasts for at least one day, event ‘B’ could have finished on any day between ‘2004-01-02’ and ‘2004-02- 27’, and likewise, event ‘C’ could have begun on any day between ‘2004- 01-03’ and ‘2004-02-28’; note the two different durations. 29.3 Time Series 655 Any rules we make for resolving the NULLs is going to be arbitrary. For example, we could give event ‘B’ the benefit of the doubt and assume that it lasted until ‘2004-02-27’ or just as well given event ‘C’ the same benefit. I might make a random choice of a pair of dates d, (d+1) in the gap between ‘B’ and ‘C’ dates. I might pick a middle point. However, this pairwise approach does not solve the problem of all the possible combinations of NULL dates. Let me propose these rules and apply them in order 1. If the start_date is NOT NULL, and the end_date is NOT NULL, then leave the row alone. 2. If the table has too many NULLs in a series, then give up. Report too much missing data. 3. If the start_date IS NULL and the end_date IS NOT NULL, then set the start_date to the day before the end_date. UPDATE Events SET start_date = (SELECT MIN(E1.end_date) + INTERVAL '1' DAY) FROM Events AS E1 WHERE E1.end_date < Events.end_date) WHERE start_date IS NULL AND end_date IS NOT NULL; 4. If the start_date is NOT NULL and the end_date is NULL, then set the end_date to the day before the next known start_date. UPDATE Events SET end_date = (SELECT MIN(E1.start_date) - INTERVAL '1' DAY) FROM Events AS E1 WHERE E1.start_date > Events.start_date) WHERE start_date IS NOT NULL AND end_date IS NULL; 5. If the start_date and end_date are both NULL, then look at the prior and following events to get the minimal start_date and/or end_date. This will leave a gap in the dates that has to be handled later. 656 CHAPTER 29: TEMPORAL QUERIES For example: ('A', '2003-01-01', '2003-12-31'), ('B', '2004-01-01', NULL), ('C', NULL, '2004-02-29'), ('D', '2004-02-01', '2004-02-29'); Becomes: ('A', '2003-01-01', '2003-12-31'), ('B', '2004-01-01', NULL), ('C', '2004-02-28', '2004-02-29'), <= rule #2 ('D', '2004-02-01', '2004-02-29'); Becomes: ('A', '2003-01-01', '2003-12-31'), ('B', '2004-01-01', '2004-02-27'), <= rule #3 ('C', '2004-02-28', '2004-02-29'), ('D', '2004-02-01', '2004-02-29'); Now consider this data: ('A', '2003-01-01', '2003-12-31'), ('B', NULL, NULL), ('C', '2004-02-01', '2004-02-29'), ('D', '2004-02-01', '2004-02-29'); The data becomes: ('A', '2003-01-01', '2003-12-31'), ('B', '2004-01-01', '2004-01-31'), <= rule #4 ('C', '2004-02-01', '2004-02-29'), ('D', '2004-02-01', '2004-02-29'); Consider this example: ('A', '2003-01-01', '2003-12-31'), ('B', NULL, NULL), ('C', NULL, '2004-02-29'), ('D', '2004-02-01', '2004-02-29'); 29.3.4 Locating Dates This little problem is sneakier than it sounds. I first saw it in Explain magazine, then met the author, Rudy Limeback, at the Database World conference in Boston years ago. The problem is to print a list of the 29.3 Time Series 657 employees whose birthdays will occur in the next 45 days. The employee files have each date of birth. The answer will depend on what date functions you have in your implementation of SQL, but Rudy was working with DB2. What makes this problem interesting is the number of possible false starts. Most versions of SQL also have a library function, MAKEDATE(year, month, day), or an equivalent, which will construct a date from three numbers representing a year, month, and day, as well as extraction functions to disassemble a date into integers representing the month, day, and year. The SQL standard would do this with the general function CAST (<string> AS DATE), but there is no provision in the standard for using integers without first converting them to strings, either explicitly or implicitly. For example direct use of strings to build a date: CAST ('2005-01-01' AS DATE) Concatenation causes integer to cast to strings: CAST (2005 || '-'|| 01 ||'-' || 01 AS DATE) The first “gotcha” in this problem is trying to use the component pieces of the dates in a search condition. If you were looking for birthdays all within the same month, it would be easy: SELECT name, dob, CURRENT_DATE FROM Employees WHERE EXTRACT(MONTH FROM CURRENT_DATE) = EXTRACT(MONTH FROM dob); Attempts to extend this approach fall apart, however, since a 45-day period could extend across three months and possibly into the following year; additionally, it might fall in a leap year. Very soon, the number of function calls is too high and the logic is too complex. The second “gotcha” is trying to write a simple search condition with these functions to construct the birthday in the current year from the date of birth (dob) in the Employee table: SELECT name, dob, CURRENT_DATE FROM Employees 658 CHAPTER 29: TEMPORAL QUERIES WHERE MAKEDATE(EXTRACT (YEAR FROM CURRENT_DATE), birthday this year EXTRACT (MONTH FROM dob), EXTRACT (DAY FROM dob)) BETWEEN CURRENT_DATE AND (CURRENT_DATE + INTERVAL 45 DAYS); But a leap-year date of birth will cause an exception to be raised on an invalid date if this is not also a leap year. There is also another problem. The third “gotcha” comes when the 45-day period wraps into the next year. For example, if the current month is December 1992, we should include January 1993 birthdays, but they are not constructed by the MAKEDATE() function. At this point, you can build a messy search condition that also goes into the next year when constructing birthdays. Rory Murchison of the Aetna Institute pointed out that if you are working with DB2 or some other SQL implementations, you will have an AGE(date1 [,date2]) function. This returns the difference in years between date1 and date2. If date2 is missing, it defaults to CURRENT_DATE. The AGE() function can be constructed from other functions in implementations that do not support it. In Standard SQL, the expression would be (date2 - date1) YEAR, which would construct an INTERVAL value. That makes the answer quite simple: SELECT name, dob, CURRENT_DATE FROM Employees WHERE INTERVAL (CURRENT_DATE - birthday) YEAR < INTERVAL (CURRENT_DATE - birthday + INTERVAL 45 DAYS) YEAR; In English, this says that if the employee is a year older 45 days from now, he must have had a birthday in the meantime. 29.3.5 Temporal Starting and Ending Points Dates can be stored in several different ways in a database. Designers of one product might decide that they want to keep things in a COBOL-like field-oriented format, which has a clear, separate area for the year, month, and day of each date. Another product might want to be more UNIX-like and keep the dates as a displacement in some small unit of time from a starting point, then calculate the display format when it is needed. Standard SQL does not say how a date should be stored, but the 29.3 Time Series 659 “COBOL approach” is easier to display, and the “temporal displacement approach” can do calculations easier. The result is that there is no best way to calculate either the first day of the month or the last day of the month from a given date. In the COBOL method, you can get the first day of the month easily by using extraction from the date to build the date, with a 1 in the day field. To return the last day of the previous month, use this expression: CURRENT_TIMESTAMP - INTERVAL (EXTRACT(DAY FROM CURRENT_TIMESTAMP) DAYS). Obviously, you can get the first day of this month with: CURRENT_TIMESTAMP - (EXTRACT (DAY FROM CURRENT_TIMESTAMP) + INTERVAL '1' DAY); Another way is with a user-defined function. CREATE FUNCTION LastDayOfMonth (IN my_date DATE) RETURNS INTEGER LANGUAGE SQL DETERMINISTIC RETURN CAST ( CASE WHEN EXTRACT(MONTH FROM my_date) IN (1, 3, 5, 7, 8, 10, 12) THEN 31 THEN EXTRACT(MONTH FROM my_date) IN (4, 6, 9, 11) THEN 30 ELSE CASE WHEN MOD(EXTRACT (YEAR FROM my_date)/100, 4) <> 0 THEN 28 WHEN MOD(EXTRACT (YEAR FROM my_date)/100, 400) = 0 THEN 29 WHEN MOD(EXTRACT (YEAR FROM my_date)/100, 100) = 0 THEN 28 ELSE 29 END END AS INTEGER); Another problem that you will find in the real world is that people never read the ISO-8601 standards for temporal data; they insist upon writing midnight as ‘24:00:00’ and letting it “leak” into the next day. That is, ‘2003-01-01 24:01:00’ probably should be ‘2003-01-02 00:01:00’ instead. The EXTRACT() function would have a really complicated definition, if this was an acceptable format. 660 CHAPTER 29: TEMPORAL QUERIES The best bet, if you cannot teach people to use the ISO-8601 Standards, is to correct the string at input time. This can be done with a simple auxiliary time that looks like this: CREATE TABLE FixTheClock (input_date_string CHAR(6) NOT NULL, input_time_pattern CHAR(25) NOT NULL PRIMARY KEY, correct_time_string CHAR(25) NOT NULL); INSERT INTO FixTheClock VALUES ('2003-01-01', '2003-01-01 24:__:__._____', '2003-01-02 00:'); Then use a LIKE predicate to replace the pattern with the corrected time. SELECT CASE WHEN R1.raw_input_timestamp LIKE F1.input_time_pattern THEN F1.correct_time_string || CAST (EXTRACT(TIMEZONE_MINUTE FROM R1.raw_input_timestamp) AS VARCHAR(10)) ELSE raw_input_timestamp END, FROM RawData AS R1, FixTheClock AS F1 WHERE F1.input_date_string = SUBSTRING (raw_input_timestamp FROM 1 FOR 6); Notice that this is strictly a string function and that the results will have to be cast to a temporal data type before being stored in the database. 29.3.6 Average Wait Times A problem Jerry posted in May 26, 2005 to take a simple table of service events and determine the average number of days between services for a group of machines. Dropping the unimportant columns, the table looked like this. CREATE TABLE ServiceLog (machine_nbr INTEGER NOT NULL, service_date DATE DEFAULT CURRENT_DATE NOT NULL, PRIMARY KEY (machine_id, service_date)); 29.4 Julian Dates 661 The first attempt posted built all of the waiting periods and then averaged them. SELECT machine_nbr, AVG(wait) AS avg_wait FROM (SELECT T1.machine_nbr, CAST ((MIN(T2.service_date) - T1.service_date) DAY AS INTEGER) FROM T AS T1, T AS T2 WHERE T2.machine_nbr = T1.machine_nbr AND T2.machine_nbr = T1.machine_nbr AND T2.service_date > T1.service_date GROUP BY T1.machine_nbr, T1.service_date ) AS T (machine_nbr, wait) GROUP BY machine_nbr; Instead, use this: SELECT machine_id, CAST (MAX(service_date) - MIN(service_date) DAY AS INTEGER) / (1.0 * COUNT(*) -1) AS AVG_gap FROM ServiceLog GROUP BY machine_id; The first answer posted was caught in the “procedural mindset” trap. Instead think of each machine as a grouping (subset) that has properties as a whole—duration range, and count of events. If the example had been to find the average size of a shot of scotch given a bottle and (n) glasses of known sizes, it would have been obvious—average the volume of scotch over the number of glasses! But replace scotch with total time used and glasses with events, and the answer becomes hard to see. 29.4 Julian Dates All SQL implementations support a DATE data type, but there is no standard defining how they should implement it internally. Some products represent the year, month, and day as parts of a double-word integer; others use Julianized dates; some use ISO ordinal dates; and some store dates as character strings. The programmer does not care as long as the dates work correctly. . #2 ('D', '200 4-0 2-0 1', '200 4-0 2-2 9'); Becomes: ('A', '200 3-0 1-0 1', '200 3-1 2-3 1'), ('B', '200 4-0 1-0 1', '200 4-0 2-2 7'),. '200 3-1 2-3 1'), ('B', '200 4-0 1-0 1', '200 4-0 1-3 1'), ('C', '200 4-0 2-0 1', '200 4-0 2-2 9'), ('D', '200 4-0 2-0 1', '200 4-0 2-2 9'); Due. '200 4-0 2-0 1', '200 4-0 2-2 9'), ('D', '200 4-0 2-0 1', '200 4-0 2-2 9'); Consider this example: ('A', '200 3-0 1-0 1', '200 3-1 2-3 1'), ('B',

Ngày đăng: 06/07/2014, 09:20

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan