Joe Celko s SQL for Smarties - Advanced SQL Programming P30 pot

262 CHAPTER 12: LIKE PREDICATE The <escape character> is used in the <pattern> to specify that the character that follows it is to be interpreted literally rather than as a wildcard. This means that the escape character is followed by the escape character itself, an ‘_’ or a ‘%’. Old C programmers are used to this convention, where the language defines the escape character as ‘\’, so this is a good choice for SQL programmers, too. 12.1 Tricks with Patterns The ‘_’ character tests much faster than the ‘%’ character. The reason is obvious: the parser that compares a string to the pattern needs only one operation to match an underscore before it can move to the next character, but has to do some look-ahead parsing to resolve a percentage sign. The wildcards can be inserted in the middle or beginning of a pattern. Thus, ‘B%K’ will match ‘BOOK’, ‘BLOCK’, and ‘BK’, but it will not match ‘BLOCKS’. The parser would scan each letter and classify it as a wildcard match or an exact match. In the case of ‘BLOCKS’, the initial ‘B’ would be an exact match and the parser would continue; ‘L’, ‘O’, and ‘C’ have to be wildcard matches, since they don’t appear in the pattern string; ‘K’ cannot be classified until we read the last letter. The last letter is ‘S’, so the match fails. For example, given a column declared to be seven characters long, and a LIKE predicate looking for names that start with ‘Mac’, you would usually write: SELECT * FROM People WHERE (name LIKE 'Mac%'); However, this might actually run faster: SELECT * FROM People WHERE (name LIKE 'Mac_ ') OR (name LIKE 'Mac__ ') OR (name LIKE 'Mac___ ') OR (name LIKE 'Mac____'); The trailing blanks are also characters that can be matched exactly. 12.1 Tricks with Patterns 263 Putting a ‘%’ at the front of a pattern is very time-consuming. For example, you might try to find all names that end in ‘-son’ with this query: SELECT * FROM People WHERE (name LIKE '%son'); The use of underscores instead will make a real difference in most SQL implementations for this query, because most of them always parse from left to right. SELECT * FROM People WHERE (name LIKE '_son ') OR (name LIKE '__son ') OR (name LIKE '___son ') OR (name LIKE '____son'); Remember that the ‘_’ character requires a matching character, and the ‘%’ character does not. Thus, this query: SELECT * FROM People WHERE (name LIKE 'John_%'); and this query: SELECT * FROM People WHERE (name LIKE 'John%'); are subtly different. Both will match to ‘Johnson’ and ‘Johns’, but the first will not accept ‘John’ as a match. This is how you get a “one-or-more- characters” pattern match in SQL. Remember that the <pattern> as well as the <match value> can be constructed with concatenation operators, SUBSTRING() , and other string functions. For example, let’s find people whose first names are part of their last names with the query: 264 CHAPTER 12: LIKE PREDICATE SELECT * FROM People WHERE (lastname LIKE '%' || firstname || '%'); This will show us people like ‘John Johnson’, ‘Anders Andersen’, and ‘Bob McBoblin’. This query will also run very slowly. However, this query is case-sensitive and would not work for names such as ‘Jon Anjon’, so you might want to modify the statement to: SELECT * FROM People WHERE (UPPER (lastname) LIKE '%' || UPPER (firstname) || '%'; 12.2 Results with NULL Values and Empty Strings As you would expect, a NULL in the predicate returns an UNKNOWN result. The NULL can be the escape character, pattern, or match value. If M and P are both character strings of length zero, M LIKE P defaults to TRUE . If one or both are longer than zero characters, you use the regular rules to test the predicate. 12.3 LIKE Is Not Equality A very important point that is often missed is that two strings can be equal but not LIKE in SQL. The test of equality first pads the shorter of the two strings with rightmost blanks, then matches the characters in each, one for one. Thus ‘Smith’ and ‘Smith ’ (with three trailing blanks) are equal. However, the LIKE predicate does no padding, so 'Smith' LIKE 'Smith ' tests FALSE because there is nothing to match to the blanks. A good trick to get around these problems is to use the TRIM() function to remove unwanted blanks from the strings within either or both of the two arguments. 12.4 Avoiding the LIKE Predicate with a Join Beginners often want to write something similar to <string> IN LIKE (<pattern list>) rather than a string of OR ed LIKE predicates. That syntax is illegal, but you can get the same results with a table of patterns and a join. 12.4 Avoiding the LIKE Predicate with a Join 265 CREATE TABLE Patterns (template VARCHAR(10) NOT NULL PRIMARY KEY); INSERT INTO Patterns VALUES ('Celko%'), ('Chelko%'), ('Cilko%'), ('Selko%), ('Silko%'); SELECT A1.lastname FROM Patterns AS P1, Authors AS A1 WHERE A1.lastname LIKE P1.template; This idea can be generalized to find strings that differ from a pattern by one position without actually using a LIKE predicate. First, assume that we have a table of sequential numbers and these following tables with sample data. the match patterns CREATE TABLE MatchList (pattern CHAR(9) NOT NULL PRIMARY KEY); INSERT INTO MatchList VALUES ('_========'); INSERT INTO MatchList VALUES ('=_======='); INSERT INTO MatchList VALUES ('==_======'); INSERT INTO MatchList VALUES ('===_====='); INSERT INTO MatchList VALUES ('====_===='); INSERT INTO MatchList VALUES ('=====_==='); INSERT INTO MatchList VALUES ('======_=='); INSERT INTO MatchList VALUES ('=======_='); INSERT INTO MatchList VALUES ('========_'); the strings to be matched or near-matched CREATE TABLE Target (nbr CHAR(9) NOT NULL PRIMARY KEY); INSERT INTO Target VALUES ('123456089'), ('543434344'); the strings to be searched for those matches CREATE TABLE Source (nbr CHAR(9) NOT NULL PRIMARY KEY); INSERT INTO Source VALUES ('123456089'); INSERT INTO Source VALUES ('123056789'); INSERT INTO Source VALUES ('123456780'); INSERT INTO Source VALUES ('123456789'); 266 CHAPTER 12: LIKE PREDICATE INSERT INTO Source VALUES ('023456789'); INSERT INTO Source VALUES ('023456780'); We use an equal sign in the match patterns as a signal to replace it with the appropriate character in the source string and see if they match, but to skip over the underscore. SELECT DISTINCT TR1.nbr FROM Sequence AS SE1, Source AS SR1, MatchList AS ML1, Target AS TR1 WHERE NOT EXISTS (SELECT * FROM Sequence AS SE1, Source AS SR2, MatchList AS ML2, Target AS TR2 WHERE SUBSTRING (ML2.pattern FROM seq FOR 1) = '=' AND SUBSTRING (SR2.nbr FROM seq FOR 1) <> SUBSTRING (TR2.nbr FROM seq FOR 1) AND SR2.nbr = SR1.nbr AND TR2.nbr = TR1.nbr AND ML2.pattern = ML1.pattern AND SE1.seq BETWEEN 1 AND (CHARLENGTH (TR2.nbr) -1)); This code is due to Jonathan Blitz. 12.5 CASE Expressions and LIKE Predicates The CASE expression in Standard SQL lets the programmer use the LIKE predicate in some interesting ways. The simplest example is counting the number of times a particular string appears inside another string. Assume that text_col is CHAR(25) and we want the count of a particular string, ‘term’, within it. SELECT text_col, CASE WHEN text_col LIKE '%term%term%term%term%term%term%' THEN 6 WHEN text_col LIKE '%term%term%term%term%term%' THEN 5 WHEN text_col LIKE '%term%term%term%term%' THEN 4 WHEN text_col LIKE '%term%term%term%' 12.6 SIMILAR TO Predicates 267 THEN 3 WHEN text_col LIKE '%term%term%' THEN 2 WHEN text_col LIKE '%term%' THEN 1 ELSE 0 END AS term_tally FROM Foobar WHERE text_col LIKE '%term%'; This depends on the fact that a CASE expression executes the WHEN clauses in order of their appearance. We know that the most times a substring can appear is six, because of the length of text_col. Another use of the CASE is to adjust the pattern within the LIKE predicate. name LIKE CASE WHEN language = 'English' THEN 'Red%' WHEN language = 'French' THEN 'Rouge%' ELSE 'R%' END 12.6 SIMILAR TO Predicates As you can see, the LIKE predicate is pretty weak, especially if you have used a version of grep(), a utility program from the UNIX operating system. The name is short for “general regular expression parser,” and before you ask, a regular expression is a class of formal languages. If you are a computer science major, you have seen them; otherwise, don’t worry about it. The bad news is that there are several versions of grep() in the UNIX community, such as egrep(), fgrep(), xgrep(), and a dozen or so others. The SQL-99 standard added a regular expression predicate of the form <string expression> SIMILAR TO <pattern> , which is based on the POSIX version of grep() found in ISO/IEC 9945. The special symbols in a pattern are: | means alternation (either of two alternatives) * means repetition of the previous item zero or more times + means repetition of the previous item one or more times 268 CHAPTER 12: LIKE PREDICATE ( ), parentheses, may be used to group items into a single unit [. . .], a bracket expression, specifies a match to any of the characters inside the brackets. There are abbreviations for lists of commonly used character subsets, taken from POSIX. [:ALPHA:] match any alphabetic character, regardless of case. [:UPPER:] match any upper case alphabetic character [:LOWER:] match any lower case alphabetic character [:DIGIT:] match any numeric digit [:ALNUM:] match any numeric digit or alphabetic character Examples: 1. The letters ‘foo’ or ‘bar’ followed by any string Foobar SIMILAR TO ‘(foo|bar)%’ 2. The serial number is a number sign followed by one or more digits serial_nbr SIMILAR TO '#[0-9]+' serial_nbr SIMILAR TO '#[:DIGIT:]' You should still read your product manual for details, but most grep() functions accept other special symbols for more general searching than the SIMILAR TO predicate: . any character (same as the SQL underscore) ^ start of line (not used in an SQL string) $ end of line (not used in an SQL string) \ The next character is a literal and not a special symbol; this is called an ESCAPE in SQL [^] match anything but the characters inside the brackets, after the caret 12.7 Tricks with Strings 269 Regular expressions have a lot of nice properties. 12.7 Tricks with Strings This is a list of miscellaneous tricks that you might not think about when using strings. 12.7.1 String Character Content A weird way of providing an edit mask for a varying character column to see if it has only digits in it was proposed by Ken Sheridan on the CompuServe ACCESS forum in October 1999. If the first character is not a zero, then you can check that the VARCHAR(n) string is all digits with: CAST (LOG10 (CAST (test_column AS INTEGER) AS INTEGER) = n If the first (n) characters are not all digits, then it will not return (n). If they are all digits, but the (n+1) character is also a digit, it will return (n+1), and so forth. If there are nondigit characters in the string, then the innermost CAST() function will fail to convert the test_column into a number. If you do have to worry about leading zeros or blanks, then concatenate ‘1’ to the front of the string. Another trick is to think in terms of whole strings, rather than in a “character-at-a-time” mindset. So how can I tell if a string is all alphabetic, partly alphabetic, or completely nonalphabetic without scanning each character? The answer, from the folks at Ocelot software, is surprisingly easy: CREATE TABLE Foobar (no_alpha VARCHAR(6) NOT NULL CHECK (UPPER(no_alpha) = LOWER(no_alpha)), some_alpha VARCHAR(6) NOT NULL CHECK (UPPER(some_alpha) <> LOWER(some_alpha)), all_alpha VARCHAR(6) NOT NULL CHECK (UPPER(all_alpha) <> LOWER(all_alpha) AND LOWER (all_alpha) BETWEEN 'aaaaaa' AND 'zzzzzz'), ); Letters have different upper and lowercase values, but other characters do not. This lets us edit a column for no alphabetic characters, some alphabetic characters, and all alphabetic characters. 270 CHAPTER 12: LIKE PREDICATE 12.7.2 Searching versus Declaring a String You need to be very accurate when you declare a string column in your DDL, but thanks to doing that, you can slack off a bit when you search on those columns in your DML. For example, most credit card numbers are made up of four groups of four digits, and each group has some validation rule, thus: CREATE TABLE CreditCards (card_nbr CHAR(17) NOT NULL PRIMARY KEY CONSTRAINT valid_card_nbr_format CHECK (card_nbr SIMILAR TO '[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0- 9][0-9][0-9]-[0-9][0-9][0-9][0-9]'), CONSTRAINT valid_bank_nbr CHECK (SUBSTRING (card_nbr FROM 1 FOR 4) IN ('2349', '2345', ), ); Since we are sure that the credit card number is stored correctly, we can search for it with a simple LIKE predicate. For example, to find all the cards with 1234 in the third group, you can use this: SELECT card_nbr FROM CreditCards WHERE card_nbr LIKE '____-____-1234-____'; Or even: SELECT card_nbr FROM CreditCards WHERE card_nbr LIKE '__________1234_____'; The SIMILAR TO predicate will build an internal finite-state machine to parse the pattern, while the underscores in the LIKE can be optimized so that it can run in parallel down the whole column. 12.7.3 Creating an Index on a String Many string-encoding techniques have the same prefix, because we read from left to right and tend to put the codes for the largest category to the 12.7 Tricks with Strings 271 left. For example, the first group of digits in the credit card numbers is the issuing bank. The syntax might look like this: CREATE INDEX acct_searching ON CreditCards WITH REVERSE(card_nbr); not Standard SQL If your SQL has the ability to define a function in an index, you can reverse or rearrange the string to give faster access. This is very dependent on your vendor, but often the query must explicitly use the same function as the index. An alternative is to store the rearranged value in the base table and show the actual value in a view. When the view is invoked, the rearranged value will be used for the query without the users knowing it. . EXISTS (SELECT * FROM Sequence AS SE1, Source AS SR2, MatchList AS ML2, Target AS TR2 WHERE SUBSTRING (ML2.pattern FROM seq FOR 1) = '=' AND SUBSTRING (SR2.nbr FROM seq FOR 1). 12.5 CASE Expressions and LIKE Predicates The CASE expression in Standard SQL lets the programmer use the LIKE predicate in some interesting ways. The simplest example is counting. ('===_====='); INSERT INTO MatchList VALUES ('====_===='); INSERT INTO MatchList VALUES ('=====_==='); INSERT INTO MatchList VALUES ('======_=='); INSERT

Định dạng
Số trang	10
Dung lượng	127,39 KB