172 CHAPTER 5: CHARACTER DATA TYPES IN SQL 5.1.3 Problems of String Grouping Because the equality test has to pad out the shorter of the two strings, you will often find doing a GROUP BY on a VARCHAR(n) has unpredictable results: CREATE TABLE Foobar (x VARCHAR(5) NOT NULL); INSERT INTO Foobar VALUES ('a'), ('a '), ('a '), ('a '); Now, execute the query: SELECT x, CHAR_LENGTH(x) FROM Foobar GROUP BY x; The value for CHAR_LENGTH(x) will vary for different products. The most common answers are 1, 4, or 5 in this example. A length of 1 is returned because it is the length of the shortest string or because it is the length of the first string physically in the table. A length of 4 is returned because it is the length of the longest string in the table, and a length of 5 because it is the greatest possible length of a string in the table. You might want to add a constraint that makes sure to trim the trailing blanks to avoid problems. 5.2 Standard String Functions SQL-92 defines a set of string functions that appear in most products, but with vendor-specific syntax. You will probably find that products will continue to support their own syntax, but will also add the Standard SQL syntax in new releases. String concatenation is shown with the || operator, taken from PL/I. The SUBSTRING(<string> FROM <start> FOR <length>) function uses three arguments: the source string, the starting position of the substring, and the length of the substring to be extracted. Truncation occurs when the implied starting and ending positions are not both within the given string. DB2 and other products have a LEFT and a RIGHT function. The LEFT function returns a string consisting of the specified number of leftmost characters of the string expression, and the RIGHT , well, that is kind of obvious. 5.2 Standard String Functions 173 The fold functions are a pair of functions for converting all the lowercase characters in a given string to uppercase, UPPER(<string>) , or all the uppercase ones to lowercase, LOWER(<string>) . TRIM([[<trim specification>] [<trim character>] FROM] <trim source>) produces a result string that is the source string with an unwanted character removed. The <trim source> is the original character value expression. The <trim specification> is either LEADING or TRAILING or BOTH , and the <trim character> is the single character that is to be removed. The TRIM() function removes the leading and/or trailing occurrences of a character from a string. The default character, if one is not given, is a space. The SQL-92 version is a very general function, but you will find that most SQL implementations have a version that works only with spaces. DB2 instead has two functions: LTRIM for leftmost (leading) blanks and RTRIM for rightmost (trailing) blanks. A character translation is a function for changing each character of a given string according to some many-to-one or one-to-one mapping between two not necessarily distinct character sets. The syntax TRANSLATE(<string expression> USING <translation>) assumes that a special schema object, called a translation, has already been created to hold the rules for doing all of this. CHAR_LENGTH(<string>) , also written CHARACTER_LENGTH (<string>) determines the length of a given character string, as an integer, in characters. In most current products, this function is usually expressed as LENGTH() , and the next two functions do not exist at all; they assume that the database will only hold ASCII or EBCDIC characters. BIT_LENGTH(<string>) determines the length of a given character string, as an integer, in bits. OCTET_LENGTH(<string>) determines the length of a given character string, as an integer, in octets. Octets are units of eight bits that are used by the one and two (Unicode) octet characters sets. This function is the same as TRUNCATE (BIT_LENGTH (<string>)/8) . The POSITION(<search string> IN <source string>) determines the first position, if any, at which the <search string> occurs within the <source string>. If the <search string> is of length zero, then it occurs at position 1 for any value of the <source string> . If the <search string> does not occur in the <source string> , zero is returned. You will also see LOCATE() in DB2 and CHAR_INDEX() in SQL Server. 174 CHAPTER 5: CHARACTER DATA TYPES IN SQL 5.3 Common Vendor Extensions The original SQL-89 standard did not define any functions for CHAR(n) data types. Standard SQL added the basic functions that have been common to implementations for years. However, there are other common or useful functions, and it is worth knowing how to implement them outside of SQL. Many vendors also have functions that will format data for display by converting the internal format to a text string. A vendor whose SQL is tied to a 4GL is much more likely to have these extensions, simply because the 4GL can use them. The most common one converts a date and time to a national format. These functions generally use either a COBOL-style PICTURE parameter or a globally set default format. Some of this conversion work is done with the CAST() function in Standard SQL, but since SQL does not have any output statements, such things will be vendor extensions for some time to come. Vendor extensions are varied, but there are some that are worth mentioning. The names will be different in different products, but the functionality will be the same: SPACE(n) produces a string of (n) spaces. REPLICATE (<string expression>, n) produces a string of (n) repetitions of the <string expression>. DB2 calls this one REPEAT(), and you will see other local names for it. REPLACE (<target string>, <old string>, <new string>) replaces the occurrences of the <old string> with the <new string> in the <target string>. As an aside, here is a nice trick to reduce several contiguous spaces in a string to a single space to format text: UPDATE Foobar SET sentence = REPLACE( REPLACE( REPLACE(sentence, SPACE(1), '<>') '><', SPACE(0)) '<>', SPACE(1)); 5.3 Common Vendor Extensions 175 REVERSE(<string expression>) reverses the order of the characters in a string to make it easier to search. This function is impossible to write with the standard string operators, because it requires either iteration or recursion. FLIP(<string expression>, <pivot>) will locate the pivot character in the string, then concatenate all the letters to the left of the pivot onto the end of the string and finally erase the pivot character. This is used to change the order of names from military format to civilian format—for example, FLIP('Smith, John', ',') yields John Smith. This func- tion can be written with the standard string functions, how- ever. NUMTOWORDS(<numeric expression>) will write out the numeric value as a set of English words to be used on checks or other documents that require both numeric and text versions of the same value. 5.3.1 Phonetic Matching People’s names are a problem for designers of databases. Names are variable-length, can have strange spellings, and are not unique. American names have a diversity of ethnic origins, which give us names pronounced the same way but spelled differently, and vice versa. Aside from this diversity of names, errors in reading or hearing a name lead to mutations. Anyone who gets junk mail is aware of this. In addition to mail addressed to “Celko,” I get mail addressed to “Selco,” “Selko,” and “Celco,” which are phonetic errors. I also get some letters with typing errors, such as “Cellro,” “Chelco,” and “Chelko” in my mail stack. Such errors result in the mailing of multiple copies of the same item to the same address. To solve this problem, we need phonetic algorithms that can find similar-sounding names. Soundex Functions The Soundex family of algorithms is named after the original algorithm. A Soundex algorithm takes a person’s name as input and produces a character string that identifies a set of names that are (roughly) phonetically alike. SQL products often have a Soundex algorithm in their library functions. It is also possible to compute a Soundex in SQL, using string functions and the CASE expression in Standard SQL. Names that sound alike do not always have the same Soundex code. For example, “Lee” 176 CHAPTER 5: CHARACTER DATA TYPES IN SQL and “Leigh” are pronounced alike, but have different Soundex codes because the silent ‘g’ in “Leigh” is given a code. Names that sound alike but start with a different first letter will always have a different Soundex, such as “Carr” and “Karr” will be separate codes. Finally, Soundex is based on English pronunciation, so European and Asian names may not encode correctly. French surnames like “Beaux” (with a silent ‘x’) and “Beau” (without it) will result in two different Soundex codes. Sometimes names that don’t sound alike have the same Soundex code. The relatively common names “Powers,” “Pierce,” “Price,” “Perez,” and “Park” all have the same Soundex code. Yet “Power,” a common way to spell Powers 100 years ago, has a different Soundex code. The Original Soundex Margaret O’Dell and Robert C. Russell patented the original Soundex algorithm in 1918. The method is based on the phonetic classification of sounds by how they are made. In case you wanted to know, the six groups are bilabial, labiodental, dental, alveolar, velar, and glottal. The algorithm is fairly straightforward to code and requires no backtracking or multiple passes over the input word. This should not be too surprising, since it was in use before computers and had to be done by hand by clerks. Here is the algorithm: 1. Capitalize all letters in the word. Pad the word with rightmost blanks as needed during each procedure step. 2. Retain the first letter of the word. 3. Drop all occurrences of the following letters after the first position: A, E, H, I, O, U, W, Y. 4. Change letters from the following sets into the corresponding digits given: 1 = B, F, P, V 2 = C, G, J, K, Q, S, X, Z 3 = D, T 4 = L 5 = M, N 6 = R 5.3 Common Vendor Extensions 177 5. Retain only one occurrence of consecutive duplicate digits from the string that resulted after step 4.0. 6. Pad the string that resulted from step 5.0 with trailing zeros and return only the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>. An alternative version of the algorithm, due to Russell, changes the letters in step 3.0 to 9s and retains them. Then step 5.0 is replaced by two steps: 5.1, which removes redundant duplicates as before, followed by 5.2, which removes all 9s and closes up the spaces. This allows pairs of duplicate digits to appear in the result string. This version has more granularity and will work better for a larger sample of names. The problem with Soundex is that it was a manual operation used by the Census Bureau long before computers. The algorithm used was not always applied uniformly from place to place. Surname prefixes, such as “La,” “De,” “von,” or “van,” are generally dropped from the last name for Soundex, but not always. If you are searching for surnames such as “DiCaprio” or “LaBianca,” you should try the Soundex codes for both with and without the prefix. Likewise, leading syllables like “Mc,” “Mac,” and “O” were also dropped. Then there was a question about dropping H and W along with the vowels. The United States Census Soundex did it both ways, so a name like “Ashcraft” could be converted to “Ascrft” in the first pass, and finally Soundexed to “A261,” as it is in the 1920 New York Census. The Soundex code for the 1880, 1900, and 1910 censuses followed both rules. In this case, Ashcraft would be “A226” in some places. The reliability of Soundex is 95.99%, with a selectivity factor of 0.213% for a name inquiry. Metaphone Metaphone is another improved Soundex that first appeared in Computer Language magazine (Philips 1990). A Pascal version written by Terry Smithwick (Smithwick 1991), based on the original C version by Lawrence Philips, is reproduced with permission here: FUNCTION Metaphone (p : STRING) : STRING; CONST VowelSet = ['A', 'E', 'I', 'O', 'U']; FrontVSet = ['E', 'I', 'Y']; VarSonSet = ['C', 'S', 'T', 'G']; 178 CHAPTER 5: CHARACTER DATA TYPES IN SQL { variable sound - modified by following 'h' } FUNCTION SubStr (A : STRING; Start, Len : INTEGER) : STRING; BEGIN SubStr := Copy (A, Start, Len); END; FUNCTION Metaphone (p : STRING) : STRING; VAR i, l, n: BYTE; silent, new: BOOLEAN; last, this, next, nnext : CHAR; m, d: STRING; BEGIN { Metaphone } IF (p = '') THEN BEGIN Metaphone := ''; EXIT; END; { Remove leading spaces } FOR i := 1 TO Length (p) DO p[i] := UpCase (p[i]); { Assume all alphas } { initial preparation of string } d := SubStr (p, 1, 2); IF d IN ('KN', 'GN', 'PN', 'AE', 'WR') THEN p := SubStr (p, 2, Length (p) - 1); IF (p[1] = 'X') THEN p := 'S' + SubStr (p, 2, Length (p) - 1); IF (d = 'WH') THEN p := 'W' + SubStr (p, 2, Length (p) - 1); { Set up for Case statement } l := Length (p); m := ''; { Initialize the main variable } new := TRUE; { this variable only used next 10 lines!!! } n := 1; { Position counter } WHILE ((Length (m) < 6) AND (n <> l) ) DO BEGIN { Set up the 'pointers' for this loop-around } IF (n > 1) 5.3 Common Vendor Extensions 179 THEN last := p[n-1] ELSE last := #0; { use a nul terminated string } this := p[n]; IF (n < l) THEN next := p[n+1] ELSE next := #0; IF ((n+1) < l) THEN nnext := p[n+2] ELSE nnext := #0; new := (this = 'C') AND (n > 1) AND (last = 'C'); { 'CC' inside word } IF (new) THEN BEGIN IF ((this IN VowelSet) AND (n = 1) ) THEN m := this; CASE this OF 'B' : IF NOT ((n = l) AND (last = 'M') ) THEN m := m + 'B'; { -mb is silent } 'C' : BEGIN { -sce, i, y = silent } IF NOT ((last = 'S') AND (next IN FrontVSet) ) THEN BEGIN IF (next = 'i') AND (nnext = 'A') THEN m := m + 'X'{ -cia- } ELSE IF (next IN FrontVSet) THEN m := m + 'S' { -ce, i, y = 'S' } ELSE IF (next = 'H') AND (last = 'S') THEN m := m + 'K' { -sch- = 'K' } ELSE IF (next = 'H') THEN IF (n = 1) AND ((n+2) <= l) AND NOT (nnext IN VowelSet) THEN m := m + 'K' ELSE m := m + 'X'; END { Else silent } END; { Case C } 'D' : IF (next = 'G') AND (nnext IN FrontVSet) THEN m := m + 'J' ELSE m := m + 'T'; 'G' : BEGIN 180 CHAPTER 5: CHARACTER DATA TYPES IN SQL silent := (next = 'H') AND (nnext IN VowelSet); IF (n > 1) AND (((n+1) = l) OR ((next = 'n') AND (nnext = 'E') AND (p[n+3] = 'D') AND ((n+3) = l) ) { Terminal -gned } AND (last = 'i') AND (next = 'n') ) THEN silent := TRUE; { if not start and near -end or -gned.) } IF (n > 1) AND (last = 'D'gnuw) AND (next IN FrontVSet) THEN { -dge, i, y } silent := TRUE; IF NOT silent THEN IF (next IN FrontVSet) THEN m := m + 'J' ELSE m := m + 'K'; END; 'H' : IF NOT ((n = l) OR (last IN VarSonSet) ) AND (next IN VowelSet) THEN m := m + 'H'; { else silent (vowel follows) } 'F', 'J', 'L', 'M', 'N', 'R' : m := m + this; 'K' : IF (last <> 'C') THEN m := m + 'K'; 'P' : IF (next = 'H') THEN BEGIN m := m + 'F'; INC (n); END { Skip the 'H' } ELSE m := m + 'P'; 'Q' : m := m + 'K'; 'S' : IF (next = 'H') OR ((n > 1) AND (next = 'i') AND (nnext IN ['O', 'A']) ) THEN m := m + 'X' ELSE m := m + 'S'; 'T' : IF (n = 1) AND (next = 'H') AND (nnext = 'O') THEN m := m + 'T' { Initial Tho- } ELSE IF (n > 1) AND (next = 'i') AND (nnext IN ['O', 'A']) THEN m := m + 'X' ELSE IF (next = 'H') THEN m := m + '0' 5.3 Common Vendor Extensions 181 ELSE IF NOT ((next = 'C') AND (nnext = 'H') ) THEN m := m + 'T'; { -tch = silent } 'V' : m := m + 'F'; 'W', 'Y' : IF (next IN VowelSet) THEN m := m + this; { else silent } 'X' : m := m + 'KS'; 'Z' : m := m + 'S'; END; { Case } INC (n); END; { While } END; { Metaphone } Metaphone := m END; NYSIIS Algorithm The New York State Identification and Intelligence System, or NYSIIS, algorithm is more reliable and selective than Soundex, especially for grouped phonetic sounds. It does not perform well with Y groups, because Y is not translated. NYSIIS yields an alphabetic string key that is filled or rounded to 10 characters. (1) Translate first characters of name: MAC => MCC KN => NN K => C PH => FF PF => FF SCH => SSS (2) Translate last characters of name: EE => Y IE => Y DT,RT,RD,NT,ND => D (3) The first character of key = first character of name. (4) Translate remaining characters by following rules, scanning one character at a time a. EV => AF else A,E,I,O,U => A b. Q => G Z => S M => N c. KN => N else K => C . FrontVSet) THEN m := m + &apos ;S& apos; { -ce, i, y = &apos ;S& apos; } ELSE IF (next = 'H') AND (last = &apos ;S& apos;) THEN m := m + 'K' { -sch- = 'K' } ELSE IF. ['E', 'I', 'Y']; VarSonSet = ['C', &apos ;S& apos;, 'T', 'G']; 178 CHAPTER 5: CHARACTER DATA TYPES IN SQL { variable sound - modified. alike. SQL products often have a Soundex algorithm in their library functions. It is also possible to compute a Soundex in SQL, using string functions and the CASE expression in Standard SQL.