Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 78 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
78
Dung lượng
3,12 MB
Nội dung
However, in PowerGrep, the regular expression pattern [t-r]ight won’t compile and produces the error shown in Figure 5-14. Figure 5-14 There is, typically, no advantage in attempting to use reverse ranges in character classes, and I suggest that you avoid using these. A Potential Range Trap Suppose that you want to allow for different separators in dates occurring in a document or set of docu- ments. Among the issues this problem throws up is a possible trap in expressing character ranges. As a first test document, we will use Dates.txt, shown here: 2004-12-31 2001/09/11 2003.11.19 2002/04/29 2000/10/19 2005/08/28 2006/09/18 129 Character Classes 08_574892 ch05.qxd 1/7/05 10:52 PM Page 129 As you can see, in this file the dates are in YYYY/MM/DD format, but sometimes the dates use the hyphen as a separator, sometimes the forward slash, and sometimes the period. Your task is to select all occurrences of sequences of characters that represent dates (assume for this example that dates are expressed only using digits and separators and are not expressed using names of months, for example). So if you wanted to select all dates, whether they use hyphens, forward slashes, or periods as separators, you might try a regular expression pattern like this: (20|19)[0-9]{2}[ /][01][0-9][ /][0123][0-9] In the character class [ /], which you attempt to use to match the separator, the sequence of characters (period followed by hyphen followed by forward slash) is interpreted as the range from the period to the forward slash. However, as you can see in the top row of Figure 5-15, the hyphen is U+002D, and the period ( U+002E) is the character immediately before the forward slash (U+002F). So, undesirably, the pattern / specifies a range that contains only the period and forward-slash characters. Figure 5-15 Characters can be expressed using Unicode numeric references. The period is U+002E; uppercase A is U+0041. The Windows Character Map shows this syntax for characters if you hover the mouse over characters of interest. 130 Chapter 5 08_574892 ch05.qxd 1/7/05 10:52 PM Page 130 To use the hyphen without creating a range, the hyphen should be the first character in the character class: [ /] This gives a pattern that will match each of the sample dates in the file Dates.txt: (20|19)[0-9]{2}[ /][01][0-9][ /][0123][0-9] Try It Out Matching Dates 1. Open PowerGrep, and enter the regular expression pattern (20|19)[0-9]{2}[ /][01][0-9][ /][0123][0-9] in the Searc text box. 2. Enter C:\BRegExp\Ch05 in the Folder: text box, assuming that you have saved the Chapter 5 files from the download in that directory. 3. Enter Dates.txt in the File Mask text box. 4. Click the Search button, and inspect the results shown in Figure 5-16. Notice particularly that the first match, 2004-12-31, includes a hyphen confirming that the regular expression pattern works as desired. Figure 5-16 How It Works The first part of the pattern, (20|19), allows a choice of 20 or 19 as the first two characters of the sequence of characters being tested. Next, the pattern [0-9]{2} matches two successive numeric digits in the range 0 through 9. Next, the character class pattern [ /] matches a single character, which is a hyphen, a period, or a forward slash. 131 Character Classes 08_574892 ch05.qxd 1/7/05 10:52 PM Page 131 The next component of the pattern, [01], matches the numeric digits 0 or 1, because months always have 0 or 1 as the first digit in this date format. Similarly, the next component, the character class [0-9], matches any number from 0 through 9. This would allow numbers for the month such as 14 or 18, which are obviously undesirable. One of the exercises at the end of this chapter will ask you to provide a more specific pattern that would allow only values from 01 to 12 inclusive. Next, the character class pattern [ /] matches a single character that is a hyphen, a period, or a forward slash. Finally, the pattern [0123][0-9] matches days of the month beginning with 0, 1, 2, or 3. As written, the pattern would allow values for the day of the month such as 00, 34 or 38. A later exercise will ask you to create a more specific pattern to constrain values to 01 through 31. Finding HTML Heading Elements One potential use for characters classes is in finding HTML/XHTML heading elements. As you probably know, HTML and XHTML 1.0 have six heading elements: h1, h2, h3, h4, h5, and h6. In XHTML the h must be lowercase. In HTML it is permitted to be h or H. First, assume that all the elements are written using a lowercase h. So it would be possible to match the start tag of all six elements, assuming that there are no attributes, using a fairly cumbersome regular expression with parentheses: <(h1|h2|h3|h4|h5|h6)> In this case the < character is the literal left angled bracket, which is the first character in the start tag. Then there is a choice of six two-character sequences representing the element type of each HTML/ XHTML heading element. Finally, a > is the final literal character of the start tag. However, because there is a sequence of numbers from 1 to 6, you can use a character class to match the same start tags, either by listing each number literally: <h[123456]> or by using a range in the character class: <h[1-6]> The sample file, HTMLHeaders.txt, is shown here: <h1>Some sample header text.</h1> <h3>Some text.</h3> <h6>Some header text.</h6> <h4></h4> <h5>Some text.</h5> <h2>Some fairly meaningless text.</h2> There is an example of each of the six headers. 132 Chapter 5 08_574892 ch05.qxd 1/7/05 10:52 PM Page 132 Try It Out Matching HTML Headers 1. Open PowerGrep, and enter the regular expression pattern <h[1-6]> in the Search: text box. 2. Enter C:\BRegExp\Ch05 in the Folder text box, assuming that you have saved the Chapter 5 files from the download in that directory. 3. Enter HTMLHeaders.txt in the File Mask text box. 4. Click the Search button, and inspect the results, as shown in Figure 5-17. Figure 5-17 Metacharacter Meaning within Character Classes Most, but not all, single characters have the same meaning inside a character class as they do outside. The ^ metacharacter The ^ metacharacter (also called a caret), when it is the first character after the left square bracket, indi- cates that any other cases specified inside the square brackets are not to be matched. The use of the ^ metacharacter is discussed in the section on negated character classes a little later. If the ^ metacharacter occurs in any position inside square brackets other than the character that imme- diately follows the left square bracket, the ^ metacharacter has its literal meaning — that is, it matches the ^ character. 133 Character Classes 08_574892 ch05.qxd 1/7/05 10:52 PM Page 133 A test file, Carets.txt, is shown here: 14^2 expresses the idea of 14 to the power 2. The ^ character is called a caret. The _ character is called an underscore or underline character. 3^2 = 9 Eating ^s helps you see in the dark. At least that’s what I think he said. The problem definition can be expressed as follows: Match any occurrence of the following characters: the underscore, the caret, or the numeric digit 3. The character class to satisfy that problem definition is as follows: [_^3] Try It Out Using the ^ Inside a Character Class This example matches the three characters mentioned in the preceding problem definition: 1. Open OpenOffice.org Writer, and open the test file Carets.txt. 2. Use the Ctrl+F keyboard shortcut to open the Find & Replace dialog box. 3. Check the Regular Expressions and Match Case check boxes, and enter the pattern [_^3] in the Search For text box. 4. Click the Find All button, and inspect the results, as shown in Figure 5-18. 5. Modify the regular expression pattern so that it reads [^_3]. 6. Click the Find All button, and compare the results shown in Figure 5-19 with the previous results. How It Works When the pattern is [_^3], the meaning is simply a character class that matches three characters: the underscore, the caret, and the numeric digit 3. When the ^ immediately follows the left square bracket, [, that creates a negated character class, which in this case has the meaning “Match any character except an underscore or the numeric digit 3.” 134 Chapter 5 08_574892 ch05.qxd 1/7/05 10:52 PM Page 134 Figure 5-18 How to Use the - Metacharacter You have already seen how the hyphen can be used to indicate a range inside a character class. The question therefore arises as to how you can specify a literal hyphen inside a character class. The safest way is to use the hyphen as the first character after the left square bracket. In some tools, such as the Komodo Regular Expressions Toolkit, you can also use the hyphen as the character immediately before the right square bracket to match a hyphen. In OpenOffice.org Writer, for example, that doesn’t work. 135 Character Classes 08_574892 ch05.qxd 1/7/05 10:52 PM Page 135 Figure 5-19 Negated Character Classes Negated character classes always attempt to match a character. So the following negated character class means “Match a character that is not in the range uppercase A through F.” [^A-F] Using that pattern, as follows, will match AG and AZ because each is an uppercase A followed by a char- acter that is not in the range A through F: A[^A-F] The pattern will not match A on its own because, while the match for A succeeds, there is no match for the negated character class [^A-F]. 136 Chapter 5 08_574892 ch05.qxd 1/7/05 10:52 PM Page 136 Combining Positive and Negative Character Classes Some languages, such as Java, allow you to combine positive and negative character classes. The following example shows how combined character classes can be used. The problem definition is as follows: Match characters A and D through Z. An alternative way to express that notion is as follows: Match characters A through Z but not B through D. You can express that in Java by combining character classes, as follows: [A-Z&&[^B-D]] Notice the paired ampersands, which means logical AND. So the pattern means “Match characters that are in the range A through Z AND are not in the range B through D.” A simple Java command-line program is shown in CombinedClass2.java: import java.util.regex.*; public class CombinedClass2{ public static void main(String args[]) throws Exception{ String TestString = args[0]; String regex = “[A-Z&&[^B-D]]”; Pattern p = Pattern.compile(regex); Matcher m = p.matcher(TestString); String match = null; System.out.println(“INPUT: “ + TestString); System.out.println(“REGEX: “ + regex); while (m.find()) { match = m.group(); System.out.println(“MATCH: “ + match); } // end while if (match == null){ System.out.println(“There were no matches.”); } // end if } // end main() } 137 Character Classes 08_574892 ch05.qxd 1/7/05 10:52 PM Page 137 [...]... time looking at how you could use character classes to match IP addresses, using the following sample file, IPLike.txt: 12.12.12.12 255.255.256.255 12.255.12.255 256.1 23. 256.1 23 8. 234 .88.55 196. 83. 83. 191 8. 234 .88,55 88.1 73. 71.66 241.92.88.1 03 161 Chapter 6 Now that you have looked at the meaning and use of the ^ and $ metacharacters, you are in a position to take that example to a successful conclusion... ABCPartNumbers.txt, is shown here: ABC1 23 There is a part number ABC1 23 ABC 234 A purchase order for 400 of ABC345 was received yesterday ABC789 Notice that some lines consist only of a part number, whereas other lines include the part number as part of some surrounding text The intention is to match lines that consist only of a part number The problem definition is as follows: Match a beginning of line position,... metacharacter in the pattern, the ^ metacharacter, is matched against the regular expression engine’s current position Because the regular expression engine is at the beginning of the file, the condition specified by the ^ metacharacter is satisfied, so the regular expression engine can proceed to attempt to match the other characters in the regular expression pattern The next character in the pattern, the... box using the Ctrl+F keyboard shortcut, and check the Regular Expressions and Match Case check boxes The regular expression pattern that works is shown here: ^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.) {3} (25[0-5]|2[0-4][0-9]|1[09][0-9]|[1-9][0-9]|[0-9])$ Figure 6-15 162 String, Line, and Word Boundaries How It Works First, let’s break the regular expression down into its component parts The... metacharacter would match the beginning of the test text or the beginning of each line, because the two concepts were the same However, in several tools and languages, it is possible to modify the behavior of the ^ metacharacter so that it matches the position before the first character of each line or only at the beginning of the first line of the test file When using the Komodo Regular Expression Toolkit,... each line begins with the sequence of characters The Some tools, such as PowerGrep, are in multiline mode by default, as shown here 1 2 3 Open PowerGrep, and check the Regular Expressions check box 4 5 Enter TheatreMultiline.txt in the File Mask text box Enter the regular expression pattern ^The in the Search text box Enter C:\BRegExp\Ch06 in the Folder text box Adjust this if you chose to put the... 2 3 4 5 Open PowerGrep, and check the Regular Expressions check box Enter the pattern the$ in the Search text box Enter C:\BRegExp\Ch06 in the Folder text box Enter Lathe.txt in the File Mask text box Click the Search button, and inspect the results displayed in the Results area, as shown in Figure 6-6 Figure 6-6 Notice that there is only one match and that the sequence of characters The at the beginning. .. the beginning or end of a word Many regular expression implementations have positional metacharacters that allow you to do that This chapter provides you with the information needed to make matches based on the position of a sequence of characters The term anchor is sometimes used to refer to the metacharacters that match a position rather than a character In some documentation (for example, the documentation... the ^ and $ metacharacters in the same pattern: 1 2 Open OpenOffice.org Writer, and open the test file ABCPartNumbers.txt 3 4 Enter the pattern ^ABC[0-9] {3} $ in the Search For text box Open the Find & Replace dialog box, using the Ctrl+F keyboard shortcut, and check the Regular Expressions and Match Case check boxes Click the Find All button, and inspect the highlighted text, as shown in Figure 6-9... quantifier, against the character 3 That too matches Finally, it attempts to match the $ metacharacter against the position following the 154 String, Line, and Word Boundaries character 3 That matches because it immediately precedes a Unicode newline character Each component of the pattern matches; therefore, the entire pattern matches At the beginning of the second line, the regular expression successfully . slash. Finally, the pattern [01 23] [0-9] matches days of the month beginning with 0, 1, 2, or 3. As written, the pattern would allow values for the day of the month such as 00, 34 or 38 . A later exercise. has its literal meaning — that is, it matches the ^ character. 133 Character Classes 08_574892 ch05.qxd 1/7/05 10:52 PM Page 133 A test file, Carets.txt, is shown here: 14^2 expresses the idea. shortcut to open the Find & Replace dialog box. 3. Check the Regular Expressions and Match Case check boxes, and enter the pattern [_ ^3] in the Search For text box. 4. Click the Find All