Information about Bluetooth, 802.11, and more. </BODY> <[hH]1>.*</[hH]1> <BODY> <H1>Welcome to my Homepage</H1> Content is divided into two sections:<BR> <H2>ColdFusion</H2> Information about Macromedia ColdFusion. <H2>Wireless</H2> Information about Bluetooth, 802.11, and more. </BODY> The pattern <[hH]1>.*</[hH]1> matched the first header (from <H1> until </H1>) and would also match <h1> (as HTML is not case sensitive). But what pattern could be used to match any header (which may be using any of the six valid header levels)? One option would be to use a simple range instead of 1, like this: <BODY> <H1>Welcome to my Homepage</H1> Content is divided into two sections:<BR> <H2>ColdFusion</H2> Information about Macromedia ColdFusion. <H2>Wireless</H2> Information about Bluetooth, 802.11, and more. </BODY> <[hH][1-6]>.*?</[hH][1-6]> <BODY> <H1>Welcome to my Homepage</H1> Content is divided into two sections:<BR> <H2>ColdFusion</H2> Information about Macromedia ColdFusion. <H2>Wireless</H2> Information about Bluetooth, 802.11, and more. </BODY> That seemed to work; <[hH][1-6]> matches any header start tag (<H1> and <H2> in this example), and <[hH][1-6]> matches any header end tag (</H1> and </H2>). Note Notice that .*? (lazy) was used here, and not .* (greedy). As explained in Lesson 5, "Repeating Matches," quantifiers such as * are greedy, and so pattern <[hH][1-6]>.*</[hH][1- 6]> could match all the way from the opening <H1> on the second line until the closing </H2> on the sixth line. Using the lazy quantifier .*? instead solves this problem. I said could, and not would, because this specific example would probably have worked even with the greedy quantifier. Metacharacter . usually does not match line breaks, and in the example, each header is on its own line. But there is no downside to using the lazy quantifier here, so better safe than sorry. Success? Not exactly. Look at the following example (using the same pattern): <BODY> <H1>Welcome to my Homepage</H1> Content is divided into two sections:<BR> <H2>ColdFusion</H2> Information about Macromedia ColdFusion. <H2>Wireless</H2> Information about Bluetooth, 802.11, and more. <H2>This is not valid HTML</H3> </BODY> <[hH][1-6]>.*?</[hH][1-6]> <BODY> <H1>Welcome to my Homepage</H1> Content is divided into two sections:<BR> <H2>ColdFusion</H2> Information about Macromedia ColdFusion. <H2>Wireless</H2> Information about Bluetooth, 802.11, and more. <H2>This is not valid HTML</H3> </BODY> A header tag starting with <H2> and ending with </H3> is invalid, and yet the pattern used here matched it. The problem is that the second part of the match (the part matching the end tag) has no knowledge of the first part of the match (the part matching the start tag). And this is where backreferences become very useful. Matching with Backreferences We'll revisit the header problem shortly. For now let's look at a simpler example, and one that cannot be solved at all without the use of backreferences. Suppose that you had a block of text and wanted to locate all repeated words (typos, where the same word was mistakenly typed twice). Obviously, when searching for the second occurrence of a word, that word must be known. Backreferences allow regular expression patterns to refer to previous matches (in this case, the previously matched word). The best way to understand this is to see it used. Here is some text containing three sets of repeated words, all of which need to be located: This is a block of of text, several words here are are repeated, and and they should not be. [ ]+(\w+)[ ]+\1 This is a block of of text, several words here are are repeated, and and they should not be. The pattern apparently worked, but how did it work? [ ]+ matches one or more spaces, \w+ matches one or more alphanumeric characters, and [ ]+ then matches any trailing spaces. But notice that \w+ is enclosed within parentheses, making it a subexpression. This subexpression is not used for repeating matches; there is no repeat matching here. Rather, the subexpression is used simply to group an expression, to flag it and identify it for future use. The final part of this pattern is \1; this is a reference back to the subexpression, and so when (\w+) matched the word of, so did \1, and when (\w+) matched the word and, so did \1. . matched it. The problem is that the second part of the match (the part matching the end tag) has no knowledge of the first part of the match (the part matching the start tag). And this is where. sections:<BR> <H2>ColdFusion</H2> Information about Macromedia ColdFusion. <H2>Wireless</H2> Information about Bluetooth, 802.11, and more. </BODY>. sections:<BR> <H2>ColdFusion</H2> Information about Macromedia ColdFusion. <H2>Wireless</H2> Information about Bluetooth, 802.11, and more. </BODY>