Calling simply new results in a leak!"; if (boost::regex_search(s,m,reg)) { // Did new match? if (m[1].matched) std::cout << "The expression (new) matched!\n"; if (m[2].matched) std::cout << "The expression (delete) matched!\n"; } The preceding program searches the input string for new or delete, and reports which one it finds first. By passing an object of type smatch to regex_search, we gain access to the details of how the algorithm succeeded. In our expression, there are two subexpressions, and we can thus get to the subexpression for new by the index 1 of match_results. We then hold an instance of sub_match, which contains a Boolean member, matched, that tells us whether the subexpression participated in the match. So, given the preceding input, running this code would output "The expression (new) matched!\n". Now, you still have some more work to do. You need to continue applying the regular expression to the remainder of the input, and to do that, you use another overload of regex_search, which accepts two iterators denoting the character sequence to search. Because std::string is a container, it provides iterators. Now, for each match, you must update the iterator denoting the beginning of the range to refer to the end of the previous match. Finally, add two variables to hold the counts for new and delete. Here's the complete program: #include <iostream> #include <string> #include "boost/regex.hpp" int main() { // Are there equally many occurrences of // "new" and "delete"? boost::regex reg("(new)|(delete)"); boost::smatch m; std::string s= "Calls to new must be followed by delete. \ Calling simply new results in a leak!"; int new_counter=0; int delete_counter=0; std::string::const_iterator it=s.begin(); std::string::const_iterator end=s.end(); while (boost::regex_search(it,end,m,reg)) { // New or delete? m[1].matched ? ++new_counter : ++delete_counter; it=m[0].second; } if (new_counter!=delete_counter) std::cout << "Leak detected!\n"; else std::cout << "Seems ok \n"; } Note that the program always sets the iterator it to m[0].second. match_results[0] returns a reference to the submatch that matched the whole regular expression, so we can be sure that the end of that match is always the correct location to start the next run of regex_search. Running this program outputs "Leak detected!", because there are two occurrences of new, and only one of delete. Of course, one variable could be deleted twice, there could be calls to new[] and delete[], and so forth. By now, you should have a good understanding of how subexpression grouping works. It's time to move on to the final algorithm in Boost.Regex, one that is used to perform substitutions. Replacing The third in the family of Regex algorithms is regex_replace. As the name implies, it's used to perform text substitutions. It searches through the input data, finding all matches to the regular expression. For each match of the expression, the algorithm calls match_results::format and outputs the result to an output iterator that is passed to the function. In the introduction to this chapter, I gave you the example of changing the British spelling of colour to the U.S. spelling of color. Changing the spelling without using regular expressions is very tedious, and extremely error prone. The problem is that there might be different capitalization, and a lot of words that are affectedfor example, colourize. To properly attack this problem, we need to split the regular expression into three subexpressions. boost::regex reg("(Colo)(u)(r)", boost::regex::icase|boost::regex::perl); We have isolated the villainthe letter uin order to surgically remove it from any matches. Also note that this regex is case-insensitive, which we achieve by passing the format flag boost::regex::icase to the constructor of regex. Note that you must also pass any other flags that you want to be in effect. A common user error when setting format flags is to omit the ones that regex turns on by default, but that don't workyou must always apply all of the flags that should be set. When calling regex_replace, we are expected to provide a format string as an argument. This format string determines how the substitution will work. In the format string, it's possible to refer to subexpression matches, and that's precisely what we need here. You want to keep the first matched subexpression, and the third, but let the second (u), silently disappear. The expression $N, where N is the index of a subexpression, expands to the match for that subexpression. So our format string becomes "$1$3", which means that the replacement text is the result of the first and the third subexpressions. By referring to the subexpression matches, we are able to retain any capitalization in the matched text, which would not be possible if we were to use a string literal as the replacement text. Here's a complete program that solves the problem. #include <iostream> #include <string> #include "boost/regex.hpp" int main() { boost::regex reg("(Colo)(u)(r)", boost::regex::icase|boost::regex::perl); std::string s="Colour, colours, color, colourize"; s=boost::regex_replace(s,reg,"$1$3"); std::cout << s; } The output of running this program is "Color, colors, color, colorize". regex_replace is enormously useful for applying substitutions like this. A Common User Misunderstanding One of the most common questions that I see related to Boost.Regex is related to the semantics of regex_match. It's easy to forget that all of the input to regex_match must match the regular expression. Thus, users often think that code like the following should yield true. boost::regex reg("\\d*"); bool b=boost::regex_match("17 is prime",reg); Rest assured that this call never results in a successful match. All of the input must be consumed for regex_match to return TRue! Almost all of the users asking why this doesn't work should use regex_search rather than regex_match. boost::regex reg("\\d*"); bool b=boost::regex_search("17 is prime",reg); This most definitely yields TRue. It is worth noting that it's possible to make regex_search behave like regex_match, using special buffer operators. \A matches the start of a buffer, and \Z matches the end of a buffer, so if you put \A first in your regular expression, and \Z last, you'll make regex_search behave exactly like regex_matchthat is, it must consume all input for a successful match. The following regular expression always requires that the input be exhausted, regardless of whether you are using regex_match or regex_search. boost::regex reg("\\A\\d*\\Z"); Please understand that this does not imply that regex_match should not be used; on the contrary, it should be a clear indication that the semantics we just talked aboutthat all of the input must be consumedare in effect. About Repeats and Greed Another common source of confusion is the greediness of repeats. Some of the repeatsfor example, + and *are greedy. This means that they will consume as much of the input as they possibly can. It's not uncommon to see regular expressions such as the following, with the intent of capturing a digit after a greedy repeat is applied. boost::regex reg("(.*)(\\d{2})"); This regular expression succeeds, but it might not match the subexpressions that you think it should! The expression .* happily eats everything that following subexpressions don't match. Here's a sample program that exhibits this behavior: int main() { boost::regex reg("(.*)(\\d{2})"); boost::cmatch m; const char* text = "Note that I'm 31 years old, not 32."; if(boost::regex_search(text,m, reg)) { if (m[1].matched) std::cout << "(.*) matched: " << m[1].str() << '\n'; if (m[2].matched) std::cout << "Found the age: " << m[2] << '\n'; } } In this program, we are using another parameterization of match_results, tHRough the type cmatch. It is a typedef for match_results<const char*>, and the reason we must use it rather than the type smatch we've been using before is that we're now calling regex_search with a string literal rather than an object of type std::string. What do you expect the output of running this program to be? Typically, users new to regular expressions first think that both m[1].matched and m[2].matched will be TRue, and that the result of the second subexpression will be "31". Next, after realizing the effects of greedy repeatsthat they consume as much input as possiblethey tend to think that only the first subexpression can be TRuethat is, the .* has successfully eaten all of the input. Finally, new users come to the conclusion that the expression will match both subexpressions, but that the second expression will match the last possible sequence. Here, that means that the first subexpression will match "Note that I'm 31 years old, not" and the second will match "32". So, what do you do when you actually want is to use a repeat and the first occurrence of another subexpression? Use non-greedy repeats. By appending ? to the repeat, it becomes non-greedy. This means that the expression tries to find the shortest possible match that doesn't prevent the rest of the expression from matching. So, to make the previous regex work correctly, we need to update it like so. boost::regex reg("(.*?)(\\d{2})"); If we change the program to use this regular expression, both m[1].matched and m[2].matched will still be true. The expression .*? consumes as little of the input as it can, which means that it stops at the first character 3, because that's what the expression needs in order to successfully match. Thus, the first subexpression matches "Note that I'm" and the second matches "31". . of sub_match, which contains a Boolean member, matched, that tells us whether the subexpression participated in the match. So, given the preceding input, running this code would output "The