1. Trang chủ
  2. » Công Nghệ Thông Tin

O''''Reilly Network For Information About''''s Book part 61 pot

6 131 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 6
Dung lượng 24,39 KB

Nội dung

A Look at regex_iterator We have seen how to use several calls to regex_search in order to process all of an input sequence, but there's another, more elegant way of doing that, using a regex_iterator. This iterator type enumerates all of the regular expression matches in a sequence. Dereferencing a regex_iterator yields a reference to an instance of match_results. When constructing a regex_iterator, you pass to it the iterators denoting the input sequence, and the regular expression to apply. Let's look at an example where we have input data that is a comma- separated list of integers. The regular expression is simple. boost::regex reg("(\\d+),?"); Adding the repeat ? (match zero or one times) to the end of the regular expression ensures that the last digit will be successfully parsed, even if the input sequence does not end with a comma. Further, we are using another repeat, +. This repeat ensures that the expression matches one or more times. Now, rather than doing multiple calls to regex_search, we create a regex_iterator, call the algorithm for_each , and supply it with a function object to call with the result of dereferencing the iterator. Here's a function object that accepts any form of match_results due to its parameterized function call operator. All work it performs is to add the value of the current match to a total (in our regular expression, the first subexpression is the one we're interested in). class regex_callback { int sum_; public: regex_callback() : sum_(0) {} template <typename T> void operator()(const T& what) { sum_+=atoi(what[1].str().c_str()); } int sum() const { return sum_; } }; You now pass an instance of this function object to std::for_each, which results in an invocation of the function call operator for every dereference of the iterator itthat is, it is invoked every time there is a match of a subexpression in the regex. int main() { boost::regex reg("(\\d+),?"); std::string s="1,1,2,3,5,8,13,21"; boost::sregex_iterator it(s.begin(),s.end(),reg); boost::sregex_iterator end; regex_callback c; int sum=for_each(it,end,c).sum(); } As you can see, the past-the-end iterator passed to for_each is simply a default- constructed instance of regex_iterator. Also, the type of it and end is boost::sregex_iterator, which is a typedef for regex_iterator<std::string::const_iterator>. Using regex_iterator this way is a much cleaner way of matching multiple times than what we did previously, where we manually had to advance the starting iterator and call regex_search in a loop. Splitting Strings with regex_token_iterator Another iterator type, or to be more precise, an iterator adaptor, is boost::regex_token_iterator. It is similar to regex_iterator, but may also be employed to enumerate each character sequence that does not match the regular expression, which is useful for splitting strings. It is also possible to select which subexpressions are of interest, so that when dereferencing the regex_token_iterator, only the subexpressions that are "subscribed to" are returned. Consider an application that receives input data where the entries are separated using a forward slash. Anything in between constitutes an item that the application needs to process. With regex_token_iterator, splitting the strings is easy. The regular expression is very simple. boost::regex reg("/"); The regex matches the separator of items. To use it for splitting the input, simply pass the special index 1 to the constructor of regex_token_iterator. Here is the complete program: int main() { boost::regex reg("/"); std::string s="Split/Values/Separated/By/Slashes,"; std::vector<std::string> vec; boost::sregex_token_iterator it(s.begin(),s.end(),reg,-1); boost::sregex_token_iterator end; while (it!=end) vec.push_back(*it++); assert(vec.size()==std::count(s.begin(),s.end(),'/')+1); assert(vec[0]=="Split"); } Similar to regex_iterator, regex_token_iterator is a template class parameterized on the iterator type for the sequence it wraps. Here, we're using sregex_token_iterator, which is a typedef for regex_token_iterator<std::string::const_iterator>. Each time the iterator it is dereferenced, it returns the current sub_match, and when the iterator is advanced, it tries to match the regular expression again. These two iterator types, regex_iterator and regex_token_iterator, are very useful; you'll know that you need them when you are considering to call regex_search multiple times! More Regular Expressions You have already seen quite a lot of regular expression syntax, but there's still more to know. This section quickly demonstrates the uses of some of the remaining functionality that is useful in your everyday regular expressions. To begin, we will look at the whole set of repeats; we've already looked at *, +, and bounded repeats using {}. There's one more repeat, and that's ?. You may have noted that it is also used to declare non-greedy repeats, but by itself, it means that the expression must occur zero or one times. It's also worth mentioning that the bounded repeats are very flexible; here are three different ways of using them: boost::regex reg1("\\d{5}"); boost::regex reg2("\\d{2,4}"); boost::regex reg3("\\d{2,}"); The first regex matches exactly 5 digits. The second matches 2, 3, or 4 digits. The third matches 2 or more digits, without an upper limit. Another important regular expression feature is to use negated character classes using the metacharacter ^. You use it to form character classes that match any character that is not part of the character class; the complement of the elements y ou list in the character class. For example, consider this regular expression. boost::regex reg("[^13579]"); It contains a negated character class that matches any character that is not one of the odd numbers. Take a look at the following short program, and try to figure out what the output will be. int main() { boost::regex reg4("[^13579]"); std::string s="0123456789"; boost::sregex_iterator it(s.begin(),s.end(),reg4); boost::sregex_iterator end; while (it!=end) std::cout << *it++; } Did you figure it out? The output is "02468"that is, all of the even numbers. Note that this character class does not only match even numbershad the input string been "AlfaBetaGamma," that would have matched just fine too. The metacharacter we've just seen, ^, serves another purpose too. It is used to denote the beginning of a line. The metacharacter $ denotes the end of a line. Bad Regular Expressions A bad regular expression is one that doesn't conform with the rules that govern regexes. For example, if you happen to forget a closing parenthesis, there's no way the regular expression engine can successfully compile the regular expression. When that happens, an exception of type bad_expression is thrown. As I mentioned before, this name will change in the next versio n of Boost.Regex, and in the version that's going to be added to the Library Technical Report. The exception type bad_expression will be renamed to regex_error. If all of your regular expressions are hardcoded into your application, you may be safe from having to deal with bad expressions, but if you're accepting user input in the form of regexes, you must be prepared to handle errors. Here's a program that prompts the user to enter a regular expression, followed by a string to be matched against the regex . As always, when there's user input involved, there's a chance that the input will be invalid. int main() { std::cout << "Enter a regular expression:\n"; std::string s; std::getline(std::cin, s); try { boost::regex reg(s); std::cout << "Enter a string to be matched:\n"; std::getline(std::cin,s); if (boost::regex_match(s,reg)) std::cout << "That's right!\n"; else std::cout << "No, sorry, that doesn't match.\n"; } catch(const boost::bad_expression& e) { std::cout << "That's not a valid regular expression! (Error: " << e.what() << ") Exiting \n"; } } To protect the application and the user, a try/catch block ensures that if boost::regex throws upon construction, an informative message will be printed, and the application will shut down gracefully. Putting this program to the test, let's begin with some reasonable input. Enter a regular expression: \d{5} Enter a string to be matched: 12345 That's right! Now, here's grief coming your way, in the form of a ve ry poor attempt at a regular expression. Enter a regular expression: (\w*)) That's not a valid regular expression! (Error: Unmatched ( or \() Exiting An exception is thrown when the regex reg is constructed, because the regular expression cannot be compiled. Consequently, the catch handler is invoked, and the program prints an error message and exits. There are only three places where you need to be aware of potential exceptions being thrown. One is when constructing a regular expression, similar to the example you just saw; another is when assigning regular expressions to a regex, using the member function assign. Finally, the regex iterators and the algorithms can also throw exceptionsif memory is exhausted or if the complexity of the match grows too quickly. . ^. You use it to form character classes that match any character that is not part of the character class; the complement of the elements y ou list in the character class. For example, consider. Expressions A bad regular expression is one that doesn't conform with the rules that govern regexes. For example, if you happen to forget a closing parenthesis, there's no way the regular. boost::sregex_iterator end; regex_callback c; int sum =for_ each(it,end,c).sum(); } As you can see, the past-the-end iterator passed to for_ each is simply a default- constructed instance of regex_iterator.

Ngày đăng: 07/07/2014, 08:20