8.2 Regular Expressions and Text Processing
8.2.3 Regular Expressions for Real Numbers
Applications of regular expressions in problems arising from numerical com- puting often involve interpreting text with real numbers. We then need reg- ular expressions for describing real numbers. This is not a trivial issue, be- cause real numbers can appear in different formats in a text. For example, the number 11 can be written as11,11.0,11.,1.1E+01,1.1E+1,1.10000e+01, to mention some possibilities. There are three main formats for real numbers:
– integer notation (11), – decimal notation (11.0), – scientific notation (1.10E+01).
The regular expression for integers is very simple, \d+, but those for the decimal and scientific notations are more demanding.
A very simple regular expression for a real number is just a collection of the various character that can appear in the three types of notation:
[0-9.Ee\-+]+
However, this pattern will also match text like12-24,24.-,--E1--, and+++++. Whether it is likely to encounter such matches depends on the type of text in which we want to search for real numbers. In the following we shall address safer and more sophisticated regular expressions that precisely describe the legal real number notations.
Matching Real Numbers in Decimal Notation. Examples of the decimal notation are-33.9816,0.11,11., and.11. The number starts with an optional minus sign, followed by zero or more digits, followed by a dot, followed by zero or more digits. The regular expression is readily constructed from a direct translation of this description:
-?\d*\.\d*
Note that the dot must be quoted: we mean the dotcharacter, not its special interpretation in regular expressions.
The observant reader will claim that our last regular expression is not perfect: it matches non-numbers like -. and even a period (.). Matching a pure period is crucial if the real numbers we want to extract appear in running text with periods. To fix this deficiency, we realize that any number in decimal notation must have a digit either before or after the dot. This can be easily expressed by means of the OR operator and parenthesis:
-?(\d+\.\d*|\d*\.\d+)
A more compact pattern can be obtained by observing that the simple pattern
\d+\.\d* fails to match numbers on the form.243, so we may just add this special form,\.\d+in an OR operator:
-?(\d+\.d*|\.\d+)
In the following we shall use the former, slightly longer, pattern as I find this a bit more readable.
A pattern that can match either the integer format or the decimal notation is expressed by nested OR operators:
-?(\d+|(\d+\.\d*|\d*\.\d+))
The problem with this pattern is that it may match the integers before the dot in a real number, i.e.,22in a number22.432. The reason is that it first checks if the text22.432 can match the first operand in the OR expression (-?\d+), and that is possible (22). Hence, we need to check for the most complicated pattern before the simplest one in the OR test:
-?((\d+\.\d*|\d*\.\d+)|\d+)
For documentation purposes, this quite complicated pattern is better con- structed in terms of variables with sensible names:
int = r’\d+’
real_dn = r’(\d+\.\d*|\d*\.\d+)’
real = ’-?(’ + real_dn + ’|’ + int + ’)’
Looking at our last regular expression, -?((\d+\.\d*|\d*\.\d+)|\d+)
8.2. Regular Expressions and Text Processing 333 we realize that we can get rid of one of the OR operators by making the\.\d*
optional, such that the first pattern of the OR expression for the decimal notation also can be an integer:
-?(\d+(\.\d*)?|\d*\.\d+)
This is a more compact pattern, but it is also more difficult to read it and break it up into logical components likeintandreal_dn as just explained.
Matching Real Numbers in Scientific Notation. Real numbers written in scientific notation require a more lengthy regular expression. Examples on the format are1.09876E+05, 9.2E-1, and -1.09876e+05. That is, the number starts with an optional minus sign, followed by one digit, followed by a dot, followed by a sequence of one or more digits, followed by E or e, then a plus or minus sign and finally one or two digits. Translating this to a regular expression results in
-?\d\.\d+[Ee][+\-]\d\d?
Notice that the minus sign has a special meaning as a range operator inside square brackets (for example,[A-Z]) so it is a good habit to quote it, as in [+\-], when we mean the character-(although a minus sign next to one of the brackets, like here, prevents it from being interpreted as a range operator).
Sometimes also the notation1e+00is allowed. We can improve the regular expression to include this format as well, either
-?\d\.?\d*[Ee][+\-]\d\d?
or
-?\d(\.\d+|)[Ee][+\-]\d\d?
We could also let 1e1and 1e001 be valid scientific notation, i.e., the sign in the exponent can be omitted and there must be one or more digits in the exponent:
-?\d\.?\d*[Ee][+\-]?\d+
A Pattern for Real Numbers. The pattern for real numbers in integer, deci- mal, and scientific notation can be constructed with aid of the OR operator:
# integer:
int = r’-?\d+’
# real number in scientific notation:
real_sn = r’-?\d(\.\d+|)[Ee][+\-]\d\d?’
# real number in decimal notation:
real_dn = r’-?(\d+\.\d*|\d*\.\d+)’
# regex for real_sn OR real_dn OR int:
real = r’(’ + real_sn + ’|’ + real_dn + ’|’ + int + r’)’
A More Compact Pattern for Real Numbers. We have seen that the pattern for an integer and a real number in decimal notation could be combined to a more compact, compound pattern:
-?(\d+(\.\d*)?|\d*\.\d+)
A number matching this pattern and followed by[Ee][+\-]\d\d? constitutes a real number. That is, we can construct a single expression that matches all types of real numbers:
-?(\d+(\.\d*)?|\d*\.\d+)([eE][+\-]?\d+)?
This pattern does not match numbers starting with a plus sign (+3.54), so we might add an optional plus or minus sign. We end up with
real_short = r’[+\-]?(\d+(\.\d*)?|\d*\.\d+)([eE][+\-]?\d+)?’
We do not recommend to construct such expressions on the fly. Instead, one should build the expressions in a step-by-step fashion. This improves the documentation and usually makes it easier to adapt the expression to new applications.
The various regular expressions for real numbers treated in this subsec- tion are coded and tested in the script src/py/regex/realre.py. For more information about recognizing real numbers, see the Perl FAQ, “How do I determine whether a scalar is a number/whole/integer/float?”. You can ac- cess this entry throughperldoc: runperldoc -q ’/float’from the command line.