8.2 Regular Expressions and Text Processing
8.2.10 Example: Swapping Arguments in Function Calls
Suppose you have a C functionsuperLibFunc taking two arguments, void superLibFunc(char* method, float x)
and that you have redefined the function such that the floatargument ap- pears before the string:
void superLibFunc(float x, char* method)
How can we create a script that searches all C files and swaps the arguments in calls tosuperLibFunc? Such automatic editing may be important if there are many users of the library who need to update their application codes.
The tricky point is to define the proper regular expression to identify superLibFunc calls and each argument. The pattern to be matched has the form
8.2. Regular Expressions and Text Processing 349 superLibFunc(arg1,arg2)
with legal optional whitespace according to the rules of C. The texts arg1 and arg2 are patterns for arbitrary variable names in C, i.e., letters and numbers plus underscore, except that the names cannot begin with a number.
A suitable regular expression is arg1 = r’[A-Za-z_][A-Za-z_0-9]*’
The char* argument may also be a string enclosed in double quotes so we may add that possibility:
arg1 = r’(".*"|[A-Za-z_][A-Za-z_0-9]*)’
The other argument,arg2, may be a C variable name or a floating point number, requiring us to include digits, a dot, minus and plus signs, lower and upper case letters as well as underscore. One possible pattern is to list all possible characters:
arg2 = ’[A-Za-z0-9_.\-+]+’
A more precise pattern forarg2can make use of therealstring from Chap- ter 8.2.3:
arg2 = ’([A-Za-z_][A-Za-z_0-9]*|’ + real + ’)’
Another complicating factor is that we perhaps also want to swap function arguments in a prototyping ofsuperLibFunc (in case there are several header files with superLibFunc prototypes). Then we needarg2 to match floatfol- lowed by whitespace(s) andan optional legal variable name as well. Embed- ded C comments /* ... */ are also allowed in the calls and the function declaration. In other words, we realize that the complexity of a precise reg- ular expression grows significantly if we want to make a general script for automatic editing of a code.
Despite all the mentioned difficulties, we can solve the whole problem with a much simpler regular expression for arg1 and arg2. The idea is to specify the arguments as some arbitrary text and rely on the surrounding structure, i.e., the name superLibFunc, parenthesis, and the comma. A first attempt might be
arg = r’.+’
Testing it with a line like
superLibFunc ( method1, a );
gives correct results, but
superLibFunc(a,x); superLibFunc(ppp,qqq);
results in the first argument matchinga,x); superLibFunc(pppand not just a. This can be avoided by demanding the regular expression to be non-greedy as explained in Chapter 8.2.5. Alternatively, we can replace the dot in.+by
“any character except comma”:
arg = r’[^,]+’
The advantage with this latter pattern is that it also matches embedded newline (.+would in that case require are.Sorre.DOTALL modifier).
To swap the arguments in the replacement string, we need to enclose each one of them as a group. The suitable regular expression for detecting superLibFunc calls and extracting the two arguments is hence
call = r’superLibFunc\s*\(\s*(%s),\s*(%s)\)’ % (arg,arg)
Note that a whitespace specification\s*after theargpattern is not necessary since[^,]+matches the argumentand optional additional whitespace.
Having stored the file in a string filestr, the command filestr = re.sub(call, r’superLibFunc(\2, \1)’, filestr)
performs the swapping of arguments throughout the file. Recall that\1and
\2 hold the contents of group number 1 and 2 in the regular expression.
Testing our regular expressions on a file containing the lines superLibFunc(a,x); superLibFunc(qqq,ppp);
superLibFunc ( method1, method2 );
superLibFunc(3method /* illegal name! */, method2 ) ; superLibFunc( _method1,method_2) ;
superLibFunc (
method1 /* the first method we have */ , super_method4 /* a special method that
deserves a two-line comment... */
) ; results in the modified lines
superLibFunc(x, a); superLibFunc(ppp, qqq);
superLibFunc(method2 , method1);
superLibFunc(method2 , 3method /* illegal name! */) ; superLibFunc(method_2, _method1) ;
superLibFunc(super_method4 /* a special method that
deserves a two-line comment... */
, method1 /* the first method we have */ ) ;
Observe that an illegal variable name like 3method is matched. However, it make sense to construct regular expressions that are restricted to work for legal C codes only, since syntax errors are found by a compiler anyway.
Improved readability of non-trivial substitutions can be obtained by ap- plying named groups. In the current example, we can name the two groups arg1andarg2 and also use the verbose regular expression form:
8.2. Regular Expressions and Text Processing 351 arg = r’[^,]+’
call = re.compile(r"""
superLibFunc # name of function to match
\s* # optional whitespace
\( # parenthesis before argument list
\s* # optional whitespace
(?P<arg1>%s) # first argument plus optional whitespace , # comma between the arguments
\s* # optional whitespace
(?P<arg2>%s) # second argument plus optional whitespace
\) # closing parenthesis
""" % (arg,arg), re.VERBOSE) The substitution command can now be written as
filestr = call.sub(r’superLibFunc(\g<arg2>, \g<arg1>)’, filestr) The swapping of arguments example is available in working scriptsswap1.py andswap2.pyin the directorysrc/py/regex. A suitable test file for both scripts is.test1.c.
A primary lesson learned from this example is that the “perfect” regular expressions can have a complexity beyond what is feasible, but you can often get away with a very simple regular expression. The disadvantage of simple regular expressions is that they can “match too much” so you need to be prepared for unintended side effects. Our[^,]+will fail if we have commas inside comments or if an argument is a call to another function, for instance
superLibFunc(m1, a /* large, random number */);
superLibFunc(m1, generate(c, q2));
In the first case,[^,]+matchesm1, a /* large, i.e., as long text as possible up to a comma (greedy match, see Chapter 8.2.5), but then there are no more commas and the call expression cannot match the superLibFunc call.
The same thing happens in the second line. A complicated regular expression would be needed to fix these undesired effects. Actually, regular expressions are often an insufficient tool for interpreting program code. The only safe and general approach is toparse the code.
Whitespace in the original text is not preserved by our specified substi- tution. It is quite difficult to fix this in a general way. The [^,]+ regular expression matches too much whitespace and cannot be used. A suggested solution is found insrc/py/regex/swap3.py.