Case Study on Numerical Eﬃciency

We shall in the following investigate the numerical eﬃciency of several implementations of a matrix-vector product. Various techniques for speeding up Python loops will be presented, including rewrite with reduce and map,

migration of code to Fortran 77, use of run-time compiler tools such as Psyco and Weave, and of course calling a built-in NumPy function for the task. All the implementations and the test suite are available in the ﬁle

src/py/examples/efficiency/pyefficiency.py

Pure Python Loops. Here is a straightforward implementation of a matrix- vector product in pure Python:

def prod1(m, v):

nrows, ncolumns = m.shape res = zeros(nrows) for i in xrange(nrows):

for j in xrange(ncolumns):

res[i] += m[i,j]*v[j]

return res

Rewrite with mapand reduce. Loops can often be replaced by certain com- binations of the Python functionsmap,reduce, andfilter. Here is a ﬁrst try where we express the matrix-vector product as a sum of the scalar products between each row and the vector:

def prod2(m, v):

nrows = m.shape[0]

res = zeros(nrows) for i in range(nrows):

res[i] = reduce(lambda x,y: x+y,

map(lambda x,y: x*y, m[i,:], v)) return res

Below is an improved version where we rely on the NumPy matrix multipli- cation operator to perform the scalar product and a new reduce to replace theiloop:

def prod3(m, v):

nrows = m.shape[0]

index = xrange(nrows) return array(map(lambda i:

reduce(lambda x,y: x+y, m[i,:]*v),index))

Theprod2 function runs slightly faster thanprod1, while prod3runs almost three times faster thanprod1.

Migration to Fortran. The nested loops can straightforwardly be migrated to Fortran (see Chapter 5.3 for an introductory example and Chapter 9 for many more details):

subroutine matvec1(m, v, w, nrows, ncolumns) integer nrows, ncolumns

real*8 m(nrows,ncolumns), v(ncolumns) real*8 w(nrows)

Cf2py intent(out) w

8.10. Investigating Eﬃciency 447

integer i, j real*8 h

C algorithm: straightforward, stride n in matrix access do i = 1, nrows

w(i) = 0.0

do j = 1, ncolumns

w(i) = w(i) + m(i,j)*v(j) end do

end do return end

The problem with this implementation is that the second index in the matrix runs fastest. Fortran arrays are stored column by column, and the matrix is accessed with large jumps in memory. A more cache friendly version is obtained by having theiloop inside thejloop. Another (potential) problem with the matvec1 subroutine is that w is an intent(out) argument, which means that the wrapper code allocates memory forw. If matvec1is called a large number of times, this memory allocation might degrade performance considerably. F2PY enables reuse of such returned arrays by specifyingwto beintent(out,cache).

An improved Fortran 77 implementation is shown below.

subroutine matvec2(m, v, w, nrows, ncolumns) integer nrows, ncolumns

real*8 m(nrows,ncolumns), v(ncolumns) real*8 w(nrows)

Cf2py intent(out,cache) w integer i, j real*8 h do i = 1, nrows

w(i) = 0.0 end do

do j = 1, ncolumns h = v(j) do i = 1, nrows

w(i) = w(i) + m(i,j)*h end do

end do return end

Migration to C++ Using Weave. A simple and convenient way of migrating a slow Python loop to C++ is to use Weave (see link in doc.html). This basically means that we write the loop with C++ syntax in a string and then ask Weave to compile and run the string. In the present application the plain Python loop we want to migrate reads

for i in xrange(nrows):

for j in xrange(ncolumns):

res[i] += m[i,j]*v[j]

The corresponding C++ code to be used with Weave looks very similar:

def prod7(m, v):

nrows, ncolumns = m.shape res = zeros(nrows) code = r"""

for (int i=0; i<nrows; i++) { for (int j=0; j<ncolumns; j++) {

res(i) += m(i, j)*v(j);

} }

"""

err = weave.inline(code,

[’nrows’, ’ncolumns’, ’res’, ’m’, ’v’],

type_converters=weave.converters.blitz, compiler=’gcc’) Weave is distributed as a part of SciPy, so if you have installed SciPy, you also have Weave. The C++ source to be compiled is contained in thecodestring.

Note that array subscription in C++ applies standard parenthesis (because we use Blitz++ arrays). The second argument toweave.inlineis a list of all the variables that we need to transfer from Python to the C++ code. The third argument specifies how Python data types are converted to C++ data structures. In the present case we specify that NumPy arrays are converted to Blitz++ arrays. The final argument specifies the compiler to be used, and because Blitz++ is used, only a few advanced C++ compilers, includinggcc, will compile the Blitz++ code. Fortunately, Weave forces compilation only if the code has changed since the last compilation.

Speeding up Python Code with Psyco. Psyco (see link indoc.html) is a kind of just-in-time compiler for pure Python code. The usuage is extremely simple: just a call topsyco.full()(for small codes) orpsyco.profile()(for larger codes) may be enough to cause signiﬁcant speed-up. We refer to the Psyco documentation for how to take advantage of this module. In the present example, it is natural to instruct Psyco to compile a speciﬁc function, typically prod1which employs pure Python loops:

import psyco

prod6 = psyco.proxy(prod1)

Nowprod6 is a Psyco-accelerated version ofprod1.

Using NumPy Functions. The most obvious way to perform a matrix-vector product in Python is, of course, to apply NumPy functions. The functiondot innumpy can be used for multiplying at matrix by a vector:

res = dot(m, v)

8.10. Investigating Eﬃciency 449 Results. Running matrix-vector products with a 2000×2000 dense matrix andnumpyarrays gave the following relative timings:

method function name CPU time

pure Python loops prod1 490

map/reduce prod2 454

map/reduce prod3 209

Psyco prod6 327

Fortran prod4 2.9

Fortran, cache-friendly loops prod5 1.0

Weave prod7 1.6

NumPy dot 1.0

All these results were obtained with double precision array elements.

Fortran Programming with NumPy Arrays

Python loops over large array structures are known to run slowly. Tests with classGrid2D from Chapter 4.3.5 show that ﬁlling a two-dimensional array of size 1100×1100 with nested loops in Python may require about 150 times longer execution time than using Fortran 77 for the same purpose. With Nu- merical Python (NumPy) and vectorized expressions (from Chapter 4.2) one can speed up the code by a factor of about 50, which gives decent performance.

There are cases where vectorization in terms of NumPy functions is de- manding or inconvenient: it would be a lot easier to just write straight loops over array structures in Fortran, C, or C++. This is quite easy, and the details of doing this in F77 are explained in the present chapter. Chapter 10 covers the same material in a C and C++ context.

The forthcoming material assumes that you are familiar with at least Chapter 5.2.1. Familiarity with Chapter 5.3 as well is an advantage.

Preparations for Working with This Book

A Scientiﬁc Hello World Script