We shall in the following investigate the numerical efficiency of several im- plementations of a matrix-vector product. Various techniques for speeding up Python loops will be presented, including rewrite with reduce and map,
migration of code to Fortran 77, use of run-time compiler tools such as Psyco and Weave, and of course calling a built-in NumPy function for the task. All the implementations and the test suite are available in the file
src/py/examples/efficiency/pyefficiency.py
Pure Python Loops. Here is a straightforward implementation of a matrix- vector product in pure Python:
def prod1(m, v):
nrows, ncolumns = m.shape res = zeros(nrows) for i in xrange(nrows):
for j in xrange(ncolumns):
res[i] += m[i,j]*v[j]
return res
Rewrite with mapand reduce. Loops can often be replaced by certain com- binations of the Python functionsmap,reduce, andfilter. Here is a first try where we express the matrix-vector product as a sum of the scalar products between each row and the vector:
def prod2(m, v):
nrows = m.shape[0]
res = zeros(nrows) for i in range(nrows):
res[i] = reduce(lambda x,y: x+y,
map(lambda x,y: x*y, m[i,:], v)) return res
Below is an improved version where we rely on the NumPy matrix multipli- cation operator to perform the scalar product and a new reduce to replace theiloop:
def prod3(m, v):
nrows = m.shape[0]
index = xrange(nrows) return array(map(lambda i:
reduce(lambda x,y: x+y, m[i,:]*v),index))
Theprod2 function runs slightly faster thanprod1, while prod3runs almost three times faster thanprod1.
Migration to Fortran. The nested loops can straightforwardly be migrated to Fortran (see Chapter 5.3 for an introductory example and Chapter 9 for many more details):
subroutine matvec1(m, v, w, nrows, ncolumns) integer nrows, ncolumns
real*8 m(nrows,ncolumns), v(ncolumns) real*8 w(nrows)
Cf2py intent(out) w
8.10. Investigating Efficiency 447
integer i, j real*8 h
C algorithm: straightforward, stride n in matrix access do i = 1, nrows
w(i) = 0.0
do j = 1, ncolumns
w(i) = w(i) + m(i,j)*v(j) end do
end do return end
The problem with this implementation is that the second index in the matrix runs fastest. Fortran arrays are stored column by column, and the matrix is accessed with large jumps in memory. A more cache friendly version is obtained by having theiloop inside thejloop. Another (potential) problem with the matvec1 subroutine is that w is an intent(out) argument, which means that the wrapper code allocates memory forw. If matvec1is called a large number of times, this memory allocation might degrade performance considerably. F2PY enables reuse of such returned arrays by specifyingwto beintent(out,cache).
An improved Fortran 77 implementation is shown below.
subroutine matvec2(m, v, w, nrows, ncolumns) integer nrows, ncolumns
real*8 m(nrows,ncolumns), v(ncolumns) real*8 w(nrows)
Cf2py intent(out,cache) w integer i, j real*8 h do i = 1, nrows
w(i) = 0.0 end do
do j = 1, ncolumns h = v(j) do i = 1, nrows
w(i) = w(i) + m(i,j)*h end do
end do return end
Migration to C++ Using Weave. A simple and convenient way of migrating a slow Python loop to C++ is to use Weave (see link in doc.html). This basically means that we write the loop with C++ syntax in a string and then ask Weave to compile and run the string. In the present application the plain Python loop we want to migrate reads
for i in xrange(nrows):
for j in xrange(ncolumns):
res[i] += m[i,j]*v[j]
The corresponding C++ code to be used with Weave looks very similar:
def prod7(m, v):
nrows, ncolumns = m.shape res = zeros(nrows) code = r"""
for (int i=0; i<nrows; i++) { for (int j=0; j<ncolumns; j++) {
res(i) += m(i, j)*v(j);
} }
"""
err = weave.inline(code,
[’nrows’, ’ncolumns’, ’res’, ’m’, ’v’],
type_converters=weave.converters.blitz, compiler=’gcc’) Weave is distributed as a part of SciPy, so if you have installed SciPy, you also have Weave. The C++ source to be compiled is contained in thecodestring.
Note that array subscription in C++ applies standard parenthesis (because we use Blitz++ arrays). The second argument toweave.inlineis a list of all the variables that we need to transfer from Python to the C++ code. The third argument specifies how Python data types are converted to C++ data structures. In the present case we specify that NumPy arrays are converted to Blitz++ arrays. The final argument specifies the compiler to be used, and because Blitz++ is used, only a few advanced C++ compilers, includinggcc, will compile the Blitz++ code. Fortunately, Weave forces compilation only if the code has changed since the last compilation.
Speeding up Python Code with Psyco. Psyco (see link indoc.html) is a kind of just-in-time compiler for pure Python code. The usuage is extremely sim- ple: just a call topsyco.full()(for small codes) orpsyco.profile()(for larger codes) may be enough to cause significant speed-up. We refer to the Psyco documentation for how to take advantage of this module. In the present ex- ample, it is natural to instruct Psyco to compile a specific function, typically prod1which employs pure Python loops:
import psyco
prod6 = psyco.proxy(prod1)
Nowprod6 is a Psyco-accelerated version ofprod1.
Using NumPy Functions. The most obvious way to perform a matrix-vector product in Python is, of course, to apply NumPy functions. The functiondot innumpy can be used for multiplying at matrix by a vector:
res = dot(m, v)
8.10. Investigating Efficiency 449 Results. Running matrix-vector products with a 2000×2000 dense matrix andnumpyarrays gave the following relative timings:
method function name CPU time
pure Python loops prod1 490
map/reduce prod2 454
map/reduce prod3 209
Psyco prod6 327
Fortran prod4 2.9
Fortran, cache-friendly loops prod5 1.0
Weave prod7 1.6
NumPy dot 1.0
All these results were obtained with double precision array elements.
Fortran Programming with NumPy Arrays
Python loops over large array structures are known to run slowly. Tests with classGrid2D from Chapter 4.3.5 show that filling a two-dimensional array of size 1100×1100 with nested loops in Python may require about 150 times longer execution time than using Fortran 77 for the same purpose. With Nu- merical Python (NumPy) and vectorized expressions (from Chapter 4.2) one can speed up the code by a factor of about 50, which gives decent perfor- mance.
There are cases where vectorization in terms of NumPy functions is de- manding or inconvenient: it would be a lot easier to just write straight loops over array structures in Fortran, C, or C++. This is quite easy, and the de- tails of doing this in F77 are explained in the present chapter. Chapter 10 covers the same material in a C and C++ context.
The forthcoming material assumes that you are familiar with at least Chapter 5.2.1. Familiarity with Chapter 5.3 as well is an advantage.