Data-Flow Programming Principles

11. Maxeler Data-Flow in Computational Finance

11.4 Data-Flow Programming Principles

In the following, we outline the data-flow oriented programming model that is used in Maxeler systems. As described in the previous section, Maxeler data-flow systems are based on a

combination of DFEs and CPUs. The basic logical architecture of such a system is illustrated in Fig. 11.5. The CPU is responsible for setting up and controlling the computation on the DFE. The DFE contains one or multiple data-flow kernels that perform the accelerated arithmetic and logical computations. Each DFE also contains a manager that is responsible for the connections between kernels, DFE memory, and the various interconnects such as PCIe, Infiniband and MaxRing.

Fig. 11.5 Logical architecture of a data-flow computing system with one CPU and one DFE

Separating computation and communication into kernels and managers is beneficial because it allows the data-path inside the kernels to be deeply pipelined without any synchronisation issues.

When developing the kernel, a designer would simply focus on achieving high degrees of pipelining and parallelism without worrying about scheduling or synchronisation. The scheduling of operations inside the kernel will be performed automatically by the compiler. The manager code describes how kernels are connected to memory and other IO interfaces, and the necessary synchronisation logic will also be generated by the compiler.

Developing an application for a DFE-based system therefore includes three parts:

A CPU application typically written in C/C , Matlab, Python or FORTRAN;

One or multiple data-flow kernels written in extended Java1;

A manager configuration, also written in extended Java.1

The compilation flow of a Maxeler data-flow design is illustrated in Fig. 11.6. The design typically starts with a CPU application where a performance-critical part needs to be accelerated.

This part of the application will be targeted on a DFE. Designing a DFE application involves describing one or multiple kernels and a manager in MaxJ. MaxJ is Java-based meta-language that describes data-flow. It is important to note that executing the MaxJ program will not perform the computations described within the program. Instead, it will trigger the generation of a configuration file for the DFE (the so-called.max file). The computation will later be performed by loading the.max configuration file into the DFE and streaming the data through it. Before we can do this, we need to modify the CPU application to invoke the DFE. To simplify this process, MaxCompiler will generate the necessary function prototypes and header files. The CPU code is then compiled as usual and is linked with the.max file and Maxeler’s Simple Live CPU (SLiC) interface library. The result of this is a single executable file that contains all the binary code to run on both the conventional CPUs and the DFEs in a system.

Fig. 11.6 Compiling a data-flow application with MaxCompiler

Let us focus on the principles of data-flow programming in MaxJ. As mentioned previously, MaxJ is a meta-language that describes data-flow computing structures; it uses Java syntax but is in

principle different from regular Java programming (or other imperative programming paradigms that describe computations by changing state). The most important principle in MaxJ is that we describe a fixed spatial data-flow structure that can perform computations by simply streaming through data, and not a sequence of instructions to be executed on a traditional processor.

To illustrate these principles, we show how a simple loop computation can be transformed into a data-flow description using MaxJ. Let us assume we want to calculate over a data set.

Even though there is nothing inherently sequential in this computation, a conventional C program would require a for loop. This is illustrated in Fig. 11.7. The calculation is repeated for the number of data elements in a loop. Within the loop body, all operations also run sequentially.

Fig. 11.7 C code of a simple computation inside a loop

In contrast, a data-flow implementation would focus on identifying the core part of the

computation and creating a data path for it. Figure 11.8 illustrates such a data-flow implementation.

The same computation that is described inside the loop body can be performed by a fixed data path that contains two multipliers and two adders. It is one of the key features of data-flow computing having several operators present at the same time and running concurrently, instead of using a time- shared functional unit inside a processor. A practical data-flow implementation can have thousands of operators in a data path all running concurrently. Another important principle is the absence of

control and instructions. The data path is fixed and the computation is performed by streaming data from memory directly into the data path.

Fig. 11.8 A data-flow implementation for the computation inside the loop body

Figure 11.9 depicts the MaxJ kernel description that can generate the data-path shown in

Fig. 11.8. The MaxJ descriptions begins by extending the kernel class (line 1). The kernel class is part of the Maxeler Java extensions and the user develops their own kernels by using inheritance.

Next, we define a constructor for SimpleCalc class (line 2). It is important to remember that this MaxJ program will only run once to build the DFE configuration; the constructor will facilitate building the data-flow implementation. To create the streaming inputs and outputs for the kernel, the methods

io.input (line 3) and io.output (line 5) are used. Streaming inputs and outputs replace the for loop in the original C code that iterates over data. The input method takes two arguments: the name on the input that will be used by the manager to connect the kernel, and the data type of the input. In this case, we use a standard single precision floating point format (8-bit exponent and a 24-bit mantissa), but MaxJ also supports custom data types that can be defined by the user. This is useful when optimising the numerical behaviour and performance, which will be covered later. The output method uses three arguments: the name of the output to be used by the manager, the variable to connect to the output, and

the data format. The computation itself is expressed in a very similar way as in the original C code (line 4).

Fig. 11.9 An MaxJ description that generates the data-flow implementation shown in Fig. 11.8

In MaxJ the DFEVar object is used to handle run-time data. Since MaxJ describes a data-flow graph rather than a procedure, we have to distinguish between run-time values and compile-time values. Regular Java variable such as int will be evaluated and fixed at compile time. Such

variables can be used as constants for improved code readability, or to control the build of the data- flow graph. The values of DFEVars are known only at run time when data is streamed through the kernel. This means assigning a Java variable to a DFEVar will result in a constant. However, it is not possible to read a DFEVar and assign it to a Java variable (Fig. 11.10). This principle means that we can use Java variables and control constructs to shape the structure of our data-flow graph. Let us consider an example of a nested loop as shown in Fig. 11.11. We observe that the outer for loop performs an iteration over data, while the inner for loop describes a computation with a cyclic dependency of v from one loop iteration to another.

Fig. 11.10 DFEVars handle run-time data, Java constants are evaluated only at compile time

Fig. 11.11 C code of a nested loop with dependency

This example can be effectively transformed into a data-flow description as illustrated in Fig. 11.12. Again, the outer loop is replaced by streaming inputs and outputs. The inner loop is

described with the same for for loop statement in Java, but the compilation of this loop will result in an unrolled implementation of the loop body in space, as depicted in Fig. 11.13. Unlike the original loop in C, the for loop in MaxJ does not carry out four iterations at run time. Instead, the compiler can resolve the dependency of v from one loop iteration to another and construct an unrolled, acyclic data path where the calculation inside the loop body is replicated four times, and each v is connected to the result from the previous iteration.

Fig. 11.12 A MaxJ implementation of the inner loop will be statically evaluated resulting in spatial replication

Fig. 11.13 The result of the MaxJ loop is an unrolled and pipelined data path

The previous example has shown how a Java for loop can be used to control the replication of statements inside the loop body into an unrolled data path. Likewise, it is possible to use Java conditionals such as if or case to control the construction of the data-flow graph. The Java if

condition is evaluated at compile time, and the block of code inside the conditional statement will be added into the data-flow graph only if the condition is evaluated as true.

However, we cannot use a Java conditional on DFEVars because their value will be only known

at run time. As previously mentioned, run-time dependent behaviour is undesirable as it is against the principles of static data-flow computing. If a data-dependent decision needs to be made then this can be expressed using the ternary operator ?: (see Fig. 11.14). This example results in data-dependent control, but in the data-path, both y1 and y2 will be computed concurrently. At the output we simply select one of the two results, depending on the value of a. This switching will be very fast and will not delay or stall the stream processing. However, it also means the we require resources for both computations on the DFE chip even though only one of the two outputs will be used at any time. This makes this type of control effective for fast, small-scale switching. For switching between larger blocks of computation, it might be more effective to implement separate DFE kernels and handle the switching and control from the CPU host.

Fig. 11.14 Data-dependent control with the ternary operator, and use of a custom number format

Figure 11.14 also illustrates that custom number formats other than conventional single or double- precision floating point can be used. In this example, we use a 9-bit exponent and a 31-bit mantissa, which offers better scaling and precision than single precision (8, 24 bit) but less than double

precision (11, 53 bit). Likewise, it is possible to use any arbitrary fixed-point or integer format. The application developer can use such custom number formats to tailor the implementation to the

numerical requirements of the application, and using such custom formats will yield better resource utilisation and performance than relying on the next larger standard format.

All previous examples have considered operations where the output is a function of inputs with the same array index within the stream, e.g.:

(11.1) However, in some cases we need to access values that are ahead or behind the current element in the data stream. For example, in a moving average filter we need to compute:

(11.2) In data-flow computing, x is a stream rather than an indexed array, and we need a way of

accessing elements of the same stream with other indices than the current one. This can be achieved with the stream.offset method that accesses values with a relative offset from the current value in the stream. In the moving average example, we need the previous value ( − 1) and the next value ( + 1) (Fig. 11.15).

Fig. 11.15 Using stream offsets to access values with relative offsets in the stream

Figure 11.16 illustrates how a DFE application interacts with the CPU host application. On the

right side, we see the moving average kernel MAVKernel from our last example. As previously mentioned, we also create a manager to describe the connectivity between the kernel and the available DFE interfaces. In Fig. 11.16, the kernel is connected directly to the CPU, and all of the communication will be facilitated via PCIe. The manager also makes visible to the CPU application all the names of the kernel streaming inputs and outputs. Compiling the manager and kernel will produce a.max file that can be included in the host application code. In the host application, running the moving average calculation will be performed with a simple function call to MAVKernel(). In this example, the host application is written in C but MaxCompiler can also generate bindings for a variety of other languages such as MATLAB or Python.

Fig. 11.16 Interaction between host code, manager and kernel in a data-flow application

MaxelerOS and the SLiC library provide a software layer that facilitates the execution and control of the DFE applications. The SLiC Application Programming Interface (API) is used to invoke the DFE and process data on it. In the example in Fig. 11.16 we use a simple SLiC interface and the simple function call MAVKernel() will carry out all DFE control functions such as loading the binary configuration file and streaming data in and out over PCIe. More advanced SLiC interfaces are also available that provide the user with additional control over the DFE behaviour. For example, in many cases it is beneficial to transfer the data to DFE memory (LMem) first and then start the

computation. This is one of many performance optimisations, which we will briefly cover in the next section.

Monte Carlo Methods for Pricing Exotic Options

Some aspects of the MC method)