11. Maxeler Data-Flow in Computational Finance
11.3 Maxeler Data-Flow Systems
At the centre of Maxeler’s data-flow systems sits its proprietary DFE hardware. In state-of-the-art MAX4 generation systems, DFEs are based on large Altera Stratix-V FPGAs that provide the
reconfigurable computing substrate for data-flow cores. This device is surrounded by large amounts of DRAM memory (currently between 48–96 GB), providing a very high memory capacity enabling large computational problems. This is called Large Memory (LMem). In addition, the FPGA itself also provides embedded on-chip memories which are spread throughout the chip’s fabric and can be used to hold local values of the computation. These embedded memories are called Fast Memory (FMem) as they can be accessed with a total bandwidth of several terabytes/second. This is an important factor for the efficiency of DFE computations because data can be kept locally were it is needed and accessed with very high speeds. This is in contrast to CPU caches where data is kept on a speculative basis, and replicated several times, with only the smallest L1 cache providing very high speed to the computational unit.
As previously mentioned, DFEs are not intended to fully replace conventional CPUs; instead, they
are integrated into an HPC-system consisting of CPUs, DFEs, storage, and networking. Various system architectures are possible and the overall balance of components can be tailored to the
requirements of the user application. As a key feature, DFEs always contain large amounts of DRAM to facilitate the previously described model of data-flow processing. Various configurations between DFEs and CPU, as well as between multiple DFEs are possible. In the following, we give a brief overview of the current Maxeler MPC-C, MPC-X and MPC-N series.
Maxeler MPC-C systems couple x86 server-grade CPUs with up to four DFE cards (see
Fig. 11.2). Each DFE card contains 48 GB of DRAM as LMem, and each DFE card is connected to the CPUs via a PCI Express (PCIe) bus. DFE cards are also directly connected to each other through a dedicated high-speed, low-latency link called MaxRing. This provides fast communication between neighbouring DFEs, enabling larger applications to scale across multiple DFEs without the PCIe link becoming a communication bottleneck. The system also includes storage and networking, and it is integrated into a dense 1U industry standard rack unit. Such a system supports simple stand-alone deployment of DFE technology, tightly coupled with high-end CPUs. The architecture is beneficial for high-performance applications that run on a fixed number of CPU cores and continuously use one or multiple DFEs.
Fig. 11.2 MPC-C series architecture. A single node contains both x86 CPUs and four data-flow engines connected via PCIe and MaxRing
The MPC-X series enable a more heterogeneous system architecture supporting dynamic
balancing of CPU and DFE resources. MPC-X series systems are pure DFE nodes without any CPUs (see Fig. 11.3). An MPC-X system combines 8 DFE cards in a 1U chassis directly connected through MaxRing. DFEs are also linked through Infiniband to a cluster of CPU nodes. The system can allocate large numbers of DFEs dynamically, providing good scalability and flexibility for applications with changing behaviour, e.g. when the computation has several stages that vary in their characteristics.
The ratio of CPU servers to MPC-X nodes can be tuned to user application requirements.
Fig. 11.3 MPC-X series architecture with eight data-flow engines inside a node. Multiple MPC-X nodes and CPU nodes are connected through an Infiniband network, and the number of DFEs used by each CPU can be varied dynamically
Maxeler’s MPC-N series systems (see Fig. 11.4) are a network oriented platform that provides Ethernet connections directly to the data-flow engines, supporting ultra low-latency line-rate
processing of multiple 10–40 Gbit data streams. A single MPC-N node contains up to 4 DFE cards similar to the MPC-C series architecture. However, each DFE card also supports up the 3 QSFP+ 40 Gbit Ethernet connections where each 40 Gbit port can be split into 4 × 10 Gbit ports. Providing fast Ethernet connections directly to the DFE enables network processing with minimal latency. The memory architecture in DFE also differs from the two previous system architectures: in addition to 24 GB DRAM available as LMem, the DFE also integrates 72 MB of QDR SRAM (QMem)
supporting very low latency off-chip data access. The system contains additional 10 Gbit connections to the CPU. MPC-N series systems are well suited for a range of networking applications including gateways, aggregators, or endpoints.
Fig. 11.4 MPC-N series architecture with four data-flow engines inside a node. Each DFE card also provides three 40G Ethernet connections directly to the DFE
Maxeler systems are provided with a compilation and simulation environment (called
MaxCompiler) for application development, and the MaxelerOS system management environment.
MaxelerOS coordinates the use of DFE resources at run time, and manages the scheduling and data movement within Maxeler systems. MaxCompiler provides a high-level programming environment to express data-flow structures, and produces the necessary binaries for CPU and DFE binaries.
1.
2.
3.