2018 5th NAFOSTED Conference on Information and Computer Science (NICS) A Reconfigurable Multi-function DMA Controller for High-Performance Computing Systems Hung K Nguyen, Khoi P Dong, Xuan-Tu Tran SISLAB, VNU University of Engineering and Technology -144 Xuan Thuy, Cau Giay, Hanoi, Vietnam Email: kiemhung@vnu.edu,vn Abstract—Huge bandwidth demand along with the requirement to synchronize data structures between different processing structures in multiprocessor system-on-chip (MPSoC) lead to the need to design dedicated memory access controllers This paper presents the design of a reconfigurable multi-function memory direct memory controller (ReDMAC) for high-performance MPSoCs The ReDMAC supports the capability of dynamic reconfiguration by enabling the hardware fabrics to be synthesized into various functions even if the system is working The ReDMAC can support four operating modes, including direct memory access, matrix transposing, data sorting, and matrix merging The ReDMAC has been modeled at the Register Transfer Level (RTL) using VHDL language The controller has been simulated and evaluated on reconfigurability to work with individual functions The controller is also synthesized with the Synopsys Design Compiler tool to compare hardware costs with the independent implementation of each individual function Simulation and synthesis results indicate that the proposed design meets the required functionality, while the area of the controller decreases about three times compared to total area of independent function cores Keywords—ReDMAC, reconfigurable memory direct memory controller, multiprocessor system-on-chip, high-performance computing, reconfigurable fabrics I INTRODUCTION Recently, the research trend in the design of highperformance computing systems has shifted toward the hybrid reconfigurable Multiprocessor System-on-Chips (MPSoC) (e.g MUSRA [1], Zynq Ultrascale[2], ADRES[3], REMUS[4], CPSoC [5] etc.) These systems are normally integrated many heterogeneous processing resources such as software programmable microprocessors (PP), hardwired IP (Intellectual Property) cores, reconfigurable hardware architectures, etc To program such a system, a target application is first partitioned into a set of tasks and then mapped onto the heterogeneous computational and routing resources of the system Mapping and partitioning the application so that it can be executed on several smaller processors in a parallel or pipelining fashion is more efficient than execution on a single processor Especially, computation-intensive kernel functions of the application are mapped onto the reconfigurable hardware so that they can achieve high performance approximately equivalent to that of ASIC while maintaining a degree of flexibility close to that of DSP processors [6] Moreover, by dynamically reconfiguring hardware, reconfigurable computing systems allow many hardware tasks to be mapped onto the same hardware platform, thus reducing the area and power consumption of the design [7] communication and synchronization of data between different processing structures Parallel processing architectures usually require a huge data bandwidth Therefore, the system bandwidth is necessary to ensure that data is always available for all resources to run concurrently without idle states Moreover, because the processing structures have different execution models, the data structure exchanged between them needs to be transformed to ensure compatibility A common method used for data communication between processing units is through a shared memory with assistance of a direct memory access controller (DMAC) Here, DMAC is used for transferring data between sharedmemory and parallel processing arrays without the participation of the central processing unit (CPU) Hence, DMAC is a very important component that helps to increase data transfer rate and reduce load for CPU in computing systems Unfortunately, a conventional DMAC [8] in general-purpose computer usually supports only simple operations that copy continuous data blocks from source storage area to destination one This architecture is not efficient to access to complex data structure supported by parallel processing architectures Because of these limitations the traditional DMACs architectures cannot provide enough throughput to keep up with new technology trends The role of DMACs becomes more complicated in parallel computation architectures Improving and optimizing the functionality of DMAC become a key issue in designing high-performance computing systems [9] Many DMACs ([10]-[14]) have been proposed with the unique features that are dedicated to a specific domain of applications In this paper, we propose and implement a reconfigurable multi-function DMA controller (ReDMAC) for the coarsegrained reconfigurable architecture, named MUSRA [1] Because MUSRA is designed to aim at accelerating computation of loops in the multimedia processing applications, some loop-transformation techniques have to be applied while mapping a specific loop onto the MUSRA As a result, the data that is transferred between software modules running on microprocessors and loops executing on the MUSRA also need to be applied some proper transformations such as tiling, fusion, splitting, skewing, sectioning, etc [15] Therefore, the proposed DMAC does not only take charge of moving data from system’s memory to parallel processing array, but also has to convert data structures to the suitable formats that are compatible to the execution model of parallel processing array of MUSRA The DMAC supports four modes: However, designing such high-performance computing systems also has some challenges One of them is the 978-1-5386-7983-8/18/$31.00 ©2018 IEEE 344 x Basic DMA mode allows a data block to be moved from one place to another one; 2018 5th NAFOSTED Conference on Information and Computer Science (NICS) x Fusing DMA mode merges an M×N-matrix with an M×L- matrix into a M×(N+L)-matrix then move it to another position; x Control Unit AHB Master Bus Reconfigurable fabrics FIFO Buffer Transposing DMA mode copies a M×N-matrix from one specified place, and then transposes before moves it to another place; x Sorting DMA mode copies a data block from one place, and then sorts before moves it to another place The rest of this paper is organized as follows The operation principle and architecture of the proposed DMAC are presented in Section II In Section III, experimental results and the evaluation on flexibility, performance and implementation cost are reported and discussed Finally, some conclusions are given in Section IV Generator (CCG), and Control Unit (CU) Especially, to offer the reconfigurability in real-time, the CU is in turn composed of a parameterized FSM (Finite State Machine), Reconfigurable Fabrics, and Context Register File (CRF) Control signals Parameterized FSM Processing Blocks Routing Blocks CRF1 CRF CRF Status signals CRF Done Stage CGRA Bus Start Handshaking Interface Configuration Context Generator (CCG) CMR SAR DAR DADR_REG Control Register File Stage AHB Slave Bus II PROPOSED ARCHITECTURE Fig Functional block diagram of DMAC core A Principle Overview The ReDMAC is designed to keep the role as an adapter between ARM AMBA-based processing systems with the hardware accelerators Fig shows ReDMAC’s interface and connectivity in a system-on-chip The interface between the ReDMAC and the processing system complies with the AMBA AHB protocol specification [16] It includes an AHB Master interface for accessing to system’s memory and an AHB slave interface for receiving DMA command from CPU In addition, ReDMAC also has another interface for handshaking with CPU or peripherals that request a DMA session From the structure perspective, the ReDMAC includes two parts: DMAC wrapper and DMAC core The wrapper is to make the interface of DMAC core compatible with the AHB bus and accelerator interface, therefore, allow DMAC core to transfer data between memory and accelerator Reset Clear all registers F Dreq = ‘1’? CCG T Hreq