Performance Evaluation of the Sun Fire Link SMP Clusters

Int J High Performance Computing and Networking Performance Evaluation of the Sun Fire Link SMP Clusters Ying Qian, Ahmad Afsahi*, Nathan R Fredrickson, Reza Zamani Department of Electrical and Computer Engineering, Queen’s University, Kingston, ON, K7L 3N6, Canada E-mail: {qiany, ahmad, fredrick, zamanir}@ee.queensu.ca *Corresponding author Abstract: The interconnection network and the communication system software are critical in achieving high performance in clusters of multiprocessors Recently, Sun Microsystems has introduced a new system area network, Sun Fire Link interconnect, for its Sun Fire cluster systems Sun Fire Link is a memory-based interconnect, where Sun MPI uses the Remote Shared Memory (RSM) model for its user-level inter-node messaging protocol In this paper, we present the overall architecture of the Sun Fire Link interconnect, and explain how communication is done under RSM, and Sun MPI We provide an in-depth performance evaluation of the Sun Fire Link interconnect cluster of four Sun Fire 6800s at the RSM layer, MPI microbenchmark layer, and the application layer Our results indicate that put has a much better performance than get on this interconnect The Sun MPI implementation achieves an inter-node latency of up to microseconds This is comparable to other contemporary interconnects The uni-directional and bi-directional bandwidths are 695 MB/s, and 660 MB/s, respectively The LogP parameters indicate the network interface is less capable of off-loading the host CPU when the message size increases The performance of our applications under MPI is better than the OpenMP version, and equal or slightly better than the mixed MPI-OpenMP Keywords: System Area Networks, Remote Shared Memory, Clusters of Multiprocessors, Performance Evaluation, MPI, OpenMP Reference to this paper should be made as follows: Qian, Y., Afsahi, A., Fredrickson, N.R and Zamani R (2005) ‘Performance Evaluation of the Sun Fire Link SMP Clusters’, Int J High Performance Computing and Networking Biographical notes: Y Qian received the BSc degree in electronics engineering from Shanghai Jiao-Tong University, China, in 1998, and MSc degree from Queen’s University, Canada, in 2004 She is currently pursuing her PhD at Queen’s Her research interests include parallel processing, high performance communications, user-level messaging, and network performance evaluations A Afsahi is an Assistant Professor at the Department of Electrical and Computer Engineering, at Queen’s University He received his PhD in electrical engineering from the University of Victoria, Canada, in 2000, MSc in computer engineering from the Sharif University of Technology and a BSc in computer engineering from the Shiraz University His research interests include parallel and distributed processing, network-based highperformance computing, cluster computing, power-aware high-performance computing, and advanced computer architecture N.R Fredrickson received the BSc degree in Computer Engineering at Queen's University in 2002 He was a research assistant at the Parallel Processing Research Laboratory, Queen’s University R Zamani is currently a PhD student at the Department of Electrical and Computer Engineering, Queen's University He received the BSc degree in communication engineering from Sharif University of Technology, Iran, and MSc degree from Queen’s University, Canada, in 2005 His current research focuses on power-aware highperformance computing, and high performance communications Copyright © 2005 Inderscience Enterprises Ltd 2 Y QIAN, A AFSAHI, N.R FREDRICKSON, AND R ZAMANI INTRODUCTION Clusters of Symmetric Multiprocessors (SMP) have been regarded as viable scalable architectures to achieve supercomputing performance There are two main components in such systems: the SMP node, and the communication subsystem including the interconnect, and the communication system software Considerable work has gone into the design of SMP systems, and several vendors such as IBM, Sun, Compaq, SGI, and HP offer small to large scale shared memory systems Sun Microsystems has introduced its Sun Fire systems in three categories of small, midsize, and large SMPs, supporting two to 106 processors, backed up with its Sun Fireplane interconnect (Charlesworth, 2002) used inside the Sun UltraSPARC III Cu The Sun Fireplane interconnect uses one to four levels of interconnect ASICs to provide better shared-memory performance All Sun Fire systems use point-to-point signals with a crossbar rather than a data bus The interconnection network hardware and the communication system software are the keys to the performance of clusters of SMPs Some high-performance interconnect technologies used in high-performance computers include Myrinet (Zamani et al., 2004), Quadrics (Petrini et al., 2003; Brightwell et al., 2004), InfiniBand (Liu et al., 2005) Each one of these interconnects provides different levels of performance, programmability, and integration with the operating systems Myrinet provides high bandwidth and low latency, and supports user-level messaging Quadrics integrates the local virtual memory into a distributed virtual shared memory The InfiniBand Architecture (http://www.infinibandta.org/) has been proposed to support the increasing demand on interprocessor communications as well as storage technologies All these interconnects support Remote Direct Memory Access (RDMA) operations Other commodity interconnects include Gigabit Ethernet, 10-Gigabit Ethernet (Feng et al., 2005), and Giganet (Vogels et al., 2000) Gigabit Ethernet is the most widely used network architecture today mostly due to its backward compatibility Giganet directly implements the Virtual Interface Architecture (VIA) (Dunning et al., 1998) in hardware Recently, Sun Microsystems has introduced the Sun Fire Link interconnect (Sistare and Jackson, 2002) for its Sun Fire clusters Sun Fire Link is a memory-based interconnect with layered system software components that implements a mechanism for user-level messaging based on direct access to remote memory regions of other nodes (Afsahi and Qian, 2003; Qian et al., 2004) This is referred to as Remote Shared Memory (RSM) (http://docs-pdf.sun.com/8174415/817-4415.pdf/) Similar work in the past includes the VMMC memory model (Dubnicki et al., 1997) on Princeton SHRIMP architecture, reflective memory in DEC memory channel (Gillett, 1996), SHMEM (Barriuso and Knies, 1994) in Cray T3E, and in software as in ARMCI (Nieplocha et al., 2001) Not to mention, these systems implement shared memory in different manner Message Passing Interface (MPI) (http://www.mpiforum.org/docs/docs.html/) is the de-facto standard for parallel programming on clusters OpenMP (http://www.openmp.org/specs/) has emerged as the standard for parallel programming on shared-memory systems As small to large SMP clusters become more prominent, it is open to debate whether pure messagepassing or mixed MPI-OpenMP is the programming of choice for higher performance Previous works on small SMP clusters have shown contradictory results (Cappello and Etiemble, 2000; Henty 2000) It is interesting to discover what would be the case for clusters with large SMP nodes The authors in (Sistare and Jackson, 2002) have presented the latency and bandwidth of the Sun Fire Link interconnect at the MPI level, along with the performance of collective communications, and the NAS parallel benchmarks (Bailey et al., 1995) on a cluster of Sun Fire 6800s However, in this paper, we take on the challenge to an in-depth performance evaluation of the Sun Fire Link interconnect clusters at the user-level (RSM), microbenchmark level (MPI), as well as the performance for real applications under different parallel programming paradigms We provide performance results on a cluster of four Sun Fire 6800s, each with 24 UltraSPARC III Cu processors under Sun Solaris 9, Sun HPC Cluster Tools 5.0, and the Forte Developer 6, update This paper has a number of contributions Specifically, this paper contributes by presenting the performance of the user-level RSM API primitives, detailed performance results for different point-to-point and collective communication operations, as well as different permutation traffic patterns at the MPI level It also presents the parameters of the LogP model, as well as the performance of two applications from the ASCI purple suite (Vetter and Mueller, 2003) under the MPI, OpenMP and mixed-mode programming paradigms Our results indicate that put has a much better performance than get on this interconnect The Sun MPI implementation achieves an inter-node latency of up to microseconds The uni-directional and bi-directional bandwidths are 695 MB/s, and 660 MB/s, respectively The performance of our applications under MPI is better than the OpenMP version, and equal or slightly better than the mixed MPI-OpenMP The rest of this paper is organized as follows In Section two, we provide an overview of the Sun Fire Link interconnect Section describes the communication under the Remote Shared Memory model Sun MPI implementation is discussed in section We describe our experimental framework in section Section presents our experimental results Related work is presented in section Finally, we conclude our paper in section SUN FIRE LINK INTERCONNECT Sun Fire Link is used to cluster Sun Fire 6800 and 15K/12K systems (http://docs.sun.com/db/doc/816-0697-11/) Nodes are connected to the network by a Sun Fire Link-specific PERFORMANCE EVALUATION OF THE SUN FIRE LINK SMP CLUSTERS I/O subsystem called the Sun Fire Link assembly The Sun Fire Link assembly is the interface between the Sun Fireplane internal system interconnect and the Sun Fire Link fabric However, it is not an interface adapter, but a direct connection to the system crossbar Each Sun Fire Link assembly contains two optical transceiver modules called Sun Fire Link optical modules Each optical module supports a full-duplex optical link The transmitter uses a Vertical Cavity Surface Emitting Laser (VCSEL) with a 1.65 GB/s raw bandwidth and a theoretical GB/s sustained bandwidth after protocol handling Sun Fire 6800s can have up to two Sun Fire Link assemblies (4 optical links), where Sun Fire 15K/12K can have up to assemblies (16 optical links) The availability of multiple Sun Fire Link assemblies allows message traffic to be striped across the optical links for higher bandwidth It will also provide protection against link failures The Sun Fire Link network can support up to 254 nodes, but the current Sun Fire switch supports only up to nodes The network connections for clusters of two to three Sun Fire systems can be point-to-point or through the Sun Fire Link switches For four to eight nodes, switches are required Figure illustrates a 4-node configuration Four switches are needed for five to nodes Nodes can also communicate via TCP/IP for cluster administration definitions The complete API calls can be found in (http://docs-pdf.sun.com/817-4415/817-4415.pdf/) The RSMAPI can be divided into five categories: interconnect controller operations, cluster topology operations, memory segment operations, barrier operations, and event operations TABLE I REMOTE SHARED MEMPRY API (PARTIAL) Interconnect Controller Operations rsm_get_controller ( ) rsm_release_controller ( ) Cluster Topology Operations rsm_free_interconnect_topology ( ) rsm_get_interconnect_topology ( ) Memory Segment Operations rsm_memseg_export_create ( ) rsm_memseg_export_destroy ( ) rsm_memseg_export_publish ( ) rsm_memseg_export_republish () rsm_memseg_export_unpublish ( ) rsm_memseg_import_connect ( ) rsm_memseg_import_disconnect ( ) rsm_memseg_import_get ( ) rsm_memseg_import_put ( ) rsm_memseg_import_map ( ) rsm_memseg_import_unmap ( ) Barrier operations rsm_memseg_import_close_barrier ( ) rsm_memseg_import_destroy_ barrier ( ) rsm_memseg_import_init_barrier ( ) rsm_memseg_import_open_barrier ( ) Optical link Node Node Node Node The network does switch not have1a DMA engine In Suninterface Fire Link contrast to the Quadrics QsNet, and InfiniBand Architecture that use DMA for remote memory operations, Sun Fire Link network interface uses programmed I/O The network interface can initiate interrupts as well as poll for data transfer operations It provides uncached read and write accesses to memory regions on the remote nodes A Remote Shared Memory Application Programming Interface (RSMAPI) offers a set of user-level function for remote Sun Fire Link switch memory operations bypassing the kernel (http://docspdf.sun.com/817-4415/817-4415.pdf/) Sun Fire Link assembly Figure 4-node, 2-switch Sun Fire Link network REMOTE SHARED MEMORY Remote Shared Memory is a memory-based mechanism, which implements user-level inter-node messaging with direct access to memory that is resident on remote nodes Table I shows some of the RSM API calls with their rsm_memseg_import_order_barrier ( ) rsm_memseg_import_set_mode ( ) Event operations rsm_intr_signal_post ( ) rsm_intr_signal_wait ( ) get controller handle release controller handle free interconnect topology get interconnect topology resource allocation function for exporting memory segments resource release function for exporting memory segments allow a memory segment to be imported by other nodes re-allow a memory segment to be imported by other nodes disallow a memory segment to be imported by other nodes create logical connection between import and export sides break logical connection between import and export sides read from an imported segment write to an imported segment map imported segment unmap imported segment close barrier for imported segment destroy barrier for imported segment create barrier for imported segment open barrier for imported segment impose the order of write in one barrier set mode for barrier scoping signal for an event wait for an event Figure shows the general message-passing structure under the Remote Shared Memory model Communication under the RSM involves two basic steps: segment setup and teardown; the actual data transfer using the direct read and write models In essence, an application process running as the “export” side should first create an RSM export segment from its local address space, and then publish it to make it available for processes on the other nodes One or more remote processes as the “import” side will create an RSM import segment with a virtual connection between the import and export segments This is called the setup phase After the connection is established, the process at the “import” side can communicate with the process at the “export” side by writing into and reading from the shared memory This is called the data transfer Y QIAN, A AFSAHI, N.R FREDRICKSON, AND R ZAMANI phase When data is successfully transferred, the last step is to tear down the connection The “import” side disconnects the connection and the “export” side unpublishes the segments, and destroys the memory handle Sun MPI chooses the most efficient communication protocol based on the location of processes, and the available interfaces (http://docs-pdf.sun.com/817-009010/817-0090-10.pdf/) The library will take advantage of shared memory mechanisms (shmem) for intra-node communication, and RSM for inter-node communication It also runs on top of the TCP stack When a process enters an MPI call, Sun MPI (through the progress engine, a layer on top of shmem, RSM, and TCP stack) may act on a variety of messages A process may progress any outstanding nonblocking sends and receives; generally poll for all messages to drain system buffers; watch for message cancellation (MPI_Cancel) from other processes; and/or yield/deschedule itself if no useful progress is made 4.1 Figure Export illustrates the main steps side Importfor sidethe data transfer phase The “import” side can use the RSM put/get get_controller ( ) primitives, or use mapping technique to read or write data get_controller ) export_create ( ) from) the Put writes to (get reads exported(memory segment Setup () through export_publish the connection The mapping method maps the () import_connect ( ) exported segment into the imported address space and then Data uses the CPU store/load memory operationstransfer for data Read/Write transfer This could be through the use of memcpy operation However, memcpy ( )is not guaranteed( ) to use the export_unpublish import_disconnect Tear some UltraSPARC’s Block Store/Load instructions Thus, export_destroy ( ) release_controller ( ) downbarrier library routines should be used for this purpose The operations ensure the release_controller ( ) data transfers are successfully completed before they return The order function is optional Setup, data of transfer, and writes tear down phases and canFigure impose the order multiple in one barrier under the RSM communication The signal operation is used to inform the “export” side that the “import” side has written something onto the exported segment Put (Write) Map (Read/Write) map ( ) init_barrier ( ) init_barrier ( ) init_barrier ( ) open_barrier ( ) open_barrier ( ) open_barrier ( ) order_barrier ( ) order_barrier ( ) order_barrier ( ) get ( ) put ( ) Block Store/Load Sharedmemory pairwise communication For intra-node point-to-point message-passing, the sender writes to shared-memory buffers, depositing pointers to these buffers into shared-memory postboxes After the sender finishes writing, the receiver can read the postboxes and the buffers For small messages, instead of putting pointers into postboxes, data itself is placed into the postboxes For large messages, which may be separated into several buffers, the reading and writing can be pipelined For very large messages, to keep the message from overrunning the shared-memory area, the sender is allowed to advance only one postbox ahead of the receiver Sun MPI uses the eager protocol for small messages, where the sender writes the messages without explicitly coordinating with the receiver For large messages, it employs the rendezvous protocol, where the receiver must explicitly notify the sender that it is ready to receive the message, before the message can be sent 4.2 Get (Read) SUN MPI RSM pairwise communication Sun MPI has been implemented on top of RSM for internode communication (http://docs-pdf.sun.com/817-009010/817-0090-10.pdf/) By default, remote connections are established as needed Because the segment setup and teardown have quite large overheads (Section 6.1), connections remain established during the application runtime unless they are explicitly torn down Messages are sent in one of two fashions: short messages (smaller than 3912 bytes) and long messages Short messages are fit into multiple postboxes, 64 bytes each Buffers, barriers, and signal operations are not used due to their high overheads Writing data less than 64 bytes invokes a kernel interrupt on the remote node, which adds to the delay Thus, a full 64-byte data is deposited into the postbox Long messages are sent in 1024-byte buffers under the control of multiple postboxes Postboxes are used in order Each postbox points to multiple buffers Barriers are opened destroy_barrier ( ) destroy_barrier ( ) destroy_barrier ( ) signal_post ( ) signal_post ( ) unmap ( ) LINK SMP CLUSTERS PERFORMANCE EVALUATION OF THE SUN FIRE Figure Steps in the data transfer phase: for each stripe to(a)make writes are successfully get, (b)sure put, the (c) map done Figure shows the pseudo-codes for MPI_Send and MPI_Recv operations Long messages smaller than 256K are sent eagerly; otherwise, rendezvous protocol is used The environment variable MPI_POLLALL can be set to ‘1’ or ‘0’ In the general polling (default case; MPI_POLLALL = 1), Sun MPI polls for all incoming messages even if their corresponding receive calls have not been posted yet In the directed polling (MPI_POLLALL = 0), it only searches for the specified connection use the local exchange method instead of point-to-point approach (Sistare et al., 1999) For inter-node collective communications, one representative process for each SMP node is chosen This process is responsible for delivering the message to all other processes on the same node, which are involved in the collective operation (Sistare et al., 1999) EXPERIMENTAL FRAMEWORK We evaluate the performance of the Sun Fire Link interconnect, Sun MPI implementation, and two application benchmarks on a cluster of Sun Fire 6800s at the High Performance Computing Virtual Laboratory (HPCVL), Queen’s University HPCVL is one of the world-wide Sun sites where Sun Fire Link is being used on Sun Fire cluster systems HPCVL participated in a beta program with Sun Microsystems to test the Sun Fire Link hardware/software before its official release in Nov 2002 We experimented with this hardware using the latest Sun Fire Link software integrated in Solaris Each Sun Fire 6800 SMP node at HPCVL has 24 900MHz UltraSPARC III processors with MB E-cache and 24 GB RAM The cluster has 11.7 TB of Sun StorEdge T3 disk storage The software environment includes Sun Solaris 9, Sun HPC Cluster Tools 5.0, and Forte Developer 6, update We had exclusive access to the cluster during our experimentation, and we bypassed the Sun Grid Engine in our tests Our timing measurements were done using the high resolution timer available in Solaris In the following, we present our framework 5.1 if send to itself copy the message into the buffer else if general poll exploit the progress engine endif establish the forward connection (if not done yet) if message < short message size (3912 bytes) set envelop as data in the postbox write data to postboxes else if message < rendezvous size (256 KB) set envelop as eager data else set envelop as rendezvous request wait for rendezvous Ack set envelop as rendezvous data endif reclaim the buffer if message Ack received prepare the message in cache-line size barrier for each connection Figure open Pseudo-codes if receive from itself for (a) MPI_Send and (b) MPI_Recv write data to buffers copy data into the user buffer close barrier else if general poll write pointers to buffers in the postboxes 4.3 Collective communications exploit the progress engine endif endif endif Efficient implementation of collective communication establish the backward connection (if not done yet) algorithms one of MPI_Send the keys topseudo-code the performance of clusters (a) waitisfor incoming data, and check out the envelope For intra-node collectives, processes communicate with switch (envelope) each other via shared memory case: rendezvou requestThe optimized algorithms send rendezvous Ack case: eager, rendezvou data, or postbox data copy data from buffers to user buffer write message Ack back to the sender endswitch endif (b) MPI_Recv pseudo-code Remote Shared Memory API The RSMAPI is the closest layer to the Sun Fire Link We measure the performance of some RSMAPI calls, as shown in Table I, with varying parameters over the Sun Fire Link 5.2 MPI latency Latency is defined as the time it takes for a message to travel from the sender process address space to the receiver process address space In uni-directional latency test, the sender transmits a message repeatedly to the receiver, and then waits for the last message to be acknowledged The number of messages sent is kept large enough to make the time for the acknowledgement negligible The bi-directional latency test is the ping-pong test where the sender sends a message and the receiver upon receiving the message immediately replies with the same message This is repeated sufficient number of times to eliminate the transient conditions of the network Then, the average round-trip time divided by two is reported as the one-way latency Tests are done using matching pairs of blocking sends and receives under the standard, synchronous, buffered, and ready mode of MPI To expose the buffer management cost at the MPI level, we modify the standard ping-pong test such that each send Y QIAN, A AFSAHI, N.R FREDRICKSON, AND R ZAMANI operation uses a different message buffer We call this method Diff buf Also, in the standard ping-pong test under load, we measure the average latency when simultaneous messages are in transit between pairs of processes on different nodes 5.3 MPI bandwidth In the bandwidth test, the sender constantly pumps messages into the network The receiver sends back an acknowledgment upon receiving all the messages Bandwidth is reported as the total number of bytes per unit time delivered during the time measured We also measure the aggregate bandwidth when simultaneous messages are in transit between pairs of processes on different nodes 5.4 LogP parameters LogP model has been proposed to gain insights into different components of a communication step (Culler et al., 1993) LogP models sequences of point-to-point communications of short messages L is the network hardware latency for one-word message transfer O is the combined overhead in processing the message at the sender (os) and receiver (or) P is the number of processors The gap, g, is the minimum time interval between two consecutive message transmission from a processor LogGP (Alexandrov et al., 1995) extends LogP to cover long messages The Gap per byte for long messages, G, is defined as the time per byte for a long message An efficient method for measurement of LogP parameters has been proposed in (Kielmann et al., 2000) The method is called parameterized LogP and subsumes both LogP, and LogGP models The most significant advantage of this method over the method introduced in (Iannello et al., 1998) is that it only requires saturation of the network to measure g(0), the gap between sending messages of size zero For a message size, m, the latency, L, and the gaps for larger messages, g(m), can be calculated directly from g(0), and round trip times, RTT(m) (Kielmann et al., 2000) 5.5 Traffic patterns In these experiments, our intension is to analyze the network performance under several traffic patterns, where each sender selects a random or fixed destination Message sizes and inter-arrival times are generated randomly using uniform and exponential distributions These patterns may generate both intra-node and inter-node traffic in the cluster 1) Uniform Traffic: The uniform traffic is one of the most frequently used traffic patterns for evaluating network performance Each sender selects its destination randomly with a uniform distribution 2) Permutation Traffic: These communication patterns are representative of parallel numerical algorithm behavior mostly found in scientific applications Note that each sender communicates with a fixed destination We experiment with the following permutation patterns: - Baseline: the ith baseline permutation is defined by βi(an-1, …, ai+1, ai, ai-1, …, a1, a0) = an-1, …, ai+1, a0, ai, ai-1, …, a1 (0  i  n-1) - Bit-reversal: the process with binary coordinates an-1, an-2, …, a1, a0 always communicates with the process a0, a1, …, an-2, an-1 - Butterfly: the ith butterfly permutation is defined by βi(an-1, …, ai+1, ai, ai-1, …, a0) = an-1, …, ai+1, a0, ai-1, …, (0  i  n-1) - Complement: the process with binary coordinates an-1, an-2, …, a1, a0 always communicates with the process an-1, an-2, …, a1, a0 - Cube: the ith cube permutation is defined by βi(an-1, …, ai+1, ai, ai-1, …, a0) = an-1, …, ai+1, ai, ai-1, …, a0 (0  i  n-1) - Matrix transpose: the process with binary coordinates an-1, an-2, …, a1, a0 always communicates with the process an/2 -1,…, a0, an-1, …, an/2 - Neighbor: processes are divided into pairs Each pair consists of two adjacent processes Process communicates with process 1, process with process 3, and so on - Perfect-shuffle: the process with binary coordinates an-1, an-2, …, a1, a0 always communicates with the process an-2, an-3, …, a0, an-1 5.6 MPI collective communications We experimented with broadcast, scatter, gather, and alltoall as representatives of the mostly used collective communication operations in parallel applications Our experiments are done with processes located on the same node and/or on different nodes In the inter-node cases, we evenly divided the processes among the four Sun Fire 6800 nodes 5.7 Applications It is important to understand if the performance delivered at the user-level and MPI-level can be effectively utilized at the application level as well We were able to experiment with two applications from ASCI purple suite (Vetter and Mueller, 2003), namely SMG2000 and Sphot, to evaluate the cluster performance under the MPI, OpenMP, and MPIOpenMP programming paradigms 1) Sphot: Sphot is a 2D photon transport code Monte Carlo transport solves the Boltzmann transport equation by directly mimicking the behavior of photons as they are born in hot matter, moved through and scattered in different materials, and absorbed/escaped from the problem domain 2) SMG2000: SMG2000 is a parallel semi-coarsening multi-grid solver for the linear systems arising from finite differences, finite volume, or finite element discretizations of the diffusion equation Du) + u = f on logically rectangular grids It solves both 2-D and 3-D problems PERFORMANCE EVALUATION OF THE SUN FIRE LINK SMP CLUSTERS 6.1 EXPERIMENTAL RESULTS Remote Shared Memory API Table II shows the execution times for different RSMAPI primitives Some API calls are affected by the memory segment size (shown here with 16 KB memory segment size), while others are not affected at all (Afsahi and Qian, 2003) The minimum memory segment size is KB in the current implementation of RSM Note the API primitives with the asterisk sign are normally used only once for each connection Figure shows the percentage execution times for the “export” and “import” sides with a typical 16 KB memory segment, and data size It is clear that the connect and disconnect calls together take more than 80% of the execution time at the “import side” However, these calls normally happen only once for each connection The times for open barrier, close barrier, and the signal primitives are not small compared to the time to put small message sizes This is why in Sun MPI, barrier is not used for small message sizes, and data transfer is done through postboxes TABLE II EXECUTION TIMES OF DIFFERENT RSMAPI CALLS Export side get_interconnect_topology ( ) * get_controller ( ) * free_interconnect_topology ( ) * export_create ( ) 16 KB * export_publish ( ) 16 KB * export_unpublish ( ) 16 KB * export_destroy ( ) 16 KB * release_controller ( ) * Import side import_connect ( ) * import_map ( ) * import_init_barrier ( ) import_set_mode ( ) import_open_barrier ( ) import_order_barrier ( ) import_put ( ) 16 KB import_get ( ) 16 KB import_close_barrier ( ) import_destroy_barrier ( ) signal_post ( ) import_unmap ( ) * import_disconnect ( ) * Time (μs) 12.65 841.00 0.61 103.61 119.36 73.48 16.73 3.63 Time (μs) 173.45 13.56 0.33 0.38 9.93 16.80 27.73 373.01 7.13 0.14 23.78 21.40 486.31 Figure shows the time for several RSMAPI functions at the “export” side affected by memory segment size The export_destroy primitive is the least affected one The results imply that applications are better off creating one large memory segment for multiple connections instead of creating multiple small memory segments Figure Percentage executions time for the export and import side (16 KB segment, and data size) Figure Execution times for several RSMAPI calls Figure compares the performance of the put and get operations It is clear that put has a much better performance than get for message sizes more than 64 bytes That is why Sun MPI (http://docs-pdf.sun.com/817-0090-10/817-009010.pdf/) uses push protocols over Sun Fire Link The poor performance of put for messages smaller than 64 bytes (a cache line) is due to invoking a kernel interrupt on the remote node, which adds to the delay Due to sudden changes at 256-byte and 16 KB messages, it is clear that RSM uses three different protocols for the put operation (a) (b) Figure RSM put and get performance, (a): latency; (b): bandwidth Y QIAN, A AFSAHI, N.R FREDRICKSON, AND R ZAMANI 6.2 MPI latency Figure 8(a) shows the latency for intra-node communication in the range [1 … 16 KB] The latency for 1-byte message is µs for uni-directional ping, µs for Standard, Ready, Buffered, Synchronous, and Diff buf bi-directional modes For the uni-directional, latency remains at µs for up to 64 bytes, and for bi-directional, it is almost constant at µs The Buffered mode has a higher latency for larger messages Figure 8(b) shows the latency for inter-node communication in the range [1 … 16 KB] The latency remains at µs for uni-directional, µs for Standard, Ready, Synchronous, and Diff buf modes, and µs for the Buffered mode for messages up to 64 bytes Figure 8(b) also verifies that Sun MPI uses the short message method for messages up to 3912 bytes Our measurements have been done under the default directed polling Shorter message latency (3.7 microseconds) has been reported in (Sistare and Jackson, 2002) for zero byte message with general polling In summary, Sun Fire Link short message latency is comparable to those for Myrinet, Quardics, and InfiniBand (Zamani et al., 2004; Petrini et al., 2003; Liu et al., 2005) the intra-node communication (except for the buffered mode), the maximum bandwidth is about 655 MB/s The uni-directional bandwidth is 695 MB/s for inter-node communication The bi-directional ping achieves a bandwidth of approximately 660 MB/s, except for the buffered mode, where it has the lowest bandwidth of 346 MB/s This is due to the overhead of buffer management However, the diff buf mode has a better performance of 582 MB/s The transition point in Figure 11(b) between the short and long message protocols is at the 3912-byte message size Figure 12 shows the aggregate bi-directional inter-node bandwidth with varying number of communicating pairs The aggregate bandwidth is the sum of individual bandwidths The network is capable of providing higher bandwidth with increasing number of communication pairs However, with 256 KB message size and more, the aggregate bandwidth is higher for 16 pairs of communication than for 32 pairs Figure Inter-node latency under load (a) Figure 10 RSM put and MPI latency comparison (b) Figure Message latencies, (a): intra-node; (b): inter-node We have also measured the ping-pong latency when simultaneous messages are in transit between pairs of processes, shown in Figure For each curve, the message size is held constant, while the number of pairs is increased The latency in each case does not change much when the number of pairs increases Figure 10 compares the standard MPI latency with the RSM put Note that we have assumed the same execution time for put for to 64-byte messages 6.3 MPI bandwidth Figure 11(a) and Figure 11(b) present the bandwidths for intra-node and inter-node communication, respectively For 6.4 LogP parameters LogP model provides greater detail about the different components of a communication step The parameters os, or, and g in the parameterized LogP model are shown in Figure 13 for different message sizes It is interesting that all three parameters, os (3 µs), or (2 µs), and g (2.29 µs) remain fixed for zero to 64 byte messages (size of a postbox) However, they increase with larger messages sizes (except with a decrease at 3912-byte due to protocol switch) It seems that the network interface is not quite powerful as the CPU has to more work with larger message sizes, both at the send and at the receiving sides Parameters of the LogP model can be calculated as in (Kielmann et al., 2000); They are as follows: L is 0.51 µs, o is 2.50 µs, and g is 2.29 µs PERFORMANCE EVALUATION OF THE SUN FIRE LINK SMP CLUSTERS bandwidth can be up to around 2000 MB/s with 64 processes, 1500 MB/s with 32 processes, and 900 MB/s with 16 processes The intra-node accepted bandwidth is much smaller than the inter-node accepted bandwidth, only around 250 MB/s for 16 processes, 500 MB/s for 32 processes, and 550 MB/s for 64 processes It is clear that the network performance scales with the number of processes (a) Inter-node,T-exponential, Inter-node,T-exponential, Inter-node,T-uniform, Inter-node,T-uniform, Intra-node,T-exponential, Intra-node,T-exponential, intra-node,T-uniform, intra-node, T-uniform, (b) Figure 11 Bandwidth: (a) intra-node, (b) inter-node S-exponential S-uniform S-exponential S-uniform S-exponential S-uniform S-exponential S-uniform Figure 14 Uniform traffic accepted bandwidth Figure 12 Aggregate inter-node bandwidth with different number of communicating pairs Figure 15 shows the permutation patterns accepted bandwidth with 32 processes Note that the Butterfly, Cube, and Baseline have single-stage and multi-stage permutation patterns The single-stage is the highest stage permutation, while the multi-stage is the full stage permutation In the permutation patterns, there is only inter-node traffic for Complement, multi-stage Cube, and single-stage Cube patterns Also, there is only intra-node traffic for Neighbor permutation The accepted bandwidth for Bit-reversal and single-stage Baseline (also Inverse Perfect Shuffle) is more than Perfect shuffle, Matrix transpose and multi-stage Butterfly permutations For Complement permutation, the network delivered around 3300 MB/s bandwidth, which is similar to the aggregate bandwidth for 64 processes with the 10 KB message size 6.6 Figure 13 LogP parameters, g(m), os(m), and or(m) 6.5 Traffic patterns We have considered uniform and exponential distributions for both the message size (denoted by ‘S’) and the interarrival time (denoted by ‘T’) Figure 14 shows the accepted bandwidth against the offered bandwidth under the uniform traffic distribution It appears the performance is not much sensitive to these distributions The inter-node accepted MPI collective communications We have measured the performance of broadcast, scatter, gather, and alltoall operations in terms of their completion time Figure 16(a) shows the completion time for intra-node collectives with 16 processes, while Figure 16(b) and Figure 16(c) illustrate the inter-node collective communications time for 16, and 64 processes, respectively The intra-node performance is better than the inter-node in most cases We can see the difference in performance between the KB and KB message size for inter-node collective communications when the protocol switches An overall look at the running time shows that the alltoall operation takes the longest, followed by the gather, scatter, and broadcast operations We not know the reasons behind the spikes in the Figures We ran our tests 1000 times and got their average The spikes were present in all cases 10 Y QIAN, A AFSAHI, N.R FREDRICKSON, AND R ZAMANI (a) (b) (c) Figure 16 Collective communication completion time: (a) 16 processes (intra-node), (b) 16 processes (inter-node), (c) 64 processes (inter-node) Intra-node Figure 15 Permutation patterns accepted bandwidth 6.7 Applications 1) Sphot: Sphot is a coarse-grained mixed-mode program The researchers in (Vetter and Mueller, 2003) have shown that the average number of messages per process is 4, and the average message volume is 360 bytes, for 32 to 96 processes Therefore, this application is not communication bound As shown in Figure 17, the MPI performance is equal or slightly better than the OpenMP performance The application is scaling but scalability is not linear Note that the MPI processes are evenly distributed among the four nodes We now compare the performance of Sphot under MPI with the MPI-OpenMP We define the number of parallel entities (PE) as: #Parallel entities = #Processes  #Threads per process We ran Sphot with different number of parallel entities and for each case we ran it with different combinations of threads and processes Figure 18 presents the execution time for one to 64 parallel entities, each with different combinations of processes and threads The results indicate that this application has almost the same performance under the MPI and the MPI-OpenMP programming paradigms M PI Scalability Inter-node OpenM P 40 30 20 10 16 32 48 64 Processes/Threads Figure 17 Sphot scalability under MPI and OpenMP PERFORMANCE EVALUATION OF THE SUN FIRE LINK SMP CLUSTERS 900 Time (seconds) 800 2P E 700 PE 600 500 PE 400 300 16 PE 200 32 PE 100 48 PE 64 PE Num ber of processes, and num ber of threads Figure 18 Sphot execution time under different combinations of the number of processes and threads 2) SMG2000: The SMG2000 problem size is roughly equal to the input problem size (64×200×200) multiplied by the number of processes For a fixed problem size, we reduce the input problem size with the increasing number of processes, accordingly The SMG2000 is a mixed-mode program, and highly communication intensive The average number of messages per process is between 15306 to 16722, the average message volume is between 2.2 MB to 2.9 MB, and the average number of message destinations is between 23 to 64, all for 32 to 96 processes This application is a tough test for the cluster Figure 19 shows that the MPI performance is equal or slightly better than the OpenMP performance The scalability is not good, and it even drops after 32 processes Scalability MP I 10 16 32 64 Processes/Threads Figure 19 SMG2000 scalability under MPI and OpenMP We then ran the SMG2000 with one to 64 parallel entities Figure 20 shows the execution times for different combinations in each parallel entity We can see that pure MPI has a better performance than the MPI-OpenMP for to 32 PEs However, with 64 PEs, the mixed mode (8p8t, 16p4t, and 32p2t) has a slightly better performance SMG2000 (m ixed-m ode) 600 500 400 300 200 100 PE PE PE 16 PE 32 PE 64 PE Num ber of processes, and num ber of threads Figure 20 SMG2000 execution time under different combinations of the number of processes and threads OpenMP 15 implementation (Sistare and Jackson, 2002) The performance of Quadrics interconnection networks has been studied in (Petrini et al., 2003; Liu et al., 2003; Brightwell et al., 2004) Petrini et al (2003) have shown the performance of QsNet under different communication patterns Brightwell et al., (2004) have shown the performance of QsNet II Numerous research studies have been done on the Myrinet including (Zamani et al., 2004; Liu et al., 2003) Zamani and his colleagues (2004) presented the performance of Myrinet two-port E-card networks at the user-level, MPI-level, and application level In (Liu et al., 2003), the authors have compared the performance of their MPI implementation on top of the InfiniBand, with the MPI implementation over Quadrics QsNet, and the Myrinet Dcard Recently, the performance of the PCI-Express InfiniBand (Liu et al., 2005), and 10-Gigabit Ethernet (feng et al., 2005) have been reported Time (seconds) Sphot (m ixed-m ode) 11 RELATED WORK Research groups in academia and industry have been continuously studying the performance of clusters and their interconnection networks Sun Microsystems has recently introduced the Sun Fire Link interconnect, and the Sun MPI There are several different semantics supported by different networks Sun Fire Link uses Remote Shared Memory (http://docs-pdf.sun.com/817-4415/817-4415.pdf/) for user-level inter-node messaging with direct access to memory that is resident on remote nodes Cray T3E uses shared-memory concept (Barriuso and Knies, 1994), where it provides a globally accessible, physically distributed memory system to provide implicit communications VMMC (Dubnicki et al., 1997) provides protected, direct communication between the sender’s and receiver’s virtual address spaces The receiving process exports areas of its address space to allow the sending process to transfer data The data are transferred from the sender’s local virtual memory to a previously imported receive buffer There is no explicit receive operation in VMMC The reflectivememory model, supported by DEC Memory Channel (Gillett, 1996), is a sort of hybrid between explicit sendreceive and implicit shared-memory models by providing a write-only memory “window” in another process address space All data written to that window directly go into the address space of the destination process ARMCI (Nieplocha et al., 2001) is a software architecture for supporting remote memory operations on clusters 12 Y QIAN, A AFSAHI, N.R FREDRICKSON, AND R ZAMANI CONCLUSION Shared-memory multiprocessors have a large market Clusters of multiprocessors have been regarded as viable platforms to provide supercomputing performance However, the interconnection networks and the supporting communication system software are the deciding factors in their performance In this paper, we attempt to measure the performance of Sun Fire 6800 clusters with the recently introduced Sun Fire Link interconnect Sun Fire Link is a memory-based interconnect, where the Sun MPI library uses the Remote Shared Memory model for its user-level messaging protocol Our performance results include the RSMAPI primitives’ execution times, the intra-node and inter-node latency and bandwidth measurements under different communication modes, parameters of LogP model, collective communication, permutation traffic, as well as the performance of two mixed-mode applications Our RSM results indicate that put has a better performance than get on this interconnect as in other memory-based interconnects We also demonstrated the overhead of the Sun MPI implementation on top of the RSM level The Sun MPI implementation incurs to µs inter-node latency Thus, the Sun Fire Link interconnect has a short message latency comparable to the other highperformance interconnects The uni-directional and bidirectional bandwidths are 695 MB/s, and 660 MB/s, respectively The aggregate bandwidth is 4.5 GB/s with 16 pairs of communicating nodes The Sun Fire Link achieves higher bandwidth than the Myrinet under GM2, however its performance is not as good as the QsNet II and InfiniBand The source overhead is µs, the destination overhead is µs, and the gap is 2.29 µs The LogP parameters increase with the message sizes larger than 64 bytes This indicates that the host CPU is more involved in the communication, and thus the network interface is less capable of off-loading The performance of intra-node collective communication operations are better than the inter-node collective communications Under single-stage Cube permutation, the cluster achieves maximum inter-node bandwidth of 3500 MB/s with 32 processes The performance of the applications under MPI is better than the OpenMP, and almost equal or slightly better than the mixed-mode (MPIOpenMP) In general, the Sun Fire Link cluster performs relatively well in most cases ACKNOWLEDGEMENT Special thank goes to E Loh of Sun Microsystems for providing the latest RSM code for the Sun Fire Link A Afsahi would like to thank K Edgecombe, and H Schmider of High Performance Computing Virtual Laboratory at the Queen’s University (http://www.hpcvl.org/), and G Braida of Sun Microsystems for their kind help in accessing the Sun Fire cluster with its Sun Fire Link interconnect This work was supported by grants from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Queen’s University Y Qian was supported by the Ontario Graduate Scholarship for Science and Technology (OGSST) REFERENCES Afsahi, A and Qian, Y (2003) ‘Remote Shared Memory over Sun Fire Link interconnect’, 15th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), pp.381-386 Alexandrov, A., Ionescu, M., Schauser, K.E and Scheiman, C (1995) ‘Incorporating long messages into the LogP model one step closer towards a realistic model for parallel computation’, 7th Annual ACM Symposium on Parallel Algorithms and Architecture (SPAA) Bailey, D.H., Harsis, T., Saphir, W., der Wijngaart, R.V., Woo, A and Yarrow, M (1995) ‘The NAS parallel benchmarks 2.0: report NAS-95-020’, Nasa Ames Research Center Barriuso, A and Knies, A (1994) ‘SHMEM user’s guide’, Cray Research Inc., SN-2516 Brightwell, R., Doerfler, D and Underwood, K.D (2004) ‘A comparison of 4X InfiniBand and Quadrics Elan-4 Technologies’, IEEE International Conference on Cluster Computing (Cluster) Cappello, F and Etiemble, D (2000) ‘MPI versus MPI+OpenMP on IBM SP for the NAS benchmarks’, Supercomputing Conference (SC) Charlesworth, A (2002) ‘The Sun Fireplane interconnect’, IEEE Micro, Vol 22, No 1, pp.36-45 Culler, D.E., Karp, R.M., Patterson, D.A., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R and von Eiken, T (1993) ‘LogP: towards a realistic model of parallel computation’, 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Dubnicki, C., Bilas, A., Chen, Y., Damianakis, S and Li, K (1997) ‘VMMC-2: Efficient support for reliable, connection-oriented communication’, Hot Interconnects V Dunning, V Regnier, G., McAlpine, G., Cameron, D., Shubert, B., Berry, F., Merritt, A., Gronke, E and Dodd, C (1998) ‘The Virtual Interface Architecture’, IEEE Micro, March/April, pp.66-76 Feng, W., Balaji, P., Baron, C., Bhuyan, L.N and Panda, D.K (2005) ‘Performance Characterization of a 10-Gigabit Ethernet TOE’, To appear in 13th Annual Symposium on HighPerformance Interconnects (Hot Interconnects) Gillett, R (1996) ‘MEMORY CHANNEL network for PCI: an optimized cluster interconnect,” IEEE Micro, February, pp.1218 Henty, D.S (2000) ‘Performance of hybrid message-passing and shared-memory parallelism for discrete element modeling’, Supercomputing Conference (SC) Iannello, G., Laurio, M and Mercolino, S (1998) ‘LogP performance characterization of fast messages atop Myrinet’, 6th EUROMICRO Workshop on Parallel and Distributed Processing (PDP) Kielmann, T., Bal, H.E and Verstoep, K (2000) ‘Fast measurement of LogP parameters for message passing platforms’, 4th Workshop on Runtime Systems for Parallel Programming (RTSPP) Liu, J., Chandrasekaran, B., Wu, J., Jiang, W., Kini, S., Yu, W., Buntinas, D., Wyckoff, P and Panda, D.K (2003) ‘Performance comparison of MPI implementations over InfiniBand, Myrinet, and Quadrics’, Supercomputing Conference (SC) PERFORMANCE EVALUATION OF THE SUN FIRE LINK SMP CLUSTERS Liu, J., Mamidala, A., Vishnu, A and Panda D.K (2005) ‘Evaluating InfiniBand performance with PCI Express’, IEEE Micro, January/February, pp.20-29 Nieplocha, J., Ju, J and Apra, E (2001) ‘One-sided communication on Myrinet-based SMP clusters using the GM message-passing library’, Workshop on Communication Architecture for Clusters (CAC) Petrini, F., Coll, S., Frachtenberg, E and A Hoisie, A (2003) ‘Performance evaluation of the quadrics interconnection network’, Journal of Cluster Computing, pp.125-142 Qian, Y., Afsahi, A., Fredrickson, N.R and Zamani, R (2004) ‘Performance evaluation of Sun Fire Link SMP clusters’, 18th International Symposium on High Performance Computing Systems and Applications, HPCS, pp.145-156 Sistare, S.J and Jackson, C.J (2002) ‘Ultra-high performance communication with MPI and the Sun Fire Link interconnect’, Supercomputing Conference (SC) Sistare, S.J., Vande Vaart, F and Loh, E (1999) ‘Optimization of MPI collectives on clusters of large-scale SMPs’, Supercomputing Conference (SC) Vetter, J.S and Mueller, F (2003) ‘Communication characteristics of large-scale scientific applications for contemporary cluster architecture’, Journal of Parallel and Distributed Computing 63, pp853-865 Vogels, W., Follett, D., Hsieh, J., Lifka, D and Stern, D (2000) ‘Tree-saturation control in the AC3 velocity cluster’, Hot Interconnect 13 Zamani, R., Qian, Y and Afsahi, A (2004) ‘An evaluation of the Myrinet/GM2 two-Port networks’, 3rd IEEE Workshop on High-Speed Local Networks, HSLN, pp.734-742 WEBSITES InfiniBand Architecture Specifications, http://www.infinibandta.org/ Message Passing Interface Forum: MPI, A Message Passing Interface standard, http://www.mpi-forum.org/docs/docs.html/ OpenMP C/C++ Application Programming Interface, version 2.0, http://www.openmp.org/specs/ Remote Shared Memory (RSM), http://docs-pdf.sun.com/817-4415/817-4415.pdf/ Sun Fire Link System Overview, http://docs.sun.com/db/doc/816-0697-11/ Sun HPC ClusterTools software performance guide, http://docs-pdf.sun.com/817-0090-10/817-0090-10.pdf/ ... Link- specific PERFORMANCE EVALUATION OF THE SUN FIRE LINK SMP CLUSTERS I/O subsystem called the Sun Fire Link assembly The Sun Fire Link assembly is the interface between the Sun Fireplane internal... handling Sun Fire 6800s can have up to two Sun Fire Link assemblies (4 optical links), where Sun Fire 15K/12K can have up to assemblies (16 optical links) The availability of multiple Sun Fire Link. .. one of the world-wide Sun sites where Sun Fire Link is being used on Sun Fire cluster systems HPCVL participated in a beta program with Sun Microsystems to test the Sun Fire Link hardware/software

Định dạng
Số trang	13
Dung lượng	510,5 KB