User-level Interprocess Communication for Shared Memory Multiprocessors
Lker-Level Interprocess Communication for Shared Memory Multiprocessors BRIAN N BERSHAD Carnegie Mellon University THOMAS E ANDERSON, University of Washington Interprocess (between communication address operating IPC applications encounter level and communication facilities These when at the user level, with thread Terms—thread, local to the design of the kernel, is architecturally from one problems them since between cornmzmzcation of contemporary but limited address space kernel-based by the to another cost of Second, their own thread management the interaction between kernel- User-Level can be solved at the user kernel address structure allows the sees threads on the same Call shared kernel within the communica- each address and processor Procedure using by moving level invocation spaces Remote protocol The programmer unconventional towards central M LEVY management these is improved This oriented threads and must provide problems stemming from and supporting motivated and HENRY the responsibility a processor space communication space communication Index kernel communicating observations though been multiprocessor, performance cross-address managed IPC has become its performance reallocating and user-level memory out of the be avoided particular First, that need inexpensive functional and performance Communication fast in has traditionally problems kernel On a shared tion IPC inherent the (IPC), D LAZOWSKA, spaces on the same machine), systems has two invoking EDWARD can machine (URPC) memory URPC with through combines lightweight to be bypassed and RPC space reallocation during a threads cross-address a conventional interface, performance, multiprocessor, operating system, parallel programming, performance, communication Categories and Subject Descriptors: D 3.3 [Programming Languages]: Language Constructs [Operating Systems]: Process Management— concurrency, multiprocessing / multiprograrnmmg; D.4.4 [Operating Systems]: Communications Manfl!ow agement; D.4 [Operating Systems]: Security and Protection— accesscontrols, information controls; D.4 [Operating systems]: Organization and Desig~ D.4.8 [Operating Systems]: Performance— measurements and Features — modules, This material is based 8619663, and on work CCR-87OO1O6, Digital Program) supported CCFL8703049, Equipment Bershad D.4 packages; Corporation was (the supported by the and National CCR-897666), Systems by an AT&T Science the Research Ph.D Center Fellowship Authors’ Pittsburgh, addresses: B N Bershad, School of Computer Science, PA 15213; T E Anderson, E D Lazowska, and Science Permission not made of the and Engineering, to copy without or distributed publication Association and specific permission @ 1991 ACM fee all or part for direct its for Computing FR-35, date of this commercial appear, Machinery 0734-2071/91/0500-0175 ACM University Transactions and material is given To copy otherwise, the External Anderson CCRCenter, Research by an IBM Carnegie Mellon University, H M Levy, Department is granted the ACM that (grants Technology and of Washington, advantage, notice and Scholarship, Graduate Computer Foundation Washington Seattle, provided copyright copying or to republish, WA that notice the copies are and the title is by permission requires of 98195 of the a fee and/or $01.50 on Computer Systems, Vol 9, No 2, May 1991, Pages 175-198 176 General B N Terms: Addltlonal Bershad Design, Key et al Performance, Words and Phrases: Measurement Modularity, remote procedure call, small-kernel operating system INTRODUCTION Efficient interprocess communication is central to the design of contemporary operating systems [16, 231 An efficient communication facility encourages system decomposition across address space boundaries Decomposed systems have several advantages over more monolithic ones, including failure isolation (address space boundaries prevent a fault in one module from “leaking” into another), extensibility (new modules can be added to the system without having to modify mechanism existing rather Although than address which they primitives ones), and modularity (interfaces are enforced by by convention) spaces can be a useful structuring device, the extent to can be used depends on the performance of the communication If cross-address space communication is slow, the structuring benefits that come from decomposition are difficult to justify to end users, whose primary concern is system performance, and who treat the entire operating system as a “black box” [181 regardless of its internal structure Consequently, designers are forced to coalesce weakly related subsystems into the same address modularity Interprocess operating problems: space, trading away failure isolation, extensibility, and for performance communication system —Architectural has traditionally kernel However, performance barriers been the responsibility kernel-based The communication performance of the has of kernel-based two syn- chronous communication is architecturally limited by the cost of invoking the kernel and reallocating a processor from one address space to another In our earlier work on Lightweight Remote Procedure Call (LRPC) [10], we show that it is possible to reduce the overhead of a kernel-mediated cross-address space call to nearly the limit possible on processor architecture: the time to perform a cross-address slightly greater than that required to twice invoke a conventional LRPC is only the kernel and have it reallocate a processor from one address space to another The efficiency of LRPC comes from taking a “common case” approach to communication, thereby avoiding unnecessary synchronization, kernel-level thread management, and data copying for calls between address spaces on the same machine The majority of LRPC’S overhead (70 percent) can be attributed directly to the fact that the kernel mediates every cross-address space call —Interaction between kernel-based communication and high-performance The performance of a parallel application running on a multiprocessor can strongly depend on the efficiency of thread management operations Medium and fine-grained parallel applications must use a thread management system implemented at the user level to obtain user-level ACM TransactIons threads on Computer Systems, Vol 9, No 2, May 1991 User-Level Interprocess satkfactory have Communication performance strong for Shared Memory Multiprocessors [6, 36] Communication interdependencies, though, and thread and the across protection boundaries (kernel-level level thread management) is high in terms 177 management cost of partitioning communication of performance them and userand system complexity On a shared eliminate Because memory multiprocessor, these problems the kernel from the path of cross-address address spaces can share memory directly, have space shared a solution: communication memory can be used as the data transfer channel Because a shared memory multiprocessor has more than one processor, processor reallocation can often be avoided by taking advantage of a processor already active without involving the kernel User-level management of cross-address improved performance over kernel-mediated –Messages kernel are sent between –Unnecessary processor reducing call overhead, contexts across inherent exploited, the same address and spaces, processor call Procedure directly, results without invoking in the without components management, which reallocation (URPC), the communication kernel can be of a message can be mediation system described between URPC address isolates of interprocess communication: and data transfer Control is the is implemented and receiving its overhead performance Call safe and efficient machine programmer, in the sending improving Remote other the three cation, thread spaces space reallocation between address spaces is eliminated, and helping to preserve valuable cache and TLB parallelism provides address space communication approaches because reallocation does prove to be necessary, several independent calls thereby User-Level paper, target calls –When processor amortized over —The address in the communication through Only processor volvement; thread management and ment and interprocess communication rather than by the kernel from presented of thread reallocation one an- processor reallotransfer between abstraction a combination in this spaces on to the management requires data transfer not Thread are done by application-level kernel in- managelibraries, URPC’S approach stands in contrast to conventional “small kernel” systems in which the kernel is responsible for address spaces, thread management, and interprocess communication Moving communication and thread management to the user level leaves the kernel responsible only for the mechanisms that allocate processors to address spaces For reasons of performance and flexibility, this is an appropriate division of responsibility for operating systems of shared memory multiprocessors (URPC can also be appropriate for uniprocessors running multithreaded applications ) The latency of a simple cross-address space procedure call is 93 psecs using a URPC implementation running on the Firefly, DEC SRC’S multiprocessor workstation [37] Operating as a pipeline, two processors can complete one ACM Transactions on Computer Systems, Vol 9, No 2, May 1991 178 call every B N Bershad et al 53 psecs that involve support for kernel-based user-level On the same multiprocessor hardware, RPC threads that accompany URPC takes 43 psecs, while operation for the kernel-level threads that accompany LRPC millisecond To put these figures in perspective, a same-address dure facilities the kernel are markedly slower, without providing adequate user-level thread management LRPC, a high performanceimplementation, takes 157 psecs The Fork operation for the call takes psecs on the Firefly, and a protected kernel the Fork take over a space proce- invocation (trap) takes 20 psecs We describe the mechanics of URPC in more detail in the next section In Section we discuss the design rationale behind URPC We discuss performance in Section we present In Section A USER-LEVEL In many we survey related work REMOTE PROCEDURE contemporary uniprocessor applications communicate sages narrow across operations (create, between programs a control languages memory tween in Section with channels and CALL FACILITY multiprocessor operating system that or ports support and that for by only a small systems, sending mes- number of send, receive, destroy) Messages permit communication on the same machine, separated by address space bound- data-structuring support device synchronous communication address operating services aries, or between programs on different machines, boundaries Although messages are a powerful communication sent Finally, our conclusions within spaces is with foreign procedure an untyped, primitive, to traditional call, address separated data space asynchronous typing, by physical they repre- Algol-like and shared Communication messages be- Programmers of message-passing systems who must use one of the many popular Algol-like languages as a systems-building platform must think, program, and structure according to two quite different programming paradigms Consequently, almost every mature message-based operating system also includes support for Remote Procedure Call (RPC) [111, enabling messages to serve as the underlying transport mechanism beneath a procedure call interface Nelson defines RPC as the synchronous language-level transfer of control between programs in disjoint address spaces whose primary communication medium is a narrow channel [301 The definition of RPC is silent about the operation of that narrow channel and how the processor scheduling (reallocation) mechanisms interact with data transfer IJRPC exploits this silence in two ways: -Messages are passed between address spaces through logical channels kept in memory that is pair-wise shared between the client and server The channels are created and mapped once for every client/server pairing, and are used for all subsequent communication between the two address spaces so that several interfaces can be multiplexed over the same channel The integrity of data passed through the shared memory channel is ensured by ACM Transactions on Computer Systems, Vol 9, No 2, May 1991 User-Level Interprocess a combination statically) correctness Communication for Shared Memory Multiprocessors of the pair-wise mapping and the URPC runtime is verified dynamically) (message system authentication in each address —Thread management is implemented entirely at the integrated with the user-level machinery that manages nels A user-level directly thread enqueuing channel the No kernel Although sends a message message calls on the are necessary a cross-address space server, it blocks waiting to send a call procedure for the reply link call signifying 179 is implied space (message user level and is the message chan- to another outgoing perspective of the programmer, it is asynchronous thread management When a thread in a client address of the space by appropriate or reply message is synchronous from the at and beneath the level of invokes a procedure in a the procedure’s return; while blocked, its processor can run another ready thread in the same address space In our earlier system, LRPC, the blocked thread and the ready thread were really the same; the thread just crosses an address space boundary In contrast, URPC address that always tries space on the client can be handled When the reply entirely arrives, to schedule thread’s another processor by the This user-level the blocked client thread thread thread from the is a scheduling same operation management system can be rescheduled on any of the processors allocated to the client’s address space, again without kernel intervention Similarly, execution of the call on the server side can be done by a processor already and need not occur By preferentially takes advantage in switching call this address requires executing in the context of the fact that a processor context of the server’s address synchronously with the call scheduling threads within the same address there to another than switching) is significantly thread in reallocating space, URPC less overhead in the same address space, involved space (we will it to a thread in another reallocation) Processor reallocation space (we will call this processor changing the mapping registers that define the virtual address space context architectures, in privileged Several in which these kernel components tion There are processor should a processor mapping mode is executing registers contribute are protected to the high On conventional and can only overhead processor be accessed of processor realloca- scheduling costs to decide the address space to which a be reallocated; immediate costs to update the virtual mem- ory mapping registers and to transfer the processor between and long-term costs due to the diminished performance of translation lookaside buffer (TLB) that occurs whefi locality address space to another [31 Although there is a long-term with context switching within when processors are frequently To demonstrate the relative the same address space, that cost is less than reallocated between address spaces [201 overhead involved in switching contexts be- tween two threads in the between address spaces, same-address space same-address space same address space we introduce the On the context switch context switch requires ACM address spaces; the cache and shifts from one cost associated TransactIons versus reallocating processors latency concept of a minimal C-VAX, the minimal latency saving and then restoring the on Computer Systems, Vol 9, No 2, May 1991 180 B N Bershad et al machine’s general-purpose registers, reallocating the takes 55 psecs without about Because reallocation, switching processor from and takes one address including about 15 psecs space to another the long-term C-VAX cost of the cost difference between context switching URPC strives to avoid processor reallocation, whenever In contrast, on the and processor instead context possible Processor reallocation is sometimes necessary with URPC, though If a client invokes a procedure in a server that has an insufficient number of processors to handle the call (e g., the server’s processors are busy doing other reply with work), the client’s processors may idle for some time awaiting the This is a load-balancing problem URPC considers an address space pending incoming messages on its channel to be “underpowered.” An idle processor in the client address space can balance the load by reallocating itself to a server that has pending (unprocessed) messages from that client Processor reallocation requires that the kernel be invoked to change the processor’s virtual memory context to that of the underpowered address space That done, the kernel upcalls into a server routine that handles the client’s outstanding requests After these have been processed, the processor is returned to the client address space via the kernel The responsibility for detecting incoming messages and scheduling threads to handle them belongs to special, low-priority threads that are part of a URPC runtime library linked incoming messages only when Figure proceeding illustrates downward manager (WinMgr) a separate address procedures into they each address space Processors would otherwise be idle a sample execution timeline One client, an editor, and and a file cache manager space Two threads in in the window manager for URPC two servers, scan for with time a window (FCMW), are shown Each is in T1 and T2, invoke the editor, and the file cache manager Initially, the editor has one processor allocated to it, the window manager has one, and the space call into file cache manager has none T1 first makes a cross-address the window manager by sending and then attempting to receive a message The window manager receives the message and begins processing In the meantime, the thread scheduler in the editor has context switched from TI (which is blocked waiting T2 initiates a procedure to receive a message) to another thread call to the file cache manager by sending Thread a message T2 T1 can be unblocked because the window and waiting to receive a reply T1 then calls manager has sent a reply message back to the editor Thread into the file cache manager and blocks The file cache manager now has two pending messages from the editor but no processors with which to handle T1 and T2 are blocked in the editor, which has no other them Threads threads waiting to run The editor’s thread management system detects the outgoing pending messages and donates a processor to the file cache manager, which can then receive, process, and reply to the editor’s two incoming calls before returning the processor back to the editor At this point, incoming reply messages from the file cache manager can be handled Tz each terminate ACM TransactIons when on Computer they receive Systems, Vol their 9, No replies 2, May 1991 the two 7’1 and User-Level Interprocess Communication for Shared Memory Multiprocessors (has no processors) (scanning channels) m 181 FCMgr WinMgr Editor m Send WinMgr TI Recv & Process Reply to T1’s Call Call [ Recv WinMgr / Context Switch T2 Send FCMgr Call [ Recv FCMgr Context Switch Send TI Call [ FCMgr Recv FCMgr Reallocate Processor Recv to FCMgr Reply Recv & Process to T2’s Call & Process Reply to T1’s Call Return Processor to Editor Context Terminate Context Terminate Switch T2 Switch T1 Time Fig URPC timeline URPC’S treatment of pending messages is analogous to the treatment of runnable threads waiting on a ready queue Each represents an execution context in need of attention from a physical processor; otherwise idle processors poll the queues looking for work in the form of these execution contexts There are two main differences between the operation of message channels and thread ready queues First, URPC enables on address space to spawn work in another address space by enqueuing a message (an execution context consisting of some control information and arguments or results) Second, ACM Transactions on Computer Systems, Vol 9, No 2, May 1991 182 B N Bershad et al Server Client Application Application Stubs Message URPC ❑ Stubs Channel mm L URPC Fa~tThreads FastThreads / \ / ~eallocate~roces~F-< Fig,2 URPC enables another Thesoftware processors through components in one address an explicit reallocation of URPC space to execute in the if the between workload context of them is imbalance 2.1 The User View It is important to emphasize that all of the URPC mechanisms described in this section exist “under the covers” of quite ordinary looking Modula2 + [34] interfaces The RPC paradigm provides the freedom to implement the control and data transfer mechanisms in ways that are best matched by the underlying hardware URPC consists of two code One package, called software are managed level at the user packages FczstThreacis and [5], used by stubs and application provides lightweight threads that scheduled on top of middleweight kernel implements the channel managethreads The second package, called URPC, package ment and message primitives described in this section The URPC FastThreads, as lies directly beneath the stubs and closely interacts with shown in Figure RATIONALE FOR THE URPC DESIGN In this section we discuss the design rationale behind URPC rationale is based on the observation that there are several components to a cross-address space call and each can be separately The main components In brief, this independent implemented are the following: management: blocking the caller’s thread, running a thread through – ‘Thread the procedure in the server’s address space, and resuming the caller’s thread on return, ACM Transactions on Computer Systems, Vol 9, No 2, May 1991 User-Level Interprocess —Data moving transfer: spaces, arguments between the client and server 183 address and –Processor ensuring that there is a physical processor in the server and the server’s reply in the client to handle reallocation: the client’s In Communicahon for Shared Memory Mulhprocessors call conventional message systems, these three components are combined beneath a kernel interface, leaving the kernel responsible for each However, thread management and data transfer not require kernel assistance, only processor how reallocation the another, 3.1 does In the three components and the benefits Processor attempts occurs URPC through the optimistically —The client address —The to reduce arise from that call follow we describe can be isolated from one such a separation has serious the frequency with use of an optimistic assumes the following: other messages, not have that space Reallocation URPC incoming subsections of a cross-address work and effect to do, in a potential which processor reallocation the delay form of runnable in the on the performance reallocation At policy processing of other threads call time, threads of a call or will in the client’s space server has, or will soon have, a processor with which it can service a message In terms several context of performance, ways The first switch between URPC’S assumption user-level optimistic makes threads assumptions can pay off in it possible to an inexpensive during the blocking phase of a cross-address space call The second assumption enables a URPC to execute in parallel with threads in the client’s address space while avoiding a processor reallocation An implication of both assumptions is that it is possible to amortize the cost of a single outstanding calls between the threads in the same client make a single processor reallocation processor reallocation across several same address spaces For example, if two calls into the same server in succession, then can handle both A user-level approach to communication and thread management is appropriate for shared memory multiprocessors where applications rely on aggressive multithreading to exploit parallelism while at the same time compensating for multiprogramming effects and memory latencies [2], where a few key operating system services are the target of the majority calls [8], and where operating system functions are affixed ing nodes for the sake of locality [31] In contrast well suited –Kernel-level to URPC, contemporary for use on shared memory communication and uniprocessor structures are not multiprocessors: thread pessimistic processor reallocation policies, rency within an application to reduce scheduling [13] underscores this Handoff tion blocks the client and reallocates its ACM kernel of all application to specific process- TransactIons management facilities rely on and are unable to exploit concurthe overhead of communication pessimism: a single kernel operaprocessor directly to the server on Computer Systems, Vol 9, No 2, May 1991 184 B N Bershad et al Although handoff is limited by the cost of kernel –In a traditional scheduling operating shared performance, invocation and processor system running on a multiprocessor, distributed over the many large-scale does improve memory kernel designed for the improvement reallocation a uniprocessor, but kernel resources are logically centralized, processors in the system For a mediummultiprocessor such as the Butterfly but to [71, Alewife [41, or DASH [251, URPC’S user-level orientation to operating system design localizes system resources to those processors where the resources are in use, relaxing the performance bottleneck that comes from relying on centralized kernel data structures This bottleneck is due to the contention for logical resources (locks) and physical resources (memory and interconnect bandwidth) that results when a few data structures, such as thread run queues and message By distributing processors functions to directly, the address contention URPC can channels, are shared by a large number of the communication and thread management also spaces (and hence processors) that use be appropriate on uniprocessors applications where inexpensive threads physical concurrency within a problem running multithreaded can be used to express Low overhead threads the logical and and communi- cation make it possible to overlap even small amounts of external tion Further, multithreaded applications that are able to benefit layed reallocation tion protocols uniprocessor, long them is reduced can so without [191 Although having reallocation it can be delayed to develop will by scheduling their eventually within computafrom de- own communicabe necessary an address on a space for as as possible 3.1.1 The Optimistic Assumptions Won’ t Alulays In cases where Hold the optimistic assumptions not hold, it is necessary to invoke the kernel to force a processor reallocation from one address space to another Examples of where it is inappropriate to rely on URPC’S optimistic processor reallocation policy are single-threaded latency must be bounded), initiate the and priority call is of high 1/0 operation invocations applications, high-latency early (where priority) since real-time applications 1/0 operations (where it will the thread To handle these take a long executing situations, time (where call it is best to to complete), the cross-address URPC address space to force a processor reallocation to the there might still be runnable threads in the client’s allows server’s, space the client’s even though 3.1.2 The Kernel Handles Processor Reallocation The kernel implements the mechanism that support processor reallocation When an idle processor discovers an underpowered address space, it invokes a procedure called Processor Donate, passing in the identity of the address space to which the processor (on which the invoking thread is running) should be reallocated Donate transfers control down through the kernel and then up to a prespecified address in the receiving space The identity of the donating Processor address ACM space is made TransactIons known on Computer to the receiver Systems, Vol 9, No 2, May by the kernel 1991 User-Level Interprocess 3.1.3 Voluntary interface URPC, Return defines as with Communication of a contract for Shared Memory Multiprocessors Processors Cannot between traditional RPC a client systems, Be and in the 185 A service Guaranteed a server implicit In the contract case of is that the server obey the policies that determine when a processor is to be returned back to the client The URPC communication library implements the following policy in the server: upon receipt of a processor from a client address space, return the processor when all have generated replies, or when the become “underpowered” (there outstanding messages from server determines that the are outstanding messages and one of the server’s processors is idle) Although URPC’S runtime libraries implement no way to enforce that kernel transfers control guarantee returning protocol of Just the as with processor back a specific kernel-based to an to the client, protocol, there systems, application, there a fair available deals processing power of load balancing URPC’s between direct reallocation applications that clients, is once the is that the application will voluntarily return the processor from the procedure that the client invoked The server could, example, use the processor to handle requests from other this was not what the client had intended It is necessary to ensure that applications receive problem the client client has no by for even though share only of the with are communicating the with one another Independent of communication, a multiprogrammed system requires policies and mechanisms that balance the load between noncommunicating (or noncooperative) address spaces Applications, for example, must not be able to starve able to delay clients one another out for processors, indefinitely by not returning and servers processors policies, which forcibly reallocate processors from one address other, are therefore necessary to ensure that applications make A processor reallocation policy should be work conserving, high-priority thread waits for a processor while a lower priority and, by implication, that anywhere in the system, specifics of how to enforce no processor address be easily 3.2 when there is work not be Preemptive space to anprogress in that no thread runs, for it to even if the work is in another address space The this constraint in a system with user-level threads are beyond the scope of this paper, The direct processor reallocation optimization space of a ing to that standpoint reason for processor idles must but are discussed done in URPC by Anderson et al in [61 can be thought of as an of a work-conserving policy A processor idling in the address URPC client can determine which address spaces are not respondclient’s calls, and therefore which address spaces are, from the of the client, the most eligible to receive a processor There is no the client to first voluntarily relinquish the processor to a global allocator, only to have the allocator then determine to which space the processor made by the client should be reallocated This is a decision that can itself Data Transfer Using Shared Memory The requirements of a communication system can be relaxed and its efficiency improved when it is layered beneath a procedure call interface rather than exposed to programmers directly In particular, arguments that are part ACM TransactIons on Computer Systems, Vol 9, No 2, May 1991 186 B N Bershad et al of a cross-address space procedure call while still guaranteeing safe interaction can be passed using shared between mutually suspicious memory subsys- tems Shared memory message of client-server interactions still overload one another, channels not increase the “abusability factor” As with traditional RPC, clients and servers can deny service, provide bogus results, and violate communication protocols (e g., fail nel data structures) And, as with to release traditional protocols abuses to ensure that lower level a well-defined manner (e g., by raising down the channel) In URPC, the safety of communication responsibility of the stubs channel RPC, filter locks, or corrupt it is up to higher up to the application exception a call-faded based The arguments on shared of a URPC! layer memory are passed is implemented one address space to another application programmers deal necessary nor sufficient by having Such directly form of messages But, when standard runtime as is the case with URPC, having the kernel spaces is neither is the in message phase server unmarshal the message’s data into copying and checking is necessary to ensure the application’s safety Traditionally, safe communication copy data from necessary when in or by closing buffers that are allocated and pair-wise mapped during the binding that precedes the first cross-address space call between a client and On receipt of a message, the stubs procedure parameters doing whatever chanlevel the kernel an implementation is with raw data in the facilities and stubs are used, copy data between address to guarantee safety Copying in the kernel is not necessary because programming languages are implemented to pass parameters between procedures on the stack, the heap, or in registers When data is passed between address spaces, none of these storage areas can, in general, be used directly by both the client and the server Therefore, the stubs must copy the data into and out of the memory used to move arguments between address spaces Safety is not increased by first doing an extra kernel-level copy of the data Copying in the kernel is not sufficient for ensuring safety when using type-safe languages such as Modula2 + or Ada [1] since each actual parameter must be checked by the stubs for conformity with the type of its corresponding formal Without such checking, for example, a client could crash the server by passing in an illegal (e g., out of range) value for a parameter These points motivate the use of pair-wise shared memory for cross-address space communication Pair-wise shared memory can be used to transfer between address spaces more efficiently, but just as safely, as messages are copied by the kernel between address spaces 3.2.1 Controlling different address test-and-set locks nitely on message is simply if the spin-wait The ACM TransactIons data that Channel Access Data flows between URPC packages in spaces over a bidirectional shared memory queue with on either end To prevent processors from waiting indefichannels, the locks are nonspinning; i.e., the lock protocol lock is free, rationale on Computer here acquire is that Systems, Vol it, or else go on the receiver 9, No 2, May 1991 to something of a message else– never should never User-Level Interprocess Communication for Shared Memory Multiprocessors 187 have to delay while the sender holds the lock If the sender stands to benefit from the receipt of a message (as on call), then it is up to the sender to ensure that locks are not held indefinitely; if the receiver stands to benefit (as on return), realized then the sender could just as easily prevent the benefits by not sending the message in the first place Cross-Address 3.3 Space Procedure The calling semantics normal procedure call, performing software address a server starts from space procedure call, like those with respect to the thread that the call Consequently, there system that manages threads the server thread When the The server communication thread client client thread finishes, —High and thread performance grained parallel management thread it stops, sends waiting on a receive a response back to the we argue facilities are the following: necessary for fine- programs — While thread management facilities at the user level, high-performance be provided (send Finally, the to be started for the next message between cross-address space In brief, management function management synchronization sends a message to the server, client, which causes the client’s waiting thread stops again on a receive, waiting server’s thread In this subsection we explore the relationship communication of is is a strong interaction between the and the one that manages cross- Each underlying and receive) involves a corresponding function (start and stop) On call, the which being Call and Thread Management of cross-address are synchronous space communication from at the user can be provided either in the kernel or thread management facilities can only level —The close interaction between communication and thread management can be exploited to achieve extremely good performance for both, when both are implemented 3.3.1 threads Concurrent within activities parallel together at the user Programming an address Concurrency programs space needs and Thread simplify the can be entirely for access to a thread or it to the can some amount address space management system that continuum major middleweight, are three and lightweight, for which on the order of thousands, hundreds, Kernels supporting heavyweight points thread application, as with as with an of its own computation with In either case, the program- of control can be described there of concurrent be external, create, schedule, and destroy threads Support for thread management on which Multiple Management management internal multiprocessors, application that needs to overlap activity on its behalf in another mer level makes in it possible terms of reference: management of to a cost heavyweight, overheads are and tens of psecs, respectively threads [26, 32, 35] make no distinction between a thread, the dynamic component of a program, and its address space, the static component Threads and address spaces are created, scheduled, and destroyed as single entities Because an address space has a large amount of baggage, such as open file descriptors, page tables, and accounting ACM Transactions on Computer Systems, Vol 9, No 2, May 1991 188 B N Bershad et al state that must be manipulated during thread management operations, operations on heavyweight threads are costly, taking tens of milliseconds Many contemporary operating system kernels provide support for middleweight, address threads [12, 361 In contrast to heavyweight middleweight threads are decoupled, so that or kernel-level spaces and address space can have many middleweight threads As with threads, though, the kernel is responsible for implementing management functions, and directly schedules an application’s the available defines physical a general processors programming With middleweight interface for heavyweight the thread threads on threads, use by all threads, a single the concurrent kernel applica- tions Unfortunately, the grained parallel slicing, or between saving threads applications, cost of this applications and restoring contexts traps are required kernel must treat affects example, floating can be sources even those for which thread management malfeasant programs generality For the simple point performance features registers of performance the features overhead are unnecessary of kernel-level for all parallel weight for all A kernel-level thread management operation, and the suspiciously, checking arguments and from the effect of thread to the high cost of but more pervasive management on program policy There are a large number of parallel programming models, and a wide variety of scheduling disciplines that are most appropri- ate (for performance) to a given strongly influenced by the choice rithms used to implement threads, In response to lightweight switching system must also protect itself from malfunctioning or Expensive (relative to procedure call) protected kernel to invoke any each invocation stemming when as time degradation access privileges to ensure legitimacy In addition to the direct overheads that contribute kernel-level thread management, there is an indirect performance within these, of fine- such thread is unlikely programs model [25, 33] Performance, though, is of interfaces, data structures, and algoso a single model represented by one style to have an implementation to the costs of kernel-level threads, threads that execute in the context threads provided by the kernel, but that is efficient programmers have turned of middleweight or heavy- are managed at the user level by a library linked in with each application [9, 38] Lightweight thread managetwo-level scheduling, since the application library schedules ment implies lightweight threads on top of weightier threads, which are themselves being scheduled by the kernel Two-level schedulers attack both the direct and indirect costs of kernel threads Directly, it is possible to implement efficient user-level thread management functions because they are accessible through a simple procedure call rather than a trap, need not be bulletproofed against application errors, and can be customized to provide only the level of support needed by a given application The cost of thread management operations in a user-level scheduler 3.3.2 for two ACM can be orders The Problem reasons: fiansactlons of magnitude with Having (1) to synchronize on Computer Systems, Vol less than Functions their 9, No for kernel-level at activities 2, May 1991 Two Levels within threads Threads an address block space User-Level Interprocess Communication for Shared Memory Multiprocessors (e.g., while controlling access to critical sections], events in other address spaces (e g., while reading manager) pensive kernel, If threads are implemented at the and flexible concurrency), but then the performance benefits switching are lost when scheduler accurately reflects In contrast, when efficient user-level (necessary in separate address of the application, awaiting and thread and management scheduling and then kernel-mediated com- are moved together at user level, the system can control threads synchronization for inex- is implemented in the scheduling and context activities state applications notified communication of the kernel and implemented needed by the communication level operation must be implemented and the application, so that the user-level the scheduling again (2) in the kernel, so that munication activity are properly user between 189 and (2) to wait for external keystrokes from a window communication of inexpensive synchronizing spaces In effect, each synchronizing executed at two levels: (1) once in out synchronization with the same primitives that are used by applications 3.4 Summary of Rationale This section has described the design advantages of a user-level approach made without invoking the kernel processors between processor reallocation over multiple calls grated with spaces turn Finally, out to be necessary, their user-level communication thread programs, as within, THE PERFORMANCE Further, as measurements megabytes 4.1 have kernel processor described many as dedicated main invocation and cost can be amortized can be cleanly intenecessary can inexpensively in this of URPC by DEC’s six to C-VAX 1/0 paper for fine- synchronize The had on the Firefly, an experiSystems Research Center microprocessors, Firefly four used C-VAX to plus one collect the processors and 32 of memory User-Level Thread Management Performance In this section we compare the performance of thread management when implemented at the user level and in the kernel The threads are those provided by Taos, the Firefly’s native operating Table mented invokes The calls can be reallocating OF URPC A can URPC facilities, programs spaces the performance workstation built Firefly when management so that address This section describes mental multiprocessor MicroVAX-11 behind address user-level grained parallel between, as well rationale to communication are that and without unnecessarily primitives kernel-level system I shows the cost of several thread management operations impleat the user level and at the kernel level PROCEDURE CALL the null procedure and provides a baseline value for the machine’s performance FORK creates, starts, and terminates a thread FORK;JOIN is like FORK, except that the thread which creates the new thread blocks until the forked thread completes YIELD forces a thread to make a full trip through the scheduling machinery; when a thread yields the processor, it is blocked (state saved), made ready (enqueued on the ready queue), scheduled ACM Transactions on Computer Systems, Vol 9, No 2, May 1991 190 B N Bershad et al Table Comparative Performance of Thread URPC Test II Component (ysecs) 43 102 37 27 53 1192 1574 57 27 271 (dequeued), Server (ysecs) 18 13 poll 6 receme 10 20 25 Total 54 53 unblocked and forth on a condition means of paired signal Each (state measured in the was elapsed table processor variable, and wait and unblocks test The run restored), was divided using the continued ACQUIRE; in succession number FastThreads tests and a blocking (nonspinning) lock for has two threads “pingpong” back blocking and unblocking one another by operations Each pingpong cycle blocks, two threads a large time The dispatch RELEASE acquires and then releases which there is no contention PING-PONG schedules, of a URPC Client (~secs) send again Taos Threads FastThreads Breakdown Component Operations (psecs) PROCEDURE CALL FORK FORK;JOIN YIELD ACQUIRE, RELEASE PINGPONG Table Management of times as part by the loop limit tests were kernel-level of a loop, to determine constrained to run threads no such had and the the values on a single constraint; Taos caches recently executed kernel-level threads on idle processors to reduce wakeup latency Because the tests were run on an otherwise idle machine, the caching optimization worked well and we saw little variance in the measurements The table demonstrates the performance advantage of implementing thread management at the user-level, where the operations have an overhead hundred, as when 4.2 on the threads URPC Component order of a few procedure calls, are provided by the kernel rather than a few Breakdown In the case when an address space has enough processing power to handle all incoming messages without requiring a processor reallocation, the work done send (enqueuing a during a URPC can be broken down into four components: poll (detecting the channel on which a message on an outgoing channel); receive (dequeueing the message); and dispatch (dismessage has arrived); patching the appropriate thread to handle an incoming message) Table II shows the time taken by each component The total processing times are almost the same for the client and the server Nearly half of that time goes towards thread management (dispatch), ACM TransactIons on Computer Systems, Vol 9, No 2, May 1991 User-Level Interprocess demonstrating the munication The influence for Shared Memory Mulhprocessors that thread management overhead has 191 on com- performance send and marshaling receive that spaces On buffers adds passed in average Table Communication is the Firefly, about most cross-address and not the cost of psec space is reflect cost are data of data passed passed of latency in calls the influenced by is and address shared the memory amount small the copying between Because procedure most of data the parameters word additional performance not when each one call II, times incurred [8, of data 16, components 17, 24], shown in transfer 4.3 Call Latency and Throughput The figures there in Table is no need latency and call throughput server II for depend space, on for are the S, and of client reallocation load number the Two and dependent, of client number of server other load, important though Both processors, runnable C, call latency and the threads provided metrics, number in the of client’s T To evaluate values independent throughput, processors, address are processor T, latency and C, S and throughput, in we ran which threads to make 100,000 “Null” loop The Null procedure call we timed a series how of tests long procedure calls into takes no arguments, it using took different the the server computes client’s T inside a tight nothing, and returns no results; it exposes the performance characteristics of a round-trip cross-address space control transfer Figures and graph call latency and throughput as a function of T Each line on the graphs represents a different combination of S and C Latency is measured as the time from when a thread calls into the Null stub to when control returns from the stub, and depends on the speed of the message primitives and the length of the thread ready queues in the client and server When (T> the number processor of caller call C + S), latency in the client exceeds since or the server to T Except is less sensitive call processing threads increases times total call or both for the special in the client the each number must When the number of calling on the other case where S = 0, as long and server message wasted transfer processing threads, are roughly client equal T > hand, as the (see Table II), At that no further C + S can processors, a free and be server proces- latency is 93 ~secs This is “pure” latency, in basic processing required to a round-trip Unfortunately, time for Throughput, until then throughput, as a function of T, improves point, processors are kept completely busy, so there T improvement in throughput by increasing sors is (T = C = S = 1), call the sense that it reflects the of processors wait the due to idling latency in the includes client and a large server amount while of waiting The 14 psec discrepancy between the obseved latency and the expected latency (54 + 53, cf., Table II) is due to a low-level scheduling optimization that allows the caller thread’s context to remain loaded on the processor if the ready queue is empty and there is no other work to be done anywhere in the system In this case, the idle loop executes in the context of the caller thread, enabling a fast wakeup when the reply message arrives At 2’ = C = S, the optimization is enabled The optimization is also responsible for the small “spike” in the throughput curves shown in Figure ACM Transactions on Computer Systems, Vol 9, No 2, May 1991 192 B N Bershad et al 1500– C=l, s=o C=l, S=2 s=l 14001300120011001000 – 900Call 800- Latency 700- (psecs) /c=l, 600500400 - /