Proceedings of the linux

Proceedings of the Linux Symposium Volume One June 27th–30th, 2007 Ottawa, Ontario Canada Contents The Price of Safety: Evaluating IOMMU Performance 9 Ben-Yehuda, Xenidis, Mostrows, Rister, Bruemmer, Van Doorn Linux on Cell Broadband Engine status update 21 Arnd Bergmann Linux Kernel Debugging on Google-sized clusters 29 M. Bligh, M. Desnoyers, & R. Schultz Ltrace Internals 41 Rodrigo Rubira Branco Evaluating effects of cache memory compression on embedded systems 53 Anderson Briglia, Allan Bezerra, Leonid Moiseichuk, & Nitin Gupta ACPI in Linux – Myths vs. Reality 65 Len Brown Cool Hand Linux – Handheld Thermal Extensions 75 Len Brown Asynchronous System Calls 81 Zach Brown Frysk 1, Kernel 0? 87 Andrew Cagney Keeping Kernel Performance from Regressions 93 T. Chen, L. Ananiev, and A. Tikhonov Breaking the Chains—Using LinuxBIOS to Liberate Embedded x86 Processors 103 J. Crouse, M. Jones, & R. Minnich GANESHA, a multi-usage with large cache NFSv4 server 113 P. Deniel, T. Leibovici, & J C. Lafoucrière Why Virtualization Fragmentation Sucks 125 Justin M. Forbes A New Network File System is Born: Comparison of SMB2, CIFS, and NFS 131 Steven French Supporting the Allocation of Large Contiguous Regions of Memory 141 Mel Gorman Kernel Scalability—Expanding the Horizon Beyond Fine Grain Locks 153 Corey Gough, Suresh Siddha, & Ken Chen Kdump: Smarter, Easier, Trustier 167 Vivek Goyal Using KVM to run Xen guests without Xen 179 R.A. Harper, A.N. Aliguori & M.D. Day Djprobe—Kernel probing with the smallest overhead 189 M. Hiramatsu and S. Oshima Desktop integration of Bluetooth 201 Marcel Holtmann How virtualization makes power management different 205 Yu Ke Ptrace, Utrace, Uprobes: Lightweight, Dynamic Tracing of User Apps 215 J. Keniston, A. Mavinakayanahalli, P. Panchamukhi, & V. Prasad kvm: the Linux Virtual Machine Monitor 225 A. Kivity, Y. Kamay, D. Laor, U. Lublin, & A. Liguori Linux Telephony 231 Paul P. Komkoff, A. Anikina, & R. Zhnichkov Linux Kernel Development 239 Greg Kroah-Hartman Implementing Democracy 245 Christopher James Lahey Extreme High Performance Computing or Why Microkernels Suck 251 Christoph Lameter Performance and Availability Characterization for Linux Servers 263 Linkov Koryakovskiy “Turning the Page” on Hugetlb Interfaces 277 Adam G. Litke Resource Management: Beancounters 285 Pavel Emelianov, Denis Lunev and Kirill Korotaev Manageable Virtual Appliances 293 D. Lutterkort Everything is a virtual filesystem: libferris 303 Ben Martin Conference Organizers Andrew J. Hutton, Steamballoon, Inc., Linux Symposium, Thin Lines Mountaineering C. Craig Ross, Linux Symposium Review Committee Andrew J. Hutton, Steamballoon, Inc., Linux Symposium, Thin Lines Mountaineering Dirk Hohndel, Intel Martin Bligh, Google Gerrit Huizenga, IBM Dave Jones, Red Hat, Inc. C. Craig Ross, Linux Symposium Proceedings Formatting Team John W. Lockhart, Red Hat, Inc. Gurhan Ozen, Red Hat, Inc. John Feeney, Red Hat, Inc. Len DiMaggio, Red Hat, Inc. John Poelstra, Red Hat, Inc. Authors retain copyright to all submitted papers, but have granted unlimited redistribution rights to all as a condition of submission. The Price of Safety: Evaluating IOMMU Performance Muli Ben-Yehuda IBM Haifa Research Lab muli@il.ibm.com Jimi Xenidis IBM Research jimix@watson.ibm.com Michal Ostrowski IBM Research mostrows@watson.ibm.com Karl Rister IBM LTC krister@us.ibm.com Alexis Bruemmer IBM LTC alexisb@us.ibm.com Leendert Van Doorn AMD Leendert.vanDoorn@amd.com Abstract IOMMUs, IO Memory Management Units, are hardware devices that translate device DMA addresses to machine addresses. An isolation capable IOMMU re- stricts a device so that it can only access parts of memory it has been explicitly granted access to. Isolation capable IOMMUs perform a valuable system service by preventing rogue devices from performing errant or ma- licious DMAs, thereby substantially increasing the system’s reliability and availability. Without an IOMMU a peripheral device could be programmed to overwrite any part of the system’s memory. Operating systems utilize IOMMUs to isolate device drivers; hypervisors utilize IOMMUs to grant secure direct hardware access to virtual machines. With the imminent publication of the PCI-SIG’s IO Virtualization standard, as well as Intel and AMD’s introduction of isolation capable IOMMUs in all new servers, IOMMUs will become ubiquitous. Although they provide valuable services, IOMMUs can impose a performance penalty due to the extra memory accesses required to perform DMA operations. The ex- act performance degradation depends on the IOMMU design, its caching architecture, the way it is programmed and the workload. This paper presents the performance characteristics of the Calgary and DART IOMMUs in Linux, both on bare metal and in a hyper- visor environment. The throughput and CPU utilization of several IO workloads, with and without an IOMMU, are measured and the results are analyzed. The poten- tial strategies for mitigating the IOMMU’s costs are then discussed. In conclusion a set of optimizations and resulting performance improvements are presented. 1 Introduction An I/O Memory Management Unit (IOMMU) creates one or more unique address spaces which can be used to control how a DMA operation, initiated by a device, accesses host memory. This functionality was originally introduced to increase the addressability of a device or bus, particularly when 64-bit host CPUs were being introduced while most devices were designed to operate in a 32-bit world. The uses of IOMMUs were later ex- tended to restrict the host memory pages that a device can actually access, thus providing an increased level of isolation, protecting the system from user-level device drivers and eventually virtual machines. Unfortunately, this additional logic does impose a performance penalty. The wide spread introduction of IOMMUs by Intel [1] and AMD [2] and the proliferation of virtual machines will make IOMMUs a part of nearly every computer system. There is no doubt with regards to the benefits IOMMUs bring. but how much do they cost? We seek to quantify, analyze, and eventually overcome the performance penalties inherent in the introduction of this new technology. 1.1 IOMMU design A broad description of current and future IOMMU hardware and software designs from various companies can be found in the OLS ’06 paper entitled Utilizing IOMMUs for Virtualization in Linux and Xen [3]. The design of a system with an IOMMU can be broadly bro- ken down into the following areas: • IOMMU hardware architecture and design. • Hardware ↔ software interfaces. • 9 • 10 • The Price of Safety: Evaluating IOMMU Performance • Pure software interfaces (e.g., between userspace and kernelspace or between kernelspace and hyper- visor). It should be noted that these areas can and do affect each other: the hardware/software interface can dictate some aspects of the pure software interfaces, and the hardware design dictates certain aspects of the hardware/software interfaces. This paper focuses on two different implementations of the same IOMMU architecture that revolves around the basic concept of a Translation Control Entry (TCE). TCEs are described in detail in Section 1.1.2. 1.1.1 IOMMU hardware architecture and design Just as a CPU-MMU requires a TLB with a very high hit-rate in order to not impose an undue burden on the system, so does an IOMMU require a translation cache to avoid excessive memory lookups. These translation caches are commonly referred to as IOTLBs. The performance of the system is affected by several cache-related factors: • The cache size and associativity [13]. • The cache replacement policy. • The cache invalidation mechanism and the fre- quency and cost of invalidations. The optimal cache replacement policy for an IOTLB is probably significantly different than for an MMU- TLB. MMU-TLBs rely on spatial and temporal locality to achieve a very high hit-rate. DMA addresses from devices, however, do not necessarily have temporal or spatial locality. Consider for example a NIC which DMAs received packets directly into application buffers: packets for many applications could arrive in any order and at any time, leading to DMAs to wildly disparate buffers. This is in sharp contrast with the way applications access their memory, where both spatial and temporal locality can be observed: memory accesses to nearby areas tend to occur closely together. Cache invalidation can have an adverse effect on the performance of the system. For example, the Calgary IOMMU (which will be discussed later in detail) does not provide a software mechanism for invalidating a single cache entry—one must flush the entire cache to in- validate an entry. We present a related optimization in Section 4. It should be mentioned that the PCI-SIG IOV (IO Vir- tualization) working group is working on an Address Translation Services (ATS) standard. ATS brings in another level of caching, by defining how I/O endpoints (i.e., adapters) inter-operate with the IOMMU to cache translations on the adapter and communicate invalidation requests from the IOMMU to the adapter. This adds another level of complexity to the system, which needs to be overcome in order to find the optimal caching strat- egy. 1.1.2 Hardware ↔ Software Interface The main hardware/software interface in the TCE family of IOMMUs is the Translation Control Entry (TCE). TCEs are organized in TCE tables. TCE tables are anal- ogous to page tables in an MMU, and TCEs are similar to page table entries (PTEs). Each TCE identifies a 4KB page of host memory and the access rights that the bus (or device) has to that page. The TCEs are arranged in a contiguous series of host memory pages that comprise the TCE table. The TCE table creates a single unique IO address space (DMA address space) for all the devices that share it. The translation from a DMA address to a host memory address occurs by computing an index into the TCE table by simply extracting the page number from the DMA address. The index is used to compute a direct offset into the TCE table that results in a TCE that trans- lates that IO page. The access control bits are then used to validate both the translation and the access rights to the host memory page. Finally, the translation is used by the bus to direct a DMA transaction to a specific location in host memory. This process is illustrated in Figure 1. The TCE architecture can be customized in several ways, resulting in different implementations that are op- timized for a specific machine. This paper examines the performance of two TCE implementations. The first one is the Calgary family of IOMMUs, which can be found in IBM’s high-end System x (x86-64 based) servers, and the second one is the DMA Address Relocation Table (DART) IOMMU, which is often paired with PowerPC [...]... unload the contexts but wait until the end of the time slice Also, like normal (nongang) contexts, the gang will not be removed from the SPUs unless there is actually another thread waiting for them to become available, independent of whether or not any of the threads in the gang execute code at the end of the time slice 4 Using SPEs from the kernel As mentioned earlier, the SPU base code in the kernel... On the other hand, the fact that the throughput is roughly the same when the IOMMU code doesn’t overload the system strongly suggests that software is the culprit, rather than hardware This is good, because software is easy to fix! Profile results from these tests strongly suggest that mapping and unmapping an entry in the TCE table is the biggest performance hog, possibly due to lock contention on the. .. the new overlay segment, overwriting the segment loaded into the overlay region before This makes it possible to even do function calls in different segments of the same region There can be any number of segments per region, and the number of regions is only limited by the size of the 6 Profiling SPE tasks Support for profiling SPE tasks with the oprofile tool has been implemented in the latest IBM Software... between the two When GDB looks at the state of a thread, it now checks if it is in the process of executing the spu_run system call If not, it shows the state of the thread on the PPE side using ptrace, otherwise it looks at the SPE registers through spufs This can work because the SIGSTOP signal is handled similarly in both cases When gdb sends this signal to a task running on the SPE, it returns from the. .. segments into concurrent regions In the most simple case, you can have two functions that both have their own segment, with the two segments occupying the same region The size of the region is the maximum of either segment size, since they both need to fit in the same space When a function in an overlay is called, the calling function first needs to call a stub that checks if the correct overlay is currently... SPUs and removed from them again by the kernel, and the number of SPU contexts can be larger than the number of available SPUs phys-id The phys-id does not represent a feature of a physical SPU but rather presents an interface to get auxiliary information from the kernel, in this case the number of the SPU that a context is loaded into, or -1 if it happens not to be loaded at all at the point it is read... asynchronously copy between the local memory and the virtual address space The advantage of this approach is that a well-written application practically never needs to wait for a memory access but can do all of these in the background The disadvantages include the limitation to 256KiB of directly addressable memory that limit the set of applications that can be ported to the architecture, and the relatively long... Controller The format of the TCEs are the first level of customization Calgary is designed to be integrated with a Host Bridge Adapter or South Bridge that can be paired with several architectures—in particular ones with a huge addressable range The Calgary TCE has the following format: The 36 bits of RPN represent a generous 48 bits (256 TB) of addressability in host memory On the other hand, the DART,... implementation of top that knows about SPU utilization The second system call, spu_run, acts as a switch for a Linux thread to transfer the flow of control from the PPE to the SPE As seen by the PPE, a thread calling spu_run blocks in that system call for an indefinite amount of time, during which the SPU context is loaded into an SPU and executed there An equivalent to spu_run on the SPU itself is the stop-and-signal... applications on the SPE as well, which can interact with other applications running on the PPE This approach makes it possible to take advantage of the wide range of applications available for Linux, while at the same time utilize the performance gain provided by the SPE design, which could not be achieved by just recompiling regular applications for a new architecture One key aspect of the SPE design is the way . prospect. On the other hand, the fact that the throughput is roughly the same when the IOMMU code doesn’t overload the system strongly suggests that software is the culprit, rather than hardware range. The Calgary TCE has the following format: The 36 bits of RPN represent a generous 48 bits (256 TB) of addressability in host memory. On the other hand, the DART, which is integrated with the. some aspects of the pure software interfaces, and the hardware design dictates certain aspects of the hardware/software interfaces. This paper focuses on two different implementations of the same

Định dạng
Số trang	314
Dung lượng	5,83 MB