SPRINGER BRIEFS IN COMPUTER SCIENCE Eduardo H M Cruz Matthias Diener Philippe O A Navaux Thread and Data Mapping for Multicore Systems Improving Communication and Memory Accesses 123 SpringerBriefs in Computer Science Series editors Stan Zdonik, Brown University, Providence, RI, USA Shashi Shekhar, University of Minnesota, Minneapolis, MN, USA Xindong Wu, University of Vermont, Burlington, VT, USA Lakhmi C Jain, University of South Australia, Adelaide, SA, Australia David Padua, University of Illinois Urbana-Champaign, Urbana, IL, USA Xuemin Sherman Shen, University of Waterloo, Waterloo, ON, Canada Borko Furht, Florida Atlantic University, Boca Raton, FL, USA V S Subrahmanian, University of Maryland, College Park, MD, USA Martial Hebert, Carnegie Mellon University, Pittsburgh, PA, USA Katsushi Ikeuchi, University of Tokyo, Tokyo, Japan Bruno Siciliano, University of Naples Federico II, Napoli, Italy Sushil Jajodia, George Mason University, Fairfax, VA, USA Newton Lee, Woodbury University, Burbank, CA, USA SpringerBriefs present concise summaries of cutting-edge research and practical applications across a wide spectrum of fields Featuring compact volumes of 50 to 125 pages, the series covers a range of content from professional to academic Typical topics might include: • A timely report of state-of-the art analytical techniques • A bridge between new research results, as published in journal articles, and a contextual literature review • A snapshot of a hot or emerging topic • An in-depth case study or clinical example • A presentation of core concepts that students must understand in order to make independent contributions Briefs allow authors to present their ideas and readers to absorb them with minimal time investment Briefs will be published as part of Springer’s eBook collection, with millions of users worldwide In addition, Briefs will be available for individual print and electronic purchase Briefs are characterized by fast, global electronic dissemination, standard publishing contracts, easy-to-use manuscript preparation and formatting guidelines, and expedited production schedules We aim for publication 8–12 weeks after acceptance Both solicited and unsolicited manuscripts are considered for publication in this series More information about this series at http://www.springer.com/series/10028 Eduardo H M Cruz • Matthias Diener Philippe O A Navaux Thread and Data Mapping for Multicore Systems Improving Communication and Memory Accesses 123 Eduardo H M Cruz Federal Institute of Parana (IFPR) Paranavai, Parana, Brazil Matthias Diener University of Illinois at Urbana-Champaign Urbana, IL, USA Philippe O A Navaux Informatics Institute Federal University of Rio Grande Sul (UFRGS) Porto Alegre, Rio Grande Sul, Brazil ISSN 2191-5768 ISSN 2191-5776 (electronic) SpringerBriefs in Computer Science ISBN 978-3-319-91073-4 ISBN 978-3-319-91074-1 (eBook) https://doi.org/10.1007/978-3-319-91074-1 Library of Congress Control Number: 2018943692 © The Author(s), under exclusive licence to Springer International Publishing AG, part of Springer Nature 2018 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Printed on acid-free paper This Springer imprint is published by the registered company Springer International Publishing AG part of Springer Nature The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Preface This book has its origin in our research Starting in 2010, we began researching better ways to perform thread mapping to optimize communication in parallel architectures In 2012, we extended the research to data mapping, as multicore architectures with multiple memory controllers were becoming more popular It is now the year 2018 and the research is still ongoing In this book, we explain all the theory behind thread and data mapping and how it can be used to reduce the memory access latency We also give an overview of the state of the art, showing how early mechanisms, dependent on expensive procedures such as simulation and source code modifications, evolved to modern mechanisms, which are transparent to programmers and have such a low overhead that are able to run online (during the execution of the applications) We would like to thank our families and friends who supported us during this long journey We also thank our colleagues at the Parallel and Distributed Processing Group (GPPD) of UFRGS, who discussed research ideas with us, analyzed and criticized our work, and supported our research Paranavai, Brazil Urbana, IL, USA Porto Alegre, Brazil Eduardo Henrique Molina da Cruz Matthias Diener Philippe Olivier Alexandre Navaux v Contents Introduction 1.1 Improving Memory Locality with Sharing-Aware Mapping 1.2 Monitoring Memory Accesses for Sharing-Aware Mapping 1.3 Organization of the Text Sharing-Aware Mapping and Parallel Architectures 2.1 Understanding Memory Locality in Shared Memory Architectures 2.1.1 Lower Latency When Sharing Data 2.1.2 Reduction of the Impact of Cache Coherence Protocols 2.1.3 Reduction of Cache Misses 2.1.4 Reduction of Memory Accesses to Remote NUMA Nodes 2.1.5 Better Usage of Interconnections 2.2 Example of Shared Memory Architectures Affected by Memory Locality 2.2.1 Intel Harpertown 2.2.2 Intel Nehalem/Sandy Bridge 2.2.3 AMD Abu Dhabi 2.2.4 Intel Montecito/SGI NUMAlink 2.3 Locality in the Context of Network Clusters and Grids 9 10 10 11 13 14 Sharing-Aware Mapping and Parallel Applications 3.1 Parallel Applications and Sharing-Aware Thread Mapping 3.1.1 Considerations About the Sharing Pattern 3.1.2 Sharing Patterns of Parallel Applications 3.1.3 Varying the Granularity of the Sharing Pattern 3.1.4 Varying the Number of Sharers of the Sharing Pattern 3.2 Parallel Applications and Sharing-Aware Data Mapping 3.2.1 Parameters that Influence Sharing-Aware Data Mapping 3.2.2 Analyzing the Data Mapping Potential of Parallel Applications 14 14 15 15 16 16 19 19 19 22 23 26 28 28 30 vii viii Contents 3.2.3 3.2.4 Influence of the Page Size on Sharing-Aware Data Mapping 32 Influence of Thread Mapping on Sharing-Aware Data Mapping 33 State-of-the-Art Sharing-Aware Mapping Methods 4.1 Sharing-Aware Static Mapping 4.1.1 Static Thread Mapping 4.1.2 Static Data Mapping 4.1.3 Combined Static Thread and Data Mapping 4.2 Sharing-Aware Online Mapping 4.2.1 Online Thread Mapping 4.2.2 Online Data Mapping 4.2.3 Combined Online Thread and Data Mapping 4.3 Discussion on Sharing-Aware Mapping and the State-of-Art 4.4 Improving Performance with Sharing-Aware Mapping 4.4.1 Mapping Mechanisms 4.4.2 Methodology of the Experiments 4.4.3 Results 35 35 36 36 37 38 38 39 41 42 45 46 47 47 Conclusions 49 References 51 Acronyms DBA FSB IBS ILP IPC MMU MPI NUMA PMU QPI TLB TLP UMA Dynamic Binary Analysis Front Side Bus Instruction-Based Sampling Instruction Level Parallelism Instructions per Cycle Memory Management Unit Message Passing Interface Non-Uniform Memory Access Performance Monitoring Unit QuickPath Interconnect Translation Lookaside Buffer Thread Level Parallelism Uniform Memory Access ix Chapter Introduction Since the beginning of the information era, the demand for computing power has been unstoppable Whenever the technology advances enough to fulfill the needs of a time, new and more complex problems arise, such that the technology is again insufficient to solve them In the past, the increase of the performance happened mainly due to instruction level parallelism (ILP), with the introduction of several pipeline stages, out-of-order and speculative execution The increase of the clock rate frequency was also an important way to improve performance However, the available ILP exploited by compilers and architectures is reaching its limits (Caparros Cabezas and Stanley-Marbell 2011) The increase of clock frequency is also reaching its limits because it raises the energy consumption, which is an important issue for current and future architectures (Tolentino and Cameron 2012) To keep performance increasing, processor architectures are becoming more dependent on thread level parallelism (TLP), employing several cores to compute in parallel These parallel architectures put more pressure on the memory subsystem, since more bandwidth to move data between the cores and the main memory is required To handle the additional bandwidth, current architectures introduce complex memory hierarchies, formed by multiple cache levels, some composed by multiple banks connected to a memory controller The memory controller interfaces a Uniform or Non-Uniform Memory Access (UMA or NUMA) system However, with the upcoming increase of the number of cores, a demand for an even higher memory bandwidth is expected (Coteus et al 2011) In this context, the reduction of data movement is an important goal for future architectures to keep performance scaling and to decrease energy consumption (Borkar and Chien 2011) Most data movement in current architectures occurs due to memory accesses and communication between the threads The communication itself in shared memory environments is performed through accesses to blocks of memory shared between the threads One of the solutions to reduce data movement consists of improving the memory locality (Torrellas 2009) In this © The Author(s), under exclusive licence to Springer International Publishing AG, part of Springer Nature 2018 E H M Cruz et al., Thread and Data Mapping for Multicore Systems, SpringerBriefs in Computer Science, https://doi.org/10.1007/978-3-319-91074-1_1 4.2 Sharing-Aware Online Mapping 39 the TLB entries in hardware-managed TLBs The comparison of TLBs can demand a lot of time if there are too many cores in the system To reduce the overhead, the comparison of TLB entries can be done less frequently, which reduces the accuracy Hardware changes are also necessary in Cruz et al (2014), in which a sharing matrix is kept by the hardware by monitoring the source and destination of invalidation messages of cache coherence protocols One disadvantage of both mechanisms is that their sharing history is limited to the size of the TLBs and cache memories, respectively Therefore, their accuracy is reduced when an application uses too much memory The usage of the instructions per cycle (IPC) metric to guide thread mapping is evaluated in Autopin (Klug et al 2008) Autopin itself does not detect the sharing pattern, it only verifies the IPC of several mappings fed to it and executes the application with the thread mapping that presented the highest IPC In Radojkovi´c et al (2013), the authors propose BlackBox, a scheduler that, similarly to Autopin, selects the best mapping by measuring the performance that each mapping obtained When the number of threads is low, all possible thread mappings are evaluated When the number of threads makes it unfeasible to evaluate all possibilities, the authors execute the application with 1000 random mappings to select the best one These mechanisms that rely on statistics from hardware counters take too much time to converge to an optimal mapping, since they need to first check the statistics of the mappings The convergence usually is not possible because amount of possible mappings is exponential in the number of threads Also, these statistics not accurately represent sharing patterns As mentioned in Sect 4.1.1, mechanisms that only perform thread mapping are not capable of decreasing the number of remote memory accesses in NUMA architectures 4.2.2 Online Data Mapping Traditional data mapping strategies, such as first-touch and next-touch (Löf and Holmgren 2005), have been used by operating systems to allocate memory on NUMA machines In the case of first-touch, a memory page is mapped to the node of the thread that causes its first page fault This strategy affects the performance of other threads, as pages are not migrated during execution When using nexttouch, pages are migrated between nodes according to memory accesses to them However, if the same page is accessed from different nodes, next-touch can lead to excessive data migrations Problems also arise in the case of thread migrations between nodes (Ribeiro et al 2009) The NUMA Balancing policy (Corbet 2012b) was included in more recent versions of the Linux kernel In this policy, the kernel introduces page faults during the execution of the application to perform lazy page migrations, reducing the amount of remote memory accesses However, NUMA Balancing does not detect the sharing pattern between the threads A previous proposal with similar goals was AutoNUMA (Corbet 2012a) 40 State-of-the-Art Sharing-Aware Mapping Methods A page migration mechanism that uses queuing delays and row-buffer hit rates from memory controllers is proposed in Awasthi et al (2010) Two page migrations mechanisms were developed The first is called Adaptive First-Touch and consists of gathering statistics of the memory controller to map the data of the application in future executions However, this mechanism fails if the behavior changes among the different phases of the application The second mechanism uses the same information, but allows online page migration during the execution of the application They select the destination NUMA node considering the difference of the access latency between the source and destination NUMA nodes, as well as the row-buffer hit rate and queuing delays of the memory controller The major problem of this work is that the information about which data each thread access is not considered The page to be migrated is randomly chosen, and may lead to an increase on the number of remote accesses The work described in Marathe and Mueller (2006) makes use of the PMU of Itanium-2 to generate samples of memory addresses accessed by each thread Their profiling mechanism imposes a high overhead, as it requires traps to the operating system on every high latency memory load operation or TLB miss Hence, they enable the profiling mechanism just during the beginning of each application, losing the opportunity to handle changes during the execution of the application They also suggest that the programmer of the application should instrument the source code of the application to provide to the operating system the areas of the application that should have their memory accesses monitored Similarly, Tikir and Hollingsworth (2008) use UltraSPARC III hardware monitors to guide data mapping However, their proposal is limited to architectures with software-managed TLBs As explained in Sect 4.1.2, mechanisms that not perform thread mapping are not able to reduce cache misses and remote memory accesses to pages accessed by several threads The Carrefour mechanism (Dashti et al 2013) uses Instruction-Based Sampling (IBS), available on AMD architectures, to detect the memory access behavior and keeps a history of memory accesses to limit unnecessary migrations Additionally, it allows replication of pages that are mostly read in different NUMA nodes Due to the runtime overhead, Carrefour has to limit the number of samples they collect and the number of pages they characterize, tracking at most 30,000 pages Piccoli et al (2014) proposes a technique named Selective Page Migration (SPM) that automatically includes instrumentation code at compile time and migrate pages at run time In the compiler, their proposal analyzes parallel loops to identify their memory access behavior This behavior is then used to migrate pages to the nodes that most access them during the execution of a loop Although using the compiler for analysis is interesting, it is only applicable to applications with simple parallelization patterns 4.2 Sharing-Aware Online Mapping 41 4.2.3 Combined Online Thread and Data Mapping A library called ForestGOMP is introduced in Broquedis et al (2010) This library integrates into the OpenMP runtime environment and gathers information about the different parallel sections of the applications The threads of parallel applications are scheduled such that threads created by the same parallel region, which usually operate in the same dataset, execute on cores nearby in the memory hierarchy, decreasing the number of cache misses To map data, the library depends on hints provided by the programmer The authors also claim to use statistics from hardware counters, although they not explain this in the paper ForestGOMP only works for OpenMP based applications Also, the need for hints to perform the mapping may be a burden to the programmer The kernel Memory Affinity Framework (kMAF) (Diener et al 2014) uses page faults to detect the memory access behavior Whenever a page fault happens, kMAF verifies the ID of the thread that generated the fault, as well as the memory address of the fault To increase its accuracy, kMAF introduces additional page faults These additional page faults impose an overhead to the system, since each fault causes an interrupt to the operating system Like Carrefour, kMAF has a limited access to the number of samples of memory accesses to control the overhead, harming the accuracy of the detected memory access behavior Gennaro et al (2016) describe the inaccuracy that comes from using page faults to a single page table to detect the sharing pattern of applications They explain that the page fault caused by one thread can mask eventual accesses to the same page by other threads The authors overcome this issue by creating additional thread-specific page tables, and using the collected sharing pattern to map threads and their working sets of pages to the same NUMA node kMAF and Gennaro et al generate mapping information based on a very small number of samples due to the overhead of the page faults The Intense Pages Mapping (IPM) (Cruz et al 2016c) mechanism is implemented directly in the MMU of the architecture, and uses the time a TLB entry stays cached in the TLB as a metric to determine the affinity between the threads and pages of parallel applications In architectures with software-managed TLBs, IPM can be implemented natively In architectures with hardware-managed TLBs, IPM is implemented as an addition to the MMU hardware IPM proved to have a very good trade-off between hardware cost and accuracy SAMMU (Cruz et al 2016b) and LAPT (Cruz et al 2016a) are another proposals that are implemented directly in the MMU SAMMU requires more hardware modifications than IPM, while LAPT has a lower accuracy than IPM 42 State-of-the-Art Sharing-Aware Mapping Methods 4.3 Discussion on Sharing-Aware Mapping and the State-of-Art In this and previous chapters, we analyzed the relation between sharing-aware mapping, computer architectures and parallel applications We observed that mapping is able to improve performance and energy efficiency in current architectures due to an improved memory locality, reducing cache misses and interchip interconnection usage In the context of parallel applications, thread mapping affects their performance only in applications whose threads present different amount of data sharing among themselves, such that performance can be improved by mapping threads that share data in cores nearby in the memory hierarchy Regarding data mapping, parallel applications whose memory pages presented a high number of accesses from a single thread (or a single NUMA node) should benefit from an improved mapping of pages to NUMA nodes The related work proved to have a wide variety of solutions, with very different characteristics, in which we summarize in Table 4.1 Analyzing the impact of mapping in the architectures and applications, as well the related work, we identified the following characteristics that would be desirable for a mechanism that performs sharing-aware mapping: Perform both thread and data mapping As we observed, thread and data mapping influences the performance in different ways, such that they can be combined for maximum performance Most of the related work performs only thread or data mapping alone Support dynamic environments Several related work required a previous analysis of the parallel application, not supporting applications that change the behavior between different executions Also, such mechanisms may present issues to support the wide variety of parallel architectures, since an analysis considering a architecture may not be usable for another architecture Support dynamic memory allocation Thread mapping mechanisms not apply to this characteristic Regarding data mapping, several mechanisms perform the analysis using memory traces or with the compiler, which may not be enough to detect information for memory that is dynamically allocated Also, dynamic allocated memory can change addresses between executions, which can invalidate previously generated information Has access to memory addresses In order to achieve a high accuracy, mechanisms should have access to the memory addresses being accessed by the threads Mechanisms that analyze the behavior solely using statistics such as cache misses or instructions per cycle have a low accuracy Low overhead Several mechanisms require memory traces, which are expensive to generate and analyze Low trade-off between accuracy and overhead There are mechanisms that, when increasing the accuracy, the overhead drastically increases Such mechanisms usually are based on sampling of memory addresses, such that the overhead is proportional to the number of samples used Mechanism BarrowWilliams et al (2009) Bienia et al (2008a) Diener et al (2010) Cruz et al (2010) Ribeiro et al (2009) Ribeiro et al (2011) Marathe et al (2010) Cruz et al (2011) Azimi et al (2009) SPCD (Diener et al 2013) Cruz et al (2015a) Cruz et al (2014) Thread mapping Data mapping Table 4.1 Summary of related work Dynamic memSupport Has access ory dynamic alloca- to memory environments tion addresses Low overhead Low trade-off accuracy ×overhead No manual Do not source code require modification sampling No special hardware support (continued) Support complex memory access patterns 4.3 Discussion on Sharing-Aware Mapping and the State-of-Art 43 Data mapping Has Support Dynamic access to dynamic memory memory Low environments allocation addresses overhead Each line represents a related work Each column represents a desired property Thread mapping Mechanism Autopin (Klug et al 2008) BlackBox (Radojkovi´c et al 2013) NUMA Balancing (Corbet 2012b) Awasthi et al (2010) Marathe and Mueller (2006) Carrefour (Dashti et al 2013) Piccoli et al (2014) ForestGOMP (Broquedis et al 2010) kMAF (Diener et al 2014) Gennaro et al (2016) LAPT (Cruz et al 2016a) SAMMU (Cruz et al 2016b) IPM (Cruz et al 2016c) Table 4.2 (continued) Low trade-off accuracy ×overhead No manual source Do not code require modification sampling No special hardware support Support complex memory access patterns 44 State-of-the-Art Sharing-Aware Mapping Methods 4.4 Improving Performance with Sharing-Aware Mapping 45 No manual source code modification Mechanisms should not depend on source code modification, which could impose a burden on programmers The wide variety of architectures aggravates the burden Do not require sampling Mechanisms that sample memory addresses can have a lower accuracy since the samples used may not characterize the behavior of the application correctly Also, such mechanisms usually depend on sampling because their tracking method impose a high overhead The usage of sampling can reduce the response time of the mechanism No special hardware support Ideally, a mechanism should be able to work in any common architecture However, some mechanisms require special hardware support to work, which does not allow them to be used in any architecture Support complex memory access patterns A mechanism should be able to detect any kind of memory access pattern, and not only simple patterns, such as domain decomposition As a general comparison, we can observe that most related work either performs thread or data mapping, but not both of them together In the case of thread mapping mechanisms, they are not able to reduce the number of remote memory accesses in NUMA architectures On the other hand, data mapping mechanisms are not able to reduce cache misses or correctly handle the mapping of shared pages The mechanisms we described that perform both mappings together have several disadvantages Cruz et al (2011) is a static mechanism, depending on information from previous executions thereby being limited to applications whose behavior not change between or during the execution ForestGOMP (Broquedis et al 2010) is online, but require hints from the programmer to work properly kMAF (Diener et al 2014) use sampling and has a high overhead when we increase the number of samples to achieve a higher accuracy Other mechanisms rely on indirectly statistics obtained by hardware counters, which not accurately represent the memory access behavior of parallel applications Several proposals require specific architectures, APIs or programming languages to work, limiting their applicability 4.4 Improving Performance with Sharing-Aware Mapping In this section, we demonstrate the benefits of several mapping techniques (Diener et al 2016) To that, we make use of Ondes3D, a scientific application kernel that simulates how seismic waves propagate during earthquakes Its model is based on a finite-differences numerical method (Aochi et al 2013) The parallel version of Ondes3D used in this experiment uses OpenMP (Dupros et al 2008) The memory access behavior of Ondes3D is static, that is, it is always the same regardless of the execution Due to this, several mapping mechanisms are able to handle Ondes3D, including mechanisms that need to generate memory accesses The communication pattern of Ondes3D is a classical domain decomposition pattern, such that neighbor threads present a high amount of communication 46 State-of-the-Art Sharing-Aware Mapping Methods 4.4.1 Mapping Mechanisms In the original source code of Ondes3D, the master thread initializes all the data, which leads to a high imbalance for a first touch policy, since the NUMA node that first executes the master thread would be responsible for handling most memory requests We compared four mapping mechanisms against the original implementation of Ondes3D We selected these mechanisms because they are commonly adopted solutions that can be used in a wide of applications and hardware architectures Both thread and data mappings are handled in all mechanisms we evaluated 4.4.1.1 Source Code Changes We added a manual thread and data mapping to Ondes3D We map each thread to a core at the beginning of the parallel phase of the application Neighbor threads are mapped to neighbor cores, following a Compact thread mapping strategy to exploit the domain decomposition pattern The mapping was implemented using the sched_setaffinity() system call from Linux To implement data mapping, we made use of the first-touch policy of Linux to map the data used by each thread to their NUMA node We this by initializing each thread’s data by the own thread, such that first-touch will ensure the data is allocated in the correct NUMA node 4.4.1.2 Offline Profiling at the User Level The Numalize mechanism (Diener et al 2015b) was used to profile Ondes3D and generate a memory access trace A kernel module receives the profile generated by Numalize and applies optimized thread and data mappings 4.4.1.3 Runtime Options Ondes3D was executed with a Compact thread mapping by setting the GOMP_CPU_AFFINITY environment variable For the data mapping, we used an Interleave data mapping using the numactl tool 4.4.1.4 Online Profiling at the System Level To perform online mapping, we used the NUMA Balancing mechanism, which is part of the Linux kernel NUMA Balancing tries to keep threads running on the NUMA node containing their data It also migrates the data to the node its accessing thread is running We executed the experiments with the default NUMA Balancing of Linux 3.13 4.4 Improving Performance with Sharing-Aware Mapping 47 4.4.2 Methodology of the Experiments We executed Ondes3D with two different inputs: small and large Both use the same amount of memory, but differ in the number of iterations The machine used in the experiments has NUMA nodes, and each one contains an Intel Xeon X7550 processor with eight cores and 2-way SMT The L1 and L2 caches are private to each core, while the L3 cache is shared among all cores within each processor The Linux kernel 3.13 was used in all experiments As compiler, we used gcc 4.6.3 with the -O2 optimization flag Each experiment was executed 10 times 4.4.3 Results Figure 4.1 shows the results of the experiments Performance improvements of more than 200% were achieved, which clearly shows the importance of thread and data mapping in current shared memory architectures In Ondes3D, the mapping mechanisms based on source code changes and offline profiling provided the best performance improvements These two mechanisms, on the other hand, have the highest up-front overhead Both require expensive analysis, which in case of source code changes means that the programmer needs to identify which data each thread will most access; while in offline profiling, the expensive analysis is due to the usage of tools to generate and process memory traces We can also note that offline profiling provided slightly better results than source code changes The mapping using runtime options also improved the performance, although not as well as the other mechanisms, since it does not improve data mapping locality However, it represents a simple way to perform mapping without any deep knowledge about the application and more advanced mapping schemes Online profiling presented better results for larger inputs, since the mechanism needs to detect the application behavior prior to perform the mapping With the large input, the time spent learning the behavior is mitigated, such that online profiling can improve performance as much as offline profiling 48 State-of-the-Art Sharing-Aware Mapping Methods Source code changes Offine profiling Runtime options Online profiling Offline profiling Runtime options Online profiling Execution time (s) (b) Source code changes Baseline Baseline Execution time (s) (a) 200 150 100 50 Fig 4.1 Execution time of Ondes3D with various mapping mechanisms (a) Small input set (b) Large input set Chapter Conclusions Locality of memory accesses is one of the most important aspects to be considered when designing an architecture or developing software With the introduction of multicore architectures, the memory hierarchy had to evolve to able to provide the necessary bandwidth to several cores operating in parallel With this evolution, memory hierarchies started to present several caches in the same level, some levels shared by multiple cores, and other private to a core Another important step was the incorporation of a memory controller inside the processor, in which multiprocessor systems presented NUMA characteristics Due to the introduction of such technologies, the performance of memory hierarchies and the systems as a whole were even more dependent on memory locality In this context, techniques such as sharing-aware thread and data mapping are able to increase memory locality and thereby performance Our experiments indicate performance improvements of up to 200% in a scientific application Lots of related work on the area of sharing-aware mapping has been proposed, with a wide variety of characteristics and features The majority of the proposals perform only static mapping, which are able to handle only applications whose memory access behavior keeps the same along different executions Most work also only handles thread or data mapping alone, not both together Most related work that is able to handle both thread and data mappings and operate online, during the execution of the application, have a high trade-off between accuracy and overhead To achieve a higher accuracy, they have to increase the overhead of their memory access behavior detection as well Some proposals are able to achieve high accuracy with low overhead, but require special hardware support © The Author(s), under exclusive licence to Springer International Publishing AG, part of Springer Nature 2018 E H M Cruz et al., Thread and Data Mapping for Multicore Systems, SpringerBriefs in Computer Science, https://doi.org/10.1007/978-3-319-91074-1_5 49 References AMD (2012) AMD OpteronTM 6300 series processor quick reference guide Tech Rep., August Aochi H, Ulrich T, Ducellier A, Dupros F, Michea D (2013) Finite difference simulations of seismic wave propagation for understanding earthquake physics and predicting ground motions: advances and challenges J Phys Conf Ser 454(1):012010 https://doi.org/10.1088/ 1742-6596/454/1/012010 Awasthi M, Nellans DW, Sudan K, Balasubramonian R, Davis A (2010) Handling the problems and opportunities posed by multiple on-chip memory controllers In: Parallel architectures and compilation techniques (PACT), pp 319–330 Azimi R, Tam DK, Soares L, Stumm M (2009) Enhancing Operating system support for multicore processors by using hardware performance monitoring ACM SIGOPS Oper Syst Rev 43(2):56–65 https://doi.org/10.1145/1531793.1531803 Bach M, Charney M, Cohn R, Demikhovsky E, Devor T, Hazelwood K, Jaleel A, Luk CK, Lyons G, Patil H, Tal A (2010) Analyzing parallel programs with pin IEEE Comput 43(3):34–41 Barrow-Williams N, Fensch C, Moore S (2009) A communication characterisation of splash-2 and parsec In: IEEE international symposium on workload characterization (IISWC), pp 86–97 https://doi.org/10.1109/IISWC.2009.5306792 Bellard F (2005) Qemu, a fast and portable dynamic translator In: USENIX annual technical conference (ATEC) USENIX Association, Berkeley, pp 41–41 Bienia C, Kumar S, Li K (2008a) PARSEC vs SPLASH-2: a quantitative comparison of two multithreaded benchmark suites on Chip-Multiprocessors In: IEEE international symposium on workload characterization (IISWC), pp 47–56 https://doi.org/10.1109/IISWC.2008.4636090 Bienia C, Kumar S, Singh JP, Li K (2008b) The PARSEC benchmark suite: characterization and architectural implications In: International conference on parallel architectures and compilation techniques (PACT), pp 72–81 Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S, Sen R, Sewell K, Shoaib M, Vaish N, Hill MD, Wood DA (2011) The gem5 simulator ACM SIGARCH Comput Archit News 39(2):1–7 Borkar S, Chien AA (2011) The future of microprocessors Commun ACM 54(5):67–77 Broquedis F, Aumage O, Goglin B, Thibault S, Wacrenier PA, Namyst R (2010) Structuring the execution of OpenMP applications for multicore architectures In: IEEE international parallel & distributed processing symposium (IPDPS), pp 1–10 Caparros Cabezas V, Stanley-Marbell P (2011) Parallelism and data movement characterization of contemporary application classes In: ACM symposium on parallelism in algorithms and architectures (SPAA) © The Author(s), under exclusive licence to Springer International Publishing AG, part of Springer Nature 2018 E H M Cruz et al., Thread and Data Mapping for Multicore Systems, SpringerBriefs in Computer Science, https://doi.org/10.1007/978-3-319-91074-1 51 52 References Casavant TL, Kuhl JG (1988) A taxonomy of scheduling in general-purpose distributed computing systems IEEE Trans Softw Eng 14(2):141–154 Chishti Z, Powell MD, Vijaykumar TN (2005) Optimizing replication, communication, and capacity allocation in CMPs ACM SIGARCH Comput Archit News 33(2):357–368 https:// doi.org/10.1145/1080695.1070001 Conway P (2007) The AMD opteron northbridge architecture IEEE Micro 27(2):10–21 Corbet J (2012a) AutoNUMA: the other approach to NUMA scheduling http://lwn.net/Articles/ 488709/ Corbet J (2012b) Toward better NUMA scheduling http://lwn.net/Articles/486858/ Coteus PW, Knickerbocker JU, Lam CH, Vlasov YA (2011) Technologies for exascale systems IBM J Res Develop 55(5):14:1–14:12 https://doi.org/10.1147/JRD.2011.2163967 Cruz EHM, Alves MAZ, Navaux POA (2010) Process mapping based on memory access traces In: Symposium on computing systems (WSCAD-SCC), pp 72–79 Cruz E, Alves M, Carissimi A, Navaux P, Ribeiro C, Mehaut J (2011) Using memory access traces to map threads and data on hierarchical multi-core platforms In: IEEE international symposium on parallel and distributed processing workshops and Phd forum (IPDPSW) Cruz EHM, Diener M, Navaux POA (2012) Using the translation lookaside buffer to map threads in parallel applications based on shared memory In: IEEE international parallel & distributed processing symposium (IPDPS), pp 532–543 https://doi.org/10.1109/IPDPS.2012.56 Cruz EHM, Diener M, Alves MAZ, Navaux POA (2014) Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols J Parallel Distrib Comput 74(3):2215–2228 https://doi.org/10.1016/j.jpdc.2013.11.006 Cruz EHM, Diener M, Navaux POA (2015a) Communication-aware thread mapping using the translation lookaside buffer Concurr Comput Pract Exp 22(6):685–701 Cruz EHM, Diener M, Pilla LL, Navaux POA (2015b) An efficient algorithm for communicationbased task mapping In: International conference on parallel, distributed, and network-based processing (PDP), pp 207–214 Cruz EH, Diener M, Alves MA, Pilla LL, Navaux PO (2016a) Lapt: a locality-aware page table for thread and data mapping Parallel Comput 54(C):59–71 http://dx.doi.org/10.1016/j.parco 2015.12.001 Cruz EHM, Diener M, Pilla LL, Navaux POA (2016b) A sharing-aware memory management unit for online mapping in multi-core architectures In: Euro-par parallel processing, pp 659–671 https://doi.org/10.1007/978-3-319-43659-3 Cruz EHM, Diener M, Pilla LL, Navaux POA (2016c) Hardware-assisted thread and data mapping in hierarchical multicore architectures ACM Trans Archit Code Optim 13(3):1–25 https://doi org/10.1145/2975587 Dashti M, Fedorova A, Funston J, Gaud F, Lachaize R, Lepers B, Quéma V, Roth M (2013) Traffic management: a holistic approach to memory placement on NUMA systems In: Architectural support for programming languages and operating systems (ASPLOS), pp 381–393 Diener M, Madruga FL, Rodrigues ER, Alves MAZ, Navaux POA (2010) Evaluating thread placement based on memory access patterns for multi-core processors In: IEEE international conference on high performance computing and communications (HPCC), pp 491–496 http:// doi.ieeecomputersociety.org/10.1109/HPCC.2010.114 Diener M, Cruz EHM, Navaux POA (2013) Communication-based mapping using shared pages In: IEEE international parallel & distributed processing symposium (IPDPS), pp 700–711 https://doi.org/10.1109/IPDPS.2013.57 Diener M, Cruz EHM, Navaux POA, Busse A, HeißHU (2014) kMAF: automatic kernel-level management of thread and data affinity In: International conference on parallel architectures and compilation techniques (PACT), pp 277–288 Diener M, Cruz EHM, Navaux POA, Busse A, HeißHU (2015a) Communication-aware process and thread mapping using online communication detection Parallel Comput 43(March):43–63 Diener M, Cruz EHM, Pilla LL, Dupros F, Navaux POA (2015b) Characterizing communication and page usage of parallel applications for thread and data mapping Perform Eval 88–89(June):18–36 References 53 Diener M, Cruz EHM, Alves MAZ, Navaux POA, Koren I (2016) Affinity-based thread and data mapping in shared memory systems ACM Comput Surv 49(4):64:1–64:38 http://doi.acm.org/ 10.1145/3006385 Dupros F, Aochi H, Ducellier A, Komatitsch D, Roman J (2008) Exploiting intensive multithreading for the efficient simulation of 3d seismic wave propagation In: IEEE international conference on computational science and engineering (CSE), pp 253–260 https://doi.org/10 1109/CSE.2008.51 Feliu J, Sahuquillo J, Petit S, Duato J (2012) Understanding cache hierarchy contention in CMPs to improve job scheduling In: International parallel and distributed processing symposium (IPDPS) https://doi.org/10.1109/IPDPS.2012.54 Gabriel E, Fagg GE, Bosilca G, Angskun T, Dongarra JJ, Squyres JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A (2004) Open MPI: goals, concept, and design of a next generation MPI implementation In: Recent advances in parallel virtual machine and message passing interface Gennaro ID, Pellegrini A, Quaglia F (2016) OS-based NUMA optimization: tackling the case of truly multi-thread applications with non-partitioned virtual page accesses In: IEEE/ACM international symposium on cluster, cloud, and grid computing (CCGRID), pp 291–300 https:// doi.org/10.1109/CCGrid.2016.91 Intel (2008) Quad-core Intel R Xeon R processor 5400 series datasheet Tech Rep., March http:// www.intel.com/assets/PDF/datasheet/318589.pdf Intel (2010a) Intel R Itanium R architecture software developer’s manual Tech Rep Intel (2010b) Intel R Xeon R processor 7500 series Tech Rep., March Intel (2012) 2nd generation Intel core processor family Tech Rep., September Jin H, Frumkin M, Yan J (1999) The OpenMP implementation of NAS parallel benchmarks and its performance Tech Rep., October, NASA Johnson M, McCraw H, Moore S, Mucci P, Nelson J, Terpstra D, Weaver V, Mohan T (2012) PAPI-V: performance monitoring for virtual machines In: International conference on parallel processing workshops (ICPPW), pp 194–199 https://doi.org/10.1109/ICPPW.2012.29 Klug T, Ott M, Weidendorfer J, Trinitis C (2008) Autopin—automated optimization of thread-tocore pinning on multicore systems High Perform Embed Archit Compilers 3(4):219–235 LaRowe RP, Holliday MA, Ellis CS (1992) An analysis of dynamic page placement on a NUMA multiprocessor ACM SIGMETRICS Perform Eval Rev 20(1):23–34 Löf H, Holmgren S (2005) Affinity-on-next-touch: increasing the performance of an industrial PDE solver on a cc-NUMA system In: International conference on supercomputing (SC), pp 387–392 Magnusson P, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B (2002) Simics: a full system simulation platform IEEE Comput 35(2):50–58 https://doi.org/10.1109/2.982916 Marathe J, Mueller F (2006) Hardware profile-guided automatic page placement for ccNUMA systems In: ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP), pp 90–99 Marathe J, Thakkar V, Mueller F (2010) Feedback-directed page placement for ccNUMA via hardware-generated memory traces J Parallel Distrib Comput 70(12):1204–1219 Martin MMK, Hill MD, Sorin DJ (2012) Why on-chip cache coherence is here to stay Commun ACM 55(7):78 https://doi.org/10.1145/2209249.2209269 Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation In: ACM SIGPLAN conference on programming language design and implementation (PLDI) OpenMP (2013) OpenMP application program interface Tech Rep., July Patel A, Afram F, Chen S, Ghose K (2011) MARSSx86: a full system simulator for x86 CPUs In: Design automation conference 2011 (DAC’11) Piccoli G, Santos HN, Rodrigues RE, Pousa C, Borin E, Quintão Pereira FM, Magno F (2014) Compiler support for selective page migration in NUMA architectures In: International conference on parallel architectures and compilation techniques (PACT), pp 369–380 54 References Radojkovi´c P, Cakarevi´c V, Verdú J, Pajuelo A, Cazorla FJ, Nemirovsky M, Valero M (2013) Thread assignment of multithreaded network applications in multicore/multithreaded processors IEEE Trans Parallel Distrib Syst 24(12):2513–2525 Ribeiro CP, Méhaut JF, Carissimi A, Castro M, Fernandes LG (2009) Memory affinity for hierarchical shared memory multiprocessors In: International symposium on computer architecture and high performance computing (SBAC-PAD), pp 59–66 Ribeiro CP, Castro M, Méhaut JF, Carissimi A (2011) Improving memory affinity of geophysics applications on numa platforms using minas In: International conference on high performance computing for computational science (VECPAR) Shwartsman S, Mihocka D (2008) Virtualization without direct execution or jitting: designing a portable virtual machine infrastructure In: International symposium on computer architecture (ISCA), Beijing Swamy T, Ubal R (2014) Multi2sim 4.2 – a compilation and simulation framework for heterogeneous computing In: International conference on architectural support for programming languages and operating systems (ASPLOS) Tanenbaum AS (2007) Modern operating systems, 3rd edn Prentice Hall Press, Upper Saddle River Terboven C, an Mey D, Schmidl D, Jin H, Reichstein T (2008) Data and thread affinity in OpenMP programs In: Workshop on memory access on future processors: a solved problem? (MAW), pp 377–384 https://doi.org/10.1145/1366219.1366222 Tikir MM, Hollingsworth JK (2008) Hardware monitors for dynamic page migration J Parallel and Distrib Comput 68(9):1186–1200 Tolentino M, Cameron K (2012) The optimist, the pessimist, and the global race to exascale in 20 megawatts IEEE Comput 45(1):95–97 Torrellas J (2009) Architectures for extreme-scale computing IEEE Comput 42(11):28–35 Verghese B, Devine S, Gupta A, Rosenblum M (1996) OS support for improving data locality on CC-NUMA compute servers Tech Rep., February Villavieja C, Karakostas V, Vilanova L, Etsion Y, Ramirez A, Mendelson A, Navarro N, Cristal A, Unsal OS (2011) DiDi: mitigating the performance impact of TLB Shootdowns using a shared TLB directory In: International conference on parallel architectures and compilation techniques (PACT), pp 340–349 https://doi.org/10.1109/PACT.2011.65 Wang W, Dey T, Mars J, Tang L, Davidson JW, Soffa ML (2012) Performance analysis of thread mappings with a holistic view of the hardware resources In: IEEE International symposium on performance analysis of systems & software (ISPASS) Woodacre M, Robb D, Roe D, Feind K (2005) The SGI Altix 3000 global shared-memory architecture Tech Rep Zhou X, Chen W, Zheng W (2009) Cache sharing management for performance fairness in chip multiprocessors In: International conference on parallel architectures and compilation techniques (PACT), pp 384–393 https://doi.org/10.1109/PACT.2009.40 Ziakas D, Baum A, Maddox RA, Safranek RJ (2010) Intel quickpath interconnect - architectural features supporting scalable system architectures In: Symposium on high performance interconnects (HOTI), pp 1–6 ... traffic and energy consumption when using combined thread and data mappings (b) Average execution time reduction provided by a combined thread and data mapping, and by using thread and data mapping. .. vector, more thread IDs can be tracked, but increases the temporal false sharing (b) (a) Thread accesses shared data X Thread accesses shared data X Time Thread accesses shared data X Thread accesses. .. using thread and data mapping together, and using them separately The execution time reduction of using a combined thread and data mapping was 15.3% On the other hand, the reduction of using thread