High-Performance Computing on Complex Environments WILEY SERIES ON PARALLEL AND DISTRIBUTED COMPUTING Series Editor: Albert Y Zomaya A complete list of titles in this series appears at the end of this volume High-Performance Computing on Complex Environments Emmanuel Jeannot Inria Julius Žilinskas Vilnius University Copyright © 2014 by John Wiley & Sons, Inc All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600, or on the web at www.copyright.com Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008 Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose No warranty may be created or extended by sales representatives or written sales materials The advice and strategies contained herin may not be suitable for your situation You should consult with a professional where appropriate Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages For general information on our other products and services please contact our Customer Care Department with the U.S at 877-762-2974, outside the U.S at 317-572-3993 or fax 317-572-4002 Wiley also publishes its books in a variety of electronic formats Some content that appears in print, however, may not be available in electronic format Library of Congress Cataloging in Publication Data: Jeannot, Emmanuel High performance computing on complex environments / Emmanuel Jeannot, Julius Zilinskas pages cm Includes bibliographical references and index ISBN 978-1-118-71205-4 (cloth) High performance computing I Žilinskas, J (Julius), 1973- II Title QA76.88.J43 2014 004.1′ 1–dc23 2013048363 High-Performance Computing on Complex Environments / Emmanuel Jeannot and Julius Žilinskas Printed in the United States of America 10 To our colleague Mark Baker Contents Contributors xxiii Preface xxvii PART I INTRODUCTION Summary of the Open European Network for High-Performance Computing in Complex Environments Emmanuel Jeannot and Julius Žilinskas 1.1 Introduction and Vision / 1.2 Scientific Organization / 1.2.1 Scientific Focus / 1.2.2 Working Groups / 1.3 Activities of the Project / 1.3.1 Spring Schools / 1.3.2 International Workshops / 1.3.3 Working Groups Meetings / 1.3.4 Management Committee Meetings / 1.3.5 Short-Term Scientific Missions / 1.4 Main Outcomes of the Action / 1.5 Contents of the Book / Acknowledgment / 10 vii viii CONTENTS PART II NUMERICAL ANALYSIS FOR HETEROGENEOUS AND MULTICORE SYSTEMS On the Impact of the Heterogeneous Multicore and Many-Core Platforms on Iterative Solution Methods and Preconditioning Techniques 11 13 Dimitar Lukarski and Maya Neytcheva 2.1 Introduction / 14 2.2 General Description of Iterative Methods and Preconditioning / 16 2.2.1 Basic Iterative Methods / 16 2.2.2 Projection Methods: CG and GMRES / 18 2.3 Preconditioning Techniques / 20 2.4 Defect-Correction Technique / 21 2.5 Multigrid Method / 22 2.6 Parallelization of Iterative Methods / 22 2.7 Heterogeneous Systems / 23 2.7.1 Heterogeneous Computing / 24 2.7.2 Algorithm Characteristics and Resource Utilization / 25 2.7.3 Exposing Parallelism / 26 2.7.4 Heterogeneity in Matrix Computation / 26 2.7.5 Setup of Heterogeneous Iterative Solvers / 27 2.8 Maintenance and Portability / 29 2.9 Conclusion / 30 Acknowledgments / 31 References / 31 Efficient Numerical Solution of 2D Diffusion Equation on Multicore Computers Matjaž Depolli, Gregor Kosec, and Roman Trobec 3.1 Introduction / 34 3.2 Test Case / 35 3.2.1 Governing Equations / 35 3.2.2 Solution Procedure / 36 3.3 Parallel Implementation / 39 3.3.1 Intel PCM Library / 39 3.3.2 OpenMP / 40 33 456 REAL-TIME TOMOGRAPHIC RECONSTRUCTION THROUGH CPU + GPU COPROCESSING implementations have provided an effective reduction of computation time, approximately proportional to the number of processors used Recently, the trend has turned toward GPUs, and a number of approaches have been presented [17, 28–32], including the use of multi-GPU strategies [33, 34], that have achieved outstanding speedup factors In most of them, the multiple threads within the GPUs work at the level of individual voxels We have proposed a novel matrix approach of WBP for GPUs that outperforms previous strategies [17] The 2D WBP procedure given by the analytic expression of WBP shown in (23.1) is directly implemented as a sparse matrix vector product (SpMV) The 3D reconstruction is then performed by a series of SpMV operations, where an important point behind this efficient implementation is the fact that the matrix B is invariable and shared for all the slices to be reconstructed The other pillar of the good efficiency of the method is the development and use of sparse matrix data structures optimized for GPUs We use the ELL-R scheme [35, 36], which consists of two arrays of dimension m × (2 ntilts ), where m = mx my is the number of rows of B and ntilts is the maximum number of nonzeroes in the rows The first array, Bsp , stores the nonzeroes, and the second, I, stores the original column index (i) in matrix B for each value in Bsp An additional vector rl of dimension m keeps the actual number of nonzeroes in each row The arrays Bsp and I store their elements in column-major order As every thread in the GPU computes a row, this ensures optimal coalesced global memory access Algorithm 23.2 shows how the SpMV operation is performed on the GPU using the ELL-R scheme (note that s[x] denotes sj in (23.1)) The reader is referred to [17] for details Algorithm 23.2 SpMV code on GPU using ELL-R scheme:s = Bsp p { int x = blockIdx.x * blockDim.x + threadIdx.x; if (x < m) { int j, length; float svalue=0.0, value; length = rl[x]; for(j=0; j