Alexandru-Petru Tanase · Frank Hannig Jürgen Teich Symbolic Parallelization of Nested Loop Programs Symbolic Parallelization of Nested Loop Programs Alexandru-Petru Tanase • Frank Hannig Jürgen Teich Symbolic Parallelization of Nested Loop Programs 123 Alexandru-Petru Tanase Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) Erlangen, Germany Frank Hannig Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) Erlangen, Germany Jürgen Teich Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) Erlangen, Germany ISBN 978-3-319-73908-3 ISBN 978-3-319-73909-0 (eBook) https://doi.org/10.1007/978-3-319-73909-0 Library of Congress Control Number: 2018930020 © Springer International Publishing AG 2018 This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations Printed on acid-free paper This Springer imprint is published by the registered company Springer International Publishing AG part of Springer Nature The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Contents Introduction 1.1 Goals and Contributions 1.2 Symbolic Outer and Inner Loop Parallelization 1.3 Symbolic Multi-level Parallelization 1.4 On-demand Fault-tolerant Loop Processing 1.5 Book Organization 4 5 Fundamentals and Compiler Framework 2.1 Invasive Computing 2.2 Invasive Tightly Coupled Processor Arrays 2.2.1 Processor Array 2.2.2 Array Interconnect 2.2.3 TCPA Peripherals 2.3 Compiler Framework 2.3.1 Compilation Flow 2.3.2 Front End 2.3.3 Loop Specification in the Polyhedron Model 2.3.4 PAULA Language 2.3.5 PARO 2.3.6 Space-Time Mapping 2.3.7 Code Generation 2.3.8 PE Code Generation 2.3.9 Interconnect Network Configuration 2.3.10 GC and AG Configuration Stream 9 13 14 15 16 18 18 20 22 26 26 33 34 35 35 36 Symbolic Parallelization 3.1 Symbolic Tiling 3.1.1 Decomposition of the Iteration Space 3.1.2 Embedding of Data Dependencies 3.2 Symbolic Outer Loop Parallelization 3.2.1 Tight Intra-Tile Schedule Vector Candidates 3.2.2 Tight Inter-tile Schedule Vectors 37 38 39 41 46 48 54 v vi Contents 3.2.3 Parametric Latency Formula 3.2.4 Runtime Schedule Selection Symbolic Inner Loop Parallelization 3.3.1 Tight Intra-Tile Schedule Vectors 3.3.2 Tight Inter-tile Schedule Vector Candidates 3.3.3 Parametric Latency Formula 3.3.4 Runtime Schedule Selection Runtime Schedule Selection on Invasive TCPAs Experimental Results 3.5.1 Latency 3.5.2 I/O and Memory Demand 3.5.3 Scalability Related Work Summary 60 63 65 67 68 71 74 76 77 78 82 84 86 92 Symbolic Multi-Level Parallelization 4.1 Symbolic Hierarchical Tiling 4.1.1 Decomposition of the Iteration Space 4.1.2 Embedding of Data Dependencies 4.2 Symbolic Hierarchical Scheduling 4.2.1 Latency-Minimal Sequential Schedule Vectors 4.2.2 Tight Parallel Schedule Vectors 4.2.3 Parametric Latency Formula 4.2.4 Runtime Schedule Selection 4.3 Experimental Results 4.3.1 Latency 4.3.2 I/O and Memory Balancing 4.3.3 Scalability 4.4 Related Work 4.5 Summary 93 94 95 97 100 101 106 108 110 112 112 115 115 117 121 On-Demand Fault-Tolerant Loop Processing 5.1 Fundamentals and Fault Model 5.2 Fault-Tolerant Loop Execution 5.2.1 Loop Replication 5.2.2 Voting Insertion 5.2.3 Immediate, Early, and Late Voting 5.3 Voting Functions Implementation 5.4 Adaptive Fault Tolerance Through Invasive Computing 5.4.1 Reliability Analysis for Fault-Tolerant Loop Execution 5.5 Experimental Results 5.5.1 Latency Overhead 5.5.2 Average Error Detection Latency 5.6 Related Work 5.7 Summary 123 124 126 127 130 132 140 142 145 146 146 149 150 152 3.3 3.4 3.5 3.6 3.7 Contents vii Conclusions and Outlook 155 6.1 Conclusions 155 6.2 Outlook 157 Bibliography 159 Index 171 Acronyms ABS AG AST CGRA COTS CPU CUDA DMR DPLA ECC EDC EDL FCR FSM FU GC GPU HPC iCtrl i-let ILP IM i-NoC LPGS LSGP MPSoC NMR Anti-lock Breaking System Address Generator Abstract Syntax Tree Coarse-Grained Reconfigurable Array Commercial Off-The-Shelf Central Processing Unit Compute Unified Device Architecture Dual Modular Redundancy Dynamic Piecewise Linear Algorithm Error-correcting Code Egregious Data Corruptiony Error Detection Latency Fault Containment Region Finite State Machine Functional Unit Global Controller Graphics Processing Unit High-Performance Computing Invasion Controller Invasive-let Integer Linear Program Invasion Manager Invasive Network-on-Chip Locally Parallel Globally Sequential Locally Sequential Globally Parallel Multi-Processor System-on-Chip N-Modular Redundancy ix x PE PFH PGAS PLA SER SEU SIL SoC SPARC TCPA TMR UDA VLIW Acronyms Processing Element Probability of Failure per Hour Partitioned Global Address Space Piecewise Linear Algorithm Soft Error Rate Single-Event Upset Safety Integrity Level System-on-Chip Scalable Processor Architecture Tightly Coupled Processor Array Triple Modular Redundancy Uniform Dependence Algorithm Very Long Instruction Word List of Symbols D ∗ – Set of tiled dependency vectors E[LE,early ] – The average error detection latency for early voting E[LE,imm ] – The average error detection latency for immediate voting E[LE,late ] – The average error detection latency for late voting G – The number of quantified equations I – Original iteration vector I n – Input space J – Intra-tile iteration vector K – Inter-tile iteration vector Kf – The tile to be executed first by a symbolic schedule vector λ Kl – The tile to be executed last by a symbolic schedule vector λ L – Latency Lg – Global latency Ll – Local latency LE,early – Error detection latency for early voting LE,imm – Error detection latency for immediate voting LE,late – Error detection latency for late voting Lopt – Optimal latency M – Maximal number of symbolic schedule candidates Out – Output space P – Tiling matrix R – Replicated iteration vector S – Path stride matrix Φ – The allocation matrix B – Set of protected variables λ – Schedule vector λJ – Intra-tile schedule vector λK – Inter-tile schedule vector λR – Schedule vector of replicated iteration space P – Processor space 42 137 135 139 23 23 60 40 41 61 61 33 33 33 137 134 139 63 51 60 30 127 48 33 132 32 47 47 128 33 xi Bibliography [DKM+ 12] [dKNS10] [DKR92] [DPT03] [DR95] [DSR+ 00] [DX11] [DYS+ 12] [EIPP08] [EM97] [EM99] [Fea91] [FL11] [Gal08] [GBH17] [GHSV+ 11] [GMG96] [GSL+ 15] 161 Danowitz, A., Kelley, K., Mao, J., Stevenson, J P., & Horowitz, M (2012) CPU DB: Recording microprocessor history Communications of the ACM, 55(4), 55–63 de Kruijf, M., Nomura, S., & Sankaralingam, K (2010) Relax: An architectural framework for software recovery of hardware faults In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA), New York, NY, USA, 2010 (pp 497–508) Darte, A., Khachiyan, L., & Robert, Y (1992) Linear scheduling is close to optimality In Proceedings of the International Conference on Application Specific Array Processors, August 1992 (pp 37–46) Duller, A., Panesar, G., & Towner, D (2003) Parallel processing — The picoChip way! In Proceedings of Communicating Process Architectures (CPA), Enschede, The Netherlands, 2003 (pp 125–138) Darte, A., & Robert, Y (1995) Affine-by-statement scheduling of uniform and affine loop nests over parametric domains Journal of Parallel and Distributed Computing, 29(1), 43–59 Darte, A., Schreiber, R., Rau, B R., & Vivien, F (2000) A constructive solution to the juggling problem in systolic array synthesis In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS) (pp 815–821) Di, P., & Xue, J (2011) Model-driven tile size selection for doacross loops on GPUs In Proceedings of the 17th International Conference on Parallel Processing - Volume Part II, Euro-Par, Berlin, Heidelberg, 2011 (pp 401–412) Di, P., Ye, D., Su, Y., Sui, Y., & Xue, J (2012) Automatic parallelization of tiled loop nests with enhanced fine-grained parallelism on GPUs In 2012 41st International Conference on Parallel Processing (pp 350–359) Eles, P., Izosimov, V., Pop, P., & Peng, Z (2008) Synthesis of fault-tolerant embedded systems In Proceedings of the Conference on Design, Automation and Test in Europe (pp 1117–1122) Eckhardt, U., & Merker, R (1997) Scheduling in co-partitioned array architectures In IEEE International Conference on Proceedings of the ApplicationSpecific Systems, Architectures and Processors, July 1997 (pp 219–228) Eckhardt, U., & Merker, R (1999) Hierarchical algorithm partitioning at system level for an improved utilization of memory structures IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(1), 14–24 Feautrier, P (1991) Dataflow analysis of array and scalar references International Journal of Parallel Programming, 20(1), 23–53 Feautrier, P., & Lengauer, C (2011) Polyhedron model In Encyclopedia of parallel computing (pp 1581–1592) Gall, H (2008) Functional safety IEC 61508/IEC 61511 the impact to certification and the user In AICCSA 2008 IEEE/ACS International Conference on Computer Systems and Applications, March 2008 (pp 1027–1031) Grudnitsky, A., Bauer, L., & Henkel, J (2017) Efficient partial online synthesis of special instructions for reconfigurable processors IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 25(2), 594–607 Goulding-Hotta, N., Sampson, J., Venkatesh, G., Garcia, S., Auricchio, J., Huang, P., et al (2011) The GreenDroid mobile application processor: An architecture for silicon’s dark future IEEE Micro, 31(2), 86–95 Gong, C., Melhem, R., & Gupta, R (1996) Loop transformations for fault detection in regular loops on massively parallel systems IEEE Transactions Parallel and Distributed Systems Impact Factor, 7(12), 1238–1249 Gangadharan, D., Sousa, E., Lari, V., Hannig, F., & Teich, J (2015) Applicationdriven reconfiguration of shared resources for timing predictability of MPSoC 162 [GSVP03] [GTHT14] [Gwe11] [Han09] [HBB+ 09] [HBRS10] [HCF97] [HCF99] [HDH+ 10] [HDT06] [HHB+ 12] [HLB+ 14] [HLD+ 09] [HRDT08] Bibliography platforms In Proceedings of Asilomar Conference on Signals, Systems, and Computers (ASILOMAR) (pp 398–403) Washington, DC, USA: IEEE Computer Society Gomaa, M., Scarbrough, C., Vijaykumar, T N., & Pomeranz, I (2003) Transientfault recovery for chip multiprocessors In Proceedings of the 30th Annual International Symposium on Computer Architecture, 2003 (pp 98–109) New York: IEEE Gangadharan, D., Tanase, A., Hannig, F., & Teich, J (2014) Timing analysis of a heterogeneous architecture with massively parallel processor arrays In DATE Friday Workshop on Performance, Power and Predictability of Many-Core Embedded Systems (3PMCES) ECSI Gwennup, L (2011) Adapteva: More Flops, Less Watts: Epiphany Offers Floating-Point Accelerator for Mobile Processors Microprocessor Report (2) Hannig, F (2009) Scheduling Techniques for High-throughput Loop Accelerators Dissertation, University of Erlangen-Nuremberg, Germany, Verlag Dr Hut, Munich, Germany ISBN: 978-3-86853-220-3 Hartono, A., Baskaran, M M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., et al (2009) Parametric multi-level tiling of imperfectly nested loops In Proceedings of the 23rd International Conference on Supercomputing (ICS), New York, NY, USA, 2009 (pp 147–157) Hartono, A., Baskaran, M M., Ramanujam, J., & Sadayappan, P (2010) DynTile: Parametric tiled loop generation for parallel execution on multicore processors In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS) (pp 1–12) Högstedt, K., Carter, L., & Ferrante, J (1997) Determining the idle time of a tiling In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (pp 160–173) New York: ACM Högstedt, K., Carter, L., & Ferrante, J (1999) Selecting tile shape for minimal execution time In Proceedings of the Eleventh Annual ACM Symposium on Parallel Algorithms and Architectures, New York, NY, USA, 1999 (pp 201–211) Howard, J., Dighe, S., Hoskote, Y., Vangal, S., Finan, D., Ruhl, G., et al (2010) A 48-core IA-32 message-passing processor with DVFS in 45 nm CMOS In Proceedings of IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC) (pp 108–109) Hannig, F., Dutta, H., & Teich, J (2006) Mapping a class of dependence algorithms to coarse-grained reconfigurable arrays: Architectural parameters and methodology International Journal of Embedded Systems, 2(1/2), 114–127 Henkel, J., Herkersdorf, A., Bauer, L., Wild, T., Hübner, M., Pujari, R K., et al (2012) Invasive manycore architectures In 17th Asia and South Pacific Design Automation Conference (ASP-DAC) (pp 193–200) New York: IEEE Hannig, F., Lari, V., Boppu, S., Tanase, A., & Reiche, O (2014) Invasive tightly-coupled processor arrays: A domain-specific architecture/compiler codesign approach ACM Transactions on Embedded Computing Systems (TECS), 13(4s), 133:1–133:29 Hu, J., Li, F., Degalahal, V., Kandemir, M., Vijaykrishnan, N., & Irwin, M J (2009) Compiler-assisted soft error detection under performance and energy constraints in embedded systems ACM Transactions on Embedded Computing Systems, 8(4), 27:1–27:30 Hannig, F., Ruckdeschel, H., Dutta, H., & Teich, J (2008) PARO: Synthesis of hardware accelerators for multi-dimensional dataflow-intensive applications In Proceedings of the Fourth International Workshop on Applied Reconfigurable Computing (ARC) Lecture notes in computer science, March 2008 (Vol 4943, pp 287–293) London, UK: Springer Bibliography [HRS+ 11] [HRT08] [HSL+ 13] [HT04] [HZW+ 14] [IDS12] [INKM05] [IR95] [IT88] [Jai86] [JLF03] [KHKT06a] [KHKT06b] [KHM03] [KMW67] [KR09] 163 Hannig, F., Roloff, S., Snelting, G., Teich, J., & Zwinkau, A (2011) Resourceaware programming and simulation of MPSoC architectures through extension of X10 In Proceedings of the 14th International Workshop on Software and Compilers for Embedded Systems (pp 48–55) New York: ACM Hannig, F., Ruckdeschel, H., & Teich, J (2008) The PAULA language for designing multi-dimensional dataflow-intensive applications In Proceedings of the GI/ITG/GMM-Workshop – Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (pp 129–138) Freiburg, Germany: Shaker Hannig, F., Schmid, M., Lari, V., Boppu, S., & Teich, J (2013) System integration of tightly-coupled processor arrays using reconfigurable buffer structures In Proceedings of the ACM International Conference on Computing Frontiers (CF) (pp 2:1–2:4) New York: ACM Hannig, F., & Teich, J (2004) Dynamic piecewise linear/regular algorithms In International Conference on Parallel Computing in Electrical Engineering PARELEC’04 (pp 79–84) New York: IEEE Heisswolf, J., Zaib, A., Weichslgartner, A., Karle, M., Singh, M., Wild, T., et al (2014) The invasive network on chip - a multi-objective many-core communication infrastructure In ARCS’14; Workshop Proceedings on Architecture of Computing Systems (pp 1–8) Irza, J., Doerr, M., & Solka, M (2012) A third generation many-core processor for secure embedded computing systems In 2012 IEEE Conference on High Performance Extreme Computing (HPEC) (pp 1–3) New York: IEEE Iyer, R K., Nakka, N M., Kalbarczyk, Z T., & Mitra, S (2005) Recent advances and new avenues in hardware-level reliability support Micro, IEEE, 25(6), 18–29 Brewer, F., & Radivojevic, I (1995) Symbolic scheduling techniques In IEICE Transactions on Information and Systems, Japan, March 1995 (pp 224–230) Irigoin, F., & Triolet, R (1988) Supernode partitioning In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), San Diego, CA, USA, January 1988 (pp 319–329) Jainandunsing, K (1986) Optimal partitioning scheme for wavefront/systolic array processors In Proceedings of IEEE Symposium on Circuits and Systems (pp 940–943) Jiménez, M., Llabería, J M., & Fernández, A (2003) A cost-effective implementation of multilevel tiling IEEE Transactions on Parallel and Distributed Systems, 14(10), 1006–1020 Kissler, D., Hannig, F., Kupriyanov, A., & Teich, J (2006) A dynamically reconfigurable weakly programmable processor array architecture template In Proceedings of the International Workshop on Reconfigurable Communication Centric System-on-Chips (ReCoSoC) (pp 31–37) Kissler, D., Hannig, F., Kupriyanov, A., & Teich, J (2006) A highly parameterizable parallel processor array architecture In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT), (pp 105– 112) New York: IEEE Kandasamy, N., Hayes, J P., & Murray, B T (2003) Transparent recovery from intermittent faults in time-triggered distributed systems IEEE Transactions on Computers, 52(2), 113–125 Karp, R M., Miller, R E., & Winograd, S (1967) The organization of computations for uniform recurrence equations Journal of the ACM, 14(3), 563–590 Kim, D., & Rajopadhye, S (2009) Efficient tiled loop generation: D-tiling In Workshop on Languages and Compilers for Parallel Computing (LCPC) Lecture notes in computer science (Vol 5898, pp 293–307) Berlin: Springer 164 Bibliography [KRR+ 07] [KRZ+ 10] [KSHT09] [KSSF10] [Kup09] [KWM12] [Lam74] [Lar16] [LBBG05] [LCB+ 10] [Len93] [Lin06] [LNHT11] [LNOM08] [LTHT14] [LTT+ 15a] [LTT+ 15b] [LWT+ 16] Kim, D., Renganarayanan, L., Rostron, D., Rajopadhye, S., & Strout, M M (2007) Multi-level tiling: M for the price of one In SC’07: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, New York, NY, USA, 2007 (pp 1–12) Klues, K., Rhoden, B., Zhu, Y., Waterman, A., & Brewer, E (2010) Processes and resource management in a scalable many-core OS In HotPar10, Berkeley, CA, 2010 Kissler, D, Strawetz, A., Hannig, F., & Teich, J (2009) Power-efficient reconfiguration control in coarse-grained dynamically reconfigurable architectures Journal of Low Power Electronics, 5(1), 96–105 Kalla, R., Sinharoy, B., Starke, W J., & Floyd, M (2010) Power7: IBM’s nextgeneration server processor IEEE Micro, 30(2), 7–15 Kupriyanov, O (2009) Modeling and Efficient Simulation of Complex Systemon-a-Chip Architectures PhD thesis, Friedrich-Alexander-Universität ErlangenNürnberg, Germany Khudia, D S., Wright, G., & Mahlke, S (2012) Efficient soft error protection for commodity embedded microprocessors using profile information In ACM SIGPLAN Notices (Vol 47, pp 99–108) New York: ACM Lamport, L (1974) The parallel execution of loops Communications of the ACM, 17(2), 83–93 Lari, V (2016) Invasive tightly coupled processor arrays In Springer Book Series on Computer Architecture and Design Methodologies Berlin: Springer ISBN: 978-981-10-1058-3 Lindenmaier, G., Beck, M., Boesler, B., & Geiß, R (2005) FIRM, An Intermediate Language for Compiler Research Technical Report 2005-8, Fakultät für Informatik, Universität Karlsruhe, Karlsruhe, Germany Leem, L., Cho, H., Bau, J., Jacobson, Q A., & Mitra, S (2010) Ersa: Error resilient system architecture for probabilistic applications In Design, Automation Test in Europe Conference Exhibition (DATE), 2010 (pp 1560–1565) Lengauer, C (1993) Loop parallelization in the polytope model In CONCUR (Vol 715, pp 398–416) Lindenmaier, G (2006) libFIRM – A Library for Compiler Optimization Research Implementing FIRM Technical Report 2002-5, Fakultät für Informatik, Universität Karlsruhe, Karlsruhe, Germany Lari, V., Narovlyanskyy, A., Hannig, F., & Teich, J (2011) Decentralized dynamic resource management support for massively parallel processor arrays In Proceedings of the 22nd IEEE International Conference on Applicationspecific Systems, Architectures, and Processors (ASAP), Santa Monica, CA, USA, September 2011 Lindholm, E., Nickolls, J., Oberman, S., & Montrym, J (2008) NVIDIA Tesla: A unified graphics and computing architecture IEEE Micro, 28(2), 39–55 Lari, V., Tanase, A., Hannig, F., & Teich, J (2014) Massively parallel processor architectures for resource-aware computing In Proceedings of the First Workshop on Resource Awareness and Adaptivity in Multi-Core Computing (Racing) (pp 1–7) Lari, V., Tanase, A., Teich, J., Witterauf, M., Khosravi, F., Hannig, F., et al (2015) A co-design approach for fault-tolerant loop execution on coarse-grained reconfigurable arrays In Proceedings of the 2015 NASA/ESA Conference on Adaptive Hardware and Systems (AHS) (pp 1–8) New York: IEEE Lari, V., Teich, J., Tanase, A., Witterauf, M., Khosravi, F., & Meyer, B H (2015) Techniques for on-demand structural redundancy for massively parallel processor arrays Journal of Systems Architecture, 61(10), 615–627 Lari, V., Weichslgartner, A., Tanase, A., Witterauf, M., Khosravi, F., Teich, J., et al (2016) Providing fault tolerance through invasive computing Information Technology, 58(6), 309–328 Bibliography [LY07] [LYLW13] [MCLS11] [MEFS97] [MF86] [MJU+ 09] [MKR02] [Moo65] [Mot02] [Muk08] [Mun12] [MZW+ 06] [Nel90] [Nic99] [OSK+ 11] [OSM02] [Pra89] [PBD01] [Rao85] [Rau94] 165 Li, X., & Yeung, D (2007) Application-level correctness and its impact on fault tolerance In IEEE 13th International Symposium on High Performance Computer Architecture, HPCA 2007 (pp 181–192) Liu, D., Yin, S., Liu, L., & Wei, S (2013) Polyhedral model based mapping optimization of loop nests for CGRAs In Proceedings of the Design Automation Conference (DAC) (pp 1–8) New York: IEEE Meyer, B H., Calhoun, B., Lach, J., & Skadron, K (2011) Cost-effective safety and fault localization using distributed temporal redundancy In CASES’11, October 2011 Merker, R., Eckhardt, U., Fimmel, D., & Schreiber, H (1997) A system for designing parallel processor arrays Computer Aided Systems Theory— EUROCAST’97 (pp 1–12) Moldovan, D I., & Fortes, J A B (1986) Partitioning and mapping algorithms into fixed size systolic arrays IEEE Transactions on Computers, C-35(1), 1–12 Mehrara, M., Jablin, T B., Upton, D., August, D I., Hazelwood, K., & Mahlke, S (2009) Compilation strategies and challenges for multicore signal processing IEEE Signal Processing Magazine, 26(6), 55–63 Mukherjee, S S., Kontz, M., & Reinhardt, S K (2002) Detailed design and evaluation of redundant multi-threading alternatives In Proceedings of the 29th Annual International Symposium on Computer Architecture’02 (pp 99–110) New York: IEEE Moore, G E (1965) Cramming more components onto integrated circuits Electronics, 38(8), 114–117 Motomura, M (2002) A dynamically reconfigurable processor architecture In Microprocessor Forum, San Jose, CA, USA, October 2002 Mukherjee, S (2008) Architecture design for soft errors Burlington, MA, USA: Morgan-Kaufmann Munshi, A (2012) The OpenCL Specification Version 1.2 Khronos OpenCL Working Group Mitra, S., Zhang, M., Waqas, S., Seifert, N., Gill, B., & Kim, K S (2006) Combinational logic soft error correction In IEEE International Test Conference, 2006 ITC’06 (pp 1–9) New York: IEEE Nelson, V P (1990) Fault-tolerant computing: Fundamental concepts Computer, 23(7), 19–25 Nicolaidis, M (1999) Time redundancy based soft-error tolerance to rescue nanometer technologies In Proceedings of the 17th IEEE, VLSI Test Symposium (pp 86–94) New York: IEEE Oechslein, B., Schedel, J., Kleinöder, J., Bauer, L., Henkel, J., Lohmann, D., et al (2011) OctoPOS: A parallel operating system for invasive computing In R McIlroy, J Sventek, T Harris, & T Roscoe (Eds.), Proceedings of the International Workshop on Systems for Future Multi-Core Architectures (SFMA) USB Proceedings of Sixth International ACM/EuroSys European Conference on Computer Systems (EuroSys), EuroSys, 2011 (pp 9–14) Oh, N., Shirvani, P P., & McCluskey, E J (2002) Error detection by duplicated instructions in super-scalar processors IEEE Transactions on Reliability, 51(1), 63–75 Prasad, V B (1989) Fault tolerant digital systems IEEE Potentials, 8(1), 17–21 Punnekkat, S., Burns, A., & Davis, R (2001) Analysis of checkpointing for realtime systems Real-Time Systems, 20(1), 83–102 Rao, S K (1985) Regular Iterative Algorithms and Their Implementations on Processor Arrays PhD thesis, Stanford University Rau, B R (1994) Iterative modulo scheduling: An algorithm for software pipelining loops In Proceedings of the 27th Annual International Symposium on Microarchitecture (MICRO), San Jose, CA, USA, November 1994 (pp 63–74) 166 Bibliography [RCV+ 05] [RG81] [RHMDR07] [RKRS07] [RKSR10] [RT99] [RTG+ 07] [Rup15] [RWZ88] [SF91] [SGFH06] [SHT15] [SHTT14] [SSE+ 11] [SSM+ 11] Reis, G A., Chang, J., Vachharajani, N., Rangan, R., & August, D I (2005) Swift: Software implemented fault tolerance In Proceedings of the International Symposium on Code Generation and Optimization (pp 243–254) Washington, DC, USA: IEEE Computer Society Rau, B R., & Glaeser, C D (1981) Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing SIGMICRO Newsletter, 12(4), 183–198 Renganarayana, L., Harthikote-Matha, M., Dewri, R., & Rajopadhye, S (2007) Towards optimal multi-level tiling for stencil computations In IEEE International Parallel and Distributed Processing Symposium, 2007 IPDPS 2007 (pp 1–10) New York: IEEE Renganarayanan, L., Kim, D., Rajopadhye, S., & Strout, M M (2007) Parameterized tiled loops for free In Proceeding of the Conference on Programming Language Design and Implementation, San Diego, CA, USA, 2007 (pp 405–414) Renganarayanan, L., Kim, D., Strout, M M., Rajopadhye, S (2010) Parameterized loop tiling ACM Transactions on Programming Languages and Systems (pp 3:1–3:41) Rivera, G., & Tseng, C.-W (1999) Locality optimizations for multi-level caches In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing (p 2) New York: ACM Rong, H., Tang, Z., Govindarajan, R., Douillet, A., & Gao, G R (2007) Singledimension software pipelining for multidimensional loops ACM Transactions on Architecture and Code Optimization (TACO), 4(1), 7:1–7:44 Rupp, K (2015) 40 years of microprocessor trend data https://www.karlrupp net/2015/06/40-years-of-microprocessor-trend-data/ Rosen, B K., Wegman, M N., & Zadeck, F K (1988) Global value numbers and redundant computations In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL’88, New York, NY, USA (pp 12–27) Shang, W., & Fortes, J A B (1991) Time optimal linear schedules for algorithms with uniform dependencies IEEE Transactions on Computers, 40(6), 723–742 Smolens, J C., Gold, B T., Falsafi, B., & Hoe, J C (2006) Reunion: Complexityeffective multicore redundancy In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (pp 223–234) Washington, DC, USA: IEEE Computer Society Sousa, E R., Hannig, F., & Teich, J (2015) Reconfigurable buffer structures for coarse-grained reconfigurable arrays In Proceedings of the 5th IFIP International Embedded Systems Symposium (IESS) Lecture notes in computer science Berlin: Springer Schmid, M., Hannig, F., Tanase, A., & Teich, J (2014) High-level synthesis revised – Generation of FPGA accelerators from a domain-specific language using the polyhedral model In Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing (Vol 25, pp 497–506) Amsterdam, The Netherlands: IOS Press Schweizer, T., Schlicker, P., Eisenhardt, S., Kuhn, T., & Rosenstiel, W (2011) Low-cost TMR for fault-tolerance on coarse-grained reconfigurable architectures In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig) (pp 135–140) New York: IEEE Saripalli, V., Sun, G., Mishra, A., Xie, Y., Datta, S., & Narayanan, V (2011) Exploiting heterogeneity for energy efficiency in chip multiprocessors IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 1(2), 109–119 Bibliography [STB+ 14] [STHT13a] [STHT13b] [STL+ 13] [Sut05] [TAB+ 97] [Tei93] [Tei08] [TGR+ 16] [THB+ 10] [THH+ 11] [Thi88] [Thi89] [THT12] [Til13] [TLHT13] 167 Schmid, M., Tanase, A., Bhadouria, V S., Hannig, F., Teich, J., & Ghoshal, D (2014) Domain-specific augmentations for high-level synthesis In Proceedings of the 25th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) (pp 173–177) New York: IEEE Sousa, E R., Tanase, A., Hannig, F., & Teich, J (2013) A prototype of an adaptive computer vision algorithm on MPSoC architecture In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP), October 2013 (pp 361–362) ECSI Media Sousa, E R., Tanase, A., Hannig, F., & Teich, J (2013) Accuracy and performance analysis of Harris corner computation on tightly-coupled processor arrays In Proceedings of the Conference on Design and Architectures for Signal and Image Processing (DASIP) (pp 88–95) New York: IEEE Sousa, E R., Tanase, A., Lari, V., Hannig, F., Teich, J., Paul, J., et al (2013) Acceleration of optical flow computations on tightly-coupled processor arrays In Proceedings of the 25th Workshop on Parallel Systems and Algorithms (PARS), Mitteilungen – Gesellschaft für Informatik e V., Parallel-Algorithmen und Rechnerstrukturen (Vol 30, pp 80–89) Gesellschaft für Informatik e V Sutter, H (2005) The free lunch is over: A fundamental turn toward concurrency in software Dr Dobb’s Journal, 30(3), 202–210 Tylka, A J., Adams, J H., Boberg, P R., Brownstein, B., Dietrich, W F., Flueckiger, E O., et al (1997) CREME96: A revision of the cosmic ray effects on micro-electronics code IEEE Transactions on Nuclear Science, 44(6), 2150– 2160 Teich, J (1993) A compiler for application specific processor arrays Reihe Elektrotechnik Freiburg, Germany: Shaker ISBN: 9783861117018 Teich, J (2008) Invasive algorithms and architectures Information Technology, 50(5), 300–310 Teich, J., Gl, M., Roloff, S., Schrưder-Preikschat, W., Snelting, G., Weichslgartner, A., et al (2016) Language and compilation of parallel programs for *-predictable MPSoC execution using invasive computing In 2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC) (pp 313–320) Tavarageri, S., Hartono, A., Baskaran, M., Pouchet, L.-N., Ramanujam, J., & Sadayappan, P (2010) Parametric tiling of affine loop nests In 15th Workshop on Compilers for Parallel Computing (CPC), Vienna, Austria, July 2010 Teich, J., Henkel, J., Herkersdorf, A., Schmitt-Landsiedel, D., SchröderPreikschat, W., & Snelting, G (2011) Multiprocessor System-on-Chip: Hardware Design and Tool Integration Invasive computing: An overview (Chap 11, pp 241– 268) Berlin: Springer Thiele, L (1988) On the hierarchical design of vlsi processor arrays In IEEE International Symposium on Circuits and Systems, 1988 (pp 2517–2520) New York: IEEE Thiele, L (1989) On the design of piecewise regular processor arrays In IEEE International Symposium on Circuits and Systems (Vol 3, pp 2239–2242) Tanase, A., Hannig, F., & Teich, J (2012) Symbolic loop parallelization of static control programs In Advanced Computer Architecture and Compilation for HighPerformance and Embedded Systems (ACACES) (pp 33–36) Tilera Corporation (2013) http://www.tilera.com Tanase, A., Lari, V., Hannig, F., & Teich, J (2012) Exploitation of quality/throughput tradeoffs in image processing through invasive computing In Proceedings of the International Conference on Parallel Computing (ParCo) (pp 53–62) 168 Bibliography [TP13] [TR91] [TT91] [TT93] [TT96] [TT02] [TTH13] [TTH14] [TTZ96] [TTZ97a] [TTZ97b] [TWOSP12] [TWS+ 16] [TWT+ 15] [TWTH14] [TWTH15] Thomas, A., & Pattabiraman, K (2013) Error detector placement for soft computation In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (pp 1–12) Thiele, L., & Roychowdhury, V P (1991) Systematic design of local processor arrays for numerical algorithms In Proceedings of the International Workshop on Algorithms and Parallel VLSI Architectures, Amsterdam, The Netherlands, 1991 (Vol A: Tutorials, pp 329–339) Teich, J., & Thiele, L (1991) Control generation in the design of processor arrays Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, 3(1), 77–92 Teich, J., & Thiele, L (1993) Partitioning of processor arrays: A piecewise regular approach Integration-The Vlsi Journal,14(3), 297–332 Teich, J., & Thiele, L (1996) A new approach to solving resource-constrained scheduling problems based on a flow-model Technical Report 17, TIK, Swiss Federal Institute of Technology (ETH) Zürich Teich, J., & Thiele, L (2002) Exact partitioning of affine dependence algorithms In Embedded Processor Design Challenges Lecture notes in computer science (Vol 2268, pp 135–151) Berlin, Germany: Springer Teich, J., Tanase, A., & Hannig, F (2013) Symbolic parallelization of loop programs for massively parallel processor arrays In Proceedings of the 24th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) (pp 1–9) New York: IEEE Best Paper Award Teich, J., Tanase, A., & Hannig, F (2014) Symbolic mapping of loop programs onto processor arrays Journal of Signal Processing Systems, 77(1–2), 31–59 Teich, J., Thiele, L., & Zhang, L (1996) Scheduling of partitioned regular algorithms on processor arrays with constrained resources In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors, ASAP’96 (p 131) Washington, DC, USA: IEEE Computer Society Teich, J., Thiele, L., & Zhang, L (1997) Scheduling of partitioned regular algorithms on processor arrays with constrained resources Journal of VLSI Signal Processing, 17(1), 5–20 Teich, J., Thiele, L., & Zhang, L (1997) Partitioning processor arrays under resource constraints Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, 17, 5–20 Teich, J., Weichslgartner, A., Oechslein, B., & Schröder-Preikschat, W (2012) Invasive computing - concepts and overheads In Proceeding of the 2012 Forum on Specification and Design Languages (pp 217–224) Tanase, A., Witterauf, M., Sousa, É R., Lari, V., Hannig, F., & Teich, J (2016) LoopInvader: A Compiler for Tightly Coupled Processor Arrays Tool Presentation at the University Booth at Design, Automation and Test in Europe (DATE), Dresden, Germany Tanase, A., Witterauf, M., Teich, J., Hannig, F., & Lari, V (2015) Ondemand fault-tolerant loop processing on massively parallel processor arrays In Proceedings of the 26th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP) (pp 194–201) New York: IEEE Tanase, A., Witterauf, M., Teich, J., & Hannig, F (2014) Symbolic inner loop parallelisation for massively parallel processor arrays In Proceedings of the 12th ACM-IEEE International Conference on Formal Methods and Models for System Design (MEMOCODE) (pp 219–228) Tanase, A., Witterauf, M., Teich, J., & Hannig, F (2015) Symbolic loop parallelization for balancing I/O and memory accesses on processor arrays In Proceedings of the 13th ACM-IEEE International Conference on Formal Methods and Models for System Design (MEMOCODE) (pp 188–197) New York: IEEE Bibliography [TWTH17] [Ver10] [VG12] [WBB+ 16] [Wol96] [WTHT16] [WTT+ 15] [Xue97] [Xue00] [YI95] [YR13] [ZA01] [Zim97] 169 Tanase, A., Witterauf, M., Teich, J., & Hannig, F (2017) Symbolic multi-level loop mapping of loop programs for massively parallel processor arrays ACM Transactions on Embedded Computing Systems, 17(2), 31:1–31:27 Verdoolaege, S (2010) ISL: An integer set library for the polyhedral model In Proceedings of the Third International Congress Conference on Mathematical Software (ICMS), Kobe, Japan, 2010 (pp 299–302) Berlin: Springer Verdoolaege, S., & Grosser, T (2012) Polyhedral extraction tool In Second International Workshop on Polyhedral Compilation Techniques (IMPACT’12), Paris, France Wildermann, S., Bader, M., Bauer, L., Damschen, M., Gabriel, D., Gerndt, M., et al (2016) Invasive computing for timing-predictable stream processing on MPSoCs Information Technology, 58(6), 267–280 Wolfe, M J (1996) High performance compilers for parallel computing Boston, MA, USA: Addison-Wesley Witterauf, M., Tanase, A., Hannig, F., & Teich, J (2016) Modulo scheduling of symbolically tiled loops for tightly coupled processor arrays In Proceedings of the 27th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP) (pp 58–66) New York: IEEE Witterauf, M., Tanase, A., Teich, J., Lari, V., Zwinkau, A., & Snelting, G (2015) Adaptive fault tolerance through invasive computing In Proceedings of the 2015 NASA/ESA Conference on Adaptive Hardware and Systems (AHS) (pp 1–8) New York: IEEE Xue, J (1997) On tiling as a loop transformation Parallel Processing Letters, 7(4), 409–424 Xue, J (2000) Loop tiling for parallelism Norwell, MA, USA: Kluwer Academic Publishers Yang, T., & Ibarra, O H (1995) On symbolic scheduling and parallel complexity of loops In Proceedings IEEE Symposium Parallel and Distributed Processing (pp 360–367) Yuki, T., & Rajopadhye, S (2013) Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs Technical Report, Citeseer Zimmermann, K.-H., & Achtziger, W (2001) Optimal piecewise linear schedules for LSGP- and LPGS-decomposed array processors via quadratic programming Computer Physics Communications, 139(1), 64–89 Zimmermann, K.-H (1997) A unifying lattice-based approach for the partitioning of systolic arrays via LPGS and LSGP Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, 17(1), 21–41 Index A Accelerators, 3, 155 Address Generators (AG), 16, 36 Affine transformations, 27, 34, 91 Anti-lock Breaking System (ABS), C Central Processing Unit (CPU), Chip multiprocessors (CMPs), 151 Coarse-Grained Reconfigurable Arrays (CGRAs), 2, 151 Code generation, 34 Common sub-expression elimination, 28 Compilers code generation, 34 compilation flow compilation branches, 18–19 libFIRM, 19 LoopInvader, 18, 20 loop programs, 19 SCOP, 20 front end DPLA, 22 PAULA code, 21–22 single assignment property, 21 task, 20–21 GC and AG configuration stream, 36 Interconnect Network Configuration, 35 loop specification, polyhedron model indexing function, 23 iteration space, 23–26 Piecewise Linear Algorithms (PLAs), 23 Regular Iterative Algorithms (RIA), 22 System of Uniform Recurrence Equations (SURE), 22 Uniform Dependence Algorithm (UDA), 23 PARO tool, 26–33 PAULA language, 25–26 PE code generation, 35 space-time mapping, 33–34 static loop tiling, 29–32 Compute Unified Device Architecture (CUDA), 86 Configuration and Communication Processor, 76, 77 Configuration Manager, 17 Constant and variable propagation, 27 Control/data flow graphs (CDFGs), 91 D Data dependencies, embedding of, 39 FIR filter loop specification, 41, 44–45 intra-tile and inter-tile dependency vectors, 42–43 non-zero elements, 44 short dependency, 42 symbolic hierarchical tiling, 97–100 Dead-code elimination, 27 Dependency vector, 42–44, 56–59, 64, 67, 83, 97, 99, 103, 127, 138 DMR, see Dual Modular Redundancy (DMR) DPLAs, see Dynamic Piecewise Linear Algorithms (DPLAs) Dual Modular Redundancy (DMR), 6, 7, 123, 128, 131, 140, 142, 143, 147–149, 157 © Springer International Publishing AG 2018 A.-P Tanase et al., Symbolic Parallelization of Nested Loop Programs, https://doi.org/10.1007/978-3-319-73909-0 171 172 Dynamic compilation techniques, 3, 155, 156 Dynamic Piecewise Linear Algorithms (DPLAs), 22, 26 E Epiphany processor, Error Detection by Duplicated Instructions (EDDI), 151 Error Detection Latency (EDL), 124, 131, 134–137, 139, 149, 152 F Fault Containment Region (FCR), 137, 139 Fault-tolerant loop processing, 4, 157 adaptive fault tolerance, invasive computing claimRet, 145 CREME96, 143 fault diagnosis, 142 fault isolation, 142 invade, infect and retreat, 143 InvadeX10 code, 144 PFH, 143 reconfiguration, 142 reinitialization, 142 reliability analysis for, 145–146 SERs, 143 SIL, 142 TCPA, 145 architectural-level hardware redundancy techniques, 150 automatic reliability analysis, 126 CGRAs, 151 CMPs, 151 combinational circuits, 150 DMR, 6, 123, 126, 152 early voting, 136–138 EDDI, 151 error detection latency, experimental results average error detection latency, 149–150 latency overhead, 146–149 fundamentals and fault model, 124–126 immediate voting, 134–136 late voting, 138–140 logic-level circuit hardening, 150 logic-level hardware redundancy, 151 loop replication, redundancy, 6, 123, 126–130 massively parallel processor arrays, 151 MPSoCs, 123 multidimensional processor arrays, 152 Index one-dimensional processor arrays, 151 outer loop parallelization, 123 PARO, compiler tool, 123 processor array, 123, 126 redundant combinatorial circuits, 150 redundant instructions, 151 reliability, 6, 124 reunion approach, 151 self-checking circuits, 150 software-based fault tolerance, 151 structural redundancy, 127 SWIFT, 151 system-level hardware redundancy, cost of, 150 TCPA, 123, 126 TMR, 6, 123, 126, 152 VLIW, 151, 152 voting functions implementation, 140–142 voting insertion, 6, 123, 130–132 G General purpose computation on graphics processing unit (GPGPU), 26 Global Controller (GC), 16, 36 Graphics Processing Unit (GPU), 86, 155 I ILP, see Integer Linear Program (ILP) Infeasible scanning matrices, 110 Integer Linear Program (ILP), 77, 100, 112, 113 Integer programming, 66 Interconnect Network Configuration, 35 InvadeX10, 10, 11, 18, 20, 144 Invasion Controller, 17 Invasion Manager (IM), 17, 76 Invasive computing, 3, 6, 37, 92, 155 adaptive fault tolerance claimRet, 145 CREME96, 143 fault diagnosis, 142 fault isolation, 142 invade, infect and retreat, 143 InvadeX10 code, 144 PFH, 143 reconfiguration, 142 reinitialization, 142 reliability analysis for, 145–146 SERs, 143 SIL, 142 TCPA, 145 constraints, 12 Index 173 definition, 10 infect method, 11, 12 invade method, 11 InvadeX10, 10–11 matmul method, 12 OctoPOS, 10 requirements, 12 retreat method, 12 state chart, 10–11 Invasive programming, 10 Invasive tightly coupled processor arrays, see Tightly Coupled Processor Array (TCPA) Iteration space decomposition, 39–41, 95–97 definition, 23 dependency graph, 24–25 iteration vector, 23 PAULA language, 25 UDA specification, 25 M Many Integrated Core (MIC), Massively parallel processor array, 2, Mixed compile/runtime approch, 94 Moore’s law, 1, MPSoCs, see Multi-Processor System-onChips (MPSoCs) Multi-level parallelization, see Symbolic multi-level parallelization Multi-level tiling, 103 Multiple symbolic tiling matrices, 94 Multi-Processor System-on-Chips (MPSoCs), 1, 3, 13, 37, 122, 123, 155 J Just-in-time compilation, 37 P Parallel computing, 22 Parametric latency formula symbolic LPGS schedule vectors, 66, 71–74 symbolic LSGP schedule vectors input space, 60 minimal latency-determining first and last tile, 48, 61–63 output space, 60–61 symbolic multi-level parallelization, 108–110 PARO tool design flow, 27–28 high-level transformations, 27–28 localization, 28–29 on-demand fault-tolerant loop processing, 123 static loop tiling, 29–32 static scheduling, 32–33 uses, 26–27 PAULA language, 25–26 PE code generation, 35 Piecewise Linear Algorithms (PLAs), 23–26 PLAs, see Piecewise Linear Algorithms (PLAs) Polyhedron model, 4, 22–26 Power7 chip, Probability of Failure per Hour (PFH), 143, 153 L Latency-minimal sequential schedule vectors coordinates of, 102 data dependency, 103, 106 determination of, 100 iterations sequential execution order, 101, 102 multi-level tiling, 103 positive linear combination, 103 schedule inequalities, 105 stride matrix, 101, 102, 106 libFIRM, 19 Linear schedules, 46 Localization, 25, 28–29 Locally Parallel Globally Sequential (LPGS), see Symbolic inner loop parallelization Locally Sequential Globally Parallel (LSGP), see Symbolic outer loop parallelization Lock step, 130, 152 LoopInvader, 18–21 Loop perfectization, 27 Loop programs, 19 Loop replication, 6, 126–130, 157 Loop unrolling, 27 O On-demand fault-tolerant loop processing, see Fault-tolerant loop processing On-demand redundancy technique, 150 OpenMP schedule, 86 174 Processing Elements (PEs), 37, 42, 76, 77, 123, 124, 128, 147, 151, 152, 157 Processor arrays, 14–15, 109, 112, 113, 115 Processors applications, architectures, evolution of, 1, invasive computing, Program block control graph, 35 Programming models, 6, 26 R Reduced Dependency Graph (RDG), 26 Redundancy DMR, 6, 7, 123, 128, 131, 140, 142, 143, 147–149, 157 on-demand redundancy technique, 150 TMR, 6, See also Fault-tolerant loop processing Regular Iterative Algorithms (RIA), 22 Reliability analysis, 6, 124, 126, 145–146 Replicated loop program, 126–130 Runtime schedule selection on invasive TCPAs, 76–77 symbolic inner loop parallelization, 66, 74–75 symbolic multi-level parallelization, 110–111 symbolic outer loop parallelization, 48, 63–65 S Safety Integrity Level (SIL), 142, 143 Scalable Processor Architecture (SPARC) processors, 18–19 Scheduling, 32–33 SCOPs, see Static Control Parts (SCOPs) Self-checking circuit, 150 SEU, see Single-event upset (SEU) Single-Chip Cloud Computer (SCC), Single-event upset (SEU), 3, 124, 126, 135, 146 Soft Error Rate (SER), 124, 143, 157 Software Implemented Fault Tolerance (SWIFT), 151 Space-time mapping, 33–34 Static Control Parts (SCOPs), 20 Static loop tiling, 29–32, 37 Static scheduling, 32–33 Static Single Assignment (SSA), 19 Streaming multiprocessors, 86 Streaming processors, 86 Index Strip-mine and interchange tiling, 121 Symbolic hierarchical scheduling latency-minimal sequential schedule vectors coordinates of, 102 data dependency, 103, 106 determination of, 100 iterations sequential execution order, 101, 102 positive linear combination, 103 schedule inequalities, 105 stride matrix, 101, 102, 106 parametric latency formula, 101, 108–110 runtime schedule selection, 101, 110–111 scheduling algorithms, 111 tight parallel schedule vectors, 101, 106–107 UDA, 100 Symbolic hierarchical tiling data dependencies, embedding of, 97–100 iteration space decomposition, 95–97 UDA, 94 Symbolic inner loop parallelization, 4–5, 38, 45, 91, 92, 115–117, 121, 156, 157 CPU times, evaluation of, 85 evaluation, optimal runtime schedule candidates, 86, 89–90 I/O bandwidth demand, 82, 83 iterations within a tile, 66 latency, 78–82 local memory demand, 83–84 maximum number of symbolic schedules, 84, 85 overview of, 46 parametric latency formula, 66, 71–74 runtime schedule selection, 66, 74–75 tight inter-tile schedule vector candidates, 66, 68–71 tight intra-tile schedule vectors, 66–68 Symbolic multi-level parallelization, arbitrary polyhedral iteration spaces, 121 experimental results I/O and memory balancing, 115, 116 latency, 112–114 scalability, 115, 117 LPGS mapping technique, 117, 121 LSGP mapping technique, 117 massively parallel distributed memory processor arrays, 121 strip-mine and interchange tiling, 121 symbolic hierarchical scheduling, 5, latency-minimal sequential schedule vectors, 101–106 Index parametric latency formula, 108–110 runtime schedule selection, 110–111 tight parallel schedule vectors, 106–107 symbolic hierarchical tiling data dependencies, embedding of, 97–100 iteration space decomposition, 95–97 symbolic tiled loops, 121 two-level hierarchical tiling, 118–120 Symbolic multi-level schedule vectors, 101, 109–110 Symbolic outer loop parallelization, 4–5, 38, 45, 91, 92, 115–117, 156, 157 CPU times, optimal runtime schedule candidates, 85–88 feasible schedules, 46–47 intra-tile and inter-tile schedule, 47 I/O bandwidth demand, 82, 83 latency, 78, 80–82 linear schedules, 46 local memory demand, 82–84 maximum number of symbolic schedules, 84, 85 overview of, 46 parametric latency formula input space, 60 minimal latency-determining first and last tile, 48, 61–63 output space, 60–61 runtime schedule selection, 48, 63–65 tight inter-tile schedule vectors, 47–48, 54–60 tight intra-tile schedule vector candidates, 47–54 feasible stride matrices, 51–54 intra-tile LSGP schedule bound, 51 intra-tile LSGP schedule construction, 49–50 path stride matrix, 48–50 Symbolic parallelization, 156, 185 LPGS (see Symbolic inner loop parallelization) LSGP (see Symbolic outer loop parallelization) for two-level hierarchical tiling, 118–120 Symbolic scheduling CDFGs, resource-constrained scheduling of, 91 partitioned loop program, 91 symbolic inner loop parallelization, 4–5 latency, 78–82 parametric latency formula, 66, 71–74 175 runtime schedule selection, 66, 74–75 tight inter-tile schedule vector candidates, 66, 68–71 tight intra-tile schedule vectors, 66–68 symbolic outer loop parallelization, 4–5, 38 feasible schedules, 46–47 intra-tile and inter-tile schedule, 47 latency, 78, 80–82 linear schedules, 46 parametric latency formula, 48, 60–63 runtime schedule selection, 48, 63–65 tight inter-tile schedule vectors, 47–48, 54–60 tight intra-tile schedule vector candidates, 47–54 Symbolic tiling, 5, 86, 156 choosing optimal tile sizes, 39 data dependencies, embedding of, 39 FIR filter loop specification, 41, 44–45 intra-tile and inter-tile dependency vectors, 42–43 non-zero elements, 44 short dependency, 42 for exposing coarse grained parallelism, 39 for high-level optimizations, 38 iteration space decomposition FIR filter, data dependencies, 41 perfect tilings, 41 rectangular iteration space, 40 UDAs, 39–41 LPGS (see Symbolic inner loop parallelization) LSGP (see Symbolic outer loop parallelization) massively parallel architectures, 38 tiling matrix, 39 System of Uniform Recurrence Equations (SURE), 22 T TCPA, see Tightly Coupled Processor Array (TCPA) Tight inter-tile schedule vector candidates, 66, 68–71 Tight intra-tile schedule vector candidates, 47 feasible stride matrices, 51–54 intra-tile LSGP schedule bound, 51 intra-tile LSGP schedule construction, 49–50 path stride matrix, 48–50 176 Tightly Coupled Processor Array (TCPA), 2–3, 37, 42, 113, 115, 123, 127, 128, 140, 143, 155 architecture, 13–14 array interconnect, 15–16 peripherals Address Generators and I/O buffers, 16 Configuration and Communication Processor, 17 Configuration Manager, 17 Global Controller, 16 Invasion Controller, 17 Invasion Managers, 17 processor array, 14–15 runtime schedule selection, 76–77 Tight parallel schedule vectors, 106–107 TILEPro 32-bit processor, Tiling matrix, 39, 40, 94, 101, 122 Transistors, 1, Triple Modular Redundancy (TMR), 6, 7, 123, 128, 131, 140, 141, 143, 148, 149, 157 Index Two-level hierarchical tiling, 118–120 U UDA, see Uniform Dependence Algorithm (UDA) Uniform Dependence Algorithm (UDA), 23, 25, 29, 39–42, 44, 45, 47, 49, 51, 54, 56, 59–61, 68, 69, 71, 72, 82, 83, 94, 96–102, 106, 108, 111, 127, 128, 131, 139, 146 Uniform Dependence Algorithms (UDAs), 39–42, 44 V Very Long InstructionWords (VLIWs), 2, 124, 140, 142, 151, 152 Voting insertion, 6, 123, 130–132, 157 X Xeon Phi coprocessor series, ...Symbolic Parallelization of Nested Loop Programs Alexandru-Petru Tanase • Frank Hannig Jürgen Teich Symbolic Parallelization of Nested Loop Programs 123 Alexandru-Petru Tanase... framework for mapping nested loop programs onto TCPAs We also present the fundamentals in terms of how to specify nested loop programs and the considered class of nested loops in the polyhedron... variables of lower dimension into a common iteration space Loop perfectization: Loop perfectization transforms non-perfectly nested loop program into perfectly nested loops [Xue97] Loop unrolling: Loop