Advanced Computer Architecture - Lecture 30: Memory hierarchy design. This lecture will cover the following: cache performance enhancement; reducing miss rate; classification of cache misses; reducing cache miss rate; way prediction and pseudo-associativity; compiler optimization;...
CS 704 Advanced Computer Architecture Lecture 30 Memory Hierarchy Design Cache Performance Enhancement (Reducing Miss Rate) Prof Dr M Ashraf Chughtai Today’s Topics Recap: Reducing Miss Penalty Classification of Cache Misses Reducing Cache Miss Rate Summary MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) Recap: Improving Cache Performance ─ ─ ─ The miss penalty The miss rate The miss Penalty or miss rate via Parallelism ─ The time to hit in the cache MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) Recap: Reducing Miss Penalty – Multilevel Caches – Critical Word first and Early Restart – Priority to Read Misses Over writes – Merging Write Buffers – Victim Caches MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) Recap: Reducing Miss Penalty ‘Multi level caches’ ‘The more the merrier MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) Recap: Reducing Miss Penalty “ Critical Word First and Early Restart’ intolerance Reduces miss-penalty MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) Recap: Reducing Miss Penalty ‘Priority to read miss over the write miss’ Favoritism MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) Recap: Reducing Miss Penalty ‘Merging write-buffer,’ Acquaintance Victim cache Salvage MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) Recap: Reducing Miss Penalty Reduces miss penalty Multi level caches Reduces miss rate Cache-misses Methods to reduce the miss rate MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) Cache Misses – Compulsory Misses (cold start or first reference misses) – Capacity Misses – Conflict Misses (collision or interference misses) MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 10 5: Compiler Optimization Data misses are reduced Spatial locality Temporal locality Array calculation MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 37 5: Compiler Optimization – loop interchange – blocking MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 38 5: Compiler Optimization: Loop Interchange • program having nested loops that access data in non-sequential order for j (0100) and in sequential order for i (05000) MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 39 5: Compiler Optimization: Using Loop Interchange First Version: or (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i] [j] = * x[i] [j]; MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 40 5: Compiler Optimization Reorderd version: for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = * x[i][j]; MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 41 5: Compiler Optimization: Using Blocking Example of improving Temporal Locality program to perform matrix multiplication MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 42 5: Compiler Optimization: Using Blocking ‘Row major order’ (row-by-row) ‘Column major order’ Iteration for matrix multiplication MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 43 5: Compiler Optimization: Using Blocking /* Initial version of matrix multiplication code */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1) { r = r + y[i][k]*z[k][j]; }; x[i][j] = r; }; MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 44 Whereas, if the cache size is small and it can hold one N x N matrix and one row of N elements, then the full matrix Z and one ( i th ) row of Y can stay in the cache Else the misses may occur for both the x and z arrays MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 45 5: Compiler Optimization: Using Blocking B is chosen such that one row of B and one B x B matrix can fit in in cache This ensures that the y and z blocks are resident on cache Let us have a look on to the modified code which shows that the two inner loops now compute in steps of size B (blocking factor) rather than the full N x N size of arrays X and Z MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 46 5: Compiler Optimization: Using Blocking MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 47 Summary • Large block size to reduce compulsory misses • Large cache size to reduce capacity misses • Higher associativity to reduce conflict misses MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 48 Summary The way-prediction techniques checks a section of cache for hit and then on miss it checks the rest of the cache The final technique – loop interchange and blocking, is a software approach to optimize the cache performance Next time we will talk about the way to enhance performance by having processor and memory operate in parallel – till then MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 49 Example: Avg Memory Access Time vs Miss Rate Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs CCT direct mapped Cache Size (KB) 1-way 2.33 1.98 1.72 1.46 16 1.29 32 1.20 64 1.14 MAC/VU-Advanced Computer Architecture 128 1.10 Associativity 2-way 4-way 8-way 2.15 2.07 2.01 1.86 1.76 1.68 1.67 1.61 1.53 1.48 1.47 1.43 1.32 1.32 1.32 1.24 1.25 1.27 1.20 1.21 1.23 Lecture 30 Memory (6) 1.17 1.18 Hierarchy 1.20 50 Allah Hafiz MAC/VU-Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 51 ... cycles MAC/VU -Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 21 1: Larger Block Size: Solution Copy table 5.18 pp 428 MAC/VU -Advanced Computer Architecture Lecture 30 Memory Hierarchy. .. time) MAC/VU -Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 24 Associativity Absolute Miss Rate 3: Higher MAC/VU -Advanced Computer Architecture Lecture 30 Memory Hierarchy (6)... X and Z MAC/VU -Advanced Computer Architecture Lecture 30 Memory Hierarchy (6) 46 5: Compiler Optimization: Using Blocking MAC/VU -Advanced Computer Architecture Lecture 30 Memory Hierarchy (6)