Hardware and Computer Organization- P14 docx

372 C H A P T E R 14 Memory Revisited, Caches and Virtual Memory Objectives When you are finished with this lesson, you will be able to:  Explain the reason for caches and how caches are organized;  Describe how various caches are organized;  Design a typical cache organization;  Discuss relative cache performance;  Explain how virtual memory is organized; and  Describe how computer architecture supports virtual memory management. Introduction to Caches As an introduction to the topic of caches and cache-based systems, let’s review the types of memories that we discussed before. The major types of memories are static random access memory (SRAM), dynamic random access memory (DRAM), and nonvolatile read-only memory (ROM). SRAM memory is based on the principle of the cross-coupled, inverting logic gates. The output value feeds back to the input to keep the gate locked in one state or the other. SRAM memory is very fast, but each memory cell required five or six transistors to implement the design, so it tends to be more expensive than DRAM memory. DRAM memory stores the logical value as charge on a tiny charge-storage element called a capacitor. Since the charge can leak off the capacitor if it isn’t refreshed periodically, this type of memory must be continually read from at regular intervals. This is why it is called dynamic RAM rather than static RAM. The memory access cycles for DRAM is also more complicated than for static RAM because these refresh cycles must be taken into account as well. However, the big advantage of DRAM memory is its density and low cost. Today, you can buy a single in-line memory module, or SIMM for your PC with 512 Mbytes of DRAM for $60. At those prices, we can afford to put the complexity of managing the DRAM interface into special - ized chips that sit between the CPU and the memory. If you’re a computer hobbyist who likes to do your own PC upgrading, then you’ve no doubt purchased a new motherboard for your PC featuring the AMD, nVidia, Intel or VIA “chipsets.” The chipsets have become as important a con - sideration as the CPU itself in determining the performance of your computer. Our computer systems demand a growing amount of memory just to keep up with the growing complexity of our applications and operating systems. This chapter is being written on a PC with Memory Revisited, Caches and Virtual Memory 373 1,024 Mbytes ( 1 Gbyte ) of memory. Today this is considered to be more than an average amount of memory, but in three years it will probably be the minimal recommended amount. Not too long ago, 10 Mbytes of disk storage was considered a lot. Today, you can purchase a 200 Gbyte hard disk drive for around $100. That’s a factor of 10,000 times improvement in storage capacity. Given our insatiable urge for ever-increasing amounts of storage, both volatile storage, such as RAM, and archival storage, such as a hard disk, it is appropriate that we also look at ways that we manage this complexity from an architectural point of view. The Memory Hierarchy There is a hierarchy of memory. In this case we don’t mean a pecking order, with some memory being more important than others. In our hierarchy, the memory that is “closer” to the CPU is consid - ered to be higher in the hierarchy then memory that is located further away from the CPU. Note that we are saying “closer” in a more general sense then just “physically closer” (although proximity to the CPU is an important factor as well). In order to maximize processor throughput, the fastest memory is located the closest to the processor. This fast memory is also the most expensive. Figure 14.1 is a qualitative representation of what is referred to as the memory hierarchy. Starting at the pinnacle, each level of the pyramid contains different types of memory with increasingly longer access times. Let’s compare this to some real examples. Today, SRAM access times are in the 2–25ns range at cost of about $50 per Mbyte. DRAM access times are 30–120ns at cost of $0.06 per Mbyte. Disk access times are 10 to 100 million ns at cost of $0.001 to $0.01 per Mbyte. Notice the exponential rise in capacity with each layer and the corresponding exponential rise in access time with the transition to the next layer. Figure 14.2, shows the memory hierarchy for a typical com - puter system that you might find in your own PC at home. Notice that there could be two separate caches in the system, an on- chip cache at level 1, often called an L1 cache, and an off-chip cache at level 2, or an L2 cache. It is easily apparent that the capacity increases and the speed of the memory decreases at each level of the hierarchy. We could also imagine that at a final level to this pyramid, is the Internet. Here the capacity is almost in- finite, and it often seems like the access time takes forever as well. Figure 14.1: The memory hierarchy. As memory moves further away from the CPU both the size and access times increase. Levels in the memory hierarchy Increasing distance from the CPU in access time Level N • • • • • • Level 2 (L2) Level 1 (L1) Capacity of the memory at each level CPU Figure 14.2: Memory hierarchy for a typical computer system. CPU Memory pyramid L1 L2 Main memory Disk CD-ROM/DVD CPU Primary cache 2 K–1 Mbyte (1 ns ) Bus Interface Unit Secondar y cache 256 K–4 MByte (20 ns) Main memor y 1 M–1.5 Gbyte (30 ns) Hard disk 1 G–200 Gbyte (100,000 ns) Ta pe backup 50 G–10 Tbyte (seconds) Chapter 14 374 Locality Before we continue on about caches, let’s be certain that we understand what a cache is. A cache is a nearby, local storage system. In a CPU we could call the register set the zero level cache. Also, on-chip, as we saw there is another, somewhat larger cache memory system. This memory typically runs at the speed of the CPU, although it is sometimes slower then regular access times. Processors will often have two separate L1 caches, one for instructions and one for data. As we’ve seen, this is an internal implementation of the Harvard architecture. The usefulness of a cache stems from the general characteristics of programs that we call locality. There are two types of locality, although they are alternative ways to describe the same principle. Locality of Reference asserts that program tend to access data and instructions that were recently accessed before, or that are located in nearby memory locations. Programs tend to execute instruc - tions in sequence from adjacent memory locations and programs tend to have loops in which a group of nearby instructions is executed repeatedly. In terms of data structures, compilers store arrays in blocks of adjacent memory locations and programs tend to access array elements in sequence. Also, compilers store unrelated variables together, such as local variables on a stack. Temporal locality says that once an item is referenced it will tend to be referenced again soon and spatial locality says that nearby items will tend to be referenced soon. Let’s examine the principle of locality in terms of a two-level memory hierarchy. This example will have an upper-level (cache memory) and a lower level (main memory).The two-level structure means that if the data we want isn’t in the cache, we will go to the lower level and retrieve at least one block of data from main memory. We’ll also define a cache hit as a data or instruction request by the CPU to the cache memory where the information requested is in cache and a cache miss as the reverse situation; the CPU requests data and the data is not in the cache. We also need to define a block as a minimum unit of data transfer. A block could be as small as a byte, or several hundred bytes, but in practical terms, it will typically be in the range of 16 to 64 bytes of information. Now it is fair to ask the question, “Why load an entire block from main memory? Why not just get the instruction or data element that we need?” The answer is that locality tells us that if the first piece of information we need is not in the cache, the rest of the information that we’ll need shortly is probably also not in the cache, so we might as well bring in an entire block of data while we’re at it. There is another practical reason for doing this. DRAM memory takes some time to set up the first memory access, but after the access is set up, the CPU can transfer successive bytes from memory with little additional overhead, essentially in a burst of data from the memory to the CPU. This called a burst mode access. The ability of modern SDRAM memories to support burst mode accesses is carefully matched to the capabilities of modern processors. Establishing the conditions for the burst mode access requires a number of clock cycles of overhead in order for the memory support chip sets to establish the initial addresses of the burst. However, after the addresses have been established, the SDRAM can output two memory read cycles for every clock period of the external bus clock. Today, with a bus clock of 200MHz and a memory width of 64-bits, that translates to a memory to processor data transfer rate of 3.2 GBytes per second during the actual burst transfer. Memory Revisited, Caches and Virtual Memory 375 Let’s make one more analogy about a memory hierarchy that is common in your everyday life. Imagine yourself, working away at your desk, solving another one of those interminable problem sets that engineering professors seem to assign with depressing regularity. You exploit locality keeping the books that you reference most often, say your required textbooks for your classes, on your desk or bookshelf. They’re nearby, easily referenced when you need them, but there are only a few books around. Suppose that your assignment calls for you to go to the engineering library and borrow another book. The engineering library certainly has a much greater selection than you do, but the retrieval costs are greater as well. If the book isn’t in the engineering library, then the Library of Congress in Washington, D.C. might be your next stop. At each level, in order to gain access to a greater amount of stored material, we incur a greater penalty in our access time. Also, our unit of transfer in this case is a book. So in this analogy, one block equals one book. Let’s go back and redefine things in terms of this example: • block: the unit of data transfer (one book), • hit rate: the percentage of the data accesses that are in the cache (on your desk) • miss rate: the percentage of accesses not in the cache (1 – hit rate) • hit time: the time required to access data in the cache (grab the book on your desk) • miss penalty: the time required to replace the block in the cache with the one you need (go to the library and get the other book) We can derive a simple equation for the effective execution time. That is the actual time, on aver - age, that it takes an instruction to execute, given the probability that the instruction will, or will not, be in the cache when you need it. There’s a subtle point here that should be made. The miss penalty is the time delay imposed because the processor must execute all instructions out of the cache. Although most cached processors allow you to enable or disable the on-chip caches, we’ll assume that you are running with the cache on. Effective Execution Time = hit rate × hit time + miss rate × miss penalty If the instruction or data is not in the cache, then the processor must reload the cache before it can fetch the next instruction. It cannot just go directly to memory to fetch the instruction. Thus, we have the block of time penalty that is incurred because it must wait while the cache is reloaded with a block from memory. Let’s do a real example. Suppose that we have a cached processor with a 100 MHz clock. Instruc - tions in cache execute in two clock cycles. Instructions that are not in cache must be loaded from main memory in a 64-byte burst. Reading from main memory requires 10 clock cycles to set up the data transfer but once set-up, the processor can read a 32-bit wide word at one word per clock cycle. The cache hit rate is 90%. 1. The hard part of this exercise is calculating the miss penalty, so we’ll do that one first. a. 100 MHz clock -> 10 ns clock period b. 10 cycles to set up the burst = 10 × 10 ns = 100 ns c. 32-bit wide word = 4 bytes -> 16 data transfers to load 64 bytes d. 16 × 10 ns = 160 ns e. Miss penalty = 100 ns + 160 ns = 260 ns Chapter 14 376 2. Each instruction takes 2 clocks, or 20 ns to execute. 3. Effective execution time = 0.9x20 + 0.1x260 = 18 + 26 = 44 ns Even this simple example illustrates the sensitivity of the effective execution time to the param - eters surrounding the behavior of the cache. The effective execution time is more than twice the in-cache execution time. So, whenever there are factors of 100% improvement floating around, designers get busy. We can thus ask some fundamental questions: 1. How can we increase the cache hit ratio? 2. How can we decrease the cache miss penalty? For #1, we could make the caches bigger. A bigger cache holds more of main memory, so that should increase the probability of a cache hit. We could change the design of the cache. Perhaps there are ways to organize the cache such that we can make better use of the cache we already have. Remember, memory takes up a lot of room on a silicon die, compared to random logic, so adding an algorithm with a few thousand gates might get a better return then adding another 100K to the cache. We could look to the compiler designers for help. Perhaps they could better structure the code so that it would be able to have a higher proportion of cache hits. This isn’t an easy one to attack, because cache behavior sometimes can become very counter-intuitive. Small changes in an algorithm can sometimes lead to big fluctuations in the effective execution time. For example, in my Embedded Systems Laboratory class the students do a lab experiment trying to fine tune an algorithm to maximize the difference in measured execution time between the algorithm running cache off and cache on. We turn it into a small contest. The best students can hand craft their code to get a 15:1 ratio. Cache Organization The first issue that we will have to deal with is pretty simple: “How do we know if an item (instruc - tion or data) is in the cache?” If it is in the cache, “How do we find it?” This is a very important consideration. Remember that your program was written, compiled and linked to run in main memory, not in the cache. In general, the compiler will not know about the cache, although there are some optimizations that it can make to take advantage of cached processors. The addresses associated with references are main memory addresses, not cache addresses. Therefore, we need to devise a method that somehow maps the addresses in main memory to the addresses in the cache. We also have another problem. What happens if we change a value such that we must now write a new value back out to main memory? Efficiency tells us to write it to the cache, but this could lead to a potentially disastrous situation where the data in the cache and the data in main memory are no longer coherent (in agreement with each other). Finally, how do we design a cache such that we can maximize our hit rate? We’ll try to answer these questions in the discussion to follow . In our first example our block size will be exactly one word of memory. The cache design that we’ll use is called a direct-mapped cache. In a direct-mapped cache, every word of memory at the lower level has exactly one location in the cache where it might be found. Thus, there will be lots of memory locations at the lower level for every memory location in the cache. This is shown in Figure 14.3 Memory Revisited, Caches and Virtual Memory 377 Referring to Figure 14.3, suppose that our cache is 1,024 words (1K) and main memory contains 1,048,576 words (1M). Each cache location maps to 1,024 main memory locations. This is fine, but now we need to be able to tell which of the 1,024 possible main memory locations is in a particular cache location at a particular point in time. Therefore, every memory location in the cache needs to contain more information than just the corresponding data from main memory. Each cache memory location consists of a number of cache entries and each cache entry has several parts. We have some cache memory that contains the instructions or data that corresponds to one of the 1,024 main memory locations that map to it. Each cache location also contains an address tag, which identifies which of the 1,024 possible memory locations happens to be in the corresponding cache location. This point deserves some further discussion. Address Tags When we first began our discussion of memory organization several lessons ago, we were introduced to the concept of paging. In this particular case, you can think of main memory as being organized as 1,024 pages with each page containing exactly 1,024 words. One page of main memory maps to one page of the cache. Thus, the first word of main memory has the binary address 0000 0000 0000 0000 0000. The last word of main memory has the address 1111 1111 11 11 1111 1111. Let’s split this up in terms of page an offset. The first word of main memory has the page address 00 0000 0000 and the offset address 00 0000 0000. The last page of main memory has the page address 11 1111 1111 and the offset address 11 1111 1111. In terms of hexadecimal addresses, we could say that the last word of memory in page/offset addressing has the address $3FF/$3FF. Nothing has changed, we’ve just grouped the bits different - ly so that we can represent the memory address in a way that is more aligned with the organization of the direct-mapped cache. Thus, any memory position in the cache also has to have storage for the page address that the data actually occupies in main memory. Now, data in a cache memory is either copies of the contents of main memory (instructions and/or data) or newly stored data that are not yet in main memory. The cache entry for that data, called a tag, contains the information about the block’s location in main memory and validity (coherence) information. Therefore, every cache entry must contain the instruction or data contained in main memory, the page of main memory that the block comes from, and, finally, information about Figure 14.3: Mapping of a 1K direct mapped cache to a 1M main memory. Every memory location in the cache maps to 1024 memory locations in main memory. Main Memory 0×00000 0×00400 0×00800 0×00C00 0×01000 0×FFC00 0×FFFFF Cache Memory 0×000 0×3FF Chapter 14 378 whether the data in the cache and the data in main memory are coherent. This is shown in Figure 14.4. We can summarize the cache opera- tion quite simply. We must maximize the probability that whenever the CPU does an instruction fetch or a data read, the instruction or data is available in the cache. For many CPU designs, the algorithmic state machine design that is used to manage the cache is one of the most jealously guarded secrets of the company. The design of this complex hardware block will dramatically impact the cache hit rate, and consequently, the overall perfor - mance of the processor. Most caches are really divided into three basic parts. Since we’ve already discussed each one, let’s just take a moment to summarize our discussion. • cache memory: holds the memory image • tag memory: holds the address information and validity bit. Determines if the data is in the cache and if the cache data and memory data are coherent. • algorithmic state machine: the cache control mechanism. Its primary function is to guar- antee that the data requested by the CPU is in the cache. To this point, we’ve been using a model that the cache and memory transfer data in blocks and our block size has been one memory word. In reality, caches and main memory are divided into equally sized quantities called refill lines. A refill line is typically between four and 64 bytes long (power of 2) and is the minimum quantity that the cache will deal with in terms of its interaction with main memory. Missing a single byte from main memory will result in a full filling of the refill line containing that byte. This is why most cached processors have burst modes to access memory and usually never read a single byte from memory. The refill line is another name for the data block that we previously discussed. Today, there are four common cache types in general use. We call these: 1. direct-mapped 2. associative 3. set-associative 4. sector mapped The one used most is the four-way set-associative cache, because it seems to have the best perfor - mance with acceptable cost and complexity. We’ll now look at each of these cache designs. Direct-Mapped Cache We’ve already studied the direct-mapped cache as our introduction to cache design. Let’s re-examine it in terms of refill lines rather than single words of data. The direct-mapped cache partitions main memory into an XY matrix consisting of K columns of N refill lines per column. Figure 14.4: Mapping of a 1K direct mapped cache to a 1M main memory. Every memory location in the cache maps to 1024 memory locations in main memory. Assumptions: • The cache is 1K deep • Main memory contains 1M words • Memory words are 16 bits wide Assumptions: • The cache is 1K deep • Main memory contains 1M words • Memory words are 16 bits wide Memory data: 16 bits D15 D0 Address tag: 10 bits Validity bit: Has data been written to the cache but not to main memory? A9 A0 Memory Revisited, Caches and Virtual Memory 379 The cache is one-column wide and N refill lines long. The Nth row of the cache can hold the Nth refill line of any one of the K columns of main memory. The tag address holds the address of the memory column. For example, suppose that we have a processor with a 32-bit byte-addressable address space and a 256K, direct-mapped cache. Finally, the cache reloads with 64 bytes long refill line. What does this system look like? 1. Repartition the cache and main memory in terms of refill lines. a. Main memory contains 2 32 bytes / 2 6 bytes per refill line = 2 26 refill lines b. Cache memory contains 2 18 bytes / 2 6 bytes per refill line = 2 12 refill lines 2. Represent cache memory as single column with 2 12 rows and main memory as an XY matrix of 2 12 rows by 2 26 / 2 12 = 2 14 columns. See Figure 14.5. In Figure 14.5 we’ve divided main memory into three distinct regions: • offset address in a refill line; • row address in a column; and • column address. We map the corresponding byte positions of a refill line of main memory to the byte position in the refill line of the cache. In other words, the offset addresses are the same in the cache and in main memory. Next, every row of the cache corresponds to every row of main memory. Finally, the same row (refill line) within each column of main memory maps to the same row, or refill line, of the cache memory and its column address is stored in the tag RAM of the cache. The address tag field must be able to hold a 14-bit wide column address, corresponding to column addresses from 0x0000 to 0x3FFF. The main memory and cache have 4096 rows, corresponding to row addresses 0x000 through 0x3FF. As an example, let’s take an arbitrary byte address and map it into this column/row/offset schema. Byte address = 0xA7D304BE Because not all of the boundaries of the column, row and offset address do not lie on the boundar - ies of hex digits (divisible by 4), it is will be easier to work the problem out in binary, rather than hexadecimal. First we’ll write out the byte address 0xA7D304BE as a 32-bit wide number and then group it according to the column, row and offset organization of the direct mapped cache example. Figure 14.5: Example of a 256Kbyte direct-mapped cache with a 4Gbyte main memory. Refill line width is 64 bytes. • • • • Main Memory Cache Memory row 0×000 row 0×FFF Address Tag column 0×0000 column 0×0001 column 0×3FFE column 0×3FF F Refill Line Byte Address = Column Address, Row Address, Offset Address Offset Addess • • • • 00 01 02 03 3C 3D 3E 3F Chapter 14 380 1010 0111 1101 0011 0000 0100 1011 1110 * 8 hexadecimal digits Offset: 11 1110 = 0x3E Row: 1100 0001 0010 = 0xC12 Column: 10 1001 1111 0100 = 0x29F4 Therefore, the byte that resides in main memory at address 0xA7D304BE resides in main memory at address 0x29F4, 0xC12, 0x3E when we remap main memory as an XY matrix of 64-byte wide refill lines. Also, when the refill line containing this byte is in the cache, it resides at row 0xC12 and the address tag address is 0x29F4. Finally, the byte is located at offset 0x3E from the first byte of the refill line. The direct mapped cache is a relatively simple design to implement but it is rather limited in its performance because of the restriction placed upon it that, at any point in time, only one refill line per row of main memory may be in the cache. In order to see how this restriction can affect the performance of a processor, consider the following example. The two addresses for the loop and for the subroutine called by the loop look vaguely similar. If we break these down into their mappings in the cache example we see that for the loop, the address maps to: • Offset = 0x0A • Row = 0x52F • Column = 0x0421 The subroutine maps to: • Offset = 0x00 • Row = 0x52F • Column = 0x0422 Thus, in this particular situation, which just might occur, either through an assembly language algorithm or as a result of how the compiler and linker organize the object code image, the loop, and the subroutine called by the loop, are on the same cache row but in adjacent columns. Every time the subroutine is called, the cache controller must refill row 0x52F from column 0x422 before it can begin to execute the subroutine. Likewise, when the RTS instruction is encountered, the cache row must once again be refilled from the adjacent column. As we’ve previously seen in the calculation for the effective execution time, this piece of code could easily run 10 times slower then it might if the two code segments were in different rows. The problem exists because of the limitations of the direct mapped cache. Since there is only one place for each of the refill lines from a given row, we have no choice when another refill from the same row needs to be accessed. At the other end of the spectrum in terms of flexibility is the associative cache. We’ll consider this cache organization next. 10854BCA loop: JSR subroutine {some code} BNE loop 10894BC0 subroutine: {some code} RT S Memory Revisited, Caches and Virtual Memory 381 Associative Cache As we’ve discussed, the direct- mapped cache is rather restrictive because of the strict limitations on where a refill line from main memory may reside in the cache. If one particular row refill line ad - dress in the cache is mapped to two refill lines that are both frequently used, the computer will be spend- ing a lot of time swapping the two refill lines in and out of the cache. What would be an improvement is if we can map any refill line address in main memory to any available refill line position in the cache. We call a cache with this organization an associative cache. Figure 14.6 illustrates an associative cache. In Figure 14.6, we’ve taken an example of a 1 Mbyte memory space, a 4 Kbyte associative cache, and a 64 byte refill line size. The cache contains 64 refill lines and main memory is organized as a single column of 2 14 refill lines (16 Kbytes). This example represents a fully associative cache. Any refill line of main memory may occupy any available refill position in the cache. This is as good as it gets. The associative cache has none of the limitations imposed by the direct-mapped cache architecture. Figure 14.6 attempts to show in a multicolor manner, the almost random mapping of rows in the cache to rows in main memory. However, the complexity of the associative cache grows exponentially with cache size and main memory size. Consider two problems: 1. When all the available rows in the cache contain valid rows from main memory, how does the cache control hardware decide where in the cache to place the next refill line from main memory? 2. Since any refill line from main memory can be located at any refill line position in the cache, how does the cache control hardware determine if a main memory refill line is currently in the cache? We can deal with issue #1 by placing a binary counter next to each row of the cache. On every clock cycle we advance all of the counters. Whenever, we access the data in a particular row of the cache, we reset the counter associated with that row back to zero. When a counter reaches the maximum count, it remains at that value. It does not roll over to zero. All of the counters feed their values into a form of priority circuit that outputs the row address of the counter with the highest count value. This row address of the counter with the highest count Figure 14.6: Example of a 4 Kbyte associative cache with a 1M main memory. Refill line width is 64 bytes. [NOTE: This figure is included in color on the DVD-ROM.] row 0×3F Refill Line row 0×00 Cache RAM Ta g RAM 3FF9 3FF7 0009 0002 0007 Refill Line Main Memory row 0×0000 row 0×0002 row 0×0007 row 0×0009 row 0×3FF 7 row 0×3FF 9 row 0×3FF F [...]... Analysis of Cache Performance for Operating Systems and Multiprogramming, PhD Thesis, Stanford University, Tech Report No CSL-TR-87-332 (May 1987) Robert J Baron and Lee Higbie, op cit, p 207 Vincent P Heuring and Harry F Jordan, Computer Systems Design and Architecture, ISBN 0-8053-4330-X, AddisonWesley Longman, Menlo Park, CA, 1997, p 367 Patterson and Hennessy, op cit, p 613 Daniel Mann, op cit 394... effect of the hardware architecture on performance cannot be minimized A back-of-the envelope calculation could yield a video frame rate of 400 Performance Issues in Computer Architecture 75 frames per second for the PC and less than 1 frame per hour for the 8-bit processor; and even though this is a rather ludicrous example, it does factor in the significant issues relating hardware and performance... logical memory, physical memory and virtual memory, with virtual memory being the hard disk, as managed by the operating system Let’s look at the components of a virtual memory system in more detail Refer to Figure 14.12 Logical Memory Instruction Opcode Operands CPU’s standard addressing modes used to generate a logical address CPU issues a virtual address On-chip hardware and/ or memory-based translation... translation buffer Physical Memory Physical Address Exception Handler The CPU executes an instruction and Operating System Intervenes requests a memory operand, or the program Figure 14.12: Components of a virtual memory system counter issues the address of the next instruction In any case, standard addressing methods (addressing modes) are used and an address is generated by the CPU This address is pointing... as its highest priority A page fault and subsequent reloading of main memory from the disk can take hundreds of thousands of clock cycles Thus, LRU algorithms are worth the price of their complexity With the O/S taking over, the page faults can be handled in software instead of hardware The obvious tradeoff is that it will be much slower than managing it with a hardware algorithm, but much more flexible... numbers are mapped to physical and virtual memory 390 Memory Revisited, Caches and Virtual Memory needs a tag memory to store the mapping information between the cache and main memory, the O/S maintains a page map to store the mapping information between virtual memory and physical memory The page map may contain other information that is of interest to the operating system and the applications that are... TLB is updated and the refill line is retrieved and placed in the cache If the page is in virtual memory, then a page fault must be generated and the O/S intervenes to retrieve the page from the disk and place it in the page frame The virtual paging process is shown as a flow chart in Figure 14.15 Protection Since the efficient management of the virtual memory depends upon having specialized hardware to... circuitry has been incorporated into the processor and most modern high-performance processors have on-chip MMUs 392 Memory Revisited, Caches and Virtual Memory Therefore, we can summarize some of the hardware- based protection functionality as follows • user mode and kernel (supervisor) mode • controlled user read-only access to user/kernel mode bit and a mechanism to switch from user to kernel mode,... instruction and for data Both are 2-way set associative using LRU replacement strategy Instruction TLB: 128 entries, Data TLB: 128 entries TLB misses are handled in hardware Split instruction and data caches 16 KB each 4-way set associative LRU replacement 32 bytes Write-back or write-through Summary of Chapter 14 Chapter 14 covered: • The concept of a memory hierarchy, • The concept of locality and how... stored in a particular cache tag address location, the circuit indicates an address match (hit) and outputs the cache row address of the main memory tag address As the size of the cache and the size of main memory increases, the number of bits that must be handled by the cache control hardware grows rapidly in size and complexity Thus, for real-life cache situations, the fully associative cache is not an . determining the performance of your computer. Our computer systems demand a growing amount of memory just to keep up with the growing complexity of our applications and operating systems. This chapter. summarize and wrap-up our discussion of caches. • There are two types of locality: spatial and temporal. • Cache contents include data, tags, and validity bits. • Spatial locality demands larger. instruction and requests a memory operand, or the program counter issues the address of the next in - struction. In any case, standard addressing methods (addressing modes) are used and an address

Định dạng
Số trang	30
Dung lượng	649,08 KB