Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 542 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
542
Dung lượng
3,13 MB
Nội dung
UnderstandingtheLinuxKernel Daniel P Bovet Marco Cesati Publisher: O'Reilly First Edition October 2000 ISBN: 0-596-00002-2, 702 pages UnderstandingtheLinuxKernel helps readers understand how Linux performs best and how it meets the challenge of different environments The authors introduce each topic by explaining its importance, and show how kernel operations relate to the utilities that are familiar to Unix programmers and users Table of Contents Preface The Audience for This Book Organization of the Material Overview of the Book Background Information Conventions in This Book How to Contact Us Acknowledgments Introduction 1.1 Linux Versus Other Unix-Like Kernels 1.2 Hardware Dependency 10 1.3 Linux Versions 11 1.4 Basic Operating System Concepts 12 1.5 An Overview of the Unix Filesystem 16 1.6 An Overview of Unix Kernels 22 Memory Addressing 36 2.1 Memory Addresses 36 2.2 Segmentation in Hardware 37 2.3 Segmentation in Linux 41 2.4 Paging in Hardware 44 2.5 Paging in Linux 52 2.6 Anticipating Linux 2.4 63 Processes 64 3.1 Process Descriptor 64 3.2 Process Switching 78 3.3 Creating Processes 86 3.4 Destroying Processes 93 3.5 Anticipating Linux 2.4 94 Interrupts and Exceptions 96 4.1 The Role of Interrupt Signals 96 4.2 Interrupts and Exceptions 97 4.3 Nested Execution of Exception and Interrupt Handlers 106 4.4 Initializing the Interrupt Descriptor Table 107 4.5 Exception Handling 109 4.6 Interrupt Handling 112 4.7 Returning from Interrupts and Exceptions 126 4.8 Anticipating Linux 2.4 129 Timing Measurements 5.1 Hardware Clocks 5.2 The Timer Interrupt Handler 5.3 PIT's Interrupt Service Routine 5.4 The TIMER_BH Bottom Half Functions 5.5 System Calls Related to Timing Measurements 5.6 Anticipating Linux 2.4 131 131 133 134 136 145 148 Memory Management 6.1 Page Frame Management 6.2 Memory Area Management 6.3 Noncontiguous Memory Area Management 6.4 Anticipating Linux 2.4 149 149 160 176 181 Process Address Space 7.1 The Process's Address Space 7.2 The Memory Descriptor 7.3 Memory Regions 7.4 Page Fault Exception Handler 7.5 Creating and Deleting a Process Address Space 7.6 Managing the Heap 7.7 Anticipating Linux 2.4 183 183 185 186 201 212 214 216 System Calls 8.1 POSIX APIs and System Calls 8.2 System Call Handler and Service Routines 8.3 Wrapper Routines 8.4 Anticipating Linux 2.4 217 217 218 229 230 Signals 9.1 The Role of Signals 9.2 Sending a Signal 9.3 Receiving a Signal 9.4 Real-Time Signals 9.5 System Calls Related to Signal Handling 9.6 Anticipating Linux 2.4 231 231 239 242 251 252 257 10 Process Scheduling 10.1 Scheduling Policy 10.2 The Scheduling Algorithm 10.3 System Calls Related to Scheduling 10.4 Anticipating Linux 2.4 258 258 261 272 276 11 Kernel Synchronization 11.1 Kernel Control Paths 11.2 Synchronization Techniques 11.3 The SMP Architecture 11.4 The Linux/SMP Kernel 11.5 Anticipating Linux 2.4 277 277 278 286 290 302 12 The Virtual Filesystem 12.1 The Role of the VFS 12.2 VFS Data Structures 12.3 Filesystem Mounting 12.4 Pathname Lookup 12.5 Implementations of VFS System Calls 12.6 File Locking 12.7 Anticipating Linux 2.4 303 303 308 324 329 333 337 342 13 Managing I/O Devices 13.1 I/O Architecture 13.2 Associating Files with I/O Devices 13.3 Device Drivers 13.4 Character Device Handling 13.5 Block Device Handling 13.6 Page I/O Operations 13.7 Anticipating Linux 2.4 343 343 348 353 360 361 377 380 14 Disk Caches 14.1 The Buffer Cache 14.2 The Page Cache 14.3 Anticipating Linux 2.4 382 383 396 398 15 Accessing Regular Files 15.1 Reading and Writing a Regular File 15.2 Memory Mapping 15.3 Anticipating Linux 2.4 400 400 408 416 16 Swapping: Methods for Freeing Memory 16.1 What Is Swapping? 16.2 Swap Area 16.3 The Swap Cache 16.4 Transferring Swap Pages 16.5 Page Swap-Out 16.6 Page Swap-In 16.7 Freeing Page Frames 16.8 Anticipating Linux 2.4 417 417 420 429 433 437 442 444 450 17 The Ext2 Filesystem 17.1 General Characteristics 17.2 Disk Data Structures 17.3 Memory Data Structures 17.4 Creating the Filesystem 17.5 Ext2 Methods 17.6 Managing Disk Space 17.7 Reading and Writing an Ext2 Regular File 17.8 Anticipating Linux 2.4 451 451 453 459 463 464 466 473 475 18 Process Communication 18.1 Pipes 18.2 FIFOs 18.3 System V IPC 18.4 Anticipating Linux 2.4 476 477 483 486 499 19 Program Execution 19.1 Executable Files 19.2 Executable Formats 19.3 Execution Domains 19.4 The exec-like Functions 19.5 Anticipating Linux 2.4 500 500 512 514 515 519 A System Startup A.1 Prehistoric Age: The BIOS A.2 Ancient Age: The Boot Loader A.3 Middle Ages: The setup( ) Function A.4 Renaissance: The startup_32( ) Functions A.5 Modern Age: The start_kernel( ) Function 520 520 521 523 523 524 B Modules B.1 To Be (a Module) or Not to Be? B.2 Module Implementation B.3 Linking and Unlinking Modules B.4 Linking Modules on Demand 526 526 527 529 531 C Source Code Structure 533 Colophon 536 UnderstandingtheLinuxKernel Preface In the spring semester of 1997, we taught a course on operating systems based on Linux 2.0 The idea was to encourage students to read the source code To achieve this, we assigned term projects consisting of making changes to thekernel and performing tests on the modified version We also wrote course notes for our students about a few critical features of Linux like task switching and task scheduling We continued along this line in the spring semester of 1998, but we moved on to theLinux 2.1 development version Our course notes were becoming larger and larger In July, 1998 we contacted O'Reilly & Associates, suggesting they publish a whole book on theLinuxkernelThe real work started in the fall of 1998 and lasted about a year and a half We read thousands of lines of code, trying to make sense of them After all this work, we can say that it was worth the effort We learned a lot of things you don't find in books, and we hope we have succeeded in conveying some of this information in the following pages The Audience for This Book All people curious about how Linux works and why it is so efficient will find answers here After reading the book, you will find your way through the many thousands of lines of code, distinguishing between crucial data structures and secondary ones—in short, becoming a true Linux hacker Our work might be considered a guided tour of theLinux kernel: most of the significant data structures and many algorithms and programming tricks used in thekernel are discussed; in many cases, the relevant fragments of code are discussed line by line Of course, you should have theLinux source code on hand and should be willing to spend some effort deciphering some of the functions that are not, for sake of brevity, fully described On another level, the book will give valuable insights to people who want to know more about the critical design issues in a modern operating system It is not specifically addressed to system administrators or programmers; it is mostly for people who want to understand how things really work inside the machine! Like any good guide, we try to go beyond superficial features We offer background, such as the history of major features and the reasons they were used Organization of the Material When starting to write this book, we were faced with a critical decision: should we refer to a specific hardware platform or skip the hardware-dependent details and concentrate on the pure hardware-independent parts of the kernel? Others books on Linuxkernel internals have chosen the latter approach; we decided to adopt the former one for the following reasons: • Efficient kernels take advantage of most available hardware features, such as addressing techniques, caches, processor exceptions, special instructions, processor control registers, and so on If we want to convince you that thekernel indeed does UnderstandingtheLinuxKernel • quite a good job in performing a specific task, we must first tell what kind of support comes from the hardware Even if a large portion of a Unix kernel source code is processor-independent and coded in C language, a small and critical part is coded in assembly language A thorough knowledge of thekernel thus requires the study of a few assembly language fragments that interact with the hardware When covering hardware features, our strategy will be quite simple: just sketch the features that are totally hardware-driven while detailing those that need some software support In fact, we are interested in kernel design rather than in computer architecture The next step consisted of selecting the computer system to be described: although Linux is now running on several kinds of personal computers and workstations, we decided to concentrate on the very popular and cheap IBM-compatible personal computers—thus, on the Intel 80x86 microprocessors and on some support chips included in these personal computers The term Intel 80x86 microprocessor will be used in the forthcoming chapters to denote the Intel 80386, 80486, Pentium, Pentium Pro, Pentium II, and Pentium III microprocessors or compatible models In a few cases, explicit references will be made to specific models One more choice was the order followed in studying Linux components We tried to follow a bottom-up approach: start with topics that are hardware-dependent and end with those that are totally hardware-independent In fact, we'll make many references to the Intel 80x86 microprocessors in the first part of the book, while the rest of it is relatively hardwareindependent Two significant exceptions are made in Chapter 11, and Chapter 13 In practice, following a bottom-up approach is not as simple as it looks, since the areas of memory management, process management, and filesystem are intertwined; a few forward references—that is, references to topics yet to be explained—are unavoidable Each chapter starts with a theoretical overview of the topics covered The material is then presented according to the bottom-up approach We start with the data structures needed to support the functionalities described in the chapter Then we usually move from the lowest level of functions to higher levels, often ending by showing how system calls issued by user applications are supported Level of Description Linux source code for all supported architectures is contained in about 4500 C and Assembly files stored in about 270 subdirectories; it consists of about million lines of code, which occupy more than 58 megabytes of disk space Of course, this book can cover a very small portion of that code Just to figure out how big theLinux source is, consider that the whole source code of the book you are reading occupies less than megabytes of disk space Therefore, in order to list all code, without commenting on it, we would need more than 25 books like this![1] [1] Nevertheless, Linux is a tiny operating system when compared with other commercial giants Microsoft Windows 2000, for example, reportedly has more than 30 million lines of code Linux is also small when compared to some popular applications; Netscape Communicator browser, for example, has about 17 million lines of code So we had to make some choices about the parts to be described This is a rough assessment of our decisions: UnderstandingtheLinuxKernel • • • • We describe process and memory management fairly thoroughly We cover the Virtual Filesystem and the Ext2 filesystem, although many functions are just mentioned without detailing the code; we not discuss other filesystems supported by Linux We describe device drivers, which account for a good part of the kernel, as far as thekernel interface is concerned, but not attempt analysis of any specific driver, including the terminal drivers We not cover networking, since this area would deserve a whole new book by itself In many cases, the original code has been rewritten in an easier to read but less efficient way This occurs at time-critical points at which sections of programs are often written in a mixture of hand-optimized C and Assembly code Once again, our aim is to provide some help in studying the original Linux code While discussing kernel code, we often end up describing the underpinnings of many familiar features that Unix programmers have heard of and about which they may be curious (shared and mapped memory, signals, pipes, symbolic links) Overview of the Book To make life easier, Chapter presents a general picture of what is inside a Unix kernel and how Linux competes against other well-known Unix systems The heart of any Unix kernel is memory management Chapter explains how Intel 80x86 processors include special circuits to address data in memory and how Linux exploits them Processes are a fundamental abstraction offered by Linux and are introduced in Chapter Here we also explain how each process runs either in an unprivileged User Mode or in a privileged Kernel Mode Transitions between User Mode and Kernel Mode happen only through well-established hardware mechanisms called interrupts and exceptions, which are introduced in Chapter One type of interrupt is crucial for allowing Linux to take care of elapsed time; further details can be found in Chapter Next we focus again on memory: Chapter describes the sophisticated techniques required to handle the most precious resource in the system (besides the processors, of course), that is, available memory This resource must be granted both to theLinuxkernel and to the user applications Chapter shows how thekernel copes with the requests for memory issued by greedy application programs Chapter explains how a process running in User Mode makes requests to the kernel, while Chapter describes how a process may send synchronization signals to other processes Chapter 10 explains how Linux executes, in turn, every active process in the system so that all of them can progress toward their completions Synchronization mechanisms are needed by thekernel too: they are discussed in Chapter 11 for both uniprocessor and multiprocessor systems Now we are ready to move on to another essential topic, that is, how Linux implements the filesystem A series of chapters covers this topic: Chapter 12 introduces a general layer that supports many different filesystems Some Linux files are special because they provide UnderstandingtheLinuxKernel trapdoors to reach hardware devices; Chapter 13 offers insights on these special files and on the corresponding hardware device drivers Another issue to be considered is disk access time; Chapter 14 shows how a clever use of RAM reduces disk accesses and thus improves system performance significantly Building on the material covered in these last chapters, we can now explain in Chapter 15, how user applications access normal files Chapter 16 completes our discussion of Linux memory management and explains the techniques used by Linux to ensure that enough memory is always available The last chapter dealing with files is Chapter 17, which illustrates the most-used Linux filesystem, namely Ext2 The last two chapters end our detailed tour of theLinux kernel: Chapter 18 introduces communication mechanisms other than signals available to User Mode processes; Chapter 19 explains how user applications are started Last but not least are the appendixes: Appendix A sketches out how Linux is booted, while Appendix B describes how to dynamically reconfigure the running kernel, adding and removing functionalities as needed Appendix C is just a list of the directories that contain theLinux source code The Source Code Index includes all theLinux symbols referenced in the book; you will find here the name of theLinux file defining each symbol and the book's page number where it is explained We think you'll find it quite handy Background Information No prerequisites are required, except some skill in C programming language and perhaps some knowledge of Assembly language Conventions in This Book The following is a list of typographical conventions used in this book: Constant Width Is used to show the contents of code files or the output from commands, and to indicate source code keywords that appear in code Italic Is used for file and directory names, program and command names, command-line options, URLs, and for emphasizing new terms How to Contact Us We have tested and verified all the information in this book to the best of our abilities, but you may find that features have changed or that we have let errors slip through the production of the book Please let us know of any errors that you find, as well as suggestions for future editions, by writing to: O'Reilly & Associates, Inc 101 Morris St Sebastopol, CA 95472 (800) 998-9938 (in the U.S or Canada) (707) 829-0515 (international/local) (707) 829-0104 (fax) UnderstandingtheLinuxKernelThe boot loader, which is invoked by the BIOS by jumping to physical address 0x00007c00, performs the following operations: Moves itself from address 0x00007c00 to address 0x00090000 Sets up the Real Mode stack, from address 0x00003ff4 As usual, the stack will grow toward lower addresses Sets up the disk parameter table, used by the BIOS to handle the floppy device driver Invokes a BIOS procedure to display a "Loading" message Invokes a BIOS procedure to load the setup( ) code of thekernel image from the floppy disk and puts it in RAM starting from address 0x00090200 Invokes a BIOS procedure to load the rest of thekernel image from the floppy disk and puts the image in RAM starting from either low address 0x00010000 (for small kernel images compiled with make zImage) or high address 0x00100000 (for big kernel images compiled with make bzImage) In the following discussion, we will say that thekernel image is "loaded low" or "loaded high" in RAM, respectively Support for big kernel images was introduced quite recently: while it uses essentially the same booting scheme as the older one, it places data in different physical memory addresses to avoid problems with the ISA hole mentioned in Section 2.5.3 in Chapter Jumps to the setup( ) code A.2.2 Booting Linux from Hard Disk In most cases, theLinuxkernel is loaded from a hard disk, and a two-stage boot loader is required The most commonly used Linux boot loader on Intel systems is named LILO (LInux LOader); corresponding programs exist for other architectures LILO may be installed either on the MBR, replacing the small program that loads the boot sector of the active partition, or in the boot sector of a (usually active) disk partition In both cases, the final result is the same: when the loader is executed at boot time, the user may choose which operating system to load The LILO boot loader is broken into two parts, since otherwise it would be too large to fit into the MBR The MBR or the partition boot sector includes a small boot loader, which is loaded into RAM starting from address 0x00007c00 by the BIOS This small program moves itself to the address 0x0009a000, sets up the Real Mode stack (ranging from 0x0009b000 to 0x0009a200), and loads the second part of the LILO boot loader into RAM starting from address 0x0009b000 In turn, this latter program reads a map of available operating systems from disk and offers the user a prompt so she can choose one of them Finally, after the user has chosen thekernel to be loaded (or let a time-out elapse so that LILO chooses a default), the boot loader may either copy the boot sector of the corresponding partition into RAM and execute it or directly copy thekernel image into RAM Assuming that a Linuxkernel image must be booted, the LILO boot loader, which relies on BIOS routines, performs essentially the same operations as the boot loader integrated into thekernel image described in the previous section about floppy disks The loader displays the "Loading Linux" message; then it copies the integrated boot loader of thekernel image to address 0x00090000, the setup( ) code to address 0x00090200, and the rest of thekernel image to address 0x00010000 or 0x00100000 Then it jumps to the setup( ) code 522 UnderstandingtheLinuxKernel A.3 Middle Ages: The setup( ) Function The code of the setup( ) assembly language function is placed by the linker immediately after the integrated boot loader of the kernel, that is, at offset 0x200 of thekernel image file The boot loader can thus easily locate the code and copy it into RAM starting from physical address 0x00090200 The setup( ) function must initialize the hardware devices in the computer and set up the environment for the execution of thekernel program Although the BIOS already initialized most hardware devices, Linux does not rely on it but reinitializes the devices in its own manner to enhance portability and robustness setup( ) essentially performs the following operations: Invokes a BIOS procedure to find out the amount of RAM available in the system Sets the keyboard repeat delay and rate (When the user keeps a key pressed past a certain amount of time, the keyboard device sends the corresponding keycode over and over to the CPU.) Initializes the video adapter card Reinitializes the disk controller and determines the hard disk parameters Checks for an IBM Micro Channel bus (MCA) Checks for a PS/2 pointing device (bus mouse) Checks for Advanced Power Management (APM) BIOS support If thekernel image was loaded low in RAM (at physical address 0x00010000), moves it to physical address 0x00001000 Conversely, if thekernel image was loaded high in RAM, does not move it This step is necessary because, in order to be able to store thekernel image on a floppy disk and to save time while booting, thekernel image stored on disk is compressed, and the decompression routine needs some free space to use as a temporary buffer following thekernel image in RAM Sets up a provisional Interrupt Descriptor Table (IDT) and a provisional Global Descriptor Table (GDT) 10 Resets the floating point unit (FPU), if any 11 Reprograms the Programmable Interrupt Controller (PIC) and maps the 16 hardware interrupts (IRQ lines) to the range of vectors from 32 to 47 Thekernel must perform this step because the BIOS erroneously maps the hardware interrupts in the range from to 15, which is already used for CPU exceptions (see Section 4.2.3 in Chapter 4) 12 Switches the CPU from Real Mode to Protected Mode by setting the PE bit in the cr0 status register As explained in Section 2.5.5 in Chapter 2, the provisional kernel page tables contained in swapper_pg_dir and pg0 identically map the linear addresses to the same physical addresses Therefore, the transition from Real Mode to Protected Mode goes smoothly 13 Jumps to the startup_32( ) assembly language function A.4 Renaissance: The startup_32( ) Functions There are two different startup_32( ) functions; the one we refer to here is coded in the arch/i386/boot/compressed/head.S file After setup( ) terminates, the function has been moved either to physical address 0x00100000 or to physical address 0x00001000, depending on whether thekernel image was loaded high or low in RAM This function performs the following operations: 523 UnderstandingtheLinuxKernel Initializes the segmentation registers and a provisional stack Fills the area of uninitialized data of thekernel identified by the _edata and _end symbols with zeros (see Section 2.5.3 in Chapter 2) Invokes the decompress_kernel( ) function to decompress thekernel image The "Uncompressing Linux " message is displayed first After thekernel image has been decompressed, the "O K, booting the kernel." message is shown If thekernel image was loaded low, the decompressed kernel is placed at physical address 0x00100000 Otherwise, if thekernel image was loaded high, the decompressed kernel is placed in a temporary buffer located after the compressed image The decompressed image is then moved into its final position, which starts at physical address 0x00100000 Jumps to physical address 0x00100000 The decompressed kernel image begins with another startup_32( ) function included in the arch/i386/kernel/head.S file Using the same name for both the functions does not create any problems (besides confusing our readers), since both functions are executed by jumping to their initial physical addresses The second startup_32( ) function essentially sets up the execution environment for the first Linux process (process 0) The function performs the following operations: Initializes the segmentation registers with their final values Sets up theKernel Mode stack for process (see Section 3.3.2 in Chapter 3) Invokes setup_idt( ) to fill the IDT with null interrupt handlers (see Section 4.4.2 in Chapter 4) Puts the system parameters obtained from the BIOS and the parameters passed to the operating system into the first page frame (see Section 2.5.3 in Chapter 2) Identifies the model of the processor Loads the gdtr and idtr registers with the addresses of the GDT and IDT tables Jumps to the start_kernel( ) function A.5 Modern Age: The start_kernel( ) Function The start_kernel( ) function completes the initialization of theLinuxkernel Nearly every kernel component is initialized by this function; we mention just a few of them: • • • • • • The page tables are initialized by invoking the paging_init( ) function (see Section 2.5.5 in Chapter 2) The page descriptors are initialized by the mem_init( ) function (see Section 6.1 in Chapter 6) The final initialization of the IDT is performed by invoking trap_init( ) (see the section Section 4.5 in Chapter 4) and init_IRQ( ) (see Section 4.6.2 in Chapter 4) The slab allocator is initialized by the kmem_cache_init( ) and kmem_cache_sizes_init( ) functions (see Section 6.2.4 in Chapter 6) The system date and time are initialized by the time_init( ) function (see Section 5.1.1 in Chapter 5) Thekernel thread for process is created by invoking the kernel_thread( ) function In turn, this kernel thread creates the other kernel threads and executes the /sbin/init program, as described in Section 3.3.2 in Chapter 524 UnderstandingtheLinuxKernel Besides the "Linux version 2.2.14 " message, which is displayed right after the beginning of start_kernel( ), many other messages are displayed in this last phase both by the init functions and by thekernel threads At the end, the familiar login prompt appears on the console (or in the graphical screen if the X Window System is launched at startup), telling the user that theLinuxkernel is up and running 525 UnderstandingtheLinuxKernel Appendix B Modules As stated in Chapter 1, modules are Linux's recipe for effectively achieving many of the theoretical advantages of microkernels without introducing performance penalties B.1 To Be (a Module) or Not to Be? When system programmers want to add a new functionality to theLinux kernel, they are faced with an interesting dilemma: should they write the new code so that it will be compiled as a module, or should they statically link the new code to the kernel? As a general rule, system programmers tend to implement new code as a module Because modules can be linked on demand, as we see later, thekernel does not have to be bloated with hundreds of seldom-used programs Nearly every higher-level component of theLinux kernel—filesystems, device drivers, executable formats, network layers, and so on—can be compiled as a module However, some Linux code must necessarily be linked statically, which means that either the corresponding component is included in the kernel, or it is not compiled at all This happens typically when the component requires a modification to some data structure or function statically linked in thekernel As an example, suppose that the component has to introduce new fields into the process descriptor Linking a module cannot change an already defined data structure like task_struct since, even if the module uses its modified version of the data structure, all statically linked code continues to see the old version: data corruption will easily occur A partial solution to the problem consists of "statically" adding the new fields to the process descriptor, thus making them available to thekernel component, no matter how it has been linked However, if thekernel component is never used, such extra fields replicated in every process descriptor are a waste of memory If the new kernel component increases the size of the process descriptor a lot, one would get better system performance by adding the required fields in the data structure only if the component is statically linked to thekernel As a second example, consider a kernel component that has to replace statically linked code It's pretty clear that no such component can be compiled as a module because thekernel cannot change the machine code already in RAM when linking the module For instance, it is not possible to link a module that changes the way page frames are allocated, since the Buddy system functions are always statically linked to thekernelThekernel has two key tasks to perform in managing modules The first task is making sure the rest of thekernel can reach the module's global symbols, such as the entry point to its main function A module must also know the addresses of symbols in thekernel and in other modules So references are resolved once and for all when a module is linked The second task consists of keeping track of the use of modules, so that no module is unloaded while another module or another part of thekernel is using it A simple reference count keeps track of each module's usage 526 UnderstandingtheLinuxKernel B.2 Module Implementation Modules are stored in the filesystem as ELF object files Thekernel considers only modules that have been loaded into RAM by the /sbin/insmod program (see Section B.3) and for each of them it allocates a memory area containing the following data: • • • A module object A null-terminated string that represents the name of the module (all modules should have unique names) The code that implements the functions of the module The module object describes a module; its fields are shown in Table B-1 A simply linked list collects all module objects, where the next field of each object points to the next element in the list The first element of the list is addressed by the module_list variable But actually, the first element of the list is always the same: it is named kernel_module and refers to a fictitious module representing the statically linked kernel code Table B-1 The module Object Type Name unsigned long size_of_struct struct module * next const char * name unsigned long size atomic_t uc.usecount unsigned long flags unsigned int nsyms unsigned int ndeps struct module_symbol * syms struct module_ref * deps struct module_ref * refs int (*)(void) init void (*)(void) cleanup struct exception_table_entry * ex_table_start struct exception_table_entry * ex_table_end Description Size of module object Next list element Pointer to module name Module size Module usage counter Module flags Number of exported symbols Number of referenced modules Table of exported symbols List of referenced modules List of referencing modules Initialization method Cleanup method Start of exception table End of exception table The total size of the memory area allocated for the module (including the module object and the module name) is contained in the size field As already mentioned in Section 8.2.6 in Chapter 8, each module has its own exception table The table includes the addresses of the fixup code of the module, if any The table is copied in RAM when the module is linked, and its starting and ending addresses are stored in the ex_table_start and ex_table_end fields of the module object B.2.1 Module Usage Counter Each module has a usage counter, stored in the uc.usecount field of the corresponding module object The counter is incremented when an operation involving the module's functions is started and decremented when the operation terminates A module can be unlinked only if its usage counter is null 527 UnderstandingtheLinuxKernel As an example, suppose that the MS-DOS filesystem layer has been compiled as a module and that the module has been linked at runtime Initially, the module usage counter is null If the user mounts an MS-DOS floppy disk, the module usage counter is incremented by Conversely, when the user unmounts the floppy disk, the counter is decremented by B.2.2 Exporting Symbols When linking a module, all references to global kernel symbols (variables and functions) in the module's object code must be replaced with suitable addresses This operation, which is very similar to that performed by the linker while compiling a User Mode program (see Section 19.1.3 in Chapter 19), is delegated to the /sbin/insmod external program (described later in Section B.3) A special table is used by thekernel to store the symbols that can be accessed by modules together with their corresponding addresses This kernel symbol table is contained in the _ _ksymtab section of thekernel code segment, and its starting and ending addresses are identified by two symbols produced by the C compiler: start _ksymtab and stop _ksymtab The EXPORT_SYMBOL macro, when used inside the statically linked kernel code, forces the C compiler to add a specified symbol to the table Only thekernel symbols actually used by some existing module are included in the table Should a system programmer need, within some module, to access a kernel symbol that is not already exported, he can simply add the corresponding EXPORT_SYMBOL macro into the kernel/ksyms.c file of theLinux source code Linked modules can also export their own symbols, so that other modules can access them The module symbol table is contained in the _ _ksymtab section of the module code segment If the module source code includes the EXPORT_NO_SYMBOLS macro, no symbols from that module are added to the table To export a subset of symbols from the module, the programmer must define the EXPORT_SYMTAB macro before including the include/linux/module.h header file Then he may use the EXPORT_SYMBOL macro to export a specific symbol If neither EXPORT_NO_SYMBOLS nor EXPORT_SYMTAB appears in the module source code, all global symbols of the modules are exported The symbol table in the ksymtab section is copied into a memory area when the module is linked, and the address of the area is stored in the syms field of the module object The symbols exported by the statically linked kernel and all linked-in modules can be retrieved by reading the /proc/ksyms file or using the query_module( ) system call (described in Section B.3) B.2.3 Module Dependency A module (B) can refer to the symbols exported by another module (A); in this case, we say that B is loaded on top of A, or equivalently that A is used by B In order to link module B, module A must have already been linked; otherwise, the references to the symbols exported by A cannot be properly linked in B In short, there is a dependency between modules The deps field of the module object relative to B points to a list describing all modules that are used by B; in our example, A's module object would appear in that list The ndeps field stores the number of modules used by B Conversely, the refs field of A points to a list 528 UnderstandingtheLinuxKernel describing all modules that are loaded on top of A (thus, B's module object will be included when it is loaded) The refs list must be updated dynamically whenever a module is loaded on top of A In order to ensure that module A is not removed before B, A's usage counter is incremented for each module loaded on top of it Beside A and B there could be, of course, another module (C) loaded on top of B, and so on Stacking modules is an effective way to modularize thekernel source code in order to speed up its development and improve its portability B.3 Linking and Unlinking Modules A user can link a module into the running kernel by executing the /sbin/insmod external program This program performs the following operations: Reads from the command line the name of the module to be linked Locates the file containing the module's object code in the system directory tree The file is usually placed in some subdirectory below /lib/modules Computes the size of the memory area needed to store the module code, its name, and the module object Invokes the create_module( ) system call, passing to it the name and size of the new module The corresponding sys_create_module( ) service routine performs the following operations: a Checks whether the user is allowed to link the module (the current process must have the CAP_SYS_MODULE capability) In any situation where one is adding functionality to a kernel, which has access to all data and processes on the system, security is a paramount concern b Invokes the find_module( ) function to scan the module_list list of module objects looking for a module with the specified name If it is found, the module has already been linked, so the system call terminates c Invokes vmalloc( ) to allocate a memory area for the new module d Initializes the fields of the module object at the beginning of the memory area and copies the name of the module right below the object e Inserts the module object into the list pointed to by module_list f Returns the starting address of the memory area allocated to the module Invokes the query_module( ) system call with the QM_MODULES subcommand to get the name of all already linked modules Invokes the query_module( ) system call with the QM_SYMBOL subcommand repeatedly, to get thekernel symbol table and the symbol tables of all modules that are already linked in Using thekernel symbol table, the module symbol tables, and the address returned by the create_module( ) system call, relocates the object code included in the module's file This means replacing all occurrences of external and global symbols with the corresponding logical address offsets Allocates a memory area in the User Mode address space and loads it with a copy of the module object, the module's name, and the module's code relocated for the running kernelThe address fields of the object point to the relocated code The init field is set to the relocated address of the module's init_module( ) function, if the module defines one (Virtually all modules define a function of that name, which is invoked in the next step to perform any initialization required by the module.) Similarly, the 529 UnderstandingtheLinuxKernel cleanup field is set to the relocated address of the module's cleanup_module( ) function, if one is present Invokes the init_module( ) system call, passing to it the address of the User Mode memory area set up in the previous step The sys_init_module( ) service routine performs the following operations: a Checks whether the user is allowed to link the module (the current process must have the CAP_SYS_MODULE capability) b Invokes find_module( ) to find the proper module object in the list to which module_list points c Overwrites the module object with the contents of the corresponding object in the User Mode memory area d Performs a series of sanity checks on the addresses in the module object e Copies the remaining part of the User Mode memory area into the memory area allocated to the module f Scans the module list and initializes the ndeps and deps fields of the module object g Sets the module usage counter to h If defined, executes the init method of the module to initialize the module's data structures properly The method is usually implemented by the init_module( ) function defined inside the module i Sets the module usage counter to and returns 10 Releases the User Mode memory area and terminates In order to unlink a module, a user invokes the /sbin/rmmod external program, which performs the following operations: From the command line, reads the name of the module to be unlinked Invokes the query_module( ) system call with the QM_MODULES subcommand to get the list of linked modules Invokes the query_module( ) system call with the QM_REFS subcommand several times, to retrieve dependency information on the linked modules If some module is linked on top of the one to be removed, terminates Invokes the delete_module( ) system call, passing the module's name to it The corresponding sys_delete_module( ) service routine performs these operations: a Checks whether the user is allowed to remove the module (the current process must have the CAP_SYS_MODULE capability) b Invokes find_module( ) to find the corresponding module object in the list to which module_list points c Checks whether both the refs field and the uc.usecount fields of the module object are null; otherwise, returns an error code d If defined, invokes the cleanup method to perform the operations needed to cleanly shut down the module The method is usually implemented by the cleanup_module( ) function defined inside the module e Scans the deps list of the module and removes the module from the refs list of any element found f Removes the module from the list to which module_list points g Invokes vfree( ) to release the memory area used by the module and returns (success) 530 UnderstandingtheLinuxKernel B.4 Linking Modules on Demand A module can be automatically linked when the functionality it provides is requested and automatically removed afterward For instance, suppose that the MS-DOS filesystem has not been linked, either statically or dynamically If a user tries to mount an MS-DOS filesystem, the mount( ) system call normally fails by returning an error code, since MS-DOS is not included in the file_systems list of registered filesystems However, if support for automatic linking of modules has been specified when configuring the kernel, Linux makes an attempt to link the MS-DOS module, then scans the list of registered filesystems again If the module was successfully linked, the mount( ) system call can continue its execution as if the MS-DOS filesystem were present from the beginning B.4.1 The modprobe Program In order to automatically link a module, thekernel creates a kernel thread to execute the /sbin/modprobe external program,[A] which takes care of possible complications due to module dependencies The dependencies were already discussed earlier: a module may require one or more other modules, and these in turn may require still other modules For instance, the MSDOS module requires another module named fat containing some code common to all filesystems based on a File Allocation Table (FAT) Thus, if it is not already present, the fat module must also be automatically linked into the running kernel when the MS-DOS module is requested Resolving dependencies and finding modules is a type of activity that's best done in User Mode, because it requires locating and accessing module object files in the filesystem [A] This is one of the few examples in which thekernel relies on an external program The /sbin/modprobe external program is similar to insmod, since it links in a module specified on the command line However, modprobe also recursively links in all modules used by the module specified on the command line For instance, if a user invokes modprobe to link the MS-DOS module, the program links the fat module, if necessary, followed by the MS-DOS module Actually, modprobe just checks for module dependencies; the actual linking of each module is done by forking a new process and executing insmod How does modprobe know about module dependencies? Another external program named /sbin/depmod is executed at system startup It looks at all the modules compiled for the running kernel, which are usually stored inside the /lib/modules directory Then it writes all module dependencies to a file named modules.dep The modprobe program can thus simply compare the information stored in the file with the list of linked modules produced by the query_module( ) system call B.4.2 The request_module( ) Function In some cases, thekernel may invoke the request_module( ) function to attempt automatic linking for a module Consider again the case of a user trying to mount an MS-DOS filesystem: if the get_fs_type( ) function discovers that the filesystem is not registered, it invokes the request_module( ) function in the hope that MS-DOS has been compiled as a module 531 UnderstandingtheLinuxKernel If the request_module( ) function succeeds in linking the requested module, get_fs_type( ) can continue as if the module were always present Of course, this does not always happen; in our example, the MS-DOS module might not have been compiled at all In this case, get_fs_type( ) returns an error code The request_module( ) function receives the name of the module to be linked as its parameter It invokes kernel_thread( ) to create a new kernel thread that executes the exec_modprobe( ) function, then it simply waits until that kernel thread terminates The exec_modprobe( ) function, in turn, also receives the name of the module to be linked as its parameter It invokes the execve( ) system call and executes the /sbin/modprobe external program,[B] passing the module name to it In turn, the modprobe program actually links the requested module, along with any that it depends on [B] The name and path of the program executed by file exec_modprobe( ) can be customized by writing into the /proc/sys/kernel/modprobe Each module automatically linked into thekernel has the MOD_AUTOCLEAN flag in the flags field of the module object set This flag allows automatic unlinking of the module when it is no longer used In order to automatically unlink the module, a system process (like crond ) periodically executes the rmmod external program, passing the -a option to it The latter program executes the delete_module( ) system call with a NULL parameter The corresponding service routine scans the list of module objects and removes all unused modules having the MOD_AUTOCLEAN flag set 532 UnderstandingtheLinuxKernel Appendix C Source Code Structure In order to help you to find your way through the files of the source code, we briefly describe the organization of thekernel directory tree As usual, all pathnames refer to the main directory of theLinux kernel, which is, in most Linux distributions, /usr/src/linux Linux source code for all supported architectures is contained in about 4500 C and Assembly files stored in about 270 subdirectories; it consists of about million lines of code, which occupy more than 58 megabytes of disk space The following list illustrates the directory tree containing theLinux source code Please notice that only the subdirectories somehow related to the target of this book have been expanded init kernel mm arch —i386 ——kernel ——mm ——math-emu ——lib ——boot ———compressed ———tools —alpha —s390 —sparc —sparc64 —mips —ppc —m68k —arm fs —proc —devpts —ext2 —isofs —nfs —nfsd —fat —msdos —vfat —nls —ntfs —smbfs —umsdos —minix Kernel initialization code Kernel core: processes, timing, program execution, signals, modules, Memory handling Platform-dependent code IBM's PC architecture Kernel core Memory management Software emulator for floating point unit Hardware-dependent utility functions Bootstrapping Compressed kernel handling Programs to build compressed kernel image Compaq's Alpha architecture IBM's System/390 architecture Sun's SPARC architecture Sun's Ultra-SPARC architecture Silicon Graphics' MIPS architecture Motorola-IBM's PowerPC-based architectures Motorola's MC680x0-based architecture Architectures based on ARM processor Filesystems /proc virtual filesystem /dev/pts virtual filesystem Linux native Ext2 filesystem ISO9660 filesystem (CD-ROM) Network File System (NFS) Integrated Network filesystem server Common code for FAT-based filesystems Microsoft's MS-DOS filesystem Microsoft's Windows filesystem (VFAT) Native Language Support Microsoft's Windows NT filesystem Microsoft's Windows Server Message Block (SMB) filesystem UMSDOS filesystem MINIX filesystem 533 UnderstandingtheLinuxKernel —hpfs —sysv —ncpfs —ufs —affs —coda —hfs —adfs —efs —qnx4 —romfs —autofs —lockd net ipc drivers —block ——paride —scsi —char ——joystick ——ftape ——hfmodem —ip2 —net —sound —video —cdrom —isdn —ap1000 —macintosh —sgi —fc4 —acorn —misc —pnp —usb —pci —sbus —nubus —zorro —dio —tc lib include —linux ——lockd ——nfsd IBM's OS/2 filesystem System V, SCO, Xenix, Coherent, and Version filesystem Novell's Netware Core Protocol (NCP) Unix BSD, SunOs, FreeBSD, NetBSD, OpenBSD, and NeXTStep filesystem Amiga's Fast File System (FFS) Coda network filesystem Apple's Macintosh filesystem Acorn Disc Filing System SGI IRIX's EFS filesystem Filesystem for QNX OS Small read-only filesystem Directory automounter support Remote file locking support Networking code System V's Interprocess Communication Device drivers Block device drivers Support for accessing IDE devices from parallel port SCSI device drivers Character device drivers Joysticks Tape-streaming devices Ham radio devices IntelliPort's multiport serial controllers Network card devices Audio card devices Video card devices Proprietary CD-ROM devices (neither ATAPI nor SCSI) ISDN devices Fujitsu's AP1000 devices Apple's Macintosh devices Silicon Graphics' devices Fibre Channel devices Acorn's devices Miscellaneous devices Plug-and-play support Universal Serial Bus (USB) support PCI bus support Sun's SPARC SBus support Apple's Macintosh Nubus support Amiga's Zorro bus support Hewlett-Packard's HP300 DIO bus support Sun's TurboChannel support (not yet finished) General-purpose kernel functions Header files (.h) Kernel core Remote file locking Integrated Network File Server 534 UnderstandingtheLinuxKernel ——sunrpc ——byteorder ——modules —asm-generic —asm-i386 —asm-alpha —asm-mips —asm-m68k —asm-ppc —asm-s390 —asm-sparc —asm-sparc64 —asm-arm —net —scsi —video —config scripts Documentation Sun's Remote Procedure Call Byte-swapping functions Module support Platform-independent low-level header files IBM's PC architecture Compaq's Alpha architecture Silicon Graphics' MIPS architecture Motorola-IBM's PowerPC-based architectures Motorola-IBM's PowerPC architecture IBM's System/390 architecture Sun's SPARC architecture Sun's Ultra-SPARC architecture Architectures based on ARM processor Networking SCSI support Video card support Header files containing the macros that define thekernel configuration External programs for building thekernel image Text files with general explanations and hints about kernel components 535 UnderstandingtheLinuxKernel Colophon Our look is the result of reader comments, our own experimentation, and feedback from distribution channels Distinctive covers complement our distinctive approach to technical topics, breathing personality and life into potentially dry subjects The cover image of a man with a bubble is adapted from a 19th-century engraving from the Dover Pictorial Archive Edie Freeman designed the cover Emma Colby produced the cover with Quark™XPress 4.1, using the ITC Garamond Condensed font David Futato designed the interior layout based on a series design by Alicia Cech Catherine Morris was the production editor, and Norma Emory was the copyeditor for UnderstandingtheLinuxKernel Clairemarie Fisher O'Leary was the proofreader Jeff Holcomb, Claire Cloutier, and Catherine Morris provided quality control Judy Hoer and Joe Wizda wrote the index Linley Dolby, Rachel Wheeler, and Deborah Smith provided production support The illustrations that appear in the book were produced by Robert Romano using Macromedia FreeHand and Adobe Photoshop 536 ... electronically To be put on our mailing list or to request a catalog, send email to: info@oreilly .com To ask technical questions or to comment on the book, send email to: bookquestions@oreilly .com We have... http://www.oreilly .com/ catalog/linuxkernel/updates/ For more information about this book and others, see the O' Reilly web site: http://www.oreilly .com/ Acknowledgments This book would not have been... without the precious help of the many students of the school of engineering at the University of Rome "Tor Vergata" who took our course and tried to decipher the lecture notes about the Linux kernel