TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTrrrrrrrrr
William Stallings Copyright 2008
‘This document is an extract from
Trang 22.8 LINUX History
Trang 328 / LINUX 98 'WINDOWS/LINUX COMPARISON Windows Vista Tina General
‘A commercial OS with strong inlucness trom AW oPsn-ouree implementation of UNIX YVAXIVMS and requirement for companibiiy focused on simplicity and effieieney, Runs ona with multiple OS personalities such as DOS! EF large range of processor architectures Windows, POSIX and, crgially, O82
Environment which influenced fundamental design desiions
32-bit program address space Tobit program address space Mbytes of physical mecnory Kbytes of physical memory
Virtual memory Swapping system with memory mapping
Multiprocessor (vay) Uniprocessor
“Miero-conttller based 110 devices State-machine based UO devices ClientServer distributed computing Large diverse user populations Standalone interactive systems Small number of friendly users
‘Compare these with today's enviroament ‘obit addresses
Gbytes of physical memory
Virtual memory, Virtual Processors Multiprocessor (61-128)
High-spoed intemetintranet, Web Services Single use, but vulnerable to hackers worldwide
Although both Windows and Linux have adapted to changes in the environment, the original design} cavironments (Leia 1989 and 1973) heavily inuenced the design choices
Unit of concurrency: threads vs processes [adress space, uniprocessor} Process ereation: LÒ CreateProcess() vs frk() Asyne vssyne Faddress space, swapping) [swapping 10 devices) Security Discretionary Access vs uidigd oxerpopulalom]
System structure
“Modular core Kernel, with explicit publishing of data structures and interfaces by components Monolithic Kernel ‘Three layers ‘+ Hardware Abstraction Layer manages
processor, interrupt, DMA BIOS details + Kemel Layer manages interrupts, and synchronization CPU scheduling, + Executive Layer implements the major OS functions in ull threaded, mostly
preemptive environment
Dynamic data structures and kemel address Kernel code and data is statically allocated space organization: intaization code đc tonon-pagenble memory
fatded after boot, Much kemel code and data is pagcable, Non-pageable kemel code tnd data uses large pages for TLH efficiency
Trang 496 CILAPTER 2 / OPERATING SySTEM OVERVIEW
Fie systems networking devices are loadable! unloadable drivers (dynamic link libraries) using the extensible HO system interfaces Dynamically loaded drivers can prove both ppageable and non-pageable seetions
[Namespace root is virtual with file systems ‘mounted underneath types of eystsm objects ‘easily extended, and leverage unified nam- ng referencing lifetime management, secu
and handle-bused synchronization
(0 personalities implemented vser-mode subsystems Native NT APIs are based on the general Kernel handlelobject aehitee-
Extensive support for loading/unloading ‘ere! modules, suchas device drivers and filesystems ‘Moles cannot be paged, but can be unloaded
[Namespace is roted in a file system: adding new named system objects requie filesystem] changes oF mapping onto device model
Implements a POSIX-compatible, UNIX- ture and allow crose-process manipolation of Virtual memory, threads, and other kernel
objects
Discretionary Access Controls privileges auditing
like interface: Kemet APIs farsimpler than ‘Windows: Can understand various types of cexcoutables
Usergroup IDsseapabilies similar to NT pri ‘legs can also be associated with processes
Key tothe success of Linux has been the availability of free software packages ‘under the auspices of the Free Software Foundation (FSF) FSF's goal is stable, plat- form-independent software that is free, high quality, and embraced by the user com munity FSF's GNU project” provides tools for software developers, and the GNU Public License (GPL) is the FSF seal of approval Torvalds used GNU tools in de veloping his kernel, which he then released under the GPL Thus, the Linux distrib- lutions that you see today are the product of FSF'S GNU project, Torvald’ individual effort, and many collaborators all over the world, In addition to its use by many individual programmers, Linux has now made
significant penetration into the corporate World, Ths is not only hecause of the free software, but also because of the quality of the Linux kernel Many talented pro- _grammers have contributed to the current version, resulting in a technically impres- sive product, Moreover, Linux is highly modular and easily configured This makes i easy {0 squeeze optimal performance from a variety of hardware platforms, Plus, with the source code available, vendors ean tweak applications and utilities to meet specific requirements Throughout this book, we will provide details of Linux kernel internals based on the most recent version, Linux 26
Modular Structure
Most UNIX kernels are monolithic Recall from earlier in this chapter that a monolithic kernel is one that includes virtually all of the OS functionality in one large block of code
Trang 528/UNUX 97
that runs as a single process with a single address space All the functional components ‘of the kernel have access to all ofits internal data structures ‘made to any portion of a typical monolithic OS, ll the modules and routines must bere and routines If changes are
linked and reinstalled and the system rebooted before the changes can take effect Asa result, any modification, such as adding # new device driver or filesystem function, isi ficult This problem is especially acute for Linu, for which development is global and done by a loosely associated yroup of independent programme Although Linux does not use a mierokernel approach, it achieves many of the potential advantages of this approach by means of ts particular modular architecture Linus i structured as a collection of modules
« number of which can be automatically loaded and unloaded on demand These relatively independent blocks are referred to asloadable modules[GOY E99] In essence,a module is an object file whose code can be linked to and unlinked from the kernel at runtime Typically, a module implements, some specific function, such as a filesystem, a device driver, oF some other feature of the kernels upper layer A module does not execute as its own process or thread, al though it can ereate kernel threads for vatious purposes as necessary: Rather, miod- ule is executed in kernel mode on behalf of the current process Thus, although Linux may be considered monolithic, its modular structure ‘overcomes some of the difficulties in developing and evolving the kernel “The Linux loadable modules have two important chatacterisies
+ Dynamic linking: A kernel module can be loaded and linked into the kernel ‘while the kernel is already in memory and executing A madule ean also be un- linked and removed from memory at any
+ Stackable modules: The modules are arranged ules serve as libraries when they are referenced by client modules higher up in in a hierarchy the hierarchy, and as clients when they reference modules further down
Dynamic linking [FRAN97] facilitates configuration and saves kernel mem= cory In Linus, a user program or user can explicitly load and unload kernel modules the insmod and rmmod commands The ketnel itself monitors the need for particular functions and can load and unload modules as needed With stackable ‘modules, dependencies between modules can be defined This has two enetits:
1 Code common t a set of similar modules (eg, drivers for similar hardware) can be moved into a single module, redueing replication
2, The kernel can make sure that needed modules are present, refraining from unloading a module on which other running modules depend, and loading any additional required modules when a new module is loaded
Figure 2.17 is an example that illustrates the structures used by Linux to man- age modules The figure shows the list of kernel modules after only tvo modules, have been loaded: FAT and VFAT Each module is defined by two tables, the mod- ule table and the symbol table.’The module table includes the following elements:
+ next: Pointer to the following module, All modules are organized into a linked lst The list hegins with a pseudomodule (not shown in Figure 2.17) * “name: Poi
Trang 698 cILAPreR 2 / OPERATING SysTEM OVERVIEW Module Module 8g ———mD xa —] Tế ‘sm ear vest symbol_tuble wah
igure 217 Example List of Linux Kernel Modules
+ usecount: Module usage counter The counter is incremented when an opeFa- tion involving the module's funetions is started and decremented when the op~ ‘eration terminates
* flags: Module flags,
«+ nsyms: Number of exported symbols,
+ ndeps: Number of referenced modules + *syms: Pointer to this module's symbol table
«= Sdeps: Pointer to list of modules the are referenced by this module, + “refs: Pointer to list of modules that use this module,
‘The symbol table defines those symbols controlled by thi
used elsewhere, Figure 2.17 shows that the VPAT module was loaded after the FAT module and that the VFAT module is dependent on the FAT module module that are Kernel Components
Figure 2.18, taken from [MOSB02| shows the main components of the Linux kernel as implemented on an IA-64 architecture (eg Intel Ilanium).The figure shows sev eral processes running on top of the kernel Ench box indicates a separate process, while each squiggly line with an arrowhead represents a thread of execution.’ The
Trang 7
28/ UNUX 99) ana | S= \ File Neework _ [ = ow) (2s gore 2.08 Linux Kernel Components
kernel itself consists of an interacting collection of components, with arrows indicate ing the main interactions, The underlying hardware is also depicted as a set of com- ponents with arrows indicating which kernel components use or control which hardware components All of the kernel components, of course, exceute on the processor but, for simplicity, these relationships are not shown Briefly, the principal kernel components are the following:
ignals: The kernel uses signals to call into a process For example, signals are used 10 notify a process of certain faults, such as division by zero, Table 26 ives a few examples of signals
Table 26 Some Linux Signals
) “Terminal Banga SIGCONT Continue
sicourr Keyboaid quit sicrst Keyboard stop
SIGTRAP ‘Trace tap SIGTTOUL Temial viec
SIGBUS ụscmor SIGXCPU (CPU tnt exceeded
SIGKILL KHtignal SIGVTALRM ——_Vilualalam clock
siGsEGV Seamentation violation SiGwINcH ‘Window size unchanged
SIGPIPT- Broken poe sicrwr Power ile
SIGTERM, “Temiaton SIGKTMIN Frstrealtime signa
Trang 8100 CHAPTER 2 / OPERATING SYSTEM OVERVIEW
+ System calls: The system calls the means by which a process requests a specific kernel serviee There are several hundred system calls, which ean be roughly {grouped into six eategories: filesystem, process, scheduling, interprocess com- munication, socket (networking), and miscellaneous Table 27 defines a few ex- ‘amples in each category Table 27 Some Linu System Calls iste related — —=—
ik Make snow name fora
open ‘Open and possibly ceatea ie or deve rend Read om file descriptor
ste Write te deserntor
Proce related
aa ean pogo
et “Teeminate he calling process
sp ‘Set user identity ofthe current process, Get proces entiation
Proves a means by which a arent process my observe and contr the execu tion of another process and examine and change score image and segs
Seng related
‘Setsthe chedolng parameter asocited with the shedting pois forthe proces Mentied by 2 Returos the maximum pity Value that can be sed with he scheduling alao- sin deed by p=!
Sets bth the scheduiag policy (ep FIFO) and he ssciated parameters forthe proces ps3
ehed get imterat | Wits ato the imespee structure pote to by the parameter“ the round oi ime quantum forthe process sched yield {A process can lings the processor wolutariy without locking vi this sy ‘em cal The proces wil hen be moved othe end of the gueue for is ati,
prot anda new proces pts oro
Inter
— ‘A messge bolle siichire allocated to receive a memage The tem ell, ‘then reads ữepsje rô the ewig que specied BY ued tate the ely created message baller
rs Commnicaton (IPC) ee
seit storms the contol persion speed bynon th haplois tne sen Performs operations on selected memburs of the semaphore ston
shmat of he tng proeek Atahes the shared memory segment identified by =hnsct the dats segment shoe Allows the user o receive information aroup and permission of shared memory segmen.or dso a zment ona shred memory sement seth one
Trang 92.9 / RECOMMENDED READING AND Wea StTES 101 Table 2.7 (Continued)
Socket (Networking) related
in “Asis the lea IP adres und port fora sockel Reurns for success and 1 ferent connect Esublses a connetion betwen the given socket and the remote socket s80
—— Retar loa host name
‘Send the byes contained in bute pointed toby "map over the sven sche Setsthe options on socket, Miestansom — syne query mole time hangup
‘hat wil be pesded to hold the mle, “Afompis to eeaes oadable module eniry and resene the Keel memory Cope alin-sore paris file dk nd wats unt the device report tht il parteare ov aabh storage
Requests information related o ladle modules fom the kernel Retun the time ia seconds since January 1,170,
‘Simulates «hangup om the cusen terminal, This call ranges for other users habe alent) agin ine
Processes and scheduler: Creates, manages, and schedules processes
{tual memory: Allocates and rages virtual memory for processes
File systems: Provides a global, hierarchical namespace for files, directories, and other file related objects and provides file system functions Network protocols Supports the Sockets interface to users for the TCPAP protocol suite
‘Character device drivers: Manages devices that require the kernel to send of receive data one byte at a time,such as terminals, modems, and printers
Block device drivers: Manages devices that read and write data in blocks, such as various forms of secondary memory (magnetic disks, CD-ROMS, ete)
Network device drivers: Manages network interface cards and communica tions ports that connect to nctwork devices, sueh as bridges and routers, ‘Traps a memory fault and faults: Handles traps and faults generated by the processor, such as
Physical memory: Manages the pool of page frames in real memory and allo- ceales pages for virtual memory
Trang 104.6 LINUX PROCESS AND THREAD MANAGEMENT Linux Tasks A process, or task, in Linus is represented by a tasie struct data structure, The ict data strueture contains information in a number of categori © State: The execution state of the process (executing, ready, suspended, stopped, zombie) This is deseribed subsequently 'WINDOWS/LINUX COMPARISON ‘Wino Lm,
roses are container Tor the worms ATIF space a general handle mechani for retereacing Kemel object and threads: heads am in proces, and the schedlabe entice
Proceso are both Sonne and the shea enttes proces can sate wre space and sp lem resources making procenies festive able x threads
rocesorare created by dseete Hops WHR con suet he container fora new program ad he fest Urea fork) ike native APL ex, bl only sed for POSIX compatiblity
Proce created hy making vital copies with fork) and then overwriting with excl) ran anew —¬
roves tandle table wed a ualorny reference kemelobjsts (representing processes thread, memory sections synchronization UO devices di ‘ere open ile network eaonetions timers Kernel transactions)
Kernel objects referenced by ad hoe cllectin of [APIs and mechanisms ncn ie desrptors for ‘pen ies and socket and PIDs for processes and proces groups
TUpio16milioa Hnadien oa Karel oboe ae ap
ported per press pte open Featocke process are sapped par
‘eel fll ul ead wh ero preemp
thon enabled ow all yen inthe orginal design Few feral proces and sd Ferel preempon A eoeat feature
‘Many system serves inplenented using a
int server computing cain he OS personality ‘ubysere that un in oser mode and eommniate thủng remote-procedute als
Trang 11196
+» Scheduling information: Information needed by Linus to schedule processes {A process can be normal or real time and has a priority Real-time processes are scheduled before normal processes, and within each category, relative pe ‘ofities cam be used A counter keeps track of the amount of time a proces is allowed to execute, entfiers: Each process has a unique process identifier and also has user and oup identifiers group identifier is used to assign resource access privileges lo a group of processes
«+ Interprocess communication: UNIX SVR4, described in Chapter 6 Linux supports the IPC mechanisms found in Jinks: Each process includes a link to its parent process, links to its siblings (processes with the same parent), and links to all ofits children,
‘Times and timers: Includes process creation time and the amount of processor time so far consumed by the process A process may also have associated one for more interval timers A process defines an interval timer by means of asys- tem call: as a resulta signal is sent to the process when the timer expires A timer may be single use or periodic
File system: Includes pointers to any files opened by this process, as well as pointers to the current and the root directories for this process
‘+ Address space: Defines the virtual address space
* Processorspecific context: The registers and stack information that constitute the context of this process assigned to this process Figure 4.18 shows the exceution states of a process These are as Follows:
This state value corresponds to two states A Running process is either executing or it is ready to execute,
«+ Interruptible: This is a blocked state, in which the process is waiting for an ‘event, such as the end of an HO operation, the availability ofa resource, of a signal from another process
* Uninterruptible: This is another blocked state, The difference between this and the Interruptible state ing directly on hardware conditions and therefore will not handle any signals is that in an uninterruptible state, a process is wat
+ Stopped: The process has been halted and can only resume by positive action Irom another process For example, a process that is being debugged can be put into the Stopped state
*» Zombie: The process has been terminated but, for some reason, till must have its task structure in the process table,
Linux Threads
Trang 126.7 LINUX PROCESS AND THREAD MANAGEMENT 197
Figure 418 Linus Procea/Thread Model
known as pibread (POSIX thread) lebraries with all of the threads mapping into single kernel-level process.“! We have seen that modern versions of UNIX offer kernel-level threads Linux provides a unique solution in that it does not recognize 4 distinetion between threads and processes Using a mechanism similar to the lightweight processes of Solaris, user-level threads are mapped into kernel-level processes Multiple user-level threads that constitute a single user-level process ‘are mapped into Linux kernel-level processes that share the same group ID This tenables these processes to share resources such as files and memory and to avoid the need for a context switeh when the scheduler switches among processes in the same group A new process is created in Linux by copying the attributes of the current process A new process ean be cloned so that it shares resources, such as files, signal hhandlers,and virtual memory: When the two processes share the same virtual mem- ‘ry they function as threads within a single process, However, no separate type of ata structure is defined for a thread In place of the usual fork() command, processes are ereated in Linux using the clone() command This command inclides a set of flags as arguments, defined in Table 45 The traditional fork() system eal i imple ‘mented by Linux asa clone() system call with all of the elone flags cleared
POSIX (Portable Operating Sistem based on UNIX) isan IEEE APH standtd tht aclu tan ded fora thread APL Libraries anplemestiag the POSIX Thea standard are les sườn Pave Pireads are mor commonly aed on UNIXIiKe POSIX systems sich as Linst and Soar bt Micosot Windows implementations ls e
Trang 13198 CHAUTER 4 /THILEADS, SM, AND MICROKERNELS, Tale 4S — Linx clone () aps
CLONE-CLEARID | Gharihe ask 1D,
‘CLONE_DETACHED | The pirent doesnot wants SIGCHLD sgn ent on sx ‘CLONE FILES ‘Shares he ble tha Keni he open fie
‘CLONES ‘Shares the able that Menifee root rectory and he caren working dncaon ella the vale of the bit mask used o mask he nit le permissions ofa ne ic ‘CLONE-IDERTASK | Set PID to ero which refers to ail ak The Mle sk employed when all
salable take are bseked wating for resources “CLONE_NEWNS | Greitew new namespics for the cil
‘CLONE_PARENT | Caller and new tak share the sume parea prose
‘CLONE_PTRAGE | ihe piven prowess beng aed he cid procem wil be Waeel GLONE.SETTID | Wile the TID back ower space
CLONE.SETTIS | Creates new TLS forthe cil
"CLONE SIGHAND | Shares he table that ents the wend Randle
“CLONE SYSVSEM | Shares System V SENLUNDO seman
‘CLONE-THREAD | Inirs thi proces into the sme head proup ol the paren hill ve limplicly entrees CLONE PARENT,
“CLONE_VFORK | Ist the paren doesnot et shedled Tor exertion wall he id ORCS TRE excel} sate cal
CONEY ‘Shares the adess spate (memory deseripor anal mat abie)
When the Linux kernel performs a switeh from one process to another, it checks whether the address of the page directory of the current process isthe same as that of the to-be-seheduled process If they are, then they are sharing the same ad- dress space, so that a context switch is basically just a jump from one location of code to another location of code Although cloned processes that are part of the same process group can share the same memory space, they eannot share the same user stacks Thus the clone() call ereates separate stack spaces for each process
Trang 14
Linux includes all of the concurrency mechanisms found in other UNIX
such as SVR4, including pipes, messages, shared memory, and signals In addition, Linux 2.6 includes a rich set of concurreney mechanisms specifically intended for use when a thread is executing ia kernel mode That is, these are mechanisms used within the kernel to provide concurrency in the execution of kernel code This sec- tion examines the Linux kernel concurrency mechanisms
Atomic Operations
Linus provides a set of operations that guarantee atomic operations on a variable ‘These operations can be used to avoid simple race conditions, An atomic operation «executes without interruption and without interference On a uniprocessor system, thread performing an atomic operation eannol be interrupted once the operation hhas started until the operation is finished In addition, on a multiprocessor system, the variable being operated on is locked from access by other threads until this op- eration is completed ‘Two types of atomic operations are defined in Linux:integer operations, which ‘operate on an integer variable, and bitmap operations, which operate on one bit ina bitmap (Table 6.3) These operations must be implemented on any architecture that implements Linux For some architectures, there are corresponding assembly lan- {guage instructions for the atomic operations On other architectures, an operation that locks the memory bus is used to guarantee that the operation is atomic: Foratomic integer operationsa special duta type is used atom
integer operations can be used only on this data type, and no other oper Jowed on this datatype, OVEO4] lists the following advantages for thes
restrictions: 1 The atomic operations are never used on variables that might in some circum- stances he unprotected from race conditions 2 Variables ofthis datatype are protected from improper use by nonatomic operations 3 The compiler cannot erroneously optimize access to the value (e , by using an ‘lias rather than the correct memory address) 4 This data type serves to hide sarchitecture-specifie differences in its imple-
A typical use of the atomic integer data type is to implement counters, The atomic bitmap operations operate on one of a sequence of bits at an arbi trary memory location indicated by a pointer variable Thus, there is no equivalent oth ‘Atomic operations are the simplest of the approaches to kernel synchroniza lta type needed for atomic integer operations tion More complex locking mechanisms can be built on top of them
Spinlocks
Trang 15290 CHAPTER 6 / CONCURRENCY: DEADLOCK AND STARVATION Yable6.3 Linux Atomic Operations
‘Atomic Integer Operations
AGMiG TRE tt 1L At Seclration: inne an atom 104
Seis eas rae ESET Resa ntegervaloe of
void sranicisst(ataniacty, TJ] Sethe value of vo ntegert
Vi SEEEiE x870nE.1, nEEmiEE sơ AMTIOW
Void avente_wubilut 1, avenle-t “Vy SuBaqtifomy
Wid avonie Ine (atoule“e “WT Aad ow
Wei avonio_ dee avoniee WT Shire Tame
—=—.-.= ‘Sobiret rom wre Ta Theresa
ateniet *V) ze; teen Derwise
Tne RtonIe.ađã_regarlvetTnE 1; ‘Ads to yretun Tithe Fval Snes AEenle-E *¥1 tive: eur otherwise senting semaphores) (used or impe- mm SgBxt1ifBm 7emretomnDotberee vetus The out —_.-= 3E nrơnie-Tns-and-tvstIavsnis-E xơ) Tem 0olberviee ‘Adi To vere Ti theresa Sa; Atomic Bitmap Operations
aˆ ‘Se{bitarin the bitmap pointed toby aaa
Wold cea DLE(INE AE, eid “aad “Gear it arin the bitmap pointed io by ade dela change bit (ine nr, veld *adaey) Tavern he bitmap pointed by ar Gar Fest and wold aaa sek BIE (ine oF) Set bic arin the Bitmap plated To by sake return he bi vale
‘Gear birnr in he bitmap pointed To By suk rte theo it vale Taser bit ei the bitmap pomted to By erent thes iva
Return the value of ita nthe Baa pointed byadde
acquire the same lock will keep trying (spinning) until it ean aequire the lock In essence a spinlock is built on an integer location in memory that is checked by exch thread before it enters its critical section If the value is 0, the thread sets the value to 1 and enters its eritical section If the value is nonzero, the thread continually checks the value until itis zero, The spinlock is easy to implement but has the disadvantage that locked-out threads continue to execute in a busy-waiting mode Thus spinlocks are most effective in situations where the wait time for acquiring a lock is expected to be very short,say on the order of les than two context changes “The basie form of use of a spinlock is the Following:
Trang 16
6.87 LINUX KERNEL CONCURRENCY MECHANISMS 291 Table 64 Linux Spinlocks
——
Tefapiniscie *asck) Acquires he specie lock, spinning Wnceded unt salable ike pin-Tock, bale disable aferaps oa Te oe racer
inte in fags Tk pi Joy, as ve Te crea TET
Tike pin lock, bl aso dfables he execution TT ‘otto halves
Weis spin-IwlszkispimlasX-L TIGER) Releases ven Ook
Wold spin_onlook AFatepinicokt Releases even nek snd able Tea ter op
to piven previous state eleaey ven Took and estore Teal TIRE
Ree given ec and eae ttm alo
Voi4 snin-Ieer-ImiInpinMesE-E Traine ven work
Soe getm-Erylserlzptmlsrt-E “Tes To acque specified loc reuras nonzero Hoek
Heck cently held na er thers
Tne spt Ta loskadlpinlosce °TGEK) Reluess nowero i lock i enenlly Rell and 729 there
Basic Spinlocks ‘The basic spintock (as opposed to the reader-vriter spinlock explained subsequently) comes in four favors (Table 6.4:
+ Plain: Ifthe critical section of code is not executed by interrupt handlers or if the interrupts are disabled during the execution of the eritical section, then the plain spinlock can be used It does not affect the interrupt state on the proces- sof on which itis run,
‘+ rg: If interrupts are always enabled, then this spinlock should be used, jeqsave: If Lis not known it interrupts will be enabled or disabled at the ime ‘of execution, then this version should ‘current state of interrupts on the local processor is saved, to be restored when be used When a lock is acquired, the
the lock is released,
‘+ _bh: When an interrupt occurs, the minimum amount of work necessary is per- Tormed by the cortesponding interrupt handler A piece of code, called the bom half performs the remainder of the interrupt-related work, allowing the current interrupt to be enabled as soon as possible The _bh spinlock is used to disable and then enable bottom halves to avoid conflict with the pro- tected critical section,
Trang 17
292 CHAPTER 6 / CONCURRENCY: DEADLOCK AND STARVATION
Spintocks are implemented differently on a uniprocessor system versus a mul: tiprocessor system For a uniprocessor system, the following considerations apply: If kernel preemption is turned off,so that a thread exceuting in kernel mode cannot be interrupted, then the locks are deleted at compile time; nel preemption is enabled, which does permit interrupts then the spinlocks again they are not needed, If ker- compile away (that is, no test of a spinlock memory location occurs) but are simply implemented as code that enablesidisables interrupts On a multiple processor sys- tem, the spinlock is compiled into code that does in fact test the spinlock location, The use of the spinlock mechanism in a program allows it to be independent of whether its executed on a uniprocessor oF multiprocessor system
Reader-Writer Spinlock ‘The reader-writer spinlock is a mechanism that al- lows a greater degree of concurrency within the kernel than the basie spinlock, ‘The reader-writer spinlock allows multiple threads to have simultaneous access to the same data structure for reading only but gives exclusive access to the spin- lock for a thread that intends to update the data structure, Each reader-writer spinlock consists of a 24-bit reader counter and an unlock flag, with the Following interpretation:
Counter Flag Interpretation
0 1 “The siniok released nd val or we
7 7 Spinosk hasbeen aegired or wing By one Tea n=O) 7 Spins hasbeen aeqired or reading ym hee
Be i Noval
As with the basie spintock, there are plain, _Sxc,and _s=qsave versions of the reader-writer spinlock Note that the reader-writer spinlock favors readers over writers If the spin- lock is held for readers, then so long as there is atleast one reader, the spinlock can- not be preempted by a writer Furthermore, new readers may be added to the spinlock even while a weiter is waiting
Semaphores
CÁ he user level, Linux provides a semaphore interface corresponding to that in UNIX SVR4, Internally, Linux provides an implementation of semaphores for its ‘own use That is,code that is part of the kernel ean invoke kernel semaphores These kernel semaphores cannot be accessed dicey by the user program via system calls ‘They are implemented a6 functions within the kernel and are thus more efficient than user-visible semaphores Linus provides three types of semaphore facilities in the kernel: binary sema- phores, counting semaphores, and reader-writer semaphores,
Trang 18
6.48/ LINUX KERNEL CONCURRENCY MECHANISMS 293, Table 6S Linux Semaphores ‘Traditional Semaphores —¬— Sine count) Iniatizes sven count the damically created semaphore othe
Wels Ile MOTER at tees) count of nally unlocked) Tnializes he djamally created semaphore witha
VoIA TRIY-XOEE-TDSEEDImruet semaphore tam)
unt of Gta loeked] Taaizes he dynamically created semaphore witia Aliempis to acguire the ive semaphore enering tnirraptible sep if semaphore is unavailable Tae down IntorraptTBIststret aenaphore tam ‘Alempis to aeuire the given semaphore enerng "te sleep ifsemaphore i unavaabe returns
“EINTR value (signal oer th the fst of up ‘operations ecives TrE đam ty tae) TRIntrver sennphore A noero vale -ANepsloazgsreIbephen xmapbðns.andrearnx if semaphore isunsvable vole W(ernict sanmenare “een Releases the ven semaphore Reader Writer Semaphores Vola [nie suesn(styuer rụ sehaphare, Si unt of Inalizes the dynamically erated semaphore witha
Wid don reilstrbet =kcvemapherer Đownoperaionforresder Upper reader Tels domi aelea ee Team" ‘Down operation for wer 214 tp-xritn oe ‘Upaperaton or srier
semaphores in Chapter 5 The function names down and up are used for the Tune- tions referred to in Chapter A counting semaphore is initialized using thesen_i i function, which gives 5 as somilait and sons! gna, respectively the semaphore a name and assigns an initial value to the semaphore Binary sem phores,called MUTEXes in Linus, UT Sx_LocKEDFunetions, which initialize the semaphore to Io O, respectively are initialized using the: is soTExand ni
Linuy provides three versions of the down (cenit) operation
1, The đosss function corresponds to the traditional ceriiait operation That is, the thread tests the semaphore and blocks if the semaphore is not available ‘The thread will awaken when a corresponding up operation on this semaphore ‘occurs, Note that this funetion name is used for an operation on either a count ing semaphore or a binary semaphore,
2, The down_invesruot to-a kernel signal while being blocked on the down operation If the thread is, ibLefunction allows the thread to receive and respond woken up by a signal, the dow jp ible function increments the
Trang 19
294 CHAPTER 6 / CONCURRENCY: DEADLOCK AND STARVATION
‘count value of the semaphore and returns an error code known in Linux as - This alerts the thread that the invoked semaphore function has aborted In effect, the thread has been forced to “give up” the semaphore This feature is useful for device drivers and other services in which itis convenient to override a semaphore operation
3 The dows_s=yLoc' function makes it possible to try to acquire a semaphore without being blocked If the semaphore is available, itis acquired Otherwise, this function returns a nonzero value without blocking the thread
Reader-Writer Semaphores The readerwriter semaphore divides users into readers and writersit allows multiple concurrent readers (with no writers) but only a single writer (with no concurrent readers) In effect, the semaphore functions as a counting semaphore for readers but a binary semaphore (MUTEX) for writers, Table 6.5 shows the basic reader-writer semaphore operations The reader-writer semaphore "uses uninterruptible sleep, so there is only one version of each of theiown operations
Barriers
In some architectures, compilers andfor the processor hardware may reorder memory accesses in source cade to optimize performance These reotderings are dane to opti= mize the use ofthe instruction pipeline inthe processor The reordering algorithms con: tin checks to ensure that data dependencies are not violated, For example, the code:
may be reordered so that memory location’ is updated before memory locations is updated However, the code
will not be reordered Even so, there are occasions when it is important that reads oF writes are executed in the order specified because of use of the information that made hy another thread or a hardware device ‘To enforce the order in which instructions are executed, Linux provides the ‘memory barrier facility Table 6.6 lists the most important functions that are defined “Table G6 Linux Memory Battier Operations =a ee ee
et ‘Prevenls oes om Being eordered arse basier
xe) ‘Proves oads and slores rộm bengreonferedxsie barir barrier) ‘Prevens the some from reordering oa o ores across he Fr Seno ‘Op SMP provides abt) and on UP provides a arr)
=n ‘Da SMP proidesa wan) anon UP providesa baie) =e EI ‘Oe SMP provides mi) andon UP proven Breer)
Trang 2069 / SOLAIUS THREAD $91 IRONIZATION pruinartivEs 295
for this facility The emis) operation insures that no reads occur across the barrier defined by the place of the vin) in the code Similarly, the wb (} operation in- sures that no wriles occur aeross the barrier defined by the place of t
the code The nis() operation provides both a load and store barrier Tivo important points to note about the barrier operations:
vans) in 12 The bartiets relate to machine instructions, namely loads and stores Thus the higher-level language instruction «= involves both a load (read) from lo-
cation > and a store (write) to location
we “The =mb, vib, and mb operations dictate the behavior of both the compiler and the processor In the ease of the compiler, the barrier operation dictates,
that the compiler case of the processor the barrier operation dictates that any instructions pend> not reorder instructions during the compile process, In the ing in the pipeline before the barrier must be committed for execution before any instructions encountered alter the barrier
Trang 21Roca In
Linux shares many of the characteristics of the memory management schemes of other UNIX implementations but has its own unique features Overall, the Linux memory-management scheme is quite complex [DUBE98} In this section, we give a brief overview of the two main aspeets of Linux memory management: process virlual memory, and kernel memory allocation
Linux Virtual Memory
Virtual Memory Addressing Linux makes use of a three-level page table structure, consisting of the following types of tables (each individual table isthe size
cof one page):
+ Page directory: An active process has a single page directory that isthe sizeof ‘one page Bach entry in the page dircetory points to one page of the page mi dle directory The page directory must bein main memory for an active process + Page middle directory: The page middle directory may span multiple pages Each entry in the page middle directory points to one page in the page table * Page table: The page table may also span multiple pages Each page table
cenity refers to one virtual page of the process
Trang 22
390 CHAPTER & / VIRTUAL MEMORY Guunemg | Mamamsng | Rare one] Page table @ E3 nyiysicl sro
Figure 8.25 Address Translation in Linus Virtal Memory Scheme
‘The Linux page table structure is platform independent and was designed to accommodate the 64-bit Alpha processor, which provides hardware suppor for three levels of paging With 64-bit addresses, the use of only two levels of pages on the ‘Alpha would result in very large page tables and directories The 32-bit Pentium/x86 architecture has a two-level hardware paging mechanism The Linux software e by defining the size ofthe page middle directory as ‘one, Note that all references ran time Therefore, there is no performance overhead for using, to an extra level of inditection are optimized away at com: genetic three-level design on platforms which support only two levels in hardware
Page Allocation To enhance the efficiency of reading in and writing out pages (© and from main memory, Linux defines a mechanism for dealing with contiguous blocks of pages mapped into contiguous blocks of page frames, For this purpose, th ‘buddy system is used The kernel maintains list of contiguous page frame groups of fixed size:a group may consist of 1,2.4,8, 16, 0r 32 page frames As pages are allo cated and deallocated in main memory, the available groups are split and merged using the buddy algorithm,
Page Replacement Algorithi The Linux page replacement algorithm is based on the elock algorithm described in Section 82 (soe Figure 8.16) Inthe im" ple clock algorithm, a use bit and a modify bit age associated with each page in main ‘memory In the Linux seheme,the use hit is replaced with an 8-bit age variable Each time that 8 page is aeessed, the age variable is incremented Inthe background, Linux periodically sweeps through the global page pool and decrements the age Variable for each page ait rotates through all the pages in main memory A page with an age of Oi an old” page that has not been referenced in some time and is the best candidate for replacement The larger the value of age, the more frequently
Trang 23
83 / WINDOWS MEMORY MANAGEMENT 391 ‘page has been used in recent times and the less eligible iti For replacement, Thu the Linux algorithm isa form of least Irequently used policy
Kernel Memory Allocation
‘The Linus kernel memory capability manages physieal main memory page frames les primary function i o allocate and deallocate frames for particular uses Possible ‘owners ofa frame include userspace processes (ie the frame is part ofthe virtual memory ofa process that is eurrenly resident in teal memory)-dynamically Kemet data static Kernel code, and the page cache allocated
“The foundation of Kernel memory allocation for Linux is the page allocation ‘mechanism used for user virtual memory management As in the virtual memory scheme a buddy algorithm is used so that memory for the kernel can be allocated and deallocated in units of one of more pages Because the minimum amount of ‘memory that ean be allocated inthis fashion is one page, the page allocator slone ‘would be inefficient because the kernel requires sell short-term memory chunks
in odd sizes To accommodate these small chunks, Linux uses a scheme known as slab allocation [BONWO4] within an allocated page On a Pentium/x86 machine, the page size is 4 Kbytes, and chunks within a page may he allocated of sizes 32,64, 252, S08, 2040, anid 4080 by The slab allocator is relatively complex and is not examined in detail here; 2 good description ean be found in [VAHA96] In essence, Linux maintains a set of linked lists, one for each size of chunk Chunks may be split and aggregated in manner similar to the buddy algorithm, and moved between lists accordingly
Trang 24
10.3 LINUX SCHEDULING
Linux provided a real-time scheduling capability coupled yon-teal-time processes that made use of the traditional UNIX scheduling algorithm deseribed in Section 9.3 Linux 2.6 includes essentially the same real-time scheduling capability as previous releases and a substantially revised scheduler for non-real-time processes We examine these two areas in (urn,
Real-Time Scheduling
The three Linux scheduling clases are
* 8CNEb._P+fO: FirsL-in-fist-out real-time threads
© scwep_aR: Round-robin real-time threads + SCHED_OTHER: Other, non-real-time threads
‘Within each class, multiple priorities may be used, with priorities inthe real-time lasses higher than the priorities for thes: y2HiER lass The default values are as follows: Real-time priority classes range from 0110 99 inclusively, andcii2_7 8% classes range from 100 to 139, A lower number equals a higher priority
For FIFO threads the following rules apply
1, The system will not interrupt an executing FIFO thread except in the follow= ing cases! 4 Another FIFO thread of higher priority becomes ready
1s The executing FIFO thread becomes blocked waiting for an eventsuch as UO The executing FIFO thread voluntarily gives up the processor following a call to the primitive sched y‡e12
2, When an executing FIFO thread is interrupted itis placed inthe queue associated “with its priority
Trang 25482 CHAPTER 10 / MULTIPROCESSOR AND REAL-TIME SCHEDULING Minimum A | Middle e b—+h—~c——x: Middle | Maximum (Reaver pen (6) How wit FF scheting D—+p—+c—+n—+c—+a—> (6) ow wih RR schsing mple of Linu Real-Time Sched
3, When a FIFO thread becomes ready and if that thread has a higher priority than the currently executing thread, then the currently executing thread is pre- ‘empted and the highest priority ready FIFO thread is executed If more than ‘one thread has that highest priority, the thread that has been waiting the longest is chosen,
“The SCxED_RA poliy is similar to the SC1/EO s2 ZO poliey, exeept for the ad: dition of a timeslice associated with each thread When a Scii20_R8 thread has ex- cecuted for its timeslice, iis suspended and a real-time thread of equal or higher priority is selected for running sare 10.11 is an example that illustrates the distinction between FIFO and RR scheduling Assume a process has four threads with thrce relative prio
signed as shown in Figure 10.11a, Assume that all waiting threads are ready to exe cute when the current thread waits or terminates and that no higher priority thread is awakened while a thread is executing Figure 10.11} shows a flow in which all of the Uhreads are in the Sci0_P2P0 class Thread D executes until it wails oF terminates Next although threads B and C have the same priority, thread B starts because it has been waiting longer than thread C Thread then thread C executes until it wails or terminates Finally, thread A executes B executes until it waits or terminates, gure 10,L1e shows a sample flow if all of the threads are in the Sc1=0_R class Thread D executes until it waits or terminates, Next threads B and C are time sliced, because they both have the same priority Finally, thread A executes ‘The final scheduling class is SCHED_O7=2, A thread in this class ean only execute if there are no real-time threads ready to execute Non-Real-Time Scheduling “The Linux 24 scheduler for th ing number of processors and ine scheduler include the following:
TIED 01188 class did not scale well with inereas- seine number of processes The drawbacks of this
Trang 26
10.3 / LINUX SCHEDULING 483 For example, suppose a task executed on CPU-L, and its data were in that processor's cache If the task got rescheduled to CPU-2, ts data would need to be invalidated in CPU-1 and brought into CPU-2,
+ The Linux 24 scheduler uses the act of choosing w task to exceute locks out any other pracessor from mi a single runqueue lock Thus, in an SMP system, hipulating the runqueues The result i idle processors awaiting release of the runqueue lock and decreased efficiency
+ Preemption is not possible in the Linux 24 schedulers this means that a lower-priority task can execute while a higher-priority task waited for it t© complete,
‘To correet these problems, Linux 2.6 uses completely new priority scheduler known as the O(1) schedules.” The scheduler is designed so thatthe time to select the appropriate process and assign it to a processor is constant, regardless of the toad on the system or the numberof processors “The Kernel maintains to scheduling data structure for each processor in the system, ofthe following form (Figuee 10.12): uct iet_head queue(¥ax + prlorizy queues +
‘A separate queue is maintained for each priority level The total number of {queues in the structure is 2%_P220, whieh has a default value of 140 The structure also includes a bitmap array of sufficient size to provide one bit per priority level Thus, with 140 priority levels and 32-bit words, BITMA®_STZ has a value of S.This creates a bitmap of 160 bits, of which 20 bits are ignored The bitmap indicates which {queues are not empty Finally, ace ive indieates the total number sent om all queues Two structures are maintained: an active queues structure and an of tasks pre- expired queues structure Initially, both bitmaps are set to all zeroes and all queues are empty, As a process becomes ready, itis assigned to the appropriate priority queue in the active {queues structure and is assigned the appropriate timeslice I task is preempted by fore it completes its timeslice, itis returned to an active queue When a task co
pletes its timestice, it goes into the appropriate queue in the expired queues structure and is assigned a new timesice Allscheduling is done from among tasks in the active queues structure, When the aetive queues structure is empty simple pointer assignment results in a switch of the active and expired queues, nd schedul- ing continues Scheduling is simple and efficient On a given processor the scheduler picks the highestipriority nonempty queue If multiple tasks are in that queue, the Tasks are scheduled in round-robin fashion,
Trang 27484 CHAPTER 10 / MULTIPROCESSOR AND REAL-TIME SCHEDULING phe pindy 80 sooem ‘ess forth pony ~—Bi ‘pny 139 1409 priority ray foractve eter Wats Pn Feet rity 14 priority aay for expired gues
‘igure 10.13 Limuy Scheduling Data Structures for Each Processor
Linux also includes a mechanism for moving tasks from the queue lists of ‘one processor to that of another Periodically the scheduler checks to see if there {sa substantial imbalance among the number of tasks assigned to each processor ‘To balance the load, the schedule can transfer some tasks The highest priority ac- tive tasks are selected for transfer, because itis more important to distribute high-
priority tasks Fairly
Calculating Priorities and Timeslices Each non-real-time task is assigned an initial priority in the range of 100 to 139, with a default of 120, Thisis the task’s static priority and is specitied by the user As the task executes, a dynamic priority fs calculated as a function of the task’s statie priority and its execution behavior ‘The Linux scheduler is designed to favor 1/0-bound tasks over processor-bound tasks This preference tends to provide good interactive response The technique used by Linux to determine the dynamie priority is to keep a running tab on how ‘much fime a process sleeps (waiting for an event) versus how much time the process runs, In essence, a task that spends most of its time sleeping is given a higher priority ‘Timeslices are assigned in the range of 10 ms to 200 ms In general, higher-
priority tasks are assigned larger timestices
Trang 28
WAL UNIX SVRA SCHEDULING 485
Relationship to Real-Time Tasks Real-time tasks are handled ina different ‘manner from non-real-time tasks inthe priority queues The following considerations
apply
12 All realtime tasks have only a static priority: no dynamic priority changes are ‘made 2, sciizD_P1e0 tasks do not have assigned timestices Such tasks are scheduled in FIFO discipline Ia siie2_°20 task is blocked, it returns to the same priority
‘queie in the active quewe list when it becomes unblocked
3 Although Scx2D_R8 tasks do have assigned timeslices, they also are moved to the expired queue list When ạ SC?/20 3⁄3 task exhaust is timeslice, itis returned to its priority queue with the same timeslice value, Timeslice vai" ues are never changed,
Trang 29
TINH
Trang 3030 CHAPTER 11 /1/O MANAGEMENT AND DISK SCHEDULING
Wo system ayered sing VO Request Packets to repronent each request and hen asin he requests through ayer of drivers (a dative agciteetre) Layered drivers ca extend fuetooalty auch a4 hein file dat for vireo ang fates rich !sspecalzed encryption or compression
VO wes pga mode, sed on fhle of rootnes to implemen! the standard device functions sich as pen read, wre cet dose
TO inerenlyasynehronows a versa any ST
an generally queue a request for later processing
nd retarn back the aor ‘aly pewwork TO and dives TO wh Ppp he ge cache,can be asynchronous Cine ia current versions
Drivers ca be dynam (onde unlonde Drivers canbe Graal detente
TO devices and vers aed nthe sytem
namespace TO devises amid nthe fie ystems drivers ace through inaaces of device
‘Advanced plug and pay support Base on nae ‘election of devices though bus enumeration, Imathing of drivers rom a database and dynamic
lesdingunloadine
Timed pag and pay support
‘Advanced power managemeat ong CPO
hibernation Tnfed power management based on CPU doce
TO prortn scoring read poner an system quirements (uc sẹ hinh ioly aeees or ông te ghen memar ịlowcand lấệ-pbuf lorbiekgroundaciiie Tike the disk dfrager)
Travis four aiferent vernon of TO when incling Sealine sed schedling nd Complete Fae Queuing to locMe DO nh mang all processes
TO competion ports provide heh pevformanee mut headed ppticatons with meen way of eating with the completion of ashehrogous UO
Disk Scheduling
‘The default disk scheduler in Linux 2.4is known as the Linus Elevator, which is a vari- ation on the LOOK algorithm discussed in Section 11.5, For Linux 26, the Elevator algorithm has been augmented by two additional algorithms: the deadline IO sched- uler and the anticipatory LO scheduler [LOVEDS].We examine each of these in turn,
The Elevator Scheduler The elevator scheduler maintains a single queue for disk read and write requests and performs both sorting and merging functions on the queue In general terms, the elevator scheduler keeps the list of requests sorted by block number Thus, as the disk requests are handled, the drive moves ina single direction, satisfying each request as it is encountered, This general strategy is re in the following manner When a new request is added to the queue, four operations
are considered in order:
Trang 31
"7" 2, Ifa request in the queue is sufficiently ol, the new request is inserted atthe tail of the queue 3 If there is a suitable location, the new request is inserted in sorted order,
Trang 32532 CHAPTER 11/70 MANAGEMENT AND DISK SCHEDULING
a read request or a write FIFO queue for a write request Thus, the read and write {queues maintain a list of requests inthe sequence in which the requests were made Associated with each request is an expiration time, with a defaull value of 0.5 sec" fonds fora read request and 5 seconds for a write request, Ordinarily, the scheduler dispatches from the sorted queue When a request is satisfied, itis removed from the head of the sorted queue and also trom the appropriate FIFO queue However, when the item at the head of one of the FIFO queues becomes older than its expita tion time, then the scheduler next dispatches from that FIFO queue, taking the expired request, plus the next few requests from the queue, As each request is dis: patched, itis ako removed from the sorted queue, ‘The deadline 1/0 scheduler scheme overcomes the starvation problem and also the read versus write problem,
Anticipatory I/O Scheduler ‘The original elevator scheduler and the dead line scheduler both are designed to dispatch a new request as soon as the existing request is satisfied, thus keeping the disk as busy as possible This same policy ap- plies to all of the scheduling algorithms discussed in Section 11.5 However, such a policy can be counterproductive if there are numerous synchronous read requests Typically an application wil wait until a read requests satisfied and the data avail- able before issuing the next request The small delay between receiving the data for the last read and isuing the next read enables the scheduler to turn elsewhere for pending request and dispatch that request Because of the principle of locality, it ikely that successive reads from the same process will be to disk blocks that are near one another Ifthe scheduler were to delay a short period of time after satisfying a read request, to see if new nearby read request is made, the overall performance of the system could be enhanced This isthe philosophy behind plemented in Linux 2.6 the aniipatory scheduler, proposed in [IYERO1 and im
Linus, the anticipatory scheduler is superimposed on the deadline sched- tler, When a read requests dispatched the anticipatory scheduler causes the sched- tuling system to delay for up to 6 milliseconds, depending on the configuration During this small delay, there is good ehance that the application that issued the last read request will issue another read request to the same region of the disk Iso that request will he serviced immediately no such read request occurs, the sched uler resumes using the deadline scheduling algorithm [LOVED4| reports on two tests of the Linux scheduling algorithms The frst test involved the reading of a 21N-MB file while doing along streaming write in the background The second tes involved doing ä read ofa large file in the background While reading every file in the Kernel source tree, The results are listed in the follow ing table:
WO Scheduler and Kernel Tet Tea?
Tinar elevator an Tang men
Deadline UO sheduler on 25 aDseconde TS minutes, 3 seconde
“Anticipatory VO scheduler on @ a scconde Ty seconds
Trang 33haw /winpows vo $33 ‘As can be seen, the performance improvement depends on the nature of the workload But in both cases, the anticipatory scheduler provides a dramatic improvement
Linux Page Cache
In Linux 2.2 and earlier releases, the kernel maintained a page cache for reads and writes from regular file system files and for virtual memory pages, and a separate buffer cache for block /O For Linux 24 and later, there isa single unified page cache that is involved in all traffic between disk and main memory “The page cache confers two benelits Ftst, when itis time to write back dirty pages to disk, a collection of them can be ordered properly and written out effi- ciently: Second, because of the principle of temporal locality, pages in the page cache are likely to be referenced again before they are flushed [rom the cache, thus saving adisk 10 operation, Dirty pages are writ back to disk in two situations:
+ WWhen free m reduces the
size of the page cache to rel noty falls below a specified threshold, the kern ise memory to be added to the free memory pool
Trang 34
eae
Pana
Linux includes a versatile and powerful file handling facility, designed to support ‘management systems and file structures The approach taken in Linux is to make use of a virtual fle system (VES) which presents a single, uniform file system interface to user processes The VES defines a common fle model that is capable of representing any conceivable file system's general feature and behavior ‘The VFS assumes that files are objects in a computer's mass storage memory that shate basic properties regardless of the target filesystem of the underlying processor
Trang 35388 CHAPTER 12 / FILE MANAGEMENT Sytem cll System ells inerace = oom | [aos eve — Eigare 1217 1inmx Vidmal File System Context
Trang 36129 / LINUX VIRTUAL FILE SVSTEM 589 Syste cal
Syste cate thing VIS am ves ‘wine Bề den X Dis 40 ——
wermerice [Linx] “eal erase ‘all in me function ‘ote system x ile Fler onseootaey Figure 12.18 Linux Virtual File System Concept
Figure 12.18 indicates the role that VFS plays within the Linu kernel, When process initiates a file-oriented system call (e, read), the kernel calls a function in the VFS This function handles the file-system-independent manipulations and initiates a call to a function inthe target ile system code This call passes through a mapping function that converts the call from the VES into a call to the target file system The VFS is independent of any file system, so the implementation of a map-
ping function must be part of the implementation ofa file system on Linus The tat- get filesystem converts the file system request into device-oriented instructions that are passed to a device driver by means of page cache functions VFS is an objectoriented scheme Because itis written in C, rather than 2 lan- guage that supports object programming (such as C+-+ of Java), VES objects are im- plemented simply 2s C data structures Each object contains both data and pointers to filesystem-implemented lunctions that operate on data The four primary object types in VES are as follows:
perblock object: Represents a specific mounted fle system + Inode object: Represents a specific file
+ Dentry object: Represents specific directory entry
+ File object: Represents an open file associated witha process
‘This scheme is based on the concepts used in UNIX file systems, as described in Section 12.7 The key concepts of UNIX file system to remember ate the following A file system consists of a hierarchal organization of directories.A directory isthe same as what is knows as a folder on many non-UNIX platforms and may contain files andlor other directories, Because a directory may contain other directories, a tree structure is formed A path through the tree structure from the root consists of quence of directory entries ending in either In UNIX a directory is implemented as a file that lists the files and directories con- a directory entry (dentry) ora file name tained within i, Thus, file operations can be performed on either files or directories
‘The Superblock Object
Trang 37590 CHAPTER 12 / FILE MANAGEMENT
‘The superblock object consists of a number of data items Examples include the following
is mounted on
+ The device that this ile sy
= The basie block size ofthe filesystem
‘+ Dirty flag, to indicate that the superhlock has been changed but not written back to disk le system type
+ Flags such as a read-only flag
+ Pointer to the root of the file system directory + List of open files
+ Semaphore for controlling access to the file system + List of superblock operations
“The last item on the preceding list refers to an operations object contained Within the superblock object The operations object defines the object methods (funedons) that the kernel can invoke against the superblock object The methods defined for the superblock object include the following
* cead_inodes Read a specified inode from a mounted file system, + write inode: Write given inode to disk
inode: Release inode + delete inode: Delete inode from disk ge: Ci supers Called by the VFS on unmount to release the given superblock ed when inode attributes are changed
+ wel ve_suver Called when the VES decides that the superblock needs to be written to disk, Obtain fle system statistics, _f + Called by the VES when the file system is remounted with new mount options inode: Release inode and lear any pages containing related data
The Inode Object
‘An inode is associated with each file The inode object holds all the information about a named file exeept ils name and the actual data contents of the File Items contained in an inode object include owner, group, permissions, access times for 3 filesize of data it holds, and number of links “The inode object also includes an inode operations object that deseribes the file system's implemented functions that the VFS can invoke on an inode, The meth- ‘ods defined for the inode abject include the follow
+ create: Creates ject in some directory a new inode for a
+ Lookup: Searches a directory for an inode corresponding toa file name
Trang 38law /winnows FILESYSTEM 591
some direetory 1 Creates a new inode for a directory associated with a dentry object in The Dentry Object
A dentry (directory entry) is specific component ina path The component may’ be either a directory name or a file name Dentry objects facilitate access to files and directories and are used in a dentry eache for that purpose The dentry object in cludes a pointer to the inode and superblock It also includes a pointer to the parent dlentry and pointers to any subordinate dentrys
The File Object
‘The file object is used to represent a file opened by a process The object is created in response to the open() system call and destroyed in response to the close() sys- tem call The file object consists of a number of items, including the following:
+ Dentry abject associated with the file le system containing the file
+ File objects usage counter + User's user ID
© User's group 1D
Trang 3916.7 BEOWULF AND LINUX CLUSTERS
In 1994, the Beowulf project was initiated under the sponsorship of the NASA High Performance Computing and Communications (HPCC) project Its goal was to investigate the potential of clustered PCs for performing important computation tasks beyond the capabilities of contemporary workstations at minimum cost Today, the Beowulf approuch is widely implemented and is perhaps the most important cluster technology available,
Beowulf Features
Key features of Beowulf include the following [RIDGY7} ‘+ Mass market commodity components
+ Dedicated processors (rather than scavenging cycles from idle workstations) + A dedicated, private nework (LAN or WAN or internetted combination) = Nocustom components
*» Easy replication from multiple vendors + Scalable HO
# A froely available software base
+ Use of freely available distribution computing tools with minimal changes ‘+ Return of the design and improvements to the community
Although elements of Beowulf software have been implemented on a number of different platforms the most obvious choice fora base is Linux, and most Beowulf implementations use’ a cluster of Linux workstations and/or PCs Figure 16.18 depicts a representative configuration, The cluster consists of a number of work- stations, perhaps of differing hardware platforms all runaing the Linus operating system, Secondary storage at cach workstation may be made available for distib- uted access (for distributed file sharing, distributed virtual memory, oF other uses) ‘The cluster nodes (the Linux systems) are interconnected with a commodity net- working approach, typically Ethernet The Ethernet support may'be inthe form of single Ethernet switch or an interconnected set of switches Commodity Ethernet produets atthe standard data rates (10 Mbps, 100 Mbps 1 Gbps) are used,
Trang 40
167 / BEOWULF AND LINUX CLUSTERS 739
Ethernet or Figure {6.18 Generic Beowulf Configurati Beowulf Software
‘The Beowulf software environment is implemented as an add-on to commercially available, royalty-free base Linux distributions The principal source of open-source Beowulf soltware is the Beowulf site at www-heowalf.org, nizations also offer free Beowulf tools and utilities but numerous other orga-
Each node in the Beowulf cluster runs its own copy of the Linux kernel and can function as an autonomous Linux system To support the Beowulf cluster con- cepl, extensions are made 10 the Linux kernel to allow the individual nodes to participate in a number of global namespaces The following are examples of Be- ‘owull system software:
© Reowulf distributed process space (BPROC): This package allows @ process ID space to span multiple nodes in a cluster envionment and also provides mechanisms for starting processes on other nodes, The goal of this package is to provide key elements needed for a single system image on Beowulf cluster BPROC provides a mechanism to start processes on remote nodes without into another node and by making all the remote processes visible
ss table of the clusters front-end node