operating systems doc

Trang 1

TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTrrrrrrrrr

‘This document is an extract from

Trang 2

2.8 LINUX History

Trang 3

28 / LINUX 98 'WINDOWS/LINUX COMPARISON Windows Vista Tina General

‘A commercial OS with strong inlucness trom AW oPsn-ouree implementation of UNIX YVAXIVMS and requirement for companibiiy focused on simplicity and effieieney, Runs ona with multiple OS personalities such as DOS! EF large range of processor architectures Windows, POSIX and, crgially, O82

Environment which influenced fundamental design desiions

32-bit program address space Tobit program address space Mbytes of physical mecnory Kbytes of physical memory

Virtual memory Swapping system with memory mapping

Multiprocessor (vay) Uniprocessor

“Miero-conttller based 110 devices State-machine based UO devices ClientServer distributed computing Large diverse user populations Standalone interactive systems Small number of friendly users

‘Compare these with today's enviroament ‘obit addresses

Gbytes of physical memory

Virtual memory, Virtual Processors Multiprocessor (61-128)

High-spoed intemetintranet, Web Services Single use, but vulnerable to hackers worldwide

Although both Windows and Linux have adapted to changes in the environment, the original design} cavironments (Leia 1989 and 1973) heavily inuenced the design choices

Unit of concurrency: threads vs processes [adress space, uniprocessor} Process ereation: LÒ CreateProcess() vs frk() Asyne vssyne Faddress space, swapping) [swapping 10 devices) Security Discretionary Access vs uidigd oxerpopulalom]

System structure

“Modular core Kernel, with explicit publishing of data structures and interfaces by components Monolithic Kernel ‘Three layers ‘+ Hardware Abstraction Layer manages

processor, interrupt, DMA BIOS details + Kemel Layer manages interrupts, and synchronization CPU scheduling, + Executive Layer implements the major OS functions in ull threaded, mostly

preemptive environment

Dynamic data structures and kemel address Kernel code and data is statically allocated space organization: intaization code đc tonon-pagenble memory

fatded after boot, Much kemel code and data is pagcable, Non-pageable kemel code tnd data uses large pages for TLH efficiency

Trang 4

96 CILAPTER 2 / OPERATING SySTEM OVERVIEW

Fie systems networking devices are loadable! unloadable drivers (dynamic link libraries) using the extensible HO system interfaces Dynamically loaded drivers can prove both ppageable and non-pageable seetions

[Namespace root is virtual with file systems ‘mounted underneath types of eystsm objects ‘easily extended, and leverage unified nam- ng referencing lifetime management, secu

and handle-bused synchronization

(0 personalities implemented vser-mode subsystems Native NT APIs are based on the general Kernel handlelobject aehitee-

Extensive support for loading/unloading ‘ere! modules, suchas device drivers and filesystems ‘Moles cannot be paged, but can be unloaded

[Namespace is roted in a file system: adding new named system objects requie filesystem] changes oF mapping onto device model

Implements a POSIX-compatible, UNIX- ture and allow crose-process manipolation of Virtual memory, threads, and other kernel

objects

Discretionary Access Controls privileges auditing

like interface: Kemet APIs farsimpler than ‘Windows: Can understand various types of cexcoutables

Usergroup IDsseapabilies similar to NT pri ‘legs can also be associated with processes

Key tothe success of Linux has been the availability of free software packages ‘under the auspices of the Free Software Foundation (FSF) FSF's goal is stable, platform-independent software that is free, high quality, and embraced by the user com munity FSF's GNU project” provides tools for software developers, and the GNU Public License (GPL) is the FSF seal of approval Torvalds used GNU tools in de veloping his kernel, which he then released under the GPL Thus, the Linux distrib- lutions that you see today are the product of FSF'S GNU project, Torvald’ individual effort, and many collaborators all over the world, In addition to its use by many individual programmers, Linux has now made

significant penetration into the corporate World, Ths is not only hecause of the free software, but also because of the quality of the Linux kernel Many talented pro- _grammers have contributed to the current version, resulting in a technically impres- sive product, Moreover, Linux is highly modular and easily configured This makes i easy {0 squeeze optimal performance from a variety of hardware platforms, Plus, with the source code available, vendors ean tweak applications and utilities to meet specific requirements Throughout this book, we will provide details of Linux kernel internals based on the most recent version, Linux 26

Modular Structure

Most UNIX kernels are monolithic Recall from earlier in this chapter that a monolithic kernel is one that includes virtually all of the OS functionality in one large block of code

Trang 5

28/UNUX 97

that runs as a single process with a single address space All the functional components ‘of the kernel have access to all ofits internal data structures ‘made to any portion of a typical monolithic OS, ll the modules and routines must bere and routines If changes are

linked and reinstalled and the system rebooted before the changes can take effect Asa result, any modification, such as adding # new device driver or filesystem function, isi ficult This problem is especially acute for Linu, for which development is global and done by a loosely associated yroup of independent programme Although Linux does not use a mierokernel approach, it achieves many of the potential advantages of this approach by means of ts particular modular architecture Linus i structured as a collection of modules

« number of which can be automatically loaded and unloaded on demand These relatively independent blocks are referred to asloadable modules[GOY E99] In essence,a module is an object file whose code can be linked to and unlinked from the kernel at runtime Typically, a module implements, some specific function, such as a filesystem, a device driver, oF some other feature of the kernels upper layer A module does not execute as its own process or thread, al though it can ereate kernel threads for vatious purposes as necessary: Rather, miod- ule is executed in kernel mode on behalf of the current process Thus, although Linux may be considered monolithic, its modular structure ‘overcomes some of the difficulties in developing and evolving the kernel “The Linux loadable modules have two important chatacterisies

+ Dynamic linking: A kernel module can be loaded and linked into the kernel ‘while the kernel is already in memory and executing A madule ean also be unlinked and removed from memory at any

+ Stackable modules: The modules are arranged ules serve as libraries when they are referenced by client modules higher up in in a hierarchy the hierarchy, and as clients when they reference modules further down

Dynamic linking [FRAN97] facilitates configuration and saves kernel mem= cory In Linus, a user program or user can explicitly load and unload kernel modules the insmod and rmmod commands The ketnel itself monitors the need for particular functions and can load and unload modules as needed With stackable ‘modules, dependencies between modules can be defined This has two enetits:

1 Code common t a set of similar modules (eg, drivers for similar hardware) can be moved into a single module, redueing replication

2, The kernel can make sure that needed modules are present, refraining from unloading a module on which other running modules depend, and loading any additional required modules when a new module is loaded

Figure 2.17 is an example that illustrates the structures used by Linux to man- age modules The figure shows the list of kernel modules after only tvo modules, have been loaded: FAT and VFAT Each module is defined by two tables, the module table and the symbol table.’The module table includes the following elements:

+ next: Pointer to the following module, All modules are organized into a linked lst The list hegins with a pseudomodule (not shown in Figure 2.17) * “name: Poi

Trang 6

98 cILAPreR 2 / OPERATING SysTEM OVERVIEW Module Module 8g ———mD xa —] Tế ‘sm ear vest symbol_tuble wah

igure 217 Example List of Linux Kernel Modules

+ usecount: Module usage counter The counter is incremented when an opeFa- tion involving the module's funetions is started and decremented when the op~ ‘eration terminates

* flags: Module flags,

«+ nsyms: Number of exported symbols,

+ ndeps: Number of referenced modules + *syms: Pointer to this module's symbol table

«= Sdeps: Pointer to list of modules the are referenced by this module, + “refs: Pointer to list of modules that use this module,

‘The symbol table defines those symbols controlled by thi

used elsewhere, Figure 2.17 shows that the VPAT module was loaded after the FAT module and that the VFAT module is dependent on the FAT module module that are Kernel Components

Figure 2.18, taken from [MOSB02| shows the main components of the Linux kernel as implemented on an IA-64 architecture (eg Intel Ilanium).The figure shows sev eral processes running on top of the kernel Ench box indicates a separate process, while each squiggly line with an arrowhead represents a thread of execution.’ The

Trang 7

28/ UNUX 99) ana | S= \ File Neework _ [ = ow) (2s gore 2.08 Linux Kernel Components

kernel itself consists of an interacting collection of components, with arrows indicate ing the main interactions, The underlying hardware is also depicted as a set of components with arrows indicating which kernel components use or control which hardware components All of the kernel components, of course, exceute on the processor but, for simplicity, these relationships are not shown Briefly, the principal kernel components are the following:

ignals: The kernel uses signals to call into a process For example, signals are used 10 notify a process of certain faults, such as division by zero, Table 26 ives a few examples of signals

Table 26 Some Linux Signals

) “Terminal Banga SIGCONT Continue

sicourr Keyboaid quit sicrst Keyboard stop

SIGTRAP ‘Trace tap SIGTTOUL Temial viec

SIGBUS ụscmor SIGXCPU (CPU tnt exceeded

SIGKILL KHtignal SIGVTALRM ——_Vilualalam clock

siGsEGV Seamentation violation SiGwINcH ‘Window size unchanged

SIGPIPT- Broken poe sicrwr Power ile

SIGTERM, “Temiaton SIGKTMIN Frstrealtime signa

Trang 8

100 CHAPTER 2 / OPERATING SYSTEM OVERVIEW

+ System calls: The system calls the means by which a process requests a specific kernel serviee There are several hundred system calls, which ean be roughly {grouped into six eategories: filesystem, process, scheduling, interprocess communication, socket (networking), and miscellaneous Table 27 defines a few ex- ‘amples in each category Table 27 Some Linu System Calls iste related — —=—

ik Make snow name fora

open ‘Open and possibly ceatea ie or deve rend Read om file descriptor

ste Write te deserntor

Proce related

aa ean pogo

et “Teeminate he calling process

sp ‘Set user identity ofthe current process, Get proces entiation

Proves a means by which a arent process my observe and contr the execu tion of another process and examine and change score image and segs

Seng related

‘Setsthe chedolng parameter asocited with the shedting pois forthe proces Mentied by 2 Returos the maximum pity Value that can be sed with he scheduling alao- sin deed by p=!

Sets bth the scheduiag policy (ep FIFO) and he ssciated parameters forthe proces ps3

ehed get imterat | Wits ato the imespee structure pote to by the parameter“ the round oi ime quantum forthe process sched yield {A process can lings the processor wolutariy without locking vi this sy ‘em cal The proces wil hen be moved othe end of the gueue for is ati,

prot anda new proces pts oro

Inter

— ‘A messge bolle siichire allocated to receive a memage The tem ell, ‘then reads ữepsje rô the ewig que specied BY ued tate the ely created message baller

rs Commnicaton (IPC) ee

seit storms the contol persion speed bynon th haplois tne sen Performs operations on selected memburs of the semaphore ston

shmat of he tng proeek Atahes the shared memory segment identified by =hnsct the dats segment shoe Allows the user o receive information aroup and permission of shared memory segmen.or dso a zment ona shred memory sement seth one

Trang 9

2.9 / RECOMMENDED READING AND Wea StTES 101 Table 2.7 (Continued)

Socket (Networking) related

in “Asis the lea IP adres und port fora sockel Reurns for success and 1 ferent connect Esublses a connetion betwen the given socket and the remote socket s80

—— Retar loa host name

‘Send the byes contained in bute pointed toby "map over the sven sche Setsthe options on socket, Miestansom — syne query mole time hangup

‘hat wil be pesded to hold the mle, “Afompis to eeaes oadable module eniry and resene the Keel memory Cope alin-sore paris file dk nd wats unt the device report tht il parteare ov aabh storage

Requests information related o ladle modules fom the kernel Retun the time ia seconds since January 1,170,

‘Simulates «hangup om the cusen terminal, This call ranges for other users habe alent) agin ine

Processes and scheduler: Creates, manages, and schedules processes

{tual memory: Allocates and rages virtual memory for processes

File systems: Provides a global, hierarchical namespace for files, directories, and other file related objects and provides file system functions Network protocols Supports the Sockets interface to users for the TCPAP protocol suite

‘Character device drivers: Manages devices that require the kernel to send of receive data one byte at a time,such as terminals, modems, and printers

Block device drivers: Manages devices that read and write data in blocks, such as various forms of secondary memory (magnetic disks, CD-ROMS, ete)

Network device drivers: Manages network interface cards and communica tions ports that connect to nctwork devices, sueh as bridges and routers, ‘Traps a memory fault and faults: Handles traps and faults generated by the processor, such as

Physical memory: Manages the pool of page frames in real memory and allo- ceales pages for virtual memory

Trang 10

4.6 LINUX PROCESS AND THREAD MANAGEMENT Linux Tasks A process, or task, in Linus is represented by a tasie struct data structure, The ict data strueture contains information in a number of categori © State: The execution state of the process (executing, ready, suspended, stopped, zombie) This is deseribed subsequently 'WINDOWS/LINUX COMPARISON ‘Wino Lm,

roses are container Tor the worms ATIF space a general handle mechani for retereacing Kemel object and threads: heads am in proces, and the schedlabe entice

Proceso are both Sonne and the shea enttes proces can sate wre space and sp lem resources making procenies festive able x threads

rocesorare created by dseete Hops WHR con suet he container fora new program ad he fest Urea fork) ike native APL ex, bl only sed for POSIX compatiblity

Proce created hy making vital copies with fork) and then overwriting with excl) ran anew —¬

roves tandle table wed a ualorny reference kemelobjsts (representing processes thread, memory sections synchronization UO devices di ‘ere open ile network eaonetions timers Kernel transactions)

Kernel objects referenced by ad hoe cllectin of [APIs and mechanisms ncn ie desrptors for ‘pen ies and socket and PIDs for processes and proces groups

TUpio16milioa Hnadien oa Karel oboe ae ap

ported per press pte open Featocke process are sapped par

‘eel fll ul ead wh ero preemp

thon enabled ow all yen inthe orginal design Few feral proces and sd Ferel preempon A eoeat feature

‘Many system serves inplenented using a

int server computing cain he OS personality ‘ubysere that un in oser mode and eommniate thủng remote-procedute als

Trang 11

196

+» Scheduling information: Information needed by Linus to schedule processes {A process can be normal or real time and has a priority Real-time processes are scheduled before normal processes, and within each category, relative pe ‘ofities cam be used A counter keeps track of the amount of time a proces is allowed to execute, entfiers: Each process has a unique process identifier and also has user and oup identifiers group identifier is used to assign resource access privileges lo a group of processes

«+ Interprocess communication: UNIX SVR4, described in Chapter 6 Linux supports the IPC mechanisms found in Jinks: Each process includes a link to its parent process, links to its siblings (processes with the same parent), and links to all ofits children,

‘Times and timers: Includes process creation time and the amount of processor time so far consumed by the process A process may also have associated one for more interval timers A process defines an interval timer by means of asys- tem call: as a resulta signal is sent to the process when the timer expires A timer may be single use or periodic

File system: Includes pointers to any files opened by this process, as well as pointers to the current and the root directories for this process

‘+ Address space: Defines the virtual address space

* Processorspecific context: The registers and stack information that constitute the context of this process assigned to this process Figure 4.18 shows the exceution states of a process These are as Follows:

This state value corresponds to two states A Running process is either executing or it is ready to execute,

«+ Interruptible: This is a blocked state, in which the process is waiting for an ‘event, such as the end of an HO operation, the availability ofa resource, of a signal from another process

* Uninterruptible: This is another blocked state, The difference between this and the Interruptible state ing directly on hardware conditions and therefore will not handle any signals is that in an uninterruptible state, a process is wat

+ Stopped: The process has been halted and can only resume by positive action Irom another process For example, a process that is being debugged can be put into the Stopped state

*» Zombie: The process has been terminated but, for some reason, till must have its task structure in the process table,

Linux Threads

Trang 12

6.7 LINUX PROCESS AND THREAD MANAGEMENT 197

Figure 418 Linus Procea/Thread Model

known as pibread (POSIX thread) lebraries with all of the threads mapping into single kernel-level process.“! We have seen that modern versions of UNIX offer kernel-level threads Linux provides a unique solution in that it does not recognize 4 distinetion between threads and processes Using a mechanism similar to the lightweight processes of Solaris, user-level threads are mapped into kernel-level processes Multiple user-level threads that constitute a single user-level process ‘are mapped into Linux kernel-level processes that share the same group ID This tenables these processes to share resources such as files and memory and to avoid the need for a context switeh when the scheduler switches among processes in the same group A new process is created in Linux by copying the attributes of the current process A new process ean be cloned so that it shares resources, such as files, signal hhandlers,and virtual memory: When the two processes share the same virtual mem- ‘ry they function as threads within a single process, However, no separate type of ata structure is defined for a thread In place of the usual fork() command, processes are ereated in Linux using the clone() command This command inclides a set of flags as arguments, defined in Table 45 The traditional fork() system eal i imple ‘mented by Linux asa clone() system call with all of the elone flags cleared

POSIX (Portable Operating Sistem based on UNIX) isan IEEE APH standtd tht aclu tan ded fora thread APL Libraries anplemestiag the POSIX Thea standard are les sườn Pave Pireads are mor commonly aed on UNIXIiKe POSIX systems sich as Linst and Soar bt Micosot Windows implementations ls e

Trang 13

198 CHAUTER 4 /THILEADS, SM, AND MICROKERNELS, Tale 4S — Linx clone () aps

CLONE-CLEARID | Gharihe ask 1D,

‘CLONE_DETACHED | The pirent doesnot wants SIGCHLD sgn ent on sx ‘CLONE FILES ‘Shares he ble tha Keni he open fie

‘CLONES ‘Shares the able that Menifee root rectory and he caren working dncaon ella the vale of the bit mask used o mask he nit le permissions ofa ne ic ‘CLONE-IDERTASK | Set PID to ero which refers to ail ak The Mle sk employed when all

salable take are bseked wating for resources “CLONE_NEWNS | Greitew new namespics for the cil

‘CLONE_PARENT | Caller and new tak share the sume parea prose

‘CLONE_PTRAGE | ihe piven prowess beng aed he cid procem wil be Waeel GLONE.SETTID | Wile the TID back ower space

CLONE.SETTIS | Creates new TLS forthe cil

"CLONE SIGHAND | Shares he table that ents the wend Randle

“CLONE SYSVSEM | Shares System V SENLUNDO seman

‘CLONE-THREAD | Inirs thi proces into the sme head proup ol the paren hill ve limplicly entrees CLONE PARENT,

“CLONE_VFORK | Ist the paren doesnot et shedled Tor exertion wall he id ORCS TRE excel} sate cal

CONEY ‘Shares the adess spate (memory deseripor anal mat abie)

When the Linux kernel performs a switeh from one process to another, it checks whether the address of the page directory of the current process isthe same as that of the to-be-seheduled process If they are, then they are sharing the same address space, so that a context switch is basically just a jump from one location of code to another location of code Although cloned processes that are part of the same process group can share the same memory space, they eannot share the same user stacks Thus the clone() call ereates separate stack spaces for each process

Trang 14

Linux includes all of the concurrency mechanisms found in other UNIX

such as SVR4, including pipes, messages, shared memory, and signals In addition, Linux 2.6 includes a rich set of concurreney mechanisms specifically intended for use when a thread is executing ia kernel mode That is, these are mechanisms used within the kernel to provide concurrency in the execution of kernel code This section examines the Linux kernel concurrency mechanisms

Atomic Operations

Linus provides a set of operations that guarantee atomic operations on a variable ‘These operations can be used to avoid simple race conditions, An atomic operation «executes without interruption and without interference On a uniprocessor system, thread performing an atomic operation eannol be interrupted once the operation hhas started until the operation is finished In addition, on a multiprocessor system, the variable being operated on is locked from access by other threads until this operation is completed ‘Two types of atomic operations are defined in Linux:integer operations, which ‘operate on an integer variable, and bitmap operations, which operate on one bit ina bitmap (Table 6.3) These operations must be implemented on any architecture that implements Linux For some architectures, there are corresponding assembly lan- {guage instructions for the atomic operations On other architectures, an operation that locks the memory bus is used to guarantee that the operation is atomic: Foratomic integer operationsa special duta type is used atom

integer operations can be used only on this data type, and no other oper Jowed on this datatype, OVEO4] lists the following advantages for thes

restrictions: 1 The atomic operations are never used on variables that might in some circum- stances he unprotected from race conditions 2 Variables ofthis datatype are protected from improper use by nonatomic operations 3 The compiler cannot erroneously optimize access to the value (e , by using an ‘lias rather than the correct memory address) 4 This data type serves to hide sarchitecture-specifie differences in its imple-

A typical use of the atomic integer data type is to implement counters, The atomic bitmap operations operate on one of a sequence of bits at an arbi trary memory location indicated by a pointer variable Thus, there is no equivalent oth ‘Atomic operations are the simplest of the approaches to kernel synchroniza lta type needed for atomic integer operations tion More complex locking mechanisms can be built on top of them

Spinlocks

Trang 15

290 CHAPTER 6 / CONCURRENCY: DEADLOCK AND STARVATION Yable6.3 Linux Atomic Operations

‘Atomic Integer Operations

AGMiG TRE tt 1L At Seclration: inne an atom 104

Seis eas rae ESET Resa ntegervaloe of

void sranicisst(ataniacty, TJ] Sethe value of vo ntegert

Vi SEEEiE x870nE.1, nEEmiEE sơ AMTIOW

Void avente_wubilut 1, avenle-t “Vy SuBaqtifomy

Wid avonie Ine (atoule“e “WT Aad ow

Wei avonio_ dee avoniee WT Shire Tame

—=—.-.= ‘Sobiret rom wre Ta Theresa

ateniet *V) ze; teen Derwise

Tne RtonIe.ađã_regarlvetTnE 1; ‘Ads to yretun Tithe Fval Snes AEenle-E *¥1 tive: eur otherwise senting semaphores) (used or impe- mm SgBxt1ifBm 7emretomnDotberee vetus The out —_.-= 3E nrơnie-Tns-and-tvstIavsnis-E xơ) Tem 0olberviee ‘Adi To vere Ti theresa Sa; Atomic Bitmap Operations

aˆ ‘Se{bitarin the bitmap pointed toby aaa

Wold cea DLE(INE AE, eid “aad “Gear it arin the bitmap pointed io by ade dela change bit (ine nr, veld *adaey) Tavern he bitmap pointed by ar Gar Fest and wold aaa sek BIE (ine oF) Set bic arin the Bitmap plated To by sake return he bi vale

‘Gear birnr in he bitmap pointed To By suk rte theo it vale Taser bit ei the bitmap pomted to By erent thes iva

Return the value of ita nthe Baa pointed byadde

acquire the same lock will keep trying (spinning) until it ean aequire the lock In essence a spinlock is built on an integer location in memory that is checked by exch thread before it enters its critical section If the value is 0, the thread sets the value to 1 and enters its eritical section If the value is nonzero, the thread continually checks the value until itis zero, The spinlock is easy to implement but has the disadvantage that locked-out threads continue to execute in a busy-waiting mode Thus spinlocks are most effective in situations where the wait time for acquiring a lock is expected to be very short,say on the order of les than two context changes “The basie form of use of a spinlock is the Following:

Trang 16

6.87 LINUX KERNEL CONCURRENCY MECHANISMS 291 Table 64 Linux Spinlocks

——

Tefapiniscie *asck) Acquires he specie lock, spinning Wnceded unt salable ike pin-Tock, bale disable aferaps oa Te oe racer

inte in fags Tk pi Joy, as ve Te crea TET

Tike pin lock, bl aso dfables he execution TT ‘otto halves

Weis spin-IwlszkispimlasX-L TIGER) Releases ven Ook

Wold spin_onlook AFatepinicokt Releases even nek snd able Tea ter op

to piven previous state eleaey ven Took and estore Teal TIRE

Ree given ec and eae ttm alo

Voi4 snin-Ieer-ImiInpinMesE-E Traine ven work

Soe getm-Erylserlzptmlsrt-E “Tes To acque specified loc reuras nonzero Hoek

Heck cently held na er thers

Tne spt Ta loskadlpinlosce °TGEK) Reluess nowero i lock i enenlly Rell and 729 there

Basic Spinlocks ‘The basic spintock (as opposed to the reader-vriter spinlock explained subsequently) comes in four favors (Table 6.4:

+ Plain: Ifthe critical section of code is not executed by interrupt handlers or if the interrupts are disabled during the execution of the eritical section, then the plain spinlock can be used It does not affect the interrupt state on the proces- sof on which itis run,

‘+ rg: If interrupts are always enabled, then this spinlock should be used, jeqsave: If Lis not known it interrupts will be enabled or disabled at the ime ‘of execution, then this version should ‘current state of interrupts on the local processor is saved, to be restored when be used When a lock is acquired, the

the lock is released,

‘+ _bh: When an interrupt occurs, the minimum amount of work necessary is per- Tormed by the cortesponding interrupt handler A piece of code, called the bom half performs the remainder of the interrupt-related work, allowing the current interrupt to be enabled as soon as possible The _bh spinlock is used to disable and then enable bottom halves to avoid conflict with the protected critical section,

Trang 17

292 CHAPTER 6 / CONCURRENCY: DEADLOCK AND STARVATION

Spintocks are implemented differently on a uniprocessor system versus a mul: tiprocessor system For a uniprocessor system, the following considerations apply: If kernel preemption is turned off,so that a thread exceuting in kernel mode cannot be interrupted, then the locks are deleted at compile time; nel preemption is enabled, which does permit interrupts then the spinlocks again they are not needed, If ker- compile away (that is, no test of a spinlock memory location occurs) but are simply implemented as code that enablesidisables interrupts On a multiple processor system, the spinlock is compiled into code that does in fact test the spinlock location, The use of the spinlock mechanism in a program allows it to be independent of whether its executed on a uniprocessor oF multiprocessor system

Reader-Writer Spinlock ‘The reader-writer spinlock is a mechanism that allows a greater degree of concurrency within the kernel than the basie spinlock, ‘The reader-writer spinlock allows multiple threads to have simultaneous access to the same data structure for reading only but gives exclusive access to the spinlock for a thread that intends to update the data structure, Each reader-writer spinlock consists of a 24-bit reader counter and an unlock flag, with the Following interpretation:

Counter Flag Interpretation

0 1 “The siniok released nd val or we

7 7 Spinosk hasbeen aegired or wing By one Tea n=O) 7 Spins hasbeen aeqired or reading ym hee

Be i Noval

As with the basie spintock, there are plain, _Sxc,and _s=qsave versions of the reader-writer spinlock Note that the reader-writer spinlock favors readers over writers If the spinlock is held for readers, then so long as there is atleast one reader, the spinlock cannot be preempted by a writer Furthermore, new readers may be added to the spinlock even while a weiter is waiting

Semaphores

CÁ he user level, Linux provides a semaphore interface corresponding to that in UNIX SVR4, Internally, Linux provides an implementation of semaphores for its ‘own use That is,code that is part of the kernel ean invoke kernel semaphores These kernel semaphores cannot be accessed dicey by the user program via system calls ‘They are implemented a6 functions within the kernel and are thus more efficient than user-visible semaphores Linus provides three types of semaphore facilities in the kernel: binary semaphores, counting semaphores, and reader-writer semaphores,

Trang 18

6.48/ LINUX KERNEL CONCURRENCY MECHANISMS 293, Table 6S Linux Semaphores ‘Traditional Semaphores —¬— Sine count) Iniatizes sven count the damically created semaphore othe

Wels Ile MOTER at tees) count of nally unlocked) Tnializes he djamally created semaphore witha

VoIA TRIY-XOEE-TDSEEDImruet semaphore tam)

unt of Gta loeked] Taaizes he dynamically created semaphore witia Aliempis to acguire the ive semaphore enering tnirraptible sep if semaphore is unavailable Tae down IntorraptTBIststret aenaphore tam ‘Alempis to aeuire the given semaphore enerng "te sleep ifsemaphore i unavaabe returns

“EINTR value (signal oer th the fst of up ‘operations ecives TrE đam ty tae) TRIntrver sennphore A noero vale -ANepsloazgsreIbephen xmapbðns.andrearnx if semaphore isunsvable vole W(ernict sanmenare “een Releases the ven semaphore Reader Writer Semaphores Vola [nie suesn(styuer rụ sehaphare, Si unt of Inalizes the dynamically erated semaphore witha

Wid don reilstrbet =kcvemapherer Đownoperaionforresder Upper reader Tels domi aelea ee Team" ‘Down operation for wer 214 tp-xritn oe ‘Upaperaton or srier

semaphores in Chapter 5 The function names down and up are used for the Tune- tions referred to in Chapter A counting semaphore is initialized using thesen_i i function, which gives 5 as somilait and sons! gna, respectively the semaphore a name and assigns an initial value to the semaphore Binary sem phores,called MUTEXes in Linus, UT Sx_LocKEDFunetions, which initialize the semaphore to Io O, respectively are initialized using the: is soTExand ni

Linuy provides three versions of the down (cenit) operation

1, The đosss function corresponds to the traditional ceriiait operation That is, the thread tests the semaphore and blocks if the semaphore is not available ‘The thread will awaken when a corresponding up operation on this semaphore ‘occurs, Note that this funetion name is used for an operation on either a count ing semaphore or a binary semaphore,

2, The down_invesruot to-a kernel signal while being blocked on the down operation If the thread is, ibLefunction allows the thread to receive and respond woken up by a signal, the dow jp ible function increments the

Trang 19

294 CHAPTER 6 / CONCURRENCY: DEADLOCK AND STARVATION

‘count value of the semaphore and returns an error code known in Linux as - This alerts the thread that the invoked semaphore function has aborted In effect, the thread has been forced to “give up” the semaphore This feature is useful for device drivers and other services in which itis convenient to override a semaphore operation

3 The dows_s=yLoc' function makes it possible to try to acquire a semaphore without being blocked If the semaphore is available, itis acquired Otherwise, this function returns a nonzero value without blocking the thread

Reader-Writer Semaphores The readerwriter semaphore divides users into readers and writersit allows multiple concurrent readers (with no writers) but only a single writer (with no concurrent readers) In effect, the semaphore functions as a counting semaphore for readers but a binary semaphore (MUTEX) for writers, Table 6.5 shows the basic reader-writer semaphore operations The reader-writer semaphore "uses uninterruptible sleep, so there is only one version of each of theiown operations

Barriers

In some architectures, compilers andfor the processor hardware may reorder memory accesses in source cade to optimize performance These reotderings are dane to opti= mize the use ofthe instruction pipeline inthe processor The reordering algorithms con: tin checks to ensure that data dependencies are not violated, For example, the code:

may be reordered so that memory location’ is updated before memory locations is updated However, the code

will not be reordered Even so, there are occasions when it is important that reads oF writes are executed in the order specified because of use of the information that made hy another thread or a hardware device ‘To enforce the order in which instructions are executed, Linux provides the ‘memory barrier facility Table 6.6 lists the most important functions that are defined “Table G6 Linux Memory Battier Operations =a ee ee

et ‘Prevenls oes om Being eordered arse basier

xe) ‘Proves oads and slores rộm bengreonferedxsie barir barrier) ‘Prevens the some from reordering oa o ores across he Fr Seno ‘Op SMP provides abt) and on UP provides a arr)

=n ‘Da SMP proidesa wan) anon UP providesa baie) =e EI ‘Oe SMP provides mi) andon UP proven Breer)

Trang 20

69 / SOLAIUS THREAD $91 IRONIZATION pruinartivEs 295

for this facility The emis) operation insures that no reads occur across the barrier defined by the place of the vin) in the code Similarly, the wb (} operation insures that no wriles occur aeross the barrier defined by the place of t

the code The nis() operation provides both a load and store barrier Tivo important points to note about the barrier operations:

vans) in 12 The bartiets relate to machine instructions, namely loads and stores Thus the higher-level language instruction «= involves both a load (read) from lo-

cation > and a store (write) to location

we “The =mb, vib, and mb operations dictate the behavior of both the compiler and the processor In the ease of the compiler, the barrier operation dictates,

that the compiler case of the processor the barrier operation dictates that any instructions pend> not reorder instructions during the compile process, In the ing in the pipeline before the barrier must be committed for execution before any instructions encountered alter the barrier

Trang 21

Roca In

Linux shares many of the characteristics of the memory management schemes of other UNIX implementations but has its own unique features Overall, the Linux memory-management scheme is quite complex [DUBE98} In this section, we give a brief overview of the two main aspeets of Linux memory management: process virlual memory, and kernel memory allocation

Linux Virtual Memory

Virtual Memory Addressing Linux makes use of a three-level page table structure, consisting of the following types of tables (each individual table isthe size

cof one page):

+ Page directory: An active process has a single page directory that isthe sizeof ‘one page Bach entry in the page dircetory points to one page of the page mi dle directory The page directory must bein main memory for an active process + Page middle directory: The page middle directory may span multiple pages Each entry in the page middle directory points to one page in the page table * Page table: The page table may also span multiple pages Each page table

cenity refers to one virtual page of the process

Trang 22

390 CHAPTER & / VIRTUAL MEMORY Guunemg | Mamamsng | Rare one] Page table @ E3 nyiysicl sro

Figure 8.25 Address Translation in Linus Virtal Memory Scheme

‘The Linux page table structure is platform independent and was designed to accommodate the 64-bit Alpha processor, which provides hardware suppor for three levels of paging With 64-bit addresses, the use of only two levels of pages on the ‘Alpha would result in very large page tables and directories The 32-bit Pentium/x86 architecture has a two-level hardware paging mechanism The Linux software e by defining the size ofthe page middle directory as ‘one, Note that all references ran time Therefore, there is no performance overhead for using, to an extra level of inditection are optimized away at com: genetic three-level design on platforms which support only two levels in hardware

Page Allocation To enhance the efficiency of reading in and writing out pages (© and from main memory, Linux defines a mechanism for dealing with contiguous blocks of pages mapped into contiguous blocks of page frames, For this purpose, th ‘buddy system is used The kernel maintains list of contiguous page frame groups of fixed size:a group may consist of 1,2.4,8, 16, 0r 32 page frames As pages are allo cated and deallocated in main memory, the available groups are split and merged using the buddy algorithm,

Page Replacement Algorithi The Linux page replacement algorithm is based on the elock algorithm described in Section 82 (soe Figure 8.16) Inthe im" ple clock algorithm, a use bit and a modify bit age associated with each page in main ‘memory In the Linux seheme,the use hit is replaced with an 8-bit age variable Each time that 8 page is aeessed, the age variable is incremented Inthe background, Linux periodically sweeps through the global page pool and decrements the age Variable for each page ait rotates through all the pages in main memory A page with an age of Oi an old” page that has not been referenced in some time and is the best candidate for replacement The larger the value of age, the more frequently

Trang 23

83 / WINDOWS MEMORY MANAGEMENT 391 ‘page has been used in recent times and the less eligible iti For replacement, Thu the Linux algorithm isa form of least Irequently used policy

Kernel Memory Allocation

‘The Linus kernel memory capability manages physieal main memory page frames les primary function i o allocate and deallocate frames for particular uses Possible ‘owners ofa frame include userspace processes (ie the frame is part ofthe virtual memory ofa process that is eurrenly resident in teal memory)-dynamically Kemet data static Kernel code, and the page cache allocated

“The foundation of Kernel memory allocation for Linux is the page allocation ‘mechanism used for user virtual memory management As in the virtual memory scheme a buddy algorithm is used so that memory for the kernel can be allocated and deallocated in units of one of more pages Because the minimum amount of ‘memory that ean be allocated inthis fashion is one page, the page allocator slone ‘would be inefficient because the kernel requires sell short-term memory chunks

in odd sizes To accommodate these small chunks, Linux uses a scheme known as slab allocation [BONWO4] within an allocated page On a Pentium/x86 machine, the page size is 4 Kbytes, and chunks within a page may he allocated of sizes 32,64, 252, S08, 2040, anid 4080 by The slab allocator is relatively complex and is not examined in detail here; 2 good description ean be found in [VAHA96] In essence, Linux maintains a set of linked lists, one for each size of chunk Chunks may be split and aggregated in manner similar to the buddy algorithm, and moved between lists accordingly

Trang 24

10.3 LINUX SCHEDULING

Linux provided a real-time scheduling capability coupled yon-teal-time processes that made use of the traditional UNIX scheduling algorithm deseribed in Section 9.3 Linux 2.6 includes essentially the same real-time scheduling capability as previous releases and a substantially revised scheduler for non-real-time processes We examine these two areas in (urn,

Real-Time Scheduling

The three Linux scheduling clases are

* 8CNEb._P+fO: FirsL-in-fist-out real-time threads

‘Within each class, multiple priorities may be used, with priorities inthe real-time lasses higher than the priorities for thes: y2HiER lass The default values are as follows: Real-time priority classes range from 0110 99 inclusively, andcii2_7 8% classes range from 100 to 139, A lower number equals a higher priority

For FIFO threads the following rules apply

1, The system will not interrupt an executing FIFO thread except in the follow= ing cases! 4 Another FIFO thread of higher priority becomes ready

1s The executing FIFO thread becomes blocked waiting for an eventsuch as UO The executing FIFO thread voluntarily gives up the processor following a call to the primitive sched y‡e12

2, When an executing FIFO thread is interrupted itis placed inthe queue associated “with its priority

Trang 25

482 CHAPTER 10 / MULTIPROCESSOR AND REAL-TIME SCHEDULING Minimum A | Middle e b—+h—~c——x: Middle | Maximum (Reaver pen (6) How wit FF scheting D—+p—+c—+n—+c—+a—> (6) ow wih RR schsing mple of Linu Real-Time Sched

3, When a FIFO thread becomes ready and if that thread has a higher priority than the currently executing thread, then the currently executing thread is pre- ‘empted and the highest priority ready FIFO thread is executed If more than ‘one thread has that highest priority, the thread that has been waiting the longest is chosen,

“The SCxED_RA poliy is similar to the SC1/EO s2 ZO poliey, exeept for the ad: dition of a timeslice associated with each thread When a Scii20_R8 thread has ex- cecuted for its timeslice, iis suspended and a real-time thread of equal or higher priority is selected for running sare 10.11 is an example that illustrates the distinction between FIFO and RR scheduling Assume a process has four threads with thrce relative prio

signed as shown in Figure 10.11a, Assume that all waiting threads are ready to exe cute when the current thread waits or terminates and that no higher priority thread is awakened while a thread is executing Figure 10.11} shows a flow in which all of the Uhreads are in the Sci0_P2P0 class Thread D executes until it wails oF terminates Next although threads B and C have the same priority, thread B starts because it has been waiting longer than thread C Thread then thread C executes until it wails or terminates Finally, thread A executes B executes until it waits or terminates, gure 10,L1e shows a sample flow if all of the threads are in the Sc1=0_R class Thread D executes until it waits or terminates, Next threads B and C are time sliced, because they both have the same priority Finally, thread A executes ‘The final scheduling class is SCHED_O7=2, A thread in this class ean only execute if there are no real-time threads ready to execute Non-Real-Time Scheduling “The Linux 24 scheduler for th ing number of processors and ine scheduler include the following:

TIED 01188 class did not scale well with inereas- seine number of processes The drawbacks of this

Trang 26

10.3 / LINUX SCHEDULING 483 For example, suppose a task executed on CPU-L, and its data were in that processor's cache If the task got rescheduled to CPU-2, ts data would need to be invalidated in CPU-1 and brought into CPU-2,

+ The Linux 24 scheduler uses the act of choosing w task to exceute locks out any other pracessor from mi a single runqueue lock Thus, in an SMP system, hipulating the runqueues The result i idle processors awaiting release of the runqueue lock and decreased efficiency

+ Preemption is not possible in the Linux 24 schedulers this means that a lower-priority task can execute while a higher-priority task waited for it t© complete,

‘To correet these problems, Linux 2.6 uses completely new priority scheduler known as the O(1) schedules.” The scheduler is designed so thatthe time to select the appropriate process and assign it to a processor is constant, regardless of the toad on the system or the numberof processors “The Kernel maintains to scheduling data structure for each processor in the system, ofthe following form (Figuee 10.12): uct iet_head queue(¥ax + prlorizy queues +

‘A separate queue is maintained for each priority level The total number of {queues in the structure is 2%_P220, whieh has a default value of 140 The structure also includes a bitmap array of sufficient size to provide one bit per priority level Thus, with 140 priority levels and 32-bit words, BITMA®_STZ has a value of S.This creates a bitmap of 160 bits, of which 20 bits are ignored The bitmap indicates which {queues are not empty Finally, ace ive indieates the total number sent om all queues Two structures are maintained: an active queues structure and an of tasks pre- expired queues structure Initially, both bitmaps are set to all zeroes and all queues are empty, As a process becomes ready, itis assigned to the appropriate priority queue in the active {queues structure and is assigned the appropriate timeslice I task is preempted by fore it completes its timeslice, itis returned to an active queue When a task co

pletes its timestice, it goes into the appropriate queue in the expired queues structure and is assigned a new timesice Allscheduling is done from among tasks in the active queues structure, When the aetive queues structure is empty simple pointer assignment results in a switch of the active and expired queues, nd scheduling continues Scheduling is simple and efficient On a given processor the scheduler picks the highestipriority nonempty queue If multiple tasks are in that queue, the Tasks are scheduled in round-robin fashion,

Trang 27

484 CHAPTER 10 / MULTIPROCESSOR AND REAL-TIME SCHEDULING phe pindy 80 sooem ‘ess forth pony ~—Bi ‘pny 139 1409 priority ray foractve eter Wats Pn Feet rity 14 priority aay for expired gues

‘igure 10.13 Limuy Scheduling Data Structures for Each Processor

Linux also includes a mechanism for moving tasks from the queue lists of ‘one processor to that of another Periodically the scheduler checks to see if there {sa substantial imbalance among the number of tasks assigned to each processor ‘To balance the load, the schedule can transfer some tasks The highest priority active tasks are selected for transfer, because itis more important to distribute high-

priority tasks Fairly

Calculating Priorities and Timeslices Each non-real-time task is assigned an initial priority in the range of 100 to 139, with a default of 120, Thisis the task’s static priority and is specitied by the user As the task executes, a dynamic priority fs calculated as a function of the task’s statie priority and its execution behavior ‘The Linux scheduler is designed to favor 1/0-bound tasks over processor-bound tasks This preference tends to provide good interactive response The technique used by Linux to determine the dynamie priority is to keep a running tab on how ‘much fime a process sleeps (waiting for an event) versus how much time the process runs, In essence, a task that spends most of its time sleeping is given a higher priority ‘Timeslices are assigned in the range of 10 ms to 200 ms In general, higher-

priority tasks are assigned larger timestices

Trang 28

WAL UNIX SVRA SCHEDULING 485

Relationship to Real-Time Tasks Real-time tasks are handled ina different ‘manner from non-real-time tasks inthe priority queues The following considerations

apply

12 All realtime tasks have only a static priority: no dynamic priority changes are ‘made 2, sciizD_P1e0 tasks do not have assigned timestices Such tasks are scheduled in FIFO discipline Ia siie2_°20 task is blocked, it returns to the same priority

‘queie in the active quewe list when it becomes unblocked

3 Although Scx2D_R8 tasks do have assigned timeslices, they also are moved to the expired queue list When ạ SC?/20 3⁄3 task exhaust is timeslice, itis returned to its priority queue with the same timeslice value, Timeslice vai" ues are never changed,

Trang 29

TINH

Trang 30

30 CHAPTER 11 /1/O MANAGEMENT AND DISK SCHEDULING

Wo system ayered sing VO Request Packets to repronent each request and hen asin he requests through ayer of drivers (a dative agciteetre) Layered drivers ca extend fuetooalty auch a4 hein file dat for vireo ang fates rich !sspecalzed encryption or compression

VO wes pga mode, sed on fhle of rootnes to implemen! the standard device functions sich as pen read, wre cet dose

TO inerenlyasynehronows a versa any ST

an generally queue a request for later processing

nd retarn back the aor ‘aly pewwork TO and dives TO wh Ppp he ge cache,can be asynchronous Cine ia current versions

Drivers ca be dynam (onde unlonde Drivers canbe Graal detente

TO devices and vers aed nthe sytem

namespace TO devises amid nthe fie ystems drivers ace through inaaces of device

‘Advanced plug and pay support Base on nae ‘election of devices though bus enumeration, Imathing of drivers rom a database and dynamic

lesdingunloadine

Timed pag and pay support

‘Advanced power managemeat ong CPO

hibernation Tnfed power management based on CPU doce

TO prortn scoring read poner an system quirements (uc sẹ hinh ioly aeees or ông te ghen memar ịlowcand lấệ-pbuf lorbiekgroundaciiie Tike the disk dfrager)

Travis four aiferent vernon of TO when incling Sealine sed schedling nd Complete Fae Queuing to locMe DO nh mang all processes

TO competion ports provide heh pevformanee mut headed ppticatons with meen way of eating with the completion of ashehrogous UO

Disk Scheduling

‘The default disk scheduler in Linux 2.4is known as the Linus Elevator, which is a vari- ation on the LOOK algorithm discussed in Section 11.5, For Linux 26, the Elevator algorithm has been augmented by two additional algorithms: the deadline IO scheduler and the anticipatory LO scheduler [LOVEDS].We examine each of these in turn,

The Elevator Scheduler The elevator scheduler maintains a single queue for disk read and write requests and performs both sorting and merging functions on the queue In general terms, the elevator scheduler keeps the list of requests sorted by block number Thus, as the disk requests are handled, the drive moves ina single direction, satisfying each request as it is encountered, This general strategy is re in the following manner When a new request is added to the queue, four operations

are considered in order:

Trang 31

"7" 2, Ifa request in the queue is sufficiently ol, the new request is inserted atthe tail of the queue 3 If there is a suitable location, the new request is inserted in sorted order,

Trang 32

532 CHAPTER 11/70 MANAGEMENT AND DISK SCHEDULING

a read request or a write FIFO queue for a write request Thus, the read and write {queues maintain a list of requests inthe sequence in which the requests were made Associated with each request is an expiration time, with a defaull value of 0.5 sec" fonds fora read request and 5 seconds for a write request, Ordinarily, the scheduler dispatches from the sorted queue When a request is satisfied, itis removed from the head of the sorted queue and also trom the appropriate FIFO queue However, when the item at the head of one of the FIFO queues becomes older than its expita tion time, then the scheduler next dispatches from that FIFO queue, taking the expired request, plus the next few requests from the queue, As each request is dis: patched, itis ako removed from the sorted queue, ‘The deadline 1/0 scheduler scheme overcomes the starvation problem and also the read versus write problem,

Anticipatory I/O Scheduler ‘The original elevator scheduler and the dead line scheduler both are designed to dispatch a new request as soon as the existing request is satisfied, thus keeping the disk as busy as possible This same policy ap- plies to all of the scheduling algorithms discussed in Section 11.5 However, such a policy can be counterproductive if there are numerous synchronous read requests Typically an application wil wait until a read requests satisfied and the data available before issuing the next request The small delay between receiving the data for the last read and isuing the next read enables the scheduler to turn elsewhere for pending request and dispatch that request Because of the principle of locality, it ikely that successive reads from the same process will be to disk blocks that are near one another Ifthe scheduler were to delay a short period of time after satisfying a read request, to see if new nearby read request is made, the overall performance of the system could be enhanced This isthe philosophy behind plemented in Linux 2.6 the aniipatory scheduler, proposed in [IYERO1 and im

Linus, the anticipatory scheduler is superimposed on the deadline sched- tler, When a read requests dispatched the anticipatory scheduler causes the sched- tuling system to delay for up to 6 milliseconds, depending on the configuration During this small delay, there is good ehance that the application that issued the last read request will issue another read request to the same region of the disk Iso that request will he serviced immediately no such read request occurs, the sched uler resumes using the deadline scheduling algorithm [LOVED4| reports on two tests of the Linux scheduling algorithms The frst test involved the reading of a 21N-MB file while doing along streaming write in the background The second tes involved doing ä read ofa large file in the background While reading every file in the Kernel source tree, The results are listed in the follow ing table:

WO Scheduler and Kernel Tet Tea?

Tinar elevator an Tang men

Deadline UO sheduler on 25 aDseconde TS minutes, 3 seconde

“Anticipatory VO scheduler on @ a scconde Ty seconds

Trang 33

haw /winpows vo $33 ‘As can be seen, the performance improvement depends on the nature of the workload But in both cases, the anticipatory scheduler provides a dramatic improvement

Linux Page Cache

In Linux 2.2 and earlier releases, the kernel maintained a page cache for reads and writes from regular file system files and for virtual memory pages, and a separate buffer cache for block /O For Linux 24 and later, there isa single unified page cache that is involved in all traffic between disk and main memory “The page cache confers two benelits Ftst, when itis time to write back dirty pages to disk, a collection of them can be ordered properly and written out effi- ciently: Second, because of the principle of temporal locality, pages in the page cache are likely to be referenced again before they are flushed [rom the cache, thus saving adisk 10 operation, Dirty pages are writ back to disk in two situations:

+ WWhen free m reduces the

size of the page cache to rel noty falls below a specified threshold, the kern ise memory to be added to the free memory pool

Trang 34

eae

Pana

Linux includes a versatile and powerful file handling facility, designed to support ‘management systems and file structures The approach taken in Linux is to make use of a virtual fle system (VES) which presents a single, uniform file system interface to user processes The VES defines a common fle model that is capable of representing any conceivable file system's general feature and behavior ‘The VFS assumes that files are objects in a computer's mass storage memory that shate basic properties regardless of the target filesystem of the underlying processor

Trang 35

388 CHAPTER 12 / FILE MANAGEMENT Sytem cll System ells inerace = oom | [aos eve — Eigare 1217 1inmx Vidmal File System Context

Trang 36

129 / LINUX VIRTUAL FILE SVSTEM 589 Syste cal

Syste cate thing VIS am ves ‘wine Bề den X Dis 40 ——

wermerice [Linx] “eal erase ‘all in me function ‘ote system x ile Fler onseootaey Figure 12.18 Linux Virtual File System Concept

Figure 12.18 indicates the role that VFS plays within the Linu kernel, When process initiates a file-oriented system call (e, read), the kernel calls a function in the VFS This function handles the file-system-independent manipulations and initiates a call to a function inthe target ile system code This call passes through a mapping function that converts the call from the VES into a call to the target file system The VFS is independent of any file system, so the implementation of a map-

ping function must be part of the implementation ofa file system on Linus The tat- get filesystem converts the file system request into device-oriented instructions that are passed to a device driver by means of page cache functions VFS is an objectoriented scheme Because itis written in C, rather than 2 language that supports object programming (such as C+-+ of Java), VES objects are implemented simply 2s C data structures Each object contains both data and pointers to filesystem-implemented lunctions that operate on data The four primary object types in VES are as follows:

perblock object: Represents a specific mounted fle system + Inode object: Represents a specific file

+ Dentry object: Represents specific directory entry

+ File object: Represents an open file associated witha process

‘This scheme is based on the concepts used in UNIX file systems, as described in Section 12.7 The key concepts of UNIX file system to remember ate the following A file system consists of a hierarchal organization of directories.A directory isthe same as what is knows as a folder on many non-UNIX platforms and may contain files andlor other directories, Because a directory may contain other directories, a tree structure is formed A path through the tree structure from the root consists of quence of directory entries ending in either In UNIX a directory is implemented as a file that lists the files and directories con- a directory entry (dentry) ora file name tained within i, Thus, file operations can be performed on either files or directories

‘The Superblock Object

Trang 37

590 CHAPTER 12 / FILE MANAGEMENT

‘The superblock object consists of a number of data items Examples include the following

is mounted on

+ The device that this ile sy

= The basie block size ofthe filesystem

‘+ Dirty flag, to indicate that the superhlock has been changed but not written back to disk le system type

+ Flags such as a read-only flag

+ Pointer to the root of the file system directory + List of open files

+ Semaphore for controlling access to the file system + List of superblock operations

“The last item on the preceding list refers to an operations object contained Within the superblock object The operations object defines the object methods (funedons) that the kernel can invoke against the superblock object The methods defined for the superblock object include the following

* cead_inodes Read a specified inode from a mounted file system, + write inode: Write given inode to disk

inode: Release inode + delete inode: Delete inode from disk ge: Ci supers Called by the VFS on unmount to release the given superblock ed when inode attributes are changed

+ wel ve_suver Called when the VES decides that the superblock needs to be written to disk, Obtain fle system statistics, _f + Called by the VES when the file system is remounted with new mount options inode: Release inode and lear any pages containing related data

The Inode Object

‘An inode is associated with each file The inode object holds all the information about a named file exeept ils name and the actual data contents of the File Items contained in an inode object include owner, group, permissions, access times for 3 filesize of data it holds, and number of links “The inode object also includes an inode operations object that deseribes the file system's implemented functions that the VFS can invoke on an inode, The meth- ‘ods defined for the inode abject include the follow

+ create: Creates ject in some directory a new inode for a

+ Lookup: Searches a directory for an inode corresponding toa file name

Trang 38

law /winnows FILESYSTEM 591

some direetory 1 Creates a new inode for a directory associated with a dentry object in The Dentry Object

A dentry (directory entry) is specific component ina path The component may’ be either a directory name or a file name Dentry objects facilitate access to files and directories and are used in a dentry eache for that purpose The dentry object in cludes a pointer to the inode and superblock It also includes a pointer to the parent dlentry and pointers to any subordinate dentrys

The File Object

‘The file object is used to represent a file opened by a process The object is created in response to the open() system call and destroyed in response to the close() system call The file object consists of a number of items, including the following:

+ Dentry abject associated with the file le system containing the file

+ File objects usage counter + User's user ID

Trang 39

16.7 BEOWULF AND LINUX CLUSTERS

In 1994, the Beowulf project was initiated under the sponsorship of the NASA High Performance Computing and Communications (HPCC) project Its goal was to investigate the potential of clustered PCs for performing important computation tasks beyond the capabilities of contemporary workstations at minimum cost Today, the Beowulf approuch is widely implemented and is perhaps the most important cluster technology available,

Beowulf Features

Key features of Beowulf include the following [RIDGY7} ‘+ Mass market commodity components

+ Dedicated processors (rather than scavenging cycles from idle workstations) + A dedicated, private nework (LAN or WAN or internetted combination) = Nocustom components

*» Easy replication from multiple vendors + Scalable HO

# A froely available software base

+ Use of freely available distribution computing tools with minimal changes ‘+ Return of the design and improvements to the community

Although elements of Beowulf software have been implemented on a number of different platforms the most obvious choice fora base is Linux, and most Beowulf implementations use’ a cluster of Linux workstations and/or PCs Figure 16.18 depicts a representative configuration, The cluster consists of a number of workstations, perhaps of differing hardware platforms all runaing the Linus operating system, Secondary storage at cach workstation may be made available for distib- uted access (for distributed file sharing, distributed virtual memory, oF other uses) ‘The cluster nodes (the Linux systems) are interconnected with a commodity networking approach, typically Ethernet The Ethernet support may'be inthe form of single Ethernet switch or an interconnected set of switches Commodity Ethernet produets atthe standard data rates (10 Mbps, 100 Mbps 1 Gbps) are used,

Trang 40

167 / BEOWULF AND LINUX CLUSTERS 739

Ethernet or Figure {6.18 Generic Beowulf Configurati Beowulf Software

‘The Beowulf software environment is implemented as an add-on to commercially available, royalty-free base Linux distributions The principal source of open-source Beowulf soltware is the Beowulf site at www-heowalf.org, nizations also offer free Beowulf tools and utilities but numerous other orga-

Each node in the Beowulf cluster runs its own copy of the Linux kernel and can function as an autonomous Linux system To support the Beowulf cluster con- cepl, extensions are made 10 the Linux kernel to allow the individual nodes to participate in a number of global namespaces The following are examples of Be- ‘owull system software:

© Reowulf distributed process space (BPROC): This package allows @ process ID space to span multiple nodes in a cluster envionment and also provides mechanisms for starting processes on other nodes, The goal of this package is to provide key elements needed for a single system image on Beowulf cluster BPROC provides a mechanism to start processes on remote nodes without into another node and by making all the remote processes visible

ss table of the clusters front-end node

Định dạng
Số trang	44
Dung lượng	2,89 MB