unix filesystems evolution design and implementation phần 4 ppt

UNIX Kernel Concepts 115 be in memory when the process requests it. The data requested is read, but before returning to the user with the data, a strategy call is made to read the next block without a subsequent call to iowait(). To perform a write, a call is made to bwrite(), which simply needs to invoke the two line sequence previously shown. After the caller has finished with the buffer, a call is made to brelse(), which takes the buffer and places it at the back of the freelist. This ensures that the oldest free buffer will be reassigned first. Mounting Filesystems The section The UNIX Filesystem, earlier in this chapter, showed how filesystems were laid out on disk with the superblock occupying block 1 of the disk slice. Mounted filesystems were held in a linked list of mount structures, one per filesystem with a maximum of NMOUNT mounted filesystems. Each mount structure has three elements, namely: m_dev. This field holds the device ID of the disk slice and can be used in a simple check to prevent a second mount of the same filesystem. m_buf. This field points to the superblock (struct filsys), which is read from disk during a mount operation. m_inodp. This field references the inode for the directory onto which this filesystem is mounted. This is further explained in the section Pathname Resolution later in this chapter. The root filesystem is mounted early on during kernel initialization. This involved a very simple code sequence that relied on the root device being hard coded into the kernel. The block containing the superblock of the root filesystem is read into memory by calling bread(); then the first mount structure is initialized to point to the buffer. Any subsequent mounts needed to come in through the mount() system call. The first task to perform would be to walk through the list of existing mount structures checking m_dev against the device passed to mount(). If the filesystem is mounted already, EBUSY is returned; otherwise another mount structure is allocated for the new mounted filesystem. System Call Handling Arguments passed to system calls are placed on the user stack prior to invoking a hardware instruction that then transfers the calling process from user mode to kernel mode. Once inside the kernel, any system call handler needs to be able to access the arguments, because the process may sleep awaiting some resource, resulting in a context switch, the kernel needs to copy these arguments into the kernel address space. 116 UNIX Filesystems—Evolution, Design, and Implementation The sysent[] array specifies all of the system calls available, including the number of arguments. By executing a hardware trap instruction, control is passed from user space to the kernel and the kernel trap() function runs to determine the system call to be processed. The C library function linked with the user program stores a unique value on the user stack corresponding to the system call. The kernel uses this value to locate the entry in sysent[] to understand how many arguments are being passed. For a read() or write() system call, the arguments are accessible as follows: fd = u.u_ar0[R0] u_base = u.u_arg[0] u_count = u.u_arg[1] This is a little strange because the first and subsequent arguments are accessed in a different manner. This is partly due to the hardware on which 5th Edition UNIX was based and partly due to the method that the original authors chose to handle traps. If any error is detected during system call handling, u_error is set to record the error found. For example, if an attempt is made to mount an already mounted filesystem, the mount system call handler will set u_error to EBUSY. As part of completing the system call, trap() will set up the r0 register to contain the error code, that is then accessible as the return value of the system call once control is passed back to user space. For further details on system call handling in early versions of UNIX, [LION96] should be consulted. Steve Pate’s book UNIX Internals—A Practical Approach [PATE96] describes in detail how system calls are implemented at an assembly language level in System V Release 3 on the Intel x86 architecture. Pathname Resolution System calls often specify a pathname that must be resolved to an inode before the system call can continue. For example, in response to: fd = open("/etc/passwd", O_RDONLY); the kernel must ensure that /etc is a directory and that passwd is a file within the /etc directory. Where to start the search depends on whether the pathname specified is absolute or relative. If it is an absolute pathname, the search starts from rootdir, a pointer to the root inode in the root filesystem that is initialized during kernel bootstrap. If the pathname is relative, the search starts from UNIX Kernel Concepts 117 u_cdir, the inode of the current working directory. Thus, one can see that changing a directory involves resolving a pathname to a base directory component and then setting u_cdir to reference the inode for that directory. The routine that performs pathname resolution is called namei(). It uses fields in the user area as do many other kernel functions. Much of the work of namei() involves parsing the pathname to be able to work on one component at a time. Consider, at a high level, the sequence of events that must take place to resolve /etc/passwd. if (absolute pathname) { dip = rootdir } else { dip = u.u_cdir } loop: name = next component scan dip for name / inode number iput(dip) dip = iget() to read in inode if last component { return dip } else { goto loop } This is an oversimplification but it illustrates the steps that must be performed. The routines iget() and iput() are responsible for retrieving an inode and releasing an inode respectively. A call to iget() scans the inode cache before reading the inode from disk. Either way, the returned inode will have its hold count (i_count) increased. A call to iput() decrements i_count and, if it reaches 0, the inode can be placed on the free list. To facil itate crossing mount points, fields in the mount and inode structures are used. The m_inodp field of the mount structure points to the directory inode on which the filesystem is mounted allowing the kernel to perform a “ ’’ traversal over a mount point. The inode that is mounted on has the IMOUNT flag set that allows the kernel to go over a mount point. Putting It All Together In order to describe how all of the above subsystems work together, this section will follow a call to open() on /etc/passwd followed by the read() and close() system calls. Figure 6.4 shows the main structures involved in actually performing the read. It is useful to have this figure in mind while reading through the following sections. 118 UNIX Filesystems—Evolution, Design, and Implementation Opening a File The open() system call is handled by the open() kernel function. Its first task is to call namei() to resolve the pathname passed to open(). Assuming Figure 6.4 Kernel structures used when reading from a file. fd = open("/etc/passwd", O_RDONLY); read(fd, buf, 512); user mode kernel mode u_base u_ofile[3] f_inode i_addr[0] iomove() b_dev = (X, Y) b_blkno = Z b_addr incore inode for “passwd” buffer for (X, Y) / Z (*bdevsw[X].d_strategy)(bp) bdevsw[] RK disk driver I/O block 0 superblock inodes data blocks in kernel memory on disk i_addr[0] inode for “passwd” block Z data copied by RK disk driver struct user struct file UNIX Kernel Concepts 119 the pathname is valid, the inode for passwd is returned. A call to open1() is then made passing the open mode. The split between open() and open1() allows the open() and creat() system calls to share much of the same code. First of all, open1() must call access() to ensure that the process can access the file according to ownership and the mode passed to open(). If all is fine, a call to falloc() is made to allocate a file table entry. Internally this invokes ufalloc() to allocate a file descriptor from u_ofile[]. The newly allocated file descriptor will be set to point to the newly allocated file table entry. Before returning from open1(), the linkage between the file table entry and the inode for passwd is established as was shown in Figure 6.3. Reading the File The read() and write() systems calls are handled by kernel functions of the same name. Both make a call to rdwr() passing FREAD or FWRITE. The role of rdwr() is fairly straightforward in that it sets up the appropriate fields in the user area to correspond to the arguments passed to the system call and invokes either readi() or writei() to read from or write to the file. The following pseudo code shows the steps taken for this initialization. Note that some of the error checking has been removed to simplify the steps taken. get file pointer from user area set u_base to u.u_arg[0]; /* user supplied buffer */ set u_count to u.u_arg[1]; /* number of bytes to read/write */ if (reading) { readi(fp->f_inode); } else { writei(fp->f_inode); } The internals of readi() are fairly straightforward and involve making repeated calls to bmap() to obtain the disk block address from the file offset. The bmap() function takes a logical block number within the file and returns the physical block number on disk. This is used as an argument to bread(), which reads in the appropriate block from disk. The uiomove() function then transfers data to the buffer specified in the call to read(), which is held in u_base. This also increments u_base and decrements u_count so that the loop will terminate after all the data has been transferred. If any errors are encountered during the actual I/O, the b_flags field of the buf structure will be set to B_ERROR and additional error information may be stored in b_error. In response to an I/O error, the u_error field of the user structure will be set to either EIO or ENXIO. The b_resid field is used to record how many bytes out of a request size 120 UNIX Filesystems—Evolution, Design, and Implementation of u_count were not transferred. Both fields are used to notify the calling process of how many bytes were actually read or written. Closing the File The close() system call is handled by the close() kernel function. It performs little work other than obtaining the file table entry by calling getf(), zeroing the appropriate entry in u_ofile[], and then calling closef(). Note that because a previous call to dup() may have been made, the reference count of the file table entry must be checked before it can be freed. If the reference count (f_count) is 1, the entry can be removed and a call to closei() is made to free the inode. If the value of f_count is greater than 1, it is decremented and the work of close() is complete. To release a hold on an inode, iput() is invoked. The additional work performed by closei() allows a device driver close call to be made if the file to be closed is a device. As with closef(), iput() checks the reference count of the inode (i_count). If it is greater than 1, it is decremented, and there is no further work to do. If the count has reached 1, this is the only hold on the file so the inode can be released. One additional check that is made is to see if the hard link count of the inode has reached 0. This implies that an unlink() system call was invoked while the file was still open. If this is the case, the inode can be freed on disk. Summary This chapter concentrated on the structures introduced in the early UNIX versions, which should provide readers with a basic grounding in UNIX kernel principles, particularly as they apply to how filesystems and files are accessed. It says something for the design of the original versions of UNIX that many UNIX based kernels still bear a great deal of similarity to the original versions developed over 30 years ago. Lions’ book Lions’ Commentary on UNIX 6th Edition [LION96] provides a unique view of how 6th Edition UNIX was implemented and lists the complete kernel source code. For additional browsing, the source code is available online for download. For a more concrete explanation of some of the algorithms and more details on the kernel in general, Bach’s book The Design of the UNIX Operating System [BACH86] provides an excellent overview of System V Release 2. Pate’s book UNIX Internals—A Practical Approach [PATE96] describes a System V Release 3 variant. The UNIX versions described in both books bear most resemblance to the earlier UNIX research editions. CHAPTER 121 7 Development of the SVR4 VFS/Vnode Architecture The development of the File System Switch (FSS) architecture in SVR3, the Sun VFS/vnode architecture in SunOS, and then the merge between the two to produce SVR4, substantially changed the way that filesystems were accessed and implemented. During this period, the number of filesystem types increased dramatically, including the introduction of commercial filesystems such as VxFS that allowed UNIX to move toward the enterprise computing market. SVR4 also introduced a number of other important concepts pertinent to filesystems, such as tying file system access with memory mapped files, the DNLC (Directory Name Lookup Cache), and a separation between the traditional buffer cache and the page cache, which also changed the way that I/O was performed. This chapter follows the developments that led up to the implementation of SVR4, which is still the basis of Sun’s Solaris operating system and also freely available under the auspices of Caldera’s OpenUNIX. The Need for Change The research editions of UNIX had a single filesystem type, as described in Chapter 6. The tight coupling between the kernel and the filesystem worked well 122 UNIX Filesystems—Evolution, Design, and Implementation at this stage because there was only one filesystem type and the kernel was single threaded, which means that only one process could be running in the kernel at the same time. Before long, the need to add new filesystem types—including non-UNIX filesystems—resulted in a shift away from the old style filesystem implementation to a newer, cleaner architecture that clearly separated the different physical filesystem implementations from those parts of the kernel that dealt with file and filesystem access. Pre-SVR3 Kernels With the exception of Lions’ book on 6th Edition UNIX [LION96], no other UNIX kernels were documented in any detail until the arrival of System V Release 2 that was the basis for Bach’s book The Design of the UNIX Operating System [BACH86]. In his book, Bach describes the on-disk layout to be almost identical to that of the earlier versions of UNIX. There was little change between the research editions of UNIX and SVR2 to warrant describing the SVR2 filesystem architecture in detail. Around this time, most of the work on filesystem evolution was taking place at the University of Berkeley to produce the BSD Fast File System which would, in time, become UFS. The File System Switch Introduced with System V Release 3.0, the File System Switch (FSS) architecture introduced a framework under which multiple different filesystem types could coexist in parallel. The FSS was poorly documented and the source code for SVR3-based derivatives is not publicly available. [PATE96] describes in detail how the FSS was implemented. Note that the version of SVR3 described in that book contained a significant number of kernel changes (made by SCO) and therefore differed substantially from the original SVR3 implementation. This section highlights the main features of the FSS architecture. As with earlier UNIX versions, SVR3 kept the mapping between file descriptors in the user area to the file table to in-core inodes. One of the main goals of SVR3 was to provide a framework under which multiple different filesystem types could coexist at the same time. Thus each time a call is made to mount, the caller could specify the filesystem type. Because the FSS could support multiple different filesystem types, the traditional UNIX filesystem needed to be named so it could be identified when calling the mount command. Thus, it became known as the s5 (System V) filesystem. Throughout the USL-based development of System V through to the various SVR4 derivatives, little development would occur on s5. SCO completely restructured their s5-based filesystem over the years and added a number of new features. Development of the SVR4 VFS/Vnode Architecture 123 The boundary between the filesystem-independent layer of the kernel and the filesystem-dependent layer occurred mainly through a new implementation of the in-core inode. Each filesystem type could potentially have a very different on-disk representation of a file. Newer diskless filesystems such as NFS and RFS had different, non-disk-based structures once again. Thus, the new inode contained fields that were generic to all filesystem types such as user and group IDs and file size, as well as the ability to reference data that was filesystem-specific. Additional fields used to construct the FSS interface were: i_fsptr. This field points to data that is private to the filesystem and that is not visible to the rest of the kernel. For disk-based filesystems this field would typically point to a copy of the disk inode. i_fstyp. This field identifies the filesystem type. i_mntdev. This field points to the mount structure of the filesystem to which this inode belongs. i_mton. This field is used during pathname traversal. If the directory referenced by this inode is mounted on, this field points to the mount structure for the filesystem that covers this directory. i_fstypp. This field points to a vector of filesystem functions that are called by the filesystem-independent layer. The set of filesystem-specific operations is defined by the fstypsw structure. An array of the same name holds an fstypsw structure for each possible filesystem. The elements of the structure, and thus the functions that the kernel can call into the filesystem with, are shown in Table 7.1. When a file is opened for access, the i_fstypp field is set to point to the fstypsw[] entry for that filesystem type. In order to invoke a filesystem-specific function, the kernel performs a level of indirection through a macro that accesses the appropriate function. For example, consider the definition of FS_READI() that is invoked to read data from a file: #define FS_READI(ip) (*fstypsw[(ip)->i_fstyp].fs_readi)(ip) All filesystems must follow the same calling conventions such that they all understand how arguments will be passed. In the case of FS_READI(), the arguments of interest will be held in u_base and u_count. Before returning to the filesystem-independent layer, u_error will be set to indicate whether an error occurred and u_resid will contain a count of any bytes that could not be read or written. Mounting Filesystems The method of mounting filesystems in SVR3 changed because each filesystem’s superblock could be different and in the case of NFS and RFS, there was no superblock per se. The list of mounted filesystems was moved into an array of mount structures that contained the following elements: 124 UNIX Filesystems—Evolution, Design, and Implementation Table 7.1 File System Switch Functions FSS OPERATION DESCRIPTION fs_init Each filesystem can specify a function that is called during kernel initialization allowing the filesystem to perform any initialization tasks prior to the first mount call fs_iread Read the inode (during pathname resolution) fs_iput Release the inode fs_iupdat Update the inode timestamps fs_readi Called to read data from a file fs_writei Called to write data to a file fs_itrunc Truncate a file fs_statf Return file information required by stat() fs_namei Called during pathname traversal fs_mount Called to mount a filesystem fs_umount Called to unmount a filesystem fs_getinode Allocate a file for a pipe fs_openi Call the device open routine fs_closei Call the device close routine fs_update Sync the superblock to disk fs_statfs Used by statfs() and ustat() fs_access Check access permissions fs_getdents Read directory entries fs_allocmap Build a block list map for demand paging fs_freemap Frees the demand paging block list map fs_readmap Read a page using the block list map fs_setattr Set file attributes fs_notify Notify the filesystem when file attributes change fs_fcntl Handle the fcntl() system call fs_fsinfo Return filesystem-specific information fs_ioctl Called in response to a ioctl() system call TEAMFLY TEAM FLY ® [...]... 76 94 7719 76 94 0 46 sleep load TE 144 PAGLCK CLGAP VBITS HAT LOCK SEGS SIZE 0 0 0x0 0x4f958 0x7fffefa0 0xb5aa50 950272 0 SIZE OPS DATA 8192 segvn_ops 0x30000aa46b0 8192 segvn_ops 0x30000bfa 448 8192 segvn_ops 0x30000b670f8 679936 segvn_ops 0x30000aa4e40 245 76 segvn_ops 0x30000b67c50 8192 segvn_ops 0x30000bfb260 163 84 segvn_ops 0x30000bfac88 163 84 segvn_ops 0x30000bface0 163 84 segvn_ops 0x30001af3f48... as follows: 0x30000aa4e40$ . coupling between the kernel and the filesystem worked well 122 UNIX Filesystems Evolution, Design, and Implementation at this stage because there was only one filesystem type and the kernel was single threaded,. while reading through the following sections. 118 UNIX Filesystems Evolution, Design, and Implementation Opening a File The open() system call is handled by the open() kernel function. Its first task. lived, being replaced by the better Sun VFS/vnode interface introduced in SVR4. 126 UNIX Filesystems Evolution, Design, and Implementation The Sun VFS/Vnode Architecture Developed on Sun Microsystem’s

Định dạng
Số trang	47
Dung lượng	532,73 KB