Beginning Linux Programming Third Edition phần 10 docx

Virtual Memory Areas Above the page table reside the virtual memory areas. These constitute a map of contiguous virtual memory addresses as handed out to an application. struct vm_area_struct { unsigned long vm_start; unsigned long vm_end; pgprot_t vm_page_prot; struct vm_operations_struct *vm_ops; unsigned long vm_pgoff; struct file *vm_file; }; This is a highly snipped version of the structure; you can find it in linux/mm.h. We will cover the members that we will actually need later on. vm_start and vm_end represent the beginning and end of the virtual memory area, and vm_page_prot is the protection attributes assigned to it—whether it is shared, private, executable, and so forth. vm_ops is similar to the file_operations structure used with character and block devices and forms an analogous abstraction to operation on virtual memory areas. Finally, vm_offset is the offset into the area, and vm_file is used in correlation with memory mapping of files. We’ll take a much closer look at this structure when we dissect the mmap functions of schar. The mappings made by a specific process can be seen in /proc/<PID>/maps. Each one corresponds to a separate vm_area_struct, and the size, span, and protection associated with the mapping can be read from the proc entry, among other things. Address Space The entire addressable area of memory (4 GB on 32-bit platforms) is split into two major areas—kernel space and user (or application) space. PAGE_OFFSET defines this split and is actually configurable in asm/page.h. The kernel space is located above the offset, and user space is kept below. The default for PAGE_OFFSET on the Intel platform is 0xc0000000 and thus provides the kernel with approximately 1 GB of memory, leaving 3 GB for user space consumption. On the Intel platform, the virtual addresses seen from the kernel are therefore a direct offset from the physical address. This isn’t always the case, and primitives to convert between the two must thus be used. See Figure 18-4 for a visual representation of the address space. 759 Device Drivers b544977 Ch18.qxd 12/1/03 8:57 AM Page 759 Figure 18-4 Types of Memory Locations There are three kinds of addresses you need to be aware of as a device-driver writer: ❑ Physical: This is the “real” address, the one that is used to index the memory bus on the motherboard. ❑ Virtual: Only the CPU and the kernel (via its page tables and TLB) know about virtual addresses. ❑ Bus: All devices outside the CPU. On some platforms, this is identical to the physical addresses. Now, if you want to talk to an add-on card, you can’t hand it a virtual memory address and tell it to transfer X number of bytes to you. The card knows absolutely nothing about what addressing scheme the kernel and CPU have agreed upon since it does not have access to the page tables and thus can’t make any sense of the memory address. Similarly, the kernel uses virtual addresses for everything, and accessing bus memory varies from platform to platform. Linux therefore provides convenient macros and functions to convert the three types of addresses back and forth. unsigned long virt_to_phys(void *address) void *phys_to_virt(unsigned long address) unsigned long virt_to_bus(void *address) void *bus_to_virt(unsigned long address) Talking to a peripheral device requires the translation back and forth between virtual addresses (that the kernel knows about) and bus addresses (what the devices know about). This is regardless of the type of bus the peripheral is installed in, be it PCI, ISA, or any other. Note that jumping through the extra hoops of converting addresses is only necessary when you explicitly need to pass a pointer to a memory area directly to the device. This is the case with DMA transfers, for example. In other situations, you normally read the data from device I/O memory or I/O ports. Application Space Kernel Space 4GB PAGE_OFFSET, 0xC000000 16MB 0 760 Chapter 18 b544977 Ch18.qxd 12/1/03 8:57 AM Page 760 Getting Memory in Device Drivers Memory is allocated in chunks of the PAGE_SIZE on the target machine. The Intel platform has a page size of 4 Kb, whereas the Alpha architecture uses 8-Kb-sized pages, and it is not a user-configurable option. Keep in mind that the page size varies depending on the platform. There are many ways of allocating memory for driver usage, the lowest-level one being a variant of unsigned long __get_free_page Allocate exactly one page of memory. (int gfp_mask) gfp_mask describes priority and attributes of the page we would like to get a hold of. The most com- monly used ones in drivers are the following: GFP_ATOMIC Memory should be returned, if any is available, without blocking or bringing in pages from swap. GFP_KERNEL Memory should be returned, if any is available, but the call may block if pages need to be swapped out. GFP_DMA The memory returned should be below the 16MB mark and thus suitable as a DMA buffer. This flag is only needed on ISA peripherals, as these cannot address more memory than 16MB. GFP_ATOMIC must always be specified if you wish to allocate memory at interrupt time since it is guar- anteed not to schedule out the current process if a suitable page is not available. ISA boards can only see up to 16MB of memory, and hence you must specify GFP_DMA if you are allocating a buffer for DMA transfers on an ISA peripheral. Depending on how much memory is installed and the level of internal fragmentation, memory allocated with GFP_DMA may not succeed. PCI devices do not suffer under this constraint and can use any memory returned by __get_free_page for DMA transfers. __get_free_page is actually just a special case of __get_free_pages. unsigned long __get_free_pages(int gfp_mask, unsigned long order) gfp_mask has the same meaning, but order is a new concept. Pages can only be allocated in orders of 2, so the number of pages returned is 2 order . The PAGE_SHIFT define determines the software page size and is 12 on the x86 platform (2 12 bytes is 4 Kb). An order of 0 returns one page of PAGE_SIZE bytes, and so forth. The kernel keeps internal lists of the different orders up to 5, which limits the maximum order to that amount, giving you a maximum of 2 5 times 4 Kb—which is equal to 128 Kb—on the x86 platform. You may have wondered why the functions are prefixed with __; there is a perfectly good explanation for this. They are actually faster variants of get_free_page and get_free_pages, respectively, and the only difference lies in the fact that the __ versions don’t clear the page before returning it. If you copy memory back to user space applications, it may be beneficial to clear the page of previous contents that could inadvertently contain sensitive information that should not be passed to another process. __get_free_page and friends are quicker and, if the memory allocated is only to be used internally, clearing the pages may not be needed. 761 Device Drivers b544977 Ch18.qxd 12/1/03 8:57 AM Page 761 void free_page(unsigned long addr) Free the page(s) at memory location addr. void free_pages(unsigned long addr, You are expected to keep track of the size of unsigned long order) allocated pages, since free_pages expects you to the supply it with the order you used when allocating the memory. kmalloc Allocation of memory with get_free_page and the like is a bit troublesome and places a lot of the memory management work in the hands of the device driver. Depending on what you are aiming at using the memory for, a page-oriented scheme might not be the most appropriate. Besides, it is not that often that the size requirement fits perfectly into the scheme of allocating pages in orders of two of the page size. This can lead to a lot of wasted memory. Linux provides kmalloc as an alternative, which lets you allocate memory any size you want. void *kmalloc(size_t size, int flags) size is the requested amount of memory and is rounded up to the nearest multiple of the page size. The flags parameter consists of a mask of priorities, just like with the get_free_page variants. The same size restrictions apply: You can only get up to 128 kB at a time. Trying to allocate more will result in an error in the log, saying “ kmalloc: Size (135168) too large,” for example. void kfree(const void *addr) kfree will free the memory previously allocated by kmalloc. If you are used to dynamically allocating memory in applications with malloc, you will feel right at home with kmalloc. vmalloc The third and final way to acquire memory is with vmalloc. While get_free_page and kmalloc both return memory that is physically contiguous, vmalloc provides memory that is contiguous in the virtual address space and thus serves a different purpose. It does so by allocating pages separately and manipulating the page tables. void *vmalloc(unsigned long size) void vfree(void *addr) vmalloc allows you to allocate much larger arrays than kmalloc, but the returned memory can only be used from within the kernel. Regions passed to peripheral devices cannot be allocated with vmalloc because they are not contiguous in the physical address space. Virtual memory is only usable within the kernel/CPU context where it can be looked up in the page tables. It is extremely important to free memory once you are done using it. The kernel does not reap allocated pages when the module is unloaded, and this makes it the module’s complete responsibility to do its own memory management. 762 Chapter 18 b544977 Ch18.qxd 12/1/03 8:57 AM Page 762 vmalloc cannot be used at interrupt time either as it may sleep, since internally kmalloc is called without GFP_ATOMIC set. This should not pose a serious problem, as it would be abnormal to need more memory than __get_free_pages can provide inside an interrupt handler. All things considered, vmalloc is most useful for internal storage. The RAM disk module, radimo, shown in the “Block Devices” section later in this chapter will provide an example of vmalloc usage. Transferring Data between User and Kernel Space Applications running on the system can only access memory below the PAGE_OFFSET mark. This ensures that no process is allowed to overwrite memory areas managed by the kernel, which would seriously compromise system integrity, but at the same time poses problems regarding getting data back to user space. Processes running in the context of the kernel are allowed to access both regions of memory, but at the same time it must be verified that the location given by the process is within its virtual memory area. int access_ok(int type, const void *addr, unsigned long size) The above macro returns 1 if it is okay to access the desired memory range. The type of access (VERIFY_ READ or VERIFY_WRITE) is specified by type, the starting address by addr, and the size of the memory region by size. Every transfer taking place to or from user space must make sure that the location given is a valid one. The code to do so is architecture-dependent and located in asm/uaccess.h. The actual transfer of data is done by various functions, depending on the size of the transfer. get_user(void *x, const Copy sizeof(addr) bytes from user space address void *addr) addr to x. put_user(void *x, const Copy sizeof(addr) bytes to user space to variable void *addr) x from addr. The type of the pointer given in addr must be known and cast if necessary, which is why there is no need for a size argument. The implementation is quite intricate and can be found in the aforementioned include file. Frequently they are used in implementing ioctl calls since those often copy single-value variables back and forth. You may have wondered why the appropriate access_ok call was not included in schar, for example. Often the check is omitted by mistake, and the x_user functions therefore include the check. The return value is 0 if the copy was completed and -EFAULT in case of access violation. get_user(x, addr) The versions prefixed with __ perform no checking. They are typically used when performing multiple single-value copies, where performing the access check several times is redundant. char foo[2]; if (access_ok(VERIFY_WRITE, arg, 2*sizeof(*arg)) { __put_user(foo[0], arg); __put_user(foo[1], arg+1); 763 Device Drivers b544977 Ch18.qxd 12/1/03 8:57 AM Page 763 } else { return -EFAULT; } This is a trivial case, but the idea behind it should be clear. A third version of the x_user family also exists. Typically, the return value is checked and -EFAULT is returned in the case of access violation; this leads to the introduction of the last variant. void get_user_ret(x, addr, ret) void put_user_ret(x, addr, ret) The _ret versions return the value in ret for you in case of error; they don’t return any error code back to you. This simplifies the programming of ioctls and leads to such simple code as get_user_ret(tmp, (long *)arg, -EFAULT); Moving More Data Often more data needs to be copied than just single variables, and it would be very inefficient and awk- ward to base the code on the primitives in the preceding section. Linux provides the functions needed to transfer larger amounts of data in one go. These functions are used in schar’s read and write functions: copy_to_user(void *to, void *from, unsigned long size) copy_from_user(void *to, void *from, unsigned long size) They copy size amount of bytes to and from the pointers specified. The return value is 0 in case of success and nonzero (the amount not transferred) if access is not permitted, as copy_xx_user also calls access_ok internally. An example of the usage can be found in schar. if (copy_to_user(buf, schar_buffer, count)) return -EFAULT; As with get_user, nonchecking versions also exist and are prefixed in the same manner with __. copy_to_user(void *to, void *from, unsigned long size) copy_from_user(void *to, void *from, unsigned long size) Finally, _ret variants are also available that return ret in case of access violations. copy_to_user_ret(void *to, void *from, unsigned long size, int ret) copy_from_user_ret(void *to, void *from, unsigned long size, int ret) All of the preceding examples rely on being run in the context of a process. This means that using them from interrupt handlers and timer functions, for example, is strictly prohibited. In these situations the kernel functions are not working on behalf of a specific process, and there is no way to know if current is related to you in any way. In these situations it is far more advisable to copy data to a buffer maintained by the driver and later move the data to user space. Alternatively, as will be seen in the next section, memory mapping of device driver buffers can be implemented and solve the problems without resorting to an extra copy. 764 Chapter 18 b544977 Ch18.qxd 12/1/03 8:57 AM Page 764 Simple Memory Mapping Instead of copying data back and forth between user and kernel space incessantly, at times it is more advantageous to simply provide the applications a way to continuously view in-device memory. The concept is called memory mapping, and you may already have used it in applications to map entire files and read or write to them through pointers instead of using the ordinary file-oriented read or write. If not, Chapter 3 contains an explanation of what mmap is and how it is used in user space. In particular, many of the arguments are explained there, and they map directly to what we are going to do here. It is not always safe or possible to copy data directly to user space. The scheduler might schedule out the process in question, which would be fatal from an interrupt handler, for example. One possible solution is to maintain an internal buffer and have such functions write and read there and later copy the data to the appropriate place. That causes additional overhead because two copies of the same data have to be made, one to the internal buffer and an extra one to the application’s memory area. However, if the driver implements the mmap driver entry point, a given application can directly obtain a viewpoint into the driver buffer, and there is thus no need for a second copy. schar_mmap is added to the file_operations structure to declare that we support this operation. Let’s look at the schar implementation: static int schar_mmap(struct file *file, struct vm_area_struct *vma) { unsigned long size; /* mmap flags - could be read and write, also */ MSG(“mmap: %s\n”, vma->vm_flags & VM_WRITE ? “write” : “read”); /* we will not accept an offset into the page */ if(vma->vm_offset != 0) { MSG(“mmap: offset must be 0\n”); return -EINVAL; } /* schar_buffer is only one page */ size = vma->vm_end - vma->vm_start; if (size != PAGE_SIZE) { MSG(“mmap: wanted %lu, but PAGE_SIZE is %lu\n”, size, PAGE_SIZE); return -EINVAL; } /* remap user buffer */ if (remap_page_range(vma->vm_start, virt_to_phys(schar_buffer), size, vma->vm_page_prot)) return -EAGAIN; return 0; } 765 Device Drivers b544977 Ch18.qxd 12/1/03 8:57 AM Page 765 We receive two arguments in the function—a file structure and the virtual memory area that will be associated with the mapping. As mentioned earlier, vm_start and vm_end signify the beginning and end of the mapping, and the total size wanted can be deduced from the difference between the two. schar’s buffer is only one page long, which is why mappings bigger than that are rejected. vm_offset would be the offset into the buffer. In this case, it wouldn’t make much sense to allow an offset into a single page, and schar_mmap rejects the mapping if one was specified. The final step is the most important one. remap_page_range updates the page tables from the vma->vm_start memory location with size being the total length in bytes. The physical address is effectively mapped into the virtual address space. remap_page_range(unsigned long from, unsigned long phys_addr, unsigned long size, pgprot_t prot) The return value is 0 in case of success and -ENOMEM if it failed. The prot argument specifies the protection associated with the area ( MAP_SHARED for a shared area, MAP_PRIVATE for a private, etc.). schar passes it directly from the one given to mmap in the application. The page or pages being mapped must be locked so they won’t be considered for other use by the kernel. Every page present in the system has an entry in the kernel tables, and we can find which page we’re using based on the address and set the necessary attributes. struct page *virt_to_page(void *addr) Return the page for address. schar allocates a page of memory and calls mem_map_reserve for the page returned by the virt_to_ page function. The page is unlocked by mem_map_unreserve and freed in cleanup_module when the driver is unloaded. This order of operation is important, as free_page will not free a page that is reserved. The entire page structure, along with all the different flag attributes, can be found in linux/mm.h. This was an example of how to access the kernel’s virtual memory from user space by making remap_page_range do the work for us. In many cases, however, memory mapping from drivers allows access to the buffers on peripheral devices. The next section will introduce I/O memory and, among other things, will briefly touch upon how to do just that. I/O Memory The last kind of address space we are going to look at is I/O memory. This can be both ISA memory below the 1MB boundary or high PCI memory, but we conceptually use the same access method for both. I/O memory is not memory in the ordinary sense, but rather ports or buffers mapped into that area. A peripheral may have a status port or onboard buffers that we would like to gain access to. The sample module Iomap gives a demonstration of these principles and can be used to read and write or memory-map a region of I/O memory. Where I/O memory is mapped to depends highly on the platform in question. On the x86 platform, simple pointer dereferencing can be used to access low memory, but it is not always in the physical address space and therefore must be remapped before we can get a hold of it. void *ioremap(unsigned long offset, unsigned long size) 766 Chapter 18 b544977 Ch18.qxd 12/1/03 8:57 AM Page 766 ioremap maps a physical memory location to a kernel pointer of the wanted size. Iomap uses it to remap the frame buffer of a graphics adapter (the main intended use for the module) to a virtual address we can access from within the driver. An Iomap device consists of the following: struct Iomap { unsigned long base; unsigned long size; char *ptr; } Where base is the starting location of the frame buffer, size is the length of the buffer, and ptr is what ioremap returns. The base address can be determined from /proc/pci, provided you have a PCI or AGP adapter; it is the prefetchable location listed there: $ cat /proc/pci PCI devices found: Bus 1, device 0, function 0: VGA compatible controller: NVidia Unknown device (rev 17). Vendor id=10de. Device id=29. Medium devsel. Fast back-to-back capable. IRQ 16. Master Capable. Latency=64. Min Gnt=5.Max Lat=1. Non-prefetchable 32 bit memory at 0xdf000000 [0xdf000000]. Prefetchable 32 bit memory at 0xe2000000 [0xe2000008]. Find your graphics adapter among the different PCI devices in your system and locate the memory listed as prefetchable; as you can see, that would be 0xe2000000 on this system. Iomap can manage up to 16 different mappings all set up through ioctl commands. We’ll need this value when trying out Iomap a little later. Once the region has been remapped, data can be read and written to. Iomap using byte-size functions. unsigned char *readb(void *addr) unsigned char *writeb(unsigned char data, void *addr) readb returns the byte read from addr, and writeb writes data to specified location. The latter also returns what it wrote, if you need that functionality. In addition, doubleword and long versions exist. unsigned short *readw(void *addr) unsigned short *writew(unsigned short data, void *addr) unsigned long *readl(void *addr) unsigned long *writel(unsigned long data, void *addr) If IOMAP_BYTE_WISE is defined, this is how Iomap reads and writes data. As one would expect, they are not that fast when doing copies of the megabyte size since that is not their intended use. When IOMAP_BYTE_WISE is not defined, Iomap utilizes other functions to copy data back and forth. void *memcpy_fromio(void *to, const void *from, unsigned long size) void *memcpy_toio(void *to, const void *from, unsigned long size) They work exactly like memcpy but operate on I/O memory instead. A memset version also exists that sets the entire region to a specific value. 767 Device Drivers b544977 Ch18.qxd 12/1/03 8:57 AM Page 767 void *memset_io(void *addr, int value, unsigned long size) Iomap’s read and write functions work basically just like schar’s, for example, so we are not going to list them here. Data is moved between user space and the remapped I/O memory through a kernel buffer, and the file position is incremented. At module cleanup time, the remapped regions must be undone. The pointer returned from ioremap is passed to iounmap to delete the mapping. void iounmap(void *addr) Assignment of Devices in Iomap Iomap keeps a global array of the possible devices created, indexed by minor numbers. This is a widely used approach to managing multiple devices and is easy to work with. The global array, iomap_dev, holds pointers to all the potential accessed devices. In all the device entry points, the device being acted upon is extracted from the array. Iomap *idev = iomap_dev[MINOR(inode->i_rdev)]; In the cases where an inode is not directly passed to the function, it can be extracted from the file structure. It contains a pointer to the dentry (directory entry) associated with the file, and the inode can be found in that structure. Iomap *idev = iomap_dev[MINOR(file->f_dentry->d_inode->i_rdev)]; I/O Memory mmap In addition to being read and written ordinarily, Iomap supports memory mapping of the remapped I/O memory. The actual remapping of pages is very similar to schar, with the deviation that because actual physical pages are not being mapped no locking needs to be done. Remember that I/O memory is not real RAM, and thus no entries exist for it in mem_map. remap_page_range(vma->vm_start, idev->base, size, vma->vm_page_prot) As with schar, remap_page_range is the heart of iomap_mmap. It does the hard work for us in setting up the page tables. The actual function doesn’t require much code. The data returned by the read and write functions is in little endian format, whether that is the native byte ordering on the target machine or not. This is the ordering used in PCI peripherals’ configuration space, for example, and the preceding functions will byte swap the data if necessary. If data needs to be converted between the two data types, Linux includes the primitives to do so. The “Portability” section later in the chapter gives us a closer look at that. 768 Chapter 18 b544977 Ch18.qxd 12/1/03 8:57 AM Page 768 [...]... Documentation/sysrq.txt; here we will examine the p command So go ahead and press Alt+Sys Rq+P: SysRq: Show Regs EIP: EAX: ESI: CR0: 0 010: [] EFLAGS: 00003246 0000001f EBX: c022a000 ECX: c022a000 EDX: c0255378 c0255300 EDI: c 0106 000 EBP: 000000a0 DS: 0018 ES: 0018 8005003b CR2: 4000b000 CR3: 0 0101 000 This is a listing of the processor state, complete with flags and registers EIP, the instruction pointer, shows... appropriate action Writing an ISR (Interrupt Service Routine) is often surrounded by mysticism, but that can only be because people have not seen how easy it really is to do in Linux There is nothing special about it because Linux exports a very elegant and uncomplicated interface for registering interrupt handlers and (eventually) handling interrupts as they come in An interrupt is a way for a device... handled internally by Linux is very architecture-dependent: It all depends on the interrupt controller that the platform is equipped with If you are interested, you can find the necessary information in arch//kernel/irq.c file: arch/i386/kernel/irq.c, for example 772 Device Drivers Interrupts that have no designated handler assigned to them are simply acknowledged and ignored by Linux You can find... configuration information without 774 Device Drivers resorting to nasty probing and guesswork How to handle PCI devices is beyond the scope of this book linux/ pci.h is a good place to start if you want to deal with PCI, and, as always, plenty of examples exist within the Linux sources The rest of this section will deal only with legacy devices If the hardware allows you to retrieve the configuration directly,... these days is not uncommon Linux s 2.0 kernel solved this problem by guarding the entire kernel space with a big lock, thus making sure that only one CPU at a time was spending time in the kernel While this solution worked, it didn’t scale very well as the number of CPUs increased During the 2.1 kernel development cycle, it became apparent that finer-grained locking was needed if Linux was to conquer machines... shared while local variables are different copies: pid = 909 : global = 0xc18005fc, local = 0xc08d3f2c pid = 910 : global = 0xc18005fc, local = 0xc098df2c While having local variables residing in the kernel stack is a relief, it also places certain constraints on what you can fit in that space The Linux kernel reserves approximately 7 Kb of kernel stack per process, which should be sufficient for most... initiate the transfer of data in both directions Several items, including the request function, need to be defined in a special order at the beginning of the module The normal order of include files applies, but the items in the following table must be defined before is included #define MAJOR_NR The major number of the device This is mandatory #define DEVICE_NAME “radimo” The name of the device... we’re going to focus on the methods used in the 2.4 series kernel If you need a reference for an older version of the kernel, an older edition of this book can probably be found in your local library, but we strongly recommend that you upgrade your kernel instead #if LINUX_ VERSION_CODE < 0x20326 /* This gets used if the kernel version is less than 2.3.36 */ static struct file_operations radimo_fops... interact with the device # make # mknod /dev/radimo b 42 0 # insmod radimo.o radimo: loaded radimo: sector size of 512, block size of 102 4, total size = 2048Kb The options printed can all be specified at load time by supplying the appropriate parameters to insmod Browse back to the beginning of the radimo section and find them or use modinfo to dig them out The defaults will do fine for this session Now that... block drivers in the kernel for you to study if you need to As block devices are used to host file systems, they normally support partition-based access Linux offers generic partition support defined in the partition and gendisk structure defined in linux/ genhd.h The implementation can be found in gendisk.c in drivers/block where various drivers that utilize the support are also located Adding partition . version of the structure; you can find it in linux/ mm.h. We will cover the members that we will actually need later on. vm_start and vm_end represent the beginning and end of the virtual memory area,. * 102 4; if (ioctl(fd1, IOMAP_SET, &dev1)) { perror(“ioctl”); return 2; } /* set up second device, offset the size of the first device */ dev2.base = BASE + dev1.size; dev2.size = 512 * 102 4; if. can only be because people have not seen how easy it really is to do in Linux. There is nothing special about it because Linux exports a very elegant and uncomplicated interface for registering

Định dạng
Số trang	90
Dung lượng	1,37 MB