Web Server Programming phần 2 ppt

collection (pool) of ‘pre-forked’ processes to reduce the time delays and costs that are associated with the creation of new processes. There is a principal process (the ‘chief’) that monitors the port/socket combination where TCP/IP connection requests are received from clients. This ‘chief’ process never handles any HTTP requests from the clients; instead it distributes this work to subordinate processes (the ‘tribesmen’). Each Apache ‘tribesman’ acts as a serial server, dealing with one client at a time. When a tribesman pro - cess finishes with a client, it returns to the pool managed by the chief. As well as being responsible for the distribution of work, the chief process is also responsible adjusting the number of child (tribesmen) processes. If there are too few tribesmen, clients’ requests will be delayed; if there are too many tribesmen, system resources are ‘wasted’ (the com - puter may have other work it could do, and such work may be slowed if most of the main memory is allocated to Apache processes). The Apache process group is started and stopped using scripts supplied as part of the package (the Windows version of Apache is installed with ‘start’ and ‘stop’ shortcuts in the Start menu). The first Apache process that is created becomes the chief; it reads the configuration files and forks a number of child processes. These child processes all imme - diately block at locks controlled by the chief. The chief process and its children share some memory (this is implementation-dependent: it may be a shared file rather than a shared memory segment). This shared memory ‘scoreboard’ structure holds data that the chief uses to monitor its tribesmen and the lock structures that the chief uses to control operations by tribesmen. When the chief has created its initial pool of tribesmen, it starts to monitor its socket for the HTTP port (usually port 80), blocking until there is input at this socket. When a client attempts a TCP/IP connection, the socket is activated and the chief process resumes. The chief finds an idle tribesman, and changes the lock status for that tribesman allowing it to resume execution. The chief can then check on its tribe’s state. If there are too few idle tribesmen waiting for work, the chief can fork a few more processes; if there are too many idle processes, some can be terminated. When its lock is released, a tribesman process does an ‘accept’on the server socket; this gives it a data socket that can be used to read data sent by a client, and to write data back to that client. The tribesman then reads the HTTP ‘Get’ or ‘Post’ request submitted by the client. The tribesman process handles a request for a simple static page, or for a page with dynamic content that will be produced by an internal Apache module (‘server-side includes’, PHP script etc.). If a request is for a dynamically generated page that has to be produced by a CGI program, the tribesman will have to fork a new process that will run this CGI program. The tribesman will communicate with its CGI process via a ‘pipe’ (and also via environment variables set prior to the fork operation); data relating to the request are stored in environment variables or are written to the pipe. The response from the CGI program is read from this pipe; this response must start with at least the Content-Type HTTP header information. The tribesman process adds a complete HTTP header to this response, and then writes the response on the data socket that connects back to the client. If the client is using the HTTP/1.0 protocol, the tribesman closes its data socket imme - diately after writing the response; then it returns itself to the pool of idle processes (by updating the shared scoreboard structure and blocking itself at a lock controlled by the chief). If a request is made using HTTP/1.1, the tribesman will keep the connection open Apache’s processes 49 and do a blocking read operation on the data socket. If this attempted read operation is timed out, the process closes the socket and then rejoins the idle pool. If the client does submit another request via the open connection, this can be handled. The procedure can then be repeated for up to a set maximum number of times. It is fairly common for large C/C++ programs to leak memory a little. Leaks occur when temporary structures, created in the heap, are forgotten and never get deleted. The memory footprint of a process grows slowly when running a leaky program. Apache servers can contain modules from many third-party suppliers, and problems had been observed that were due memory leaks (some operating systems have C libraries that con - tain leaks). Leaks can now be dealt with automatically. The tribesman processes can be configured so that they will ‘commit suicide’ after handling a specified number of client connections. The process simply removes its entries from the shared scoreboard and then exits. The chief process can create a fresh process to replace the one that terminated. These details of process behavior are all controlled via a configuration file, httpd.conf, that must be edited by the server’s administrator. Entries in the file include the following that control the number of Apache processes: ● StartServers This defines the number of tribesman processes that the chief creates at start-up. ● MaxClients This is an upper limit on the number of processes that you are prepared to run. ● MinSpareServers, MaxSpareServers These values control the chief’s behavior with regard to idle tribesmen; if there are fewer than MinSpareServers, more are created; if there are more than MaxSpareServers,some are terminated. The default values given in the supplied configuration files might suit a small web-hosting company with a multi-CPU PC; you should reduce the values before running an Apache system on an ordinary home/office PC. A second group of parameters in the configuration file control the behavior of the tribesman processes. These include: ● Keepalive Does this server support HTTP/1.1 ‘persistent connections’ (it should)? ● MaxKeepAliveRequests, KeepAliveTimeout These parameters control an individual persistent connection. The client is allowed to submit up to the specified number of requests. The timeout parameter controls how long the tribesman will wait for the next request on the open connection. ● Timeout Another timeout is used in situations where a response is expected from a client. For example, if a user attempts to access a controlled resource, he or she will be prompted to enter a password that must be returned by the browser and checked on the server before 50 Apache the requested data will be sent. A user who does not respond to the prompt should eventu - ally be disconnected. ● MaxRequestsPerChild This is the ‘suicide’ limit used to avoid problems from leaky code. A child process (tribesman) terminates once it has handled this number of requests. The Windows version of Apache works slightly differently, requiring just two distinct processes. The first has roughly the same role as that just described for the ‘chief’ in a Linux/Unix system. Instead of a separate process for each tribesman, Windows Apache just has a thread in a second multi-threaded process. Some of the controls that apply in the Linux/Unix world are irrelevant; for example, the MaxRequestsPerChild control does not apply: the threads are not terminated in this way. Similarly, the MaxClients control is replaced by a limit on the number of threads. 3.2 Apache’s modules The Apache server has a relatively small core that can handle HTTP ‘Get’ requests for static pages and modules that provide all the other services of a full HTTP-compliant server. The default configuration for an Apache server incorporates the modules for dynamic pages (CGI and SSI), for controlled access to resources, for content negotiation, and so forth. You can adjust this default configuration to meet your specific needs. You have essentially total control over the structure of a Linux/Unix version of Apache. You use the ‘configure’script to select the modules that you will require; this script builds the makefile that can then be used to create your Apache. The Windows version of Apache has a larger fixed part that incorporates many of the standard modules; the remaining modules are available as dlls. If another module is needed for a Windows version of Apache, you simply un-comment a commented-out Load-Module directive in the httpd.conf runtime configuration file. Apache’s modules include: ● Core web server functio nality – mod_cgi, mod-env These modules support CGI-style generation of dynamic content. – mod_include This module supports ‘server-side includes’ that allow the web server itself to create limited forms of dynamic content that are to be inserted into an HTML page prior to its return. – mod_log_config This module handles the basic logs for the server; these record all accesses to web resources and also all errors (such as requests for non-existent files that might indi - cate bad links in your web site). Apache’s modules 51 – mod_access This module allows access to selected web resources to be restricted to clients whose IP addresses satisfy specified constraints. – mod_auth, mod_auth_db, mod_auth_dbm These alternative modules all support access controls that require a client to supply a password before access is granted to specified web resources. They differ with respect to the storage used for the name and password collections. – mod_mime This module determines content type from file extension – so allowing the server to handle a get request for picture.gif by correctly returning a response with the HTTP header content-type=image/gif. – mod_negotiation This module deals with HTTP Accept request headers that specify preferred content type. ● Server administrator options – mod_status This generates an HTML page that displays information about the server; data shown include the number of processes currently handling requests, the number that have finished with their clients but which are still writing log data, and the number of idle processes. – mod-info This displays a page with details of the configuration options for the server. Both these displays are of interest only to the administrator of the web server and hackers seeking to disrupt the service. (You use access controls to limit their use to the administrator!) ● Control of location of resources – mod_userdir By default, documents will be taken from the htdocs directory within the Apache system’s install directory. Sometimes you may have to allow individual users to have web pages in a subdirectory of their own home directories. This module supports such usage. – mod_alias This allows you to map pathnames, as specified in <a href= > links in web pages (and, consequently, appearing in HTTP get requests), onto different names – the names that actually represent the true file hierarchy. It allows you to conceal the loca - tion of resources or simply helps make your site more resilient to change by allowing you to move resources without breaking too many HTML links. – mod-rewrite This module applies rules for changing request URLs before the server attempts to find the file. There are various uses, but a common one relates to a mechanism for maintaining client session data. The URL rewriting approach to session state mainte - nance involves embedding a client identifier in every URL included as a link in a 52 Apache page returned to that client. This identifier must then be removed from the URLs used in requests that the client subsequently submits. ● More exotic modules – mod_imap This module supports server-side image-map processing. (Most web pages now rely on browsers to handle image-map interactions at the client side, so you shouldn’t really need this.) – mod_proxy This allows your Apache to act as a proxy server. Other machines on your network may not have direct access to the Internet; all their HTTP requests are instead directed to your proxy server. Your Apache can filter requests by blocking access to named sites, and forwarding other requests to the actual remote server. You can also enable caching; this may be of advantage if you expect many requests for the same resources (e.g. lots of students viewing the same material from a ‘Web resources’ list). – mod_speling (sic) Tries to guess what the client meant if there is no resource with the name appearing in the client’s request (e.g. guess home.html if the client asked for hme.html). – mod_so This module is required to support dynamic linking of optional modules. ● Add-on modules – mod_perl Perl is the most popular CGI scripting language; mod_perl improves overall system performance by incorporating a Perl interpreter into the web server, thereby avoiding the need to start and then communicate with a separate CGI process. – mod_php The interpreter for the PHP scripting language was designed to run within a web server. – mod_ssl Implements secure socket layer communications. When choosing modules, you need to take account of issues other than functionality. The more modules that you add, the larger your Apache executable becomes. If the executables grow too large, you risk problems from the operating system starting to swap programs between main memory and secondary storage. Any such swapping will have a major negative impact on performance. Other configuration choices trade functionality against performance or security; for example, while ‘server-side includes’ (SSI) offer an easy mechanism for adding a limited amount of dynamic content, they are also known to constitute a security risk. Poorly constructed SSI setups have permitted many hacking exploits. You have to decide whether to support SSI. The modules that you build into your system define its capabilities, but many do not operate automatically. Most of the modules depend on control information in the runtime Apache’s modules 53 configuration file. For example, you might add mod_status and mod_info so that you can observe how your Apache system is operating; but your server will not display these per - formance data until the configuration files are changed. Similarly, you can include mod_access and mod_auth in your Apache, but this in itself will not result in any security restrictions being imposed on your website. You still have to change the runtime configu - ration file to include sections that identify the controlled resources (e.g. ‘all files in d irec - tory ’) and the specific controls that you require (e.g. ‘client must be in our company domain’). 3.3 Access controls There are two quite separate issues relating to access. First, if you are running on a Linux/ Unix system, the normal controls on file access apply – your web servers will not be able to serve files that they do not have permission to read. Second, there are the controls that the web server can apply to restrict access by client domain, or in support of HTTP authentication. On a Linux/Unix system, your Apache will be running with some specified Unix user- identifier; this user-id determines which files can be read. If you launch your own Apache server, it will run with your user-id and will be able to access all your files. (Such a private server cannot use the standard port 80; by default it will use port 8080, although this port number can be changed in the configuration file.) An ‘official’ Apache web server that runs at port 80 must be launched by the system’s administrator (it requires ‘root’ privi- lege). Such a server will run with an effective user-id that is chosen by the system’s administrator – typically ‘www’, or ‘nobody’. If you are using such a server, you have to have permission to place your web files in the part of the file space that it uses, and you must set the privileges on your files to include global read permission. Many of the mistakes made by beginners involve incorrect Unix access permissions for their files. The Apache server allows you to provide selective access to resources using restrictions on a client’s address, through a requirement for a password, or by a combination of both these methods. Typically, different policies are applied to resources in different directo - ries, but you can have additional global constraints (it is for example possible to specify that clients may never access a file whose name starts with ‘ .ht’ – such names are com - monly used for Apache password files and some configuration files). Controls on resources can be defined either in the main httpd.conf runtime configura - tion file or in .htaccess files located in the directories holding the resources (or holding the subdirectories with resources). Generally, it is best to centralize all controls in the main httpd.conf file. There are two problems with .htaccess files. First, they do add to the work that a web server must perform. If a server is asked for a resource located some - where in the file space below a point where an .htaccess file might be defined, the server must check the directory, its parent directory, and so on back up the directory path. If an .htaccess file is found, the server must read and apply the restrictions defined in that file. The second problem is that these .htacess files may reduce the security of your web site. This is particularly likely to occur if you allow individual users to maintain files in their private directories and further allow them to specify their own access controls. 54 Apache Basic controls (which come with mod_access) allow some restrictions based on the IP address or domain name included in the request. The controls allow you to specify that: ● A resource is generally available. ● Access to the resource is prohibited for clients with addresses that fall in a specified range of IP addresses (or a specified domain), but access is permitted from everywhere else. ● Or, more usefully, that access is prohibited except for clients whose IP addresses fall in a specified range or whose domain matches a specified domain. Controls are defined in the httpd.conf file using Directory, DirectoryMatch or File directives. These directives have the general form: <Directory resource-location> control options </Directory> A Directory directive will have the full pathname of the directory to which the controls apply. A DirectoryMatch,orFile directive can use simple regular expressions to identify the resources. The control options include several that are described later; those that relate to access are Order, Allow and Deny. The Allow option is used to specify the IP range, or domain name, for those clients who are permitted access to a resource. The Deny option identifies those excluded. The Order option specifies how the checks are to apply. If the order is Deny, Allow then the default is that the resource is accessible; the client is checked against the Deny constraint and, if matched, will be blocked unless the client also matches the subsequent more specific Allow constraint. If the order is Allow, Deny then the resource is by default inaccessible; if the client matches the Allow constraint access will be permitted provided the client is not caught by a more closely targeted Deny constraint. The following examples illustrate constraints applied to the contents of directories (and their subdirectories). The examples assume that your Apache is installed in /local/ apache . The first example defines a restricted subdirectory that is only to be accessed by students and others who are logged into the domain bigcampus.edu: <Directory "/local/apache/htdocs/onCampus"> Order deny, allow Deny from all Allow from .bigcampus.edu </Directory> When checking such a constraint, Apache will do a reverse lookup on the IP address of the client to obtain its domain and then check whether this ends with .bigcampus.edu.A second rather similar example would be appropriate if you had a resource that was for some reason not to be available to clients in France: Access controls 55 <Directory /local/apache/htdocs/notForTheFrench> Order allow, deny Allow from all Deny from .fr </Directory> (Such a constraint is not that far-fetched! French courts are trying to enforce French com - mercial laws on e-commerce transactions made by those residing in France; you might not want to bother with the need to employ French legal representation.) The standard httpd.conf file contains an example of a File directive: <Files ~ "^\.ht"> Order allow, deny Deny from all </Files> This has a regular expression match that defines all files that start with the sequence .ht; access to these files is globally prohibited. Incorporation of any of mod_auth, mod_auth_db or mod_auth_dbm modules into your Apache allows you to utilize HTTP authentication. These modules differ only with respect to how name and password data are stored. The _db and _dbm modules use various versions of the db/dbm simple database package that is available for Linux/Unix. The basic mode_auth module works with text files defining your users and their passwords, and also any user-groups that you wish to have. (The passwords in the password file are held in encrypted form.) The text files are simpler; but if you are likely to have hundreds of users, you should use one of the db packages to avoid performance problems with large text files. Authentication-based restrictions are typically applied to a directory (and its subdirec - tories) and are again defined using a Directory directive in the httpd.conf file. The first time that a client attempts to access a resource in a controlled directory, Apache will respond with a HTTP 401 ‘authorization required’ challenge. This challenge will contain a name (the ‘realm’ name) that the server administrator has chosen for the collection of resources. The client’s browser will handle the challenge by displaying a simple dialog informing the user that a name and password must be provided to access resources in the named ‘realm’. Apache keeps the connection open until the client’s identification data are returned and can be checked. If the name and password are validated, Apache returns the resource. The client’s browser keeps a record of the name, password, realm triple and will automatically handle any subsequent challenges related to other resources in the same realm. Normally, the password is sent encoded as base 64; this is not a cryptographic encoding – it is really just a letter substitution scheme that avoids possible problems from special characters in a password. In principle, a more secure scheme based on the MD5 hashing algorithm can be used to secure passwords; in practice, most browsers do not sup - port this feature (Internet Explorer 5 and above can handle more demanding security controls). The actual control on a resource may: 56 Apache ● simply require that the user has supplied a valid name-password combination; ● list the names of those users who are permitted access to the resource; ● specify the name of a user-group, as defined in a ‘groups’ file, whereby all members of the group are permitted to access the resource. The web server administrator must allocate usernames and passwords and create the files (or db/dbm entries) for the users and groups. There is a utility program, /local/ apache/bin/htpasswd , that can be used to create an initial password file or add a user to the password file: #Create the password file in current directory htpasswd –c .htppasswds firstuser #add another user htpasswd .htppasswds anotheruser The htpasswd program prompts for the password that is to be allocated to the user. Group files are simple text files; each line in the file defines a group and its members: BridgePlayers: anne david carol phillip peter jon james The password files should be placed in a directory in the main Apache installation directory. An example of a Directory directive specifying an authorization control is: <Directory /local/apache/htdocs/notices> AuthName "Private Departmental Notices" AuthType Basic AuthUserFile /local/apache/pwrds/.htpasswds AuthGroupFile /local/apache/pwrds/.htgroups Require valid-user </Directory> The AuthName option specifies the name of the realm; the AuthType option will specify ‘Basic’ (if you are targeting browsers that support the feature, you can specify MD5 encryption of the passwords sent by clients). The AuthUserFile and AuthGroupFile iden - tify the locations of the associated password and group files. The Require valid-user control accepts any user who enters their password correctly. Alternative controls would be Require user carol phillip (list the names of the users who are allowed access to the resource) or Require group BridgePlayers (allow access by all members in BridgePlayers group). Authorization and IP/domain restrictions can be combined: <Directory /local/apache/htdocs/DevelopMent/hotstuff> Order deny, allow Access controls 57 Deny from all Allow from 130.130 AuthName Require group staff Satisfy all </Directory> This example requires that users be at hosts on the 130.130 network, and that they have established themselves, by entering a name and password, as being a member of the staff group. You could use a constraint Satisfy any; this would require that either the users were working from the specified domain, or that they had entered a name and password for a staff member. 3.4 Logs Apache expects to maintain logs recording its work. In its standard configuration, Apache records all access attempts by clients and all server-side errors (subject to a minimum severity cutoff that is set by a control parameter). There is further provision for creation of custom logs. For example, you can arrange to log data identifying the browsers used (so, if you really want to know, you can find the proportions of your clients who use Opera, Netscape, IE or another browser). You should plan how to use the data from these logs or turn off as much as possible of the logging. The error logs naturally help you find problems with your site; an analysis of the data in the access logs may help you better organize and market your site. The logs grow rapidly in size. You should never delete a large log file in the hope that Apache will start a fresh one. Apache keeps track of the file size and will continue to try to write at what it thinks should be the current end of file. There is a little helper program in the /bin directory that allows you to “rotate” log files; existing log files are renamed, and Apache is told to continue writing at the beginnings of the new log files. An example fragment of an access log is as follows (line breaks have been inserted at convenient points – each entry goes on a single line): 130.130.189.103 - - [28/May/2001:14:37:17 +1000] "GET /~yz13/links.htm HTTP/1.0" 200 1011 208.219.77.29 - - [28/May/2001:14:37:26 +1000] "GET /robots.txt HTTP/1.1" 404 216 130.130.189.103 - - [28/May/2001:14:38:18 +1000] "GET /~yz13/image/tb.gif HTTP/1.0" 200 94496 130.130.64.188 - yag [25/May/2001:11:39:49 +1000] "GET /controlled/printenv.pl HTTP/1.1" 401 486 An access record has: 58 Apache [...]... examples of records from a server s error log are: [Thu May 24 13 :27 :55 20 01] [error] [client 20 2. 129 .93.44] File does not exist: /packages/csci213-www/documents/ma61 [Thu May 24 14:00:30 20 01] [error] [client 130.130.66.60] (13) Permission denied: file permissions deny server access: /packages/csci213-www/documents/cgi-bin/sp15/ass4/myApplet2.html [Fri May 25 10:31:34 20 01] [error] [client 130.130.64.33]... appear in generated web indices This record indicated that someone at 20 8 .21 9.77 .29 was running a spider to map the resources on this web server (which was intriguing because this log came from a temporary server running at the non-standard port 20 00) A reverse IP lookup identified the source as being someone at marvin.northernlight.com It was not a well-behaved web spider; the rules for web spiders require... next request is more interesting: 20 8 .21 9.77 .29 - - [28 /May /20 01:14:37 :26 +1000] "GET /robots.txt HTTP/1.1" 404 21 6 The request was for a robots.txt file in the root htdocs directory of this server; there was no such file, hence the 404 failure code A robots.txt file is conventionally used to provide information to web spiders – programs that map all resources at a web site, maybe to build indices like... your web server down It is more sensible to identify the client machines in a program that analyzes the logs You can make your web server attempt to identify each client There is a server called identd that can be run on a Unix machine The identd server on a host machine can be asked for the user identifier of a process that is associated with a given TCP/IP port on the same machine Your web server. .. in the range 128 –191 (decimal values); the second byte could have any value from 0 25 5 (Class B addresses can be recognized by the first two address bits being 10-binary.) There were something like sixteen thousand such network addresses available Amongst those allocated in the early days of the Internet were 128 .6 which went to Rutgers University, 128 .29 for Mitre corporation, 128 .23 2 for the University... address and the bit pattern 25 5 .25 5.0.0: 10001000.10101010.11110001.01010101 (Address) 136.170 .24 1.85 11111111.11111111.00000000.00000000 (Network mask) 25 5 .25 5.000.000 -10001000.10101010.00000000.00000000 (Network) 136.170.0.0 10001000.10101010.11110001.01010101 (Address) 136.170 .24 1.85 00000000.00000000.11111111.11111111 (Machine mask) 0.0 .25 5 .25 5 ... does not have a value specified for the ServerName parameter; you may need to define something like ServerName localhost (or maybe ServerName 127 .0.0.1) (If nothing is defined, Apache will try to find a DNS server that can tell it the correct server name based on your machine’s IP address and the DNS records; this attempt will fail if you are not linked to a DNS server, so Apache won’t start.) After... sequence of four decimal numbers, each in the range 0 25 5, with each number representing one byte of the address This lead to the now familiar ‘dotted decimal’ form for IP addresses – e.g 20 7.68.1 72. 253 (this is one of Microsoft’s computers) The class A addresses used the first 8 bits of the 32- bit address to identify a network, and the remaining 24 bits of the address were for a machine identifier... experiment with server- side includes The next section of the file will include a Location directive: Exercises 71 # # SetHandler server- status # Order deny, allow # Deny from all # Allow from your_domain.com # (There is a similar commented-out server- info section.) These relate to support for the server monitoring facilities that might be needed by a webmaster When... 79 The class C addresses used 24 bits for the network identifier and only 8 bits for the computer identifier Class C addresses have a first byte with a value in the range 1 92 22 3 (the first three bits are 110-binary) While there were a couple of million such network addresses possible, each of these networks could have at most 25 4 machines (the machine addresses 0 and 25 5 are reserved for things like . line): 130.130.189.103 - - [28 /May /20 01:14:37:17 +1000] "GET /~yz13/links.htm HTTP/1.0" 20 0 1011 20 8 .21 9.77 .29 - - [28 /May /20 01:14:37 :26 +1000] "GET /robots.txt HTTP/1.1" 404 21 6 130.130.189.103. records from a server s error log are: [Thu May 24 13 :27 :55 20 01] [error] [client 20 2. 129 .93.44] File does not exist: /packages/csci213-www/documents/ma61 [Thu May 24 14:00:30 20 01] [error] [client. interesting: 20 8 .21 9.77 .29 - - [28 /May /20 01:14:37 :26 +1000] "GET /robots.txt HTTP/1.1" 404 21 6 The request was for a robots.txt file in the root htdocs directory of this server; there

Định dạng
Số trang	63
Dung lượng	531,64 KB