201 ■ ■ ■ CHAPTER 31 Process-ManagementMonitor S ystem process monitors can be a vital tool in determining the health of a running machine. Ensuring that the required processes are running and that the total number of each type of running process is appropriate is a good way to maintain system stability. The downside of these types of monitors is that they let you know only which processes are running and how many there are. They don’t give you an indication of the health of each individual process. This script dives a little deeper into the condition of processes. By using the ps com- mand with a customized format, we’ll be able to monitor the age, proportion of CPU usage, virtual-memory consumption, and amount of CPU time consumed by a particular process. If you are monitoring multiple instances of any given process, each instance will be held up to the standard being monitored. One other feature of this process monitor is that it can be configured not only to warn you of impending peril from processes whose operational values are out of bounds, but also to take action in the form of killing the aberrant process when necessary. The monitor could be modified easily to perform other actions besides killing a process. Using historical data, you can sometimes predict when a specific application will start to consume too many resources. It was one such application I was working with that prompted me to write this monitor. The monitor helped in characterizing exactly when the application ran out of control and in finding the cause of the behavior. Both were very helpful in fixing the problem. The syntax for monitor configuration is fairly straightforward, with five colon- separated fields as shown in the following example. The fields are as follows: the process command, the indicator to track, a lower threshold, an upper threshold, and the kill option. You can configure multiple processes by including several records in the config- uration string. kill_plist="dhcpd:pcpu:15:30:1 sshd:pcpu:15:30:1" The first field is the process command itself. This will be slightly different, and hope- fully simpler, than the traditional ps -ef output. The ps -ef default output (-e for all processes, -f for formatted output) includes the commands that are running, as well as any arguments they were passed. The ps -eo comm output is formatted to include only the commands that are running on a system without any path or argument information. 202 CHAPTER 31 ■ PROCESS-MANAGEMENTMONITOR With this switch combination (-eo) you can also format your output in many ways to show many other options, such as memory size, process age, process CPU time, and so on. (On some UNIX systems, you may need to define the UNIX95 variable within the script for the ps -eo command to function properly. The UNIX95 variable can be set to anything you’d like; it just needs to not be undefined.) When specifying the process for our script to monitor, you’ll want to use only the command name, as this is what the script will be looking for. The second field contains the indicator you want to track. The options are cputime, which measures the number of minutes the cpu has allocated to the process; etime, which is the elapsed time in minutes since the process began running; pcpu which represents the current percentage of the CPU capacity the process is consuming; and vsize, which shows the virtual-memory size in kilobytes for the process. The third and fourth fields contain the desired lower and upper thresholds for the indi- cator you’re tracking. The fifth and final field is the kill option. It is a value from 0 to 3: 0: Send a notification when either the low warning or high error threshold have been crossed, but don’t kill the process. 1: Send a warning notification when the low threshold has been crossed or an error notification when the high threshold has been crossed, and kill the process. 2: Send only a low-level warning notification when either the low or high threshold has been crossed, and kill the process. 3: Kill the process without any notification at all. Note that for safety, if the kill option is not set or is set to anything but one of the values outlined here, processes will not be killed. Notice that there are two levels of notification. I have used alphanumeric paging for the high level (error status) and e-mail for the low level (warning status). You may want to implement the notification method as appropri- ate for your needs. The first section of the script sets up a few configuration variables, which alternatively could be stored in a separate configuration file and sourced each time the script runs through the loop. This would allow for live configuration changes to the script. The debug value is for testing and the sleeptime value represents the amount of time to delay between each run. The kill_plist variable is the main configuration value that lets the script know what processes and values it should be watching. #!/bin/sh debug=1 sleeptime=3 kill_plist="dhcpd:pcpu:15:30:1 sshd:pcpu:15:30:1" CHAPTER 31 ■ PROCESS-MANAGEMENTMONITOR 203 The following function performs all notifications and process terminations in the script. It is called with seven sequentially numbered parameters. The positional variables are somewhat difficult to understand and their values could have been assigned to more meaningfully named variables before they were used, for ease of debugging later. To streamline the script a little, I didn’t do this. notify () { case $2 in 0) # Warn/error level and don't kill echo "$1: $3 process id $4 found with $5 $7. Should be less than $6." ;; 1) # Warn/error level and kill echo "$1: $3 process id $4 found with $5 $7. Should be less than $6." test $debug -eq 0 && kill $4 ;; 2) # Warning level only . echo "Warning: $3 process id $4 found with $5 $7. Should be less than $6." test $debug -eq 0 && kill $4 ;; 3) # Just kill, don't warn at all test $debug -eq 0 && kill $4 ;; *) echo "Warning: killoption not set correctly, please validate configuration." ;; esac } Here, for ease of reference, I define all of the command-line arguments passed to this function: $1: Text passed used for building the notification string; used for the difference between warning and error $2: The kill option, which has a possible value of 0-3 $3: The process name that is being monitored $4: The process ID of the process being monitored $5: The current value of the indicator you are tracking 204 CHAPTER 31 ■ PROCESS-MANAGEMENTMONITOR $6: The monitor’s lower threshold $7: The text equivalent of the indicator you are tracking This is also a good example of how a function can reduce the length and complexity of a script. The body of this function is code that would have to be repeated eight times throughout the script if it were not placed in a function. An older version of this script was written this way. Putting the code into a function reduced the script’s length by roughly 40 percent. The following code is the beginning of the main loop. The script is intended to be run at system startup; it will then be run continuously through an infinite loop. After each iter- ation completes, the script will sleep for a predetermined time before the next iteration. The first part here is a nested loop that progresses through each record in the configura- tion string to parse its fields and set up the monitor. while : do for pline in $kill_plist do process=`echo $pline | cut -d: -f1` process="`echo $process | sed -e \"s/%20/ /g\"`" type=`echo $pline | cut -d: -f2` value=`echo $pline | awk -F: '{print $3}'` errval=`echo $pline | awk -F: '{print $4}'` killoption=`echo $pline | awk -F: '{print $5}'` The process variable is assigned the first field in the configuration record (pline). It is possible that the process command name you’re monitoring will consist of more than one word, separated by spaces. Such spaces are replaced (here using the sed command) with %20, which is a commonly used substitute for the space character, as in URL encoding, for example. The type variable is the second field in the configuration record. As mentioned, it spec- ifies the performance indicator to watch: cputime (amount of CPU time consumed), etime (elapsed time or age of process), pcpu (current percentage of the CPU consumed), or vsize (virtual-memory size). The value variable holds the lower warning threshold for the monitored value, taken from the third field. The errval variable is assigned the value of the upper error threshold for the monitored value, taken from the fourth field. The killoption variable is assigned the final field of the configuration record and spec- ifies an action to perform when the process deviates from the normal range. If the kill option was not specified initially, we set it to be the default kill option. This makes sure no processes are killed unless one of the options for doing so is explicitly used. CHAPTER 31 ■ PROCESS-MANAGEMENTMONITOR 205 if [ "$killoption" = "" ] then killoption=0 fi test $debug -gt 0 && echo "Kill $process processes if $type is greater than $errval" Next we pare down the full list of processes running on the system to the ones running the command being monitored. Then we start a loop that iterates through the remaining processes. for pid in `ps -eo pid,comm | egrep "${process}$|${process}:$" | grep -v grep | awk '{print $1}'` do For each process ID, the script has to gather the pertinent information. The embedded ps command gathers only the specific information we want. test $debug -gt 0 && echo "$process pid $pid" pid_string=`ps -eo pid,cputime,etime,pcpu,vsize,comm | \ grep $pid | egrep "${process}$|${process}:$" | grep -v grep` The following case statement is the heart of the monitor. The script tests for the monitor type (cputime, etime, pcpu, or vsize); the cputime is the first monitor type listed. The code for each type is slightly different, but all are very similar. Here we obtain the process time from the ps output, as well as the number of fields that the proc_time variable contains. case $type in "cputime") proc_time=`echo $pid_string | awk '{print $2}'` fields=`echo $proc_time | awk -F: '{print NF}'` proc_time_min=`echo $proc_time | awk -F: '{print $(NF-1)}'` Both of these are needed because the format of the time value varies depending on the amount of time it represents. The cputime and etime variables have values of the form days-hours:minutes:seconds or hours:minute:seconds. A low value might look something like 00:28 for 28 seconds. A high value could be 1-18:32:29 for 1 day, 18 hours, 32 minutes, and 29 seconds. Both of these types have to be processed and converted to minutes. (Seconds are dropped for simplicity.) Of the four performance indicators, the logic for handling the cputime and etime values is the most complex because the format used to report them changes depending on the amount of time these values represent. if [ $fields -lt 3 ] then proc_time_hr=0 proc_time_day=0 206 CHAPTER 31 ■ PROCESS-MANAGEMENTMONITOR else proc_time_hr=`echo $proc_time | awk -F: '{print $(NF-2)}'` fields=`echo $proc_time_hr | awk -F- '{print NF}'` if [ $fields -ne 1 ] then proc_time_day=`echo $proc_time_hr | awk -F- '{print $1}'` proc_time_hr=`echo $proc_time_hr | awk -F- '{print $2}'` else proc_time_day=0 fi fi Once all time values have been determined, we convert them to minutes for compari- son with the monitor thresholds. curr_cpu_time=\ `echo "$proc_time_day*1440+$proc_time_hr*60+$proc_time_min"\ | bc` test $debug -gt 0 && echo "Current cpu time for \ $process pid $pid is $curr_cpu_time minutes" If the current cputime value is between the warning and error thresholds, we call the notify() function with the appropriate switches. It will handle output and process termi- nation, as described earlier. if test $curr_cpu_time -gt $value -a \ $curr_cpu_time -lt $errval then notify "Warning" $killoption $process $pid \ $curr_cpu_time $value "minutes of CPU time" If the current cputime is greater than the error threshold, we call the notify() function with a different set of options. elif test $curr_cpu_time -ge $errval then notify "Error" $killoption $process $pid \ $curr_cpu_time $value "minutes of CPU time" The final condition handles the case where there is no issue with the running process: the script just issues a message saying so. else test $debug -gt 0 && echo "process cpu time ok" fi ;; CHAPTER 31 ■ PROCESS-MANAGEMENTMONITOR 207 The etime monitor is nearly the same as the cputime monitor. The primary difference is the field that is extracted from the ps output to get the current process age. "etime") proc_age=`echo $pid_string | awk '{print $3}'` fields=`echo $proc_age | awk -F: '{print NF}'` proc_age_min=`echo $proc_age | awk -F: '{print $(NF-1)}'` Once again, you convert the age of the process to values that will then be used to calcu- late the age in minutes. if [ $fields -lt 3 ] then proc_age_hr=0 proc_age_day=0 else proc_age_hr=`echo $proc_age | awk -F: '{print $(NF-2)}'` fields=`echo $proc_age_hr | awk -F- '{print NF}'` if [ $fields -ne 1 ] then proc_age_day=`echo $proc_age_hr | awk -F- '{print $1}'` proc_age_hr=`echo $proc_age_hr | awk -F- '{print $2}'` else proc_age_day=0 fi fi Now expressing the process age in minutes makes the threshold check very simple. curr_age=\ `echo "$proc_age_day*1440+$proc_age_hr*60+$proc_age_min" \ | bc` test $debug -gt 0 && echo "Current age of $process pid \ $pid is $curr_age minutes" We now perform the comparison checks against the monitor thresholds as before. The first check determines if the current process age is between the low and high thresholds. The second sees if the current age is above the high threshold. In both these cases, call the notify() function for end-user output and process termination. The final possibility is that there is no issue, and in this case the script gives a message stating that the process is OK. if test $curr_age -gt $value -a $curr_age -lt $errval then notify "Warning" $killoption $process $pid \ $curr_age $value "minutes of elapsed time" elif test $curr_age -ge $errval 208 CHAPTER 31 ■ PROCESS-MANAGEMENTMONITOR then notify "Error" $killoption $process $pid \ $curr_age $value "minutes of elapsed time" else test $debug -gt 0 && echo "process age ok" fi ;; The test for percentage CPU usage is quite simple. The value to be compared to the thresholds is obtained directly from the ps output. There is no need for further calculation as was needed in the code for the cputime and etime monitors. "pcpu") curr_proc_cpu=`echo $pid_string | awk '{print $4}' | \ awk -F. '{print $1}'` test $debug -gt 0 && echo "Current percent cpu of \ $process pid $pid is $curr_proc_cpu" Once again, we compare the percentage CPU value with the configured low and high thresholds and call the notify() function to alert the user and perform any required pro- cess termination. If the CPU percentage is below either of these values, the code outputs an “OK” message. if test $curr_proc_cpu -gt $value -a \ $curr_proc_cpu -lt $errval then notify "Warning" $killoption $process $pid \ $curr_proc_cpu $value "percent of the CPU" elif test $curr_proc_cpu -ge $errval then notify "Error" $killoption $process $pid \ $curr_proc_cpu $value "percent of the CPU" else test $debug -gt 0 && echo "process cpu percent ok" fi ;; The vsize monitor is as simple as the percent-CPU monitor. We obtain the current process’s memory footprint directly from the ps output. "vsize") curr_proc_size=`echo $pid_string | awk '{print $5}'` test $debug -gt 0 && echo "Current size of $process pid \ $pid is $curr_proc_size" We have to check the current memory size against the monitor thresholds one last time. If they are within a low or high warning status, we call the notify() function for out- put and termination. If not, the code outputs that the process size is OK. CHAPTER 31 ■ PROCESS-MANAGEMENTMONITOR 209 if test $curr_proc_size -gt $value -a \ $curr_proc_size -lt $errval then notify "Warning" $killoption $process $pid \ $curr_proc_size $value "blocks of virtual size" elif test $curr_proc_size -ge $errval then notify "Error" $killoption $process $pid \ $curr_proc_size $value "blocks of virtual size" else test $debug -gt 0 && echo "process virtual size ok" fi ;; Finally we close the monitor case statement and the two inner processing loops. The script then goes to sleep for the configured amount of time before starting over again. It will then continue its monitoring until the monitor itself dies or is killed or the system is shut down. esac done done sleep $sleeptime done . time ok" fi ;; CHAPTER 31 ■ PROCESS-MANAGEMENT MONITOR 207 The etime monitor is nearly the same as the cputime monitor. The primary difference is. is being monitored $4: The process ID of the process being monitored $5: The current value of the indicator you are tracking 204 CHAPTER 31 ■ PROCESS-MANAGEMENT