CONTENTS

Chapter 26. System Performance and Profiling

26.1 Timing Is Everything

Whether you are a system administrator or user, the responsiveness of your Unix system is going to be the primary criterion of evaluating your machine. Of course, "responsiveness" is a loaded word. What about your system is responsive? Responsive to whom? How fast does the system need to be to be responsive? There is no one silver bullet that will slay all system latencies, but there are tools that isolate performance bottlenecks — the most important of which you carry on your shoulders.

This chapter deals with issues that affect system performance generally and how you go about finding and attenuating system bottlenecks. Of course, this chapter cannot be a comprehensive guide to how to maximize your system for your needs, since that is far too dependent on the flavors of Unix and the machines on which they run. However, there are principles and programs that are widely available that will help you assess how much more performance you can expect from your hardware.

One of the fundamental illusions in a multiuser, multiprocessing operating system like Unix is that every user and every process is made to think that they are alone on the machine. This is by design. At the kernel level, a program called the scheduler attempts to juggle the needs of each user, providing overall decent performance of:

System performance degrades when one of these goals overwhelms the others. These problems are very intuitive: if there are five times the normal number of users logged into your system, chances are that your session will be less responsive than at less busy times.

Performance tuning is a multifaceted problem. At its most basic, performance issues can be looked at as being either global or local problems. Global problems affect the system as a whole and can generally be fixed only by the system administrator. These problems include insufficient RAM or hard drive space, inadequately powerful CPUs, and scanty network bandwidth. The global problems are really the result of a host of local issues, which all involve how each process on the system consumes resources. Often, it is up to the users to fix the bottlenecks in their own processes.

Global problems are diagnosed with tools that report system-wide statistics. For instance, when a system appears sluggish, most administrators run uptime (Section 26.4) to see how many processes were recently trying to run. If these numbers are significantly higher than normal usage, something is amiss (perhaps your web server has been slashdotted).

If uptime suggests increased activity, the next tool to use is either ps or top to see if you can find the set of processes causing the trouble. Because it shows you "live" numbers, top can be particularly useful in this situation. I also recommend checking the amount of available free disk space with df, since a full filesystem is often an unhappy one, and its misery spreads quickly.

Once particular processes have been isolated as being problematic, it's time to think locally. Process performance suffers when either there isn't more CPU time available to finish a task (this is known as a CPU-bound process) or the process is waiting for some I/O resource (i.e., I/O-bound ), such as the hard drive or network. One strategy for dealing with CPU-bound processes, if you have the source code for them, is to use a profiler like GNU's gprof. Profilers give an accounting for how much CPU time is spent in each subroutine of a given program. For instance, if I want to profile one of my programs, I'd first compile it with gcc and use the -pg compilation flag. Then I'd run the program. This creates the gmon.out data file that gprof can read. Now I can use gprof to give me a report with the following invocation:

$ gprof -b executable gmon.out

Here's an abbreviated version of the output:

Flat profile:

Each sample counts as 0.01 seconds.
 no time accumulated

  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
  0.00      0.00     0.00        2     0.00     0.00  die_if_fault_occurred
  0.00      0.00     0.00        1     0.00     0.00  get_double
  0.00      0.00     0.00        1     0.00     0.00  print_values

Here, we see that three subroutines defined in this program (die_if_fault_occurred, get_double, and print_values) were called. In fact, the first subroutine was called twice. Because this program is neither processor- nor I/O-intensive, no significant time is shown to indicate how long each subroutine took to run. If one subroutine took a significantly longer time to run than the others, or one subroutine is called significantly more often than the others, you might want to see how you can make that problem subroutine faster. This is just the tip of the profiling iceberg. Consult your language's profiler documentation for more details.

One less detailed way to look at processes is to get an accounting of how much time a program took to run in user space, in kernel space, and in real time. For this, the time (Section 26.2) command exists as part of both C and bash shells. As an external program, /bin/time gives a slightly less detailed report. No special compilation is necessary to use this program, so it's a good tool to use to get a first approximation of the bottlenecks in a particular process.

Resolving I/O-bound issues is difficult for users. Only adminstrators can both tweak the low-level system settings that control system I/O buffering and install new hardware, if needed. CPU-bound processes might be improved by dividing the program into smaller programs that feed data to each other. Ideally, these smaller programs can be spread across several machines. This is the basis of distributed computing.

Sometimes, you want a particular process to hog all the system resources. This is the definition of a dedicated server, like one that hosts the Apache web server or an Oracle database. Often, server software will have configuration switches that help the administrator allocate system resources based on typical usage. This, of course, is far beyond the scope of this book, but do check out Web Performance Tuning and Oracle Performance Tuning from O'Reilly for more details. For more system-wide tips, pick up System Performance Tuning, also from O'Reilly.

As with so many things in life, you can improve performance only so much. In fact, by improving performance in one area, you're likely to see performance degrade in other tasks. Unless you've got a machine that's dedicated to a very specific task, beware the temptation to over-optimize.

— JJ

26.2 Timing Programs

Two commands, time and /bin/time, provide simple timings. Their information is highly accurate, because no profiling overhead distorts the program's performance. Neither program provides any analysis on the routine or trace level. They report the total execution time, some other global statistics, and nothing more. You can use them on any program.

time and /bin/time differ primarily in that time is built into many shells, including bash. Therefore, it cannot be used in safely portable Bourne shell scripts or in makefiles. It also cannot be used if you prefer the Bourne shell (sh). /bin/time is an independent executable file and therefore can be used in any situation. To get a simple program timing, enter either time or /bin/time, followed by the command you would normally use to execute the program. For example, to time a program named analyze (that takes two command-line arguments, an input file and an output file), enter the following command:

% time analyze inputdata outputfile
9.0u 6.7s 0:30 18% 23+24k 285+148io 625pf+0w

This result (in the default C shell format) indicates that the program spent 9.0 seconds on behalf of the user (user time), 6.7 seconds on behalf of the system (system time, or time spent executing Unix kernel routines on the user's behalf), and a total of 30 seconds elapsed time. Elapsed time is the wall clock time from the moment you enter the command until it terminates, including time spent waiting for other users, I/O time, etc.

By definition, the elapsed time is greater than your total CPU time and can even be several times larger. You can set programs to be timed automatically (without typing time first) or change the output format by setting shell variables.

The example above shows the CPU time as a percentage of the elapsed time (18 percent). The remaining data reports virtual memory management and I/O statistics. The meaning varies, depending on your shell; check your online csh manual page or article.

In this example, under SunOS 4.1.1, the other fields show the amount of shared memory used, the amount of nonshared memory used (k), the number of block input and output operations (io), and the number of page faults plus the number of swaps (pf and w). The memory management figures are unreliable in many implementations, so take them with a grain of salt.

/bin/time reports only the real time (elapsed time), user time, and system time. For example:

% /bin/time analyze inputdata outputfile
       60.8 real        11.4 user         4.6 sys

[If you use a shell without a built-in time command, you can just type time. — JP] This reports that the program ran for 60.8 seconds before terminating, using 11.4 seconds of user time and 4.6 seconds of system time, for a total of 16 seconds of CPU time. On Linux and some other systems, that external time command is in /usr/bin/time and may make a more detailed report.

There's a third timer on some systems: timex. It can give much more detail if your system has process accounting enabled. Check the timex(1) manpage.

— ML

26.3 What Commands Are Running and How Long Do They Take?

When your system is sluggish, you will want to see what users are on the system along with the processes they're running. To get a brief snapshot of this information, the tersely named w can show you who is logged in, from where, how long they've been idle, and what programs they're running. For instance, when I run w on my Red Hat box at home, I get this result:

  3:58pm  up 38 days,  4:37,  6 users,  load average: 0.00, 0.07, 0.07
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU  WHAT
jjohn    tty2     -                13Feb02  7:03m  1.32s  0.02s  /bin/sh /usr/X
jjohn    pts/1    :0                8:55am  7:02m  0.06s  0.06s  bash
jjohn    pts/3    :0                8:55am  0.00s 51.01s  0.05s  w
jjohn    pts/0    :0                8:55am  7:02m  0.06s  0.06s  bash
jjohn    pts/4    :0                8:55am  2:25m  2:01   0.12s  bash
jjohn    pts/2    mp3.daisypark.ne Tue 4pm  3:41m  0.23s  0.23s  -bash

Originally, I logged in at the console and started X. Most of the sessions are xterminals except for the last, which is an ssh session. The JCPU field accounts for the CPU time used by all the processes at that TTY. The PCPU simply accounts for the process named in the WHAT field. This is a quick and simple command to show you the state of your system, and it relies on no special process accounting from the kernel.

When you're debugging a problem with a program, trying to figure out why your CPU usage bill is so high [in the days when CPU cycles were rented — JJ], or curious what commands someone (including yourself) is running, the lastcomm command on Berkeley-like Unixes can help (if your computer has its process accounting system running, that is). Here's an example that lists the user lesleys:

% date
Mon Sep  4 16:38:13 EDT 2001
% lastcomm lesleys
emacs          lesleys  ttyp1      1.41 secs Wed Sep  4 16:28
cat          X lesleys  ttyp1      0.06 secs Wed Sep  4 16:37
stty           lesleys  ttypa      0.02 secs Wed Sep  4 16:36
tset           lesleys  ttypa      0.12 secs Wed Sep  4 16:36
sed            lesleys  ttypa      0.02 secs Wed Sep  4 16:36
hostname       lesleys  ttypa      0.00 secs Wed Sep  4 16:36
quota          lesleys  ttypa      0.16 secs Wed Sep  4 16:35
   ...

The processes are listed in the order completed, most recent first. The emacs process on the tty (Section 2.7) ttyp1 started 10 minutes ago and took 1.41 seconds of CPU time. Sometime while emacs was on ttyp1, lesleys ran cat and killed it (the X shows that). Because emacs ran on the same terminal as cat but finished later, Lesley might have emacs (with CTRL-z) stopped (Section 23.3) to run cat. The processes on ttypa are the ones run from her .cshrc and .login files (though you can't tell that from lastcomm). You don't see the login shell for ttypa (csh) here because it hasn't terminated yet; it will be listed after Lesley logs out of ttypa.

lastcomm can do more. See its manual page.

Here's a hint: on a busy system with lots of users and commands being logged, lastcomm is pretty slow. If you pipe the output or redirect it into a file, like this:

tee Section 43.8

% lastcomm lesleys > lesley.cmds & 
% cat lesley.cmds 
    ...nothing...
% lastcomm lesleys | tee lesley.cmds 
    ...nothing...

the lastcomm output may be written to the file or pipe in big chunks instead of line-by-line. That can make it look as if nothing's happening. If you can tie up a terminal while lastcomm runs, there are two workarounds. If you're using a window system or terminal emulator with a "log to file" command, use it while lastcomm runs. Otherwise, to copy the output to a file, start script (Section 37.7) and then run lastcomm:

% script lesley.cmds
Script started, file is lesley.cmds
% lastcomm lesleys
emacs          lesleys  ttyp1      1.41 secs Wed Sep  4 16:28
cat          X lesleys  ttyp1      0.06 secs Wed Sep  4 16:37
   ...

% exit
Script done, file is lesley.cmds
%

A final word: lastcomm can't give information on commands that are built into the shell (Section 1.9). Those commands are counted as part of the shell's execution time; they'll be in an entry for csh, sh, etc. after the shell terminates.

—JP and JJ

26.4 Checking System Load: uptime

figs/www.gif Go to http://examples.oreilly.com/upt3 for more information on: uptime

The BSD command uptime, also available under System V Release 4, AIX, and some System V Release 3 implementations, will give you a rough estimate of the system load:

% uptime
3:24pm up 2 days, 2:41, 16 users, load average: 1.90, 1.43, 1.33

uptime reports the current time, the amount of time the system has been up, and three load average figures. The load average is a rough measure of CPU use. These three figures report the average number of processes active during the last minute, the last 5 minutes, and the last 15 minutes. High load averages usually mean that the system is being used heavily and the response time is correspondingly slow. Note that the system's load average does not take into account the priorities and niceness (Section 26.5) of the processes that are running.

What's high? As usual, that depends on your system. Ideally, you'd like a load average under, say, 3, but that's not always possible given what some systems are required to do. Higher load averages are usually more tolerable on machines with more than one processor. Ultimately, "high" means high enough that you don't need uptime to tell you that the system is overloaded — you can tell from its response time.

Furthermore, different systems behave differently under the same load average. For example, on some workstations, running a single CPU-bound background job at the same time as the X Window System (Section 1.22) will bring response to a crawl even though the load average remains quite "low." In the end, load averages are significant only when they differ from whatever is "normal" on your system.

— AF

26.5 Know When to Be "nice" to Other Users...and When Not To

The BSD-System V split isn't so obvious in modern Unixes, but the different priority systems still live in various flavors. This article should help you understand the system in whatever version you have.

If you are going to run a CPU-bound (Section 26.1) process that will monopolize the CPU from other processes, you may reduce the urgency of that more intensive process in the eyes of the process scheduler by using nice before you run the program. For example:

$ nice executable_filename

On most systems, no user can directly change a process's priority (only the scheduler does that), and only the administrator can use nice to make a process more urgent. In practice, nice is rarely used on multiuser systems — the tragedy of the commons — but you may be able to get more processes running simultaneously by judicious use of this program.

If you're not familiar with Unix, you will find its definition of priority confusing — it's the opposite of what you would expect. A process with a high nice number runs at low priority, getting relatively little of the processor's attention; similarly, jobs with a low nice number run at high priority. This is why the nice number is usually called niceness: a job with a lot of niceness is very kind to the other users of your system (i.e., it runs at low priority), while a job with little niceness hogs the CPU. The term "niceness" is awkward, like the priority system itself. Unfortunately, it's the only term that is both accurate (nice numbers are used to compute priorities but are not the priorities themselves) and avoids horrible circumlocutions ("increasing the priority means lowering the priority...").

Many supposedly experienced users claim that nice has virtually no effect. Don't listen to them. As a general rule, reducing the priority of an I/O-bound job (a job that's waiting for I/O a lot of the time) won't change things very much. The system rewards jobs that spend most of their time waiting for I/O by increasing their priority. But reducing the priority of a CPU-bound process can have a significant effect. Compilations, batch typesetting programs (troff, TEX, etc.), applications that do a lot of math, and similar programs are good candidates for nice. On a moderately loaded system, I have found that nice typically makes a CPU-intensive job roughly 30 percent slower and consequently frees that much time for higher priority jobs. You can often significantly improve keyboard response by running CPU-intensive jobs at low priority.

Note that System V Release 4 has a much more complex priority system, including real-time priorities. Priorities are managed with the priocntl command. The older nice command is available for compatibility. Other Unix implementations (including HP and Concurrent) support real-time scheduling. These implementations have their own tools for managing the scheduler.

The nice command sets a job's niceness, which is used to compute its priority. It may be one of the most nonuniform commands in the universe. There are four versions, each slightly different from the others. BSD Unix has one nice that is built into the C shell, and another standalone version can be used by other shells. System V also has one nice that is built into the C shell and a separate standalone version.

Under BSD Unix, you must also know about the renice(8) command (Section 26.7); this lets you change the niceness of a job after it is running. Under System V, you can't modify a job's niceness once it has started, so there is no equivalent.

Think carefully before you nice an interactive job like a text editor. See Section 26.6.

We'll tackle the different variations of nice in order.

26.5.1 BSD C Shell nice

Under BSD Unix, nice numbers run from -20 to 20. The -20 designation corresponds to the highest priority; 20 corresponds to the lowest. By default, Unix assigns the nice number 0 to user-executed jobs. The lowest nice numbers (-20 to -17) are unofficially reserved for system processes. Assigning a user's job to these nice numbers can cause problems. Users can always request a higher nice number (i.e., a lower priority) for their jobs. Only the superuser (Section 1.18) can raise a job's priority.

To submit a job at a greater niceness, precede it with the modifier nice. For example, the following command runs an awk command at low priority:

% nice awk -f proc.awk datafile > awk.out

By default, the csh version of nice will submit this job with a nice level of 4. To submit a job with an arbitrary nice number, use nice one of these ways, where n is an integer between 0 and 20:

% nice + n command
% nice - n command

The +n designation requests a positive nice number (low priority); -n requests a negative nice number. Only a superuser may request a negative nice number.

26.5.2 BSD Standalone nice

The standalone version of nice differs from C shell nice in that it is a separate program, not a command built in to the C shell. You can therefore use the standalone version in any situation: within makefiles (Section 11.10), when you are running the Bourne shell, etc. The principles are the same. nice numbers run from -20 to 20, with the default being 0. Only the syntax has been changed to confuse you. For the standalone version, -n requests a positive nice number (lower priority) and --n requests a negative nice number (higher priority — superuser only). Consider these commands:

$ nice -6 awk -f proc.awk datafile > awk.out
# nice --6 awk -f proc.awk datafile > awk.out

The first command runs awk with a high nice number (i.e., 6). The second command, which can be issued only by a superuser, runs awk with a low nice number (i.e., -6). If no level is specified, the default argument is -10.

26.5.3 System V C Shell nice

System V takes a slightly different view of nice numbers. nice levels run from 0 to 39; the default is 20. The numbers are different but their meanings are the same: 39 corresponds to the lowest possible priority, and 0 is the highest. A few System V implementations support real-time submission via nice. Jobs submitted by root with extremely low nice numbers (-20 or below) allegedly get all of the CPU's time. Systems on which this works properly are very rare and usually advertise support for real-time processing. In any case, running jobs this way will destroy multiuser performance. This feature is completely different from real-time priorities in System V Release 4.

With these exceptions, the C shell version of nice is the same as its BSD cousin. To submit a job at a low priority, use the command:

% nice command

This increases the command's niceness by the default amount (4, the same as BSD Unix); command will run at nice level 24. To run a job at an arbitrary priority, use one of the following commands, where n is an integer between 0 and 19:

% nice + n command
% nice - n command

The +n entry requests a higher nice level (a decreased priority), while -n requests a lower nice level (a higher priority). Again, this is similar to BSD Unix, with one important difference: n is now relative to the default nice level. That is, the following command runs awk at nice level 26:

% nice +6 awk -f proc.awk datafile > awk.out

26.5.4 System V Standalone nice

Once again, the standalone version of nice is useful if you are writing makefiles or shell scripts or if you use the Bourne shell as your interactive shell. It is similar to the C shell version, with these differences:

Consider these commands:

$ nice -6 awk -f proc.awk datafile > awk.out
# nice --6 awk -f proc.awk datafile > awk.out

The first command runs awk at a higher nice level (i.e., 26, which corresponds to a lower priority). The second command, which can be given only by the superuser, runs awk at a lower nice level (i.e., 14).

— ML

26.6 A nice Gotcha

It's not a good idea to nice a foreground job (Section 23.3). If the system gets busy, your terminal could "freeze" waiting to get enough CPU time to do something. You may not even be able to kill (Section 24.11) a nice'd job on a very busy system because the CPU may never give the process enough CPU time to recognize the signal waiting for it! And, of course, don't nice an interactive program like a text editor unless you like to wait... :-)

— JP

26.7 Changing a Running Job's Niceness

On Unix systems with BSD-style priority schemes, once a job is running, you can use the renice(8) command to change the job's priority:

% /etc/renice priority-p pid
% /etc/renice priority -g pgrp
% /etc/renice priority -u uname

where priority is the new nice level (Section 26.5) for the job. It must be a signed integer between -20 and 20. pid is the ID number (Section 24.3) (as shown by ps (Section 24.5)) of the process you want to change. pgrp is the number of a process group (Section 24.3), as shown by ps -l; this version of the command modifies the priority of all commands in a process group. uname may be a user's name, as shown in /etc/passwd; this form of the command modifies the priority of all jobs submitted by the user.

A nice level of 19 is the "nicest": the process will run only when nothing else on the system wants to. Negative values make a process get a greater percentage of the CPU's time than the default niceness (which is 0). Again, only the superuser can lower the nice number (raise a process' priority). Users can only raise the nice number (lower the priority), and they can modify the priorities of only the jobs they started.

— ML

[1]  This list is modified from Tanenbaum and Woodhull's Operating Systems: Design and Implementation, Second Edition (Upper Saddle River: Prentice-Hall, Inc. 1997], 83).

CONTENTS