Previous Page
Next Page

7.6. Keeping Everything Running Smoothly

A significant part of maintenance is being aware that something is wrong before it becomes a real problem. If you catch a problem early, chances are it'll be that much easier to fix. As the old adage says, an ounce of prevention is worth a pound of cure.

This isn't quite troubleshootingwe'll devote an entire chapter to troubleshooting laterthink of it more as "pre-troubleshooting." Troubleshooting (the pound of cure) is what you have to do after your problem has developed complications and you need to identify the problem by its symptoms.

The next two sections deal with preventative maintenance: looking periodically at the syslog file and at the BIND nameserver statistics to see whether any problems are developing. Consider this a nameserver's medical checkup.

7.6.1. Common Syslog Messages

There are a large number of syslog messages that named can emit. In practice, you'll see only a few of them. We'll cover the most common syslog messages here, excluding reports of syntax errors in zone datafiles.

Every time you start named, it sends out a message at priority LOG_NOTICE. For a BIND 8 nameserver, it looks like this:

Jan 10 20:48:32 toystory named[3221]: starting.  named 8.2.3 Tue May 16 09:39:40
MDT 2000 cricket@huskymo.boulder.acmebw.com:/usr/local/src/bind-8.2.3/src/bin/
named

For BIND 9, it's significantly abridged:

Jul 27 16:18:41 toystory named[7045]: starting BIND 9.3.2

This message logs the fact that named started at this time and tells you the version of BIND you're running as well as who built it and where (for BIND 8). Of course, this is nothing to be concerned about. It is a good place to look if you're not sure what version of BIND your operating system supports.

Every time you send the nameserver a reload command, a BIND 8 nameserver sends out this message at priority LOG_NOTICE:

Jan 10 20:50:16 toystory named[3221]: reloading nameserver 


Here's the BIND 9 nameservers log:

Jul 27 16:27:45 toystory named[7047]: loading configuration from
                '/etc/named.conf'

These messages simply tell you that named reloaded its database (as a result of a reload command) at this time. Again, this is nothing to be concerned about. This message will most likely be of interest when you are tracking down how long a bad resource record has been in your zone data or how long a whole zone has been missing because of a mistake during an update.

Here's another message you may see shortly after your nameserver starts:

Jan 10 20:50:20 toystory named[3221]: cannot set resource limits on
                this system

This means that your nameserver thinks your operating system does not support the getrlimit() and setrlimit() system calls, which are used when you try to define coresize, datasize, stacksize, or files. It doesn't matter whether you're actually using any of these substatements in your configuration file; BIND will print the message anyway. If you are not using these substatements, ignore the message. If you are, and you think your operating system actually does support getrlimit() and setrlimit(), you'll have to recompile BIND with HAVE_GETRUSAGE defined. This message is logged at priority LOG_INFO.

If you run your nameserver on a host with many network interfaces (especially virtual network interfaces), you may see this message soon after startup or even after your nameserver has run well for a while:

Jan 10 20:50:31 toystory named[3221]: fcntl(dfd, F_DUPFD, 20): Too
                many open files
Jan 10 20:50:31 toystory named[3221]: fcntl(sfd, F_DUPFD, 20): Too
                many open files

This means that BIND has run out of file descriptors. BIND uses a fair number of file descriptors: two for each network interface it's listening on (one for UDP and one for TCP), and one for opening zone datafiles. If that's more than the limit your operating system places on processes, BIND won't be able to get any more file descriptors, and you'll see this message. The priority depends on which part of BIND fails to get the file descriptor: the more critical the subsystem, the higher the priority.

The next step is either to get BIND to use fewer file descriptors, or to raise the limit the operating system places on the number of file descriptors BIND can use:

  • If you don't need BIND listening on all your network interfaces (particularly the virtual ones), use the listen-on substatement to configure BIND to listen only on those interfaces it needs to. See Chapter 10 for details on the syntax of listen-on.

  • If your operating system supports getrlimit() and setrlimit() (as just described), configure your nameserver to use a larger number of files with the files substatement. See Chapter 10 for details on using the files substatement.

  • If your operating system places too restrictive a limit on open files, raise that limit before you start named with the ulimit command.

Every time a nameserver loads a zone, it sends out a message at priority LOG_INFO:

Jan 10 21:49:50 toystory named[3221]: zone movie.edu/IN
                loaded serial 2005011000

This tells you when the nameserver loaded the zone, the class of the zone (in this case, IN), and the serial number in the zone's SOA record.

About every hour, a BIND 8 nameserver sends a snapshot of the current statistics at priority LOG_INFO:

Feb 18 14:09:02 toystory named[3565]: USAGE 824681342 824600158
                CPU=13.01u/3.26s CHILDCPU=9.99u/12.71s
Feb 18 14:09:02 toystory named[3565]: NSTATS 824681342 824600158
                A=4 PTR=2
Feb 18 14:09:02 toystory named[3565]: XSTATS 824681342 824600158
                RQ=6 RR=2 RIQ=0 RNXD=0 RFwdQ=0 RFwdR=0 RDupQ=0 RDupR=0
                RFail=0 RFErr=0 RErr=0 RTCP=0 RAXFR=0 RLame=0 Ropts=0
                SSysQ=2 SAns=6 SFwdQ=0 SFwdR=0 SDupQ=5 SFail=0 SFErr=0
                SErr=0 RNotNsQ=6 SNaAns=2 SNXD=1

(BIND 9 doesn't send out the statistics as a log message.) The first two numbers for each message are times. If you subtract the second number from the first number, you'll find out how many seconds your server has been running. (You'd think the nameserver could do that for you.) The CPU entry tells you how much time your server has spent in user mode (13.01 seconds) and system mode (3.26 seconds). Then it tells you the same statistic for child processes. The NSTATS message lists the types of queries your server has received and the counts for each. The XSTATS message lists additional statistics. The statistics under NSTATS and XSTATS are explained in more detail later in this chapter.

If BIND finds a name that doesn't conform to RFC 952, it logs a syslog error:

Jul 24 20:56:26 toystory named[1496]: ID_4.movie.edu IN:
                                      bad owner name 
 (check-names)

This message is logged at level LOG_ERROR. See Chapter 4 for the host-naming rules.

Another syslog message, sent at priority LOG_ERROR, is a message about the zone data:

Jan 10 20:48:38 toystory2 named[3221]: ts2 has CNAME
                and other data (invalid)

This message means that there's a problem with your zone data. For example, you may have entries like these:

ts2                 IN  CNAME toystory2
ts2                 IN  MX    10 toystory2
toystory2           IN  A     192.249.249.10
toystory2           IN  MX    10 toystory2

The MX record for ts2 is incorrect and triggers the message just listed. ts2 is an alias for toystory2, which is the canonical name. As described earlier, when a nameserver looks up a name and finds a CNAME, it replaces the original name with the canonical name and then tries looking up the canonical name. Thus, when the server looks up the MX data for ts2, it finds a CNAME record and then looks up the MX record for toystory2. Since the server follows the CNAME record for ts2, it never uses the MX record for ts2; in fact, this record is illegal. In other words, all resource records for a host have to use the canonical name; it's an error to use an alias in place of the canonical name.

The following message indicates that a BIND 8 slave was unable to reach any master server when it tried to do a zone transfer:

Jan 10 20:52:42 wormhole named[2813]: zoneref: Masters for
                secondary zone "movie.edu" unreachable

BIND 9 slaves say:

Jul 27 16:50:55 toystory named[7174]: transfer of 'movie.edu/IN'
                from 192.249.249.3#53: failed to connect: timed out

This message is sent at priority LOG_NOTICE on BIND 8, and LOG_ERROR on BIND 9, and is sent only the first time the zone transfer fails. When the zone transfer finally succeeds, BIND tells you that the zone transferred by issuing another syslog message. When this message first appears, you don't need to take any immediate action. The nameserver will continue attempting to transfer the zone according to the retry period in the SOA record. After a few days (or half the expire time), you might check that the server was able to transfer the zone. Or, you can verify that the zone transferred by checking the timestamp on the backup zone datafile. When a zone transfer succeeds, a new backup file is created. When a nameserver finds a zone is up to date, it "touches" the backup file (à la the Unix touch command). In both cases, the timestamp on the backup file is updated, so go to the slave and give the command ls -l /usr/local/named/db*. This tells you when the slave last synchronized each zone with the master server. We'll cover how to troubleshoot slaves failing to transfer zones in Chapter 14.

If you are watching the syslog messages, you'll see a LOG_INFO syslog message when the slave picks up the new zone data or when a tool such as nslookup transfers a zone:

Mar  7 07:30:04 toystory named[3977]: client 192.249.249.1#1076:
                transfer of 'movie.edu/IN':AXFR started

If you're using the allow-transfer substatement (explained in Chapter 11) to limit which servers can load zones, you may see this message saying denied instead of started:

Jul 27 16:59:26 toystory named[7174]: client 192.249.249.1#1386:
                zone transfer 'movie.edu/AXFR/IN' denied

You'd see this syslog message only if you capture LOG_INFO syslog messages:

Jan 10 20:52:42 wormhole named[2813]: Malformed response 

                from 192.1.1.1

Most often, this message means that some bug in a nameserver caused it to send an erroneous response packet. The error probably occurred on the remote nameserver (192.1.1.1) rather than the local server (wormhole). Diagnosing this kind of error involves capturing the response packet in a network trace and decoding it. Decoding DNS packets manually is beyond the scope of this book, so we won't go into much detail. You'd see this type of error if the response packet said it contained several answers in the answer section (such as four address resource records), yet the answer section contained only a single answer. The only course of action is to notify the administrator of the offending host via email (assuming you can get the name of the host by looking up the address). You would also see this message if the underlying network altered (damaged) the UDP response packets in some way. Checksumming UDP packets is optional, so this error might not be caught at a lower level.

A BIND 8 named logs this message when you try to sneak records into your zone datafile that belong in another zone:

Jun 13 08:02:03 toystory named[2657]: db.movie.edu:28: data "foo.bar.edu"
                           outside zone "movie.edu" (ignored)

A BIND 9 named logs:

Jul 27 17:07:01 toystory named[7174]: dns_master_load:
                db.movie.edu:28: ignoring out-of-zone data

For instance, if we tried to use this zone data:

shrek       IN A  192.249.249.2
toystory    IN A  192.249.249.3

; Add this entry to the nameserver's cache
foo.bar.edu.  IN A  10.0.7.13

we'd be adding data for the bar.edu zone into our movie.edu zone datafile. This syslog message is logged at priority LOG_WARNING.

Earlier in the book, we said that you couldn't use a CNAME in the data portion of a resource record. BIND 8 will catch this misuse:

Jun 13 08:21:04 toystory named[2699]: "movie.edu IN NS" points to a
                                        CNAME (mi.movie.edu)

BIND 9 doesn't catch it as of 9.3.0.

Here is an example of the offending resource records:

@                        IN  NS       toystory.movie.edu.
                         IN  NS       mi.movie.edu.
toystory.movie.edu.      IN  A     192.249.249.3
monsters-inc.movie.edu.  IN  A     192.249.249.4
mi.movie.edu.            IN  CNAME monsters-inc.movie.edu.

The second NS record should have listed monsters-inc.movie.edu instead of mi.movie.edu. This syslog message won't show up immediately when you start your nameserver.

You'll only see the syslog message when the offending data is looked up. This syslog message is logged by a BIND 8 server at priority LOG_INFO.


The following message indicates that your nameserver may be guarding itself against one type of network attack:

Jun 11 11:40:54 toystory named[131]: Response from unexpected source 

                                        ([204.138.114.3].53)

Your nameserver sent a query to a remote nameserver, but the response that came wasn't returned from any of the addresses your nameserver had listed for the remote nameserver. The potential security breach is this: an intruder causes your nameserver to query a remote nameserver, and at the same time the intruder sends responses (pretending the responses are from the remote nameserver) that the intruder hopes your nameserver will add to its cache. Perhaps he sends along a false PTR record, pointing the IP address of one of his hosts to the domain name of a host you trust. Once the false PTR record is in your cache, the intruder uses one of the BSD "r" commands (e.g., rlogin) to gain access to your system.

Less paranoid admins will realize that this situation can also happen if a parent zone's nameserver knows about only one of the IP addresses of a multihomed nameserver for a child zone. The parent tells your nameserver the one IP address it knows, and when your server queries the remote nameserver, the remote nameserver responds from the other IP address. This shouldn't happen if BIND is running on the remote nameserver host, because BIND makes every effort to use the same IP address in the response as the query was sent to. This syslog message is logged at priority LOG_INFO.

Here's an interesting syslog message:

Jun 10 07:57:28 toystory named[131]: No root name servers for
                class 226

The only classes defined to date are: class 1, Internet (IN); class 3, Chaos (CH); and class 4, Hesiod (HS). What's class 226? That's exactly what your nameserver is saying with this syslog message: something is wrong because there's no class 226. What can you do about it? Nothing, really. This message doesn't give you enough information; you don't know who the query is from or what the query was for. Then again, if the class field is corrupted, the domain name in the query may be garbage too. The actual cause of the problem could be a broken remote nameserver or resolver, or a corrupted UDP datagram. This syslog message is logged at priority LOG_INFO.

This message might appear if you are backing up some other zone:

Jun 7 20:14:26 wormhole named[29618]: Zone "253.253.192.in-addr.arpa"
                (class 1) SOA serial# (3345) rcvd from [192.249.249.10]
                is < ours (563319491)

Ah, the pesky admin for 253.253.192.in-addr.arpa changed the serial number format and neglected to tell you about it. Some thanks you get for running a slave for this zone, huh? Drop the admin a note to see if this change was intentional or just a typo. If the change was intentional, or if you don't want to contact the admin, then you have to deal with it locallykill your slave, remove the backup copy of this zone, and restart your server. This procedure removes all knowledge your slave had of the old serial number, at which point it's quite happy with the new serial number. This syslog message is logged at priority LOG_NOTICE.

By the way, if that pesky admin was running a BIND 8 or 9 nameserver, then he must have missed (or ignored) a message his server logged, telling him that he'd rolled the zone's serial number back. On a BIND 8 nameserver, the message looks like:

Jun 7 19:35:14 toystory named[3221]: WARNING: new serial number < old
              (zp->z_serial < serial)

On a BIND 9 nameserver, it looks like:

Jun 7 19:36:41 toystory named[9832]: dns_zone_load: zone movie.edu/IN: zone
serial has gone backwards

This message is logged at LOG_NOTICE.

You might want to remind him of the wisdom of checking syslog after making any changes to the nameserver.

This BIND 8 message will undoubtedly become familiar to you:

Aug 21 00:59:06 toystory named[12620]: Lame server on 'foo.movie.edu'
         (in 'MOVIE.EDU'?): [10.0.7.125].53 'NS.HOLLYWOOD.LA.CA.US':
         learnt (A=10.47.3.62,NS=10.47.3.62)

Under BIND 9, it looks like this:

Jan 15 10:20:16 toystory named[14205]: lame server 
 on 'foo.movie.edu' (in
         'movie.EDU'?): 10.0.7.125#53

"Aye, Captain, she's sucking mud!" There's some mud out there in the Internet waters in the form of bad delegations. A parent nameserver is delegating a subdomain to a child nameserver, and the child nameserver is not authoritative for the subdomain. In this case, the edu nameserver is delegating movie.edu to 10.0.7.125, and the nameserver on this host is not authoritative for movie.edu. Unless you know the admin for movie.edu, there's probably nothing you can do about this. The syslog message is logged at LOG_INFO.

If your configuration file has:

logging { category queries { default_syslog; }; };

you will get a LOG_INFO syslog message for every query your nameserver receives:

Feb 20 21:43:25 toystory named[3830]:
            XX /192.253.253.2/carrie.movie.edu/A
Feb 20 21:43:32 toystory named[3830]:
            XX /192.253.253.2/4.253.253.192.in-addr.arpa/PTR

The format has changed slightly in BIND 9, though:

Jan 13 18:32:25 toystory named[13976]: client 192.253.253.2#1702:
           query: carrie.movie.edu IN A +
Jan 13 18:32:42 toystory named[13976]: client 192.253.253.2#1702:
           query: 4.253.253.192.in-addr.arpa IN PTR +

These messages include the IP address of the host that made the query as well as the query itself. On a BIND 8.2.1 or later nameserver, recursive queries are marked with XX+ instead of XX. A BIND 9 nameserver marks recursive queries with a + and non-recursive queries with a - character. BIND 8.4.3 and later and 9.3.0 and later even mark EDNS0 queries and TSIG-signed queries with E and S, respectively. (We'll talk about EDNS0 in Chapter 10 and TSIG in Chapter 11.)

Make sure you have lots of disk space if you log all the queries to a busy nameserver. (On a running server, you can toggle query logging on and off with the querylog command.)

Starting with BIND 8.1.2, you might see this set of syslog messages:

May 19 11:06:08 named[21160]: bind(dfd=20, [10.0.0.1].53):
                Address already in use
May 19 11:06:08 named[21160]: deleting interface [10.0.0.1].53
May 19 11:06:08 named[21160]: bind(dfd=20, [127.0.0.1].53):
                Address already in use
May 19 11:06:08 named[21160]: deleting interface [127.0.0.1].53
May 19 11:06:08 named[21160]: not listening on any interfaces
May 19 11:06:08 named[21160]: Forwarding source address
                is [0.0.0.0].1835
May 19 11:06:08 named[21161]: Ready to answer queries.

On BIND 9 nameservers, that looks like:

Jul 27 17:15:58 toystory named[7357]: listening on IPv4 interface lo, 127.0.0.1#53
Jul 27 17:15:58 toystory named[7357]: binding TCP socket: address in use
Jul 27 17:15:58 toystory named[7357]: listening on IPv4 interface eth0,
         206.168.194.122#53
Jul 27 17:15:58 toystory named[7357]: binding TCP socket: address in use
Jul 27 17:15:58 toystory named[7357]: listening on IPv4 interface eth1,
         206.168.194.123#53
Jul 27 17:15:58 toystory named[7357]: binding TCP socket: address in use
Jul 27 17:15:58 toystory named[7357]: couldn't add command channel
0.0.0.0#953: address in use

What has happened is that you had a nameserver running, and you started up a second nameserver without killing the first one. Unlike what you might expect, the second nameserver continues to run; it just isn't listening on any interfaces.

7.6.2. Understanding the BIND Statistics

Periodically, you should look over the statistics on some of your nameservers, if only to see how busy they are. We'll now show you an example of the nameserver statistics and discuss what each line means. Nameservers handle many queries and responses during normal operation, so first we need to show you what a typical exchange might look like.

Reading the explanations for the statistics is hard without a mental picture of what goes on during a lookup. To help you understand the nameserver's statistics, Figure 7-2 shows what might happen when an application tries to look up a domain name. The application, FTP, queries a local nameserver. The local nameserver had previously looked up data in this zone and knows where the remote nameservers are. It queries each of the remote nameserversone of them twicetrying to find the answer. In the meantime, the application times out and sends yet another query, asking for the same information.

Figure 7-2. Example query/response exchange


Keep in mind that even though a nameserver has sent a query to a remote nameserver, the remote nameserver may not receive the query right away. The query might be delayed or lost by the underlying network, or perhaps the remote nameserver host might be busy with another application.

Notice that a BIND nameserver is able to detect duplicate queries only while it is still trying to answer the original query. The local nameserver detects the duplicate query from the application because the local nameserver is still working on it. But remote nameserver 1 does not detect the duplicate query from the local nameserver because it answered the previous query. After the local nameserver receives the first response from remote nameserver 1, all other responses are discarded as duplicates. This dialog required the following exchanges:

Exchange

Number

Application to local nameserver

2 queries

Local nameserver to application

1 response

Local nameserver to remote nameserver 1

2 queries

Remote nameserver 1 to local nameserver

2 responses

Local nameserver to remote nameserver 2

1 query

Remote nameserver 2 to local nameserver

1 response

Local nameserver to remote nameserver 3

1 query

Remote nameserver 3 to local nameserver

0 responses


These exchanges would make the following contributions to the local nameserver's statistics:

Statistic

Cause

2 queries received

From the application on the local host

1 duplicate query

From the application on the local host

1 answer sent

To the application on the local host

3 responses received

From remote nameservers

2 duplicate responses

From remote nameservers

2 A queries

Queries for address information


In our example, the local nameserver received queries only from an application, yet it sent queries to remote nameservers. Normally, the local nameserver would also receive queries from remote nameservers (that is, in addition to asking remote servers for information it needs to know, the local server would also be asked by remote servers for information they need to know), but we didn't show any remote queries for the sake of simplicity.

7.6.2.1. BIND 8 statistics

Now that you've seen a typical exchange between applications and nameservers, as well as the statistics it generated, let's go over a more extensive example of the statistics. To get the statistics from your BIND 8 nameserver, use ndc:

# ndc stats

Wait a few seconds, look at the file named.stats in the nameserver's working directory. If the statistics are not dumped to this file, your server may not have been compiled with STATS defined and, thus, may not be collecting statistics. Following are the statistics from one of Paul Vixie's BIND 4.9.3 nameservers. BIND 8 nameservers have all of the same items listed here except for RnotNsQ, and the items are arranged in a different order. BIND 9 nameservers, as of 9.1.0, keep an entirely different set of statistics, which we'll show you in the next section.

+++ Statistics Dump +++ (800708260) Wed May 17 03:57:40 1995
746683    time since boot (secs)
392768    time since reset (secs)
14        Unknown query types
268459    A queries
3044      NS queries
5680      CNAME queries
11364     SOA queries
1008934   PTR queries
44        HINFO queries
680367    MX queries
2369      TXT queries
40        NSAP queries
27        AXFR queries
8336      ANY queries
++ Name Server Statistics ++
(Legend)
      RQ    RR    RIQ   RNXD    RFwdQ
      RFwdR RDupQ RDupR RFail   RFErr
      RErr  RTCP  RAXFR RLame   ROpts
      SSysQ SAns  SFwdQ SFwdR   SDupQ
      SFail SFErr SErr  RNotNsQ SNaAns
      SNXD
(Global)
   1992938 112600 0 19144 63462 60527 194 347 3420 0  5 2235 27 35289 0
   14886 1927930 63462 60527 107169  10025 119 0 1785426 805592  35863
[15.255.72.20]
   485 0 0 0 0  0 0 0 0 0  0 0 0 0 0  0 485 0 0 0  0 0 0 0 485  0
[15.255.152.2]
   441 137 0 1 2 108 0 0 0 0  0 0 0 0 0  13 439 85 7 84  0 0 0 0 431  0
[15.255.152.4]
   770 89 0 1 4  69 0 0 0 0  0 0 0 0 0  14 766 68 5 7  0 0 0 0 755  0
...  <lots of entries deleted>

If your BIND 8 nameserver doesn't include any per-IP address sections after "(Global)," you need to set host-statistics to yes in your options statement if you want to track per-host statistics:

options {
    host-statistics yes;
};

However, keeping host statistics requires a fair amount of memory, so you may not want to do it routinely unless you're trying to build a profile of your nameserver's activity.

Let's look at these statistics one line at a time.

+++ Statistics Dump +++ (800708260) Wed May 17 03:57:40 1995

This is when this section of the statistics was dumped. The number in parentheses (800708260) is the number of seconds since the Unix epoch, which was January 1, 1970. Mercifully, BIND converts that into a real date and time for you: May 17, 1995, 3:57:40 a.m.

746683    time since boot (secs)

This is how long the local nameserver has been running. To convert to days, divide by 86,400 (60 x 60 x 24, the number of seconds in a day). This server has been running for about 8.5 days.

392768    time since reset (secs)

This is how long the local nameserver has run since the last reload. You'll probably see this number differ from the time since boot only if the server is a primary nameserver for one or more zones. Nameservers that are slaves for a zone automatically pick up new data with zone transfers and are not usually reloaded. Since this server has been reset, it is probably the primary nameserver for some zone.

14        Unknown query types

This nameserver received 14 queries for data of a type it didn't recognize. Either someone is experimenting with new types, there is a defective implementation somewhere, or Paul needs to upgrade his nameserver.

268459    A queries

There have been 268,459 address lookups. Address queries are normally the most common type of query.

3044      NS queries

There have been 3,044 nameserver queries. Internally, nameservers generate NS queries when they are trying to look up servers for the root zone. Externally, applications such as dig and nslookup can also be used to look up NS records.

5680      CNAME queries

Some versions of sendmail make CNAME queries in order to canonicalize a mail address (replace an alias with the canonical name). Other versions of sendmail use ANY queries instead (we'll get to those shortly). Otherwise, the CNAME lookups are most likely from dig or nslookup.

11364     SOA queries

SOA queries are made by slave nameservers to check if their zone data is current. If the data is not current, an AXFR query follows to cause the zone transfer. Since this set of statistics does show AXFR queries, we can conclude that slave nameservers load zone data from this server.

1008934   PTR queries

The pointer queries map addresses to names. Many kinds of software look up IP addresses: inetd, rlogind, rshd, network management software, and network-tracing software.

44        HINFO queries

The host-information queries are most likely from someone interactively looking up HINFO records.

680367    MX queries

Mailers such as sendmail make mail exchanger queries as part of the normal electronic mail delivery process.

2369      TXT queries

Some application must be making text queries for this number to be this large. It might be a tool like Harvest, which is an information search-and-retrieval technology developed at the University of Colorado.

40        NSAP queries

This is a relatively new record type used to map domain names to OSI Network Service Access Point addresses.

27        AXFR queries

Slave nameservers make AXFR queries to initiate zone transfers.

8336      ANY queries

ANY queries request records of any type for a name. sendmail is the most common program to use this query type. Since sendmail looks up CNAME, MX, and address records for a mail destination, it will make a query for ANY record type so that all the resource records are cached right away at the local nameserver.

The rest of the statistics are kept on a per-host basis. If you look over the list of hosts your nameserver has exchanged packets with, you'll find out just how garrulous your nameserver is: you'll see hundreds or even thousands of hosts in the list. While the size of the list is impressive, the statistics themselves are only somewhat interesting. We'll explain all the statistics, even the ones with zero counts, although you'll probably find only a handful of the statistics useful. To make the statistics easier to read, you'll need a tool to expand the statistics because the output format is rather compact. We wrote a tool called bstat to do just this. Here's what its output looks like:

hpcvsop.cv.hp.com
        485 queries received
        485 responses sent to this name server
        485 queries answered from our cache
relay.hp.com
        441 queries received
        137 responses received
          1 negative response received
          2 queries for data not in our cache or authoritative data
        108 responses from this name server passed to the querier
         13 system queries sent to this name server
        439 responses sent to this name server
         85 queries sent to this name server
          7 responses from other name servers sent to this name server
         84 duplicate queries sent to this name server
        431 queries answered from our cache
hp.com
        770 queries received
         89 responses received
          1 negative response received
          4 queries for data not in our cache or authoritative data
         69 responses from this name server passed to the querier
         14 system queries sent to this name server
        766 responses sent to this name server
         68 queries sent to this name server
          5 responses from other name servers sent to this name server
          7 duplicate queries sent to this name server
        755 queries answered from our cache

In the raw statistics (not the bstat output), each host's IP address is followed by a table of counts. The column heading for this table is the cryptic legend at the beginning. The legend is broken into several lines, but the host statistics are all on a single line. In the following section, we'll explain briefly what each column means as we look at the statistics for one of the hosts this nameserver conversed with15.255.152.2 (relay.hp.com). For the sake of our explanation, we'll first show you the column heading from the legend (e.g., RQ) followed by the count for this column for relay.


RQ 441

RQ is the count of queries received from relay. These queries were made because relay needed information about a zone served by this nameserver.


RR 137

RR is the count of responses received from relay. These are responses to queries made from this nameserver. Don't try to correlate this number to RQ, because they are not related. RQ counts questions asked by relay; RR counts answers that relay gave to this nameserver (because this nameserver asked relay for information).


RIQ 0

RIQ is the count of inverse queries received from relay. Inverse queries were originally intended to map addresses to names, but that function is now handled by PTR records. Older versions of nslookup use an inverse query on startup, so you may see a nonzero RIQ count.


RNXD 1

RNXD is the count of "no such domain" answers received from relay.


RFwdQ 2

RFwdQ is the count of queries received (RQ) from relay that need further processing before they can be answered. This count is much higher for hosts that configure their resolver (with resolv.conf) to send all queries to your nameserver.


RFwdR 108

RFwdR is the count of responses received (RR) from relay that answer the original query and are passed back to the application that made the query.


RDupQ 0

RDupQ is the count of duplicate queries from relay. You'll see duplicates only when the resolver is configured (with resolv.conf) to query this nameserver.


RDupR 0

RDupR is the count of duplicate responses from relay. A response is a duplicate when the nameserver can no longer find the original query in its list of pending queries that caused the response.


RFail 0

RFail is the count of SERVFAIL responses from relay. A SERVFAIL response indicates some sort of server failure. Server failure responses often occur because the remote server reads a zone datafile and finds a syntax error. Any queries for data in the zone with the erroneous zone datafile results in a server failure answer from the remote nameserver. This is probably the most common cause of SERVFAIL responses. Server failure responses also occur when the remote nameserver tries to allocate more memory and can't, or when the remote slave nameserver's zone data expires.


RFErr 0

RFErr is the count of FORMERR responses from relay. FORMERR means that the remote nameserver said the local nameserver's query had a format error.


RErr 0

RErr is the count of errors that aren't either SERVFAIL or FORMERR.


RTCP 0

RTCP is the count of queries received on TCP connections from relay. (Most queries use UDP.)


RAXFR 0

RAXFR is the count of zone transfers initiated. The count indicates that relay is not a slave for any zones served by this nameserver.


RLame 0

RLame is the count of lame delegations received. If this count is not 0, it means that some zone is delegated to the nameserver at this IP address, and the nameserver is not authoritative for the zone.


ROpts 0

ROpts is the count of packets received with IP options set.


SSysQ 13

SSysQ is the count of system queries sent to relay. System queries are queries that are initiated by the local nameserver. Most system queries will go to root nameservers because system queries are used to keep the list of root nameservers up to date. But system queries are also used to find the address of a nameserver if the address record timed out before the nameserver record did. Since relay is not a root nameserver, these queries must have been sent for the latter reason.


SAns 439

SAns is the count of answers sent to relay. This nameserver answered 439 out of the 441 (RQ) queries relay sent to it. I wonder what happened to the two queries it didn't answer . . .


SFwdQ 85

SFwdQ is the count of queries that are sent (forwarded) to relay when the answer is not in this nameserver's zone data or cache.


SFwdR 7

SFwdR is the count of responses from a nameserver that are sent (forwarded) to relay.


SDupQ 84

SDupQ is the count of duplicate queries sent to relay. It's not as bad as it looks, though. The duplicate count is incremented if the query is sent to any other nameserver first. So relay might have answered all the queries it received the first time it received them, and the query still counted as a duplicate because it was sent to some other nameserver before relay.


SFail 0

SFail is the count of SERVFAIL responses sent to relay.


SFErr 0

SFErr is the count of FORMERR responses sent to relay.


SErr 0

SErr is the count of sendto() system calls that failed when the destination was relay.


RNotNsQ 0

RNotNsQ is the count of queries received that are not from port 53, the nameserver port. Prior to BIND 8, all nameserver queries came from port 53. Any queries from ports other than 53 came from a resolver. BIND 8 nameservers query from ports other than 53, however, which makes this statistic useless since you can no longer distinguish resolver queries from nameserver queries. Hence, BIND 8 dropped RNotNsQ from its statistics.


SNaAns 431

SNaAns is the count of nonauthoritative answers sent to relay. Out of the 439 answers (SAns) sent to relay, 431 are from cached data.


SNXD 0

SNXD is the count of "no such domain" answers sent to relay.

7.6.2.2. BIND 9 statistics

BIND 9.1.0 is the first version of BIND 9 to keep statistics. You use rndc to induce BIND 9 to dump its statistics:

% rndc stats

The nameserver dumps statistics, as a BIND 8 nameserver would, to a file called named.stats in its working directory. However, those statistics look completely different from BIND 8's. Here are the contents of the stats file from one of our BIND 9 nameservers:

+++ Statistics Dump +++ (979436130)
success 9
referral 0
nxrrset 0
nxdomain 1
recursion 1
failure 1
--- Statistics Dump --- (979436130)
+++ Statistics Dump +++ (979584113)
success 651
referral 10
nxrrset 11
nxdomain 17
recursion 296
failure 217
--- Statistics Dump --- (979584113)

The nameserver appends a new statistics block (the section between "+++ Statistics Dump +++" and "--- Statistics Dump ---") each time it receives a stats command. The number in parentheses (979436130) is, as in earlier stats files, the number of seconds since the Unix epoch. Unfortunately, BIND doesn't convert the value for you, but you can use the date command to convert it to something more readable. For example, to convert 979584113 seconds since the Unix epoch (January 1, 1970), you can use:

% date -d '1970-01-01 979584113 sec'
Mon Jan 15 18:41:53 MST 2001

Let's now go through these statistics one line at a time:


success 651

This is the number of successful queries the nameserver handled. Successful queries are those that didn't result in referrals or errors.


referral 10

This is the number of queries the nameserver handled that resulted in referrals.


nxrrset 11

This is the number of queries the nameserver handled that resulted in responses saying that the type of record the querier requested didn't exist for the domain name it specified.


nxdomain 17

This is the number of queries the nameserver handled that resulted in responses saying that the domain name the querier specified didn't exist.


recursion 296

This is the number of queries the nameserver received that required recursive processing to answer.


failure 217

This is the number of queries the nameserver received that resulted in errors other than those covered by nxrrset and nxdomain.

These are obviously not nearly as many statistics as a BIND 8 nameserver keeps, but future versions of BIND 9 will probably record more.

7.6.2.3. Using the BIND statistics

Is your nameserver "healthy"? How do you know what a "healthy" operation looks like? From a single snapshot, you can't really say whether a nameserver is healthy. You have to watch the statistics generated by your server over a period of time to get a feel for what sorts of numbers are normal for your configuration. These numbers will vary markedly among nameservers depending on the mix of applications generating lookups, the type of server (primary, slave, caching-only), and the level of the zones in the namespace it is serving.

One thing to watch for in the statistics is how many queries per second your nameserver receives. Take the number of queries received and divide by the number of seconds the nameserver has been running. Paul's BIND 4.9.3 nameserver received 1,992,938 queries in 746,683 seconds, or approximately 2.7 queries per secondnot a very busy server.[*] If the number you come up with for your nameserver seems out of line, look at which hosts are making all the queries and decide if it makes sense for them to be making all those queries. At some point, you may decide that you need more nameservers to handle the load. We'll cover that situation in the next chapter.

[*] Recall that the root nameservers, which run plain vanilla BIND, can handle thousands of queries per second.


Previous Page
Next Page