| CONTENTS |
On some Unix systems, the spell command reads one or more files and prints a list of words that may be misspelled. You can redirect the output to a file, use grep (Section 13.1) to locate each of the words, and then use vi or ex to make the edits. It's also possible to hack up a shell and sed script that interactively displays the misspellings and fixes them on command, but realistically, this is too tedious for most users. (The ispell (Section 16.2) program solves many though not all of these problems.)
When you run spell on a file, the list of words it produces usually includes a number of legitimate words or terms that the program does not recognize. spell is case sensitive; it's happy with Aaron but complains about aaron. You must cull out the proper nouns and other words spell doesn't know about to arrive at a list of true misspellings. For instance, look at the results on this sample sentence:
$ cat sample Alcuin uses TranScript to convert ditroff into PostScript output for the LaserWriter printerr. $ spell sample Alcuin ditroff printerr LaserWriter PostScript TranScript
Only one word in this list is actually misspelled.
On many Unix systems, you can supply a local dictionary file so that spell recognizes special words and terms specific to your site or application. After you have run spell and looked through the word list, you can create a file containing the words that were not actual misspellings. The spell command will check this list after it has gone through its own dictionary. On certain systems, your word-list file must be sorted (Section 22.1).
If you added the special terms in a file named dict, you could specify that file on the command line using the + option:
$ spell +dict sample printerr
The output is reduced to the single misspelling.
The spell command will make some errors based on incorrect derivation of spellings from the root words contained in its dictionary. If you understand how spell works (Section 15.4), you may be less surprised by some of these errors.
As stated at the beginning, spell isn't on all Unix systems, e.g., Darwin and FreeBSD. In these other environments, check for the existence of alternative spell checking, such as ispell (Section 16.2). Or you can download and install the GNU version of spell at http://www.gnu.org/directory/spell.html.
DD and SP
The original Unix spell-checking program, spell (Section 15.1), is fine for quick checks of spelling in a short document, but it makes you cry out for a real spellchecker, which not only shows you the misspelled words in context, but offers to change them for you.
Go to
http://examples.oreilly.com/upt3 for more information on: ispell
ispell, a very useful program that's been ported to Unix and enhanced over the years, does all this and more. Either it will be preinstalled or you'll need to install it for your Unix version.
Here's the basic usage: just as with spell, you spell check a document by giving ispell a filename. But there the similarities cease. ispell takes over your screen or window, printing two lines of context at the bottom of the screen. If your terminal can do reverse video, the offending word is highlighted. Several alternate possibilities are presented in the upper-left corner of the screen any word in ispell's dictionary that differs by only one letter, has a missing or extra letter, or transposed letters.
Faced with a highlighted word, you have eight choices:
Press the spacebar to accept the current spelling.
Type A to accept the current spelling, now and for the rest of this input file.
Type I to accept the current spelling now and for the rest of this input file and also to instruct ispell to add the word to your private dictionary. By default, the private dictionary is the file .ispell_words in your home directory, but it can be changed with the -p option or by setting the environment variable (Section 35.3) WORDLIST to the name of some other file. If you work with computers, this option will come in handy since we use so much jargon in this business! It makes a lot more sense to "teach" all those words to ispell than to keep being offered them for possible correction. (One gotcha: when specifying an alternate file, you must use an absolute pathname (Section 1.14), or ispell will look for the file in your home directory.)
Type the digit corresponding to one of ispell's alternative suggestions to use that spelling instead. For example, if you've typed "hnadle," as I did when writing this article, ispell will offer 0: handle in the upper-left corner of your screen. Typing 0 makes the change and moves on to the next misspelling, if any.
Type R if none of ispell's offerings do the trick and you want to be prompted for a replacement. Type in the new word, and the replacement is made.
Type L if ispell didn't make any helpful suggestions and you're at a loss as to how to spell the word correctly. ispell will prompt you for a lookup string. You can use * as a wildcard character (it appears to substitute for zero or one characters); ispell will print a list of matching words from its dictionary.
Type Q to quit, writing any changes made so far, but ignoring any misspellings later in the input file.
Type X to quit without writing any changes.
But that's not all! ispell also saves a copy of your original file with a .bak extension, just in case you regret any of your changes. If you don't want ispell making .bak files, invoke it with the -x option.
How about this: ispell knows about capitalization. It already knows about proper names and a lot of common acronyms it can even handle words like "TEX" that have oddball capitalization. Speaking of TEX, ispell has special modes in which it recognizes TEX constructions.
If ispell isn't on your system by default, you should be able to find an installation of it packaged in your system's own unique software-installation packaging, discussed in Chapter 40.
In addition, you can also look for a newer spell-checking utility, aspell , based on ispell but with improved processing. Though aspell is being considered a replacement for ispell, the latter is still the most commonly found and used of the two.
TOR
Are you writing a document and want to check the spelling of a word before you finish (if you aren't using a word processor with automatic spelling correction, that is)? A Unix system gives you several ways to do this.
|
If you aren't sure which of two possible spellings is right, you can use the spell command with no arguments to find out. Type the name of the command, followed by a RETURN, then type the alternative spellings you are considering. Press CTRL-d (on a line by itself) to end the list. The spell command will echo back the word(s) in the list that it considers to be in error:
$ spell misspelling mispelling CTRL-d mispelling
If you're using ispell ( Section 16.2) or the newer aspell, you need to add the -a option. The purpose of this option is to let the speller interact with other programs; there are details in the programs' documentation. But, like most Unix filters, you can also let these programs read a word from standard input and write their response on standard output; it will either tell you that the spelling is right or give you a list of suggestions. aspell and ispell will use their local dictionaries and improved spelling rules.
As an example, let's check the spelling of outragous and whut with both ispell and aspell:
$ ispell -a @(#) International Ispell Version 3.1.20 10/10/95 outragous whut & outragous 1 0: outrageous & whut 5 10: hut, shut, what, whet, whit CTRL-d $ aspell -a @(#) International Ispell Version 3.1.20 (but really Aspell .32.6 alpha) outragous whut & outragous 3 0: outrageous, outrages, outrage's & whut 5 10: what, whet, whit, hut, shut CTRL-d $
When these spellers start, they print a version message and wait for input. I type the words I want to check and press RETURN. The speller returns one result line for each word:
A result of * means the word is spelled correctly.
A line starting with & means the speller has suggestions. Then it repeats the word, the number of suggestions it has for that word, the character position that the word had on the input line, and finally the suggestions.
So ispell suggested that outragous might be outrageous. aspell also came up with outrages and outrage's. (I'd say that outrage's is barely a word. Be careful with aspell's suggestions.) Both spellers had five suggestions for whut; the differences are interesting . . .
A result of # means there were no suggestions.
After processing a line, the spellers both print an empty line. Press CTRL-d to end input.
Another way to do the same thing is with look (Section 13.14). With just one argument, look searches the system word file, /usr/dict/words, for words starting with the characters in that one argument. That's a good way to check spelling or find a related word:
% look help help helpful helpmate
look uses its -df options automatically when it searches the word list. -d ignores any character that isn't a letter, number, space or tab; -f treats upper- and lowercase letters the same.
JP and DD
[If you have ispell (Section 16.2), there's not a whole lot of reason for using spell any more. Not only is ispell more powerful, it's a heck of a lot easier to update its spelling dictionaries. Nonetheless, we decided to include this article, because it clarifies the kinds of rules that spellcheckers go through to expand on the words in their dictionaries. TOR]
On many Unix systems, the directory /usr/lib/spell contains the main program invoked by the spell command along with auxiliary programs and data files.
On some systems, the spell command is a shell script that pipes its input through deroff -w and sort -u ( Section 22.6) to remove formatting codes and prepare a sorted word list, one word per line. On other systems, it is a standalone program that does these steps internally. Two separate spelling lists are maintained, one for American usage and one for British usage (invoked with the -b option to spell). These lists, hlista and hlistb, cannot be read or updated directly. They are compressed files, compiled from a list of words represented as nine-digit hash codes. (Hash coding is a special technique used to search for information quickly.)
The main program invoked by spell is spellprog. It loads the list of hash codes from either hlista or hlistb into a table, and it looks for the hash code corresponding to each word on the sorted word list. This eliminates all words (or hash codes) actually found in the spelling list. For the remaining words, spellprog tries to derive a recognizable word by performing various operations on the word stem based on suffix and prefix rules. A few of these manipulations follow:
The new words created as a result of these manipulations will be checked once more against the spell table. However, before the stem-derivative rules are applied, the remaining words are checked against a table of hash codes built from the file hstop. The stop list contains typical misspellings that stem-derivative operations might allow to pass. For instance, the misspelled word thier would be converted into thy using the suffix rule -y+ier. The hstop file accounts for as many cases of this type of error as possible.
The final output consists of words not found in the spell list even after the program tried to search for their stems and words that were found in the stop list.
You can get a better sense of these rules in action by using the -v or -x option. The -v option eliminates the last look-up in the table and produces a list of words that are not actually in the spelling list, along with possible derivatives. It allows you to see which words were found as a result of stem-derivative operations and prints the rule used. (Refer to the sample file in Section 16.1.)
% spell -v sample Alcuin ditroff LaserWriter PostScript printerr TranScript +out output +s uses
The -x option makes spell begin at the stem-derivative stage and prints the various attempts it makes to find the stem of each word.
% spell -x sample ... =into =LaserWriter =LaserWrite =LaserWrit =laserWriter =laserWrite =laserWrit =output =put ... LaserWriter ...
The stem is preceded by an equals sign (=). At the end of the output are the words whose stem does not appear in the spell list.
One other file you should know about is spellhist. On some systems, each time you run spell, the output is appended through tee (Section 43.8) into spellhist, in effect creating a list of all the misspelled or unrecognized words for your site. The spellhist file is something of a "garbage" file that keeps on growing: you will want to reduce it or remove it periodically. To extract useful information from this spellhist, you might use the sort and uniq -c (Section 21.20) commands to compile a list of misspelled words or special terms that occur most frequently. It is possible to add these words back into the basic spelling dictionary, but this is too complex a process to describe here. It's probably easier just to use a local spelling dictionary (Section 16.1). Even better, use ispell; not only is it a more powerful spelling program, it is much easier to update the word lists it uses (Section 16.5).
DD
ispell (Section 16.2) uses two lists for spelling verification: a master word list and a supplemental personal word list.
The master word list for ispell is normally the file /usr/local/lib/ispell/ispell.hash, though the location of the file can vary on your system. This is a "hashed" dictionary file. That is, it has been converted to a condensed, program-readable form using the buildhash program (which comes with ispell) to speed the spell-checking process.
The personal word list is normally a file called .ispell_english or .ispell_words in your home directory. (You can override this default with either the -p command-line option or the WORDLIST environment variable (Section 35.3).) This file is simply a list of words, one per line, so you can readily edit it to add, alter, or remove entries. The personal word list is normally used in addition to the master word list, so if a word usage is permitted by either list it is not flagged by ispell.
Custom personal word lists are particularly useful for checking documents that use jargon or special technical words that are not in the master word list, and for personal needs such as holding the names of your correspondents. You may choose to keep more than one custom word list to meet various special requirements.
You can add to your personal word list any time you use ispell: simply use the I command to tell ispell that the word it offered as a misspelling is actually correct, and should be added to the dictionary. You can also add a list of words from a file using the ispell -a (Section 16.3) option. The words must be one to a line, but need not be sorted. Each word to be added must be preceded with an asterisk. (Why? Because ispell -a has other functions as well.) So, for example, we could have added a list of Unix utility names to our personal dictionaries all at once, rather than one-by-one as they were encountered during spell checking.
Obviously, though, in an environment where many people are working with the same set of technical terms, it doesn't make sense for each individual to add the same word list to his own private .ispell_words file. It would make far more sense for a group to agree on a common dictionary for specialized terms and always to set WORDLIST to point to that common dictionary.
If the private word list gets too long, you can create a "munched" word list. The munchlist script that comes with ispell reduces the words in a word list to a set of word roots and permitted suffixes according to rules described in the ispell(4) reference page that will be installed with ispell from the CD-ROM [see http://examples.oreilly.com/upt3]. This creates a more compact but still editable word list.
Another option is to provide an alternative master spelling list using the -d option. This has two problems, though:
The master spelling list should include spellings that are always valid, regardless of context. You do not want to overload your master word list with terms that might be misspellings in a different context. For example, perl is a powerful programming language, but in other contexts, perl might be a misspelling of pearl. You may want to place perl in a supplemental word list when documenting Unix utilities, but you probably wouldn't want it in the master word list unless you were documenting Unix utilities most of the time that you use ispell.
The -d option must point to a hashed dictionary file. What's more, you cannot edit a hashed dictionary; you will have to edit a master word list and use (or have the system administrator use) buildhash to hash the new dictionary to optimize spell checker performance.
To build a new hashed word list, provide buildhash with a complete list of the words you want included, one per line. (The buildhash utility can only process a raw word list, not a munched word list.) The standard system word list, /usr/dict/words on many systems, can provide a good starting point. This file is writable only by the system administrator and probably shouldn't be changed in any case. So make a copy of this file, and edit or add to the copy. After processing the file with buildhash, you can either replace the default ispell.hash file or point to your new hashed file with the -d option.
TOR and LK
The wc (word count) command counts the number of lines, words, and characters in the files you specify. (Like most Unix utilities, wc reads from its standard input if you don't specify a filename.) For example, the file letter has 120 lines, 734 words, and 4,297 characters:
% wc letter
120 734 4297 letter
You can restrict what is counted by specifying the options -l (count lines only), -w (count words only), and -c (count characters only). For example, you can count the number of lines in a file:
% wc -l letter
120 letter
or you can count the number of files in a directory:
% cd man_pages
% ls | wc -w
233
The first example uses a file as input; the second example pipes the output of an ls command to the input of wc. (Be aware that the -a option (Section 8.9) makes ls list dot files. If your ls command is aliased (Section 29.2) to include -a or other options that add words to the normal output such as the line total nnn from ls -l then you may not get the results you want.)
The following command will tell you how many more words are in new.file than in old.file:
% expr `wc -w < new.file` - `wc -w < old.file`
Many shells have built-in arithmetic commands and don't really need expr ; however, expr works in all shells.
|
Taking this concept a step further, here's a simple shell script to calculate the differences in word count between two files:
count_1=`wc -w < $1` # number of words in file 1 count_2=`wc -w < $2` # number of words in file 2 diff_12=`expr $count_1 - $count_2` # difference in word count # if $diff_12 is negative, reverse order and don't show the minus sign: case "$diff_12" in -*) echo "$2 has `expr $diff_12 : '-\(.*\)'` more words than $1" ;; *) echo "$1 has $diff_12 more words than $2" ;; esac
If this script were called count.it, then you could invoke it like this:
% count.it draft.2 draft.1 draft.1 has 23 more words than draft.2
You could modify this script to count lines or characters.
|
Finally, two notes about file size:
wc -c isn't an efficient way to count the characters in large numbers of files. wc opens and reads each file, which takes time. The fourth or fifth column of output from ls -l (depending on your version) gives the character count without opening the file.
Using character counts (as in the previous item) doesn't give you the total disk space used by files. That's because, in general, each file takes at least one disk block to store. The du (Section 15.8) command gives accurate disk usage.
JP, DG, and SP
One type of error that's hard to catch when proofreading is a doubled word. It's hard to miss the double "a" in the title of this article, but you might find yourself from time to time with a "the" on the end of one line and the beginning of another.
We've seen awk scripts to catch this, but nothing so simple as this shell function. Here are two versions; the second is for the System V version of tr (Section 21.11):
uniq Section 21.20
ww( ) { cat $* | tr -cs "a-z'" "\012" | uniq -d; }
ww( ) { cat $* | tr -cs "[a-z]'" "[\012*]" | uniq -d; }
In the script ww.sh, the output of the file is piped to tr to break the stream into separate words, which is then passed to the uniq command for testing of duplicate terms.
TOR and JP
A common problem in text processing is making sure that items that need to occur in pairs actually do so.
Most Unix text editors include support for making sure that elements of C syntax such as parentheses and braces are closed properly. Some editors, such as Emacs (Section 19.1) and vim Section 17.1), also support syntax coloring and checking for text documents -- HTML and SGML, for instance. There's much less support in command-line utilities for making sure that textual documents have the proper structure. For example, HTML documents that start a list with <UL> need a closing </UL>.
Unix provides a number of tools that might help you to tackle this problem. Here's a gawk script written by Dale Dougherty that makes sure <UL> and </UL> tags macros come in pairs:
gawk Section 20.11
#! /usr/local/bin/gawk -f
BEGIN {
IGNORECASE = 1
inList = 0
LSlineno = 0
LElineno = 0
prevFile = ""
}
# if more than one file, check for unclosed list in first file
FILENAME != prevFile {
if (inList)
printf ("%s: found <UL> at line %d without </UL> before end of file\n",
prevFile, LSlineno)
inList = 0
prevFile = FILENAME
}
# match <UL> and see if we are in list
/^<UL>/ {
if (inList) {
printf("%s: nested list starts: line %d and %d\n",
FILENAME, LSlineno, FNR)
}
inList = 1
LSlineno = FNR
}
/^<\/UL>/ {
if (! inList)
printf("%s: too many list ends: line %d and %d\n",
FILENAME, LElineno, FNR)
else
inList = 0
LElineno = FNR
}
# this catches end of input
END {
if (inList)
printf ("%s: found <UL> at line %d without </UL> before end of file\n",
FILENAME, LSlineno)
}
You can adapt this type of script for any place you need to check for a start and finish to an item. Note, though, that not all systems have gawk preinstalled. You'll want to look for an installation of the utility for your system to use this script.
A more complete syntax-checking program could be written with the help of a lexical analyzer like lex. lex is normally used by experienced C programmers, but it can be used profitably by someone who has mastered awk and is just beginning with C, since it combines an awk-like pattern-matching process using regular-expression syntax with actions written in the more powerful and flexible C language. (See O'Reilly & Associates' lex & yacc.)
Of course, this kind of problem could be very easily tackled with the information in Chapter 41.
TOR and SP
In various textual-analysis scripts, you sometimes need just the words (Section 16.7).
I know two ways to do this. The deroff command was designed to strip out troff Section 45.11) constructs and punctuation from files. The command deroff -w will give you a list of just the words in a document; pipe to sort -u (Section 22.6) if you want only one of each.
deroff has one major failing, though. It considers a word as just a string of characters beginning with a letter of the alphabet. A single character won't do, which leaves out one-letter words like the indefinite article "A."
A substitute is tr (Section 21.11), which can perform various kinds of character-by-character conversions.
To produce a list of all the individual words in a file, type the following:
% tr -cs A-Za-z '\012' < file
The -c option "complements" the first string passed to tr; -s squeezes out repeated characters. This has the effect of saying: "Take any nonalphabetic characters you find (one or more) and convert them to newlines (\012)."
(Wouldn't it be nice if tr just recognized standard Unix regular expression syntax (Section 32.4)? Then, instead of -c A-Za-z, you'd say '[^A-Za-z]'. It's no less obscure, but at least it's used by other programs, so there's one less thing to learn.)
The System V version of tr (Section 21.11) has slightly different syntax. You'd get the same effect with this:
% tr -cs '[A-Z][a-z]' '[\012*]' < file
TOR
[1] You could also type cat new.file | wc -w, but this involves two commands, so it's less efficient (Section 43.2).
| CONTENTS |